Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
Access Status
Authors
Date
2014Type
Metadata
Show full item recordCitation
Source Title
ISSN
School
Collection
Abstract
This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML representation of a scanned PDF document reduces the document's storage size and encodes the mathematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document, all textual and non-textual components must be recognised, identified and tagged. These components may be test or mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expressions are one of the most significant components within scanned scientific and engineering PDF documents and need to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non-alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech engine (TTS) to make the document accessible for vision-impaired users.
Related items
Showing items related by title, author, creator and subject.
-
Yao, Y.; Lu, F.; Zhu, Y.; Wei, F.; Liu, X.; Lian, C.; Wang, Shaobin (2015)Novel CuFe2O4@C3N4 core–shell photocatalysts were fabricated through a self-assembly method and characterized by X-ray diffraction, Fourier transform infrared spectroscopy, thermogravimetric analysis, X-ray photoelectron ...
-
Ge, Rongfeng; Zhu, W.; Wilde, Simon; He, J. (2014)Continental crust was largely generated before 2.5 Ga through mafic–ultramafic and TTG (tonalite-trondhjemite-granodiorite) magmatism, but it is contentious when did such primitive crust evolve into mature granodioritic ...
-
Nazemi, Azadeh; Murray, Iain; McMeekin, David (2014)Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, ...