Show simple item record

dc.contributor.authorNazemi, Azadeh
dc.contributor.authorMurray, Iain
dc.contributor.authorMcMeekin, David
dc.date.accessioned2017-01-30T11:04:46Z
dc.date.available2017-01-30T11:04:46Z
dc.date.created2014-11-02T20:00:30Z
dc.date.issued2014
dc.identifier.citationNazemi, A. and Murray, I. and McMeekin, D. 2014. Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation. Computer and Information Science. 7 (1): pp. 162-171.
dc.identifier.urihttp://hdl.handle.net/20.500.11937/8128
dc.identifier.doi10.5539/cis.v7n1p162
dc.description.abstract

Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions.

dc.publisherCanadian Center of Science and Education
dc.subjectdocument layout analysis
dc.subjectassistive technology
dc.subjectoptical character recognition
dc.titleLayout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
dc.typeJournal Article
dcterms.source.volume7
dcterms.source.number1
dcterms.source.startPage162
dcterms.source.endPage171
dcterms.source.issn1913-8989
dcterms.source.titleComputer and Information Science
curtin.note

This article is published under the Open Access publishing model and distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/ Please refer to the licence to obtain terms for any further reuse or distribution of this work.

curtin.departmentDepartment of Electrical and Computer Engineering
curtin.accessStatusOpen access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record