Curtin University Homepage
  • Library
  • Help
    • Admin

    espace - Curtin’s institutional repository

    JavaScript is disabled for your browser. Some features of this site may not work without it.
    View Item 
    • espace Home
    • espace
    • Curtin Research Publications
    • View Item
    • espace Home
    • espace
    • Curtin Research Publications
    • View Item

    Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation

    203476_124038_Layout_Analysis_for_Scanned_PDF_and_Transformation.pdf (4.142Mb)
    Access Status
    Open access
    Authors
    Nazemi, Azadeh
    Murray, Iain
    McMeekin, David
    Date
    2014
    Type
    Journal Article
    
    Metadata
    Show full item record
    Citation
    Nazemi, A. and Murray, I. and McMeekin, D. 2014. Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation. Computer and Information Science. 7 (1): pp. 162-171.
    Source Title
    Computer and Information Science
    DOI
    10.5539/cis.v7n1p162
    ISSN
    1913-8989
    School
    Department of Electrical and Computer Engineering
    Remarks

    This article is published under the Open Access publishing model and distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/ Please refer to the licence to obtain terms for any further reuse or distribution of this work.

    URI
    http://hdl.handle.net/20.500.11937/8128
    Collection
    • Curtin Research Publications
    Abstract

    Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions.

    Related items

    Showing items related by title, author, creator and subject.

    • Mathematical Information Retrieval (MIR) from scanned PDF and MathML conversion
      Nazemi, Azadeh; Murray, Iain; McMeekin, David (2014)
      This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML ...
    • Practical Segmentation Methods for Logical and Geometric Layout Analysis to Improve Scanned PDF Accessibility to Vision Impaired
      Nazemi, Azadeh; Murray, Iain; McMeekin, David (2014)
      The use of electronic documents has rapidly increased in recent decades and the PDF is one the most commonly used electronic document formats. A scanned PDF is an image and does not actually contain any text. For the ...
    • Improving the relevance of web search results by combining web snippet categorization, clustering and personalization
      Zhu, Dengya (2010)
      Web search results are far from perfect due to the polysemous and synonymous characteristics of nature languages, information overload as the results of information explosion on the Web, and the flat list, “one size fits ...
    Advanced search

    Browse

    Communities & CollectionsIssue DateAuthorTitleSubjectDocument TypeThis CollectionIssue DateAuthorTitleSubjectDocument Type

    My Account

    Admin

    Statistics

    Most Popular ItemsStatistics by CountryMost Popular Authors

    Follow Curtin

    • 
    • 
    • 
    • 
    • 

    CRICOS Provider Code: 00301JABN: 99 143 842 569TEQSA: PRV12158

    Copyright | Disclaimer | Privacy statement | Accessibility

    Curtin would like to pay respect to the Aboriginal and Torres Strait Islander members of our community by acknowledging the traditional owners of the land on which the Perth campus is located, the Whadjuk people of the Nyungar Nation; and on our Kalgoorlie campus, the Wongutha people of the North-Eastern Goldfields.