Curtin University Homepage
  • Library
  • Help
    • Admin

    espace - Curtin’s institutional repository

    JavaScript is disabled for your browser. Some features of this site may not work without it.
    View Item 
    • espace Home
    • espace
    • Curtin Research Publications
    • View Item
    • espace Home
    • espace
    • Curtin Research Publications
    • View Item

    Attribute-Based Semantic Type Detection and Data Quality Assessment

    97300.pdf (609.6Kb)
    Access Status
    Open access
    Authors
    Silva, Marcelo Valentim
    Herrmann, Hannes
    Maxville, Valerie
    Date
    2024
    Type
    Conference Paper
    
    Metadata
    Show full item record
    Citation
    Silva, M.V. and Herrmann, H. and Maxville, V. 2024. Attribute-Based Semantic Type Detection and Data Quality Assessment. In IEEE/ACM International Symposium on Big Data Computing (BDC), 16-19 Dec 2024, Sharjah, United Arab Emirates.
    Source Title
    2024 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)
    Source Conference
    IEEE/ACM International Symposium on Big Data Computing (BDC)
    DOI
    10.1109/BDCAT63179.2024.00030
    Remarks

    © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

    URI
    http://hdl.handle.net/20.500.11937/97536
    Collection
    • Curtin Research Publications
    Abstract

    The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value).

    Related items

    Showing items related by title, author, creator and subject.

    • Characterisation of aquatic natural organic matter by micro-scale sealed vessel pyrolysis
      Berwick, Lyndon (2009)
      The analytical capacity of MSSV pyrolysis has been used to extend the structural characterisation of aquatic natural organic matter (NOM). NOM can contribute to various potable water issues and is present in high ...
    • Feasibility of rock characterization for mineral exploration using seismic data
      Harrison, Christopher Bernard (2009)
      The use of seismic methods in hard rock environments in Western Australia for mineral exploration is a new and burgeoning technology. Traditionally, mineral exploration has relied upon potential field methods and surface ...
    • Optimum use of the flexible pavement condition indicators in pavement management system
      Shiyab, Adnan M S H (2007)
      This study aimed at investigating the current practices and methods adopted by roads agencies around the world with regard to collection, analysis and utilization of the data elements pertaining to the main pavement ...
    Advanced search

    Browse

    Communities & CollectionsIssue DateAuthorTitleSubjectDocument TypeThis CollectionIssue DateAuthorTitleSubjectDocument Type

    My Account

    Admin

    Statistics

    Most Popular ItemsStatistics by CountryMost Popular Authors

    Follow Curtin

    • 
    • 
    • 
    • 
    • 

    CRICOS Provider Code: 00301JABN: 99 143 842 569TEQSA: PRV12158

    Copyright | Disclaimer | Privacy statement | Accessibility

    Curtin would like to pay respect to the Aboriginal and Torres Strait Islander members of our community by acknowledging the traditional owners of the land on which the Perth campus is located, the Whadjuk people of the Nyungar Nation; and on our Kalgoorlie campus, the Wongutha people of the North-Eastern Goldfields.