Attribute-Based Semantic Type Detection and Data Quality Assessment
Access Status
Authors
Date
2024Type
Metadata
Show full item recordCitation
Source Title
Source Conference
Collection
Abstract
The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value).
Related items
Showing items related by title, author, creator and subject.
-
Berwick, Lyndon (2009)The analytical capacity of MSSV pyrolysis has been used to extend the structural characterisation of aquatic natural organic matter (NOM). NOM can contribute to various potable water issues and is present in high ...
-
Harrison, Christopher Bernard (2009)The use of seismic methods in hard rock environments in Western Australia for mineral exploration is a new and burgeoning technology. Traditionally, mineral exploration has relied upon potential field methods and surface ...
-
Shiyab, Adnan M S H (2007)This study aimed at investigating the current practices and methods adopted by roads agencies around the world with regard to collection, analysis and utilization of the data elements pertaining to the main pavement ...