Attribute-Based Semantic Type Detection and Data Quality Assessment

Silva, Marcelo Valentim; Herrmann, Hannes; Maxville, Valerie

doi:10.1109/BDCAT63179.2024.00030

dc.contributor.author	Silva, Marcelo Valentim
dc.contributor.author	Herrmann, Hannes
dc.contributor.author	Maxville, Valerie
dc.date.accessioned	2025-04-17T01:50:24Z
dc.date.available	2025-04-17T01:50:24Z
dc.date.issued	2024
dc.identifier.citation	Silva, M.V. and Herrmann, H. and Maxville, V. 2024. Attribute-Based Semantic Type Detection and Data Quality Assessment. In IEEE/ACM International Symposium on Big Data Computing (BDC), 16-19 Dec 2024, Sharjah, United Arab Emirates.
dc.identifier.uri	http://hdl.handle.net/20.500.11937/97536
dc.identifier.doi	10.1109/BDCAT63179.2024.00030
dc.description.abstract	The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value).
dc.publisher	IEEE
dc.title	Attribute-Based Semantic Type Detection and Data Quality Assessment
dc.type	Conference Paper
dcterms.source.startPage	119
dcterms.source.endPage	124
dcterms.source.title	2024 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)
dcterms.source.conference	IEEE/ACM International Symposium on Big Data Computing (BDC)
dcterms.source.conference-start-date	16 Dec
dcterms.source.conferencelocation	Sharjah, United Arab Emirates
dc.date.updated	2025-04-17T01:50:24Z
curtin.note	© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
curtin.accessStatus	Open access
dcterms.source.conference-end-date	19 Dec
curtin.repositoryagreement	V3

Files in this item

Name:: 97300.pdf
Size:: 609.6Kb
Format:: PDF

This item appears in the following Collection(s)

Curtin Research Publications

Show simple item record

Attribute-Based Semantic Type Detection and Data Quality Assessment

Files in this item

This item appears in the following Collection(s)

Related items