Attribute-Based Semantic Type Detection and Data Quality Assessment
dc.contributor.author | Silva, Marcelo Valentim | |
dc.contributor.author | Herrmann, Hannes | |
dc.contributor.author | Maxville, Valerie | |
dc.date.accessioned | 2025-04-17T01:50:24Z | |
dc.date.available | 2025-04-17T01:50:24Z | |
dc.date.issued | 2024 | |
dc.identifier.citation | Silva, M.V. and Herrmann, H. and Maxville, V. 2024. Attribute-Based Semantic Type Detection and Data Quality Assessment. In IEEE/ACM International Symposium on Big Data Computing (BDC), 16-19 Dec 2024, Sharjah, United Arab Emirates. | |
dc.identifier.uri | http://hdl.handle.net/20.500.11937/97536 | |
dc.identifier.doi | 10.1109/BDCAT63179.2024.00030 | |
dc.description.abstract |
The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value). | |
dc.publisher | IEEE | |
dc.title | Attribute-Based Semantic Type Detection and Data Quality Assessment | |
dc.type | Conference Paper | |
dcterms.source.startPage | 119 | |
dcterms.source.endPage | 124 | |
dcterms.source.title | 2024 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT) | |
dcterms.source.conference | IEEE/ACM International Symposium on Big Data Computing (BDC) | |
dcterms.source.conference-start-date | 16 Dec | |
dcterms.source.conferencelocation | Sharjah, United Arab Emirates | |
dc.date.updated | 2025-04-17T01:50:24Z | |
curtin.note |
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | |
curtin.accessStatus | Open access | |
dcterms.source.conference-end-date | 19 Dec | |
curtin.repositoryagreement | V3 |