Quality and interestingness of association rules derived from data mining of relational and semi-structured data
Access Status
Authors
Date
2012Supervisor
Type
Award
Metadata
Show full item recordSchool
Collection
Abstract
Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. As the data mining techniques are data-driven, it is beneficial to affirm the rules using a statistical approach. It is important to establish the ways in which the existing statistical measures and constraint parameters can be effectively utilized and the sequence of their usage.In this thesis, a systematic way to evaluate the association rules discovered from frequent, closed and maximal itemset mining algorithms; and frequent subtree mining algorithm including the rules based on induced, embedded and disconnected subtrees is presented. With reference to the frequent subtree mining, in addition a new direction is explored based on utilizing the DSM approach capable of preserving all information from tree-structured database in a flat data format, consequently enabling the direct application of a wider range of data mining analysis/techniques to tree-structured data. Implications of this approach were investigated and it was found that basing rules on disconnected subtrees, can be useful in terms of increasing the accuracy and the coverage rate of the rule set.A strategy that combines data mining and statistical measurement techniques such as sampling, redundancy and contradictive checks, correlation and regression analysis to evaluate the rules is developed. This framework is then applied to real-world datasets that represent diverse characteristics of data/items. Empirical results show that with a proper combination of data mining and statistical analysis, the proposed framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy rules. Moreover, the results reveal the important characteristics and differences between mining frequent, closed or maximal itemsets; and mining frequent subtree including the rules based on induced, embedded and disconnected subtrees; as well as the impact of confidence measure for the prediction and classification task.
Related items
Showing items related by title, author, creator and subject.
-
Hadzic, Fedja; Dillon, Tharam S.; Sidhu, Amandeep; Chang, Elizabeth; Tan, H. (2006)In this paper we consider the 'Prions' database that describes protein instances stored for Human Prion Proteins. The Prions database can be viewed as a database of rooted ordered labeled subtrees. Mining frequent ...
-
Tan, H.; Hadzic, Fedja; Dillon, T. (2012)The increasing need for representing information through more complex structures where semantics and relationships among data objects can be more easily expressed has resulted in many semi-structured data sources. Structure ...
-
Tan, Henry; Hadzic, Fedja; Dillon, Tharam S.; Chang, Elizabeth; Feng, Ling; Feng, L. (2008)Due to the inherent flexibilities in both structure and semantics, XML association rules mining faces few challenges, such as: a more complicated hierarchical data structure and ordered data context. Mining frequent ...