Show simple item record

dc.contributor.authorLiu, Xiu
dc.contributor.authorAldrich, Chris
dc.date.accessioned2025-04-30T00:36:04Z
dc.date.available2025-04-30T00:36:04Z
dc.date.issued2022
dc.identifier.citationLiu, X. and Aldrich, C. 2022. Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models. FUEL. 335: 126891.
dc.identifier.urihttp://hdl.handle.net/20.500.11937/97646
dc.identifier.doi10.1016/j.fuel.2022.126891
dc.description.abstract

Modelling the characteristics and composition of coal is important, as proximity data and other measurements to do so are typically expensive or hard to acquire in real-time. Understanding anomalies in these relatively small data sets are important, as removal may result in an unnecessary loss of data or bias in the data used in the model. Although anomaly detection has been considered in-depth in the literature, very little work has been devoted to the explanation of anomalies. In this paper, a general anomaly detection and identification methodology is considered, based on three models, viz an isolation forest, a random forest and a tree SHAP explanatory model. Three case studies related to the composition of coal and coal processing are considered. In these case studies, the IF-RF-SHAP approach identified outliers of data anomalies not identifiable with principal component analysis. The model is a new variant of some of the integrated approaches that have recently been considered. Further contribution of the study lies in the empirical comparison of IF anomaly scores with distance-based and reconstruction-based anomaly scores generated with principal component models. In the case studies considered, the IF anomaly scores were better able to identify anomalies in the data than the scores derived from the principal component models. As a result, the methodology can complement distance-based approaches, such as principal component analysis, to explain anomalies or outliers detected in data. Apart from the proposed IF-RF-SHAP approach, four approaches to compare the contributions of variables in random forest models are considered as well. These were simple correlation of individual predictors with anomaly scores of samples, random forest prediction based on an impurity criterion, random forest prediction based on a permutation criterion, as well as the tree SHAP approach. If the latter is considered as a benchmark, then the impurity criterion gave the most reliable results, while simple predictor correlations gave the least reliable results.

dc.languageEnglish
dc.publisherElsevier
dc.subjectScience & Technology
dc.subjectTechnology
dc.subjectEnergy & Fuels
dc.subjectEngineering, Chemical
dc.subjectEngineering
dc.subjectAnomaly detection
dc.subjectIsolation forest
dc.subjectShapley value regression
dc.subjectCoal
dc.subjectVariable importance measures
dc.subjectRandom forests
dc.subjectPRINCIPAL COMPONENT ANALYSIS
dc.subjectBITUMINOUS COAL
dc.subjectCOMBUSTION
dc.subjectSYSTEM
dc.subjectPREDICTION
dc.subjectREGRESSION
dc.subjectBOILER
dc.subjectFOREST
dc.subjectCARBON
dc.subjectASH
dc.titleExplaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
dc.typeJournal Article
dcterms.source.volume335
dcterms.source.issn0016-2361
dcterms.source.titleFUEL
dc.date.updated2025-04-30T00:36:02Z
curtin.departmentWASM: Minerals, Energy and Chemical Engineering
curtin.accessStatusOpen access
curtin.facultyFaculty of Science and Engineering
curtin.contributor.orcidLiu, Xiu [0000-0003-4592-7232]
curtin.contributor.orcidAldrich, Chris [0000-0003-2963-1140]
curtin.identifier.article-number126891
dcterms.source.eissn1873-7153
curtin.contributor.scopusauthoridAldrich, Chris [7103255150]
curtin.repositoryagreementV3


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record