Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models

Liu, Xiu; Aldrich, Chris

doi:10.1016/j.fuel.2022.126891

dc.contributor.author	Liu, Xiu
dc.contributor.author	Aldrich, Chris
dc.date.accessioned	2025-04-30T00:36:04Z
dc.date.available	2025-04-30T00:36:04Z
dc.date.issued	2022
dc.identifier.citation	Liu, X. and Aldrich, C. 2022. Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models. FUEL. 335: 126891.
dc.identifier.uri	http://hdl.handle.net/20.500.11937/97646
dc.identifier.doi	10.1016/j.fuel.2022.126891
dc.description.abstract	Modelling the characteristics and composition of coal is important, as proximity data and other measurements to do so are typically expensive or hard to acquire in real-time. Understanding anomalies in these relatively small data sets are important, as removal may result in an unnecessary loss of data or bias in the data used in the model. Although anomaly detection has been considered in-depth in the literature, very little work has been devoted to the explanation of anomalies. In this paper, a general anomaly detection and identification methodology is considered, based on three models, viz an isolation forest, a random forest and a tree SHAP explanatory model. Three case studies related to the composition of coal and coal processing are considered. In these case studies, the IF-RF-SHAP approach identified outliers of data anomalies not identifiable with principal component analysis. The model is a new variant of some of the integrated approaches that have recently been considered. Further contribution of the study lies in the empirical comparison of IF anomaly scores with distance-based and reconstruction-based anomaly scores generated with principal component models. In the case studies considered, the IF anomaly scores were better able to identify anomalies in the data than the scores derived from the principal component models. As a result, the methodology can complement distance-based approaches, such as principal component analysis, to explain anomalies or outliers detected in data. Apart from the proposed IF-RF-SHAP approach, four approaches to compare the contributions of variables in random forest models are considered as well. These were simple correlation of individual predictors with anomaly scores of samples, random forest prediction based on an impurity criterion, random forest prediction based on a permutation criterion, as well as the tree SHAP approach. If the latter is considered as a benchmark, then the impurity criterion gave the most reliable results, while simple predictor correlations gave the least reliable results.
dc.language	English
dc.publisher	Elsevier
dc.relation.sponsoredby	http://purl.org/au-research/grants/arc/CE200100009
dc.subject	Science & Technology
dc.subject	Technology
dc.subject	Energy & Fuels
dc.subject	Engineering, Chemical
dc.subject	Engineering
dc.subject	Anomaly detection
dc.subject	Isolation forest
dc.subject	Shapley value regression
dc.subject	Coal
dc.subject	Variable importance measures
dc.subject	Random forests
dc.subject	PRINCIPAL COMPONENT ANALYSIS
dc.subject	BITUMINOUS COAL
dc.subject	COMBUSTION
dc.subject	SYSTEM
dc.subject	PREDICTION
dc.subject	REGRESSION
dc.subject	BOILER
dc.subject	FOREST
dc.subject	CARBON
dc.subject	ASH
dc.title	Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
dc.type	Journal Article
dcterms.source.volume	335
dcterms.source.issn	0016-2361
dcterms.source.title	FUEL
dc.date.updated	2025-04-30T00:36:02Z
curtin.department	WASM: Minerals, Energy and Chemical Engineering
curtin.accessStatus	Open access
curtin.faculty	Faculty of Science and Engineering
curtin.contributor.orcid	Liu, Xiu [0000-0003-4592-7232]
curtin.contributor.orcid	Aldrich, Chris [0000-0003-2963-1140]
curtin.identifier.article-number	126891
dcterms.source.eissn	1873-7153
curtin.contributor.scopusauthorid	Aldrich, Chris [7103255150]
curtin.repositoryagreement	V3

Files in this item

Name:: Explaining anomalies in coal ...
Size:: 1.587Mb
Format:: PDF

Name:: 97410.pdf
Size:: 1.587Mb
Format:: PDF

This item appears in the following Collection(s)

Curtin Research Publications

Show simple item record

Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models

Files in this item

This item appears in the following Collection(s)

Related items