Variable selection in principal component analysis : using measures of multivariate association.
|Sithole, Moses M.
|Dr S. Ganeshanandam
This thesis is concerned with the problem of selection of important variables in Principal Component Analysis (PCA) in such a way that the selected subsets of variables retain, as much as possible, the overall multivariate structure of the complete data. Throughout the thesis, the criteria used in order to meet this requirement are collectively referred to as measures of Multivariate Association (MVA). Most of the currently available selection methods may lead to inappropriate subsets, while Krzanowskis (1987) M(subscript)2-Procrustes criterion successfully identifies structure-bearing variables particularly when groups are present in the data. Our major objective, however, is to utilize the idea of multivariate association to select subsets of the original variables which preserve any (unknown) multivariate structure that may be present in the data.The first part of the thesis is devoted to a study of the choice of the number of components (say, k) to be used in the variable selection process. Various methods that exist in the literature for choosing k are described, and comparative studies on these methods are reviewed. Currently available methods based exclusively on the eigenvalues of the covariance or correlation matrices, and those based on cross-validation are unsatisfactory. Hence, we propose a new technique for choosing k based on the bootstrap methodology. A full comparative study of this new technique and the cross-validatory choice of k proposed by Eastment and Krzanowski (1982) is then carried out using data simulated from Monte Carlo experiment.The remainder of the thesis focuses on variable selection in PCA using measures of MVA. Various existing selection methods are described, and comparative studies on these methods available in the literature are reviewed. New methods for selecting variables, based of measures of MVA are then proposed and compared among themselves as well as with the M(subscript)2-procrustes criterion. This comparison is based on Monte Carlo simulation, and the behaviour of the selection methods is assessed in terms of the performance of the selected variables.In summary, the Monte Carlo results suggest that the proposed bootstrap technique for choosing k generally performs better than the cross-validatory technique of Eastment and Krzanowski (1982). Similarly, the Monte Carlo comparison of the variable selection methods shows that the proposed methods are comparable with or better than Krzanowskis (1987) M(subscript)2-procrustes criterion. These conclusions are mainly based on data simulated by means of Monte Carlo experiments. However, these techniques for choosing k and the various variable selection techniques are also evaluated on some real data sets. Some comments on alternative approaches and suggestions for possible extensions conclude the thesis.
|principal component analysis
|Variable selection in principal component analysis : using measures of multivariate association.
|School of Mathematics and Statistics