Unsupervised modeling of multiple data sources : a latent shared subspace approach

Gupta, Sunil Kumar

170275_Gupta2011.pdf (1.187Mb)

Access Status

Open access

Authors

Gupta, Sunil Kumar

Date

2011

Supervisor

Prof. Svetha Venkatesh

Type

Thesis

Award

PhD

Metadata

Show full item record

School

Department of Computing

URI

http://hdl.handle.net/20.500.11937/2583

Collection

Curtin Theses

Abstract

The growing number of information sources has given rise to joint analysis. While the research community has mainly focused on analyzing data from a single source, there has been relatively few attempts on jointly analyzing multiple data sources exploiting their statistical sharing strengths. In general, the data from these sources emerge without labeling information and thus it is imperative to perform the joint analysis in an unsupervised manner.This thesis addresses the above problem and presents a general shared subspace learning framework for jointly modeling multiple related data sources. Since the data sources are related, there exist common structures across these sources, which can be captured through a shared subspace. However, each source also has some individual structures, which can be captured through an individual subspace. Incorporating these concepts in nonnegative matrix factorization (NMF) based subspace learning, we develop a nonnegative shared subspace learning model for two data sources and demonstrate its application to tag based social media retrieval. Extending this model, we impose additional regularization constraints of mutual orthogonality on the shared and individual subspaces and show that, compared to its unregularized counterpart, the new regularized model effectively deals with the problem of negative knowledge transfer – a key issue faced by transfer learning methods. The effectiveness of the regularized model is demonstrated through retrieval and clustering applications for a variety of data sets. To take advantage from more than one auxiliary source, we extend above models generalizing two sources to multiple sources with an added flexibility of allowing sources having arbitrary sharing configurations. The usefulness of this model is demonstrated through improved performance, achieved with multiple auxiliary sources. In addition, this model is used to relate the items from disparate media types allowing us to perform cross-media retrieval using tags.Departing from the nonnegative models, we use a linear-Gaussian framework and develop Bayesian shared subspace learning, which not only models the mixed-sign data but also learns probabilistic subspaces. Learning the subspace dimensionalities for the shared subspace models has an important role in optimum knowledge transfer but requires model selection – a task that is computationally intensive and time consuming. To this end, we xii propose a nonparametric Bayesian joint factor analysis model that circumvents the problem of model selection by using a hierarchical beta process prior, inferring subspace dimensionalities automatically from the data. The effectiveness of this model is shown on both synthetic and real data sets. For synthetic data set, successful recovery of both shared and individual subspace dimensionalities is demonstrated, whilst for real data set, the model outperforms recent state-of-the-art techniques for text modeling and image retrieval.