Unsupervised modeling of multiple data sources : a latent shared subspace approach
Access Status
Authors
Date
2011Supervisor
Type
Award
Metadata
Show full item recordSchool
Collection
Abstract
The growing number of information sources has given rise to joint analysis. While the research community has mainly focused on analyzing data from a single source, there has been relatively few attempts on jointly analyzing multiple data sources exploiting their statistical sharing strengths. In general, the data from these sources emerge without labeling information and thus it is imperative to perform the joint analysis in an unsupervised manner.This thesis addresses the above problem and presents a general shared subspace learning framework for jointly modeling multiple related data sources. Since the data sources are related, there exist common structures across these sources, which can be captured through a shared subspace. However, each source also has some individual structures, which can be captured through an individual subspace. Incorporating these concepts in nonnegative matrix factorization (NMF) based subspace learning, we develop a nonnegative shared subspace learning model for two data sources and demonstrate its application to tag based social media retrieval. Extending this model, we impose additional regularization constraints of mutual orthogonality on the shared and individual subspaces and show that, compared to its unregularized counterpart, the new regularized model effectively deals with the problem of negative knowledge transfer – a key issue faced by transfer learning methods. The effectiveness of the regularized model is demonstrated through retrieval and clustering applications for a variety of data sets. To take advantage from more than one auxiliary source, we extend above models generalizing two sources to multiple sources with an added flexibility of allowing sources having arbitrary sharing configurations. The usefulness of this model is demonstrated through improved performance, achieved with multiple auxiliary sources. In addition, this model is used to relate the items from disparate media types allowing us to perform cross-media retrieval using tags.Departing from the nonnegative models, we use a linear-Gaussian framework and develop Bayesian shared subspace learning, which not only models the mixed-sign data but also learns probabilistic subspaces. Learning the subspace dimensionalities for the shared subspace models has an important role in optimum knowledge transfer but requires model selection – a task that is computationally intensive and time consuming. To this end, we xii propose a nonparametric Bayesian joint factor analysis model that circumvents the problem of model selection by using a hierarchical beta process prior, inferring subspace dimensionalities automatically from the data. The effectiveness of this model is shown on both synthetic and real data sets. For synthetic data set, successful recovery of both shared and individual subspace dimensionalities is demonstrated, whilst for real data set, the model outperforms recent state-of-the-art techniques for text modeling and image retrieval.
Related items
Showing items related by title, author, creator and subject.
-
Gupta, Sunil; Phung, Dinh; Adams, Brett; Venkatesh, Svetha (2011)Joint modeling of related data sources has the potential to improve various data mining tasks such as transfer learning, multitask clustering, information retrieval etc. However, diversity among various data sources might ...
-
Gupta, Sunil; Phung, Dinh; Adams, Brett; Venkatesh, Svetha (2011)This paper presents a novel Bayesian formulation to exploit shared structures across multiple data sources, constructing foundations for effective mining and retrieval across disparate domains. We jointly analyze diverse ...
-
Gupta, Sunil; Phung, Dinh; Adams, Brett; Tran, Truyen; Venkatesh, Svetha (2010)Although tagging has become increasingly popular in online image and video sharing systems, tags are known to be noisy, ambiguous, incomplete and subjective. These factors can seriously affect the precision of a social ...