Improving the relevance of web search results by combining web snippet categorization, clustering and personalization

Zhu, Dengya

155024_Zhu2010.pdf (4.096Mb)

Access Status

Open access

Authors

Zhu, Dengya

Date

2010

Supervisor

Dr. Heinz Dreher

Type

Thesis

Award

PhD

Metadata

Show full item record

School

School of Information Systems

URI

http://hdl.handle.net/20.500.11937/326

Collection

Curtin Theses

Abstract

Web search results are far from perfect due to the polysemous and synonymous characteristics of nature languages, information overload as the results of information explosion on the Web, and the flat list, “one size fits all” strategies of search engines to present search results without concentrating on user personal information needs.Re-organizing Web search results, or Web snippets, by means of text categorization and clustering are two dominant approaches to attack the issues above. Text categorization uses a collection of labeled documents to train a classifier which can then predict labels for new unlabeled documents; while text clustering groups unlabeled documents by finding common properties shared among the documents in the same group. The issue related to categorization is human labeled training documents are very expensive to obtain and thus surprisingly scarce at the moment; while how to label the generated groups is still an open research question for text clustering. In addition, a Web snippet, returned from search engines, contains only the title of a webpage and an optional very short (less than 30 words) description of the page. The less-informative aspect of Web snippets is another challenge for both text categorization and clustering.The primary objective of this research is to improve the relevance of Web search results and thus provide the user with a better search experience. To achieve this objective, the research combines Web snippet categorization, clustering and personalization techniques to recommend relevant results to search users. Using design research methodology, the study develops an IT artifact named RIB – Recommender Intelligent Browser. RIB categorizes Web snippets using a socially constructed Web directory such as the Open Directory Project (ODP) for which the semantic characteristics of the categories in ODP are extracted to generate a series of labeled document sets. At the same time, the Web snippets are clustered to boost the quality of the categorization. Based on search preferences in a user profile which is automatically generated by using information extracted from user personal computer with the approval of the user for information collection, the proposed search method will recommend personalized search results to users. Experimental data demonstrate that the mean average precision improvement of RIB over Yahoo Search Web Services API based on 25 search-terms with 1250 Web snippets is 7.84%, from 55.55% of Yahoo to 64.29% of RIB.A novel boostingUp algorithm is also proposed in this research to improve the performance of text categorization by leveraging the power of text clustering and vice versa. Experimental results illustrate that boostingUp can marginally improve the performance of both Web snippet categorization and clustering in terms of Adjusted Rand Index and F[subscript]1. BoostingUp is able to produce 0.97% improvement of macro-averaged F[subscript]1 from 24.51% to 25.48% for Naïve Bayes with combination of K-Means, 2.04% improvement of micro-averaged F[subscript]1 from 32.17% to 34.21%. On the other hand, the improvement in terms of Adjusted Rand Index of K-Means with combination of Naïve Bayes is 2.35% (from 13.17% to 15.52%), and the improvement of F[subscript]1 is 2.37% (from 21.45% to 23.82%).The issues of lack of labeled data set that can be used for Web snippet categorization and used as benchmark document collection to evaluate text categorization/clustering algorithms is addressed by extracting semantic characteristics of ODP categories to generate a series of labeled categoryDocument sets. Statistical information about the generated data sets is provided as well. The generated categoryDocuments are used to evaluate the performance of Naïve Bayes, Adaboost, and kNN text categorization algorithms when a list of feature selection algorithms including Chi-square, Mutual Information, Information Gain, Odds Ratio, are employed to pick up 50, 80, 100, 200, 300, 500,1000, 2000, 3000, 5000, and 10000 features. Other text categorization algorithms such as SVMlight and Statistical Language Model based algorithm and feature selection algorithms such as GSS Coefficient, NGL Coefficient, and Relevancy Score are also evaluated based on a specially designed small data set. Two other proposed algorithms, R[superscript]2Cut thresholding strategy and Z-tfidf, are at the same time evaluated, and demonstrate the ability of slightly improving the performance of text categorization. Text clustering algorithms such as K-Means and Hierarchical Agglomerative Clustering are also evaluated by using the generated categoryDocument sets. All algorithms involved in this research were implemented in Java.In addition, this research is the first to present the detailed information about the hierarchy of the ODP, the world’s most comprehensive human-edited Web directory, by analyzing the data in two publicly accessible files under Free Use License. Although ODP is adopted as core directory services for the World’s most popular search engines such Google, AOL Search, Netscape Search, Lycos, HotBot and hundreds of other; and used for a wide range of research purposes, there is no detailed hierarchical information about ODP published so far.The research further verifies the relationship between precision improvement and relevance judgment convergent degree when the effectiveness of an information retrieval system is evaluated based on the results of human relevance judgment; and reveals that the two variables are to some extent co-related in terms of correlation coefficient.Improving the relevance of Web searching is challenging. This research proposes to combine text categorization, clustering and personalization to provide better search experience to users. Comprehensive experimental evidence and favorable comparisons against search results of Yahoo API demonstrate the designed search objectives have been achieved.