Improving the relevance of search results via search-term disambiguation and ontological filtering

Zhu, Dengya

9348_Simon Thesis Final.pdf (12.41Mb)

Access Status

Open access

Authors

Zhu, Dengya

Date

2007

Supervisor

Assoc. Prof. Heinz Dreher

Type

Thesis

Award

MInfoSys

Metadata

Show full item record

School

School of Information Systems

URI

http://hdl.handle.net/20.500.11937/2486

Collection

Curtin Theses

Abstract

With the exponential growth of the Web and the inherent polysemy and synonymy problems of the natural languages, search engines are facing many challenges such as information overload, mismatch of search results, missing relevant documents, poorly organized search results, and mismatch of human mental model of clustering engines. To address these issues, much effort including employing different information retrieval (IR) models, information categorization/clustering, personalization, semantic Web, ontology-based IR, and so on, has been devoted to improve the relevance of search results. The major focus of this study is to dynamically re-organize Web search results under a socially constructed hierarchical knowledge structure, to facilitate information seekers to access and manipulate the retrieved search results, and consequently to improve the relevance of search results.To achieve the above research goal, a special search-browser is developed, and its retrieval effectiveness is evaluated. The hierarchical structure of the Open Directory Project (ODP) is employed as the socially constructed knowledge structure which is represented by the Tree component of Java. Yahoo! Search Web Services API is utilized to obtain search results directly from Yahoo! search engine databases. The Lucene text search engine calculates similarities between each returned search result and the semantic characteristics of each category in the ODP; and thus to assign the search results to the corresponding ODP categories by Majority Voting algorithm. When an interesting category is selected by a user, only search results categorized under the category are presented to the user, and the quality of the search results is consequently improved.Experiments demonstrate that the proposed approach of this research can improve the precision of Yahoo! search results at the 11 standard recall levels from an average 41.7 per cent to 65.2 per cent; the improvement is as high as 23.5 per cent. This conclusion is verified by comparing the improvements of the P@5 and P@10 of Yahoo! search results and the categorized search results of the special search-browser. The improvement of P@5 and P@10 are 38.3 per cent (85 per cent - 46.7 per cent) and 28 per cent (70 per cent - 42 per cent) respectively. The experiment of this research is well designed and controlled. To minimize the subjectiveness of relevance judgments, in this research five judges (experts) are asked to make their relevance judgments independently, and the final relevance judgment is a combination of the five judges’ judgments. The judges are presented with only search-terms, information needs, and the 50 search results of Yahoo! Search Web Service API. They are asked to make relevance judgments based on the information provided above, there is no categorization information provided.The first contribution of this research is to use an extracted category-document to represent the semantic characteristics of each of the ODP categories. A category-document is composed of the topic of the category, description of the category, the titles and the brief descriptions of the submitted Web pages under this category. Experimental results demonstrate the category-documents of the ODP can represent the semantic characteristics of the ODP in most cases. Furthermore, for machine learning algorithms, the extracted category-documents can be utilized as training data which otherwise demand much human labor to create to ensure the learning algorithm to be properly trained. The second contribution of this research is the suggestion of the new concepts of relevance judgment convergent degree and relevance judgment divergent degree that are used to measure how well different judges agree with each other when they are asked to judge the relevance of a list of search results. When the relevance judgment convergent degree of a search-term is high, an IR algorithm should obtain a higher precision as well. On the other hand, if the relevance judgment convergent degree is low, or the relevance judgment divergent degree is high, it is arguable to use the data to evaluate the IR algorithm. This intuition is manifested by the experiment of this research. The last contribution of this research is that the developed search-browser is the first IR system (IRS) to utilize the ODP hierarchical structure to categorize and filter search results, to the best of my knowledge.