Cumulated Gain-Based Evaluation of IR Techniques

Paper · Source
Recommender Systems (General)

Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield.

Introduction. Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents, or document components, should be identified and ranked first for presentation. This is often desirable from the user point of view. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. The current practice of liberal binary judgment of topical relevance gives equal credit for a retrieval technique for retrieving highly and marginally relevant documents. For example, TREC is based on binary relevance judgments with a very low threshold for accepting a document as relevant—the document needs to have at least one sentence pertaining to the request to count as relevant [TREC 2001]. Therefore differences between sloppy and excellent retrieval techniques, regarding highly relevant documents, may not become apparent in evaluation.

Discussion / Conclusion. We have argued that in modern large database environments, the development and evaluation of IR methods should be based on their ability to retrieve highly relevant documents. This is often desirable from the user viewpoint and presents a not too liberal test for IR techniques. We then developed novel methods for IR technique evaluation, which aim at taking the document relevance degrees into account. These are the CG and the DCG measures, which give the (discounted) cumulated gain up to any given document rank in the retrieval results, and their normalized variants nCG and nDCG, based on the ideal retrieval performance. They are related to some traditional measures such as average search length [Losee 1998], expected search length [Cooper 1968], normalized recall [Rocchio 1966; Salton and McGill 1983], sliding ratio [Pollack 1968; Korfhage 1997], satisfaction—frustration— total measure [Myaeng and Korfhage 1990], and ranked half-life [Borlund and Ingwersen 1998].