Heterogeneous Benchmarking across Domains and Languages

Date:

Title

Heterogenous Benchmarking across Domains and Languages: The Key to Enable Meaningful Progress in IR Research.

Abstract

Benchmarks are ever so necessary to measure realistic progress within Information Retrieval. However, existing benchmarks quickly saturate as they are prone to overfitting affecting retrieval model generalization. To overcome these challenges, I would present two of my research efforts: BEIR, a heterogeneous benchmark for zero-shot evaluation across specialized domains and MIRACL, a monolingual benchmark covering a diverse range of languages. In BEIR, we show that neural retrievers surprisingly struggle to generalize zero-shot on specialized domains due to lack of training data. To overcome this, we develop GPL that distils cross-encoder knowledge using cross-domain BEIR synthetic data. On the language side, MIRACL is robust in annotations and includes a broader coverage of the languages. However, generating supervised training data is cumbersome in realistic settings. To supplement, we construct SWIM-IR, a synthetic training dataset with 28 million LLM-generated pairs across 37 languages to develop multilingual retrievers comparable to supervised models on three multilingual retrieval benchmarks and can be extended to several new languages.

Youtube Video Recording

IMAGE ALT TEXT HERE

Slides