Research

:mag: Research Interests

My research interests lie within NLP and Information Retrieval. I like to study and work on building easy, scalable and efficient systems in the neural search paradigm. My long-term research ambition lies in building robust and generalizable retrieval models to help serve information better for everyone. Few of my notable works include developing the BEIR Benchmark (Thakur et al., 2021) and Augmented SBERT (Thakur et al., 2021).

:bar_chart: Data Efficiency: Transfer Learning, Data Augmentation and Zero-shot Learning

In order to train neural retriever systems, large amounts of human-labeled training data is required which is often cumbersome and expensive to generate for real-world tasks. Data efficiency plays a crucial role to address this challenge. Transfer learning is motivated by distilling knowledge from pretrained models or LLMs to train data-efficient models. Data Augmentation techniques involve generating high-quality synthetic data for training purposes. Zero-shot learning enables models to generalize to unseen classes or queries without any training examples.

:speaking_head: Languages: Multilingual Retrieval

Multilingual Retrieval aims to provide relevant search results for user searching across multiple languages. Multilingual retrieval involves various challenges, including language mismatch, translation ambiguity, and language-specific resource limitations. To overcome these challenges, machine translation, cross-lingual IR and mulitilingual embeddings have been employed. However, training data for such tasks is even more scarce than English making it a important challenge.

:scroll: Publications

For an updated list of publications please refer either to my Google Scholar profile or Semantic Scholar profile.

* denotes equal contribution

project image

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Nandan Thakur , Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer To appear in the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2024)., 2024.
University of Waterloo, Google Research and Google DeepMind.
[paper] / [code] / [dataset] /

A large-scale synthetic LLM-generated dataset for improving multilingual retrieval systems without human-labeled training data.
project image

Resources for Brewing BEIR: Reproducible Reference Models and Statistical Analyses

Ehsan Kamalloo, Nandan Thakur , Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, Jimmy Lin. To appear in SIGIR 2024 (Resource Track), 2024.
University of Waterloo, Naver Labs
[paper] / [code] /

Resources to support the BEIR benchmark: Reproducible lexical, sparse and dense baselines and statistical analyses.
project image

Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval

Nandan Thakur , Nils Reimers, Jimmy Lin. 2023 Workshop on Reaching Efficiency in Neural Information Retrieval (ReNeuIR’23), 2023.
University of Waterloo and Cohere.
[paper] / [code] / [dataset] /

A domain adaptation technique which is able to improve zero-shot performance of dense-retrieval models by maintaining 32x memory efficiency and latency.
project image

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur , Jimmy Lin. Arxiv Preprint, 2023.
University of Waterloo
[paper] / [code] / [dataset] /

A high-quality dataset for training and evaluating generative search (RAG) models with citations.
project image

SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

Nandan Thakur , Kexin Wang, Iryna Gurevych, Jimmy Lin. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), 2023.
University of Waterloo and UKPLab, Technical University of Darmstadt.
[paper] / [code] /

A unified toolkit for evaluation of diverse zero-shot neural sparse retrieval models.
project image

Evaluating Embedding APIs for Information Retrieval

Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur , David Alfonso-Hermelo, Mehdi Rezagholizadeh, Jimmy Lin. Association for Computational Linguistics (ACL) 2023 Industry Track, 2023.
University of Waterloo and Huawei Noah’s Ark Lab
[paper] /

Analyze semantic embedding APIs in realistic retrieval scenarios in order to assist practitioners and researchers in finding suitable services.
project image

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

Jimmy Lin, David Alfonso-Hermelo, Vitor Jeronymo, Ehsan Kamalloo, Carlos Lassance, Rodrigo Nogueira, Odunayo Ogundepo, Mehdi Rezagholizadeh, Nandan Thakur , Jheng-Hong Yang, Xinyu Zhang. Arxiv Preprint, 2023.
University of Waterloo, Huawei Noah’s Ark Lab, UNICAMP and Naver Labs Europe.
[paper] / [code] /

Simple yet Effective Cross-lingual Baselines involving both sparse and dense retrieval models using IR Toolkits for test collections in the TREC 2022 NeuCLIR Track.
project image

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

Xinyu Zhang*, Nandan Thakur* , Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin. Proceedings of Transactions of the Association for Computational Linguistics (TACL), 2022.
University of Waterloo and Huawei Noah’s Ark Lab
[paper] / [code] / [dataset] /

A human-labeled multilingual retrieval dataset across 18 languages from diverse langauge families to progress retrieval systems across various languages.
project image

GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval

Kexin Wang, Nandan Thakur , Nils Reimers, Iryna Gurevych. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2022.
UKPLab, Technical University of Darmstadt
[paper] / [code] / [dataset] /

A novel unsupervised domain adaptation method which combines a query generator with pseudo labeling from a cross-encoder.
project image

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur , Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
UKPLab, Technical University of Darmstadt
[paper] / [code] / [video] / [dataset] /

A novel heterogeneous zero-shot retrieval benchmark containing 18 datasets from diverse text retrieval tasks and domains in English.
project image

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

Nandan Thakur , Nils Reimers, Johannes Daxenberger, Iryna Gurevych. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021.
UKPLab, Technical University of Darmstadt
[paper] / [code] / [video] /

A simple yet efficient data augmentation strategy using the cross-encoder to label training data for training the bi-encoder for pairwise sentence scoring tasks.