Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, Zilong Zheng

Natural Language and Conversational AI Lab (NLCo), Beijing Institute for General Artificial Intelligence (BIGAI)

<aside> 📌

中文版同步更新：https://r3obs6x1ee.feishu.cn/docx/NMkRdoyOZoQOUwxrafpcelwGnfb

Full Technical Report Available:

https://arxiv.org/abs/2506.11603

Current Checkpoints Available:

🤗 TongSearch QR 7B

🤗 TongSearch QR 1.5B

Visit Our Github Repo For More Information:

https://github.com/bigai-nlco/TongSearch-QR

</aside>

<aside> 💡

Details about our ablation studies will be release later in our formal paper.

</aside>

Table of Contents

Reasoning-Intensive Retriveal performance on BRIGHT (Su et. al., 2025). Last updated on 2025-05-14.

Introduction

We present TongSearch QR (previously known as TongSearch Reasoner), a family of small-scale language models as reasoning-intensive query reasoners that are trained via deep reinforcement learning to capture deep causal relationships between queries and documents. “QR” is short of “Query Reasoning”.

While traditional information retrieval methods excel at textual and semantic matching, they fail in complex scenarios requiring logical reasoning—a critical shortcoming in modern RAG applications. Our evaluation using the BRIGHT Benchmark confirmed these limitations across diverse domains. Instead of pursuing labor-intensive, large-scale annotated datasets, we developed an innovative reinforcement-learning (RL)-based approach with semi-rule based reward functions. This enables smaller models, e.g., Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve reasoning performance rivaling much larger LLMs without their prohibitive inference costs. With BM25 as retrievers, the resulting TongSearch Reasoner 7B and 1.5B models significantly outperform non-reasoning baselines, delivering performance comparable to substantially larger reasoning models like DeepSeek-R1 and QwQ-32B, while offering superior adaptability for real-world deployment.

What is TongSearch QR

Information retrieval system aims to retrieve documents relevant to a given query from a document collection. How to define and measure the fundamental concept of "relevance" thus becomes the core of information retrieval algorithms. Traditional approaches of sparse retrieval (e.g., BM25) or dense retrieval can model query-document relevance from the perspectives of text matching and semantic matching, respectively. These approaches are now relatively mature, straightforward, and easy to use. However, in scenarios such as RAG (Retrieval-Augmented Generation) and DeepResearch, query-document relevance often goes beyond simple textual or semantic correlation, instead involving deeper causal relationships. For example, when addressing a coding problem, finding the relevant API documentation requires understanding the logic and syntax of the functions in the code.

p.c. BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (Su et. al., 2025)

p.c. BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (Su et. al., 2025)

A typical example is the BRIGHT Benchmark evaluation task, which includes 13k real-world questions spanning multiple domains (e.g., StackExchange, LeetCode, Olympiad mathematics, etc.) and provides corresponding human-annotated correct answers. These correct answers exhibit a significant gap in textual or semantic alignment with the original queries, making it difficult for traditional retrieval methods to retrieve the correct answers directly. On the leaderboard of the short document tasks in BRIGHT Benchmark, it can be observed that for the test set's nDCG@10 metric, neither BM25, typical vector models (e.g., bge), nor Cross-Encoder-based reranker models achieve satisfactory results.