Distribution-Driven Dense Retrieval: Modeling Many-to-One Query-Document Relationship
Abstract
Dense retrieval has emerged as the leading approach in information retrieval, aiming to find semantically relevant documents based on natural language queries. Given that a single document can be retrieved by multiple distinct queries, existing methods aim to represent a document with multiple vectors. Each vector is aligned with a different query to model the many-to-one relationship between queries and documents. However, these multiple vector-based approaches encounter challenges such as Increased Storage, Vector Collapse, and Search Efficiency. To address these issues, we introduce the Distribution-Driven Dense Retrieval framework (DDR). Specifically, we use vectors to represent queries and distributions to represent documents. This approach not only captures the relationships between multiple queries corresponding to the same document but also avoids the need to use multiple vectors to represent the document. Furthermore, to ensure search efficiency for DDR, we propose a dot product-based computation method to calculate the similarity between documents represented by distributions and queries represented by vectors. This allows for seamless integration with existing approximate nearest neighbor (ANN) search algorithms for efficient search. Finally, we conduct extensive experiments on real-world datasets, which demonstrate that our method significantly outperforms traditional dense retrieval methods.