Evidence Diversity, Redundancy, and Semantic Similarity Trade-Offs in Multi-Hop RAG Systems
DOI:
https://doi.org/10.63575/CIA.2026.40203Keywords:
multi-hop retrieval, etrieval-augmented generation, evidence diversity, redundancy, semantic similarity, metadata retrieval, Maximal Marginal Relevance, BM25, TF-IDF, evidence-set completionAbstract
Multi-hop retrieval-augmented generation requires a retriever to assemble a complementary evidence set rather than return a single locally similar passage. This study examines the interaction among query relevance, within-set redundancy, and evidence diversity using a deterministic benchmark instance aligned with the published MultiHop-RAG scale and schema. The evaluation covers 2,556 queries, 609 documents, four query classes, metadata-rich records, and answerable evidence sets spanning two to four documents. We compare BM25, TF-IDF cosine retrieval, metadata-augmented variants, weighted hybrid fusion, four Maximal Marginal Relevance settings, and a hybrid-plus-MMR selector. Performance is assessed with Partial Recall, Complete Recall, MRR, nDCG, mean pairwise redundancy, diversity, semantic similarity, source diversity, null-query rejection, bootstrap confidence intervals, and measured retrieval latency. TF-IDF with metadata achieves the strongest Complete Recall@10 at 0.793, together with Partial Recall@10 of 0.916 and nDCG@10 of 0.739. Pure novelty pressure is counterproductive: MMR with λ=0.35 raises diversity to 0.834 but reduces Complete Recall@10 to 0.011. A relevance-dominant setting, λ=0.80, recovers 0.665 Complete Recall@10 while lowering redundancy relative to similarity-only ranking. The results show that metadata-aware relevance should establish the candidate pool, after which diversification can be applied as a bounded set-composition correction. For multi-hop retrieval, diversity is valuable only when it preserves the shared entities, temporal anchors, and source constraints that bind an evidence chain.


