Coverage-Aware Prediction of Multi-Hop RAG Answer Correctness on FRAMES
DOI:
https://doi.org/10.63575/CIA.2026.40202Keywords:
retrieval-augmented generation, FRAMES, multi-hop question answering, retrieval coverage, strict correctness, BM25, TF-IDF, calibrationAbstract
Multi-hop retrieval-augmented generation (RAG) can retrieve context that appears relevant while omitting an article needed to complete the reasoning chain. This study examines whether that evidence deficit can be detected before answer generation. The evaluation covers all 824 questions in the public FRAMES test file, which provides gold answers, reasoning labels, and relevant Wikipedia links. Because the file identifies articles but does not include article bodies, a closed-world title corpus is constructed from every parsed gold link. Four deterministic retrievers are compared: BM25 over normalized titles, word-level TF-IDF, character-level TF-IDF, and a fixed hybrid. Coverage is defined as the fraction of a question's parsed gold article set returned in the top-k. Strict coverage-gated correctness is one only when the full set is present; it therefore measures retrieval readiness rather than neural-reader accuracy. At k=10, the hybrid reaches mean coverage of 0.547 and strict correctness of 0.225; at k=25, the values rise to 0.590 and 0.273. Performance declines sharply as article count grows. In five-fold cross-validation, logistic regression using observable query and ranking signals predicts strict correctness at k=10 with ROC-AUC 0.796. Adding benchmark hop-count metadata raises ROC-AUC to 0.856. The findings show that incomplete multi-hop evidence is both a central retrieval bottleneck and a predictable risk signal that can guide additional retrieval, decomposition, or abstention before generation.


