An Empirical Comparison of Few-Shot Example Selection Strategies for In-Context Learning on Public Reasoning and QA Benchmarks
DOI:
https://doi.org/10.63575/CIA.2025.30209Keywords:
in-context learning, few-shot prompting, demonstration selection, reasoning benchmarks.Abstract
In-context learning allows large language models to adapt to a new task by conditioning on a small set of labelled demonstrations placed inside the prompt, and a growing body of work shows that the demonstrations chosen can shift task accuracy by more than ten absolute points. Four families of selection strategies dominate current practice: random sampling, similarity-based retrieval, diversity-based coverage, and complexity-based ranking. Their relative strengths across task types have not been examined inside a single controlled grid. This work offers an empirical comparison of six representative strategies drawn from these four families on four widely used public benchmarks — GSM8K, MMLU, BIG-Bench Hard, and CommonsenseQA — with two open-weight instruction-tuned backbones (Llama-3-8B-Instruct and Mistral-7B-Instruct-v0.2) and a robustness check on StrategyQA. Every strategy is evaluated under the same shot budget and prompt template, and stability is quantified across random seeds. No single strategy dominates the spread of tasks: similarity-based retrieval excels on commonsense QA, complexity-based ranking leads on multi-step arithmetic and algorithmic reasoning, a similarity-plus-diversity hybrid delivers the most stable average accuracy, and the gap between the best and worst strategies is moderate at 3.1 points. These findings support a task-aware view of demonstration selection and suggest that selection can be tuned at the task-type level


