An Empirical Evaluation of Prompt Injection Detection and Refusal-Usefulness Tradeoffs Using the deepset/prompt-injections Dataset
DOI:
https://doi.org/10.63575/CIA.2025.30205Keywords:
prompt injection, jailbreaks, large language models, text classification, abstention, security evaluationAbstract
Prompt injection is a leading security risk for large language model (LLM) applications because adversaries can embed instructions that override system intent, exfiltrate hidden prompts, or trigger unsafe tool use. This paper presents a fully empirical evaluation of prompt-injection defenses on the public deepset/prompt-injections dataset (662 labeled prompts; 399 benign, 263 injection/attack) using the official train/test split. We compare lightweight detectors that can be deployed as an input gate: a keyword-based rule system, word-level TF-IDF with Logistic Regression (LR), character-level TF-IDF with LR, calibrated linear Support Vector Machines (SVMs), and Complement Naive Bayes. We report attack success rate (ASR), detection F1, and a refusal rate–usefulness tradeoff. On the test split, the best detector is a character TF-IDF + calibrated linear SVM with F1=0.901 and ROC-AUC=0.977, substantially outperforming keyword rules (F1=0.125). When used as a refusal gate, the same family of character models reduces ASR from 1.000 (no gate) to 0.117 at 92.9% usefulness (1 - benign refusal) under a low-false-positive operating point derived from benign-score quantiles on the training split. Error analysis shows that most remaining bypasses are short, multilingual, or typo-heavy injections, indicating that robust defenses require character-level generalization and abstention tuning. Overall, our results quantify the operational tradeoffs between security and usability and provide reproducible baselines for prompt injection detection research.


