Cross-Modal Artifact Mining for Generalizable Deepfake Detection in the Wild
DOI:
https://doi.org/10.63575/CIA.2024.20208Keywords:
Deepfake Detection, Cross-Modal Learning, Frequency Domain Analysis, GeneralizationAbstract
The proliferation of deepfake content poses unprecedented threats to digital security and information integrity. Existing detection methods suffer from significant performance degradation when confronting cross-dataset scenarios and real-world manipulated media. This paper proposes a novel cross-modal artifact mining framework that integrates frequency-domain analysis with audio-visual consistency verification for enhanced generalization capability. Our approach employs adaptive high-frequency enhancement modules coupled with discrete cosine transform feature extraction to capture subtle manipulation artifacts. The cross-modal attention fusion mechanism effectively leverages temporal alignment inconsistencies between audio and visual streams. Through comprehensive evaluation on six benchmark datasets, our method achieves superior cross-dataset generalization performance with 89.7% average accuracy and demonstrates robust detection capability against diffusion-generated deepfakes. Extensive experiments validate the effectiveness of each proposed component through ablation studies, while robustness analysis confirms resilience against adversarial perturbations and compression artifacts encountered in real-world deployment scenarios.


