A Comparative Empirical Study of Semantic Signal Enhancement Methods for User Interest Features in CTR Prediction: Applicability of TF-IDF Weighting, Sentence-BERT Embeddings, and LDA Topic Fusion

Authors

  • Tianxing Tang Translation and Localization Management, Middlebury Institute of International Studies, CA, USA Author
  • Mingzhuo Yu Computer Science, Northeastern University, MA, USA Author

DOI:

https://doi.org/10.63575/CIA.2024.20114

Keywords:

Semantic signal enhancement, click-through rate prediction, comparative empirical study, user interest feature engineering

Abstract

User interest feature engineering in recommendation and advertising platforms increasingly relies on semantic signals derived from item text, user queries, and tag annotations. Practitioners lack a unified empirical comparison of the dominant fusion paths under a shared evaluation protocol. This study reports a comparative empirical analysis of three representative semantic signal enhancement methods — TF-IDF weighting, Sentence-BERT embeddings, and latent Dirichlet allocation (LDA) topic distributions — applied to user interest features for click-through rate (CTR) prediction. All three methods are evaluated on four public datasets (MovieLens-25M, MIND-small, Amazon Reviews 2023, and KuaiRec 2.0) using the Deep Interest Network and the Deep Interest Evolution Network as fixed CTR backbones. Sentence-BERT yields a mean AUC lift of 1.71 percent over the identifier-only baseline, while TF-IDF and LDA deliver 0.81 percent and 0.29 percent, respectively. Granularity analysis indicates that TF-IDF peaks on short text such as titles and tags, Sentence-BERT scales monotonically with document length, and LDA only matches TF-IDF once content exceeds roughly one hundred tokens. Cost-benefit profiling places Sentence-BERT with cached item vectors on the accuracy-latency Pareto frontier for mid-to-long text, while TF-IDF remains preferable in short-text, cold-start, and long-tail regimes.

Author Biography

  • Mingzhuo Yu, Computer Science, Northeastern University, MA, USA

     

     

Published

2024-02-20

How to Cite

[1]
Tianxing Tang and Mingzhuo Yu, “A Comparative Empirical Study of Semantic Signal Enhancement Methods for User Interest Features in CTR Prediction: Applicability of TF-IDF Weighting, Sentence-BERT Embeddings, and LDA Topic Fusion”, Journal of Computing Innovations and Applications, vol. 2, no. 1, pp. 165–174, Feb. 2024, doi: 10.63575/CIA.2024.20114.