Multimodal Transformers for Detecting Bad Quality Ads on YouTube


Vijaya Teja Rayavarapu, Bharath Bhat, Myra Nam, Vikas Bahirwani and Shobha Diwakar

An ads ecosystem needs robust, scalable mechanisms to safeguard users from bad quality ads. Contemporary ad creatives typically contain different combinations of modalities like text, images and video, and as such, any system that flags bad quality ad content needs a holistic multimodal representation of the ad. In this paper, we demonstrate that modern Transformer based neural network models are effective multimodal learners. We report significant performance gains in YouTube video ads on the task of content quality prediction by transitioning to Transformer based models from simpler feed-forward neural networks. We provide ablation studies to understand the impact of each input modality, and compare various flavors of Transformer architectures. We hope that our experiments help practitioners looking to incorporate these powerful multimodal models into other parts of the ads ecosystem.