Combating Cyberbullying with a Fairer Toxic Classification Algorithm Using Hierarchical Attention Networks
Social Media is replete with toxic comments. AI algorithms built to identify toxicity often exhibit bias as they associate identity terms (such as racial, gender, ethnicity, nationality, religious or others) with toxicity, lacking an understanding of context. Current approaches to bias-free toxic classification in forums do not scale due to manual identity term selection and un-interpretability of the classifications. Automatically detecting the identity terms associated with toxicity, and surgically removing those biases by adding more non-toxic comments containing those terms will scale better, help build a more accurate and less biased toxic classifier. Thus, the objective was to develop a targeted AI algorithm to scale the identity term selection in removal of bias and to improve toxic comment classifiers. The novel method used in this project addressed these issues by using a Hierarchical Attention-based sequence learning neural network model that was more interpretable, adopting a linguistic-driven Noun-Adjective criteria that increased scope of important identity terms contributing to toxicity and scaled by auto-filtering the selection of relevant identity terms, by grid-searching the metaparameters, and finally de-biasing by augmenting the dataset with the appropriate number of non-toxic examples containing those identity terms. The model was trained on over 99,000 labeled (toxic/non-toxic) comments from the Wikipedia comment dataset and tested on a test set with a similar distribution. The model achieved an accuracy (AUC) of 0.98, compared to 0.95 in a prior paper by Google. The method identified several hundred more identity terms than prior papers and debiased significantly better when compared to the control set, without any human intervention. Unlike prior approaches, this model does not have comment length limitations, while being way more scalable and adaptable. It can also be used to build debiasing models in other languages, although that part is for future work. The approach used in this paper breaks new ground on multiple levels as it is novel in its use of attention networks to debias, automate and scale the selection and debiasing of terms, while also improving the accuracy of toxic classification by a significant amount. The models in this paper were used to develop www.detoxifAI.com live demo, and a Chrome Extension that addresses the cyberbullying problem by flagging toxic comments and highlighting specific words that contribute to the toxicity in Gmail (under construction), Google Docs, and others.
Arjun Neervannan is a 16 year old junior from University High School, Irvine, California. He is interested in Deep Sequence Learning, Natural Language Processing, AI Ethics, and Deep Reinforcement Learning. Over the past year, Arjun has been working under the guidance of Prof. Sameer Singh at the University of California, Irvine, to develop a fair toxic comment classification algorithm. Using these algorithms, Arjun also created detoxifAI (www.detoxifAI.com), a toxic comment blocker available as a Chrome Extension to combat the cyberbullying crisis at schools. Previously, Arjun developed reinforcement learning models that learned to walk in a simulated environment, under the guidance of Prof. Alex Ihler at UCI, and published the results in the October 2018 issue of Baltic Journal for Modern Computing. Arjun is also a founding member and the Captain of his FIRST Robotics team that made it to the FRC (FIRST Robotics Competition) World Championships twice.