Can algorithmic moderation eradicate online hate speech?

Hate speech has become a pervasive issue in today’s digital age, with social media providing a platform for its dissemination. Hate speech not only makes the online speech environment toxic and hinders freedom of speech, but also has an impact on the real world outside of the real platform, such as traumatizing the victim, affecting the individual’s normal life, and even triggering serious social conflicts and terrorist incidents. Despite being a focal point for various internet companies and subject to ongoing research, the vast amount of information leaves them feeling overwhelmed and unable to effectively address the issue of online hate speech. Therefore, many companies choose to manage online hate speech through the use of artificial intelligence or a combination of artificial intelligence and human moderation.

We understand hate speech to be speech that expresses or encourages hatred against a particular collective, simply because of the particular feature of that group such as race, ethnicity, gender, religion, nationality, and sexual orientation(Parekh, 2012). Because it contains explicit emotional bias, we can easily associate the negative words contained in hate speech with common sense. Gorwa and others(2020) define algorithmic commercial content moderation (referred to as algorithmic moderation for brevity in the following sections) as systems that classify user-generated content based on either matching or prediction, leading to a decision and governance outcome (e.g. removal, geoblocking, account takedown). Most algorithmic moderation is driven by language models. In its simplest form, a language model involves assigning a probability to a certain sequence of words (Ke-Li et al., 2022). For instance, the sequence ‘the cat in the hat’ is probably more likely than ‘the cat in the computer’. They learn the patterns and structures of language by observing large amounts of text data and use this learned knowledge to generate new text or solve specific language tasks. It acts as a filter, detecting negative words that may appear in the text based on the corpus. This technique has been widely used on other social media such as YouTube and has achieved remarkable success.

In an ideal operational state, the most significant advantage it brings is a substantial improvement in both the efficiency and scale of handling online hate speech. This can be confirmed by the public reports on the handling of content violating the YouTube Community Guidelines. From October to December 2023, over eight million videos were removed after algorithmic moderation, while over three hundred thousand videos were manually flagged and removed by users, accounting for only 3.5% of those detected by AI. Additionally, the AI detection response is nearly instantaneous, as indicated in the report, approximately half of the videos are removed before users have the chance to view them, while 26.43% of the videos are removed after being viewed fewer than ten times. Within a few seconds of users uploading speech or videos, AI has already completed the preliminary detection. If a video does not pass AI review, it cannot be uploaded or is immediately taken down. This ensures that hate speech is promptly revoked, preventing a range of potential negative impacts it may bring. Finally, a vast amount of data processed by AI can be incorporated into databases, facilitating the analysis of overall trends and characteristics of hate speech, and providing samples and directions for its management and research.

However, many problems exist in the practical application of algorithmic moderation. Even though algorithmic moderation has identified many hate speeches, we still often find that videos and speeches with hate speech are not deleted, and there is no way to know how many hate speeches exist in the network, so we cannot know the ratio of processed and unprocessed. Since we do not know how much hate speech exists in the network, we do not know the ratio of processed to unprocessed. Algorithmic moderation matches and categorizes the content that needs to be detected with the content in the database. The disadvantage of this approach is that if a user modifies the video slightly, such as watermarking or mosaicking the offending content, or changing the size of the video, it will be allowed to be published. Furthermore, not all hate speech is overt; it often manifests in “scientific” language or is presented as jokes or ironic commentary (Flew, 2021). To avoid systematic censorship, users hide their offending speech in a variety of ways. Gorwa and others (2020) mention that users hide words that are marked as offending in normal words or sentences, such as ASS-ociation, to avoid systematic detection. Even, to avoid algorithmic moderation, many disguised curse words have been used as Internet buzzwords. They may not be formed in hate speech, but it can’t be confirmed if it is used in hate speech. What algorithmic moderation face is, perhaps after it learns a word morph, there are already new buzzwords to replace them.

Language is one of the barriers to algorithmic moderation. Despite algorithmic moderation having achieved great success, English still accounts for the vast majority of detected violations and remains at the forefront of current developments in language technology research. For example, Matteo, analyzing a corpus of more than 1 million comments on an Italian YouTube video related to COVID-19, found that only 32% of violent comments were unavailable due to either moderation or removal by the author. removal by the author(Cinelli et al., 2021). Wickramaarachchi and others (2024) mention the current situation of the difficulty of detecting hate speech in Sinhala on YouTube. It might be thought that smaller languages are spoken by fewer people and therefore hate speech is spread less and brings less impact, a wrong thought. Facebook made a splash with allegations of helping to incite genocide in Myanmar in 2017, after which it stepped up its censorship of Burmese-language hate speech. And Yatanar Htun, director of the Myanmar Information and Communication Technology Development Organisation (MIDO), which monitors online hate speech, told Reuters that YouTube videos were used to send fake emails to voters that spoofed emails from Myanmar authorities alleging fraud and foreign interference. Reuters shows that the same fake videos have been removed from Facebook and remain on YouTube, and there is a crisis of hate speech forming in large numbers.

The use of automated techniques can potentially help firms remove illegal content more quickly and effectively, and firms will continue investing heavily down the moderation ‘stack’, optimizing their systems to improve their precision and recall. It is the rapidity of AI processing and the massive amount of information generated that may have removed those offending videos before we saw them. As mentioned earlier, YouTube has publicly released transparency reports on algorithmic moderation of hate speech. Although categorizing violating videos and providing brief explanations of various types of violations in the hate speech policy section, many users have experienced receiving notifications of video removal due to hate speech without clear indications of which specific content or words violated regulations. There is no fixed, unified value judgment on hate speech, as it is not a simple distinction between right and wrong. Instead, this behavior is a societal assertion, indicating that certain content should be treated as hate speech, while also conveying an understanding of hate speech (Tarleton, 2020). Being labeled as hate speech on social media implies the communication of a social, and political value system, and in the face of it, there is no escape from the platforms, even if users believe that the vetting criteria are questionable. When users lose trust in the platform’s vetting, it is only natural that resistance to the platform’s vetting emerges. Cobbe (2021) categorizes these resisting populations into everyday resistance and organized resistance. Everyday resistance refers to small-scale and relatively safe circumvention activities, as previously mentioned avoiding censorship by changing the spelling or order of words, for example. Organized resistance implies a collective action by users who band together to target algorithmic review, which weakens the power of social media’s algorithmic content review.

The algorithm, essentially a product of human design, inevitably reflects the values and biases of its creators. This inherent bias poses a significant challenge for algorithms employed in broader scrutiny tasks. These biases can manifest in various forms, leading to the disproportionate scrutiny of specific groups based on factors such as gender, religion, race, and more (Ahmed et al., 2022; Cobbe, 2021). While platforms may ostensibly consider legal obligations, regulatory mandates, or even the political agendas of authoritarian regimes, these algorithms focus more on how to comply with the corporate interests and values represented by the platform, rather than setting moderation criteria based entirely on public interest or other factors. This may mean that platforms are more inclined to restrict or promote certain speech in line with their business strategies or profit-maximizing goals, rather than focusing solely on the fairness or social impact of speech. In this process, they effectively establish and enforce boundaries of acceptable speech according to commercial interests. This shift in focus undermines the original role of social platforms as open and inclusive spaces for communication, gradually rendering them more susceptible to control by corporate agendas.

The opaque auditing system condones the existence of algorithmic auditing biases. As a result, marginalized groups are subject to stricter control, their voices diminish, and the speech environment is increasingly dominated by groups already in positions of advantage, making it more difficult for other disadvantaged groups to express themselves. However, full disclosure of the reasons for post-removal and how audits operate may increase the risk of bad actors bypassing platform audits by manipulating rules and developing new methods to spread inappropriate content (Wang & Kim, 2023). The advent of algorithmic moderation systems, instead of maintaining correct value, enables social platforms to better align public and private online communication with commercial priorities, granting the platforms increased power while simultaneously eroding their capacity to facilitate discourse, communication, and interpersonal relationships. Therefore, it is crucial to balance the transparency and fairness of algorithmic auditing standards while protecting them from resistance, which is essential for the implementation of algorithmic auditing. These issues also require external forces to intervene in platform algorithmic auditing standards to ensure their relative correctness.

In conclusion, according to current reports, algorithmic moderation has indeed made significant contributions to purifying the online environment and preventing the harm of hate speech. However, in major social platforms such as YouTube, the standards for algorithmic moderation still have significant loopholes, leading to frequent errors in identifying hate speech due to the inability to match modified words or videos, or the lack of relevant data for minority languages. Additionally, the transparency and biases of algorithmic moderation are significant issues in its implementation, which may cause users to distrust and resist algorithmic moderation. However, this may be something we cannot resolve at the current stage because it requires ongoing learning and growth of AI to improve their efficiency and accuracy in moderation. In the current environment, a combination of AI and human moderation remains the relatively optimal solution and is the choice of most companies.


Ahmed, Z., Vidgen, B., & Hale, S. A. (2022). Tackling racial bias in automated online hate detection: Towards fair and accurate detection of hateful users with geometric deep learning. EPJ Data Science, 11(1), 1–22.

Cobbe, J. (2021). Algorithmic Censorship by Social Platforms: Power and Resistance. Philosophy & Technology, 34(4), 739–766.

Croes, E. A. J., & Antheunis, M. L. (2021). Perceived Intimacy Differences of Daily Online and Offline Interactions in People’s Social Network. Societies (Basel, Switzerland), 11(1), 13-.

Flew, T. (2021). Regulating platforms. Polity Press.

Gorwa, R., Binns, R., & Katzenbach, C. (2020). Algorithmic content moderation: Technical and political challenges in the automation of platform governance. Big Data & Society, 7(1), 205395171989794-.

Ke-Li, C., Collins, A., & Alexander, R. (2022). Detecting Hate Speech with GPT-3. arXiv.Org.

Ma, R., & Kou, Y. (2021). “How advertiser-friendly is my video?”: YouTuber’s Socioeconomic Interactions with Algorithmic Content Moderation. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1–25.

Parekh, B. (2012). Is there a case for banning hate speech? In M. Herz and P. Molnar (eds), The Content and Context of Hate Speech: Rethinking Regulation and Responses (pp. 37–56). Cambridge: Cambridge University Press.

W A K M Wickramaarachchi, Subasinghe, S. S., K K Rashani Tharushika Wijerathna, A Sahashra Udani Athukorala, Abeywardhana, L., & Karunasena, A. (2024). Identifying False Content and Hate Speech in Sinhala YouTube Videos by Analyzing the Audio. arXiv.Org.

Wang, S., & Kim, K. J. (2023). Content Moderation on Social Media: Does It Matter Who and Why Moderates Hate Speech? Cyberpsychology, Behavior and Social Networking, 26(7), 527–534.

YouTube Community Guidelines enforcement. (n.d.). Transparency Report.

Be the first to comment

Leave a Reply