Machine Learning-Based Hate Speech Detection in the Kazakh Language

Milana Bolatbek; Shynar Mussiraliyeva; Moldir Sagynay

doi:10.18778/1731-7533.23.19

Authors

Milana Bolatbek Al-Farabi Kazakh National University
Shynar Mussiraliyeva Al-Farabi Kazakh National University
Moldir Sagynay Al-Farabi Kazakh National University

DOI:

https://doi.org/10.18778/1731-7533.23.19

Keywords:

hate speech, bullying, violent extremism, TF-IDF, MTF-IDF, LSTM, MLPClassifier

Abstract

Modern text data processing and classification methods require extensive use of machine learning and neural networks. Categorizing text into different classes has become a crucial task in many fields. This paper presents a multi-class text classification model utilizing a Modified TF-IDF (MTF-IDF) approach in combination with Long Short-Term Memory (LSTM) neural networks, XGBoost, and MLPClassifier algorithms. Additionally, the study explores the integration of TF-IDF and CountVectorizer (MTF-IDF) methods for text vectorization, aiming to enhance classification efficiency.

The research findings indicate that the LSTM model achieved the highest accuracy rate of 89%, demonstrating superior performance. The MLPClassifier model achieved 85% accuracy, while XGBoost obtained 81% accuracy. Moreover, the integration of TF-IDF and MTF-IDF methods significantly improved the detection of rare but essential words, enhancing the overall performance of the models.

This study is dedicated to addressing the problem of automated detection of harmful content in the Kazakh language. Hate speech in the digital space refers to any online material that harms individuals or communities through aggression, manipulation, discrimination, or the intentional spread of socially damaging narratives. The results provide a solid foundation for future research aimed at the early identification and mitigation of hate speech in the digital space, contributing to a safer online environment.

Author Biographies

Milana Bolatbek, Al-Farabi Kazakh National University

Milana Bolatbek is a researcher specializing in Artificial Intelligence and Natural Language Processing. She holds PhD degree in Information Security Systems and focuses on the development of intelligent systems for text analysis, hate speech detection, and digital content monitoring. Her academic interests include deep learning, computational linguistics, and social media analytics. Milana Bolatbek has contributed to several interdisciplinary projects integrating linguistics, psychology, and AI for cybersecurity applications. She has co-authored papers published in peer-reviewed and Scopus-indexed journals and actively participates in international conferences on artificial intelligence and data science.

Shynar Mussiraliyeva, Al-Farabi Kazakh National University

Shynar Mussiraliyeva is a researcher in the field of Cyber Security and Data Analytics. She is a professor of the department of Cybersecuirty and Cryptology at al-Farabi Kazakh National University and has extensive experience in machine learning, natural language processing, and intelligent information systems. Her research focuses on applying AI technologies to solve problems in cybersecurity, social media analysis, and digital communication. Shynar Mussiraliyeva has published numerous papers in international peer-reviewed and Scopus-indexed journals. She is actively involved in academic collaborations and has supervised several research projects related to AI applications in language and behavior analysis.

Moldir Sagynay, Al-Farabi Kazakh National University

Moldir Sagynay is a researcher in the field of Artificial Intelligence and Computational Linguistics. She holds a Master’s degree in Information Technology and focuses on developing intelligent algorithms for text processing, emotion detection, and online communication analysis. Her academic interests include neural network models, low-resource language processing, and digital safety systems.

References

Gorwa R., Binns R., Katzenbach C. Algorithmic content moderation: Technical and political challenges in the automation of platform governance //Big Data & Society. – 2020. – Т. 7. – №. 1. – С. 2053951719897945.
Google Scholar DOI: https://doi.org/10.1177/2053951719897945

Barakhin V. B. et al. Methods for detecting destructive information. // Physics Journal: Conference Series. – IOP Publishing, 2019. – Vol. 1405. – No. 1. – P. 012004.
Google Scholar DOI: https://doi.org/10.1088/1742-6596/1405/1/012004

Kumisbekov S. K., Sabitov S. M., Akimzhanova M. T. Issues of preventing cyberbullying at the present stage. // Bulletin of the Karaganda University “Law Series”. – 2022. – Vol. 105. – No. 1. – Pp. 85–95.
Google Scholar DOI: https://doi.org/10.31489/2022l1/85-95

Alqahtani A. F., Ilyas M. A Machine Learning Ensemble Model for the Detection of Cyberbullying //arXiv preprint arXiv:2402.12538. – 2024.
Google Scholar DOI: https://doi.org/10.5121/ijaia.2024.15108

Li J. R., Mao Y. F., Yang K. Improvement and application of TF* IDF algorithm //Information Computing and Applications: Second International Conference, ICICA 2011, Qinhuangdao, China, October 28-31, 2011. Proceedings 2. – Springer Berlin Heidelberg, 2011. – С. 121-127.
Google Scholar

Fan H., Qin Y. Research on text classification based on improved tf-idf algorithm //2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018). – Atlantis Press, 2018. – С. 501-506.
Google Scholar DOI: https://doi.org/10.2991/ncce-18.2018.79

Shakil M. H., Alam M. G. R. Toxic Voice Classification Implementing CNN-LSTM & Employing Supervised Machine Learning Algorithms Through Explainable AI-SHAP //2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET). – IEEE, 2022. – С. 1-6.
Google Scholar DOI: https://doi.org/10.1109/IICAIET55139.2022.9936775

Schnitzler K. et al. Using Twitter™ to influence research: discussing strategies, opportunities, and challenges. // International Journal of Nursing Studies. – 2016. – Vol. 59. – Pp. 15–26.
Google Scholar DOI: https://doi.org/10.1016/j.ijnurstu.2016.02.004

Bolatbek M. et al. Kazakh Language Dataset for Hate Speech Detection on Social Media Text //2024 IEEE 9th International Conference on Computational Intelligence and Applications (ICCIA). – IEEE, 2024. – С. 94-98.
Google Scholar DOI: https://doi.org/10.1109/ICCIA62557.2024.10719327