Paper Title: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services
- This repository provides Korean Multi-task Text Dataset and PyTorch implementations for classification models.
- (News) This work is accepted to the NeurIPS 2023 workshop on Socially Responsible Language Modelling Research (SoLaR).
Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park, Donbin Na,
The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
RoBERTa | KR-BERT | KoELECTRA | KoBigBird | |
---|---|---|---|---|
Multi-task | RoBERTa | KR-BERT | KoELECTRA | KoBigBird |
Single-task(Preference) | RoBERTa | KR-BERT | KoELECTRA | KoBigBird |
Single-task(Profanity) | RoBERTa | KR-BERT | KoELECTRA | KoBigBird |
Single-task(Bias) | RoBERTa | KR-BERT | KoELECTRA | KoBigBird |
sourced from a forum, "Real-time Best Gallery", of DC Inside, a well-known online community in South Korea
- Total 150,000 comments
- Labeled Dataset: Train Dataset (38,361 comments/5MB), Test Dataset (2,000 comments/286KB)
- Unlabeled Dataset (110,000 comments/11.5MD)
- The overall classification performance for both single-task and multi-task settings including the Preference, Profanity, and Bias tasks. The AUROC and PRROC for the Bias task represent the average values across all biases.
- Detailed AUROC, F1-score, and PRROC results for each specific bias type.
If this work can be useful for your research, please cite our paper:
@misc{choi2023largescale, title={Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services}, author={Dasol Choi and Jooyoung Song and Eunsun Lee and Jinwoo Seo and Heejune Park and Dongbin Na}, year={2023}, eprint={2310.04313}, archivePrefix={arXiv}, primaryClass={cs.CL} }