[go: nahoru, domu]

Skip to content

(NeurIPS 2023 workshop on SoLaR) Korean Multi-task Text Dataset for Classifying Biased Speech in Real-World Online Services

Notifications You must be signed in to change notification settings

Dasol-Choi/KoMultiText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

KoMultiText

arXiv

Korean Multi-task Dataset for Classifying Biased Speech in Real-World Online Services

Paper Title: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Authors

Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park, Donbin Na,

Abstract

The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.

Source Codes

RoBERTa KR-BERT KoELECTRA KoBigBird
Multi-task RoBERTa KR-BERT KoELECTRA KoBigBird
Single-task(Preference) RoBERTa KR-BERT KoELECTRA KoBigBird
Single-task(Profanity) RoBERTa KR-BERT KoELECTRA KoBigBird
Single-task(Bias) RoBERTa KR-BERT KoELECTRA KoBigBird

Dataset

sourced from a forum, "Real-time Best Gallery", of DC Inside, a well-known online community in South Korea

Download Dataset

  • Total 150,000 comments
    • Labeled Dataset: Train Dataset (38,361 comments/5MB), Test Dataset (2,000 comments/286KB)
    • Unlabeled Dataset (110,000 comments/11.5MD)

Models Performance

Download Models

  • The overall classification performance for both single-task and multi-task settings including the Preference, Profanity, and Bias tasks. The AUROC and PRROC for the Bias task represent the average values across all biases.

  • Detailed AUROC, F1-score, and PRROC results for each specific bias type.

Citation

If this work can be useful for your research, please cite our paper:

@misc{choi2023largescale,
      title={Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services}, 
      author={Dasol Choi and Jooyoung Song and Eunsun Lee and Jinwoo Seo and Heejune Park and Dongbin Na},
      year={2023},
      eprint={2310.04313},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

(NeurIPS 2023 workshop on SoLaR) Korean Multi-task Text Dataset for Classifying Biased Speech in Real-World Online Services

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published