Zampieri et al: SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Tags: papers, nlp, bias nlp project, arabic nlp
Summary
-
Set up a set of classification tasks for offensive language across multiple languages
- Identification of offensive languages (y/n on whether something is offensive)
- Automatic categorization of offensive types (targeted or not targeted?)
- targeted - insults towards a group or individual
- not targeted - profanity or swearing
- Target identification (invidual, group, other)
- other category includes organizations, events, issues
-
Used the Offensive Language Identification Dataset (OLID), a 14k+ english tweets with the three types of identification labeled\
-
Teams were provided OLID and SOLID (Semi-supervised offensive language identification dataset)
-
Language breakdown
- Arabic dataset had 10k tweets (twitter langauge filter on lang:ar)
- Danish had 3.6k comments from FB, Reddit, and local newspaper
- Greek used the Offensive Greek Twitter Dataset, 10k+ tweets
- Turkish used 35k+ tweets
-
Models
- Most teams used contextualized transformers (Bert/Roberta/mBert)
- Wordembeddings also showed up, some sentiment analysis as well
-
Results
- Best performing team used a cross language model (XML-Roberta), emsembled with XML-Rob base and large