Zampieri et al: SemEval-2020 Task 12: Multilingual Offensive Language Identiﬁcation in Social Media (OffensEval 2020)

Summary

Set up a set of classification tasks for offensive language across multiple languages
1. Identification of offensive languages (y/n on whether something is offensive)
2. Automatic categorization of offensive types (targeted or not targeted?)
  - targeted - insults towards a group or individual
  - not targeted - profanity or swearing
3. Target identification (invidual, group, other)
  - other category includes organizations, events, issues
Used the Offensive Language Identification Dataset (OLID), a 14k+ english tweets with the three types of identification labeled\
Teams were provided OLID and SOLID (Semi-supervised offensive language identification dataset)
Language breakdown
- Arabic dataset had 10k tweets (twitter langauge filter on lang:ar)
- Danish had 3.6k comments from FB, Reddit, and local newspaper
- Greek used the Offensive Greek Twitter Dataset, 10k+ tweets
- Turkish used 35k+ tweets
Models
- Most teams used contextualized transformers (Bert/Roberta/mBert)
- Wordembeddings also showed up, some sentiment analysis as well
Results
- Best performing team used a cross language model (XML-Roberta), emsembled with XML-Rob base and large