عجفت الغور

Zampieri et al: SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Tags: papers, nlp, bias nlp project, arabic nlp

Summary

  • Set up a set of classification tasks for offensive language across multiple languages

    1. Identification of offensive languages (y/n on whether something is offensive)
    2. Automatic categorization of offensive types (targeted or not targeted?)
      • targeted - insults towards a group or individual
      • not targeted - profanity or swearing
    3. Target identification (invidual, group, other)
      • other category includes organizations, events, issues
  • Used the Offensive Language Identification Dataset (OLID), a 14k+ english tweets with the three types of identification labeled\

  • Teams were provided OLID and SOLID (Semi-supervised offensive language identification dataset)

  • Language breakdown

    • Arabic dataset had 10k tweets (twitter langauge filter on lang:ar)
    • Danish had 3.6k comments from FB, Reddit, and local newspaper
    • Greek used the Offensive Greek Twitter Dataset, 10k+ tweets
    • Turkish used 35k+ tweets
  • Models

    • Most teams used contextualized transformers (Bert/Roberta/mBert)
    • Wordembeddings also showed up, some sentiment analysis as well
  • Results

    • Best performing team used a cross language model (XML-Roberta), emsembled with XML-Rob base and large