A convolutional neural network approach for gender and language variety identification

Helena Gómez-Adorno, Roddy Fuentes-Alba, Ilia Markov, Grigori Sidorov, Alexander Gelbukh

Research output: Contribution to journalArticle

Abstract

We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm-support vector machines (SVM) trained on character n-grams (n = 3-8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol NE to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.

Original languageEnglish
Pages (from-to)4845-4855
Number of pages11
JournalJournal of Intelligent and Fuzzy Systems
Volume36
Issue number5
DOIs
StatePublished - 1 Jan 2019

Fingerprint

Learning systems
Neural Networks
Neural networks
Machine Learning
Learning algorithms
Support vector machines
N-gram
Learning Algorithm
Support Vector Machine
Dependent
Experiments
Experiment
Gender
Language
Corpus
Deep learning
Learning
Architecture
Character

Keywords

  • Author profiling
  • Character n-grams
  • Convolutional neural networks
  • Deep learning
  • Gender identification
  • Language variety identification
  • Machine learning
  • Spanish

Cite this

@article{b771213dc78d49268b1de0bf7862e25a,
title = "A convolutional neural network approach for gender and language variety identification",
abstract = "We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm-support vector machines (SVM) trained on character n-grams (n = 3-8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol NE to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.",
keywords = "Author profiling, Character n-grams, Convolutional neural networks, Deep learning, Gender identification, Language variety identification, Machine learning, Spanish",
author = "Helena G{\'o}mez-Adorno and Roddy Fuentes-Alba and Ilia Markov and Grigori Sidorov and Alexander Gelbukh",
year = "2019",
month = "1",
day = "1",
doi = "10.3233/JIFS-179032",
language = "Ingl{\'e}s",
volume = "36",
pages = "4845--4855",
journal = "Journal of Intelligent and Fuzzy Systems",
issn = "1064-1246",
publisher = "IOS Press",
number = "5",

}

A convolutional neural network approach for gender and language variety identification. / Gómez-Adorno, Helena; Fuentes-Alba, Roddy; Markov, Ilia; Sidorov, Grigori; Gelbukh, Alexander.

In: Journal of Intelligent and Fuzzy Systems, Vol. 36, No. 5, 01.01.2019, p. 4845-4855.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A convolutional neural network approach for gender and language variety identification

AU - Gómez-Adorno, Helena

AU - Fuentes-Alba, Roddy

AU - Markov, Ilia

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2019/1/1

Y1 - 2019/1/1

N2 - We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm-support vector machines (SVM) trained on character n-grams (n = 3-8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol NE to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.

AB - We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm-support vector machines (SVM) trained on character n-grams (n = 3-8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol NE to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.

KW - Author profiling

KW - Character n-grams

KW - Convolutional neural networks

KW - Deep learning

KW - Gender identification

KW - Language variety identification

KW - Machine learning

KW - Spanish

UR - http://www.scopus.com/inward/record.url?scp=85066429431&partnerID=8YFLogxK

U2 - 10.3233/JIFS-179032

DO - 10.3233/JIFS-179032

M3 - Artículo

AN - SCOPUS:85066429431

VL - 36

SP - 4845

EP - 4855

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

SN - 1064-1246

IS - 5

ER -