Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling: Notebook for PAN at CLEF 2017

Ilia Markov; Helena Gómez-Adorno; Grigori Sidorov

Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling: Notebook for PAN at CLEF 2017

Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Conference article › peer-review

17 Scopus citations

Abstract

We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.

Original language	English
Journal	CEUR Workshop Proceedings
Volume	1866
State	Published - 2017
Event	18th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2017 - Dublin, Ireland Duration: 11 Sep 2017 → 14 Sep 2017

Cite this

@article{ee2694df428746a892ddf03869127343,

title = "Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling: Notebook for PAN at CLEF 2017",

abstract = "We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.",

author = "Ilia Markov and Helena G{\'o}mez-Adorno and Grigori Sidorov",

note = "Funding Information: This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20162204, 20162064, 20171813, 20171344, and 20172008).; 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2017 ; Conference date: 11-09-2017 Through 14-09-2017",

year = "2017",

language = "Ingl{\'e}s",

volume = "1866",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling

T2 - 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2017

AU - Markov, Ilia

AU - Gómez-Adorno, Helena

AU - Sidorov, Grigori

N1 - Funding Information: This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20162204, 20162064, 20171813, 20171344, and 20172008).

PY - 2017

Y1 - 2017

N2 - We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.

AB - We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.

UR - http://www.scopus.com/inward/record.url?scp=85034760880&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:85034760880

SN - 1613-0073

VL - 1866

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

Y2 - 11 September 2017 through 14 September 2017

ER -

Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling: Notebook for PAN at CLEF 2017

Abstract

Other files and links

Fingerprint

Cite this