Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling: Notebook for PAN at CLEF 2017

Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov

Research output: Contribution to journalConference articlepeer-review

17 Scopus citations

Abstract

We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume1866
StatePublished - 2017
Event18th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2017 - Dublin, Ireland
Duration: 11 Sep 201714 Sep 2017

Fingerprint

Dive into the research topics of 'Language- and subtask-dependent feature selection and classifier parameter tuning for author Profiling: Notebook for PAN at CLEF 2017'. Together they form a unique fingerprint.

Cite this