Abstract
In this paper, we describe the CIC-IPN submissions to the shared task on Indian Native Language Identification (INLI 2018). We use the Support Vector Machines algorithm trained on numerous feature types: word, character, part-of-speech tag, and punctuation mark n-grams, as well as character n-grams from misspelled words and emotion-based features. The features are weighted using log-entropy scheme. Our team achieved 41.8% accuracy on the test set 1 and 34.5% accuracy on the test set 2, ranking 3rd in the official INLI shared task scoring.
Original language | English |
---|---|
Pages (from-to) | 82-88 |
Number of pages | 7 |
Journal | CEUR Workshop Proceedings |
Volume | 2266 |
State | Published - 2018 |
Event | 10th Working Notes of FIRE - Forum for Information Retrieval Evaluation, FIRE-WN 2018 - Gandhinagar, India Duration: 6 Dec 2018 → 9 Dec 2018 |
Keywords
- Feature engineering
- Indian languages
- Machine learning
- Native Language Identification
- Social media