Efficient discrimination between closely related languages

In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages make it very hard to classify them correctly using standard methods that have been proposed in the literature. Ded...

Full description

Permalink: http://skupni.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:318226/Details
Matična publikacija: Proceedings of COLING 2012
Mumbai : 2012
Glavni autori: Tiedemann, Jörg (-), Ljubešić, Nikola, informatičar (Author)
Vrsta građe: Članak
Jezik: eng
LEADER 02000naa a2200229uu 4500
008 131111s2012 xx 1 eng|d
035 |a (CROSBI)616773 
040 |a HR-ZaFF  |b hrv  |c HR-ZaFF  |e ppiak 
100 1 |a Tiedemann, Jörg 
245 1 0 |a Efficient discrimination between closely related languages /  |c Tiedemann, Jörg ; Ljubešić, Nikola. 
246 3 |i Naslov na engleskom:  |a Efficient Discrimination Between Closely Related Languages 
300 |a 2619-2634  |f str. 
520 |a In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages make it very hard to classify them correctly using standard methods that have been proposed in the literature. Dedicated models that focus on specific discrimination tasks help to improve the accuracy of general-purpose language identification tools. We propose and compare methods based on simple document classification techniques trained on parallel corpora of closely related languages and methods that emphasize discriminating features in terms of blacklisted words. Our experiments demonstrate that these techniques are highly accurate for the difficult task of discriminating between Bosnian, Croatian and Serbian. The best setup yields an absolute improvement of over 9% in accuracy over the best performing baseline using a state-of-the-art language identification tool. 
536 |a Projekt MZOS  |f FP7-288342 
546 |a ENG 
690 |a 5.04 
693 |a language identification, language discrimination, closely related languages  |l hrv  |2 crosbi 
693 |a language identification, language discrimination, closely related languages  |l eng  |2 crosbi 
773 0 |a COLING 2012 (10.-15.12.2012. ; Mumbai, Indija)  |t Proceedings of COLING 2012  |d Mumbai : 2012  |g str. 2619-2634 
700 1 |9 445  |a Ljubešić, Nikola,   |c informatičar  |4 aut 
942 |c RZB  |u 2  |v Recenzija  |z Znanstveni - Predavanje - CijeliRad  |t 1.08 
999 |c 318226  |d 318224