Language identification: how to distinguish similar languages?

The goal of this paper is to discuss the language identification problem of Croatian, language that even state-of-the-art language identification tools fi nd hard to distinguish from similar languages, such as Serbian, Slovenian or Slovak language. We developed the tool that implements the...

Full description

Permalink: http://skupni.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:315303/Details
Matična publikacija: ITI 2007 Proceedings of the 29th International Conference on INFORMATION TECHNOLOGY INTERFACES
Zagreb : SRCE, 2007.
Glavni autori: Ljubešić, Nikola (-), Mikelić, Nives (Author), Boras, Damir
Vrsta građe: Članak
Jezik: eng
LEADER 02005naa a2200241uu 4500
008 131111s2007 xx 1 eng|d
035 |a (CROSBI)324219 
040 |a HR-ZaFF  |b hrv  |c HR-ZaFF  |e ppiak 
100 1 |a Ljubešić, Nikola 
245 1 0 |a Language identification: how to distinguish similar languages? /  |c Ljubešić, Nikola ; Mikelić, Nives ; Boras, Damir. 
246 3 |i Naslov na engleskom:  |a Language identification: how to distinguish similar languages? 
300 |f str. 
520 |a The goal of this paper is to discuss the language identification problem of Croatian, language that even state-of-the-art language identification tools fi nd hard to distinguish from similar languages, such as Serbian, Slovenian or Slovak language. We developed the tool that implements the list of Croatian most frequent words with the threshold that each document needs to satisfy, we added the specific characters elimination rule, applied second-order Markov model classification and a rule of forbidden words. Finally, we built up the tool that overperforms current tools in discriminating between these similar languages. 
536 |a Projekt MZOS  |f 130-1301679-1380 
546 |a ENG 
690 |a 5.04 
693 |a Written language identification, Croatian language, second-order Markov model, web-corpus, most frequent words method, forbidden words method  |l hrv  |2 crosbi 
693 |a Written language identification, Croatian language, second-order Markov model, web-corpus, most frequent words method, forbidden words method  |l eng  |2 crosbi 
700 1 |a Mikelić, Nives  |4 aut 
700 1 |a Boras, Damir  |4 aut 
773 0 |a 29th International Conference on INFORMATION TECHNOLOGY INTERFACES (25-28.06.2007. ; Cavtat, Hrvatska)  |t ITI 2007 Proceedings of the 29th International Conference on INFORMATION TECHNOLOGY INTERFACES  |d Zagreb : SRCE, 2007.  |n Lužar - Stiffler, Vesna ; Hljuz Dobrić, Vesna  |z 978-953-7138-10-3 
942 |c RZB  |u 1  |v Recenzija  |z Znanstveni - Predavanje - CijeliRad 
999 |c 315303  |d 315301