Towards Obtaining High Quality Sentence-Aligned English-Croatian Parallel Corpus

This paper presents the acquisition of parallel bilingual corpus and all the steps involved in the process of unsupervised sentence alignment, such as tokenization, lowercasing, etc. The problem of sentence alignment is not trivial because translators do not necessarily translate one sentence in the...

Full description

Permalink: http://skupni.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:317111/Details
Matična publikacija: Proceedings of the 4th IEEE International Conference on Computer Science and Information Technology ICCSIT 2011
Chengdu, China : 2011
Glavni autori: Brkić, Marija (-), Matetić, Maja (Author), Seljan, Sanja
Vrsta građe: Članak
Jezik: eng
LEADER 02090naa a2200241uu 4500
008 131111s2011 xx 1 eng|d
035 |a (CROSBI)516695 
040 |a HR-ZaFF  |b hrv  |c HR-ZaFF  |e ppiak 
100 1 |a Brkić, Marija 
245 1 0 |a Towards Obtaining High Quality Sentence-Aligned English-Croatian Parallel Corpus /  |c Brkić, Marija ; Matetić, Maja ; Seljan, Sanja. 
246 3 |i Naslov na engleskom:  |a Towards Obtaining High Quality Sentence-Aligned English-Croatian Parallel Corpus 
300 |a 1068-1070  |f str. 
520 |a This paper presents the acquisition of parallel bilingual corpus and all the steps involved in the process of unsupervised sentence alignment, such as tokenization, lowercasing, etc. The problem of sentence alignment is not trivial because translators do not necessarily translate one sentence in the source language into one sentence in the target language. Three different unsupervised and language independent approaches to sentence alignment are presented and implementations of these approaches through three different freely available tools are tested. A gold standard for English-Croatian automatic sentence alignment evaluation is created. Finally, a detailed analysis of the acquired corpus is given. 
536 |a Projekt MZOS  |f 130-1300646-0909 
546 |a ENG 
690 |a 5.04 
693 |a Sentence alignment ; alignment tools ; sentence alignment evaluation ; parallel corpus ; sentence-length ; word-correspondence  |l hrv  |2 crosbi 
693 |a Sentence alignment ; alignment tools ; sentence alignment evaluation ; parallel corpus ; sentence-length ; word-correspondence  |l eng  |2 crosbi 
700 1 |a Matetić, Maja  |4 aut 
700 1 |9 430  |a Seljan, Sanja  |4 aut 
773 0 |a 4th IEEE International Conference on Computer Science and Information Technology ICCSIT 2011 (10-12.06.2011. ; Sečuan, Kina)  |t Proceedings of the 4th IEEE International Conference on Computer Science and Information Technology ICCSIT 2011  |d Chengdu, China : 2011  |g str. 1068-1070 
942 |c RZB  |u 2  |v Recenzija  |z Znanstveni - Predavanje - CijeliRad  |t 1.08 
999 |c 317111  |d 317109