The applicability of lemmatisation in translation equivalents detection

The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter...

Full description

Permalink: http://skupni.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:310964/Details
Matična publikacija: Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora
Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela
Glavni autori: Tadić, Marko (-), Fulgosi, Sanja (Author), Šojat, Krešimir
Vrsta građe: Članak
Jezik: eng
Online pristup: http://www.is.bham.ac.uk/ubpress/corpus_meaningful.asp
LEADER 02432naa a2200265uu 4500
008 131111s2004 xx eng|d
020 |a 082647490X 
035 |a (CROSBI)125583 
040 |a HR-ZaFF  |b hrv  |c HR-ZaFF  |e ppiak 
100 1 |a Tadić, Marko 
245 1 4 |a The applicability of lemmatisation in translation equivalents detection /  |c Tadić, Marko ; Fulgosi, Sanja ; Šojat, Krešimir. 
246 3 |i Naslov na engleskom:  |a The applicability of lemmatisation in translation equivalents detection 
300 |a 195-206  |f str. 
520 |a The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter applied before or after the processing which has 3 steps: 1) generation of all possible pairs of tokens from 1:1 aligned sentences (Carthesius product) ; 2) application of mutual information to generated pairs in order to detect candidates for real TE ; 3) sorting the pairs according to calculated MI and choosing real TE for further use. The same method was applied to nonlemmatized and lemmatized material. The latter demonstrated 4.5 % higher precision and it has proven our hypothesis that for Croatian-English pair (and possibly other morphologically rich languages like Croatian) the lemmatized form of corpus data helps the statistical methods of TE detection. 
536 |a Projekt MZOS  |f 0130418 
546 |a ENG 
690 |a 6.03 
693 |a Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection  |l hrv  |2 crosbi 
693 |a Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection  |l eng  |2 crosbi 
700 1 |a Fulgosi, Sanja  |4 aut 
700 1 |a Šojat, Krešimir  |4 aut 
773 0 |t Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora  |d London, New York : Continuum international publishing group, 2004  |n Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela  |z 082647490X  |g str. 195-206 
856 |u http://www.is.bham.ac.uk/ubpress/corpus_meaningful.asp 
942 |c POG  |t 1.16.1  |u 1  |z Znanstveni 
999 |c 310964  |d 310962