The applicability of lemmatisation in translation equivalents detection
The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter...
Permalink: | http://skupni.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:310964/Details |
---|---|
Matična publikacija: |
Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela |
Glavni autori: | Tadić, Marko (-), Fulgosi, Sanja (Author), Šojat, Krešimir |
Vrsta građe: | Članak |
Jezik: | eng |
Online pristup: |
http://www.is.bham.ac.uk/ubpress/corpus_meaningful.asp |
LEADER | 02432naa a2200265uu 4500 | ||
---|---|---|---|
008 | 131111s2004 xx eng|d | ||
020 | |a 082647490X | ||
035 | |a (CROSBI)125583 | ||
040 | |a HR-ZaFF |b hrv |c HR-ZaFF |e ppiak | ||
100 | 1 | |a Tadić, Marko | |
245 | 1 | 4 | |a The applicability of lemmatisation in translation equivalents detection / |c Tadić, Marko ; Fulgosi, Sanja ; Šojat, Krešimir. |
246 | 3 | |i Naslov na engleskom: |a The applicability of lemmatisation in translation equivalents detection | |
300 | |a 195-206 |f str. | ||
520 | |a The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter applied before or after the processing which has 3 steps: 1) generation of all possible pairs of tokens from 1:1 aligned sentences (Carthesius product) ; 2) application of mutual information to generated pairs in order to detect candidates for real TE ; 3) sorting the pairs according to calculated MI and choosing real TE for further use. The same method was applied to nonlemmatized and lemmatized material. The latter demonstrated 4.5 % higher precision and it has proven our hypothesis that for Croatian-English pair (and possibly other morphologically rich languages like Croatian) the lemmatized form of corpus data helps the statistical methods of TE detection. | ||
536 | |a Projekt MZOS |f 0130418 | ||
546 | |a ENG | ||
690 | |a 6.03 | ||
693 | |a Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection |l hrv |2 crosbi | ||
693 | |a Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection |l eng |2 crosbi | ||
700 | 1 | |a Fulgosi, Sanja |4 aut | |
700 | 1 | |a Šojat, Krešimir |4 aut | |
773 | 0 | |t Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora |d London, New York : Continuum international publishing group, 2004 |n Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela |z 082647490X |g str. 195-206 | |
856 | |u http://www.is.bham.ac.uk/ubpress/corpus_meaningful.asp | ||
942 | |c POG |t 1.16.1 |u 1 |z Znanstveni | ||
999 | |c 310964 |d 310962 |