MARC: Evaluating full lemmatization of Croatian texts

Evaluating full lemmatization of Croatian texts

The chapter presents the implementation and evaluation of a module for full lemmatization of Croatian texts. The module implements several lemmatization procedures, all of them based on merging outputs of the previously developed stochastic morphosyntactic tagger CroTag and the infectional lexicon o...

Full description

Permalink:	http://skupni.nsk.hr/Record/ffzg.KOHA-OAI-FFZG:312303/Details
Matična publikacija:	Technologies for the Processing and Retrieval of Semi-Structured Documents: Experience from the CADIAL Project Language and Technology
Glavni autori:	Agić, Željko (-), Tadić, Marko (Author), Dovedan Han, Zdravko
Vrsta građe:	Članak
Jezik:	eng
Online pristup:	http://langtech.jrc.ec.europa.eu/Documents/2009_Cadial-Book_TOC.pdf


LEADER	02689naa a2200325uu 4500
008	131111s2009 xx eng\|d
020			\|a 978953-55375-1-9
035			\|a (CROSBI)426782
040			\|a HR-ZaFF \|b hrv \|c HR-ZaFF \|e ppiak
100	1		\|9 495 \|a Agić, Željko
245	1	0	\|a Evaluating full lemmatization of Croatian texts / \|c Agić, Željko ; Tadić, Marko ; Dovedan, Zdravko.
246	3		\|i Naslov na engleskom: \|a Evaluating Full Lemmatization of Croatian Texts
300			\|a 133-144 \|f str.
500			\|a This is a corrected version of a paper published in Klopotek, M. ; Przepiorkowski, A. ; Wierzchon, S. ; Trojanowski, K. (eds.) (2009) Recent Advances in Intelligent Information Systems, Academic Publishing House EXIT, Warsaw, 175-184.
520			\|a The chapter presents the implementation and evaluation of a module for full lemmatization of Croatian texts. The module implements several lemmatization procedures, all of them based on merging outputs of the previously developed stochastic morphosyntactic tagger CroTag and the infectional lexicon of Croatian. Evaluation of the lemmatization module on two test cases, simulating realistic and ideal operating conditions, provided full lemmatization accuracy scores of 96.96 and 98.15 percent on a newspaper corpus, respectively. It is also shown that a majority of errors in this framework, 57.14 percent in the realistic testing scenario, occur on word forms with external homography. Moreover, approximately 80 percent of all lemmatization errors occur on nouns, adjectives, verbs and adverbs in that particular order. Language resources, testing environment and procedure descriptions are provided in the paper along with a discussion of obtained results and possible future research directions.
536			\|a Projekt MZOS \|f 036-1300646-1986
536			\|a Projekt MZOS \|f 130-1300646-0645
536			\|a Projekt MZOS \|f 130-1300646-1776
546			\|a ENG
690			\|a 2.09
690			\|a 5.04
690			\|a 6.03
693			\|a full lemmatization, morphosyntactic tagging, Croatian language \|l hrv \|2 crosbi
693			\|a full lemmatization, morphosyntactic tagging, Croatian language \|l eng \|2 crosbi
773	0		\|t Technologies for the Processing and Retrieval of Semi-Structured Documents: Experience from the CADIAL Project \|d Zagreb : Croatian Language Technologies Society, 2009 \|k Language and Technology \|h 238 \|n Tadić, Marko ; Dalbelo Bašić, Bojana ; Moens, Marie-Francine \|z 978-953-55375-1-9 \|g str. 133-144
700	1		\|9 888 \|a Tadić, Marko \|4 aut
700	1		\|9 415 \|a Dovedan Han, Zdravko \|4 aut
856			\|u http://langtech.jrc.ec.europa.eu/Documents/2009_Cadial-Book_TOC.pdf
942			\|c POG \|t 1.16.1 \|u 2 \|z Znanstveni
999			\|c 312303 \|d 312301

Evaluating full lemmatization of Croatian texts

Slični primjerci