MARC: A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings

A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings

Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short te...

Full description

Permalink:	http://skupni.nsk.hr/Record/nsk.NSK01001088895/Details
Matična publikacija:	Journal of information and organizational sciences (Online) 44 (2020), 2 ; str. 231-246
Glavni autori:	Babić, Karlo (Author), Guerra, Francesco, Martinčić-Ipšić, Sanda, Meštrović, Ana
Vrsta građe:	e-članak
Jezik:	eng
Predmet:	Semantička sličnost > Sličnost tekstova > Obrada prirodnog teksta
Online pristup:	https://doi.org/10.31341/jios.44.2.2 Journal of information and organizational sciences (Online) Hrčak


LEADER	02745naa a22003854i 4500
001	NSK01001088895
003	HR-ZaNSK
005	20210217162027.0
006	m d
007	cr\|\|\|\|\|\|\|\|\|\|\|\|
008	210201s2020 ci d \|o \|0\|\| \|\|eng
024	7		\|2 doi \|a 10.31341/jios.44.2.2
035			\|a (HR-ZaNSK)001088895
040			\|a HR-ZaNSK \|b hrv \|c HR-ZaNSK \|e ppiak
041	0		\|a eng \|b eng
042			\|a croatica
044			\|a ci \|c hr
080	1		\|a 004 \|2 2011
100	1		\|a Babić, Karlo \|4 aut
245	1	2	\|a A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings \|h [Elektronička građa] / \|c Karlo Babić, Francesco Guerra, Sanda Martinčić-Ipšić, Ana Meštrović.
300			\|b Graf. prikazi.
504			\|a Bibliografija: 46 jed.
504			\|a Abstract.
520			\|a Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.
653		0	\|a Semantička sličnost \|a Sličnost tekstova \|a Obrada prirodnog teksta
700	1		\|a Guerra, Francesco \|4 aut \|9 HR-ZaNSK
700	1		\|a Martinčić-Ipšić, Sanda \|4 aut
700	1		\|a Meštrović, Ana \|4 aut
773	0		\|t Journal of information and organizational sciences (Online) \|x 1846-9418 \|g 44 (2020), 2 ; str. 231-246 \|w nsk.(HR-ZaNSK)000672813
981			\|b Be2020 \|b B02/20
998			\|b tino2102
856	4	0	\|u https://doi.org/10.31341/jios.44.2.2
856	4	0	\|u https://jios.foi.hr/index.php/jios/article/view/1427 \|y Journal of information and organizational sciences (Online)
856	4	0	\|u https://hrcak.srce.hr/247489 \|y Hrčak
856	4	1	\|y Digitalna.nsk.hr

A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings

Slični primjerci