期刊:
NATURAL LANGUAGE ENGINEERING,2018年24(4):523-549 ISSN:1351-3249
通讯作者:
Li, Bo
作者机构:
[Li, Bo] Cent China Normal Univ, Dept Comp Sci, Wuhan, Hubei, Peoples R China.;[Gaussier, Eric] Univ Grenoble Alpes, CNRS, LIG, AMA, Grenoble, France.;[Yang, Dan] China Elect Power Res Inst, Wuhan, Hubei, Peoples R China.
通讯机构:
[Li, Bo] C;Cent China Normal Univ, Dept Comp Sci, Wuhan, Hubei, Peoples R China.
摘要:
Comparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has been justified by several studies. However, researchers have not been able to establish a widely accepted and fully validated framework to measure corpus quality. We will thus investigate in this paper a comprehensive methodology to deal with the quality of comparable corpora. To be exact, we will propose several comparability measures and a quantitative strategy to test those measures. Our experiments show that the proposed comparability measure can capture gold-standard comparability levels very well and is robust to the bilingual dictionary used. Moreover, we will show in the task of bilingual lexicon extraction that the proposed measure correlates well with the performance of the real world application.