作者机构:
[Zhong, Duo; Jiang, Xingpeng; Li, Bojing] Cent China Normal Univ, Hubei Key Lab Artificial Intelligence & Smart Lear, Wuhan, Peoples R China.;[Zhong, Duo; Jiang, Xingpeng; Li, Bojing] Cent China Normal Univ, Sch Comp, Wuhan, Peoples R China.;[Qiao, Jimei] Shanghai Normal Univ, Math & Sci Coll, Shanghai, Peoples R China.;[Jiang, Xingpeng] Cent China Normal Univ, Natl Language Resources Monitoring & Res Ctr Netwo, Wuhan, Peoples R China.
通讯机构:
[Xingpeng Jiang] H;Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China<&wdkj&>School of Computer, Central China Normal University, Wuhan, China<&wdkj&>National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, China
摘要:
Microorganisms play important roles in our lives especially on metabolism and diseases. Determining the probability of human suffering from specific diseases and the severity of the disease based on microbial genes is the crucial research for understanding the relationship between microbes and diseases. Previous could extract the topological information of phylogenetic trees and integrate them to metagenomic datasets, thus enable classifiers to learn more information in limited datasets and thus improve the performance of the models. In this paper, we proposed a GNPI model to better learn the structure of phylogenetic trees. GNPI maintained the original vector format of metagenomic datasets, while previous research had to change the input form to matrices. The vector-like form of the input data can be easily adopted in the baseline machine learning models and is available for deep learning models. The datasets processed with GNPI help enhance the accuracy of machine learning and deep learning models in three different datasets. GNPI is an interpretable data processing method for host phenotype prediction and other bioinformatics tasks.
摘要:
In cross-language question retrieval (CLQR), users employ a new question in one language to search the community question answering (CQA) archives for similar questions in another language. In addition to the ranking problem in monolingual question retrieval, one needs to bridge the language gap in CLQR. The existing adversarial models for cross-language learning normally rely on a single adversarial component. Since natural languages consist of units of different abstract levels, we argue that crossing the language gap adaptatively on different levels with multiple adversarial components should lead to smoother text representation and better CLQR performance. To this end, we first encode questions into multi-layer representations of different abstract levels with a CNN based model which enhances conventional models with diverse kernel shapes and the corresponding pooling strategy so as to capture different aspects of a text segment. We then impose a set of adversarial components on different layers of question representation so as to decide the appropriate abstract levels and their role in performing cross-language mapping. Experimental results on two real-world datasets demonstrate that our model outperforms state-of-the-art models for CLQR, which is on par with the strong machine translation baselines and most monolingual baselines. (C) 2020 Elsevier Inc. All rights reserved.
摘要:
Dense motion estimations obtained from optical flow techniques play a significant role in many image processing and computer vision tasks. Remarkable progress has been made in both theory and its application in practice. In this paper, we provide a systematic review of recent optical flow techniques with a focus on the variational method and approaches based on Convolutional Neural Networks (CNNs). These two categories have led to state-of-the-art performance. We discuss recent modifications and extensions of the original model, and highlight remaining challenges. For the first time, we provide an overview of recent CNN-based optical flow methods and discuss their potential and current limitations.
摘要:
One category of neural information retrieval models tries to learn text representation in a common embedding space for both queries and documents. However, a single embedding space is not always sufficient, since queries and documents are different in terms of length, number of topics covered, etc. We argue that queries and documents should be mapped into different but overlapping embedding spaces, which is named Partially Shared Embedding Space (PSES) model in this paper. PSES consists of two embedding spaces respectively for queries and documents, and a shared embedding space capturing common features of two sources. Those three embeddings are learned by jointly obeying three constraints: a feature separation constraint, a pairwise matching constraint, and a reconstruction constraint. Experiments on standard TREC collections indicate that PSES leads to significant better performance of retrieval over traditional IR models and several neural IR models with only one embedding space.
期刊:
NATURAL LANGUAGE ENGINEERING,2018年24(4):523-549 ISSN:1351-3249
通讯作者:
Li, Bo
作者机构:
[Li, Bo] Cent China Normal Univ, Dept Comp Sci, Wuhan, Hubei, Peoples R China.;[Gaussier, Eric] Univ Grenoble Alpes, CNRS, LIG, AMA, Grenoble, France.;[Yang, Dan] China Elect Power Res Inst, Wuhan, Hubei, Peoples R China.
通讯机构:
[Li, Bo] C;Cent China Normal Univ, Dept Comp Sci, Wuhan, Hubei, Peoples R China.
摘要:
Comparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has been justified by several studies. However, researchers have not been able to establish a widely accepted and fully validated framework to measure corpus quality. We will thus investigate in this paper a comprehensive methodology to deal with the quality of comparable corpora. To be exact, we will propose several comparability measures and a quantitative strategy to test those measures. Our experiments show that the proposed comparability measure can capture gold-standard comparability levels very well and is robust to the bilingual dictionary used. Moreover, we will show in the task of bilingual lexicon extraction that the proposed measure correlates well with the performance of the real world application.
期刊:
Information Processing & Management,2018年54(2):291-302 ISSN:0306-4573
通讯作者:
Li, Bo
作者机构:
[Li, Bo] Cent China Normal Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China.;[Gaussier, Eric] Univ Grenoble Alpes, CNRS, LIG AMA, Grenoble, France.;[Yang, Dan] China Elect Power Res Inst, Wuhan, Hubei, Peoples R China.
通讯机构:
[Li, Bo] C;Cent China Normal Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China.
关键词:
Cross-language information retrieval;D/C condition;Information retrieval heuristic
摘要:
The type of centralized group key establishment protocols is the most commonly used one due to its efficiency in computation and communication. A key generation center (KGC) in this type of protocols acts as a server to register users initially. Since the KGC selects a group key for group communication, all users must trust the KGC. Needing a mutually trusted KGC can cause problem in some applications. For example, users in a social network cannot trust the network server to select a group key for a secure group communication. In this paper, we remove the need of a mutually trusted KGC by assuming that each user only trusts himself. During registration, each user acts as a KGC to register other users and issue sub-shares to other users. From the secret sharing homomorphism, all sub-shares of each user can be combined into a master share. The master share enables a pairwise shared key between any pair of users. A verification of master shares enables all users to verify their master shares are generated consistently without revealing the master shares. In a group communication, the initiator can become the server to select a group key and distribute it to each other user over a pairwise shared channel. Our design is unique since the storage of each user is minimal, the verification of master shares is efficient and the group key distribution is centralized. There are public-key based group key establishment protocols without a trusted third party. However, these protocols can only establish a single group key. Our protocol is a non-public-key solution and can establish multiple group keys which is computationally efficient.
摘要:
Buzzwords are the main embodiment of internet culture, which play an important role in public opinion analysis, social focus tracking and language evolution study. At present, questionnaire has been wildly used as a standard method to obtain network buzzwords, which is subjective and costly. In this paper, we will propose a novel algorithm relying on the time-distribution feature of words and a KL-divergence measure to estimate words' popularity so as to figure out buzzwords in a specific period. The time-distribution feature simply states the fact that buzzwords' usage has a sharp increase during a very short period, which is then modeled formally with the KL-divergence measure. Compared with traditional method involving much workforce, the automatic algorithm presented here is clearly more efficient. Moreover, buzzwords identified in this manner will not be affected by individual's subjective opinions, so they can reflect the language usage in practice better. When applying the algorithm to a social media big data set, our experimental results show that the proposed approach can accurately identify buzzwords in a certain period, which is highly coincident with results tagged manually.
期刊:
Lecture Notes in Computer Science,2014年8801:223-233 ISSN:0302-9743
通讯作者:
Li, Bo
作者机构:
[He, Tingting; Li, Bo; Chen, Qianjun; Zhu, Qunyan] Cent China Normal Univ, Hubei Univ, Sch Comp,Ctr Natl Language Tracing & Res Network, Natl Engn Res Ctr E Learning,Network Ctr, Wuhan 430079, Peoples R China.
通讯机构:
[Li, Bo] C;Cent China Normal Univ, Hubei Univ, Sch Comp,Ctr Natl Language Tracing & Res Network, Natl Engn Res Ctr E Learning,Network Ctr, Wuhan 430079, Peoples R China.
会议名称:
13th China National Conference on Chinese Computational Linguistics (CCL) / 2nd International Symposium on Natural Language Processing Based on Naturally Annotated Big Data (NLP-NABD)
会议时间:
OCT 18-19, 2014
会议地点:
Cent China Normal Univ, Wuhan, PEOPLES R CHINA
期刊:
Lecture Notes in Computer Science,2014年8444 LNAI(PART 2):134-145 ISSN:0302-9743
作者机构:
[Luo, Jing; Tu, Xinhui; Gu, Jinguang] College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China;[Gu, Jinguang] State Key Lab. of Software Engineering, Wuhan University, Wuhan, China;[He, Tingting; Li, Bo] Department of Computer Science, Central China Normal University, Wuhan, China;[Luo, Jing; Tu, Xinhui] Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China
期刊:
International Conference on Information and Knowledge Management, Proceedings,2013年:1237-1240
作者机构:
[Luo, Jing; Tu, Xinhui; Liu, Maofu] College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China;[Li, Bo; He, Tingting] Department of Computer Science, Central China Normal University, Wuhan, China;[Luo, Jing; Tu, Xinhui; Liu, Maofu] Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China
会议名称:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
摘要:
A main challenge in applying translation language models to information retrieval is how to estimate the "true" probability that a query could be generated as a translation of a document. The state-of-art methods rely on document-based word co-occurrences to estimate word-word translation probabilities. However, these methods do not take into account the proximity of co-occurrences. Intuitively, the proximity of co-occurrences can be exploited to estimate more accurate translation probabilities, since two words occur closer are more likely to be related. In this paper, we study how to explicitly incorporate proximity information into the existing translation language model, and propose a proximity-based translation language model, called TM-P, with three variants. In our TM-P models, a new concept (proximity-based word co-occurrence frequency) is introduced to model the proximity of word co-occurrences, which is then used to estimate translation probabilities. Experimental results on standard TREC collections show that our TM-P models achieve significant improvements over the state-of-the-art translation models. Copyright 2013 ACM.