You are on page 1of 4

Research on Text Similarity Computing Based on

Word Vector Model of Neural Networks

Yuan Sun *", Weikang Li*", Peilei Dong*"


*School of Information Engineering, Minzu University of China
"Minority Languages Branch, National Language Resource and Monitoring Research Center
100081, Beijing, China
tracy.yuan.sun@gmail.com

AbstractText similarity computing plays an important role in Only non-zero elements are stored and represented in this
natural language processing. In this paper, we build a word method.
vector model based on neural network, and train Chinese corpus
from Sohu News, World News, and so on. Meanwhile, a method Yu Gang proposed the method of calculating text similarity
of calculating the text semantic similarity using word vector is based on lexical semantic [7]. It calculated the correlation of
proposed. Finally, through comparing with the traditional two article vectors based on lexical semantic of How-net and
calculation method TF-IDF, the experimental results prove the got the similarity of two articles via The Maximum Matching
method is effective. Algorithm.

Keywords-neural networks; word vector; semantic; text In this paper, we firstly build word vector training model
similarity based on neural networks to improve system efficiency.
Secondly, we use the model to train words from a large of
I. INTRODUCTION Chinese corpus for calculating the semantic distance. Thirdly,
With the rapid development of Internet, people have a we propose the method of calculating text similarity based on
growing demand for getting information of network. Its neural networks. Finally, we compare the method with TF-IDF
important to calculate the text similarity in many areas of text on text similarity and analyze the results of them.
analyzing. The rest of this paper is organized as follows: The next
Text similarity computing is an important metric in section shows the process of getting word vector based on
comparing similarity of two or more articles. In general, it can neural networks. Section III introduces a new text similarity
be divided into two aspects: the calculating of semantic computing method based on neural networks. The analysis of
similarity and the calculating of non-semantic similarity. The results is presented and discussed in section IV. Finally, the
research of text similarity has a wide range of applications in conclusion is drawn.
information retrieval, automatic question answering, and II. WORD VECTOR BUILDING BASED ON NEURAL
machine learning [1-2]. At present, there are a lot of mature NETWORK
research methods in non-semantic text similarity computing.
However, the semantic text similarity computing still has much A. Neural Network Model
work to do. Current research status in text similarity is as The word vector in this article is used by the way of
follows: Distributed Representation, which depends on statistical
Pan Qianhong proposed the way to calculate text similarity language models. We assume a sentence is consist of the words
based on Attribute Theory [3], and built the properties gravity W1 ,W2 ,W3 ,...,Wt , the probability of the sentence can be calculated
splitting model of text. It calculated the correlation between as follows [8]:
key words with the help of distance between coordinate points.
p( w)  p( w1 , w2 ,......, wt )
Zhang Huanjiong proposed the method of calculating text
similarity based on Hamming Distance and also put the
Hamming Concept [4]. It used a new way to calculate, with a p( wt )  p( wt 1 ) p( wt wt 1 )
very great convenience and high accelerate. Meanwhile, it was
different from the traditional way of using space concepts [5]. Network model consists of an input layer, a projection layer
It represented the text information by the way of using code- and an output layer. For any word in the corpus, we get double
word, which was possible for describing text information in c words before and after the word w to form the context of w ,
joint. Context (w) [9]. The input layer consists of 2c word vectors:
v (context ( w),1), (context ( w), 2),..., (context ( w), 2c) . The
Huo Hua proposed the way to calculate text similarity
based on Compressed Sparse Vector Multiplying, effectively word vectors length is m .
reducing the cost of computing and storage space required [6].
____________________________________
978-1-47-- /1/$31.00 201 IEEE


The 2c word vectors are accumulated in sum in projection (6) Store word vectors.
layer:
To optimize word vectors [13], we need to calculate
X w  v (context ( w),1), (context ( w), 2),..., (context ( w), 2c)  ( X w ) . For easy operation, we use approximate calculation
method in this paper.
Output layer is a Huffman Tree, which consists of each
words weight in corpus. Leaf node corresponds to the word The range interval [6,6] is divided into 1000 equal parts,
dictionary word. The number of leaf node is the same as the which are called x0 , x1 , x2 ,......, x1000 . Sigmoid Function values
size of the dictionary. Left sub-tree is recorded as a negative
category 1, and Right sub-tree is called a positive class 0. are calculated at each xk and will be stored in the memory. We
We assume each node has a coding d, so the value of d can assume the sum of a word context word vectors equals x .
be 0 or 1. Every nodes category is represented by using
the formula: When x  6 , x 0 .

Lable( piw)  1  diw, i  1,2,3,...... When x  6 , x 1.

In this paper, we assume sigmoid function as the network When  6  x  6 , x  xk. We can get the value
structures excitation function [10]. The probability that a node from the table directly.
is classified into positive class is as follows:
During the process of training, the size of the learning rate
1 has a great impact on network convergence rate and training
 ( x w ) 
1  e  xw results. If its value is too small, the training rate is low. If high,
it may cause oscillation or diverging [14]. The value of
The probability that a node is classified into negative class: learning rate is set 0.025 at the beginning. In order to optimize
1   ( X w ) learning rate, we adjust the learning rate after training 1000
words. The adjust formula is as follows
The objective function of neural network structure [11]:
f ( w) t  trainWordsCount
L ran  (  1)
w j 2

G 

(1  d ) log[ ( X w
w
j
w
j 1 )]  log[1   ( X w w
j 1 
)] t  trainWordsCount f (w)
The formula for the removal of high-frequency words:
B. Model Parameter Optimization
wordCountActual
In this paper, we use Stochastic Gradient Ascent Method     (1  )
[12] as the optimize function. Supposing G w, j represents trainWordCount  1
the content that is circled by two double summation sign in the where wordCountActual is the number of trained words.
objective function. So the calculation about X w simplifies to trainWordCount is the number of all words.
the below:
t is the value set by ourselves. f (w) is the word frequency
G w, j of word w . We set a threshold, nextRandom . If the value of
 1  d   X w
w w
j 1   w
j 1
X w
j
ran is bigger than nextRandom , it should be removed.
Lw is the node on the path in Huffman tree. So far, we have B. Calculating Semantic Distance
got the approach trained to generate word vector process The word semantic distance is the degree of conformity
approach. words [15]. In this paper, if its distance is long, the degree of
conformity words is high; if short, the degree of conformity
III. TEXT SIMILARITY STUDY words is low. The steps of calculating word semantic distance
A. Building of Word Vector Model are as follows:
The steps of word vector model building in this paper are as (1) Load model and get the word vector trained.
follows:
(2) Calculate the semantic distance. We firstly get the
(1) Get words form the exercise texts. central words and their word vectors. Then, we calculate the
distances between central words and other words in the word
(2) Get the word frequency, initial the word vector and library via Cosine Law.
then put them into the hash table.
To improve the convenience of calculating semantic
(3) Construct the Huffman tree, get the Huffman tree path distance, we make vectors divided by their vector length. The
of every word. formula is as follows:
(4) Remove high-frequency words, get word vectors and
Vector (i )
optimize objective function. Vector (i )  i n
(5) Count the number of trained words and update the
Vector (i)
i 0
2

learning rate when it is greater than 1000.


The word A vector is represented as Va1,Va2,Va3,...,Van and World News 64M
the word B vector is represented as (Vb1,Vb2,Vb3,...,Vbn) . So the Netease News 145M
calculating formula of semantic between word A and word B is Car Home News 55M
as follows:
n


Va Vb
i 1
i i
Then, we select about 1,000,000 Chinese sentences as the
corpus of word vector training and get about 400M word vector
D
n n files. The results of semantic distance calculating are as follows:

Va
Vb
2 2
j j
j 1 j 1 Central word (Liu xiang) :
C. Semantic Text Similarity Computing
TABLE II. RESULTS OF LIUXIANG
This paper explores the calculating of semantic text
similarity via word vector, we use the word vector mentioned Related words Meaning in Semantic distance
English
above to calculate the semantic text similarity. The main Ge tian 0.721582
thought of calculating text similarity is to compare the
similarities of texts by calculating the semantic similarity of Liu xuegen 0.635796
feature words. In this paper, the whole process is divided into Licensing 0.629734
two parts: text pre-processing and similarity calculation. Fenke 0.625426
We should participle the texts via software of word cutting, Star style 0.60653
and get word frequency of each word. After that, we use Liao haiguang 0.605873
feature words to represent the text, assign the evaluation
- Angelina-Jolie 0.605339
parameter a . The feature words can be obtained if its word
frequency is bigger than a . Deborah 0.596614
Beans 0.594925
The steps of algorithm are as follows:
Crowe Goodale 0.587726
(1) Get the vocabulary of texts.
(2) Statistic the word frequency and excluding low-
frequency words. Central word (Driving)

(3) Produce the feature words prepared for the calculating TABLE III. RESULTS OF DRIVING
of similarity. Related words Meaning in Semantic distance
After obtaining the feature words of texts, we use word English
Diver 0.708154
vector to calculate the distance between them. Assuming the
threshold k , get the number of words which are bigger than k . Driving 0.659605
In the end, we use the formula to get text similarity. Tired 0.645744

The formula of calculating similarity: Wrong lane 0.624946


Distraction 0.617421
2  Length
Similarity  Near Schools 0.610519
LengthA  LengthB
Attention 0.60669
The steps of algorithm are as follows: Out 0.598952
(1) Load the binary vector model. Newer 0.598586

(2) Get feature words of text LA and LB . Get on 0.586931

(3) Use the formula to calculate the text similarity.


From the results above, we can observe that the distance of
IV. ANALYSIS OF EXPERIMENTAL RESULTS words which are close to central words is bigger than others. It
A. Training Word Vectors and Semantic Distances means that the closer, the distance between them is bigger. Part
words distances arent so big intuitively, but the results here
In this paper, we use web crawlers to get corpus for our
analysis from many websites such as Sohu News, World News, turn out to be bigger. For example, the word (Xiaomi) is
Netease News, Car Home News and so on. The corpus is as close to (Apple). This situation may happen when there
follows: exists word association in corpus. So the selected corpus also
has an important influence on the training results of word
TABLE I. CORPUS RESOURCE vector.
Websites Size of Corpus
Sohu News 136M


B. Analysis of Results From Text Similarity Calculating V. CONCLUSION
Firstly, we analyze the text similarity of plain text by using This paper has designed and achieved the method of
TF-IDF way and Word Vector way. The test texts are got by calculating text similarity based on neural networks. By
web crawler from the Sohu News website. There are 4900 comparing the traditional way of calculating the text similarity
group texts to be compared. The results are as follows: using TF-IDF and Cosine Law, the method not only ensures
the accuracy of the text similarity between the non-semantic
texts, but when calculating the similarity between semantically
related texts have obvious advantages. This method can be
applied in specific areas such as information retrieval, data
mining, and so on.
ACKNOWLEDGMENT
This work is supported by National Nature Science
Foundation (No. 61331013), Beijing Higher Education Young
Figure1. Results of plain text similarity computing Elite Teacher Project (No. YETP1291), National Language
Committee Project (No. YB125-139, ZDI125-36), and Minzu
At the same, we also analyze the text similarity of semantic
University of China Scientific Research Project (No.
texts in the two different ways. For example, the semantic texts
2015MDQN11).
are as follows:
Sentence 1 : ,. REFERENCES
(Xiao Ming likes to play basketball. And he is a member of the [1] Xu Baowen, Zhang Weifeng. Search engine and information retrieval
technology. Beijing: Tsinghua University Press, 2003, 10-11.
school basketball.)
[2] SaltonG. Automatic text processing: the transformation analysis, and
Sentence 2 : NBA , retrieval of information by computer. Reading, Penn-sylvania: Aoldison-
Wesley, 1989.
. (Kobe is a all-star in NBA.
[3] Pan Qianhong, Wang Ju, Shi Zhongzhi. Text Similarity Computing
He loves basketball very much and see it his life.) Based On Attribute Theory. Chinese Journal of Computers, 1999, 6:
The value of text similarity calculated by TF-IDF is 0.435. 651-655.
However, the value of text similarity calculated by our method [4] Zhang Huanjiong, Wang Guosheng, Zhong Yixin. Text Similarity
Computing Based on Hamming Distance. Computer Engineering and
based on neural networks is 0.867. Applications, 2001, 19:21-22.
The results of 20 groups are as follows: [5] Li Guohui, Tang Daquan, Wu Defeng. Information organization and
retrieval. Beijing: Science Press, 2003, 91-92.
[6] Huo Hua, Feng Boqin, Document Similarity Degree Measuring Based
on Compressed Sparse Matrix Vector Multiplication Technique. Journal
of Chinese Computer Systems, 2005, 26(6): 28-31.
[7] Yu Gang, Pei Yangjun, Zhu Zhengyu. Research of Text Similarity
Based on Word Similarity Computing. Computer Engineering and
Design, 2006, 27(2): 241-244.
[8] Li Xiaoguang, Wang Daling, Yu Ge. Information Retrieval Based on
Statistical Language Model. Computer Science. 2005-08-23.
[9] A Neural Probabilistic Language Model. Yoshua Bengio, Rejean
Ducharme, Pascal Vincent, Christian Jauvin. JMLR, 2003.
Figure2. Results of semantic text similarity computing [10] Zhang Yunong,Qu Lu,Chen Junwei,Liu JinRong, Guo Dongsheng.
Weights and Structure Determination Method of Multiple-input Sigmoid
Activation Function Neural Network. Application Research of
From all the figures above, we can observe these features Computers. 2012-11-15.
below. As for plain texts, when the similarity of text is higher [11] Simon Haykin. Neural Networks and Learning Machines(Third Edition).
or lower, the results produced by these two ways are almost 2011-03-01.
same. When texts are partly similar, there is a little difference [12] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed
between these two ways. The reason is that the traditional way Representation of Words and Phrases and their
only calculates the relationship of key words which are the Compositionality.arXiv:1310.4546, 2013.
same, however, the Word Vector way will calculate all the [13] Wong S K M, Ziarko W, Wong P C N. Generalized vector space model
words in this text. in information retrieval. In: Proc the 8th Annual ACM SIG IR
International Conference on Research and Development in Information
As for semantic texts, the Word Vector way is better than Retrieval, 1985, 18-25.
the traditional ways. After analysis, we find that the traditional [14] Wei Min, Yu LeAn. A RBF Neural Network with Optimum Learning
way just calculate the same key words. When these same key Rates and Its Application. Journal of Management Sciences in China.
2012-04-15.
words are little or the most, the results can be bad. The Word
[15] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient
Vector way can directly calculate the semantic distance. Estimation of Word Representations in Vector Space.arXiv:1301.3781,
2013.



You might also like