CN108255813A

CN108255813A - A kind of text matching technique based on term frequency-inverse document and CRF

Info

Publication number: CN108255813A
Application number: CN201810062016.3A
Authority: CN
Inventors: 唐贤伦; 李佳歆; 万辉; 马艺玮; 蔡军; 黄淼; 刘想德
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2018-07-06
Anticipated expiration: 2038-01-23
Also published as: CN108255813B

Abstract

A kind of semantic matching method based on word frequency against document (TF IDF) and CRF is claimed in the present invention, and the attributive character and the statistical nature of TF IDF that selection CRF is excavated assign weighted value to text term vector to represent the weighted value of text.Weight is obtained from statistics angle and demand information merely this method solve TF IDF and CRF and does not account for the problem of semantic between word but, while also solves the problems, such as that the statement of word feature is fixed in Word2vec not to be known.Matched accuracy rate is remarkably improved with reference to above method processing text matches problem.

Description

A kind of text matching technique based on term frequency-inverse document and CRF

Technical field

The invention belongs to text-processing technical field more particularly to the text semantics of a kind of combination term frequency-inverse document and CRF Matching process.

Background technology

Text matches are one of natural language processing (NLP) tasks, be often applied to solve information retrieval, community's question and answer, The problems such as commending system.The unstructured data of word class is converted to structural data needs to use text representation model, leads to Understanding of the system to text, vector space model (Vector can be deepened to the enhancing of keywords semantics by crossing text representation model Space Model, VSM) it is presently the most one of text representation model that is ripe and being most widely used.By enhancing text In Feature item weighting can play the semantic effect of enhancing, can the selection correctness of characteristic item for correctly express one It is most important for the theme or particular meaning of a text, and term frequency-inverse document algorithm (TF-IDF) is current information retrieval system One of most common weighted strategy in system, therefore, can TF-IDF and term vector models coupling, so as to increase feature for Text contacting semantically between word and word, the holistic correlation for being not only suitable for text are suitable for the part of certain specific words again Correlation so that feature has more generalization ability.

Condition random field (Conditional Random Fields, CRFs) is a kind of probability graph model, this model tool There is the ability of expression long-distance dependence and overlapping property feature, the problems such as preferably can solving to mark or bias；Has consideration Transition probability between contextual tagging, all features can carry out global parameter optimization in the form of serializing, and can acquire the overall situation Optimal solution has very strong inferential capability, and can use complexity, have the feature of plyability and dependent to be trained and push away Reason.Therefore, user property is excavated with that in descriptive labelling text task, can use it in matching user demand text, with this Obtain abundant information.

So with reference to the feature vector of TF-IDF and CRF, from calculation amount, the TF-IDF based on statistical method is calculated Method is relatively simple and quick, and the use of CRF can be analyzed according to user demand and make corresponding weight enhancing, therefore For the semantic information got more fully while also with specific aim, the feature vector expression of text is also more accurate, can significantly carry The accuracy of high text matches.

Invention content

Present invention seek to address that above problem of the prior art.It proposes a feature vectors and represents more accurate, it can be notable Improve the text matching technique based on term frequency-inverse document and CRF of the accuracy of text matches.Technical scheme of the present invention is such as Under：

A kind of text matching technique based on term frequency-inverse document and CRF, including step：1：Text matches are collected from network Corpus, the sentence pair comprising product description and search term, then its label is 1 for association matching, is otherwise 0, and corpus is random It is divided into training set language material and test set language material, step 2：The language material handled well in step 1 is divided using Chinese Word Automatic Segmentation Word, collection go to deactivate vocabulary, and the stop words in language material is removed according to vocabulary；It is further comprising the steps of：

Step 3：Training set in the language material obtained in step 2 is carried out by identity word by condition random field (CRF) (act), the label of behavior word (ide) and unrelated word (non) adds in marker samples tail portion part of speech as surface, makes Attributive character masterplate based on Bigram feature carries out CRF modelings using CRF++ tools, and the text marked is carried out Study, trains attribute model, and obtain the attribute of all words in text.According to correlation or similitude matching to identity word or Behavior word does word weight enhancing；

Step 4：The language material obtained in step 2 is trained using term frequency-inverse document (TF-IDF), and obtains each word TF-IDF values as the word TF-IDF weight；

Step 5：The language material handled well in step 2 is trained using Word2vec, obtains term vector model；

Step 6：Two kinds of Weight Algorithms that step 3 and step 4 obtain are merged to obtain the weight of each word, then will be every The weight of a word is multiplied with the corresponding term vector that step 5 obtains, and obtains new Text eigenvector；

Step 7：The text semantic feature vector input Softmax of the training set language material obtained in step 6 is trained into text This Matching Model；

Step 8：The Text eigenvector of the test set language material obtained in step 6 is inputted into Softmax, according in step 7 Trained model carries out text matches and calculates the accuracy rate of matching result.

Further, it is described that language material is segmented using the Chinese Word Automatic Segmentation based on N- shortest paths.

Further, the Chinese Word Automatic Segmentation based on N- shortest paths carries out language material participle and specifically includes step： Coupled relation between each phrase is represented by adjacency list first, then by calculating the coupled relation between phrase come really Fixed first participle path；Finally after all path computings, optimal path is found as word segmentation result.

Further, the step 3 is specially：By CRF by the training set in the language material obtained in step 2 into rower Note, user behavior word, identity word and meaningless word are respectively labeled as act, ide, non, to extract the preference information of user and need Information is sought, part of speech is added in marker samples tail portion as surface, makes the attributive character mould based on Bigram feature Version so that current word and its front and rear each contamination relationship are considered during feature extraction, carrying out CRF using CRF++ tools builds Mould learns the text marked, trains attribute model.

Further, it is described that the language material obtained in step 2 is trained, and obtains using TF-IDF (term frequency-inverse document) The TF-IDF values of each word are taken, characteristic item of the TF-IDF algorithms using word as text, the weight of each characteristic item is by TF weights It forms, specifically includes with two parts of IDF weights：

Specific formula for calculation is as follows：

w_ji=TF_ji·IDF_i (2)

TF_ji=f_ji/T (3)

IDF_i=log (N/n_i+0.01) (4)

W_j={ w_j1,w_j2,...,w_ji} (5)

TF is characterized the frequency that item occurs in the text, represents significance level of the word in current text, wherein, T tables It is shown as whole word numbers of j-th of text, f_jiFor the number that i-th of word in j-th of text occurs in the text, IDF is characterized The inverse number of files of item judges the importance of the word by the overall view overall situation, and wherein N represents a total of N number of text, and word i is in n_iA text Occur in this, W_jRepresent the weight set of j-th of Text eigenvector, w_jiThe then weight for i-th of word in j-th of text.

Further, the step 5, will be in step 2 using the bag of words that Softmax algorithms are layered in Word2vec In the language material input model managed, each expression of the word in vector space is obtained to obtain.

Further, two kinds of weights that step 3 and step 4 obtain are merged to obtain the power of each word by the step 6 Weight, then the weight of each word is multiplied with the corresponding term vector that step 5 obtains, new Text eigenvector is obtained, specifically For：The corresponding term vector of the TF-IDF values of each word obtained in step 4 is multiplied, and according to obtaining in step 3 The user property word taken carries out the weight enhancing by multiple, obtains the feature vector after semantically enhancement.

Further, the Text eigenvector of obtained test set language material is inputted Softmax by the step 8, according to Trained good model carries out text matches and calculates the accuracy rate of matching result, specifically includes：The survey that will be obtained in step 6 The Text eigenvector input Softmax of examination collection language material, according to trained model carries out text matches in step 7, such as The label of fruit reality output text is equal to 1, and it is matched to content to judge the sentence, if the label of reality output text differs It is equal to 0 in 1 i.e. label, it is unmatched to content to judge the sentence, counts the label of reality output text and desired output text Different number between this label calculates sentence to matched accuracy rate.

Further, the CRF is the undirected graph model based on discriminate, using being linear chain condition random field, x= (x₁, x₂... x_n) flag sequence that represents represents observed input data sequence, y=(y₁, y₂... y_n) represent a state Sequence, in the case where giving a list entries, the combination condition probability of the CRFs model definition status sequences of linear chain is：

Wherein, t_j(y_i-1,y_i,x,_i) it is observation sequence in i and the transfer characteristic function at i-1 moment, s_k(y_i, x, i) and it is to see Examine state characteristic function of the sequence at the i moment, parameter lambda_jAnd μ_kIt can estimate from training data, it is more big then excellent to obtain nonnegative value First select corresponding characteristic event；If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.

It advantages of the present invention and has the beneficial effect that：

The present invention proposes a kind of semantic matching method of text, and this method selects TF-IDF and CRF to excavate user demand letter The feature of breath, and the Deep Semantics feature for merging Word2vec represents the feature of text jointly.The present invention is obtained first using CRF The attributive character of vocabulary does corresponding matching to be directed to demand in text, is then used as shallow-layer language by calculating text TF-IDF values Adopted feature, by two kinds of features collectively as weight, this method solve TF-IDF and CRF merely from statistics angle or demand information Upper acquisition weight does not account for the problem of semantic between word but.Finally, weight TF-IDF and CRF obtained is with passing through The term vector fusion that Word2vec is obtained, solves in Word2vec while Deep Semantics feature is obtained and fixes word mark sheet State the problem of not knowing.Matched accuracy rate is remarkably improved with reference to above method processing text matches problem.

Description of the drawings

Fig. 1 is that the present invention provides text matching technique flow chart of the preferred embodiment based on term frequency-inverse document and CRF.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only the part of the embodiment of the present invention.

The present invention solve above-mentioned technical problem technical solution be：

As shown in Figure 1, the semantic matching method the present invention is based on TF-IDF and CRF comprises the concrete steps that：

Step 1：Text matches corpus, the sentence pair comprising product description and search term, association matching are collected from network Then its label is 1, is otherwise 0.Corpus is divided into training set and test set, training set is used for training the model of emotional semantic classification, Test set is used for the effect of test model classification.

Step 2：The matched text for treating participle is needed to be pre-processed before participle, including removal character, punctuation mark And the non-Chinese character information such as English alphabet.Then Chinese word segmentation is carried out to the corpus in step 1, segmenting method used herein is Chinese Word Automatic Segmentation based on N- shortest paths.It is represented between each phrase by adjacency list (y-bend participle chart) first Coupled relation, each node represent a line in participle chart, the starting point of row value representative edge, the end of train value representative edge Then point determines first participle path by calculating the coupled relation between phrase.Finally when all path computings finish Afterwards, optimal path (i.e. shortest path) is found as word segmentation result.

After participle, each text is by the corpus of text that is formed with the word that space separates.Then stop words is collected Table, artificial delete are deactivated in vocabulary to testing useful vocabulary, and deactivating in the language material after deactivated vocabulary removal participle Word.Removal stop words is to save memory space and improve efficiency.

Step 3：User requirements analysis is carried out to text using CRF, does user property extraction.CRF is based on discriminate Undirected graph model, the most commonly used is linear chain condition random fields.X=(x₁, x₂... x_n) represent flag sequence represent quilt The input data sequence of observation, y=(y₁, y₂... y_n) represent a status switch, in the case where giving a list entries, The combination condition probability of the CRFs model definition status sequences of linear chain is：

Wherein, t_j(y_i-1,y_i, x, i) and it is observation sequence in i and the transfer characteristic function at i-1 moment, s_k(y_i, x, i) and it is to see Examine state characteristic function of the sequence at the i moment, parameter lambda_jAnd μ_kIt can estimate from training data, it is more big then excellent to obtain nonnegative value First select corresponding characteristic event；If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.Pass through Training set in the language material obtained in step 2 is marked CRF, and user behavior word, identity word and meaningless word mark respectively For act, ide, non, to extract the preference information of user and demand information.Part of speech is added in marker samples tail portion as external special Sign so that the customer attribute information of extraction is more accurate.Make the attributive character masterplate based on Bigram feature so that special Current word and its front and rear each contamination relationship are considered during sign extraction.CRF modelings are carried out using CRF++ tools, to label Good text is learnt, and trains attribute model.

Step 4：Each word weighted value is obtained using TF-IDF.Characteristic item of the TF-IDF algorithms using word as text, often The weight of a characteristic item is made of two parts of TF weights and IDF weights.Specific formula for calculation is as follows：

w_ji=TF_ji·IDF_i (2)

TF_ji=f_ji/T (3)

IDF_i=log (N/n_i+0.01) (4)

W_j={ w_j1,w_j2,...,w_ji} (5)

TF (Term Frequency) is characterized the frequency that item occurs in the text, represents weight of the word in current text Degree is wanted, wherein, T is expressed as whole word numbers of j-th of text, f_jiOccur in the text for i-th of word in j-th of text Number.IDF (Inverse Document Frequency) is characterized the inverse number of files of item, judges the word by the overall view overall situation Importance, wherein N represent a total of N number of text, and word i is in n_iOccur in a text.W_jRepresent the power of j-th of Text eigenvector Gather again, w_jiThe then weight for i-th of word in j-th of text.Each word in language material is calculated using TF-IDF, can be extracted The word of high identification in text, and give and the relevant weights of its importance.

Step 5：Using the DBOW models of Hierarchical Softmax algorithms in word2vec, by language model It is modeled, ties up real vector into k using each word in text as Feature Mapping, this is trained by stochastic gradient descent method A little term vectors obtain gradient by backpropagation in the process, and calculate gradient error.Then the parameter of more new model, It is final to obtain to obtain each expression of the word in vector space.That is V in the original item feature vector of formula (6)_jIt represents j-th The feature vector set of text, v_jiRepresent the ith feature vector of j-th of text.

V_j={ v_j1,v_j2,...,v_ji} (6)

Step 6：Finally by the feature vector v of word each in step 4_jiCorrespondence is multiplied by its weight w_ji, obtain improved V_j’ As shown in formula (7)：

V_j'={ w_j1·v_j1,w_j2·v_j2,...,w_ji·v_ji} (7)

When needing to do similitude matching, then user identity word will be extracted in step 3 and carries out weight enhancing；It needs to do phase During the matching of closing property, then user behavior word will be extracted in step 3 and carries out weight enhancing.Improved feature vector is because adding Weights that TF-IDF and CRF are obtained and it is more targeted for inhomogeneity another characteristic, increase spy using term vector model The contacting semantically between word and word is levied for text, is not only suitable for the holistic correlation of text again suitable for certain specific words Local correlations so that feature has more generalization ability.

Step 7：The text semantic feature vector input Softmax of training set in language material in step 6 is trained into text This Matching Model.

Step 8：The Text eigenvector of test set in language material in step 6 is inputted into Softmax, according in step 7 Trained model carries out text matches, if the label of reality output text is equal to 1, it is to content to judge the sentence Match, if the label of reality output text, not equal to 1 (i.e. label is equal to 0), it is unmatched, system to content to judge the sentence Number different between the label of reality output text and the label of desired output text is counted, calculates sentence to matched accurate Rate.

The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention. After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. a kind of text matching technique based on term frequency-inverse document and CRF, including step：1：Text matches language is collected from network Material collection, the sentence pair comprising product description and search term, then its label is 1 for association matching, is otherwise 0, corpus is divided at random For training set language material and test set language material, step 2：The language material handled well in step 1 is segmented using Chinese Word Automatic Segmentation, Collection goes to deactivate vocabulary, and the stop words in language material is removed according to vocabulary；It is it is characterized in that, further comprising the steps of：

Step 3：Training set in the language material obtained in step 2 is carried out by identity word, behavior word and nothing by condition random field CRF The label of word is closed, part of speech is added in marker samples tail portion as surface, it is special to make the attribute based on Bigram feature Masterplate is levied, CRF modelings are carried out using CRF++ tools, the text marked is learnt, trains attribute model, and obtain The attribute of all words in text does identity word or behavior word according to correlation or similitude matching the weight enhancing of the word；

Step 4：The language material obtained in step 2 is trained, and obtain the TF- of each word using term frequency-inverse document TF-IDF IDF values are as the word TF-IDF weight；

Step 6：Two kinds of Weight Algorithms that step 3 and step 4 obtain are merged to obtain the weight of each word, then by each word Weight be multiplied with the corresponding term vector that step 5 obtains, obtain new Text eigenvector；

Step 7：The text semantic feature vector input Softmax of the training set language material obtained in step 6 is trained into text With model；

2. a kind of text matching technique based on term frequency-inverse document and CRF according to claim 1, which is characterized in that adopt Language material is segmented with the Chinese Word Automatic Segmentation based on N- shortest paths.

A kind of 3. text matching technique based on term frequency-inverse document and CRF according to claim 2, which is characterized in that institute State the Chinese Word Automatic Segmentation based on N- shortest paths to language material carry out participle specifically include step：It is represented first by adjacency list Coupled relation between each phrase, then determines first participle path by calculating the coupled relation between phrase；Most Afterwards after all path computings, optimal path is found as word segmentation result.

4. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature It is, the step 3 is specially：The training set in the language material obtained in step 2 is marked by CRF, user behavior word, Identity word and meaningless word are respectively labeled as act, ide, non, to extract the preference information of user and demand information, to marking sample This tail portion adds in part of speech as surface, makes the attributive character masterplate based on Bigram feature so that feature extraction When consider current word and its front and rear each contamination relationship, CRF modelings are carried out using CRF++ tools, to the text marked This is learnt, and trains attribute model.

5. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature It is, it is described that the language material obtained in step 2 is trained using TF-IDF, and obtain the TF-IDF values of each word, TF-IDF Characteristic item of the algorithm using word as text, the weight of each characteristic item are made of two parts of TF weights and IDF weights, specifically Including：

Specific formula for calculation is as follows：

w_ji=TF_ji·IDF_i (2)

TF_ji=f_ji/T (3)

IDF_i=log (N/n_i+0.01) (4)

W_j={ w_j1,w_j2,...,w_ji} (5)

TF is characterized the frequency that item occurs in the text, represents significance level of the word in current text, wherein, T is expressed as Whole word numbers of j-th of text, f_jiFor the number that i-th of word in j-th of text occurs in the text, IDF is characterized item Inverse number of files judges the importance of the word by the overall view overall situation, and wherein N represents a total of N number of text, and word i is in n_iIn a text Occur, W_jRepresent the weight set of j-th of Text eigenvector, w_jiThe then weight for i-th of word in j-th of text.

6. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature It is, the step 5 is using the bag of words DBOW that Softmax algorithms are layered in Word2vec, by what is handled well in step 2 In language material input model, each expression of the word in vector space is obtained to obtain.

A kind of 7. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute Step 6 is stated to be merged to obtain the weight of each word, then the weight by each word by two kinds of weights that step 3 and step 4 obtain It is multiplied with the corresponding term vector that step 5 obtains, obtains new Text eigenvector, specially：It is every by what is obtained in step 4 The term vector that the TF-IDF values of a word are corresponding is multiplied, and according to the user property word obtained in step 3 carry out by The weight enhancing of multiple, obtains the feature vector after semantically enhancement.

A kind of 8. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute It states step 8 and the Text eigenvector of obtained test set language material is inputted into Softmax, according to trained model into style of writing This matching and the accuracy rate for calculating matching result, specifically include：By the text feature of test set language material obtained in step 6 to Amount input Softmax, according to trained model carries out text matches in step 7, if the label of reality output text Equal to 1, it is matched to content to judge the sentence, if not equal to 1, i.e. label is equal to 0 to the label of reality output text, judgement The sentence is unmatched to content, counts different between the label of reality output text and the label of desired output text Number calculates sentence to matched accuracy rate.

A kind of 9. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute It is the undirected graph model based on discriminate to state CRF, using being linear chain condition random field, x=(x₁, x₂... x_n) represent label Sequence represents observed input data sequence, y=(y₁, y₂... y_n) represent a status switch, in given input sequence In the case of row, the combination condition probability of the CRFs model definition status sequences of linear chain is：

Wherein, t_j(y_i-1,y_i, x, i) and it is observation sequence in i and the transfer characteristic function at i-1 moment, s_k(y_i, x, i) and it is observation sequence It is listed in the state characteristic function at i moment, parameter lambda_jAnd μ_kIt can estimate from training data, obtain the more big then preferential choosing of nonnegative value Select corresponding characteristic event；If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.