[go: nahoru, domu]

CN108255813A - A kind of text matching technique based on term frequency-inverse document and CRF - Google Patents

A kind of text matching technique based on term frequency-inverse document and CRF Download PDF

Info

Publication number
CN108255813A
CN108255813A CN201810062016.3A CN201810062016A CN108255813A CN 108255813 A CN108255813 A CN 108255813A CN 201810062016 A CN201810062016 A CN 201810062016A CN 108255813 A CN108255813 A CN 108255813A
Authority
CN
China
Prior art keywords
text
word
crf
idf
language material
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810062016.3A
Other languages
Chinese (zh)
Other versions
CN108255813B (en
Inventor
唐贤伦
李佳歆
万辉
马艺玮
蔡军
黄淼
刘想德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201810062016.3A priority Critical patent/CN108255813B/en
Publication of CN108255813A publication Critical patent/CN108255813A/en
Application granted granted Critical
Publication of CN108255813B publication Critical patent/CN108255813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A kind of semantic matching method based on word frequency against document (TF IDF) and CRF is claimed in the present invention, and the attributive character and the statistical nature of TF IDF that selection CRF is excavated assign weighted value to text term vector to represent the weighted value of text.Weight is obtained from statistics angle and demand information merely this method solve TF IDF and CRF and does not account for the problem of semantic between word but, while also solves the problems, such as that the statement of word feature is fixed in Word2vec not to be known.Matched accuracy rate is remarkably improved with reference to above method processing text matches problem.

Description

A kind of text matching technique based on term frequency-inverse document and CRF
Technical field
The invention belongs to text-processing technical field more particularly to the text semantics of a kind of combination term frequency-inverse document and CRF Matching process.
Background technology
Text matches are one of natural language processing (NLP) tasks, be often applied to solve information retrieval, community's question and answer, The problems such as commending system.The unstructured data of word class is converted to structural data needs to use text representation model, leads to Understanding of the system to text, vector space model (Vector can be deepened to the enhancing of keywords semantics by crossing text representation model Space Model, VSM) it is presently the most one of text representation model that is ripe and being most widely used.By enhancing text In Feature item weighting can play the semantic effect of enhancing, can the selection correctness of characteristic item for correctly express one It is most important for the theme or particular meaning of a text, and term frequency-inverse document algorithm (TF-IDF) is current information retrieval system One of most common weighted strategy in system, therefore, can TF-IDF and term vector models coupling, so as to increase feature for Text contacting semantically between word and word, the holistic correlation for being not only suitable for text are suitable for the part of certain specific words again Correlation so that feature has more generalization ability.
Condition random field (Conditional Random Fields, CRFs) is a kind of probability graph model, this model tool There is the ability of expression long-distance dependence and overlapping property feature, the problems such as preferably can solving to mark or bias;Has consideration Transition probability between contextual tagging, all features can carry out global parameter optimization in the form of serializing, and can acquire the overall situation Optimal solution has very strong inferential capability, and can use complexity, have the feature of plyability and dependent to be trained and push away Reason.Therefore, user property is excavated with that in descriptive labelling text task, can use it in matching user demand text, with this Obtain abundant information.
So with reference to the feature vector of TF-IDF and CRF, from calculation amount, the TF-IDF based on statistical method is calculated Method is relatively simple and quick, and the use of CRF can be analyzed according to user demand and make corresponding weight enhancing, therefore For the semantic information got more fully while also with specific aim, the feature vector expression of text is also more accurate, can significantly carry The accuracy of high text matches.
Invention content
Present invention seek to address that above problem of the prior art.It proposes a feature vectors and represents more accurate, it can be notable Improve the text matching technique based on term frequency-inverse document and CRF of the accuracy of text matches.Technical scheme of the present invention is such as Under:
A kind of text matching technique based on term frequency-inverse document and CRF, including step:1:Text matches are collected from network Corpus, the sentence pair comprising product description and search term, then its label is 1 for association matching, is otherwise 0, and corpus is random It is divided into training set language material and test set language material, step 2:The language material handled well in step 1 is divided using Chinese Word Automatic Segmentation Word, collection go to deactivate vocabulary, and the stop words in language material is removed according to vocabulary;It is further comprising the steps of:
Step 3:Training set in the language material obtained in step 2 is carried out by identity word by condition random field (CRF) (act), the label of behavior word (ide) and unrelated word (non) adds in marker samples tail portion part of speech as surface, makes Attributive character masterplate based on Bigram feature carries out CRF modelings using CRF++ tools, and the text marked is carried out Study, trains attribute model, and obtain the attribute of all words in text.According to correlation or similitude matching to identity word or Behavior word does word weight enhancing;
Step 4:The language material obtained in step 2 is trained using term frequency-inverse document (TF-IDF), and obtains each word TF-IDF values as the word TF-IDF weight;
Step 5:The language material handled well in step 2 is trained using Word2vec, obtains term vector model;
Step 6:Two kinds of Weight Algorithms that step 3 and step 4 obtain are merged to obtain the weight of each word, then will be every The weight of a word is multiplied with the corresponding term vector that step 5 obtains, and obtains new Text eigenvector;
Step 7:The text semantic feature vector input Softmax of the training set language material obtained in step 6 is trained into text This Matching Model;
Step 8:The Text eigenvector of the test set language material obtained in step 6 is inputted into Softmax, according in step 7 Trained model carries out text matches and calculates the accuracy rate of matching result.
Further, it is described that language material is segmented using the Chinese Word Automatic Segmentation based on N- shortest paths.
Further, the Chinese Word Automatic Segmentation based on N- shortest paths carries out language material participle and specifically includes step: Coupled relation between each phrase is represented by adjacency list first, then by calculating the coupled relation between phrase come really Fixed first participle path;Finally after all path computings, optimal path is found as word segmentation result.
Further, the step 3 is specially:By CRF by the training set in the language material obtained in step 2 into rower Note, user behavior word, identity word and meaningless word are respectively labeled as act, ide, non, to extract the preference information of user and need Information is sought, part of speech is added in marker samples tail portion as surface, makes the attributive character mould based on Bigram feature Version so that current word and its front and rear each contamination relationship are considered during feature extraction, carrying out CRF using CRF++ tools builds Mould learns the text marked, trains attribute model.
Further, it is described that the language material obtained in step 2 is trained, and obtains using TF-IDF (term frequency-inverse document) The TF-IDF values of each word are taken, characteristic item of the TF-IDF algorithms using word as text, the weight of each characteristic item is by TF weights It forms, specifically includes with two parts of IDF weights:
Specific formula for calculation is as follows:
wji=TFji·IDFi (2)
TFji=fji/T (3)
IDFi=log (N/ni+0.01) (4)
Wj={ wj1,wj2,...,wji} (5)
TF is characterized the frequency that item occurs in the text, represents significance level of the word in current text, wherein, T tables It is shown as whole word numbers of j-th of text, fjiFor the number that i-th of word in j-th of text occurs in the text, IDF is characterized The inverse number of files of item judges the importance of the word by the overall view overall situation, and wherein N represents a total of N number of text, and word i is in niA text Occur in this, WjRepresent the weight set of j-th of Text eigenvector, wjiThe then weight for i-th of word in j-th of text.
Further, the step 5, will be in step 2 using the bag of words that Softmax algorithms are layered in Word2vec In the language material input model managed, each expression of the word in vector space is obtained to obtain.
Further, two kinds of weights that step 3 and step 4 obtain are merged to obtain the power of each word by the step 6 Weight, then the weight of each word is multiplied with the corresponding term vector that step 5 obtains, new Text eigenvector is obtained, specifically For:The corresponding term vector of the TF-IDF values of each word obtained in step 4 is multiplied, and according to obtaining in step 3 The user property word taken carries out the weight enhancing by multiple, obtains the feature vector after semantically enhancement.
Further, the Text eigenvector of obtained test set language material is inputted Softmax by the step 8, according to Trained good model carries out text matches and calculates the accuracy rate of matching result, specifically includes:The survey that will be obtained in step 6 The Text eigenvector input Softmax of examination collection language material, according to trained model carries out text matches in step 7, such as The label of fruit reality output text is equal to 1, and it is matched to content to judge the sentence, if the label of reality output text differs It is equal to 0 in 1 i.e. label, it is unmatched to content to judge the sentence, counts the label of reality output text and desired output text Different number between this label calculates sentence to matched accuracy rate.
Further, the CRF is the undirected graph model based on discriminate, using being linear chain condition random field, x= (x1, x2... xn) flag sequence that represents represents observed input data sequence, y=(y1, y2... yn) represent a state Sequence, in the case where giving a list entries, the combination condition probability of the CRFs model definition status sequences of linear chain is:
Wherein, tj(yi-1,yi,x,i) it is observation sequence in i and the transfer characteristic function at i-1 moment, sk(yi, x, i) and it is to see Examine state characteristic function of the sequence at the i moment, parameter lambdajAnd μkIt can estimate from training data, it is more big then excellent to obtain nonnegative value First select corresponding characteristic event;If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.
It advantages of the present invention and has the beneficial effect that:
The present invention proposes a kind of semantic matching method of text, and this method selects TF-IDF and CRF to excavate user demand letter The feature of breath, and the Deep Semantics feature for merging Word2vec represents the feature of text jointly.The present invention is obtained first using CRF The attributive character of vocabulary does corresponding matching to be directed to demand in text, is then used as shallow-layer language by calculating text TF-IDF values Adopted feature, by two kinds of features collectively as weight, this method solve TF-IDF and CRF merely from statistics angle or demand information Upper acquisition weight does not account for the problem of semantic between word but.Finally, weight TF-IDF and CRF obtained is with passing through The term vector fusion that Word2vec is obtained, solves in Word2vec while Deep Semantics feature is obtained and fixes word mark sheet State the problem of not knowing.Matched accuracy rate is remarkably improved with reference to above method processing text matches problem.
Description of the drawings
Fig. 1 is that the present invention provides text matching technique flow chart of the preferred embodiment based on term frequency-inverse document and CRF.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, detailed Carefully describe.Described embodiment is only the part of the embodiment of the present invention.
The present invention solve above-mentioned technical problem technical solution be:
As shown in Figure 1, the semantic matching method the present invention is based on TF-IDF and CRF comprises the concrete steps that:
Step 1:Text matches corpus, the sentence pair comprising product description and search term, association matching are collected from network Then its label is 1, is otherwise 0.Corpus is divided into training set and test set, training set is used for training the model of emotional semantic classification, Test set is used for the effect of test model classification.
Step 2:The matched text for treating participle is needed to be pre-processed before participle, including removal character, punctuation mark And the non-Chinese character information such as English alphabet.Then Chinese word segmentation is carried out to the corpus in step 1, segmenting method used herein is Chinese Word Automatic Segmentation based on N- shortest paths.It is represented between each phrase by adjacency list (y-bend participle chart) first Coupled relation, each node represent a line in participle chart, the starting point of row value representative edge, the end of train value representative edge Then point determines first participle path by calculating the coupled relation between phrase.Finally when all path computings finish Afterwards, optimal path (i.e. shortest path) is found as word segmentation result.
After participle, each text is by the corpus of text that is formed with the word that space separates.Then stop words is collected Table, artificial delete are deactivated in vocabulary to testing useful vocabulary, and deactivating in the language material after deactivated vocabulary removal participle Word.Removal stop words is to save memory space and improve efficiency.
Step 3:User requirements analysis is carried out to text using CRF, does user property extraction.CRF is based on discriminate Undirected graph model, the most commonly used is linear chain condition random fields.X=(x1, x2... xn) represent flag sequence represent quilt The input data sequence of observation, y=(y1, y2... yn) represent a status switch, in the case where giving a list entries, The combination condition probability of the CRFs model definition status sequences of linear chain is:
Wherein, tj(yi-1,yi, x, i) and it is observation sequence in i and the transfer characteristic function at i-1 moment, sk(yi, x, i) and it is to see Examine state characteristic function of the sequence at the i moment, parameter lambdajAnd μkIt can estimate from training data, it is more big then excellent to obtain nonnegative value First select corresponding characteristic event;If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.Pass through Training set in the language material obtained in step 2 is marked CRF, and user behavior word, identity word and meaningless word mark respectively For act, ide, non, to extract the preference information of user and demand information.Part of speech is added in marker samples tail portion as external special Sign so that the customer attribute information of extraction is more accurate.Make the attributive character masterplate based on Bigram feature so that special Current word and its front and rear each contamination relationship are considered during sign extraction.CRF modelings are carried out using CRF++ tools, to label Good text is learnt, and trains attribute model.
Step 4:Each word weighted value is obtained using TF-IDF.Characteristic item of the TF-IDF algorithms using word as text, often The weight of a characteristic item is made of two parts of TF weights and IDF weights.Specific formula for calculation is as follows:
wji=TFji·IDFi (2)
TFji=fji/T (3)
IDFi=log (N/ni+0.01) (4)
Wj={ wj1,wj2,...,wji} (5)
TF (Term Frequency) is characterized the frequency that item occurs in the text, represents weight of the word in current text Degree is wanted, wherein, T is expressed as whole word numbers of j-th of text, fjiOccur in the text for i-th of word in j-th of text Number.IDF (Inverse Document Frequency) is characterized the inverse number of files of item, judges the word by the overall view overall situation Importance, wherein N represent a total of N number of text, and word i is in niOccur in a text.WjRepresent the power of j-th of Text eigenvector Gather again, wjiThe then weight for i-th of word in j-th of text.Each word in language material is calculated using TF-IDF, can be extracted The word of high identification in text, and give and the relevant weights of its importance.
Step 5:Using the DBOW models of Hierarchical Softmax algorithms in word2vec, by language model It is modeled, ties up real vector into k using each word in text as Feature Mapping, this is trained by stochastic gradient descent method A little term vectors obtain gradient by backpropagation in the process, and calculate gradient error.Then the parameter of more new model, It is final to obtain to obtain each expression of the word in vector space.That is V in the original item feature vector of formula (6)jIt represents j-th The feature vector set of text, vjiRepresent the ith feature vector of j-th of text.
Vj={ vj1,vj2,...,vji} (6)
Step 6:Finally by the feature vector v of word each in step 4jiCorrespondence is multiplied by its weight wji, obtain improved Vj’ As shown in formula (7):
Vj'={ wj1·vj1,wj2·vj2,...,wji·vji} (7)
When needing to do similitude matching, then user identity word will be extracted in step 3 and carries out weight enhancing;It needs to do phase During the matching of closing property, then user behavior word will be extracted in step 3 and carries out weight enhancing.Improved feature vector is because adding Weights that TF-IDF and CRF are obtained and it is more targeted for inhomogeneity another characteristic, increase spy using term vector model The contacting semantically between word and word is levied for text, is not only suitable for the holistic correlation of text again suitable for certain specific words Local correlations so that feature has more generalization ability.
Step 7:The text semantic feature vector input Softmax of training set in language material in step 6 is trained into text This Matching Model.
Step 8:The Text eigenvector of test set in language material in step 6 is inputted into Softmax, according in step 7 Trained model carries out text matches, if the label of reality output text is equal to 1, it is to content to judge the sentence Match, if the label of reality output text, not equal to 1 (i.e. label is equal to 0), it is unmatched, system to content to judge the sentence Number different between the label of reality output text and the label of desired output text is counted, calculates sentence to matched accurate Rate.
The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention. After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims (9)

1. a kind of text matching technique based on term frequency-inverse document and CRF, including step:1:Text matches language is collected from network Material collection, the sentence pair comprising product description and search term, then its label is 1 for association matching, is otherwise 0, corpus is divided at random For training set language material and test set language material, step 2:The language material handled well in step 1 is segmented using Chinese Word Automatic Segmentation, Collection goes to deactivate vocabulary, and the stop words in language material is removed according to vocabulary;It is it is characterized in that, further comprising the steps of:
Step 3:Training set in the language material obtained in step 2 is carried out by identity word, behavior word and nothing by condition random field CRF The label of word is closed, part of speech is added in marker samples tail portion as surface, it is special to make the attribute based on Bigram feature Masterplate is levied, CRF modelings are carried out using CRF++ tools, the text marked is learnt, trains attribute model, and obtain The attribute of all words in text does identity word or behavior word according to correlation or similitude matching the weight enhancing of the word;
Step 4:The language material obtained in step 2 is trained, and obtain the TF- of each word using term frequency-inverse document TF-IDF IDF values are as the word TF-IDF weight;
Step 5:The language material handled well in step 2 is trained using Word2vec, obtains term vector model;
Step 6:Two kinds of Weight Algorithms that step 3 and step 4 obtain are merged to obtain the weight of each word, then by each word Weight be multiplied with the corresponding term vector that step 5 obtains, obtain new Text eigenvector;
Step 7:The text semantic feature vector input Softmax of the training set language material obtained in step 6 is trained into text With model;
Step 8:The Text eigenvector of the test set language material obtained in step 6 is inputted into Softmax, according in step 7 Trained model carries out text matches and calculates the accuracy rate of matching result.
2. a kind of text matching technique based on term frequency-inverse document and CRF according to claim 1, which is characterized in that adopt Language material is segmented with the Chinese Word Automatic Segmentation based on N- shortest paths.
A kind of 3. text matching technique based on term frequency-inverse document and CRF according to claim 2, which is characterized in that institute State the Chinese Word Automatic Segmentation based on N- shortest paths to language material carry out participle specifically include step:It is represented first by adjacency list Coupled relation between each phrase, then determines first participle path by calculating the coupled relation between phrase;Most Afterwards after all path computings, optimal path is found as word segmentation result.
4. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature It is, the step 3 is specially:The training set in the language material obtained in step 2 is marked by CRF, user behavior word, Identity word and meaningless word are respectively labeled as act, ide, non, to extract the preference information of user and demand information, to marking sample This tail portion adds in part of speech as surface, makes the attributive character masterplate based on Bigram feature so that feature extraction When consider current word and its front and rear each contamination relationship, CRF modelings are carried out using CRF++ tools, to the text marked This is learnt, and trains attribute model.
5. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature It is, it is described that the language material obtained in step 2 is trained using TF-IDF, and obtain the TF-IDF values of each word, TF-IDF Characteristic item of the algorithm using word as text, the weight of each characteristic item are made of two parts of TF weights and IDF weights, specifically Including:
Specific formula for calculation is as follows:
wji=TFji·IDFi (2)
TFji=fji/T (3)
IDFi=log (N/ni+0.01) (4)
Wj={ wj1,wj2,...,wji} (5)
TF is characterized the frequency that item occurs in the text, represents significance level of the word in current text, wherein, T is expressed as Whole word numbers of j-th of text, fjiFor the number that i-th of word in j-th of text occurs in the text, IDF is characterized item Inverse number of files judges the importance of the word by the overall view overall situation, and wherein N represents a total of N number of text, and word i is in niIn a text Occur, WjRepresent the weight set of j-th of Text eigenvector, wjiThe then weight for i-th of word in j-th of text.
6. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature It is, the step 5 is using the bag of words DBOW that Softmax algorithms are layered in Word2vec, by what is handled well in step 2 In language material input model, each expression of the word in vector space is obtained to obtain.
A kind of 7. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute Step 6 is stated to be merged to obtain the weight of each word, then the weight by each word by two kinds of weights that step 3 and step 4 obtain It is multiplied with the corresponding term vector that step 5 obtains, obtains new Text eigenvector, specially:It is every by what is obtained in step 4 The term vector that the TF-IDF values of a word are corresponding is multiplied, and according to the user property word obtained in step 3 carry out by The weight enhancing of multiple, obtains the feature vector after semantically enhancement.
A kind of 8. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute It states step 8 and the Text eigenvector of obtained test set language material is inputted into Softmax, according to trained model into style of writing This matching and the accuracy rate for calculating matching result, specifically include:By the text feature of test set language material obtained in step 6 to Amount input Softmax, according to trained model carries out text matches in step 7, if the label of reality output text Equal to 1, it is matched to content to judge the sentence, if not equal to 1, i.e. label is equal to 0 to the label of reality output text, judgement The sentence is unmatched to content, counts different between the label of reality output text and the label of desired output text Number calculates sentence to matched accuracy rate.
A kind of 9. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute It is the undirected graph model based on discriminate to state CRF, using being linear chain condition random field, x=(x1, x2... xn) represent label Sequence represents observed input data sequence, y=(y1, y2... yn) represent a status switch, in given input sequence In the case of row, the combination condition probability of the CRFs model definition status sequences of linear chain is:
Wherein, tj(yi-1,yi, x, i) and it is observation sequence in i and the transfer characteristic function at i-1 moment, sk(yi, x, i) and it is observation sequence It is listed in the state characteristic function at i moment, parameter lambdajAnd μkIt can estimate from training data, obtain the more big then preferential choosing of nonnegative value Select corresponding characteristic event;If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.
CN201810062016.3A 2018-01-23 2018-01-23 Text matching method based on word frequency-inverse document and CRF Active CN108255813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810062016.3A CN108255813B (en) 2018-01-23 2018-01-23 Text matching method based on word frequency-inverse document and CRF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810062016.3A CN108255813B (en) 2018-01-23 2018-01-23 Text matching method based on word frequency-inverse document and CRF

Publications (2)

Publication Number Publication Date
CN108255813A true CN108255813A (en) 2018-07-06
CN108255813B CN108255813B (en) 2021-11-16

Family

ID=62742366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810062016.3A Active CN108255813B (en) 2018-01-23 2018-01-23 Text matching method based on word frequency-inverse document and CRF

Country Status (1)

Country Link
CN (1) CN108255813B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062899A (en) * 2018-07-31 2018-12-21 中国科学院信息工程研究所 A kind of file similarity measure method based on part-of-speech tagging
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109446321A (en) * 2018-10-11 2019-03-08 深圳前海达闼云端智能科技有限公司 Text classification method, text classification device, terminal and computer readable storage medium
CN109522549A (en) * 2018-10-30 2019-03-26 云南电网有限责任公司信息中心 Building of corpus method based on Web acquisition and text feature equiblibrium mass distribution
CN109558489A (en) * 2018-12-03 2019-04-02 南京中孚信息技术有限公司 File classification method and device
CN109933670A (en) * 2019-03-19 2019-06-25 中南大学 A kind of file classification method calculating semantic distance based on combinatorial matrix
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110427627A (en) * 2019-08-02 2019-11-08 北京百度网讯科技有限公司 Task processing method and device based on semantic expressiveness model
CN111881668A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Improved TF-IDF calculation model based on chi-square statistics and TF-CRF
CN112580691A (en) * 2020-11-25 2021-03-30 北京北大千方科技有限公司 Term matching method, matching system and storage medium of metadata field
CN112784062A (en) * 2019-03-15 2021-05-11 北京金山数字娱乐科技有限公司 Idiom knowledge graph construction method and device
CN117951256A (en) * 2024-03-25 2024-04-30 北京长河数智科技有限责任公司 Document duplicate checking method based on hierarchical feature vector search

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978356A (en) * 2014-04-10 2015-10-14 阿里巴巴集团控股有限公司 Synonym identification method and device
CN105677779A (en) * 2015-12-30 2016-06-15 山东大学 Feedback-type question type classifier system based on scoring mechanism and working method thereof
CN105740236A (en) * 2016-01-29 2016-07-06 中国科学院自动化研究所 Writing feature and sequence feature combined Chinese sentiment new word recognition method and system
KR20170000185A (en) * 2015-06-23 2017-01-02 아시아나아이디티 주식회사 Text categorizing system of sparse vector-space document, device and method thereof
CN107193959A (en) * 2017-05-24 2017-09-22 南京大学 A kind of business entity's sorting technique towards plain text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978356A (en) * 2014-04-10 2015-10-14 阿里巴巴集团控股有限公司 Synonym identification method and device
KR20170000185A (en) * 2015-06-23 2017-01-02 아시아나아이디티 주식회사 Text categorizing system of sparse vector-space document, device and method thereof
CN105677779A (en) * 2015-12-30 2016-06-15 山东大学 Feedback-type question type classifier system based on scoring mechanism and working method thereof
CN105740236A (en) * 2016-01-29 2016-07-06 中国科学院自动化研究所 Writing feature and sequence feature combined Chinese sentiment new word recognition method and system
CN107193959A (en) * 2017-05-24 2017-09-22 南京大学 A kind of business entity's sorting technique towards plain text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHILIN ZHANG AND MEI GU: "《Improved Text Classification to acquire job opportunities for Chinese disabled persons》", 《2010 2ND INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER CONTROL》 *
唐贤伦: "《基于条件随机场和TF_IDF的文本语义匹配及推荐》", 《第28届中国过程控制会议(CPCC 2017)暨纪念中国过程控制会议30周年摘要集》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062899A (en) * 2018-07-31 2018-12-21 中国科学院信息工程研究所 A kind of file similarity measure method based on part-of-speech tagging
CN109062899B (en) * 2018-07-31 2021-10-15 中国科学院信息工程研究所 Document similarity measurement method based on part-of-speech tagging
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109271626B (en) * 2018-08-31 2023-09-26 北京工业大学 Text semantic analysis method
CN109446321A (en) * 2018-10-11 2019-03-08 深圳前海达闼云端智能科技有限公司 Text classification method, text classification device, terminal and computer readable storage medium
CN109522549A (en) * 2018-10-30 2019-03-26 云南电网有限责任公司信息中心 Building of corpus method based on Web acquisition and text feature equiblibrium mass distribution
CN109522549B (en) * 2018-10-30 2022-06-10 云南电网有限责任公司信息中心 Corpus construction method based on Web collection and text feature balanced distribution
CN109558489A (en) * 2018-12-03 2019-04-02 南京中孚信息技术有限公司 File classification method and device
CN112784062A (en) * 2019-03-15 2021-05-11 北京金山数字娱乐科技有限公司 Idiom knowledge graph construction method and device
CN112784062B (en) * 2019-03-15 2024-06-04 北京金山数字娱乐科技有限公司 Idiom knowledge graph construction method and device
CN109933670A (en) * 2019-03-19 2019-06-25 中南大学 A kind of file classification method calculating semantic distance based on combinatorial matrix
CN109933670B (en) * 2019-03-19 2021-06-04 中南大学 Text classification method for calculating semantic distance based on combined matrix
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110427627B (en) * 2019-08-02 2023-04-28 北京百度网讯科技有限公司 Task processing method and device based on semantic representation model
CN110427627A (en) * 2019-08-02 2019-11-08 北京百度网讯科技有限公司 Task processing method and device based on semantic expressiveness model
CN111881668A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Improved TF-IDF calculation model based on chi-square statistics and TF-CRF
CN111881668B (en) * 2020-08-06 2023-06-30 成都信息工程大学 TF-IDF computing device based on chi-square statistics and TF-CRF improvement
CN112580691A (en) * 2020-11-25 2021-03-30 北京北大千方科技有限公司 Term matching method, matching system and storage medium of metadata field
CN112580691B (en) * 2020-11-25 2024-05-14 北京北大千方科技有限公司 Term matching method, matching system and storage medium for metadata field
CN117951256A (en) * 2024-03-25 2024-04-30 北京长河数智科技有限责任公司 Document duplicate checking method based on hierarchical feature vector search
CN117951256B (en) * 2024-03-25 2024-05-31 北京长河数智科技有限责任公司 Document duplicate checking method based on hierarchical feature vector search

Also Published As

Publication number Publication date
CN108255813B (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN108255813A (en) A kind of text matching technique based on term frequency-inverse document and CRF
Downey et al. Locating complex named entities in web text.
US9672205B2 (en) Methods and systems related to information extraction
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN108959258B (en) Specific field integrated entity linking method based on representation learning
CN109190117A (en) A kind of short text semantic similarity calculation method based on term vector
CN110688836A (en) Automatic domain dictionary construction method based on supervised learning
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN110414009A (en) The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device
CN110851593B (en) Complex value word vector construction method based on position and semantics
CN113095087B (en) Chinese word sense disambiguation method based on graph convolution neural network
Babhulgaonkar et al. Language identification for multilingual machine translation
Gan et al. Character-level deep conflation for business data analytics
CN115017903A (en) Method and system for extracting key phrases by combining document hierarchical structure with global local information
CN114707615B (en) Ancient character similarity quantification method based on duration Chinese character knowledge graph
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN114925198A (en) Knowledge-driven text classification method fusing character information
Priya et al. Intelligent Aspect based Model for Efficient Sentiment Analysis of User Reviews
Thilagavathi et al. Tamil english language sentiment analysis system
Paul et al. Multi-facet universal schema
CN111881678A (en) Domain word discovery method based on unsupervised learning
Saidi et al. New approch of opinion analysis from big social data environment using a supervised machine learning algirithm
Shekhar Text Mining and Sentiment Analysis
CN112463928B (en) Technical list generation method and system for field evaluation prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant