CN108255813A - A kind of text matching technique based on term frequency-inverse document and CRF - Google Patents
A kind of text matching technique based on term frequency-inverse document and CRF Download PDFInfo
- Publication number
- CN108255813A CN108255813A CN201810062016.3A CN201810062016A CN108255813A CN 108255813 A CN108255813 A CN 108255813A CN 201810062016 A CN201810062016 A CN 201810062016A CN 108255813 A CN108255813 A CN 108255813A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- crf
- idf
- language material
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A kind of semantic matching method based on word frequency against document (TF IDF) and CRF is claimed in the present invention, and the attributive character and the statistical nature of TF IDF that selection CRF is excavated assign weighted value to text term vector to represent the weighted value of text.Weight is obtained from statistics angle and demand information merely this method solve TF IDF and CRF and does not account for the problem of semantic between word but, while also solves the problems, such as that the statement of word feature is fixed in Word2vec not to be known.Matched accuracy rate is remarkably improved with reference to above method processing text matches problem.
Description
Technical field
The invention belongs to text-processing technical field more particularly to the text semantics of a kind of combination term frequency-inverse document and CRF
Matching process.
Background technology
Text matches are one of natural language processing (NLP) tasks, be often applied to solve information retrieval, community's question and answer,
The problems such as commending system.The unstructured data of word class is converted to structural data needs to use text representation model, leads to
Understanding of the system to text, vector space model (Vector can be deepened to the enhancing of keywords semantics by crossing text representation model
Space Model, VSM) it is presently the most one of text representation model that is ripe and being most widely used.By enhancing text
In Feature item weighting can play the semantic effect of enhancing, can the selection correctness of characteristic item for correctly express one
It is most important for the theme or particular meaning of a text, and term frequency-inverse document algorithm (TF-IDF) is current information retrieval system
One of most common weighted strategy in system, therefore, can TF-IDF and term vector models coupling, so as to increase feature for
Text contacting semantically between word and word, the holistic correlation for being not only suitable for text are suitable for the part of certain specific words again
Correlation so that feature has more generalization ability.
Condition random field (Conditional Random Fields, CRFs) is a kind of probability graph model, this model tool
There is the ability of expression long-distance dependence and overlapping property feature, the problems such as preferably can solving to mark or bias;Has consideration
Transition probability between contextual tagging, all features can carry out global parameter optimization in the form of serializing, and can acquire the overall situation
Optimal solution has very strong inferential capability, and can use complexity, have the feature of plyability and dependent to be trained and push away
Reason.Therefore, user property is excavated with that in descriptive labelling text task, can use it in matching user demand text, with this
Obtain abundant information.
So with reference to the feature vector of TF-IDF and CRF, from calculation amount, the TF-IDF based on statistical method is calculated
Method is relatively simple and quick, and the use of CRF can be analyzed according to user demand and make corresponding weight enhancing, therefore
For the semantic information got more fully while also with specific aim, the feature vector expression of text is also more accurate, can significantly carry
The accuracy of high text matches.
Invention content
Present invention seek to address that above problem of the prior art.It proposes a feature vectors and represents more accurate, it can be notable
Improve the text matching technique based on term frequency-inverse document and CRF of the accuracy of text matches.Technical scheme of the present invention is such as
Under:
A kind of text matching technique based on term frequency-inverse document and CRF, including step:1:Text matches are collected from network
Corpus, the sentence pair comprising product description and search term, then its label is 1 for association matching, is otherwise 0, and corpus is random
It is divided into training set language material and test set language material, step 2:The language material handled well in step 1 is divided using Chinese Word Automatic Segmentation
Word, collection go to deactivate vocabulary, and the stop words in language material is removed according to vocabulary;It is further comprising the steps of:
Step 3:Training set in the language material obtained in step 2 is carried out by identity word by condition random field (CRF)
(act), the label of behavior word (ide) and unrelated word (non) adds in marker samples tail portion part of speech as surface, makes
Attributive character masterplate based on Bigram feature carries out CRF modelings using CRF++ tools, and the text marked is carried out
Study, trains attribute model, and obtain the attribute of all words in text.According to correlation or similitude matching to identity word or
Behavior word does word weight enhancing;
Step 4:The language material obtained in step 2 is trained using term frequency-inverse document (TF-IDF), and obtains each word
TF-IDF values as the word TF-IDF weight;
Step 5:The language material handled well in step 2 is trained using Word2vec, obtains term vector model;
Step 6:Two kinds of Weight Algorithms that step 3 and step 4 obtain are merged to obtain the weight of each word, then will be every
The weight of a word is multiplied with the corresponding term vector that step 5 obtains, and obtains new Text eigenvector;
Step 7:The text semantic feature vector input Softmax of the training set language material obtained in step 6 is trained into text
This Matching Model;
Step 8:The Text eigenvector of the test set language material obtained in step 6 is inputted into Softmax, according in step 7
Trained model carries out text matches and calculates the accuracy rate of matching result.
Further, it is described that language material is segmented using the Chinese Word Automatic Segmentation based on N- shortest paths.
Further, the Chinese Word Automatic Segmentation based on N- shortest paths carries out language material participle and specifically includes step:
Coupled relation between each phrase is represented by adjacency list first, then by calculating the coupled relation between phrase come really
Fixed first participle path;Finally after all path computings, optimal path is found as word segmentation result.
Further, the step 3 is specially:By CRF by the training set in the language material obtained in step 2 into rower
Note, user behavior word, identity word and meaningless word are respectively labeled as act, ide, non, to extract the preference information of user and need
Information is sought, part of speech is added in marker samples tail portion as surface, makes the attributive character mould based on Bigram feature
Version so that current word and its front and rear each contamination relationship are considered during feature extraction, carrying out CRF using CRF++ tools builds
Mould learns the text marked, trains attribute model.
Further, it is described that the language material obtained in step 2 is trained, and obtains using TF-IDF (term frequency-inverse document)
The TF-IDF values of each word are taken, characteristic item of the TF-IDF algorithms using word as text, the weight of each characteristic item is by TF weights
It forms, specifically includes with two parts of IDF weights:
Specific formula for calculation is as follows:
wji=TFji·IDFi (2)
TFji=fji/T (3)
IDFi=log (N/ni+0.01) (4)
Wj={ wj1,wj2,...,wji} (5)
TF is characterized the frequency that item occurs in the text, represents significance level of the word in current text, wherein, T tables
It is shown as whole word numbers of j-th of text, fjiFor the number that i-th of word in j-th of text occurs in the text, IDF is characterized
The inverse number of files of item judges the importance of the word by the overall view overall situation, and wherein N represents a total of N number of text, and word i is in niA text
Occur in this, WjRepresent the weight set of j-th of Text eigenvector, wjiThe then weight for i-th of word in j-th of text.
Further, the step 5, will be in step 2 using the bag of words that Softmax algorithms are layered in Word2vec
In the language material input model managed, each expression of the word in vector space is obtained to obtain.
Further, two kinds of weights that step 3 and step 4 obtain are merged to obtain the power of each word by the step 6
Weight, then the weight of each word is multiplied with the corresponding term vector that step 5 obtains, new Text eigenvector is obtained, specifically
For:The corresponding term vector of the TF-IDF values of each word obtained in step 4 is multiplied, and according to obtaining in step 3
The user property word taken carries out the weight enhancing by multiple, obtains the feature vector after semantically enhancement.
Further, the Text eigenvector of obtained test set language material is inputted Softmax by the step 8, according to
Trained good model carries out text matches and calculates the accuracy rate of matching result, specifically includes:The survey that will be obtained in step 6
The Text eigenvector input Softmax of examination collection language material, according to trained model carries out text matches in step 7, such as
The label of fruit reality output text is equal to 1, and it is matched to content to judge the sentence, if the label of reality output text differs
It is equal to 0 in 1 i.e. label, it is unmatched to content to judge the sentence, counts the label of reality output text and desired output text
Different number between this label calculates sentence to matched accuracy rate.
Further, the CRF is the undirected graph model based on discriminate, using being linear chain condition random field, x=
(x1, x2... xn) flag sequence that represents represents observed input data sequence, y=(y1, y2... yn) represent a state
Sequence, in the case where giving a list entries, the combination condition probability of the CRFs model definition status sequences of linear chain is:
Wherein, tj(yi-1,yi,x,i) it is observation sequence in i and the transfer characteristic function at i-1 moment, sk(yi, x, i) and it is to see
Examine state characteristic function of the sequence at the i moment, parameter lambdajAnd μkIt can estimate from training data, it is more big then excellent to obtain nonnegative value
First select corresponding characteristic event;If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.
It advantages of the present invention and has the beneficial effect that:
The present invention proposes a kind of semantic matching method of text, and this method selects TF-IDF and CRF to excavate user demand letter
The feature of breath, and the Deep Semantics feature for merging Word2vec represents the feature of text jointly.The present invention is obtained first using CRF
The attributive character of vocabulary does corresponding matching to be directed to demand in text, is then used as shallow-layer language by calculating text TF-IDF values
Adopted feature, by two kinds of features collectively as weight, this method solve TF-IDF and CRF merely from statistics angle or demand information
Upper acquisition weight does not account for the problem of semantic between word but.Finally, weight TF-IDF and CRF obtained is with passing through
The term vector fusion that Word2vec is obtained, solves in Word2vec while Deep Semantics feature is obtained and fixes word mark sheet
State the problem of not knowing.Matched accuracy rate is remarkably improved with reference to above method processing text matches problem.
Description of the drawings
Fig. 1 is that the present invention provides text matching technique flow chart of the preferred embodiment based on term frequency-inverse document and CRF.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, detailed
Carefully describe.Described embodiment is only the part of the embodiment of the present invention.
The present invention solve above-mentioned technical problem technical solution be:
As shown in Figure 1, the semantic matching method the present invention is based on TF-IDF and CRF comprises the concrete steps that:
Step 1:Text matches corpus, the sentence pair comprising product description and search term, association matching are collected from network
Then its label is 1, is otherwise 0.Corpus is divided into training set and test set, training set is used for training the model of emotional semantic classification,
Test set is used for the effect of test model classification.
Step 2:The matched text for treating participle is needed to be pre-processed before participle, including removal character, punctuation mark
And the non-Chinese character information such as English alphabet.Then Chinese word segmentation is carried out to the corpus in step 1, segmenting method used herein is
Chinese Word Automatic Segmentation based on N- shortest paths.It is represented between each phrase by adjacency list (y-bend participle chart) first
Coupled relation, each node represent a line in participle chart, the starting point of row value representative edge, the end of train value representative edge
Then point determines first participle path by calculating the coupled relation between phrase.Finally when all path computings finish
Afterwards, optimal path (i.e. shortest path) is found as word segmentation result.
After participle, each text is by the corpus of text that is formed with the word that space separates.Then stop words is collected
Table, artificial delete are deactivated in vocabulary to testing useful vocabulary, and deactivating in the language material after deactivated vocabulary removal participle
Word.Removal stop words is to save memory space and improve efficiency.
Step 3:User requirements analysis is carried out to text using CRF, does user property extraction.CRF is based on discriminate
Undirected graph model, the most commonly used is linear chain condition random fields.X=(x1, x2... xn) represent flag sequence represent quilt
The input data sequence of observation, y=(y1, y2... yn) represent a status switch, in the case where giving a list entries,
The combination condition probability of the CRFs model definition status sequences of linear chain is:
Wherein, tj(yi-1,yi, x, i) and it is observation sequence in i and the transfer characteristic function at i-1 moment, sk(yi, x, i) and it is to see
Examine state characteristic function of the sequence at the i moment, parameter lambdajAnd μkIt can estimate from training data, it is more big then excellent to obtain nonnegative value
First select corresponding characteristic event;If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.Pass through
Training set in the language material obtained in step 2 is marked CRF, and user behavior word, identity word and meaningless word mark respectively
For act, ide, non, to extract the preference information of user and demand information.Part of speech is added in marker samples tail portion as external special
Sign so that the customer attribute information of extraction is more accurate.Make the attributive character masterplate based on Bigram feature so that special
Current word and its front and rear each contamination relationship are considered during sign extraction.CRF modelings are carried out using CRF++ tools, to label
Good text is learnt, and trains attribute model.
Step 4:Each word weighted value is obtained using TF-IDF.Characteristic item of the TF-IDF algorithms using word as text, often
The weight of a characteristic item is made of two parts of TF weights and IDF weights.Specific formula for calculation is as follows:
wji=TFji·IDFi (2)
TFji=fji/T (3)
IDFi=log (N/ni+0.01) (4)
Wj={ wj1,wj2,...,wji} (5)
TF (Term Frequency) is characterized the frequency that item occurs in the text, represents weight of the word in current text
Degree is wanted, wherein, T is expressed as whole word numbers of j-th of text, fjiOccur in the text for i-th of word in j-th of text
Number.IDF (Inverse Document Frequency) is characterized the inverse number of files of item, judges the word by the overall view overall situation
Importance, wherein N represent a total of N number of text, and word i is in niOccur in a text.WjRepresent the power of j-th of Text eigenvector
Gather again, wjiThe then weight for i-th of word in j-th of text.Each word in language material is calculated using TF-IDF, can be extracted
The word of high identification in text, and give and the relevant weights of its importance.
Step 5:Using the DBOW models of Hierarchical Softmax algorithms in word2vec, by language model
It is modeled, ties up real vector into k using each word in text as Feature Mapping, this is trained by stochastic gradient descent method
A little term vectors obtain gradient by backpropagation in the process, and calculate gradient error.Then the parameter of more new model,
It is final to obtain to obtain each expression of the word in vector space.That is V in the original item feature vector of formula (6)jIt represents j-th
The feature vector set of text, vjiRepresent the ith feature vector of j-th of text.
Vj={ vj1,vj2,...,vji} (6)
Step 6:Finally by the feature vector v of word each in step 4jiCorrespondence is multiplied by its weight wji, obtain improved Vj’
As shown in formula (7):
Vj'={ wj1·vj1,wj2·vj2,...,wji·vji} (7)
When needing to do similitude matching, then user identity word will be extracted in step 3 and carries out weight enhancing;It needs to do phase
During the matching of closing property, then user behavior word will be extracted in step 3 and carries out weight enhancing.Improved feature vector is because adding
Weights that TF-IDF and CRF are obtained and it is more targeted for inhomogeneity another characteristic, increase spy using term vector model
The contacting semantically between word and word is levied for text, is not only suitable for the holistic correlation of text again suitable for certain specific words
Local correlations so that feature has more generalization ability.
Step 7:The text semantic feature vector input Softmax of training set in language material in step 6 is trained into text
This Matching Model.
Step 8:The Text eigenvector of test set in language material in step 6 is inputted into Softmax, according in step 7
Trained model carries out text matches, if the label of reality output text is equal to 1, it is to content to judge the sentence
Match, if the label of reality output text, not equal to 1 (i.e. label is equal to 0), it is unmatched, system to content to judge the sentence
Number different between the label of reality output text and the label of desired output text is counted, calculates sentence to matched accurate
Rate.
The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention.
After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes
Change and modification equally falls into the scope of the claims in the present invention.
Claims (9)
1. a kind of text matching technique based on term frequency-inverse document and CRF, including step:1:Text matches language is collected from network
Material collection, the sentence pair comprising product description and search term, then its label is 1 for association matching, is otherwise 0, corpus is divided at random
For training set language material and test set language material, step 2:The language material handled well in step 1 is segmented using Chinese Word Automatic Segmentation,
Collection goes to deactivate vocabulary, and the stop words in language material is removed according to vocabulary;It is it is characterized in that, further comprising the steps of:
Step 3:Training set in the language material obtained in step 2 is carried out by identity word, behavior word and nothing by condition random field CRF
The label of word is closed, part of speech is added in marker samples tail portion as surface, it is special to make the attribute based on Bigram feature
Masterplate is levied, CRF modelings are carried out using CRF++ tools, the text marked is learnt, trains attribute model, and obtain
The attribute of all words in text does identity word or behavior word according to correlation or similitude matching the weight enhancing of the word;
Step 4:The language material obtained in step 2 is trained, and obtain the TF- of each word using term frequency-inverse document TF-IDF
IDF values are as the word TF-IDF weight;
Step 5:The language material handled well in step 2 is trained using Word2vec, obtains term vector model;
Step 6:Two kinds of Weight Algorithms that step 3 and step 4 obtain are merged to obtain the weight of each word, then by each word
Weight be multiplied with the corresponding term vector that step 5 obtains, obtain new Text eigenvector;
Step 7:The text semantic feature vector input Softmax of the training set language material obtained in step 6 is trained into text
With model;
Step 8:The Text eigenvector of the test set language material obtained in step 6 is inputted into Softmax, according in step 7
Trained model carries out text matches and calculates the accuracy rate of matching result.
2. a kind of text matching technique based on term frequency-inverse document and CRF according to claim 1, which is characterized in that adopt
Language material is segmented with the Chinese Word Automatic Segmentation based on N- shortest paths.
A kind of 3. text matching technique based on term frequency-inverse document and CRF according to claim 2, which is characterized in that institute
State the Chinese Word Automatic Segmentation based on N- shortest paths to language material carry out participle specifically include step:It is represented first by adjacency list
Coupled relation between each phrase, then determines first participle path by calculating the coupled relation between phrase;Most
Afterwards after all path computings, optimal path is found as word segmentation result.
4. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature
It is, the step 3 is specially:The training set in the language material obtained in step 2 is marked by CRF, user behavior word,
Identity word and meaningless word are respectively labeled as act, ide, non, to extract the preference information of user and demand information, to marking sample
This tail portion adds in part of speech as surface, makes the attributive character masterplate based on Bigram feature so that feature extraction
When consider current word and its front and rear each contamination relationship, CRF modelings are carried out using CRF++ tools, to the text marked
This is learnt, and trains attribute model.
5. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature
It is, it is described that the language material obtained in step 2 is trained using TF-IDF, and obtain the TF-IDF values of each word, TF-IDF
Characteristic item of the algorithm using word as text, the weight of each characteristic item are made of two parts of TF weights and IDF weights, specifically
Including:
Specific formula for calculation is as follows:
wji=TFji·IDFi (2)
TFji=fji/T (3)
IDFi=log (N/ni+0.01) (4)
Wj={ wj1,wj2,...,wji} (5)
TF is characterized the frequency that item occurs in the text, represents significance level of the word in current text, wherein, T is expressed as
Whole word numbers of j-th of text, fjiFor the number that i-th of word in j-th of text occurs in the text, IDF is characterized item
Inverse number of files judges the importance of the word by the overall view overall situation, and wherein N represents a total of N number of text, and word i is in niIn a text
Occur, WjRepresent the weight set of j-th of Text eigenvector, wjiThe then weight for i-th of word in j-th of text.
6. a kind of text matching technique based on term frequency-inverse document and CRF according to one of claim 1-3, feature
It is, the step 5 is using the bag of words DBOW that Softmax algorithms are layered in Word2vec, by what is handled well in step 2
In language material input model, each expression of the word in vector space is obtained to obtain.
A kind of 7. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute
Step 6 is stated to be merged to obtain the weight of each word, then the weight by each word by two kinds of weights that step 3 and step 4 obtain
It is multiplied with the corresponding term vector that step 5 obtains, obtains new Text eigenvector, specially:It is every by what is obtained in step 4
The term vector that the TF-IDF values of a word are corresponding is multiplied, and according to the user property word obtained in step 3 carry out by
The weight enhancing of multiple, obtains the feature vector after semantically enhancement.
A kind of 8. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute
It states step 8 and the Text eigenvector of obtained test set language material is inputted into Softmax, according to trained model into style of writing
This matching and the accuracy rate for calculating matching result, specifically include:By the text feature of test set language material obtained in step 6 to
Amount input Softmax, according to trained model carries out text matches in step 7, if the label of reality output text
Equal to 1, it is matched to content to judge the sentence, if not equal to 1, i.e. label is equal to 0 to the label of reality output text, judgement
The sentence is unmatched to content, counts different between the label of reality output text and the label of desired output text
Number calculates sentence to matched accuracy rate.
A kind of 9. text matching technique based on term frequency-inverse document and CRF according to claim 6, which is characterized in that institute
It is the undirected graph model based on discriminate to state CRF, using being linear chain condition random field, x=(x1, x2... xn) represent label
Sequence represents observed input data sequence, y=(y1, y2... yn) represent a status switch, in given input sequence
In the case of row, the combination condition probability of the CRFs model definition status sequences of linear chain is:
Wherein, tj(yi-1,yi, x, i) and it is observation sequence in i and the transfer characteristic function at i-1 moment, sk(yi, x, i) and it is observation sequence
It is listed in the state characteristic function at i moment, parameter lambdajAnd μkIt can estimate from training data, obtain the more big then preferential choosing of nonnegative value
Select corresponding characteristic event;If the negative value obtained is bigger, illustrate that corresponding characteristic event can not possibly more occur.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810062016.3A CN108255813B (en) | 2018-01-23 | 2018-01-23 | Text matching method based on word frequency-inverse document and CRF |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810062016.3A CN108255813B (en) | 2018-01-23 | 2018-01-23 | Text matching method based on word frequency-inverse document and CRF |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108255813A true CN108255813A (en) | 2018-07-06 |
CN108255813B CN108255813B (en) | 2021-11-16 |
Family
ID=62742366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810062016.3A Active CN108255813B (en) | 2018-01-23 | 2018-01-23 | Text matching method based on word frequency-inverse document and CRF |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108255813B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062899A (en) * | 2018-07-31 | 2018-12-21 | 中国科学院信息工程研究所 | A kind of file similarity measure method based on part-of-speech tagging |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109446321A (en) * | 2018-10-11 | 2019-03-08 | 深圳前海达闼云端智能科技有限公司 | Text classification method, text classification device, terminal and computer readable storage medium |
CN109522549A (en) * | 2018-10-30 | 2019-03-26 | 云南电网有限责任公司信息中心 | Building of corpus method based on Web acquisition and text feature equiblibrium mass distribution |
CN109558489A (en) * | 2018-12-03 | 2019-04-02 | 南京中孚信息技术有限公司 | File classification method and device |
CN109933670A (en) * | 2019-03-19 | 2019-06-25 | 中南大学 | A kind of file classification method calculating semantic distance based on combinatorial matrix |
CN110297913A (en) * | 2019-06-12 | 2019-10-01 | 中电科大数据研究院有限公司 | A kind of electronic government documents entity abstracting method |
CN110427627A (en) * | 2019-08-02 | 2019-11-08 | 北京百度网讯科技有限公司 | Task processing method and device based on semantic expressiveness model |
CN111881668A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Improved TF-IDF calculation model based on chi-square statistics and TF-CRF |
CN112580691A (en) * | 2020-11-25 | 2021-03-30 | 北京北大千方科技有限公司 | Term matching method, matching system and storage medium of metadata field |
CN112784062A (en) * | 2019-03-15 | 2021-05-11 | 北京金山数字娱乐科技有限公司 | Idiom knowledge graph construction method and device |
CN117951256A (en) * | 2024-03-25 | 2024-04-30 | 北京长河数智科技有限责任公司 | Document duplicate checking method based on hierarchical feature vector search |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104978356A (en) * | 2014-04-10 | 2015-10-14 | 阿里巴巴集团控股有限公司 | Synonym identification method and device |
CN105677779A (en) * | 2015-12-30 | 2016-06-15 | 山东大学 | Feedback-type question type classifier system based on scoring mechanism and working method thereof |
CN105740236A (en) * | 2016-01-29 | 2016-07-06 | 中国科学院自动化研究所 | Writing feature and sequence feature combined Chinese sentiment new word recognition method and system |
KR20170000185A (en) * | 2015-06-23 | 2017-01-02 | 아시아나아이디티 주식회사 | Text categorizing system of sparse vector-space document, device and method thereof |
CN107193959A (en) * | 2017-05-24 | 2017-09-22 | 南京大学 | A kind of business entity's sorting technique towards plain text |
-
2018
- 2018-01-23 CN CN201810062016.3A patent/CN108255813B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104978356A (en) * | 2014-04-10 | 2015-10-14 | 阿里巴巴集团控股有限公司 | Synonym identification method and device |
KR20170000185A (en) * | 2015-06-23 | 2017-01-02 | 아시아나아이디티 주식회사 | Text categorizing system of sparse vector-space document, device and method thereof |
CN105677779A (en) * | 2015-12-30 | 2016-06-15 | 山东大学 | Feedback-type question type classifier system based on scoring mechanism and working method thereof |
CN105740236A (en) * | 2016-01-29 | 2016-07-06 | 中国科学院自动化研究所 | Writing feature and sequence feature combined Chinese sentiment new word recognition method and system |
CN107193959A (en) * | 2017-05-24 | 2017-09-22 | 南京大学 | A kind of business entity's sorting technique towards plain text |
Non-Patent Citations (2)
Title |
---|
SHILIN ZHANG AND MEI GU: "《Improved Text Classification to acquire job opportunities for Chinese disabled persons》", 《2010 2ND INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER CONTROL》 * |
唐贤伦: "《基于条件随机场和TF_IDF的文本语义匹配及推荐》", 《第28届中国过程控制会议(CPCC 2017)暨纪念中国过程控制会议30周年摘要集》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062899A (en) * | 2018-07-31 | 2018-12-21 | 中国科学院信息工程研究所 | A kind of file similarity measure method based on part-of-speech tagging |
CN109062899B (en) * | 2018-07-31 | 2021-10-15 | 中国科学院信息工程研究所 | Document similarity measurement method based on part-of-speech tagging |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109271626B (en) * | 2018-08-31 | 2023-09-26 | 北京工业大学 | Text semantic analysis method |
CN109446321A (en) * | 2018-10-11 | 2019-03-08 | 深圳前海达闼云端智能科技有限公司 | Text classification method, text classification device, terminal and computer readable storage medium |
CN109522549A (en) * | 2018-10-30 | 2019-03-26 | 云南电网有限责任公司信息中心 | Building of corpus method based on Web acquisition and text feature equiblibrium mass distribution |
CN109522549B (en) * | 2018-10-30 | 2022-06-10 | 云南电网有限责任公司信息中心 | Corpus construction method based on Web collection and text feature balanced distribution |
CN109558489A (en) * | 2018-12-03 | 2019-04-02 | 南京中孚信息技术有限公司 | File classification method and device |
CN112784062A (en) * | 2019-03-15 | 2021-05-11 | 北京金山数字娱乐科技有限公司 | Idiom knowledge graph construction method and device |
CN112784062B (en) * | 2019-03-15 | 2024-06-04 | 北京金山数字娱乐科技有限公司 | Idiom knowledge graph construction method and device |
CN109933670A (en) * | 2019-03-19 | 2019-06-25 | 中南大学 | A kind of file classification method calculating semantic distance based on combinatorial matrix |
CN109933670B (en) * | 2019-03-19 | 2021-06-04 | 中南大学 | Text classification method for calculating semantic distance based on combined matrix |
CN110297913A (en) * | 2019-06-12 | 2019-10-01 | 中电科大数据研究院有限公司 | A kind of electronic government documents entity abstracting method |
CN110427627B (en) * | 2019-08-02 | 2023-04-28 | 北京百度网讯科技有限公司 | Task processing method and device based on semantic representation model |
CN110427627A (en) * | 2019-08-02 | 2019-11-08 | 北京百度网讯科技有限公司 | Task processing method and device based on semantic expressiveness model |
CN111881668A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Improved TF-IDF calculation model based on chi-square statistics and TF-CRF |
CN111881668B (en) * | 2020-08-06 | 2023-06-30 | 成都信息工程大学 | TF-IDF computing device based on chi-square statistics and TF-CRF improvement |
CN112580691A (en) * | 2020-11-25 | 2021-03-30 | 北京北大千方科技有限公司 | Term matching method, matching system and storage medium of metadata field |
CN112580691B (en) * | 2020-11-25 | 2024-05-14 | 北京北大千方科技有限公司 | Term matching method, matching system and storage medium for metadata field |
CN117951256A (en) * | 2024-03-25 | 2024-04-30 | 北京长河数智科技有限责任公司 | Document duplicate checking method based on hierarchical feature vector search |
CN117951256B (en) * | 2024-03-25 | 2024-05-31 | 北京长河数智科技有限责任公司 | Document duplicate checking method based on hierarchical feature vector search |
Also Published As
Publication number | Publication date |
---|---|
CN108255813B (en) | 2021-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108255813A (en) | A kind of text matching technique based on term frequency-inverse document and CRF | |
Downey et al. | Locating complex named entities in web text. | |
US9672205B2 (en) | Methods and systems related to information extraction | |
CN107315738B (en) | A kind of innovation degree appraisal procedure of text information | |
CN108959258B (en) | Specific field integrated entity linking method based on representation learning | |
CN109190117A (en) | A kind of short text semantic similarity calculation method based on term vector | |
CN110688836A (en) | Automatic domain dictionary construction method based on supervised learning | |
CN110879831A (en) | Chinese medicine sentence word segmentation method based on entity recognition technology | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
CN110414009A (en) | The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device | |
CN110851593B (en) | Complex value word vector construction method based on position and semantics | |
CN113095087B (en) | Chinese word sense disambiguation method based on graph convolution neural network | |
Babhulgaonkar et al. | Language identification for multilingual machine translation | |
Gan et al. | Character-level deep conflation for business data analytics | |
CN115017903A (en) | Method and system for extracting key phrases by combining document hierarchical structure with global local information | |
CN114707615B (en) | Ancient character similarity quantification method based on duration Chinese character knowledge graph | |
CN112613318B (en) | Entity name normalization system, method thereof and computer readable medium | |
CN114925198A (en) | Knowledge-driven text classification method fusing character information | |
Priya et al. | Intelligent Aspect based Model for Efficient Sentiment Analysis of User Reviews | |
Thilagavathi et al. | Tamil english language sentiment analysis system | |
Paul et al. | Multi-facet universal schema | |
CN111881678A (en) | Domain word discovery method based on unsupervised learning | |
Saidi et al. | New approch of opinion analysis from big social data environment using a supervised machine learning algirithm | |
Shekhar | Text Mining and Sentiment Analysis | |
CN112463928B (en) | Technical list generation method and system for field evaluation prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |