CN109062887A - A kind of part-of-speech tagging method based on average perceived device algorithm - Google Patents
A kind of part-of-speech tagging method based on average perceived device algorithm Download PDFInfo
- Publication number
- CN109062887A CN109062887A CN201810561207.4A CN201810561207A CN109062887A CN 109062887 A CN109062887 A CN 109062887A CN 201810561207 A CN201810561207 A CN 201810561207A CN 109062887 A CN109062887 A CN 109062887A
- Authority
- CN
- China
- Prior art keywords
- word
- speech
- sentence
- feature
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 101100244625 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.71 / DSM 1257 / FGSC 987) pph-1 gene Proteins 0.000 description 8
- 101100174722 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GAA1 gene Proteins 0.000 description 3
- 102100036467 Protein delta homolog 1 Human genes 0.000 description 2
- 101710119301 Protein delta homolog 1 Proteins 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 101150037206 EFM7 gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of part-of-speech tagging methods based on average perceived device algorithm, belong to natural language processing technique field.The present invention is trained first against training set: extracting word information in training set, such as the original shape of current word, two, end letter, the features such as the part of speech of previous word, and each possible probability of part of speech in each feature is updated according to corpus, local is finally stored in the data structure of nested dictionary and in a manner of byte stream;Secondly, to the word part-of-speech tagging stage in sentence: being pre-processed to the sentence to be marked, the feature for then obtaining word returns to the maximum part of speech of possibility by the comparison with model file.The present invention can reach higher accuracy rate with less training set, and also not high to equipment requirement, and the training time is not also grown.
Description
Technical field
The present invention relates to a kind of part-of-speech tagging methods based on average perceived device algorithm, belong to natural language processing technique neck
Domain.
Background technique
Part-of-speech tagging is the basic project of natural language processing, is the basis of a lot of other natural language processing tasks,
The final performance of follow-up work is largely affected simultaneously.Construct a high-performance, efficient part-of-speech tagging system
With important academic significance and application value.
Perceptron is the linear classification model of two classification, and input is the feature vector of example, exports the class for example
Not ,+1 and -1 two-value are taken.Perceptron corresponds in the input space (feature space) and surpasses the separation that example is divided into positive and negative two class
Plane belongs to discrimination model.Perceptron study is intended to find out the separating hyperplance that training is carried out to linear partition, for this purpose, importing
Loss function based on misclassification carries out minimization to loss function using gradient descent method, acquires perceptron model.Perceptron
Learning algorithm has simple and is easily achieved a little, is divided into primitive form and dual form.Perceptron prediction is with learning
To perceptron model classify to new input example.
Summary of the invention
The technical problem to be solved by the present invention is to propose a kind of part-of-speech tagging method based on average perceived device algorithm, to
It solves the above problems.
The technical scheme is that a kind of part-of-speech tagging method based on average perceived device algorithm.Training stage: first
First, corpus is pre-processed;Then, training data is read from corpus;Then, feature is extracted from corpus;So
Afterwards, in training characteristics template weight value;Finally, calculating average weight.Test phase: firstly, being pre-processed to corpus;
Then, the feature for extracting word in the sentence to be marked obtains the most probable word of the word with the model comparison that the training stage obtains
Property.
Specific steps are as follows:
(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one
The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will
Sentence is added in train_data list as training set;
(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list;
(3) word obtained to step (2) pre-processes;
(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented
It is wrong;
(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word
The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result;
(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing
It is complete, then it carries out in next step;
(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with
The mode of byte stream is stored in local;
(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order
In;
(10) words list in step (9) is pre-processed;
(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency
The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list;
(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step
(12), if processing is complete, tokens list is exported.
Sentence in the step (1) refers to: sentence=([], []), wherein first list is word, the
Two lists are the corresponding parts of speech of each word in first list;
Train_data in the step (1) refers to: train_data=[sentence1, sentence2 ...], namely
Train_data=[([], []), ([], []) ...], wherein each () is a sentence;
In the step (3) to word carry out pretreatment refer to: firstly, all words are converted to small letter;Secondly, will
Number between 1900-2200 is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as
TELENUM;
Addition spcial character before and after words list is referred in the step (4): it is added before each words list '
START ' and ' START2 ', each words list end adds ' END ' and ' END2 ', prevent processing first or the last one
It reports an error when word;
High frequency dictionary in the step (5) refers to: English in have general half word part of speech be it is determining, these words are just
Without prediction, read directly from local high frequency dictionary, efficiency of algorithm can not only be can be improved by doing so, and can also be mentioned
High-accuracy;
Extraction feature in the step (5) refers to: rear three letters of word, the initial of word, before sentence where word
The part of speech of one word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word
Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word
The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word;
Predict that part of speech refers to according to feature in the step (6): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight
2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, selects and weighed in all features of word
Prediction part of speech of the maximum corresponding part of speech of weight values as the word;
Update weight in the step (6) to refer to: if correct according to the word part of speech that existing feature is predicted, the word is corresponding each
The corresponding weighted value of part of speech is constant in feature, if prediction error, which is corresponded to the weighted value of correct part of speech in each feature
Add one, the weighted value of mistake part of speech subtracts one;
It updates weight in the step (8) to refer to:, cannot be directly with having updated in order to allow the accuracy for improving perceptron algorithm
Weight.The problem of above-mentioned algorithm, is, if the training in two slight different examples, may obtain completely different
Model.Model cannot evolvement problem cleverly.Bigger problem is that algorithm can excessively pay close attention to the point of those misclassifications, and
And the entire model of adjustment is to adapt to them.So ameliorative way is return average weight, rather than final weight.Concrete mode
It is one additional dictionary of maintenance, records the time of each weight last time variation.When we change a feature weight
When, we just take out this value for updating;
Nested dictionary refers in the step (8): shaped like: { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature
2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } data structure;
Pretreatment refers in the step (10): firstly, all words are converted to small letter;Secondly, by between 1900-2200
Number is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM;
Step (11) the medium-high frequency dictionary refers to: English in have general half word part of speech be it is determining, these words are just
Without prediction, read directly from local high frequency dictionary;
Extract feature in the step (11) to refer to: rear three letters of word, the initial of word, sentence where word are previous
The part of speech of a word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word
Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word
The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word;
Prediction part of speech refers in the step (12): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... },
Feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, it is maximum to select weighted value in all features of word
Prediction part of speech of the corresponding part of speech as the word;
Tokens list refers in the step (12): shaped like: [(word 1, part of speech 1), (word 2, part of speech 2) ... ...].
The beneficial effects of the present invention are: the present invention can reach higher accuracy rate with less training set, and to setting
Standby requirement is not also high, and the training time is not also grown, and has the advantages that simple and is easily achieved.
Detailed description of the invention
Fig. 1 is step flow chart of the invention.
Specific embodiment
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1: as shown in Figure 1, a kind of part-of-speech tagging method based on average perceived device algorithm, specifically:
(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one
The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will
Sentence is added in train_data list as training set;
(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list, specifically:
words=[I,love,you],tags=[NNP,VB,NNP];
(3) word obtained to step (2) pre-processes;
(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented
It is wrong;
(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word
The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result;
(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing
It is complete, then it carries out in next step;
(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with
The mode of byte stream is stored in local;
(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order
In;
(10) words list in step (9) is pre-processed;
(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency
The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list, specifically:
tokens=[(I,NNP),(love,VB),(you,NNP)];
(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step
(12), if processing is complete, tokens list is exported.
Sentence in the step (1) refers to: sentence=([], []), wherein first list is word, the
Two lists are that each word is corresponding in first list, specifically: sentence=([I, love, you], [and NNP, VB,
NNP]);
Train_data in the step (1) refers to: train_data=[sentence1, sentence2 ...], namely
Train_data=[([], []), ([], []) ...], wherein each () is a sentence, specifically:
train_data = [([I,love,you],[nnp,vb,nnp]),([a,good,man],[at,adj,n]);
In the step (3) to word carry out pretreatment refer to: firstly, all words are converted to small letter;Secondly, will
Number between 1900-2200 is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as
TELENUM, specifically: general ' There is a tree ' is transformed into ' there is a tree ';
Addition spcial character before and after words list is referred in the step (4): it is added before each words list '
START ' and ' START2 ', each words list end adds ' END ' and ' END2 ', prevent processing first or the last one
It reports an error when word, specifically: [START, START2I, love, you, END, END2];
High frequency dictionary in the step (5) refers to: English in have general half word part of speech be it is determining, these words are just
Without prediction, read directly from local high frequency dictionary, efficiency of algorithm can not only be can be improved by doing so, and can also be mentioned
High-accuracy, specifically: { good:adj, man:n ... ... };
Extraction feature in the step (5) refers to: rear three letters of word, the initial of word, before sentence where word
The part of speech of one word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word
Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word
The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word,
The feature of specifically: sentence is ' this be despite a system ', ' this ' are as follows: " ' i suffix his', ' i
pref1 t','i-1 tag -START-','i-2 tag -START2-','i word this','i-1 suffix
T2-','i-2 word -START-','i+1 word be','i+1 suffix be','i+2 word despite'";
Predict that part of speech refers to according to feature in the step (6): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight
2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, selects and weighed in all features of word
Prediction part of speech of the maximum corresponding part of speech of weight values as the word, specifically: from " ' i suffix his':{ ' dd1 n':
3.803, 'zz1_np1@\n': -1.0, 'nn2\n': -1.793, 'nn1_jjr\n': -0.947, 'jj\n': -
0.892, 'appge\n': 2.185, 'nn1\n': -0.814, 'np1\n': -0.541}, 'i pref1 t': {'
dd1\n': 2.437, 'zz1_np1@\n': -1.0, 'to\n': 1.674, 'nn2\n': -0.972, 'ii\n':
1.065, 'cst\n': 1.827, 'nn1\n': -0.465, 'vvg\n': -1.238, 'ex\n': 1.827, '
mcmc\n': -0.994, 'csn\n': 1.013, 'pphs2\n': 1.885, 'vbr\n': -0.99, 'rg\n':
0.996, 'vbi\n': -0.985, 'dd1_cst\n': 1.273, 'rr\n': -1.929, 'vvz\n': 0.567, '
vvn_vvd\n': -0.983, 'ex_rl\n': 0.764, 'rl\n': 0.823, 'vvd\n': -0.466, 'vvn@\
n': 1.259, 'vvn\n': -2.43, 'nn\n': 0.307, 'vm\n': -0.971, 'jj\n': -1.395, '
appge\n': 1.8, 'nn1_vv0\n': -1.893, 'zz1\n': 0.869, 'vv0\n': -0.368, 'cst_
dd1\n': 1.39, 'rt\n': 2.07, 'pphs1\n': -0.949, 'nn1_jjr\n': -0.947, 'ii22\n':
0.09, 'dd2\n': 2.649, 'md\n': 1.322, 'np1\n': -0.728, 'rgt\n': -0.921, 'jj_
nn1\n': -0.849, 'xx\n': -0.919, 'ii33\n': 0.023, 'ii@\n': 0.096, 'nn1_jj\n':
-0.91, 'ccb\n': -0.906, 'ii_rp@\n': 0.038, 'vvg_nn1@\n': -0.897, 'dd1_rg%_
cst\n': 0.062, 'db\n': -0.889, 'vvgk\n': -0.888, 'mc\n': 1.464, 'nnt1\n':
0.92, 'npd1\n': 0.036, 'rp\n': -0.86, 'nn2_vvz\n': 0.892, 'ppho2\n': 1.415, '
mf\n': -0.843, 'vvi\n': 0.435, 'vvg_jj\n': -0.802, 'db2_rr\n': -0.687, 'ddq\
n': -0.672, 'nnt2\n': -0.671, 'jjr_rrr@\n': -0.665, 'nn1_vv0@\n': -0.646, '
vbdz\n': -0.33, 'rr_nn1\n': -0.23}, 'i-1 tag -START-': {'dd1\n': 0.008, 'zz1_
np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017,
'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n':
0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011,
'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\
n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n':
0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -0.065,
'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n':
0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n':
-0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -0.565, '
pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-2 tag -START2-': {'dd1\n': 0.008,
'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n':
0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\
n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n':
0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, '
db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\
n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -
0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, '
fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_
at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -
0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i tag+i-2 tag -START- -
START2-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007,
'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n':
0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\
n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n':
0.882, 'at1\n': -0.724, 'db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003,
'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\
n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n':
0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446,
'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\
n': -0.701, 'db\n': -0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i word
this': {'dd1\n': 3.803, 'zz1_np1@\n': -1.0, 'nn2\n': -0.963, 'nn1_jjr\n': -
0.947, 'jj\n': -0.892}, 'i-1 tag+i word -START- this': {'dd1\n': 1.963, 'zz1_
np1@\n': -1.0, 'nn2\n': -0.963}, 'i-1 word -START2-': {'dd1\n': 0.008, 'zz1_
np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017,
'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n':
0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n':0.011,
'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\
n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n':
0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -0.065,
'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n':
0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n':
-0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -0.565, '
pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-1 suffix T2-': {'dd1\n': 0.008, '
zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n':
0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\
n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n':
0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, '
db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\
n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -
0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, '
fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_
at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -
0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-2 word -START-': {'dd1\n':
0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, '
cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n':
0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, '
vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\
n': -0.724, 'db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n':
0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, '
vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n':
0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n':
0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, '
db\n': -0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i+1 word be': {'dd1\
n': 1.09, 'zz1_np1@\n': -1.0, 'np1\n': 1.092, 'ii_cs\n': -0.998, 'nn1\n':
0.299, 'ex\n': 0.96, 'mcmc\n': -0.994, 'nn2\n': 1.411, 'jj\n': -1.969, 'vm\
n': 0.111, 'vvn\n': -0.986, 'nn\n': -0.133, 'pph1\n': 0.905, 'cs\n': -0.977,
'to\n': 0.888, 'xx\n': 0.051, 'ex_rl\n': -0.217, 'vv0\n': 0.15, 'zz1\n':
0.965, 'ddq\n': 1.022, 'ge_"@\n': 1.11, 'ge\n': -1.11, 'vv0@_nn1\n': -0.95, '
csn\n': 0.929, 'dd2\n': 0.627, 'csa\n': 1.492, 'ii\n': -0.886, 'ppis1\n':
0.906, 'ppy\n': 0.003, 'at1_zz1\n': -0.871, 'ppis2\n': 0.504, 'jj_nn1\n': -
0.859, 'cst\n': -0.741, 'vhd\n': 0.806, 'vhi\n': -0.824, 'vh0\n': 1.051, '
vhz\n': -0.23, 'y\n': -0.78, 'vvi\n': -0.765, 'cst_dd1\n': 0.747, 'vvn_vvd@\
n': -0.699, 'db2_rr\n': -0.687, 'ppho2\n': 0.567, 'pphs1\n': -0.185, 'rg\n':
-0.452, 'rp\n': -0.413, 'pphs2\n': 0.04}, 'i+1 suffix be': {'dd1\n': 1.09, '
zz1_np1@\n': -1.0, 'np1\n': 1.092, 'ii_cs\n': -0.998, 'nn1\n': 0.299, 'ex\n':
0.96, 'mcmc\n': -0.994, 'nn2\n': 1.411, 'jj\n': -1.969, 'vm\n': 0.111, 'vvn\
n': -0.986, 'nn\n': -0.133, 'pph1\n': 0.905, 'cs\n': -0.977, 'to\n': 0.888, '
xx\n': 0.051, 'ex_rl\n': -0.217, 'vv0\n': 0.15, 'zz1\n': 0.965, 'ddq\n':
1.022, 'ge_"@\n': 1.11, 'ge\n': -1.11, 'vv0@_nn1\n': -0.95, 'csn\n': 0.929, '
dd2\n': 0.627, 'csa\n': 1.492, 'ii\n': -0.886, 'ppis1\n': 0.906, 'ppy\n':
0.003, 'at1_zz1\n': -0.871, 'ppis2\n': 0.504, 'jj_nn1\n': -0.859, 'cst\n': -
0.741, 'vhd\n': 0.806, 'vhi\n': -0.824, 'vh0\n': 1.051, 'vhz\n': -0.23, 'y\
n': -0.78, 'vvi\n': -0.765, 'cst_dd1\n': 0.747, 'vvn_vvd@\n': -0.699, 'db2_
rr\n': -0.687, 'ppho2\n': 0.567, 'pphs1\n': -0.185, 'rg\n': -0.452, 'rp\n': -
0.413, 'pphs2\n': 0.04}, 'i+2 word despite': {'dd1\n': 1.0, 'zz1_np1@\n': -
1.0, ' ii n':0.686, ' nn1_vv0_jj n':-0.686 " in select weight be 3.803 corresponding part of speech ' dd1
n';
Update weight in the step (6) to refer to: if correct according to the word part of speech that existing feature is predicted, the word is corresponding each
The corresponding weighted value of part of speech is constant in feature, if prediction error, which is corresponded to the weighted value of correct part of speech in each feature
Add one, the weighted value of mistake part of speech subtracts one;
It updates weight in the step (8) to refer to:, cannot be directly with having updated in order to allow the accuracy for improving perceptron algorithm
Weight.The problem of above-mentioned algorithm, is, if the training in two slight different examples, may obtain completely different
Model.Model cannot evolvement problem cleverly.Bigger problem is that algorithm can excessively pay close attention to the point of those misclassifications, and
And the entire model of adjustment is to adapt to them.So ameliorative way is return average weight, rather than final weight.Concrete mode
It is one additional dictionary of maintenance, records the time of each weight last time variation.When we change a feature weight
When, we just take out this value for updating;
Nested dictionary refers in the step (8): shaped like: { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature
2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } data structure, specifically such as data structure in 7;
Pretreatment refers in the step (10): firstly, all words are converted to small letter;Secondly, by between 1900-2200
Number is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM, specifically:
General ' There is a tree ' is transformed into ' there is a tree ';
Step (11) the medium-high frequency dictionary refers to: English in have general half word part of speech be it is determining, these words are just
Without prediction, read directly from local high frequency dictionary;
Extract feature in the step (11) to refer to: rear three letters of word, the initial of word, sentence where word are previous
The part of speech of a word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word
Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word
The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word;
Prediction part of speech refers in the step (12): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... },
Feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, it is maximum to select weighted value in all features of word
Prediction part of speech of the corresponding part of speech as the word;
Tokens list refers in the step (12): shaped like: [(word 1, part of speech 1), (word 2, part of speech 2) ... ...], specifically
Ground: [(a, at), (good, adj), (man, n)].
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (4)
1. a kind of part-of-speech tagging method based on average perceived device algorithm, it is characterised in that:
(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one
The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will
Sentence is added in train_data list as training set;
(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list;
(3) word obtained to step (2) pre-processes;
(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented
It is wrong;
(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word
The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result;
(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing
It is complete, then it carries out in next step;
(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with
The mode of byte stream is stored in local;
(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order
In;
(10) words list in step (9) is pre-processed;
(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency
The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list;
(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step
(12), if processing is complete, tokens list is exported.
2. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step
(3) in word carry out pretreatment refer to: firstly, all words are converted to small letter;Secondly, by between 1900-2200
Number is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM.
3. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step
(5) the extraction feature in refers to: rear three letters of word, the initial of word, the word of the previous word of sentence where word
Property, the part of speech of sentence the first two word where word, word itself, the previous word of sentence itself where word, word place
Rear three letters of the previous word of sentence, sentence the first two word itself where word, sentence the latter word where word
Itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word.
4. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step
(6) update weight in refer to: if correct according to the word part of speech that existing feature is predicted, which corresponds to part of speech pair in each feature
The weighted value answered is constant, if prediction error, the weighted value which corresponds to correct part of speech in each feature is added one, mistake part of speech
Weighted value subtract one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810561207.4A CN109062887A (en) | 2018-06-04 | 2018-06-04 | A kind of part-of-speech tagging method based on average perceived device algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810561207.4A CN109062887A (en) | 2018-06-04 | 2018-06-04 | A kind of part-of-speech tagging method based on average perceived device algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109062887A true CN109062887A (en) | 2018-12-21 |
Family
ID=64820276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810561207.4A Pending CN109062887A (en) | 2018-06-04 | 2018-06-04 | A kind of part-of-speech tagging method based on average perceived device algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109062887A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697289A (en) * | 2018-12-28 | 2019-04-30 | 北京工业大学 | It is a kind of improved for naming the Active Learning Method of Entity recognition |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295295A (en) * | 2008-06-13 | 2008-10-29 | 中国科学院计算技术研究所 | Chinese language lexical analysis method based on linear model |
CN102831558A (en) * | 2012-07-20 | 2012-12-19 | 桂林电子科技大学 | System and method for automatically scoring college English compositions independent of manual pre-scoring |
US20160140104A1 (en) * | 2005-05-05 | 2016-05-19 | Cxense Asa | Methods and systems related to information extraction |
CN107807910A (en) * | 2017-10-10 | 2018-03-16 | 昆明理工大学 | A kind of part-of-speech tagging method based on HMM |
-
2018
- 2018-06-04 CN CN201810561207.4A patent/CN109062887A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160140104A1 (en) * | 2005-05-05 | 2016-05-19 | Cxense Asa | Methods and systems related to information extraction |
CN101295295A (en) * | 2008-06-13 | 2008-10-29 | 中国科学院计算技术研究所 | Chinese language lexical analysis method based on linear model |
CN102831558A (en) * | 2012-07-20 | 2012-12-19 | 桂林电子科技大学 | System and method for automatically scoring college English compositions independent of manual pre-scoring |
CN107807910A (en) * | 2017-10-10 | 2018-03-16 | 昆明理工大学 | A kind of part-of-speech tagging method based on HMM |
Non-Patent Citations (3)
Title |
---|
TOGETHER_CZ: "词性标注的python实现-基于平均感知机算法", 《HTTPS://BLOG.CSDN.NET/TOGETHER_CZ/ARTICLE/DETAILS/73821852》 * |
李正华: "汉语依存句法分析关键技术研究", 《中国博士学位论文全文数据库 (信息科技辑)》 * |
蒲昊雨: "推特中的非特定事件检测方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697289A (en) * | 2018-12-28 | 2019-04-30 | 北京工业大学 | It is a kind of improved for naming the Active Learning Method of Entity recognition |
CN109697289B (en) * | 2018-12-28 | 2023-01-13 | 北京工业大学 | Improved active learning method for named entity recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110347835B (en) | Text clustering method, electronic device and storage medium | |
CN109063159B (en) | Entity relation extraction method based on neural network | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN107729314B (en) | Chinese time identification method and device, storage medium and program product | |
CN102289522B (en) | Method of intelligently classifying texts | |
CN108984745A (en) | A kind of neural network file classification method merging more knowledge mappings | |
CN109657239A (en) | The Chinese name entity recognition method learnt based on attention mechanism and language model | |
CN110309868A (en) | In conjunction with the hyperspectral image classification method of unsupervised learning | |
CN109598517B (en) | Commodity clearance processing, object processing and category prediction method and device thereof | |
CN109684476B (en) | Text classification method, text classification device and terminal equipment | |
CN113887643B (en) | New dialogue intention recognition method based on pseudo tag self-training and source domain retraining | |
CN107832458A (en) | A kind of file classification method based on depth of nesting network of character level | |
CN110263325A (en) | Chinese automatic word-cut | |
CN111797622B (en) | Method and device for generating attribute information | |
CN110781297B (en) | Classification method of multi-label scientific research papers based on hierarchical discriminant trees | |
CN109993216A (en) | A kind of file classification method and its equipment based on K arest neighbors KNN | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN111859967A (en) | Entity identification method and device and electronic equipment | |
CN109543036A (en) | Text Clustering Method based on semantic similarity | |
CN114218945A (en) | Entity identification method, device, server and storage medium | |
CN115563982A (en) | Advertisement text optimization method and device, equipment, medium and product thereof | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
CN115700515A (en) | Text multi-label classification method and device | |
CN114579743A (en) | Attention-based text classification method and device and computer readable medium | |
CN110134956A (en) | Place name tissue name recognition method based on BLSTM-CRF |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181221 |
|
RJ01 | Rejection of invention patent application after publication |