CN109062887A

CN109062887A - A kind of part-of-speech tagging method based on average perceived device algorithm

Info

Publication number: CN109062887A
Application number: CN201810561207.4A
Authority: CN
Inventors: 邵玉斌; 郭海震; 龙华; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-06-04
Filing date: 2018-06-04
Publication date: 2018-12-21

Abstract

The present invention relates to a kind of part-of-speech tagging methods based on average perceived device algorithm, belong to natural language processing technique field.The present invention is trained first against training set: extracting word information in training set, such as the original shape of current word, two, end letter, the features such as the part of speech of previous word, and each possible probability of part of speech in each feature is updated according to corpus, local is finally stored in the data structure of nested dictionary and in a manner of byte stream；Secondly, to the word part-of-speech tagging stage in sentence: being pre-processed to the sentence to be marked, the feature for then obtaining word returns to the maximum part of speech of possibility by the comparison with model file.The present invention can reach higher accuracy rate with less training set, and also not high to equipment requirement, and the training time is not also grown.

Description

A kind of part-of-speech tagging method based on average perceived device algorithm

Technical field

The present invention relates to a kind of part-of-speech tagging methods based on average perceived device algorithm, belong to natural language processing technique neck Domain.

Background technique

Part-of-speech tagging is the basic project of natural language processing, is the basis of a lot of other natural language processing tasks, The final performance of follow-up work is largely affected simultaneously.Construct a high-performance, efficient part-of-speech tagging system With important academic significance and application value.

Perceptron is the linear classification model of two classification, and input is the feature vector of example, exports the class for example Not ,+1 and -1 two-value are taken.Perceptron corresponds in the input space (feature space) and surpasses the separation that example is divided into positive and negative two class Plane belongs to discrimination model.Perceptron study is intended to find out the separating hyperplance that training is carried out to linear partition, for this purpose, importing Loss function based on misclassification carries out minimization to loss function using gradient descent method, acquires perceptron model.Perceptron Learning algorithm has simple and is easily achieved a little, is divided into primitive form and dual form.Perceptron prediction is with learning To perceptron model classify to new input example.

Summary of the invention

The technical problem to be solved by the present invention is to propose a kind of part-of-speech tagging method based on average perceived device algorithm, to It solves the above problems.

The technical scheme is that a kind of part-of-speech tagging method based on average perceived device algorithm.Training stage: first First, corpus is pre-processed；Then, training data is read from corpus；Then, feature is extracted from corpus；So Afterwards, in training characteristics template weight value；Finally, calculating average weight.Test phase: firstly, being pre-processed to corpus； Then, the feature for extracting word in the sentence to be marked obtains the most probable word of the word with the model comparison that the training stage obtains Property.

Specific steps are as follows:

(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will Sentence is added in train_data list as training set；

(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list；

(3) word obtained to step (2) pre-processes；

(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented It is wrong；

(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines；If it is not, extracting the word feature；

(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result；

(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing It is complete, then it carries out in next step；

(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with The mode of byte stream is stored in local；

(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order In；

(10) words list in step (9) is pre-processed；

(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines；If it is not, extracting the word feature；

(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list；

(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step (12), if processing is complete, tokens list is exported.

Sentence in the step (1) refers to: sentence=([], []), wherein first list is word, the Two lists are the corresponding parts of speech of each word in first list；

Train_data in the step (1) refers to: train_data=[sentence1, sentence2 ...], namely Train_data=[([], []), ([], []) ...], wherein each () is a sentence；

In the step (3) to word carry out pretreatment refer to: firstly, all words are converted to small letter；Secondly, will Number between 1900-2200 is defined as YEAR, and other definitions are DIGITS；Finally, continuous ten one-bit digital is defined as TELENUM；

Addition spcial character before and after words list is referred in the step (4): it is added before each words list ' START ' and ' START2 ', each words list end adds ' END ' and ' END2 ', prevent processing first or the last one It reports an error when word；

High frequency dictionary in the step (5) refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary, efficiency of algorithm can not only be can be improved by doing so, and can also be mentioned High-accuracy；

Extraction feature in the step (5) refers to: rear three letters of word, the initial of word, before sentence where word The part of speech of one word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word；

Predict that part of speech refers to according to feature in the step (6): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, selects and weighed in all features of word Prediction part of speech of the maximum corresponding part of speech of weight values as the word；

Update weight in the step (6) to refer to: if correct according to the word part of speech that existing feature is predicted, the word is corresponding each The corresponding weighted value of part of speech is constant in feature, if prediction error, which is corresponded to the weighted value of correct part of speech in each feature Add one, the weighted value of mistake part of speech subtracts one；

It updates weight in the step (8) to refer to:, cannot be directly with having updated in order to allow the accuracy for improving perceptron algorithm Weight.The problem of above-mentioned algorithm, is, if the training in two slight different examples, may obtain completely different Model.Model cannot evolvement problem cleverly.Bigger problem is that algorithm can excessively pay close attention to the point of those misclassifications, and And the entire model of adjustment is to adapt to them.So ameliorative way is return average weight, rather than final weight.Concrete mode It is one additional dictionary of maintenance, records the time of each weight last time variation.When we change a feature weight When, we just take out this value for updating；

Nested dictionary refers in the step (8): shaped like: { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } data structure；

Pretreatment refers in the step (10): firstly, all words are converted to small letter；Secondly, by between 1900-2200 Number is defined as YEAR, and other definitions are DIGITS；Finally, continuous ten one-bit digital is defined as TELENUM；

Step (11) the medium-high frequency dictionary refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary；

Extract feature in the step (11) to refer to: rear three letters of word, the initial of word, sentence where word are previous The part of speech of a word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word；

Prediction part of speech refers in the step (12): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, Feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, it is maximum to select weighted value in all features of word Prediction part of speech of the corresponding part of speech as the word；

Tokens list refers in the step (12): shaped like: [(word 1, part of speech 1), (word 2, part of speech 2) ... ...].

The beneficial effects of the present invention are: the present invention can reach higher accuracy rate with less training set, and to setting Standby requirement is not also high, and the training time is not also grown, and has the advantages that simple and is easily achieved.

Detailed description of the invention

Fig. 1 is step flow chart of the invention.

Specific embodiment

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1: as shown in Figure 1, a kind of part-of-speech tagging method based on average perceived device algorithm, specifically:

(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list, specifically: words=[I,love,you],tags=[NNP,VB,NNP]；

(3) word obtained to step (2) pre-processes；

(10) words list in step (9) is pre-processed；

(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list, specifically: tokens=[(I,NNP),(love,VB),(you,NNP)]；

Sentence in the step (1) refers to: sentence=([], []), wherein first list is word, the Two lists are that each word is corresponding in first list, specifically: sentence=([I, love, you], [and NNP, VB, NNP])；

Train_data in the step (1) refers to: train_data=[sentence1, sentence2 ...], namely Train_data=[([], []), ([], []) ...], wherein each () is a sentence, specifically: train_data = [([I,love,you],[nnp,vb,nnp]),([a,good,man],[at,adj,n])；

In the step (3) to word carry out pretreatment refer to: firstly, all words are converted to small letter；Secondly, will Number between 1900-2200 is defined as YEAR, and other definitions are DIGITS；Finally, continuous ten one-bit digital is defined as TELENUM, specifically: general ' There is a tree ' is transformed into ' there is a tree '；

Addition spcial character before and after words list is referred in the step (4): it is added before each words list ' START ' and ' START2 ', each words list end adds ' END ' and ' END2 ', prevent processing first or the last one It reports an error when word, specifically: [START, START2I, love, you, END, END2]；

High frequency dictionary in the step (5) refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary, efficiency of algorithm can not only be can be improved by doing so, and can also be mentioned High-accuracy, specifically: { good:adj, man:n ... ... }；

Extraction feature in the step (5) refers to: rear three letters of word, the initial of word, before sentence where word The part of speech of one word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word, The feature of specifically: sentence is ' this be despite a system ', ' this ' are as follows: " ' i suffix his', ' i pref1 t','i-1 tag -START-','i-2 tag -START2-','i word this','i-1 suffix T2-','i-2 word -START-','i+1 word be','i+1 suffix be','i+2 word despite'"；

Predict that part of speech refers to according to feature in the step (6): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, selects and weighed in all features of word Prediction part of speech of the maximum corresponding part of speech of weight values as the word, specifically: from " ' i suffix his':{ ' dd1 n': 3.803, 'zz1_np1@\n': -1.0, 'nn2\n': -1.793, 'nn1_jjr\n': -0.947, 'jj\n': - 0.892, 'appge\n': 2.185, 'nn1\n': -0.814, 'np1\n': -0.541}, 'i pref1 t': {' dd1\n': 2.437, 'zz1_np1@\n': -1.0, 'to\n': 1.674, 'nn2\n': -0.972, 'ii\n': 1.065, 'cst\n': 1.827, 'nn1\n': -0.465, 'vvg\n': -1.238, 'ex\n': 1.827, ' mcmc\n': -0.994, 'csn\n': 1.013, 'pphs2\n': 1.885, 'vbr\n': -0.99, 'rg\n': 0.996, 'vbi\n': -0.985, 'dd1_cst\n': 1.273, 'rr\n': -1.929, 'vvz\n': 0.567, ' vvn_vvd\n': -0.983, 'ex_rl\n': 0.764, 'rl\n': 0.823, 'vvd\n': -0.466, 'vvn@\ n': 1.259, 'vvn\n': -2.43, 'nn\n': 0.307, 'vm\n': -0.971, 'jj\n': -1.395, ' appge\n': 1.8, 'nn1_vv0\n': -1.893, 'zz1\n': 0.869, 'vv0\n': -0.368, 'cst_ dd1\n': 1.39, 'rt\n': 2.07, 'pphs1\n': -0.949, 'nn1_jjr\n': -0.947, 'ii22\n': 0.09, 'dd2\n': 2.649, 'md\n': 1.322, 'np1\n': -0.728, 'rgt\n': -0.921, 'jj_ nn1\n': -0.849, 'xx\n': -0.919, 'ii33\n': 0.023, 'ii@\n': 0.096, 'nn1_jj\n': -0.91, 'ccb\n': -0.906, 'ii_rp@\n': 0.038, 'vvg_nn1@\n': -0.897, 'dd1_rg%_ cst\n': 0.062, 'db\n': -0.889, 'vvgk\n': -0.888, 'mc\n': 1.464, 'nnt1\n': 0.92, 'npd1\n': 0.036, 'rp\n': -0.86, 'nn2_vvz\n': 0.892, 'ppho2\n': 1.415, ' mf\n': -0.843, 'vvi\n': 0.435, 'vvg_jj\n': -0.802, 'db2_rr\n': -0.687, 'ddq\ n': -0.672, 'nnt2\n': -0.671, 'jjr_rrr@\n': -0.665, 'nn1_vv0@\n': -0.646, ' vbdz\n': -0.33, 'rr_nn1\n': -0.23}, 'i-1 tag -START-': {'dd1\n': 0.008, 'zz1_ np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\ n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -0.565, ' pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-2 tag -START2-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\ n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, ' db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\ n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': - 0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, ' fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_ at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': - 0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i tag+i-2 tag -START- - START2-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\ n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\ n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\ n': -0.701, 'db\n': -0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i word this': {'dd1\n': 3.803, 'zz1_np1@\n': -1.0, 'nn2\n': -0.963, 'nn1_jjr\n': - 0.947, 'jj\n': -0.892}, 'i-1 tag+i word -START- this': {'dd1\n': 1.963, 'zz1_ np1@\n': -1.0, 'nn2\n': -0.963}, 'i-1 word -START2-': {'dd1\n': 0.008, 'zz1_ np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n':0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\ n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -0.565, ' pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-1 suffix T2-': {'dd1\n': 0.008, ' zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\ n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, ' db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\ n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': - 0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, ' fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_ at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': - 0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-2 word -START-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, ' cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, ' vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\ n': -0.724, 'db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, ' vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, ' db\n': -0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i+1 word be': {'dd1\ n': 1.09, 'zz1_np1@\n': -1.0, 'np1\n': 1.092, 'ii_cs\n': -0.998, 'nn1\n': 0.299, 'ex\n': 0.96, 'mcmc\n': -0.994, 'nn2\n': 1.411, 'jj\n': -1.969, 'vm\ n': 0.111, 'vvn\n': -0.986, 'nn\n': -0.133, 'pph1\n': 0.905, 'cs\n': -0.977, 'to\n': 0.888, 'xx\n': 0.051, 'ex_rl\n': -0.217, 'vv0\n': 0.15, 'zz1\n': 0.965, 'ddq\n': 1.022, 'ge_"@\n': 1.11, 'ge\n': -1.11, 'vv0@_nn1\n': -0.95, ' csn\n': 0.929, 'dd2\n': 0.627, 'csa\n': 1.492, 'ii\n': -0.886, 'ppis1\n': 0.906, 'ppy\n': 0.003, 'at1_zz1\n': -0.871, 'ppis2\n': 0.504, 'jj_nn1\n': - 0.859, 'cst\n': -0.741, 'vhd\n': 0.806, 'vhi\n': -0.824, 'vh0\n': 1.051, ' vhz\n': -0.23, 'y\n': -0.78, 'vvi\n': -0.765, 'cst_dd1\n': 0.747, 'vvn_vvd@\ n': -0.699, 'db2_rr\n': -0.687, 'ppho2\n': 0.567, 'pphs1\n': -0.185, 'rg\n': -0.452, 'rp\n': -0.413, 'pphs2\n': 0.04}, 'i+1 suffix be': {'dd1\n': 1.09, ' zz1_np1@\n': -1.0, 'np1\n': 1.092, 'ii_cs\n': -0.998, 'nn1\n': 0.299, 'ex\n': 0.96, 'mcmc\n': -0.994, 'nn2\n': 1.411, 'jj\n': -1.969, 'vm\n': 0.111, 'vvn\ n': -0.986, 'nn\n': -0.133, 'pph1\n': 0.905, 'cs\n': -0.977, 'to\n': 0.888, ' xx\n': 0.051, 'ex_rl\n': -0.217, 'vv0\n': 0.15, 'zz1\n': 0.965, 'ddq\n': 1.022, 'ge_"@\n': 1.11, 'ge\n': -1.11, 'vv0@_nn1\n': -0.95, 'csn\n': 0.929, ' dd2\n': 0.627, 'csa\n': 1.492, 'ii\n': -0.886, 'ppis1\n': 0.906, 'ppy\n': 0.003, 'at1_zz1\n': -0.871, 'ppis2\n': 0.504, 'jj_nn1\n': -0.859, 'cst\n': - 0.741, 'vhd\n': 0.806, 'vhi\n': -0.824, 'vh0\n': 1.051, 'vhz\n': -0.23, 'y\ n': -0.78, 'vvi\n': -0.765, 'cst_dd1\n': 0.747, 'vvn_vvd@\n': -0.699, 'db2_ rr\n': -0.687, 'ppho2\n': 0.567, 'pphs1\n': -0.185, 'rg\n': -0.452, 'rp\n': - 0.413, 'pphs2\n': 0.04}, 'i+2 word despite': {'dd1\n': 1.0, 'zz1_np1@\n': - 1.0, ' ii n':0.686, ' nn1_vv0_jj n':-0.686 " in select weight be 3.803 corresponding part of speech ' dd1 n'；

Nested dictionary refers in the step (8): shaped like: { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } data structure, specifically such as data structure in 7；

Pretreatment refers in the step (10): firstly, all words are converted to small letter；Secondly, by between 1900-2200 Number is defined as YEAR, and other definitions are DIGITS；Finally, continuous ten one-bit digital is defined as TELENUM, specifically: General ' There is a tree ' is transformed into ' there is a tree '；

Tokens list refers in the step (12): shaped like: [(word 1, part of speech 1), (word 2, part of speech 2) ... ...], specifically Ground: [(a, at), (good, adj), (man, n)].

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of part-of-speech tagging method based on average perceived device algorithm, it is characterised in that:

(3) word obtained to step (2) pre-processes；

(10) words list in step (9) is pre-processed；

2. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step (3) in word carry out pretreatment refer to: firstly, all words are converted to small letter；Secondly, by between 1900-2200 Number is defined as YEAR, and other definitions are DIGITS；Finally, continuous ten one-bit digital is defined as TELENUM.

3. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step (5) the extraction feature in refers to: rear three letters of word, the initial of word, the word of the previous word of sentence where word Property, the part of speech of sentence the first two word where word, word itself, the previous word of sentence itself where word, word place Rear three letters of the previous word of sentence, sentence the first two word itself where word, sentence the latter word where word Itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word.

4. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step (6) update weight in refer to: if correct according to the word part of speech that existing feature is predicted, which corresponds to part of speech pair in each feature The weighted value answered is constant, if prediction error, the weighted value which corresponds to correct part of speech in each feature is added one, mistake part of speech Weighted value subtract one.