Summary of the invention
For technological deficiency of the existing technology, according to an aspect of the present invention, a kind of pair of medicine text data knot
The control method of structureization processing, the medicine text data for that will correspond to natural language carry out structuring processing and obtain structuring
Text includes the following steps:
A. entity is carried out to medicine text data based on structural model to extract to obtain multiple medicine entities mappings, the structure
Model includes multiple sequence labels, and the sequence label is formed by way of model training on the basis of artificial mark, institute
Stating medicine text data includes multiple word insertions;
B. multiple medicine entity mappings are combined to obtain the structured text.
Preferably, the step a includes the following steps:
A1. two-way length memory network in short-term is inputted after the medicine text data being converted to word insertion two-dimensional matrix;
A2. the two-way length in short-term memory network output length be the corresponding sequence length of the medicine text data and
Width is the medicine text data two-dimensional matrix of designated length;
A3. it is right to be passed into the maximum sequence label institute of conditional random fields acquisition score for the medicine text data two-dimensional matrix
The word insertion answered is mapped as the medicine entity, and the score is by conditional random fields according to the overall situation of the medicine text data
Information determines that a sequence label corresponds to a score based on the structural model.
Preferably, it is also executed the following steps: before the step a
I. two-way length memory network in short-term is inputted after standard word sequence being converted to word insertion two-dimensional matrix;
Ii. two-way length memory network output in short-term length is the corresponding sequence length of the standard word sequence and width
Degree is the standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into conditional random fields;
Iii. the conditional random fields calculate the conditional probability of the structural model and obtain the loss of the structural model
Value, updates each layer weight of the structural model using the algorithm of backpropagation and optimizes the penalty values;
Iv. step i, ii, iii are repeated until the structural model is restrained.
Preferably, the step b includes the following steps:
B1. participle operation is executed to the medicine text data and obtains text word segmentation result, to multiple medicine entities
Mapping executes participle operation acquisition medicine entity and maps word segmentation result, and the participle operation is completed by participle model, and described point
Word model is formed by way of model training on the basis of artificial mark;
B2. the text word segmentation result is matched with medicine entity mapping word segmentation result, and based on described
Multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings with result;
B3. multiple preferred medicine entity mappings are combined to obtain the structured text.
Preferably, include the following steps: after the step b
C. multiple medicine entity mappings are input to database and carry out the multiple standard information segments of conversion acquisition;
D. multiple standard information segments are combined and obtain standard Structured text.
According to another aspect of the present invention, the control device of a kind of pair of medicine text data structureization processing is also provided, is used
Structuring processing, which is carried out, in the medicine text data that will correspond to natural language obtains structured text, comprising:
Entity draw-out device is used to carry out entity to medicine text data based on structural model to extract to obtain multiple medicine
Entity mapping, the structural model include multiple sequence labels, and the sequence label passes through model on the basis of artificial mark
Trained mode is formed, and the medicine text data includes multiple word insertions;
Structurizer is used to be combined to obtain the structured text to multiple medicine entity mappings.
Preferably, the entity draw-out device includes following device:
First input unit inputs two-way length after being used to be converted to the medicine text data word insertion two-dimensional matrix
Short-term memory network;
First output device, being used for two-way length memory network output in short-term length is the medicine text data pair
The sequence length and width answered are the medicine text data two-dimensional matrix of designated length;
First acquisition device is used for the medicine text data two-dimensional matrix and is passed into conditional random fields acquisition score most
The insertion of word corresponding to big sequence label is mapped as the medicine entity, and the score is by conditional random fields according to the doctor
The global information for learning text data determines that a sequence label corresponds to a score based on the structural model.
Preferably, the control device further includes following device:
Secondary input device inputs two-way length and remembers in short-term after being used to be converted to standard word sequence word insertion two-dimensional matrix
Recall network;
Second output device, being used for the two-way length, memory network output length is corresponding for the standard word sequence in short-term
Sequence length and width be designated length standard word sequence two-dimensional matrix, and by the standard word sequence two-dimensional matrix quilt
Incoming conditional random fields;
Second acquisition device is used for the conditional random fields and calculates described in conditional probability and the acquisition of the structural model
The penalty values of structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the loss
Value.
Preferably, the structurizer includes following device:
Device is segmented, is used to execute the medicine text data participle operation and obtains text word segmentation result and right
Multiple medicine entity mappings execute participle operation acquisition medicine entity and map word segmentation result, and the participle operation passes through participle
Model is completed, and the participle model is formed by way of model training on the basis of artificial mark;
Coalignment is used to match the text word segmentation result with medicine entity mapping word segmentation result,
And multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result;
Combination unit is used to be combined to obtain the structured text to multiple preferred medicine entity mappings.
Preferably, the control device further includes following device:
Reforming unit, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple standards
Information segment;
Multiple standard information segments are combined and obtain standard Structured text by standard combination device.
The present invention carries out entity extraction to medicine text data by the inclusion of the structural model of multiple sequence labels and is cured
Entity mapping is learned, and maps to form final structured text based on medicine entity.The present invention is kept away by the way of entity extraction
The dependence to matching dictionary is exempted from, has extracted effect, generalization ability and scalability and be improved, reduce maintenance cost.Meanwhile
The present invention can also realize the purpose for automatically updating structural model, be also more in line with medical industries specialized vocabulary in this way and be continuously increased
Characteristic.
Specific embodiment
Fig. 1 shows the specific embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing
The flow chart of method processed, the medicine text data for that will correspond to natural language carry out structuring processing and obtain structuring text
This.It also include medical treatment it will be appreciated by those skilled in the art that both including natural language in content corresponding to the medicine text data
Technical term, before executing the following steps, the medicine text data can be converted into multiple words as unit of monocase
Insertion, and then realization indicates each monocase with fixed length vector, for subsequent data processing use.Specifically, the doctor
Learning text data can be understood as a word sequence, such as the medicine text data includes " A, B, C, D, E, F ", and we are uncommon
Hope a vector (the usually vector of low-dimensional) out corresponding to each character therein, if the corresponding vector of A is [0.30.7],
B is corresponding to two [- 0.3 0.6], and so on obtain the corresponding vector of alphabet, this is the medicine text data packet
The whole words insertion included, can carry out subsequent operation on this basis.It will be appreciated by those skilled in the art that illustrate herein only be for
Explanation, is not offered as limitation of the invention.
Step S101 is first carried out, entity is carried out to medicine text data based on structural model and extracts to obtain multiple medicine realities
Body mapping, the structural model include multiple sequence labels, and the sequence label is instructed on the basis of artificial mark by model
Experienced mode is formed, and the medicine text data includes multiple word insertions.Specifically, the sequence label can be understood as according to
The professional classification of medical terminology is manually set, for example, sequence label can correspond to symptom, symptom modification;Inspection item,
Inspection result;Disease, disease modification;Treatment, treatment modification;Drug, drug modification;Disease time etc..Those skilled in the art
Member understands, compared to dictionary is established, sets that multiple sequence labels are quicker, while data volume is also significantly smaller than the data of dictionary
Amount.Institute's predicate is embedded in more specifically, is mapped by segment of multiple sequence labels to the medicine text data
Multiple medicine entity mappings can be obtained.For example, the corresponding content of text of the medicine text data is that " at 9 points in patient's morning is just
Examine, patient reflects that night yesterday is generated heat suddenly to 37 degree, but the state of mind is good at present ", correspondingly, multiple sequence labels pair
The content answered includes " disease time ", " symptom " and " symptom description ", then by multiple sequence labels to the medicine textual data
After carrying out entity extraction, " time: yesterday, symptom: fever, symptom description: 37 degree " is obtained, it thus can be to the medicine text
Data make refining, and are finally included into medicine large database concept.Based on above description as can be seen that in this step, do not need pair
The context for each words that medical text is included is made identification, therefore its processing time that medical text is greatly saved, and
The purpose of this step is to be screened based on structural model to medical text, and then make contributions for the accumulation of big data.
Further, step S102 is executed, multiple medicine entity mappings are combined to obtain the structuring text
This.Specifically, this step can be understood as the step of arranging to the operation result of above-mentioned steps 101, the structuring text
Originally it can be ranked up, can not also be ranked up according to common logic of natural language according to common logic of natural language,
For example, mapping the layout that is directly modular by multiple medicine entities that the step S101 is obtained, that is, each doctor
Entity mapping is learned to be divided into corresponding module, and the module belongs to a part of medicine big data.Still with above-mentioned step
After for rapid 101 specific example, obtaining " time: yesterday, symptom: fever, symptom description: 37 degree ", a kind of combination is obtained
The result arrived is " yesterday generates heat 37 degree ", and another combination is respectively to incorporate " yesterday ", " fever ", " 37 degree " not into
Same module, and corresponding module can be the crowd in some corresponding region, and then can be to the medicine of the crowd in the region
Big data is counted.
Specifically, Fig. 2 shows a specific implementations of step S102, carry out to multiple medicine entity mappings
Combination obtains the process of the structured text, includes the following steps:
Step S1021 is executed, participle operation is executed to the medicine text data and obtains text word segmentation result, to multiple institutes
It states the mapping of medicine entity and executes participle operation acquisition medicine entity mapping word segmentation result, the participle operation is complete by participle model
At the participle model is formed by way of model training on the basis of artificial mark.Specifically, it is mentioned to reduce entity
Generated error is taken, after completing the mapping of entity extraction butterfly medicine entity, medicine entity can also be mapped and be optimized
Processing correspondingly then carries out participle operation firstly the need of to medicine text data and the mapping of medicine entity, and participle operation is existing
The common scheme of technology, a kind of mode are based on string matching, i.e. scanning character string, if it find that the substring and word of character string
Identical, even if matching, a kind of mode is completed based on statistics and machine learning, and those skilled in the art can combine existing
Segmentation methods are achieved.
Further, execute step S1022, by the text word segmentation result and the medicine entity map word segmentation result into
Row matching, and multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result.
Specifically, the purpose of this step is that the mapping of medicine entity is proofreaded based on medicine text data, and reason is medicine text data
It is urtext, it is more objective as benchmark text.It preferably, can be with base after in addition to being proofreaded based on medicine text data
It is proofreaded in the logical construction that multiple medicine entities map itself, that is, all the medicine entity maps word segmentation result for judgement
Logic continuity, then obtain the mapping of preferred medicine entity on this basis.
Further, step S1023 is executed, multiple preferred medicine entity mappings are combined to obtain the structure
Change text.
Fig. 3 shows the first embodiment of the present invention, the control method of a kind of pair of medicine text data structureization processing
Flow chart, specifically comprise the following steps:
Step S201 is executed, two-way length is inputted after the medicine text data is converted to word insertion two-dimensional matrix and remembers in short-term
Recall network.Specifically, the medicine text data is converted to word insertion two-dimensional matrix and can be embedded in conjunction in above-mentioned Fig. 1 about word
Partial description, i.e., the described medicine text data will illustrate two-way length memory network in short-term with represented by bivector matrix,
Firstly the need of introducing long memory network in short-term (referred to as " LSTM "), LSTM is actually to solve circulation memory network (referred to as
" RNN ") the problem of and occur, improved by the hidden layer to RNN and realize its function, LSTM is substantially or one
RNN, we can be understood as one and carry out an improved network on RNN framework, be cooperated by multitiered network more
Better artificial intelligence operation effect is realized in layer multiplexing, and LSTM includes at least the cell for memory, is used for parameter input and output
Input Gate and Output Gate, there are also for forgetting Forge Gate.On this basis, two-way long short-term memory
Network (referred to as " BiLSTM ") can be understood as the improvement to (referred to as " BiRNN ") of bidirectional circulating memory network, and BiRNN
It is different from RNN place to be, past context can either be accessed, additionally it is possible to access following context, basic thought
It is to propose that each training sequence is forwardly and rearwardly two RNN respectively, and the two RNN are connected to an output layer,
It can be supplied to the complete contextual information with future in the past that each in output layer list entries is put in this way, accordingly
Ground, BiLSTM are also the improvement to BiRNN, that is, a two-way cell unit is increased on the basis of BiRNN, tool
Body can be understood in conjunction with the description of this section of first half.It will be appreciated by those skilled in the art that the above-mentioned network architecture is actually being answered
Can also correspond to specific algorithm in, but this and be not belonging to emphasis of the invention, it will not be described here.
Further, step S202 is executed, two-way length memory network output in short-term length is the medicine textual data
It is the medicine text data two-dimensional matrix of designated length according to corresponding sequence length and width.Specifically, the medicine text
The corresponding sequence length of data is determined based on the length that multiple words included by the medicine text data are embedded in, described specified
Length is can to set different numerical value according to different algorithm bases by an artificial preset length value.More
For the specifically medicine text data two-dimensional matrix can be understood as by BiLSTM to whole words be embedded in corresponding vector into
The row two-dimensional matrix that treated obtains.
Further, step S203 is executed, the medicine text data two-dimensional matrix is passed into conditional random fields and obtains
The insertion of word corresponding to maximum sequence label is divided to map as the medicine entity, the score is by conditional random fields according to institute
The global information of medicine text data is stated based on structural model determination, is obtained described in one described sequence label correspondence one
Point.Specifically, conditional random fields (referred to as " CRF ") are one of common algorithms of natural language processing field, are usually used in syntax point
Analysis, name Entity recognition, part-of-speech tagging etc. use Markov Chain as the score metastasis model for implying variable, by can
Observation state differentiates implicit variable, and it is a discrimination model that score is got also through mark collection statistics.CRF is substantially hidden
Markov Chain and Observable state containing variable are to the score for implying variable, in the application of the prior art, with grammatical term for the character
Property score for, it is assumed that part of speech label meets Markov property, i.e., current part of speech only have with a upper part of speech score turn
Shifting relationship and it is unrelated with the part of speech of other positions, for example adjective is 0.5 followed by adjectival score, with modified " "
It is scored at 0.5, is scored at 0 with verb.
Further, based on the application above to the description of CRF and the prior art, in the application of CRF, usually to phase
The target of adjacent relationship is judged, such as in the judgement of above-mentioned part of speech, is confined to current part of speech and upper one or next word
The score transfer relationship of property.And apply CRF in this step, it is used for the score that grammatical term for the character insertion meets sequence label, without
It is same as the prior art, is to carry out global judgement and be not based on front-rear position relationship (adjacent pass in other words in this step
System) judged, that is, judge whether the insertion of some word can map as medicine entity, needs to comprehensively consider the label sequence
The relationship of column and whole sequence labels included by the structural model obtains the sequence label score in turn, and specific judgement is calculated
Why method can use such technical solution, reason is still in the present invention not in conjunction with the realization of the prior art
Need to obtain the value that each word is embedded in corresponding text, but it is embedding to screen the word to match with sequence label by CRF
Enter, more specifically, judges if it is the prior art, then should be the judgement between whole word insertions, and turn of the invention
It is changed to the judgement of word insertion and sequence label, this is " to be based on the structure according to the global information of the medical text data
The process that model " is judged, it will be appreciated by those skilled in the art that the data volume of sequence label is significantly less than whole word insertions
Data volume, by avoiding the judgment mode using front-rear position relationship (neighbouring relations in other words), it can significantly improve fortune
Calculate efficiency.Meanwhile the dependence avoided to matching dictionary is extracted using entity, it extracts effect, generalization ability and scalability and obtains
It improves, reduces maintenance cost.
Further, step S204 is executed, multiple medicine entity mappings are combined to obtain the structuring text
This.This step can understand in conjunction with step S102.
Fig. 4 shows the second embodiment of the present invention, for the flow chart of training structure model, specifically, for training
The step S101 that the process of structural model can be shown in FIG. 1 is carried out before executing, and is included the following steps:
Step S301 is first carried out, inputs two-way long short-term memory after standard word sequence is converted to word insertion two-dimensional matrix
Network.Specifically, " standard word sequence " herein can be interpreted as to " the medicine text data " of step S201, it correspondingly, can
To combine the description of step S201 to understand this step.
Further, step S302 is executed, two-way length memory network output in short-term length is the standard word sequence
Corresponding sequence length and width are the standard word sequence two-dimensional matrix of designated length, and by the standard word sequence Two-Dimensional Moment
Battle array is passed into conditional random fields.Specifically, " standard word sequence " herein can be interpreted as to " the medicine textual data of step S202
According to ", correspondingly, this step can be understood in conjunction with the description of step S202.
Further, step S303 is executed, the conditional random fields calculate the conditional probability of the structural model and acquisition
The penalty values of the structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the damage
Mistake value.Specifically, the step S301 and step S302 mentioned standard word sequence can be understood as training sample, can be with
Do different variations according to trained actual conditions, and the conditional probability that the conditional random fields calculate the sequence label can be with
Understand are as follows: in specified criteria random field P (Y | X), list entries x and output sequence y, design conditions probability P (Yi=yi| x), P
(Yi-1=yi-1, Yi=yi| x) and the problem of corresponding mathematic expectaion, x therein, y are the corresponding score of the sequence label,
Its specific algorithmic formula can be achieved in conjunction with the prior art.Correspondingly, in the whole marks for obtaining the structural model
After the conditional probability for signing sequence, then the mean square error of all conditional probabilities of the sequence label is calculated, according to the mean square error
It can be obtained the penalty values of the structural model, and penalty values are for judging whether current structural model is perfect enough.
Further, after obtaining penalty values, also penalty values are optimized, judge current structure model to improve
Accuracy rate.Specifically, the algorithm of backpropagation be used to train at present the most frequently used of artificial neural network (referred to as " ANN ") and
Most effective algorithm.Its main thought is: training set data being input to the input layer of ANN, by hidden layer, is finally reached defeated
Out layer and export as a result, this is the propagated forward process of ANN;Since the output result and actual result of ANN have error, then calculate
Error between estimated value and actual value, and by the error from output layer to hidden layer backpropagation, until traveling to input layer;
During backpropagation, according to the value of error transfer factor various parameters;The continuous iteration above process, until convergence.
It will be appreciated by those skilled in the art that in order to obtain the perfect structural model of approach, need constantly to repeat step S301,
S302 and S303 is until the structural model is restrained, that is, step S301 to step S303 is a training circulation.
Fig. 5 shows the fourth embodiment of the present invention, and a kind of pair of medicine text data structureization processing obtains standard knots
The flow chart of the control method of structure text, specifically comprises the following steps:
Step S501 is executed, carries out entity to medicine text data based on structural model and extract to obtain multiple medicine entities to reflect
It penetrates, the structural model includes multiple sequence labels, and the sequence label is formed by way of manually marking, the medicine text
Notebook data includes multiple word insertions;Step S502 is executed, multiple medicine entity mappings are combined to obtain the structure
Change text.The step S501 and step S502 can be understood in conjunction with step S101 and step S102.
Further, step S503 is executed, multiple medicine entity mappings are input to database and carry out conversion acquisition
Multiple standard information segments.It will be appreciated by those skilled in the art that in order to advise the format content for ultimately generating structured text more
Model is unified, can pre-establish database, is the information of standard by medicine entity Mapping and Converting by the database, such as can be with
The format content for being converted to [type, description, value, time, additional information] is engaged in realization to each dimensional information of medical text
Standard expression.
Further, step S504 is executed, multiple standard information segments are combined and obtain standard Structured text
This.Specifically, combination can be combined according to the logical construction of natural language, can not also logically structure be carried out
Combination, this can be achieved the object of the present invention.
As another embodiment of the invention, Fig. 6 shows a kind of pair of medicine text data structureization processing
Control device functional block diagram, for will correspond to natural language medicine text data carry out structuring processing obtain structure
Change text, comprising:
Entity draw-out device 10 is used for it and is used to carry out entity to medicine text data based on structural model to extract to obtain
Multiple medicine entity mappings, the structural model include multiple sequence labels, and the sequence label is on the basis of artificial mark
It is formed by way of model training, the medicine text data includes multiple word insertions.
Structurizer 20 is used to be combined to obtain the structured text to multiple medicine entity mappings.
Reforming unit 30, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple marks
Calibration information segment.
Multiple standard information segments are combined and obtain standard Structured text by standard combination device 40.
Preferably, the entity draw-out device 10 includes following device:
First input unit 101, input is double after being used to be converted to the medicine text data word insertion two-dimensional matrix
To long memory network in short-term;
First output device 102, being used for two-way length memory network output in short-term length is the medicine textual data
It is the medicine text data two-dimensional matrix of designated length according to corresponding sequence length and width;
First acquisition device 103 is used to the medicine text data two-dimensional matrix and is passed into conditional random fields obtain
The word insertion greater than the first score threshold is divided to map as the medicine entity, the score is by conditional random fields according to the doctor
The global information for treating text data determines that the score corresponds to the insertion of institute's predicate based on the structural model.
Preferably, the structurizer 20 further includes following device:
Device 201 is segmented, is used to execute the medicine text data participle operation acquisition text word segmentation result, and
Participle operation acquisition medicine entity is executed to multiple medicine entity mappings and maps word segmentation result;
Coalignment 202 is used for the text word segmentation result and medicine entity mapping word segmentation result progress
Match, and multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result;
Combination unit 203203 is used to be combined to obtain the structuring to multiple preferred medicine entity mappings
Text.
Fig. 7 shows the fifth embodiment of the present invention, a kind of functional block diagram of the control device of training structure model, packet
It includes:
Secondary input device 104 inputs two-way length after being used to be converted to standard word sequence word insertion two-dimensional matrix
When memory network;
Second output device 105, being used for two-way length memory network output in short-term length is the standard word sequence
Corresponding sequence length and width are the standard word sequence two-dimensional matrix of designated length, and by the standard word sequence Two-Dimensional Moment
Battle array is passed into conditional random fields;
Second acquisition device 106 is used for conditional probability and acquisition that the conditional random fields calculate the structural model
The penalty values of the structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the damage
Mistake value.
It will be appreciated by those skilled in the art that figure 6 above realization of each apparatus function into Fig. 7 can be hardware, by processor
The software or combination of execution.It specifically, can be by preparatory burning program described in if realized by software module
In processor, or by software installation into preset system;If by hardware realization, using field-programmable gate array
Column (FPGA) realize corresponding function immobilization.
Further, the software module can store in RAM memory, flash memory, ROM memory, eprom memory,
The storage medium of hard disk or any other form known in the art.By the way that the storage medium is coupled to processor, thus
So that the processor is read information from the storage medium, and information can be written to the storage medium.As
A kind of variation, the storage medium can be component part or the processor and the equal position of the storage medium of processor
In on specific integrated circuit (ASIC).
Further, the hardware can be that by the general processor of concrete function, digital signal processor
(DSP), specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or
The combination of transistor logic, discrete hardware components or the above hardware.As a kind of variation, can also be set by calculating
Standby combination realizes, for example, the combination of DSP and microprocessor, the combination of multi-microprocessor, with DSP communicating one combined
Or combination of multi-microprocessor etc..
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring substantive content of the invention.