[go: nahoru, domu]

CN1965349A - Multimodal disambiguation of speech recognition - Google Patents

Multimodal disambiguation of speech recognition Download PDF

Info

Publication number
CN1965349A
CN1965349A CNA2005800178056A CN200580017805A CN1965349A CN 1965349 A CN1965349 A CN 1965349A CN A2005800178056 A CNA2005800178056 A CN A2005800178056A CN 200580017805 A CN200580017805 A CN 200580017805A CN 1965349 A CN1965349 A CN 1965349A
Authority
CN
China
Prior art keywords
input
word
candidate
speech recognition
mobile device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2005800178056A
Other languages
Chinese (zh)
Inventor
M·朗格
R·埃亚德
K·C·贺尔费什
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AMERICAN ON-LINE
Original Assignee
AMERICAN ON-LINE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AMERICAN ON-LINE filed Critical AMERICAN ON-LINE
Publication of CN1965349A publication Critical patent/CN1965349A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a speech recognition system combined with one or more alternate input modalities to ensure efficient and accurate text input. The speech recognition system achieves less than perfect accuracy due to limited processing power, environmental noise, and/or natural variations in speaking style. The alternate input modalities use disambiguation or recognition engines to compensate for reduced keyboards, sloppy input, and/or natural variations in writing style. The ambiguity remaining in the speech recognition process is mostly orthogonal to the ambiguity inherent in the alternate input modality, such that the combination of the two modalities resolves the recognition errors efficiently and accurately. The invention is especially well suited for mobile devices with limited space for keyboards or touch-screen input.

Description

Multi-form disambiguation speech recognition
Technical field
The invention relates to that the user uses during an input media is input to information in one system.In detail, the invention relates to the speech recognition that combines with literal input sharpening system.
Background technology
Portable computer has become more and more little for many years.Main size restrictions member is a keyboard in the effort of making a littler portable computer.If use the button of standard writer machine size, then portable computer is equally big with QWERTY keyboard at least.Miniature keyboard is used on the portable computer, can't allow the user operate easily or apace but the button of miniature keyboard is too little.Add a full-scale keyboard and on a portable computer, also hindered the real Portable purposes of computing machine.If be not placed on the smooth working face, allow the user typewrite by enough two hands, then most of portable computers are can't be operated.In moving or when standing, the user can't use a portable computer.
At present, the huge growth on wireless industrial produces reliably in a large number, and is convenient, and the very general available mobile device of ordinary consumer, no mobile phone, PDA etc.Therefore, need the hand held radio communication of literal input and arithmetic unit to be still and become more and more little.Recently on mobile phone and the technical progress of other portable wireless caused demand for little and portable both-way communication system.Most of radio communication device manufacturer also wants to provide the consumer lines that can allow the user can operate with the hand that grips this device.
Speech recognition has been the best mode of literal input for a long time by expectation, with regard to the throughput rate that improves desktop PC and just be not always the case with regard to this two aspect, road of the solution of the size restrictions of mobile device.A kind of voice identification system comprises that typically a microphone is used for detecting and the recording voice input.The sound input is digitized and is analyzed in order to obtain speech pattern (pattern).Speech recognition typically needs a strong system to handle this sound input.Some voice identification system with limited capability has been used on the midget plant, and in the order and control as mobile phone, but for for voice-operated operation, a device only needs the several orders of identification.Even if this narrow speech recognition, a midget plant still can't have gratifying speech recognition correctness, because speech pattern has very big variation between different speakers, and the environmental noise complexity of tzc signal more.
People such as Suhm discuss a special problem in being published in one piece of article of ACM Transactions on Computer-HumanInteraction (2001)." repairing problem " is for revising the mistake because of faulty identification produced.They find to use identical form (modality) (retelling) unlikely can revise the identification mistake, major part is because " Lombard " effect, be that they in a minute can be with different usually after being misread at the beginning for people, and they find to use a different form, similarly being keyboard, is a more effective and efficient means to save the situation.Unfortunately, mobile device lacks processing power especially and internal memory provides complete speech recognition ability, and causes higher identification error, and the space of shortage entity provides complete keyboard and mouse to import to carry out efficient error correction.
Sharpening
Previous development has been considered and has been used a keyboard by the bond number minimizing.Just shown in the profile of the miniature keyboard of same push-button telephone, the keyboard of many scaled-down versions all uses 3 to take advantage of 4 button array.Each button all comprises several characters in this button array.Therefore, when the user imports a series of button, promptly can produce ambiguity (ambiguity), because each button operation all signifies one in several letters.Existing several methods is suggested the ambiguity that solves the bunchiness button operation.These methods are called as sharpening.
At J.Arnott, M.Javad is summarised in them and is published in a piece of Journal of the lnternationalSociety for Augmentative and Alternative Communication periodical ' certain methods is proposed in the Probabilistic Character Disambiguation for Reduced Keyboards UsingSmall Text Samples ' paper, with the correct character sequence that decides corresponding to a uncertain button operation order.
T9  Text Input is with United States Patent (USP) the 5th, 818, and No. 437 patents and later patents thereof similarly are telephone keypad for the miniature keyboard that provides on basis, the leading brand of the sharpening of individual character level.Come layout to cause that the order of the individual character of ambiguity can be reduced in the efficiency that is occurred in a little earlier the research by the frequency of using, and the ability of adding new word make it be easier to use through after a while after.Position or input pattern that input sequence raps according to nib as repeatedly rapping, can be read as individual character simultaneously, character string and/or finish, number, and character string clearly.
The T9 and the similar products like that also have the miniature keyboard of (alphabetic) language (as Chinese) of (ideographic) that sacrificial vessel expresses the meaning but not letter to use.These products usually adopt a kind of in the following dual mode: basic handwritten stroke or stroke classification mapped (map) are extremely on the available button; And the stroke of the character wanted according to the input of traditional order of user; Or one phonetic letter be mapped on these buttons and the voice spelling of the character that user input is wanted.No matter any method, the user must find out and choose the character of being wanted from many characters that meet input sequence.These input products all can benefit from the context (context) of the character imported before usually, in order to improving the DISPLAY ORDER of the most normal character that is used, because define an individual character or phrase usually need two or more characters of expressing the meaning.
Unfortunately, mobile device all is designed to have littler keyboard, and its button is more fashionable but also fast more difficult simultaneously and typewriting correctly.So, with the button operation more improvement of sharpening in proper order in addition of fork meaning.For example, with the order sharpening of an input time or in prediction, during next input sequence, do not consider the context of grammer or application usually.
Another kind of often use the keyboard on midget plant to comprise a touch panel, on this panel, be printed on the keyboard of certain form, or comprise a touch panel and show a keyboard on it.According to the size of this specific keyboard and essential, a finger or a pointer (stylus) can be used to this panel or display screen on the user want by the key or the zone at alphabetical place interactive.Because many portable apparatus is small-sized, so when beaing each button, can use a pointer to be used for obtaining required degree of accuracy usually.The overall dimensions of these keyboards is very little to make that the relative area of each button is also very little, therefore is difficult to typewrite with enough degree of accuracy for general user.
There is the product of multiple in-building type or external hanging type to provide the literal prediction to use for above-mentioned touch-control screen curtain keyboard.The user carefully press this word preceding give a letter after, this prognoses system can show the inventory with the most probable complete word of these several beginning of letter.Yet if too many possible selection is arranged, the user just must continue to beat, till the want word appearance of beating or being used to complete this word.After having beaten a letter, promptly need with vision burnt poly-switch between this touch control screen keyboard and this word selection inventory be can slow down rather than quicken in civilian input.
Be described in United States Patent (USP) the 6th, 801, the system in No. 190 patents use the literal grade from normal moveout correction solve accuracy problem and can input fast on miniature keyboard.Because beaing the position is incorrect by hypothesis, so be what has uncertainty aspect at the word of wanting about the user to beat.The user can be provided each button and knock order one or more annotation corresponding to a word, makes the user can choose the annotation of being wanted easily.The method this system can be used be included in whole button to knock in the order information and unties the intention of user for each lexicographic order.Yet when when normal moveout correction is activated, this system possibly can't make too many completeization of literal, because it is not that several letters are correct before the supposition, thereby can not determine whether the user is beating whole word, and this button has many other annotations to show in proper order.
Handwriting identification is that another kind is used to solve and has the method that touch control screen maybe can be detected the literal input on the midget plant of mat of motion of finger or pointer.The meeting of writing on a touch panel or display screen produces the input of the data stream of a demonstration contact point.Handwriting identification software can be analyzed the geometric properties of this data stream input, in order to determine each letter or literal.
Unfortunately, present handwriting identification solution all has many problems:
1) hand-written slower than typewriting usually;
2) on midget plant, the restriction of internal memory has reduced the degree of accuracy of handwriting identification; And
3) everyone handwriting style is widely different with the people's who is used for training hand-written software handwriting style.
Because above problem, many hand-written product requirement user study are the very unique stroke group of each letter one.These very unique stroke groups are to be designed to simplify the geometric scheme identification processing of this system and improve discrimination power.These strokes have different greatly with the mode that this letter is write usually naturally.This causes the employing rate of this product very low.
For the identification degree of accuracy bigger challenge is arranged in the handwriting input on the mobile device: when attempting gripping this device, hand-written direction can allow input change or distortion; And use in moving, winnow with a dustpan as vibrations during taking a bus or mountain peak, can cause and fail to contact, and be formed on " noise " in the contact point inlet flow with contact panel.
Therefore, be used for remedying the restriction of midget plant and be used in the literal input based on being restricted to of the system of ambiguity and identifier, their speed and degree of accuracy can drop to the user and think a degree for method acceptance.
The paper definition " multi-form error correction " of Suhm uses (non-voice) form that substitutes to re-enter by the whole word or the phrase of misidentification.It is found that this retells than part has efficiently howed, incorrect because speech form has been proved to be formula.When which form decision will use re-enter, the user must consider this another kind input form the identification accuracy problem of itself, but each form all is independent operated during finishing this literal input service.
Therefore, be advantageous if a kind of speech recognition Apparatus and method for that is used for the intelligent editor of speech recognition output can be provided.
If can provide a kind of can be advantageous at the maximized speech recognition Apparatus and method for of revising on the identification mistake of benefit with an alternative input form.
Speech recognition is invalid or non-under instantly work or environment want in, be advantageous if the speech recognition Apparatus and method for of an efficient alternative input form can be provided.
Summary of the invention
The invention provides a kind of voice identification system of one or more input form (modality) that substitutes that combines in order to guarantee efficient and correct literal input.Voice identification system is because limited processing power, environmental noise, and/or the relation of the factors such as variation naturally on the locution and do not reach right-on degree as yet.Alternative input form uses sharpening or recognition engine to compensate the keyboard that reduces, hasty and careless input, and/or the variation of the nature on writing style.Ambiguity (ambiguity) great majority in speech recognition is handled be with alternative input pattern in intrinsic ambiguity quadrature, make the combination of two kinds of forms to reach efficiently and correctly solve the identification mistake.The present invention is specially adapted to the limited portable apparatus in space of keyboard input or Touch Screen input.
One embodiment of the invention provide a kind of method that is used for handling the language input in a data handling system, and its step that comprises has: receive first input that comprises the sound input; Determine first several candidate according to this first input; Reception comprises second input of non-sound input; Reach according to this first input and second input and decide one or more candidate.This one or more candidate is to determine under the restriction of first input according to this second input.Perhaps, the connection collection of two candidate inventories or common factor are determined, rather than are filtered another input with an input.
In another embodiment, this one or more candidate decides according to the literal context according to first input.This literal context is any based in the language mode of N-gram language mode and speech recognition engine.
In another embodiment, the decision of this one or more candidate has comprised the step of revising or filtering these first several candidate according to this second input.
In another embodiment, this second input is to be received on a mobile device; And for the speech recognition of sound input partly be implement on this mobile device and part be to connect on the server that is coupled to this mobile device by radio communication one to implement.
In another embodiment, this speech recognition is to depress conversation (push-to-talk) button by one on this mobile device to activate.
In another embodiment, this second input is to be presented when choosing or editing received in one or more candidate.
In another embodiment, this second input comprises touch control keyboard, and hand-written posture identification reaches any one in the keypad input.
The medium that one embodiment of the invention provide a kind of machine readable that stores instruction on it to get, these instructions are performed the method that this data handling system enforcement one of Shi Huirang is used to handle the language input on a data handling system, the step that this method comprises has: receive one and comprise first input that sound is imported; Determine first several candidate according to this first input; Receive one and comprise second input that non-sound is imported; Reach according to this first input and second input and decide one or more candidate.
This one or more candidate be according to determine.In another embodiment, this one or more candidate be according to this second input under the restriction of first input and decide according to the literal context, this literal context is any based in the language mode of N-gram language mode and speech recognition engine.
In another embodiment, the decision of this one or more candidate decides by revising these first several candidate.
In another embodiment, the speech recognition of this sound input partly be implement on this mobile device and part be to implement connecting on the server that is coupled to this mobile device by radio communication; And this speech recognition is to activate by conversation (push-to-talk) button of depressing on this mobile device.
In another embodiment, this second input is to be presented when editing or to be presented when choosing received in first several candidate in these first several candidate, and this second input comprises touch control keyboard, and hand-written posture identification reaches any one in the keypad input.
In another embodiment, the input form of separation can be used to represent the word of punctuation mark.One temporary transient form (as, the sign format of T9) can be aroused, in order to the single character of an identification, as symbol or numeral.For example, the literal of output " fullstop ", ". " can be come out by identification.
In one embodiment, " intelligent " punctuation mark can be transfused to during second input in order to the part of this sound input is annotated and is punctuation mark.In another embodiment, need not to import any special form and come the identification punctuation mark.For example, when the user said " fullstop ", literal " fullstop " reached ". " both and can appear in the inventory.
Description of drawings
Fig. 1 is a synoptic diagram, and it shows the system that is used for the user input of identification on a data handling system according to of the present invention;
Fig. 2 is the calcspar according to the data handling system that is used for identification user input of the present invention;
Fig. 3 is used for handling at a process flow diagram according to the method for the input of the language in the data handling system of the present invention;
Fig. 4 is a calcspar, and it provides an example, and wherein a user is according to word of embodiments of the invention oral account; And
Fig. 5 A-5C is a calcspar, and it provides an example, and wherein a user is according to of the present invention
Embodiment gives an oral account a word.
The primary clustering symbol description
101 users, 103 displays
105 Aristogrids, 109 code translators
111 recognition engine, 113 this paper buffers
107 Aristogrids, 115 sharpening engines
117 sharpening engines, 119 language databases
201 processors, 202 hand input devices
203 displays, 204 acoustic input dephonoprojectoscopes
205 voice outputs, 206 press key input devices
210 internal memories, 211 operating systems
220 application programs, 214 word inventories
216 sharpening engines based on word
217 identification or sharpening engines based on phrase
218 identification or sharpening engines based on context
215 phrase inventories, 213 phoneme recognition engine
Embodiment
The invention provides a kind of Apparatus and method for that is used for the intelligent editor of speech recognition output, it can provide most probable selection or hypothesis (hypotheses) according to user's input.This speech recognition engine is graded for the hypothesis that substitutes, and these hypothesis are added to numerical value on the information that offers the user.For example, select hypothesis if speech recognition offers first of user's mistake, then the user can want to obtain the hypothesis of other N the best (N-best) in order to revise by the hypothesis of this identifier institute loopback.In a multi-form environment, can obtain N best hypothesis inventory from this speech recognition output.In detail, the inventory of this N best hypothesis is added in the present literal menu so that editor.
One embodiment of the present of invention provide use on N the best hypothesis acoustics (acoustic) information and literal context the two.This can be that grammer is interdependent or independent.That is, language model can provide can influence the syntactic information of probability of a given literal, or it merely provides and shows that for a moment some are connected on the N-gram model of the probability of a word or several words specific word afterwards.
Close pronunciation can appear on this N the best inventory on the acoustics.This information is more convenient because of a confusion matrix, and this matrix is informed N best hypothesis formula of the frequency of relevant specific mispronounce.For example, if speech recognition engine can with last position of word /p/ obscures with/b/, N the best hypothesis that then has these phonemes (phoneme) can be listed this in and consider.Be used for showing that each phoneme also can obtain with the frequency information of other phoneme confusion in a given language, this information comprises locational context, is the beginning that occurs in a word, stage casing or end as it.Except confusion information, when information deleted or that insert also can be provided relevant phoneme.
In the present invention, the user's literal book that is produced in this multi-form environment goes into also to be used to upgrade any identification system language database.Ideally, can be applied to any pro forma database all is updated in each form.If the word that this speech recognition engine provided is not in this T9 dictionary, then it can be added in the dictionary.In addition, word and phrase frequency and N-gram information also can be updated along with use.
The invention provides a kind of intelligent editor's feature.For example, a user is to this mobile device dress oral account.When vernier position during, given the user by loopback from the literal output of this identifier in literal entr screen position.For the purpose of editing and revising, add this N-best information, make this output enrich.
One embodiment of the invention also provide a principal and subordinate (client-server) feature, these pronunciations are anticipated on this device thus, become the N-best inventory carries out literal to this device demonstration and editor by identification and by loopback on a server that is connected by the wireless data pipeline.Suppose it is more dynamic and more relevant to any change of this literal work with the user.For example, also be returned to this server if this speech recognition engine demonstration " winner " and user are modified to it " winter " and user's correction, then this action will improve word " storm " the quilt possibility of identification correctly that adds.The language model of server side can provide one more to have the form-grammatical analysis of the power of fully understanding to improve identification usefulness to this input.These models have the next word that bigger ability is predicted the user, strengthen the prediction of word and finish the algorithm of word.In addition, the distinctive feature of language (for example, the consistance between subject term and verb, capital and small letter (case), sex, and quantity consistance or the like) can be more easily be implemented in order to improve the identification accuracy on a powerful server.This system can allow the user to control via the correction of delivering to server of client configuration or initiation and upgrade stream.
The present invention also provides " intelligent " punctuation mark.Voice identification system the user want to insert a symbol and when non-legible (for example, insert ". " but not " fullstop ", or ":-) " but not " smiling face ") can produce the difficulty in the detecting.The ambiguity character input system has limited button or gesture is chosen a symbol but not a letter.Revising the suitable annotation that this system of voice informing should pronounce with " intelligent " punctuation mark feature of a fork meaning is a symbol.
The present invention allows the temporary transient pattern of " press and give an oral account ", and its feature with " press and speak " is close, and just these voice are converted into literal, rather than as a voice signal be transferred into another phone or as the sound annex of an Email as be saved.
In addition, the present invention allows vectorial quantization (it can be implemented) on this device, and it has to produce on this device or server mate/supposes inventory.
Fig. 1 is a synoptic diagram, and it shows the system that is used for the user input of identification on a data handling system according to of the present invention.User 101 by the oral account word, phrase, sentence or paragraph begin.Aristogrid 105 and code translator 109 use a speech model (not shown) to convert this sound input to speech data.Vocabulary and/or the language model of recognition engine 111 bases in language database 119, and optionally can comprise the frequency of use and nearest use, and optionally can analyze this data according to the article context on every side in this this paper buffer 113.Best annotation is added in this this paper buffer 113 and is shown to user 101 via this paper and inventory display 103 to be seen.Perhaps, the annotation inventory of this N-best be stored in this this paper buffer 113 for after a while with reference to and/or be shown to user 101 to confirm via this paper and inventory display 103.
At certain time point after a while, user 101 chooses a word or phrase to revise via this paper and inventory display 103.According to the input capability of the form that should substitute, the user pushes button or writes on a Touch Screen, and this will be converted to a list entries by a suitable Aristogrid 107.This sharpening engine 115 is according to vocabulary and/or language model in language database 119, reach and optionally can comprise the frequency of use and nearest use, and optionally can decide possible annotation according to the article context on every side in this this paper buffer 113.Multi-form sharpening engine 117 comprises this fork meaning list entries and/or is contrasting the best of this speech recognition or the annotation that N-best annotates, and presents to user 101 to confirm by the annotation that this this paper and inventory display 103 will be revised.In another embodiment, this sharpening engine 115,117 is combined, and mutual sharpening can take place, the intrinsic part during just as the input handled from another form.
In another embodiment, this multi-form sharpening engine 117 leads back to this recognition engine 111 with the annotation of fork meaning, annotates in order to the best or N-best inventory with this speech recognition again.In this embodiment, original vector or phoneme label can be stored in this this paper buffer 113; In another embodiment, this multi-form sharpening engine 117 or recognition engine 111 will at this best or N-best and/or divergent letter (chart) of anticipating the word in annotating shine upon back these vectors or phoneme is annotated for this recognition engine 111 again.
This identification and sharpening engine 111,115,117 renewable one or more language databases 119 are risked or composite new word or phrase clearly in order to add user 101, in order to reflect word or the frequency of utilization of phrase or nearest use that import by user 101 or that revise.
In another embodiment of the present invention, this System Discrimination handwriting (roman, cursive script, or or even shorthand) rather than voice.The function that this component of a system 105,109,111 is provided on the identification handwriting and its identical on the identification voice.The form that should substitute can be from keyboard or Touch Screen keyboard, or the input of the fork meaning of speech recognition (no matter be continuous, separate, or letter), decided according to the input function of this equipment and processing power.
Fig. 2 is the calcspar according to the data handling system that is used for identification user input of the present invention.Though Fig. 2 shows the various member of a data handling system example, should be appreciated that foundation data handling system of the present invention can comprise other member except member shown in Figure 2 substantially.For example, in a mobile phone embodiment, some system can have telecommunication circuit.Fig. 2 shows and the extremely relevant various member of some characteristic at least of the present invention.For that reason, have the knack of this skill person and can recognize, the configuration of a foundation data handling system of the present invention is not limited to certain architectures shown in Figure 2.
Display 203 is coupled to processor 210 by suitable interface circuit.One hand-written input media 202, as a Touch Screen, a mouse, or a digital pen are coupled to this processor 201 needs a little identifications of hand in order to reception user's input and/or other user input.One speech input device 204 as a microphone, is coupled to this processor 201 needs speech recognition in order to reception user's input and/or other user input.One press key input device 206, as a telephone key-press, one group of exclusive or assembling button, or a keypad that is presented on the Touch Screen are coupled to this processor 201 in order to receive typewriting input and/or other input of user.Optionally, a voice output 205 as loudspeaker, also is coupled to this processor.
Processor 201 receives from input media, as hand input device 202, or speech input device 204 or press key input device 206, input and management export this display and loudspeaker to.Processor 201 is coupled to an internal memory 210.This internal memory comprises temporary transient Storage Media, as random access memory (RAM), and the permanent storage medium, as ROM (read-only memory) (ROM), disk sheet, Winchester disk drive, or CD-ROM.Internal memory 210 comprises required all software routines and the data of operation of managing this system.This internal memory typically comprises operating system 211 and application program 220.The example of application program comprises word processor, communication client, and foreign language translater.Speech synthesis software also can be provided as the part of this data handling system.
In one embodiment of this invention, internal memory 210 comprises the form of separating that is used for each part that identification processing and/or sharpening handle, and it can comprise: based on sharpening engine 216, identification or sharpening engine 217, identification or sharpening engine 218, a selection form 219 based on the article context based on phrase of literal, reach other (as a word inventory 214 and a phrase inventory 215).In this embodiment, the article context aspect (aspect) of the action that should utilize the user based on the identification or the sharpening engine of article context is on the input sharpening.For example, a glossary can be in work or stay at home as the user according to chosen user position; Period in one day is as working time vs. leisure time; The recipient; Or the like and be selected.
In one embodiment of this invention, the most of member that is used for identification and sharpening is in different input forms, as is used for speech recognition and is used for the miniature keyboard input, between share.Word inventory 214 comprises the known word at a language that row are used for form of ownership.Word inventory 214 further is included in the information of the frequency of utilization of corresponding word in this language.In one embodiment, one not the word in the word inventory 214 of this language be regarded as having zero frequency.Perhaps, can the designated one very little frequency of utilization of the word an of the unknown.By using this frequency of utilization of being supposed on the word of the unknown, known word and unknown word are that available roughly the same mode is handled.Word inventory 214 can be used for to candidate row grade by this identification or sharpening engine 216 based on word, deletion, and/or choose candidate, these candidate are according to the pattern recognition engine, as stroke/feature identification engine 212 or phoneme recognition engine 213, the result determined, and be used for user according to a part and import and tell fortune by analysing the component parts of a Chinese character in advance in order to finish this word.Phase class ground, phrase inventory 215 comprise the phrase that row comprise two words or a plurality of words, and the information of frequency of utilization, should can use this information and can be used to predict the word of finishing this phrase based on the identification or the sharpening engine 217 of phrase.
Fig. 3 one is used for handling a process flow diagram according to the method for the input of the language in the data handling system of the present invention.This method has comprised first step of importing 300 that sound is imported from receiving one.This method advances to step 302 in order to decide first several candidate according to this first input.This method proceeds to step 304 and comprises second input that non-sound is imported in order to receive one.At last, this method decides one or more candidate in step 306 according to this first and second input.
It is the sequence vector of a numeral that voice identification system changes into this voice signal, they with have the contextual potential phoneme of its article (phone) and be complementary.Again, transcription form is mated a vocabulary and the language mode N-best word inventory in order to the sounding that produces each separation.
In continuous speech recognition, between word and word, may not have clearly and pause, so identification output can have one or more possible phrase or sentence to annotate.Most probable annotation is displayed on the literal insertion point in the present input area of this application.
After the step of this method, the user determines that then the word of discerning before some is incorrect.By use one pointer, direction button or voice command, the user can choose one or more word and revise.This input system can show that one is listed in the most probable annotation of this time point, but it is not forever all can show the word of being wanted, and shows restriction if particularly have.
By using alternative form available or preference, as the input of the T9 literal on a telephone key-press, the user begins to squeeze into first by anti-white word.Because corresponding each button of wanting of these letters similarly is the ABC on the button 2, and is not close acoustically, so system can determine first phoneme immediately, for example plosive/b/ or/p/, in fact be B rather than P, because be that button 2 is pressed rather than the button 7 that comprises PQRS is pressed.Similar ground raps near near the automatic correction qwerty keyboard the VBN rather than IOP and can improve the possibility that B is the letter wanted.Similar ground, allow a hand-written recognition engine annotate near B or 3 rather than one the posture of P or R reciprocally solved ambiguity in two identifiers.
When the user writes incorrect word again, a system that implements the embodiment of this method will provide the preferable annotation of this original identifier output immediately, give each fork meaning and revise.As shown in the above example, only re-entering first letter or preceding two letters promptly is enough to the mutual sharpening of whole word for this system and the selection as the best of the word wanted is provided.Context row and/or follow-up word and the syntax before in this input field (so its not selected revise to be assumed to be correct) can be further decide priority with the annotation of the pronunciation of revising through the user and make with extra care.Under the most probable literal that gives present pronunciation was annotated again, follow-up pronunciation can be annotated again was other more possible word.In another embodiment, by using vocabulary or clearly indicate the language-specific rule of each alphabetical pronunciation, other word that is selected is in annotating again to can be by corresponding echo element before other more possible word.
In one embodiment, this method has vector or phoneme label and should the fork meaning revises input and be directed back this voice identification system to carry out the hypothesis search of refiningization.In another embodiment, this method requires this sharpening system to use these vectors or phoneme label to make with extra care and filter this corrections, makes only to have with the annotation of anticipating of the fork of the character of these vectors or phoneme compatibility and can be considered.
After the user was with the word correction, this voice identification system can determine the paragraph of continuous speech to be wrong and can to annotate border between word and word again according to user's correction; Or determine that a pause is not the definition symbol of representative between word and word, so annotate language again and it be shown as a single word.
If the input option on this device is very limited, then the user can only select a word to revise a time.In this case, after the user chose this word to be revised, this method can comprise according to the context of this word that is corrected considered the step how next word and/or original vector correspond to the beginning of the tail end of the word that this process revises and next word again.This system can demonstrate next word to be had one the confidence mark on ground or row that can automatically demonstrate relevant language is annotated.
In one embodiment of the invention, this system automatically annotates and follows one through the correction of the meaning of the fork after the language of identification input as the word or the phrase of front.In another embodiment, this system the user finish this word input the time will import simultaneously and annotate to of the word of front being revised and being one will be added into a new word herein, a minority is effectively revised or new word annotation can be retained and most probable correction or new word are annotated and will be suggested.
In another embodiment of the present invention, this first and second input almost is the while or overlapping; In fact, the user sounds for the word that he or she is beating.This system automatically annotate these two kinds the input and with their mutual sharpenings in order to produce the two good annotation.The user need not often to recall and revise word or phrase, can improve the possibility that this system chooses correct annotation because these two inputs are combined.Only importing minority represents the input of fork meaning of beginning of each word promptly enough in many examples.In another embodiment of the present invention, these two inputs are imported synchronously, identification, and only be selected with the sharpening reciprocally of quilt the correction after at a word or phrase.
For example, the user can push button 2 and to say one be the word of beginning with " a " for " a ".In one embodiment, this button pushes first letter that can be read as the word that representative wants.Therefore, as if when two forms of input conformed to, an input form can be strengthened another input form and improve the confidence degree of this system for its word of offering out.Yet two forms of this input also may not conform to.In this case, the word that is complementary with two forms of this input all can be present in the inventory of this candidate.This user can use a pattern or two patterns to do further affirmation then.
In addition, this input form can be used to another form from this input words that " develops (buildaround) ".For example, the user can say " home " this word and push button 9 then at once.Because these two as if conflicts mutually of input, so should comprise pronouncing the word close in the inventory of this possible word and comprise with the letter on the button 9 " w " " x ", the word of " y " or " z " beginning with meeting with " home ".The pushing of button 9 also can be identified as the beginning of the next part of this compound, makes when the user says " work ", and 9 sharpenings that can be used to help next phonetic entry push button.
Fig. 4 is a calcspar, and it shows that the user has given an oral account the example of a word according to one embodiment of the invention.This speech engine identification one language 400.This word is displayed to user 402 and sees.If the user reselects this word in the input field of using, then this word selection inventory provides the hypothesis that substitutes from this speech recognition output 404.The user can choose correct annotation then and continue speech recognition input 406 from this word selection inventory.If the user is a button of pushing one or more fork meaning in use the time at a word, this word selection inventory only can reflect the word that meets this keystroke sequence 408 from this N-best inventory.
5A-5C figure shows that the user has given an oral account the icon and the sample display screen of the example of " The top " according to one embodiment of the invention.This speech engine is recognized as this language " The top " and is recycled to user's mobile device (Fig. 5 A).If the user gets " stop " this word from multi-form T9, then this word selection inventory can provide alternative hypothesis (Fig. 5 B) user who exports from this speech recognition can choose its said language then from this word selection inventory and continue and import with the T9 input or with speech recognition.
If the user imports a button and pushes, then this word selection inventory demonstration is subjected to this button to push the word of restriction (Fig. 5 C) from this N-best inventory.When (active), an extra button is pushed and can be extended this alphabetical sequence in a word is to use.Therefore, a soft key " Edit " option can arouse this modification method.
It is apparent that the present invention also can use on the miniature keyboard or be used on the language of writing with ideographic character.For example, the Pinyin letter with corresponding to each button as the ABC on the button 2, revises the language " being " that is mistaken as " ping "; Pushing button after 2, this system can determine immediately that first phoneme in fact is B but not P.Similar ground, when the order input system is knocked in use, when first of the character of being wanted by next generation table the user knocked category, this speech recognition engine promptly can be considered with the character that knocks beginning in another category and a better annotation of this language can be provided.Similar ground uses a hand-written shape ideographic characters recognition engine first character that begins to draw also can revise this voice and annotate.
Though the stroke order input system of a fork meaning or a hand-written recognition engine may name really determine which handwritten stroke be want, the annotation of these voice is enough to the character of two kinds of input form sharpenings in order to provide the user to want with the combination of this stroke annotation.As mentioning the voice correction of relevant alphabetic language in the preamble, when the user chose through the character of expressing the meaning (character) revised, this method can comprise crossing the context of correction and/or original sound vector according to this essence is how to correspond to the step that the beginning of the character tail end of this process correction and character late comes the identification character late.Because the relation of these corrections, this voice identification system also can determine a temporary transient pause can not represent the define symbol between word and word or phrase and phrase, therefore newly annotate this language and it is shown as a series of character of representing an one word or phrase, rather than two words that separate or phrase; Anti-is as the same.
The combination of speech recognition and the input of fork meaning has other benefit.In a noisy environment, similarly be on the walkway in city, in the many dining rooms of a people, or in a building ground, this speech recognition is accurate to and can drops under the unacceptable degree of user.Perhaps, in a quiet environment, similarly be, or when topic is privacy and sensitivity, possibly can't use speech dictation in the library or in the middle of meeting.The user then can freely import interior literary composition as a reliable system with the fork input system of anticipating.In addition, identification or risk one not the word in the vocabulary of this voice identification system will be difficult to, and fork meaning input system typically provides a reliable mechanism to squeeze into any character string and with its vocabulary adding.In addition, this speech recognition engine can be used to from by choosing a word the shown candidate inventory that comes out of this fork meaning input system.
In one embodiment of the invention, the annotation of word or phrase is to come the alignment preface according to these words or the frequency of occurrences of phrase in the general use of this language.In one embodiment of this invention, this ordering is used as each word or phrase constantly or once in a while with respect to other word or the user's of phrase frequency and/or frequency of utilization.
In one embodiment of this invention, meet finishing or predicting that annotating one with other word is provided of up to the present the word that button knocks or pointer is rapped, revise and the key entry again of extra word reaches more fast easily in order to allow.In one embodiment of this invention, the difference diacritic similarly is the vowel intonation, is placed on the suitable character of word that this quilt says or the word that is corrected, and need not the user and points out the mark that needs a difference to pronounce.
In one embodiment of this invention, some or all are not the fork meaning from the input of this form that substitutes.This can reduce or save the needs for the sharpening engine 115 among Fig. 1, but still needs this multi-form sharpening engine 117, in order to annotate the vector or the phoneme label of this word that is corrected or phrase again according to new list entries till now.
In one embodiment of this invention, as the input system when this fork meaning is one to be presented at the automatic correction keyboard on the touch-controlled screen device, each character of the best annotation of user or during typewriting again in correction, as the character that raps near each pointer, form a sequence, this system can be shown as it annotation of a nothing fork meaning, and this user can choose the annotation that this nothing fork is anticipated in this glossary if the word of being wanted does not have.
In one embodiment of this invention, as when the input system of this fork meaning is used a miniature keyboard, during as the telephone keypad of a standard, the annotation of this nothings fork meaning is one or two button of this keystroke sequence or raps annotation more.
In one embodiment of this invention, the annotation of this nothing fork meaning is added in this glossary, implements to revise or output if the user chooses it.In one embodiment of this invention, that this quilt picks out or point out out a replacement word or a phrase that is used to export through the annotation that the word revised maybe should nothings fork meaning, as one than the abbreviation of the phrase of length or a dirty word can received substitute.In one embodiment of this invention, word that this system continues after can in fact choosing according to the user or phrase are annotated and are adapted between user's input, as position of rapping or hand-written shape is crooked, and between the character or word wanted, systemic difference.
In one embodiment of this invention, the user arouses a pattern, and intonation is recognized as character separately in this pattern, as a letter, and numeral or punctuation mark.This character string can be added in this glossary, if it is new.In one embodiment of this invention, be used for the alternative word of mosaic, be recognized as character separately as " Alpha Tango Charlie " or " A as in Andy, P as in Paul ".
In one embodiment of this invention, the guiding that no longer provides usefulness when these vectors or voice label is annotated or during sharpening, them can be selected to ignore by this system again.In one embodiment of this invention, this system provides a mechanism, as a button or posture, removes the relevant speech data of word that some or all picks out with this quilt to the user.
In another embodiment, at installation phase, or at the reception period of Word message or other data, message file is scanned in order to word is added in the vocabulary.The method of scanning information file is that genus is known in this skill.Find that in scan period temporarily, they can be added to the word of being used as low frequency of utilization in the glossary form, and be placed in the tail end of this word selection inventory.The number of times that is detected during one scan according to a given new word is assigned to its higher right of priority by its position in its word selection inventory is promoted, thereby improves this word appear at possibility in the word selection inventory during the information input.Being used for pronunciation rule present or determined linguistic standard can be applied on the new word in order to reach the form phonetically that they are used for following identification.
Haveing the knack of this skill person with apprehensible is, extra glossary form (no matter by rule or in concrete appointment on the vocabulary) can be implemented in this computing machine, as comprises legal language, medical jargons or word, and the glossary form of other language.Again, in some language, similarly be that Dard is a language, it is that possible candidate is formed or given the candidate importing the preceding and be considered that the glossary form can use efficient sub-word (sub-word) sequence samples which decides.By system's menu, the user allows extra glossary appear at first or the last character in the possible word inventory, as have special color or anti-white, or which glossary form is this system can be provided before the word that is selected and automatically change the order of word according to.Therefore, in the scope of claim below, should be appreciated that the present invention can implement with being different from specifically described mode herein.
Though the present invention illustrates with reference to preferred embodiment, has the knack of this skill person and will be easy to and can recognize, do not departing under spirit of the present invention and the scope application that other application can replace herein to be proposed.Therefore, scope of the present invention is only defined by following claim.

Claims (20)

1. method that is used for handling the language input in a data handling system, it comprises following step at least:
Receive first input, it comprises sound input;
According to this one or more candidate of first input decision;
Receive second elm and go into, it comprises a non-sound input; And
Decide one or more candidate according to this first input and second input.
2. as the method for claim 1, wherein this one or more candidate is to determine under the restriction of first input according to this second input.
3. as method as claimed in claim 2, wherein this one or more candidate is to make decision considering the context of word (context) according to first input.
4. as method as claimed in claim 3, wherein the context of this word is according to following listed any one:
The N-gram language mode; And
The language mode of speech recognition engine.
5. as the method for claim 1, wherein determine the step of this one or more candidate to comprise the step of revising or filtering these first several candidate according to this second input.
6. as the method for claim 1, wherein this second input is to be received on a mobile device; And wherein for the speech recognition of sound input partly be implement on this mobile device and part be to connect on the server that is coupled to this mobile device by radio communication one to implement.
7. as method as claimed in claim 6, wherein this speech recognition is to depress conversation (push-to-talk) button by one on this mobile device to activate.
8. as the method for claim 1, wherein this second input is to be presented when choosing or editing received in one or more candidate.
9. as method as claimed in claim 8, wherein this second input comprises following listed any one:
Touch control keyboard;
Hand-written posture identification; And
The keypad input.
10. as the method for claim 1, wherein when this second input and punctuation mark or symbol associated, this first input is annotated is punctuation mark or one or more other symbol.
11. the medium that the machine readable that stores instruction on it is got, these instructions are performed this data handling system of Shi Huirang and implement a method that is used to handle the language input on a data handling system, this method comprises following step at least:
Receive first input, it comprises sound input;
According to this one or more candidate of first input decision;
Receive second input, it comprises a non-sound input; And
Decide one or more candidate according to this first input and second input.
12. as the medium got of machine readable as claimed in claim 11, wherein this one or more candidate is to make decision considering the context of word (context) according to first input; And the context of this word is according to following listed any one:
The N-gram language mode; And
The language mode of speech recognition engine.
13. the medium as machine readable as claimed in claim 11 is got wherein determine the step of this one or more candidate to comprise the step of revising these first several candidate.
14. as claim 11 medium that the machine readable of stating is got, wherein this second input is to be received on a mobile device; And wherein for the speech recognition of sound input partly be implement on this mobile device and part be to connect on the server that is coupled to this mobile device by data one to implement; And wherein this speech recognition is to depress conversation (push-to-talk) button by one on this mobile device to activate.
15. as claim 11 medium that the machine readable of stating is got, wherein this second to import be to be presented when editing in one or more candidate, or be presented when choosing received in these first several candidate; And this second input comprises following listed any one:
Touch control keyboard;
Hand-written posture identification; And
The keypad input.
16. a mobile device that is used to handle the language input, it comprises at least:
The speech recognition form is used for handling first input, and it comprises sound input; And
Or a plurality of second input forms, being used for handling one second input, it comprises a non-sound input;
Processing form, it is coupled to this one or more second input form and this speech recognition form, and this processing form determines one first plural candidate word and next decides one or more candidate according to this first input and second input according to this first input.
17. as device as claimed in claim 16, wherein this one or more candidate is to make decision under the restriction of first input and considering the context of word (context) according to this second input; And the context of this word is according to following listed any one:
The N-gram language mode; And
The language mode of speech recognition engine.
18. as device as claimed in claim 16, wherein this one or more candidate decides by revising these first several candidate.
19. as device as claimed in claim 16, wherein for the speech recognition of sound input partly be implement on this mobile device and part be to connect on the server that is coupled to this mobile device by radio communication one to implement; And wherein this speech recognition is to depress conversation (push-to-talk) button by one on this mobile device to activate.
20. as claim 16 a described device, wherein this second to import be to be presented when editing in one or more candidate, or be presented when choosing received in these first several candidate; And this second input comprises following listed any one:
Touch control keyboard;
Hand-written posture identification; And
The keypad input.
CNA2005800178056A 2004-06-02 2005-06-02 Multimodal disambiguation of speech recognition Pending CN1965349A (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US57673204P 2004-06-02 2004-06-02
US60/576,732 2004-06-02
US10/866,634 2004-06-10
US11/043,506 2005-01-25
US60/651,302 2005-02-08
US11/143,409 2005-06-01

Publications (1)

Publication Number Publication Date
CN1965349A true CN1965349A (en) 2007-05-16

Family

ID=38083522

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2005800178056A Pending CN1965349A (en) 2004-06-02 2005-06-02 Multimodal disambiguation of speech recognition

Country Status (1)

Country Link
CN (1) CN1965349A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102282610B (en) * 2009-01-20 2013-02-20 旭化成株式会社 Voice conversation device, conversation control method
CN103038728A (en) * 2010-03-12 2013-04-10 纽昂斯通信有限公司 Multimodal text input system, such as for use with touch screens on mobile phones
CN109147791A (en) * 2017-06-16 2019-01-04 深圳市轻生活科技有限公司 A kind of shorthand system and method
TWI815658B (en) * 2022-09-14 2023-09-11 仁寶電腦工業股份有限公司 Speech recognition device, speech recognition method and cloud recognition system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102282610B (en) * 2009-01-20 2013-02-20 旭化成株式会社 Voice conversation device, conversation control method
CN103038728A (en) * 2010-03-12 2013-04-10 纽昂斯通信有限公司 Multimodal text input system, such as for use with touch screens on mobile phones
CN103038728B (en) * 2010-03-12 2016-01-20 纽昂斯通信有限公司 Such as use the multi-mode text input system of touch-screen on a cellular telephone
CN109147791A (en) * 2017-06-16 2019-01-04 深圳市轻生活科技有限公司 A kind of shorthand system and method
TWI815658B (en) * 2022-09-14 2023-09-11 仁寶電腦工業股份有限公司 Speech recognition device, speech recognition method and cloud recognition system

Similar Documents

Publication Publication Date Title
US11914925B2 (en) Multi-modal input on an electronic device
TWI266280B (en) Multimodal disambiguation of speech recognition
US8311829B2 (en) Multimodal disambiguation of speech recognition
JP4829901B2 (en) Method and apparatus for confirming manually entered indeterminate text input using speech input
CN1918578B (en) Handwriting and voice input with automatic correction
US7395203B2 (en) System and method for disambiguating phonetic input
CN102272827B (en) Method and apparatus utilizing voice input to resolve ambiguous manually entered text input
JP2005202917A (en) System and method for eliminating ambiguity over phonetic input
JP5703491B2 (en) Language model / speech recognition dictionary creation device and information processing device using language model / speech recognition dictionary created thereby
US11416214B2 (en) Multi-modal input on an electronic device
CN101415259A (en) System and method for searching information of embedded equipment based on double-language voice enquiry
JP3476007B2 (en) Recognition word registration method, speech recognition method, speech recognition device, storage medium storing software product for registration of recognition word, storage medium storing software product for speech recognition
CN1965349A (en) Multimodal disambiguation of speech recognition
CN1275174C (en) Chinese language input method possessing speech sound identification auxiliary function and its system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20070516