AU2006201110A1 - Dynamic match lattice spotting for indexing speech content - Google Patents
Dynamic match lattice spotting for indexing speech content Download PDFInfo
- Publication number
- AU2006201110A1 AU2006201110A1 AU2006201110A AU2006201110A AU2006201110A1 AU 2006201110 A1 AU2006201110 A1 AU 2006201110A1 AU 2006201110 A AU2006201110 A AU 2006201110A AU 2006201110 A AU2006201110 A AU 2006201110A AU 2006201110 A1 AU2006201110 A1 AU 2006201110A1
- Authority
- AU
- Australia
- Prior art keywords
- phone
- observed
- lattice
- sequences
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
INO 1 0TITLE Dynamic Match Lattice Spotting for indexing Speech Content BACKGROUND OF THE INVENTION 2Field of the Invention oThe present invention generally relates to speech indexing. In particular, although not exclusively, the present invention relates to an improved unrestricted vocabulary
NO
speech indexing system and method for audio, video and multimedia data.
Discussion of the Background Art The continued development of a number of transmission and storage media such as the Internet has seen an increase in the transmission of various forms of information such as voice, video and multimedia data. The rapid growth in such transmission media has necessitated the development of a number technologies that can index and search the multitude of available data formats effectively Internet search engines). Such systems are and will continue to be paramount in providing an effective means of accessing information provided within these data formats.
One method of indexing and searching large speech corpora has been to use a twopass speech transcription approach. Speech is first prepared for indexing by using a large vocabulary speech recogniser to generate approximate textual transcriptions.
These transcriptions are then indexed using traditional text indexing approaches, thus allowing rapid information retrieval at search time.
Unfortunately, such an approach is severely restricted by the vocabulary of the speech recogniser used to generate textual transcriptions. The vocabulary of a speech recogniser is usually a finite size, and thus it is unlikely to contain every word that may be of interest in speech search, such as names, acronyms and foreign keywords. Since the content of any generated transcripts are constrained by the vocabulary of the speech recogniser, the set of possible query words is thus finite.
A constrained query vocabulary poses significant implications for many types of speech search applications with dynamic or very large vocabularies. These include 2
O
0 c"I tasks such as news-story indexing, technical document database searching and multi-language surveillance. As such, novel techniques are required that allow unrestricted vocabulary speech indexing and retrieval.
Another approach is to use acoustic keyword spotting techniques, such as the method described by J. R. Rohlicek in 'Modem methods of Speech Processing'.
Here, audio content is searched at query time using a simplified recogniser that has been dynamically tuned for the detection of the query words only. Such a search is considerably faster than performing a complete transcription of the speech. However, the technique is not scalable to searching very large corpora, as the required 0 acoustic processing is still considerably slower than typical text-based search techniques.
A considerably faster unrestricted vocabulary search approach using a reverse dictionary lookup was proposed by S. Dharanipragada and S. Roukos, "A multistage algorithm for spotting new words in speech," IEEE Transactions on, Speech and Audio Processing, vol. 10, no. 8, pp. 542-550, November 2002. In this approach, the speech is first processed offline to generate low-level phonetic or syllabic transcriptions. At query time, the target words are first decomposed into their lowlevel phonetic/syllabic representations, and then the intermediary transcriptions are searched in a bottom-up fashion to infer the locations of the query words.
Unfortunately, the phonetic/syllabic transcriptions upon which this approach is based are typically quite erroneous, since accurate phonetic/syllabic transcription in itself is a difficult task. As a result, the overall approach suffers from poor detection error rates.
Phone lattice based searching is another fast unrestricted vocabulary search technique, proposed by S. J. Young and M. G. Brown, "Acoustic indexing for multimedia retrieval and browsing" in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 199-202 April 1997. This technique attempts to incorporate a degree of robustness to phone recogniser error by indexing the speech using phonetic lattices rather than transcriptions. Phone lattices encode a significantly greater number of recognition paths than phone transcriptions, and therefore preserve a considerably broader search space for query time processing.
As a result, the phone lattice approach tends to achieve better detection error rates than the above discussed approaches. However, the resulting error rates are still quite poor and have thus presented a significant barrier for usable information
NO
Sretrieval.
Thus a system and method for indexing and retrieval of various data formats such as speech is required that allows accurate yet rapid search of large information repositories. Clearly it would be advantageous to provide system and method that provides significantly more user-friendly access to the important information contained in the vast amounts of speech and multimedia being generated daily.
oThroughout the specification reference to any prior art is not, and should not be taken as an acknowledgement or any form of suggestion that the referenced prior art forms part of the common general knowledge in Australia.
SUMMARY OF THE INVENTION Disclosure of the Invention Accordingly in one aspect of the present invention there is provided a method of indexing speech content, the method including the steps of: generating a phone lattice from said speech content; processing the phone lattice to generate a set of observed sequences Q (®,i)where 0 are the set of observed phone sequences for each node i in said phone lattice; and storing said set of observed sequences Qfor each node.
In a further aspect of the present invention there is provided a method for searching indexed speech content wherein said indexed speech content is stored in the form of a phone lattice, the method including the steps of: obtaining a target sequence P P2 PN) comparing the target sequence with a set of observed sequences Q=(®,i)generated for each node i, where Gare the set of observed phone sequences for each node i in said phone lattice wherein the comparison between the target sequence P and observed sequences Q includes scoring each observed sequence against the target sequence using a Minimum Edit Distance calculation; and N4
O
0 outputting a set of sequences R from said set of observed sequences Q= (E,i)that match said target sequence.
In yet another aspect of the present invention there is provided a method of indexing and searching speech content, the method including the steps of: ogenerating a phone lattice from said speech content; processing the phone lattice to generate a set of observed sequences Q wherein 0 are the set of observed sequences for each node i in said o phone lattice; 0 N 10 storing said set of observed sequences Q for each node; obtaining a target sequence P (p PN); comparing said target sequence P with the set of observed sequences Q wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance calculation; and outputting a set of sequences R from said set of observed sequences Qthat match said target sequence P.
Preferably the step of generating the phone lattice further includes the step of optimising the size and complexity of said lattice utilising one or more of the following; tuning the number of tokens U used to produce the lattice, lattice pruning to remove less likely paths outside the pruning beamwidth W, and/or tuning the number of lattice traversals V.
In a still further aspect of the present invention there is provided a system for indexing speech content, the system including: a speech recognition engine for generating a phone lattice from said speech content; a database for storing said phone lattice generated by said recognition engine; and at least one processor coupled to said database said processor configured to: v. 0 0 process said phone lattice to generate a set of observed sequences Q (0,i)for each node i in said phone lattice; Sstore said observed sequences in a said second database.
In another aspect of the present invention there is provided a system for searching indexed speech content wherein said indexed speech content is stored in the form of c an observed phone sequence database, the system including:
INO
an input for obtaining a target sequence P (p 1 P2, PN); a database containing a set of observed sequences Q (®,i)generated for each node i where 0 are the set of observed sequences in said phone lattice; at least one processor configured to: compare said target sequence P with the set of observed sequences Q wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance calculation; and output a set of sequences R from said set of observed sequences Qthat match said target sequence P.
In yet another aspect of the present invention there is provided a system for indexing and searching speech content, the system including: a speech recognition engine for generating a phone lattice from said speech content; a first database for storing said an observed phone sequence database generated by said recognition engine; an input for obtaining a target sequence P (pI, PN); at least one processor configured to: Sprocess said phone lattice to generate a set of observed sequences Q i)where 0 are the set.of observed sequences for each node i in said phone lattice; ~6 O store said observed sequences Q in a second database; compare said target sequence P with the set of observed sequences Q wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a 0Minimum Edit Distance calculation; and output a set of sequences R from said set of observed 0 N sequences Qthat match said target sequence P.
ON
S 10 The speech content may be in the form of a plurality of speech files, audio files, video files or other suitable multimedia data type.
Preferably the speech recognition engine is configured to perform a feature based extraction process to generate a feature-based representation of the speech files to construct a phone recognition network; and perform an N-best decoding utilising said phone recognition network to produce the phone lattice.
The feature based extraction process may be performed by any suitable speech recognition software. Suitably the phone recognition network is constructed using a number of available techniques such as phone loop or phone sequence fragment loop wherein common M-Length phone grams are placed in parallel. Preferably the N-best decoding utilises a well-trained set of acoustic models such as triphone Hidden Markov Models (HMM), and an appropriate language model such as an Ngram language model in conjunction with the phone recognition network to produce the phone lattice. Preferably the size and complexity the lattice is optimised utilising one or more of the following; tuning the number of tokens U used to produce the lattice, lattice pruning to remove less likely paths outside the pruning beamwidth W, and/or tuning the number of lattice traversals Vtokens.
The set of observed sequences Q may be in the form of a constrained sequence set i.e. the constrained sequence set i) containing the N0 7 0 C- K top scoring sequences wherein Q' i,K) is derived by the equation i, K) ke r(gl(rk H(r (K Suitably the Minimum Edit Distance scoring includes calculating the minimum cost A(A,B) of transforming the observed sequence to the target sequence in accordance with a set of insertion deletion substitution C, and match operations 0 wherein A(A,B) is defined by A(A,B)=BESTMED(A, and where I BESTMED(...) returns the last column of the MED cost matrix that is less than the 0 maximum MED score threshold Sma,. Preferably the MED cost matrix is produced in accordance with the Levenstein algorithm.
The generation of the cost function CT' may utilise one or more cost rules including same letter substitution, vowel substitution and/or closure/stop substitution. Suitably the maximum MED score threshold is adjusted to the optimal value for a given lattice the value of S,,x is adjusted so as to reduce the number of false alarms per keyword searched without substantive loss query-time execution speeds for a given lattice).
Suitably the process of calculating the Minimum Edit Distance cost matrix further includes the steps of applying prefix sequence optimisation and/or early stopping optimisation, and/or a linearizing the cost matrix.
In a further aspect of the present invention the set of observed sequences Q (9,i) may undergo further processing to produce a set of hypersequences. Each hypersequence may then be used to represent a particular group of observed sequences Q Preferably the hypersequences are generated by mapping the observed sequences to a hypersequence domain in accordance with a suitable mapping function.
Suitably the mapping of the observed sequences to domain is performed on an element by element basis in accordance with one or more of the following a linguistic
OU
0 knowledge based mapping, a data driven acoustic mapping and or a context dependent mapping.
Preferably the comparison of the target sequence and the observed sequences includes: o* comparing the target sequence with each hypersequence to identify the 4sequence groups most likely to yield a match for the target sequence; and then comparing said target sequence with the set of observed o sequencesQ contained within the identified hypersequence sequence 0 N 10 group/s.
BRIEF DETAILS OF THE DRAWINGS In order that this invention may be more readily understood and put into practical effect, reference will now be made to the accompanying drawings, which illustrate preferred embodiments of the invention, and wherein: FIG. 1 is schematic diagram of a speech indexing system in accordance with an embodiment of the present invention; FIG's. 2a, 2b, 2c are schematic diagrams depicting the sequence generation process according to an embodiment of the present invention; FIG. 3 is a schematic diagram depicting the speech retrieval process according to an embodiment of the invention; FIG. 4 is a schematic depicting the hypersequence generation process according to an embodiment of the present invention FIG. 5 is a schematic diagram depicting the speech retrieval process according to an embodiment of the invention; FIG. 6 is an example of a cost matrix for transforming the word deranged to hanged calculated using Levenstein algorithm in accordance with an embodiment of the invention; FIG. 7 is a schematic diagram illustrating the effects of manipulating the lattice traversal token parameter in accordance with an embodiment of the invention; v.0 9 0 0 FIG. 8 is an example of the relationship between cost matrices for subsequences utilised in the prefix optimisation process according to an embodiment of the present invention; and FIG. 9 is an example of the Minimum Edit Distance prefix optimisation algorithm applied in an embodiment of the present invention.
DESCRIPTION OF EMBODIMENTS OF THE INVENTION 0 With reference to FIG. 1 there is illustrated the basic structure of a typical speech 0 10 indexing system 10 of one embodiment of the invention. The system consists primarily of two distinct stages, a speech indexing stage 100 and a speech retrieval stage 200.
The speech indexing stage consists of three main components a library of speech files 101, a speech recognition engine 102 and a phone lattice database 103.
In order to generate the phone lattice 103 the speech files from the library of speech files 102 are passed through the recogniser 102. The recogniser 102 performs a feature extraction process to generate a feature-based representation of the speech file. A phone recognition network is then constructed via a number of available techniques, such as phone loop or phone sequence fragment loop wherein common M-Length phone grams are placed in parallel.
In order to produce the resulting phone lattice 103 an N-best decoding is then preformed. Such a decoding utilises the phone recognition network discussed above in conjunction with a well-trained set of acoustic models such as tri-phone Hidden Markov Models (HMM), and an appropriate language model such as an N-gram language model.
The lattice may then be further refined by performing a high order language expansion. An output beamwidth pruning operation may then be performed to remove paths from outside a predefined beamwidth of the top scoring path of the lattice.
In the speech retrieval stage 200, a user firstly inputs a query 201 into the systems search engine which then searches the lattice 202 for instances of the given query.
The results 203 of the search are then displayed to the user.
Traditional retrieval methods perform an online lattice search during query time.
O Given a target phone sequence, a lattice can be traversed using standard lattice traversal techniques to locate instances of the target sequence within the complex o lattice structure.
IND
S 10 The standard lattice-based search algorithm can be summarised as follows. Let the target phone sequence be defined as P P 2 PN then let A where A is the collection of matching sequences and 0 be defined as the set of all N-length sequences O' ,0 exist in the lattice, where each element 0.
corresponds to a node in the lattice. Each node may have properties including node time, which is the time at which the node was generated in the associated speech utterance, and node label which is the associated label for the node phone label).
The mapping function to map a node sequence to a phone symbol sequence is then defined as follows: where the function O(x)returns the phone symbol for node x.
For each node i in the phone-lattice, where node list traversal is done in time-order, the subset of all node sequences that terminate at the current node are derived. This set is termed the observed sequence set and is defined as follow: QIQ2{O k Ok (2) VO 11 c The members of the observed sequence that match the target phone sequence are then determined. The matching operation is performed using the matching Soperator® as follows: EQ(Q') PI (3) Since for the standard lattice-based search algorithm, the equality operation is used for matching sequences, then R(Q, P)can also be defined as: (4) IND R(Q, I Q' PI (4) o Any matching sequences are then appended to the set of matches for this lattice, using the update equation: A A R(Q, P) where output A is the set of putative matches. Timings of the matches can be easily derived from the node properties.
Searching through a lattice provides improvements for detection rates, compared to searching N-best transcriptions, since a sufficiently rich lattice provides multiple localised hypotheses at any given point in time. However, the lattice search task is computationally intensive, and although textual, still poses significant implications for the query-time execution speed.
Accordingly the query-time execution speed may be improved by performing a significant portion of the lattice traversal offline during the speech indexing stage.
Since the paths traversed through the lattice are independent of the queried target sequences (traversal is done purely by maximum likelihood), it is possible to perform the actual lattice traversal during the indexing stage. That is, it is possible to obtain for each node i during the indexing stage, since it is independent of the target sequence P. However, it is not possible to determine the set of target sequence matches, R(Q, since P is not known at the time of indexing. For example in FIG.
2a, a listing of possible occurrences Q(,i)of the term stack is compiled 105 by traversing the lattice 106 backwards from node i 107.
IN 12
O
0 The data storage requirements for storing may be quite large particularly for Svery rich lattices. Thus in order to reduce the storage space required it is necessary to apply some constraints to restrict the size of Instead of storing a constrained sequence set Q' is used. This constrained sequence set is derived as follows.
0 cl Let be defined as the path likelihood of sequence This path likelihood can
ID
0 0 be computed from the lattice by traversing the path traced by 0' and accumulating the total acoustic and language likelihoods for this path.
The sequence set F(Q) is then determined and is simply the sequence set Q ordered by the path likelihood and is given by: F, (6) F(DescSort H(Q (7) Where y, (8) DescSort(([x,,y, [x ,y 2 [m 2 n 2 (9) me X,n e Yn, n~k i) The reduced seq F'(Q)is then obtained by finding the subset of sequences in F(Q) that have unique phone symbol sequences. An element with a higher likelihood is always chosen when comparing two items with the same phone symbol sequence.
Finally, the constrained sequence set is derived as follows:
I
*E r()l H(r 11 IO 13 0 Thus, the constrained sequence set will contain the K top scoring sequences based on the path likelihoods as shown in FIG 2b. The resulting constrained Ssequence set is then stored in a sequence database 108 see FIG 2c.
However such an approach requires prior knowledge of the length of the target sequence. Since® represents the set of all, N-length sequences in a lattice, the value Sof N, must be known during the indexing stage. However, this can be easily solved by C, selecting as the maximum length supported for target phone sequence queries.
Va SThen all length sequences can be stored in the database and used for the retrieval of any target sequence that is shorter than Nmx.
The sequence generation stage can then be performed for each lattice as follows.
Let be defined as the maximum length of target phone sequences that will be supported for speech retrieval and let K be defined as the maximum node sequence set size that will be stored for each node. Let where A is the collection of nodes and let S 0 2 be the set of Nm,, -length sequences that occur within the lattice.
Then for each node i in the phone-lattice, where node list traversal is done in timeorder, the set of -length sequences terminating at this node, iis determined using equation 2 as detailed above. The constrained set of sequences terminating at this node, i, K) is then determined using equation 11 as above.
The collection of node sequences, A A u i, K) is then updated. Next the ordered set of observed phone sequences B is determined in accordance with equation 12 below and the corresponding observed phone sequence times C is determined in accordance with equation 13 below.
B ={(DA')JA'EAl (12) c= (A')jA'EA} (13) where transforms the node sequence A' to the corresponding node time sequence. The node sequences A, the observed phone sequences collection, B and INO 14
O
o the phone sequence times collection C are then stored in the sequence database 108.
Thus retrieval process is significantly simplified. For each node i, R(Q,P) can quickly be computed by retrieving Q(O,i) from the sequence database 108 as shown in FIG ~3.
0 Firstly a desired target word 205 is obtained, a target phone sequence 207 is then o derived by performing a phone decomposition operation 206 on the target word 205.
S 10 Typically, the phone sequence representation can be obtained by simply performing a lookup in a lexicon 210 containing pronunciations for a crosssection of common words in the target language. The target phone sequence is then compared against the phone sequences Q(O,i) 208 stored in database 105 and the results output for display 209.
However, since the purpose of this speech indexing and retrieval system is to provide an unrestricted query vocabulary, it is not sufficient to assume that every target word will exist in the provided lexicon, no matter how large the lexicon. Thus, as a fallback, standard spelling-to-sound rules can be used to derive this phone decomposition. A number of well-established methods exist to perform this phone decomposition, including spelling-to-sound rules and heuristics. A number of simple techniques are reviewed in the paper by D.H. Klatt entitled "Review of text-to-speech conversion for English", published in the Journal of the Acoustical Society of America vol. 82, pp 737-793, September 1987.
In order to further improve robustness of the system, the search operation in a further embodiment of the present invention is an extension of the search operation used in the standard phone lattice search. This search operation utilises the Minimum Edit Distance (MED) during the lattice search to compensate for phone recogniser errors.
The MED can be calculated using the Levenstein algorithm. A basic implementation of the Levenstein algorithm uses a cost matrix to accumulate transformation costs. A 0O O recursive process is desirably used to update successive elements of this matrix in order to discover the overall minimum transformation cost.
Let the sequence P (I be defined as the source sequence and the sequence be defined as the target sequence. In addition three o transformation cost functions are defined as follows: c C, represents the cost of transforming symbol x in P to symbol y in Q.
\V
STypically this has a cost of 0 ifx=y i.e. a match operation; c" 10 the cost of inserting symbol y into sequence P; and Cd(x) the cost of deleting the symbol x from sequence P.
The element at row i and columnj in the cost matrix represents the minimum cost of transforming the subsequence (pk) to Hence the bottom-right element of the cost matrix represents the total minimum cost of transforming the entire source sequence P to the target sequence Q.
The basic premise of the Levenstein algorithm is that the minimum cost of transforming the sequence is any of: 1. The cost of transforming to plus the cost of inserting qj; 2. The cost of transforming )'to plus the cost of deleting pi; 3. The cost of transforming to plus the cost of substituting pi with qj.
If qj then this is usually taken to have a cost of 0.
In this way, the cost matrix can be filled from the top-left corner to the bottom-right corner in an iterative fashion. The Levenstein algorithm is then as follows: 1. Initialise a x matrix 0. This is called the Levenstein cost matrix.
2. The top left element o,o represents the cost of transforming the empty sequence to the empty sequence; this is therefore initialised to 0
\O
o 3. The first row of the cost matrix represents the sequence of successive insertions. Hence it can be initialsed to be: O Oj-, S4. The first column of the cost matrix represents successive deletions. It therefore can also be immediately initialised to be:
IN
,o C 0 Update elements of the cost matrix from the top-left down to the bottom-right using the Levenstien update equation: S= Minn, C, C, q FIG 6 shows an example of a cost matrix 110 obtained using the MED method for transforming the word "deranged' to the word "hanged' using constant transformation functions. It shows that the cheapest transformation cost is 3. There are multiple means of obtaining this minimum cost. For example, both the operation sequences (del, del, subst, match, match, match, match, match) and (subst, del, del, match, match, match, match, match) have costs of 3.
Given source and target sequences, the MED calculates the minimum cost of transforming the source sequence to the target sequence using a combination of insertion, deletion, substitution and match operations, where each operation has an associated cost. In this embodiment of the invention, each observed lattice phone sequence is scored against the target phone sequence using the MED. Lattice sequences are then accepted or rejected by thresholding on the MED score, hence providing robustness against phone recogniser errors. The standard lattice search is a special case of this embodiment of the invention where a threshold of 0 is used.
Let P (P (I P 2 1 PJ) be defined as the target phone sequence, where N is the target phone sequence length. In addition let be the maximum MED score threshold, K be the maximum number of observed phone sequences to be emitted at each node, and V be defined as the number of tokens used during lattice traversal. Then for each node in the phone-lattice, where node list traversal is done in time-order, then o ~for each token in the top K scoring tokens in the current node letQ=q.
M=N+MAX(C,) x be the observed phone sequence obtained by traversing the token history backwards M levels, where Cj is the insertion MVED cost function.
ci 10 Let S =JIESTMED(Q,P,Cj,,CCJ, where C 4 is the deletion cost function, C, is the substitution cost functions, and BESTMED( returns the score of the first element in the last column of the MVED cost matrix that is S5 (or otherwise). Then emit Q as a keyword occurrences if S:5 For each node linked to the current node, perform V-best token set merging of the current node's token set into the target node's token set.
A number of optimisations can be used to improve throughput of this search process.
In particular, MVED calculations can be aggressively optimised to reduce processing time. One such optimisation is to only calculate successive columns of the MVED matrix if the minimum element of the current column is less than since by definition the minimum of a MED matrix column is always greater than or equal to the minimum of the previous column.
Another optimisation is the removal of lattice traversal from query-time processing.
Since the paths traversed through the lattice are independent of the queried phone sequence (traversal is done purely by maximum likelihood), it is possible to perform the lattice traversal during the speech preparation stage and hence only store the observed phone sequences at each node for searching at query-time. Therefore, if the maximum query phone sequence length is fixed at Nm,, and the maximum sequence match score is preset at it is only necessary to store observed phone sequences of length -IMAX(C,) x Smax, for searching at query time. Querytime processing then reduces to simply calculating the MVED between each stored observed phone sequence Q(Qi) and the target phone sequence P P P N N18 0 This optimisation results in a significant reduction in the complexity of query-time processing. Whereas in the previous approach, full Viterbi traversal was required, processing using this optimised approach is now a linear progression through a set of observed phone sequences. The improved lattice building algorithm is as follows.
Firstly the recognition lattice is constructed using the same approach as in the basic o method discussed above. Let A where A is the collection of observed phone N sequences. For each node in the phone-lattice, where node list traversal is done in 0 S 10 time-order, for each token in the top K scoring tokens in the current node, let m) be the observed phone sequence obtained by traversing the token history backwards levels. The sequence Q is the appended to the collection A, for each node linked to the current node perform V-best token set merging of the current node's token set into the target node's token set. The observed phone sequence collection is the stored for subsequent searching.
The recognition lattice can now be discarded as it is no longer required for query-time searching, this allows the considerably simpler query time search algorithm. Firstly the previously computed observed phone sequence collection for the current utterance then for each member, Q of the collection of observation sequences, A Let S BESTMED(Q,P,Ci,Cd,Cs) then emit Q as a putative occurrence if S5 Both of the above discussed algorithms utilise a Minimum Edit Distance (or Levenstien distance) Cost matrix. The Levenstein distance measures the minimum cost of transforming one string to another. Transformation is performed by successive applications of one of four transformations matching, substitution, insertion and deletion. Typically, each transformation has an associated cost, and hence implicitly the Levenstein algorithm must discover which sequence of transformations results in the cheapest total transformation cost.
In a further embodiment of the present invention the sequences stored within the sequence database may be grouped together to form what the applicant calls hypersequences using a number of predefined criteria.
ID19 0 FIG 4 illustrates the basic hypersequence generation process 300, here the set of observed sequences 301 generated during the sequence generation process discussed above are grouped together to form a set of hypersequences 302 in accordance with number of predefined criteria (hypersequence rules) 303. The o resulting set of observed hypersequences are then stored the hypersequence database 213.
0 A single hypersequence may then be used to represent each group of sequences. If a sensible hypersequence mapping is used, then at search time, it is possible to simply search the hypersequence database to identify which hypersequences (and thus which corresponding groups of sequences) are most likely to yield a match for the target sequence.
In this way, the sequence database can be represented using a hierarchical structure. Since a hierarchically structured database can be more efficiently searched than a flat structured database, the entire speech retrieval process becomes significantly faster and thus more scalable for large database search tasks.
A significant factor that affects the overall performance gain in hypersequence based searching is the structure of the hypersequence mapping. Let E(9) be defined as a hypersequence mapping that maps a sequence e to its corresponding hypersequence3. In order to produce a hypersequence domain that has a smaller search space than the sequence database, the mapping E must be an N-1 mapping. Thus the inverse mapping B which maps a hyper sequence to its set of associated sequences (or hypersequence cluster) must be a N mapping.
The hypersequence mapping can thus be considered as a compressing transform, with a compression factorS. The larger the value of 5 the more restricted hypersquence search space which results in a faster hypersequence database search. However, too large a value of J results in a very large number of sequences being associated with each hypersequence, thus resulting in a greater amount of
O
Ssubsequence processing required during the refined search of the sequence database.
Thus the mapping function E(O) should be selected such that it provides a compromise between domain compression and the average size of the resulting S hypersequence sequence clusters. In this particular instance a sequence is mapped Sto a hypersequence on a per-element basis. That is, given a sequence 0 of length N Sa hypersequence is generated by performing a per-element mapping of each o element of e to result in a hypersequence of length N. Thus the hypersequence 0 10 transform has the following form: (O (14) Then development of the hypersequence transform reduces to a careful selection of the form of the element mapping function A number of forms for this element mapping function may be applied a number of which are detailed below.
The first is a linguistic knowledge based mapping. Here mapping rules are derived using well-defined classings from linguistic knowledge. For example, a valid mapping function may be to map all phones to one to their most representative linguistic class, such as vowels, fricatives, nasals, etc. as shown in table I below.
Table I Linguistic-based hypersequence mapping function 9' (O aa,ae,ah,ao,aw,ax,ay eh,en,er,ey ih,iy Vowel ow,oy uh,uw b,d,g,k,p,t Closure/Stop ch,fjh,s,sh,v,z,zh Fricative hh,l,r,w,wh,y Glide m,n,nx Nasal IND 21 0 0 The second form of the element mapping is a data driven acoustic mapping. In this form of the mapping, the appropriate rules are derived using an acoustic-based clustering approach. For example, a maximum likelihood decision-tree clustering approach can be used to derive a set of clusterings for phones, which can then be translated into a rule-based mapping. A similar approach is used for deriving triphone o mappings in triphone decision-tree clustering.
oThe third approach is a context dependent mapping. Incorporating context restricts Imapping rules in this manner is more theoretically correct, but results in a larger number of target hypersequence symbols, and thus a much smaller hypersequence domain compression factor. It should be noted that both the linguistic knowledge based and data driven acoustic mapping can be constructed in a context dependent or context independent fashion.
Once a valid hypersequence function has been selected, the hypersequence database can then be generated as follows. Firstly the inverse hypersequence mapping function 8 k) is initialised to be a null mapping function defined by: v9 For each node sequence, Ak, in the sequence database A the corresponding hypersequence is obtained using: 9k (A k) (16) The inverse hypersequence mapping function is then updated in accordance with: k= (S k)U Ak (17) the resulting inverse hypersequence mapping function is then stored to disk for use during speech retrieval stage.
IND 22
\O
o A unique list of hypersequences D is then generated, the list is simply the domain of the inverse mapping function defined as follows: D=dom{( (18) oThe hypersequence collection, D, is then stored to disk for use during speech retrieval stage.
\O
o The retrieval process, as shown in FIG 5 firstly involves obtaining the target word 205 and then producing the target phone sequence 207 by performing a phone decomposition 206 on a given target word. A crude search 212 is the preformed on the constrained domain hypersequence database 213, and the associated sequence clusters of any matching hypersequences corresponding to a set of target phone sequences P {p],P 2 are then emitted 214 as detailed below.
For each target phone sequence, p the equivalent target hypersequence is determined The set of candidate sequence function is initialised to be a null function Then for each hypersequence d in the domain set D of the hypersequence inverse mapping function 9 calculate the sequence distance Sequence distance can be calculated using one of the many approaches described in detail below. A number of experiments conducted by the applicant have demonstrated that a simple equality-based distance function is sufficient to retrieve a significant portion of the correct candidate sequences, however, more complex distance functions provide greater control on the retrieval accuracy of the overall system.
If the sequence distance is within the hypersequence emission threshold,5, then update the candidate sequence function to include the corresponding sequence cluster using: H(p) (19) I23 0 0 c- The final candidate sequence function is then outputted. This outputted set of Ssequences then undergoes the refined sequence search briefly mentioned above.
SThe refined search 208 is conducted on the candidate sequence function, {O(pi),fl(p 2 in order to better identify the correct instances of the target 0 Sphone sequences, P 0 C Such a search process can be performed by evaluating the set of sequence Va o distances between each member of and the corresponding target sequence C<1 and emitting only those sequences that are within a predefined acceptance threshold, 5' in accordance with the following algorithm.
For each target sequence, p, the set of candidate sequences function is initialised as a null function and the candidate sequence set rI(p,) obtained using the hyper sequence search. Then for each member 7c, of evaluate the sequence distance A((D(r),p,)and emit the sequence 7r, as a candidate if its distance is within the sequence acceptance threshold S'wherein The final result set is then obtain using R(p, I 2 Y( (21) where (o(ir)is the phone symbol sequence of 7r, and is the timing information for 7r obtained from the phone sequence times collection C.
Thus the output 209 of this stage is a set of refined results, R 2 containing matched phone symbol sequences and corresponding timing information.
O24 0 As discussed both the hypersequence database search and sequence database search stages in this particular instance utilise the sequence distance to obtain a measure of the similarity between a candidate sequence and a target sequence.
There are a number of methods that can be used to evaluate sequence distance. For example traditional lattice-based search methods use an equality-based distance o function. That is, the sequence distance function is formulated as: 0 A(AD fNeq(a,,b,)= N ,0 (22) Sotherwise ,o where eq(x, hebrwise (23) fotherwise ,0 Unfortunately, this type of distance measure does not provide suitable robustness to erroneous lattice realisations. Since the phone recognition is highly erroneous (quoted error rates are typically in the vicinity of 30 to it is quite probable that a true instance of a target phone sequence will actually be realised erroneously with substitution, insertion or deletion errors.
As discussed above in order to improve robustness to lattice errors the Minimum Edit Distance (MED) algorithm is used to determine the cost of transforming an observed lattice sequence to the target sequence. This cost is indicative of the distance between the two sequences, and thus provides a measure of similarity between the observed lattice sequence and the target sequence.
Thus, the sequence distance for speech retrieval stage in this particular embodiment can be formulated as follows:
O
A(A, B) BESTMED(A, B, C (24) where CT' insertion cost function C' =delection cost function C, substition cost function 0 BESTMED(....) the minimum score of the last column of the MEDcost matrix should be noted that in this instance the inverse cost functions C' and have V. been use as opposed to the cost functions C, and Cd as discussed above. This is 0 o 5 because it is more natural within the context of this speech retrieval algorithm to consider the insertion function as corresponding to the cost of assuming that an insertion error occurred in the source sequence the phone sequence from the recogniser). This corresponds to a deletion within conventional MED framework, represented by the cost function Cd Similarly, it is more natural to consider the deletion function as corresponding to a deletion error within the source sequence, which corresponds to an insertion in the conventional MED framework.
Accordingly within the context of this embodiment of the invention with or without hypersquence searching, the insertion function refers to the inverse cost function Cd and the deletion function refers to the inverse cost function C,.
The quality of the distance function is dependent on the three cost functions, C, andC,. During experimentation the applicant found that it was useful to set either the insertion cost or deletion cost functions to cr to prevent a substitution error being mimicked by a combination of insertion and deletion.
Additionally, it was found that using an infinite deletion cost function lead to a slight improvement in detection performance, and thus, C o is used.
The insertion and substitution cost functions may then be derived in one of two fashions. The first is via the use of knowledge based linguistic rules. Under this method a set of rules are obtained using a combination of linguistic knowledge and observations regarding confusion of phones within the phone recogniser. For IN 26 example, it is probable that the stop phones Ib, Idl, and Ip/ are likely to be confused by a recogniser and thus there would be a small substitution cost. In contrast, it is less likely that a stop phone would be confused with a vowel phone, and thus this substitution would have a correspondingly higher cost.
o Thus, through careful selection and experimentation, it is possible to craft a set of rules that can be used to represent both the insertion and substitution cost functions.
o For example a well-performing system was built using a fixed insertion cost function I of C' 1 1 and a variable substitution cost function as shown in table II below: 10 Table II Rule-based substitution costs Rule Cost same-letter consonant phone substitution eg. iz/ /h/0 vowel substitutions 1 closure and stop substitutions 1 all other substitutions oo Rules used to derive phone substitution costs Phones Cost Phones Cost aa ae ah ao aw ax ay d dh 0 eh en er ey ih iy 1 n nx 0 ow oy uh uw t th 0 bddhgkptthjh I uww 1 z zh s sh 1 w wh 0 Actual phone substitution costs The alternative approach for deriving the required cost functions is to estimate probabilistically based representations estimated from a set of development data. In this instance the estimation was based on the phone recogniser confusion matrix which can be generated by comparing transcripts hypothesised by the recogniser to a given set of reference transcripts.
Given a confusion matrix, a number of base statistics can be calculated. The following derivation uses the events defined below.
I0 21 0O c, E, :phone x was emitted by the regoniser R, phone x was in the reference transcript (26) Thus the elements of the confusion matrix, F can be defined as: r(x,y)=p(E (27) 0 SOThe unconditional probabilities P(E,)and P(Ry)can be computed from the confusion 0 0 matrix in accordance with the following: p(E,Ry) (28) P(E) y p(E (29) Thus it is possible to obtain the substitution probability using Bayes theorem, as follows: p(Ex R)P(R
P(E,)
R) (31) S {p(ExI R )Zk p(Ek jR which is the likelihood that the reference phone was y given that the phone recogniser output the phone x. This provides a basis for the subsequent derivation of the substitution cost function.
The insertion probability p(Ix)can be simply computed by counting the number of times the phone x is inserted in the hypothesised transcriptions with respect to the set of reference transcripts.
Given these probabilities, the required cost functions, C, and CT' can then be computed using a number of approaches: 0 a) Information-theoretic: Here the information of a given event is used to represent G its cost. The information of an event is representative of the uncertainty of the event, and thus an indication of the cost that should be incurred if the event occurs. Thus an event with a high level of uncertainty will be penalised more 0 severely than a common event.
o From information theory, the information of a given event is given by: \0O o 10 log p(X) (32) Thus, the required cost functions can derived as follows: C, y)=I(Ry E) (33) =-logp(Ry (34) p(E, Rj )Ep(Ek R) (36) logp(I,) (37) This approach may be further extended to use the mutual information I(x; relative entropy, or other information-theoretic concepts.
b) Affine transform: An alternative means of deriving the cost functions from the given conditional event probabilities is to simply apply a numerical transform. The cost of an event can be said to be inversely proportional to the likelihood of its occurrence, and thus, a variety of decreasing functions could be used to transform the given conditional probabilities to the required costs.
The types of functions that could be used to perform this transformation include: a) The negative exponential, Ae-h(x-); b) The tangent function, A tan b(x c) 0 A cIc) The inverse function, bx- c) For example, using the negative exponential, the substitution cost function is given _b by: Q -eb(((RY (38) 2 ex brp(Exj Ry)X, p(Eil Ry) J Z ep(xIA- p(b~i C (39) ci 5 where the parameters A, b and c need to be tuned appropriately
IND
o Both the knowledge-based and data-driven approaches described above have been shown by experimentation to estimate robust MVED distances. However, both approaches are approximations of a true cost function: the knowledge based approach is based on human derived rules which are clearly prone to error, while the data-driven approach is based on a confusion matrix which is estimated from a development data set.
Thus both approaches are suboptimal and are prone to error. To reduce the effects of this error, fusion can be used to obtain a hopefully more robust cost function estimate. A linear fusion approach is used to derive a overall fused substitution cost function of the form: Cjf(x, y) =a C,a(x,y) (I a) C,b(x, y) where C,a(x, y) and C,b(x, y) are the two candidate substitution cost functions to be fused.
The quality of the resulting fused function wilt depend on two important factors: the amount of complimentary information between the candidate functions, and the choice of the fusion coefficient. The amount of complimentary information is most appropriately measured qualitatively by experimentation. However, the selection of the fusion coefficient can be done using a number of standard approaches, including gradient descent or expectation maximisation.
0 The above derivations of cost functions have focused on a context-independent cost function. That is, the value of the cost function is only dependent on the hypothesised phone and the reference phone.
However, it is an established fact that the types of phone substitution that occur are o typically well correlated with the context in which the event is present. That is, there is a considerable amount of dependence between the current hypothesised phone, the o set of previously hypothesised and reference phones, and the set of following 0 hypothesised and reference phones. Thus, there is likely to be a considerable improvement in the quality of the cost functions if contextual information is incorporated into the process.
One approach to incorporating context into the cost function is to only include the short term hypothesised phone context. To facilitate this derivation, the following notation is used: E,(t)=phone x was emitted by regoniser at time t (41) R, phone x was in the reference transcription at time t (42) Then, the contextual event likelihoods required for estimating the necessary cost functions are p F E p(E (I E E (t RIRY (t))P(Ry p(RY Q EC 43) if only single phone context is used. The required probabilities p(Ea (t E, E, (t 1)jRy(t)), and P(E (x E5 E, (i can be easily computed using the context dependent counts from the confusion matrix.
Clearly, the above formulation can be extended to longer contexts if sufficient data is available to maintain robustness in the estimates. However, this introduces additional complexity into the computation of the function during MED scoring and may thus reduce the overall speed of speech retrieval.
0O Further gains in throughput can be obtained through optimisation of the MED C calculations. MED calculations are in fact the mostly costly operations performed during the search stage. The basic MED algorithm is an O(N 2 algorithm and hence not particularly suitable for high-speed calculation. However, within the context of this o embodiment of the invention, a number of optimisations can be applied to reduce the computational cost of these MED calculations. These optimisations include the prefix o sequence optimisation and the early stopping optimisation.
IND O o 10 The prefix sequence optimisation utilises the similarities in the MED cost matrix of two observed phone sequences that share a common prefix sequence.
Let A =(al,a 2 and B=(b,,b 2 Also let B'be defined as the first order prefix sequence of B, given byB' m Finally, let the MED cost matrix between two sequences be defined as Y).
From the basic definition of the MED cost matrix, the (N I)x M cost matrix rn(A, B') is equal to the first M columns of the cost matrix This is because B' is equal to the first M-1 elements of B Therefore given the cost matrix it is only necessary to calculate the values of the (M+l)th column of Q(A, B) as shown in FIG 8. The argument extends to even shorter prefix sequences of B. For example, let B" be defined as the third-order prefix sequence of B, given by B" Then given n(A, it is only necessary to calculate the values of the (M-1)th, Mth and (M+l)th columns of (A,B)to obtain the full cost matrix.
Now, given that the MED cost matrix Q(A,B) is known, consider the task of calculating the MED cost matrix Let P(B,C)return the longest prefix I32 0 c- sequence of B that is also a prefix sequence of C. Then, t(A,C) can be obtained by taking Q(A,B) and recalculating the last ICl-lP(B,C columns.
In the case of the present invention, an utterance is represented by a collection of observed phone sequences. Typically, there is a degree of prefix similarity between sequences from the same temporal location, and in particular between sequences emitted from the same node. As demonstrated above, knowledge of prefix similarity 0 Cl will allow a significant reduction in the number of MED calculations required.
O
S 10 The simplest means of obtaining this knowledge is to simply sort the phone sequences of an utterance lexically during the lattice building stage. Then the degree of prefix similarity between each sequence and it's predecessor can be calculated and stored. For this purpose, the degree of prefix similarity is defined as the length of the longest common prefix subsequence of two sequences.
Then, during the search stage, all that is required is to step through the sequence collection and use the predetermined prefix similarity value to determine what portion of the MED cost matrix needs to be calculated, as demonstrated in FIG 9. As such, only changed portions of the MED cost matrix are iteratively updated, greatly reducing computational burden.
The early stopping optimisation uses knowledge about the threshold to limit the extent of the MED matrix that has to be calculated. From MED theory, the element of the MED cost matrix o(X,Y) corresponds to the minimum cost of transforming the sequence to the sequence For convenience, the notation 0 id used to represent o(X, The value of is given by the recursive operation 0 -i C, y Min Q C (44) Iij,1 (yj IN0 jj 0 0 Given the above formulation, and assuming non-negative cost functions, the value of SQ,j has a lower bound (LB) governed by: 1 That is, it is bounded by the minimum value of columnj 1 and all values above row i 2 5 in column j. This states that the lower bound ofh, j is a function of which Simplies o Min({LB(Q, 1
{Q,
1 1 z) (46)
IND
0 S This states that the lower bound of Qj is governed by all entries in the previous column and the lower bound of the element directly above it in the cost matrix. If the recursion is continuously unrolled, then the lower bound reduces to being only a function of the previous column and the very first element in column j, that is LB^>Min({LB(n {oQ, kj (47) Now MED theory states that Q,,j C, (yi) for all values ofJ. This means that for a positive insertion cost function: LB(n,j) LB(nQj_) (48) Substituting this back into equation 17 gives: LB(QJ) Min({LB(nj,, (49) This reduces to the simple relationship LB(OJ) It has therefore been demonstrated that the lower bound of Q,j is only a function of the values of the previous column of the MED matrix. This lends itself to a significant optimisation within the framework of the present invention.
ID 34
O
Since is fixed prior to the search, there is an upper bound on the MED score of observed phone sequences that are to be considered as putative hits. When calculating columns of the MED matrix, the relationship in equation 52 can be used to predict what the lower bound of the current column is. If this lower bound exceeds Sm. then it is not necessary to calculate the current or any subsequent columns of o the cost matrix, since all elements will exceed oThis is a very powerful optimisation, particularly when comparing two sequences that are very different. It means that in many cases only the first few columns will need to be calculated before it can be declared that a sequence is not a putative occurrence.
The early stopping optimisation and the prefix sequence optimisation can be easily combined to give even greater speed improvements. Essentially the prefix sequence optimisation uses prior information to eliminate computation of the starting columns of the cost matrix, while the early stopping optimisation uses prior information to prevent unnecessary computation of the final columns of the cost matrix.
When combined, all that remains during MED costing is to calculate the necessary inbetween columns of the cost matrix.
In the combined algorithm the first step is to initialise a MED cost matrix of size (N+1) x where N is the length of the target phone sequence and M is the maximum length of the observed phone sequences. Then for each sequence in the observed sequence collection let k be defined as the previously computed degree of prefix similarity metric between this sequence and the previous sequence. Using the prefix sequence optimisation, it is only necessary to update the trailing columns of the MED matrix. Thus, for each column, j, from (M 1)-k 1 to M 1 of the MED cost matrix determine the minimum score, MinScore(] 1) in columnj 1 of the cost matrix.
If MinScore (j 1) S,,m then using the early stopping optimisation, this sequence can be declared as not being a putative occurrence and processing can stop. All the elements for column j of the cost matrix are then calculated and S BESTMED(...) in the normal fashion given this MED cost matrix.
v.0
O
0 Determining the MED between two sequences requires the computation of a cost matrix, 0, which is an O(N 2 algorithm. Using the optimisations described above significantly reduces the number of computations required to generate this cost matrix, but nevertheless the overall algorithm is still of O(N 2 complexity.
In this instance an O(N) complexity derivative of the MED algorithm is used to oestimate an approximation of the MED. This results in a further reduction in the IN number of computations required for MED scoring, thus yielding even faster search speeds.
As discussed above discussed it was necessary to use an infinite deletion cost function to obtain good retrieval accuracy. Given this, the standard cost matrix update equation for MED becomes: =Min +C (51) -1 C, p, qj =Min 0i, (p (52) Mini Q C5 ,P qj) which is simply a direct comparison between the cost of inserting an element versus substituting an element. This effectively makes the top-right triangle of the MED cost matrix redundant since information from it is no longer required for computation of the cost matrix.
Additionally, because deletions are no longer considered, then element Q(i,N)in the final column of the cost matrix must be the result of exactly (M -i)Vi:M insertions where M is the number of elements in the source sequence A and N is the number of elements in the target sequence B. This is because, it is now only possible to move diagonally down and right (substitution or match), or directly right (insertion), since ND 36
O
S deletions are no longer allowed. Thus it is not possible to trace a path ton(i,N) Swithout performing exactly (i M) insertions (or moving right (M i)times).
If it can be assumed that insertions on average are more expensive than substitution, then clearly it would be less costly to select the set of transforms dictated by element (i,N)than those dictated by element Q(k,N)if i>k. The validity of this assumption can be maintained by using a high insertion penalty when generating CN lattices.
C 10 Then to obtain an approximation of MED(A,B), it would be on average more computationally efficient to compute the values in the final column in the order followed by followed by etc, since the transformations dictated by are on average more likely to be less costly than the transformations dictated by V i< k.
Thus an algorithm that favours computation of over where i k will on average result in the correct estimate of the minimum cost to transform sequence A to sequence B. The following algorithm is used to perform this type of biased calculation.
Let the source sequence be given by P=(pI, p 2 ,p,)and the target sequence be given byQ=(ql,q 2 ,q Then assuming N Mthe linearised MED cost vector, is derived, where Q'(i)corresponds to the approximate minimum cost of transforming subsequence to subsequence(B) k where k is equal to the number of insertions that have occurred prior too'(i). The Linearised MED cost vector thus plays a similar role to the MED cost matrix in the standard MED formulation.
Elements ofQ' are computed recursively in a similar fashion to the standard MED algorithm, using an update equation based on equation 52 above as follows.
IND 37 0 SFirstly a linearised MED cost vector of size M is initialised (where M is the length of the source sequence P) and the position pointers i and j are set to 1 i=1, j=1).
While i< Mthe current cost vector is updated using the update equation (0 i (Pi (53) Min( i-l+(pC 3q) SIf minimum cost was substitution then increment target sequence position pointer,j o j 1, if j N then terminate matching as the entire target sequence has been C- matched, with a cost of else increment the source sequence position pointer, i i 1. If the entire source sequence has been processed without reaching the end of the target sequence a match score of co is then declared.
The above algorithm has an O(N) complexity and thus considerably more computationally efficient than the standard O(N 2 MED algorithm. This provides significant benefits in the sequence database and hypersequence database search stages.
In addition to this the previously described prefix sequence and early stopping optimisations can be easily incorporated into the algorithm to provide further improvements in speed. The resulting Linearised MED algorithm that incorporates these optimisations is as follows.
Firstly a linearised MED cost vector of size where is the maximum length of the observed phone sequences. For each sequence in the observed phone sequence collection, where the collection has been sorted lexically, and the orders of prefix similarity have been computed, let k be defined as the previously computed degree of prefix similarity metric between this sequence and the previous sequence.
A(P,Q) is then determined as follows using the prefix sequence optimisation, it is only necessary to update the trailing elements of the linearised MED cost vector.
Thus, the initial position pointers are initialised as: i=k (54) j wO(i)
\O
0 Where co(i) is the value of j at which the value of was last computed (or zero on the first iteration). While i<Mthe current cost vector is updated using the update E equation Min(Q'(i-1 (56) Qi (i-1 +C,,,qJ If minimum cost was substitution then increment target sequence position pointer, j ci j 1, if I else increment the source sequence position pointer, i i 1. Then o using the early stopping optimisation, declare this sequence as having a match score Cl of oo and terminate matching for the sequence. Ifj N then terminate matching as the entire target sequence has been matched, with a cost of else increment the source sequence position pointer, i i 1. If the entire source sequence has been processed without reaching the end of the target sequence a match score of oo is then declared. Once A(P,Q) has been determined normal retrieval processing is then performed.
In one experiment performed by the applicant to compare the performance of the Dynamic Match Lattice Spotting technique of the present invention against conventional indexing techniques, a keyword spotting evaluation set was constructed using speech taken from the TIMIT test database. The choice of query words was constrained to words that had 6-phone-length pronunciations to reduce target word length dependent variability.
Approximately 1 hour of TIMIT test speech (excluding SA1 and SA2 utterances) was labelled as evaluation speech. From this speech, 200 6-phone-length unique words were randomly chosen and labelled as query words. These query words appeared a total of 480 times in the evaluation speech.
16-mixture triphone HMM acoustic models and a 256-mixture Gaussian Mixture Model background model were then trained on a 140 hour subset of the Wall Street Journal I (WSJ1) database for use in the experiment. Additionally 2-gram and 4gram phone-level language models were trained on the same section of WSJ1 for D39 0 use during the lattice building stages of DMLS and the conventional lattice-based methods.
All speech was parameterised using Perceptual Linear Prediction coefficient feature extraction and Cepstral Mean Subtraction. In addition to 13 static Cepstral o coefficients (including the 0th coefficient), deltas and accelerations were computed to generate 39-dimension observation vectors.
o Lattices were then constructed, based on the optimised DMLS lattice building approach described above. The lattices were generated for each utterance by performing a U-token Viterbi decoding pass using the 2-gram phone-level language model. The resulting lattices were expanded using the 4-gram phone-level language model. Output likelihood lattice pruning was then applied using a beam-width of W to reduce the complexity of the lattices. This essentially removed all paths from the lattice that had a total likelihood outside a beamwidth of W of the top-scoring path. A second V-token traversal was performed to generate the top 10 scoring observed phone sequences of length 11 at each node (allowing detection of sequences of up to 11 MAX(C,) x phones).
Lattice building was only performed once per utterance. The resulting phone sequence collections were then stored to disk and used during subsequent querytime search experiments.
The sequence matching threshold, Sm,, was fixed at 2 for all experiments unless noted otherwise. MED calculations used a constant deletion cost of C co, as preliminary experiments obtained poor results when non-infinite values of C' were used. The insertion cost was also fixed at C, 1.
In contrast, C, was allowed to vary based on phone substitution rules. The basic rules used to obtain these costs are shown in table III(a) and were determined by examining phone recogniser confusion matrices. However some exceptions to these rules were made based on empirical observations from small scale experiments. The final set of substitution costs used in the reported experiments is given in table III(b).
o Substitutions were completely symmetric. Hence the substitution of a phone, m, in a given phone group with another phone, n, in the same group yielded the same cost as the substitution of n with phone m.
Table III PHONE SUBSTITUTION COSTS FOR DMLS Rule Cost same-letter consonant phone sub tatuton 0 eg- /sh/ vowel substituions 1 closure and stop substitution; 1 all other substitutions .d Phones Cost Phones Cost aa ae ah ao aw ax ay d db 0 eh e er ey ihy 1 n xn 0 ow oy uh uw t th 0 bddh gkptthjh I uw w I z zh s sh I wwh 0 Rules used to derive phone substitution cots Actual phone substitution cost" Each system was then evaluated by performing single-word keyword spotting for each query word across all utterances in the evaluation set. The total miss rate for all query words and the False Alarm per keyword occurrence rate (FA/kw) were then calculated using reference transcriptions of the evaluation data. Additionally the total CPU processing seconds per queried keyword per hour (CPU/kw-hr) was measured for each experiment using a 3GHz IntelTM Pentium 4 processor. For DMLS, CPU/kwhr only included the CPU time used during the DMLS search stage. That is, the time required for lattice building was not included. All experiments used a commercialgrade decoder to ensure that the best possible CPU/kw-hr results were reported for the HMM-based system. This is because HMM-based keyword spotting time performance is bound by decoder performance.
For clarity of discussion the notation DMLS V, W, Smnax] is used to specify DMLS configurations, where U is the number of tokens for lattice generation, V is the number of tokens for lattice traversal, W is the pruning beamwidth, and S,,a is the sequence match score threshold. The notation HMM[a] is used when referring to baseline HMM systems where a was the duration-normalised output likelihood threshold used. Additionally the baseline conventional lattice-based method is referred to as CLS.
IN 41 0 Performances for the DMLS, HMM-based and lattice-based systems measured for the TIMIT evaluation set are shown in Table IV. For this set of experiments, the SDMLS[3,10,200,2] configuration was arbitrarily chosen as the baseline DMLS configuration.
Table IV SBaseline results Evaluated on TIMIT 0 0 ci
^D
0 Method Miss FA] CPU/ Rate k, kw-hr HlMM(ol 1.6 44.2 94.8 HMM(f-7580) 10.4 36.6 94.8 HMMi(-7000) 39.8 16.8 94.S CLS[3,10,200,0] 32.9 0.4 DM.LS(3,10J002] 10.2 18.5 18.0 The timing results demonstrate that as expected DMLS was significantly faster than the HMM method, running approximately 5 times faster. This amounts to a baseline DMLS system being capable of searching 1 hour of speech in 18 seconds. DMLS also had more favourable FA/kw performance at 10.2% miss rate, it had a FA/kw rate of 18.5, significantly lower than the 36.6 FA/kw rate achieved by the HMM[-7580] system. However, the HMM system was still capable of achieving a much lower miss rate of 1.6% using the HMM[oo] configuration, though at the expense of considerably more false alarms. The miss rate achieved by the conventional lattice-based system was very poor compared to that of DMLS. This confirms that the phone error robustness inherent in DMLS yields considerable detection performance benefits.
However, the false alarm rate for CLS was dramatically better than all other systems, though with a high miss rate.
Thus a considerable improvement in miss rate for DMLS over the baseline latticebased system is achievable. The improvements in performance could be attributed to the four main cost rules used in the dynamic match process insertions, i.e. same letter substitutions Idl+ldhl, In Inx), vowel substitutions and closure/stop substitutions Ibl Accordingly further experiments were conducted to quantify the benefits of individual cost rules.
A number of specialized DMLS systems were built to evaluate the effects of individual cost rules in isolation. The systems were implemented using customized MED cost functions as show in table V below: Table V Cost rules for Isolated Rule DMLS systems System CG C, C,(o,b) Same letter oo oo a and b same letter base, 1 subst otherwi5e c Vowel co co a and b are vowels, 1 subst otherwise, oo Closureftop oa o a and h ae closuestop, I subst otherwise, cc Insei:ons coc oc The evaluation set, recogniser parameters, experimental procedure and DMLS algorithm are the as that used in the above discussed evaluation experiment for the optimised DMLS system. Table VI shows the results of the specialised DMLS systems, baseline lattice-based CLS system and the previously evaluated DMLS [3,10,200,2] system with all MED rules.
Table VI TIMIT Performance When Isolating Various DP Rules Method Miss FA Rate kw CLS[3,10,200,0] 32.9 0.4 DMLS[3,10,200,] insertions 2.5 1.2 DMLS[3,10200,2] same letter subst 31.0 D.LS[3,10,200:2] vawel subst 15.6 7.8 DMLS[3,10,200:21 closuastop subst 23.5 DMILS[3,10,200.,2 all rules 10.2 18.5 The experiments demonstrate that the magnitude of contributions of the various rules to overall detection performance varies drastically. Interestingly no single rule brought performance down to the all rules DMLS system. This indicates that the rules are complementary in nature and yield a combined overall improvement in miss rate performance.
IND) 4 o Using the same letter substitution rules only yielded a small gain in performance over the null-rule CLS system: 1.9% absolute in miss rate with only a 0.1 drop in FA/kw rate. The result suggests that the phone-lattice is already robust to same letter substitutions, and as such, inclusion of this does not obtain significant gains in performance. Empirical study of the phone-lattices revealed this to be the case in o many situations. For example, typically if the phone Isi appeared in the lattice, then it was almost guaranteed that the phone Ishl also appeared at a similar time location in 0 the lattice.
IND
S 10 The insertions-only system yielded a slightly larger gain of 4.4% absolute in miss rate with only a 0.8 drop in FA/kw rate. The result indicates that the lattices contain extraneous insertions across many of the multiple hypotheses paths, preventing detection of the target phone sequence when insertions are not accounted for. This observation is to be expected since phone recognisers typically do have significant insertion error rates, even when considering multiple levels of transcription hypotheses.
A significant absolute miss rate gain of 17.3% was observed for the vowel substitution system. However, this gain was at the expense of a'7.4 absolute increase in FA/kw rate. This is a pleasing gain and is supported by the fact that vowel substitution is a frequent occurrence in the realisation of speech. As such, incorporating support for vowel substitutions in DMLS not only corrects errors in the phone recogniser but also accommodates this substitutionary habit of human speech.
Finally, significant gains were also observed for the closure/stop substitution system.
An absolute gain of 9.4% in miss rate combined with an unfortunate 2.6 absolute increase in FA/kw rate was obtained for this system. Typically closures and stops are shorter acoustic units and therefore more likely to yield classification errors. As such, even though the phone lattice encodes multiple hypotheses, it appears that it is still necessary to incorporate robustness against closure/stop confusion for lattice-based keyword spotting.
IND) 44
O
o The above discussed experiments demonstrate the benefits of the various classes of MED rules used in the evaluated DMLS systems. It should be noted that even the simplest of these rules provided tangible gains in DMLS system performance over that of the baseline system. The experimental results showed that insertion and same-letter consonant substitution rules only provided a small performance benefit 0 over a conventional lattice-based system, whereas vowel and closure/stop substitution rules yielded considerable gains in miss rate. Gains in miss rate were otypically offset by increases in FA/kw rate, although the majority of these gains were Ifairly small, and would most likely be justifiable in light of the resulting improvements in miss rate.
In addition to the above experiments were also conducted to quantify the effect of individual DMLS algorithm parameters on detection performance. The evaluation set, recogniser parameters, experimental procedure are the same as that used in evaluation of the fixed DMLS [3,10, 200, 2] configuration discussed above.
The number of tokens used for lattice generation, U, has a direct impact on the maximum size of the resulting phone lattice. For example, if a value of U 3 is used, then a lattice node can have at most 3 predecessor nodes. Whereas, if a value of U 5 is used, then the same node can have up to 5 predecessor nodes, greatly increasing the size and complexity of the lattice when applied across all nodes.
Tuning of U directly affects the number of hypotheses encoded in the lattice, and hence the best achievable miss rate. However, using larger values of U also increases the number of nodes in the lattice, resulting in an increased amount of processing during DMLS searching and therefore increased execution time.
Table VII shows the result of increasing U from 3 to 5. As expected, increasing U resulted in an improvement in miss rate of 4.4% absolute but also in an increase in execution time by a factor of 2.3. A corresponding 19.9 increase FA/kw rate was also observed. The obvious benefit of tuning the number of lattice generation tokens is that appreciable gains in miss rate can be obtained. Although this has a negative IND o effect on FA/kw rate, a subsequent keyword verification stage may be able to accommodate the increase.
Table VII Effect of Adjusting Number of Lattice Generation Tokens
O
0 t-1 0 0 0 0~ Method Mi ssRate FA'w CPUA-w-br DMLS[3,10,200,2 10.2 18.5 18.0 DMLS(5,10-0,2} 5.8 38.4 42.6 As discussed above lattice pruning is to remove less likely paths from the generated phone lattice, thus making the lattice more compact. This is typically necessary when language model expansion is applied. For example, applying 4-gram language model expansion to a lattice generated using a 2-gram language model results in a significant increase in the number of nodes in the lattice, many of which may now have much poorer likelihoods due to additional 4-gram language model scores.
The direct benefit of applying lattice pruning is an immediate reduction in the size of the lattice that needs to be searched. Resulting improvements in execution time, though at the expense of losing potentially correct paths that unfortunately did not score well linguistically.
Table VIII shows the effect of pruning beamwidth W for four different values 150, 200, 250 and o. As predicted, decreasing pruning beamwidth yielded significant gains in execution speed at the expense of reductions in miss rate. Corresponding drops in FA/kw rate were also observed.
Table VIII Effect of Adjusting the Pruning Beamwidth Method Miss Rate FAAw CPUikw-hr DNLS[3,10,150.2] 12.5 12.2 10.8 DMIS[3,10400,2] 10.2 18.5 18.0 DMIS[3,10:250,2] 9.2 24.7 2S.2 DMLS[3,10.o.2A] 7.3 60.6 175.8 IND 4b 0 0 Thus adjusting the pruning beamwidth appears to be particularly well suited for tuning execution time. The changes in CPU/kw-hr figures were dramatic, and in G comparison, the miss rate figures varied in a much smaller range.
The number of lattice traversal tokens, V, corresponds to the number of tokens used o during the secondary Viterbi traversal. Tuning this parameter affects how many tokens are propagated out from a node, and hence, the number of paths entering a o node that survive subsequent propagation.
IND ci The impact of this on DMLS is actually more subtle, and is demonstrated by FIG 7. In this instance, the scores of tokens propagated from the t node are much higher than the scores from the other nodes. As such, in the 5-token propagation case, the majority of the high-scoring tokens in the target node are from the t node. Hence the tokens above the emission cutoff the tokens from which observed phone sequences are generated) are mainly i nodes. However, using the same emission cutoff and 3-token propagation results in a set of top-scoring tokens from a variety of source nodes it is not immediately obvious whether it is better to use a high or low number of lattice traversal tokens for optimal DMLS performance.
Table IX shows the results of experiments using three different numbers of traversal tokens 5, 10 and 20. It appears that all three measured performance metrics were fairly insensitive to changes in the number of traversal tokens. There was a slight decrease in miss rate when using a higher value of V, though this may not be considered a dramatic enough change to justify the additional processing burden required at the lattice building stage.
Table IX Effect of Adjusting Number of Traversal Tokens Method Mi-s Rae CPU kw-hr DMLS[3,5,200,21 10.4 t17.4 16.8 DMLS[3,10.2002] 10.2 18.5 13.0 DMILS[3,20200.2] 9.8 18.8 17.4 ID 4(
O
o Tuning of the MED cost threshold, is the most direct means of tuning miss and FA/kw performance. However, if discrete MED costs are used, then Sx, itself will be G a discrete variable, and as such, thresholding will not be on a continuous scale. The Sax parameter controls the maximum allowable discrepancy between an observed phone sequence and the target phone sequence. Experiments were carried out to 0 study the effects of changes in S,,m on performance. The results of these experiments are shown in table X. Since thresholding was applied on the result set of DMLS, 0 there were no changes in execution time.
\O
o 10 The experiments demonstrated that adjusting gave dramatic changes in FA/kw.
In contrast, the changes in miss rate were considerably more conservative except for the 0 case. Tuning of the MED cost threshold therefore appears to be most applicable to adjusting the FA/kw operating point. This is intuitive since adjusting Sma, adjusts how much error an observed phone sequence is allowed to have, and as such has a direct correlation with false alarm rate.
Table X Effect of Adjusting MED cost threshold Method Min Rate FA'cw CPUkw-hr DMLS[3,10,200,0] 31.0 0.5 18.0 DMLS[3,10200,1] 13.3 4.3 81.0 DMLSI3,10,200,.2 10.2 18.5 18.0 DMLS[3,10,2003] 8.7 52.0 18.0 Given the above experiments were conducted on systems using a combination of the tuned parameters discussed above. As such, two tuned systems were constructed and evaluated on the TIMIT data set. Parameters for these systems were selected as follows: 1. The number of lattice generation tokens U was set to a value of 2. As the DMLS system performance appeared insensitive to changes in the number of lattice traversal tokens, the value of V was maintained at the value used for the previously discussed experiments i.e. IND 4 0 o 3. The speed increases observed using a reduced lattice pruning beamwidth were quite dramatic, and in comparison only resulted in a small decrease in miss G rate. Considering the anticipated gains in miss rate from the increase in the number of lattice generation tokens, a reduced value of W= 150 was used.
o 4. Two values of were evaluated to obtain performance at different false alarm points. The values evaluated were S,,m 1 and Smax 2. Although it was 0 anticipated that a reduction in miss rate would be observed for the lower S,,x 1 I system, it was hoped that this would be compensated for by the increase in the 0 10 number of lattice generation tokens, and further justified by the significantly lower false alarm rate.
The results of the tuned systems on the TIMIT evaluation set are shown in table XI.
The first system achieved a significant reduction in FA/kw rate over the initial DMLS [3,10,200,2] system at the expense of only a small 1.3% absolute increase in miss rate. The second system obtained a good decrease in miss rate of 2.9% with only a small 3.8 FA/kw rate increase. Both these systems maintained almost the same execution speed as the initial DMLS system.
Table XI Tuned DMLS Configurations Evaluated on TIMIT Method Mis Rate FAlw CPU.Aw-hi U=hmed DMLS[3.10,200,2] 10.2 18.5 18.0 Tumed DMLS[5.1,150,1] 11.5 5.6 18.6 Tuned DMLS[5,10,150,2] 7.3 22.3 18.6 The above discussed experimental evaluation of the DMLS system was conducted in respect of a clean microphone speech domain. The conversational telephone speech domain is a more difficult domain but is more representative of a real-world practical application of DMLS. Accordingly experiments were performed using the Switchboard 1 telephone speech corpus. To maintain consistency, the same baseline systems, DMLS algorithms and evaluation procedure as discussed above.
IO 4U 0 The evaluation set in this instance was constructed in a similar fashion to the previously constructed TIMIT evaluation set. Approximately 2 hours of speech was Staken from the Switchboard corpus and labelled as evaluation speech. From this speech, 360 6-phone-length unique words were randomly chosen and marked as query words. In total, these query words appeared a total of 808 times in the o evaluation set.
O Acoustic and language models were trained up on a 165 hour subset of the IND Switchboard-1 database using the same approach as used for the previous TIMIT 0 10 experiments.
The results for the HMM-based, conventional lattice-based, and DMLS experiments on conversational speech SWB1 data are shown in table XII. DMLS performance was measured using the baseline DMLS[3,10,200,2] system as well as a number of tuned configurations. Tuned systems were constructed using a combination of lattice generation tokens, pruning beamwidth and Sm tuning.
Table XII Keyword Spotting Results on SWB1 Method Miss Rate FA'w CPUAwk-hr HMM[-75001 8.0 366.9 106.2 HlB iIl-7300) 14.1 319.6 106.2 CLS[3,0,200,0] 38.4 3.2 DMLS[3,10,200,2] 17.5 59.0 30.6 DMLS(5,10,150.2 11.0 83.6 43.2 DMLS[5,10,150,1) 14.2 23.0 43.2 DMLS[5.10,100,2] 13.9 36.1 10.8 It is noted that a dramatic increase in FA/kw rates for all systems compared to those observed for the TIMIT evaluations. This is an expected result, since the conversational telephone speech domain is a more difficult domain for recognition.
For DMLS, this increase in false alarm rate is a result of the increased complexity of the lattices. It was found that the lattices generated for the Switchboard data were significantly larger than those generated for the TIMIT data when using the same pruning beamwidth. This meant that there were more paths with high likelihoods, indicating a greater degree of confusability within the lattices. As a result, more false alarms were generated.
\O
0 Losses in miss rate in the vicinity of 5% absolute were also observed for all systems compared to the TIMIT evaluations. Although this is unfortunate, these losses are still G minor in light of the increased difficulty of the data.
Overall though, DMLS still achieved more favourable performance than the baseline o HMM-based and lattice-based systems. The DMLS systems not only yielded considerably lower miss rates than CLS but also significantly lower FA/kw and o CPU/kw-hr rates than the HMM-based systems.
IND ci In terms of detection performance, the two best DMLS systems were the DMLS[5,10,150,1] and the DMLS[5,10,100,2 configurations. Both had lower false alarm rates than the other DMLS systems and still maintained fairly low miss rates.
However, the execution speed of the DMLS[5,10,100,2] configuration was 4 times faster than the DMLS[5,10,150,1] system. In fact, this system was capable of searching 1 hour of speech in 10 seconds and thus would be more appropriate for applications requiring very fast search speeds.
Overall, the experiments demonstrate that DMLS is capable of delivering good keyword spotting performance on the more difficult conversational telephone speech domain. Although there was some degradation in performance compared to the clean speech microphone domain, the losses were in line with what would be expected. Also, DMLS offered much faster performance than the HMM-based system and considerably lower miss rates than the conventional lattice-based system.
Experiments were performed to evaluate the execution time benefits of the prefix sequence and early stopping optimisations. Five systems were evaluated as follows: 1) NOPT: DMLS system without prefix sequence and early stopping optimisations; 2) ESOPT: DMLS system with early stopping optimisation; 3) PSOPT: DMLS system with prefix sequence optimisation; 4) COPT: DMLS system with combined early stopping and prefix sequence optimisations; and o 5) CXOPT: The COPT system with miscellaneous coding optimisations applied such as removal of dynamic memory allocation, more efficient passing of data, etc.
Experiments were performed using 10 randomly selected utterances from the Switchboard evaluation set detailed above. Single word keyword spotting was o performed for each utterance using a 6-phonelength target word. Each utterance was processed repeatedly for the same word 1400 times and the total execution time was O measured for all passes. The total time was then summed across all tested ci I0 utterances to obtain the total time required to perform 10 x 1400 passes. The relative 0 10 speeds were calculated by finding the ratio between the measured speed of the tested system and the measured speed of the baseline NOPT system. The entire evaluation was then repeated a total of 10 times and the average relative speed factor was calculated. Execution time was measured on a single 3GHz Intel T M Pentium 4 processor.
In addition the final putative occurrence result sets were examined to ensure that exactly the same miss rate and FA/kw rates were obtained across all methods, since both optimisations should not affect these metrics. Table XIII shows the speed of each system relative to the baseline unoptimised NOPT system. Tests were performed using S,,m values of 2 and 4, since the benefits of the early stopping optimisation depend on the value of S,,x.
Table XIII Relative Speeds of Optimised DMLS systems Sma, System Relate speed factor 2 NOPT 1.00 2 PSOPT 0.60 2 ESOPT 0.42 2 COPT 0.25 2 CXOPT 0.16 4 NOPT 1.00 4 PSOPT 0.60 4 ESOPT 0.64 4 COPT 0.32 4 CXOPT 0.21 0 o The results clearly demonstrate that both optimisations yielded significant speed benefits. An even more pleasing result was that the two optimisations combined Seffectively to reduce execution time by a factor of 4 for the 2 tests, and by a factor of 3 for the 4 tests. Overall the fully optimised CXOPT system ran about 5 to 6 times faster than the original unoptimised system. Table XIV shows the 0 execution time of the unoptimised DMLS system discussed above as well as the CPU/kw-hr figure for the same system incorporating the early stopping and prefix 0 sequence optimisations. It can be seen that the resultant speed is 1.8 CPU/kw-hr.
I This is an impressive result and clearly emphasises the suitability of DMLS for very 0 10 fast large database keyword spotting applications.
Table XIV Fully Optimised System on Switchboard Method Miss FA1 CPU/ Rate kw kw-hr O,100,2] 13.9 36.1 10.8 DMLS[5,10,100,2] with CXOPT 13.9 36.1 1.8 While the above discussed experiments were performed utilising TIMIT and SWO1 test database. It will be appreciated by those skilled in the art that such DMLS systems can be utilised across a variety of applications such as Podcast indexing and searching, Radio/television/media indexing and searching, Telephone conversation indexing and searching, Call Centre telephone indexing and searching etc.
It is to be understood that the above embodiments have been provided only by way of exemplification of this invention, and that further modifications and improvements thereto, as would be apparent to persons skilled in the relevant art, are deemed to fall within the broad scope and ambit of the present invention described herein.
Claims (50)
1. A method of indexing speech content, the method including the steps of: generating a phone lattice from said speech content; processing the phone lattice to generate a set of observed sequences o wherein E are the set of observed phone sequences for each node i in said phone lattice; and 0 storing said set of observed sequences Qfor each node. O N 10 2. The method of claim 1 wherein the step of generating of the phone lattice further includes the steps of: performing a feature based extraction process to construct a phone recognition network; and performing an N-best decoding on said phone recognition network to produce the phone lattice.
3. The method of claim 2 wherein the phone recognition network is constructed using phone loop or phone sequence fragment loop techniques.
4. The method of claim 2 or 3 where in the N-best decoding utilises a set of well trained acoustic models and an appropriate language model. The method of claim 4 wherein the set of well trained acoustic models are triphone Hidden Markov Models (HMM) and the language model is an N-gram language model.
6. The method of any one of the preceding claims wherein the step of generating the phone lattice further includes the step of optimising lattice size and complexity utilising one or more of the following: tuning a plurality of tokens U used to generate the phone lattice; pruning less likely paths outside a pruning beamwidth W;and/or tuning the number of lattice traversals V.
7. The method of any one of the preceding claims wherein Q=QOti) is generated in accordance with Q(O i) {,kc No hr is the set of all N-length sequences 0'=(O9 1 that exist in the lattice, and wherein each element 0, corresponds to a node within the lattice. o
8. A method for searching indexed speech content wherein said indexed speech o content is stored in the form of a phone lattice, the method including the steps of: INO ~obtaining a target sequence P=(IIP I cicomparing the target sequence with a set of observed sequences Q9= generated for each node i, where 0 are the set of observed phone sequences for each node i in said phone lattice, wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance (MED) calculation; and outputting a set of sequences R from said set of observed sequences Q that match said target sequence P.
9. The method of claim 8 wherein 9 =(EUi) is generated in accordance with Q0 k Ce0 190 =4,l where is the set of all N-length sequences exist in the phone lattice, and wherein each element 9. corresponds to a node within the phone lattice. The method of claim 8 or 9 wherein scoring each observed sequence against the target sequence includes the step of generating a MVED cost matrix.
11. The method of claim 10 wherein the MVED calculation includes calculating the minimum cost S of transforming each observed sequence within the set of observed sequences into the target sequence in accordance with a set of insertion C, deletion C, and substitution C, costs, where S is defined by S BESTMEL(P, Q, C, CI, C,) and wherein JJESTMED( returns the last column MVED cost matrix that is less than a maximum score threshold
12. The method claim 11 wherein C, and Cd are fixed andC, is varied according Sto the following substitution rules: C, =0 for same letter consonant phone substitutions; C, =1 for vowel substitutions; C, =1 for closure and stop substitutions; and C, for all other substitutions. D 13. The method of any one of claims 10 to 12 wherein the MED cost matrix is generated in accordance with a Levenstein algorithm.
14. The method of any one of claims 11 to 13 wherein the MED calculations are optimised by only calculating successive columns of the MED cost matrix if the minimum element of the current column is less than
15. The method of any one of claims 8 to 14 wherein the method further includes the step of: processing the set of observed sequences Q to produce a set of hypersequences, wherein each hypersequence represents a particular group of observed sequences Q=
16. The method of claim 15 wherein the hypersequences are produced by mapping the observed sequences to a hypersequence domain in accordance with a predetermined mapping function.
17. The method of claim 16 wherein the mapping of the observed sequences to the hypersequence domain is performed on an element by element basis using a mapping method selected from: a linguistic knowledge based mapping; a data driven acoustic mapping; and a context dependent mapping. NO O o 18. The method of claim 15 or 16 wherein the step of comparing the target sequence and the observed sequences includes: comparing the target sequence with each hypersequence to identify sequence groups most likely to yield a match for the target sequence; and comparing said target sequence with the set of observed sequences Q ocontained within the identified hypersequence sequence groups.
19. A method of indexing and searching speech content, the method including the o steps of: generating a phone lattice from said speech content; processing the phone lattice to generate a set of observed sequences Q= wherein 0 are the set of observed phone sequences for each node i in said phone lattice; storing said set of observed sequences Q for each node; obtaining a target sequence P p 2 P); comparing said target sequence P with the set of observed sequences Q wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance (MED) calculation; and outputting a set of sequences R from said set of observed sequences Qthat match said target sequence P. The method of claim 19 wherein the phone recognition network is constructed using phone loop or phone sequence fragment loop techniques.
21. The method of claim 19 or 20 where in the N-best decoding utilises a set of well trained acoustic models and an appropriate language model.
22. The method of claim 21 wherein the set of well trained acoustic models are triphone Hidden Markov Models (HMM) and the language model is an N-gram language model. o 23. The method of any one of claims 19 to 22 wherein the step of generating the phone lattice further includes the step of optimising lattice size and complexity of the G phone lattice utilising one or more of the following: tuning a plurality of tokens U used to generate the phone lattice; pruning less likely paths outside a pruning beamwidth W ;and/or S(c) tuning the number of lattice traversals V. o 24. The method of any one of claims 19 to 23 wherein Q is generated in O accordance with k e O where is the set of all N-length sequences that exist in the lattice, and wherein each element 8 corresponds to a node within the lattice. The method of any one of claims 19 to 24 wherein scoring each observed sequence against the target sequence includes the step of generating a MED cost matrix.
26. The method of claim 25 wherein the MED calculation includes calculating the minimum cost S of transforming each observed sequence within the set of observed sequences into the target sequence in accordance with a set of insertion Ci, deletion Cd and substitution C, costs, where S is defined by S BESTMED(P,Q,C,, C,C,) and wherein BESTMED(...) returns the last column MED cost matrix that is less than a maximum score threshold S,,a.
27. The method claim 26 wherein C, and are fixed and C, is varied according to the following substitution rules: C, =0 for same letter consonant phone substitutions; C, =1 for vowel substitutions; C, =1 for closure and stop substitutions; and C, for all other substitutions. 0 o 28. The method of any one of claims 25 to 27 wherein the MED cost matrix is generated in accordance with a Levenstein algorithm.
29. The method of any one of claims 25 to 28 wherein the MED calculations are optimised by only calculating successive columns of the MED cost matrix if the Sminimum element of the current column is less than The method of any one of claims 19 to 29 wherein the method further D includes the step of: 0 processing the set of observed sequences Q=(Ei) to produce a set of hypersequences, wherein each hypersequence represents a particular group of observed sequences Q=
31. The method of claim 30 wherein the hypersequences are produced by mapping the observed sequences to a hypersequence domain in accordance with a predetermined mapping function.
32. The method of claim 31 wherein the mapping of the observed sequences to the hypersequence domain is performed on an element by element basis using a mapping method selected from: a linguistic knowledge based mapping; a data driven acoustic mapping; and a context dependent mapping.
33. The method of claim 30 or 31 wherein the step of comparing the target sequence and the observed sequences includes: comparing the target sequence with each hypersequence to identify sequence groups most likely to yield a match for the target sequence; and comparing said target sequence with the set of observed sequences Q (0,i) contained within the identified hypersequence sequence groups. IND 59 0 0 34. A system for indexing speech content, the system including: a speech recognition engine for generating a phone lattice from said speech content; a database for storing said phone lattice generated by said recognition engine; and o at least one processor coupled to the database said processor configured to: process said phone lattice to generate a set of observed sequences Q= (O,i)wherein 0 is the set of observed phone ci Isequences for each node i in said phone lattice; and 0 c 10 store said observed sequences Q in a said database. The system of claim 34 wherein the speech recognition engine is configured to: construct a phone recognition network utilising a feature based extraction process; perform an N-best decoding operation on said phone recognition network to produce the phone lattice; and store said phone lattice in the database.
36. The system of claim 35 wherein the feature based extraction process is performed by a suitable speech recognition programme.
37. The system of claim 34 or 35 wherein the phone recognition network is constructed using phone loop or phone sequence fragment loop techniques.
38. The system of any one of claims 35 to 37 where in the N-best decoding utilises a set of well trained acoustic models and an appropriate language model.
39. The system of claims 38 wherein the set of well trained acoustic models are tri-phone Hidden Markov Models (HMM) and the language model is an N-gram language model. 0O tU The system of any one of claims 34 to 39 wherein phone lattice size and complexity is optimised utilising one or more of the following: S(a) tuning a plurality of tokens U used to generate the phone lattice; pruning less likely paths outside a pruning beamwidth W ;and/or tuning the number of lattice traversals V.
41. The system of any one of claims 34 to 40 wherein Q is generated in accordance with Q',Q 2 e where is the set of all N-length sequences 0' that exist in the lattice, and wherein each element O8 corresponds to a node within the lattice.
42. A system for searching indexed speech content wherein said indexed speech content is stored in the form of an observed phone sequence database, the system including: an input for obtaining a target sequence P p 2 P); a database containing a set of observed sequences Q i) generated for each node i in said phone lattice, wherein are the set of observed phone sequences for each node i; at least one processor coupled to said input and said database, which processor is configured to: compare said target sequence P with the set of observed sequences Q, wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance calculation; and output a set of sequences R from said set of observed sequences that match said target sequence.
43. The system of claim 42 wherein Q is generated in accordance with Q1Q2}.. =k e |O where is the set of all N-length \O sequences O' exist in the phone lattice, and wherein each element 0 corresponds to a node within the phone lattice.
44. The system of claim 42 or 43 wherein scoring each observed sequence 5 against the target sequence further includes generating a MED cost matrix. 0 The system of claim 44 wherein the MED calculation includes calculating the C minimum cost S of transforming each observed sequence within the set of observed S sequences into the target sequence in accordance with a set of insertion C, deletion and substitution C, costs, where S is defined by S BESTMED(P,Q, C,,Cd,C, and wherein BESTMED(...) returns the last column MED cost matrix that is less than a maximum score threshold
46. The system of claim 45 wherein C, and Cd are fixed and C, is varied according to the following substitution rules: C, =0 for same letter consonant phone substitutions; C, =1 for vowel substitutions; C,=1 for closure and stop substitutions; and C, for all other substitutions.
47. The system of any one of claims 44 to 46 wherein the MED cost matrix is generated in accordance with a Levenstein algorithm.
48. The system of any one of claims 44 to 47 wherein the MED calculations are optimised by only calculating successive columns of the MED cost matrix if the minimum element of the current column is less than
49. The system of any one of claims 42 to 48 wherein the method further includes the step of: SbZ 0 processing the set of observed sequences to produce a set of Shypersequences, wherein each hypersequence represents a particular group of observed sequences Q=
50. The system of claim 49 wherein the hypersequences are produced by mapping the observed sequences to a hypersequence domain in accordance with a predetermined mapping function. O o 51. The system of claim 49 wherein the mapping of the observed sequences to N 10 the hypersequence domain is performed on an element by element basis using a mapping method selected from: a linguistic knowledge based mapping; a data driven acoustic mapping; and a context dependent mapping.
52. The system of claim 49 or 50 wherein the step of comparing the target sequence and the observed sequences includes: comparing the target sequence with each hypersequence to identify sequence groups most likely to yield a match for the target sequence; and comparing said target sequence with the set of observed sequences Q contained within the identified hypersequence sequence groups.
53. A system for indexing and searching speech content, the system including: a speech recognition engine for generating a phone lattice form said speech content; a first database for storing said phone lattice generated by said recognition engine; an input for obtaining a target sequence P p 2 at least one processor coupled to said input and said first database, which processor is configured to: NOW process said phone lattice to generate a set of observed sequences Q= where E are the set of observed phone sequences for each node i in said phone lattice; store said observed sequences Q in a second database; compare said target sequence P with the set of observed sequences 0Q, wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance 1calculation; and output a set of sequences R from said set of observed sequences Q that match said target sequence.
54. The system of claim 53 wherein the speech recognition engine is configured to: construct a phone recognition network utilising a feature based extraction process; perform an N-best decoding operation on said phone recognition network to produce the phone lattice; and store said phone lattice in the first database. The system of claim 54 wherein the feature based extraction process is performed by a suitable speech recognition programme.
56. The system of claim 54 or 55 wherein the phone recognition network is constructed using phone loop or phone sequence fragment loop techniques.
57. The system of any one of claims 54 to 56 where in the N-best decoding utilises a set of well trained acoustic models and an appropriate language model.
58. The system of claims 57 wherein the set of well trained acoustic models are tri-phone Hidden Markov Models (HMM) and the language model is an N-gram language model.
59. The system of any one of claims 53 to 58 wherein phone lattice size and G complexity is optimised utilising one or more of the following: tuning a plurality of tokens U used to generate the phone lattice; pruning less likely paths outside a pruning beamwidth W ;and/or S(c) tuning the number of lattice traversals V. o 60. The system of any one of claims 53 to 59 wherein Q is generated in 0O accordance with 8 e ©9 0 i, where Q is the set of all N-length sequences that exist in the lattice, and wherein each element O corresponds to a node within the lattice.
61. The system of claim 53 to 61 wherein scoring each observed sequence against the target sequence further includes generating a MED cost matrix.
62. The system of claim 61 wherein the MED calculation includes calculating the minimum cost S of transforming each observed sequence within the set of observed sequences into the target sequence in accordance with a set of insertion deletion C, and substitution C, costs, where S is defined by S BESTMED(P,Q,C,,C,, C,) and wherein BESTMED(...) returns the last column MED cost matrix that is less than a maximum score threshold
63. The system of claim 62 wherein C, and C, are fixed and C, is varied according to the following substitution rules: C, =0 for same letter consonant phone substitutions; C, =1 for vowel substitutions; C, =1 for closure and stop substitutions; and C, for all other substitutions. NO 0 64. The method of any one of claims 61 to 63 wherein the MED calculations are optimised by only calculating successive columns of the MED cost matrix if the minimum element of the current column is less than
65. The system of any one of claims 61 to 64 wherein the MED cost matrix is 2generated in accordance with a Levenstein algorithm.
66. The system of any one of claims 53 to 64 wherein the method further includes 0 N the step of: o 10 processing the set of observed sequences Q= to produce a set of C hypersequences, wherein each hypersequence represents a particular group of observed sequences Q
67. The system of claim 66 wherein the hypersequences are produced by mapping the observed sequences to a hypersequence domain in accordance with a predetermined mapping function.
68. The system of claim 67 wherein the mapping of the observed sequences to the hypersequence domain is performed on an element by element basis using a mapping method selected from: a linguistic knowledge based mapping; a data driven acoustic mapping; and a context dependent mapping.
69. The system of claim 66 or 67 wherein the step of comparing the target sequence and the observed sequences includes: comparing the target sequence with each hypersequence to identify sequence groups most likely to yield a match for the target sequence; and comparing said target sequence with the set of observed sequencesQ contained within the identified hypersequence sequence groups. DATED THIS SEVENTEENTH DAY OF MARCH 2006 QUEENSLAND UNIVERSITY OF TECHNOLOGY BY PIZZEYS PATENT AND TRADE MARK
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2006201110A AU2006201110A1 (en) | 2006-02-02 | 2006-03-17 | Dynamic match lattice spotting for indexing speech content |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2006900497 | 2006-02-02 | ||
AU2006900497A AU2006900497A0 (en) | 2006-02-02 | Dynamic match lattice spotting | |
AU2006201110A AU2006201110A1 (en) | 2006-02-02 | 2006-03-17 | Dynamic match lattice spotting for indexing speech content |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2006201110A1 true AU2006201110A1 (en) | 2007-08-16 |
Family
ID=38429978
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2006201110A Abandoned AU2006201110A1 (en) | 2006-02-02 | 2006-03-17 | Dynamic match lattice spotting for indexing speech content |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2006201110A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861057A (en) * | 2022-05-17 | 2022-08-05 | 北京百度网讯科技有限公司 | Resource sending method, training of recommendation model and device |
-
2006
- 2006-03-17 AU AU2006201110A patent/AU2006201110A1/en not_active Abandoned
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861057A (en) * | 2022-05-17 | 2022-08-05 | 北京百度网讯科技有限公司 | Resource sending method, training of recommendation model and device |
CN114861057B (en) * | 2022-05-17 | 2023-05-30 | 北京百度网讯科技有限公司 | Resource sending method, training of recommendation model and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20070179784A1 (en) | Dynamic match lattice spotting for indexing speech content | |
US6877001B2 (en) | Method and system for retrieving documents with spoken queries | |
US7542966B2 (en) | Method and system for retrieving documents with spoken queries | |
US20030204399A1 (en) | Key word and key phrase based speech recognizer for information retrieval systems | |
Riccardi et al. | Stochastic automata for language modeling | |
US5870706A (en) | Method and apparatus for an improved language recognition system | |
Thambiratnam et al. | Rapid yet accurate speech indexing using dynamic match lattice spotting | |
Parlak et al. | Spoken term detection for Turkish broadcast news | |
WO2003010754A1 (en) | Speech input search system | |
JPWO2004034378A1 (en) | Language model generation and storage device, speech recognition device, language model generation method and speech recognition method | |
Bulyko et al. | Subword speech recognition for detection of unseen words. | |
Morris et al. | Combining phonetic attributes using conditional random fields. | |
Federico et al. | Language modelling for efficient beam-search | |
Gao et al. | The Use of Clustering Techniques for Language Modeling V Application to Asian Language | |
Bufyko et al. | Detection of unseen words in conversational Mandarin | |
Deoras et al. | Joint Decoding for Speech Recognition and Semantic Tagging. | |
JP5360414B2 (en) | Keyword extraction model learning system, method and program | |
CA2596126A1 (en) | Speech recognition by statistical language using square-root discounting | |
Wallace et al. | Spoken term detection using fast phonetic decoding | |
Lecouteux et al. | Combined low level and high level features for out-of-vocabulary word detection | |
Thambiratnam | Acoustic keyword spotting in speech with applications to data mining | |
Sangwan et al. | Keyword recognition with phone confusion networks and phonological features based keyword threshold detection | |
CN101937450B (en) | Method for retrieving items represented by particles from an information database | |
JP2886121B2 (en) | Statistical language model generation device and speech recognition device | |
Ramabhadran et al. | Fast decoding for open vocabulary spoken term detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MK4 | Application lapsed section 142(2)(d) - no continuation fee paid for the application |