JPH0792997A

JPH0792997A - Speech synthesizing device

Info

Publication number: JPH0792997A
Application number: JP23613293A
Authority: JP
Inventors: Keiji Hayashi; 慶士林; Noriya Murakami; 憲也村上
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1993-09-22
Filing date: 1993-09-22
Publication date: 1995-04-07

Abstract

PURPOSE:To provide the speech synthesizing device which facilitates the addition, deletion, etc., of selection conditions when a waveform element piece is selected and flexibly copes with them. CONSTITUTION:A preprocessing part 12 divides an input character string into phoneme units. A selection reference parameter setting part 13 sets selection reference parameters, used to select waveform element pieces as synthesis parameters, on the basis of the phoneme units. An element piece selection part 14 calculates squaring errors between the set selection reference parameters and element piece parameters inputted from an element piece parameter table 15, selects the element piece parameters in the increasing order of the squaring errors to generate primary candidates, and determines the element piece corresponding to the element piece parameter which meets element piece environment adaption conditions set in a condition setting part 141 most as an optimum element piece for the phoneme unit. An element piece connection part 17 extracts determined optimum element pieces of the phoneme units from an element piece file 16 and connects them in the phoneme units to generate a synthesized speech.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声合成技術に関し、特
に、合成パラメタとして自然音声から取り出された波形
素片を用いて、入力文字列から合成音声を生成する音声
合成装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing technique, and more particularly to a speech synthesizing apparatus for generating synthetic speech from an input character string by using a waveform segment extracted from natural speech as a synthesizing parameter.

【０００２】[0002]

【従来の技術】従来の一般的な音声合成装置の構成例を
図６に示す。図６において、６１は入力端子、６２は前
処理部、６３は選択基準パラメタ設定部、６４は素片選
択部、６５は波形辞書、６６は波形ファイル、６７は素
片変形部、６８は素片接続部、６９は出力端子である。2. Description of the Related Art FIG. 6 shows a configuration example of a conventional general speech synthesizer. In FIG. 6, 61 is an input terminal, 62 is a preprocessing unit, 63 is a selection reference parameter setting unit, 64 is a segment selection unit, 65 is a waveform dictionary, 66 is a waveform file, 67 is a segment transformation unit, and 68 is a segment. One-sided connecting portion, 69 is an output terminal.

【０００３】この構成の音声合成装置において、音素記
号及びアクセント記号からなる入力文字列は、入力端子
６１から入力された後に前処理部６２において音韻単位
に分割される。選択基準パラメタ設定部６３では、上記
音韻単位とアクセント記号とから、選択基準パラメタ
（平均ピッチ周波数・ピッチ傾斜・時間長・平均パワ
（ＲＭＳ値））と、その波形素片とを格納した特定の素
片ファイルを設定する。なお、上記選択基準パラメタ
は、合成パラメタである波形素片の選択基準として用い
るものである。素片選択部６４では、下記の評価関数式
を用いて、音韻連接に対する最適素片を夫々選択する。In the speech synthesizer having this configuration, an input character string consisting of phoneme symbols and accent symbols is input to the input terminal 61 and then divided into phoneme units in the preprocessing unit 62. The selection reference parameter setting unit 63 stores a selection reference parameter (average pitch frequency / pitch inclination / time length / average power (RMS value)) and its waveform segment from the phoneme unit and the accent symbol. Set the fragment file. The selection reference parameter is used as a selection reference for the waveform element which is a synthesis parameter. The segment selection unit 64 uses the following evaluation function formula to select the optimal segment for phonological concatenation.

【０００４】[0004]

【数１】Ｐ＝α×(ｎ'−ｎ)／σn＋(１−α)×(Ｗ'−Ｗ)／σw Ｗ＝ωv×│(Ｖｔ−Ｖｉ)／σv│²＋ωf×│(Ｆｔ−Ｆｉ)／σf│² ＋ωt×│(Ｔｔ−Ｔｉ)／σt│²＋ωa×│(Ａｔ−Ａｉ)／σa│² ｎ＝１／ｅ^N ・・・・・・(1)## EQU1 ## P = α × (n'-n) / σn + (1-α) × (W'-W) / σw W = ωv × │ (Vt-Vi) / σv│ ² + ωf × │ (Ft- Fi) / σf│ ² + ωt × │ (Tt-Ti) / σt│ ² + ωa × │ (At-Ai) / σa│ ² n = 1 / e ^N ... (1)

【０００５】上記(1)式において、αは音韻環境と韻律
特性に関するバランス係数、ｎは音韻環境係数、Ｗは韻
律特性係数である。また、ｎ'及びＷ'はそれぞれ前記
ｎ，Ｗの平均値、Ｎは最大一致音韻数、Ｖは平均ピッチ
周波数（Ｈｚ）、Ｆはピッチ傾斜、Ｔは継続時間、Ａは
平均パワ値（ＲＭＳ値）、σｎ，σｗ，σｖ，σｆ，σ
ｔ，σａは素片の韻律パラメタに対する平均ピッチ周波
数などの分散値、ωｖ，ωｆ，ωｔ，ωａは平均ピッチ
周波数などに関する重み係数、ｔは選択基準パラメタを
表す添字、ｉは素片の韻律パラメタを表す添字である。In the above equation (1), α is a balance coefficient relating to the phonological environment and prosody characteristics, n is a phonological environment coefficient, and W is a prosody characteristic coefficient. Further, n ′ and W ′ are the average values of n and W, N is the maximum number of matching phonemes, V is the average pitch frequency (Hz), F is the pitch slope, T is the duration, and A is the average power value (RMS). Value), σn, σw, σv, σf, σ
t and σa are variances of the average pitch frequency and the like with respect to the prosodic parameter of the segment, ωv, ωf, ωt, and ωa are weighting factors related to the average pitch frequency, t is a subscript representing the selection reference parameter, and i is the prosodic parameter of the segment. Is a subscript that represents.

【０００６】波形ファイル６６は、小説・随筆などの文
章音声データ（波形情報）を、文番号毎にファイル番号
を付与してそのまま格納しており、素片の集合という形
態をとっていない。また、素片変形部６７では、素片選
択部６４で選択された前記最適素片を前記波形ファイル
６６より読み出し、前記選択基準パラメタに一致するよ
う、前記選択基準パラメタ毎に所定の変形処理を施して
いる。The waveform file 66 stores sentence voice data (waveform information) such as novels and essays as file numbers are assigned to each sentence number as they are, and does not take the form of a set of pieces. Further, the segment transforming unit 67 reads the optimum segment selected by the segment selecting unit 64 from the waveform file 66, and performs a predetermined transforming process for each of the selection criterion parameters so as to match the selection criterion parameter. I am giving it.

【０００７】このような構成の音声合成装置では、素片
選択に用いる評価関数式（上記(1)式）における各係数
α，ωは経験的に設定される変数であり、一定値でな
い。また、上記各係数は、波形ファイル作成に用いた特
定話者（以下、第一の話者）専用にチューニングされた
数値である。従って前記第一の話者と異なる第三者（以
下、第二の話者）を用いて新規の波形ファイル６６を作
成する場合には、上記各係数を第二の話者用に再チュー
ニングしなければならない欠点があった。また、素片変
形部６７における変形処理は、例えばピッチ変形処理の
場合、選択基準パラメタと素片選択部６４で選択された
最適素片との間でピッチ変更比率を設定して行なってい
る。しかし、ピッチ変更を行う際に、この変更比率が大
きい場合は元の最適素片の音質劣化を招く。従って素片
変形処理に対して高度な変形処理が要求される欠点もあ
った。これは、該当する音韻連接を含む最適素片が波形
ファイル６６中に存在しないという、ファイル構成にも
関連している。In the speech synthesizer having such a configuration, the coefficients α and ω in the evaluation function formula (formula (1)) used for selecting the phoneme are empirically set variables and are not constant values. Further, each coefficient is a numerical value tuned only for a specific speaker (hereinafter, the first speaker) used for creating the waveform file. Therefore, when a new waveform file 66 is created by using a third party (hereinafter, second speaker) different from the first speaker, the above coefficients are retuned for the second speaker. There was a drawback that had to be made. In the case of pitch transformation processing, the transformation processing in the segment transformation unit 67 is performed by setting a pitch change ratio between the selection reference parameter and the optimum segment selected by the segment selection unit 64. However, when the pitch is changed, if this change ratio is large, the sound quality of the original optimum segment is deteriorated. Therefore, there is also a drawback that a high-level transformation process is required for the segment transformation process. This is also related to the file structure in which the optimum segment including the corresponding phonological concatenation does not exist in the waveform file 66.

【０００８】そこで本発明者らは、先に、上記欠点を解
消し得る音声合成装置（特願平５−５９３８５号参照）
を提案した。この音声合成装置は、要するに、入力文字
列から分解された音韻連接を含む波形素片の韻律パラメ
タのうち、選択基準パラメタからの誤差（例えば２乗誤
差）が最小となるものに対応する特定の波形素片を選択
し、これら選択した波形素片を音韻連接毎に順次接続し
て合成音声を生成することによって、話者データの変化
に対処し得るようにしたものである。Therefore, the present inventors have previously proposed a speech synthesizer capable of eliminating the above-mentioned drawbacks (see Japanese Patent Application No. 5-59385).
Proposed. This speech synthesizer is, in short, a particular prosody parameter of a waveform segment containing phonological concatenation decomposed from an input character string that corresponds to the one with the smallest error (for example, squared error) from the selection reference parameter. By selecting a waveform segment and sequentially connecting the selected waveform segments for each phonological concatenation to generate a synthetic voice, it is possible to cope with a change in speaker data.

【０００９】[0009]

【発明が解決しようとする課題】上記先提案に係る音声
合成装置によれば、第二の話者データが第一の話者デー
タと異なる場合であっても評価関数式（上記(1)式）の
各係数を再チューニングする必要がなくなる。また、こ
の音声合成装置においては、母音の韻律パラメタの影響
を良く考慮してあり、かなり自然な合成音声も得られて
いる。しかしながら、以後の多面的な検証の結果、波形
素片の選択の際に、音韻環境の変化その他の条件をも加
味することで、より自然に近い合成音声が得られること
が判明した。このような新たな条件を加味して波形素片
を選択することは、上記音声合成装置の構成のままでは
困難であり、特に、ピッチ傾斜符号等の、選択基準パラ
メタに反映させにくい条件を加味することはできなかっ
た。本発明は上記背景の下になされたものであり、波形
素片の選択条件について、より柔軟な対応が可能な音声
合成装置を提供することを目的としている。According to the speech synthesizer according to the above-mentioned proposal, even if the second speaker data is different from the first speaker data, the evaluation function expression (the above expression (1) is used). ) It becomes unnecessary to retune each coefficient. Moreover, in this speech synthesizer, the influence of the prosody parameter of the vowel is well taken into consideration, and a fairly natural synthesized speech is obtained. However, as a result of the multifaceted verification thereafter, it was found that a more natural synthesized speech can be obtained by considering the change of the phonological environment and other conditions when selecting the waveform segment. It is difficult to select a waveform segment by adding such a new condition with the configuration of the speech synthesizer described above, and in particular, a condition that is difficult to be reflected in the selection reference parameter such as pitch inclination code is added. I couldn't. The present invention has been made under the above background, and an object of the present invention is to provide a speech synthesizing device capable of more flexibly responding to a selection condition of a waveform segment.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成する本発
明の音声合成装置は、音声に対応する入力文字列を音韻
単位に分解する前処理部と、複数の波形素片と各波形素
片の韻律パラメタとを格納する波形情報格納手段と、前
記分解された個々の音韻に対応する波形素片の韻律パラ
メタを前記波形情報格納手段より抽出するとともに、抽
出した韻律パラメタのうち、予め設定された素片環境適
性条件を満たし、且つ、波形素片の選択基準として設定
した選択基準パラメタからの誤差が最小となるものに対
応する波形素片を選択する素片選択手段と、選択された
波形素片を前記入力文字列の順に接続して合成音声を生
成する素片接続部と、を有することを特徴とする。A speech synthesizer according to the present invention which achieves the above object, comprises a preprocessing unit for decomposing an input character string corresponding to a speech into phoneme units, a plurality of waveform segments and each waveform segment. Waveform information storage means for storing the prosody parameters of the, and the prosody parameters of the waveform segment corresponding to the decomposed individual phonemes are extracted from the waveform information storage means, and among the extracted prosody parameters, preset parameters are set. Segment selection means for selecting a waveform segment corresponding to a segment that satisfies the environmental suitability condition and has a minimum error from the selection reference parameter set as the selection criterion of the waveform segment, and the selected waveform. And a segment connecting unit for connecting the segments in the order of the input character string to generate a synthesized voice.

【００１１】上記構成の音声合成装置において、前記波
形情報格納手段は、具体的には、前記音韻単位に対する
アクセント情報を含む複数の音韻環境の波形素片を自然
音声中から切り出して個別に格納する素片格納手段と、
前記各波形素片に対応する韻律パラメタを格納した素片
パラメタテーブルとから成り、前記誤差は、例えば、前
記選択基準パラメタと個々の韻律パラメタとの差分を各
韻律パラメタの変動幅で除した値の２乗の和とする。In the speech synthesizer having the above structure, specifically, the waveform information storage means cuts out waveform units of a plurality of phonological environments including accent information for the phonological unit from natural speech and stores them individually. An element storage means,
It consists of a segment parameter table that stores prosody parameters corresponding to each waveform segment, and the error is, for example, a value obtained by dividing the difference between the selection reference parameter and each prosody parameter by the variation width of each prosody parameter. The sum of squares of

【００１２】また、上記目的を達成する本発明の他の構
成は、前記素片選択手段が、前記素片環境適性条件の追
加、変更、削除を行う条件変更手段を有することを特徴
としている。なお、上記素片環境適性条件は任意のもの
を設定できるが、例えば、前記音韻単位の前後における
音韻環境と前記選択候補群内の各韻律パラメタに対応す
る波形素片の音韻環境との一致度、あるいは、前記音韻
単位と前記選択候補群内の各韻律パラメタとにおけるピ
ッチ傾斜の符号の整合性等を用いると良い。Another structure of the present invention that achieves the above object is characterized in that the element selection means includes condition changing means for adding, changing, or deleting the element environment suitability condition. It should be noted that the segment environment suitability condition can be set arbitrarily, for example, the degree of agreement between the phonological environment before and after the phonological unit and the phonological environment of the waveform segment corresponding to each prosodic parameter in the selection candidate group. Alternatively, it is preferable to use the consistency of the sign of the pitch slope in the phoneme unit and each prosody parameter in the selection candidate group.

【００１３】[0013]

【作用】本発明の音声合成装置にあっては、前処理部で
分解された個々の音韻に対応する波形素片の韻律パラメ
タが波形情報格納手段より抽出される。この場合、韻律
パラメタは各音韻単位に複数のものが抽出される。素片
接続部では、これら抽出された韻律パラメタのうち、選
択基準パラメタからの誤差（例えば２乗誤差）が最小と
なるものを小さい順に並べて音素単位の選択候補群（以
下、１次候補）を作成する。そして、選択候補群の中か
ら、予め設定された素片環境適性条件を満たし且つ上記
誤差が最小となるものに対応する波形素片を選択する。
選択された波形素片は音韻単位の最適波形素片となるの
で、これを入力文字列の順に接続して合成音声を得る。In the speech synthesizer of the present invention, the prosody parameter of the waveform segment corresponding to each phoneme decomposed by the preprocessor is extracted from the waveform information storage means. In this case, a plurality of prosody parameters are extracted for each phoneme unit. In the segment connecting section, of the extracted prosody parameters, those having the smallest error (for example, squared error) from the selection reference parameter are arranged in ascending order to form a selection candidate group (hereinafter, primary candidate) in phoneme units. create. Then, from the selection candidate group, the waveform element corresponding to the element element environmental suitability condition set in advance and having the minimum error is selected.
Since the selected waveform segment is the optimum waveform segment of the phoneme unit, it is connected in the order of the input character string to obtain the synthesized speech.

【００１４】このように、選択基準パラメタからの誤差
が小さい順に韻律パラメタを選択することで、各韻律パ
ラメタに対応する波形素片が、特定話者の発声音に拘束
されず、また、予め設定された素片環境適性条件を満た
す中で最小誤差となるものを選択することで、種々の素
片環境を加味した波形素片が得られる。特に、音韻環境
の一致度やピッチ傾向の整合性は上記選択基準パラメタ
との誤差として評価することは困難であるが、これらを
素片環境適性条件として採用することで、それぞれの影
響を適切に評価することができ、合成される音声がより
自然音声に近くなる。In this way, by selecting the prosodic parameters in ascending order of the error from the selection reference parameter, the waveform segment corresponding to each prosodic parameter is not constrained by the vocal sound of the specific speaker and is preset. By selecting the one with the smallest error among the above-mentioned conditions for suiting the segment environment, a waveform segment considering various segment environments can be obtained. In particular, it is difficult to evaluate the degree of coincidence of phonological environments and the consistency of pitch tendency as errors with the above-mentioned selection reference parameters, but by adopting these as the conditions for segment environment suitability It can be evaluated and the synthesized speech is more like a natural speech.

【００１５】[0015]

【実施例】次に、本発明の実施例を図面を参照して詳細
に説明する。図１は、本発明の一実施例に係る音声合成
装置の構成図であり、１１は入力端子、１２は前処理
部、１３は選択基準パラメタ設定部、１４は素片選択
部、１４１は条件設定部、１５は素片パラメタテーブ
ル、１６は素片ファイル、１７は素片接続部、１８は出
力端子を表す。条件設定部１４１は、素片選択部１４に
おける素片選択処理時に用いる素片環境適性条件を種々
設定するもので、これら設定条件の追加、変更、削除も
自在に構成されている。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 is a block diagram of a speech synthesizer according to an embodiment of the present invention, in which 11 is an input terminal, 12 is a preprocessing unit, 13 is a selection reference parameter setting unit, 14 is a segment selection unit, and 141 is a condition. A setting unit, 15 is a segment parameter table, 16 is a segment file, 17 is a segment connection unit, and 18 is an output terminal. The condition setting unit 141 sets various element environment suitability conditions used in the element selection process in the element selection unit 14, and is configured to freely add, change, or delete these setting conditions.

【００１６】また、図２は、上記素片パラメタテーブル
１５と素片ファイル１６の概念構成図である。素片ファ
イル１６は、音韻単位と各音韻単位の中心音韻に対する
アクセントの有無に従って、該当する波形素片を自然音
声中から切り出して個別に格納するものであり、素片パ
ラメタテーブル１５は、素片パラメタ１６及び該素片パ
ラメタ１６に属する波形素片の韻律パラメタ（以下、素
片パラメタ）を格納するものである。FIG. 2 is a conceptual configuration diagram of the segment parameter table 15 and the segment file 16. The segment file 16 is for extracting the corresponding waveform segment from the natural speech according to the presence or absence of accent on the phoneme unit and the central phoneme of each phoneme unit and storing it individually. The parameter 16 and the prosody parameter of the waveform segment belonging to the segment parameter 16 (hereinafter, segment parameter) are stored.

【００１７】これら素片パラメタテーブル１５及び素片
ファイル１６は、切り出した自然音声中の波形素片毎に
複数組準備され、図２に示すように、音韻単位２２とア
クセントの有無を表すアクセント記号２３に応じて特定
のものが設定されるようになっている。ここで、音韻単
位２２とは、通常の／ａ／、／ｋ／の音韻の他に、素片
接続の滑らかさを考慮して／ａｉ／などの連母音や／ａ
Ｎ／などの複合音韻を含めたものである。A plurality of sets of the phoneme parameter table 15 and the phoneme file 16 are prepared for each waveform phoneme in the cut out natural voice, and as shown in FIG. 2, a phoneme unit 22 and an accent symbol indicating the presence or absence of an accent. Specific items are set according to the number 23. Here, the phoneme unit 22 is, in addition to the usual phonemes of / a / and / k /, in consideration of the smoothness of phoneme connection, continuous vowels such as / ai / and / a
It includes compound phonemes such as N /.

【００１８】次に、＃ｋａｒｅｗａ＃（彼は）という文
字列入力に対する音声合成処理を例にとって本実施例の
動作を具体的に説明する。この入力文字列には、音素記
号のほか、前記アクセント記号２３が含まれているもの
とする。まず、入力端子１１から入力された文字列（＃
ｋａｒｅｗａ＃）を、前処理部１２で／ｋａ／、／ｒｅ
／、／ｗａ／のように音韻単位で分解する。以下、便宜
上、／ｋａ／を第一の音韻単位、／ｒｅ／を第二の音韻
単位、／ｗａ／を第三の音韻単位と定義して説明する。
なお、＃は、音韻単位が単語の先頭／末尾にあることを
示す付加記号である。従って、＃ｋａｒｅｗａ＃（彼
は）は、それぞれ＃ｋａ−、ｒｅ−、ｗａ＃−の音韻単
位に分割される。また、＃ｋａ−は、自然音声（単語／
文章）の先頭にある音韻が／か／で、アクセントの付与
されていないものを意味する。アクセントが付与されて
いる場合、上記音韻の表現は＃ｋａ＋となる。これは他
の音韻についても同様である。Next, the operation of this embodiment will be concretely described by taking a voice synthesis process for a character string input of "# karewa #" as an example. It is assumed that the input character string includes the phoneme symbol and the accent symbol 23. First, the character string (#
karewawa #) in the preprocessing unit 12 to / ka /, / re
It is decomposed in phoneme units such as / and / wa /. In the following description, for convenience, / ka / is defined as a first phoneme unit, / re / is defined as a second phoneme unit, and / wa / is defined as a third phoneme unit.
Note that # is an additional symbol indicating that the phoneme unit is at the beginning / end of the word. Therefore, # karewa # (he) is divided into phoneme units of # ka-, re-, and wa #-, respectively. Also, # ka- is a natural voice (word /
It means that the phoneme at the beginning of (sentence) is / or / and that no accent is given. When accented, the phoneme expression is # ka +. This also applies to other phonemes.

【００１９】選択基準パラメタ設定部１３では、各音韻
単位とアクセント記号２３とから、それぞれ合成パラメ
タである波形素片の選択に用いる選択基準パラメタとそ
の波形素片を格納する素片ファイル１６とを設定する。
この処理は、以下のようにして行われる。In the selection reference parameter setting section 13, a selection reference parameter used for selecting a waveform segment as a synthesis parameter and a segment file 16 for storing the waveform segment are selected from each phoneme unit and the accent symbol 23. Set.
This process is performed as follows.

【００２０】まず、第一の音韻単位とアクセント記号と
を用いて、平均ピッチ周波数Ｖt、ピッチ傾斜Ｆt、時間
長Ｔt、及び平均パワ（ＲＭＳ値）Ａtを決定する。これ
らを要素とするパラメタを第一の選択基準パラメタとす
る。同様に、第二、第三…の各音韻単位に対して、それ
ぞれ対応する第二、第三…の選択基準パラメタを得る。
次に、第一の選択基準パラメタに対応する素片ファイル
を第一の選択ファイル、第二、第三…の各選択基準パラ
メタに対応する素片ファイルをそれぞれ第二、第三…の
選択ファイルとして設定する。First, the average pitch frequency Vt, pitch slope Ft, time length Tt, and average power (RMS value) At are determined using the first phoneme unit and the accent symbol. The parameter having these elements is the first selection criterion parameter. Similarly, for each of the second, third, ... Phoneme units, the corresponding second, third, ... Selection reference parameters are obtained.
Next, the segment files corresponding to the first selection criterion parameters are the first selection file, and the segment files corresponding to the second, third ... Set as.

【００２１】これら設定された選択ファイルは、各指定
された音韻単位に対応する波形素片及びその特性を示す
ものである。例えば、＃ｋａ−（自然音声（単語／文
章）の先頭にある音韻／か／で、アクセントの付与され
ていないものの意、＃ｋａ＋であればアクセントが付さ
れている）という音韻単位に対応する波形素片は、＃ｋ
ａｋａｅｒｕ＃、＃ｋａｋａｇｅｒｕ＃等の単語の波形
から切り出すことができ、抽出対象となる単語によって
その平均ピッチ周波数Ｖt、ピッチ傾斜Ｆt、継続時間Ｔ
t等が異なる。These set selection files show the waveform segment corresponding to each designated phoneme unit and its characteristics. For example, it corresponds to a phoneme unit of # ka- (phoneme / or / at the beginning of a natural voice (word / sentence), which means that no accent is given, and # ka + is accented). The waveform element is #k
It can be cut out from the waveform of a word such as akaeru #, # kakageru #, etc., and the average pitch frequency Vt, pitch slope Ft, duration T can be extracted depending on the word to be extracted.
t etc. are different.

【００２２】素片パラメタテーブル１５には、抽出され
た波形素片それぞれに対してファイル番号を付与して識
別し、抽出を行った単語を素片抽出環境として示すとと
もに、波形素片ごとに素片パラメタ（平均ピッチ周波数
Ｖi、ピッチ傾斜Ｆi、時間長Ｔi、平均パワ（ＲＭＳ
値）Ａi）を格納しておく。更に、各素片パラメタの平
均値、分散値、最大値、最小値をも格納しておく。その
内容説明図を図３に示す。The segment parameter table 15 identifies each extracted waveform segment by assigning a file number and identifies the extracted word as a segment extraction environment. One-side parameter (average pitch frequency Vi, pitch inclination Fi, time length Ti, average power (RMS
The value) Ai) is stored. Furthermore, the average value, variance value, maximum value, and minimum value of each segment parameter are also stored. FIG. 3 shows an explanatory diagram of the contents.

【００２３】図３は第一の音韻単位である／ｋａ／に関
する内容例であり、３１はテーブル名、３２は素片抽出
環境、３３は平均ピッチ周波数（ＨZ）Ｖi、３４はピッ
チ傾斜Ｆi、３５は継続時間Ｔｉ、３６は平均パワ値Ａ
i、３７は各パラメタ３３〜３６の最大値、３８は各パ
ラメタの最小値、３９は素片選択部１４において選択の
対称となるファイル番号を表す。平均ピッチ周波数３
３、ピッチ傾斜３４、時間長３５、平均パワ（ＲＭＳ
値）３６を要素として素片パラメタが構成されている。FIG. 3 shows an example of contents relating to the first phoneme unit / ka /. 31 is a table name, 32 is a segment extraction environment, 33 is an average pitch frequency (HZ) Vi, 34 is a pitch slope Fi, 35 is duration Ti, 36 is average power value A
i and 37 are maximum values of the respective parameters 33 to 36, 38 is a minimum value of the respective parameters, and 39 is a file number which is symmetrical in selection in the segment selecting unit 14. Average pitch frequency 3
3, pitch inclination 34, time length 35, average power (RMS
A fragment parameter is configured with (value) 36 as an element.

【００２４】テーブル名３１は、例えば、素片抽出環境
３２のインデックスとして機能させる。図示の例では、
＃ｋａｋａｅｒｕ＃〜＃ｋａｇａｍｉ＃を表出するため
のインデックスが記載されている。また、各パラメタ３
３〜３６及び最大値３７等の数値は波形切り出しの際に
解析された値であり、固定的な数値である。The table name 31 functions as an index of the segment extraction environment 32, for example. In the example shown,
Indexes for expressing # kakaeru # to # kagami # are described. Also, each parameter 3
Numerical values such as 3 to 36 and the maximum value 37 are values analyzed at the time of waveform cutting and are fixed numerical values.

【００２５】図１に戻ると、素片選択部１４では、各選
択基準パラメタと、これら選択基準パラメタに対応して
設定された選択ファイルに属する素片パラメタとの間で
２乗誤差を算出し、誤差値の小さい順に素片パラメタを
複数個選択し、これらを第一の音韻単位に対する１次候
補とする。同様にして第二、第三…の音韻単位に対して
も、それぞれ１次候補を選択する。更に、条件設定部１
４１からの設定条件を参照して各１次候補の中から音韻
環境、平均ピッチ、ピッチ傾斜の符号等の特性尤度が最
も高い素片パラメタをそれぞれ選択し、第一、第二…の
各音韻単位に対する最終的な最適素片を夫々決定する。
この原理を図４を用いて説明する。Returning to FIG. 1, the segment selection unit 14 calculates a squared error between each selection criterion parameter and the segment parameter belonging to the selected file set corresponding to these selection criterion parameters. , A plurality of segment parameters are selected in ascending order of the error value, and these are used as primary candidates for the first phoneme unit. Similarly, primary candidates are selected for the second, third, ... Phoneme units. Furthermore, the condition setting unit 1
With reference to the setting conditions from No. 41, the phoneme environment, the average pitch, the code of the parameter having the highest characteristic likelihood such as the code of the pitch inclination is selected from each of the primary candidates, and the first, second, ... The final optimum segment for each phoneme unit is determined.
This principle will be described with reference to FIG.

【００２６】図４は素片選択部１４の詳細な処理手順を
示すフローチャートであり、Ｓは各処理のステップ番号
である。図４を参照すると、素片選択部１４では、選択
基準パラメタ設定部１３で設定された第一の選択基準パ
ラメタを読み込み（Ｓ４１）、素片パラメタテーブル１
５から対応する素片パラメタ（図３における平均ピッチ
周波数３３、ピッチ傾斜３４、継続時間３５、平均パワ
３６）、及びこれらの最大値３７、最小値３８を読み込
む（Ｓ４２）。そして、各パラメタの最大値３７と最小
値３８との差分を算出し、これをパラメタ変動幅として
後段に出力する（Ｓ４３）。次に、第一の選択基準パラ
メタと第一の選択ファイル中の素片パラメタとの差分を
それぞれ当該パラメタの変動幅で除し、各々の２乗値の
合計（２乗誤差）を素片毎に算出する（Ｓ４４）。この
２乗誤差を最適素片の選択抽出尺度ＳＣ（Selection Cr
iteria）として用いる。この算出処理を数式で表すと下
記(2)式のようになる。FIG. 4 is a flow chart showing a detailed processing procedure of the segment selecting section 14, and S is a step number of each processing. Referring to FIG. 4, the segment selection unit 14 reads the first selection criterion parameter set by the selection criterion parameter setting unit 13 (S41), and the segment parameter table 1
5, the corresponding segment parameters (average pitch frequency 33, pitch slope 34, duration 35, average power 36 in FIG. 3), and their maximum value 37 and minimum value 38 are read (S42). Then, the difference between the maximum value 37 and the minimum value 38 of each parameter is calculated, and this is output to the subsequent stage as the parameter fluctuation range (S43). Next, the difference between the first selection criterion parameter and the segment parameter in the first selection file is divided by the fluctuation range of the parameter, and the sum of the squared values (square error) of each parameter is segmented. Is calculated (S44). This squared error is taken as a selection extraction scale SC (Selection Cr
iteria). This calculation process is expressed by a mathematical formula as shown in the following formula (2).

【００２７】[0027]

【数２】ＳＣ＝（ＶtーＶi／wideＶ）²＋（ＦtーＦi／wideＦ）² ＋（ＴtーＴi／wideＴ）²＋（ＡtーＡi／wideＡ）² ・・・・・・・・・(2) （２≦ｉ≦ｎ）## EQU2 ## SC = (Vt-Vi / wideV) ² + (Ft-Fi / wideF) ² + (Tt-Ti / wideT) ² + (At-Ai / wideA) ² ... (2) (2 ≦ i ≦ n)

【００２８】ここで、変数名wideＶ、wideＦ、wideＴ、
wideＡは、夫々Ｓ４３で得られた平均ピッチ周波数Ｖ、
ピッチ傾斜Ｆ、継続時間Ｔ、平均パワ値Ａのパラメタ変
動幅であり、ｎは選択されたテーブルに属する素片パラ
メタの数を示す。また、音韻連接の中心音韻が無声化母
音である場合は、上記(2)式に示す２乗誤差のピッチ成
分に関する項を”０”とし、下式により算出する。Here, variable names wideV, wideF, wideT,
wideA is the average pitch frequency V obtained in S43,
It is a parameter variation width of pitch inclination F, duration T, and average power value A, and n represents the number of segment parameters belonging to the selected table. When the central phoneme of the phoneme concatenation is the unvoiced vowel, the term regarding the pitch component of the square error shown in the above equation (2) is set to “0” and calculated by the following equation.

【００２９】[0029]

【数３】ＳＣ＝（ＴtーＴi／wideＴ）²＋（ＡtーＡi／wideＡ）² ・・・・・・・・・(3) （２≦ｉ≦ｎ）[Equation 3] SC = (Tt-Ti / wideT) ² + (At-Ai / wideA) ² ... (3) (2≤i≤n)

【００３０】上記のようにして第一の選択ファイル内の
素片パラメタそれぞれに対して２乗誤差ＳＣを算出した
後に、このＳＣ値が小さい順にｍ（０≦ｍ≦ｎ）個の波
形素片を選択し、これを第一の音韻単位に対応する波形
素片の１次候補とする（Ｓ４５）。なお、ｍの値は任意
に設定して良く、常にｎ＝ｍとしても良いが、波形素片
の数ｎが余りに大きいと最適な波形素片を選択する際に
多大な計算量が必要となるので、予めｍを適当な値に設
定しておくことが好ましい。同様に、第二、第三…の各
音韻単位に対しても上記Ｓ４１〜４５の操作を行い、す
べての音韻単位に対して１次候補を選択する（Ｓ４
６）。After the squared error SC is calculated for each of the segment parameters in the first selected file as described above, m (0≤m≤n) waveform segments are arranged in ascending order of SC value. Is selected as the primary candidate of the waveform segment corresponding to the first phoneme unit (S45). Note that the value of m may be set arbitrarily and may always be n = m, but if the number n of waveform segments is too large, a large amount of calculation is required when selecting the optimum waveform segment. Therefore, it is preferable to set m to an appropriate value in advance. Similarly, the operations of S41 to 45 are performed for each of the second, third, ... Phoneme units to select primary candidates for all phoneme units (S4).
6).

【００３１】次に、第一の音韻単位についての１次候補
に対して、条件設定部１４１から読み込んだ以下の素片
環境適性条件による素片適性チェックを行う。このチェ
ックは、具体的には、これら条件を全て満たし、且つそ
の中で最もＳＣ値の小さい波形素片を第一の音韻単位に
対する最適素片として決定する。Next, for the primary candidate for the first phoneme unit, a phoneme suitability check based on the following phoneme environment suitability conditions read from the condition setting unit 141 is performed. In this check, specifically, the waveform segment that satisfies all of these conditions and has the smallest SC value is determined as the optimal segment for the first phoneme unit.

【００３２】素片環境適性条件（１）：素片を抽出した
音韻環境と合成時の音韻環境が一致もしくは調音方式が
類似すること。素片環境適性条件（２）：平均ピッチの大小関係が選択
基準パラメタの大小関係と一致すること。素片環境適性条件（３）：ピッチ傾斜の符号(正／負／
０)が選択基準パラメタの符号と一致すること。上記各条件を第二、第三…の各音韻単位に対する１次候
補にも適用してそれぞれ最適素片を決定し（Ｓ４７）、
それぞれ決定された最適素片のファイル番号を出力する
（Ｓ４８）。Element environment suitability condition (1): The phoneme environment in which the phoneme is extracted matches the phoneme environment at the time of synthesis or the articulation method is similar. Element environment suitability condition (2): The magnitude relationship of the average pitch matches the magnitude relationship of the selection criterion parameter. Fragment environmental suitability condition (3): Sign of pitch inclination (positive / negative /
0) matches the sign of the selection criterion parameter. The above conditions are also applied to the primary candidates for the second, third, ... Phoneme units to determine optimum segments (S47),
The file number of each determined optimum segment is output (S48).

【００３３】ここで、素片適性チェック（Ｓ４７）にお
ける処理をより詳細に説明する。図５は、入力文字列＃
ｋａｒｅｗａ＃について、各音韻単位の１次候補選択処
理（Ｓ４５）が終了した時点での処理内容図である。図
５に示されるように、各音韻単位の波形素片は、上記Ｓ
Ｃ値が小さい順に整理されている。最適素片の決定処理
は、まずＳＣ値が最小である素片（ファイル番号０７６
０，００５０，０３９１の各素片に対応）に対して行わ
れる。Here, the processing in the fragment suitability check (S47) will be described in more detail. Figure 5 shows the input string #
It is a processing content figure at the time of finishing the primary candidate selection processing (S45) for each phoneme unit for karewa #. As shown in FIG. 5, the waveform segment of each phoneme unit is S
The C values are arranged in ascending order. In the process of determining the optimum segment, the segment with the smallest SC value (file number 076
(Corresponding to each unit of 0, 0050, 0391)).

【００３４】例えば第一の音韻単位である＃ｋａについ
てみると、ＳＣ値が最小となるのはファイル番号０７６
０の波形素片である。この波形素片は、前述した素片環
境適性条件（２）及び素片環境適性条件（３）を満たし
ているが、素片環境適性条件（１）を満たしていない。
即ち、＃ｋａに後続する音韻環境は／ｒ／であるが、フ
ァイル番号０７６０の波形素片において後続する音韻環
境は／ｋ／であり、音韻環境が一致していない。従っ
て、合成時の音韻環境と完全に一致する素片抽出環境を
有し、かつＳＣ値が最も小さいファイル番号０７４６の
波形素片が、第一の音韻単位に対する最適素片として決
定される。For example, in the case of #ka which is the first phoneme unit, the SC value becomes the smallest at the file number 076.
It is a waveform element of 0. This corrugated piece satisfies the element environment suitability condition (2) and the element piece environment suitability condition (3) described above, but does not satisfy the element piece environment suitability condition (1).
That is, the phoneme environment following #ka is / r /, but the phoneme environment following in the waveform segment of file number 0760 is / k /, and the phoneme environments do not match. Therefore, the waveform segment of file number 0746, which has a segment extraction environment that completely matches the phoneme environment at the time of synthesis and has the smallest SC value, is determined as the optimum segment for the first phoneme unit.

【００３５】合成時の音韻環境／ｒ／と一致する素片抽
出環境を有する波形素片がない場合は、／ｒ／と調音方
式が類似している／ｗ／、／ｊ／を抽出環境にもつ波形
素片を最適素片として選択する。このように、上記３条
件を満たす波形素片が存在しない場合には、適宜その条
件に対する満足度が高い波形素片を選択する。When there is no waveform segment having a segment extraction environment that matches the phonological environment / r / at the time of synthesis, / w / and / j /, which have a similar articulation method to / r /, are used as the extraction environment. The corrugated segment that has is selected as the optimal segment. In this way, when there is no waveform segment satisfying the above three conditions, a waveform segment having a high degree of satisfaction with the conditions is appropriately selected.

【００３６】第二の音韻単位であるｒｅ−、第三の音韻
単位であるｗａ＃−については、ＳＣ値が最小である波
形素片が上記３条件を全て満たすので、これらの波形素
片をそのまま最適素片として決定する。従って、第二の
音韻単位に対応する最適素片としてファイル番号００５
０の波形素片、第三の音韻単位に対応する最適素片とし
てファイル番号０３９１の波形素片がそれぞれ決定され
る。With respect to the second phoneme unit re- and the third phoneme unit wa #-, the waveform segment having the smallest SC value satisfies all of the above three conditions. It is decided as it is as the optimum fragment. Therefore, the file number 005 is set as the optimum segment corresponding to the second phoneme unit.
The waveform segment of file number 0391 is determined as the waveform segment of 0 and the optimal segment corresponding to the third phoneme unit.

【００３７】素片接続部１７においては、前記素片選択
部１４の出力である各最適素片のファイル番号を検索キ
ーとして、素片ファイル１６より対応する最適素片を読
み出し、これら最適素片を順次結合することによって合
成音声を生成する。In the unit segment connection unit 17, the file number of each optimum unit output from the unit selection unit 14 is used as a search key to read out the corresponding optimum unit from the unit file 16 and the optimum unit units are read. A synthetic voice is generated by sequentially combining the.

【００３８】以上述べた処理を経て、出力端子１８に
は、前記入力文字列＃ｋａｒｅｗａ＃に対応した合成音
声が出力される。Through the above-mentioned processing, the output terminal 18 outputs the synthesized voice corresponding to the input character string # carewa #.

【００３９】なお、本実施例の音声合成装置では、自然
音声の音韻環境やアクセント情報を個別に考慮した素片
を合成結合単位とするので、図６に示した素片変形部５
７は不要となる。また、上記最適素片は、素片ファイル
１６に格納された複数個の素片から、前述の２乗誤差が
最小値となるものが個別に抽出されているので、話者が
異なっても略同一の波形が得られ、話者毎のチューニン
グが不要となる。更に、選択基準パラメタからの誤差が
小さい波形素片を複数選択した後に、合成音声の評価時
に特に影響力を与える音韻環境、平均ピッチの大小関
係、ピッチ傾斜の符号に対する適合性が高く、かつ上記
誤差が小さい波形素片を最適波形素片としているので、
より自然な合成音声を得ることができる。このように、
本実施例によれば、波形素片の選択条件について、より
柔軟な対応が可能な音声合成装置を実現することができ
る。In the speech synthesizing device of this embodiment, since the speech synthesis unit is a segment in which the phonological environment of natural speech and accent information are individually considered, the segment transformation unit 5 shown in FIG.
7 becomes unnecessary. Further, the optimum segment is extracted individually from the plurality of segments stored in the segment file 16 so that the above-mentioned squared error has the minimum value. The same waveform is obtained, and tuning for each speaker is unnecessary. Furthermore, after selecting a plurality of waveform segments having a small error from the selection criterion parameter, the phoneme environment, which has a particular influence on the evaluation of the synthetic speech, the magnitude relationship of the average pitch, and the high suitability for the pitch slope code, and Since the waveform element with a small error is the optimum waveform element,
A more natural synthesized voice can be obtained. in this way,
According to the present embodiment, it is possible to realize a speech synthesizer capable of more flexibly handling the selection condition of the waveform segment.

【００４０】本実施例は以上のとおりであるが、本発明
はこの実施例の構成に限定されるものではなく、その要
旨を逸脱しない範囲で任意に構成の変更が可能である。
例えば、この実施例では、最適素片の選択尺度ＳＣとし
て、基準値との最小２乗誤差を満たす形式について説明
したが、これは各誤差の方向を考慮する上で好適となる
例であって、必ずしもこの構成に拘束されるものではな
い。他の構成として、各パラメタをその変動幅で除した
ものの合計、あるいは２乗誤差よりも高次乗数誤差の合
計が夫々最小となるものを選択するようにしても良い。
また、選択基準パラメタ及び素片パラメタを構成する要
素として他の要素を用いても良い。Although the present embodiment is as described above, the present invention is not limited to the constitution of this embodiment, and the constitution can be arbitrarily changed without departing from the gist thereof.
For example, in this embodiment, as the selection scale SC of the optimum segment, the form satisfying the least square error with the reference value has been described, but this is an example suitable in consideration of the direction of each error. , Is not necessarily bound to this configuration. As another configuration, the sum of those obtained by dividing each parameter by the variation width or the sum of the higher-order multiplier errors than the square error may be selected to be the smallest.
Further, other elements may be used as the elements constituting the selection criterion parameter and the segment parameter.

【００４１】[0041]

【発明の効果】以上説明したように、本発明の音声合成
装置によれば、入力文字列に基づいて抽出された音韻単
位の韻律パラメタのうち、予め定めた素片環境適性条件
を満たし且つ選択基準パラメタからの誤差が最小となる
ものに対応する波形素片が選択されて文字列順に接続さ
れるので、話者データが変わっても略同一の音声波形が
得られるとともに、この音声波形が種々の素片環境適性
条件を加味したものになる効果がある。従って、波形素
片の選択を行う際の選択条件について、より柔軟な対応
が可能な音声合成装置を提供することができる。また、
この素片環境適性条件は、素片選択手段にて任意に追加
等ができるので、合成される音声をより自然音声に近づ
けることが極めて容易となり、従来のこの種の音声合成
装置に比べて格段に性能が向上する。As described above, according to the speech synthesizer of the present invention, among the prosodic parameters of the phoneme unit extracted on the basis of the input character string, a predetermined segment environment suitability condition is selected and selected. Since the waveform segment corresponding to the one with the smallest error from the reference parameter is selected and connected in the order of the character strings, substantially the same voice waveform can be obtained even if the speaker data changes, and this voice waveform can be varied. It has the effect of adding the elemental environmental suitability conditions of. Therefore, it is possible to provide a speech synthesizing device that can more flexibly deal with the selection condition when selecting a waveform segment. Also,
Since this element environment suitability condition can be arbitrarily added by the element selection means, it becomes extremely easy to bring the synthesized speech closer to a natural speech, which is much easier than the conventional speech synthesizer of this type. The performance is improved.

[Brief description of drawings]

【図１】本発明の一実施例に係る音声合成装置のブロッ
ク図。FIG. 1 is a block diagram of a speech synthesizer according to an embodiment of the present invention.

【図２】素片ファイルと素片パラメタテーブルとの概念
構成図。FIG. 2 is a conceptual configuration diagram of a segment file and a segment parameter table.

【図３】素片パラメタテーブルの内容説明図。FIG. 3 is an explanatory diagram of contents of a segment parameter table.

【図４】素片選択部における処理を示すフローチャー
ト。FIG. 4 is a flowchart showing a process in a segment selection unit.

【図５】入力文字列＃ｋａｒｅｗａ＃についての最適素
片決定処理の説明図。FIG. 5 is an explanatory diagram of an optimum segment determination process for an input character string # karewa #.

【図６】従来例に係る音声合成装置のブロック図。FIG. 6 is a block diagram of a speech synthesizer according to a conventional example.

[Explanation of symbols]

１１、６１入力端子１２、６２前処理部１３、６３選択基準パラメタ設定部１４、６４素片選択部（素片選択手段）１４１条件設定部（素片選択手段、条件変更手段）１５素片パラメタテーブル（波形情報格納手段）１６素片ファイル（波形情報格納手段）１７素片接続部１８出力端子 11, 61 Input terminal 12, 62 Pre-processing unit 13, 63 Selection reference parameter setting unit 14, 64 Element selection unit (element selection means) 141 Condition setting section (element selection means, condition change means) 15 Element parameter Table (waveform information storage means) 16 Element file (waveform information storage means) 17 Element connection section 18 Output terminal

Claims

[Claims]

1. A preprocessing unit for decomposing an input character string corresponding to a voice into phonological units, waveform information storage means for storing a plurality of waveform segments and prosodic parameters of each waveform segment, and the decomposing unit. While extracting the prosody parameter of the waveform segment corresponding to the unit phoneme from the waveform information storage means,
Of the extracted prosody parameters, a segment that selects a waveform segment corresponding to one that satisfies a preset segment environment suitability condition and has a minimum error from the selection criterion parameter set as the selection criterion of the waveform segment. A voice synthesizing apparatus comprising: a selection unit; and a segment connection unit that connects the selected waveform segments in the order of the input character strings to generate a synthesized voice.

2. The waveform information storage means includes a segment storage means for extracting waveform segments of a plurality of phonological environments including accent information for each phonological unit from natural speech and individually storing the segment, and each waveform segment. The speech synthesis apparatus according to claim 1, further comprising: a segment parameter table storing corresponding prosody parameters.

3. The error is a sum of squares of a value obtained by dividing a difference between the selection reference parameter and each prosody parameter by a variation width of each prosody parameter. The described speech synthesizer.

4. The speech synthesis apparatus according to claim 1, wherein the element selection means includes condition changing means for adding, changing, or deleting the element environment suitability condition. A speech synthesizer characterized by.