JP2000172483A

JP2000172483A - Method and system for speech recognition by common virtual picture and storage medium stored with speech recognition program by common virtual picture

Info

Publication number: JP2000172483A
Application number: JP10351771A
Authority: JP
Inventors: Kiyotada Usami; 潔忠宇佐美; Takashi Kono; 隆志河野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-12-10
Filing date: 1998-12-10
Publication date: 2000-06-23

Abstract

PROBLEM TO BE SOLVED: To enable efficient vocal transmission of an intension to an object in a common virtual picture by allowing a grammar management server to distribute difference information of grammar needed for speech recognition to a user terminal. SOLUTION: The system has a grammar management server which manages the grammar having a recognition object vocabulary registered in a text, etc., and a speech recognition result distribution server which distributes a speech recognition result. When a user terminal requests the grammar management server to update the difference information of the grammar needed for speech recognition (S1), the grammar management server distributes necessary difference information of the grammar to the user terminal at the start of the speech recognition (S2). The user terminal updates the grammar according to the distributed difference information (S3), recognizes an input voice by using the updated grammar (S4), and extracts (S5) and distributes (S6) the most likelihood vocabulary in the vocabulary in the recognition result to the distribution server. The distribution server distributes it to other user terminals (S7).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、共有仮想画面にお
ける音声認識方法及びシステム及び共有仮想画面におけ
る音声認識プログラムを格納した記憶媒体に係り、特
に、ユーザの視点に応じて、ユーザが眺めるシーンを描
画更新することが可能な共有仮想画面において、その共
有仮想画面を介して通信される音声を認識するシステム
において、例えば、ユーザが直接関知していない共有仮
想画面内の対象に対して音声による意思伝達を行う際
に、効率的に音声による意思伝達を行うために有効な共
有仮想画面における音声認識方法及びシステム及び共有
仮想画面における音声認識プログラムを格納した記憶媒
体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition method and system for a shared virtual screen and a storage medium storing a voice recognition program for the shared virtual screen, and more particularly to a scene viewed by a user according to the viewpoint of the user. In a system for recognizing a voice communicated via a shared virtual screen that can be updated in drawing, for example, a voice intention for a target in the shared virtual screen that is not directly related to the user. The present invention relates to a speech recognition method and system for a shared virtual screen that is effective for efficiently communicating a voice when performing communication, and a storage medium that stores a speech recognition program for a shared virtual screen.

【０００２】[0002]

【従来の技術】従来、複数のユーザ端末がネットワーク
を介してサーバ装置に接続され、複数のユーザが３次元
ＣＧによって構築された３次元仮想空間に各々のアバタ
を参加させることができる３次元共有仮想空間として
は、例えば、InterSpace(http://www.ntts.com/ispace.
html) がある。このシステムにおいて、ユーザはマイク
などの音声入力デバイスを用いて、３次元共有仮想空間
内に参加する他のユーザと音声によるコミュニケーショ
ンを行うことが可能である。2. Description of the Related Art Conventionally, a plurality of user terminals are connected to a server device via a network, and a plurality of users can participate in their avatars in a three-dimensional virtual space constructed by three-dimensional CG. As a virtual space, for example, InterSpace (http://www.ntts.com/ispace.
html). In this system, a user can communicate by voice with other users participating in the three-dimensional shared virtual space using a voice input device such as a microphone.

【０００３】一方、人間の話す言葉をコンピュータに理
解させる音声認識の技術は、人間と機械との情報のやり
取りなどを行うためには重要な技術であり、マルチメデ
ィアの操作性などのために音声認識技術に対する期待は
大きい。ネットワークを介した音声認識システムとして
は、クライアントから入力された音声データをサーバに
送信し、サーバで音声認識処理を実行した後、その音声
認識結果をクライアントに返信するクライアント・サー
バ・システムがある。連続音声認識のような処理量の大
きなタスクを扱う場合や、即時性を要求するようなタス
クを処理する場合に、高速（かつ高価）なワークステー
ション等をサーバに利用し、低価格なパーソナルコンピ
ュータ上の複数のクライアントから利用するような形態
を可能とする。[0003] On the other hand, speech recognition technology that allows a computer to understand the words spoken by humans is an important technology for exchanging information between humans and machines, and voice recognition technology for operability of multimedia. Expectations for recognition technology are great. As a voice recognition system via a network, there is a client-server system that transmits voice data input from a client to a server, executes voice recognition processing on the server, and returns a voice recognition result to the client. A low-cost personal computer that uses a high-speed (and expensive) workstation as a server when handling a task with a large processing amount such as continuous speech recognition or a task that requires immediacy. It is possible to use a form that is used by a plurality of clients.

【０００４】一般的な音声認識処理は、言語音声を構成
する要素である音素（母音や子音など）を単位とし、各
音素毎に音響的な性質をモデル化した音響モデル、音声
認識処理における探索空間を規定するための語彙や文
法、または、言語的な単位の統計的な連鎖確率のモデル
である言語モデル、及び認識対象語彙がテキストなどに
より登録された文法を用いて行われる。入力された音声
データ（音声波形）を音声分析し、音声認識に有効な音
声特徴量を抽出し、音声認識処理における探索空間を言
語モデルにより限定し、音響モデルと音響的な照合を行
った後、文法に登録されている語彙の中で尤度の最も高
い語彙を音声認識結果として得る。In general speech recognition processing, a phoneme (vowel, consonant, or the like) which is a component of a linguistic speech is used as a unit, and an acoustic model in which acoustic properties are modeled for each phoneme is searched. The vocabulary and the grammar for defining the space, the language model which is a model of the statistical chain probability of the linguistic unit, and the vocabulary in which the vocabulary to be recognized is registered as text are used. Speech analysis of the input speech data (speech waveform), extracting speech features effective for speech recognition, limiting the search space in speech recognition processing with a language model, and performing acoustic matching with the acoustic model , The vocabulary having the highest likelihood among the vocabularies registered in the grammar is obtained as the speech recognition result.

【０００５】[0005]

【発明が解決しようとする課題】最近では、パーソナル
コンピュータの処理能力の高まりと音声認識処理アルゴ
リズムに対する造詣の深まりによる音声認識処理プログ
ラムの簡潔化により、パーソナルコンピュータ上で音声
認識処理も行うことができるようになってきている。Recently, voice recognition processing can be performed on a personal computer by simplifying a voice recognition processing program by increasing the processing capability of a personal computer and gaining knowledge of voice recognition processing algorithms. It is becoming.

【０００６】しかしながら、依然として不特定話者用の
音響モデルを利用しており、特定の話者用に作成された
音響モデルと同等の音声認識性能を得ることが難しくな
っている。また、人間の操作を離れた対象に対して音声
で意思伝達をすること（例えば、犬キャラクタに対して
「お手」や「お座り」などの音声で命令すること）は困
難である。However, since an acoustic model for an unspecified speaker is still used, it is difficult to obtain a speech recognition performance equivalent to an acoustic model created for a specific speaker. In addition, it is difficult to communicate by voice to a target that has left a human operation (for example, to instruct a dog character by voice such as “hand” or “sit”).

【０００７】このように、従来のネットワークを介して
音声認識技術では、複数のユーザの音声を認識するため
に、不特定話者用の音響モデルが音声認識サーバには用
意されているが、例え、不特定話者用の音響モデルとは
言っても、音声の個人差による多様な要因を全て吸収し
た音声認識を行うことは困難であり、特定の話者用に作
成された音響モデルと同等の音声認識性能を得ることは
困難である。As described above, in the conventional speech recognition technology via a network, an acoustic model for an unspecified speaker is prepared in the speech recognition server in order to recognize the voices of a plurality of users. Although it is an acoustic model for unspecified speakers, it is difficult to perform speech recognition that absorbs all the various factors due to individual differences in speech, and it is equivalent to an acoustic model created for a specific speaker. It is difficult to obtain the voice recognition performance.

【０００８】また、従来の共有仮想画面内における音声
によるコミュニケーションは、ネットワークを介して伝
播された音声を直接人間が聞き、理解しながらそれに対
するリアクションを人間が直接行うという形態を採って
いる。ここには、人間のリアルタイムな介入が必要不可
欠であり、例えば、上記のような人間の操作を離れた犬
キャラクタのような対象に対して「お手」や「お座り」
などの音声による命令を実現することは困難である。[0008] In the conventional communication using voice in a shared virtual screen, humans directly listen to voices transmitted via a network, and perform reactions on the voices while understanding them. Here, real-time human intervention is indispensable. For example, “hand” or “sitting” is performed on an object such as a dog character that has left human operation as described above.
It is difficult to realize such a command by voice.

【０００９】従って、上記の従来の技術には、ユーザが
直接関知していない共有仮想画面内の対象に対して音声
による意思伝達を行う際に、効率的に音声による意思伝
達を行うことができないという問題がある。本発明は、
上記の点に鑑みなされたもので、ユーザが直接関知して
いない共有仮想画面内の対象に対して音声による意思伝
達を行う際に、効率的に音声による意思伝達を行うため
に有効な共有仮想画面における音声認識方法及びシステ
ム及び共有仮想画面における音声認識プログラムを格納
した記憶媒体を提供することを目的とする。[0009] Therefore, according to the above-mentioned conventional technology, when communicating by voice to an object in the shared virtual screen which the user is not directly concerned with, it is not possible to efficiently communicate by voice. There is a problem. The present invention
In view of the above points, when a user communicates by voice to an object in a shared virtual screen that is not directly related to the user, the shared virtual machine is effective for efficiently communicating by voice. It is an object of the present invention to provide a method and system for voice recognition on a screen and a storage medium storing a voice recognition program for a shared virtual screen.

【００１０】[0010]

【課題を解決するための手段】本発明（請求項１）は、
ユーザの視点に応じて、ユーザが眺めるシーンを描画更
新することが可能な共有仮想画面における音声認識方法
において、認識対象語彙がテキストなどにより登録され
た文法を管理する文法管理サーバと、音声認識結果をユ
ーザ端末に配信する音声認識結果配信サーバとを有し、
文法管理サーバにおいて、ユーザ端末の音声認識処理開
始の際に、該音声認識処理に必要な文法の差分情報を該
ユーザ端末に配信する。Means for Solving the Problems The present invention (claim 1) provides:
In a voice recognition method on a shared virtual screen capable of drawing and updating a scene viewed by a user according to a user's viewpoint, a grammar management server that manages a grammar in which a vocabulary to be recognized is registered by text or the like, a voice recognition result And a voice recognition result distribution server that distributes the result to the user terminal,
The grammar management server distributes grammar difference information necessary for the voice recognition process to the user terminal when the voice recognition process of the user terminal starts.

【００１１】本発明（請求項２）は、ユーザ端末におい
て、音声認識に必要なユーザ独自の音響モデルと言語モ
デルとを文法と共に格納・管理する。本発明（請求項
３）は、ユーザ端末において、入力された音声をユーザ
端末において文法、音響モデル及び言語モデルを用いて
音声認識する。本発明（請求項４）は、ユーザ端末にお
いて、音声認識された結果を音声認識結果配信サーバに
送信する。According to the present invention (claim 2), in a user terminal, a user-specific acoustic model and a language model required for speech recognition are stored and managed together with a grammar. According to the present invention (claim 3), in the user terminal, the input speech is recognized using the grammar, the acoustic model, and the language model in the user terminal. According to the present invention (claim 4), in the user terminal, the result of speech recognition is transmitted to the speech recognition result distribution server.

【００１２】図１は、本発明の原理構成図である。本発
明（請求項５）は、ユーザの視点に応じて、ユーザが眺
めるシーンを描画更新することが可能な共有仮想画面に
おける音声認識方法において、認識対象語彙がテキスト
などにより登録された文法を管理する文法管理サーバ
と、音声認識結果をユーザ端末に配信する音声認識結果
配信サーバとを有し、ユーザ端末において、文法管理サ
ーバに対して音声認識の際に必要となる文法の差分情報
の更新を要求し（ステップ１）、文法管理サーバにおい
て、ユーザ端末の音声認識処理開始の際に、該音声認識
処理に必要な文法の差分情報を該ユーザ端末に配信し
（ステップ２）、ユーザ端末において、文法管理サーバ
から配信された差分情報を取得して文法を更新し（ステ
ップ３）、更新された文法を用いて、入力された音声の
音声認識を行い（ステップ４）、音声認識の結果の語彙
の中で最尤の語彙を音声認識結果として抽出し（ステッ
プ５）、音声認識結果を音声認識結果配信サーバに送信
し（ステップ６）、音声認識結果配信サーバにおいて、
ユーザ端末から送信された音声認識結果を他のユーザ端
末に配信する（ステップ７）。FIG. 1 is a block diagram showing the principle of the present invention. According to a fifth aspect of the present invention, in a voice recognition method for a shared virtual screen capable of drawing and updating a scene viewed by a user in accordance with a user's viewpoint, a grammar in which recognition target vocabulary is registered as text or the like is managed. Grammar management server, and a speech recognition result distribution server that distributes speech recognition results to the user terminal. The user terminal updates the grammar difference information required for speech recognition to the grammar management server at the user terminal. Request (step 1), the grammar management server distributes grammatical difference information required for the speech recognition processing to the user terminal when the speech recognition processing of the user terminal starts (step 2). The grammar is updated by acquiring the difference information distributed from the grammar management server (step 3), and using the updated grammar, performs voice recognition of the input voice ( Step 4) Extract the maximum likelihood vocabulary from the vocabulary of the speech recognition result as the speech recognition result (step 5), transmit the speech recognition result to the speech recognition result distribution server (step 6), and deliver the speech recognition result. On the server,
The speech recognition result transmitted from the user terminal is delivered to another user terminal (step 7).

【００１３】図２は、本発明の原理構成図である。本発
明（請求項６）は、ユーザの視点に応じて、ユーザが眺
めるシーンを描画更新することが可能な共有仮想画面に
おける音声認識システムであって、音声認識を行うユー
ザ端末１００と、認識対象語彙がテキストなどにより登
録された文法を管理する文法管理サーバ２００と、音声
認識結果をユーザ端末１００に配信する音声認識結果配
信サーバ３００とを有し、文法管理サーバ２００は、認
識対象語彙がテキストなどにより登録された文法を管理
する文法管理手段１０と、ユーザ端末の音声認識処理開
始の際に、該音声認識処理に必要な文法の差分情報を該
ユーザ端末に配信する差分情報配信手段２０とを有し、
ユーザ端末１００は、音声認識に必要なユーザ独自の音
響モデルと言語モデルとを文法と共に格納・管理するモ
デル格納手段３０と、入力された音声をユーザ端末にお
いて文法、音響モデル及び言語モデルを用いて音声認識
する音声認識手段４０と、音声認識手段４０で得られた
音声認識結果を送信する音声認識結果送信手段５０とを
有し、音声認識結果配信サーバ３００は、ユーザ端末１
００から送信された音声認識結果を他のユーザ端末に配
信する手段を有する。FIG. 2 is a diagram showing the principle of the present invention. The present invention (Claim 6) is a voice recognition system for a shared virtual screen capable of drawing and updating a scene viewed by a user in accordance with the viewpoint of the user, comprising: a user terminal 100 that performs voice recognition; The grammar management server 200 manages a grammar whose vocabulary is registered as text and the like, and a speech recognition result delivery server 300 that delivers speech recognition results to the user terminal 100. A grammar management unit 10 that manages grammar registered by the user, etc., and a difference information distribution unit 20 that distributes, to the user terminal, grammatical difference information required for the speech recognition process when the speech recognition process of the user terminal starts. Has,
The user terminal 100 stores and manages a user-specific acoustic model and a language model required for speech recognition together with a grammar, and a model storage unit 30 and stores input speech in the user terminal by using a grammar, an acoustic model, and a language model. The apparatus includes a speech recognition unit 40 for recognizing speech and a speech recognition result transmitting unit 50 for transmitting the speech recognition result obtained by the speech recognition unit 40.
Means for delivering the speech recognition result transmitted from 00 to other user terminals.

【００１４】本発明（請求項７）は、ユーザの視点に応
じて、ユーザが眺めるシーンを描画更新することが可能
な文法管理サーバに搭載される共有仮想画面における音
声認識プログラムを格納した記憶媒体であって、ユーザ
端末の音声認識処理開始の際に、該音声認識処理に必要
な文法の差分情報を該ユーザ端末に配信する。According to a seventh aspect of the present invention, there is provided a storage medium storing a voice recognition program for a shared virtual screen mounted on a grammar management server capable of drawing and updating a scene viewed by a user in accordance with the viewpoint of the user. When the speech recognition processing of the user terminal is started, grammatical difference information required for the speech recognition processing is delivered to the user terminal.

【００１５】本発明（請求項８）は、ユーザの視点に応
じて、ユーザが眺めるシーンを描画更新することが可能
なユーザ端末に搭載される共有仮想画面における音声認
識プログラムを格納した記憶媒体であって、音声認識に
必要なユーザ独自の音響モデルと言語モデルとを文法と
共に記憶手段に格納・管理するプロセスと、音声認識を
行うために必要な文法を管理する文法管理サーバから配
信された文法の差分情報を取得して格納するプロセス
と、入力された音声を文法、音響モデル及び言語モデル
を用いて音声認識するプロセスと、音声認識された結果
を音声認識結果配信サーバに送信させるプロセスとを有
する。The present invention (claim 8) is a storage medium storing a voice recognition program for a shared virtual screen mounted on a user terminal capable of drawing and updating a scene viewed by a user according to the viewpoint of the user. A process for storing and managing user-specific acoustic models and language models necessary for speech recognition together with grammar in storage means, and a grammar distributed from a grammar management server for managing grammar necessary for performing speech recognition A process of acquiring and storing the difference information of the utterance, a process of recognizing the input speech using a grammar, an acoustic model, and a language model, and a process of transmitting the result of the speech recognition to the speech recognition result distribution server. Have.

【００１６】上記のように、本発明は、認識対象語彙集
がテキストなどにより登録された文法を管理する文法管
理サーバと、音声認識結果をユーザ端末に配信する音声
認識結果配信サーバとを有し、ある音声認識に対する個
人的に作成された文法を多数のユーザ端末で共有するた
めに設けられている文法管理サーバにおいて、ユーザ端
末の音声認識処理開始の際に、音声認識処理に必要な文
法の差分情報をユーザ端末に配信し、得られた差分情報
を更新した文法を用いて、入力された音声の音声認識を
行い、音声認識された結果を音声認識結果配信サーバに
送信し、音声認識配信サーバにおいて、送信された音声
認識結果を他のユーザ端末に配信することにより、ユー
ザが眺めるシーンを描画更新することが可能となる。As described above, the present invention has a grammar management server that manages a grammar in which a vocabulary to be recognized is registered as text and the like, and a speech recognition result distribution server that delivers speech recognition results to user terminals. In a grammar management server provided for sharing a personally created grammar for a certain voice recognition among a large number of user terminals, a grammar necessary for the voice recognition process is started when the user terminal starts the voice recognition process. Distribute the difference information to the user terminal, perform voice recognition of the input voice using a grammar that updates the obtained difference information, transmit the voice recognition result to the voice recognition result distribution server, and perform voice recognition distribution. By distributing the transmitted speech recognition result to another user terminal in the server, it is possible to draw and update the scene viewed by the user.

【００１７】[0017]

【発明の実施の形態】図３は、本発明の音声認識システ
ムの構成を示す。同図に示すシステムは、ユーザ端末１
００、文法管理サーバ２００、音声認識結果配信サーバ
３００及びネットワーク４００から構成され、ユーザ端
末１００、文法管理サーバ２００、及び音声認識結果配
信サーバ３００はネットワーク４００に接続されてい
る。FIG. 3 shows the configuration of a speech recognition system according to the present invention. The system shown in FIG.
The user terminal 100, the grammar management server 200, and the speech recognition result distribution server 300 are connected to the network 400.

【００１８】ユーザ端末１００は、データ管理部１１
０、文法作成処理部１２０、音響モデル適応部１３０、
音声認識処理部１４０、文法更新要求部１５０、表示部
１６０及び通信制御部１７０から構成される。なお、当
該ユーザ端末１００は、図３においては説明の簡単化の
ため１台がネットワークに接続されているが、本来は、
ｎ台が接続されているものとする。The user terminal 100 has a data management unit 11
0, grammar creation processing unit 120, acoustic model adaptation unit 130,
It comprises a speech recognition processing unit 140, a grammar update request unit 150, a display unit 160, and a communication control unit 170. Note that one user terminal 100 is connected to the network in FIG. 3 for simplicity of description, but originally,
It is assumed that n units are connected.

【００１９】データ管理部１１０は、音声認識処理に必
要な音響モデル、言語モデル、文法などのデータを格納
・管理する。文法作成処理部１２０は、認識対象語彙が
テキストなどにより登録されたユーザ独自の文法を作成
する。音響モデル適応部１３０は、音声を入力する１８
０ユーザの音響特性に合わせて不特定話者用の音響モデ
ルをユーザ用に適応させる。適応させる具体的な方法
は、音響モデル適応部１３０により表示部１６０に表示
された指示に従い、ユーザ１８０は、不特定話者用の音
響モデルにある母音・子音を全て網羅するような発音を
行い、音響モデル適応部１３０は、その発音からユーザ
１８０の音響特性を捉えて不特定話者用の音響モデルを
ユーザ用に適応させる。The data management unit 110 stores and manages data such as acoustic models, language models, and grammar required for speech recognition processing. The grammar creation processing unit 120 creates a grammar unique to the user in which the vocabulary to be recognized is registered as text or the like. The acoustic model adaptation unit 130 inputs the speech 18
The sound model for the unspecified speaker is adapted for the user according to the sound characteristics of the user 0. According to a specific method of adapting, in accordance with the instruction displayed on the display unit 160 by the acoustic model adapting unit 130, the user 180 performs pronunciation so as to cover all vowels and consonants in the acoustic model for an unspecified speaker. The acoustic model adaptation unit 130 captures the acoustic characteristics of the user 180 from the pronunciation and adapts the acoustic model for the unspecified speaker to the user.

【００２０】音声認識処理部１４０は、データ管理部１
１０により管理された音声認識処理に必要なデータを用
いて、音声認識結果を作成する。文法更新要求部１５０
は、音声認識処理部１４０により音声認識を行う対象に
対する文法の差分情報の更新を文法管理サーバ２００に
対して要求する。表示部１６０は、ユーザ１８０に対し
て共有仮想シーンを描画更新する。The voice recognition processing unit 140 includes the data management unit 1
A speech recognition result is created by using data necessary for speech recognition processing managed by 10. Grammar update request unit 150
Requests the grammar management server 200 to update the grammar difference information for the target for which the speech recognition processing unit 140 performs speech recognition. The display unit 160 draws and updates the shared virtual scene for the user 180.

【００２１】通信制御部１７０は、文法管理サーバ２０
０や音声認識結果配信サーバ３００との通信を管理す
る。文法管理サーバ２００は、ある音声認識対象に対す
るユーザが個人的に作成された文法を多数の他のユーザ
端末で共有するため、当該文法を管理するサーバであ
り、文法データベース２１０、検索部２２０、ロード部
２３０、アップロード部２４０及び通信制御部２５０か
ら構成される。The communication control unit 170 controls the grammar management server 20
0 and the communication with the speech recognition result distribution server 300 are managed. The grammar management server 200 is a server that manages a grammar created by a user for a certain voice recognition target in order to share the grammar created personally with many other user terminals. It comprises a unit 230, an upload unit 240, and a communication control unit 250.

【００２２】文法データベース２１０は、ユーザ端末１
００から送信された文法情報を格納・管理する。検索部
２２０は、ユーザ端末１００の文法更新要求部１４０か
ら指示された要求に基づいて文法データベース２１０を
検索する。ロード部２３０は、検索部２２０で検索され
たユーザ端末１００のデータ管理部１１０により管理さ
れる文法との差分情報をロードする。The grammar database 210 stores the user terminal 1
The grammar information transmitted from 00 is stored and managed. The search unit 220 searches the grammar database 210 based on a request specified by the grammar update request unit 140 of the user terminal 100. The loading unit 230 loads difference information from the grammar managed by the data management unit 110 of the user terminal 100 searched by the search unit 220.

【００２３】アップロード部２４０は、ユーザ端末１０
０から送信された文法情報を文法データベース２１０に
アップロードする。通信制御部２５０は、ユーザ端末１
００との通信を管理する。音声認識結果配信サーバ３０
０は、ユーザ端末１００から取得した音声認識結果を選
択した他のユーザ端末に配信するサーバであり、音声認
識結果管理部３１０、配信ユーザ管理部３２０、通信制
御部３３０から構成される。The upload unit 240 is provided for the user terminal 10
The grammar information transmitted from 0 is uploaded to the grammar database 210. The communication control unit 250 controls the user terminal 1
Manage communication with 00. Speech recognition result distribution server 30
Reference numeral 0 denotes a server that distributes the speech recognition result obtained from the user terminal 100 to the selected other user terminals, and includes a speech recognition result management unit 310, a distribution user management unit 320, and a communication control unit 330.

【００２４】音声認識結果管理部３１０は、ユーザ端末
１００から送信された音声認識結果を格納・管理する。
配信ユーザ管理部３２０は、音声認識結果管理部３１０
で管理する音声認識結果を配信するユーザ端末１００を
管理する。配信するユーザ端末の選択は、特願平１０−
３４７３９６に開示されているマスタ端末の選択方法に
依るものとする。The speech recognition result management section 310 stores and manages a speech recognition result transmitted from the user terminal 100.
The distribution user management unit 320 includes a speech recognition result management unit 310
Manages the user terminal 100 that distributes the voice recognition result managed by the user terminal 100. The selection of the user terminal to be distributed is based on Japanese Patent Application
It is assumed that the method depends on the method of selecting a master terminal disclosed in US Pat.

【００２５】通信制御部３３０は、ユーザ端末１００と
の通信を管理する。次に、本発明の動作の概要を説明す
る。音声認識処理で必要とする音響モデル、言語モデル
及び文法の中で共通的な規則により言語モデルと不特定
話者用でなく個人の音響特性を反映した音響モデルは個
々のユーザ端末１００に蓄積し、他のユーザが共有する
個々のユーザが作成した文法は、当該文法を作成したユ
ーザ端末１００自体に蓄積すると共に、他のユーザが共
有できるように文法管理サーバ２００に蓄積する。The communication control unit 330 manages communication with the user terminal 100. Next, an outline of the operation of the present invention will be described. According to the common rules among the acoustic model, the language model, and the grammar required for the speech recognition processing, the language model and the acoustic model reflecting the acoustic characteristics of the individual, not for the unspecified speaker, are stored in the individual user terminals 100. The grammar created by each user shared by other users is stored in the user terminal 100 itself that created the grammar, and is also stored in the grammar management server 200 so that other users can share the grammar.

【００２６】ユーザ端末１００の文法作成処理部１２０
により個々のユーザが作成した文法は、ユーザ端末１０
０の通信制御部１７０、文法管理サーバ２００の通信制
御部２５０及びアップロード部２４０を介して文法デー
タベース２１０に更新・蓄積される。なお、共通的な規
則の文法は、既にユーザ端末１００のデータ管理部１１
０に蓄積されるようにしても良いし、ユーザ端末１００
のデータ管理部１１０と文法データベース２１０との両
方に蓄積するようにしてもよい。The grammar creation processing unit 120 of the user terminal 100
The grammar created by each user by the user terminal 10
0 is updated and stored in the grammar database 210 via the communication control unit 170 of the grammar management server 200, the communication control unit 250 of the grammar management server 200, and the upload unit 240. Note that the grammar of the common rule is already in the data management unit 11 of the user terminal 100.
0 and may be stored in the user terminal 100.
May be stored in both the data management unit 110 and the grammar database 210.

【００２７】個々のユーザ端末への最新文法の反映処理
は、以下のようにして行うものとする。第１の文法の反
映処理として、音声認識対象をユーザ１８０が選択し、
当該ユーザ端末１００から文法更新要求を発行した場合
について説明する。この場合の文法更新は、音声認識対
象に関わる文法の差分情報だけを更新する。ユーザ端末
１００から音声認識対象に関する文法バージョンを文法
管理サーバ２００に送信する。The process of reflecting the latest grammar on each user terminal is performed as follows. As reflection processing of the first grammar, the user 180 selects a speech recognition target,
A case where a grammar update request is issued from the user terminal 100 will be described. The grammar update in this case updates only the grammar difference information relating to the speech recognition target. The user terminal 100 transmits a grammar version related to the speech recognition target to the grammar management server 200.

【００２８】次に、文法管理サーバ２００では、送信さ
れた音声認識対象に関する文法バージョンと文法データ
ベース２１０に蓄積されている当該音声認識対象に関す
る最新の文法バージョンとを比較する。異なっている時
だけ、文法データベース２１０の最新バージョンの文法
との差分情報をユーザ端末１００に返信する。第２の文
法の反映処理として、ワールド全体に関して共通の文法
が必要な場合（例えば、関西の文化圏を反映させたワー
ルドでは関西弁が通じる必要がある）に、文法管理サー
バ２００からユーザ端末１００に対して文法更新を行う
方法について説明する。Next, the grammar management server 200 compares the transmitted grammar version regarding the speech recognition target with the latest grammar version regarding the speech recognition target stored in the grammar database 210. Only when it is different, the difference information from the latest version of the grammar of the grammar database 210 is returned to the user terminal 100. As a second grammar reflection process, when a common grammar is required for the entire world (for example, in a world reflecting the Kansai cultural sphere, the Kansai dialect needs to be communicated), the grammar management server 200 sends the user terminal 100 A method for updating the grammar of the grammar will be described.

【００２９】この場合の文法更新時期は、例えば、ユー
ザ端末１００がワールド管理サーバ（ワールドへのユー
ザのログイン、ログアウトを管理するサーバ（図示せ
ず））を介してワールドに参加した際に、全体の文法バ
ージョンを送信し、文法管理サーバ２００では、ワール
ド管理サーバ（図示せず）経由で送付された全体の文法
バージョンを比較し、異なっている時に、最新の文法情
報全体をユーザ端末１００に送信する。In this case, the grammar update time is determined, for example, when the user terminal 100 joins the world via the world management server (a server (not shown) for managing the login and logout of the user to the world). The grammar management server 200 compares the entire grammar version sent via the world management server (not shown), and sends the latest grammar information to the user terminal 100 when they are different. I do.

【００３０】第３の文法の反映処理として、音声認識対
象の方からユーザのアバタに近づいてきて、受動的にそ
れに必要な文法更新を行う方法について説明する。この
場合の文法更新方法は、例えば、音声認識対象が自分に
対する文法バージョンを伴っていて、ユーザのアバタに
ある範囲の距離に近づいた時に、ユーザ端末１００に蓄
積されている音声認識対象に対する文法バージョンとを
比較し、異なっている場合は、ユーザ端末の文法バージ
ョンを文法管理サーバ２００に送信し、文法差分情報を
得て更新する。または、ユーザのアバタにある範囲の距
離に近づいた時を契機として、前述の第１の文法の反映
処理を行う。As a third grammar reflection process, a method of passively updating the grammar necessary for approaching the user's avatar from the voice recognition target will be described. In this case, the grammar updating method includes, for example, a grammar version for the speech recognition target stored in the user terminal 100 when the speech recognition target accompanies a grammar version for the user and approaches a distance within a range of the user's avatar. If the grammar version is different, the grammar version of the user terminal is transmitted to the grammar management server 200, and grammar difference information is obtained and updated. Alternatively, the above-described first grammar reflection processing is performed when the distance to the user's avatar approaches a certain range.

【００３１】次に、上記の構成による音声認識処理の動
作を説明する。図４は、本発明の音声認識処理のフロー
チャートである。ステップ１０１）ユーザ端末１００は、文法更新要求
部１５０により、音声認識を行う際に必要となる文法の
差分情報の更新を要求し、文法の更新を行う必要がある
か否かを判定する。更新を行う必要がある場合には、ス
テップ１０２に移行し、必要がない場合にはステップ１
０３に移行する。Next, the operation of the speech recognition processing according to the above configuration will be described. FIG. 4 is a flowchart of the voice recognition processing of the present invention. Step 101) The grammar update request unit 150 requests the grammar update request unit 150 to update grammar difference information required for performing speech recognition, and determines whether or not grammar update is required. If it is necessary to update, go to step 102; if not, go to step 1
Shift to 03.

【００３２】ステップ１０２）文法管理サーバ２００
のロード部２３０によりロードされた文法の差分情報に
基づいて、ユーザ端末１００のデータ管理部１１０によ
り管理された文法を更新する。ステップ１０３）次に、ユーザ端末１００のデータ管
理部１１０により管理された音声認識処理に必要なデー
タを用いて音声認識処理部１４０により音声認識を行
う。Step 102) Grammar management server 200
The grammar managed by the data management unit 110 of the user terminal 100 is updated based on the grammar difference information loaded by the loading unit 230. Step 103) Next, voice recognition is performed by the voice recognition processing unit 140 using data necessary for the voice recognition process managed by the data management unit 110 of the user terminal 100.

【００３３】ステップ１０４）そして、ユーザ端末１
００のデータ管理部１１０により管理された文法に登録
されている語彙の中で尤度の最も高い語彙をテキスト形
式の音声認識結果として抽出する。ステップ１０５）最後に、テキスト形式にて表現され
た音声認識結果を音声認識結果配信サーバ３００送信
し、本処理を終了する。Step 104) Then, the user terminal 1
The vocabulary with the highest likelihood among the vocabulary registered in the grammar managed by the data management unit 110 of 00 is extracted as a speech recognition result in a text format. Step 105) Finally, the speech recognition result expressed in the text format is transmitted to the speech recognition result distribution server 300, and the process ends.

【００３４】[0034]

【実施例】以下、図面と共に本発明の実施例を説明す
る。本実施例の前提として、ユーザ端末１００のデータ
管理部１１０は、命令コマンドと当該命令に対応するア
クションが記載されているアクションテーブル６００を
有し、当該ユーザ端末１００がマスタ端末となった場合
（マスタ端末となった場合の当該ユーザ端末の処理は、
特願平１０−３４７３９６号に詳述されている）には、
アクションテーブル６００に記載されているアクション
に対応する処理を行うものとする。Embodiments of the present invention will be described below with reference to the drawings. As a premise of this embodiment, the data management unit 110 of the user terminal 100 has an action table 600 in which an instruction command and an action corresponding to the instruction are described, and the user terminal 100 becomes a master terminal ( When the user terminal becomes the master terminal,
Japanese Patent Application No. 10-347396) describes in detail
It is assumed that processing corresponding to the action described in the action table 600 is performed.

【００３５】以下の実施例では、共有仮想画面として、
３次元共有仮想空間を例にとって説明する。図５は、本
発明の一実施例の３次元共有仮想空間における音声認識
の例を示す。同図に示すように、実際に、ユーザ端末１
００を用いてあるワールド５００に参加するユーザ５１
０が、当該ワールド５００内に存在する犬キャラクタ５
２０に対して音声による意思伝達を行う場合について説
明する。なお、この犬キャラクタ５２０は、図６に示す
ようなアクションテーブル６００に基づいて動作を行う
ものとする。In the following embodiment, as a shared virtual screen,
A description will be given taking a three-dimensional shared virtual space as an example. FIG. 5 shows an example of speech recognition in a three-dimensional shared virtual space according to one embodiment of the present invention. As shown in FIG.
A user 51 participating in a certain world 500 using 00
0 is a dog character 5 existing in the world 500
A case where voice communication is performed with respect to 20 will be described. The dog character 520 operates based on the action table 600 as shown in FIG.

【００３６】図７は、本発明の一実施例の文法データベ
ースの構成を示す。同図に示す文法データベース２１０
は、音声認識対象毎に１つのファイルが割り当てられた
複数のファイルで構成される。同図では、１つのファイ
ルに１つの文法が格納されている状態を表しており、文
法識別子と開始は１つの文法の区切りであり、多数の文
法の一つ一つが文法識別子と開始に囲まれて１つのファ
イルに格納されている。FIG. 7 shows the structure of a grammar database according to one embodiment of the present invention. Grammar database 210 shown in FIG.
Is composed of a plurality of files to which one file is assigned for each speech recognition target. The figure shows a state in which one grammar is stored in one file, and the grammar identifier and the start are a delimiter of one grammar, and each of many grammars is surrounded by the grammar identifier and the start. Are stored in one file.

【００３７】単語宣言は、１つの文法に使われる有意義
な単語を全て定義しており、音響モデル及び言語モデル
を利用して認識された単語が有意義な単語にどのように
対応するか、また、同じ意味の単語はあるかを表してい
る。この例では、同じ意味の単語は、「｜」印のＯＲ記
号により表されており、「いどう」と「いどうしてくだ
さい。」は同じ有意義な単語としている。The word declaration defines all meaningful words used in one grammar, how words recognized using the acoustic model and the language model correspond to meaningful words, It indicates whether there is a word with the same meaning. In this example, words having the same meaning are represented by an OR symbol with a "|" mark, and "Ido" and "Ido Please" are the same meaningful words.

【００３８】文法宣言は、意味のある１つの文として、
どのように有意義な単語群からどのような順序配列で構
成されているかを表している。この文法の表記方法は既
に知られている方法である。この例では、・＄number=($1｜$2｜$3);「１」または、「２」また
は、「３」は『ｎｕｍｂｅｒ』とする。・$num=$number$ 歩；「１歩」または、「２歩」また
は、「３歩」は『ｎｕｍ』とする。・$direction=$前；「前」は『ｄｉｒｅｃｔｉｏｎ』と
する。・$dir=$direction$に；「前に」は『ｄｉｒ』とする。・number=($num$dir｜$dir$num) ；「１歩」または、
「２歩」または、「３歩」かつ「前に」または、「前
に」から「１歩」または「２歩」または、「３歩」は
『ｎｕｍｂｅｒ』とする。・$sentence=$numdir$移動；「１歩」または、「２歩」
または、「３歩」かつ「前に」または、「前に」から
「１歩」または「２歩」または、「３歩」かつ「移動」
は『ｓｅｎｔｅｎｃｅ』とする。A grammar declaration is one meaningful statement.
It shows how meaningful word groups are composed in what order. The notation of this grammar is a known method. In this example: • $ number = ($ 1 | $ 2 | $ 3); "1" or "2" or "3" is "number". $ Num = $ number $ steps; "one step", "two steps" or "three steps" is "num".・ $ Direction = $ before; “before” is “direction”.・ In $ dir = $ direction $; “before” is “dir”.・ Number = ($ num $ dir | $ dir $ num); "one step" or
“2 steps”, “3 steps” and “before”, or “1 step” or “2 steps” or “3 steps” from “before” is “number”.・ $ Sentence = $ numdir $ move; "one step" or "two steps"
Alternatively, "one step" or "two steps" or "three steps" and "movement" from "three steps" and "before" or "before"
Is "sentence".

【００３９】従って「１歩前に移動」も「前に１歩移
動」も「１歩前に移動して下さい」も同じ意味となる。
また、１つのファイル毎に２つのバージョン情報があ
り、また、文法識別子と開始とで区切られた文法毎にバ
ージョン情報がある。これらを利用する場合には、文法
データベース２１０の１ファイル毎にある２つのバージ
ョン情報として、一つは、１ファイル全体を文法管理サ
ーバ２００を維持・管理する制御装置等からの制御によ
り更新するときに、更新・利用されるバージョン情報で
ある。このバージョン情報は、バージョン番号と作成日
等からなり、ワールド全体に関して共通の文法が必要な
場合（関西の文化圏を反映させたワールドでは関西弁が
通じる必要がある）に、文法管理サーバ２００からユー
ザ端末１００に対して文法全体の更新を行うときに利用
される。Therefore, "move one step before", "move one step before", and "move one step before" have the same meaning.
Also, there are two version information for each file, and there is version information for each grammar separated by the grammar identifier and the start. When these are used, as one of two version information for each file of the grammar database 210, one is for updating one entire file under the control of a control device or the like that maintains and manages the grammar management server 200. Is version information to be updated / used. The version information includes a version number and a creation date. When a common grammar is required for the entire world (in a world reflecting the Kansai cultural sphere, the Kansai dialect needs to be communicated), the grammar management server 200 It is used when updating the entire grammar for the user terminal 100.

【００４０】もう一つの文法データベース２１０の１フ
ァイル毎にあるバージョン情報は、最新の更新日・時刻
等からなり、ユーザ端末１００からの文法作成による文
法データベース２１０のファイル更新毎に、その更新日
・時刻に更新される。このバージョン情報は、ユーザ端
末１００の文法更新要求部１５０からデータベースと構
成フォーマットが同じであるデータ管理部１１０に記録
されている音声認識対象に対してのバージョン情報を送
信する時に利用されるものである。このバージョン情報
は、個々の文法にあるバージョン情報から検索処理によ
り、最新の更新日・時刻を得る処理よりも高速に最新の
更新日・時刻を取得するために用いるものであるが、ユ
ーザ端末１００の処理能力が高く、当該バージョン情報
を用いずに文法毎にあるバージョン情報の検索処理をし
ても容易に最新の更新日・時刻を取り出すことができ
て、他の処理に影響を与えないとするならばこのファイ
ル毎のバージョン情報は無くてもよい。The version information for each file of the other grammar database 210 includes the latest update date / time, etc., and each time the file of the grammar database 210 is updated by creating a grammar from the user terminal 100, the update date / time is updated. Updated at time. This version information is used when transmitting the version information from the grammar update request unit 150 of the user terminal 100 to the voice recognition target recorded in the data management unit 110 having the same configuration format as the database. is there. This version information is used to obtain the latest update date and time faster than the process of obtaining the latest update date and time by searching from version information in each grammar. Has a high processing capability, and even if a version information search process for each grammar is performed without using the version information, the latest update date / time can be easily obtained, and other processes are not affected. If so, the version information for each file may not be necessary.

【００４１】文法毎にあるバージョン情報は、ユーザ端
末１００の文法作成処理部１２０で音声認識対象に対し
て作成された文法が送信される毎に更新される。このバ
ージョン情報は、最新の更新日・時刻（送信された日）
等からなっている。また、作成文法を送信したユーザ端
末１００のデータ管理部１１０においても、作成文法と
共に、バージョン情報が送信時に更新される。この時の
ユーザ端末１００における文法作成時のデータ管理部１
１０の作成文法及びバージョン情報の更新方法は、単に
作成した文法及びそのバージョン情報だけのユーザ端末
１００の処理による更新では、他の文法のバージョンが
旧世代のままになってしまう恐れもある（なぜなら、音
声認識対象に対する全体文法の最新の更新日・時刻であ
るバージョン情報は、最新になるのに、他の文法は、依
然と旧世代のままの内容であり、修正前の最新更新日の
日付から修正日の最新更新日までの変化が反映されなく
なる）ので、データ管理部１１０に記憶されている修正
前の音声認識対象に対する全体文法の最新の更新日・時
刻であるバージョン情報も送信し、下記に示す文法管理
サーバ２００の処理により差分情報を得て更新する必要
がある。The version information for each grammar is updated each time the grammar created for the speech recognition target by the grammar creation processing unit 120 of the user terminal 100 is transmitted. This version information is the latest update date and time (date sent)
And so on. Also, in the data management unit 110 of the user terminal 100 that has transmitted the creation grammar, the version information is updated at the time of transmission together with the creation grammar. The data management unit 1 at the time of grammar creation in the user terminal 100 at this time
In the method of updating the creation grammar and the version information of No. 10, if only the created grammar and its version information are updated by the processing of the user terminal 100, the version of another grammar may remain in the old generation (because the old grammar and the version are not updated) The version information that is the latest update date and time of the entire grammar for the speech recognition target is the latest, but the other grammars are still the contents of the old generation, and the date of the latest update date before correction From the modification date to the latest update date are no longer reflected), so version information that is the latest update date and time of the entire grammar for the speech recognition target before modification stored in the data management unit 110 is also transmitted. It is necessary to obtain and update difference information by processing of the grammar management server 200 described below.

【００４２】なお、文法管理サーバ２００は、文法デー
タベース２１０の文法毎にあるこのバージョン情報を、
ユーザ端末１００の文法更新要求部１５０から送信され
てきたデータ管理部１１０に記憶されている音声認識対
象に対してのバージョン情報（最新の更新日・時刻）と
比較して、より新しい更新日・時刻のバージョンの文法
データベース２１０にある文法だけを差分情報として、
ユーザ端末１００に送信する。ユーザ端末１００では、
送信された音声認識対象に対する文法の差分情報でデー
タ管理部１１０の個々の文法を更新し、最新のものとす
る。同じく、バージョン情報（最新の更新日・時刻）も
更新する。The grammar management server 200 stores the version information for each grammar in the grammar database 210,
Compared with the version information (latest update date / time) for the speech recognition target stored in the data management unit 110 transmitted from the grammar update request unit 150 of the user terminal 100, Only the grammar in the grammar database 210 of the time version is used as difference information,
Transmit to the user terminal 100. In the user terminal 100,
The individual grammar of the data management unit 110 is updated with the transmitted grammar difference information for the speech recognition target, and is updated. Similarly, the version information (latest update date / time) is also updated.

【００４３】図８は、本発明の一実施例の音声認識処理
の動作の例を示すシーケンスチャートである。ステップ２０１）図５において、ワールド５００内の
犬キャラクタ５２０をマウスによりクリックすることな
どにより音声認識処理を開始すると、ユーザ端末１００
の文法更新要求部１５０により、文法管理サーバ２００
に対して、犬キャラクタ５２０に対する音声認識処理を
行う際に必要となる文法の差分情報の更新を要求する。FIG. 8 is a sequence chart showing an example of the operation of the voice recognition processing according to one embodiment of the present invention. Step 201) In FIG. 5, when speech recognition processing is started by, for example, clicking the dog character 520 in the world 500 with a mouse, the user terminal 100
Of the grammar management server 200 by the
Is requested to update the grammatical difference information required for performing the voice recognition process on the dog character 520.

【００４４】ステップ２０２）更新要求を受信した文
法管理サーバ２００は、検索部２２０により、その犬キ
ャラクタ５２０に対する音声認識処理を行うために必要
な文法を文法データベース２１０から検索し、ロード部
２３０により、ユーザ端末１００のデータ管理部１１０
により管理される文法との差分情報をロードする。ステップ２０３）次に、ロード部２３０によりロード
された文法の差分情報は、ユーザ端末１００のデータ管
理部１１０により更新・格納され、この文法を用いて実
際の音声認識処理が可能となる。なお、ここでは、音声
認識対象をユーザが選択した際のクライアント側（ユー
ザ）からの文法更新要求を例としているが、例えば、関
西の文化圏を反映させたワールドでは関西弁が通じるよ
うに、あるワールド全体に関して共通の文法が必要な場
合には、サーバ側（文法管理サーバ２００）からクライ
アント（ユーザ）に対して文法更新を行ったり、音声認
識対象の方から近づいてきて受動的にそれに必要な文法
更新を行うなど、色々な適用例が考えられる。Step 202) Upon receiving the update request, the grammar management server 200 searches the grammar database 210 for a grammar necessary for performing the voice recognition processing on the dog character 520 by the search unit 220, and the load unit 230 Data management unit 110 of user terminal 100
Load the difference information with the grammar managed by. Step 203) Next, the grammar difference information loaded by the loading unit 230 is updated and stored by the data management unit 110 of the user terminal 100, and actual speech recognition processing can be performed using this grammar. Here, a grammar update request from the client side (user) when the user selects a speech recognition target is taken as an example. For example, in a world that reflects the Kansai cultural sphere, the Kansai dialect is communicated. If a common grammar is required for the entire world, the grammar is updated from the server side (grammar management server 200) to the client (user), or the grammar is passively approached from the voice recognition target. Various application examples are conceivable, such as updating the grammar.

【００４５】ここで言う文法とは、ユーザが発声する音
声を仮名で記述した文字列とそれに対応する文字列の表
記との組み合わせのことを指す。例えば、図６に示すよ
うなアクションテーブル６００において、「１歩前に移
動」というテキストコマンドを音声認識結果として得る
場合には、文法の中には最低限「いっぽまえにいどう」
という発声が「１歩前に移動」という文字列の表記に対
応する、と登録されていなければ認識することはできな
い。また、「１歩前に移動」も「前に１歩移動」も「１
歩前に移動して下さい」という発声も音声認識結果とし
ては「１歩前に移動」と同一である、と文法の中に登録
されていなければ、それぞれが異なる音声認識結果を得
ることになってしまう。これらの発声は内容的には皆同
じではあるが、文法の登録の仕方によっては異なる音声
認識結果を得ることに繋がる。逆に全く同じ発声ではあ
っても、文法の登録のしかたによって、異なる音声認識
結果を得ることができるので、音声認識処理を行う対象
毎に異なった文法を登録して、その文法をユーザ端末に
逐次反映することにより、ユーザの音声による意思伝達
を柔軟に行うことが可能となる。Here, the grammar refers to a combination of a character string describing a voice uttered by a user in a kana and a corresponding character string notation. For example, in the action table 600 as shown in FIG. 6, when a text command of “move one step ahead” is obtained as a speech recognition result, at least the grammar “Immediately before” is included in the grammar.
Cannot be recognized unless it is registered that the utterance corresponds to the notation of the character string “move forward one step”. In addition, “move one step forward” and “move one step forward”
If you are not registered in the grammar that the utterance "Please move forward one step" is the same as the voice recognition result as "Move one step ahead", each will obtain a different voice recognition result. Would. Although these utterances are all the same in content, they lead to obtaining different speech recognition results depending on how the grammar is registered. Conversely, even if the utterances are exactly the same, different speech recognition results can be obtained depending on how the grammar is registered, so a different grammar is registered for each object to be subjected to speech recognition processing, and the grammar is stored in the user terminal. By sequentially reflecting, it is possible to flexibly communicate the user's voice.

【００４６】ステップ２０４）犬キャラクタ５２０に
対する音声認識処理は、ユーザ端末１００の文法更新要
求部１４０の更新要求により更新された文法を用いて、
音声認識処理部１３０により行われる。ステップ２０５）テキスト形式にて表現された音声認
識結果は、音声認識結果配信サーバ３００に送信され
る。Step 204) The voice recognition process for the dog character 520 is performed using the grammar updated by the update request of the grammar update request unit 140 of the user terminal 100.
This is performed by the voice recognition processing unit 130. Step 205) The speech recognition result expressed in the text format is transmitted to the speech recognition result distribution server 300.

【００４７】ステップ２０６）音声認識結果を受信し
た音声認識結果配信サーバ３００は、音声認識結果管理
部３１０により、受信した音声認識結果を一旦格納す
る。ステップ２０７）次に、配信ユーザ管理部３２０によ
り、受信した音声認識結果を配信するユーザ端末を選択
する。ステップ２０８）通信制御部３３０により、選択され
たユーザ端末１００に対して音声認識結果の配信を行
う。Step 206) Upon receiving the speech recognition result, the speech recognition result distribution server 300 causes the speech recognition result management unit 310 to temporarily store the received speech recognition result. Step 207) Next, the distribution user management unit 320 selects a user terminal to distribute the received voice recognition result. Step 208) The communication control unit 330 distributes the speech recognition result to the selected user terminal 100.

【００４８】ステップ２０９）音声認識結果配信サー
バ３００から配信された音声認識結果を受信したユーザ
端末１００は、アクションテーブル６００により定義さ
れた該当するアクションを犬キャラクタ５２０に対して
開始させる。これらにより、ユーザが直接関知していな
い共有仮想空間内の対象に対して音声による意思伝達を
行う際に、効率的に音声による意思伝達を行うことが可
能となる。Step 209) Upon receiving the speech recognition result distributed from the speech recognition result distribution server 300, the user terminal 100 causes the dog character 520 to start the corresponding action defined by the action table 600. Accordingly, it is possible to efficiently communicate with voice when communicating with voice to a target in the shared virtual space that the user is not directly aware of.

【００４９】ステップ２１０）一方、ユーザ端末１０
０の文法作成部１２０により作成されたユーザ独自の文
法は、データ管理部１１０により更新・格納される。ステップ２１１）同時に、通信制御部１７０により、
文法管理サーバ２００に送信される。ステップ２１２）文法情報を受信した文法管理サーバ
２００は、アップロード部２４０により、文法データベ
ース２１０にその文法情報をアップロードする。Step 210) On the other hand, the user terminal 10
The user-specific grammar created by the grammar creation unit 120 of 0 is updated and stored by the data management unit 110. Step 211) At the same time, the communication control unit 170
It is transmitted to the grammar management server 200. Step 212) The grammar management server 200 that has received the grammar information uploads the grammar information to the grammar database 210 by the upload unit 240.

【００５０】これにより、ユーザ独自の文法は常に最新
の文法として他のユーザ端末に反映させることが可能と
なる。ここで、上記のステップ２０９におけるアクショ
ンテーブル６００の定義に基づいてアクションを起こす
場合について説明する。犬キャラクタ５２０に対する命
令を認識したユーザ端末１００は、その音声認識した命
令を音声認識結果配信サーバ３００に送信する。音声認
識結果配信サーバ３００は、その犬キャラクタ５２０に
対する命令を犬キャラクタ５２０の行動を管理するマス
タ端末に配信する。犬キャラクタの行動を管理するマス
タ端末では、マスタ端末の共有オブジェクト処理手段
（図示せず）がデータ管理部に記憶されているアクショ
ンテーブルを利用して、犬キャラクタに対する命令に対
応する行動を行わせる。Thus, the grammar unique to the user can always be reflected on other user terminals as the latest grammar. Here, a case where an action is performed based on the definition of the action table 600 in step 209 will be described. The user terminal 100 that has recognized the command for the dog character 520 transmits the voice-recognized command to the voice recognition result distribution server 300. The speech recognition result distribution server 300 distributes the command for the dog character 520 to the master terminal that manages the behavior of the dog character 520. In the master terminal that manages the action of the dog character, the shared object processing means (not shown) of the master terminal uses the action table stored in the data management unit to perform an action corresponding to the command for the dog character. .

【００５１】また、音声認識処理に必要となる音響モデ
ルをユーザ個別に管理することにより、実際に音声を入
力するユーザ毎に音響モデルを柔軟に適応させることが
できるので、ユーザ毎に最高の音声認識性能を得ること
が可能となる。このように、本発明によれば、ユーザの
視点に応じて、ユーザが眺めるシーンを描画更新するこ
とができる共有仮想画面において、ユーザが直接関知し
ていない共有仮想画面内の対象に対して音声による意思
伝達を行う際に、効率的に音声による意思伝達を行うこ
とができる。Also, by managing the acoustic models required for the speech recognition processing individually for each user, the acoustic models can be flexibly adapted for each user who actually inputs speech, so that the best speech model for each user can be obtained. It is possible to obtain recognition performance. As described above, according to the present invention, in a shared virtual screen in which a scene viewed by a user can be drawn and updated in accordance with the viewpoint of the user, audio is output to an object in the shared virtual screen that is not directly related to the user. When the communication is performed by the user, the communication by voice can be efficiently performed.

【００５２】また、上記の実施例は、図３の構成に基づ
いて説明しているが、ユーザ端末１００、文法管理サー
バ２００及び音声認識結果配信サーバ３００の各構成要
素をプログラムとして構築し、それぞれの端末、サーバ
として利用されるコンピュータに接続されるディスク装
置や、フロッピーディスク、ＣＤ−ＲＯＭ等の可搬記憶
媒体に格納しておき、本発明を実施する際にインストー
ルすることにより、容易に本発明を実現できる。Although the above embodiment has been described based on the configuration of FIG. 3, each component of the user terminal 100, the grammar management server 200 and the speech recognition result distribution server 300 is constructed as a program, and The terminal is easily stored in a disk device connected to a computer used as a server or a portable storage medium such as a floppy disk, a CD-ROM, or the like, and is easily installed by implementing the present invention. The invention can be realized.

【００５３】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々変更・応用が可能
である。The present invention is not limited to the above embodiment, but can be variously modified and applied within the scope of the claims.

【００５４】[0054]

【発明の効果】上述のように、本発明では、共有仮想画
面における音声認識において、ユーザの視点に応じてユ
ーザが眺めるシーンを描画更新することが可能な共有仮
想画面において、ユーザが直接関知していない共有仮想
画面内の対象に対して音声による意思伝達を行う際に、
効率的に音声による意思伝達を行うことができる。As described above, according to the present invention, in voice recognition on a shared virtual screen, a user directly perceives a shared virtual screen in which a scene viewed by the user can be drawn and updated according to the user's viewpoint. When communicating by voice to an object in the shared virtual screen that is not
It is possible to efficiently communicate by voice.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の音声認識システムの構成図である。FIG. 3 is a configuration diagram of a speech recognition system of the present invention.

【図４】本発明の音声認識処理のフローチャートであ
る。FIG. 4 is a flowchart of a voice recognition process according to the present invention.

【図５】本発明の一実施例の３次元共有仮想空間におけ
る音声認識の例である。FIG. 5 is an example of speech recognition in a three-dimensional shared virtual space according to one embodiment of the present invention.

【図６】本発明の一実施例のアクションテーブルの例で
ある。FIG. 6 is an example of an action table according to an embodiment of the present invention.

【図７】本発明の一実施例である文法の例である。FIG. 7 is an example of a grammar according to an embodiment of the present invention.

【図８】本発明の一実施例の音声認識の動作の例を示す
シーケンスチャートである。FIG. 8 is a sequence chart showing an example of a speech recognition operation according to an embodiment of the present invention.

[Explanation of symbols]

１０文法管理手段２０差分情報配信手段３０モデル格納手段４０音声認識手段５０音声認識結果送信手段１００ユーザ端末１１０データ管理部１２０文法作成処理部１３０音響モデル適応部１４０音声認識処理部１５０文法更新要求部１６０表示部１７０通信制御部１８０ユーザ２００文法管理サーバ２１０文法データベース２２０検索部２３０ロード部２４０アップロード部２５０通信制御部３００音声認識結果配信サーバ３１０音声認識結果管理部３２０配信ユーザ管理部３３０通信制御部 Reference Signs List 10 grammar management means 20 difference information distribution means 30 model storage means 40 speech recognition means 50 speech recognition result transmission means 100 user terminal 110 data management section 120 grammar creation processing section 130 acoustic model adaptation section 140 speech recognition processing section 150 grammar update request section 160 display unit 170 communication control unit 180 user 200 grammar management server 210 grammar database 220 search unit 230 load unit 240 upload unit 250 communication control unit 300 voice recognition result distribution server 310 voice recognition result management unit 320 distribution user management unit 330 communication control unit

Claims

[Claims]

1. A speech recognition method for a shared virtual screen capable of drawing and updating a scene viewed by a user according to a viewpoint of a user, wherein a grammar management server for managing a grammar in which vocabulary to be recognized is registered as text or the like. And a speech recognition result delivery server that delivers a speech recognition result to the user terminal, wherein the grammar management server includes, at the start of speech recognition processing of the user terminal, difference information of grammar required for the speech recognition process. And distributing the same to the user terminal.

2. The speech recognition method for a shared virtual screen according to claim 1, wherein the user terminal stores and manages a user-specific acoustic model and a language model required for speech recognition together with a grammar.

3. The voice recognition method for a shared virtual screen according to claim 1, wherein at the user terminal, the input voice is voice-recognized at the user terminal using the grammar, the acoustic model, and the language model.

4. The voice recognition method for a shared virtual screen according to claim 1, wherein the user terminal transmits a voice recognition result to the voice recognition result distribution server.

5. A grammar management server for managing a grammar in which a vocabulary to be recognized is registered as a text or the like in a voice recognition method on a shared virtual screen capable of drawing and updating a scene viewed by the user in accordance with the viewpoint of the user. And a speech recognition result distribution server that delivers a speech recognition result to a user terminal. The user terminal requests the grammar management server to update grammatical difference information required for speech recognition. In the grammar management server, when the speech recognition process of the user terminal is started, grammar difference information required for the speech recognition process is delivered to the user terminal, and in the user terminal, the grammar management server delivers the grammar difference information. Acquiring the difference information and updating the grammar, performing speech recognition of the input speech using the updated grammar, and obtaining a result of the speech recognition. Extracting the maximum likelihood vocabulary in the vocabulary as a speech recognition result, transmitting the speech recognition result to a speech recognition result distribution server, and, in the speech recognition result distribution server, the speech recognition result transmitted from the user terminal. A speech recognition method for a shared virtual screen, which is distributed to another user terminal.

6. A speech recognition system for a shared virtual screen capable of drawing and updating a scene viewed by a user according to a viewpoint of the user, wherein a user terminal for performing speech recognition and a vocabulary to be recognized include text or the like. A grammar management server that manages the registered grammar; and a speech recognition result distribution server that distributes the speech recognition result to the user terminal, wherein the grammar management server manages the grammar in which the vocabulary to be recognized is registered as text or the like. A grammar management unit, and a difference information distribution unit that distributes grammar difference information necessary for the speech recognition process to the user terminal when the speech recognition process of the user terminal starts, the user terminal includes: Model storage means for storing and managing user-specific acoustic models and language models required for speech recognition together with grammar, and input speech stored in the user terminal A voice recognition unit that performs voice recognition using the grammar, the acoustic model, and the language model; and a voice recognition result transmitting unit that transmits a voice recognition result obtained by the voice recognition unit. The server has means for distributing a speech recognition result transmitted from the user terminal to another user terminal.

7. A storage medium storing a voice recognition program for a shared virtual screen mounted on a grammar management server capable of drawing and updating a scene viewed by a user in accordance with the viewpoint of the user, wherein the user terminal A storage medium storing a voice recognition program for a shared virtual screen, wherein at the start of the voice recognition process, grammatical difference information required for the voice recognition process is delivered to the user terminal.

8. A storage medium storing a voice recognition program for a shared virtual screen mounted on a user terminal capable of drawing and updating a scene viewed by a user in accordance with a viewpoint of the user, the storage medium being necessary for voice recognition. Process of storing and managing the user-specific acoustic model and language model together with the grammar in the storage means, and acquiring the difference information of the grammar distributed from the grammar management server that manages the grammar necessary for performing speech recognition. Storing the input speech using the grammar, the acoustic model, and the language model, and transmitting the speech recognition result to the speech recognition result distribution server. A storage medium storing a voice recognition program for a shared virtual screen as a feature.