JP2011159100A

JP2011159100A - Successive similar document retrieval apparatus, successive similar document retrieval method and program

Info

Publication number: JP2011159100A
Application number: JP2010020137A
Authority: JP
Inventors: Takeharu Eda; 毅晴江田; Katsuto Bessho; 克人別所; Toshiro Uchiyama; 俊郎内山; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-02-01
Filing date: 2010-02-01
Publication date: 2011-08-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a successive similar document retrieval apparatus, method and program, which has high convenience when retrieving documents in which prescribed document are similar in texts such as news, blog news and diaries of a social network which are daily produced in so large amounts. <P>SOLUTION: The successive similar document retrieval apparatus has: a successive similar document retrieval means for successively retrieving similar documents; and an updating means for updating retrieved results retrieved by the successive similar document retrieval means. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、逐次類似文書検索装置、逐次類似文書検索方法およびプログラムに関する。 The present invention relates to a sequential similar document search apparatus, a sequential similar document search method, and a program.

ウェブの発展によって、ニュースやブログ記事、ソーシャルネットワークの日記等のテキストが、日々大量に生産されている。記録されたテキストは、いずれかの人が読むためのものであるから、誰かによって既に書かれたテキストと類似しているかどうかは、そのテキストの価値に係る根源的な問題である。 With the development of the web, texts such as news, blog articles, and social network diaries are being produced in large quantities every day. Since the recorded text is intended for someone to read, whether it is similar to text already written by someone is a fundamental question of the value of the text.

故に、計算機の発展によって可能になった大量の文書からなる文書データベースにおける類似文書検索技術は、非常に重要な技術であると言える。実際、その応用範囲も幅広く、特許文書や論文等に対しても適用が試みられている。 Therefore, it can be said that the similar document retrieval technique in the document database composed of a large number of documents made possible by the development of computers is a very important technique. In fact, its application range is wide, and attempts are being made to apply to patent documents and papers.

文書間の類似度を測る場合、その文書で利用されている単語を、シンボルの集合としてベクトル表現し、距離を測る方法があり、類似文書検索に広く用いられている。しかし、単語すなわち言葉は人が作り出したものであり、その意味の多様性をシンボルとして扱うだけでは不十分である。 When measuring the degree of similarity between documents, there is a method in which words used in the documents are represented as vectors as a set of symbols and the distance is measured, which is widely used for similar document search. But words, or words, are created by people, and it is not enough to treat the diversity of meanings as symbols.

そこで、眼前の文書には現れない、その単語が持つ豊かな意味あいを考慮した文書の表現方法として、概念ベース法が提案されている（たとえば、非特許文献１参照）。概念ベース法は、大量のトレーニングデータを用いて、単語をベクトルとして表現する（単語概念ベクトルで表現する）。文書は、登場する単語概念ベクトルの重心として表現することができ、つまり、ある単語を単語概念ベクトルで表現すると、その単語の使われ方を表現している。上記単語概念ベクトルによって、人間が作り出した言葉の豊かな意味合いを考慮した類似文書検索が可能である。 Therefore, a concept-based method has been proposed as a method for representing a document that does not appear in the immediate document and takes into account the rich meaning of the word (see Non-Patent Document 1, for example). The concept-based method uses a large amount of training data to express a word as a vector (expressed as a word concept vector). A document can be expressed as the center of gravity of a word concept vector that appears. That is, when a word is expressed by a word concept vector, it expresses how the word is used. By using the word concept vector, it is possible to search for similar documents in consideration of the rich meaning of words created by humans.

概念ベース法によって、類似文書検索は、数百から数千次元の密なベクトル空間内での類似検索に置き換えられる（ベクトルの次元はアプリケーション依存で決定される）。しかし、従来、高次元ベクトル空間内での類似検索は、「次元の呪い」によって、高速に処理することが困難である。概念ベース法においても、類似文書を検索する場合、データベース内の全文書との突き合わせの後にランキングを行うというナイーブな手法で実現されるので、数百万件の文書データベースに対して、数秒の検索時間を要する。 The concept-based method replaces the similar document search with a similar search in a dense vector space of hundreds to thousands of dimensions (the vector dimension is determined depending on the application). However, conventionally, it is difficult to perform a similar search in a high-dimensional vector space at high speed due to a “dimensional curse”. Even in the concept-based method, when searching for similar documents, it is realized by a naive technique of ranking after matching all documents in the database, so searching for a few seconds against a database of millions of documents It takes time.

一方、文書処理技術とは別の分野であるデータベース分野から、「次元の呪い」を克服する近似的なベクトルの類似検索技術の研究が、１９９８年頃から始まっている（たとえば、非特許文献６参照）。提案されている様々な技術の中では、局所性検知可能ハッシュ（ＬＳＨ）（たとえば、非特許文献４参照）が最も有望である。局所性検知可能ハッシュは、多次元ベクトルを近似して複数のハッシュを用いて索引付けを行う。これによって、検索精度を確率的に保証しながら、検索の計算コストは、（ハッシュ個数）×（サンプリングビット数）で済む。 On the other hand, from the database field, which is a field different from the document processing technique, research on an approximate vector similarity search technique that overcomes the “curse of dimension” has started around 1998 (for example, see Non-Patent Document 6). ). Of the various techniques proposed, a locality detectable hash (LSH) (see, for example, Non-Patent Document 4) is the most promising. The locality detectable hash is obtained by approximating a multidimensional vector and indexing using a plurality of hashes. Thus, the search calculation cost can be reduced to (number of hashes) × (number of sampling bits) while guaranteeing the search accuracy stochastically.

理論上、次元数に大きく依存せずに高速に類似検索が可能であるので、ウェブにおける情報推薦や画像検索等への応用が期待されている。 Theoretically, similar retrieval is possible at high speed without greatly depending on the number of dimensions, and application to information recommendation, image retrieval, etc. on the web is expected.

局所性検知可能ハッシュ（ＬＳＨ）は、近似アルゴリズムであるものの、確率的に精度が保証され（たとえば、非特許文献３、４参照）、非常に高速に近傍（類似）検索を実現することができる。理論としてのフレームワークは、非特許文献３で整理され、非特許文献４で、Ｌ１ノルムが定義された多次元ベクトル空間をハミング空間に写像する局所性検知可能ハッシュ（ＬＳＨ）が実装されている。また、最近は、Ｌ１ノルムだけでなく、安定分布を利用するＬ２ノルム版や、コサイン類似度版、Ｊａｃｃａｒｄ係数版も提案されている。 Although the locality-detectable hash (LSH) is an approximation algorithm, the accuracy is probabilistically guaranteed (see, for example, Non-Patent Documents 3 and 4), and a neighborhood (similar) search can be realized very quickly. . The theoretical framework is organized in Non-Patent Document 3, and Non-Patent Document 4 implements a locality-detectable hash (LSH) that maps a multidimensional vector space in which an L1 norm is defined to a Hamming space. . Recently, not only the L1 norm but also an L2 norm version using a stable distribution, a cosine similarity version, and a Jaccard coefficient version have been proposed.

ところで、従来技術において、キーワードによる逐次検索（インクリメンタルサーチ）は、英語・日本語を問わず、広く普及し、利用されている（各種検索エンジンや、非特許文献２参照）。 By the way, in the prior art, sequential search using keywords (incremental search) is widely spread and used regardless of English or Japanese (see various search engines and Non-Patent Document 2).

「逐次検索」は、検索したい単語をすべて入力して検索するのではなく、ユーザが文字を１文字入力する度に、即座に候補を表示させる検索方法である。上記逐次検索は、ユーザの入力に従って動的に検索が進行する「ＤｙｎａｍｉｃＱｕｅｒｙ」の一種であり、検索の効率化だけではなく、テキスト編集の効率化にもつながる（たとえば、非特許文献２参照）。通常の逐次検索は、検索クエリの部分文字列による部分一致検索であり、入力文字列をシンボルとみなし、検索対象もシンボルとみなした上で、一致箇所に早く辿り着くための技術である。 “Sequential search” is a search method in which candidates are displayed immediately every time a user inputs a character, rather than inputting all the words to be searched. The sequential search is a kind of “Dynamic Query” in which the search dynamically proceeds according to the user's input and leads not only to efficient search but also to efficient text editing (for example, see Non-Patent Document 2). . The normal sequential search is a partial match search using a partial character string of a search query, and is a technique for quickly reaching a matching portion after regarding an input character string as a symbol and regarding a search target as a symbol.

キーワードをシンボルとみなした通常の一致検索に対して、言語の持つ豊かな概念を利用した類似文書検索技術が提案されている（たとえば、非特許文献１参照）が、しかし、非特許文献１に記載されている発明では、高次元概念ベクトル同士の距離計算のコストを削減することができない。 A similar document search technique using a rich concept of a language has been proposed for a normal match search in which keywords are regarded as symbols (for example, see Non-Patent Document 1). The described invention cannot reduce the cost of calculating the distance between high-dimensional concept vectors.

別所克人、内山俊郎、片岡良治著「単語・意味属性間共起に基づく概念ベースの拡張方式」、情報処理学会研究報告 2006-ICS-144 2006 /7/28、pp.29−34．Katsuhito Bessho, Toshiro Uchiyama, Ryoji Kataoka, “Concept-based extension based on co-occurrence between words and semantic attributes”, IPSJ Research Report 2006-ICS-144 2006/7/28, pp.29-34. 高林哲、小松弘幸、増井俊之著「Ｍｉｇｅｍｏ：日本語のインクリメンタル検索」、情報処理学会論文誌、Vol. 43、No. 12、pp.3698−3705、December, 2002.Satoshi Takabayashi, Hiroyuki Komatsu, Toshiyuki Masui “Migemo: Japanese Incremental Search”, Transactions of Information Processing Society of Japan, Vol. 43, No. 12, pp. 3698-3705, December, 2002. Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft著「When Is“Nearest Neighbor”Meaningful?」ICDT 2005Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft “When Is“ Nearest Neighbor ”Meaningful?” ICDT 2005 Piotr Indyk, Rajeev Motwani著“Approximate nearest neighbors: towards removing the curse of dimensionality”Annual ACM Symposium on Theory of Computing 1998Piotr Indyk, Rajeev Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality” Annual ACM Symposium on Theory of Computing 1998 Aristides Gionis, Piotr Indyk, Rajeev Motwani著“Similarity Search in High Dimensions via Hashing” Very Large Data Bases 1999“Similarity Search in High Dimensions via Hashing” by Aristides Gionis, Piotr Indyk, Rajeev Motwani, Very Large Data Bases 1999 Roger Weber, Hans-Jorg Schek, Stephen Blott著“A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces” Very Large Data Bases 1998Roger Weber, Hans-Jorg Schek, Stephen Blott “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces” Very Large Data Bases 1998

上記従来例では、単語の概念を用いた類似文書検索において、「次元の呪い」に起因する検索速度低下が引き起こす類似文書検索インタフェースのレスポンスの悪さが生じ、つまり、類似文書検索の利便性が低いという問題がある。 In the conventional example described above, in the similar document search using the concept of the word, the poor response of the similar document search interface caused by the decrease in the search speed caused by the “dimensional curse” occurs, that is, the convenience of the similar document search is low. There is a problem.

また、上記従来例では、ユーザが質問文の作成を完成した後に検索を開始するので、ユーザによる質問文の作成の途中で類似文書検索の結果を得ることができず、類似文書検索の結果を質問文完成まで待つ必要があり、この意味でも、類似文書検索の利便性が低いという問題がある。 In the above conventional example, since the search is started after the user completes the creation of the question sentence, the result of the similar document search cannot be obtained in the middle of the question sentence creation by the user. It is necessary to wait until the question sentence is completed. In this sense, there is a problem that the convenience of similar document search is low.

さらに、上記従来例では、検索精度が低下し、検索サーバのＣＰＵコストが下がらず、高速な逐次類似文書検索ができないという問題がある。 Furthermore, in the above conventional example, there is a problem that the retrieval accuracy is lowered, the CPU cost of the retrieval server is not lowered, and high-speed sequential similar document retrieval cannot be performed.

本発明は、類似文書検索において利便性が高い逐次類似文書検索装置、逐次類似文書検索方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a sequential similar document search apparatus, a sequential similar document search method, and a program that are highly convenient in similar document search.

本発明は、類似文書を逐次的に検索する逐次的類似文書検索手段と、上記逐次的類似文書検索手段が検索した検索結果を更新する更新手段とを有することを特徴とする逐次類似文書検索装置である。 The present invention has a sequential similar document search device comprising: sequential similar document search means for sequentially searching similar documents; and update means for updating a search result searched by the sequential similar document search means. It is.

本発明によれば、類似文書検索において利便性が高いという効果を奏する。 According to the present invention, there is an effect that convenience is high in a similar document search.

本発明の実施例１である逐次類似文書検索システム１００の概略を示すブロック図である。It is a block diagram which shows the outline of the sequential similar document search system 100 which is Example 1 of this invention. 逐次類似文書検索システム１００におけるクライアントＣＬ１と、類似文書検索アプリケーション１０とのハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a client CL1 and a similar document search application 10 in a sequential similar document search system 100. FIG. 逐次類似文書検索システム１００の動作の説明図であり、ＬＳＨ索引を用いた逐次類似文書検索処理の処理概要を示す図である。It is explanatory drawing of operation | movement of the sequential similar document search system 100, and is a figure which shows the process outline | summary of the sequential similar document search process using a LSH index. 実施例１の動作を示すフローチャートである。3 is a flowchart showing the operation of the first embodiment. 本発明の実施例２である逐次類似文書検索システム２００の動作の説明図であり、ＬＳＨ索引を用いた逐次類似文書検索において、概念ベースを用いた処理の説明図である。It is explanatory drawing of operation | movement of the sequential similar document search system 200 which is Example 2 of this invention, and is explanatory drawing of the process using a concept base in the sequential similar document search using an LSH index. 実施例２の動作を示すフローチャートである。6 is a flowchart showing the operation of the second embodiment. 本発明の実施例３である逐次類似文書検索システム３００において、ＬＳＨ索引を用いた逐次類似文書検索において、概念ベースを用い、単語概念ベクトルのキャッシュを利用した処理の概要を示す図である。In the sequential similar document search system 300 which is Example 3 of this invention, it is a figure which shows the outline | summary of the process using the cache of a word concept vector using a concept base in the sequential similar document search using an LSH index. 実施例３の動作を示すフローチャートである。10 is a flowchart illustrating the operation of the third embodiment.

発明を実施するための形態は、以下の実施例である。 The modes for carrying out the invention are the following examples.

図１は、本発明の実施例１である逐次類似文書検索システム１００の概略を示すブロック図である。 FIG. 1 is a block diagram showing an outline of a sequential similar document search system 100 according to the first embodiment of the present invention.

逐次類似文書検索システム１００は、文字単位ではなく、索引語単位で検索する実施例である。 The sequential similar document search system 100 is an embodiment in which search is performed in units of index words instead of units of characters.

なお、上記逐次類似文書検索は、類似文書検索を、インクリメンタルに行うことであり、入力検索文ｑの索引語境界を認識し、索引語が追加されたタイミングで、動的に検索結果の取得更新を行う。 Note that the above-described sequential similar document search is to perform similar document search incrementally. The index word boundary of the input search sentence q is recognized, and the retrieval result is dynamically updated at the timing when the index word is added. I do.

逐次類似文書検索システム１００は、類似文書検索アプリケーション１０と、ネットワークＮＷ１と、クライアントＣＬ１、ＣＬ２、ＣＬ３とによって構成され、クライアントＣＬ１、ＣＬ２、ＣＬ３のそれぞれには、利用者Ｕ１、Ｕ２、Ｕ３が対応している。 The sequential similar document search system 100 includes a similar document search application 10, a network NW1, and clients CL1, CL2, and CL3. Users U1, U2, and U3 correspond to the clients CL1, CL2, and CL3, respectively. is doing.

類似文書検索アプリケーション１０は、類似文書検索装置の例であり、ルータ１１と、ＬＡＮ１２と、アプリケーションサーバ１３と、データベースサーバ１４、１５と、類似文書検索エンジン１６とを有する。類似文書検索エンジン１６は、ＬＳＨ構築エンジン１６１と、問合せ処理エンジン１６２とを有する。 The similar document search application 10 is an example of a similar document search device, and includes a router 11, a LAN 12, an application server 13, database servers 14 and 15, and a similar document search engine 16. The similar document search engine 16 includes an LSH construction engine 161 and an inquiry processing engine 162.

図２は、逐次類似文書検索システム１００におけるクライアントＣＬ１と、類似文書検索アプリケーション１０とのハードウェア構成を示す図である。 FIG. 2 is a diagram illustrating a hardware configuration of the client CL1 and the similar document search application 10 in the sequential similar document search system 100.

クライアントＣＬ１は、通信手段３１と、記憶手段３２と、データ処理手段３３と、ユーザインタフェース３４とを有する。 The client CL1 includes a communication unit 31, a storage unit 32, a data processing unit 33, and a user interface 34.

類似文書検索アプリケーション１０は、通信インタフェース１７と、制御手段１８と、記憶手段１９とを有し、入力装置２１と出力装置２２とに接続されている。 The similar document search application 10 includes a communication interface 17, a control unit 18, and a storage unit 19, and is connected to an input device 21 and an output device 22.

制御手段１８は、ＬＳＨ構築手段１８１と、問合せ処理手段１８２とを有する。 The control unit 18 includes an LSH construction unit 181 and an inquiry processing unit 182.

記憶手段１９は、ＲＯＭ１９１と、ＲＡＭ１９２と、ＨＤＤ１９３と、ＳＳＤ１９４とを有する。 The storage unit 19 includes a ROM 191, a RAM 192, an HDD 193, and an SSD 194.

図３は、逐次類似文書検索システム１００の動作の説明図であり、ＬＳＨ索引を用いた逐次類似文書検索処理の処理概要を示す図である。 FIG. 3 is an explanatory diagram of the operation of the sequential similar document search system 100 and is a diagram showing a processing outline of the sequential similar document search processing using the LSH index.

なお、ＬＳＨは、局所性検知可能ハッシュ（Locality Sensitive Hashing）であり、近似近傍検索を、ハッシュを用いて実現する方法であり、ハミング距離、ユークリッド距離、Ｌ２ノルム、コサイン類似度に対応したハッシュ構成方法が提案されている。 Note that LSH is a locality-sensitive hash, which is a method for performing approximate neighborhood search using a hash, and hashes corresponding to the Hamming distance, Euclidean distance, L2 norm, and cosine similarity A method has been proposed.

元の文書集合Ｄ１は、文書ベクトルモデルなどを利用して文書データベース化する。文書データベースＤＢ１に対して、局所性検知可能ハッシングＨ１を適用し、ＬＳＨ索引を構築する。以上は、前処理として実行する。 The original document set D1 is made into a document database using a document vector model or the like. A locality detectable hashing H1 is applied to the document database DB1 to construct an LSH index. The above is executed as preprocessing.

利用者Ｕ１が検索文ｑを入力し、リアルタイムに分かち書きを行い、同様に文書ベクトルモデルなどを利用して、検索文ｑを多次元ベクトルに変換する。なお、上記多次元ベクトルは、物体の位置や形状、画像、動画、テキスト等の特徴を、ユークリッド空間のベクトルと見做して表現したものであり、その次元数は、計測機器やアプリケーションに依存して決められる。たとえば、概念ベースの単語概念ベクトルは、多次元ベクトルである。 The user U1 inputs the search sentence q and performs writing in real time, and similarly converts the search sentence q into a multidimensional vector using a document vector model or the like. Note that the above multidimensional vector expresses features such as the position and shape of an object, images, movies, and texts as a vector in Euclidean space, and the number of dimensions depends on the measuring device and application. Can be decided. For example, a concept-based word concept vector is a multidimensional vector.

ＬＳＨ索引を用いて近似類似検索を実行し、類似文書集合Ｄ２を結果として得る。最後に、結果リストを逐次更新しながら、利用者Ｕ１に表示する。 An approximate similarity search is performed using the LSH index, and a similar document set D2 is obtained as a result. Finally, the result list is displayed to the user U1 while being sequentially updated.

逐次類似文書検索システム１００は、局所性検知可能ハッシングＨ１を用いて高速な類似文書検索を実現する。さらに、高速な検索処理を利用して、従来、不可能であった逐次類似文書検索（インクリメンタル類似文書検索）を実現し、類似文書検索のインタフェースを改善する。 The sequential similar document search system 100 realizes high-speed similar document search using the locality-detectable hashing H1. Furthermore, by using a high-speed search process, sequential similar document search (incremental similar document search), which has been impossible in the past, is realized, and the interface for similar document search is improved.

また、一般に利用者Ｕ１は、検索文ｑを逐次的に追加するので、検索文ｑに既に含まれている単語ベクトルを、キャッシュすることができ、索引語データベースＤＢ２へ問い合わせる場合、事前の分かち書き結果に追加された単語のみを問い合わせればよい。すなわち、索引語データベースＤＢ２への問合せも高速に実現可能である。以上によって、類似文書検索が高速に処理されるので、逐次類似文書検索を実現することができる。 In general, since the user U1 sequentially adds the search sentences q, the word vector already included in the search sentence q can be cached. You only need to query the words added to. That is, the query to the index word database DB2 can be realized at high speed. As described above, the similar document search is processed at high speed, so that the similar document search can be realized sequentially.

上記実施例によれば、長い自然文を検索文ｑとする検索が高速に実現可能であり、これによって、フリーフォームによる文書編集環境での逐次類似文書検索が可能である。たとえば、「教えてｇｏｏ」に代表されるＱ＆Ａサイトにおいて、質問者が新しい質問文を作っている最中でも、質問文を追記するにつれ、逐次的検索することによって、似た質問文を即座に提示することができる。つまり、検索文を作成中に、似た質問が利用者に提示されるので、質問しなくても疑問点が解決されることがあり、この場合には、その似た質問を繰り返して実行することを回避することができ、質問者・サイト運営者の双方にとってメリットである。 According to the above-described embodiment, a search using a long natural sentence as a search sentence q can be realized at high speed, and thus a similar similar document search can be performed in a free form document editing environment. For example, on a Q & A site represented by “Tell me goo”, while a questioner is creating a new question, as the question is added, a similar question is immediately presented by performing a sequential search. can do. In other words, similar questions are presented to the user while creating a search statement, so the question may be resolved without asking questions. In this case, the similar question is repeatedly executed. This is a merit for both the questioner and the site operator.

また、提案技術の適用領域は、日本語に限定されない。英語のような区切りの明確な言語の場合、分かち書きせずに、クエリベクトルを求めることができるので、より高速に逐次類似文書検索を実現することができる。 In addition, the application area of the proposed technology is not limited to Japanese. In the case of a clearly delimited language such as English, query vectors can be obtained without writing, so that similar document retrieval can be performed sequentially at higher speed.

類似文書検索システム１００は、ニュースやウェブログ等の様々なコンテンツの類似文書検索エンジン１６として利用することができる。 The similar document search system 100 can be used as a similar document search engine 16 for various contents such as news and web logs.

図１に示す逐次類似文書検索システム１００は、類似文書検索アプリケーション１０内部で利用した例である。 The sequential similar document search system 100 shown in FIG. 1 is an example used inside the similar document search application 10.

類似文書検索エンジン１６は、アプリケーションサーバ１３から利用され、ＬＳＨ構築エンジン１６１と、問合せ処理エンジン１６２とによって構成されている。これらの処理部は、単一のサーバ内で実現することができるだけでなく、複数台で分散構成することができる。 The similar document search engine 16 is used by the application server 13 and includes an LSH construction engine 161 and an inquiry processing engine 162. These processing units can be realized not only in a single server but also in a distributed configuration with a plurality of units.

次に、実施例１の動作について説明する。 Next, the operation of the first embodiment will be described.

［逐次検索処理］
図４は、実施例１の動作を示すフローチャートである。 [Sequential search processing]
FIG. 4 is a flowchart illustrating the operation of the first embodiment.

Ｓ１で、文書データベースＤＢ１に対して構築したＬＳＨ索引ｉｄｘと、ある時点での検索文ｑとを入力する。 In S1, the LSH index idx constructed for the document database DB1 and the search sentence q at a certain point in time are input.

図３に示す文書データベースＤＢ１は、概念ベース法によるものに限定する必要はない。たとえば、単語をシンボルとみなし、その頻度で重みをつけるｔｆ＊ｉｄｆ法等を利用するようにしてもよい。このようにすることによって、それぞれの文書を多次元ベクトルとすることができるので、局所性検知可能ハッシングＨ１を用いて索引付けすることができる。 The document database DB1 shown in FIG. 3 need not be limited to the one based on the concept-based method. For example, a tf * idf method or the like in which a word is regarded as a symbol and weighted at the frequency may be used. By doing so, each document can be made into a multidimensional vector and can be indexed by using the locality detectable hashing H1.

次に、Ｓ２で、検索文ｑを分かち書き処理し、結果単語列Ｖｑを生成する。たとえば、単語をシンボルとみなし、その頻度で重みをつけるｔｆ＊ｉｄｆ法を利用することが考えられる。そして、Ｓ３で、検索文ｑの末尾が索引語境界であるかどうかを判定し、索引語境界でないと判定すれば、Ｓ４で、利用者Ｕ１が検索分ｑ（クエリ）を追記するのを待つ。索引語境界であると判定すれば、Ｓ５で、結果単語列Ｖｑを用い、予め構築されているＬＳＨ索引ｉｄｘを利用し、類似検索処理を行う。 Next, in S2, the search sentence q is divided and a result word string Vq is generated. For example, it is conceivable to use a tf * idf method in which a word is regarded as a symbol and weighted at the frequency. Then, in S3, it is determined whether or not the end of the search sentence q is an index word boundary. If it is determined that it is not an index word boundary, it waits for the user U1 to add a search part q (query) in S4. . If it is determined that it is an index word boundary, a similarity search process is performed in S5 using the result word string Vq and using a pre-constructed LSH index idx.

上記類似検索は、大量の多次元ベクトルが格納された多次元データベースから，与えられた検索文ｑに近い（似た）ベクトルを取得する検索である。近さは、距離（類似度）によって決められる。特に、文書が検索対象のベクトルである場合、文書が検索対象のベクトルであれば、類似文書検索と呼ぶ。上記距離は、ある２つの多次元ベクトルの間に定義される尺度のうちで、距離の公理を満たすものである。 The similarity search is a search for obtaining a vector similar to (similar to) a given search sentence q from a multidimensional database in which a large number of multidimensional vectors are stored. The proximity is determined by the distance (similarity). In particular, when a document is a search target vector, if the document is a search target vector, it is called a similar document search. The distance satisfies a distance axiom among measures defined between two multidimensional vectors.

検索文ｑの末尾が索引語境界であるかどうかを判定する場合、後述の［例１］に示すように、形態素解析の結果である単語の品詞が、名詞、サ変動詞であれば、索引語境界であると判定するか、または、後述の［例２］に示すように、索引語データベースＤＢ２に単語が存在すれば、当該単語が検索後境界であると判定する等によって、検索文ｑの末尾が索引語境界であるかどうかを判定することができる。 When determining whether or not the end of the search sentence q is an index word boundary, as shown in [Example 1] described later, if the part of speech of the word that is the result of the morphological analysis is a noun or a sub-variable, the index word As shown in [Example 2], which will be described later, if there is a word in the index word database DB2, it is determined that the word is a post-search boundary. It can be determined whether the end is an index word boundary.

「検索文ｑの末尾が単語境界であるかどうかを判定する例」
［例１］：形態素解析し、「名詞」と「サ変動詞」とを用いることによって、文書集合を索引付けする（索引語境界であると判断する）方法
上記［例１］では、検索文ｑを形態素解析し、形態素解析した末尾が、名詞またはサ変動詞であれば、検索語境界であると判断する。たとえば、検索文ｑとして「今日の天気は、いまい」を形態素解析すると、「今日：名詞、の：格助詞、天気：名詞、は：連用助詞、い、：動詞語幹、まい：動詞接尾辞」である。この検索文ｑの末尾「まい」は、動詞接尾辞であり、名詞、サ変動詞のいずれでもないので、上記検索文ｑの末尾は、索引語境界ではないと判断する。 "Example of determining whether the end of the search sentence q is a word boundary"
[Example 1]: Method of indexing a document set (determining that it is an index word boundary) by performing morphological analysis and using “noun” and “sa-variable” In the above [Example 1], the search sentence q If the end of the morphological analysis is a noun or a sub-variable, it is determined that it is a search word boundary. For example, morphological analysis of “Today's weather is now” as the search sentence q, “Today: noun, no: case particle, weather: noun, ha: continuous particle, yes ,: verb stem, performance: verb suffix” It is. Since the end “mai” of the search sentence q is a verb suffix and is neither a noun nor a sub-variable, it is determined that the end of the search sentence q is not an index word boundary.

形態素解析の結果の単語列の最後の単語の品詞が、索引付け（索引語境界であると判断する）対象の品詞（名詞またはサ変動詞）であれば、上記「形態素解析の結果の単語列の最後の単語」が、索引語境界である。下記のように文字が逐次的に入力された場合、３つの索引語境界ｘ、ｙ、ｘが出現し、索引語境界でのみ、類似文書を検索するので、類似文書検索を３回実行すれば足り、検索サーバ（図５に示す類似文書検索アプリケーション１０の類似文書検索エンジン１６）のＣＰＵコストが削減される。
・今日の天（名詞）…索引語境界ｘ
つまり、「今日の」に引き続いて、「天」を逐次的に入力した場合、入力された「天」は名詞であり、名詞は、索引付け対象の品詞であるので、入力された「天」は、索引語境界であり、この境界を索引語境界ｘと表現する。
・今日の天気（名詞）…索引語境界ｙ
つまり、「今日の天」に引き続いて、「気」を逐次的に入力した場合、入力された「気」は名詞であり、名詞は、索引付け対象の品詞であるので、入力された「気」は、索引語境界であり、この境界を索引語境界ｙと表現する。
・今日の天気は（連用助詞）
つまり、「今日の天気」に引き続いて、「は」を逐次的に入力した場合、入力された「は」は連用動詞であり、連用動詞名詞は、索引付け対象の品詞ではないので、入力された「は」は、索引語境界ではない。
・今日の天気は、い（動詞語幹）
・今日の天気は、いま（名詞）…索引語境界ｚ
つまり、「今日の天気は、」に引き続いて、「いま」を逐次的に入力した場合、入力された「いま」は名詞であり、名詞は、索引付け対象の品詞であるので、入力された「いま」は、索引語境界であり、この境界を索引語境界ｚと表現する。
・今日の天気は、いまい（動詞接尾辞）
・今日の天気は、いまいち（連用詞）
・今日の天気は、いまいちだ（判定詞）
・今日の天気は、いまいちだっ（終助詞）
・今日の天気は、いまいちだった（判定詞）
［例２］：索引語データベースＤＢ２を用いて索引語境界であるかどうかを判断する方法
上記［例２］は、分かち書きのみの結果を用いて、分かち書き結果の最後の単語が、索引語データベースＤＢ２に含まれていれば、索引語境界であると判断する方法である。［例２］では、形態素解析まで実行する必要はない。なお、索引語データベースＤＢ２は、ＨＤＤ１９３等に格納されている。
・今日の「天」（含まれる）索引語境界
つまり、「今日の」に引き続いて、「天」を逐次的に入力した場合、入力された最後の単語「天」が、索引語データベースＤＢ２に含まれているので、入力された「天」は、索引語境界であり、この境界を索引語境界ｘと表現する。
・今日の「天気」（含まれる）索引語境界
つまり、「今日の天」に引き続いて、「気」を逐次的に入力した場合、入力された最後の単語「天気」が、索引語データベースＤＢ２に含まれているので、入力された「気」は、索引語境界である。
・今日の天気「は」（含まれない）
つまり、「今日の天気」に引き続いて、「は」を逐次的に入力した場合、入力された最後の単語「は」が、索引語データベースＤＢ２に含まれていないので、入力された「は」は、索引語境界ではない。
・今日の天気は、「い」（含まれない）
・今日の天気は、「いま」（含まれない）
「いま」が、漢字の「今」であれば、「今」が索引語データベースＤＢ２に含まれているので、入力された「今」は、索引語境界である。しかし、「いま」は、索引語データベースＤＢ２に含まれていないので、入力された「いま」は、索引語境界ではない。
・今日の天気は、い「まい」（含まれない）
・今日の天気は、「いまいち」（含まれる）索引語境界
・今日の天気は、いまいち「だ」（含まれない）
・今日の天気は、「いまいちだっ」（含まれない）
・今日の天気は、いまいち「だった」（含まれない）
なお、上記「（含まれる）」は、分かち書き結果の最後の単語が、索引語データベースＤＢ２に含まれていることを意味する。また、上記「（含まれない）」は、分かち書き結果の最後の単語が、索引語データベースＤＢ２に含まれていないことを意味する。 If the part of speech of the last word in the word string resulting from morphological analysis is the part of speech (noun or sub-variable) that is to be indexed (determined to be an index word boundary), The “last word” is the index word boundary. When characters are sequentially input as shown below, three index word boundaries x, y, and x appear, and similar documents are searched only at the index word boundaries. Therefore, if similar document search is executed three times In short, the CPU cost of the search server (similar document search engine 16 of the similar document search application 10 shown in FIG. 5) is reduced.
・ Today's heaven (noun) ... index word boundary x
That is, if you enter “heaven” sequentially after “today”, the input “heaven” is a noun, and the noun is the part of speech to be indexed. Is an index word boundary, and this boundary is expressed as an index word boundary x.
・ Today's weather (noun) ... index word boundary y
In other words, if “ki” is entered sequentially after “Today's heaven”, the inputted “ki” is a noun, and the noun is the part of speech to be indexed. "Is an index word boundary, and this boundary is expressed as an index word boundary y.
・ Today's weather (continuous particle)
That is, if you enter “ha” sequentially after “Today's weather”, the input “ha” is a continuous verb, and the continuous verb noun is not a part of speech to be indexed. “Ha” is not an index word boundary.
・ The weather today is yes (verb stem)
・ Today's weather is now (noun) ... Index word boundary z
That is, if you enter "now" sequentially after "Today's weather is", the entered "now" is a noun and the noun is the part of speech that is to be indexed. “Now” is an index word boundary, and this boundary is expressed as an index word boundary z.
・ Today's weather is now (verb suffix)
・ Today's weather is not good enough (continuous verbs)
・ The weather today is not good enough (determinants)
・ The weather today is not good enough (final particle)
・ The weather today was not good enough (determinants)
[Example 2]: Method of determining whether or not the index word database DB2 is an index word boundary In the above [Example 2], the last word of the segmentation result is index word database DB2 using the result of segmentation only. If it is included, the index word boundary is determined. In [Example 2], it is not necessary to perform morphological analysis. The index word database DB2 is stored in the HDD 193 or the like.
・ Today's “heaven” (included) index word boundary In other words, when “heaven” is sequentially input following “today”, the last word “heaven” input is stored in the index word database DB2. Since it is included, the input “heaven” is an index word boundary, and this boundary is expressed as an index word boundary x.
-Today's "weather" (included) index word boundary That is, when "ki" is sequentially input following "today's heaven", the last word "weather" input is the index word database DB2. Therefore, the inputted “qi” is an index word boundary.
・ Today ’s weather “ha” (not included)
That is, when “ha” is sequentially input following “today's weather”, since the last word “ha” input is not included in the index word database DB2, the input “ha” is input. Is not an index word boundary.
・ Today ’s weather is “I” (not included)
・ Today ’s weather is “now” (not included)
If “now” is “now” of the Chinese character, “now” is included in the index word database DB2, and thus “now” input is an index word boundary. However, since “now” is not included in the index word database DB2, the input “now” is not an index word boundary.
・ Today ’s weather is “Mai” (not included)
-Today's weather is "Imaichi" (included) index word boundary-Today's weather is "Ida" (not included)
・ Today ’s weather is “not good” (not included)
・ Today ’s weather was “good” (not included)
Note that the above “(included)” means that the last word of the segmented result is included in the index word database DB2. In addition, the above “(not included)” means that the last word of the segmented result is not included in the index word database DB2.

索引語境界以外の分かち書き結果に含まれている索引語集合と、その直前の索引語境界での分かち書き結果に含まれている索引語集合とが同一であることに注意を要する。換言すれば、検索語境界の状態で類似文書検索しても、検索語境界以外の状態で類似文書検索しても、検索結果が変化しない。つまり、検索語境界の状態でのみ類似文書検索しても、検索精度が低下しない。すなわち、検索語境界以外の状態で類似文書検索せずに、これによって、検索回数を減らしても、検索結果に影響がなく、したがって、検索精度が低下しない。なお、文字を入力する度に、索引語境界の検出処理を実行し、上記直前は、検索文の索引語境界を検出する直前であり、今実行した境界検出の直前の境界検出である。 It should be noted that the index word set included in the segmentation result other than the index word boundary is the same as the index word set included in the segmentation result immediately before the index word boundary. In other words, the search result does not change even if the similar document search is performed in the state of the search word boundary or the similar document search is performed in a state other than the search word boundary. That is, even if a similar document is searched only in the search word boundary state, the search accuracy does not decrease. That is, even if the number of searches is reduced without searching for similar documents in a state other than the search word boundary, the search results are not affected, and therefore the search accuracy is not lowered. It should be noted that every time a character is input, index word boundary detection processing is executed, and immediately before the above is immediately before detecting an index word boundary of a search sentence, which is boundary detection immediately before the boundary detection just performed.

次に、Ｓ６で、類似検索の結果ベクトル集合Ｒｑを、利用者Ｕ１が見ているインタフェースに逐次的に追加する。これによって、逐次検索を実現することができる。 Next, in S6, the result search result vector set Rq is sequentially added to the interface viewed by the user U1. Thereby, a sequential search can be realized.

たとえば、上記［例１］において、索引語境界ｘ、ｙ、ｚまでのそれぞれの検索文ｑ（クエリ）で検索すると、下記の検索結果を得ることができる。 For example, in the above [Example 1], the following search results can be obtained by performing a search using the respective search sentences q (queries) up to the index word boundaries x, y, and z.

索引語境界ｘ…今日の天→今日、天［検索結果］：「今日は天気が良いね」、「今日、天に召されました」
つまり、索引語境界ｘまでの検索文ｑである「今日の天」で検索すると、この検索結果（ランキングの上位２つの検索結果）は、「今日は天気が良いね」、「今日、天に召されました」であったとする。 Index word boundary x… Today's heaven → Today, heaven [Search results]: “The weather is good today”, “Today we were called by heaven”
In other words, when searching for “Today's sky”, which is the search sentence q up to the index word boundary x, the search results (the top two search results in the ranking) are “Today's weather is good” Suppose you were called.

索引語境界ｙ…今日の天気→今日、天気［検索結果］：「今日の天気はいかが？」、「今日は天気が良いね」
つまり、索引語境界ｘまでの検索文ｑである「今日の天気」で検索すると、この検索結果（ランキングの上位２つの検索結果）は、「今日の天気はいかが？」、「今日は天気が良いね」であったとする。 Index word boundary y… Today's weather → Today, weather [Search results]: “How is the weather today?”, “Today ’s weather is good”
In other words, when searching for “Today's weather” that is the search sentence q up to the index word boundary x, the search results (the top two search results in the ranking) are “How is the weather today?” It ’s good. ”

索引語境界ｚ…今日の天気は、いま→今日、天気、いま［検索結果］：「今日の天気は、いまからどうなる？」、「今日の天気はいかが？」
検索文が追記されればされる程、より多くのキーワードの特徴を利用できるので、検索結果の精度が高くなることが期待できる。 Index word boundary z… Today ’s weather is now → Today ’s weather, now [Search Results]: “What ’s the weather today?”, “How about the weather today?”
The more the search text is added, the more the characteristics of the keyword can be used. Therefore, it can be expected that the accuracy of the search result increases.

最後に、Ｓ７で、検索文ｑに更新があると判定すれば、すなわち利用者Ｕ１が検索文ｑ（クエリ）を修正した場合、クエリからベクトルを生成するステップＳ２に戻り、同じ処理を繰り返す。 Finally, if it is determined in S7 that the search sentence q is updated, that is, if the user U1 modifies the search sentence q (query), the process returns to step S2 where a vector is generated from the query, and the same processing is repeated.

実施例１は、多次元ベクトル空間をハミング空間に写像する局所性検知可能ハッシングＨ１を用いて、性能の課題を解決しつつ、逐次類似文書検索を実現する発明であり、逐次類似文書検索では、入力検索文ｑの索引語境界を認識し、索引語が追加されたタイミングで、検索結果の取得更新を動的に行い、検索文ｑを入力するにつれ、逐次的に検索結果を得ることができる。 Example 1 is an invention that implements sequential similar document search while solving performance problems using locality-detectable hashing H1 that maps a multidimensional vector space to a Hamming space. In sequential similar document search, The index word boundary of the input search sentence q is recognized, and the retrieval result is dynamically updated at the timing when the index word is added, and the search result can be obtained sequentially as the search sentence q is input. .

実施例１によれば、検索精度が低下せず、検索サーバのＣＰＵコストを下げて、高速な逐次類似文書検索が可能である。 According to the first embodiment, it is possible to perform a high-speed sequential similar document search without reducing the search accuracy and reducing the CPU cost of the search server.

なお、実施例１において、索引語単位で逐次的に類似文書を検索しなくてもよく、また、検索文の索引語境界を検出しなくてもよい。つまり、類似文書を逐次的に検索すれば足り、上記実施例は、類似文書を逐次的に検索する逐次的類似文書検索手段と、上記逐次的類似文書検索手段が検索した検索結果を更新する更新手段とを有する逐次類似文書検索装置の例である。このようにすれば、ユーザによる質問文の作成の途中で類似文書検索の結果を得ることができ、類似文書検索の結果を質問文完成まで待つ必要がなく、類似文書検索の利便性が高い。 In the first embodiment, it is not necessary to sequentially search for similar documents in units of index words, and it is not necessary to detect index word boundaries of search sentences. That is, it suffices to sequentially search for similar documents, and in the above embodiment, the sequential similar document search means for sequentially searching for similar documents and the update for updating the search results searched by the sequential similar document search means are described. FIG. 2 is an example of a sequential similar document search apparatus having a means. In this way, it is possible to obtain a similar document search result during the creation of a question sentence by the user, and there is no need to wait for the result of the similar document search until the question sentence is completed, and the convenience of the similar document search is high.

本発明の実施例２である逐次類似文書検索システム２００は、索引語単位で検索し、しかも概念ベース法を用いる実施例である。 The sequential similar document search system 200 according to the second embodiment of the present invention is an embodiment in which search is performed in index word units and the concept-based method is used.

［概念ベースを用いた逐次検索処理］
逐次類似文書検索システム２００のハードウェアは、図１、図２に示す逐次類似文書検索システム１００と同様である。 [Sequential search processing using concept base]
The hardware of the sequential similar document search system 200 is the same as that of the sequential similar document search system 100 shown in FIGS.

次に、本発明の実施例２の動作について説明する。 Next, the operation of the second embodiment of the present invention will be described.

図５は、本発明の実施例２である逐次類似文書検索システム２００の動作の説明図であり、ＬＳＨ索引を用いた逐次類似文書検索において、概念ベースを用いた処理の説明図である。 FIG. 5 is an explanatory diagram of the operation of the sequential similar document search system 200 according to the second embodiment of the present invention, and is an explanatory diagram of processing using the concept base in the sequential similar document search using the LSH index.

上記概念ベースは、単語と意味属性共起行列とに、特異値分解を施すことによって抽出されたデータベースである。意味属性付与機能を持つ形態素解析器を利用することによって、膨大なトレーニングデータの意味を反映させた単語の意味ベクトルを構築することができる。登場する単語ベクトルの重心として、文書を表現することができ、文書の意味的類似性に基づいた概念検索を、ベクトル間の距離の近さを用いて実現できる。 The concept base is a database extracted by performing singular value decomposition on words and semantic attribute co-occurrence matrices. By using a morphological analyzer having a semantic attribute assigning function, it is possible to construct a word semantic vector reflecting the meaning of a large amount of training data. A document can be expressed as the center of gravity of the word vector that appears, and a concept search based on the semantic similarity of the document can be realized using the proximity of the distance between vectors.

元の文書集合Ｄ１から、概念ベース法を用いて文書データベースＤＢ１と索引語データベースＤＢ２とを構築する。文書データベースＤＢ１に対して、局所性検知可能ハッシングＨ１を適用し、ＬＳＨ索引を構築する。以上は、前処理として実行する。 A document database DB1 and an index word database DB2 are constructed from the original document set D1 using a concept-based method. A locality detectable hashing H1 is applied to the document database DB1 to construct an LSH index. The above is executed as preprocessing.

利用者Ｕ１が検索文ｑ（クエリ）を入力し、リアルタイムに分かち書きを行い、索引語データベースＤＢ２を利用し、入力された検索文ｑ（クエリ）を図５に示すクエリ概念ベクトルに変換する。ＬＳＨ索引を用いて近似類似検索を実行し、類似文書集合Ｄ２を結果として得る。最後に、利用者Ｕ１に結果リストを逐次更新を行いながら表示する。 The user U1 inputs the search sentence q (query), performs the division in real time, uses the index word database DB2, and converts the input search sentence q (query) into the query concept vector shown in FIG. An approximate similarity search is performed using the LSH index, and a similar document set D2 is obtained as a result. Finally, the result list is displayed to the user U1 while being sequentially updated.

図６は、実施例２の動作を示すフローチャートである。 FIG. 6 is a flowchart illustrating the operation of the second embodiment.

実施例２では、Ｓ１１で、概念ベース法によって構築した単語概念ベース（図５参照、単語概念ベクトルを単語名で検索できるようにしたデータベース）を用意する（非特許文献１参照）。Ｓ２で、概念ベース法の類似文書検索処理と同様に、まず、検索文ｑを分かち書き出し、単語集合Ｖｑを得る。次に、単語集合Ｖｑ中のそれぞれの単語の概念ベクトルを、単語概念ベースから取得する。 In Example 2, in S11, a word concept base (see FIG. 5, a database that allows word concept vectors to be searched by word names) constructed by the concept base method is prepared (see Non-Patent Document 1). In S2, as in the similar document search process of the concept-based method, first, the search sentence q is written out and the word set Vq is obtained. Next, the concept vector of each word in the word set Vq is acquired from the word concept base.

そして、Ｓ１２で、これらの単語概念ベクトル集合の重心を求める。重心は、たとえば、単語概念ベクトルの集合Ｄｗを、Ｄｗ＝｛ｖ１，ｖ２，…ｖｍ｝としたときに、重心ベクトルＧ_ｑを、 In S12, the center of gravity of these word concept vector sets is obtained. Center of gravity, for example, a set Dw word concept vector, Dw = {v1, v2, ... vm} is taken as the center of gravity vector _{G q,}

と表すことができる。なお、ｍは、単語概念ベクトルの数である。Ｓ１３で、この重心ベクトルＧ_ｑを検索キーとして、ＬＳＨ索引ｉｄｘに対して検索処理を行い、Ｓ１４で、検索結果を逐次的にインタフェースに追加する。さらに、検索文ｑに更新があれば、Ｓ２に戻り、再度検索を行い、結果を逐次的に更新する。

It can be expressed as. Note that m is the number of word concept vectors. In S13, as a search key the centroid vector _{G q,} it performs a search process on LSH index idx, in S14, to add to sequentially interface search results. Further, if there is an update in the search sentence q, the process returns to S2, the search is performed again, and the results are updated sequentially.

実施例２によれば、索引語境界を検出する場合、概念ベース法による類似文書検索を実現するので、語の意味合いを考慮した逐次類似文書検索が可能である。つまり、実施例２は、概念ベース法によってベクトル化した文書データベースンＤＢ１に対して、予め、局所性検知可能ハッシングＨ１を用いて、索引付けする（索引語境界を検出する）。また、概念ベース法によって構築した単語概念ベクトル集合も、単語名に索引（ハッシュやＢ木等）を付与して索引語データベースＤＢ２として索引付けする。利用者Ｕ１が検索文ｑの文字列を入力する度に、単語集合を取得する。語の切れ目の無い日本語等の場合、即座に検索文ｑ（検索クエリ）の分かち書きを行い、単語集合を取得する。直前状態から単語の追加、削除があれば、分割されたそれぞれの単語を用いて、索引語データベースＤＢ２を検索し、このデータベースに含まれている単語集合の重心ベクトルを求め、これを検索クエリベクトルとする。なお、この検索クエリベクトル（＝クエリベクトル）は、クエリ概念ベクトルの上位概念であり、検索クエリベクトルのうちで、単語概念ベースを用いて作成したベクトルが、クエリ概念ベクトルである。 According to the second embodiment, when an index word boundary is detected, a similar document search by the concept-based method is realized, so that it is possible to perform a similar similar document search in consideration of the meaning of the word. That is, in the second embodiment, the document database DB1 vectorized by the concept-based method is indexed in advance using the locality detectable hashing H1 (detects index word boundaries). A word concept vector set constructed by the concept-based method is also indexed as an index word database DB2 by adding an index (hash, B-tree, etc.) to the word name. Each time the user U1 inputs a character string of the search sentence q, a word set is acquired. In the case of Japanese with no word breaks, the search sentence q (search query) is immediately written and a word set is acquired. If a word is added or deleted from the immediately preceding state, the index word database DB2 is searched using each divided word, a centroid vector of a word set included in the database is obtained, and this is used as a search query vector. And Note that this search query vector (= query vector) is a superordinate concept of the query concept vector, and among the search query vectors, a vector created using the word concept base is a query concept vector.

上記検索クエリベクトルを検索キーにして、文書データベースンＤＢ１に対して、問合せする。文書データベースンＤＢ１は、局所性検知可能ハッシングＨ１によって、高速に類似文書集合Ｄ２を返却可能であり、擬似的に類似文書集合Ｄ２を逐次的に利用者Ｕ１に返し、見せることができる。通常の逐次検索（インクリメンタルサーチ）とは異なり、検索結果の更新を文字単位ではなく、単語単位で検索結果を更新することによって、サーバの負荷を下げることができる。 The document database DB1 is inquired using the search query vector as a search key. The document database DB1 can return the similar document set D2 at high speed by the locality-detectable hashing H1, and can return the pseudo-similar document set D2 to the user U1 sequentially and show it. Unlike normal sequential search (incremental search), updating the search result in terms of words instead of characters can reduce the load on the server.

［単語ベクトルのキャッシュを用いた逐次検索処理］
本発明の実施例３である逐次類似文書検索システム３００は、索引語単位で検索し、概念ベース法を用い、しかも、直前の重心ベクトルと単語とを記憶し、差分のみによって、次の重心ベクトルを更新する実施例である。 [Sequential search processing using word vector cache]
The sequential similar document search system 300 according to the third embodiment of the present invention searches in index word units, uses a concept-based method, stores the previous centroid vector and word, and uses only the difference to determine the next centroid vector. Is an embodiment in which

次に、逐次類似文書検索システム３００の動作について説明する。 Next, the operation of the sequential similar document search system 300 will be described.

図７は、本発明の実施例３である逐次類似文書検索システム３００において、ＬＳＨ索引を用いた逐次類似文書検索において、概念ベースを用い、単語概念ベクトルのキャッシュを利用した処理の概要を示す図である。 FIG. 7 is a diagram showing an overview of processing using a concept base and using a word concept vector cache in sequential similar document search using an LSH index in the sequential similar document search system 300 according to the third embodiment of the present invention. It is.

利用者Ｕ１が検索文ｑを入力し、リアルタイムに分かち書きを行い、検索文ｑ中に、直前の検索文ｑからの差分があるかどうかを検出する。新規追加または削除があった単語のみを、索引語データベースＤＢ２を用いて問い合わせ、直前の検索文ｑと合わせて用いて、クエリ概念ベクトルを構築する。検索文ｑの集合は、次回の問い合わせに備えて保存する。ＬＳＨ索引を用いて近似類似検索を実行し、類似文書集合Ｄ２を結果として得る。最後に、利用者Ｕ１に結果リストを逐次更新を行いながら表示する。 The user U1 inputs the search sentence q and performs a writing in real time, and detects whether or not there is a difference from the immediately preceding search sentence q in the search sentence q. Only the words that have been newly added or deleted are queried using the index word database DB2 and used together with the immediately preceding search sentence q to construct a query concept vector. The set of search sentences q is stored for the next inquiry. An approximate similarity search is performed using the LSH index, and a similar document set D2 is obtained as a result. Finally, the result list is displayed to the user U1 while being sequentially updated.

図８は、実施例３の動作を示すフローチャートである。 FIG. 8 is a flowchart illustrating the operation of the third embodiment.

通常の検索時には、利用者Ｕ１は検索文ｑを追記していくことが多い。また、本技術が想定する、日記記入テキストエリアやＱ＆Ａサイトにおける質問文作成テキストエリア等、フリーフォームのテキスト編集環境では、ある程度の長い文書を追記する。そこで、逐次類似文書検索する直前の重心ベクトルと単語を記憶し、Ｓ２１で、更新前の分かち書き結果から取得した索引語集合と、更新後の分かち書き結果から取得した索引語集合との差分を検出し、この検出された差分のみについて検索することによって、単語概念ベースへの問合せ回数を減らして重心ベクトルを更新する。差分Δは、分かち書きした単語集合の差分であり、追加された単語の集合δ_＋と削除された単語の集合δ₋とからなる。この場合、更新後の重心ベクトルＧ_ｑ’は以下の式（２）によって、更新することができる。 During a normal search, the user U1 often appends a search sentence q. Further, in a free-form text editing environment such as a diary entry text area or a question text creation text area on a Q & A site assumed by the present technology, a long document is added to some extent. Therefore, the centroid vector and the word immediately before the successive similar document search are stored, and in S21, the difference between the index word set acquired from the pre-update segmentation result and the index word set acquired from the post-update segmentation result is detected. By searching only for the detected difference, the number of queries to the word concept base is reduced to update the centroid vector. The difference Δ is a difference between the divided word sets, and includes an added word set δ ₊ and a deleted word set δ ₋ . In this case, the updated center-of-gravity vector G _q ′ can be updated by the following equation (2).

｜Ｖ_ｑ｜が十分に大きければ（つまり、直前の検索文ｑが長ければ）、単語概念ベースの検索回数を大幅に削減することができ、検索速度が向上する。なお、上記式（２）において、Ｖ_ｑ、δ_＋、δ₋、はいずれも集合を示し、これらの絶対値記号は、当該集合に含まれている要素の個数を示す。 If | V _q | is sufficiently large (that is, if the immediately preceding search sentence q is long), the number of word concept-based searches can be greatly reduced, and the search speed is improved. In the above formula (2), V _q , δ ₊ , and δ ₋ all indicate a set, and these absolute value symbols indicate the number of elements included in the set.

上記各実施例は、コンピュータで使用可能なソフトウェアとして実施できる。プログラムは、ハードディスク、ＣＤ−ＲＯＭ、光記憶装置または磁気記憶装置等の任意のコンピュータ可読媒体に記憶できる。 Each of the above embodiments can be implemented as software usable on a computer. The program can be stored in any computer-readable medium such as a hard disk, CD-ROM, optical storage device, or magnetic storage device.

実施例３によれば、直前の重心ベクトルと単語とを記憶し、直前の受信ベクトルと現在の受信ベクトルとの差分のみに応じて、次の重心ベクトルを更新するので、語の意味合いを考慮した逐次類似文書検索が高速で実行される。 According to the third embodiment, the previous centroid vector and the word are stored, and the next centroid vector is updated only in accordance with the difference between the immediately preceding received vector and the current received vector. Successive similar document search is executed at high speed.

上記実施例によれば、従来では不可能であった逐次類似文書検索を実現することができる。 According to the above-described embodiment, it is possible to realize a sequential similar document search that has been impossible in the past.

実施例１では、文字単位ではなく、単語単位で検索を行うことによって、検索精度およびレスポンスタイムが低下せずに、データベースへの検索回数を下げて逐次類似文書検索を実現できる。また、実施例２では、概念ベース法を用いることによって、語の持つ豊かな意味合いを考慮した逐次類似文書検索を実現できる。さらに、実施例３では、索引語境界を検索する直前の重心ベクトルと単語集合を記憶して次の検索文ｑへの差分についてのみデータベースへ検索することによって、より高速に逐次類似文書検索を実現できる。 In the first embodiment, by performing a search in units of words rather than in units of characters, it is possible to sequentially perform similar document searches by reducing the number of searches to the database without reducing the search accuracy and response time. Further, in the second embodiment, by using the concept-based method, it is possible to realize a sequential similar document search considering the rich meaning of words. Furthermore, in the third embodiment, a similar centroid search is realized at a higher speed by storing a centroid vector and a word set immediately before searching for an index word boundary and searching the database only for differences to the next search sentence q. it can.

提案技術の特徴として、長い自然文を検索文ｑとする検索が高速に実現可能である。これによって、フリーフォームによる文書編集環境での逐次類似文書検索が可能になる。たとえば、「教えてｇｏｏ」に代表されるＱ＆Ａサイトにおいて、質問者が新しい質問文を作っている最中に、質問文を追記するにつれ、逐次的な検索によって似た質問文を即座に提示することができる。このように、似た質問を直ちに検索するので、その似た質問を行なうことを控えるであろうから、似た質問が繰り返されることを回避し、質問者・サイト運営者の双方にとってメリットが得られる。 As a feature of the proposed technology, a search using a long natural sentence as a search sentence q can be realized at high speed. As a result, it is possible to search for similar documents sequentially in a free form document editing environment. For example, in a Q & A site represented by “Tell me goo”, while a questioner is creating a new question, as the question is added, a similar question is immediately presented by a sequential search. be able to. In this way, since similar questions are searched immediately, it will refrain from asking similar questions, avoiding repeated similar questions, and gaining benefits for both the questioner and the publisher. It is done.

また、上記各実施例における手段を工程に変更すれば、上記実施例を方法の発明として把握することができる。つまり、上記実施例は、逐次的類似文書検索手段が、類似文書を逐次的に検索し、記憶手段に記憶する逐次的類似文書検索段階と、上記逐次的類似文書検索段階で検索された検索結果を更新する更新段階とを有することを特徴とする逐次類似文書検索方法の例である。 Further, if the means in each of the above embodiments is changed to a process, the above embodiment can be grasped as a method invention. That is, in the above embodiment, the sequential similar document search means sequentially searches for similar documents and stores them in the storage means, and the search results searched in the sequential similar document search stage. It is an example of the sequential similar document search method characterized by having the update step which updates.

この場合、検索語境界検索手段が、検索文の索引語境界を検出し、記憶手段に記憶する検索語境界検索段階を有し、上記逐次的類似文書検索段階は、上記索引語境界検出段階で上記検索文の索引語境界が検出される度に、索引語単位で逐次的に類似文書を検索する段階である。また、上記索引語境界検出段階は、局所性検知可能ハッシングを利用して、検索文の索引語境界を検出する段階である。さらに、上記索引語境界検出段階は、概念ベース法による類似文書検索を実現する段階である。 In this case, the search word boundary search means has a search word boundary search stage that detects an index word boundary of the search sentence and stores it in the storage means, and the sequential similar document search stage includes the index word boundary detection stage. This is a step of sequentially searching for similar documents in units of index words each time an index word boundary of the search sentence is detected. The index word boundary detection step is a step of detecting an index word boundary of a search sentence using locality detectable hashing. Further, the index word boundary detection step is a step of realizing a similar document search by a concept-based method.

また、上記実施例をプログラムの発明として把握することができる。つまり、上記逐次類似文書検索方法をコンピュータに実行させるプログラムを想定することができる。そして、このプログラムを、半導体メモリ、ＣＤ、ＤＶＤ、磁気ディスク、光磁気ディスク、ＨＤ等、コンピュータ読み取り可能な記録媒体に記録するようにしてもよい。 Moreover, the said Example can be grasped | ascertained as invention of a program. That is, it is possible to assume a program that causes a computer to execute the sequential similar document search method. The program may be recorded on a computer-readable recording medium such as a semiconductor memory, CD, DVD, magnetic disk, magneto-optical disk, or HD.

１００、２００、３００…逐次類似文書検索システム、
１０…類似文書検索アプリケーション、
１６…類似文書検索エンジン、
１８…制御手段、
１８１…ＬＳＨ構築手段、
１８２…問合せ処理手段、
ＤＢ１ああ文書データベース、
ＤＢ２…索引語データベース、
Ｄ１…元の文書集合、
Ｄ２…類似文書集合。 100, 200, 300 ... Sequential similar document search system,
10: Similar document search application,
16 ... Similar document search engine,
18 ... control means,
181 ... LSH construction means,
182 ... Inquiry processing means,
DB1 ah document database,
DB2 ... Index word database,
D1 ... original document set,
D2 ... Similar document set.

Claims

A sequential similar document search means for sequentially searching similar documents;
Updating means for updating a search result searched by the sequential similar document search means;
A sequential similar document search apparatus characterized by comprising:

In claim 1,
Having an index word boundary detecting means for detecting an index word boundary of a search sentence;
The sequential similar document search means is means for sequentially searching for similar documents in units of index words each time the index word boundary detection means detects an index word boundary of the search sentence. Similar document search device.

In claim 2,
The sequential similar document search device, wherein the index word boundary detecting means is means for detecting an index word boundary of a search sentence using locality detectable hashing.

In claim 2,
The sequential similar document search device, wherein the index word boundary detection means is means for realizing similar document search by a concept-based method.

In claim 4,
Storage means for storing a centroid vector of the search sentence and a word vector in the search sentence immediately before detecting an index word boundary of the search sentence;
The index word boundary detecting means is a means for searching only index words newly added and deleted from the immediately preceding state,
The sequential similar document search device, wherein the updating means is means for updating the centroid vector.

A sequential similar document search means for sequentially searching for similar documents and storing them in a storage means;
An update stage for updating the search results searched in the sequential similar document search stage;
A sequential similar document search method characterized by comprising:

In claim 6,
The search word boundary search means has a search word boundary search stage for detecting an index word boundary of the search sentence and storing it in the storage means,
The sequential similar document search step is a step of sequentially searching for similar documents in units of index words each time an index word boundary of the search sentence is detected in the index word boundary detection step. Sequential similar document search method.

In claim 7,
The sequential similar document search method, wherein the index word boundary detection step is a step of detecting an index word boundary of a search sentence using locality detectable hashing.

In claim 7,
The sequential similar document search method, wherein the index word boundary detection step is a step of realizing a similar document search by a concept-based method.

A program for causing a computer to execute the sequential similar document search method according to any one of claims 6 to 9.