JP3861529B2

JP3861529B2 - Document search method

Info

Publication number: JP3861529B2
Application number: JP29760499A
Authority: JP
Inventors: 靖彦稲場; 勝己多田; 菅谷　　奈津子; 忠孝松林; 明彦山口; 靖司川下
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1999-10-20
Filing date: 1999-10-20
Publication date: 2006-12-20
Anticipated expiration: 2019-10-20
Also published as: JP2001117937A

Description

【０００１】
【発明の属する技術分野】
本発明は、検索条件に基づいて文書データベースから文書を検索する方法および装置に関し、その検索の結果として得られた文書に対してユーザが評価を与え、その評価に基づき検索条件を変更する方法および装置に関する。
【０００２】
【従来の技術】
近年、パーソナルコンピュータやインターネット等の普及に伴い、電子化文書が急激に増加している。このような状況において、ユーザが所望する情報を含んだ文書を高速かつ効率的に検索したいという要求が高まってきている。
【０００３】
このような要求に応えるための検索技術としてレリバンスフィードバックとよばれる技術がある。この技術は、全文検索や類似文書検索による検索結果に対して、ユーザが「所望の文書である」か「所望の文書でない」かなどの評価をシステムに入力し、その評価情報を検索条件に反映させることにより、その後の検索結果を改善する技術である。
【０００４】
具体的な処理の内容としては、例えば「"Information Retrieval",William B.Frakes / Rocardo Baeza-Yates, Prentice Hall PTR, 1992 p.p.241〜263」に示されるように、ユーザが所望であると評価した文書から抽出した単語に関する検索条件中の重みを加算し、所望でないと評価した文書から抽出された単語に関する検索条件中の重みを減算する方法がある。以下この技術を従来技術１と呼ぶ。検索条件中のある単語について、具体的な重みの加減算の方法の例を式１に示す。
【０００５】
【数１】

【０００６】
ここでＷ'はその単語の新たな重み、Ｗは元の重みであり、ＦＰ（ｉ）は所望であると評価されたｉ番目の文書におけるその単語の出現回数、ＦＮ（ｊ）は所望でないと評価されたｊ番目の文書におけるその単語の出現回数である。また、Ｐは所望であると評価された文書の数、Ｎは所望でないと評価された文書の数である。なお、α、βはパラメータである。ここで、この新たな重みＷ'は負になってもよく、そのような場合は、その単語が含まれる文書は類似度が下がることになる。
【０００７】
この従来技術１によるレリバンスフィードバック処理の例を図２に示す。本図に示す例は、ユーザが「高校野球」に関する文書を所望する場合に、「サッカーに続き高校野球が開幕した」という文書を種文書に選んだ場合である。その後、「サッカー」に関するノイズ文書に対し「所望でない」と評価をして、システムに入力した場合である。この結果、本図に示すように「サッカー」という単語の重みが下がり、以後「サッカー」に関する文書の類似度を下げることができる。
【０００８】
【発明が解決しようとする課題】
しかし、従来技術１による方式では、ユーザが「所望のものでない」といった評価をしたときに検索結果が改善しない場合がある。この問題を図３を用いて説明する。本図に示した例は、「高校野球」に関する文書を所望する場合に、「高校サッカーが開幕した・・・」といったノイズ文書に対し「所望の文書でない」と評価した場合である。このとき従来技術１によれば、このノイズ文書から「高校」「サッカー」「開幕」といった単語を抽出し、検索条件中のそれぞれの単語の重みを減算することになる。この場合、「サッカー」の重みを減算するだけでなく、「高校」という単語の重みまでも減算してしまう。その結果、更新された検索条件によって検索を行なうと、「高校野球」に関する文書の類似度が、「プロ野球」「社会人野球」といった文書の類似度よりも低くなってしまうという問題がある。
【０００９】
このように、従来の方法によりユーザが「所望のものでない」と評価した文書から抽出した単語の重みを単純に減算すると、ユーザが所望とする概念を表す単語の重みまで減算してしまい、検索結果が改善しないという問題がある。
【００１０】
本発明の目的は、ユーザが「所望のものでない」といった評価を与えた文書から抽出した情報のうち適切なものを使用して、検索結果を改善することにある。
【００１１】
【課題を解決するための手段】
上記課題を解決するため、第１の手段として、
文字列に付与された重みを含む検索条件により文書データベースを検索し、該検索により得られた文書に対してユーザが入力した「所望である」または「所望でない」の評価を受け取り、上記検索の結果得られた文書から抽出した文字列の重みを上記評価に基づき変更して検索する文書検索方法において、
上記「所望である」と評価した文書から抽出した第一の文字列に正の重みを付与し、
上記「所望でない」と評価した文書から抽出した第二の文字列に負の重みを付与し、
第二の文字列のうち上記第一の文字列と一致するもとともに当該第一の文字列の重みが所定値以上ものを除外したものとその重みおよび上記第一の文字列とその重みとを含む検索条件を生成して検索する。
【００１２】
この方法により、ユーザが所望のものと評価した文書から抽出した所望の内容を特徴付ける文字列に付与された負の重みにより検索精度を下げてしまうという課題を改善することができる。
【００１３】
また、第２の手段は、
文字列に付与された重みを含む検索条件により文書データベースを検索し、該検索により得られた文書に対してユーザが入力した「所望である」または「所望でない」の評価を受け取り、上記検索の結果得られた文書から抽出した文字列の重みを上記評価に基づき変更して検索する文書検索方法において、
上記「所望である」と評価した文書から第一の文字列を抽出し、上記「所望でない」と評価した文書から抽出した文字列で上記第一の文字列と一致する場合は、当該第一の文字列の重みが所定値以下の場合は上記抽出した文字列を第二の文字列として抽出し、第二の文字列の重みを第一の文字列の重みよりも低くし、一致しない場合は上記抽出した文字列を第二の文字列として抽出し、第二の文字列の重みを第一の文字列の重みよりも低くする。
【００１４】
この方法により、ユーザが所望のものと評価した文書から抽出した所望の内容を特徴付ける文字列に、負の重みを付与してしまい以降の検索精度を下げてしまうという課題を改善できる。
【００１５】
【発明の実施の形態】
以下、本発明の第一の実施例について説明する。
【００１６】
まず、本発明の第一の実施例のシステム構成を図１に示す。本実施例におけるシステムは、ディスプレイ１００、キーボード１０１、中央演算処理装置（ＣＰＵ）１０２、磁気ディスク装置１０５、フロッピディスクドライブ（ＦＤＤ）１０６、主メモリ１０９およびこれらを結ぶバス１０８から構成される。
【００１７】
磁気ディスク装置１０５は二次記憶装置の一つであり、テキスト１０３、出現頻度ファイル１０４が格納される。ＦＤＤ１０６を介してフロッピディスク１０７に格納されている情報が、主メモリ１０９あるいは磁気ディスク装置１０５へ読み込まれる。
【００１８】
主メモリ１０９には、システム制御プログラム１１０、文書登録プログラム１１１、検索制御プログラム１１２が格納される。検索制御プログラム１１２は、検索条件生成プログラム１１３、類似文書検索プログラム１１４、検索結果文書内容表示プログラム１１５、検索条件修正制御プログラム１１６、およびプロファイル重み調整プログラム１１９で構成される。ここで、検索条件修正制御プログラム１１６は、プロファイル更新プログラム１１７、および検索使用文字列選択プログラム１１８で構成される。
【００１９】
また、正のプロファイル１２０、負のプロファイル１２１、総合プロファイル１２２、種文書保存エリア１２３、登録文書保存エリア１２４、特徴文字列保存エリア１２５、および表示用文書保存エリア１２６が同じく主メモリ１０９に確保される。
【００２０】
ここで、正のプロファイル１２０、負のプロファイル１２１、総合プロファイル１２２とは後述する図１５に示すように、いずれも幾つかの検索文字列とその重みを保持したデータである。正のプロファイル１２０には、ユーザが所望であると評価した文書から抽出した文字列が格納される。負のプロファイル１２１には、ユーザが所望のものでないと評価した文書から抽出した文字列が格納される。総合プロファイル１２２は、正負のプロファイルから選択された検索に用いる文字列が格納される。
【００２１】
以下に、第一の実施例における、各プログラムの処理手順について説明する。
【００２２】
まず、システム制御プログラム１１０の処理手順について図４のＰＡＤ（ＰｒｏｂｌｅｍＡｎａｌｙｓｉｓＤｉａｇｒａｍ）図を用いて説明する。
【００２３】
システム制御プログラム１１０は、まずステップ４０１においてユーザがキーボードから入力したコマンドを解析する。
【００２４】
次にステップ４０２において、このコマンドが文書登録のコマンドであると解析された場合には、ステップ４０４で文書登録プログラム１１１を起動して文書の登録を行なう。
【００２５】
またステップ４０３において、検索実行のコマンドであると解析された場合には、ステップ４０５で検索制御プログラム１１２を起動して文書の検索を行なう。
【００２６】
以上が、システム制御プログラム１１０の処理手順である。
【００２７】
次に、図４に示したステップ４０４でシステム制御プログラムにより起動される、文書登録プログラム１１１について図５のＰＡＤ図を用いて説明する。
【００２８】
文書登録プログラム１１１は、まずステップ５０１においてＤ１０６に挿入されたフロッピディスク１０７から登録すべき文書データを読み込み、これをテキスト１０３として磁気ディスク装置１０５に格納する。文書データは、フロッピディスク１０７を用いて入力するだけに限らず、通信回線やＣＤ−ＲＯＭ装置（図１には示していない）等を用いて他の装置から入力するような構成を取ることも可能である。
【００２９】
次にステップ５０２で、検索対象文書から抽出される自立語の可能性がある文字列（以下、特徴文字列と呼ぶ）がどの文書に何回出現したかを高速に抽出するためのデータとして、出現頻度ファイル１０４を各登録対象文書について生成する。ここで出現頻度ファイルの生成方法としては「特開平１１−１４３９０２号広報」に開示されている出現頻度ファイルの生成方法と同一の方法でも良いし、形態素解析等を用いて各文書中の単語を抽出する方法やニューラルネットワークの学習データを用いた方法でもかまわない。また、単純ｎ−ｇｒａｍを抽出する方法であってもかまわない。
【００３０】
以上が、文書登録プログラム１１１の処理手順である。
次に、図４に示したステップ４０５でシステム制御プログラムにより起動される、検索制御プログラム１１２の処理手順を図６のＰＡＤ図を用いて説明する。
【００３１】
検索制御プログラム１１２は、まずステップ６０１において検索条件生成プログラム１１３を起動し、検索条件を生成する。
【００３２】
次にステップ６０２において、ステップ６０３〜ステップ６１２の処理を、ステップ６０４においてユーザから検索セッションの終了が要求されたと解析されるまで繰り返す。
【００３３】
この繰り返し処理では、まずステップ６０３において、類似文書検索プログラム１１４を起動し、ステップ６０１で生成された検索条件にもとづき類似文書検索を行なう。
【００３４】
次にステップ６０４において、キーボードから入力されるコマンドを解析する。
【００３５】
次にステップ６０５において、このコマンドが文書の内容表示コマンドであると解析された場合には、ステップ６０９で検索結果文書内容表示プログラム１１５を起動し、指定された検索結果文書の内容を表示する。
【００３６】
次にステップ６０６において、検索結果文書に対するユーザの評価の入力コマンドであると解析された場合には、ステップ６１０で検索条件修正制御プログラム１１６を起動し、検索条件を修正する。
【００３７】
次にステップ６０７において、プロファイルの内容調整コマンドであると解析された場合には、ステップ６１１でプロファイル重み調整プログラム１１９を起動し、プロファイルの内容を調整する。
【００３８】
次にステップ６０８において、検索セッション終了コマンドであると解析された場合には、ステップ６１２で、正のプロファイル１２０、負のプロファイル１２１、および総合プロファイル１２２の内容をクリアし、ステップ６０２の繰り返しを終了する。
【００３９】
以上が検索制御プログラム１１２の処理手順である。
【００４０】
次に、図６に示したステップ６０１で検索制御プログラムにより起動される、検索条件生成プログラム１１３の処理手順を図７のＰＡＤ図を用いて説明する。
【００４１】
検索条件生成プログラム１１３は、まずステップ７０１において、キーボード１０１から入力される種文書を読み込み、種文書保存エリア１２３に格納する。
【００４２】
次にステップ７０２において、種文書保存エリア１２３に格納された種文書から特徴文字列を抽出し、種文書内出現回数を計数して、特徴文字列保存エリア１２５に格納する。
【００４３】
ここで、特徴文字列を抽出する方法は、図５に示した文書登録プログラム１１１のステップ５０２における方法を用いても良いし、その他の方法を用いても良い。
【００４４】
次にステップ７０３において、ステップ７０２で抽出した特徴文字列をステップ７０２で計数した出現回数と共に総合プロファイル１２２に書き込む。ここで総合プロファイル１２２は、後述する図１５に示すように特徴文字列とその重みが保持されたものであり、後述するように類似文書検索プログラム１１４の入力として使用する。ここで重みとしては種文書内出現回数を用いるものとするが、他のものを用いても良い。また、ここで総合プロファイル１２２に書き込む文字列は、ステップ７０２で抽出した特徴文字列のうち重みの上位から所定数のものに限定しても良い。
【００４５】
次にステップ７０４において、ステップ７０２で抽出した文字列をステップ７０２で計数した出現回数と共に正のプロファイル１２０に書き込む。この正のプロファイル１２０は、後述するように、検索結果文書に対しユーザが評価をした場合に、検索条件を修正する際に使用する。また、ここで正のプロファイル１２０に書き込む文字列は、ステップ７０２で抽出した特徴文字列のうち重みの上位のもの所定数に限定しても良い。
【００４６】
以上が、検索条件生成プログラム１１３の処理手順である。
【００４７】
次に、図６に示したステップ６０３で検索制御プログラムにより起動される、類似文書検索プログラム１１４の処理手順を図８のＰＡＤ図を用いて説明する。
【００４８】
類似文書検索プログラム１１４は、まずステップ８０１において、図７に示したステップ７０３で検索条件生成プログラム１１３により生成された総合プロファイル１２２を読み込む。
【００４９】
次にステップ８０２において、出現頻度ファイル１０４を読み込む。
【００５０】
次にステップ８０３において、総合プロファイル１２２内の特徴文字列の重みと、出現頻度ファイル１０４内の各文書における該文字列の出現頻度から、テキスト１０３内の各文書の類似度を算出する。ここで類似度の算出式としては、例えば以下の式２のようなものを用いる。
【００５１】
【数２】

【００５２】
この式で、Ｓ（Ｄ）はテキスト１０３内の文書番号Ｄの類似度であり、Ｆｒｑ（ｉ）は出現頻度ファイル１０４内の単語ｉの文書Ｄにおける出現頻度であり、ｗ（ｉ）は総合プロファイル内の単語ｉの重みである。ここで類似度算出式としては、これ以外のものを用いても構わない。
【００５３】
次にステップ８０４において、テキスト１０３内の各文書の文書番号を類似度の順に降順にソートし、ディスプレイ１００に出力する。ここで、類似度の上位所定件のみを出力するようにしても良いし、所定の類似度を上回るもののみを出力するようにしても良い。また、文書にタイトルのような属性があればそれを出力しても良い。
【００５４】
以上が、類似文書検索プログラム１１４の処理手順である。
【００５５】
次に、図６に示したステップ６０９で検索制御プログラムにより起動される、検索結果文書内容表示プログラム１１５の処理手順を図９のＰＡＤ図を用いて説明する。
【００５６】
検索結果文書内容表示プログラム１１５は、まずステップ９０１において、ユーザがキーボード１０１から入力する文書番号を読み込む。
【００５７】
次にステップ９０２において、ステップ９０１で入力された文書番号に該当する文書を登録文書保存エリア１２４に読み込む。
【００５８】
次にステップ９０３において、ステップ９０４で該文書を最後まで読み込むまで以下に示すステップ９０４からステップ９０７の処理を繰り返す。
【００５９】
ステップ９０３の繰り返し処理では、まずステップ９０４において、登録文書保存エリア１２４の文書の文字列を順次読み込み、総合プロファイル１２２に格納された文字列と照合する。
【００６０】
次にステップ９０５において、ステップ９０４で読み込んだ文字列が総合プロファイル１２２において正の重みを持つ文字列と一致した場合には、ステップ９０８で「該文字列を赤色表示する」という情報を付与して表示用文書保存エリア１２６に追加する。ここで例えばＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）の形式で表示する場合は、該文字列の前後に赤色表示を表すタグを挿入し、表示用文書保存エリア１２６に追加する。ここで、重みが所定値以下の文字列や、重みの上位所定件に含まれないものは、この処理の対象外にするなどしても構わない。また、表示色は別の色を用いても構わない。
【００６１】
次にステップ９０６において、ステップ９０４で読み込んだ文字列が総合プロファイル１２２において負の重みを持つ文字列と一致した場合には、ステップ９０９で「該文字列を青色表示する」という情報を付与して表示用文書保存エリア１２６に追加する。ここで例えばＨＴＭＬの形式で表示する場合は、該文字列の前後に青色表示を表すタグを挿入し、表示用文書保存エリア１２６に追加する。ここで、重みが所定値以下の文字列や、重みの上位所定件に含まれないものは、この処理の対象外にするなどしても構わない。また、表示色はステップ９０８で指定する色以外の別の色を用いても構わない。
【００６２】
次にステップ９０７において、ステップ９０４で読み込んだ文字列が総合プロファイル内の文字列と一致しない場合には、ステップ９１０で「該文字列を黒色表示する」という情報を付与して表示用文書保存エリア１２６に追加する。ここで例えばＨＴＭＬの形式で表示する場合は、該文字列の前後に黒色表示を表すタグを挿入し、表示用文書保存エリア１２６に追加する。ここで、表示色はステップ９０８、９０９で指定する以外の別の色を用いても構わない。
【００６３】
次にステップ９１１において、表示用文書保存エリア１２６に保存された内容をディスプレイ１００に表示する。
【００６４】
以上が、検索結果文書内容表示プログラム１１５の処理手順である。
【００６５】
次に、図６に示したステップ６１０で検索制御プログラムにより起動される、検索条件修正制御プログラム１１６の処理手順を図１０のＰＡＤ図を用いて説明する。
【００６６】
検索条件修正制御プログラム１１６は、まずステップ１００１においてプロファイル更新プログラム１１７を起動し、正のプロファイル１２０および負のプロファイル１２１の内容を更新する。
【００６７】
次にステップ１００２において、検索使用文字列選択プログラム１１８を起動し、ステップ１００１で更新された正のプロファイル１２０および負のプロファイル１２１の内容にもとづき、総合プロファイル１２２の内容を更新する。
【００６８】
以上が検索条件修正プログラム１１６の処理手順である。
【００６９】
次に、図６に示したステップ６１１で検索制御プログラムにより起動される、プロファイル重み調整プログラム１１９の処理手順を図１１のＰＡＤ図を用いて説明する。
【００７０】
プロファイル重み調整プログラム１１９は、まずステップ１１０１において、正のプロファイル１２０に格納された文字列とその重みを一覧表示する。
【００７１】
次にステップ１１０２において、負のプロファイル１２１に格納された文字列とその重みを一覧表示する。
【００７２】
次にステップ１１０３において、ユーザがキーボード１０１により入力した、ユーザが重みを変更したい文字列、またはいずれかのプロファイルに追加したい文字列と、その重みを取得する。ここで、正のプロファイルにある文字列に負の重みを付与しようとした場合や、負のプロファイルにある文字列に正の重みを付与しようとした場合には、ユーザへの警告を出力するようにする等しても良い。
【００７３】
次にステップ１１０４において、ステップ１１０３で取得したとおりに正のプロファイル１２０または負のプロファイル１２１の内容を変更する。
【００７４】
以上が、プロファイル重み調整プログラム１１９の処理手順である。
【００７５】
ここで、図１２にプロファイル重み調整プログラム１１９により、ユーザがプロファイルを調整する際にディスプレイ１００に表示する入力画面の例を示す。正のプロファイル１２０の内容が１２０１に、負のプロファイル１２１の内容が１２０２に表示される。それぞれスクロールバー１２０３および１２０４を用いて、全ての内容を表示させることも可能である。ユーザがテキストボックス１２０５に重みを変更したい文字列、またはいずれかのプロファイルに追加したい文字列を入力し、重みを１２０６に入力して送信ボタン１２０７を押下する。ここで、重みを変更したい文字列文字列はテキストボックス１２０５に入力する形ではなく、表示される一覧の中からラジオボタン等により選択する形にしても良い。
【００７６】
次に、図１０に示したステップ１００１で検索条件修正制御プログラム１１６により起動される、プロファイル更新プログラム１１７の処理手順を図１３のＰＡＤ図を用いて説明する。
【００７７】
プロファイル更新プログラム１１７は、まずステップ１３０１において、ユーザがキーボード１０１により入力した文書番号と、その文書番号の文書に対するユーザの評価（「所望のものであった」あるいは「所望のものでなかった」等の評価）を読み込む。
【００７８】
次にステップ１３０２において、ステップ１３０１で読み込んだ文書番号に該当する文書を、テキスト１０３から登録文書保存エリア１２４に読み込む。
【００７９】
次にステップ１３０３において、登録文書保存エリア１２４に格納された文書から特徴文字列を抽出し、該文書内出現回数を計数出現頻度ファイル１０４を参照することにより抽出し、共に特徴文字列保存エリア１２５に格納する。ここで、特徴文字列の抽出方法としては前掲の「特開平１１−１４３９０２号広報」による方法を用いても良いし、形態素解析やニューラルネットワークによる学習データなどを用いる方法でもかまわない。
【００８０】
次にステップ１３０４において、ステップ１３０１で読み込んだユーザの評価が正の評価であった場合には、ステップ１３０６において、特徴文字列保存エリア１２５内の文字列の出現回数を正のプロファイルの該当文字列の重みに加算する。このとき、正のプロファイル１２０に無い文字列の場合には、ステップ１３０３で読み込んだ出現回数を重みとして付与し、該文字列を正のプロファイル１２０に追加する。
【００８１】
次にステップ１３０５において、ステップ１３０１で読み込んだユーザの評価が負の評価であった場合には、ステップ１３０７において、特徴文字列保存エリア１２５内の文字列の出現回数を負のプロファイルの該当文字列の重みから減算する。このとき、負のプロファイル１２１に無い文字列の場合には、ステップ１３０３で読み込んだ出現回数の負値を重みとして付与し、該文字列を負のプロファイル１２１に追加する。
【００８２】
ここでステップ１３０６、１３０７において重みの加減算の方法は、ユーザの評価により調整しても良い。例えばステップ１３０６において、ユーザが「所望のものである」という評価をした場合には、その文書内の特徴文字列の出現回数を、そのまま正のプロファイル１２０の該文字列の重みに足し、「やや所望のものである」という評価をした場合には、その文書内の特徴文字列の出現回数の半数を、正のプロファイル１２０の該文字列の重みに足す、などといった方法にしても良い。また、ステップ１３０６およびステップ１３０７で重みを加減算する特徴文字列は、ステップ１３０３において抽出した出現回数の上位所定数に限定しても構わない。
【００８３】
以上が、プロファイル更新プログラム１１７の処理手順である。
【００８４】
次に、図１０に示したステップ１００２において検索条件修正制御プログラム１１６により起動される、検索使用文字列選択プログラム１１８の処理手順を図１４のＰＡＤ図を用いて説明する。
【００８５】
検索使用文字列選択プログラム１１８は、まずステップ１４０１において、総合プロファイル１２２の内容をクリアする。
【００８６】
次にステップ１４０２において、正のプロファイル１２０の中の特徴文字列のうち重みの上位所定件を抽出し、その重みと共に総合プロファイル１２２に追加する。
【００８７】
次にステップ１４０３において、負のプロファイル１２１の中の特徴文字列のうち、重みの絶対値の上位所定件のもので、かつ正のプロファイル１２０の中の特徴文字列の重みの上位所定件に含まれないものを、総合プロファイル１２２に追加する。
【００８８】
ここでステップ１４０２、ステップ１４０３で使用する所定件数はそれぞれ異なった値でも良い。
【００８９】
以上が検索使用文字列選択プログラム１１８の処理手順である。
【００９０】
以上が、本実施例における各プログラムの処理手順である。
【００９１】
以下、本実施例において検索結果文書に対しユーザが負の評価をした場合の、検索条件の修正および再検索処理の流れを、図１５を用いて説明する。
【００９２】
本図においては、ユーザが「高校野球」に関する文書を検索したいものとし、最初に種文書に指定した「サッカーに続き、高校野球が開幕した…」という文書１５０１から抽出された「サッカー」「高校」「野球」「開幕」という文字列１５０２が検索条件生成プログラム１１３により、正のプロファイル１２０に登録されているものとする。
【００９３】
ここで、「高校サッカーが開幕した・・・」という検索結果文書１５０３に対して負の評価をした場合を想定する。
【００９４】
まず、出現頻度ファイル１０４に格納された出現頻度情報のうち、ユーザが負の評価をした「高校サッカーが開幕した・・・」という文書１５０３から特徴文字列１５０４を抽出し、それぞれの特徴文字列の文書１５０３内の出現頻度とともに特徴文字列保存エリア１２５に読み込む。本図の例では、「高校」、「サッカー」、「開幕」、・・・という文字列とその出現頻度を読み込む。
【００９５】
次に、特徴文字列保存エリア１２５の文字列のうち負のプロファイル１２１にある文字列についてはその重みを減算し、負のプロファイル１２１に無い文字列については、その出現回数の負の数を重みとして負のプロファイル１２１に登録する。本図の例では、「高校」、「サッカー」、「開幕」、…という文字列にそれぞれ重み「−４」、「−４」、「−１」、…を付与して負のプロファイル１２１に追加する。
【００９６】
次に、正にプロファイル１２０の文字列のうち重みの上位所定数もの１５０５と、負のプロファイル１２１のうち重みの下位所定数１５０６に含まれ、かつ正のプロファイル１２０の文字列のうち上位所定数のもの１５０７に含まれないものを、総合プロファイル１２２に登録する。本図に示した例では、正のプロファイル１２０から「高校」と「野球」、負のプロファイル１２１から「サッカー」という文字列を選択し、総合プロファイル１２２に追加する。
【００９７】
検索時には、この総合プロファイル１２２の文字列とその重みにより検索を行なう。本図に示した例では、負のプロファイル中の「高校」という文字列に関する重み値−４は検索に使用されないことになる。このことにより、「高校サッカー」の文書に負の評価をしても、「高校」という文字列の重みが下がらないため、「高校野球」よりも「プロ野球」の文書に高い類似度が算出されてしまうといった問題を防ぐことができる。
【００９８】
以上が、検索結果文書に対しユーザが負の評価をした場合の、検索条件の修正および再検索処理の流れである。
【００９９】
以上示したように本実施例によれば、ユーザが「所望のものでない」と評価した文書から抽出された文字列のうち、ユーザが「所望のものである」と評価した文書から抽出された文字列を、重みを下げる対象から除外する形態をとる。そのため、ユーザの所望ではない概念を表す文字列のみの重みを適切に減算することができる。したがって、ユーザが「所望のものでない」と評価した文書から抽出した文字列の重みを単純に減算すると、ユーザの所望の概念を表す文字列の重みまで減算してしまい、検索結果が改善しない、といった問題を解決できる。
【０１００】
また、本実施例によれば、検索結果文書の内容を表示する際、検索条件データに保存されている文字列の重み正負により文字列を別の形式でハイライト表示する形態をとる。
【０１０１】
この方法により、ユーザは、検索結果文書がどの程度所望の内容を示しているかを視覚的に容易に判断できる。また、正の重みが付与された文字列や負の重みが付与された文字列として、どのようなものが所望文書やノイズ文書に含まれているかを見ることにより、次回以降のプロファイルの調整に役立てることができるようになる。
【０１０２】
また、本実施例によれば、検索条件データの中の文字列のうち検索に用いる文字列をユーザが選択、あるいはそれぞれの文字列の重みをユーザが調整する形態をとる。
【０１０３】
この方法により、ユーザの所望する内容を特徴付けるものでないものを、検索に使用することを防ぐことができ、適切な検索結果を得られるようになる。
【０１０４】
図１３に示したプロファイル更新プログラムの処理おいては、ユーザが負の評価をした際に、評価対象文書から抽出した文字列を負のプロファイル１２１に追加した後、総合プロファイル１２２に追加する文字列を選択する形態をとっている。ここで図１６に示すように、評価対象文書から抽出した文字列のうち、負のプロファイル１２１に追加する文字列を選択する形態をとっても良い。
【０１０５】
すなわち、図１６のステップ１３０５において、ステップ１３０１で読み込んだユーザの評価が負の評価であった場合には、ステップ１３０７を実行する前に図１６に示すプロファイル更新用文字列選択ステップ１６０１を実行しても良い。ここでプロファイル更新用文字列選択ステップ１６０１は、特徴文字列保存エリア１２５の文字列のうち、正のプロファイル１２０中の重みの上位のものに含まれるものを、特徴文字列保存エリア１２５からクリアするステップである。これにより、正のプロファイル１２０に追加されているユーザの所望の概念を表す文字列に、負の重みを付与し負のプロファイル１２１に追加してしまうことを防ぐことができる。
【０１０６】
以下、本発明の第二の実施例について説明する。
【０１０７】
第一の実施例においては、検索時に使用する文字列、または検索条件の修正時にプロファイルに追加する文字列をシステムが自動的に選択する。したがって、検索結果文書に対するユーザの評価が不適切な場合には、検索精度が向上しないという問題がある。
【０１０８】
以上の問題を解決するために、本発明の第二の実施例では、ユーザが正または負の評価をした文書から抽出される文字列を一覧表示し、正の重みまたは負の重みを付与する文字列をユーザが選択する手段を提供するものである。
【０１０９】
本実施例は図１に示す第一の実施例とほぼ同様の構成をとる。ここで図１７に示すように検索条件修正制御プログラム１１６ａはプロファイル更新用文字列ユーザ選択プログラム１７０１、プロファイル更新プログラム１１７ａ、および検索使用文字列選択プログラム１１８により構成される。また、図１８に示すようにプロファイル更新プログラム１１７ａの処理手順が、第一の実施例におけるプロファイル更新プログラム１１７と異なる。
【０１１０】
以下、第二の実施例における、プロファイル更新プログラム１１７ａの処理手順について図１８のＰＡＤ図を用いて説明する。
【０１１１】
まずプロファイル更新プログラム１１７ａは、まずステップ１８０１において、ユーザがキーボード１０１により入力した文書番号と、その文書番号の文書に対するユーザの評価（「所望のものであった」あるいは「所望のものでなかった」等の評価）を読み込む。
【０１１２】
次にステップ１８０２において、ステップ１８０１で読み込んだ文書番号に該当する文書を、テキスト１０３から登録文書保存エリア１２４に読み込む。
【０１１３】
次にステップ１８０３において、登録文書保存エリア１２４に格納された文書から特徴文字列を抽出し、該文書内出現回数を計数出現頻度ファイル１０４を参照することにより抽出し、共に特徴文字列保存エリア１２５に格納する。ここで、特徴文字列の抽出方法としては前掲の「特開平１１−１４３９０２号広報」による方法を用いても良いし、形態素解析やニューラルネットワークによる学習データなどを用いる方法でもかまわない。
【０１１４】
次にステップ１８０４において、プロファイル更新用文字列ユーザ選択プログラム１７０１を起動し、ステップ１８０３において読み込んだ文字列のうちユーザが選択しなかった文字列を、特徴文字列保存エリア１２５からクリアする。
【０１１５】
次にステップ１８０５において、ステップ１８０１で読み込んだユーザの評価が正の評価であった場合には、ステップ１８０７において、特徴文字列保存エリア１２５の文字列の出現回数を正のプロファイルの該当文字列の重みに加算する。このとき、正のプロファイル１２０に無い文字列の場合には、ステップ１８０３で読み込んだ出現回数を重みとして付与し、該文字列を正のプロファイル１２０に追加する。
【０１１６】
次にステップ１８０６において、ステップ１８０１で読み込んだユーザの評価が負の評価であった場合には、ステップ１８０８において、特徴文字列保存エリア１２５の文字列の出現回数を負のプロファイルの該当文字列の重みから減算する。このとき、負のプロファイル１２１に無い文字列の場合には、ステップ１８０３で読み込んだ出現回数の負値を重みとして付与し、該文字列を負のプロファイル１２１に追加する。
【０１１７】
ここでステップ１８０７、１８０８において重みの加減算の方法は、ユーザの評価により調整しても良い。例えばステップ１８０７において、ユーザが「所望のものである」という評価をした場合には、その文書内の特徴文字列の出現回数を、そのまま正のプロファイル１２０の該文字列の重みに足し、「やや所望のものである」という評価をした場合には、その文書内の特徴文字列の出現回数の半数を、正のプロファイル１２０の該文字列の重みに足す、などといった方法にしても良い。また、ステップ１８０７およびステップ１８０８で重みを加減算する特徴文字列は、ステップ１８０３において抽出した出現回数の上位所定数に限定しても構わない。
【０１１８】
以上が、プロファイル更新プログラム１１７ａの処理手順である。
【０１１９】
次に図１８に示したステップ１８０４でプロファイル更新プログラム１１７ａにより起動される、プロファイル更新用文字列ユーザ選択プログラム１７０１の処理手順を、図１９のＰＡＤ図を用いて説明する。
【０１２０】
まずステップ１９０１において、特徴文字列保存エリア１２５内の特徴文字列を一覧表示する。
【０１２１】
次にステップ１９０２において、ステップ１９０１で表示した文字列のうち、ユーザが選択しなかった文字列を取得し、該文字列の情報を特徴文字列保存エリア１２５からクリアする。
【０１２２】
以上がプロファイル更新用文字列ユーザ選択プログラム１７０１の処理手順である。
【０１２３】
ここで、プロファイル更新用文字列ユーザ選択プログラム１７０１により、ユーザがプロファイルに追加したい文字列を選択する画面の例を図２０に示す。ウィンドウ２００１に、ユーザが評価した文書から抽出される特徴文字列がチェックボックスと共に表示される。特徴文字列が多数ある場合はスクロールバー２００２を用いてすべての文字列をウィンドウ２００１内で参照することができる。ユーザは、ウィンドウ２００１内の文字列のうち、プロファイルに追加したい文字列のチェックボックスをチェックし、送信ボタン２００３を押下する。
【０１２４】
なお、文字列の選択方法は図２０の例のようにチェックボックスを用いたものでも良いし、各文字列に識別番号を付与して識別番号と共に一覧表示するようにし、文字列の識別番号により選択する方法でも良い。
【０１２５】
以下、本実施例において検索結果テキストに対しユーザが負の評価をした場合の、検索条件の修正および再検索処理の流れを、図２１を用いて説明する。
【０１２６】
本図においては、ユーザが「高校野球」に関するテキストを検索したいものとし、最初に種文書に指定した「サッカーに続き、高校野球が開幕した…」というテキスト２１０１から抽出されたサッカー」「高校」「野球」「開幕」という文字列２１０２が検索条件生成プログラム１１３により、正のプロファイル１２０に登録されているものとする。
【０１２７】
ここで、「高校サッカーの１回戦が・・・」という検索結果テキストに対して負の評価をした場合を想定する。
【０１２８】
まず、出現頻度ファイル１０４に格納された出現頻度情報のうち、ユーザが負の評価をした「高校サッカーの１回戦が・・・」という文書２１０３から特徴文字列２１０４を抽出し、それぞれの特徴文字列の文書２１０３内の出現頻度とともに特徴文字列保存エリア１２５に読み込む。本図の例では、「高校」、「サッカー」、「１回戦」、・・・という文字列とその出現頻度が読み込まれる。
【０１２９】
次に、前述した図２０の画面でユーザが選択した文字列の情報を、文字列保存エリア１２５からクリアする。本図の例では、ユーザが「高校野球」に関するテキストを所望しており、「サッカー」に関するテキストは所望ではない。したがってユーザは「サッカー」という文字列のみに負の重みを加えると指定するものとする。このとき、文字列保存エリア１２５から、「高校」および「１回戦」という文字列とその重みをクリアする。
【０１３０】
次に、出現頻度情報２１０４のうち負のプロファイル１２１にある文字列についてはその重みを減算し、負のプロファイル１２１に無い文字列については、その出現回数の負の数を重みとして負のプロファイル１２１に登録する。本図の例では、「サッカー」という文字列に重み「−４」を付与して正のプロファイル１２０に追加する。
【０１３１】
次に、正にプロファイル１２０の文字列のうち重みの上位所定数もの２１０５と、負のプロファイル１２１のうち重みの下位所定数２１０６に含まれ、かつ正のプロファイル１２０の文字列のうち上位所定数のもの２１０７に含まれないものを、総合プロファイル１２２に登録する。検索時には、この総合プロファイル１２２の文字列とその重みにより検索を行なう。
【０１３２】
以上のように、本図に示した例では、「高校サッカーの１回戦が…」というテキストに負の評価をしても、「高校」という文字列の重みが下がらないため、「高校野球」よりも「プロ野球」のテキストに高い類似度が算出されてしまうといった問題を防ぐことができる。また、正のプロファイル１２０に無い「１回戦」という文字列の重みがさがらないため、「高校野球の１回戦」といったユーザが所望するテキストの類似度が下がってしまうといった問題を防ぐことができる。
【０１３３】
以上が、検索結果テキストに対しユーザが負の評価をした場合の、検索条件の修正および再検索処理の流れである。
【０１３４】
なお、本実施例において検索結果文書に対しユーザが正の評価をした場合にも同様に、正のプロファイルに追加する文字列を選択することができる。したがって、正の評価をした文書から抽出されるがユーザの概念を表す文字列ではない文字列に、正の重みを付与してしまうことを防ぐことができる。
【０１３５】
以上が、本発明の第二の実施例である。
【０１３６】
以上示したように本実施例によれば、ユーザが「所望のものでない」と評価した文書から抽出された文字列のうち、ユーザが所望する概念を表す文字列をユーザが指定することにより、該文字列を重みを下げる対象から除外する形態をとる。そのため、ユーザの所望ではない概念を表す文字列のみの重みを適切に減算することができる。したがって、ユーザが「所望のものでない」と評価した文書から抽出した文字列の重みを単純に減算すると、ユーザの所望の概念を表す文字列の重みまで減算してしまい、検索結果が改善しない、といった問題を解決できる。
【０１３７】
また、ユーザが「所望のものである」と評価した文書から抽出された文字列のうち、ユーザが所望する概念を表さない文字列をユーザが指定することにより、該文字列を重みを上げる対象から除外する形態をとる。そのため、ユーザの所望する概念を表す文字列のみの重みを適切に加算することができる。したがって、ユーザが「所望のものである」と評価した文書から抽出した文字列の重みを単純に加算すると、ユーザの所望の概念を表さない文字列の重みまで加算してしまい、検索結果が改善しない、といった問題を解決できる。
【０１３８】
なお、第一、第二の実施例において、ひとつの検索結果文書に対しユーザが評価を入力し、その評価を反映した検索結果を出力するようにしたが、複数の検索結果文書に対しそれぞれ異なった評価を一度に入力し、それらの評価を反映した検索結果を出力するようにしても構わない。
【０１３９】
また、第一、第二の実施例において、最初に種文書を設定し、その種文書に類似した内容を持つ文書を検索するものとしたが、最初にキーワードを設定する全文検索を行なう形式にしても良い。その場合には、図７に示した検索条件生成プログラム１１３のステップ７０２、７０３のかわりに、入力したキーワードを所定の重みを付与して正のプロファイル１２０、および総合プロファイル１２２に追加すれば良い。
【０１４０】
本実施例によれば、ユーザの所望の概念を表す単語の重みを減算しないため、ユーザが「所望のものでない」といった評価を与えた検索結果文書から抽出した情報をもとに検索結果を改善することができる。
【０１４１】
【発明の効果】
本発明によれば、ユーザが「所望のものでない」といった評価を与えた文書から抽出した情報のうち適切なものを使用して、検索結果を改善することができる。
【図面の簡単な説明】
【図１】本発明の第一の実施例の構成を示す図である。
【図２】従来技術によるレリバンスフィードバック処理の例を示す図である。
【図３】従来技術によるレリバンスフィードバック処理により検索結果が改善しない例を示す図である。
【図４】本発明の第一の実施例におけるシステム制御プログラム１１０の処理手順を示すＰＡＤ図である。
【図５】本発明の第一の実施例における文書登録プログラム１１１の処理手順を示すＰＡＤ図である。
【図６】本発明の第一の実施例における検索制御プログラム１１２の処理手順を示すＰＡＤ図である。
【図７】本発明の第一の実施例における検索条件生成プログラム１１３の処理手順を示すＰＡＤ図である。
【図８】本発明の第一の実施例における類似文書検索プログラム１１４の処理手順を示すＰＡＤ図である。
【図９】本発明の第一の実施例における検索結果文書内容表示プログラム１１５の処理手順を示すＰＡＤ図である。
【図１０】本発明の第一の実施例における検索条件修正制御プログラム１１６の処理手順を示すＰＡＤ図である。
【図１１】本発明の第一の実施例におけるプロファイル重み調整プログラム１１９の処理手順を示すＰＡＤ図である。
【図１２】本発明の第一の実施例において、ユーザがプロファイルを調整する際にディスプレイ１００に表示する入力画面の例を示す図である。
【図１３】本発明の第一の実施例におけるプロファイル更新プログラム１１７の処理手順を示すＰＡＤ図である。
【図１４】本発明の第一の実施例における検索使用文字列選択プログラム１１８の処理手順を示すＰＡＤ図である。
【図１５】本発明の第一の実施例において、検索結果文書に対しユーザが負の評価をした場合の、検索条件の修正および再検索処理の流れを示す図である。
【図１６】本発明の第一の実施例におけるプロファイル更新プログラムの処理１１７の処理の一形態を示すＰＡＤ図である。
【図１７】本発明の第二の実施例における検索条件修正プログラム１１６ａの構成を示すＰＡＤ図である。
【図１８】本発明の第二の実施例におけるプロファイル更新プログラム１１７ａの処理手順を示すＰＡＤ図である。
【図１９】本発明の第二の実施例におけるプロファイル更新用文字列ユーザ選択プログラム１７０１の処理手順を示すＰＡＤ図である。
【図２０】本発明の第二の実施例において、ユーザがプロファイルに追加したい文字列を選択する画面の例を示すＰＡＤ図である。
【図２１】本発明の第二の実施例において、検索結果文書に対しユーザが負の評価をした場合の、検索条件の修正および再検索処理の流れを示す図である。
【符号の説明】
１００ディスプレイ
１０１キーボード
１０２中央演算処理装置（ＣＰＵ）
１０３テキスト
１０４出現頻度ファイル
１０５磁気ディスク装置
１０６フロッピディスクドライブ（ＦＤＤ）
１０７フロッピディスク
１０８バス
１０９主メモリ
１１０システム制御プログラム
１１１文書登録プログラム
１１２検索制御プログラム
１１３検索条件生成プログラム
１１４類似文書検索プログラム
１１５検索結果文書内容表示プログラム
１１６検索条件修正制御プログラム
１１７プロファイル更新プログラム
１１８検索使用文字列選択プログラム
１１９プロファイル重み調整プログラム
１２０正のプロファイル
１２１負のプロファイル
１２２総合プロファイル
１２３種文書保存エリア
１２４登録文書保存エリア
１２５特徴文字列保存エリア
１２６表示文書保存エリア[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method and an apparatus for searching a document from a document database based on a search condition, a method in which a user gives an evaluation to a document obtained as a result of the search, and the search condition is changed based on the evaluation. Relates to the device.
[0002]
[Prior art]
In recent years, with the spread of personal computers, the Internet, etc., digitized documents have been rapidly increasing. Under such circumstances, there is an increasing demand for searching for a document including information desired by a user at high speed and efficiently.
[0003]
There is a technique called relevance feedback as a search technique for meeting such a demand. In this technique, a user inputs an evaluation such as “desired document” or “not a desired document” into a system for a search result by full-text search or similar document search, and uses the evaluation information as a search condition. This is a technique for improving the subsequent search results by reflecting them.
[0004]
For example, as shown in “Information Retrieval”, William B. Frakes / Rocardo Baeza-Yates, Prentice Hall PTR, 1992 pp 241 to 263, the content of the specific processing is a document evaluated by the user as desired. There is a method of adding the weights in the search conditions related to the words extracted from, and subtracting the weights in the search conditions related to the words extracted from the document evaluated as undesired. Hereinafter, this technique is referred to as Conventional Technique 1. Formula 1 shows an example of a specific weight addition / subtraction method for a certain word in the search condition.
[0005]
[Expression 1]

[0006]
Where W ′ is the new weight of the word, W is the original weight, FP (i) is the number of occurrences of the word in the i-th document evaluated as desired, and FN (j) is not desired The number of occurrences of the word in the jth document evaluated as. P is the number of documents evaluated as desired, and N is the number of documents evaluated as not desired. Α and β are parameters. Here, the new weight W ′ may be negative. In such a case, the similarity of the document including the word is lowered.
[0007]
An example of relevance feedback processing according to the prior art 1 is shown in FIG. The example shown in this figure is a case where a user selects a document “high school baseball has started after soccer” as a seed document when a user desires a document related to “high school baseball”. Thereafter, the noise document relating to “soccer” is evaluated as “not desired” and input to the system. As a result, as shown in the figure, the weight of the word “soccer” is lowered, and the similarity of documents related to “soccer” can be lowered thereafter.
[0008]
[Problems to be solved by the invention]
However, in the method according to the related art 1, the search result may not be improved when the user evaluates “not desired”. This problem will be described with reference to FIG. In the example shown in the figure, when a document related to “high school baseball” is desired, a noise document such as “high school soccer has started ...” is evaluated as “not a desired document”. At this time, according to the prior art 1, words such as “high school”, “soccer”, and “opening” are extracted from the noise document, and the weight of each word in the search condition is subtracted. In this case, not only the weight of “soccer” is subtracted but also the weight of the word “high school” is subtracted. As a result, when a search is performed according to the updated search condition, there is a problem that the similarity of a document related to “high school baseball” becomes lower than the similarity of a document such as “professional baseball” or “worker baseball”.
[0009]
Thus, if the weight of a word extracted from a document that the user has evaluated as “not desired” is simply subtracted by the conventional method, the weight of the word representing the concept desired by the user is subtracted and the search is performed. There is a problem that the result does not improve.
[0010]
An object of the present invention is to improve a search result by using appropriate information extracted from a document given an evaluation such as “not desired” by a user.
[0011]
[Means for Solving the Problems]
In order to solve the above problem, as a first means,
The document database is searched according to the search condition including the weight assigned to the character string, and the evaluation of “desired” or “not desired” input by the user with respect to the document obtained by the search is received. In a document search method for searching by changing the weight of a character string extracted from a document obtained as a result based on the above evaluation,
A positive weight is given to the first character string extracted from the document evaluated as “desirable”,
A negative weight is given to the second character string extracted from the document evaluated as “not desired”,
Of the second character string, the one that matches the first character string and the weight of the first character string excluding a predetermined value or more, the weight, and the first character string and the weight Generate a search condition that contains and search.
[0012]
By this method, it is possible to improve the problem that the search accuracy is lowered due to the negative weight given to the character string characterizing the desired content extracted from the document evaluated as desired by the user.
[0013]
The second means is:
The document database is searched according to the search condition including the weight assigned to the character string, and the evaluation of “desired” or “not desired” input by the user with respect to the document obtained by the search is received. In a document search method for searching by changing the weight of a character string extracted from a document obtained as a result based on the above evaluation,
When the first character string is extracted from the document evaluated as “desirable” and the character string extracted from the document evaluated as “not desired” matches the first character string, the first character string is extracted. If the weight of the character string is less than or equal to the predetermined value, the extracted character string is extracted as the second character string, and the weight of the second character string is lower than the weight of the first character string. Extracts the extracted character string as a second character string, and makes the weight of the second character string lower than the weight of the first character string.
[0014]
By this method, it is possible to improve the problem that a negative weight is given to a character string characterizing a desired content extracted from a document evaluated as desired by the user, and the subsequent search accuracy is lowered.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
The first embodiment of the present invention will be described below.
[0016]
First, the system configuration of the first embodiment of the present invention is shown in FIG. The system according to the present embodiment includes a display 100, a keyboard 101, a central processing unit (CPU) 102, a magnetic disk device 105, a floppy disk drive (FDD) 106, a main memory 109, and a bus 108 connecting them.
[0017]
The magnetic disk device 105 is one of secondary storage devices, and stores text 103 and an appearance frequency file 104. Information stored in the floppy disk 107 is read into the main memory 109 or the magnetic disk device 105 via the FDD 106.
[0018]
The main memory 109 stores a system control program 110, a document registration program 111, and a search control program 112. The search control program 112 includes a search condition generation program 113, a similar document search program 114, a search result document content display program 115, a search condition correction control program 116, and a profile weight adjustment program 119. Here, the search condition correction control program 116 includes a profile update program 117 and a search use character string selection program 118.
[0019]
A positive profile 120, a negative profile 121, a general profile 122, a seed document storage area 123, a registered document storage area 124, a characteristic character string storage area 125, and a display document storage area 126 are also secured in the main memory 109. The
[0020]
Here, the positive profile 120, the negative profile 121, and the general profile 122 are all data holding several search character strings and their weights, as shown in FIG. The positive profile 120 stores a character string extracted from a document evaluated as desired by the user. The negative profile 121 stores a character string extracted from a document evaluated as not desired by the user. The general profile 122 stores a character string used for a search selected from positive and negative profiles.
[0021]
The processing procedure of each program in the first embodiment will be described below.
[0022]
First, a processing procedure of the system control program 110 will be described using a PAD (Problem Analysis Diagram) diagram of FIG.
[0023]
The system control program 110 first analyzes a command input by the user from the keyboard in step 401.
[0024]
If it is determined in step 402 that this command is a document registration command, the document registration program 111 is activated in step 404 to register the document.
[0025]
If it is determined in step 403 that the command is a search execution command, the search control program 112 is activated in step 405 to search for a document.
[0026]
The processing procedure of the system control program 110 has been described above.
[0027]
Next, the document registration program 111 activated by the system control program in step 404 shown in FIG. 4 will be described with reference to the PAD diagram of FIG.
[0028]
In step 501, the document registration program 111 first reads document data to be registered from the floppy disk 107 inserted into the D 106, and stores it in the magnetic disk device 105 as text 103. The document data is not limited to be input using the floppy disk 107 but may be configured to be input from another device using a communication line, a CD-ROM device (not shown in FIG. 1), or the like. Is possible.
[0029]
Next, in step 502, as data for extracting at a high speed how many times a character string (hereinafter referred to as a characteristic character string) that may be an independent word extracted from a search target document appears in: An appearance frequency file 104 is generated for each registration target document. Here, the appearance frequency file generation method may be the same as the appearance frequency file generation method disclosed in “Publication of Japanese Patent Application Laid-Open No. 11-143902”, or a word in each document may be obtained using morphological analysis or the like. An extraction method or a method using learning data of a neural network may be used. Also, a method of extracting a simple n-gram may be used.
[0030]
The processing procedure of the document registration program 111 has been described above.
Next, the processing procedure of the search control program 112 started by the system control program in step 405 shown in FIG. 4 will be described using the PAD diagram of FIG.
[0031]
The search control program 112 first activates the search condition generation program 113 in step 601 to generate a search condition.
[0032]
Next, in step 602, the processing in steps 603 to 612 is repeated until it is analyzed in step 604 that the end of the search session is requested by the user.
[0033]
In this iterative process, first, in step 603, the similar document search program 114 is activated, and a similar document search is performed based on the search conditions generated in step 601.
[0034]
Next, in step 604, a command input from the keyboard is analyzed.
[0035]
Next, when it is analyzed in step 605 that this command is a document content display command, the search result document content display program 115 is activated in step 609 to display the contents of the designated search result document.
[0036]
In step 606, if it is analyzed that the input command is a user evaluation for the search result document, the search condition correction control program 116 is activated in step 610 to correct the search condition.
[0037]
In step 607, if it is analyzed that the command is a profile content adjustment command, the profile weight adjustment program 119 is activated in step 611 to adjust the profile content.
[0038]
Next, when it is analyzed in step 608 that it is a search session end command, the contents of the positive profile 120, the negative profile 121, and the general profile 122 are cleared in step 612, and the repetition of step 602 is ended. To do.
[0039]
The processing procedure of the search control program 112 has been described above.
[0040]
Next, the processing procedure of the search condition generation program 113 activated by the search control program in step 601 shown in FIG. 6 will be described using the PAD diagram of FIG.
[0041]
First, in step 701, the search condition generation program 113 reads the seed document input from the keyboard 101 and stores it in the seed document storage area 123.
[0042]
Next, in step 702, a characteristic character string is extracted from the seed document stored in the seed document storage area 123, the number of appearances in the seed document is counted, and stored in the characteristic character string storage area 125.
[0043]
Here, as a method of extracting the characteristic character string, the method in step 502 of the document registration program 111 shown in FIG. 5 may be used, or another method may be used.
[0044]
Next, in step 703, the characteristic character string extracted in step 702 is written in the general profile 122 together with the number of appearances counted in step 702. Here, the general profile 122 holds a characteristic character string and its weight as shown in FIG. 15 described later, and is used as an input of the similar document search program 114 as described later. Here, the number of occurrences in the seed document is used as the weight, but other weights may be used. Further, the character string to be written to the general profile 122 may be limited to a predetermined number from the top of the weight among the characteristic character strings extracted in step 702.
[0045]
Next, in step 704, the character string extracted in step 702 is written in the positive profile 120 together with the number of appearances counted in step 702. As will be described later, the positive profile 120 is used when the search condition is corrected when the user evaluates the search result document. Further, the character string to be written in the positive profile 120 here may be limited to a predetermined number of higher-weighted character strings extracted in step 702.
[0046]
The processing procedure of the search condition generation program 113 has been described above.
[0047]
Next, the processing procedure of the similar document search program 114 started by the search control program in step 603 shown in FIG. 6 will be described using the PAD diagram of FIG.
[0048]
First, in step 801, the similar document search program 114 reads the general profile 122 generated by the search condition generation program 113 in step 703 shown in FIG.
[0049]
Next, in step 802, the appearance frequency file 104 is read.
[0050]
Next, in step 803, the similarity of each document in the text 103 is calculated from the weight of the characteristic character string in the general profile 122 and the appearance frequency of the character string in each document in the appearance frequency file 104. Here, as a formula for calculating the similarity, for example, the following formula 2 is used.
[0051]
[Expression 2]

[0052]
In this expression, S (D) is the similarity of the document number D in the text 103, Frq (i) is the appearance frequency of the word i in the appearance frequency file 104 in the document D, and w (i) is the total. It is the weight of the word i in the profile. Here, as the similarity calculation formula, other expressions may be used.
[0053]
In step 804, the document numbers of the documents in the text 103 are sorted in descending order of similarity and output to the display 100. Here, only the upper predetermined number of similarities may be output, or only those exceeding a predetermined similarity may be output. If the document has an attribute such as a title, it may be output.
[0054]
The processing procedure of the similar document search program 114 has been described above.
[0055]
Next, the processing procedure of the search result document content display program 115 started by the search control program in step 609 shown in FIG. 6 will be described with reference to the PAD diagram of FIG.
[0056]
The search result document content display program 115 first reads a document number input by the user from the keyboard 101 in step 901.
[0057]
In step 902, the document corresponding to the document number input in step 901 is read into the registered document storage area 124.
[0058]
Next, in step 903, the following processing from step 904 to step 907 is repeated until the document is read to the end in step 904.
[0059]
In the repetitive processing in step 903, first, in step 904, the character string of the document in the registered document storage area 124 is sequentially read and collated with the character string stored in the general profile 122.
[0060]
Next, in step 905, if the character string read in step 904 matches the character string having a positive weight in the general profile 122, in step 908, information “display the character string in red” is added. This is added to the display document storage area 126. Here, for example, in the case of displaying in the HTML (HyperText Markup Language) format, a tag indicating red display is inserted before and after the character string and added to the display document storage area 126. Here, a character string whose weight is equal to or less than a predetermined value or a character string that is not included in the upper predetermined number of weights may be excluded from this processing. Further, a different display color may be used.
[0061]
Next, in step 906, if the character string read in step 904 matches a character string having a negative weight in the overall profile 122, in step 909, the information “display the character string in blue” is added. This is added to the display document storage area 126. Here, for example, when displaying in the HTML format, a tag indicating blue display is inserted before and after the character string and added to the display document storage area 126. Here, a character string whose weight is equal to or less than a predetermined value or a character string that is not included in the upper predetermined number of weights may be excluded from this processing. In addition, a display color other than the color specified in step 908 may be used.
[0062]
Next, in step 907, if the character string read in step 904 does not match the character string in the general profile, in step 910, the information “display the character string in black” is added to display the document storage area for display. 126. For example, when displaying in the HTML format, a tag indicating black display is inserted before and after the character string and added to the display document storage area 126. Here, a display color other than that specified in steps 908 and 909 may be used.
[0063]
In step 911, the content stored in the display document storage area 126 is displayed on the display 100.
[0064]
The processing procedure of the search result document content display program 115 has been described above.
[0065]
Next, the processing procedure of the search condition correction control program 116 started by the search control program in step 610 shown in FIG. 6 will be described using the PAD diagram of FIG.
[0066]
The search condition correction control program 116 first activates the profile update program 117 in step 1001 to update the contents of the positive profile 120 and the negative profile 121.
[0067]
Next, in step 1002, the search use character string selection program 118 is started, and the contents of the general profile 122 are updated based on the contents of the positive profile 120 and the negative profile 121 updated in step 1001.
[0068]
The processing procedure of the search condition correction program 116 has been described above.
[0069]
Next, the processing procedure of the profile weight adjustment program 119 activated by the search control program in step 611 shown in FIG. 6 will be described using the PAD diagram of FIG.
[0070]
First, in step 1101, the profile weight adjustment program 119 displays a list of character strings stored in the positive profile 120 and their weights.
[0071]
In step 1102, a list of character strings stored in the negative profile 121 and their weights is displayed.
[0072]
Next, in step 1103, the character string that the user inputs with the keyboard 101 and the user wants to change the weight or adds to one of the profiles and the weight are acquired. If a negative weight is given to a character string in the positive profile, or a positive weight is given to a character string in the negative profile, a warning to the user is output. You may make it.
[0073]
Next, in step 1104, the contents of the positive profile 120 or the negative profile 121 are changed as acquired in step 1103.
[0074]
The processing procedure of the profile weight adjustment program 119 has been described above.
[0075]
Here, FIG. 12 shows an example of an input screen displayed on the display 100 when the user adjusts the profile by the profile weight adjustment program 119. The content of the positive profile 120 is displayed in 1201, and the content of the negative profile 121 is displayed in 1202. It is also possible to display all contents using scroll bars 1203 and 1204, respectively. The user inputs a character string whose weight is to be changed in the text box 1205 or a character string to be added to any profile, inputs a weight in 1206, and presses a send button 1207. Here, the character string to be changed in weight may not be input to the text box 1205 but may be selected from a list displayed by a radio button or the like.
[0076]
Next, the processing procedure of the profile update program 117 activated by the search condition correction control program 116 in step 1001 shown in FIG. 10 will be described using the PAD diagram of FIG.
[0077]
In step 1301, the profile update program 117 first determines the document number input by the user using the keyboard 101, and the user's evaluation of the document with the document number (“desired” or “not desired”). ).
[0078]
In step 1302, the document corresponding to the document number read in step 1301 is read from the text 103 into the registered document storage area 124.
[0079]
Next, in step 1303, a characteristic character string is extracted from the document stored in the registered document storage area 124, the number of appearances in the document is extracted by referring to the count appearance frequency file 104, and both are stored in the characteristic character string storage area 125. To store. Here, as a method for extracting the characteristic character string, the method described in the above-mentioned “Japanese Laid-Open Patent Publication No. 11-143902” may be used, or a method using morphological analysis or learning data by a neural network may be used.
[0080]
In step 1304, if the user's evaluation read in step 1301 is positive, in step 1306, the number of occurrences of the character string in the characteristic character string storage area 125 is determined as the corresponding character string of the positive profile. Add to the weight of. At this time, if the character string is not in the positive profile 120, the number of appearances read in step 1303 is assigned as a weight, and the character string is added to the positive profile 120.
[0081]
Next, in step 1305, if the user's evaluation read in step 1301 is negative, in step 1307, the number of occurrences of the character string in the characteristic character string storage area 125 is set to the corresponding character string of the negative profile. Subtract from the weight of. At this time, if the character string is not in the negative profile 121, the negative value of the number of appearances read in step 1303 is assigned as a weight, and the character string is added to the negative profile 121.
[0082]
Here, the weight addition / subtraction method in

steps

1306 and 1307 may be adjusted by user evaluation. For example, when the user makes an evaluation “desired” in step 1306, the number of appearances of the characteristic character string in the document is added to the weight of the character string of the positive profile 120 as it is. If the evaluation is “desired”, a method may be used in which half the number of appearances of the characteristic character string in the document is added to the weight of the character string in the positive profile 120. In addition, the characteristic character string to which the weight is added / subtracted in step 1306 and step 1307 may be limited to the upper predetermined number of appearances extracted in step 1303.
[0083]
The processing procedure of the profile update program 117 has been described above.
[0084]
Next, the processing procedure of the search use character string selection program 118 started by the search condition correction control program 116 in step 1002 shown in FIG. 10 will be described with reference to the PAD diagram of FIG.
[0085]
The search use character string selection program 118 first clears the contents of the general profile 122 in step 1401.
[0086]
Next, in step 1402, a predetermined higher-order item of the weight is extracted from the characteristic character string in the positive profile 120 and added to the general profile 122 together with the weight.
[0087]
Next, in step 1403, among the characteristic character strings in the negative profile 121, those that are higher in the absolute value of weights and included in the upper predetermined cases in the weight of the characteristic character string in the positive profile 120 What is not added is added to the general profile 122.
[0088]
Here, the predetermined numbers used in step 1402 and step 1403 may be different values.
[0089]
The processing procedure of the search use character string selection program 118 has been described above.
[0090]
The above is the processing procedure of each program in the present embodiment.
[0091]
The flow of search condition correction and re-search processing when the user negatively evaluates the search result document in this embodiment will be described below with reference to FIG.
[0092]
In this figure, it is assumed that the user wants to search for a document related to “high school baseball”, and “soccer” “high school” extracted from the document 1501 “high school baseball started after soccer” specified first as the seed document. It is assumed that character strings 1502 “baseball” and “opening” are registered in the positive profile 120 by the search condition generation program 113.
[0093]
Here, it is assumed that a negative evaluation is made on the search result document 1503 “High school soccer has started ...”.
[0094]
First, out of the appearance frequency information stored in the appearance frequency file 104, a characteristic character string 1504 is extracted from a document 1503 “high school soccer opened ...” that the user gave a negative evaluation, and each characteristic character string is extracted. Together with the appearance frequency in the document 1503, the feature character string storage area 125 is read. In the example of this figure, the character strings “high school”, “soccer”, “opening”,.
[0095]
Next, the weight is subtracted for the character string in the negative profile 121 among the character strings in the characteristic character string storage area 125, and the negative number of the appearance count is weighted for the character string not in the negative profile 121. Is registered in the negative profile 121. In the example of this figure, weights “−4”, “−4”, “−1”,... Are added to the character strings “high school”, “soccer”, “opening”,. to add.
[0096]
Next, the upper predetermined number 1505 of the weight of the profile 120 is included in the upper predetermined number 1505 of the weight and the lower predetermined number 1506 of the weight of the negative profile 121 and the upper predetermined number of the character string of the positive profile 120 Those not included in the thing 1507 are registered in the general profile 122. In the example shown in the figure, character strings “high school” and “baseball” are selected from the positive profile 120, and “soccer” is selected from the negative profile 121, and added to the general profile 122.
[0097]
At the time of search, the search is performed based on the character string of the general profile 122 and its weight. In the example shown in the figure, the weight value −4 regarding the character string “high school” in the negative profile is not used for the search. As a result, even if a negative evaluation is given to the document "High School Soccer", the string of "High School" does not lose weight, so a higher similarity is calculated for the "Pro Baseball" document than for "High School Baseball" It is possible to prevent problems such as being done.
[0098]
The above is the flow of search condition correction and re-search processing when the user negatively evaluates the search result document.
[0099]
As described above, according to this embodiment, the character string extracted from the document evaluated by the user as “not desired” is extracted from the document evaluated by the user as “desired”. The character string is excluded from the object whose weight is to be reduced. Therefore, it is possible to appropriately subtract the weight of only the character string representing the concept that is not desired by the user. Therefore, simply subtracting the weight of the character string extracted from the document that the user has evaluated as “not desired” subtracts the weight of the character string representing the user's desired concept, and the search result does not improve. Can be solved.
[0100]
Further, according to the present embodiment, when displaying the contents of the search result document, the character string is highlighted in another format according to the weight of the character string stored in the search condition data.
[0101]
By this method, the user can easily visually determine how much the search result document shows the desired content. In addition, it is possible to adjust the profile from the next time on by checking what is included in the desired document or noise document as a character string with a positive weight or a character string with a negative weight. It can be useful.
[0102]
Further, according to the present embodiment, the user selects a character string to be used for the search among the character strings in the search condition data, or the user adjusts the weight of each character string.
[0103]
By this method, it is possible to prevent the use of what does not characterize the contents desired by the user for the search, and an appropriate search result can be obtained.
[0104]
In the processing of the profile update program shown in FIG. 13, when a user makes a negative evaluation, a character string extracted from the evaluation target document is added to the negative profile 121 and then added to the general profile 122. The form to select is taken. Here, as illustrated in FIG. 16, a character string to be added to the negative profile 121 may be selected from character strings extracted from the evaluation target document.
[0105]
That is, if the user's evaluation read in step 1301 is negative in step 1305 in FIG. 16, the profile update character string selection step 1601 shown in FIG. 16 is executed before executing step 1307. May be. Here, the profile update character string selection step 1601 clears, from the characteristic character string storage area 125, those included in the higher weights in the positive profile 120 among the character strings in the characteristic character string storage area 125. It is a step. Thereby, it is possible to prevent the character string representing the user's desired concept added to the positive profile 120 from being given a negative weight and added to the negative profile 121.
[0106]
The second embodiment of the present invention will be described below.
[0107]
In the first embodiment, the system automatically selects a character string to be used for search or a character string to be added to the profile when the search condition is corrected. Therefore, when the user's evaluation for the search result document is inappropriate, there is a problem that the search accuracy is not improved.
[0108]
In order to solve the above problems, in the second embodiment of the present invention, a list of character strings extracted from a document that has been evaluated positively or negatively by a user is displayed, and a positive weight or negative weight is given. A means for a user to select a character string is provided.
[0109]
This embodiment has substantially the same configuration as the first embodiment shown in FIG. Here, as shown in FIG. 17, the search condition correction control program 116a includes a profile update character string user selection program 1701, a profile update program 117a, and a search use character string selection program 118. Further, as shown in FIG. 18, the processing procedure of the profile update program 117a is different from the profile update program 117 in the first embodiment.
[0110]
Hereinafter, the processing procedure of the profile update program 117a in the second embodiment will be described with reference to the PAD diagram of FIG.
[0111]
First, the profile update program 117a first determines in step 1801 the document number input by the user using the keyboard 101 and the user's evaluation for the document with that document number (“desired” or “not desired”). Etc.).
[0112]
In step 1802, the document corresponding to the document number read in step 1801 is read from the text 103 into the registered document storage area 124.
[0113]
In step 1803, a feature character string is extracted from the document stored in the registered document storage area 124, the number of appearances in the document is extracted by referring to the count appearance frequency file 104, and both feature character string storage area 125 is extracted. To store. Here, as a method for extracting the characteristic character string, the method described in the above-mentioned “Japanese Laid-Open Patent Publication No. 11-143902” may be used, or a method using morphological analysis or learning data by a neural network may be used.
[0114]
Next, in step 1804, the profile update character string user selection program 1701 is activated, and the character string that the user did not select among the character strings read in step 1803 is cleared from the characteristic character string storage area 125.
[0115]
In step 1805, if the user's evaluation read in step 1801 is positive, in step 1807, the number of occurrences of the character string in the characteristic character string storage area 125 is set to the corresponding character string of the positive profile. Add to weight. At this time, if the character string is not in the positive profile 120, the number of appearances read in step 1803 is assigned as a weight, and the character string is added to the positive profile 120.
[0116]
In step 1806, if the user's evaluation read in step 1801 is negative, in step 1808, the number of appearances of the character string in the characteristic character string storage area 125 is set to the corresponding character string of the negative profile. Subtract from weight. At this time, if the character string is not in the negative profile 121, the negative value of the number of appearances read in step 1803 is assigned as a weight, and the character string is added to the negative profile 121.
[0117]
Here, the weight addition / subtraction method in steps 1807 and 1808 may be adjusted by user evaluation. For example, when the user makes an evaluation “desired” in step 1807, the number of appearances of the characteristic character string in the document is added to the weight of the character string of the positive profile 120 as it is, If the evaluation is “desired”, a method may be used in which half the number of appearances of the characteristic character string in the document is added to the weight of the character string in the positive profile 120. Further, the characteristic character string to which the weight is added / subtracted in Step 1807 and Step 1808 may be limited to the upper predetermined number of the appearance count extracted in Step 1803.
[0118]
The processing procedure of the profile update program 117a has been described above.
[0119]
Next, the processing procedure of the profile update character string user selection program 1701 started by the profile update program 117a in step 1804 shown in FIG. 18 will be described with reference to the PAD diagram of FIG.
[0120]
First, in step 1901, a list of characteristic character strings in the characteristic character string storage area 125 is displayed.
[0121]
Next, in step 1902, the character string not selected by the user is acquired from the character strings displayed in step 1901, and information on the character string is cleared from the characteristic character string storage area 125.
[0122]
The processing procedure of the profile update character string user selection program 1701 has been described above.
[0123]
Here, an example of a screen for selecting a character string that the user wants to add to the profile by the profile update character string user selection program 1701 is shown in FIG. In the window 2001, a characteristic character string extracted from the document evaluated by the user is displayed together with a check box. When there are many characteristic character strings, all the character strings can be referred to in the window 2001 using the scroll bar 2002. The user checks the check box of the character string to be added to the profile among the character strings in the window 2001, and presses the send button 2003.
[0124]
The method for selecting a character string may use a check box as in the example of FIG. 20, or an identification number is assigned to each character string so that it is displayed together with the identification number, depending on the identification number of the character string. The method of selecting may be used.
[0125]
Hereinafter, the flow of search condition correction and re-search processing when the user negatively evaluates the search result text in this embodiment will be described with reference to FIG.
[0126]
In this figure, it is assumed that the user wants to search for a text related to “high school baseball”, and “soccer extracted from the text 2101“ high school baseball has started after soccer ”specified in the seed document first,“ high school ”. It is assumed that character strings 2102 “baseball” and “opening” are registered in the positive profile 120 by the search condition generation program 113.
[0127]
Here, it is assumed that a negative evaluation is made on the search result text “The first round of high school soccer is ...”.
[0128]
First, from the appearance frequency information stored in the appearance frequency file 104, the characteristic character string 2104 is extracted from the document 2103 that the user has negatively evaluated “high school soccer first round is ...”, and each characteristic character is extracted. The character string storage area 125 is read together with the appearance frequency in the document 2103 of the column. In the example of this figure, the character strings “high school”, “soccer”, “first round”,... And their appearance frequencies are read.
[0129]
Next, the information on the character string selected by the user on the screen of FIG. 20 described above is cleared from the character string storage area 125. In the example of this figure, the user desires text related to “high school baseball” and does not desire text related to “soccer”. Therefore, the user designates that a negative weight is added only to the character string “soccer”. At this time, the character strings “high school” and “first round” and their weights are cleared from the character string storage area 125.
[0130]
Next, the weight of the character string in the negative profile 121 in the appearance frequency information 2104 is subtracted, and the character profile not in the negative profile 121 is weighted with the negative number of the number of appearances as the negative profile 121. Register with. In the example of this figure, a weight “−4” is assigned to the character string “soccer” and added to the positive profile 120.
[0131]
Next, the upper predetermined number 2105 of the weights among the character strings of the profile 120 and the lower predetermined number 2106 of the weights of the negative profile 121 and the upper predetermined number of the character strings of the positive profile 120 Items not included in the item 2107 are registered in the general profile 122. At the time of search, the search is performed based on the character string of the general profile 122 and its weight.
[0132]
As described above, in the example shown in this figure, even if a negative evaluation is made on the text “High school soccer first round ...”, the string of “high school” does not lose weight, so “high school baseball” It is possible to prevent a problem that a high similarity is calculated for the text of “professional baseball”. Moreover, since the weight of the character string “first round” not included in the positive profile 120 is not reduced, it is possible to prevent the problem that the similarity of the text desired by the user such as “first round of high school baseball” is lowered.
[0133]
The above is the flow of search condition correction and re-search processing when the user negatively evaluates the search result text.
[0134]
In this embodiment, even when the user makes a positive evaluation with respect to the search result document, it is possible to select a character string to be added to the positive profile. Therefore, it is possible to prevent a positive weight from being assigned to a character string that is extracted from a document that has been positively evaluated but is not a character string that represents the concept of the user.
[0135]
The above is the second embodiment of the present invention.
[0136]
As described above, according to the present embodiment, the user designates a character string representing a concept desired by the user among the character strings extracted from the document evaluated by the user as “not desired”. The character string is excluded from the object whose weight is to be reduced. Therefore, it is possible to appropriately subtract the weight of only the character string representing the concept that is not desired by the user. Therefore, simply subtracting the weight of the character string extracted from the document that the user has evaluated as “not desired” subtracts the weight of the character string representing the user's desired concept, and the search result does not improve. Can be solved.
[0137]
Further, the user designates a character string that does not represent the concept desired by the user among the character strings extracted from the document that the user has evaluated as “desired”, thereby increasing the weight of the character string. Take the form of exclusion from the subject. Therefore, it is possible to appropriately add the weight of only the character string representing the concept desired by the user. Therefore, if the weight of the character string extracted from the document evaluated by the user as “desired” is simply added, the weight of the character string that does not represent the user's desired concept is also added, and the search result is Can solve the problem of not improving.
[0138]
In the first and second embodiments, the user inputs an evaluation for one search result document, and the search result reflecting the evaluation is output. It is also possible to input the evaluations at a time and output the search results reflecting those evaluations.
[0139]
In the first and second embodiments, a seed document is set first, and a document having contents similar to the seed document is searched. However, a full text search is first performed in which a keyword is set. May be. In that case, instead of steps 702 and 703 of the search condition generation program 113 shown in FIG. 7, the input keyword may be added to the positive profile 120 and the general profile 122 with a predetermined weight.
[0140]
According to the present embodiment, since the weight of the word representing the user's desired concept is not subtracted, the search result is improved based on the information extracted from the search result document given by the user as “not desired”. can do.
[0141]
【The invention's effect】
According to the present invention, it is possible to improve a search result by using appropriate information extracted from a document that the user gave an evaluation of “not desired”.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a first exemplary embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of relevance feedback processing according to a conventional technique;
FIG. 3 is a diagram illustrating an example in which search results are not improved by relevance feedback processing according to the conventional technology;
FIG. 4 is a PAD diagram showing the processing procedure of the system control program 110 in the first embodiment of the present invention.
FIG. 5 is a PAD showing a processing procedure of the document registration program 111 in the first embodiment of the present invention.
FIG. 6 is a PAD diagram showing a processing procedure of the search control program 112 in the first embodiment of the present invention.
FIG. 7 is a PAD showing the processing procedure of the search condition generation program 113 in the first embodiment of the present invention.
FIG. 8 is a PAD showing the processing procedure of the similar document search program 114 in the first embodiment of the present invention.
FIG. 9 is a PAD showing the processing procedure of the search result document content display program 115 in the first embodiment of the present invention.
FIG. 10 is a PAD showing the processing procedure of the search condition correction control program 116 in the first embodiment of the present invention.
FIG. 11 is a PAD showing the processing procedure of the profile weight adjustment program 119 in the first embodiment of the present invention.
FIG. 12 is a diagram showing an example of an input screen displayed on the display when the user adjusts the profile in the first embodiment of the present invention.
FIG. 13 is a PAD showing the processing procedure of the profile update program 117 in the first embodiment of the present invention.
FIG. 14 is a PAD showing a processing procedure of a search use character string selection program 118 in the first embodiment of the present invention.
FIG. 15 is a diagram showing the flow of search condition correction and re-search processing when a user negatively evaluates a search result document in the first embodiment of the present invention.
FIG. 16 is a PAD showing one form of processing 117 of the profile update program in the first embodiment of the present invention.
FIG. 17 is a PAD diagram showing a configuration of a search condition correction program 116a in the second embodiment of the present invention.
FIG. 18 is a PAD showing the processing procedure of the profile update program 117a in the second embodiment of the present invention.
FIG. 19 is a PAD showing a processing procedure of a profile update character string user selection program 1701 in the second embodiment of the present invention.
FIG. 20 is a PAD showing an example of a screen for selecting a character string that the user wants to add to the profile in the second embodiment of the present invention.
FIG. 21 is a diagram showing the flow of search condition correction and re-search processing when a user negatively evaluates a search result document in the second embodiment of the present invention.
[Explanation of symbols]
100 display
101 keyboard
102 Central processing unit (CPU)
103 text
104 Appearance frequency file
105 Magnetic disk drive
106 floppy disk drive (FDD)
107 floppy disk
108 Bus
109 Main memory
110 System control program
111 Document registration program
112 Search control program
113 Search condition generation program
114 Similar Document Search Program
115 Search result document content display program
116 Search condition correction control program
117 Profile update program
118 Search use character string selection program
119 Profile weight adjustment program
120 positive profile
121 Negative profile
122 General Profile
123 document storage area
124 Registered document storage area
125 Character string storage area
126 Display Document Storage Area

Claims

A search condition including a character string and its weight is input, and a user's suitability evaluation for a document searched based on the search condition is acquired. A predetermined value corresponding to the number of occurrences in the document that has been properly evaluated for the character string is added to the weight of the character string, and the weight of the character string is applied to the second character string extracted from the document that has been rejected In the document search method by the system having a processing device for correcting the search condition by subtracting a predetermined value according to the number of appearances in the document that has received the negative evaluation of the character string from
The processing device deletes the character string that matches the first character string from the second character string extracted from the document that has received the negative evaluation, and the character is applied only to the second character string that has not been deleted. A document search method, wherein the search condition is corrected by subtracting a predetermined value corresponding to the number of appearances in a document that has been evaluated for rejection of the character string from a column weight.

A search condition including a character string and its weight is input, and a user's suitability evaluation for a document searched based on the search condition is acquired. A predetermined value corresponding to the number of occurrences in the document that has been properly evaluated for the character string is added to the weight of the character string, and the weight of the character string is applied to the second character string extracted from the document that has been rejected In the document search method by the system having a processing device for correcting the search condition by subtracting a predetermined value according to the number of appearances in the document that has received the negative evaluation of the character string from
The processing device deletes a character string that matches the first character string having a weight greater than a predetermined value from the second character string extracted from the document that has received the negative evaluation, and the second character string that has not been deleted. A document search method, wherein the search condition is corrected by subtracting a predetermined value corresponding to the number of appearances in a document that has received a negative evaluation of the character string from the weight of the character string only for the character string.