CN115391495B

CN115391495B - Method, device and equipment for searching keywords in Chinese context

Info

Publication number: CN115391495B
Application number: CN202211330466.9A
Authority: CN
Inventors: 王利烨; 韩亚林; 徐捷; 田恒
Original assignee: Qiangqi Baodian Shandong Information Technology Co ltd
Current assignee: Qiangqi Baodian Shandong Information Technology Co ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-01-24
Anticipated expiration: 2042-10-28
Also published as: CN115391495A

Abstract

The invention discloses a method, a device and equipment for searching keywords in a Chinese context, belonging to the technical field of information retrieval, wherein the method comprises the following steps: creating a word bank word frequency table which records the use frequency value of each Chinese character in the Chinese word bank after being counted by the Chinese language bank in a computer memory; respectively inquiring the use frequency value of each Chinese character in the searched key words by reading the character library according to records in a character frequency table database of the character library and combining an addressing mode; sorting the use frequency values in a sequence from small to large to generate a keyword word frequency sorting table; in the context of Chinese characters to be searched, a continuous Chinese character string starting from any Chinese character position in the text is compared with the search keywords in a round of single characters, and the single character arrangement sequence represented by the generated keyword character frequency ordering table is used as the sequence for comparing the characters at the corresponding positions in the search text among the Chinese characters in the search keyword structure. The invention improves the efficiency of keyword retrieval.

Description

Method, device and equipment for searching keywords in Chinese context

Technical Field

The invention relates to a method, a device and equipment for searching keywords in a Chinese context, belonging to the technical field of information retrieval.

Background

A technique for searching contents using a specific keyword is widely used in the field of computer information technology. The search operation generally refers to a process of searching for a continuous character string matching a search keyword (also sometimes referred to as a pattern, a substring, or the like) in a search text (also sometimes referred to as a text string, a main string, or the like), and if matching succeeds, it is generally required to return the position of the search keyword at which the search keyword appears in the search text, otherwise, a search conclusion that matching fails is returned. The searched object may be a Web page containing text data, or even video content containing text data, in addition to the electronic document stored locally; the applied fields generally relate to the scenes of document retrieval, search engines, spell checking, language translation, data compression and the like.

During keyword retrieval, each character in the keyword structure needs to be individually compared with one character at the corresponding position in the retrieval text one by one, from the initial position of the retrieval text, the success of keyword matching is determined only under the condition that the character string at a certain position in the retrieval text is completely consistent with the whole retrieval keyword in the character structure and the character sequence after comparison, otherwise, the next round of matching calculation is continued from the next character at the initial position in the retrieval text, and the failure of matching is determined only when any character string completely equivalent to the retrieval keyword does not exist in the retrieval text.

The matching process of the search keyword at present is shown in fig. 1 and fig. 2, wherein the previous three rounds of matching process in fig. 1 show the single word comparison situation, especially the single word comparison sequence, when searching for the search keyword T in the search text S, and fig. 2 shows the whole process of performing single word comparison in each round in the keyword search. Referring to fig. 1 and 2, the search text S and the search keyword T are stored in the form of character arrays, and the search text S is composed of characters S ₀ 、S ₁ 、S ₂ ……S _(m-1) The search keyword T is composed of a character T ₀ 、T ₁ 、T ₂ ……T _(n-1) Where m and n are the length of the search text and the length of the search keyword, respectively, i.e., the number of characters in their composition. In each round of single character comparison of keyword matching, the natural sequence or space sequence formed by characters in search keyword T is adopted as T ₀ 、T ₁ 、T ₂ ……T _(n-1) Comparing the single characters in sequence with one character at the corresponding position in the search text. Therefore, it is customary to use the sequence of characters in the search keyword structure as the sequence of comparison between two characters, and not to consider the sequence of comparison between characters in the search keyword structure as a problem, which is one measure to improve the search efficiency.

The keyword search operation performed in the search text is implemented on the underlying computer software, and usually adopts BF (Brute-Force) algorithm, KMP (Knuth, morris and Pratt) algorithm or KMP optimization algorithm to perform pattern matching between the main strings and the sub strings. The BF algorithm is a naive pattern matching algorithm, and the execution efficiency of the algorithm is relatively low because the character pointers of the main string and the substring have backtracking phenomena. As an improvement to BF algorithm, KMP algorithm and KMP optimization algorithm do not need to trace back the pointer position of the main string when the character comparison is different in each round of matching process, but use the obtained result of 'partial matching' to 'slide' the pattern to the right for a certain distance as far as possible and then continue to compare, because the main string character pointer does not need to trace back but only the sub string character pointer, the execution efficiency of KMP algorithm and KMP optimization algorithm is improved compared with BF algorithm.

It is worth noting, however, that the KMP algorithm and the KMP optimization algorithm are both improvements of the BF algorithm, and only when there are many "partial matches" between the main string and the sub string, they can exhibit the advantage of efficiency improvement, so that the probability that the number of characters in the character library is small, i.e. the comparison result between any two characters is the same is relatively high, for example, in the case of keyword search in the english context mainly composed of 26 letters, the KMP algorithm and the KMP optimization algorithm can exhibit the advantage of efficiency improvement, while for keyword search in the chinese context, since the total number of chinese characters in the chinese character library is large, for example, the number of basic chinese characters included in the national standard GB/T2312-1980 is as high as 6763, and the probability of "partial matches" between the search keyword and the search text is relatively low, this results in that the advantage of efficiency of keyword search in the chinese context is difficult to exhibit in the KMP algorithm and the KMP optimization algorithm.

In addition, in the context of chinese, especially in some application scenarios where the volume of the retrieved text is large and the retrieval operation needs to be frequently called, the execution efficiency of the retrieval operation will directly affect the user experience due to the corresponding increase of the computation amount of data comparison. In the case that the size of the retrieved text is large (for example, the total number of the Chinese characters therein is more than 100 ten thousand), or the response time-consuming process of the user to the retrieval operation result is sensitive (for example, in the application environment of a network search engine), if the keyword retrieval is performed by only using the traditional three pattern matching algorithms, it is difficult to achieve the high standard requirement of the user experience.

Disclosure of Invention

The invention aims to solve the technical problem of how to improve the retrieval efficiency when carrying out keyword retrieval operation in a Chinese retrieval text, more specifically how to reduce the number of times of comparing single Chinese characters in each round of matching process on the whole, and especially how to reduce the number of times of comparing single Chinese characters in each round of matching process as much as possible under the condition that the front Chinese characters in the Chinese retrieval keyword are more likely to be compared with the corresponding Chinese characters in the retrieval text because of higher use frequency and the rear Chinese characters are more likely to be compared with different Chinese characters because of lower use frequency.

The invention aims to fully utilize the statistical rule with larger use frequency difference among all Chinese characters from the macroscopic view, and optimize the sequence of comparing each Chinese character forming the search keyword in each round of matching with one character at the corresponding position in the search text respectively so as to reduce the comparison frequency of single Chinese character in each round of matching process of keyword search as much as possible. Namely, the ideal effect to be realized by the invention is as follows: the matching of each round of unsuccessful matches is carried out on the first compared Chinese characters in the search keywords rather than the subsequent compared Chinese characters as stopping the matching of the round at the first time means that the matching of the round can generate the least number of times of single character comparison.

In order to solve the above problems, the present invention provides a method, an apparatus and a device for searching keywords in a chinese context, which are used to improve the efficiency of keyword search in the chinese context.

The technical scheme adopted by the invention for solving the technical problem is as follows:

in a first aspect, an embodiment of the present invention provides a method for retrieving keywords in a chinese context, where the retrieved keywords are composed of two or more chinese characters, and the method includes the following steps:

creating a word frequency table of a word stock in a computer memory, and recording a use frequency value corresponding to each Chinese character in a Chinese word stock after statistics of a Chinese language stock in a database of the word frequency table of the word stock;

respectively inquiring the use frequency value corresponding to each Chinese character in the searched keyword composition by reading the word frequency table of the word stock according to the records in the word frequency table database of the word stock and combining the addressing mode recorded by the database;

sequentially sequencing the use frequency values corresponding to each Chinese character in the searched keyword composition according to a sequence from small to large to generate a keyword word frequency sequencing table;

in the context of Chinese to be searched, a continuous Chinese character string starting from the position of a first character of a text to the position of the tail of the text is compared with a search keyword in a round of single characters, and in each round of single character comparison of the search keyword, the single character arrangement sequence represented by the generated keyword word frequency ordering table is used as the sequence for comparing the characters at the corresponding positions in the search text among the Chinese characters in the search keyword.

As a possible implementation manner of this embodiment, the process of creating a word frequency table of a word stock includes the following steps:

establishing a word stock standard, wherein the standard is limited to national standard files or government files of a Chinese word stock;

collecting Chinese language material which is modern Chinese and contains more than million Chinese characters;

and counting word frequency values, wherein the word frequency values are the total times of occurrence of each Chinese character in the Chinese character corpus in the Chinese character library.

As a possible implementation manner of this embodiment, the use frequency value of each chinese character recorded in the word frequency table of the word stock is continuously updated as the user searches for various search texts; the continuous updating process comprises the following steps: in the matching process of the search keywords, each Chinese character in the read search text is inquired about the corresponding record position in the word frequency table of the word stock, and 1 is added on the basis of the original use frequency value of the Chinese character recorded in the position.

As a possible implementation manner of this embodiment, the keyword word frequency ranking table represents a sequence of ordinal values in a structure of each chinese character in the search keyword, and stores each ordinal value in a one-dimensional integer array in sequence, and forms a comparison sequence of each chinese character in the structure in each round of matching operation of the search keyword, that is, a comparison sequence of the chinese character represented by the keyword word frequency ranking table, through a continuous change of element values of the corresponding array when lower addresses of the integer array are sequentially increased in an increasing order.

As a possible implementation manner of this embodiment, the chinese context to be retrieved includes a chinese text of a single file stored in a local or server web page of the computer, or a chinese text in a specified range in a file thereof, or a chinese text composed of multiple files as a whole.

As a possible implementation manner of this embodiment, in each round of single character comparison process of the search keyword, once two single characters at any corresponding position are found to be different, the comparison in the current round is directly exited, and only when one continuous character string in the search text and the whole chinese search keyword are completely the same in the sequence of the chinese character formation and the chinese character, the success of the matching in the current round is determined until all the continuous character strings in the whole text of the search text having the same length as the chinese search keyword are compared with the chinese search keyword.

As a possible implementation manner of this embodiment, an addressing manner of the word frequency table of the word stock is as follows: the method is characterized in that the self value of the in-machine code of any Chinese character in a Chinese character library of a character library character frequency table in a computer is directly arranged into the corresponding storage position of the use frequency value of the Chinese character in the character library character frequency table after the addition and subtraction operation of a constant value, and when the use frequency value of a certain Chinese character in the character library character frequency table is inquired or updated, the storage position of the use frequency value of the Chinese character in the character library character frequency table is directly deduced through the addition and subtraction operation based on the self value of the in-machine code of the Chinese character.

As a possible implementation manner of this embodiment, the standard of the word library of the word frequency table of the word library adopts 6763 chinese character set included in GB/T2312-1980 basic set of chinese character code set for information exchange or 6866 chinese character set included in GB/T12345-1990 auxiliary set of chinese character code set for information exchange, and stores the use frequency value corresponding to each chinese character in the word library in a matrix structure of two-dimensional arrays, and when the use frequency value of a chinese character is read or written in the two-dimensional array matrix, the row and column addresses of the array elements storing the use frequency value of the chinese character in the two-dimensional array matrix are obtained by performing subtraction operations of "-B0H" and "-A1H" correspondingly from the 16-system values of high and low bytes of the code in the machine of the chinese character.

In a second aspect, an embodiment of the present invention provides an apparatus for retrieving keywords in a chinese context, where the retrieved keywords are composed of two or more chinese characters, including:

the word stock word frequency table creating module is used for creating a word stock word frequency table in a computer memory, and the using frequency value which is counted by the Chinese language database and corresponds to each Chinese character in the Chinese character stock is recorded in the database of the word stock word frequency table;

the using frequency query module is used for respectively querying a using frequency value corresponding to each Chinese character in the searched keyword composition through reading operation on the word frequency table of the word stock according to records in the word frequency table database of the word stock and in combination with an addressing mode of records in the database;

the word frequency sorting table generating module is used for sequentially sorting the use frequency numerical values corresponding to each Chinese character in the searched keyword composition according to the sequence from small to large to generate a keyword word frequency sorting table;

and the keyword retrieval module is used for comparing a continuous Chinese character string starting from the position of a first character of the text to the position of the tail part of the text from the position of the first character of the text in the Chinese context to be retrieved by one round of single characters, and in each round of single character comparison of the retrieval keyword, the single character arrangement sequence represented by the generated keyword word frequency ordering table is used as the sequence for comparing the characters at the corresponding positions in the retrieval text among the Chinese characters in the retrieval keyword structure.

In a third aspect, an embodiment of the present invention provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the processor executing the machine-readable instructions to implement a method for retrieving keywords in a chinese context as described in any of the above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program for executing any one of the above methods for retrieving keywords in a chinese context.

The technical scheme of the embodiment of the invention has the following beneficial effects:

in the comparison process of each Chinese character in the keyword retrieval, the invention abandons the habitual method of comparing each Chinese character one by one according to the natural arrangement sequence of each Chinese character in the construction of the retrieval keyword, takes the attribute of the use frequency of the Chinese characters as an important factor influencing the retrieval efficiency to be researched, and adopts the single character comparison sequence of first low frequency and then high frequency to ensure that the condition of failed matching of the retrieval keyword and the retrieval text occurs at the initial time of each round of matching operation as much as possible. From the overall angle of statistics, the invention reduces the total times of word comparison in the matching process, thereby improving the retrieval efficiency of Chinese keywords.

Compared with the prior art, the method is more suitable for the technical field of large text keyword retrieval because the resource overhead of some time and space is additionally increased in the steps of setting the word frequency table of the word stock, inquiring the word frequency table of the word stock and generating the word frequency sorting table of the keywords, and the method can better embody the beneficial effect of improving the overall retrieval efficiency in the word frequency statistical sense. In other words, the optimization and improvement of the execution efficiency are performed under the condition of large computation amount correspondingly generated by the search text of mass information.

Drawings

FIG. 1 is a diagram illustrating the comparison and sequence of each round of matched words in keyword search in the prior art;

FIG. 2 is a flowchart illustrating the comparison of words matched in each round during keyword search in the prior art;

FIG. 3 is a flowchart illustrating a method of retrieving keywords in a Chinese context in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating an apparatus for retrieving keywords in a Chinese context in accordance with an exemplary embodiment;

FIG. 5 is a flow chart of an embodiment of the present invention for retrieving keywords in a Chinese context;

FIG. 6 is a block diagram of the differences between the prior art and the present technology;

FIG. 7 is a schematic diagram of an example of a word frequency table of the word stock in the present invention;

FIGS. 8a and 8b are schematic diagrams illustrating the difference between the word comparison sequence and the prior art and the use of the word frequency sorting table of the keywords in the present invention;

FIG. 9 is a schematic diagram of four end information distributions of the Chinese character library GB/T2312-1980 basic set of Chinese character coding sets for information exchange;

FIG. 10a is a schematic diagram showing the conversion relationship between Chinese character codes, international code, region code and the address of the word frequency table storage array of the word stock in the present invention;

fig. 10b is a schematic diagram illustrating a conversion relationship between various numerical values in fig. 10a by taking a chinese character "o" as an example.

Detailed Description

In order to more clearly illustrate the technical features of the present invention, the present invention will be described in detail by the following embodiments with reference to the accompanying drawings.

As shown in fig. 3, a method for retrieving keywords in a chinese context according to an embodiment of the present invention, where the retrieved keywords are composed of two or more chinese characters, includes the following steps:

in the context of Chinese characters to be searched, a continuous Chinese character string from the position of a first character of a text to the position of the tail of the text is compared with a single character in a round with a search keyword, and in each round of single character comparison of the search keyword, the single character arrangement sequence represented by the generated keyword character frequency ordering table is used as the sequence of comparison between the Chinese characters in the search keyword and the characters at corresponding positions in the search text.

In the prior art, in the sequence of individual word comparison of search keywords, the factors of difference of use frequency of each Chinese character forming the search keywords are not considered, which often results in that the matching operation of the search keywords generates more individual word comparison times on the whole under the condition that the front Chinese character appears at high frequency and the rear Chinese character appears at low frequency in the formation of the search keywords, thereby influencing the search efficiency. When the matching operation between the search keyword and the search text is executed in the Chinese environment, the Chinese characters appearing at low frequency are compared firstly and then the Chinese characters appearing at high frequency are compared in the arrangement of the comparison sequence between the Chinese characters and the search text in the Chinese search keyword structure, namely, the single character comparison sequence in the search keyword is determined by the high-low order of the character frequency. The present invention differs from the prior art by the fact that: for a specific retrieval text and a specific retrieval keyword, the comparison times of the single characters generated in each round in the retrieval keyword matching operation are influenced by the sequence of comparison and high-frequency comparison of the single characters in the retrieval keyword composition, the Chinese characters which appear at low frequency in the retrieval text are compared firstly in the retrieval keyword composition, then the Chinese characters which appear at high frequency are compared, the comparison times of the single characters generated in each round of matching can be reduced relative to the opposite comparison sequence, and further the total comparison times of the single characters required by full-text matching can be reduced.

As shown in fig. 4, an apparatus for retrieving keywords in a chinese context according to an embodiment of the present invention, where the retrieved keywords are composed of two or more chinese characters, includes:

the word stock word frequency table creating module is used for creating a word stock word frequency table in a computer memory, and the using frequency value of each Chinese character in the Chinese word stock after the statistics of the Chinese language database is recorded in a database of the word stock word frequency table;

As shown in FIG. 5, the specific process of searching keywords in the Chinese context using the apparatus of the present invention is as follows.

S1: a word stock word frequency table is arranged in a memory and used for storing the use frequency value of each Chinese character in the Chinese word stock after the linguistic data statistics.

In order to sort the use frequency of each Chinese character in the search keyword structure and compare the characters in the order from low frequency to high frequency in each matching round, it is necessary to know the use frequency value of each Chinese character in advance. Therefore, before generating the keyword word frequency sorting table, a word library word frequency table is required to be arranged in a computer memory in advance for recording the numerical value of the use frequency of each Chinese character in the Chinese word library under the Chinese context in the common meaning, which is generated in the statistics of the total occurrence frequency of each Chinese character in a Chinese language database with a certain scale.

In order to obtain an objective, scientific and accurate word frequency table of the word stock, the establishment process is described as follows:

firstly, a character library standard is established, and the current commonly used Chinese character library standard comprises the following steps: 6763 Chinese characters included in the Chinese character coding character set basic collection for information exchange of national standard GB/T2312-1980, 6866 Chinese characters included in the Chinese character coding character set auxiliary collection for information exchange of national standard GB/T12345-1990, 8105 Chinese characters included in the general standard Chinese character table of State institute (number 2013) 23) and the like, which can be used in the Chinese character library used in the word frequency table of the Chinese character library.

And then collecting Chinese language material used for counting the occurrence frequency of each Chinese character in the character library, wherein the Chinese language material is a large-scale text material which actually appears in the actual use of Chinese language, and the Chinese language material is stored in the form of electronic text for the convenience of processing and processing information. In the invention, the text as the Chinese language database belongs to modern Chinese and covers various topics and fields such as literature, news, science and technology, and the capacity of the language database reaches more than one million Chinese characters so as to ensure the reliability of the word frequency table of the word library generated by statistics on the objective condition of reflecting the word frequency.

Finally, counting word frequency values, counting the use frequency values of each Chinese character in the Chinese character library in the whole Chinese language library, namely the total times of the Chinese character in the whole language library, and generating a word frequency table of the word library, wherein the use frequency values of the Chinese characters can be theoretically the total times of the Chinese character in the whole Chinese language library, can also be the frequency percentage of the Chinese character in the whole Chinese language library, and even can be simultaneously listed into the counting or calculating range of the word frequency table of the word library.

As an important Chinese character standard meeting the Chinese character application requirements in various fields of the current society, a general standard Chinese character table (national issue [ 2013 ] No. 23) published by the State administration in 2013 forms a Chinese character set with 8105 Chinese characters in total, and the Chinese character set can be used as a Chinese character library in the invention; the invention also uses more than three thousand modern Chinese works which are originated from network searching and downloading and generated since the nineties of the last century as a Chinese language database, the total number of Chinese characters contained in the language database is 106205826, and the Chinese character sample size is large; the modern information technology provides a scientific and efficient technical means for the construction and application of a large corpus, the large-scale and uniformly-distributed corpus provides a reliable basis for the measurement and the check of the use frequency of the Chinese characters, and the inventor carries out statistics on the occurrence frequency and the occurrence frequency of 8105 word stock Chinese characters in 106205826 corpus Chinese characters and generates a conclusion report as an example of a word frequency table of the word stock, as shown in fig. 7. In this example, not only the total number of occurrences and percentage of occurrences of each Chinese character in the Chinese character library consisting of 8105 Chinese characters in the whole corpus are recorded, but also the Chinese characters are queued from high to low according to the numerical values and numbered from 1 to 8105, and in addition, the percentage of the current frequency of each Chinese character in the ranking is accumulated and calculated. It should be noted that, as disclosed in the patent document, the table in the figure lists only some of the Chinese characters involved for convenience of explanation or in the example.

The corpus on which the statistical conclusion of fig. 7 is based comes from the multidisciplinary content of modern chinese, and the universality, the balance of multiple aspects and the scientificity of the statistical method of the corpus are believed to well guarantee the credibility of the word frequency and coverage rate data statistics. According to the statistical result of Chinese characters of a plurality of Chinese texts from other sources, the matching degree of the ranking result of the occurrence frequency of all Chinese characters in the plurality of texts and the word frequency ranking sequence shown in FIG. 7 is very high.

The following points are also needed to be described for the word frequency table of the word stock:

regarding the data stability of the word frequency table of the word stock, it is obvious that the conclusion report of the word frequency table of the word stock is closely related to a specific statistical corpus, and the conclusion of the data report under the statistics of a large corpus tends to be stable in the case of a specific context with a small number of words or a high specialty, although the statistical law in a general meaning may be slightly deviated. In addition, as the times and cultures change, the frequency of use of each Chinese character in the character library also changes. In the invention, the data of the word frequency table in the word stock is taken into consideration from the experience angle of relative stability, and the data is taken as a reference basis of the comparison sequence of each single word and the retrieval text in each round of matching operation during keyword retrieval.

The dynamic data adjustment of the word frequency table of the word stock is realized by the method that the initial assignment state of the word frequency table of the whole word stock is the state of the using frequency value of each Chinese character in the Chinese word stock generated at one time after the statistics of a Chinese language database with a certain scale. In the invention, the initial data state of the word frequency table of the word stock can be kept stable for a long time, and all Chinese characters in the searched text content can be listed into an additional statistical object of the word frequency table of the word stock along with the daily keyword searching behavior on the basis of the initialized assignment state so as to form a process of continuously updating the word frequency table database of the word stock, so that the word frequency attribute reflected by the word frequency table of the word stock can be closer to the searched text frequently related to a user, and after all, the rule of word frequency difference is the basis for realizing the technical effect of the invention.

Regarding the data ordering of the word frequency table of the word stock, all records corresponding to each Chinese character use frequency value in the whole word frequency table database of the word stock can be arranged in sequence from high frequency to low frequency, can be arranged in sequence from low frequency to high frequency, and can also be arranged in sequence about the use frequency value, although the function of the table is to record the difference of the use frequency value of each Chinese character in the word stock, in short, the word frequency ordering among all records in the word frequency table database of the word stock is not required.

Regarding the data storage location of the word frequency table of the word stock, the memory for storing the word frequency table data of the word stock belongs to the external memory device of the computer, because the word frequency table data of the word stock may need to go through the accumulation process of continuous data statistics, it exists independently relative to the core part of the software for realizing the invention, and is specially set for solving the technical problem to be solved by the invention.

S2: for each Chinese character in the Chinese search keyword structure, the use frequency value correspondingly recorded in the word frequency table of the word stock is inquired.

The Chinese retrieval key words are disassembled into the Chinese characters contained in the Chinese retrieval key words, and the use frequency values corresponding to the single Chinese characters are inquired one by one through the reading operation of the word frequency table of the word library according to the data record of the word frequency table of the word library set in the step S1 and the addressing mode of the data record in the word frequency table of the word library.

In the invention, from the viewpoint of data statistics and updating convenience, the word frequency table in the word stock is used for representing the use frequency value of a Chinese character, which is the total occurrence number of the Chinese character in the whole statistical corpus.

In addition, when the use frequency value of each Chinese character is inquired, a reasonable design scheme must be provided for the data storage structure of the word frequency table of the word stock, otherwise, it is a time-consuming matter to find the use frequency value corresponding to a certain Chinese character in thousands of Chinese character records.

S3: and arranging all Chinese characters in the search keyword structure according to the sequence of the use frequency values from small to large to generate a keyword word frequency ordering table.

The Chinese characters forming the search keyword are sorted from small to large according to the use frequency value inquired in the previous step S2, and a keyword word frequency sorting table is generated.

The information to be expressed by the keyword word frequency ranking table is that when each round of single Chinese character comparison of the matching operation of the search keywords in the search text, the sequence of comparison adopted by the technical scheme of the invention is different from the natural ranking sequence of firstly comparing the front Chinese character and then comparing the rear Chinese character in the search keyword structure, which is usually adopted when each round of single Chinese character comparison in the prior art.

S4: and comparing the single characters in the search keywords with the single characters at the corresponding positions in the search text according to the sequence in the keyword-character frequency sequence table.

In the search text of the Chinese context, when the search operation of the search keyword input by the user is carried out, a round of single character comparison is carried out on a continuous Chinese character string starting from the position of a first character of the search text to the position of the tail part of the search text, and in each round of single character comparison of the search keyword, the single character arrangement sequence represented by the keyword character frequency ordering table generated in the previous step S3 is used as the sequence for comparing the characters at the corresponding positions in the search text and among the Chinese characters in the search keyword structure.

The retrieval text of the Chinese context includes both the Chinese text stored locally in the computer and the Chinese text stored on the web page of the server; it may refer to a complete local file or a Chinese text in a web page file, a Chinese text integrally composed of a plurality of files, or a Chinese text in a designated range in one file.

In the single character comparison of each round, once two single characters in any corresponding position are found to be different, the comparison of the current round is directly quitted; only when a continuous character string in the search text and the whole Chinese search keyword are completely the same in Chinese character composition and Chinese character sequence, the matching of the current round of comparison is successful; until all the continuous character strings with the same length as the Chinese search key words in the full text of the search text are compared with the Chinese search key words.

The above is a description of the four steps in the method of the present invention. In the whole technical scheme of the present invention, based on an assumption that the conclusion of the word frequency table of the word stock statistically generated by the specific chinese corpus in step S1 is used as the basis for searching the frequency of each chinese character in the keyword composition in step S4 in the search text, in fact, it is completely reasonable to regard the frequency of each chinese character in the search keyword composition in the previous chinese corpus as the frequency of each chinese character in the current search text by an empirical language research method: the character frequency table of the whole character library can bring all the Chinese characters in the searched text content into the subsequent statistical objects of the character frequency table of the character library in daily user keyword searching behaviors so as to form the continuous updating effect of the character frequency table of the character library, so that the character frequency table of the character library is more close to the real situation that a user often relates to the occurrence frequency of each Chinese character in the searched text.

For a given search text and a given search keyword, the evaluation of the efficiency of a search method can be measured by the total number of times of single-character comparison required in the whole matching process, the total number of times depends on the sum of the number of times of single-character comparison in each round of matching operation, wherein the one or more rounds of single-character comparison which is successful in matching occurs under specific conditions, so the total number of times mainly depends on the total number of times of single-character comparison which is already performed when the conclusion of the matching failure of the round is obtained by the conclusion of the matching failure of the round in each round of failed matching, and the first single-character comparison in each round of failed single-character comparison is inevitable, because the conclusion of successful matching of the whole keyword can not be obtained without passing through the comparison of any single character, so as to reduce the total number of single-character comparison in the matching process, the focus problem lies in how to avoid generating the comparison of second, third and more subsequent single characters as possible after the comparison of the first single character in each round of failed matching. According to the explanation of the technical scheme of the invention, the Chinese characters which are used at low frequency in the search keywords are firstly taken out to be compared with the search text, namely, the conclusion that the matching fails in the previous round can be obtained when the first Chinese character is compared as much as possible, so that the continuous comparison of more subsequent Chinese characters is avoided, and the technical effect of reducing the total times of single character comparison in the keyword search process is completely expected.

In the invention, as in the prior art, when the keyword is searched, the whole Chinese character string as the search keyword is firstly disassembled into single Chinese characters in sequence, and then the single Chinese characters in the search text are compared one by one; however, the technical scheme of the invention is different from the prior art in that the sequence of comparison among the single Chinese characters is not the arrangement sequence of the single Chinese characters in the construction of the search key word, but the comparison is carried out in sequence according to the high and low sequence of the occurrence frequency of the Chinese characters in the whole context among the Chinese characters forming the search key word, wherein the Chinese characters with lower occurrence frequency are compared firstly, and then the Chinese characters with higher occurrence frequency are compared. In other words, in the prior art, the chronological order of the character comparison between the individual characters and the corresponding positions of the search text is always consistent with the positional order of the individual characters in the search keyword composition, but in the present invention, this order is not always necessary, although sometimes the order is actually the same as the prior art order.

In the retrieval operation of the Chinese retrieval key words in the prior art, no matter which pattern matching algorithm is adopted, the difference of the occurrence frequency of each Chinese character in the retrieval key words in the whole retrieval text is not considered, and in a Chinese context, particularly in the Chinese corpus environment of a large text, if the rule and the characteristic of the difference of the use frequency of the Chinese characters can be fully utilized, and the proper storage structure and the efficient algorithm design are combined, the total times of single character comparison in the whole retrieval process can be completely reduced. Although the final implementation of the technical scheme of the invention also utilizes the traditional algorithm of the prior art, the invention can completely further optimize the retrieval efficiency under the Chinese context on the basis of the prior art.

Fig. 6 shows in block diagram form the differences between the prior art and the solution of the invention shown in fig. 5: in the prior art, only step S0 in the figure is involved, and a conventional pattern matching algorithm is used to implement the keyword retrieval operation; in the present case, the keyword search operation includes four steps S1, S2, S3 and S4 in fig. 5, which is different from the prior art in the order of comparison between each character in the keyword structure and the corresponding position character in the search text, and the difference is realized by means of a word frequency table of a word stock and a word frequency ranking table of keywords, although the conventional pattern matching algorithm is also used, and the technical field of the present invention is further limited to a method for searching keywords composed of two or more chinese characters in the search text of a chinese context.

In the following, taking the four chinese search keywords listed in fig. 8a and fig. 8b (fig. 8b is a continuation diagram of fig. 8 a) as an example, the difference between the present invention and the prior art in the word-by-word comparison sequence during keyword search is schematically illustrated.

In the prior art, when four Chinese words, namely 'correction', 'transmission', 'vegetables' and 'supply chain', are used as keywords to search text contents, the used word-by-word comparison sequence is the natural arrangement sequence of each Chinese character in the search keyword structure, namely: "correction → correction" "transmission → transportation" "vegetable → analog" and "supply → response → chain".

In the invention, the word frequency bit sequence determines the sequence of the word comparison of the search keywords. The comparison sequence of each Chinese character in the matching of the whole search key word is determined, the natural arrangement sequence of the Chinese character in the structure of the search key word is not the sequence of the Chinese character in the use frequency ordering list of all Chinese characters forming the search key word, the single Chinese character which is used at low frequency in the search key word is compared with the search text, and then the single Chinese character which is used at high frequency is compared with the search text.

Before performing word-by-word comparison of matching operation between the search keyword "transmission" and the search text, firstly, according to the word frequency table of the word stock of fig. 7, the use frequency (i.e. the total number of occurrences or the percentage of occurrence frequency of the words in corpus statistics) of the "transmission" word and the "transmission" word is queried, the total number of occurrences of corpus of the "transmission" word with the bit sequence number of 367 is 55119, the use frequency of 0.05190, the total number of occurrences of corpus of the "transmission" word with the bit sequence number of 1515 is 7739, and the use frequency of 0.00729.

For other three Chinese retrieval keywords, namely 'correction', 'vegetables' and 'supply chain', the comparison sequence of single Chinese characters during retrieval can be analogized from the above thought, but the comparison sequence of single Chinese characters of the two keywords of 'correction' and 'vegetables' is exactly the same as that of the prior art.

The Chinese language database for word frequency statistics is a basic resource for finally realizing the technical effect of the invention, and the word frequency table of the word library generated from the Chinese language database is a specific application of the empirical meaning language research method in the technical field of keyword search.

As can be seen from fig. 5 and 6, in the present invention, in the matching process of the keywords, the comparison sequence of the single characters in the search keyword structure depends on the keyword word frequency ranking table, and the generation of the keyword word frequency ranking table depends on the records of the use frequency values of the Chinese characters in the word frequency table of the word stock, so the accuracy and objectivity of the difference of the use frequency values of each Chinese character in the word stock recorded in the word frequency table of the word stock are important for realizing the technical effect of the present invention. Thus, as a further optimization measure, the word frequency table of the word bank is further defined in two aspects as follows: one is the initial assignment state of the word frequency table of the word stock; the other is a dynamic accumulation mechanism of the Chinese characters related to the use frequency value of each Chinese character recorded in the character frequency table of the character library along with the retrieval text.

The present invention uses modern Chinese works with hundred million level Chinese character quantity scale as Chinese language database to generate the data statistical report of figure 7. From this report, a very obvious conclusion is: a small number of chinese characters in a chinese character library are used with high frequency in a realistic chinese context. For example, in the usage frequency arrangement table of 8105 chinese characters, the context cumulative coverage of the first 10 top-frequency used chinese characters is 16%, the context cumulative coverage of the first 50 top-frequency used chinese characters is 35%, the context cumulative coverage of the first 100 top-frequency used chinese characters is 46%, and the context cumulative coverage of the first 500 top-frequency used chinese characters is 77% … ….

Regarding the initial assignment state of the word frequency table in the word library, the first 50 Chinese characters statistically used with the highest frequency (i.e. occurring most frequently in the statistical Chinese corpus) by the inventor are listed in the following order: one, yes, i, me, no, in, he, person, present, this, one, to, say, you, up, people, up, to, just, big, her, ground, in, time, that, son, yes, also, want, in, get, how, and, middle, meeting, out, from, under, in, can, past, right, go, birth, back, no, see, all, good, energy.

In a preferred embodiment, in the initial data assignment of the word frequency table in the word stock, the first 50 Chinese characters with the largest frequency value are used, and the first 50 Chinese characters are consistent with the above 50 Chinese characters in composition and ordering.

Although the word stock word frequency table which is generated once after the statistics of the large-scale Chinese language database and is kept stable and unchanged for a long time is also allowed to be applied to the technical scheme of the invention, as a preferred embodiment, with the continuous hunting of various search texts in the search operation of a user, the word stock word frequency table is taken as a statistic category of more Chinese language materials added subsequently, namely all Chinese characters involved in the word stock word frequency table are listed into a statistic object of the word stock word frequency table to form a continuous adding and updating process of the word stock word frequency table data, because the continuous data adding and updating can be closer to the reality of the context which the user often faces in the search operation, so as to further ensure the accuracy and objectivity of the word stock word frequency table on the word frequency attribute recording.

In the invention, the total occurrence frequency of the Chinese characters in the whole Chinese language database is taken as the expression form of the use frequency numerical value recorded in the word library word frequency table of the Chinese characters, so when the word library word frequency table is updated, the corresponding recording position of each Chinese character in the read retrieval text is inquired in the word library word frequency table only in the matching process of the retrieval key words, and the operation of adding 1 is carried out on the basis of the originally recorded use frequency numerical value of the Chinese character.

In step S2, the method for searching the database of the word frequency table in the word stock for the use frequency value of a chinese character naturally depends on the specific storage structure of the word frequency table in the word stock and the addressing mode based on the storage structure. Generally, some time cost is necessarily paid for searching the word frequency table database of the word stock, and especially, when a scheme of searching and contrasting all Chinese characters in the word frequency table of the whole word stock one by one is adopted, a large time cost is consumed.

Although fig. 7 also shows an example of the word frequency table of the word stock, the storage structure of the word frequency table of the word stock in the memory can also completely arrange records according to the sequence of the use frequency value of each Chinese character in the example, which is more convenient for the public to understand the scheme idea of the invention, but it does not mean that the storage structure of the word frequency table of the word stock in the invention is the best implementation mode when the word frequency table of the word stock is implemented by computer software.

In order to improve the convenience of data reading and writing in the word frequency table of the word stock, in particular to improve the execution efficiency of reading and updating the use frequency value of the Chinese characters in the word frequency table of the word stock, a more reasonable technical scheme must be provided for the computer storage structure of the word frequency table of the word stock.

In fact, because the Chinese characters are stored, transmitted and retrieved in the form of machine-internal codes in the information processing of the computer system, and the numerical value distribution of the machine-internal codes is relatively concentrated, it can be considered that the word frequency table of the word stock is stored in a linear mode, and the storage addresses of the word frequency table of the word stock are associated with the Chinese character internal codes through a certain simple operation rule, namely, a direct mapping relation is created between the addressing codes of the word frequency table of the word stock and the Chinese character internal codes, and certainly, a one-to-one correspondence relation is required between the Chinese character internal codes and the storage addresses storing the Chinese character use frequency values in the word frequency table of the word stock.

In view of this, the invention also provides a further optimization scheme for the data reading and writing method of the word frequency table of the word stock by combining the computer data storage structure of the word frequency table of the word stock: the method is characterized in that the self value of the in-machine code of any Chinese character in a Chinese character library of a character library character frequency table in the computer storage is directly arranged into the corresponding storage position of the use frequency value of the Chinese character in the character library character frequency table after the addition and subtraction operation of a constant value, so that when the use frequency value of a certain Chinese character in the character library character frequency table is required to be inquired or updated, the storage position of the use frequency value of the Chinese character in the character library character frequency table can be directly deduced through the addition and subtraction operation based on the self value of the in-machine code of the Chinese character, namely, the direct addressing of data is realized, the time consuming process of recording and searching in the character library character frequency table one by one is avoided, and the time complexity realized by software is optimized. The convenience of the address mapping relation is not only used for addressing data reading operation before sequencing of the use frequency values of all Chinese characters in the search key word composition, but also used for addressing data writing operation when the use frequency values of the Chinese characters are updated by adding 1 in the word frequency table of the word stock.

To further explain how to establish a direct mapping relationship between the read and write addresses of the frequency value of each Chinese character in the word frequency table of the word stock and the machine code of the Chinese character, three specific word stock standards are taken as examples for explanation.

Example one of a Chinese character library: the national standard GB/T2312-1980 basic set of Chinese character coding character sets for information exchange.

The national standard GB/T2312-1980 basic set of Chinese character coding character sets for information exchange contains 6763 Chinese characters in total, and a two-dimensional matrix coding method is adopted to code all the characters, namely, a square matrix of 94 lines (areas) and 94 columns (positions) is firstly constructed, then all the characters are filled in the square matrix, wherein the 16 th to 87 th areas are the Chinese characters, and the total number of the characters is 6763. In the national standard, the storage of Chinese characters in a computer is based on the zone code, and the zone code respectively occupy one storage byte, so that each Chinese character occupies two storage bytes. Fig. 9 shows four end information of 6763 chinese characters regarding the distribution of the region code and the in-plane code, which are included in the national standard GB/T2312-1980 basic set of chinese character code sets for information exchange, as an outline of the layout of the whole word stock with one glance at one spot.

In the invention, the coding sequence of each Chinese character in the national standard is used as the addressing sequence of each Chinese character in the word frequency table of the word stock. The storage structure of a two-dimensional array is used for storing the use frequency value of each Chinese character in the character frequency table of the character library, and in order to reduce the space requirement of the array storage, the size of the two-dimensional array is only required to be defined according to the size of 72 areas and 94 columns in total of the Chinese character library parts in the national standard, namely 16 th to 87 th, and the whole two-dimensional array is not represented as A [72] [94].

The Chinese character in the national standard does not start from 0,0, but the subscript address stored in the two-dimensional array A72 94 in the computer is usually addressed from 0,0, so that the Chinese character internal code calculates the storage position of the Chinese character use frequency value in the array A72 94, and only needs to perform integral translation on the value size.

As the prior art, FIG. 10a shows various conversion relations among 6763 Chinese character machine internal codes, national standard codes and zone bit codes in GB/T2312-1980; in addition, FIG. 10a also shows the method of converting the row-column subscript address of a corresponding array element for storing the Chinese character use frequency value from the Chinese character internal code to the two-dimensional array matrix A [72] [94] of the word frequency table of the word stock, i.e. the latter can be obtained by performing 16-system subtraction operation of "-B0A1H" on the former.

For example, two bytes used for storing a Chinese character in the national standard GB/T2312-1980 in a computer are represented as XY (Hex) in 16 systems, and a numerical value for recording the use frequency of the Chinese character is stored in the position A [ X-B0H, Y-A1H ]. With reference to FIG. 10a, the total offset for the row and column index addresses in two-dimensional array A [72] [94] is calculated as B0A1H =8080H +2020H +1001H.

In fig. 10b, the first Chinese character "o" in the national standard word stock code is taken as an example to clarify the conversion relationship between the machine code and the subscript address of the two-dimensional array a [72] [94]. The area and bit code of Chinese character "o" are respectively 16 and 01 of 10-system number, said area and bit code are converted into 16-system number and expressed as 1001, on the basis of said 16-system number expression of area bit code, after making addition operation of "+2020H" and "+8080H", the international code 3021H and internal code B0A1H of "a" character can be respectively obtained, and the correspondent storage position of the use frequency value of "a" character in word frequency table of word library is A [0,0].

If the numerical values of two bytes in the Chinese character internal code are directly used as subscript addresses of the Chinese character use frequency numerical values in the two-dimensional array A for storing the word frequency table of the word stock, and the B0A1H offset of the middle row and column subscript addresses is not performed, the array A will occupy a very large storage space. This is because, referring to fig. 9, the machine code of the last Chinese character "slab" in the national standard GB/T2312-1980 is F7FEH, the 10-ary digits corresponding to the high and low bytes F7 and FE are denoted as 247 and 254, respectively, and the two-dimensional array a is addressed from [0,0] in the calculation instead of [1,1], so that the capacity of the array a will reach (247 + 1) × (254 + 1) =63240, that is, a at this time is denoted as a [248] [255], which is far greater than the total number of national standard Chinese characters of 6763, resulting in unnecessary waste of storage space. In a word, by the method of the address integral deviation, the array size of the word frequency table of the word stock is defined, and the computer storage space of the word frequency table of the word stock is greatly saved, namely, the space complexity of a program is optimized.

Example two of the Chinese character library: the national standard GB/T12345-1990 auxiliary Chinese character set for information exchange.

The Chinese character encoding character set basic set for information exchange of the international GB/T2312-1980 and the Chinese character encoding character set auxiliary set for information exchange of the international GB/T12345-1990 are basically the same in character encoding scheme, except that: the total amount of the Chinese characters collected in the former is 6763, and the latter is 6866; the Chinese character area in the former is located in the 16 th-87 th area, and the latter is the 16 th-89 th area (the newly added 88 th and 89 th areas correspond to the newly added 103 Chinese characters in the latter national standard).

When the national standard is used as the Chinese character library, the storage structure of the character frequency table of the character library and the addressing calculation mode of data reading and writing can completely refer to the scheme of the first example of the Chinese character library. Except that: the size of the two-dimensional array is changed from A72 in the former to A74 in the latter.

Example three of a Chinese character library: the national standard GB18030-2022 "Chinese coding character set for information technology".

When the standard of the Chinese character library adopts 8105 standard Chinese characters included in the general standard Chinese character table of the State Council (the State issue [ 2013 ] 23), the Chinese character library system of the computer must meet the requirements of the Chinese standard GB18030-2022 Chinese coding character set of the information technology which supports the 8105 character library. The national standard GB18030-2022 contains more Chinese characters, and is compatible with the double-byte Chinese character coding scheme adopted by the GB/T2312-1980 and the GB/T12345-1990, and also adopts expanded four-byte coding to partial Chinese characters contained in the general standard Chinese character table.

In this case, the method for mapping the in-machine code of the four-byte coded Chinese characters to the storage address in the array A is that on the basis of the defined array A, the in-machine code is used as the expansion of array storage, and all the word frequency values corresponding to the stored four-byte Chinese characters are integrally translated to the tail part of the array; or the method defined by the array with unequal length is adopted for A to simultaneously store the use frequency values corresponding to the double-byte Chinese characters and the four-byte Chinese characters.

In the prior art, in the computer software implementation of keyword search, for each round of matching and single word comparison in each round, the pointer position is usually adjusted in sequence by using a method of adding 1 to a loop variable (for example, loop variables j and k in fig. 1 and 2); in the present invention, the comparison order of each Chinese character in the search keyword structure is not the natural arrangement order in the structure, but is changed continuously in a jumping manner.

In order to realize the change of the jumping property, the invention adopts the following technical means:

firstly, the keyword word frequency sorting table is expressed into a number sequence with the same length as the search keyword, wherein each number represents a Chinese character corresponding to the number bit sequence in the search keyword, the sequence of each number in the whole number sequence is a result of sorting the use frequency numerical values of each Chinese character inquired by the word frequency table of the word stock from small to large, and the sequence is also the sequence of comparing each single word forming the keyword with the text in each round of comparison of the keyword matching operation. Since the Chinese characters corresponding to the digit numbers in the digit sequence are arranged in the order from low frequency to high frequency, when the search keyword and the search text are compared with each other individually, the Chinese characters with low frequency of use are compared first, and then the Chinese characters with high frequency of use are compared. In addition, combining with the convention of computer language expression in software programming, the bit sequence number of the first Chinese character in the search keyword is recorded as 0, and the bit sequence numbers of the subsequent Chinese characters are sequentially recorded as 1,2,3 … … (n-1) (where n is the length of the search keyword, i.e., the number of Chinese characters in the composition).

Then, an integer array Cn with the capacity of n (wherein n is the length of the search key words, namely the number of Chinese characters in the composition) is set for realizing the conversion of the numerical subscript from the natural sequence of 1,2,3 … … (n-1) in the prior art to the word frequency sequence of the invention. When each element in the array is assigned, the bit sequence numerical value of the Chinese character in the search keyword in the keyword word frequency sorting table is stored in sequence from low frequency to high frequency, so that the change sequence of the numerical value of each corresponding element in the search keyword T [ C [ k ] ] is represented as the sequence of each round of single Chinese character comparison in the search matching when the subscript pointer k of the array C [ n ] is increased from 0.

In the prior art, the comparison between the searched text and the single character in the search keyword is not represented as S [ j ] ↔ T [ k ], whereas in the present invention, the single character comparison S [ j ] ↔ T [ k ] in the foregoing is changed to S [ i + C [ j-i ] ] ↔ T [ C [ k ] ].

Therefore, the conversion relation between two address code codes is realized through the integer array C [ n ], namely, the natural arrangement sequence among all Chinese characters in the search keyword structure in the prior art is converted into the numerical sequence described in the keyword word frequency ordering table in the invention, and the sequence represents the sequence of comparison between all Chinese characters and the search text in the search keyword structure in the invention, namely the arrangement sequence of the use frequency of single characters from small to large in the search keyword structure.

In the following description, referring to fig. 8a and 8b, in the configuration of the chinese search keyword "supply chain", the bit order of three chinese characters "supply", "response" and "chain" in the search keyword "supply chain" is 0, 1 and 2 in order, in the present invention, the three chinese characters are arranged in order of their use frequency from small to large as "chain" → "supply" → "response", and the bit order of the three chinese characters arranged in this order in the original search keyword configuration is 2, 0 and 1 in order, so the keyword frequency ordering table at this time is represented as a sequence of three

numbers

2, 0 and 1. After the

numbers

2, 0 and 1 in the keyword word frequency ordering table are sequentially stored in a transformation array C [3], when a position pointer k in a search keyword is scanned in a natural sequence of 0 → 1 → 2, the value of C [ k ] is changed from 2 → 0 → 1, and the corresponding change generated by T [ C [ k ] ] is 'chain' → 'for supply' → 'response', so that the sequence of single word comparison with a search text in sequence in the search keyword composition according to the requirement of the invention is correspondingly generated by T [ C [ k ] ] in the 1-added cyclic scanning of a lower-label pointer.

Therefore, in the flowchart of the word comparison of each round of the matching operation in the related art shown in fig. 2, the determination condition "S [ j ] = = T [ k ]" whether the word comparison is consistent or not is changed to "S [ i + C [ j-i ] ] = = T [ C [ k ]", and the rest of the flowchart remains unchanged, so that the flowchart of the word comparison of each round of the matching operation in the present invention can be changed.

It can be seen that the transition from the traditional single character comparison sequence to the low-frequency to high-frequency single character comparison sequence is realized by the setting of the integer array C [ n ] and the data storage of the keyword word frequency sorting table, and then the method of converting the index of the keyword single character subscript in the traditional pattern matching algorithm from k to C [ k ], and the storage resource occupied by the integer array C [ n ] is very low, and the workload of reading and writing the array data is not large, so the method is simple and easy to implement.

The embodiment of the invention provides computer equipment, which comprises: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the processor executing the machine-readable instructions to implement a method for retrieving keywords in a chinese context as described in any of the above.

The memory and the processor are all general purpose memory and processor, and are not specifically limited herein, and the processor can execute the above method for retrieving keywords in the chinese context when executing a computer program stored in the memory.

Those skilled in the art will appreciate that the configuration of the computer device is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, some components may be split, or a different arrangement of components.

In some embodiments, the computer device may further include a touch screen for displaying a graphical user interface (e.g., a launch interface for an application, etc.) and receiving user operations with respect to the graphical user interface (e.g., launch operations for an application, etc.). The touch screen may include a Display panel and a touch panel, wherein the Display panel may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), and the like, and the touch panel may collect contact or non-contact operations on or near the touch panel by a user and generate preset operation instructions, such as operations on or near the touch panel by the user using a finger, a stylus, and any suitable object or accessory. In addition, the touch panel may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction and gesture of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts the touch information into information capable of being processed by the processor, sends the information to the processor, and receives and executes commands sent by the processor. Moreover, the touch panel can be realized by various types such as resistance, capacitance, infrared rays and surface acoustic waves, and can also be realized by any technology developed in the future. Further, the touch panel may overlay the display panel, the user may operate on or near the touch panel overlaid on the display panel according to a graphical user interface displayed by the display panel, the touch panel detects the operation thereon or near, and then the touch panel transmits the operation to the processor to determine a user input, and the processor then provides a corresponding visual output on the display panel in response to the user input. In addition, the touch panel and the display panel can be realized as two independent components or can be realized in an integrated manner.

Corresponding to the starting method of the application program, the embodiment of the invention also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program is used for executing the method for searching the keywords in the Chinese context.

The starting device of the application program provided by the embodiment of the application program can be specific hardware on the device or software or firmware installed on the device. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments provided in the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the invention.

Claims

1. A method for searching keywords in a Chinese context, the searched keywords being composed of two or more Chinese characters, characterized in that the method comprises the following steps:

creating a word frequency table of a word stock in a computer memory, and recording a use frequency value which is counted by a Chinese language database and corresponds to each Chinese character in a Chinese word stock in a database of the word frequency table of the word stock;

according to records in the word frequency table database of the word stock, and in combination with an addressing mode recorded by the database, respectively inquiring a use frequency value corresponding to each Chinese character in the structure of the searched key words through reading operation on the word frequency table of the word stock;

in the context of Chinese to be retrieved, starting from the position of a first character of a text to the position of the tail part of the text, carrying out one round of single character comparison on a continuous Chinese character string starting from the position of any Chinese character in the text and a retrieval keyword, and in each round of single character comparison of the retrieval keyword, taking the single character arrangement sequence represented by a generated keyword character frequency ordering table as the sequence for comparing the characters at the corresponding positions in the retrieval text among the Chinese characters in the retrieval keyword;

the use frequency value of each Chinese character recorded in the word frequency table of the word stock is continuously updated along with continuous hunting of various search texts by user search operation; the continuous updating process comprises the following steps: in the matching process of the search keywords, each Chinese character in the read search text is inquired about the corresponding record position in the word frequency table of the word stock, and 1 is added on the basis of the original use frequency value of the Chinese character recorded in the position.

2. The method for searching keywords in Chinese context according to claim 1, wherein the process of creating a word frequency table of word stock comprises the steps of:

3. The method of claim 1, wherein the keyword word ranking table is expressed as a sequence of ranking values of each Chinese character in the search keyword, and each ranking value is sequentially stored in a one-dimensional integer array, and the comparison sequence of each Chinese character in the search keyword in each matching operation is formed by the continuous change of the corresponding array element values when the lower addresses of the integer array are sequentially increased, i.e. the comparison sequence of the Chinese characters represented by the keyword word ranking table.

4. The method for searching keywords in a chinese context according to claim 1, wherein the chinese context to be searched comprises a single chinese text saved on a computer local or server web page, or a chinese text within a specified range of one document thereof, or a chinese text composed of a plurality of documents thereof as a whole.

5. The method of claim 1, wherein in each round of single word comparison of the search keyword, once two single words at any corresponding position are found to be different, the current round of comparison is directly exited, and only when a continuous string in the search text is identical to the whole Chinese search keyword in terms of Chinese character composition and Chinese character sequence, the success of the current round of matching is determined until all continuous strings in the whole text of the search text having the same length as the Chinese search keyword are compared with the Chinese search keyword.

6. The method for searching keywords in Chinese context according to any of claims 1-5, wherein the word frequency table of the word stock is addressed in the following way: the method is characterized in that the self value of the in-machine code of any Chinese character in a Chinese character library of a character library character frequency table in a computer is directly arranged into the corresponding storage position of the use frequency value of the Chinese character in the character library character frequency table after the addition and subtraction operation of a constant value, and when the use frequency value of a certain Chinese character in the character library character frequency table is inquired or updated, the storage position of the use frequency value of the Chinese character in the character library character frequency table is directly deduced through the addition and subtraction operation based on the self value of the in-machine code of the Chinese character.

7. The method for searching keywords in Chinese context according to claim 6, wherein the standard of the word stock of the word frequency table of the word stock adopts 6763 Chinese character set included in GB/T2312-1980 basic set of Chinese character coding set for information exchange or 6866 Chinese character set included in GB/T12345-1990 auxiliary set of Chinese character coding set for information exchange, and stores the use frequency value corresponding to each Chinese character in the word stock in a matrix structure of two-dimensional array, and when the use frequency value of a Chinese character is read or written in the two-dimensional array matrix, the row and column addresses of the array elements storing the use frequency value of the Chinese character are obtained by subtracting the 16-system values of the high and low bytes of the code in the Chinese character machine from "-B0H" and "-A1H".

8. An apparatus for retrieving a keyword in a chinese context, the retrieved keyword being composed of two or more chinese characters, comprising:

the keyword retrieval module is used for carrying out one round of single character comparison on a continuous Chinese character string starting from the position of a first character of a text to the position of the tail part of the text in the Chinese context to be retrieved, and taking the single character arrangement sequence represented by the generated keyword character frequency ordering table as the sequence for comparing the characters at the corresponding positions in the retrieval text among the Chinese characters in the retrieval keyword composition in each round of single character comparison of the retrieval keyword;

9. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the computer device is running, the processor executing the machine-readable instructions to implement the method for retrieving keywords in a chinese context as claimed in any of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the method of retrieving keywords in the chinese context according to any one of claims 1-7.