CN118364916A

CN118364916A - News retrieval method and system based on large language model and knowledge graph

Info

Publication number: CN118364916A
Application number: CN202410602922.3A
Authority: CN
Inventors: 雷小炫; 高登科; 陈涵宇; 吴方印; 李少博
Original assignee: Sichuan Cover Media Technology Co ltd
Current assignee: Sichuan Cover Media Technology Co ltd
Priority date: 2024-05-15
Filing date: 2024-05-15
Publication date: 2024-07-19

Abstract

The invention belongs to the technical field of information retrieval, and aims to provide a news retrieval method and system based on a large language model and a knowledge graph. The method comprises the following steps: acquiring multi-source news data, and constructing a news database according to the multi-source news data; carrying out knowledge extraction on multi-source news data in a news database by adopting an initial large language model to obtain news entities and news relations, and constructing a news knowledge graph according to the news entities and the news relations; integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model; and searching news content based on the final large language model to obtain news search results. The invention can provide more abundant news related knowledge, more comprehensive news context information, latest news entities and news facts for news search results, and is helpful for related personnel to obtain more intelligent, more accurate and more personalized news search results at lower cost.

Description

News retrieval method and system based on large language model and knowledge graph

Technical Field

The invention belongs to the technical field of information retrieval, and particularly relates to a news retrieval method and system based on a large language model and a knowledge graph.

Background

With the development of the internet, the news information volume is in explosive growth, and the personalized retrieval demands of users, editors, journalists and related media staff on news are also increasing. The traditional knowledge base-based news retrieval method cannot accurately and individually retrieve relevant news. In order to meet the personalized retrieval demands of users on news, at present, a news retrieval method based on a knowledge graph has appeared, wherein the knowledge graph can carry out semantic association on entities, relations and attributes, and provides richer semantic information.

However, in using the prior art, the inventors found that there are at least the following problems in the prior art:

because the construction of the news knowledge graph is incomplete and unknown entities and the latest facts cannot be effectively modeled, the news knowledge graph is mainly dependent on structural information when the news is searched based on the knowledge graph, the ability of understanding news content information is lacking, and search content which is more accurate in understanding level cannot be obtained, so that the news searching precision is required to be improved.

Disclosure of Invention

The invention aims to solve the technical problems at least to a certain extent, and provides a news retrieval method and a news retrieval system based on a large language model and a knowledge graph.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

In a first aspect, the present invention provides a news retrieval method based on a large language model and a knowledge graph, including:

acquiring multi-source news data, and constructing a news database according to the multi-source news data;

carrying out knowledge extraction on multi-source news data in the news database by adopting an initial large language model to obtain news entities and news relations, and constructing a news knowledge graph according to the news entities and the news relations;

Integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model;

and searching news content based on the final large language model to obtain a news search result.

In one possible design, knowledge extraction is performed on multi-source news data in the news database by using an initial large language model to obtain news entities and news relations, including:

Acquiring a plurality of clause data corresponding to the multi-source news data in the news database by adopting an initial large language model;

Performing entity recognition on each clause data by adopting a pre-training large model to obtain candidate entities in each clause data, and performing relation extraction on the candidate entities in each clause to obtain an initial relation between the candidate entities in each clause;

Generating a formula relationship label for the initial relationship among the candidate entities based on the candidate entities in each clause and the initial relationship among the candidate entities by adopting the initial large language model to obtain the candidate relationship;

And screening the importance of the candidate entities in the clause data to obtain candidate entities with importance ranking greater than a preset ranking threshold, using the candidate entities as news entities, and using candidate relations between the candidate entities with importance ranking greater than the preset ranking threshold as news relations.

In one possible design, the initial large language model employs a ChatGLM model that includes an encoder and a decoder; correspondingly, acquiring a plurality of clause data corresponding to the multi-source news data in the news database by adopting an initial large language model comprises the following steps:

preprocessing the multisource news data in the news database to obtain preprocessed news data; wherein the preprocessed news data comprises a plurality of first word segmentation data;

Adopting an encoder of the initial large language model to encode the preprocessed news data to obtain first word embedding vectors of a plurality of first word segmentation data in the preprocessed news data;

And performing clause processing on the plurality of first word embedded vectors by adopting the decoder of the initial large language model to obtain a plurality of clause data.

In one possible design, constructing a news knowledge-graph according to the news entity and the news relationship includes:

Taking the news entities as nodes, taking news relations among the news entities as edges, and constructing a word graph structure;

and constructing and obtaining a news knowledge graph based on the word graph structure.

In one possible design, the knowledge fusion module includes a text encoder and a knowledge-graph encoder; correspondingly, integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model, wherein the method comprises the following steps of:

receiving input text data entered into the initial large language model;

acquiring a text representation vector of the input text data by adopting a text encoder in the knowledge fusion module;

Adopting a knowledge graph encoder in the knowledge fusion module to encode the news entity and the news relationship in the news knowledge graph to obtain an entity representation vector corresponding to the news entity and a relationship representation vector corresponding to the news relationship;

Adopting the knowledge graph encoder to fuse the entity representation vector, the relation representation vector and the text representation vector to obtain a fusion representation vector;

and integrating the fusion expression vector into the initial large language model to obtain a final large language model.

In one possible design, obtaining a text representation vector of the input text data using a text encoder in the knowledge fusion module includes:

Preprocessing the input text data to obtain preprocessed text data; wherein the preprocessed text data comprises a plurality of second word segmentation data;

Encoding the preprocessed text data through the text encoder to obtain second word embedding vectors of a plurality of second word data in the preprocessed text data;

Performing multi-layer self-attention mechanism calculation on the plurality of second word embedded vectors through the text encoder to obtain position expression vectors of the plurality of second word embedded vectors;

Embedding a plurality of second words into a plurality of position representation vectors through the text encoder to obtain text representation vectors of the input text data; wherein the text representation vector is:

TextEncoderOutput=TransformerEncoder(TokenEmbeddings+PositionEmbeddings)；

Wherein TokenEmbeddings is a word vector sequence consisting of a plurality of second word embedded vectors; positionEmbeddings is a position vector sequence composed of a plurality of position representation vectors; transformerEncoder () represents a function that encodes an input sequence, which includes multiple layers of self-attention mechanisms and feedforward neural network layers.

In one possible design, the fusing the entity representation vector, the relationship representation vector and the text representation vector by using the knowledge-graph encoder to obtain a fused representation vector includes:

Acquiring an associated entity representation vector and an associated relation representation vector which have an associated relation with the text representation vector in the entity representation vector and the relation representation vector;

Carrying out fusion processing on the association entity representation vector, the association relation representation vector and the text representation vector to obtain a fusion representation vector; wherein, the fusion expression vector is:

SentenceRepresentation=TextEncoderOutput+α·KGEntityVectors；

Wherein, + represents a splice symbol; textEncoderOutput is the text representation vector; alpha is a preset weight coefficient; KGEntityVectors is a combined vector obtained by weighted summation of the association entity expression vector and the association relation expression vector.

In one possible design, performing news content retrieval based on the final large language model to obtain a news retrieval result includes:

Customizing different prompting word templates according to different business scene requirements, and constructing a prompting word template library according to the different prompting word templates;

extracting a target prompt word template matched with a preset search requirement from the prompt word template library, receiving search request data, and combining the search request data with the target prompt word template to obtain target search request data;

and carrying out news retrieval on the target retrieval request data by adopting the final large language model to obtain a news retrieval result matched with the target retrieval request data.

In one possible design, after obtaining the news search result, the method further includes:

And carrying out news knowledge verification on the news search result by adopting a news knowledge assessment module to obtain a verification result, and feeding back the verification result to a database matched with the final large language model when the verification result characterizes that the matching degree of the news search result and the news knowledge graph is smaller than a preset matching degree threshold value.

In a second aspect, the present invention provides a news retrieval system based on a large language model and a knowledge graph, for implementing a news retrieval method based on a large language model and a knowledge graph as described in any one of the above; the news retrieval system based on the large language model and the knowledge graph comprises:

the database construction module is used for acquiring multi-source news data and constructing a news database according to the multi-source news data;

the knowledge graph construction module is in communication connection with the database construction module and is used for carrying out knowledge extraction on multi-source news data in the news database by adopting an initial large language model to obtain news entities and news relations, and constructing a news knowledge graph according to the news entities and the news relations;

The retrieval model generation module is in communication connection with the knowledge graph construction module and is used for integrating the news knowledge graph into the initial large language model by adopting the knowledge fusion module to obtain a final large language model;

And the news retrieval module is in communication connection with the retrieval model generation module and is used for retrieving news content based on the final large language model to obtain news retrieval results.

In a third aspect, the present invention provides an electronic device, comprising:

a memory for storing computer program instructions; and

A processor for executing the computer program instructions to perform the operations of the large language model and knowledge graph based news retrieval method as set forth in any one of the preceding claims.

In a fourth aspect, the present invention provides a computer program product comprising a computer program or instructions which, when executed by a computer, implement a large language model and knowledge graph based news retrieval method as claimed in any one of the preceding claims.

The beneficial effects of the invention are as follows:

The invention discloses a news enhancement retrieval method and a system based on large language model and knowledge graph joint driving, which can provide richer news related knowledge, more comprehensive news context information, latest news entities and news facts for news retrieval results and are beneficial to related personnel to obtain more intelligent, more accurate and more personalized news retrieval results at lower cost. Specifically, in the implementation process, firstly, multi-source news data are acquired, and a news database is constructed according to the multi-source news data; then, carrying out knowledge extraction on multi-source news data in the news database by adopting an initial large language model to obtain news entities and news relations, and constructing a news knowledge graph according to the news entities and the news relations; then, integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model; and finally, searching news content based on the final large language model to obtain a news search result. In the process, the method adopts the initial large language model to assist in knowledge extraction, simultaneously assists in constructing a news knowledge graph, and obtains a final large language model for the initial large language model through the news knowledge graph, so that real-time knowledge in the news field is injected into the initial large language model, the understanding degree of the obtained final large language model on news content information is improved, and the accuracy and individuation degree of news content retrieval are improved.

Other advantageous effects of the present invention will be further described in the detailed description.

Drawings

FIG. 1 is a flow diagram of a news retrieval method based on a large language model and knowledge graph in an embodiment;

FIG. 2 is a block diagram of a news retrieval system based on a large language model and knowledge graph in an embodiment;

Fig. 3 is a block diagram of an electronic device in an embodiment.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the present invention will be briefly described below with reference to the accompanying drawings and the description of the embodiments or the prior art, and it is obvious that the following description of the structure of the drawings is only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention.

Example 1:

The embodiment discloses a news retrieval method based on a large language model and a knowledge graph, which can be executed by a computer device or a virtual machine with certain computing resources, such as a personal computer, a smart phone, a personal digital assistant, or an electronic device such as a wearable device, or a virtual machine.

As shown in fig. 1, a news retrieval method based on a large language model and a knowledge graph may include, but is not limited to, the following steps:

S1, acquiring multi-source news data, and constructing a news database according to the multi-source news data; specifically, in this embodiment, capturing, cleaning and sorting the multisource news manuscripts and the own news manuscripts at regular time to obtain processed multisource news data, and then pulling the processed multisource news data into a news database; the capturing of the multi-source news manuscript and the self news manuscript may be implemented by, but not limited to, web crawlers, API (Application Programming Interface, application program interface) interfaces or other data acquisition methods, which are not limited herein.

S2, carrying out knowledge extraction on multi-source news data in the news database by adopting an initial large language model (Large Language Model, LLM) to obtain news entities and news relations, and constructing a news knowledge graph according to the news entities and the news relations; specifically, step S2 discloses a technical scheme for aided construction of a news knowledge graph based on the initial large language model, and based on the technical scheme, a complete and accurate data basis can be provided for news retrieval of the subsequent large language model. The news entities and news relationships may be organized into a graph structure to form a rich knowledge representation, in which the news entities are nodes in the knowledge graph and the news relationships are edges connecting the nodes.

In step S2 of the present embodiment, knowledge extraction is performed on the multi-source news data in the news database by using an initial large language model to obtain a news entity and a news relationship, including:

S201, acquiring a plurality of clause data corresponding to multi-source news data in the news database by adopting an initial large language model.

Specifically, the initial large language model adopts ChatGLM model including encoder and decoder, specifically ChatGLM-6B model, which is an open-source dialogue language model supporting Chinese-English bilingual, and is realized based on General Language Model (GLM, general language model) architecture, chatGLM-6B model includes encoder and decoder, wherein the encoder converts input sequence into vector representation with fixed dimension, and the decoder generates output sequence according to the output and current state of the encoder; correspondingly, in S201 of the present embodiment, acquiring, by using an initial large language model, a plurality of clause data corresponding to multi-source news data in the news database includes:

S2011, preprocessing multi-source news data in the news database to obtain preprocessed news data; wherein the preprocessed news data comprises a plurality of first word segmentation data; specifically, in this embodiment, the preprocessing flow includes basic processing flows such as text word segmentation and punctuation removal, which are not limited herein.

S2012, adopting an encoder of the initial large language model to encode the preprocessed news data to obtain first word embedding vectors of a plurality of first word segmentation data in the preprocessed news data; the pre-processed news data may be encoded, or may be expressed as feature extraction of the pre-processed news data, that is, converting each element (such as a word, a character, or other entity) in the pre-processed news data into a first word embedding vector in the form of a vector.

S2013, performing clause processing on the first word embedded vectors by adopting a decoder of the initial large language model to obtain a plurality of clause data; it should be noted that, in this embodiment, a special tag may be preset as a sentence ending symbol, so as to identify sentence boundaries in the plurality of first word embedded vectors, so as to help divide the plurality of continuous first word embedded vectors into separate sentences. Specifically, the decoder of the initial large language model can predict whether the character corresponding to the next first word embedding vector in the plurality of first word embedding vectors is a sentence ending symbol one by one, and when the prediction probability exceeds a preset probability threshold, the decoder considers that the current character is followed by a new sentence; and sequentially carrying out clause processing on the plurality of first word embedded vectors to obtain a plurality of clause data.

It should be noted that, the steps S2011 to S2013 disclose a technical scheme for performing auxiliary clauses based on the initial large language model, that is, a process of obtaining multiple clause data according to the multi-source news data in the news database.

S202, carrying out entity recognition on each clause data by adopting a pre-training large model to obtain candidate entities in each clause data, and carrying out relation extraction on the candidate entities in each clause to obtain an initial relation between the candidate entities in each clause; specifically, candidate entities such as including person names, organization names, etc., are not limited herein; in this embodiment, a BERT (Bidirectional Encoder Representation from Transformers, a pre-training model) model may be used to identify entities of clause data and extract relationships between entities, and DEBERTA (BERT with decoding enhancement and attention decoupling) may be further introduced in the relationship extraction stage, so as to increase a relative position coding and bidirectional attention enhancement module, so as to capture association information between entities more accurately; in addition, since the same news content may have different expressions, in this embodiment, after entity identification is performed on each clause data, entity reference resolution processing may be performed to improve accuracy of entity identification.

S203, adopting the initial large language model to carry out generating type relation labeling on the initial relation among the candidate entities based on the candidate entities in each clause and the initial relation among the candidate entities to obtain candidate relation; it should be noted that, in this embodiment, after obtaining the initial relationship between the candidate entities in each clause, the initial relationship between the candidate entities is labeled by the generating formula through the initial large language model, so as to obtain the candidate relationship with stronger comprehensibility and readability, that is, the model directly generates the text description of the relationship type instead of the classification label, so as to further refine and process the existing information, so that the information has better readability, comprehensibility and operability, and is favorable for providing basic data for knowledge graph construction, so that the text information can be more conveniently converted into a graphic structure.

In the embodiment, when entity identification and relation extraction are performed, the performance of the BERT model on rare or difficult-to-identify relations can be optimized through an countermeasure training or active learning strategy, and the accuracy of complex relation extraction can be improved by further combining an external knowledge base or priori knowledge.

S204, screening the importance of the candidate entities in the clause data to obtain candidate entities with importance ranking greater than a preset ranking threshold, using the candidate entities as news entities, and using candidate relations between the candidate entities with importance ranking greater than the preset ranking threshold as news relations. Specifically, in this embodiment, the importance degree of each clause in the initial multi-source news data may be quantified by using a BERT-based vector cluster analysis method, and the importance degree of each candidate entity may be obtained by comprehensively scoring each candidate entity based on the frequency of occurrence of each candidate entity in different clause data, the significance of the context, the density of the relationship network formed by the entity and other entities, and other indexes; in addition, in the embodiment, a graph neural network (Graph Neural Networks, GNN) can be introduced to evaluate the importance of the entity, the entity and the relationship are constructed into a graph structure, and the importance of the node (entity) is calculated through a message transmission mechanism; in this embodiment, a dynamic threshold strategy may also be added to adjust the importance score screening criteria according to the news topic changes.

In step S2, a news knowledge graph is constructed according to the news entity and the news relationship, which includes:

S205, taking the news entities as nodes, taking news relations among the news entities as edges, and constructing a word graph structure; specifically, in this embodiment, the word graph structure includes news entities and news relationship information in the multi-source news data, and is represented by a graph structure, where each node represents a news entity, and a unique identifier may be used to represent the news entity, and each edge connects two nodes, which represents a relationship between two news entities corresponding to the two nodes. In this embodiment, in the process of constructing the word graph structure, for each word in the input clause data, if it corresponds to a certain knowledge entity, a corresponding node is created in the word graph structure and connected with other related entity nodes; in addition, it is necessary to ensure that all words (including non-entity words) in the input clause data form a full connection in the word graph structure, i.e., edges may be added between any two word nodes to express potential semantic associations.

S206, constructing and obtaining a news knowledge graph based on the word graph structure. Specifically, in this embodiment, when the news knowledge graph is constructed based on the word graph structure, the nodes and edges in the word graph structure may be further marked with attributes, and the marked attribute information may include types of entities and edges, attribute information, relationship types, and the like, where the attribute information may be extracted from text data or manually added, and is not limited herein; then, more entity, relation and attribute information can be added in the word graph structure to integrate a plurality of data sources, and operations such as data cleaning and normalization are performed to obtain a final news knowledge graph.

S3, integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model; it should be noted that, in this embodiment, by integrating the news knowledge graph into the initial large language model, the large language model can understand and generate more accurate and deep text response data by using rich knowledge in the news knowledge graph, so as to facilitate improving accuracy of subsequent news retrieval. Specifically, in this embodiment, the news knowledge graph is integrated into the initial large language model, that is, the structural information of the news knowledge graph is converted into characteristic data in a sequence form or in other coding modes that can be understood by the large language model, and the characteristic data is fused with text representations in the large language model to obtain a final large language model, so that the final large language model can refer to more background knowledge and context information when processing input sentences.

In step S3, the knowledge fusion module includes a text encoder and a knowledge graph encoder, that is, the knowledge fusion module adopts a text-knowledge dual encoder architecture; correspondingly, integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model, wherein the method comprises the following steps of:

S301, receiving input text data input into the initial large language model; in particular, in the present embodiment, the text data includes any news stories, article summaries, or text content such as user query text, which is not limited herein. It should be further noted that, in this embodiment, the input text data may be new text data, or may be multi-source news data in a news database, that is, the news knowledge graph may be integrated into the initial large language model when the new text data is received, or the multi-source news data in the news database may be re-input into the initial large language model, where the news knowledge graph is integrated into the initial large language model, which is not limited herein.

S302, acquiring a text representation vector of the input text data by adopting a text encoder in the knowledge fusion module; the text expression vector is data such as a number or a vector converted from the input text data.

Specifically, the text encoder in this embodiment adopts a BERT model based on a transducer structure, and the main task of the text encoder is to understand and characterize input text data deeply. In step S302 of the present embodiment, obtaining a text representation vector of the input text data by using a text encoder in the knowledge fusion module includes:

S3021, preprocessing the input text data to obtain preprocessed text data; wherein the preprocessed text data comprises a plurality of second word segmentation data;

S3022, encoding the preprocessed text data through a word embedding layer in the text encoder to obtain second word embedding vectors of a plurality of second word segmentation data in the preprocessed text data; specifically, in steps S3021 and S3022 of the present embodiment, when preprocessing and encoding the input text data, a preprocessing scheme for preprocessing the multi-source news data in the news database in step S2011 and an encoding scheme for encoding the preprocessed news data in step S2012 may be adopted, which are not described herein.

S3023, carrying out multi-layer self-attention mechanism calculation on the plurality of second word embedded vectors through the text encoder to obtain position representation vectors of the plurality of second word embedded vectors, namely generating context-related vector representations of each position of the plurality of second word embedded vectors;

S3024, embedding vectors and a plurality of position representation vectors into a plurality of second words through the text encoder to obtain text representation vectors of the input text data; specifically, in this embodiment, the text encoder obtains, as the text expression vector of the input text data, a semantic feature vector capable of representing the entire text from a plurality of second word embedding vectors and a plurality of position expression vectors by using CLS (special classification embedding, special class embedding) labeling. The text representation vector in this embodiment is a hidden state matrix, each row corresponding to a context sensitive vector representation of a location (word or symbol).

Specifically, in step S3024, the process of acquiring a text vector representation may be expressed as:

TransformerEncoder(TokenEmbeddings+PositionEmbeddings)；

Correspondingly, the text representation vector is:

TextEncoderOutput=TransformerEncoder(TokenEmbeddings+PositionEmbeddings)；

S303, adopting a knowledge graph encoder in the knowledge fusion module to encode the news entity and the news relation in the news knowledge graph to obtain an entity representation vector corresponding to the news entity and a relation representation vector corresponding to the news relation; specifically, in this embodiment, the knowledge-graph encoder implements the encoding process of the news entity and the news relationship by Graph Neural Networks (adaptive graph convolutional neural network, GNN) or the like, which is not limited herein.

S304, adopting the knowledge graph encoder to fuse the entity expression vector, the relation expression vector and the text expression vector to obtain a fused expression vector.

In step S304, the knowledge graph encoder is used to perform fusion processing on the entity representation vector, the relationship representation vector and the text representation vector to obtain a fusion representation vector, which includes:

S3041, acquiring an associated entity representation vector and an associated relation representation vector which have an associated relation with the text representation vector in the entity representation vector and the relation representation vector;

S3042, carrying out fusion processing on the association entity expression vector, the association relation expression vector and the text expression vector to obtain a fusion expression vector; wherein, the fusion expression vector is:

SentenceRepresentation=TextEncoderOutput+α·KGEntityVectors；

In addition, in step S304 of the present embodiment, the following steps may be used to obtain the fusion expression vector:

the knowledge-graph encoder focuses the text representation vector on the knowledge-graph information most relevant to the context of the current input text data by using a focusing mechanism, and fuses the knowledge-graph information with the text representation vector to obtain a fused representation vector. The knowledge graph information most relevant to the context of the current input text data is as follows:

SentenceRepresentation=TextEncoderOutput+WeightedKGVectors；

Where WeightedKGVectors represents the entity weight value, weightedKGVectors = Softmax (AttentionScores) + KGEntityVectors, -, indicates the element-level multiplication symbol, softmax () indicates the normalization processing function, attentionScores represents the attention score, attentionScores = AttentionFunction (TextEncoderOutput, KGEntityVectors), attentionFunction () represents a function of calculating the attention score between the text representing vector and each entity representing vector.

S305, integrating the fusion expression vector into the initial large language model to obtain a final large language model.

It should be noted that, the fusion expression vector obtained based on the knowledge fusion module in this embodiment may be regarded as a unified vector expression including the text context in the input text data and the additional knowledge information in the news knowledge graph, and the fusion expression vector may be used as an intermediate expression in the initial large language model, that is, a vector feature, and may be integrated into the initial large language model to obtain a final large language model, so that knowledge in the news knowledge graph and the text expression in the initial large language model may be effectively fused, and the continuous variability of knowledge in the real world may be dealt with, and the final retrieval work may be updated and integrated without retraining the initial large language model.

S4, news content retrieval is carried out based on the final large language model, and news retrieval results are obtained. It should be noted that, in this embodiment, since the news knowledge graph is integrated in the final large language model, the news knowledge graph can be understood according to the context, so that the news event can be understood and interpreted more accurately, and the entity in the retrieved content can be linked to the corresponding entity in the knowledge graph to form the relevance retrieval such as development context and entity link, so as to further enhance the interpretation and relevance of the retrieval result.

In step S4, news content retrieval is performed based on the final large language model, so as to obtain a news retrieval result, which includes:

s401, customizing different prompt word templates according to different service scene requirements, wherein the prompt word templates can also be called as prompt templates in the embodiment, and constructing a prompt word template library according to the different prompt word templates; it should be noted that, in this embodiment, the functions of different alert word templates include: the method comprises the steps of processing multi-language news texts, understanding the meaning of the news texts according to contexts, finding and associating related events and the like, and a prompt word function template is used for enabling the whole news retrieval system, namely a final large language model, to be capable of accurately understanding the news events and the development venues thereof and key entity links so as to obtain clearer retrieval interpretation and association. In the actual application process, the task type required to be solved by the final large language model can be determined according to the news retrieval requirement of the user, and then a corresponding prompt word template is designed to ensure that the prompt word template can guide the final large language model to complete the reasoning and information retrieval work required by the task.

S402, extracting a target prompt word template matched with a preset search requirement from the prompt word template library, receiving search request data, and combining the search request data with the target prompt word template to obtain target search request data; specifically, in this embodiment, when a user needs to perform news retrieval, one or more alert word templates in the alert word template library may be output in advance, and after the user selects one or more target alert word templates, search request data input by the user is received, and then the search request data and the target alert word templates are combined and processed, that is, the search request data is filled in the target alert word templates, so as to obtain target search request data, so that a news retrieval result meeting the news retrieval requirement of the user is obtained according to a final large language model.

S403, carrying out news retrieval on the target retrieval request data by adopting the final large language model to obtain a news retrieval result matched with the target retrieval request data.

In the present embodiment, two examples of the relevant data and the retrieval process in steps S401 to S403 are given:

Example 1: business scenario for cross-language news retrieval

Service scene description: an international news organization needs to build a retrieval system that can understand and process news stories in english, chinese, and other languages, and accurately identify and associate different language version stories for the same news event.

The prompting word template customized based on the business scene is as follows: key information is extracted from the source language news digest and translated into the target language. At the same time, please find out the key summaries of other language stories related to this news and confirm whether they relate to the same core event.

The target retrieval request data after template filling is as follows: key information is extracted from this english news digest "London hosted the G20 summit discussing CLIMATE CHANGE INITIATIVES" and translated into chinese while looking for key digests of french and spanish news that discuss the same issues.

The news retrieval process of the target retrieval request data by adopting the final large language model comprises the following steps: using a final large language model with multi-language understanding and cross-language retrieval capabilities, the populated target retrieval request data is input into the model, which will output the translated abstract and related information of other language stories related to the original news event, i.e., news retrieval results.

Example 2: business scene constructed by news event association and time line

Service scene description: a financial news platform would like to build a retrieval system that would analyze the causal and chronological order of news events so that users would track the timeline of related news events for a particular company or industry.

The prompting word template customized based on the business scene is as follows: a chain of news events about the topic is combed. Please time-sequential sequence since [ start date ] all relevant important news headlines and their short summaries, and indicate the logical links between each news and other news.

The target retrieval request data after template filling is as follows: the chain of major news events for Tesla corporation was compiled, all important reports since 1 month 1 of 2023, including news headlines and summaries thereof, such as product release, market performance and management changes, and the potential impact relationships between events were revealed.

The news retrieval process of the target retrieval request data by adopting the final large language model comprises the following steps: the filled target prompt word template is input into a final large language model with event understanding and reasoning capability, the model outputs a news event list related to Tesla company according to time sequence, and attempts to resolve causal relations among the events, so that a coherent news event time line is formed.

Based on the technical scheme, the embodiment can be beneficial to improving the interpretation and relevance of the news search results. Specifically, in this embodiment, through customizing the prompt word templates in different service scenarios, the accurate, detailed, personalized, customized and interpretable news search results can be obtained when the news search is performed by the final large language model, so that the news search effect is enhanced from multiple dimensions.

In this embodiment, the final large language model generally contains a large amount of cross-language and multi-dimensional news knowledge, but the news knowledge is stored in a hidden manner in the final large language model, so that the stored structure is difficult to determine, and in addition, the final large language model also has a hallucination problem, which results in generation of statements contradicting facts. The embodiment further provides the following technical scheme improvement: after obtaining the news search result, the method further comprises the following steps:

S5, carrying out news knowledge verification on the news retrieval result by adopting a news knowledge assessment module to obtain a verification result, and feeding back the verification result to a database matched with the final large language model when the verification result characterizes that the matching degree of the news retrieval result and the news knowledge graph is smaller than a preset matching degree threshold value so as to help the news retrieval result to improve future news retrieval performance.

It should be noted that, the news knowledge assessment module in this embodiment may utilize authority of the news knowledge graph to compare the fact statement in the news search result obtained by the final large language model with the known fact in the news knowledge graph, so as to implement news knowledge verification on the news search result. Specifically, in this embodiment, when verifying the news knowledge of the news search result, the method includes the following steps:

S501, converting the entity relation of the verification result output by the final large language model into a triplet form, comparing the triplet with the triplet in the news knowledge graph, and carrying out real-time or off-line verification by combining an open network or other trusted sources;

s502, distributing a confidence score for the news retrieval result, and based on the quantity and quality of the support evidence found in the news knowledge graph and the consistency degree with external resources.

S503, when the news retrieval result obtained by the final large language model is not consistent with the news knowledge graph, recording and feeding back the news retrieval result to a database matched with the final large language model, so that the accuracy of the follow-up retrieval result is improved.

Based on the above steps, the embodiment can accurately evaluate and ensure the correctness of knowledge contained in the final large language model by setting the news knowledge evaluation module.

It should be further noted that, in this embodiment, the news knowledge graph is used to store news knowledge, so as to ensure the structuring and layering of news, and the news knowledge graph is integrated into the initial large language model through the fused scheme, and the integrated final large language model only needs to maintain the news knowledge assessment module and the knowledge fusion module, so that the reasoning of the final large language model is not affected even if the news knowledge assessment module and the knowledge fusion module are not maintained. The final news retrieval process is the final large language model reasoning process, and can be expanded into a very multifunctional process according to different input descriptions, so that the high expansibility is realized.

In addition, in this embodiment, the news knowledge graph may be updated in a timed iteration manner, and the news knowledge graph may be updated in a loop, and integrated into a corresponding module and a final large language model, so as to implement the loop iteration update of the final large language model. If the news knowledge graph is iteratively updated every day according to the hours and is updated to a knowledge fusion module of the initial large language model and the news knowledge graph, the latest news retrieval content and a large number of related functions in the hours can be basically ensured; and the retrieval result of the final large language model can be ensured to be more like the retrieval analysis result of news specialists who learn too much news facts with general knowledge.

The embodiment discloses a news enhancement retrieval method and a system based on large language model and knowledge graph joint driving, which can provide richer news related knowledge, more comprehensive news context information, latest news entities and news facts for news retrieval results and are beneficial to related personnel to obtain more intelligent, more accurate and more personalized news retrieval results at lower cost. Specifically, in the implementation process of the embodiment, firstly, multi-source news data are acquired, and a news database is constructed according to the multi-source news data; then, carrying out knowledge extraction on multi-source news data in the news database by adopting an initial large language model to obtain news entities and news relations, and constructing a news knowledge graph according to the news entities and the news relations; then, integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model; and finally, searching news content based on the final large language model to obtain a news search result. In this process, the embodiment adopts the initial large language model to assist in knowledge extraction, and simultaneously assists in constructing a news knowledge graph, and obtains a final large language model from the initial large language model through the news knowledge graph, so that real-time knowledge in the news field is injected into the initial large language model, and the understanding degree of the obtained final large language model on news content information is improved, so that the accuracy and individuation degree of news content retrieval in the embodiment are improved.

Example 2:

The embodiment discloses a news retrieval system based on a large language model and a knowledge graph, which is used for realizing the news retrieval method based on the large language model and the knowledge graph in the embodiment 1; as shown in fig. 2, the news retrieval system based on a large language model and a knowledge graph includes:

It should be noted that, in the working process, working details and technical effects of the news retrieval system based on the large language model and the knowledge graph provided in the embodiment 2, reference may be made to the embodiment 1, and no description is repeated here.

Example 3:

On the basis of embodiment 1 or 2, this embodiment discloses an electronic device, which may be a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like. The electronic device may be referred to as a user terminal, a portable terminal, a desktop terminal, etc., as shown in fig. 3, the electronic device includes:

a memory for storing computer program instructions; and

A processor for executing the computer program instructions to perform the operations of the large language model and knowledge graph based news retrieval method as described in any one of embodiment 1.

In particular, processor 301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 301 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 301 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 301 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen.

Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 302 is used to store at least one instruction for execution by processor 301 to implement the large language model and knowledge graph based news retrieval method provided in embodiment 1 of the present application.

In some embodiments, the terminal may further optionally include: a communication interface 303, and at least one peripheral device. The processor 301, the memory 302 and the communication interface 303 may be connected by a bus or signal lines. The respective peripheral devices may be connected to the communication interface 303 through a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 304, a display screen 305, and a power supply 306.

The communication interface 303 may be used to connect at least one peripheral device associated with an I/O (Input/Output) to the processor 301 and the memory 302. In some embodiments, processor 301, memory 302, and communication interface 303 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 301, the memory 302, and the communication interface 303 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 304 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 304 communicates with a communication network and other communication devices via electromagnetic signals.

The display screen 305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof.

The power supply 306 is used to power the various components in the electronic device.

Example 4:

on the basis of any one of embodiments 1 to 3, this embodiment discloses a computer program product comprising a computer program or instructions which, when executed by a computer, implement the news retrieval method based on the large language model and the knowledge graph as described in any one of embodiment 1.

It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solution of the present invention, and not limiting thereof; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents. Such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1.A news retrieval method based on a large language model and a knowledge graph is characterized in that: comprising the following steps:

2. The news retrieval method based on a large language model and a knowledge graph according to claim 1, wherein: knowledge extraction is carried out on multi-source news data in the news database by adopting an initial large language model to obtain news entities and news relations, and the method comprises the following steps:

3. The news retrieval method based on the large language model and the knowledge graph according to claim 2, wherein: the initial large language model adopts ChatGLM model including encoder and decoder; correspondingly, acquiring a plurality of clause data corresponding to the multi-source news data in the news database by adopting an initial large language model comprises the following steps:

4. The news retrieval method based on a large language model and a knowledge graph according to claim 1, wherein: and constructing a news knowledge graph according to the news entity and the news relationship, wherein the construction comprises the following steps:

5. The news retrieval method based on a large language model and a knowledge graph according to claim 1, wherein: the knowledge fusion module comprises a text encoder and a knowledge graph encoder; correspondingly, integrating the news knowledge graph into the initial large language model by adopting a knowledge fusion module to obtain a final large language model, wherein the method comprises the following steps of:

receiving input text data entered into the initial large language model;

6. The news retrieval method based on the large language model and the knowledge graph according to claim 5, wherein: acquiring the text representation vector of the input text data by adopting a text encoder in the knowledge fusion module comprises the following steps:

TextEncoderOutput=TransformerEncoder(TokenEmbeddings+PositionEmbeddings)；

7. The news retrieval method based on the large language model and the knowledge graph according to claim 5, wherein: and carrying out fusion processing on the entity representation vector, the relation representation vector and the text representation vector by adopting the knowledge graph encoder to obtain a fusion representation vector, wherein the fusion representation vector comprises the following components:

SentenceRepresentation=TextEncoderOutput+α·KGEntityVectors；

8. The news retrieval method based on a large language model and a knowledge graph according to claim 1, wherein: news content retrieval is carried out based on the final large language model to obtain news retrieval results, and the news retrieval results comprise:

9. The news retrieval method based on a large language model and a knowledge graph according to claim 1, wherein: after obtaining the news search result, the method further comprises the following steps:

10. A news retrieval system based on a large language model and a knowledge graph is characterized in that: a news retrieval method based on a large language model and a knowledge graph as claimed in any one of claims 1 to 9; the news retrieval system based on the large language model and the knowledge graph comprises: