[go: nahoru, domu]

CN117153157A - Multi-mode full duplex dialogue method and system for semantic recognition - Google Patents

Multi-mode full duplex dialogue method and system for semantic recognition Download PDF

Info

Publication number
CN117153157A
CN117153157A CN202311212596.7A CN202311212596A CN117153157A CN 117153157 A CN117153157 A CN 117153157A CN 202311212596 A CN202311212596 A CN 202311212596A CN 117153157 A CN117153157 A CN 117153157A
Authority
CN
China
Prior art keywords
dialogue
vector
determining
mode
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311212596.7A
Other languages
Chinese (zh)
Other versions
CN117153157B (en
Inventor
沈卫民
刘祖芳
马学文
王伟林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Macchi Information Technology Co ltd
Original Assignee
Shenzhen Macchi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Macchi Information Technology Co ltd filed Critical Shenzhen Macchi Information Technology Co ltd
Priority to CN202311212596.7A priority Critical patent/CN117153157B/en
Publication of CN117153157A publication Critical patent/CN117153157A/en
Application granted granted Critical
Publication of CN117153157B publication Critical patent/CN117153157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a multi-mode full duplex dialogue method and a system for semantic recognition, wherein the method comprises the following steps: step 1: acquiring an initiated dialogue between a dialogue user and a preset dialogue model; step 2: determining a dialogue mode selected by a dialogue user; step 3: according to the semantic recognition technology and the dialogue mode, acquiring dialogue semantics of initiating dialogue; step 4: a multi-modal full duplex conversation is conducted based on the meaning of the utterance and the conversation modality. According to the multi-mode full duplex dialogue method and system for semantic recognition, the speaking meaning is determined according to the acquired full duplex dialogue and the dialogue mode selected by the dialogue user, and the multi-mode full duplex dialogue is performed according to the speaking meaning and the dialogue mode, so that the user intention is expressed and understood more abundantly, more diversified responses are provided, and the interactivity is stronger.

Description

Multi-mode full duplex dialogue method and system for semantic recognition
Technical Field
The application relates to the technical field of semantic recognition, in particular to a multi-mode full duplex dialogue method and system for semantic recognition.
Background
A multi-modal full duplex conversation refers to the simultaneous input and output of multiple modalities (e.g., text, voice, image, video, etc.) involved in the conversation process. Meaning refers to identifying what a piece of content (e.g., text, audio, and image, etc.) represents. The multi-mode full duplex dialogue method for semantic recognition refers to a method for processing input and output of multiple modes simultaneously in a dialogue system, and can realize full duplex dialogue interaction, expand semantic understanding and generation to multiple modes, and enable the dialogue system to more comprehensively understand user input information and generate multi-mode reply information.
The application number is: the application patent of CN201811010816.7 discloses a method for realizing full duplex voice dialogue and page control based on a webpage, wherein the method comprises the following steps: the user accesses the webpage; a user initiates a voice conversation request in a webpage; and a server side for responding to the voice session request; the server establishes full duplex voice dialogue with the user and outputs the user intention; the command control module receives the intention of the user to realize page control. The application solves the problems of poor interaction experience and low communication efficiency of the existing webpage manager and website visitor dialogue.
However, in the above-mentioned prior art, only a dialogue of a voice mode is performed, but a simple expression mode of the voice mode is relatively single, and there are cases where the content to be expressed cannot be intuitively expressed by voice, and in addition, the human-computer interaction process experience is also poor.
In view of the foregoing, there is a need for a multi-modal full duplex dialogue method and system for semantic recognition that addresses at least the above-mentioned shortcomings.
Disclosure of Invention
The application aims to provide a multi-mode full duplex dialogue method and a system for identifying semantic meaning, which are used for determining the semantic meaning of a speech according to the acquired dialogue initiating mode and dialogue mode selected by a dialogue user, carrying out multi-mode full duplex dialogue according to the semantic meaning of the speech and the dialogue mode, expressing and understanding the user intention more abundantly, providing more diversified responses and having stronger interactivity.
The multi-mode full duplex dialogue method for semantic recognition provided by the embodiment of the application comprises the following steps:
step 1: acquiring an initiated dialogue between a dialogue user and a preset dialogue model;
step 2: determining a dialogue mode selected by a dialogue user;
step 3: according to the semantic recognition technology and the dialogue mode, acquiring dialogue semantics of initiating dialogue;
step 4: a multi-modal full duplex conversation is conducted based on the meaning of the utterance and the conversation modality.
Preferably, step 1: the method comprises the steps of obtaining an initiated dialogue between a dialogue user and a preset dialogue model, and comprising the following steps:
determining an input port for initiating a dialog;
acquiring input information of a dialogue user;
determining a target port of the input information according to the information type of the input information;
determining port information according to input information based on an analysis rule corresponding to a target port;
based on the real-time Web technology, the initiation of the dialogue is determined according to the port information.
Preferably, step 2: determining a dialog modality selected by a dialog user, comprising:
acquiring a mode selection instruction input by a dialogue user, and determining a dialogue mode according to the mode selection instruction;
and/or the number of the groups of groups,
and acquiring the context information input by the dialogue user, determining the mode switching intention of the user according to the context information, and determining the dialogue mode according to the mode switching intention.
Preferably, step 3: according to the speech recognition technology and the dialogue mode, acquiring the dialogue semantic of the initiated dialogue comprises the following steps:
collecting training data according to a dialogue mode;
based on a preset extraction rule, determining a plurality of extraction samples according to training data;
training a language identification decision tree according to the extracted sample based on a random forest algorithm;
determining a plurality of decision results according to the initiated dialog and the semantic recognition decision tree;
acquiring a decision result expression heat map;
carrying out hierarchical clustering on each decision result according to the decision result expression heat map to obtain a clustering tree;
determining tree nodes with the maximum volume in the cluster tree;
obtaining a central heat map value of a decision result corresponding to the tree node;
determining a meaning of the utterance according to the central heat map value;
wherein, according to decision result expression heat map, carry out hierarchical clustering of each decision result, obtain the cluster tree, include:
and calculating the similarity between every two decision results, wherein the calculation formula of the similarity is as follows:
wherein D is m As the mth decision result, D n For the nth decision result, corridation (D m ,D n ) For the similarity calculation result of the mth decision result and the nth decision result, dis (D m ,D n ) X is the distance between the mth decision result and the nth decision result on the decision result expression heat map m And X n Calibration values of X dimension of an mth decision result and an nth decision result on the decision result expression heat map are respectively, Y m And Y n Respectively expressing the calibration values of the m decision result and the Y dimension of the n decision result on the heat map for the decision result, wherein sigma is a preset similarity normalization coefficient;
and carrying out iterative merging on the decision result with the highest similarity to obtain a cluster tree.
Preferably, collecting training data according to a dialog modality includes:
acquiring a mode type of a dialogue mode;
determining a collection rule according to the mode type;
determining a collection rule to correspond to a preset collection template;
acquiring a dialogue scene of initiating a dialogue;
extracting dialogue scene characteristics of a dialogue scene based on a preset dialogue scene characteristic extraction rule;
determining feature setting parameters of the collection template according to dialogue scene features;
setting corresponding feature setting parameters of the collecting template to obtain a target template;
training data is collected based on the target template.
Preferably, step 4: according to the meaning of the words and the conversation mode, carrying out multi-mode full duplex conversation, comprising:
according to the meaning of the words, obtaining dialogue requirements;
determining an output channel of a dialogue mode;
determining output content according to the dialogue requirement and the output channel;
and carrying out multi-mode full duplex dialogue according to the output content.
Preferably, determining output content according to the dialogue requirement and the output channel includes:
determining a first dialogue vector of dialogue requirements based on a preset dialogue vector model;
acquiring a corpus group corresponding to an output channel;
based on a preset sentence-breaking rule, determining a plurality of first sentence-breaking corpora according to the corpus group;
determining a second dialogue vector of the first sentence-breaking corpus based on the dialogue vector model;
aligning the vector starting points of the first dialogue vector and the second dialogue vector, and acquiring a first vector included angle between the first dialogue vector and the second dialogue vector after the vector starting points are aligned;
if the first vector included angle is smaller than a preset vector included angle threshold value, the corresponding second dialogue vector is used as a third dialogue vector;
determining a first vector included angle between the first dialogue vector and the third dialogue vector, and taking the first vector included angle as a second vector included angle;
rotating the third dialogue vector by a second vector included angle to obtain a fourth dialogue vector;
calculating a vector modulus difference value between the fourth dialogue vector and the first dialogue vector;
determining output content according to the second vector included angle and the vector module value difference value;
wherein calculating a vector modulus difference between the fourth dialog vector and the first dialog vector comprises:
calculating a dimension difference value of the same vector dimension of the fourth dialogue vector and the first dialogue vector;
acquiring the dimension weight of the vector dimension according to the vector dimension and a preset dimension weight library;
and determining a vector modulus value difference value according to the dimension difference value and the dimension weight based on a preset calculation rule.
Preferably, determining the output content according to the second vector included angle and the vector modulus difference value includes:
acquiring a first conversion rule corresponding to the vector included angle and a second conversion rule corresponding to the vector module value difference;
according to the first conversion rule, determining a first conversion value corresponding to the second vector included angle, and correlating with a corresponding third dialogue vector;
determining a second conversion value corresponding to the vector module value difference value according to a second conversion rule, and correlating with a corresponding third dialogue vector;
summing the first conversion value and the second conversion value associated with the third dialogue vector to obtain a statistical value;
taking the first sentence-breaking corpus corresponding to the third dialogue vector with the smallest statistic value as the second sentence-breaking corpus;
and determining a third sentence-breaking corpus according to the second sentence-breaking corpus and the first sentence-breaking corpus in the corpus group, and taking the third sentence-breaking corpus as output content.
Preferably, the multi-mode full duplex dialogue is performed according to the output content, including:
acquiring a presentation requirement of output content;
determining a first support parameter according to the presentation requirement;
acquiring a second support parameter of the local server;
judging whether the output content can be presented on the user interface or not according to the first support parameter and the second support parameter;
if the output content can be presented on the user interface, presenting the output content through the user interface;
if the output content can not be presented on the user interface, establishing a communication link with a target platform meeting the presentation requirement through a local server, and sending the presentation requirement to the target platform;
and acquiring the presentation information after the target platform receives the presentation request, and returning the presentation information in real time.
The embodiment of the application provides a multi-mode full duplex dialogue system for semantic recognition, which comprises the following steps:
the system comprises an initiation dialogue acquisition subsystem, a dialogue generation subsystem and a dialogue generation subsystem, wherein the initiation dialogue acquisition subsystem is used for acquiring an initiation dialogue between a dialogue user and a preset dialogue model;
a dialog modality determination subsystem for determining a dialog modality selected by a dialog user;
the speaking meaning acquisition subsystem is used for acquiring the speaking meaning of the initiated dialog according to the speaking meaning identification technology and the dialog mode;
and the dialogue subsystem is used for carrying out multi-mode full-duplex dialogue according to the meaning of the words and the dialogue modes.
The beneficial effects of the application are as follows:
according to the method and the device, the speaking meaning is determined according to the acquired dialog initiating mode and the dialog user selecting mode, the multi-mode full duplex dialog is carried out according to the speaking meaning and the dialog mode, the user intention is expressed and understood more abundantly, more diversified responses are provided, and the interactivity is stronger.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the application is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application. In the drawings:
FIG. 1 is a schematic diagram of a multi-modal full duplex dialogue method for semantic recognition according to an embodiment of the present application;
fig. 2 is a schematic diagram of a multi-mode full duplex dialogue system with semantic recognition according to an embodiment of the present application.
Detailed Description
The preferred embodiments of the present application will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present application only, and are not intended to limit the present application.
The embodiment of the application provides a multi-mode full duplex dialogue method for semantic recognition, which is shown in fig. 1 and comprises the following steps:
step 1: acquiring an initiated dialogue between a dialogue user and a preset dialogue model; wherein, the dialogue user is: a user of a multi-modal full duplex dialog system using semantic recognition; the dialogue model is as follows: a multi-modal full duplex dialog system; the initiation dialogue is: a dialog initiated on a dialog interface of a multi-modal full duplex dialog system for semantic recognition;
step 2: determining a dialogue mode selected by a dialogue user; the dialogue mode is as follows: text, voice, image, video, etc.;
step 3: according to the semantic recognition technology and the dialogue mode, acquiring dialogue semantics of initiating dialogue; wherein, the dialogue meaning is: initiating meaning of dialog inclusion;
step 4: a multi-modal full duplex conversation is conducted based on the meaning of the utterance and the conversation modality. Wherein, the form of the multi-mode full duplex dialogue refers to dialogue can be various, such as: conversations are performed through text, audio, images, and the like.
The working principle and the beneficial effects of the technical scheme are as follows:
according to the method and the device, the speaking meaning is determined according to the acquired dialog initiating mode and the dialog user selecting mode, the multi-mode full duplex dialog is carried out according to the speaking meaning and the dialog mode, the user intention is expressed and understood more abundantly, more diversified responses are provided, and the interactivity is stronger.
When the method is applied specifically, a dialogue user initiates a dialogue of a mode which can be supported by any server in a multi-mode full duplex dialogue system with semantic recognition, and after the multi-mode full duplex dialogue system with semantic recognition receives the initiated dialogue, a reply of the mode selected by the dialogue user is generated.
In one embodiment, step 1: the method comprises the steps of obtaining an initiated dialogue between a dialogue user and a preset dialogue model, and comprising the following steps:
determining an input port for initiating a dialog; the input ports are, for example: radio microphone, such as: a file input box;
acquiring input information of a dialogue user; the input information is, for example: inputting text, inputting pictures, inputting voice and the like;
determining a target port of the input information according to the information type of the input information; the information types are, for example: text, images, audio, etc.; when determining a destination port for input information, for example: when the input information is text or image, determining an input box of a preset user interface, wherein the preset user interface is a dialogue interface presented to a dialogue user, and the dialogue user inputs or drags the text and the image to be input to the corresponding input box to complete the determination of a target port (namely, the input port corresponding to the input box of the user interface is the target port of the text and the image); and, for example: the sound receiving microphone receives sound of surrounding environment in real time, and when the input information is voice information, the target port is a port corresponding to the sound receiving microphone;
determining port information according to input information based on an analysis rule corresponding to a target port; the analysis rule is determined according to the port equipment; the port information is: dialogue information transmitted by the port in real time;
based on the real-time Web technology, the initiation of the dialogue is determined according to the port information. The real-time Web technology belongs to the category of the prior art, and is not described in detail.
The working principle and the beneficial effects of the technical scheme are as follows:
the application determines the target port of the input information according to the determined input port and the information type of the input information of the dialogue user, introduces the analysis rule of the target port and determines the port information according to the input information. The real-time Web technology is introduced to initiate the dialogue in real time according to the port information, so that timeliness of initiating the dialogue to acquire is improved, the dialogue efficiency is higher, and the user experience is better.
In one embodiment, step 2: determining a dialog modality selected by a dialog user, comprising:
acquiring a mode selection instruction input by a dialogue user, and determining a dialogue mode according to the mode selection instruction; wherein, the mode selection instruction is: the server makes a trigger instruction during the mode change operation;
and/or the number of the groups of groups,
and acquiring the context information input by the dialogue user, determining the mode switching intention of the user according to the context information, and determining the dialogue mode according to the mode switching intention. Wherein, the above information is: historical conversations of conversational users; the mode switching intent is: what mode is desired to be switched to.
The working principle and the beneficial effects of the technical scheme are as follows:
the application introduces two modes to determine the dialogue mode, and the determination of the dialogue mode is more reasonable.
In one embodiment, step 3: according to the speech recognition technology and the dialogue mode, acquiring the dialogue semantic of the initiated dialogue comprises the following steps:
collecting training data according to a dialogue mode; wherein, training data is: a record identifying the semantic meaning of the corresponding dialog of the dialog modality, such as: identifying a machine learning record that text represents a semantic, such as: identifying a machine learning record of speech representative semantics;
based on a preset extraction rule, determining a plurality of extraction samples according to training data; the extraction rule is as follows: the number of the extracted training data samples is replaced, 1 sample is extracted each time, and finally the extracted samples with the number of the samples are formed;
training a language identification decision tree according to the extracted sample based on a random forest algorithm; wherein, the random forest algorithm belongs to the category of the prior art, and the principle is not repeated; the semantic meaning decision tree is: determining a decision tree for speaking the meaning according to the initiating statement;
determining a plurality of decision results according to the initiated dialog and the semantic recognition decision tree; wherein, the decision result is: semantic decision results of the initiating sentences output by each semantic identification decision tree;
acquiring a decision result expression heat map; wherein, the decision result expression heat map is: the heat map used for visualization to represent different decision results, the heat map analysis belongs to the category of the prior art;
carrying out hierarchical clustering on each decision result according to the decision result expression heat map to obtain a clustering tree;
determining tree nodes with the maximum volume in the cluster tree; wherein, the tree node with the largest volume is: the maximum volume is: the number of decision results divided into tree nodes is the largest;
obtaining a central heat map value of a decision result corresponding to the tree node; wherein, the central heat map value is: average heat map value of decision result corresponding to tree node;
determining a meaning of the utterance according to the central heat map value; wherein each heat map value has a correspondingly characterized semantic meaning, and therefore, can be determined;
wherein, according to decision result expression heat map, carry out hierarchical clustering of each decision result, obtain the cluster tree, include:
and calculating the similarity between every two decision results, wherein the calculation formula of the similarity is as follows:
wherein D is m As the mth decision result, D n For the nth decision result, corridation (D m ,D n ) For the similarity calculation result of the mth decision result and the nth decision result, dis (D m ,D n ) X is the distance between the mth decision result and the nth decision result on the decision result expression heat map m And X n Calibration values of X dimension of an mth decision result and an nth decision result on the decision result expression heat map are respectively, Y m And Y n Respectively expressing the calibration values of the m decision result and the Y dimension of the n decision result on the heat map for the decision result, wherein sigma is a preset similarity normalization coefficient;
and carrying out iterative merging on the decision result with the highest similarity to obtain a cluster tree. Wherein, the iterative merging is as follows: and combining decision results with highest similarity of the current iteration when each iteration is performed until the number of combined results reaches a manual preset value.
The working principle and the beneficial effects of the technical scheme are as follows:
the application collects training data corresponding to dialogue modes, and determines extraction samples for generating semantic recognition decision trees according to the training data. And introducing a random forest algorithm, training a language identification decision tree according to the extracted sample, and obtaining a decision result of the language identification decision tree on initiating the conversation. Taking the discreteness of the distribution of the decision results into consideration, introducing a decision result expression heat map, and clustering the hierarchy of each decision result to obtain a cluster tree; when hierarchical clustering is carried out, similarity among decision results is introduced, iterative merging is carried out, and when similarity is calculated, a similarity normalization coefficient is introduced, so that the rationality of hierarchical clustering is improved. And determining the central heat map value of the tree node with the maximum volume in the cluster tree to determine the meaning of the words, so that the determination efficiency of the meaning of the words is improved.
In one embodiment, collecting training data according to a dialog modality includes:
acquiring a mode type of a dialogue mode; among them, the modality types are, for example: text, images, and video, etc.;
determining a collection rule according to the mode type; wherein, the collection rule is: how to collect training data corresponding to the modality type, for example: how to collect records of text semantic recognition, such as: how to collect records of speech semantic recognition;
determining a collection rule to correspond to a preset collection template; wherein, the collection template is: the template accords with the collection rule, and the constraint of the template only carries out the collection of corresponding records according to the collection rule;
acquiring a dialogue scene of initiating a dialogue; the dialogue scene is as follows: application scenarios of dialog, such as: a visual question-answering system, a chat robot, etc.;
extracting dialogue scene characteristics of a dialogue scene based on a preset dialogue scene characteristic extraction rule; the dialogue scene features are, for example: a dialogue of which topic is performed, a form of the dialogue, and the like;
determining feature setting parameters of the collection template according to dialogue scene features; wherein, the characteristic setting parameters are: constraint parameters of the collection template are restrained, and the collection behavior of the collection template is restrained to further acquire more accurate training data;
setting corresponding feature setting parameters of the collecting template to obtain a target template;
training data is collected based on the target template.
The working principle and the beneficial effects of the technical scheme are as follows:
in general, the process of semantic recognition of different dialog modalities is different, so the application introduces modality types, determines collection rules according to the modality types, and acquires collection templates according to the collection rules. In addition, a full duplex dialogue scene is introduced, dialogue scene characteristics of the dialogue scene are extracted to determine characteristic setting parameters of the collecting template, corresponding characteristic setting parameters of the collecting template are set, training data are collected, and the collected training data are more suitable.
In one embodiment, step 4: according to the meaning of the words and the conversation mode, carrying out multi-mode full duplex conversation, comprising:
according to the meaning of the words, obtaining dialogue requirements; the dialogue requirements are as follows: requirements for semantic characterization of conversations, such as: please help me draw a picture with the theme "xx" with AI;
determining an output channel of a dialogue mode; wherein, the output channel is: delivering different forms or intermediaries of dialog replies to the user;
determining output content according to the dialogue requirement and the output channel; wherein, the output content is: reply to dialog requirements, such as: AI drawing entitled "xx";
and carrying out multi-mode full duplex dialogue according to the output content.
The working principle and the beneficial effects of the technical scheme are as follows:
according to the application, the output content is determined according to the acquired dialogue requirement and the output channel of the dialogue mode, the multi-mode full duplex dialogue is performed according to the output content, more diversified responses are provided, and the interactivity is also stronger.
In one embodiment, determining output content based on dialog requirements and output channels includes:
determining a first dialogue vector of dialogue requirements based on a preset dialogue vector model; the dialogue vector model is as follows: word bag model; the first dialog vector is: regarding each word in the dialogue as an independent feature, constructing a vocabulary, and counting the occurrence times of each word in the dialogue or using weights such as TF-IDF and the like for each dialogue to represent, so as to obtain a vector representation with fixed length finally;
acquiring a corpus group corresponding to an output channel; wherein, the output channel corresponds corpus group and is: a dialogue record containing a dialogue mode corresponding to the output channel;
based on a preset sentence-breaking rule, determining a plurality of first sentence-breaking corpora according to the corpus group; the sentence breaking rules are, for example: for the questioner, a sentence is broken when an enter key is pressed, and for the compound, a sentence is broken when 3s is not output;
determining a second dialogue vector of the first sentence-breaking corpus based on the dialogue vector model; wherein, the construction rule of the second dialogue vector is the same as the first dialogue vector;
aligning the vector starting points of the first dialogue vector and the second dialogue vector, and acquiring a first vector included angle between the first dialogue vector and the second dialogue vector after the vector starting points are aligned; the first vector included angle is, for example: 15 degrees;
if the first vector included angle is smaller than a preset vector included angle threshold value, the corresponding second dialogue vector is used as a third dialogue vector; the vector angle threshold is preset manually, for example: 10 degrees;
determining a first vector included angle between the first dialogue vector and the third dialogue vector, and taking the first vector included angle as a second vector included angle;
rotating the third dialogue vector by a second vector included angle to obtain a fourth dialogue vector;
calculating a vector modulus difference value between the fourth dialogue vector and the first dialogue vector; the vector modulus difference is, for example: 0.2;
determining output content according to the second vector included angle and the vector module value difference value;
wherein calculating a vector modulus difference between the fourth dialog vector and the first dialog vector comprises:
calculating a dimension difference value of the same vector dimension of the fourth dialogue vector and the first dialogue vector; the dimension difference value is a numerical value difference value of vector elements of a fourth dialogue vector and a first dialogue vector of the same vector dimension;
acquiring the dimension weight of the vector dimension according to the vector dimension and a preset dimension weight library; the vector dimension and the corresponding dimension weight in the preset dimension weight library are input in advance by manpower;
and determining a vector modulus value difference value according to the dimension difference value and the dimension weight based on a preset calculation rule. The preset calculation rule is that the dimension difference value is squared and multiplied by the corresponding dimension weight and summed to obtain a result value, and the obtained result value is root-signed to obtain a vector module value difference value.
The working principle and the beneficial effects of the technical scheme are as follows:
when the reply of the dialogue is initiated, a vector matching technology is adopted, but because the meaning of each vector dimension representation is different, the dialogue reply determined by simply calculating the vector difference is not accurate enough, the application introduces a dialogue vector model, determines a first dialogue vector of dialogue requirements, performs sentence breaking on a corpus corresponding to an output channel according to sentence breaking rules, determines a first sentence breaking corpus, and determines a second dialogue vector of the first sentence breaking corpus. Calculating a first vector included angle of the first dialogue vector and the second dialogue vector, screening a second vector included angle of the first vector included angle being smaller than a vector included angle threshold value, calculating a vector module value difference value of a fourth dialogue vector and the first dialogue vector, introducing a vector dimension and a dimension weight library during calculation, determining a dimension weight, determining a vector module value difference value according to the dimension difference value and the dimension weight, determining output content according to the second vector included angle and the vector module value difference value, further improving the accuracy degree of the obtaining process of the output content, and being more suitable.
In one embodiment, determining output content based on the second vector included angle and the vector modulus difference value comprises:
acquiring a first conversion rule corresponding to the vector included angle and a second conversion rule corresponding to the vector module value difference; the first conversion rule is as follows: a rule for converting the vector angle into a first conversion value; the second conversion rule is: a rule for converting the vector modulus difference value into a second converted value;
according to the first conversion rule, determining a first conversion value corresponding to the second vector included angle, and correlating with a corresponding third dialogue vector; wherein the first conversion value is a numerical value;
determining a second conversion value corresponding to the vector module value difference value according to a second conversion rule, and correlating with a corresponding third dialogue vector; wherein the second conversion value is a numerical value;
summing the first conversion value and the second conversion value associated with the third dialogue vector to obtain a statistical value; the larger the statistical value is, the less the requirement of the first sentence-breaking corpus representation corresponding to the third dialogue vector is matched with the dialogue requirement;
taking the first sentence-breaking corpus corresponding to the third dialogue vector with the smallest statistic value as the second sentence-breaking corpus;
and determining a third sentence-breaking corpus according to the second sentence-breaking corpus and the first sentence-breaking corpus in the corpus group, and taking the third sentence-breaking corpus as output content. Wherein, the third sentence breaking sentence is: and replying sentences and replying contents of the second sentence-breaking corpus in the corpus group.
The working principle and the beneficial effects of the technical scheme are as follows:
the application introduces a first conversion rule and a second conversion rule, determines a first conversion value according to the first conversion rule and the second vector included angle, determines a second conversion value according to the vector module value difference value and the second conversion rule, calculates the sum of the first conversion value and the second conversion value associated with the third dialogue vector, obtains a statistic value, takes the response of the first sentence-breaking corpus corresponding to the third dialogue vector with the minimum statistic value as output content, and has more reasonable determination of the output content.
In one embodiment, a multi-modal full duplex conversation is conducted based on output content, comprising:
acquiring a presentation requirement of output content; among these presentation requirements are, for example: an image of what parameters are as follows: dynamic video presentation of what parameters;
determining a first support parameter according to the presentation requirement; wherein the first support parameters include: a platform supporting presentation requirements, a platform carrying server and setting parameters;
acquiring a second support parameter of the local server; wherein the second support parameters include: configuration parameters of the local server;
judging whether the output content can be presented on the user interface or not according to the first support parameter and the second support parameter; judging whether the output content can be presented on the user interface, and if the second support parameter of the local server can support the first support parameter, the output content can be presented on the user interface, otherwise, the output content cannot be presented;
if the output content can be presented on the user interface, presenting the output content through the user interface;
if the output content can not be presented on the user interface, establishing a communication link with a target platform meeting the presentation requirement through a local server, and sending the presentation requirement to the target platform; the target platform is, for example: a three-dimensional modeling platform;
and acquiring the presentation information after the target platform receives the presentation request, and returning the presentation information in real time. The presentation information is, for example: three-dimensional animation of SolidWorks aided design.
The working principle and the beneficial effects of the technical scheme are as follows:
when a dialogue user performs a dialogue, there are different requirements, such as: drawing some graphs with larger demand on calculation force, wherein a local server may not have corresponding configuration, so the method and the device for displaying the multi-mode dialogue display content determine the first support parameter according to the acquired display requirement of the output content, judge whether the output content can be displayed on a user interface according to the first support parameter and the second support parameter of the local server, and directly output the output content when the output content can be displayed on the user interface, otherwise, establish a communication link with a target platform meeting the display requirement, and return display information of the target platform after receiving the display requirement in real time, so that the form breadth of the multi-mode dialogue display content is expanded, and user experience is improved.
The embodiment of the application provides a multi-mode full duplex dialogue system for semantic recognition, which is shown in fig. 2 and comprises the following steps:
an initiating dialogue acquisition subsystem 1, configured to acquire an initiating dialogue between a dialogue user and a preset dialogue model;
a dialog modality determination subsystem 2 for determining a dialog modality selected by a dialog user;
a speaking meaning acquisition subsystem 3, configured to acquire a speaking meaning of initiating a conversation according to a speaking meaning recognition technology and a conversation mode;
a dialogue subsystem 4 for conducting a multi-modal full duplex dialogue according to the meaning of the utterance and the dialogue modality.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A multi-modal full duplex dialog method for semantic recognition, comprising:
step 1: acquiring an initiated dialogue between a dialogue user and a preset dialogue model;
step 2: determining a dialogue mode selected by a dialogue user;
step 3: according to the semantic recognition technology and the dialogue mode, acquiring dialogue semantics of initiating dialogue;
step 4: a multi-modal full duplex conversation is conducted based on the meaning of the utterance and the conversation modality.
2. The method of multi-modal full duplex dialogue for semantic recognition as claimed in claim 1 wherein step 1: the method comprises the steps of obtaining an initiated dialogue between a dialogue user and a preset dialogue model, and comprising the following steps:
determining an input port for initiating a dialog;
acquiring input information of a dialogue user;
determining a target port of the input information according to the information type of the input information;
determining port information according to input information based on an analysis rule corresponding to a target port;
based on the real-time Web technology, the initiation of the dialogue is determined according to the port information.
3. The multi-modal full duplex dialogue method as claimed in claim 1 wherein step 2: determining a dialog modality selected by a dialog user, comprising:
acquiring a mode selection instruction input by a dialogue user, and determining a dialogue mode according to the mode selection instruction;
and/or the number of the groups of groups,
and acquiring the context information input by the dialogue user, determining the mode switching intention of the user according to the context information, and determining the dialogue mode according to the mode switching intention.
4. The multi-modal full duplex dialogue method as claimed in claim 1 wherein step 3: according to the speech recognition technology and the dialogue mode, acquiring the dialogue semantic of the initiated dialogue comprises the following steps:
collecting training data according to a dialogue mode;
based on a preset extraction rule, determining a plurality of extraction samples according to training data;
training a language identification decision tree according to the extracted sample based on a random forest algorithm;
determining a plurality of decision results according to the initiated dialog and the semantic recognition decision tree;
acquiring a decision result expression heat map;
carrying out hierarchical clustering on each decision result according to the decision result expression heat map to obtain a clustering tree;
determining tree nodes with the maximum volume in the cluster tree;
obtaining a central heat map value of a decision result corresponding to the tree node;
determining a meaning of the utterance according to the central heat map value;
wherein, according to decision result expression heat map, carry out hierarchical clustering of each decision result, obtain the cluster tree, include:
and calculating the similarity between every two decision results, wherein the calculation formula of the similarity is as follows:
wherein D is m As the mth decision result, D n For the nth decision result, corridation (D m ,D n ) For the similarity calculation result of the mth decision result and the nth decision result, dis (D m ,D n ) X is the distance between the mth decision result and the nth decision result on the decision result expression heat map m And X n Calibration values of X dimension of an mth decision result and an nth decision result on the decision result expression heat map are respectively, Y m And Y n Respectively expressing the calibration values of the m decision result and the Y dimension of the n decision result on the heat map for the decision result, wherein sigma is a preset similarity normalization coefficient;
and carrying out iterative merging on the decision result with the highest similarity to obtain a cluster tree.
5. The method of claim 4, wherein collecting training data based on the dialog modality comprises:
acquiring a mode type of a dialogue mode;
determining a collection rule according to the mode type;
determining a collection rule to correspond to a preset collection template;
acquiring a dialogue scene of initiating a dialogue;
extracting dialogue scene characteristics of a dialogue scene based on a preset dialogue scene characteristic extraction rule;
determining feature setting parameters of the collection template according to dialogue scene features;
setting corresponding feature setting parameters of the collecting template to obtain a target template;
training data is collected based on the target template.
6. The multi-modal full duplex dialogue method as claimed in claim 1 wherein step 4: according to the meaning of the words and the conversation mode, carrying out multi-mode full duplex conversation, comprising:
according to the meaning of the words, obtaining dialogue requirements;
determining an output channel of a dialogue mode;
determining output content according to the dialogue requirement and the output channel;
and carrying out multi-mode full duplex dialogue according to the output content.
7. The method of claim 6, wherein determining output content based on the dialog requirements and the output channel comprises:
determining a first dialogue vector of dialogue requirements based on a preset dialogue vector model;
acquiring a corpus group corresponding to an output channel;
based on a preset sentence-breaking rule, determining a plurality of first sentence-breaking corpora according to the corpus group;
determining a second dialogue vector of the first sentence-breaking corpus based on the dialogue vector model;
aligning the vector starting points of the first dialogue vector and the second dialogue vector, and acquiring a first vector included angle between the first dialogue vector and the second dialogue vector after the vector starting points are aligned;
if the first vector included angle is smaller than a preset vector included angle threshold value, the corresponding second dialogue vector is used as a third dialogue vector;
determining a first vector included angle between the first dialogue vector and the third dialogue vector, and taking the first vector included angle as a second vector included angle;
rotating the third dialogue vector by a second vector included angle to obtain a fourth dialogue vector;
calculating a vector modulus difference value between the fourth dialogue vector and the first dialogue vector;
determining output content according to the second vector included angle and the vector module value difference value;
wherein calculating a vector modulus difference between the fourth dialog vector and the first dialog vector comprises:
calculating a dimension difference value of the same vector dimension of the fourth dialogue vector and the first dialogue vector;
acquiring the dimension weight of the vector dimension according to the vector dimension and a preset dimension weight library;
and determining a vector modulus value difference value according to the dimension difference value and the dimension weight based on a preset calculation rule.
8. The method of claim 7, wherein determining output content based on the second vector included angle and the vector mode difference comprises:
acquiring a first conversion rule corresponding to the vector included angle and a second conversion rule corresponding to the vector module value difference;
according to the first conversion rule, determining a first conversion value corresponding to the second vector included angle, and correlating with a corresponding third dialogue vector;
determining a second conversion value corresponding to the vector module value difference value according to a second conversion rule, and correlating with a corresponding third dialogue vector;
summing the first conversion value and the second conversion value associated with the third dialogue vector to obtain a statistical value;
taking the first sentence-breaking corpus corresponding to the third dialogue vector with the smallest statistic value as the second sentence-breaking corpus;
and determining a third sentence-breaking corpus according to the second sentence-breaking corpus and the first sentence-breaking corpus in the corpus group, and taking the third sentence-breaking corpus as output content.
9. The method for multi-modal full-duplex dialogue for semantic recognition as claimed in claim 6 wherein performing multi-modal full-duplex dialogue based on the output content comprises:
acquiring a presentation requirement of output content;
determining a first support parameter according to the presentation requirement;
acquiring a second support parameter of the local server;
judging whether the output content can be presented on the user interface or not according to the first support parameter and the second support parameter;
if the output content can be presented on the user interface, presenting the output content through the user interface;
if the output content can not be presented on the user interface, establishing a communication link with a target platform meeting the presentation requirement through a local server, and sending the presentation requirement to the target platform;
and acquiring the presentation information after the target platform receives the presentation request, and returning the presentation information in real time.
10. A multi-modal full duplex dialog system for semantic recognition, comprising:
the system comprises an initiation dialogue acquisition subsystem, a dialogue generation subsystem and a dialogue generation subsystem, wherein the initiation dialogue acquisition subsystem is used for acquiring an initiation dialogue between a dialogue user and a preset dialogue model;
a dialog modality determination subsystem for determining a dialog modality selected by a dialog user;
the speaking meaning acquisition subsystem is used for acquiring the speaking meaning of the initiated dialog according to the speaking meaning identification technology and the dialog mode;
and the dialogue subsystem is used for carrying out multi-mode full-duplex dialogue according to the meaning of the words and the dialogue modes.
CN202311212596.7A 2023-09-19 2023-09-19 Multi-mode full duplex dialogue method and system for semantic recognition Active CN117153157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311212596.7A CN117153157B (en) 2023-09-19 2023-09-19 Multi-mode full duplex dialogue method and system for semantic recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311212596.7A CN117153157B (en) 2023-09-19 2023-09-19 Multi-mode full duplex dialogue method and system for semantic recognition

Publications (2)

Publication Number Publication Date
CN117153157A true CN117153157A (en) 2023-12-01
CN117153157B CN117153157B (en) 2024-06-04

Family

ID=88900745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311212596.7A Active CN117153157B (en) 2023-09-19 2023-09-19 Multi-mode full duplex dialogue method and system for semantic recognition

Country Status (1)

Country Link
CN (1) CN117153157B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1615124A1 (en) * 2004-07-07 2006-01-11 Alcatel Alsthom Compagnie Generale D'electricite A method for handling a multi-modal dialog
US20060155546A1 (en) * 2005-01-11 2006-07-13 Gupta Anurag K Method and system for controlling input modalities in a multimodal dialog system
US20110302247A1 (en) * 2010-06-02 2011-12-08 Microsoft Corporation Contextual information dependent modality selection
CN106663129A (en) * 2016-06-29 2017-05-10 深圳狗尾草智能科技有限公司 A sensitive multi-round dialogue management system and method based on state machine context
WO2019046463A1 (en) * 2017-08-29 2019-03-07 Zhoa Tiancheng System and method for defining dialog intents and building zero-shot intent recognition models
CN110136713A (en) * 2019-05-14 2019-08-16 苏州思必驰信息科技有限公司 Dialogue method and system of the user in multi-modal interaction
US20190278292A1 (en) * 2018-03-06 2019-09-12 Zoox, Inc. Mesh Decimation Based on Semantic Information
CN112101045A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Multi-mode semantic integrity recognition method and device and electronic equipment
CN112613534A (en) * 2020-12-07 2021-04-06 北京理工大学 Multi-mode information processing and interaction system
CN112732340A (en) * 2019-10-14 2021-04-30 苏州思必驰信息科技有限公司 Man-machine conversation processing method and device
CN113792196A (en) * 2021-09-10 2021-12-14 北京京东尚科信息技术有限公司 Method and device for man-machine interaction based on multi-modal dialog state representation
CN114416934A (en) * 2021-12-24 2022-04-29 北京百度网讯科技有限公司 Multi-modal dialog generation model training method and device and electronic equipment
CN115840841A (en) * 2023-02-01 2023-03-24 阿里巴巴达摩院(杭州)科技有限公司 Multi-modal dialog method, device, equipment and storage medium
CN115905490A (en) * 2022-11-25 2023-04-04 北京百度网讯科技有限公司 Man-machine interaction dialogue method, device and equipment
CN116661603A (en) * 2023-06-02 2023-08-29 南京信息工程大学 Multi-mode fusion user intention recognition method under complex man-machine interaction scene

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060020917A1 (en) * 2004-07-07 2006-01-26 Alcatel Method for handling a multi-modal dialog
EP1615124A1 (en) * 2004-07-07 2006-01-11 Alcatel Alsthom Compagnie Generale D'electricite A method for handling a multi-modal dialog
US20060155546A1 (en) * 2005-01-11 2006-07-13 Gupta Anurag K Method and system for controlling input modalities in a multimodal dialog system
US20110302247A1 (en) * 2010-06-02 2011-12-08 Microsoft Corporation Contextual information dependent modality selection
CN106663129A (en) * 2016-06-29 2017-05-10 深圳狗尾草智能科技有限公司 A sensitive multi-round dialogue management system and method based on state machine context
WO2019046463A1 (en) * 2017-08-29 2019-03-07 Zhoa Tiancheng System and method for defining dialog intents and building zero-shot intent recognition models
US20190278292A1 (en) * 2018-03-06 2019-09-12 Zoox, Inc. Mesh Decimation Based on Semantic Information
CN110136713A (en) * 2019-05-14 2019-08-16 苏州思必驰信息科技有限公司 Dialogue method and system of the user in multi-modal interaction
CN112732340A (en) * 2019-10-14 2021-04-30 苏州思必驰信息科技有限公司 Man-machine conversation processing method and device
CN112101045A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Multi-mode semantic integrity recognition method and device and electronic equipment
CN112613534A (en) * 2020-12-07 2021-04-06 北京理工大学 Multi-mode information processing and interaction system
CN113792196A (en) * 2021-09-10 2021-12-14 北京京东尚科信息技术有限公司 Method and device for man-machine interaction based on multi-modal dialog state representation
CN114416934A (en) * 2021-12-24 2022-04-29 北京百度网讯科技有限公司 Multi-modal dialog generation model training method and device and electronic equipment
CN115905490A (en) * 2022-11-25 2023-04-04 北京百度网讯科技有限公司 Man-machine interaction dialogue method, device and equipment
CN115840841A (en) * 2023-02-01 2023-03-24 阿里巴巴达摩院(杭州)科技有限公司 Multi-modal dialog method, device, equipment and storage medium
CN116661603A (en) * 2023-06-02 2023-08-29 南京信息工程大学 Multi-mode fusion user intention recognition method under complex man-machine interaction scene

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
窦敏: ""基于CNN和LSTM的视频语义分析系统设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 02, 15 February 2019 (2019-02-15) *

Also Published As

Publication number Publication date
CN117153157B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
TW202009749A (en) Human-machine dialog method, device, electronic apparatus and computer readable medium
JP2021021955A (en) Method and device for generating and registering voiceprint
CN109325091B (en) Method, device, equipment and medium for updating attribute information of interest points
CN116127045B (en) Training method for generating large language model and man-machine voice interaction method based on model
CN107452379B (en) Dialect language identification method and virtual reality teaching method and system
WO2016197767A2 (en) Method and device for inputting expression, terminal, and computer readable storage medium
US20240070397A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN109256133A (en) A kind of voice interactive method, device, equipment and storage medium
CN104598644A (en) User fond label mining method and device
KR20200109239A (en) Image processing method, device, server and storage medium
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
JP2015055979A (en) Data conversion device and method
US20220392493A1 (en) Video generation method, apparatus, electronic device, storage medium and program product
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN110880324A (en) Voice data processing method and device, storage medium and electronic equipment
CN112100357A (en) Method and device for generating guide language, electronic equipment and computer storage medium
CN112232066A (en) Teaching outline generation method and device, storage medium and electronic equipment
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN114065720A (en) Conference summary generation method and device, storage medium and electronic equipment
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium
CN117153157B (en) Multi-mode full duplex dialogue method and system for semantic recognition
CN114490967B (en) Training method of dialogue model, dialogue method and device of dialogue robot and electronic equipment
US20230267726A1 (en) Systems and methods for image processing using natural language
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant