CN113963724A

CN113963724A - Audio content metadata and generation method, electronic device and storage medium

Info

Publication number: CN113963724A
Application number: CN202111100818.7A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-01-21

Abstract

The present disclosure relates to audio content metadata and a generation method, an electronic device, and a storage medium. Audio content metadata comprising: a property area including an audio content identifier of the audio content and an audio content name, the audio content identifier representing identification information about the audio content, the audio content name representing name information about the audio content; and a sub-element region including reference audio object information for referring to one or more audio object elements and associating contents of the audio object elements with formats thereof. The audio content metadata describe the metadata and the format of the audio content, so that the content of an audio program can be provided during audio playing, the content and the format of one or more audio object elements can be associated, and the quality of an audio playing scene is improved.

Description

Audio content metadata and generation method, electronic device and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio content metadata and generation method, an electronic device, and a storage medium.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form an effect of mutual involvement. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

Audio channels (or audio channels) refer to audio signals that are independent of each other and that are captured or played back at different spatial locations when sound is recorded or played. The number of channels is the number of sound sources when recording or the number of corresponding speakers when playing back sound. For example, in a surround 5.1 speaker system comprising audio signals at 6 different spatial locations, each separate audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system comprising audio signals at 8 different spatial positions, each separate audio signal is used to drive a speaker at a corresponding spatial position.

Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.

The present disclosure provides audio content metadata and a generation method thereof in order to provide metadata capable of solving the above technical problems.

Disclosure of Invention

The present disclosure is directed to an audio content metadata generation method, an electronic device, and a storage medium, so as to solve one of the above technical problems.

To achieve the above object, a first aspect of the present disclosure provides audio content metadata, including:

a property area including an audio content identifier of the audio content and an audio content name, the audio content identifier representing identification information about the audio content, the audio content name being related to name information of the audio content;

and a sub-element region including reference audio object information for referring to one or more audio object elements and associating contents of the audio object elements with formats thereof.

To achieve the above object, a second aspect of the present disclosure provides a method for generating audio content metadata, including:

the generating comprises audio content metadata as described in the first aspect.

To achieve the above object, a third aspect of the present disclosure provides an electronic device, including: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to generate a stream comprising audio content metadata as described in the first aspect.

To achieve the above object, a fourth aspect of the present disclosure provides a storage medium containing computer-executable instructions which, when generated by a computer processor, comprise audio content metadata as described in the first aspect.

From the above, the disclosed audio content metadata includes: a property area including an audio content identifier of the audio content and an audio content name, the audio content identifier representing identification information about the audio content, the audio content name being related to name information of the audio content; and a sub-element region including reference audio object information for referring to one or more audio object elements and associating contents of the audio object elements with formats thereof. The audio content metadata describes the metadata and the format of the audio content, so that the content of an audio program can be provided during audio playing, and the content and the format of one or more audio object elements can be associated, thereby improving the quality of an audio playing scene.

Drawings

Fig. 1 is a schematic diagram of a three-dimensional acoustic audio production model provided in embodiment 1 of the present disclosure;

fig. 2 is a schematic structural diagram of audio content metadata provided in embodiment 1 of the present disclosure;

fig. 3 is a flowchart of a method for generating audio content metadata provided in embodiment 2 of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure.

Detailed Description

The following examples are intended to illustrate the present disclosure, but are not intended to limit the scope of the present disclosure.

Metadata (Metadata) is information that describes the structural characteristics of data, and the functions supported by Metadata include indicating storage locations, historical data, resource lookups, or file records.

As shown in fig. 1, the three-dimensional audio production model is composed of a set of production elements each describing information of structural characteristics of data of a corresponding stage of audio production by metadata, and includes a content production section and a format production section.

The production elements of the content production section include: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element.

The audio program includes narration, sound effects, and background music, and the audio program references one or more audio contents that are combined together to construct a complete audio program. The audio program elements are, for example, elements that produce an audio program, and metadata that describes the structural characteristics of the audio program is generated for the audio program.

The audio content describes the content of a component of an audio program, such as background music, and relates the content to its format by reference to one or more audio objects. The audio content element is information for producing audio content, and metadata for generating the audio content is used for describing structural characteristics of the audio content.

The audio objects are used to build content, format and valuable information and to determine the soundtrack unique identification of the actual soundtrack. The audio object elements are, for example, production audio objects, and metadata of the audio objects is generated to describe information of structural characteristics of the audio objects.

The audio track unique identification element is used for making an audio track unique identification, and metadata for generating the audio track unique identification is used for describing the structural characteristics of the audio track unique identification.

The production elements of the format production part include: an audio packet format element, an audio channel format element, an audio stream format element, an audio track format element.

The audio package format is a format adopted when metadata of an audio object and audio stream data are packaged according to channel packets, wherein the audio package format can include a nested audio package format. The audio packet format element is also the production audio packet data. The audio packet data comprises metadata in an audio packet format, and the metadata in the audio packet format is used for describing information of structural characteristics of the audio packet format.

The audio channel format represents a single sequence of audio samples on which certain operations may be performed, such as movement of rendering objects in a scene. Nested audio channel formats can be included in the audio channel formats. The audio channel format element is to make audio channel data. The audio channel data comprises metadata in an audio channel format, and the metadata in the audio channel format is used for describing information of structural characteristics of the audio channel format.

Audio streams, which are combinations of audio tracks needed to render channels, objects, higher-order ambient sound components, or packets. The audio stream format is used to establish a relationship between a set of audio track formats and a set of audio channel formats or audio packet formats. The audio stream format element is also the production audio stream data. The audio stream data comprises metadata in an audio stream format, and the metadata in the audio stream format is used for describing information of structural characteristics of the audio stream format.

The audio track format corresponds to a set of samples or data in a single audio track in the storage medium, the track format used to describe the original audio data, and the decoded signal of the renderer. The audio track format is derived from the original audio data for identifying the combination of audio tracks required for successful decoding of the audio track data. The audio track format element is the production audio track data. The audio track data includes metadata in an audio track format, and the metadata in the audio track format is used for describing information of structural characteristics of the audio track format.

Each stage of the three-dimensional audio production model produces metadata that describes the characteristics of that stage.

And after the audio channel data manufactured based on the three-dimensional audio manufacturing model is transmitted to the far end in a communication mode, the far end performs stage-by-stage rendering on the audio channel data based on the metadata, and the manufactured sound scene is restored.

Example 1

The present disclosure provides and describes in detail audio content metadata in a three-dimensional acoustic audio model.

Audio content (audioContent) is content used to describe a component of an audio program, such as background music in the audio program, and is also used to refer to one or more audio object elements and associate the content of the referred audio object elements with their formats to achieve smooth playing of the audio program.

As shown in fig. 2, the audio content metadata 100 includes a property area 110 and a sub-element area 120.

The property area 110 includes an audio content identifier 111 and an audio content name 112 of the audio content.

The audio content identification 111 includes identification information representing related audio content, and the audio content name 112 represents name information relating to the audio content.

In the embodiment of the present disclosure, the audio content identifier 111 is a reference identifier of the audio content, for example, a reference identifier of a certain audio content is "ACO-1001", and a name of the audio content is "music", so that the audio content with the name of "music" can be obtained by the name of "ACO-1001", and the audio content identifier 111 can be described in a computer language as:

the audio content identifier 111 representing a certain audio content is the "AO-1002" audio content identifier 111, which means that the reference identifier of the audio content by which the corresponding audio content can be obtained is understood.

The audio content name 112 is represented as a specific name of the audio content, for example, the name of an audio content is "music", and its corresponding audio content identifier 111 is "ACO-1001", so that the audio content named "music" of the audio content can be obtained through the "ACO-1001" of the audio content identifier 111.

The relationship between the audio content identifier 111 and the audio content name 112 can be described in a computer language as:

specifically, the attribute region 110 further includes:

language information 113 of the audio content, the language information 113 being used to describe a language selected by the audio content. Specifically, the language information 113 is a language for selecting the audio content, for example, "en" represents english, "cn" represents chinese, "jp" represents japanese, "kr" represents korean, and the like, and after one language information 113 is selected, the audio content is expressed by the language information, and if not selected, the default option is "en".

The language information 113 may be described in a computer language as:

<audioContentLangague＝“en”>

the language selected for the audio content is expressed in english and the associated description of the audio content is expressed in english.

The sub-element area 120 includes reference audio object information 121, and the reference audio object information 121 is used to reference one or more audio object elements and to associate the contents of the audio object elements with the formats thereof. When the audio content refers to a plurality of audio objects, since there may be an independent audio object format for each audio object, the reference audio object information 121 functions to associate the content of one audio object with its corresponding format, so that each audio object can be played smoothly.

The reference audio object information 121 may be described in a computer language as:

represents two audio objects referring to the audio object identifications "AO _ 1001" and "AO _ 1002" and is capable of associating the contents of the two audio objects of the audio object identifications "AO _ 1001" and "AO _ 1002" with their corresponding formats.

Specifically, in this embodiment of the present disclosure, the sub-element area further includes:

content measure loudness information 122 of the audio content, the content measure loudness information 122 describing a measure loudness of the audio content; the content loudness measurement information 122 includes a rule used for setting the loudness of the content measurement and a certain criterion of the selected rules, for example, if the rule used for setting the loudness of the content measurement is "ITU-R bs.1770" and the criterion of the selected rule is "EBU R128", the corresponding computer language is described as:

</loudnessMetadata>

the rule used to set the content measure loudness is "ITU-R bs.1770", the standard in this rule chosen is "EBU R128", and the gain is-23.0.

And/or dialog element information 123, the dialog element information 123 being used to characterize whether an audio program is a dialog, and its corresponding attribute information. Specifically, for the audio content, the dialog information may include no dialog, pure dialog or mixed dialog, and no dialog is that there is no dialog information in the audio content, but there may be audio content such as background sound; pure dialogue is that only dialogue information exists in audio content, and audio content such as background sound does not exist; the mixed dialogue is the audio content with dialogue information and background sound.

Specifically, in this embodiment of the present disclosure, the dialog element information includes:

no dialogue information exists, and the dialogue value of the dialogue element information is 0 at the moment, which indicates that the audio content is not a dialogue;

pure dialogue information, wherein the dialogue value of the dialogue element information is 1 at the moment, and the audio content is pure dialogue;

and mixing the dialog information, wherein the dialog value of the dialog element information is 2 at this time, and the audio content is the mixed dialog.

Specifically, by setting the dialog value, a specific use manner of the dialog element information can be determined.

Specifically, in this embodiment of the present disclosure, the values of the non-session information respectively include:

when the value of the non-dialogue information is 0, the non-dialogue information is undefined non-dialogue information;

when the value of the non-dialogue information is 1, the non-dialogue information is music information;

when the value of the non-dialog information is 2, the non-dialog information is represented as effect information.

Specifically, the value of the non-session information is a parameter value of the non-session information, and by selecting different parameter values, the working parameters and the working mode of the non-session information can be obtained.

Specifically, in this embodiment of the present disclosure, the values of the pure dialog information respectively include:

when the value of the pure dialogue information is 0, the pure dialogue information is undefined pure dialogue information;

when the value of the pure dialogue information is 1, the pure dialogue information is shown as pure dialogue information with a story plot;

when the value of the pure dialogue information is 2, the pure dialogue information is represented as voice-over information;

when the value of the pure dialogue information is 3, the pure dialogue information is represented as voice subtitle information;

when the value of the pure dialogue information is 4, the pure dialogue information is represented as audio description information or video damage information;

when the value of the pure dialogue information is 5, the pure dialogue information is expressed as comment information;

when the value of the pure dialogue information is 6, the pure dialogue information is represented as emergency information.

Specifically, the value of the pure dialogue information is a parameter value of the pure dialogue information, and by selecting different parameter values, the working parameters and the working mode of the pure dialogue information can be obtained.

Specifically, in this embodiment of the present disclosure, the values of the mixed session information respectively include:

when the value of the mixed dialogue information is 0, the mixed dialogue information is undefined pure dialogue information;

when the value of the mixed dialogue information is 1, the mixed dialogue information is represented as complete main line information;

when the value of the mixed dialogue information is 2, the mixed dialogue information is represented as mixed information;

when the value of the mixed dialogue information is 3, the mixed dialogue information is represented as hearing impairment information.

Specifically, the value of the mixed dialogue information is a parameter value of the mixed dialogue information, and the working parameters and the working mode of the mixed dialogue information can be obtained by selecting different parameter values.

The embodiment of the present disclosure describes the metadata of the audio content and the format thereof through the audio content metadata 100, which can provide the content of the audio program during audio playing, and link the content of one or more audio object elements with the format, thereby improving the quality of the audio playing scene.

Example 2

The present disclosure also provides an embodiment of a method for generating audio content metadata, which is similar to the above embodiment, and the explanation based on the same name and meaning is the same as the above embodiment, and has the same technical effect as the above embodiment, and is not described herein again.

As shown in fig. 3, a method for generating audio content metadata includes the steps of:

step S210, generating audio content metadata, where the audio content metadata includes:

a property area including an audio content identifier of the audio content and an audio content name, the audio content identifier representing identification information about the audio content, the audio content name representing name information about the audio content;

Optionally, the attribute zone further comprises: language information of the audio content, the language information describing a language selected by the audio content.

Optionally, the sub-element region includes: content measure loudness information of the audio content, the content measure loudness information describing a measure loudness of the audio content; and/or dialog element information characterizing whether the audio program is a dialog, and its corresponding attribute information.

Optionally, the dialog element information includes: no dialogue information exists, and the dialogue value of the dialogue element information is 0 at the moment, which indicates that the audio content is not a dialogue; pure dialogue information, wherein the dialogue value of the dialogue element information is 1 at the moment, and the audio content is pure dialogue; and mixing the dialog information, wherein the dialog value of the dialog element information is 2 at this time, and the audio content is the mixed dialog.

Optionally, the values of the dialog-free information respectively include: when the value of the non-dialogue information is 0, the non-dialogue information is undefined non-dialogue information; when the value of the non-dialogue information is 1, the non-dialogue information is music information; when the value of the non-dialog information is 2, the non-dialog information is represented as effect information.

Optionally, the values of the pure dialog information respectively include: when the value of the pure dialogue information is 0, the pure dialogue information is undefined pure dialogue information; when the value of the pure dialogue information is 1, the pure dialogue information is shown as pure dialogue information with a story plot; when the value of the pure dialogue information is 2, the pure dialogue information is represented as voice-over information; when the value of the pure dialogue information is 3, the pure dialogue information is represented as voice subtitle information; when the value of the pure dialogue information is 4, the pure dialogue information is represented as audio description information or video damage information; when the value of the pure dialogue information is 5, the pure dialogue information is expressed as comment information; when the value of the pure dialogue information is 6, the pure dialogue information is represented as emergency information.

Optionally, the values of the mixed session information respectively include: when the value of the mixed dialogue information is 0, the mixed dialogue information is undefined pure dialogue information; when the value of the mixed dialogue information is 1, the mixed dialogue information is represented as complete main line information; when the value of the mixed dialogue information is 2, the mixed dialogue information is represented as mixed information; when the value of the mixed dialogue information is 3, the mixed dialogue information is represented as hearing impairment information.

The embodiment of the disclosure generates the audio content metadata to describe the metadata and the format of the audio content, can provide the content of the audio program during audio playing, and can link the content of one or more audio object elements with the format, thereby improving the quality of the audio playing scene.

Example 3

Fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure. As shown in fig. 4, the electronic apparatus includes: a processor 30, a memory 31, an input device 32, and an output device 33. The number of the processors 30 in the electronic device may be one or more, and one processor 30 is taken as an example in fig. 4. The number of the memories 31 in the electronic device may be one or more, and one memory 31 is taken as an example in fig. 4. The processor 30, the memory 31, the input device 32 and the output device 33 of the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example. The electronic device can be a computer, a server and the like. The embodiment of the present disclosure describes in detail by taking an electronic device as a server, and the server may be an independent server or a cluster server.

Memory 31 is provided as a computer-readable storage medium that may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules for generating audio content metadata as described in any embodiment of the present disclosure. The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 31 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 31 may further include memory located remotely from the processor 30, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 32 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, as well as a camera for capturing images and a sound pickup device for capturing audio data. The output device 33 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 32 and the output device 33 can be set according to actual conditions.

The processor 30 executes various functional applications of the device and data processing, i.e. generating audio content metadata, by running software programs, instructions and modules stored in the memory 31.

Example 4

The disclosed embodiment 4 also provides a storage medium containing computer executable instructions which, when generated by a computer processor, include audio content metadata as described in embodiment 1.

Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided by any embodiments of the present disclosure, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present disclosure.

It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in an embodiment," "in yet another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present disclosure has been described in detail hereinabove with respect to general description, specific embodiments and experiments, it will be apparent to those skilled in the art that some modifications or improvements may be made based on the present disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

Claims

1. Audio content metadata, comprising:

2. The audio content metadata according to claim 1, wherein the attribute section further comprises:

language information of the audio content, the language information describing a language selected by the audio content.

3. The audio content metadata according to claim 1, wherein the sub-element region further comprises:

content measure loudness information of the audio content, the content measure loudness information describing a measure loudness of the audio content;

and/or dialog element information characterizing whether the audio program is a dialog, and its corresponding attribute information.

4. The audio content metadata according to claim 3, wherein the dialog element information includes:

5. The audio content metadata according to claim 4, wherein the values of the dialog-free information respectively comprise:

6. The audio content metadata according to claim 4, wherein the values of the pure dialog information respectively comprise:

7. The audio content metadata according to claim 4, wherein the values of the mixed dialog information respectively include:

8. A method for generating audio content metadata, comprising:

generating metadata comprising audio content as claimed in any of claims 1-7.

9. An electronic device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to generate metadata comprising the audio content of any of claims 1-7.

10. A storage medium containing computer-executable instructions which, when generated by a computer processor, comprise audio content metadata as claimed in any one of claims 1 to 7.