[go: nahoru, domu]

Jump to content

Perceptual Objective Listening Quality Analysis: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Yobot (talk | contribs)
m WP:CHECKWIKI error fixes using AWB (12095)
 
(26 intermediate revisions by 14 users not shown)
Line 1: Line 1:
'''Perceptual Objective Listening Quality Analysis''' ('''POLQA''') was the working title of an [[ITU-T]] standard that covers a model to predict speech quality by means of analyzing digital speech signals.<ref>{{Cite web|title=POLQA - The Next-Generation Mobile Voice Quality Testing Standard|url=http://www.polqa.info/|access-date=2021-04-11|website=www.polqa.info}}</ref> The model was standardized as Recommendation ITU-T P.863 (Perceptual objective listening quality assessment) in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.<ref name=":0">{{Cite web|title=P.863 : Perceptual objective listening quality prediction|url=https://www.itu.int/rec/T-REC-P.863/en|access-date=2021-04-11|website=www.itu.int}}</ref>
'''POLQA''' ''Perceptual Objective Listening Quality Assessment'', also known as ITU-T Rec. P.863<ref name="POLQA">http://www.itu.int/rec/T-REC-P.863/en ITU-T Recommendation P.863: Perceptual objective listening quality assessment</ref> is an ITU-T Standard that covers a model to predict speech quality by means of digital speech signal analysis.
----


== Measurement scope ==
== Measurement scope ==
POLQA covers a model to predict speech quality,<ref name="POLQA2">http://www.aes.org/e-lib/browse.cfm?elib=16829 Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I—Temporal Alignment</ref><ref name="POLQA3">http://www.aes.org/e-lib/browse.cfm?elib=16830 Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II—Perceptual Model</ref> by means of digital speech signal analysis. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Usually, a Mean Opinion Score (MOS) is predicted. POLQA uses real speech as a test stimulus for assessing telephony networks.
POLQA covers a model to predict speech quality,<ref name="POLQA2">{{Cite journal |last=Beerends |first=John G. |last2=Schmidmer |first2=Christian |last3=Berger |first3=Jens |last4=Obermann |first4=Matthias |last5=Ullmann |first5=Raphael |last6=Pomy |first6=Joachim |last7=Keyhl |first7=Michael |date=2013-07-08 |title=Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I—Temporal Alignment |url=https://www.aes.org/e-lib/browse.cfm?elib=16829 |journal=Journal of the Audio Engineering Society |language=English |volume=61 |issue=6 |pages=366–384}}</ref><ref name="POLQA3">{{Cite journal |last=Beerends |first=John G. |last2=Schmidmer |first2=Christian |last3=Berger |first3=Jens |last4=Obermann |first4=Matthias |last5=Ullmann |first5=Raphael |last6=Pomy |first6=Joachim |last7=Keyhl |first7=Michael |date=2013-07-08 |title=Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II—Perceptual Model |url=https://www.aes.org/e-lib/browse.cfm?elib=16830 |journal=Journal of the Audio Engineering Society |language=English |volume=61 |issue=6 |pages=385–402}}</ref> by means of digital speech signal analysis. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Usually, a Mean Opinion Score (MOS) is predicted. POLQA uses real speech as a test stimulus for assessing telephony networks.


== Technology capabilities ==
== Technology capabilities ==
POLQA is the successor of [[PESQ]] (Recommendation ITU-T P.862).<ref name=":1">{{Cite web|title=P.862 : Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs|url=https://www.itu.int/rec/T-REC-P.862|access-date=2021-04-11|website=www.itu.int}}</ref> POLQA avoids weaknesses of the current P.862 model and is extended towards handling of higher bandwidth audio signals. Further improvements target the handling of time called signals and signals with many delay variations. Similarly to P.862, POLQA supports measurements in the common telephony band (300–3400&nbsp;Hz), but in addition it has a second operational mode for assessing HD-Voice in wideband and super-wideband speech signals (50–14000&nbsp;Hz). POLQA also targets the assessment of speech signals recorded acoustically by an artificial head with mouth and ear simulators.
POLQA is the successor of [[PESQ]] (ITU-T Rec. P.862). POLQA avoids weaknesses of the current P.862 model and is extended towards handling of higher bandwidth audio
signals. Further improvements target the handling of time called signals and signals with many delay variations. Similarly to P.862,<ref name="PESQ">http://www.itu.int/rec/T-REC-P.862/en ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs</ref> POLQA supports measurements in the common telephony band (300–3400&nbsp;Hz), but in addition it has a second operational mode for assessing HD-Voice in wideband and super-wideband speech signals (50–14000&nbsp;Hz). POLQA also targets the assessment of speech signals recorded acoustically by an artificial head with mouth and ear simulators.


== Development history ==
== Development history ==
The POLQA activities started in ITU-T in early 2006 under the working title POLQA. In mid-2009 a competition was started to evaluate several candidate models. In May 2010 ITU-T selected candidate models from three companies, OPTICOM, SwissQual a [[Rohde & Schwarz]] company, and TNO (Netherlands Organisation for Applied Scientific Research), to form the future Recommendation P.863. The three companies were asked to merge their approaches to one single standardized model. The result is now standardized as POLQA / P.863.<ref name="POLQA"/>
The POLQA activities started in ITU-T in early 2006 under the working title P.OLQA. In mid-2009, a competition was started to evaluate several candidate models. In May 2010, ITU-T selected candidate models from three companies (OPTICOM, SwissQual / [[Rohde & Schwarz]] and TNO ([[Netherlands Organisation for Applied Scientific Research]])). The three companies merged their approaches to one single model, which was adopted as Recommendation ITU-T P.863.<ref name=":0" />


== Genealogy of related standards ==
== Genealogy of related standards ==
ITU-T’s family of full reference objective voice quality measurements started in 1997 with P.861 (PSQM), which was superseded by P.862 ([[PESQ]])<ref name="PESQ"/> in 2001. P.862 was later complemented with the recommendations P.862.1<ref name="P8621">http://www.itu.int/rec/T-REC-P.862.1/en ITU-T Recommendation P.862.1: Mapping function for transforming P.862 raw result scores to MOS-LQO</ref> (mapping of [[PESQ]] scores to a MOS scale), P.862.2<ref name="P8622">http://www.itu.int/rec/T-REC-P.862.2/en ITU-T Recommendation P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs</ref> (wideband measurements) and P.862.3<ref name="P8623">http://www.itu.int/rec/T-REC-P.862.3/en ITU-T Recommendation P.862.3 Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2</ref> (application guide). Since 2011 P.863 (POLQA)<ref name="POLQA"/> is in force. Two additional implementer’s guides for P.863 have been consented by ITU-T Study Group 12 in November 2011. In addition to the above listed full reference methods, the list of ITU-T’s objective voice quality measurement standards also includes P.563<ref name="P563">http://www.itu.int/rec/T-REC-P.563/en ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications</ref> (no-reference algorithm).
ITU-T’s family of full reference objective voice quality measurements started in 1997 with Recommendation ITU-T P.861 (PSQM), which was superseded by ITU-T P.862 (PESQ)<ref name=":1" /> in 2001. P.862 was later complemented with Recommendations ITU-T P.862.1<ref>{{Cite web|title=P.862.1 : Mapping function for transforming P.862 raw result scores to MOS-LQO|url=https://www.itu.int/rec/T-REC-P.862.1|access-date=2021-04-11|website=www.itu.int}}</ref> (mapping of PESQ scores to a MOS scale), ITU-T P.862.2<ref>{{Cite web|title=P.862.2 : Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs|url=https://www.itu.int/rec/T-REC-P.862.2|access-date=2021-04-11|website=www.itu.int}}</ref> (wideband measurements) and ITU-T P.862.3<ref>{{Cite web|title=P.862.3 : Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2|url=https://www.itu.int/rec/T-REC-P.862.3|access-date=2021-04-11|website=www.itu.int}}</ref> (application guide). The first edition of ITU-T P.863 (POLQA)<ref name=":0" /> entered into force in 2011. An Application guide for Recommendation ITU-T P.863 was approved in 2019 and published as ITU-T P.863.1.<ref>{{Cite web|title=P.863.1 : Application guide for Recommendation ITU-T P.863|url=https://www.itu.int/rec/T-REC-P.863.1|access-date=2021-04-11|website=www.itu.int}}</ref>

In addition to the above listed full reference methods, the list of ITU-T’s objective voice quality measurement standards also includes ITU-T P.563<ref>{{Cite web|title=P.563 : Single-ended method for objective speech quality assessment in narrow-band telephony applications|url=https://www.itu.int/rec/T-REC-P.563|access-date=2021-04-11|website=www.itu.int}}</ref> (no-reference algorithm).


== Testing typology ==
== Testing typology ==
POLQA, similar to P.862 [[PESQ]], is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original signal. It compares each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception as MP3 or AAC. Basically, the signals are analysed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences
POLQA, similar to P.862 PESQ, is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original signal. It compares each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception as MP3 or AAC. Basically, the signals are analysed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale as usual for MOS tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks (e.g. drive test tools for mobile network benchmarks).
between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale as usual for MOS tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks (e.g. drive test tools for mobile network benchmarks).


POLQA is full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. POLQA can be applied to provide an end-to-end (E2E) quality assessment for a network, or characterize individual network components.
POLQA is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. POLQA can be applied to provide an end-to-end (E2E) quality assessment for a network, or characterize individual network components.


POLQA results principally model [[mean opinion score]]s (MOS) that cover a scale from 1 (bad) to 5 (excellent).
POLQA results principally model [[mean opinion score]]s (MOS) that cover a scale from 1 (bad) to 5 (excellent).


== Description of the POLQA Algorithm ==
== Description of the POLQA algorithm ==
The inputs to the algorithm are two waveforms represented by two data vectors containing 16 bit PCM samples. The first vector contains the samples of the (undistorted) reference signal, whereas the second vector contains the samples of the degraded signal. The POLQA algorithm consists of a temporal alignment block, a sample rate estimator of a sample rate converter, which is used to compensate for differences in the sample rate of the input signals, and the actual core model, which performs the MOS calculation. In a first step, the delay between the two input signals is determined and the sample rate of the two signals relative to each other is estimated. The sample rate estimation is based on the delay information calculated by the temporal alignment. If the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled. After each step, the results are stored together with an average delay reliability indicator, which is a measure for the quality of the delay estimation. The result from the re-sampling step, which yielded the highest overall reliability, is finally chosen. Once the correct delay is determined and the sample rate differences have been compensated, the signals and the delay information are passed on to the core model, which calculates the perceptibility as well as the annoyance of the distortions and maps them to a MOS scale.
The inputs to the algorithm are two waveforms represented by two data vectors containing 16 bit PCM samples. The first vector contains the samples of the (undistorted) reference signal, whereas the second vector contains the samples of the degraded signal. The POLQA algorithm consists of a temporal alignment block, a sample rate estimator of a sample rate converter, which is used to compensate for differences in the sample rate of the input signals, and the actual core model, which performs the MOS calculation. In a first step, the delay between the two input signals is determined and the sample rate of the two signals relative to each other is estimated. The sample rate estimation is based on the delay information calculated by the temporal alignment. If the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled. After each step, the results are stored together with an average delay reliability indicator, which is a measure for the quality of the delay estimation. The result from the re-sampling step, which yielded the highest overall reliability, is finally chosen. Once the correct delay is determined and the sample rate differences have been compensated, the signals and the delay information are passed on to the core model, which calculates the perceptibility as well as the annoyance of the distortions and maps them to a MOS scale.
A much more detailed and comprehensive description of the algorithm can be found in.<ref name="POLQA"/> The next few sections are only intended to give an overview on the basics of POLQA’s internal structure.
A much more detailed and comprehensive description of the algorithm can be found in.<ref name=":0" /> The next few sections are only intended to give an overview on the basics of POLQA’s internal structure.


=== The Core Model ===
=== The core model ===
The main element of the core model is the perceptual model which is calculated four times using different parameters in order to cope with different major distortion types. Those distortion types can be split into additive distortions and subtracted distortions. For both types a further distinction is made between very strong and weaker effects. The inputs to the perceptual models are waveforms and the delay information. The output is the Disturbance Density, which is a measure for the perceptibility of distortions in the signals. The perceptual model for the main branch also produces indicators for Frequency distortions, Noise and Reverberation distortions. A subsequent switch which is triggered by a detector for very strong distortions reduces the four Disturbance Density values down to two, one for added and one for subtracted distortions. So far the Disturbance Density is an indicator for the perceptibility of distortions only and cognitive effects are not yet taken into account. Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially they convert the perceptibility measure Disturbance Density into an annoyance measure. This conversion is performed by correcting the Disturbance Density values for situations with:
The main element of the core model is the perceptual model which is calculated four times using different parameters in order to cope with different major distortion types. Those distortion types can be split into additive distortions and subtracted distortions. For both types a further distinction is made between very strong and weaker effects. The inputs to the perceptual models are waveforms and the delay information. The output is the Disturbance Density, which is a measure for the perceptibility of distortions in the signals. The perceptual model for the main branch also produces indicators for Frequency distortions, Noise and Reverberation distortions. A subsequent switch which is triggered by a detector for very strong distortions reduces the four Disturbance Density values down to two, one for added and one for subtracted distortions. So far the Disturbance Density is an indicator for the perceptibility of distortions only and cognitive effects are not yet taken into account. Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially they convert the perceptibility measure Disturbance Density into an annoyance measure. This conversion is performed by correcting the Disturbance Density values for situations with:


Line 43: Line 42:
So far all operations were performed on frames with a duration of approximately 32 and 43ms duration (depending on the sample rate and using an overlap of 50%) and for each Bark band separately. In a final step all indicators are integrated over time and frequency in order to compute the final MOS LQO value.
So far all operations were performed on frames with a duration of approximately 32 and 43ms duration (depending on the sample rate and using an overlap of 50%) and for each Bark band separately. In a final step all indicators are integrated over time and frequency in order to compute the final MOS LQO value.


==== The Perceptual Model ====
=== The perceptual model ===
The key concept inside the perceptual model is Idealisation. The idea behind this is, that POLQA is supposed to simulate [[Absolute Category Rating]] (ACR) tests. In an ACR test however, subjects have no comparison to the actual reference signal when they score a speech signal. Instead, it is assumed that subjects have an understanding of what an ideal signal sounds like and they use this as their own reference. Consequently, if they are asked to score a reference signal which is not absolutely perfect (e.g. it has the wrong volume or contains too much timbre, noise or reverberation), it will be scored worse than perfect. In its idealization step POLQA therefore corrects small imperfections of the reference signals in order to derive the same ideal reference for the comparison to the degraded signal as human subjects would use in their minds. Similar to the idealization of the reference signal, some distortions present in the degraded signal which are hardly perceptible in an ACR test will be partially compensated (e.g. small pitch shifts, linear frequency distortions).
The key concept inside the perceptual model is Idealisation. The idea behind this is, that POLQA is supposed to simulate [[Absolute Category Rating]] (ACR) tests. In an ACR test however, subjects have no comparison to the actual reference signal when they score a speech signal. Instead, it is assumed that subjects have an understanding of what an ideal signal sounds like and they use this as their own reference. Consequently, if they are asked to score a reference signal which is not absolutely perfect (e.g. it has the wrong volume or contains too much timbre, noise or reverberation), it will be scored worse than perfect. In its idealization step POLQA therefore corrects small imperfections of the reference signals in order to derive the same ideal reference for the comparison to the degraded signal as human subjects would use in their minds. Similar to the idealization of the reference signal, some distortions present in the degraded signal which are hardly perceptible in an ACR test will be partially compensated (e.g. small pitch shifts, linear frequency distortions).
The perceptual model starts with scaling the reference signal to an ideal average active speech level of approximately -26dBov. No such scaling is performed on the degraded signal. It is assumed that any deviation of the level of the degraded signal from the ideal -26dBov is to be scored as a degradation of the signal.
The perceptual model starts with scaling the reference signal to an ideal average active speech level of approximately -26dBov. No such scaling is performed on the degraded signal. It is assumed that any deviation of the level of the degraded signal from the ideal -26dBov is to be scored as a degradation of the signal.
Line 56: Line 55:


== See also ==
== See also ==
* Perceptual Evaluation of Speech Quality ([[PESQ]])
* [[Hearing-Aid Speech Quality Index]] (HASQI)
* Perceptual Evaluation of Video Quality ([[PEVQ]])
* [[Perceptual Evaluation of Audio Quality]] (PEAQ)
* Perceptual Evaluation of Audio Quality ([[PEAQ]])
* [[Perceptual Evaluation of Video Quality]] (PEVQ)
* Hearing Aid Speech Quality Index ([[HASQI]])


== References ==
== References ==
{{Reflist}}
{{Reflist}}

== External links ==
* [http://www.polqa.info/ Official Website on POLQA]
* [http://www.pesq.org/ Official Website on PESQ]


{{DEFAULTSORT:Polqa}}
{{DEFAULTSORT:Polqa}}
[[Category:ITU-T recommendations]]
[[Category:ITU-T recommendations]]
[[Category:Speech codecs]]
[[Category:ITU-T P Series Recommendations]]
[[Category:International standards]]

[[Category:Telecommunications]]
<!--- Categories --->
<!--- Categories --->
[[Category:Articles created via the Article Wizard]]

Latest revision as of 14:00, 18 April 2024

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals.[1] The model was standardized as Recommendation ITU-T P.863 (Perceptual objective listening quality assessment) in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.[2]

Measurement scope[edit]

POLQA covers a model to predict speech quality,[3][4] by means of digital speech signal analysis. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Usually, a Mean Opinion Score (MOS) is predicted. POLQA uses real speech as a test stimulus for assessing telephony networks.

Technology capabilities[edit]

POLQA is the successor of PESQ (Recommendation ITU-T P.862).[5] POLQA avoids weaknesses of the current P.862 model and is extended towards handling of higher bandwidth audio signals. Further improvements target the handling of time called signals and signals with many delay variations. Similarly to P.862, POLQA supports measurements in the common telephony band (300–3400 Hz), but in addition it has a second operational mode for assessing HD-Voice in wideband and super-wideband speech signals (50–14000 Hz). POLQA also targets the assessment of speech signals recorded acoustically by an artificial head with mouth and ear simulators.

Development history[edit]

The POLQA activities started in ITU-T in early 2006 under the working title P.OLQA. In mid-2009, a competition was started to evaluate several candidate models. In May 2010, ITU-T selected candidate models from three companies (OPTICOM, SwissQual / Rohde & Schwarz and TNO (Netherlands Organisation for Applied Scientific Research)). The three companies merged their approaches to one single model, which was adopted as Recommendation ITU-T P.863.[2]

Genealogy of related standards[edit]

ITU-T’s family of full reference objective voice quality measurements started in 1997 with Recommendation ITU-T P.861 (PSQM), which was superseded by ITU-T P.862 (PESQ)[5] in 2001. P.862 was later complemented with Recommendations ITU-T P.862.1[6] (mapping of PESQ scores to a MOS scale), ITU-T P.862.2[7] (wideband measurements) and ITU-T P.862.3[8] (application guide). The first edition of ITU-T P.863 (POLQA)[2] entered into force in 2011. An Application guide for Recommendation ITU-T P.863 was approved in 2019 and published as ITU-T P.863.1.[9]

In addition to the above listed full reference methods, the list of ITU-T’s objective voice quality measurement standards also includes ITU-T P.563[10] (no-reference algorithm).

Testing typology[edit]

POLQA, similar to P.862 PESQ, is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original signal. It compares each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception as MP3 or AAC. Basically, the signals are analysed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale as usual for MOS tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks (e.g. drive test tools for mobile network benchmarks).

POLQA is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. POLQA can be applied to provide an end-to-end (E2E) quality assessment for a network, or characterize individual network components.

POLQA results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent).

Description of the POLQA algorithm[edit]

The inputs to the algorithm are two waveforms represented by two data vectors containing 16 bit PCM samples. The first vector contains the samples of the (undistorted) reference signal, whereas the second vector contains the samples of the degraded signal. The POLQA algorithm consists of a temporal alignment block, a sample rate estimator of a sample rate converter, which is used to compensate for differences in the sample rate of the input signals, and the actual core model, which performs the MOS calculation. In a first step, the delay between the two input signals is determined and the sample rate of the two signals relative to each other is estimated. The sample rate estimation is based on the delay information calculated by the temporal alignment. If the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled. After each step, the results are stored together with an average delay reliability indicator, which is a measure for the quality of the delay estimation. The result from the re-sampling step, which yielded the highest overall reliability, is finally chosen. Once the correct delay is determined and the sample rate differences have been compensated, the signals and the delay information are passed on to the core model, which calculates the perceptibility as well as the annoyance of the distortions and maps them to a MOS scale. A much more detailed and comprehensive description of the algorithm can be found in.[2] The next few sections are only intended to give an overview on the basics of POLQA’s internal structure.

The core model[edit]

The main element of the core model is the perceptual model which is calculated four times using different parameters in order to cope with different major distortion types. Those distortion types can be split into additive distortions and subtracted distortions. For both types a further distinction is made between very strong and weaker effects. The inputs to the perceptual models are waveforms and the delay information. The output is the Disturbance Density, which is a measure for the perceptibility of distortions in the signals. The perceptual model for the main branch also produces indicators for Frequency distortions, Noise and Reverberation distortions. A subsequent switch which is triggered by a detector for very strong distortions reduces the four Disturbance Density values down to two, one for added and one for subtracted distortions. So far the Disturbance Density is an indicator for the perceptibility of distortions only and cognitive effects are not yet taken into account. Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially they convert the perceptibility measure Disturbance Density into an annoyance measure. This conversion is performed by correcting the Disturbance Density values for situations with:

  • Significant level variations
  • Many frame repetitions
  • Strong timbre
  • Spectral flatness
  • Noise switching during speech pauses
  • Many delay variations
  • Strong variations of the Disturbance Density over time
  • Strong variations of the loudness of the signals

Two further indicators, one for spectral flatness and one for level variations are also calculated in this step.

So far all operations were performed on frames with a duration of approximately 32 and 43ms duration (depending on the sample rate and using an overlap of 50%) and for each Bark band separately. In a final step all indicators are integrated over time and frequency in order to compute the final MOS LQO value.

The perceptual model[edit]

The key concept inside the perceptual model is Idealisation. The idea behind this is, that POLQA is supposed to simulate Absolute Category Rating (ACR) tests. In an ACR test however, subjects have no comparison to the actual reference signal when they score a speech signal. Instead, it is assumed that subjects have an understanding of what an ideal signal sounds like and they use this as their own reference. Consequently, if they are asked to score a reference signal which is not absolutely perfect (e.g. it has the wrong volume or contains too much timbre, noise or reverberation), it will be scored worse than perfect. In its idealization step POLQA therefore corrects small imperfections of the reference signals in order to derive the same ideal reference for the comparison to the degraded signal as human subjects would use in their minds. Similar to the idealization of the reference signal, some distortions present in the degraded signal which are hardly perceptible in an ACR test will be partially compensated (e.g. small pitch shifts, linear frequency distortions). The perceptual model starts with scaling the reference signal to an ideal average active speech level of approximately -26dBov. No such scaling is performed on the degraded signal. It is assumed that any deviation of the level of the degraded signal from the ideal -26dBov is to be scored as a degradation of the signal. Next, the spectra of both signals are computed using an FFT with 50% overlapping frames with a duration of between 32ms and 43ms duration (depending on the sample rate). Subsequently small pitch shifts of the degraded signal will be eliminated (Frequency Dewarping). Now, the spectra will be transformed to a psychoacoustically motivated pitch scale, by combining individual spectral lines (FFT bins) to so-called critical bands. The pitch scale used is similar to the Bark scale with an average resolution of 0.3 Bark per band. The result is the Pitch Power Density. At this stage the first three distortion indicators for frequency response distortions, additive noise and room reverberations are calculated. After this, the excitation of each band is derived. This includes the modeling of masking effects in the frequency as well as in the temporal domain. The result is for each frame of each signal a head-internal representation which indicates roughly how loud each frequency component would be perceived. Now, a further idealization step of the reference signal takes place by filtering out excessive timbre and low level stationary noise. At the same time, linear frequency distortions and stationary noise are partially removed from the degraded signal. A subtraction of the idealized excitations finally leads to the Distortion Density, which is measure for the audibility of distortions.

POLQA in research[edit]

A paper which uses POLQA to investigate the impact of tone language and non-native listening on speech quality measurement can be found in.[11]

See also[edit]

References[edit]

  1. ^ "POLQA - The Next-Generation Mobile Voice Quality Testing Standard". www.polqa.info. Retrieved 2021-04-11.
  2. ^ a b c d "P.863 : Perceptual objective listening quality prediction". www.itu.int. Retrieved 2021-04-11.
  3. ^ Beerends, John G.; Schmidmer, Christian; Berger, Jens; Obermann, Matthias; Ullmann, Raphael; Pomy, Joachim; Keyhl, Michael (2013-07-08). "Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I—Temporal Alignment". Journal of the Audio Engineering Society. 61 (6): 366–384.
  4. ^ Beerends, John G.; Schmidmer, Christian; Berger, Jens; Obermann, Matthias; Ullmann, Raphael; Pomy, Joachim; Keyhl, Michael (2013-07-08). "Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II—Perceptual Model". Journal of the Audio Engineering Society. 61 (6): 385–402.
  5. ^ a b "P.862 : Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs". www.itu.int. Retrieved 2021-04-11.
  6. ^ "P.862.1 : Mapping function for transforming P.862 raw result scores to MOS-LQO". www.itu.int. Retrieved 2021-04-11.
  7. ^ "P.862.2 : Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs". www.itu.int. Retrieved 2021-04-11.
  8. ^ "P.862.3 : Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2". www.itu.int. Retrieved 2021-04-11.
  9. ^ "P.863.1 : Application guide for Recommendation ITU-T P.863". www.itu.int. Retrieved 2021-04-11.
  10. ^ "P.563 : Single-ended method for objective speech quality assessment in narrow-band telephony applications". www.itu.int. Retrieved 2021-04-11.
  11. ^ D. Ebem (University of Nigeria); et al. (2011). "The Impact of Tone language and Non-Native Language Listening on Measuring Speech Quality" (PDF). Journal of the Audio Engineering Society. 59 (9, 2011 September): 9.