[go: nahoru, domu]

Jump to content

Mean opinion score: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Davidfdzp (talk | contribs)
Added reference to an analytical formula to estimate MOS
Adding local short description: "Measure of quality of a stimulus or system", overriding Wikidata description "measure"
 
(44 intermediate revisions by 35 users not shown)
Line 1: Line 1:
{{Short description|Measure of quality of a stimulus or system}}
The Mean Opinion Score (MOS) test has been used for decades in telephony networks to obtain the human user's view of the quality of the network. In multimedia (audio, voice telephony, or video) especially when [[codecs]] are used to compress the [[Bandwidth (computing)|bandwidth]] requirement (for example, of a digitized voice connection from the standard 64 kilobit/second [[Pulse-code modulation|PCM]] [[modulation]]), the '''mean opinion score (MOS)''' provides a numerical indication of the perceived quality from the users' perspective of received media after compression and/or transmission. The MOS is expressed as a single number in the range 1 to 5, where 1 is lowest perceived audio quality, and 5 is the highest perceived [[audio quality measurement]].
'''Mean opinion score''' (MOS) is a measure used in the domain of [[Quality of Experience]] and [[telecommunications engineering]], representing overall quality of a stimulus or system. It is the [[arithmetic mean]] over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality".<ref>ITU-T Rec. P.10/G.100 (2017) Vocabulary for performance, quality of service and quality of experience.</ref> Such ratings are usually gathered in a [[Subjective video quality|subjective quality evaluation test]], but they can also be algorithmically estimated.


MOS is a commonly used measure for video, audio, and audiovisual quality evaluation, but not restricted to those modalities. [[ITU-T]] has defined several ways of referring to a MOS in Recommendation [https://www.itu.int/rec/T-REC-P.800.1 ITU-T P.800.1], depending on whether the score was obtained from audiovisual, conversational, listening, talking, or video quality tests.
MOS tests for voice are specified by [[ITU-T]] recommendation [http://www.itu.int/rec/T-REC-P.800-199608-I/en P.800]


==Rating scales and mathematical definition==
The MOS is generated by averaging the results of a set of standard, subjective tests where a number of listeners rate the heard [[audio quality]] of test sentences read aloud by both male and female speakers over the communications medium being tested. A listener is required to give each sentence a rating using the following rating scheme:


The MOS is expressed as a single rational number, typically in the range 1–5, where 1 is lowest perceived quality, and 5 is the highest perceived quality. Other MOS ranges are also possible, depending on the [[rating scale]] that has been used in the underlying test. The [[Absolute Category Rating]] scale is very commonly used, which maps ratings between ''Bad'' and ''Excellent'' to numbers between 1 and 5, as seen in below table.
{| class="wikitable" style="text-align:left"
{| class="wikitable"
|+'''Mean opinion score (MOS)'''
!Rating
!Label
|-
|-
|5
! MOS !! Quality !! Impairment
|Excellent
|-
|-
|4
! 5
|Good
| Excellent || Imperceptible
|-
|-
|3
! 4
|Fair
| Good || Perceptible but not annoying
|-
|-
|2
! 3
|Poor
| Fair || Slightly annoying
|-
|-
|1
! 2
|Bad
| Poor || Annoying
|-
! 1
| Bad || Very annoying
|}
|}
Other standardized quality rating scales exist in [[:Category:ITU-T recommendations|ITU-T Recommendations]] (such as [http://www.itu.int/rec/T-REC-P.800/en ITU-T P.800] or [https://www.itu.int/rec/T-REC-P.910 ITU-T P.910]). For example, one could use a continuous scale ranging between 1–100. Which scale is used depends on the purpose of the test. In certain contexts there are no statistically significant differences between ratings for the same stimuli when they are obtained using different scales.<ref>{{Cite journal|last=Huynh-Thu|first=Q.|last2=Garcia|first2=M. N.|last3=Speranza|first3=F.|last4=Corriveau|first4=P.|last5=Raake|first5=A.|date=2011-03-01|title=Study of Rating Scales for Subjective Quality Assessment of High-Definition Video|journal=IEEE Transactions on Broadcasting|volume=57|issue=1|pages=1–14|doi=10.1109/TBC.2010.2086750|issn=0018-9316}}</ref>


The MOS is the [[arithmetic mean]] of all the individual scores, and can range from 1 (worst) to 5 (best).
The MOS is calculated as the [[arithmetic mean]] over single ratings performed by human subjects for a given stimulus in a [[Subjective video quality|subjective quality evaluation test]]. Thus:


:<math>MOS = \frac{\sum_{n=1}^N{R_n}}{N}</math>
Compressor/decompressor ([[codec]]) systems and digital signal processing ([[Digital signal processor|DSP]]) are commonly used in voice communications, and can be configured to conserve [[Bandwidth (computing)|bandwidth]], but there is a trade-off between voice quality and bandwidth conservation. The best codecs provide the most bandwidth conservation while producing the least degradation of voice quality. Bandwidth can be measured quantitatively, but voice quality requires human interpretation, although estimates of voice quality can be made by automatic test systems.


Where {{nowrap|<math>R</math>}} are the individual ratings for a given stimulus by {{nowrap|<math>N</math>}} subjects.
A similar process can be used to evaluate [[subjective video quality]].


==Properties of the MOS==
As an example, the following are mean opinion scores for one implementation of different codecs [http://www.cisco.com/en/US/tech/tk1077/technologies_tech_note09186a00800b6710.shtml#mos]:


The MOS is subject to certain mathematical properties and biases. In general, there is an ongoing debate on the usefulness of the MOS to quantify Quality of Experience in a single scalar value.<ref>{{Cite journal|last=Hoßfeld|first=Tobias|last2=Heegaard|first2=Poul E.|last3=Varela|first3=Martín|last4=Möller|first4=Sebastian|author-link4=Sebastian Möller|date=2016-12-01|title=QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS|journal=Quality and User Experience|language=en|volume=1|issue=1|pages=2|doi=10.1007/s41233-016-0002-1|issn=2366-0139|arxiv=1607.00321}}</ref>
{| class="wikitable sortable"
|-
!Codec
!Data rate <br /> [kbit/s]
!Mean opinion score <br /> (MOS)
|-
| [[G.711]] ([[ISDN]])
| 64
| 4.1
|-
| [[iLBC]]
| 15.2
| 4.14
|-
| [[Adaptive multi-rate compression|AMR]]
| 12.2
| 4.14
|-
| [[G.729]]
| 8
| 3.92
|-
| [[G.723.1]] r63
| 6.3
| 3.9
|-
| [[GSM]] [[Enhanced Full Rate|EFR]]
| 12.2
| 3.8
|-
| [[G.726|G.726 ADPCM]]
| 32
| 3.85
|-
| [[G.729a]]
| 8
| 3.7
|-
| [[G.723.1]] r53
| 5.3
| 3.65
|-
| [[G.728]]
| 16
| 3.61
|-
| [[GSM]] [[Full Rate|FR]]
| 12.2
| 3.5
|}


When the MOS is acquired using a categorical rating scales, it is based on – similar to [[Likert scale]]s – an [[Level of measurement#Ordinal scale|ordinal scale]]. In this case, the ranking of the scale items is known, but their interval is not. Therefore, it is mathematically incorrect to calculate a mean over individual ratings in order to obtain the central tendency; the median should be used instead.<ref>Jamieson, Susan. "Likert scales: how to (ab) use them." Medical education 38.12 (2004): 1217-1218.</ref> However, in practice and in the definition of MOS, it is considered acceptable to calculate the arithmetic mean.
A drawback of obtaining MOS estimations is that it may be more time-consuming and expensive as it requires hiring experts to make estimations. When a voice coding system is under development, or the developer has to test and compare a couple of audio systems, it's very important to have a possibility for a quick check.


It has been shown that for categorical rating scales (such as ACR), the individual items are not perceived equidistant by subjects. For example, there may be a larger "gap" between ''Good'' and ''Fair'' than there is between ''Good'' and ''Excellent''. The perceived distance may also depend on the language into which the scale is translated.<ref>Streijl, Robert C., Stefan Winkler, and David S. Hands. "Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives." Multimedia Systems 22.2 (2016): 213-227.</ref> However, there exist studies that could not prove a significant impact of scale translation on the obtained results.<ref>{{Cite journal|last=Pinson|first=M. H.|last2=Janowski|first2=L.|last3=Pepion|first3=R.|last4=Huynh-Thu|first4=Q.|last5=Schmidmer|first5=C.|last6=Corriveau|first6=P.|last7=Younkin|first7=A.|last8=Callet|first8=P. Le|last9=Barkowsky|first9=M.|date=October 2012|title=The Influence of Subjects and Environment on Audiovisual Subjective Tests: An International Study|journal=IEEE Journal of Selected Topics in Signal Processing|volume=6|issue=6|pages=640–651|doi=10.1109/jstsp.2012.2215306|issn=1932-4553|url=https://hal.archives-ouvertes.fr/hal-00725992/file/06286980.pdf}}</ref>
Some suitable English-language phrases used for determining a MOS as suggested by [[ITU-T]] recommendation [[P.800]] are:
*You will have to be very quiet.
*There was nothing to be seen.
*They worshipped wooden idols.
*I want a minute with the inspector.
*Did he need any money?


Several other biases are present in the way MOS ratings are typically acquired.<ref>Zielinski, Slawomir, Francis Rumsey, and Søren Bech. "On some biases encountered in modern audio quality listening tests-a review." Journal of the Audio Engineering Society 56.6 (2008): 427-451.</ref> In addition to the above-mentioned issues with scales that are perceived non-linearly, there is a so-called "range-equalization bias": subjects, over the course of a subjective experiment, tend to give scores that span the entire rating scale. This makes it impossible to compare two different subjective tests if the range of presented quality differs. In other words, the MOS is never an absolute measure of quality, but only relative to the test in which it has been acquired.
There exist some analytical formulas to estimate the MOS from packet losses in percentage and the packets duration in ms (see External Links referenced paper):

Predicted MOS = 4.0 - 0.7 ln(%loss) - 0.1 ln(size_ms)
For the above reasons – and due to several other contextual factors influencing the perceived quality in a subjective test – a MOS value should only be reported if the context in which the values have been collected in is known and reported as well. MOS values gathered from different contexts and test designs therefore should not be directly compared. Recommendation [https://www.itu.int/rec/T-REC-P.800.2 ITU-T P.800.2] prescribes how MOS values should be reported. Specifically, P.800.2 says:<blockquote>it is not meaningful to directly compare MOS values produced from separate experiments, unless those experiments were explicitly designed to be compared, and even then the data should be statistically analysed to ensure that such a comparison is valid.</blockquote>

==MOS for speech and audio quality estimation==

MOS historically originates from [[subjectivity|subjective]] measurements where listeners would sit in a "quiet room" and score a telephone call quality as they perceived it. This kind of test methodology had been in use in the telephony industry for decades and was standardized in Recommendation [http://www.itu.int/rec/T-REC-P.800 ITU-T P.800]. It specifies that "the talker should be seated in a quiet room with volume between 30 and 120 m³ and a reverberation time less than 500 ms (preferably in the range 200–300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Requirements for other modalities were similarly specified in later ITU-T Recommendations.

== MOS estimation using quality models ==
Obtaining MOS ratings may be time-consuming and expensive as it requires the recruitment of human assessors. For various use cases such as codec development or service quality monitoring purposes – where quality should be estimated repeatedly and automatically – MOS scores can also be predicted by [[Video quality#Objective video quality|objective quality models]], which typically have been developed and trained using human MOS ratings. A question that arises from using such models is whether the MOS differences produced are noticeable to the users. For example, when rating images on a five point MOS scale, an image with a MOS equal to 5 is expected to be noticeably better in quality than one with a MOS equal to 1. Contrary to that, it is not evident whether an image with a MOS equal to 3.8 is noticeably better in quality than one with a MOS equal to 3.6. Research conducted on determining the smallest MOS difference that is perceptible to users for digital photographs showed that a MOS difference of approximately 0.46 is required in order for 75% of the users to be able to detect the higher quality image. <ref name="interpretMOS">{{Cite journal|last=Katsigiannis|first=S.|last2=Scovell|first2=J. N.|last3=Ramzan|first3=N.|last4=Janowski|first4=L.|last5=Corriveau|first5=P.|last6=Saad|first6=M.|last7=Van Wallendael|first7=G.|date=2018-05-02|title=Interpreting MOS scores, when can users see a difference? Understanding user experience differences for photo quality|journal=Quality and User Experience|volume=3|issue=1|pages=6|doi=10.1007/s41233-018-0019-8|issn=2366-0139|hdl=1854/LU-8581457|hdl-access=free}}</ref> Nevertheless, image quality expectation, and hence MOS, changes over time with the change of user expectations. As a result, minimum noticeable MOS differences determined using analytical methods such as in <ref name="interpretMOS"></ref> may change over time.


==See also==
==See also==
*[[Absolute Category Rating]]
*[[Likert scale]]
*[[MUSHRA]] (Recommendation [[ITU-R]] BS.1534)
*[[Video quality#Objective video quality|Objective video quality]]
*[[Subjective video quality]]
*[[Subjective video quality]]
*[[MUSHRA]] ITU BS.1534 Recommendation
*[[PSQM]] Perceptual Speech Quality Measure (ITU-T P.861 - withdrawn and replaced with [[PESQ]] ITU-T P.862)
*[[PESQ]] Perceptual Evaluation of Speech Quality, is mechanism for automated assessment of the speech quality enjoyed by the user of a telephone system. It is standardised as ITU-T recommendation P.862 (02/01).
*[[PEVQ]] Perceptual Evaluation of Video Quality, a measurement algorithm for the automated assessment of video quality.
*[[PEAQ]] Perceptual Evaluation of Audio Quality, a measurement algorithm for the automated assessment of audio quality.
*[[Absolute Category Rating]]
*[[MNRU]]

==External links==
*[http://www.itu.int/rec/T-REC-P.800-199608-I/en ITU-T Recommendation P.800 Methods for subjective determination of transmission quality]
*[http://www.rhyshaden.com/voice.htm Voice Quality]
*[http://www.sevana.fi/voice_quality_testing_measurement_analysis.php AQuA - Audio Quality Analyzer]
*[http://www.sevana.fi/non-intrusive-voice-quality-testing-software.php NIQA- Non-Intrusive Quality Analyzer]
*[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.5576 Impact of network performance parameters on the end-to-end perceived speech quality]
*[http://www.itu.int/ITU-T/studygroups/com12/emodelv1/index.htm E-Model (ITU-T G.107)]


==References==
[[Category:Audio codecs]]
{{Reflist}}
[[Category:Speech codecs]]
[[Category:Video codecs]]


[[Category:Multimedia]]
[[de:Mean Opinion Score]]
[[Category:Telecommunications]]
[[fr:Note d'opinion moyenne]]
[[ja:平均オピニオン評点]]
[[nl:Mean opinion score]]
[[pl:Mean Opinion Score]]
[[pt:Mean Opinion Score]]

Latest revision as of 05:28, 23 February 2024

Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality".[1] Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.

MOS is a commonly used measure for video, audio, and audiovisual quality evaluation, but not restricted to those modalities. ITU-T has defined several ways of referring to a MOS in Recommendation ITU-T P.800.1, depending on whether the score was obtained from audiovisual, conversational, listening, talking, or video quality tests.

Rating scales and mathematical definition[edit]

The MOS is expressed as a single rational number, typically in the range 1–5, where 1 is lowest perceived quality, and 5 is the highest perceived quality. Other MOS ranges are also possible, depending on the rating scale that has been used in the underlying test. The Absolute Category Rating scale is very commonly used, which maps ratings between Bad and Excellent to numbers between 1 and 5, as seen in below table.

Rating Label
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad

Other standardized quality rating scales exist in ITU-T Recommendations (such as ITU-T P.800 or ITU-T P.910). For example, one could use a continuous scale ranging between 1–100. Which scale is used depends on the purpose of the test. In certain contexts there are no statistically significant differences between ratings for the same stimuli when they are obtained using different scales.[2]

The MOS is calculated as the arithmetic mean over single ratings performed by human subjects for a given stimulus in a subjective quality evaluation test. Thus:

Where are the individual ratings for a given stimulus by subjects.

Properties of the MOS[edit]

The MOS is subject to certain mathematical properties and biases. In general, there is an ongoing debate on the usefulness of the MOS to quantify Quality of Experience in a single scalar value.[3]

When the MOS is acquired using a categorical rating scales, it is based on – similar to Likert scales – an ordinal scale. In this case, the ranking of the scale items is known, but their interval is not. Therefore, it is mathematically incorrect to calculate a mean over individual ratings in order to obtain the central tendency; the median should be used instead.[4] However, in practice and in the definition of MOS, it is considered acceptable to calculate the arithmetic mean.

It has been shown that for categorical rating scales (such as ACR), the individual items are not perceived equidistant by subjects. For example, there may be a larger "gap" between Good and Fair than there is between Good and Excellent. The perceived distance may also depend on the language into which the scale is translated.[5] However, there exist studies that could not prove a significant impact of scale translation on the obtained results.[6]

Several other biases are present in the way MOS ratings are typically acquired.[7] In addition to the above-mentioned issues with scales that are perceived non-linearly, there is a so-called "range-equalization bias": subjects, over the course of a subjective experiment, tend to give scores that span the entire rating scale. This makes it impossible to compare two different subjective tests if the range of presented quality differs. In other words, the MOS is never an absolute measure of quality, but only relative to the test in which it has been acquired.

For the above reasons – and due to several other contextual factors influencing the perceived quality in a subjective test – a MOS value should only be reported if the context in which the values have been collected in is known and reported as well. MOS values gathered from different contexts and test designs therefore should not be directly compared. Recommendation ITU-T P.800.2 prescribes how MOS values should be reported. Specifically, P.800.2 says:

it is not meaningful to directly compare MOS values produced from separate experiments, unless those experiments were explicitly designed to be compared, and even then the data should be statistically analysed to ensure that such a comparison is valid.

MOS for speech and audio quality estimation[edit]

MOS historically originates from subjective measurements where listeners would sit in a "quiet room" and score a telephone call quality as they perceived it. This kind of test methodology had been in use in the telephony industry for decades and was standardized in Recommendation ITU-T P.800. It specifies that "the talker should be seated in a quiet room with volume between 30 and 120 m³ and a reverberation time less than 500 ms (preferably in the range 200–300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Requirements for other modalities were similarly specified in later ITU-T Recommendations.

MOS estimation using quality models[edit]

Obtaining MOS ratings may be time-consuming and expensive as it requires the recruitment of human assessors. For various use cases such as codec development or service quality monitoring purposes – where quality should be estimated repeatedly and automatically – MOS scores can also be predicted by objective quality models, which typically have been developed and trained using human MOS ratings. A question that arises from using such models is whether the MOS differences produced are noticeable to the users. For example, when rating images on a five point MOS scale, an image with a MOS equal to 5 is expected to be noticeably better in quality than one with a MOS equal to 1. Contrary to that, it is not evident whether an image with a MOS equal to 3.8 is noticeably better in quality than one with a MOS equal to 3.6. Research conducted on determining the smallest MOS difference that is perceptible to users for digital photographs showed that a MOS difference of approximately 0.46 is required in order for 75% of the users to be able to detect the higher quality image. [8] Nevertheless, image quality expectation, and hence MOS, changes over time with the change of user expectations. As a result, minimum noticeable MOS differences determined using analytical methods such as in [8] may change over time.

See also[edit]

References[edit]

  1. ^ ITU-T Rec. P.10/G.100 (2017) Vocabulary for performance, quality of service and quality of experience.
  2. ^ Huynh-Thu, Q.; Garcia, M. N.; Speranza, F.; Corriveau, P.; Raake, A. (2011-03-01). "Study of Rating Scales for Subjective Quality Assessment of High-Definition Video". IEEE Transactions on Broadcasting. 57 (1): 1–14. doi:10.1109/TBC.2010.2086750. ISSN 0018-9316.
  3. ^ Hoßfeld, Tobias; Heegaard, Poul E.; Varela, Martín; Möller, Sebastian (2016-12-01). "QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS". Quality and User Experience. 1 (1): 2. arXiv:1607.00321. doi:10.1007/s41233-016-0002-1. ISSN 2366-0139.
  4. ^ Jamieson, Susan. "Likert scales: how to (ab) use them." Medical education 38.12 (2004): 1217-1218.
  5. ^ Streijl, Robert C., Stefan Winkler, and David S. Hands. "Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives." Multimedia Systems 22.2 (2016): 213-227.
  6. ^ Pinson, M. H.; Janowski, L.; Pepion, R.; Huynh-Thu, Q.; Schmidmer, C.; Corriveau, P.; Younkin, A.; Callet, P. Le; Barkowsky, M. (October 2012). "The Influence of Subjects and Environment on Audiovisual Subjective Tests: An International Study" (PDF). IEEE Journal of Selected Topics in Signal Processing. 6 (6): 640–651. doi:10.1109/jstsp.2012.2215306. ISSN 1932-4553.
  7. ^ Zielinski, Slawomir, Francis Rumsey, and Søren Bech. "On some biases encountered in modern audio quality listening tests-a review." Journal of the Audio Engineering Society 56.6 (2008): 427-451.
  8. ^ a b Katsigiannis, S.; Scovell, J. N.; Ramzan, N.; Janowski, L.; Corriveau, P.; Saad, M.; Van Wallendael, G. (2018-05-02). "Interpreting MOS scores, when can users see a difference? Understanding user experience differences for photo quality". Quality and User Experience. 3 (1): 6. doi:10.1007/s41233-018-0019-8. hdl:1854/LU-8581457. ISSN 2366-0139.