In chapters 1 and 3 of this white paper, definitions are proposed to support discussions on audio quality and sound quality. Chapter 8 presented a selection of issues on operational quality. For discussions to make sense, in addition to definitions, information about the audio system is required - including judgement statements on the audio quality, sound quality and operational quality. Operational quality is generally covered by comparing the published system devices and the system design’s operational specifications to the operational requirements. For the audio quality and sound quality however, things are a little more complicated. The assessment of both audio quality and sound quality is a subject of discussions in the professional audio market(*9A). Contributing to the discussions from a ‘Performance & Response’ viewpoint, this chapter presents two basic methods for the quality assessment of an audio system: analysis of electrical measurements, and listening.

9.1 Quality assessment through electronic measurements

To assess the audio quality of a networked audio system, the Performance of the system can be measured by electronic measurement equipment such as level meters, oscilloscopes, FFT analysers and impulse response analysers. Because the requirements for the Performance of a system are so clearly defined - see chapter 1, the measurements can be analysed and interpreted using strict definitions and specifications. In cases where definitions and measurements of a found audio quality problem don’t exist yet, a new measurement method has to be invented. But it is assumed that - after more than a century of world wide research in the field of electronic sound reproduction systems - the majority of audio quality issues in audio systems have been defined, electronically measured and analysed. Most manufacturers of electronic equipment list the most relevant measurements in their specifications - although sometimes different definitions are used, making comparison between devices from different manufacturers difficult or even impossible.

To assess the sound quality of an audio system, the Response of the system can be measured using the same electronic measurements as used for the measurement of the system’s Performance. But the result of an electronic measurement doesn’t say anything about sound quality by itself - it has to be translated. A complication is that the translation of electronic measurements to (individual) hearing experiences is not standardized. For example: a measured equalizer curve can be interpreted as ‘good sounding’ by one individual, and ‘bad sounding’ by the next. To obtain an ‘average audience’ translation matrix of physical sound characteristics to perceptional sound characteristics, clinical research is required using a large population of listeners. In contradiction to electronic Performance measurements, research on the translation of electronic Response measurements to sound quality perception has not been conducted on large scale yet(*9B). Instead, the sound quality (Response) of many audio devices (or DSP algorithms - ‘plug-ins’) are most commonly referenced to individual opinion leaders in the professional audio field, articles of respected journalists in professional audio magazines, or to the overall sound quality image of the manufacturer. Of course, also many individual listening sessions are conducted to assess the sound quality of audio devices, but then there are many complications involving the translation of the hearing sensation to the device’s sound quality. This topic is presented in chapter 9.2.

Electronic measurements of analogue and digital systems is not difficult as both analogue and digital measuring equipment is widely available. The measurement of the acoustic parts of a system can be performed by the same equipment, but it requires a calibrated measurement microphone to transform acoustic signals into analogue signals. Often, individual parts of the system can be measured by either probing internal circuits in the device, or by bypassing parts of the system. A special case of bypassing is the measurement of the Performance of a system with the Response of the system bypassed - eg. switching off all processing. A big advantage of electronic measurements is that the system can be measured using a controlled test signal - making the measurement independent of the sound source, and also making it possible to reproduce the measurement at different times and locations for confirmation or to obtain a high statistical significance.

9.2 Quality assessment through listening tests

When it comes to the assessment of audio quality and sound quality of a networked audio system through listening (assessing the quality of the invoked hearing sensation), the human auditory system makes things extremely difficult: a hearing sensation is affected by the quality of the sound source, acoustic environment, listening position and angle, the individual’s hearing abilities, preferences, expectations, short term aural activity image memory, long term aural scene memory, and all other sensory inputs of the human body such as vision, taste, smell, touch. For this reason, quality judgements based on hearing sensations have to be analysed for all factors that play a role in the hearing experience before any quality statement can be extracted about the audio system. Figure 902 presents a selection of the most important factors that play a role in the result of a listening session, with selected factors discussed in the remainder of this chapter..

Aural scene memory

As the definition of quality is ‘conformance to requirements’, always both the measurement - in the case of a listening session the hearing sensation - and the requirement are needed to come to a quality assessment. The problem is that for hearing sensations, requirements do not really exist. Every individual has different preferences or expectations for a hearing sensation, only an average hearing sensation can be assumed as requirement - at this moment only available indirectly through the sales results of theatre tickets, CD’s and music downloads. This data is not very reliable, as it is heavily biased with other factors not even included in figure 902, such as social peer pressure, culture, commercial factors.

The only compatible and relevant reference for quality assessment are previous hearing sensations stored in the brain’s memory as aural scenes. However, these are only relevant if they are undergone at exactly the same conditions as the conditions in the listening session. Any deviation in the environment - from all factors in figure 902 - make the memory invalid for reference. A fundamental conclusion - based on the simplified auditory processing model presented in figure 412 in chapter 4.3 - is that the assessment of the quality of a sound system can never be achieved by a single listening session because the aural scene memory is virtually always invalid. Listening tests always have to be performed in two or more sessions to create a reference and allow differential analysis to come to a quality assessment.

Aural activity image memory

In the auditory processing model, long term aural scene memory and short term aural activity image memory are presented. Because multiple listening sessions can only be conducted one-by-one, the reference used for differential analysis always uses memory. In case of long listening sessions, only the overall aural scene can be used for comparison - as it is stored in long term memory, allowing only aggregated quality assessments. If detailed quality assessments are required, the brain’s short term aural activity image memory of 20 seconds has to be used - which means that comparisons of hearing sensations have to take place within 20 seconds - using identical sound source signals. Listening sessions performed by switching between two situations while listening to an integral piece of music are not valid, as then the 20 seconds before and after the switch are never the same. The conclusion is that audio fragments used in listening sessions for detailed quality assessment have to be identical pieces of sound, shorter than 20 seconds.

Acoustic environment, listening position and listening angle

Listening sessions always take place in an acoustic environment. The environment can be a live space such as a concert hall, or a carefully designed acoustically optimized control room in a music studio. Only listening sessions performed in ‘dry rooms’ in auditory research laboratories, or a desolated open space without any wind such as in a desert on a structure high above the ground, can cancel out the acoustic environment. Apart from the acoustic environment, the listening position and angle significantly affects the hearing sensation - already a displacement between multiple listening sessions of several millimetres or degrees can change the result drastically. Two conclusions can be drawn: first, listening sessions should be undergone strictly in the sweet spot of a speaker system - facing the same direction for every session. In case of listening sessions to assess the quality of loudspeakers, a mechanical rotating system should be used to ensure that listening position and angle are always the same. Second, the result only has relevance for the listening position and angle in the acoustic environment the listening session was performed in. In any other acoustic environment the quality assessments obtained from the listening sessions are no longer valid. For live systems, the quality results are never valid because the audience in most cases consists of more than one person, who can not be located at the same listening position at the same time.

Anticipation

The anticipation of a result can cause test subjects to experience the result, even if there is none. Because of this placebo effect, all clinical trials in the medical field include two groups of test subjects, one receiving the drug under test, and the other receiving the placebo. The placebo effect also plays a role in listening tests - when a change is anticipated, even if there is no change, a change might somehow be detected. This is not a shortcoming of the listener, anticipation is simply one of the many factors that affect a hearing sensation - in fact, anticipation can significantly amplify the pleasantness of the hearing sensation - a concept used in many music compositions and performances. To assess the quality of a sound system however, anticipation should be avoided to prevent it from affecting the quality assessment. The simplest way to achieve this is to perform blind tests - with the test subject knowing that the audio fragment can be either different or the same.

Expectation

If listening tests are conducted sighted instead of blind, the test subjects can be influenced by non-auditory signals, along with previous experiences associated with those signals. For example, seeing the mechanical construction of test objects can create an expectation of the listening experience, eg. large loudspeakers are expected to reproduce low frequencies very well. Of course, knowing the product brand and remembering the brand’s reputation also strongly affects the test results, rendering the outcome invalid(*9C).

Vision & other sensory organs

In many cases, the brain’s processing of audio signals is affected and sometimes overruled by other sensory inputs - a famous example is the McGurk Ba-Ga test described in chapter 4.3. If in listening sessions the nonaudio sensory inputs differ the outcome can be completely different. Even things often thought of as completely irrelevant for audio, such as the colour of cabinets and cables, or even the colour of the visual signals used to indicate sources in a listening test, can affect the quality assessment. For a series of relevant listening tests, the non-audio environment should be as constant as possible - eg. constant colours and temperature. For this reason, the consumption of food and drinks - constituting smell and taste - shortly before and during listening sessions should be avoided.

Sound source

All hearing sensations are affected by the quality of the sound source as described in chapter 1. To assess the quality of a system, a reference is required to compare results with the same sound source - thus ruling out the quality of the sound source. This is easy using pre-recorded materials - high resolution (eg. 24 bit 96 kHz) audio recordings can be used. It is impossible to assess a system’s quality in multiple listening sessions using real-life musicians - as the musicians will never play two music pieces exactly the same. This causes the references in the aural memory to differ because of the sound source quality, and not the audio system quality. When using pre-recorded sound source materials, it is important to know if the test subject (listener) is familiar with the material because that would allow the aural scene memory (with scenes most probably generated under different conditions) to affect the new hearing sensations.

Calibration

To allow differential analysis of a listening session comparing a single parameter or process in a system (for example comparing one signal chain with an equaliser applied and one without), all other processes in the signal chains have to be exactly the same. When two physically different analogue devices are used, this is never the case - the gain error alone can cause up to 4 dBu level difference in case of a mixing console - significantly affecting the hearing sensation as louder signals are most commonly perceived as better sounding. This can be eliminated partially by calibrating all signal chains in the listening test to produce the same output volume. As the human auditory system is capable of detecting level differences down to 0.5 dB, listening test systems need to be calibrated within 0.5 dB or lower.

Assessment of preferences vs. assessment of detection thresholds

Listening tests can be performed to assess the preferences of test subjects when the differences between audio systems are high. With small differences however, it becomes increasingly difficult to assess preferences - in that case, first an assessment of detection thresholds can be performed. For this purpose, ABX testing is an accepted method - featuring blind listening to two situations A and B, then confronting the test subject with an unknown situation X, which can be either A or B. Performing an ABX test multiple times gives a statistically significant statement on whether the difference could be detected or not.

Training

Training strongly affects the result of listening tests. Trained listeners have learned to extract detailed information from the aural activity scene information and keep it in long term memory, not only remembering more details than untrained listeners, but also being able to report the results better - knowing the psycho-acoustic vocabulary.

Listening to audio sources as form of short-term training in AAAB test sequences introduces a preference bias as the test persons get accustomed to the A source, and might perceive the B source as less preferred. This makes AAAB tests unsuited for preference tests. For detection tests however, AAAB tests can be suited if the difference between objects are extremely small.

9.3 Conducting listening tests

To achieve a relevant quality assessment about the audio quality and sound quality of an audio system, we propose the following conditions to be met for valid listening tests:

  1. 1. Tests must be controlled: all factors other than the audio system must be either removed or kept constant:
    1. * sound source (live musicians can not be used)
    2. * acoustic environment
    3. * listening position & angle
    4. * visible environment
    5. * temperature and humidity
    6. * smell and taste
  2. 2. at least two listening sessions must be performed per listening test to allow differential analysis.
    1. * A single session referencing to memory is not valid.
  3. 3. tests must be blind
    1. * The test subjects must not know to what reference they are listening to
  4. 4. audio materials must be shorter than 20 seconds
  5. 5. If different signal chains are used, their total gain must be calibrated within 0.5 dB

Significance

The abilities and characteristics of the human auditory system differ strongly from individual to individual, but also over time. Single listening tests (with multiple sessions) only provide a quality assessment of a system that is valid only for the test subject at the time of the test. To achieve statistical significance in order to generate statements that are valid for an average audience at all times, listening tests and sessions can be performed multiple times, applying general scientific statistical principles (eg. analysis of variance, χ² tests)

Analysis - statements on audio quality and sound quality

The results from valid listening tests can identify audio quality issues in the Performance processes of a system, and sound quality issues in the Response processes of a system. However, there is no translation table available to translate hearing sensations to physical phenomenon in a system’s circuits or software. Statements on physical phenomenon can not be made based only on listening test results. At best, electronic measurements can be proposed - based on listening test results - to find a possible physical cause of the perceived quality issue. Only if a physical cause can be confirmed, a valid statement can be made correlating the hearing experience to the physical phenomenon. All assuming that the listening tests were ‘controlled’ - conducted under the conditions proposed in table 901.

In this white paper, we strongly advise not to draw direct conclusions about physical phenomenon in networked audio systems based on listening tests. A valid conclusion can only be drawn after confirming a found cause for the listening tests results - normally by conducting further listening tests varying the found cause parameters. We even more strongly not advise to draw any conclusion in general based on uncontrolled listening tests.