Sound in the Digital Domain

Digital systems (e.g. computers) and formats (e.g. CD) are clearly the most popular and commonplace methods of storing and manipulating audio. Since the introduction of the compact disc in the early 1980's, the digital format has provided increasingly greater storage capacity and the ability to store audio information at an acceptable quality. Although analogue formats still exist (vinyl, tape), they typically serve a niche audience. Digital systems are ubiquitous in modern music technology. It must be stressed that there is no argument as to whether one domain, be it analogue or digital is superior, but the following provides some desirable features of working with audio in the digital domain.

Storage. The amount of digital audio data capable of being stored on a modern hard drive is far greater than a tape system. Furthermore, we can choose the quality of the captured audio data, which relates directly to file size and other factors.

Control. By storing audio information in digital, we can perform powerful and complex operations on the data that would be extremely difficult to realise otherwise.

Durability. Digital audio can be copied across devices without any loss of information. Furthermore, many systems employ error correction codes to compensate for wear and tear on a physical digital format such as a compact disc.

Digital <-> Analogue Conversion

Acoustic information (sound waves) are treated as signals. As demonstrated in the previous chapter, we traditionally view these signals as varying amplitude over time. In analogue systems, this generally means that the amplitude is represented by a continuous voltage; but inside a digital system, the signal must be stored as a stream of discrete values.

Figure 2.1. An overview of the digital <-> analogue conversion process.

Digital data stored in this way has no real physical meaning; one could describe a song on a computer as just an array of numbers; these numbers are meaningless unless there exists within the system a process that can interpret each number in sequence appropriately. Fig. 2.1 shows an overview of the process of capturing analogue sound and converting it into a digital stream of numbers for storage and manipulation in such a system. The steps are as follows:

1. An input such as a microphone converts acoustic air pressure variations (sound waves) into variations in voltage.

2. An analogue to digital converter (ADC) converts the varying voltage into a stream of digital values by taking a 'snapshot' of the voltage at a point in time and assigning it a value depending on its amplitude. It typically takes these 'snapshots' thousands of times a second, the rate at which is known as the sample rate.

3. The numerical data is stored on the digital system and then subsequently manipulated or analysed by the user.

4. The numerical data is re-read and streamed out of the digital system.

5. A digital to analogue converter (DAC) converts the stream of digital values back to a varying voltage.

6. A loudspeaker converts the voltage to variations in air pressure (sound).

Although the signal at each stage comes in a different form (sound energy, digital values etc.), the information is analogous. However, due to the nature of the conversion process, this data may become manipulated and distorted. For instance, low values for sample rates or other factors at the ADC might mean that the continuous analogue signal is not represented with enough detail and subsequently the information will be distorted. There are also imperfections in physical devices such as microphones which further "colour" the signal in some way. It is for this reason that musicians and engineers aim to use the most high-quality equipment and processes in order to preserve the integrity of the original sound throughout the process. Musicians and engineers must consider what other processes their music will go through before consumption, too (radio transmission etc.).

Sampling

Sound waves in their natural acoustic form can be considered continuous; that is, their time-domain graphs are smooth lines on all zoom factors without any breaks or jumps. We cannot have these breaks, or discontinuities because sound cannot switch instantaneously between two values. An example of this may be an idealised waveform like a square wave - on paper, it switches between 1 and -1 amplitude at a point instantaneously; however a loudspeaker cannot, by the laws of physics, jump between two points in no time at all, the cone has to travel through a continuous path from one point to the next.

Figure 2.2. Discrete samples (red) of a continuous waveform (grey).

Sampling is the process of taking a continuous, acoustic waveform and converting it into a digital stream of discrete numbers. An ADC measures the amplitude of the input at a regular rate creating a stream of values which represent the waveform in digital. The output is then created by passing these values to the DAC, which drives a loudspeaker appropriately. By measuring the amplitude many thousands of times a second, we create a "picture" of the sound which is of sufficient quality to human ears. The more and more we increase this sample rate, the more accurately a waveform is represented and reproduced.

Nyquist-Shannon sampling theorem

The frequency of a signal has implications for its representation, especially at very high frequencies. As discussed in the previous chapter, the frequency of a sine wave is the number of cycles per second. If we have a sample rate of 20000 samples per second (20Khz), it is clear that a high frequency sinusoid such as 9000 Hz is going to have less "snapshots" than a sinusoid at 150 Hz. Eventually there reaches a point where there are not enough sample points to be able to record the cycle of a waveform, which leads us to the following important result:

The sample rate of a system defines the maximum representable frequency, which is half the sample rate.

Why is this? The minimum number of sample points required to represent a sine wave is two. It may seem apparent at this time that using just two points to represent a continuous curve such as a sinusoid would result in a crude approximation - a square wave. And, inside the digital system, this is true. However, both ADCs and DACs have low-pass filters set at half the sample rate (the highest representable frequency). What this means for input and output is that any frequency above the cut-off point is removed and it follows from this that the crude sine representation - a square wave in theory - becomes filtered down to a single frequency (i.e. a sine wave). From this, we have two mathematical results:

$F_{s}\geq 2f_{max}$ and $F_{N}={\frac {F_{s}}{2}}$

Where $F_{s}$ is the sample rate, $f_{max}$ is the highest frequency in the signal. $F_{N}$ is the highest possible frequency that can be represented with $F_{s}$ , and is known as the Nyquist frequency. Frequencies over the Nyquist frequency do not exist due to the presence of filters to block them; without such processes there would be frequency component foldover, otherwise known as aliasing.

Sampling accuracy and bit depth

It has been established that the higher the sample rate, the more accurate the representation of a waveform in a digital system. However, although there are many reasons and arguments for higher sample rates, there are two general standards: 44100 samples per second and 48000 samples per second, with the former being most commonplace. The main consideration for this is the fact that the human hearing range extends, at maximum, to an approximate limit (that varies from person to person) of 20000 Hz. Frequencies above this are inaudible. Considering the example of 44.1 Khz, we find that the Nyquist frequency evaluates to 22050 Hz, which is more than the human hearing system is capable of perceiving. There are other reasons for this particular sample rate, but that is beyond the scope of this book.

Figure 2.3. Effects of increased sample rate and bit depth on representing a continuous analogue signal.

There is one more important factor to consider when considering the sampling process: bit depth. Bit depth represents the precision with which the amplitude is measured. In the same way that there are a limited amount of samples per second in a conversion process, there are also a limited amount of amplitude values for a sample point, and the greater the number, the greater the accuracy. A common bit resolution found in most standard digital audio systems (Hi-Fi, Compact Disc) is 16 binary bits which allows for a range of 65536 ( $2^{16}$ ) individual amplitude values at a point in time. Lower bit values result in a greater distortion of the sound - a two bit system ( $2^{2}$ ) only allows for four different amplitudes, which results in a massively inaccurate approximation of the input signal.