Overview of codecs (codess) commonly used in Voice over IP (VOIP)
Overview of codecs (codess) commonly used in Voice over IP (VOIP)
- Why Enterprise RAID Rebuilding Succeeds Where Consumer Arrays Fail?
- Linus Torvalds Rejects MMC Subsystem Updates for Linux 7.0: “Complete Garbage”
- The Man Who Maintained Sudo for 30 Years Now Struggles to Fund the Work That Powers Millions of Servers
- How Close Are Quantum Computers to Breaking RSA-2048?
- Why Windows 10 Users Are Flocking to Zorin OS 18 Instead of Linux Mint?
- How to Prevent Ransomware Infection Risks?
- What is the best alternative to Microsoft Office?
This article provides an overview of codecs (codess) commonly used in Voice over IP (VOIP).
They are often called codecs, vocoders, or simply coders. There is a lot to know about this .
First briefly introduces the main functions of the coder and the classification of the coder, and then elaborates three kinds of coders used in VOIP:
- IUT-T G.723 voice coder
- ITU-T G.729 voice coder

Overview of codecs (codess) commonly used in Voice over IP (VOIP)
1. The function of the speech coder The main function of the speech coder is to encode the PCM (pulse code modulation) sample value of the user’s voice into a small number of bits (frames).
This method makes the voice robust (Robustness) when the link generates bit errors, network jitter and burst transmission. At the receiving end, the speech frame is first errored into PCM speech samples, and then converted into a speech waveform .
2. Classification of Speech Coders Speech coders are divided into three types:
(a) waveform coder;
(b) vocoder;
(c) hybrid coder.
Waveform encoders try to construct analog waveforms that include background noise as much as possible. Since the waveform encoder operates on all input signals, it produces high quality samples. However, waveform encoders operate at high bit rates. For example: The bit rate used by the ITU-G.7 11 specification (PCM) is 64Kbps.
A vocoder does not reproduce the original waveform. This set of encoders will extract a set of parameters, which will be sent to the receiving end to derive the speech generation model.
Linear predictive coding (LPC ) is used to obtain the parameters of a time-varying digital filter. This filter is used to simulate the output of a speaker’s vocal tract [WEST96]. Using a vocoder in a phone system, the voice quality is not good enough.
MOS score–bitrate relationship curve for low bitrate encoders (WEST96)
The voice coder commonly used in VOIP is a hybrid coder, which incorporates the strengths of the waveform coder and the voice device, and another feature of it is that it works at a very low bit rate (4-6Kbps). Hybrid encoders employ Analysis by Synthesis (AbS).
To illustrate the problem, consider a speech pattern produced by the human vocal tract: when a person speaks a speech signal is produced, voiced sounds (such as phonemes pa, da, etc.) and unvoiced sounds (such as phonemes sh, th) are produced .
The excitation signal is derived from the input speech signal by making the synthesized speech very little different from the input speech.
The usage of the LPC, the generation of the stimulus, and the error checking of the analysis by synthesis (AbS) system are shown in Figure 4-1.
Toll-quality encoders are easy to implement when the bit rate is higher than 8Kbps, as shown in Figure 4-2. The Voice Mean Opinion Score (MOS) of long-distance call quality must be at or above 10 points.
When the bit rate of traditional PCN voice is less than 32 Kbps, the voice quality will seriously deteriorate, so PCN will not be discussed here. Hybrid encoders and vocoders score acceptable on MOS at fairly low bitrates.
At this stage, most VOIP-based encoders work in the range of 5.2~8kbps.
Studies have shown that standard encoders can provide acceptable NOS scores at a bit rate of 4 Kbps, with some demultiplexing systems scoring 3.8 at MOS at 4.8 Kbps.
A better method for vector quantization and code-excited linear prediction is to encode the representation vector of the input speech signal with the codebook of the optimal parameter (symbol vector) stored for prediction.
This technique vector quantization). Combining VQ and AbS technology will further improve the coding performance.
AbS VQ is the technology that forms the basis of CELP . The main difference between VQ and AbS VQ lies in the definition of the quantization distortion measure used when performing vector quantization codebook search [WONG96].
3. Linear predictive analysis by synthesis encoder The most commonly used speech encoder with a bit rate between 4.8 kbps and 16 kbps is based on a model encoder, and these encoders are all linear predictive analysis by synthesis (LPAS) methods .
In order to simulate a speech signal over time, a linear predictive speech production model must be excited with an appropriate signal.
At regular intervals (such as every 20 ms ), the speech model parameters and excitation parameters must be estimated and updated, and used to control the speech model.
Two kinds of LPAS encoders will be introduced below: the forward-to-adaptive LPAS encoder and the backward- adaptive LPAS encoder. 3.1 Forward adaptive LPAS encoder: 8kbps G.729 encoder and 6.3kbps and 5.3kbps G.723.1 encoder In the forward adaptive AbS encoder , the coefficients and gains of the prediction filter are explicitly transmitted. To provide toll-quality speech performance, both encoders rely on source models. An excitation signal (in the form of information in the pitch ) is also transmitted.
The model provided by this encoder is relatively good for speech signals, but it is not suitable for some noises or multipliers.
Therefore , in background noise and music environments, the quality of the LPAS encoder is somewhat worse than that of the 7.726 and 7.727 encoders.
① G.723.1 ITU-T G.723.1 encoder provides toll-quality voice at 6.4kbps.
At the same time, G.723.1 also includes a low-quality speech encoder working at 5.3kbps . G.723.1 is designed for low bit rate videophone.
In this adaptation, since the video encoding delay is usually greater than the speech encoding delay, the delay requirement is not very strict. The G. 723.1 encoder has a frame length of 30ms and a lookahead of 7.5ms.
Adding the processing delay of the encoder, the total one-way delay of the encoder is 67.5ms. Other delays are caused by system buffers and the network.
The G.723.1 encoder first filters the voice signal with the traditional telephone bandwidth (based on G.712), then samples the voice signal at a traditional 8000Hz rate (based on G.711), and converts it into a linear PCM code of 1 bit. as input to this encoder. The output is inversely manipulated in the encoder to reconstruct the speech signal.
The G.723.1 system uses the LPAS encoding method to encode the voice signal into frames. The encoder is capable of generating voice traffic at two rates:
(a) 6.3kbps for the high rate;
(b) 5.3kbps for low rates.
The main rate encoder uses multi-pulse maximum natural quantization (MP-MLQ), and the low rate encoder uses the Algebraic-Code-Excited Linear-Prediction (ACELP, Algebraic -Code-Excited Linear-Prediction) method.
Both the encoder and decoder must support both rates and can convert between the two rates between frames.
This system is also capable of compressing and decompressing music and other audio signals, but it is optimal for speech signals.
The encoder operates on frames, each consisting of 240 samples, at a rate of 8000 Hz. After further processing (high-pass filter to remove the DC component), each frame is divided into 4 subframes, each subframe includes 60 samples, and various other operations include the calculation of LPC filter and LSP filter non-quantized coefficients, etc., Will result in a packet delay of 30ms. For each subframe, an LPC filter is computed with the raw input signal.
The filter coefficients of the last subframe are used for quantization by a predictive split vector quantizer (PSVQ, Predictive split Vector quantizer).
As mentioned earlier, the look-ahead occupies 7.5ms, so the entire encoding delay is 37.5ms. This delay is an important factor when evaluating encoders, especially when transmitting voice over data networks , because if the encoding and decoding delays are relatively small, it means that there is a greater difficulty in dealing with delays and jitters in the Internet. degrees of freedom.
The processing of the decoder:
- Decode the quantization index number of the LPC.
- To construct an LPC synthesis filter.
- For each subframe, the adaptive codebook excitation and the fixed codebook excitation are decoded before input to the synthesis filter.
- The excitation signal is processed by the pitch post-filter, and then sent to the synthesis filter.
- The synthesized signal is input to a formant post-filter which employs a gain-scaling unit to maintain its output energy at its input level.
Silence compression has been used for many years and takes advantage of the fact that silence accounts for about 50% of the total session time.
The basic idea is to reduce the number of bits transmitted during periods of silence, thus saving the total number of bits that need to be transmitted.
In the telephone network, the analog voice signal has been processed by Time-Assigned Speech Interpolation (TASI, Time-Assigned Speech Interpolation) for many years. This technique provides additional capacity .
Now, TASI has been used in digital signals and given new names – one example of which is Time Division Multiple Access (TDMA, Time Division Multiple Access).
Briefly speaking, DTMA divides the usual signal into small, digitized segments (slots or slots). These time slots are time multiplexed with other time slots in one channel.
G.723.1 employs silence compression that performs discontinuous transmission, which means that artificial noise is added to the bitstream during periods of silence.
In addition to reserving bandwidth, this technique keeps the transmitter ‘s modem working continuously and avoids the on-off of the carrier signal.
② G.729 The G.729 encoder is designed for low-latency applications. Its frame length is only 10ms, and the processing delay is also 10ms.
In addition, the look-ahead of 5ms makes the point-to-point generated by G.729 The time delay is 25ms and the bit rate is 8 kbps.
These latency properties are very important in the Internet, because we know that any factor that can reduce latency is very important.
There are two versions of G.729: G.729 and G.729A. G.729 is simpler than G.723.1.
The two versions are compatible with each other but their performance is somewhat different, the lower complexity version (G.729 A) has poorer performance.
Both encoders provide concealed handling of frame loss and packet loss, so they are good choices when transmitting speech over the Internet. Cox et al. [COX98] argue that G.729 has poor performance in handling random bit errors.
It is not recommended to use this encoder on channels with random bit errors, unless channel coding (forward error correction and convolutional codes, discussed in the wireless section) is used to protect the most sensitive bits.
3.2 Backward adaptive LPAS coding: 16 kbps G.728 low-delay code-excited linear prediction G.728 is a hybrid of low-bit linear prediction-by-synthesis coders (G.729 and G.723.1) and backward ADPCM coders . G.728 is an LD-CELP encoder, which only processes 5 samples at a time.
CELP is the last speech coding technology, and its excitation signal is selected through a full search method from a possible excitation signal set. The low-rate speech coder adopts a forward adaptive scheme to the sample prediction filter . While LD-CELP adopts backward adaptive filter and updates every 2.5ms. There are a total of 1024 possible stimulus vectors in CELP. These vectors can be further analyzed into 4 possible gains, two signs (+ and -) and 128 shape vectors.
For low-rate (56~128 kbps) Integrated Services Digital Network (ISDN) videophones, G.728 is a proposed speech coder. Because of its backward adaptive nature, G.728 is a low-delay coder, but it is more complex than other coders because the 50-order LPC analysis must be repeated in the coder. G.728 also uses an adaptive post- filter to improve its performance.
4. Parametric speech coder:
2.4 kbps mixed excitation linear predictive coding parametric coder adopts a speech model that simplifies the excitation signal, so it can work at the lowest bit rate.
All of the previously discussed speech coders can be described as waveform-tracking in that the waveform and phase of their output signal is very similar to the input signal.
Parametric vocoders, on the other hand, do not appear as waveform traces. This type of encoder is based on an analysis-synthesis model and can represent the speech signal with relatively few parameters. These parameters are usually extracted and quantized from the speech signal every 20ms~40ms.
At the receiving end, these parameters are used to generate a synthesized speech signal. Under ideal conditions, synthesized speech sounds similar .
In the presence of loud background noise, any parametric encoder will fail since the input speech signal is not well modeled according to its intrinsic speech model. The US government has chosen 2.4 kbps MELP for secure telephony.
For media applications, research in [COX98] points out that parametric encoders are a good choice when low bit rates are required.
For example, parameter encoders are often used in simple user games . This reduces the storage space required. For the same reason, parametric encoders are also a good choice for certain multimedia messaging-type services.
The absolute speech quality of parametric encoders is lower for all types of speech environments , especially in noisy environments.
This shortcoming can be overcome if the voice file can be carefully edited in advance.
Currently, most parametric encoders in multimedia applications are not standard. Instead, it applies to such special encoders.
G.723.1 Variable Rate Coding for Wireless Communication Annex C of G.723.1 specifies a channel coding specification that can be used with triple rate speech coders. The variable bit rate of this
This channel encoder supports bit rates ranging from 0.7 kbps to 4.3 kbps.
It also supports the codec of the three operation modes of G.723.1, namely high rate mode, low rate mode and discontinuous transmission mode.
The channel coder uses truncated convolutional codes, and the bit rate of the channel coder can be optimized for different bit types according to the subjective importance of each type of information bit .
This allocation algorithm is known to both the encoder and the decoder.
Every time the system control signal changes the rate of G.723.1 or changes the bit rate of the channel coder , this algorithm will make the channel coder adapt to the new voice service configuration.
If the available rate of the channel coder is low, the subjectively most sensitive bits must be protected first.
When the bit rate of the channel encoder increases, the redundant channel bits are first used to protect more information bits, and then the protection of the protected bit types is enhanced.
Before channel coding is applied, speech parameters are partially changed in the channel adaptation layer to improve robustness to transmission errors.
5. Encoder Evaluation There are several important factors to consider when evaluating the performance of an encoder. These factors are prompted as follows:
- Frame size: The frame size represents the time length of voice traffic, also known as frame delay. Frames are discrete components of the speech signal, and each frame is updated based on speech samples. The encoders described in this article all process one frame at a time. Each frame of information is placed in each voice packet and sent to the receiving end.
- Processing delay: It indicates the time required for encoding algorithm processing of a frame of speech in the encoder. It is usually simply accounted for as frame delay. Processing delay is better known as algorithmic delay.
- Look-ahead delay: The encoder checks a certain length of the next frame in order to provide assistance for the encoding of the current frame. This length is called the look-ahead delay. The idea of lookahead is to exploit the close correlation between adjacent speech frames. Frame length: This value indicates the number of bytes after encoding (excluding the frame header).
- Voice bit rate: When the input of the codec is the standard pulse code modulation voice code stream (bit rate is 64 kbit/s), the output rate of the codec.
- DSP MIPS: This value refers to the minimum speed of the DSP processor that supports the specific codec. It is worth noting that DSP MISP is independent of the MISP rate of other processors. Unlike the general-purpose processors used in workstations and personal computers, these DSPs are purpose-built for specific tasks. Therefore, general-purpose processors are larger than dedicated DSP processors to implement the above-mentioned MISP required for codec processing .
- RAM Requirements: It describes the size of RAM required to support a particular encoding process.
A key factor in evaluating encoder performance is the time required for the encoder to work.
This time refers to the buffering and processing time of the encoder, which is called the one-way system delay. Its value is equal to: frame size + processing delay + look-ahead delay.
Obviously, decoding latency is also very important. In fact, the decoding delay is about half of the encoding delay.
6. Comparison of Speech Coders To summarize the discussion of standard coders, Table 4-1 [RUDK97] compares the bit rate, MOS, complexity (based on G.711) and delay size and look-ahead time) for comparison.
|
standard
|
encoding type
|
bit rate (kbps)
|
MOS
|
Complexity
|
Latency(ms)
|
|
G.711
|
PCM
|
64
|
4.3
|
1
|
0.125
|
|
G.726
|
ADPCM
|
32
|
4.0
|
10
|
0.125
|
|
G.728
|
LD-CELP
|
16
|
4.0
|
50
|
0.625
|
|
GSM
|
RAE_LPT
|
13
|
3.7
|
5
|
20
|
|
G.729
|
CSA-CELP
|
8
|
4.0
|
30
|
15
|
|
G.729A
|
|
|
|
|
15
|
|
G.723.1
|
ACELP
|
6.3
|
3.8
|
25
|
37.5
|
|
|
MP-MLQ
|
6.3
|
|
|
|
|
US Dod
|
LPC-10
|
2.4
|
synthesized speech
|
10
|
22.5
|
|
FS1015
|
|
|
|
|
|
7. Summary Speech coder is the engine that builds and processes VOIP packets.
It is driven by DSP. The original DS0, TMD G.711 64kbps encoders will eventually be eliminated by the industry and replaced by low bit rate encoders.