Dynamic range: compressed or standard. Speech synthesis and recognition. Modern solutions. Computer audio equipment. Converting sound into a stream of numbers. Using Dynamic Compression

The sound level is the same throughout the entire composition, there are several pauses.

Narrowing of the dynamic range, or more simply put compression, is necessary for various purposes, the most common of which are:

1) Achieving a uniform volume level throughout the entire composition (or instrument part).

2) Achieving a uniform volume level for songs throughout the album/radio broadcast.

2) Increased intelligibility, mainly when compressing a certain part (vocals, bass drum).

How does dynamic range narrowing occur?

The compressor analyzes the sound level at the input by comparing it to a user-specified Threshold value.

If the signal level is below the value Threshold– then the compressor continues to analyze the sound without changing it. If the sound level exceeds the Threshold value, then the compressor begins its action. Since the role of the compressor is to narrow the dynamic range, it is logical to assume that it limits the largest and smallest amplitude values (signal level). At the first stage, the largest values are limited, which are reduced with a certain force, which is called ratio(Attitude). Let's look at an example:

Green curves display the sound level; the greater the amplitude of their oscillations from the X axis, the greater the signal level.

The yellow line is the threshold (Threshold) for the compressor to operate. By making the Threshold value higher, the user moves it away from the X axis. By making the Threshold value lower, the user brings it closer to the Y axis. It is clear that the lower the threshold value, the more often the compressor will operate and vice versa, the higher it is, the less often. If the Ratio value is very high, then after the Threshold signal level is reached, all subsequent signal will be suppressed by the compressor until silence. If the Ratio value is very small, then nothing will happen. The choice of Threshold and Ratio values will be discussed later. Now we should ask ourselves the following question: What is the point of suppressing all subsequent sound? Indeed, this makes no sense, we only need to get rid of the amplitude values (peaks) that exceed the Threshold value (marked in red on the graph). It is to solve this problem that there is a parameter Release(Attenuation), which sets the duration of the compression.

The example shows that the first and second exceedances of the Threshold threshold last less than the third exceedance of the Threshold threshold. So, if the Release parameter is set to the first two peaks, then when processing the third, an unprocessed part may remain (since exceeding the Threshold threshold lasts longer). If the Release parameter is set to the third peak, then when processing the first and second peaks, an undesirable decrease in the signal level is formed behind them.

The same goes for the Ratio parameter. If the Ratio parameter is adjusted to the first two peaks, then the third one will not be sufficiently suppressed. If the Ratio parameter is configured to process the third peak, then the processing of the first two peaks will be too excessive.

These problems can be solved in two ways:

1) Setting the attack parameter (Attack) - a partial solution.

2) Dynamic compression - a complete solution.

Parameter Astill (Attack) is intended to set the time after which the compressor will start operating after exceeding the Threshold threshold. If the parameter is close to zero (equal to zero in the case of parallel compression, see the corresponding article) - then the compressor will begin to suppress the signal immediately, and will work for the amount of time specified by the Release parameter. If the attack speed is high, then the compressor will begin its action after a certain period of time (this is necessary to give clarity). In our case, we can adjust the parameters of the threshold (Threshold), attenuation (Release) and compression level (Ratio) to process the first two peaks, and set the Attack value close to zero. Then the compressor will suppress the first two peaks, and when processing the third, it will suppress it until the threshold is exceeded (Threshold). However, this does not guarantee high-quality sound processing and is close to limiting (a rough cut of all amplitude values, in this case the compressor is called a limiter).

Let's look at the result of sound processing with a compressor:

The peaks disappeared, I note that the processing settings were quite gentle and we suppressed only the most prominent amplitude values. In practice, the dynamic range narrows much more and this trend only progresses. In the minds of many composers, they make the music louder, but in practice they completely deprive it of dynamics for those listeners who may be listening to it at home and not on the radio.

We just have to consider last parameter compression, this Gain(Gain). Gain is designed to increase the amplitude of the entire composition and, in fact, is equivalent to another sound editor tool - normalize. Let's look at the final result:

In our case, compression was justified and improved the quality of the sound, since the prominent peak is more likely an accident than a deliberate result. In addition, it is clear that the music is rhythmic, therefore it has a narrow dynamic range. In cases where high amplitude values are intentional, compression may be a mistake.

Dynamic compression

The difference between dynamic compression and non-dynamic compression is that with the former, the level of signal suppression (Ratio) depends on the level of the input signal. Dynamic compressors are found in all modern programs; the Ratio and Threshold parameters are controlled using a window (each parameter has its own axis):

There is no single standard for displaying a graph; somewhere along the Y axis the level of the incoming signal is displayed, somewhere on the contrary, the signal level after compression. Somewhere the point (0,0) is in the upper right corner, somewhere in the lower left. In any case, when you move the mouse cursor over this field, the values of the numbers that correspond to the Ratio and Threshold parameters change. Those. You set the compression level for each Threshold value, allowing for very flexible compression settings.

Side Chain

A side chain compressor analyzes the signal of one channel, and when the sound level exceeds a threshold (threshold), it applies compression to another channel. Side chaining has its advantages of working with instruments that are located in the same frequency region (the bass-kick combination is actively used), but sometimes instruments located in different frequency regions are also used, which leads to an interesting side-chain effect.

Part Two – Compression Stages

There are three stages of compression:

1) The first stage is compression of individual sounds (singleshoots).

The timbre of any instrument has the following characteristics: Attack, Hold, Decay, Delay, Sustain, Release.

The stage of compression of individual sounds is divided into two parts:

1.1) Compression of individual sounds of rhythmic instruments

Often the components of a beat require separate compression to give them clarity. Many people process the bass drum separately from other rhythmic instruments, both at the stage of compression of individual sounds and at the stage of compression of individual parts. This is due to the fact that it is located in the low-frequency region, where in addition to it, only bass is usually present. The clarity of a bass drum means the presence of a characteristic click (the bass drum has a very short attack and hold time). If there is no click, then you need to process it with a compressor, setting the threshold to zero and the attack time from 10 to 50 ms. The roll-off (Realese) of the compressor must end before the next kick drum hit. The last problem can be solved using the formula: 60,000 / BPM, where BPM is the tempo of the composition. So, for example) 60,000/137=437.96 (time in milliseconds until a new downbeat of a 4-dimensional composition).

All of the above applies to other rhythmic instruments with a short attack time - they should have an accentuated click that should not be suppressed by the compressor at any stage of the compression levels.

1.2) Compressionindividual soundsharmonic instruments

Unlike rhythmic instruments, parts of harmonic instruments are rarely composed of individual sounds. However, this does not mean that they should not be processed at the sound compression level. If you use a sample with a recorded part, then this is the second level of compression. Only synthesized harmonic instruments apply to this compression level. These can be samplers, synthesizers using various methods of sound synthesis (physical modeling, FM, additive, subtractive, etc.). As you probably already guessed, we are talking about programming the synthesizer settings. Yes! This is also compression! Almost all synthesizers have a programmable envelope parameter (ADSR), which means envelope. Using the envelope, you set the time of Attack, Decay, Sustain, and Release. And if you tell me that this is not compression of each individual sound - you are my enemy for life!

2) Second stage – Compression of individual parts.

By compression of individual parts I mean narrowing the dynamic range of a number of combined individual sounds. This stage also includes recordings of parts, including vocals, which require compression processing to give it clarity and intelligibility. When processing parts by compression, you need to take into account that when adding individual sounds, unwanted peaks may appear, which you need to get rid of at this stage, since if this is not done now, the picture may worsen at the stage of mixing the entire composition. At the stage of compression of individual parts, it is necessary to take into account the compression of the stage of processing individual sounds. If you have achieved clarity of the bass drum, then incorrect re-processing at the second stage can ruin everything. It is not necessary to process all parts with a compressor, just as it is not necessary to process all individual sounds. I advise you to install, just in case, an amplitude analyzer to determine the presence of unwanted side effects of combining individual sounds. In addition to compression at this stage, it is necessary to ensure that the batches are, if possible, in different frequency ranges for quantization to take place. It is also useful to remember that sound has such a characteristic as masking (psychoacoustics):

1) A quieter sound is masked by a louder one coming in front of it.

2) A quieter sound at a low frequency is masked by a louder sound at a high frequency.

So, for example, if you have a synthesizer part, then often the notes begin to play before the previous notes finish sounding. Sometimes this is necessary (creating harmony, playing style, polyphony), but sometimes it is not at all - you can cut off their end (Delay - Release) if it is audible in solo mode, but not audible in playback mode of all parts. The same applies to effects, such as reverb - it should not last until the sound source starts again. By cutting and removing unnecessary signal, you make the sound cleaner, and this can also be considered as compression - because you are removing unnecessary waves.

3) The third stage – Compression of the composition.

When compressing an entire composition, you need to take into account the fact that all parts are a combination of many individual sounds. Therefore, when combining them and subsequent compression, we need to make sure that the final compression does not spoil what we achieved in the first two stages. You also need to separate compositions in which a wide or narrow range is important. when compressing compositions with a wide dynamic range, it is enough to install a compressor that will crush short-term peaks that were formed as a result of adding parts together. When compressing a composition in which a narrow dynamic range is important, everything is much more complicated. Here compressors have recently been called maximizers. Maximizer is a plugin that combines a compressor, limiter, graphic equalizer, enhyzer and other sound transformation tools. At the same time, he must have sound analysis tools. Maximizing, the final processing with a compressor, is largely necessary to combat mistakes made at previous stages. Errors - not so much in compression (however, if you do at the last stage what you could have done at the first stage, this is already a mistake), but in the initial selection of good samples and instruments that would not interfere with each other (we are talking about frequency ranges) . This is precisely why the frequency response is corrected. It often happens that with strong compression on the master, it is necessary to change the compression and mixing parameters at earlier stages, since with a strong narrowing of the dynamic range, quiet sounds that were previously masked come out, and the sound of individual components of the composition changes.

In these parts, I deliberately did not talk about specific compression parameters. I considered it necessary to write about the fact that when compression it is necessary to pay attention to all sounds and all parts at all stages of creating a composition. This is the only way in the end you will get a harmonious result not only from the point of view of music theory, but also from the point of view of sound engineering.

The table below gives practical advice for processing individual batches. However, in compression, numbers and presets can only suggest the desired area in which to search. Ideal settings compression depends on each individual case. The Gain and Threshold parameters assume a normal sound level (logical use of the entire range).

Part Three - Compression Parameters

Brief information:

Threshold – determines the sound level of the incoming signal, upon reaching which the compressor starts working.

Attack – determines the time after which the compressor will start working.

Level (ratio) – determines the degree of reduction in amplitude values (relative to the original amplitude value).

Release – defines the time after which the compressor will stop working.

Gain – determines the level of increase in the incoming signal after processing by a compressor.

Compression table:

Tool	Threshold	Attack	ratio	Release	Gain	Description
Vocals	0 dB	1-2 ms 2-5 mS 10 ms 0.1 ms 0.1 ms	less than 4:1 2,5: 1 4:1 – 12:1 2:1 -8:1	150 ms 50-100 mS 150 ms 150 ms 0.5s		Compression during recording should be minimal; it requires mandatory processing at the mixing stage to give clarity and intelligibility.
Wind instruments		1 – 5ms	6:1 – 15:1	0.3s
Barrel		10 to 50 ms 10-100 mS	4:1 and higher 10:1	50-100 ms 1 mS		The lower the Thrshold and the higher the Ratio and the longer the Attack, the more pronounced the click at the beginning of the kick drum.
Synthesizers						Depends on the wave type (ADSR envelopes).
Snare drum:		10-40 mS 1- 5ms	5:1 5:1 – 10:1	50 mS 0.2s
Hi-Hat		20 mS	10:1	1 mS
Overhead microphones		2-5 mS	5:1	1-50 mS
Drums		5ms	5:1 – 8:1	10ms
Bas-guitar		100-200 mS 4ms to 10ms	5:1	1 mS 10ms
Strings		0-40 mS	3:1	500 mS
Synth. bass		4ms – 10ms	4:1	10ms		Depends on the envelopes.

Percussion		0-20 mS	10:1	50 mS
Acoustic guitar, Piano		10-30 mS 5 – 10ms	4:1 5:1 -10:1	50-100 mS 0.5s
Electro-nitara		2 – 5ms	8:1	0.5s

Final compression		0.1 ms 0.1 ms	2:1 from 2:1 to 3:1	50 ms 0.1 ms	0 dB output	The attack time depends on the purpose - whether you need to remove peaks or make the track smoother.
Limiter after final compression		0 mS	10:1	10-50 mS	0 dB output	If you need a narrow dynamic range and a rough “cut” of waves.

The information was taken from various sources referenced by popular resources on the Internet. The difference in compression parameters is explained by different sound preferences and working with different materials.

Or photographic latitude photographic material is the ratio between the maximum and minimum exposure values that can be correctly captured in the photograph. Applied to digital photography, the dynamic range is actually equivalent to the ratio of the maximum and minimum possible values of the useful electrical signal generated by the photosensor during exposure.

Dynamic range is measured in exposure stops (). Each step corresponds to doubling the amount of light. So, for example, if a certain camera has a dynamic range of 8 EV, this means that the maximum possible value of the useful signal of its matrix is related to the minimum as 2 8: 1, which means that the camera is able to capture objects that differ in brightness within one frame no more than 256 times. More precisely, it can capture objects with any brightness, but objects whose brightness exceeds the maximum permissible value will appear dazzling white in the image, and objects whose brightness is below the minimum value will appear pitch black. Details and texture will be visible only on those objects whose brightness falls within the dynamic range of the camera.

To describe the relationship between the brightness of the lightest and darkest objects being photographed, the not entirely correct term “scene dynamic range” is often used. It would be more correct to talk about the brightness range or the contrast level, since dynamic range is usually a characteristic of the measuring device (in this case, the matrix of a digital camera).

Unfortunately, the brightness range of many beautiful scenes we encounter in real life can significantly exceed the dynamic range of a digital camera. In such cases, the photographer is forced to decide which objects should be worked out in full detail, and which can be left outside the dynamic range without compromising the creative intent. In order to make the most of your camera's dynamic range, you may sometimes need not so much a thorough understanding of how the photosensor works, but rather a developed artistic sense.

Factors limiting dynamic range

The lower limit of the dynamic range is set by the self-noise level of the photosensor. Even an unlit matrix generates a background electrical signal called dark noise. Also, interference occurs when charge is transferred to the analog-to-digital converter, and the ADC itself introduces a certain error into the digitized signal - the so-called. sampling noise.

If you take a photo in complete darkness or with a lens cap on, the camera will only record this meaningless noise. If you allow a minimal amount of light to reach the sensor, the photodiodes will begin to accumulate electric charge. The magnitude of the charge, and hence the intensity of the useful signal, will be proportional to the number of captured photons. In order for any meaningful details to appear in the image, it is necessary that the level of the useful signal exceeds the level of background noise.

Thus, the lower limit of the dynamic range or, in other words, the sensor sensitivity threshold can be formally defined as the level of the output signal at which the signal-to-noise ratio is greater than unity.

The upper limit of the dynamic range is determined by the capacitance of an individual photodiode. If during exposure any photodiode accumulates an electric charge of its maximum value, then the image pixel corresponding to the overloaded photodiode will turn out completely white, and further irradiation will not affect its brightness in any way. This phenomenon is called clipping. The higher the overload capacity of a photodiode, the greater the output signal it can produce before it reaches saturation.

For greater clarity, let us turn to the characteristic curve, which is a graph of the output signal versus exposure. The horizontal axis represents the binary logarithm of the radiation received by the sensor, and the vertical axis represents the binary logarithm of the magnitude of the electrical signal generated by the sensor in response to this radiation. My drawing is largely conventional and serves purely illustrative purposes. The characteristic curve of a real photosensor has a slightly more complex shape, and the noise level is rarely so high.

The graph clearly shows two critical turning points: in the first of them, the level of the useful signal crosses the noise threshold, and in the second, the photodiodes reach saturation. The exposure values that lie between these two points make up the dynamic range. In this abstract example, it is equal, as is easy to see, to 5 EV, i.e. The camera can handle five doublings of exposure, which is equivalent to a 32-fold (2 5 = 32) difference in brightness.

The exposure zones that make up the dynamic range are unequal. The upper zones have a higher signal-to-noise ratio, and therefore appear cleaner and more detailed than the lower ones. As a result, the upper limit of the dynamic range is very significant and noticeable - clipping cuts off light at the slightest overexposure, while the lower limit is inconspicuously drowned in noise, and the transition to black is not nearly as sharp as to white.

The linear dependence of the signal on exposure, as well as the sharp rise to a plateau, are unique features of the digital photographic process. For comparison, take a look at the characteristic characteristic curve of traditional photographic film.

The shape of the curve and especially the angle of inclination strongly depend on the type of film and on the procedure for its development, but the main, striking difference between the film graph and the digital one remains unchanged - the nonlinear nature of the dependence of the optical density of the film on the exposure value.

The lower limit of the photographic latitude of negative film is determined by the density of the veil, and the upper limit is determined by the maximum achievable optical density of the photographic layer; for reversible films it is the other way around. Both in the shadows and in the highlights, smooth bends in the characteristic curve are observed, indicating a drop in contrast when approaching the boundaries of the dynamic range, because the slope of the curve is proportional to the contrast of the image. Thus, the exposure zones lying in the middle part of the graph have maximum contrast, while in the highlights and shadows the contrast is reduced. In practice, the difference between film and a digital matrix is especially noticeable in the highlights: where in a digital image the highlights are burned out by clipping, on film the details are still visible, although low in contrast, and the transition to pure white looks smooth and natural.

In sensitometry, even two independent terms are used: actually photographic latitude, limited by a relatively linear portion of the characteristic curve, and useful photographic latitude, which, in addition to the linear section, also includes the base and shoulder of the chart.

It is noteworthy that when processing digital photographs, as a rule, a more or less pronounced S-shaped curve is applied to them, increasing the contrast in midtones at the cost of reducing it in shadows and highlights, which gives the digital image a more natural and pleasing to the eye view.

Bit depth

Unlike the matrix of a digital camera, human vision is characterized by, let's say, a logarithmic view of the world. Successive doublings of the amount of light are perceived by us as equal changes in brightness. Light numbers can even be compared to musical octaves, because double changes in sound frequency are perceived by ear as a single musical interval. Other senses work on this principle. Nonlinearity of perception greatly expands the range of human sensitivity to stimuli of varying intensity.

When converting a RAW file (it doesn’t matter - using the camera or in a RAW converter) containing linear data, the so-called. gamma curve, which is designed to non-linearly increase the brightness of a digital image, bringing it into line with the characteristics of human vision.

With linear conversion, the image is too dark.

After gamma correction, the brightness returns to normal.

The gamma curve stretches dark tones and compresses light ones, making the distribution of gradations more uniform. The result is a natural-looking image, but noise and sampling artifacts in the shadows inevitably become more noticeable, which is only exacerbated by the small number of brightness levels in the lower zones.

Linear distribution of brightness gradations.

Uniform distribution after applying the gamma curve.

ISO and dynamic range

Despite the fact that digital photography uses the same concept of photosensitivity of photographic material as in film photography, it should be understood that this happens solely due to tradition, since approaches to changing photosensitivity in digital and film photography differ fundamentally.

Increasing ISO sensitivity in traditional photography means replacing one film with another with coarser grain, i.e. There is an objective change in the properties of the photographic material itself. In a digital camera, the light sensitivity of the sensor is strictly determined by its physical characteristics and cannot be changed in the literal sense. When increasing ISO, the camera does not change the actual sensitivity of the sensor, but only amplifies the electrical signal generated by the sensor in response to irradiation and adjusts the digitization algorithm for this signal accordingly.

An important consequence of this is that the effective dynamic range decreases in proportion to the increase in ISO, because along with the useful signal, noise also increases. If at ISO 100 the entire range of signal values is digitized - from zero to the saturation point, then at ISO 200 only half the capacity of the photodiodes is taken as the maximum. With each doubling of ISO sensitivity, the top step of the dynamic range is cut off, and the remaining steps are pulled into its place. This is why using ultra-high ISO values makes no practical sense. With the same success, you can lighten the photo in a RAW converter and get a comparable noise level. The difference between increasing the ISO and artificially brightening the image is that when increasing the ISO, the signal is amplified before it enters the ADC, which means that quantization noise is not amplified, unlike the sensor’s own noise, while in a RAW converter it is subject to amplification including ADC errors. In addition, reducing the sampling range means more accurate sampling of the remaining input signal values.

By the way, lowering the ISO below the base value (for example, to ISO 50), available on some devices, does not at all expand the dynamic range, but simply attenuates the signal by half, which is equivalent to darkening the image in the RAW converter. This function can even be considered harmful, since using a subminimum ISO value provokes the camera to increase the exposure, which, while the sensor saturation threshold remains unchanged, increases the risk of clipping in the highlights.

True Dynamic Range

There are a number of programs like (DxO Analyzer, Imatest, RawDigger, etc.) that allow you to measure the dynamic range of a digital camera at home. In principle, this is not very necessary, since data for most cameras can be freely found on the Internet, for example, on the website DxOMark.com.

Should we believe the results of such tests? Quite. With the only caveat that all these tests determine the effective or, so to speak, technical dynamic range, i.e. the relationship between the saturation level and the noise level of the matrix. For a photographer, the most important thing is the useful dynamic range, i.e. number of exposure zones that actually allow you to capture something useful information.

As you remember, the dynamic range threshold is set by the noise level of the photosensor. The problem is that in practice, the lower zones, which are technically already included in the dynamic range, still contain too much noise to be usefully used. Here a lot depends on individual disgust - everyone determines the acceptable noise level for themselves.

My subjective opinion is that details in the shadows begin to look more or less decent when the signal-to-noise ratio is at least eight. On this basis, I define useful dynamic range as technical dynamic range minus about three stops.

For example, if a DSLR camera, according to reliable tests, has a dynamic range of 13 EV, which is very good by today's standards, then its useful dynamic range will be about 10 EV, which, in general, is also quite good. Of course, we are talking about shooting in RAW, with minimum ISO and maximum bit depth. When shooting JPEG, dynamic range is highly dependent on contrast settings, but on average you should give up another two or three stops.

For comparison: color reversal films have a useful photographic latitude of 5-6 stops; black and white negative films give 9-10 stops with standard developing and printing procedures, and with certain manipulations - up to 16-18 stops.

To summarize the above, let's try to formulate a few simple rules, the observance of which will help you squeeze maximum performance out of your camera's sensor:

The dynamic range of a digital camera is only fully accessible when shooting in RAW.
Dynamic range decreases as light sensitivity increases, so avoid high ISO settings unless absolutely necessary.
Using a higher bit depth for RAW files does not increase true dynamic range, but it does improve tonal separation in the shadows due to more brightness levels.
Exposure to the right. The upper exposure zones always contain the maximum useful information with a minimum of noise and should be used most effectively. At the same time, do not forget about the danger of clipping - pixels that have reached saturation are absolutely useless.

And most importantly: don't worry too much about the dynamic range of your camera. Its dynamic range is fine. Your ability to see light and manage exposure correctly is much more important. A good photographer will not complain about the lack of photographic latitude, but will try to wait for more comfortable lighting, or change the angle, or use the flash, in a word, will act in accordance with the circumstances. I'll tell you more: some scenes only benefit from the fact that they do not fit into the dynamic range of the camera. Often an unnecessary abundance of details simply needs to be hidden in a semi-abstract black silhouette, which makes the photo both more laconic and richer.

High contrast is not always a bad thing – you just need to know how to work with it. Learn to exploit the shortcomings of the equipment as well as its advantages, and you will be surprised how much your creative possibilities will expand.

Thank you for your attention!

Vasily A.

Post scriptum

If you found the article useful and informative, you can kindly support the project by making a contribution to its development. If you didn’t like the article, but you have thoughts on how to make it better, your criticism will be accepted with no less gratitude.

Please remember that this article is subject to copyright. Reprinting and quoting are permissible provided there is a valid link to the source, and the text used must not be distorted or modified in any way.

Let's think about the question - why do we need to turn up the volume? In order to hear quiet sounds that are not audible in our conditions (for example, if you cannot listen loudly, if there is extraneous noise in the room, etc.). Is it possible to amplify quiet sounds while leaving loud ones alone? It turns out it is possible. This technique is called dynamic range compression (DRC). To do this, you need to change the current volume constantly - amplify quiet sounds, loud ones - not. The simplest law of volume change is linear, i.e. The volume changes according to the law output_loudness = k * input_loudness, where k is the dynamic range compression ratio:

Figure 18. Dynamic range compression.

When k = 1, no changes are made (the output volume is equal to the input volume). At k< 1 громкость будет увеличиваться, а динамический диапазон - сужаться. Посмотрим на график (k=1/2) - тихий звук, имевший громкость -50дБ станет громче на 25дБ, что значительно громче, но при этом громкость диалогов (-27дБ) повысится всего лишь на 13.5дБ, а громкость самых громких звуков (0дБ) вообще не изменится. При k >1 - volume will decrease and dynamic range will increase.

Let's look at the volume graphs (k = 1/2: DD compression is doubled):

Figure 19. Loudness graphs.

As you can see in the original there were both very quiet sounds, 30 dB below the dialogue level, and very loud ones - 30 dB above the dialogue level. That. the dynamic range was 60dB. After compression, loud sounds are only 15dB higher, and quiet sounds are 15dB lower than dialogue (dynamic range is now 30dB). Thus, loud sounds became significantly quieter, and quiet sounds became significantly louder. In this case, there is no overflow!

Now let's look at the histograms:

Figure 20. Compression example.

As you can clearly see, with amplification up to +30dB, the shape of the histogram is well preserved, which means that loud sounds remain well expressed (they do not go to the maximum and are not cut off, as happens with simple amplification). This produces quiet sounds. The histogram shows this poorly, but the difference is very noticeable by ear. The disadvantage of this method is the same volume jumps. However, the mechanism of their occurrence differs from the volume jumps that occur during cutting, and their character is different - they appear mainly at very high amplification quiet sounds(and not when cutting off loud ones, as with regular amplification). An excessive level of compression leads to a flattening of the sound picture - all sounds tend to be the same loudness and inexpressiveness.

Excessive amplification of quiet sounds may cause recording noise to become audible. Therefore, the filter uses a slightly modified algorithm so that the noise level rises less:

Figure 21. Increasing volume without increasing noise.

Those. at a volume level of -50dB, the transfer function inflects, and the noise will be amplified less (yellow line). In the absence of such an inflection, the noise will be much louder (gray line). This simple modification significantly reduces the amount of noise even at very high compression levels (1:5 compression in the picture). The “DRC” level in the filter sets the gain level for quiet sounds (at -50dB), i.e. The 1/5 compression level shown in the figure corresponds to the +40dB level in the filter settings.

The second part of the series is devoted to functions for optimizing the dynamic range of images. In it we will tell you why such solutions are needed, consider various options for their implementation, as well as their advantages and disadvantages.

Embrace the immensity

Ideally, a camera should capture an image of the surrounding world as a person perceives it. However, due to the fact that the mechanisms of “vision” of a camera and the human eye are significantly different, there are a number of restrictions that do not allow this condition to be met.

One of the problems that previously faced by users of film cameras and is faced now by owners of digital cameras is the inability to adequately capture scenes with large differences in illumination without the use of special devices and/or special shooting techniques. The peculiarities of the human visual system make it possible to perceive details of high-contrast scenes equally well in both brightly lit and dark areas. Unfortunately, the camera sensor is not always able to capture an image the way we see it.

The greater the difference in brightness in the photographed scene, the higher the likelihood of loss of detail in highlights and/or shadows. As a result, instead of a blue sky with lush clouds, the picture turns out to be only a whitish spot, and objects located in the shadows turn into indistinct dark silhouettes or completely merge with the surrounding environment.

In classical photography, the concept of photographic latitude(See sidebar for details). Theoretically, the photographic latitude of digital cameras is determined by the bit depth of the analog-to-digital converter (ADC). For example, when using an 8-bit ADC, taking into account the quantization error, the theoretically achievable value of photographic latitude will be 7 EV, for a 12-bit ADC - 11 EV, etc. However, in real devices the dynamic range of images turns out to be at the same theoretical maximum due to the influence of various types of noise and other factors.

A large difference in brightness levels represents a serious
problem when taking photographs. In this case, the capabilities of the camera
turned out to be insufficient for adequate transmission of the most
light areas of the scene, and as a result, instead of an area of blue
sky (marked with a stroke) it turns out to be a white “patch”

The maximum brightness value that a light-sensitive sensor can record is determined by the saturation level of its cells. The minimum value depends on several factors, including the amount of thermal noise of the matrix, charge transfer noise and ADC error.

It is also worth noting that the photographic latitude of the same digital camera can vary depending on the sensitivity value set in the settings. The maximum dynamic range is achievable by setting the so-called basic sensitivity (corresponding to the minimum possible numerical value). As the value of this parameter increases, the dynamic range decreases due to the increasing noise level.

The photographic latitude of modern models of digital cameras equipped with large sensors and 14- or 16-bit ADCs ranges from 9 to 11 EV, which is significantly greater compared to similar characteristics of 35 mm format color negative films (average 4 to 5 EV ). Thus, even relatively inexpensive digital cameras have a photographic latitude sufficient to adequately convey most typical amateur shooting scenes.

However, there is a problem of a different kind. It is associated with the limitations imposed by existing standards for recording digital images. Using the 8-bit JPEG format color channel(which has now become the de facto standard for recording digital images in the computer and digital technology industries), it is not even theoretically possible to save a photograph having a photographic latitude greater than 8 EV.

Let's assume that the camera's ADC allows you to obtain an image with a bit depth of 12 or 14 bits, containing discernible details in both highlights and shadows. However, if the photographic latitude of this image exceeds 8 EV, then in the process of conversion to a standard 8-bit format without any additional actions (that is, simply by discarding “extra” bits), part of the information recorded by the photosensitive sensor will be lost.

Dynamic range and photographic latitude

To put it simply, dynamic range is defined as the ratio of the maximum brightness value of an image to its minimum value. In classical photography, the term photographic latitude is traditionally used, which essentially means the same thing.

Dynamic range width can be expressed as a ratio (for example, 1000:1, 2500:1, etc.), but most often this is done on a logarithmic scale. In this case, the value of the decimal logarithm of the ratio of the maximum brightness to its minimum value is calculated, and after the number the capital letter D (from the English density? - density), or less often? - the abbreviation OD (from the English optical density? - optical density) is placed. For example, if the ratio of the maximum brightness value to the minimum value of a device is 1000:1, then the dynamic range will be equal to 3.0 D:

To measure photographic latitude, so-called exposure units are traditionally used, abbreviated EV (exposure values; professionals often call them “stops” or “steps”). It is in these units that the exposure compensation value is usually set in the camera settings. Increasing the photographic latitude value by 1 EV is equivalent to doubling the difference between the maximum and minimum brightness levels. Thus, the EV scale is also logarithmic, but in this case the base 2 logarithm is used to calculate the numerical values. For example, if a device is capable of capturing images with a maximum to minimum brightness ratio of 256:1, then its photographic latitude will be 8 EV:

Compression is a reasonable compromise

Most effective way To preserve the full extent of image information recorded by the camera’s light-sensitive sensor, it is possible to record images in RAW format. However, not all cameras have such a function, and not every amateur photographer is ready to engage in the painstaking work of selecting individual settings for each photograph taken.

To reduce the likelihood of loss of detail in high-contrast images converted inside the camera into 8-bit JPEG, devices from many manufacturers (not only compact ones, but also DSLRs) have introduced special functions that make it possible to compress the dynamic range of saved images without user intervention. By reducing the overall contrast and losing a small part of the information in the original image, such solutions make it possible to preserve details in highlights and shadows recorded by the device’s light-sensitive sensor in 8-bit JPEG format, even if the dynamic range of the original image turned out to be wider than 8 EV.

One of the pioneers in the development of this area was the HP company. Released in 2003, the HP Photosmart 945 digital camera featured the world's first HP Adaptive Lightling technology, which automatically compensates for low light levels in dark areas of photos and thus preserves shadow detail without the risk of overexposure (which is very important when shooting high-contrast scenes). The HP Adaptive Lightling algorithm is based on the principles set out by the English scientist Edwin Land in the RETINEX theory of human visual perception.

HP Adaptive Lighting menu

How does Adaptive Lighting work? After obtaining a 12-bit image of the image, an auxiliary monochrome image is extracted from it, which is actually an irradiance map. When processing an image, this card is used as a mask, allowing you to adjust the degree of influence of a rather complex digital filter on the image. Thus, in areas corresponding to the darkest points of the map, the impact on the image of the future image is minimal, and vice versa. This approach allows shadow detail to be revealed by selectively brightening these areas and, accordingly, reducing the overall contrast of the resulting image.

It should be noted that when Adaptive Lighting is enabled, the captured image is processed in the manner described above before the finished image is written to a file. All described operations are performed automatically, and the user can only select one of two Adaptive Lighting operating modes in the camera menu (low or high level impact) or disable this function.

Generally speaking, many specific functions of modern digital cameras (including the facial recognition systems discussed in the previous article) are a kind of by-product or conversion product of research work that was originally carried out for military customers. When it comes to image dynamic range optimization functions, one of the most well-known providers of such solutions is Apical. The algorithms created by its employees, in particular, underlie the operation of the SAT (Shadow Adjustment Technology) function, implemented in a number of Olympus digital camera models. Briefly, the operation of the SAT function can be described as follows: based on the original image of the image, a mask is created corresponding to the darkest areas, and then for these areas automatic correction exposure values.

Sony also acquired a license to use Apical's developments. Many models of compact cameras in the Cyber-shot series and in DSLR cameras in the alpha series implement the so-called Dynamic Range Optimizer (DRO) function.

Photos taken with the HP Photosmart R927 turned off (top)
And activated function Adaptive Lighting

Image correction when DRO is activated is performed in the process primary processing images (that is, before recording the finished file JPEG format). In the basic version, DRO has a two-stage setting (you can select a standard or advanced mode of operation in the menu). When choosing standard mode Based on an analysis of the image, the exposure value is adjusted, and then a tone curve is applied to the image to even out the overall balance. The advanced mode uses a more complex algorithm that allows correction in both shadows and highlights.

Sony developers are constantly working to improve the DRO algorithm. For example, in the a700 SLR camera, when the advanced DRO mode is activated, it is possible to select one of five correction options. In addition, it is possible to save three versions of one image at once (a kind of bracketing) with different DRO settings.

Many Nikon digital camera models have a D-Lighting function, which is also based on Apical algorithms. True, in contrast to the solutions described above, D-Lighting is implemented as a filter for processing previously saved images using a tonal curve, the shape of which allows you to make shadows lighter, while keeping other areas of the image unchanged. But since in this case ready-made 8-bit images are processed (and not the original frame image, which has a higher bit depth and, accordingly, a wider dynamic range), the capabilities of D-Lighting are very limited. The user can get the same result by processing the image in a graphic editor.

When comparing enlarged fragments, it is clearly visible that the dark areas of the original image (left)
when the Adaptive Lighting function was turned on, they became lighter

There are also a number of solutions based on other principles. Thus, many cameras of the Lumix family from Panasonic (in particular, DMC-FX35, DMC-TZ4, DMC-TZ5, DMC-FS20, DMC-FZ18, etc.) implement the light detection function (Intelligent Exposure), which is integral part iA intelligent automatic shooting control systems. The Intelligent Exposure function is based on automatic analysis of the frame image and correction of dark areas of the image to avoid loss of detail in the shadows, as well as (if necessary) compression of the dynamic range of high-contrast scenes.

In some cases, the dynamic range optimization function involves not only certain operations for processing the original image image, but also correction of shooting settings. For example, new models of Fujifilm digital cameras (in particular, the FinePix S100FS) implement a function for expanding the dynamic range (Wide Dynamic Range, WDR), which, according to the developers, allows you to increase the photographic latitude by one or two steps (in settings terminology - 200 and 400%).

When WDR is activated, the camera takes photos with exposure compensation of -1 or -2 EV (depending on the selected setting). Thus, the image of the frame turns out to be underexposed - this is necessary in order to preserve maximum information about the details in the highlights. The resulting image is then processed using a tone curve, which allows you to equalize the overall balance and adjust the black level. The image is then converted to 8-bit format and recorded as a JPEG file.

Dynamic range compression preserves more detail
in lights and shadows, but an inevitable consequence of such exposure
is a decrease in overall contrast. In the bottom image
the texture of the clouds is much better developed, however
due to the lower contrast, this version of the photo
looks less natural

A similar function called Dynamic Range Enlargement is implemented in a number of compact and SLR cameras Pentax (Optio S12, K200D, etc.). According to the manufacturer, the use of the Dynamic Range Enlargement function allows you to increase the photographic latitude of images by 1 EV without losing detail in highlights and shadows.

A similar function called Highlight tone priority (HTP) is implemented in a number of Canon DSLR models (EOS 40D, EOS 450D, etc.). According to the user manual, activating HTP improves highlight detail (specifically, in the 0 to 18% gray range).

Conclusion

Let's summarize. Built-in dynamic range compression allows you to convert a high dynamic range source image to 8-bit with minimal damage JPEG file. Without the option to save images in RAW format, Dynamic Range Compression mode allows photographers to more fully utilize their camera's potential when shooting high-contrast scenes.

Of course, it is important to remember that dynamic range compression is not a miracle cure, but rather a compromise. Preserving detail in highlights and/or shadows comes at the cost of increasing the noise level in the dark areas of the image, reducing its contrast, and somewhat coarsening smooth tonal transitions.

Like any automatic function, the dynamic range compression algorithm is not a fully universal solution that allows you to improve absolutely any photo. Therefore, it makes sense to activate it only in cases where it is really necessary. For example, in order to shoot a silhouette with a well-designed background, the dynamic range compression function must be turned off - otherwise the spectacular scene will be hopelessly ruined.

Concluding our consideration of this topic, it should be noted that the use of dynamic range compression functions does not allow us to “pull out” details in the resulting image that were not captured by the camera sensor. To achieve satisfactory results when shooting high-contrast scenes, you need to use additional tools (for example, gradient filters for photographing landscapes) or special techniques (such as shooting several frames with exposure bracketing and then combining them into one image using Tone Mapping technology).

The next article will focus on the burst function.

To be continued

At a time when researchers were just beginning to solve the problem of creating a speech interface for computers, they often had to make their own equipment that would allow audio information to be input into the computer and also output it from the computer. Today, such devices may only be of historical interest, since modern computers can easily be equipped with audio input and output devices, such as sound adapters, microphones, headphones and speakers.

We won't go into details internal device these devices, but we will tell you how they work and give some recommendations for choosing audio computer devices for working with speech recognition and synthesis systems.

As we already said in the previous chapter, sound is nothing more than air vibrations, the frequency of which lies in the range of frequencies perceived by humans. The exact boundaries of the audible frequency range may vary from person to person, but sound vibrations are believed to lie in the range of 16-20,000 Hz.

The job of a microphone is to convert sound vibrations into electrical vibrations, which can then be amplified, filtered to remove interference, and digitized for input. audio information to the computer.

Based on their operating principle, the most common microphones are divided into carbon, electrodynamic, condenser and electret. Some of these microphones require an external current source for their operation (for example, carbon and condenser), others, under the influence of sound vibrations, are capable of independently producing alternating current. electrical voltage(these are electrodynamic and electret microphones).

You can also separate the microphones according to their purpose. There are studio microphones that can be held in your hand or mounted on a stand, there are radio microphones that can be clipped to clothing, and so on.

There are also microphones designed specifically for computers. Such microphones are usually mounted on a stand placed on the surface of a table. Computer microphones can be combined with headphones, as shown in Fig. 2-1.

Rice. 2-1. Headphones with microphone

How can you choose from the variety of microphones that are best suited for speech recognition systems?

In principle, you can experiment with any microphone you have, as long as it can be connected to your computer's audio adapter. However, developers of speech recognition systems recommend purchasing a microphone that, during operation, will be at a constant distance from the speaker’s mouth.

If the distance between the microphone and the mouth does not change, then the average level of the electrical signal coming from the microphone will not change too much either. This will have a positive impact on the performance of modern speech recognition systems.

What's the problem?

A person is able to successfully recognize speech, the volume of which varies over a very wide range. The human brain is able to filter out quiet speech from interference, such as the noise of cars passing on the street, outside conversations and music.

As for modern speech recognition systems, their abilities in this area leave much to be desired. If the microphone is on a table, then when you turn your head or change your body position, the distance between your mouth and the microphone will change. This will change the microphone output level, which in turn will reduce the reliability of speech recognition.

Therefore, when working with speech recognition systems, the best results will be achieved if you use a microphone attached to headphones, as shown in Fig. 2-1. When using such a microphone, the distance between the mouth and the microphone will be constant.

We also draw your attention to the fact that all experiments with speech recognition systems are best carried out in privacy in a quiet room. In this case, the influence of interference will be minimal. Of course, if you need to select a speech recognition system that can operate in conditions of strong interference, then the tests need to be conducted differently. However, as far as the authors of the book know, the noise immunity of speech recognition systems is still very, very low.

The microphone converts sound waves into vibrations for us. electric current. These fluctuations can be seen on the oscilloscope screen, but do not rush to the store to purchase this expensive device. We can carry out all oscillographic studies using a regular computer equipped with a sound adapter, for example, a Sound Blaster adapter. Later we will tell you how to do this.

In Fig. 2-2 we showed the oscillogram sound signal, resulting from pronouncing a long sound a. This waveform was obtained using the GoldWave program, which we will talk about later in this chapter of the book, as well as using a Sound Blaster audio adapter and a microphone similar to that shown in Fig. 2-1.

Rice. 2-2. Audio signal oscillogram

The GoldWave program allows you to stretch the oscillogram along the time axis, which allows you to see the smallest details. In Fig. 2-3 we showed a stretched fragment of the above-mentioned oscillogram of sound a.

Rice. 2-3. Fragment of an oscillogram of an audio signal

Please note that the magnitude of the input signal coming from the microphone changes periodically and takes on both positive and negative values.

If there was only one frequency present in the input signal (that is, if the sound was “clean”), the waveform received from the microphone would be a sine wave. However, as we have already said, the spectrum of human speech sounds consists of a set of frequencies, as a result of which the shape of the oscillogram of the speech signal is far from sinusoidal.

We will call a signal whose magnitude changes continuously over time analog signal. This is exactly the signal that comes from the microphone. Unlike an analog signal, a digital signal is a set of numerical values that change discretely over time.

In order for a computer to process an audio signal, it must be converted from analogue to digital form, that is, presented as a set of numerical values. This process is called analog signal digitization.

Digitization of an audio (and any analog) signal is performed using a special device called analog-to-digital converter ADC (Analog to Digital Converter, ADC). This device is located on the sound adapter board and is a regular-looking microcircuit.

How does an analog-to-digital converter work?

It periodically measures the level of the input signal and outputs a numerical value of the measurement result. This process is illustrated in Fig. 2-4. Here, gray rectangles indicate input signal values measured at some constant time interval. A set of such values is a digitized representation of the input analog signal.

Rice. 2-4. Measurements of signal amplitude versus time

In Fig. 2-5 we showed connecting an analog-to-digital converter to a microphone. In this case, an analog signal is supplied to input x 1, and a digital signal is removed from outputs u 1 -u n.

Rice. 2-5. Analog-to-digital converter

Analog-to-digital converters are characterized by two important parameters - the conversion frequency and the number of quantization levels of the input signal. The correct choice of these parameters is critical to achieve adequate representation in digital form analog signal.

How often do you need to measure the amplitude of the input analog signal so that information about changes in the input analog signal is not lost as a result of digitization?

It would seem that the answer is simple - the input signal needs to be measured as often as possible. Indeed, the more often an analog-to-digital converter makes such measurements, the better it will be able to track the slightest changes in the amplitude of the input analog signal.

However, excessively frequent measurements can lead to an unjustified increase in the flow of digital data and a waste of computer resources when processing the signal.

Fortunately, right choice conversion frequencies (sampling frequencies) are quite simple to do. To do this, it is enough to turn to Kotelnikov’s theorem, known to specialists in the field of digital signal processing. The theorem states that the conversion frequency must be twice the maximum frequency of the spectrum of the converted signal. Therefore, to digitize without losing the quality of an audio signal whose frequency lies in the range of 16-20,000 Hz, you need to select a conversion frequency no less than 40,000 Hz.

Note, however, that in professional audio equipment the conversion frequency is selected several times higher than the specified value. This is done to achieve very high quality digitized audio. This quality is not relevant for speech recognition systems, so we will not focus your attention on this choice.

What conversion frequency is needed to digitize the sound of human speech?

Since the sounds of human speech lie in the frequency range of 300-4000 Hz, the minimum required conversion frequency is 8000 Hz. However, many computer programs Speech recognition uses a standard conversion frequency of 44,000 Hz for conventional audio adapters. On the one hand, such a conversion frequency does not lead to an excessive increase in the digital data flow, and on the other hand, it ensures speech digitization with sufficient quality.

Back in school, we were taught that with any measurements errors arise, which cannot be completely eliminated. Such errors arise due to the limited resolution of measuring instruments, as well as due to the fact that the measurement process itself can introduce some changes into the measured value.

An analog-to-digital converter represents the input analog signal as a stream of numbers of limited capacity. Conventional audio adapters contain 16-bit ADC blocks capable of representing the amplitude of the input signal as 216 = 65536 different values. ADC devices in high-end audio equipment can be 20-bit, providing greater accuracy in representing the amplitude of the audio signal.

Modern speech recognition systems and programs were created for regular computers, equipped with conventional sound adapters. Therefore, to conduct experiments with speech recognition, you do not need to purchase a professional audio adapter. An adapter such as Sound Blaster is quite suitable for digitizing speech for the purpose of its further recognition.

Along with the useful signal, various noises usually enter the microphone - noise from the street, wind noise, extraneous conversations, etc. Noise has a negative impact on the performance of speech recognition systems, so it has to be dealt with. We have already mentioned one of the ways - today's speech recognition systems are best used in a quiet room, alone with the computer.

However, it is not always possible to create ideal conditions, so it is necessary to use special methods to get rid of interference. To reduce the noise level, special tricks are used when designing microphones and special filters that remove frequencies from the spectrum of the analog signal that do not carry useful information. In addition, a technique such as compression of the dynamic range of input signal levels is used.

Let's talk about all this in order.

Frequency filter is a device that converts the frequency spectrum of an analog signal. In this case, during the conversion process, vibrations of certain frequencies are released (or absorbed).

You can imagine this device as a kind of black box with one input and one output. In relation to our situation, a microphone will be connected to the input of the frequency filter, and an analog-to-digital converter will be connected to the output.

There are different frequency filters:

· low pass filters;

high pass filters;

· transmitting bandpass filters;

· band-stop filters.

Low Pass Filters(low-pass filter) remove from the spectrum of the input signal all frequencies whose values are below a certain threshold frequency, depending on the filter setting.

Since audio signals lie in the range of 16-20,000 Hz, all frequencies less than 16 Hz can be cut off without degrading the sound quality. For speech recognition, the frequency range of 300-4000 Hz is important, so frequencies below 300 Hz can be cut out. In this case, all interference whose frequency spectrum lies below 300 Hz will be cut out from the input signal, and they will not interfere with the speech recognition process.

Likewise, high pass filters(high-pass filter) cut out from the spectrum of the input signal all frequencies above a certain threshold frequency.

Humans cannot hear sounds with a frequency of 20,000 Hz and higher, so they can be cut out of the spectrum without noticeable deterioration in sound quality. As for speech recognition, here you can cut out all frequencies above 4000 Hz, which will lead to a significant reduction in the level of high-frequency interference.

Band pass filter(band -pass filter) can be thought of as a combination of a low-pass and high-pass filter. Such a filter delays all frequencies below the so-called lower pass frequency, and also above upper pass frequency.

Thus, a passband filter is convenient for a speech recognition system, which delays all frequencies except frequencies in the range of 300-4000 Hz.

As for band-stop filters, they allow you to cut out all frequencies lying in a given range from the spectrum of the input signal. Such a filter is convenient, for example, for suppressing interference that occupies a certain continuous part of the signal spectrum.

In Fig. 2-6 we showed the connection of a pass bandpass filter.

Rice. 2-6. Filtering the audio signal before digitizing

It must be said that conventional sound adapters installed in a computer include a bandpass filter through which the analog signal passes before digitization. The passband of such a filter usually corresponds to the range of audio signals, namely 16-20,000 Hz (in different audio adapters, the values of the upper and lower frequencies may vary within small limits).

How to achieve a narrower bandwidth of 300-4000 Hz, corresponding to the most informative part of the spectrum of human speech?

Of course, if you have a penchant for designing electronic equipment, you can make your own filter from an operational amplifier chip, resistors and capacitors. This is roughly what the first creators of speech recognition systems did.

However industrial systems Speech recognition systems must be functional on standard computer hardware, so the route of making a special bandpass filter is not suitable here.

Instead, in modern systems speech processing uses the so-called digital frequency filters, implemented in software. This became possible after CPU The computer has become quite powerful.

A digital frequency filter, implemented in software, converts an input digital signal into an output digital signal. During the conversion process, the program processes in a special way the stream of numerical values of the signal amplitude coming from the analog-to-digital converter. The result of the transformation will also be a stream of numbers, but this stream will correspond to an already filtered signal.

While talking about the analog-to-digital converter, we noted such an important characteristic as the number of quantization levels. If a 16-bit analog-to-digital converter is installed in the sound adapter, then after digitization the audio signal levels can be represented as 216 = 65536 different values.

If there are few quantization levels, then the so-called quantization noise. To reduce this noise, high-quality audio digitization systems should use analog-to-digital converters with the maximum number of quantization levels available.

However, there is another technique to reduce the impact of quantization noise on the quality of the audio signal, which is used in digital audio recording systems. When using this technique, the signal is passed through a nonlinear amplifier before digitization, emphasizing signals with low signal amplitude. Such a device enhances weak signals stronger than strong.

This is illustrated by the graph of the output signal amplitude versus the input signal amplitude shown in Fig. 2-7.

Rice. 2-7. Nonlinear amplification before digitization

At the stage inverse conversion digitized sound into analog (we will discuss this stage later in this chapter), before being output to the sound speakers, the analog signal is again passed through a non-linear amplifier. This time, a different amplifier is used, which emphasizes high-amplitude signals and has a transfer characteristic (the dependence of the amplitude of the output signal on the amplitude of the input signal) inverse to that used during digitization.

How can all this help the creators of speech recognition systems?

A person, as is known, recognizes speech spoken in a quiet whisper or in a fairly loud voice quite well. We can say that the dynamic range of loudness levels of successfully recognized speech for a person is quite wide.

Today's computer systems speech recognition, unfortunately, cannot yet boast of this. However, in order to slightly expand the specified dynamic range, before digitizing, you can pass the signal from the microphone through a nonlinear amplifier, the transfer characteristic of which is shown in Fig. 2-7. This will reduce the quantization noise level when digitizing weak signals.

Developers of speech recognition systems, again, are forced to focus primarily on commercially produced sound adapters. They do not provide for the nonlinear signal conversion described above.

However, it is possible to create the software equivalent of a nonlinear amplifier that converts the digitized signal before passing it on to the speech recognition module. Although such a software amplifier will not be able to reduce quantization noise, it can be used to emphasize those signal levels that carry the most speech information. For example, you can reduce the amplitude of weak signals, thus ridding the signal of noise.

Popular

How to delete a page in Odnoklassniki forever or block it temporarily

« Why today do many people wonder how to delete a page in Odnoklassniki? After all, just recently, around social networks, real reigned... »

“Can't connect to iTunes Store”

« As you probably know, the iTunes Store is Apple's online store where you purchase various media content: music, movies, games,... »

Set a password on your computer

« Please note: To follow most of the instructions in this article, you must use a local Windows account with... »