Audio is nothing but any type of sound in a digital format. It can be speech, music, environmental sounds, etc. Audio data is commonly used in various fields such as speech recognition, music analysis, audio classification, and more. In recent years, audio data has become an increasingly important data type for training models that can perform audio-related tasks. but after a certain point, the human ear can no longer distinguish between the two.
How exactly is sound recorded? In order for a computer to understand our real-world audio, we need to convert audio to numbers just like it is done with images. A microphone is used to record audio, which is subsequently transformed from analog sound to digital sound via sampling at regular intervals of time. It is known as the sample rate. The sound quality improves with the increase in the sample rate. The average sound sample rate is 48kHz or 48000 samples per second.
Audio data preparation and augmentation are important steps in machine learning workflows that involve audio data. Here are a few reasons why:
- In order to use audio data it should be in a useable format like WAV, MP3, and FLAC.
- Our audio data can contain noise or background sounds, therefore data cleaning is necessary to build an effective model.
- Augmenting audio data can also help improve the generalization of machine learning models, by bringing more variations in the input data.
Audio Data Preparation
Let's get started with the implementation.
In this article, we are using "tensorflow-io" library which provides necessary APIs for efficient and scalable audio data preparation and augmentation. Here are the steps:
Step 1:
First, we will Install the required packages and restart the kernel/runtime.
!pip install tensorflow-io
Step 2: Load the Dataset
Download the sample audio file: Link
I have used the audio having wav format, you can use other formats. Next load the audio file by specifying the path and read it using 'tfio.audio.AudioIOTensor', it is a TensorFlow Python module that provides a class for reading audio files into a TensorFlow Tensor object.
Python
#importing necessary libraries
import tensorflow as tf
import tensorflow_io as tfio
#loading and reading the audio file.
audio_data = tfio.audio.AudioIOTensor('audio.wav')
print(audio_data)
print(audio_data.shape)
print(audio_data.rate)
Output:
<AudioIOTensor: shape=[66150 1], dtype=<dtype: 'int16'>, rate=22050>
tf.Tensor([66150 1], shape=(2,), dtype=int64)
tf.Tensor(22050, shape=(), dtype=int32)
The shape, dtype and sample rate of the AudioIOTensor is printed. The shape is represented as [Samples, Channels] (samples=length of audio clip in secs * sample rate of audio), therefore from the above example our audio clip has 66150 (3*22050) samples in int16 and 1 channel which says our audio is mono channel.
Step 3: Audio preprocessing
Next is Audio preprocessing using TensorFlow. Slicing is done when we want to focus or use only a specific part of the audio clip.
Python
#perform slicing
audio_sl = audio_data[6:]
print('Shape After Slicing :',audio_sl.shape)
#last dimension is ignored.
audio_t = tf.squeeze(audio_sl, axis=[-1])
print('Shape After squeezing:',audio_t.shape)
#to play the audio,
from IPython.display import Audio
Audio(audio_t.numpy(), rate=audio_data.rate.numpy())
Output:
Shape After Slicing : (66144, 1)
Shape After squeezing: (66144,)
Listen Audio clipwe have used the tf.squeeze() function to remove the last dimension of our audio tensor because it simplifies the data representation, reduces memory usage, and makes the audio tensor compatible with TensorFlow operations. As our audio is of the mono channel, therefore, we have taken axis=-1, if it is stereo i.e., channel=2 then axis=1.
Note: If your audio data is a tensor of floating point values then you can perform Normalization using the code,
# Normalize audio between -1 and 1
normalized_audio = audio_t / tf.math.reduce_max(tf.math.abs(audio_t))
In our example, we have audio data represented as a tensor of 16-bit signed integers so the common normalization technique is to convert the integer values to floating-point values between -1 and 1. This can be done by dividing the integer values by the maximum value that can be represented by the integer data type, such as 32767 for 16-bit signed integers. This technique is sometimes referred to as "integer normalization" or "sample normalization".
We can then use matplotlib.pyplot library to plot our audio data in the form of a graph.
Python
# convert tensor to float datatype,.
tensor = tf.cast(audio_t, tf.float32) / 32767.0
print(tensor)
#plot the graph.
import matplotlib.pyplot as plt
plt.figure()
plt.plot(tensor.numpy())
plt.show()
Output:
tf.Tensor(
[ 9.15555283e-05 2.13629566e-04 1.52592547e-04 ... -3.66222113e-04
-1.30314035e-02 -6.92770164e-03], shape=(66144,), dtype=float32)
Audio clip in GraphStep 4: Trimming
Trimming is another pre-processing technique, it is used to remove the trailing silence from the audio which is unnecessary or can help to reduce unnecessary processing and improve the accuracy. "tfio.audio.trim()" function provides a convenient way to perform this operation in TensorFlow.
Syntax: tfio.audio.trim(audio, axis, epsilon)
Parameters
- audio--> representing the audio waveform.
- axis--> The axis along which to trim the audio.
- epsilon--> representing the threshold below which samples are considered as silent, the silent part or the part Whose threshold value was zero is been removed.
Python
# Trim silence from the beginning and end of the audio signal
trimed = tfio.audio.trim(tensor,
axis=0,
epsilon=0.2)
print('Trimmed :',trimed)
#Mention the stating and ending limit
start = trimed[0]
stop = trimed[1]
print("START:{},\nSTOP :{}".format(start.numpy(), stop.numpy()))
trimmed_audio = tensor[start:stop]
print('Trimmed Audio shape:',trimmed_audio.shape)
plt.figure()
plt.plot(trimmed_audio.numpy())
#to play the audio,
Audio(trimmed_audio.numpy(),
rate=audio_data.rate.numpy())
# This code is modified by Susobhan Akhuli
Output:
Trimmed : tf.Tensor([13840 59291], shape=(2,), dtype=int64)
START:13840,
STOP :59291
Trimmed Audio shape: (45451,)
Trimed audioStep 5: Fade in and out
Fading in and fading out are common audio augmentation techniques used to avoid abrupt transitions in audio signals.
To apply the fade-in and fade-out effects, you can use the tfio.audio.fade function in TensorFlow.
Syntax: tfio.audio.fade(samples, fade_in_len, fade_out_len, mode='logarithmic', axis=-1)
Parameters:
- samples--> A tensor containing the audio samples.
- fade_in_len--> An integer scalar, the number of samples to fade in at the beginning.
- fade_out_len-->An integer scalar, the number of samples to fade out at the end.
- mode--> (Optional) A string specifying the type of fade, either 'logarithmic' or 'linear'. The default is 'logarithmic'.
- axis--> (Optional) An integer specifying the axis over which to apply the fade. The default is -1, the last axis.
Here's an example of how to apply a fade-in and fade-out effect to an audio clip:
Python
#Apply a fade-in and fade_out effect with a duration
audio_fade = tfio.audio.fade(trimmed_audio,
fade_in=1000,
fade_out=2000,
mode="logarithmic")
plt.figure()
#plotting the faded audio.
plt.plot(audio_fade.numpy())
#to plat the audio
Audio(audio_fade.numpy(),
rate=audio_data.rate.numpy())
# This code is modified by Susobhan Akhuli
Output:
tf.Tensor([0. 0.00027118 0.00048062 ... 0.000303 0.00016269 0. ],
shape=(45451,), dtype=float32)
Fade in and outStep 6: Spectrogram
A spectrogram is useful in pre-processing as it helps in the representation of frequencies present in the audio data.
It's mainly helpful in sound detection, information retrieval, etc. We will use "tfio.audio.spectrogram" function to do so.
"tfio.audio.spectrogram" with the audio_tensor object as input, along with the parameter,
nfft --> representing the number of samples to use for the fast Fourier transform (FFT),
window --> representing the size of the window used for the FFT,
the stride --> representing the distance between consecutive windows.
The function returns a new tensor representing the spectrogram of the audio signal.
Spectrograms can be used in audio augmenting techniques such as time masking, and frequency masking.
Python
# Convert to spectrogram
spectrogram = tfio.audio.spectrogram(audio_fade,
nfft=512,
window=512,
stride=512)
plt.figure()
#plotting spectrogram
plt.imshow(tf.math.log(spectrogram).numpy())
plt.show()
Output:
SpectrogramMel spectrogram
Mel spectrogram is a version of the spectrogram that is scaled based on the human auditory system. The mel scale is a non-linear transformation of frequency that is designed to better reflect how the human ear perceives sound.
"tfio.audio.melscale" is a TensorFlow I/O (TFIO) Python module function that can be used to convert a spectrogram into a mel spectrogram.
"tfio.audio.melscale()" takes the parameter such as,
- spectrogram-->A float tensor of shape [batch_size, num_frames, num_fft_bins], representing the input spectrogram. The input should be in a linear magnitude scale.
- rate-->An integer or float scalar representing the sample rate of the input audio signal. The default value is 16000.
- mels-->An integer scalar representing the number of mel bins. The default value is 500.
- fmin-->A float scalar representing the minimum frequency of the spectrogram. The default value is 0.
- fmax-->A float scalar representing the maximum frequency of the spectrogram. The default value is 8000.
The "tfio.audio.dbscale" is a TensorFlow module function that can be used to convert a spectrogram or mel spectrogram into a decibel-scaled version of the same representation.
we call "tfio.audio.dbscale" takes,
- spectrogram object as input
- top_db--> representing the maximum value in decibels that should be represented in the output. Any values above this threshold will be clipped to the maximum value.
The function returns a new tensor representing the decibel-scaled spectrogram of the audio signal.
Python
# Convert spectrogram to mel spectrogram with 250 mel bands
spectrogram_mel = tfio.audio.melscale(spectrogram, rate=16000, mels=250, fmin=0, fmax=7000)
plt.figure()
plt.imshow(tf.math.log(spectrogram_mel).numpy())
#Convert spectrogram to decibel scale with top_db=70
spectrogram_dbscale = tfio.audio.dbscale(spectrogram_mel, top_db=70)
plt.figure()
plt.imshow(spectrogram_dbscale.numpy())
plt.show()
Output:
Mel spectrogramStep 7: Frequency masking and Time masking
Frequency masking and Time masking are audio augmentation techniques used to improve the robustness and generalization of audio models. They are based on the idea of randomly masking certain frequency bands or time steps in the audio signal to simulate real-world variations and improve the model's ability to handle noisy or incomplete inputs.
Frequency masking
Python
# Apply frequency masking with
# a max mask width of 100 frequency bins
spec_freq_mask = tfio.audio.freq_mask(spectrogram_dbscale, param=100)
plt.figure()
plt.imshow(spec_freq_mask.numpy())
plt.show()
Output:
Frequency maskingTime masking
Python
# Apply time masking with
# a max mask width of 50 time steps
audio_time_masked = tfio.audio.time_mask(spectrogram_dbscale,
param=50)
plt.figure()
plt.imshow(audio_time_masked.numpy())
plt.show()
Output:
Time maskingtfio.audio.freq_mask() and tfio.audio.time_mask() applies masking on the frequency axis and time axis respectively of the input spectrogram by setting a rectangular window of zeros centered around a random frequency and time index respectively.
This helps to make the model more robust by shifting freq and time to improve its generalization performance.
By following these steps, you can preprocess and augment audio data using TensorFlow in Python. You can experiment with different data augmentation techniques and evaluate their impact on the performance of your audio models.
You can download the source code from here: Audio Data Preparation and Augmentation
Similar Reads
Adding Regularizations in TensorFlow
Regularization is a technique used in machine learning to prevent overfitting by penalizing overly complex models. In TensorFlow, regularization can be easily added to neural networks through various techniques, such as L1 and L2 regularization, dropout, and early stopping. This article explores how
7 min read
Python - Tensorflow math.add_n() method
Tensorflow math.add_n() method adds the all passed tensors element-wise. The operation is done on the representation of a and b. This method belongs to math module. Syntax: tf.math.add_n(inputs, name=None) Arguments inputs: It specifies a list of tf.Tensor or tf.IndexedSlices objects, and the shape
2 min read
Audio Recognition in Tensorflow
This article discusses audio recognition and also covers an implementation of a simple audio recognizer in Python using the TensorFlow library which recognizes eight different words. Audio RecognitionAudio recognition comes under the automatic speech recognition (ASR) task which works on understandi
8 min read
Numerical Operations in TensorFlow
TensorFlow is an open-source machine-learning library developed by Google. TensorFlow is used to build and train deep learning models as it facilitates the creation of computational graphs and efficient execution on various hardware platforms. Here, we will learn some of the basic Numerical operatio
5 min read
Hidden Layer Perceptron in TensorFlow
In this article, we will learn about hidden layer perceptron. A hidden layer perceptron is nothing but a hi-fi terminology for a neural network with one or more hidden layers. The purpose which is being served by these hidden layers is that they help to learn complex and non-linear functions for a t
5 min read
Python - Tensorflow math.add() method
Tensorflow math.add() method returns the a + b of the passes inputs. The operation is done on the representation of a and b. This method belongs to math module. Syntax: tf.math.add(a, b, name=None) Arguments a: This parameter should be a Tensor and also from the one of the following types: bfloat16,
2 min read
Graphs and Functions in TensorFlow
TensorFlow is a powerful machine learning library that allows developers to create and train models efficiently. One of the foundational concepts in TensorFlow is its computational graph system, which provides a structured way to define and execute operations. Along with graphs, TensorFlow offers tf
9 min read
Mathematical Foundations in Tensorflow
Before constructing a basic TensorFlow program, it's critical to grasp the mathematical ideas required for TensorFlow. Any machine learning algorithm's core is considered mathematics. A strategy or solution for a certain machine learning algorithm is established with the aid of key mathematical prin
4 min read
Python - tensorflow.dynamic_partition()
TensorFlow is open-source Python library designed by Google to develop Machine Learning models and deep learning  neural networks. dynamic_partition()  is used to divide the data into number of partitions. Syntax: tensorflow.dynamic_partition(data, partitions, num_partitions, name) Parameters: data
2 min read
Tensorflow.js tf.data.microphone() Function
Tensorflow.js is an open-source library developed by Google for running machine learning models and deep learning neural networks in the browser or node environment. The tf.data.microphone() function is used to produce an iterator that creates frequency-domain spectrogram Tensors from microphone aud
3 min read