In this project, we developed an SDR system in a speaker-independent setting that can generalize to different speakers in real-world ASR systems. We performed a sequence classification task using two different neural networks, a simple RNN and a transformer model. Given a short audio clip as input, the respective model classifies the digit that was spoken. First, the given dataset was explored using MelSpectrograms and analyzed. Later, the raw data was fed to the neural network models to further improve inference results. We also explored two augmentation techniques SpecAugment and Wavaugment to emphasize on single-speaker generalization for global classification.
The data you will use for training, validation and testing is organized as follow:
/data/
- speech_data: contains raw wav files
The folders contains file that have been first MFCC extracted and then downsampled. In the spectro-temporal representation of speech, a speech sample can be seen as a sequence of
Before implementing the Baseline model we first downsampled the spectrogram. As
We used a Linear SGD-based Classifier with hinge loss, which is an SVM model.
We used a simple RNN architecture that uses a single LSTM layer and a linear layer followed by softmax-based output. In this case, each input wave sample was embedded to the size of 80. The model was trained for 30 epochs with early stopping (patience=5) to select the best model without exhausting resources.
The Conformer model introduced by \citet{conformer} is an amalgam of Convolution Neural networks and Transformers. This is based on the intuition that Transformer models can handle long-term dependencies well and Convolution models can handle local information well. Conformer was originally used in the case of speech recognition i.e. predicting the spoken text.
The conformer block contains the self-attention module multi-headed self-attention module integrated with a relative sinusoidal positional encoding scheme; this module helps learn multiple long-term dependencies in a given sequence. The convolution unit has a point-wise convolution and a GLU and is succeeded by a 1-D depth-wise convolution layer. The Feedforward module has two linear transformations with a nonlinear activation function between them. In the feed-forward module Swish activation is used and further dropout is used for the purpose of regularisation. In our implementation, we have adapted this model for the case of Classification.
For dimensionality reduction, we used a t-SNE algorithm to reduce the dimensions of the spectrogram so that we can plot the data and see how our classification algorithms classify those points. We reduce the dimension of the spectrogram to 2 components of t-SNE which will act as the axes in our 2D plot. Figures below respectively show the t-SNE plots for the RNN and the Conformer model for both the dev and test set. As we can see in the plots, there don't seem to be definite clusters (based on the components generated by t-SNE) of each digit so it is a bit difficult to clearly see how accurate the classifications are from the plot. Nevertheless, we can still compare regions to get an idea of the classification accuracy or the decision boundaries.
When plotted for RNN, we can see that the classification seems to be better with fewer misclassifications as we don't see any one digit dominating the predictions. Still, there seems to be confusion between the digits 0 and 1 particularly in the dev set. In the case of the test set, the misclassification seems to be more among 3 and 8.
 which suggests that the improvement might be statistically significant.
In this section, we discuss the results from training the conformer model using only the data of one speaker. Upon training the Conformer model on one of the speakers (George), the test accuracy dropped to 26% as the training set only had 500 examples to train from and the evaluations were made on the original dev and test set. This is expected as the model is trained on a very small dataset with fewer feature variations to learn from; thus the highly flexible creates a lot of bias and consecutively performs poorly on test sets. To tackle this issue and bring about more variation in the train dataset without introducing more speakers, we explored two different augmentation strategies.
The first Data Augmentation technique used was SpecAugment introduced by \citet{spec} as an approach to data augmentation for Audio. This approach augments the mel spectrogram directly. This is applied to George's audio data to generate additional 500 examples totalling 1000 training examples. We trained the conformer model with this augmented data using the same methodology mentioned as before.
Secondly, We also used WavAugment introduced by \citet{wav} which directly augments the raw waveform. Thus, in this case, we can use the raw waveforms without extracting the Mel-frequency cepstral coefficients from them. The total number of training data and training process is the same as that of SpecAugment. The results are summarized in table \ref{tab:accents}.
In our case, SpecAugment was found to have performed better than WavAugment, hence contrastive loss learning is implemented with this augmentation technique. In a supervised learning setting, the contrastive loss is added to the cross-entropy loss and then backpropagated. Although \citet{wav} showed that contrastive learning improves the WavAugment technique's results, we did not find the same case for the SpecAugment technique. The previous model with simple cross-entropy loss performed better by 19.74% than the one combined with contrastive loss.
Test accuracy for different models:
| Whole Dataset | Accuracy |
|---|---|
| SVM | 36% |
| RNN | 62.62% |
| Conformer | 80.52% |
| --------------------------- | -------: |
| Single-speaker Dataset | |
| --------------------------- | -------: |
| Conformer | 27.63% |
| SpecAugment+Conformer | 42.35% |
| WavAugment+Conformer | 32.41% |
| SpecAugment+Conformer+CL | 33.99% |
