This project provides a comprehensive exploration of various neural network approaches for audio classification using TensorFlow and Keras. The implementation covers the complete machine learning pipeline from data loading and preprocessing to model deployment and inference.
The notebook demonstrates multiple state-of-the-art techniques for audio classification, comparing different architectural approaches and their performance on speech command recognition tasks.
- Audio Loading: Reading WAV files from compressed archives (.gz, .tar formats)
- Signal Processing: Resampling audio to consistent 16kHz sample rate using SciPy
- Normalization: Padding/trimming audio to fixed length (16,000 samples)
- Efficient Pipelines: Using
tf.data.Datasetfor optimized data loading with shuffling, batching, and prefetching - Label Encoding: Converting string labels to integers using scikit-learn's LabelEncoder
- Input: Raw audio waveforms (16,000 samples ร 1 channel)
- Architecture:
- Conv1D (16 filters, kernel_size=3, ReLU activation)
- MaxPooling1D (pool_size=2)
- Flatten layer
- Dense (64 units, ReLU)
- Output (36 units, Softmax)
- Parameters: ~4.1 million trainable parameters
- Use Case: Direct learning from raw audio signals
- Input: Spectrograms generated via Short-Time Fourier Transform (STFT)
- Processing: Converting time-domain signals to frequency-domain representations
- Architecture: 2D convolutional layers adapted for spectrogram input
- Enhancements: Custom normalization layers and attention mechanisms
- Custom Implementation: ChannelAttention layer for Keras
- Integration: Enhanced 2D CNN architecture with attention gates
- Benefits: Improved feature focus and model interpretability
- Base Model: Pre-trained YAMNet audio event classification model from TensorFlow Hub
- Feature Extraction: Using YAMNet embeddings (1,024 dimensions)
- Custom Head: Training new classification layers on top of frozen embeddings
- Dataset Adaptation: Applied to ESC-50 environmental sound classification dataset
- 36 Audio Classes:
- Basic commands: "yes", "no", "stop", "go", "up", "down", "left", "right"
- Numbers: "zero" through "nine"
- Animals: "bird", "dog", "cat"
- Household: "bed", "house", "tree"
- Miscellaneous: "happy", "wow", "follow", "learn", "visual", etc.
- Background noise category
- Environmental Sound Classification dataset
- 50 classes of environmental recordings
- Used for YAMNet transfer learning experiments
def load_and_process_audio(filename, max_length=16000):
# Read and decode WAV file
# Resample to 16kHz using SciPy
# Pad/trim to fixed length
# Return normalized tensor- Optimizer: Adam
- Loss Function: Sparse Categorical Crossentropy
- Metrics: Accuracy
- Batch Size: 32
- Validation Split: 20% stratified split
- Epochs: 10+ with early stopping potential
- Training/validation accuracy and loss tracking
- Visualization of learning curves
- Confusion matrix analysis
- Performance comparison across architectures
# Time-domain model training
history_time_domain = model_time_domain.fit(
train_dataset,
epochs=10,
batch_size=32,
validation_data=val_dataset
)# Convert audio to spectrograms
spectrogram = tf.signal.stft(audio, frame_length=255, frame_step=128)# Load pre-trained YAMNet
yamnet_model = hub.load('https://tfhub.dev/google/yamnet/1')
# Extract embeddings and train custom classifierThe project includes comprehensive evaluation of:
- Training Accuracy: Model performance on training data
- Validation Accuracy: Generalization capability
- Loss Curves: Training stability and convergence
- Inference Speed: Real-time classification potential
- Model Size: Parameter efficiency
def predict_audio_class(model, audio_path):
# Preprocess audio
# Run model inference
# Return class probabilities and predicted labelmodel_spectrogram.save('audio_classification_model.h5')tensorflow>=2.8.0
numpy>=1.21.0
scipy>=1.7.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
librosa>=0.9.0
tensorflow-hub # For YAMNet transfer learning
ipython # For notebook visualization
audio_classification/
โโโ Audio_Classification.ipynb # Main notebook
โโโ data_audio/
โ โโโ dataset_commands.gz # Compressed dataset
โโโ models/
โ โโโ time_domain_model.h5 # Saved 1D CNN model
โ โโโ spectrogram_model.h5 # Saved 2D CNN model
โโโ utils/
โโโ audio_processing.py # Helper functions
audio-classification neural-networks tensorflow keras cnn spectrogram-analysis transfer-learning yamnet speech-recognition machine-learning deep-learning audio-processing python 1d-cnn 2d-cnn attention-mechanism signal-processing audio-ml environmental-sound-classification speech-commands esc-50 data-augmentation tf-data-pipeline
This project demonstrates practical implementations of:
- Multi-modal neural network architectures for audio
- Comparative analysis of time-domain vs frequency-domain approaches
- Effective transfer learning strategies for audio tasks
- Attention mechanisms for improved feature learning
- Production-ready data preprocessing pipelines
- Real-time audio classification
- Mobile deployment with TensorFlow Lite
- Multi-label audio classification
- Audio generation and style transfer
- Cross-modal learning (audio + text)
- TensorFlow Audio Recognition Tutorials
- YAMNet: Pre-trained audio event classifier
- Speech Commands Dataset (Google)
- ESC-50 Dataset for environmental sound classification
- Attention mechanisms in audio processing literature
This project serves as both an educational resource and a practical foundation for building production audio classification systems.