Skip to content

Intent classifier with Pytorch for short conversational utterances (e.g., booking, cancellations, product or account inquiries, support requests, and greetings) with training, evaluation, and inference for multi-class intent detection.

License

Notifications You must be signed in to change notification settings

BryanTheLai/intent-classifier-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Intent Classifier PyTorch

A PyTorch Intent Classifier for LLMs using DistilBERT and the CLINC150 dataset. Built with clean, modular architecture following ML engineering best practices.

Python 3.9+ PyTorch Code style: black

🎯 Overview

Intent classification helps identify what a user wants to do from their text input. This is critical for routing LLM queries, reducing latency, and improving response accuracy. This implementation achieves 95%+ accuracy on CLINC150's 150 intent classes.

Key Features

2 Stage Architecture β€” Uses a two-stage approach with a lightweight BERT classifier and confidence-based routing to handle queries efficiently.

CLINC150 Dataset β€” Trained on 23,700 examples spanning 150 intents across 10 different domains including banking, travel, and utilities.

Comprehensive Evaluation β€” Includes macro and weighted F1 scores, confusion matrices, confidence analysis, and detailed per-class metrics.

Modular Design β€” Clean separation between configuration management, dataset loading, training, and inference makes the code easy to understand and extend.

Type-Safe β€” Full type hints throughout the codebase provide better IDE support and catch errors early.

πŸ“Š Performance Metrics

Here's what we achieved on the CLINC150 test set after 5 epochs with DistilBERT:

Metric Score
Accuracy 95.4%
Macro F1 0.954
Weighted F1 0.954
Macro Precision 0.956
Macro Recall 0.954

Training Time:

  • GPU (CUDA): around 10 to 15 minutes
  • CPU only: around 45 to 60 minutes

Confidence Analysis:

  • Average confidence: 0.952
  • High confidence predictions (>0.7): 95% of all predictions
  • Accuracy on high confidence predictions: 97.8%
  • Accuracy on low confidence predictions: 50.2%

Training Progression

Epoch Train Loss Val Loss Val Accuracy
1 2.545 0.780 89.7%
2 0.475 0.308 94.3%
3 0.165 0.213 95.8%
4 0.082 0.194 95.9%
5 0.055 0.190 96.2%

πŸ“ Project Structure

intent-classifier-pytorch/
β”‚
β”œβ”€β”€ configs/
β”‚   └── config.yaml              # Training and model configuration
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py                # Configuration dataclasses with YAML loading
β”‚   β”œβ”€β”€ dataset.py               # CLINC150 loader and PyTorch Dataset wrapper
β”‚   β”œβ”€β”€ model.py                 # DistilBERT classifier with verified architecture
β”‚   β”œβ”€β”€ trainer.py               # Training loop with early stopping
β”‚   β”œβ”€β”€ evaluate.py              # Metrics, confusion matrix, and confidence analysis
β”‚   β”œβ”€β”€ inference.py             # Single and batch prediction with confidence scoring
β”‚   └── utils.py                 # Device detection, checkpointing, and seed setting
β”‚
β”œβ”€β”€ examples/
β”‚   └── quick_start.py           # Programmatic inference examples
β”‚
β”œβ”€β”€ tests/
β”‚   └── test_model.py            # Unit tests for model architecture
β”‚
β”œβ”€β”€ train.py                     # Main training script
β”œβ”€β”€ predict.py                   # CLI inference for interactive, single, or batch mode
β”œβ”€β”€ requirements.txt             # Project dependencies
β”œβ”€β”€ pyproject.toml              # Package metadata compatible with uv
└── README.md                    # This file

Generated during training:
β”œβ”€β”€ models/                      # Model checkpoints
β”‚   β”œβ”€β”€ best_model.pt           # Best model based on lowest validation loss
β”‚   └── checkpoint_epoch_*.pt   # Checkpoints saved after each epoch
β”œβ”€β”€ results/                     # Evaluation outputs
β”‚   β”œβ”€β”€ classification_report.txt
β”‚   β”œβ”€β”€ confusion_matrix.png
β”‚   └── training_metadata.json  # Training history and label mappings
β”œβ”€β”€ logs/                        # Training logs if configured
└── cache/                       # HuggingFace dataset cache

πŸš€ Quick Start

Prerequisites

You'll need Python 3.9 or higher and the uv package manager.

Installation

1. Install uv if you haven't already

Windows (PowerShell):

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone the repository

git clone https://github.com/BryanTheLai/intent-classifier-pytorch.git
cd intent-classifier-pytorch

3. Create and activate virtual environment with uv

uv venv

Activate the environment:

Windows (PowerShell):

.venv\Scripts\activate

macOS/Linux:

source .venv/bin/activate

4. Install dependencies

uv pip install -e .

For development dependencies:

uv pip install -e ".[dev]"

GPU Setup (Windows + uv)

If training shows "Using CPU device" but you have an NVIDIA GPU, you probably installed a CPU-only PyTorch wheel. Here's how to fix that by installing the CUDA build via uv.

Step 1: Check your current setup

.venv\Scripts\python -c "import torch; print('torch', torch.__version__, 'cuda?', torch.cuda.is_available(), 'built for', torch.version.cuda)"

Step 2: Install CUDA-enabled PyTorch

For CUDA 12.x, try cu124 first. If that doesn't work, fall back to cu121.

uv pip uninstall torch torchvision torchaudio

# Install CUDA wheels (cu124)
uv pip install --upgrade --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio

# If cu124 isn't available for your version, use cu121 instead:
# uv pip install --upgrade --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 3: Verify it worked

.venv\Scripts\python -c "import torch; print('cuda available?', torch.cuda.is_available(), 'runtime cuda', torch.version.cuda)"

Step 4: Try training again

python train.py

Running Inference

Interactive mode (recommended for testing)

python predict.py --interactive

Example session:

πŸ’¬ Interactive mode (type 'quit' to exit)
===========================================================

Enter text: I want to check my account balance

βœ“ Predicted Intent: balance
    Confidence: 0.9543

    Top 5 predictions:
    1. balance: 0.9543
    2. transactions: 0.0234
    3. freeze_account: 0.0098
    4. pin_change: 0.0067
    5. routing: 0.0034

Single prediction

python predict.py --text "Transfer money to John"

Programmatic usage

from pathlib import Path
import torch
from src.inference import IntentPredictor
from src.utils import get_device
import json

device = get_device()

with open("results/training_metadata.json", "r") as f:
    metadata = json.load(f)
id2label = {int(k): v for k, v in metadata["label_mappings"]["id2label"].items()}

predictor = IntentPredictor.from_pretrained(
    model_path=Path("models/best_model.pt"),
    tokenizer_name="distilbert-base-uncased",
    id2label=id2label,
    device=device,
    confidence_threshold=0.7,
)

text = "Book a flight to Paris"
intent, confidence, is_high_conf = predictor.predict_single(text)
print(f"Intent: {intent}, Confidence: {confidence:.4f}")

batch_texts = ["Transfer $50", "What's the weather?", "Set an alarm"]
results = predictor.predict_batch(batch_texts)
for text, (intent, conf, _) in zip(batch_texts, results):
    print(f"{text} -> {intent} ({conf:.4f})")

Training the Model

Review configuration (optional)

You can edit configs/config.yaml to adjust hyperparameters:

model:
  name: "distilbert-base-uncased"  # You can use other BERT variants too
  dropout: 0.3                     # Dropout rate
  max_length: 128                  # Maximum sequence length

training:
  num_epochs: 5                    # Number of training epochs
  learning_rate: 2.0e-5           # Learning rate
  warmup_steps: 0                  # Learning rate warmup steps
  weight_decay: 0.01              # Weight decay for L2 regularization
  max_grad_norm: 1.0              # Gradient clipping threshold
  seed: 42                         # Random seed for reproducibility
  early_stopping_patience: 5       # Stop if validation loss doesn't improve
  early_stopping_min_delta: 0.001  # Minimum improvement to reset patience

data:
  dataset_name: "DeepPavlov/clinc150"
  batch_size: 8                    # Safe for 4 GB GPUs, increase if you have more memory
  num_workers: 2                   # On Windows, keeping this at 0 to 2 is more stable

Run training

python train.py --config configs/config.yaml

What happens during training:

The CLINC150 dataset downloads automatically from HuggingFace on your first run. Then the DistilBERT model loads its pre-trained weights and starts training for 5 epochs with validation after each one. The system saves checkpoints after every epoch and keeps track of the best model based on validation loss in models/best_model.pt. After training finishes, it runs a full evaluation on the test set with metrics and visualizations.

Expected output:

============================================================
INTENT CLASSIFICATION TRAINING
============================================================

πŸ“¦ Loading dataset...
βœ“ Dataset loaded: 150 intents
  Train samples: 15000
  Val samples: 3000
  Test samples: 5700

πŸ—οΈ Building model...
βœ“ Model created
  Total parameters: 66,955,350
  Trainable parameters: 66,955,350

πŸš€ Starting training...

Epoch 1/5
============================================================
Training: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 938/938 [05:23<00:00, 2.90it/s]
Validation: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 188/188 [00:31<00:00, 5.98it/s]

Epoch 1 Summary:
  Train Loss: 2.1234
  Val Loss: 1.2345
  Val Accuracy: 0.7856
  Val Macro F1: 0.7654
  βœ“ New best model saved!
...

Training outputs:

  • models/best_model.pt contains the best model weights
  • models/checkpoint_epoch_*.pt contains checkpoints from each epoch
  • results/classification_report.txt has detailed per-class metrics
  • results/confusion_matrix.png shows the confusion matrix visualization
  • results/training_metadata.json stores training history and label mappings

πŸ”§ Configuration

Model Configuration

model:
  name: "distilbert-base-uncased"  # You can swap this for other BERT variants
  dropout: 0.3                     # Dropout rate
  max_length: 128                  # Maximum sequence length

Training Configuration

training:
  num_epochs: 5                    # Number of training epochs
  learning_rate: 2.0e-5           # Learning rate
  warmup_steps: 0                  # Learning rate warmup steps
  weight_decay: 0.01              # Weight decay for L2 regularization
  max_grad_norm: 1.0              # Gradient clipping threshold
  seed: 42                         # Random seed
  early_stopping_patience: 5       # Stop training if validation loss doesn't improve
  early_stopping_min_delta: 0.001  # Minimum improvement needed to reset patience

Data Configuration

data:
  dataset_name: "DeepPavlov/clinc150"
  batch_size: 8                    # Works well on 4 GB GPUs, increase if you have more memory
  num_workers: 2                   # Keeping this between 0 and 2 works best on Windows

πŸ§ͺ Testing

Run unit tests:

pytest tests/

Or run tests directly:

python tests/test_model.py

πŸ“ˆ Advanced Usage

Custom Dataset

To use your own dataset, you'll need to modify src/dataset.py:

class CustomDataLoader:
    def __init__(self, data_path: str):
        self.data_path = data_path
        
    def load_dataset(self):
        # Load your custom data here
        # Create label2id and id2label mappings
        pass
    
    def get_split(self, split: str):
        # Return texts and labels for the specified split
        pass

Fine-tuning on Custom Intents

First, prepare your data in CLINC150 format with text and intent pairs. Then update src/dataset.py to load your data and adjust num_intents in the configuration. After that, you can run training normally.

Freezing BERT Layers

If you want faster training or have limited data, you can freeze the BERT layers:

from src.model import IntentClassifier

model = IntentClassifier(num_intents=150)
model.freeze_bert_encoder()  # This freezes BERT and only trains the classifier head

Two-Stage Training

# Stage 1: Train just the classifier head
model.freeze_bert_encoder()
trainer.train(evaluator)

# Stage 2: Fine-tune the entire model
model.unfreeze_bert_encoder()
trainer.learning_rate = 1e-5  # Use a lower learning rate for fine-tuning
trainer.train(evaluator)

πŸ—οΈ Deployment

Confidence-Based Routing

predictor = IntentPredictor.from_pretrained(...)

intent, confidence, is_high_conf = predictor.predict_single(user_query)

if is_high_conf:
    # Route to the appropriate handler
    handle_intent(intent, user_query)
else:
    # Fall back to LLM for ambiguous queries
    llm_response = llm.generate(user_query)

Model Optimization

For deployment, here are some options:

Quantization to reduce model size:

import torch.quantization as quantization
quantized_model = quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

ONNX Export for cross-platform deployment:

import torch.onnx
dummy_input = (input_ids, attention_mask)
torch.onnx.export(model, dummy_input, "model.onnx")

TorchScript for faster inference:

scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

πŸ”¬ Research Extensions

Synthetic Data Generation

You can boost performance by generating training data with an LLM:

from openai import OpenAI

client = OpenAI()

def generate_intent_examples(intent_name: str, num_examples: int = 50):
    prompt = f"Generate {num_examples} diverse user queries for the intent '{intent_name}'"
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Knowledge Distillation

You can improve the student model by using a teacher LLM:

teacher_model = LargeLanguageModel()
student_model = IntentClassifier(num_intents=150)

for batch in dataloader:
    teacher_logits = teacher_model(batch)
    student_logits = student_model(batch)
    
    distillation_loss = kl_divergence(
        F.softmax(student_logits / temperature),
        F.softmax(teacher_logits / temperature)
    )

πŸ“š Dataset Information

CLINC150

The CLINC150 dataset contains 23,700 examples split into 15,000 for training, 3,000 for validation, and 5,700 for testing. It covers 150 different intent classes across 10 domains and includes out-of-scope detection. You can find it on HuggingFace.

Domain Distribution:

Banking has 13 intents covering account management, transfers, and card operations. Credit cards include 11 intents for payments, rewards, and pin changes. Kitchen and dining cover 6 intents about food queries and complaints. Home has 9 intents for smart home control and automation. Auto and commute include 9 intents about insurance, rental, and repair.

Travel is the largest with 15 intents for booking, flight status, and baggage. Utility has 17 intents covering bills and service activation or cancellation. Work includes 11 intents for PTO requests, meetings, and contracts. Small talk has 10 intents for greetings and chitchat. Finally, Meta encompasses 50 intents for help, out-of-scope queries, and general questions.

Example Intents:

  • balance for checking account balance
  • transfer for transferring money
  • book_flight for booking airline tickets
  • weather for weather queries
  • oos for out-of-scope queries that don't fit the 150 intents

Alternative Datasets

Banking77: Fine-grained banking intents

dataset = load_dataset("PolyAI/banking77")

Custom Generation: See Research Extensions section

πŸ› οΈ Troubleshooting

CUDA Out of Memory

Try reducing the batch size in configs/config.yaml:

data:
  batch_size: 8  # Reduce from 16 or whatever you had

Slow Training

If your CPU is the bottleneck, try reducing num_workers. You can also use mixed precision training to speed things up:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    outputs = model(inputs)

Poor Accuracy

Try increasing num_epochs to 8 or 10. You can also experiment with the learning_rate by trying 3e-5 or 1e-5. Reducing dropout to 0.1 or 0.2 might help. Another option is trying different BERT variants like bert-base-uncased or roberta-base.

πŸ”¬ Technical Details

Architecture

The model architecture follows best practices verified against PyTorch and HuggingFace Transformers documentation.

Model Components:

The base model is DistilBERT (distilbert-base-uncased) with 66 million parameters. For pooling, we use the [CLS] token representation from hidden_state[:, 0]. The classifier head applies dropout at 0.3 then passes through a linear layer mapping from 768 dimensions to 150 output classes. The loss function is CrossEntropyLoss which combines LogSoftmax and NLLLoss.

Implementation Verification:

Everything has been verified to follow best practices. We correctly use DistilBertModel.last_hidden_state[:, 0] for classification. Tokenization properly uses encode_plus with padding="max_length" and truncation=True. The code correctly handles map_location for CPU and GPU compatibility. Gradient clipping uses torch.nn.utils.clip_grad_norm_ after the backward pass. Inference mode properly combines model.eval() with torch.no_grad().

Training Details

The optimizer is AdamW with a learning rate of 2e-5 and weight decay of 0.01. We use a linear warmup scheduler with decay. Gradient clipping has a max_norm of 1.0. Early stopping kicks in with a patience of 5 epochs and a minimum delta of 0.001. For reproducibility, we set the seed to 42 and use deterministic mode.

Data Pipeline

We use DistilBertTokenizer with a maximum length of 128 tokens. The batch size is set to 16 but you can adjust this based on your GPU memory. We don't use any data augmentation, just the standard CLINC150 splits. Preprocessing handles automatic label mapping and validation.

πŸ“– References

Papers

Documentation

πŸ“„ License

MIT License. See the LICENSE file for details.

πŸ™ Acknowledgments

Thanks to the creators of the CLINC150 dataset (Larson et al.), the HuggingFace Transformers library, and the PyTorch team.

About

Intent classifier with Pytorch for short conversational utterances (e.g., booking, cancellations, product or account inquiries, support requests, and greetings) with training, evaluation, and inference for multi-class intent detection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages