A PyTorch Intent Classifier for LLMs using DistilBERT and the CLINC150 dataset. Built with clean, modular architecture following ML engineering best practices.
Intent classification helps identify what a user wants to do from their text input. This is critical for routing LLM queries, reducing latency, and improving response accuracy. This implementation achieves 95%+ accuracy on CLINC150's 150 intent classes.
2 Stage Architecture β Uses a two-stage approach with a lightweight BERT classifier and confidence-based routing to handle queries efficiently.
CLINC150 Dataset β Trained on 23,700 examples spanning 150 intents across 10 different domains including banking, travel, and utilities.
Comprehensive Evaluation β Includes macro and weighted F1 scores, confusion matrices, confidence analysis, and detailed per-class metrics.
Modular Design β Clean separation between configuration management, dataset loading, training, and inference makes the code easy to understand and extend.
Type-Safe β Full type hints throughout the codebase provide better IDE support and catch errors early.
Here's what we achieved on the CLINC150 test set after 5 epochs with DistilBERT:
| Metric | Score |
|---|---|
| Accuracy | 95.4% |
| Macro F1 | 0.954 |
| Weighted F1 | 0.954 |
| Macro Precision | 0.956 |
| Macro Recall | 0.954 |
Training Time:
- GPU (CUDA): around 10 to 15 minutes
- CPU only: around 45 to 60 minutes
Confidence Analysis:
- Average confidence: 0.952
- High confidence predictions (>0.7): 95% of all predictions
- Accuracy on high confidence predictions: 97.8%
- Accuracy on low confidence predictions: 50.2%
| Epoch | Train Loss | Val Loss | Val Accuracy |
|---|---|---|---|
| 1 | 2.545 | 0.780 | 89.7% |
| 2 | 0.475 | 0.308 | 94.3% |
| 3 | 0.165 | 0.213 | 95.8% |
| 4 | 0.082 | 0.194 | 95.9% |
| 5 | 0.055 | 0.190 | 96.2% |
intent-classifier-pytorch/
β
βββ configs/
β βββ config.yaml # Training and model configuration
β
βββ src/
β βββ __init__.py
β βββ config.py # Configuration dataclasses with YAML loading
β βββ dataset.py # CLINC150 loader and PyTorch Dataset wrapper
β βββ model.py # DistilBERT classifier with verified architecture
β βββ trainer.py # Training loop with early stopping
β βββ evaluate.py # Metrics, confusion matrix, and confidence analysis
β βββ inference.py # Single and batch prediction with confidence scoring
β βββ utils.py # Device detection, checkpointing, and seed setting
β
βββ examples/
β βββ quick_start.py # Programmatic inference examples
β
βββ tests/
β βββ test_model.py # Unit tests for model architecture
β
βββ train.py # Main training script
βββ predict.py # CLI inference for interactive, single, or batch mode
βββ requirements.txt # Project dependencies
βββ pyproject.toml # Package metadata compatible with uv
βββ README.md # This file
Generated during training:
βββ models/ # Model checkpoints
β βββ best_model.pt # Best model based on lowest validation loss
β βββ checkpoint_epoch_*.pt # Checkpoints saved after each epoch
βββ results/ # Evaluation outputs
β βββ classification_report.txt
β βββ confusion_matrix.png
β βββ training_metadata.json # Training history and label mappings
βββ logs/ # Training logs if configured
βββ cache/ # HuggingFace dataset cache
You'll need Python 3.9 or higher and the uv package manager.
Windows (PowerShell):
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/BryanTheLai/intent-classifier-pytorch.git
cd intent-classifier-pytorchuv venvActivate the environment:
Windows (PowerShell):
.venv\Scripts\activatemacOS/Linux:
source .venv/bin/activateuv pip install -e .For development dependencies:
uv pip install -e ".[dev]"If training shows "Using CPU device" but you have an NVIDIA GPU, you probably installed a CPU-only PyTorch wheel. Here's how to fix that by installing the CUDA build via uv.
Step 1: Check your current setup
.venv\Scripts\python -c "import torch; print('torch', torch.__version__, 'cuda?', torch.cuda.is_available(), 'built for', torch.version.cuda)"Step 2: Install CUDA-enabled PyTorch
For CUDA 12.x, try cu124 first. If that doesn't work, fall back to cu121.
uv pip uninstall torch torchvision torchaudio
# Install CUDA wheels (cu124)
uv pip install --upgrade --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio
# If cu124 isn't available for your version, use cu121 instead:
# uv pip install --upgrade --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudioStep 3: Verify it worked
.venv\Scripts\python -c "import torch; print('cuda available?', torch.cuda.is_available(), 'runtime cuda', torch.version.cuda)"Step 4: Try training again
python train.pypython predict.py --interactiveExample session:
π¬ Interactive mode (type 'quit' to exit)
===========================================================
Enter text: I want to check my account balance
β Predicted Intent: balance
Confidence: 0.9543
Top 5 predictions:
1. balance: 0.9543
2. transactions: 0.0234
3. freeze_account: 0.0098
4. pin_change: 0.0067
5. routing: 0.0034
python predict.py --text "Transfer money to John"from pathlib import Path
import torch
from src.inference import IntentPredictor
from src.utils import get_device
import json
device = get_device()
with open("results/training_metadata.json", "r") as f:
metadata = json.load(f)
id2label = {int(k): v for k, v in metadata["label_mappings"]["id2label"].items()}
predictor = IntentPredictor.from_pretrained(
model_path=Path("models/best_model.pt"),
tokenizer_name="distilbert-base-uncased",
id2label=id2label,
device=device,
confidence_threshold=0.7,
)
text = "Book a flight to Paris"
intent, confidence, is_high_conf = predictor.predict_single(text)
print(f"Intent: {intent}, Confidence: {confidence:.4f}")
batch_texts = ["Transfer $50", "What's the weather?", "Set an alarm"]
results = predictor.predict_batch(batch_texts)
for text, (intent, conf, _) in zip(batch_texts, results):
print(f"{text} -> {intent} ({conf:.4f})")You can edit configs/config.yaml to adjust hyperparameters:
model:
name: "distilbert-base-uncased" # You can use other BERT variants too
dropout: 0.3 # Dropout rate
max_length: 128 # Maximum sequence length
training:
num_epochs: 5 # Number of training epochs
learning_rate: 2.0e-5 # Learning rate
warmup_steps: 0 # Learning rate warmup steps
weight_decay: 0.01 # Weight decay for L2 regularization
max_grad_norm: 1.0 # Gradient clipping threshold
seed: 42 # Random seed for reproducibility
early_stopping_patience: 5 # Stop if validation loss doesn't improve
early_stopping_min_delta: 0.001 # Minimum improvement to reset patience
data:
dataset_name: "DeepPavlov/clinc150"
batch_size: 8 # Safe for 4 GB GPUs, increase if you have more memory
num_workers: 2 # On Windows, keeping this at 0 to 2 is more stablepython train.py --config configs/config.yamlWhat happens during training:
The CLINC150 dataset downloads automatically from HuggingFace on your first run. Then the DistilBERT model loads its pre-trained weights and starts training for 5 epochs with validation after each one. The system saves checkpoints after every epoch and keeps track of the best model based on validation loss in models/best_model.pt. After training finishes, it runs a full evaluation on the test set with metrics and visualizations.
Expected output:
============================================================
INTENT CLASSIFICATION TRAINING
============================================================
π¦ Loading dataset...
β Dataset loaded: 150 intents
Train samples: 15000
Val samples: 3000
Test samples: 5700
ποΈ Building model...
β Model created
Total parameters: 66,955,350
Trainable parameters: 66,955,350
π Starting training...
Epoch 1/5
============================================================
Training: 100%|ββββββββββββ| 938/938 [05:23<00:00, 2.90it/s]
Validation: 100%|ββββββββ| 188/188 [00:31<00:00, 5.98it/s]
Epoch 1 Summary:
Train Loss: 2.1234
Val Loss: 1.2345
Val Accuracy: 0.7856
Val Macro F1: 0.7654
β New best model saved!
...
Training outputs:
models/best_model.ptcontains the best model weightsmodels/checkpoint_epoch_*.ptcontains checkpoints from each epochresults/classification_report.txthas detailed per-class metricsresults/confusion_matrix.pngshows the confusion matrix visualizationresults/training_metadata.jsonstores training history and label mappings
model:
name: "distilbert-base-uncased" # You can swap this for other BERT variants
dropout: 0.3 # Dropout rate
max_length: 128 # Maximum sequence lengthtraining:
num_epochs: 5 # Number of training epochs
learning_rate: 2.0e-5 # Learning rate
warmup_steps: 0 # Learning rate warmup steps
weight_decay: 0.01 # Weight decay for L2 regularization
max_grad_norm: 1.0 # Gradient clipping threshold
seed: 42 # Random seed
early_stopping_patience: 5 # Stop training if validation loss doesn't improve
early_stopping_min_delta: 0.001 # Minimum improvement needed to reset patiencedata:
dataset_name: "DeepPavlov/clinc150"
batch_size: 8 # Works well on 4 GB GPUs, increase if you have more memory
num_workers: 2 # Keeping this between 0 and 2 works best on WindowsRun unit tests:
pytest tests/Or run tests directly:
python tests/test_model.pyTo use your own dataset, you'll need to modify src/dataset.py:
class CustomDataLoader:
def __init__(self, data_path: str):
self.data_path = data_path
def load_dataset(self):
# Load your custom data here
# Create label2id and id2label mappings
pass
def get_split(self, split: str):
# Return texts and labels for the specified split
passFirst, prepare your data in CLINC150 format with text and intent pairs. Then update src/dataset.py to load your data and adjust num_intents in the configuration. After that, you can run training normally.
If you want faster training or have limited data, you can freeze the BERT layers:
from src.model import IntentClassifier
model = IntentClassifier(num_intents=150)
model.freeze_bert_encoder() # This freezes BERT and only trains the classifier head# Stage 1: Train just the classifier head
model.freeze_bert_encoder()
trainer.train(evaluator)
# Stage 2: Fine-tune the entire model
model.unfreeze_bert_encoder()
trainer.learning_rate = 1e-5 # Use a lower learning rate for fine-tuning
trainer.train(evaluator)predictor = IntentPredictor.from_pretrained(...)
intent, confidence, is_high_conf = predictor.predict_single(user_query)
if is_high_conf:
# Route to the appropriate handler
handle_intent(intent, user_query)
else:
# Fall back to LLM for ambiguous queries
llm_response = llm.generate(user_query)For deployment, here are some options:
Quantization to reduce model size:
import torch.quantization as quantization
quantized_model = quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)ONNX Export for cross-platform deployment:
import torch.onnx
dummy_input = (input_ids, attention_mask)
torch.onnx.export(model, dummy_input, "model.onnx")TorchScript for faster inference:
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")You can boost performance by generating training data with an LLM:
from openai import OpenAI
client = OpenAI()
def generate_intent_examples(intent_name: str, num_examples: int = 50):
prompt = f"Generate {num_examples} diverse user queries for the intent '{intent_name}'"
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.contentYou can improve the student model by using a teacher LLM:
teacher_model = LargeLanguageModel()
student_model = IntentClassifier(num_intents=150)
for batch in dataloader:
teacher_logits = teacher_model(batch)
student_logits = student_model(batch)
distillation_loss = kl_divergence(
F.softmax(student_logits / temperature),
F.softmax(teacher_logits / temperature)
)The CLINC150 dataset contains 23,700 examples split into 15,000 for training, 3,000 for validation, and 5,700 for testing. It covers 150 different intent classes across 10 domains and includes out-of-scope detection. You can find it on HuggingFace.
Domain Distribution:
Banking has 13 intents covering account management, transfers, and card operations. Credit cards include 11 intents for payments, rewards, and pin changes. Kitchen and dining cover 6 intents about food queries and complaints. Home has 9 intents for smart home control and automation. Auto and commute include 9 intents about insurance, rental, and repair.
Travel is the largest with 15 intents for booking, flight status, and baggage. Utility has 17 intents covering bills and service activation or cancellation. Work includes 11 intents for PTO requests, meetings, and contracts. Small talk has 10 intents for greetings and chitchat. Finally, Meta encompasses 50 intents for help, out-of-scope queries, and general questions.
Example Intents:
balancefor checking account balancetransferfor transferring moneybook_flightfor booking airline ticketsweatherfor weather queriesoosfor out-of-scope queries that don't fit the 150 intents
Banking77: Fine-grained banking intents
dataset = load_dataset("PolyAI/banking77")Custom Generation: See Research Extensions section
Try reducing the batch size in configs/config.yaml:
data:
batch_size: 8 # Reduce from 16 or whatever you hadIf your CPU is the bottleneck, try reducing num_workers. You can also use mixed precision training to speed things up:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
outputs = model(inputs)Try increasing num_epochs to 8 or 10. You can also experiment with the learning_rate by trying 3e-5 or 1e-5. Reducing dropout to 0.1 or 0.2 might help. Another option is trying different BERT variants like bert-base-uncased or roberta-base.
The model architecture follows best practices verified against PyTorch and HuggingFace Transformers documentation.
Model Components:
The base model is DistilBERT (distilbert-base-uncased) with 66 million parameters. For pooling, we use the [CLS] token representation from hidden_state[:, 0]. The classifier head applies dropout at 0.3 then passes through a linear layer mapping from 768 dimensions to 150 output classes. The loss function is CrossEntropyLoss which combines LogSoftmax and NLLLoss.
Implementation Verification:
Everything has been verified to follow best practices. We correctly use DistilBertModel.last_hidden_state[:, 0] for classification. Tokenization properly uses encode_plus with padding="max_length" and truncation=True. The code correctly handles map_location for CPU and GPU compatibility. Gradient clipping uses torch.nn.utils.clip_grad_norm_ after the backward pass. Inference mode properly combines model.eval() with torch.no_grad().
The optimizer is AdamW with a learning rate of 2e-5 and weight decay of 0.01. We use a linear warmup scheduler with decay. Gradient clipping has a max_norm of 1.0. Early stopping kicks in with a patience of 5 epochs and a minimum delta of 0.001. For reproducibility, we set the seed to 42 and use deterministic mode.
We use DistilBertTokenizer with a maximum length of 128 tokens. The batch size is set to 16 but you can adjust this based on your GPU memory. We don't use any data augmentation, just the standard CLINC150 splits. Preprocessing handles automatic label mapping and validation.
- CLINC150: An Evaluation Dataset for Intent Detection by Larson et al., 2019
- DistilBERT: A distilled version of BERT by Sanh et al., 2019
- HuggingFace Transformers for the DistilBERT implementation
- PyTorch Documentation for training best practices
- CLINC150 Dataset on HuggingFace Hub
MIT License. See the LICENSE file for details.
Thanks to the creators of the CLINC150 dataset (Larson et al.), the HuggingFace Transformers library, and the PyTorch team.