This repository is the official implementation of NeurIPS 2025 paper (Oral Presentation).
Max Suppression (MaxSup) is a novel regularization technique that overcomes the shortcomings of traditional Label Smoothing (LS). While LS prevents overconfidence by softening one-hot labels, it inadvertently collapses intra-class feature diversity and can boost overconfident errors. In contrast, MaxSup applies a uniform smoothing penalty to the model’s top prediction—regardless of correctness—preserving richer per-sample information and improving both classification performance and downstream transfer.
- MaxSup: Overcoming Representation Collapse in Label Smoothing
- Table of Contents
- Overview
- Methodology: MaxSup vs. Label Smoothing
- Enhanced Feature Representation
- Training Vision Transformers with MaxSup
- Fine-grained Image Classification
- Pretrained Weights
- Training ConvNets with MaxSup
- Logit Characteristic Visualization
- More about Improved Feature Space
- References
Traditional Label Smoothing (LS) replaces one-hot labels with a smoothed version to reduce overconfidence. However, LS can over-tighten feature clusters within each class and may reinforce errors by making mispredictions overconfident. MaxSup tackles these issues by applying a smoothing penalty to the model's top-1 logit output regardless of whether the prediction is correct, thus preserving intra-class diversity and enhancing inter-class separation. The result is improved performance on both classification tasks and downstream applications such as linear transfer and image segmentation.
Label Smoothing softens the target distribution by blending the one-hot vector with a uniform distribution. Although effective at reducing overconfidence, LS inadvertently introduces two effects:
- A regularization term that limits the sharpness of predictions.
- An error-enhancement term that can cause overconfident wrong predictions.
MaxSup addresses this by uniformly penalizing the highest logit output, whether it corresponds to the true class or not. This approach enforces a consistent regularization effect across all samples. In formula form:
where ( z_{\max} ) is the highest logit among the ( K ) classes. This mechanism prevents the prediction distribution from becoming too peaky while preserving informative signals from non-target classes.
MaxSup-trained models display richer intra-class feature diversity compared to models trained with traditional LS. Feature embedding visualizations show that while LS forces features into tight clusters, MaxSup preserves finer-grained differences among samples. Grad-CAM analyses also demonstrate that MaxSup-trained models focus more precisely on relevant class-discriminative regions.

Figure 1: Feature representations. MaxSup maintains greater intra-class diversity and clear inter-class boundaries.

Figure 2: Grad-CAM visualizations. The MaxSup model (row 2) accurately highlights target objects, whereas the LS model (row 3) and Baseline (row 4) show more diffuse activations.
We evaluated feature representations on ResNet-50 trained on ImageNet-1K. Intra-class variation (reflecting the diversity within classes) and inter-class separability (indicating class distinctiveness) were measured. Additionally, a linear transfer learning task on CIFAR-10 was performed.
Table 1: Feature Representation Metrics (ResNet-50 on ImageNet-1K)
| Method | Intra-class Var. (Train) | Intra-class Var. (Val) | Inter-class Sep. (Train) | Inter-class Sep. (Val) |
|---|---|---|---|---|
| Baseline | 0.3114 | 0.3313 | 0.4025 | 0.4451 |
| Label Smoothing | 0.2632 | 0.2543 | 0.4690 | 0.4611 |
| Online LS | 0.2707 | 0.2820 | 0.5943 | 0.5708 |
| Zipf’s LS | 0.2611 | 0.2932 | 0.5522 | 0.4790 |
| MaxSup (ours) | 0.2926 | 0.2998 | 0.5188 | 0.4972 |
Higher intra-class variation indicates more preserved sample-specific details, while higher inter-class separability suggests better class discrimination.
Table 2: Linear Transfer Accuracy on CIFAR-10
| Pretraining Method | Accuracy (%) |
|---|---|
| Baseline | 81.43 |
| Label Smoothing | 74.58 |
| MaxSup | 81.02 |
Label Smoothing degrades transfer accuracy due to its over-smoothing effect, whereas MaxSup nearly matches the baseline performance while still offering improved calibration.
We have also conducted comprehensive experiments on both imbalanced dataset (on CIFAR10-LT with varying imbalance ratios (50, and 100)) and out-of-distribution (on CIFAR10-C) datasets.
Table 3: Comparison of overall accuracy (%) (jointly considering many-shot, median-shot, and low-shot top-1 accuracy) for different loss strategies (Focal Loss vs Label Smoothing (LS) vs MaxSup) across datasets, imbalance levels, and backbones. Best performances are in bold.
| Dataset | Split | Imbalance Ratio | Backbone | Method | Overall | Many | Medium | Low |
|---|---|---|---|---|---|---|---|---|
| Long-tailed CIFAR-10 | val | 50 | Resnet32 | Focal Loss | 77.4 | 76.0 | 89.7 | 0.0 |
| LS | 81.2 | 81.6 | 77.0 | 0.0 | ||||
| MaxSup | 82.1 | 82.5 | 78.1 | 0.0 | ||||
| Long-tailed CIFAR-10 | test | 50 | Resnet32 | Focal Loss | 76.8 | 75.3 | 90.4 | 0.0 |
| LS | 80.5 | 81.1 | 75.4 | 0.0 | ||||
| MaxSup | 81.4 | 82.3 | 73.4 | 0.0 | ||||
| Long-tailed CIFAR-10 | val | 100 | Resnet32 | Focal Loss | 75.1 | 71.8 | 88.3 | 0.0 |
| LS | 76.6 | 80.6 | 60.7 | 0.0 | ||||
| MaxSup | 77.1 | 80.1 | 65.1 | 0.0 | ||||
| Long-tailed CIFAR-10 | test | 100 | Resnet32 | Focal Loss | 74.7 | 71.6 | 87.2 | 0.0 |
| LS | 76.4 | 80.8 | 59.0 | 0.0 | ||||
| MaxSup | 76.4 | 79.9 | 62.4 | 0.0 |
Table 4: Comparison of MaxSup and Label Smoothing (LS) on CIFAR10-C using Resnet50 as backbone. For all metrics, lower value is better, and best performances are in bold.
| Metric | MaxSup | LS |
|---|---|---|
| Error (Corr) | 0.3951 | 0.3951 |
| NLL (Corr) | 1.8431 | 1.5730 |
| ECE (Corr) | 0.1479 | 0.1741 |
We integrated MaxSup into the training pipeline for Vision Transformers using the DeiT framework.
cd Deit
python train_with_MaxSup.shThis script trains a DeiT-Small model on ImageNet-1K with MaxSup regularization.
For improved data loading efficiency on systems with slow I/O, a caching mechanism is provided. This feature compresses the ImageNet dataset into ZIP files and loads them into memory. Enable caching by adding the --cache flag to the training script.
-
Create ZIP Archives:
In your ImageNet data directory, run:cd data/ImageNet zip -r train.zip train zip -r val.zip val -
Mapping Files:
Downloadtrain_map.txtandval_map.txtfrom our release assets and place them in thedata/ImageNetdirectory. The directory should appear as follows:data/ImageNet/ ├── train_map.txt # Relative paths and labels for training images ├── val_map.txt # Relative paths and labels for validation images ├── train.zip # Compressed training images └── val.zip # Compressed validation images- train_map.txt: Each line should be in the format
<class_folder>/<image_filename>\t<label>. - val_map.txt: Each line should be in the format
<image_filename>\t<label>.
- train_map.txt: Each line should be in the format
Beyond large-scale benchmarks such as ImageNet, we further evaluate MaxSup on two fine-grained visual recognition datasets: CUB-200-2011 and Stanford Cars. These datasets present subtle inter-class differences that often expose the limitations of standard regularization techniques.
As shown below, MaxSup achieves the best performance on both datasets, surpassing Label Smoothing (LS) and its recent variants. This indicates that MaxSup helps models learn more discriminative and semantically rich feature representations, effectively capturing fine-grained attributes such as textures and part-level details. The consistent gains across domains further demonstrate MaxSup’s strong generalization ability and robustness in recognition tasks requiring nuanced feature understanding.
| Method | CUB (Acc %) | Cars (Acc %) |
|---|---|---|
| Baseline | 80.88 | 90.27 |
| LS | 81.96 | 91.64 |
| OLS | 82.33 | 91.96 |
| Zipf-LS | 81.40 | 90.99 |
| MaxSup | 82.53 | 92.25 |
Pretrained checkpoints for both Vision Transformers and ConvNets are available:
- Vision Transformer (DeiT-Small): Available in the release titled "checkpoint_deit". This includes the model
.pthfile and training logs. - ConvNet (ResNet-50): Pretrained weights can be downloaded from https://huggingface.co/zhouyuxuanyx/MaxSup-Regularized-ResNet50/tree/main. Baseline and LS variants are also provided for comparison.
These checkpoints can be used for direct evaluation or fine-tuning on downstream tasks.
The Conv/ directory provides scripts for training convolutional networks with MaxSup:
- Conv/ffcv: Contains scripts to reproduce ImageNet results using FFCV for efficient data loading. See
Conv/ffcv/README.mdfor details. - Conv/common_resnet: Contains additional experiments with ResNet architectures. Refer to
Conv/common_resnet/README.mdfor further instructions.
The viz/ directory contains a toolkit to analyze the distribution of logits produced by models trained with LS versus MaxSup.
Run the following command to extract logits from your trained model:
python viz/logits.py \
--checkpoint /path/to/model_checkpoint.pth \
--output /path/to/save/logits_labels.pt--checkpoint: Path to your model checkpoint.--output: Destination file for the extracted logits and labels.
After extraction, run:
python viz/analysis.py --input /path/to/save/logits_labels.pt --output /path/to/analysis_results/This script generates:
- A histogram of near-zero logit proportions.
- A scatter plot comparing top-1 probabilities with near-zero proportions.
- Saved visualizations for side-by-side comparisons.

Figure 3: Logit distribution comparing LS and MaxSup.
Figure.4: Visualization of penultimate-layer activations from DeiT-Small (trained with CutMix and Mixup) on the ImageNet validation set.
- DeiT (Vision Transformer):
Touvron et al., Training Data-Efficient Image Transformers & Distillation through Attention, ICML 2021. GitHub. - Grad-CAM:
Selvaraju et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, ICCV 2017. - Online Label Smoothing: See paper for details.
- Zipf’s Label Smoothing: See paper for details.
@article{zhou2025maxsup,
title={Maxsup: Overcoming representation collapse in label smoothing},
author={Zhou, Yuxuan and Li, Heng and Cheng, Zhi-Qi and Yan, Xudong and Dong, Yifei and Fritz, Mario and Keuper, Margret},
journal={arXiv preprint arXiv:2502.15798},
year={2025}
}This repository provides the official implementation of MaxSup. Contributions and discussions are welcome. For any questions or issues, please open an issue on GitHub or contact the authors directly.
