This project investigates and compares the spectral properties of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) using Fourier analysis. The goal is to understand how different architectures process frequency components of input data and how this affects robustness and generalization.
- Spatial-Frequency Representation: CNNs vs ViTs
- Summary: How ViT and CNN Learn Frequency
- Evaluation Results
Below are the feature maps and their corresponding FFTs for both models. For more details and the power spectral density (PSD) plots, refer to Notebook 2.
Example Image
ResNet
ViT
ResNet
ViT
ResNet
ViT
CNNs and ViTs process spatial information differently, and this impacts how they encode frequency components:
-
CNNs:
- Use convolutional filters with local receptive fields.
- Naturally biased toward low-frequency structures (e.g., smooth regions, edges).
- Hierarchical structure causes progressive abstraction of spatial frequency.
-
ViTs:
- Use global attention with positional encoding.
- Tend to retain more global and high-frequency information.
- Attention heads can capture sparse yet global patterns without locality bias.
This makes ViTs generally more robust to low-frequency removal and CNNs more sensitive to high-frequency preservation.
| Aspect | CNN (ResNet18) | ViT (ViT-B/16) |
|---|---|---|
| Frequency Bias | Strong low-frequency preference | More balanced frequency representation |
| Local vs Global | Local receptive fields | Global attention mechanism |
| Feature Map FFTs | Smooth, low-freq concentrated | Richer in high-frequency components |
We tested classification performance under original, lowpass, and highpass filtered inputs for more details refer to NoteBook 3. Metrics used:
- Accuracy
- F1 Score
- Confidence Drop
| Metric | Value |
|---|---|
| Accuracy (original) | 0.977 |
| Accuracy (lowpass) | 0.779 |
| Accuracy (highpass) | 0.261 |
| F1 Score (original) | 0.977 |
| F1 Score (lowpass) | 0.778 |
| F1 Score (highpass) | 0.197 |
| Confidence Drop (HP) | 0.255 |
| Confidence Drop (LP) | 0.052 |
| Metric | Value |
|---|---|
| Accuracy (original) | 0.994 |
| Accuracy (lowpass) | 0.900 |
| Accuracy (highpass) | 0.467 |
| F1 Score (original) | 0.994 |
| F1 Score (lowpass) | 0.901 |
| F1 Score (highpass) | 0.452 |
| Confidence Drop (HP) | 0.083 |
| Confidence Drop (LP) | 0.011 |
- ViT retains performance significantly better under both lowpass and highpass filtering.
- ResNet heavily relies on low-frequency content, failing when highpass content is dominant.
- Confidence drop is much smaller for ViT, showing more spectral generalization.
This project is licensed under the MIT License.













