Skip to content

This repository investigates and compares the spectral properties of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) using Fourier analysis.

License

Notifications You must be signed in to change notification settings

Estaheri7/ViTvsCNNFourier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNN vs ViT Spectral Analysis

This project investigates and compares the spectral properties of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) using Fourier analysis. The goal is to understand how different architectures process frequency components of input data and how this affects robustness and generalization.

Table of Contents

Spatial-Frequency Representation: CNNs vs ViTs

Below are the feature maps and their corresponding FFTs for both models. For more details and the power spectral density (PSD) plots, refer to Notebook 2.

Example Image

ex_image

Early Layers

ResNet

ResNetE ResNetE_FFT

ViT

ViTE ViTE_FFT

Mid Layers

ResNet

ResNetM ResNetM_FFT

ViT

ViTM ViTM_FFT

Late Layers

ResNet

ResNetL ResNetL_FFT

ViT

ViTL ViTL_FFT

CNNs and ViTs process spatial information differently, and this impacts how they encode frequency components:

  • CNNs:

    • Use convolutional filters with local receptive fields.
    • Naturally biased toward low-frequency structures (e.g., smooth regions, edges).
    • Hierarchical structure causes progressive abstraction of spatial frequency.
  • ViTs:

    • Use global attention with positional encoding.
    • Tend to retain more global and high-frequency information.
    • Attention heads can capture sparse yet global patterns without locality bias.

This makes ViTs generally more robust to low-frequency removal and CNNs more sensitive to high-frequency preservation.

Summary: How ViT and CNN Learn Frequency

Aspect CNN (ResNet18) ViT (ViT-B/16)
Frequency Bias Strong low-frequency preference More balanced frequency representation
Local vs Global Local receptive fields Global attention mechanism
Feature Map FFTs Smooth, low-freq concentrated Richer in high-frequency components

Evaluation Results

We tested classification performance under original, lowpass, and highpass filtered inputs for more details refer to NoteBook 3. Metrics used:

  • Accuracy
  • F1 Score
  • Confidence Drop

🔵 ResNet18

Metric Value
Accuracy (original) 0.977
Accuracy (lowpass) 0.779
Accuracy (highpass) 0.261
F1 Score (original) 0.977
F1 Score (lowpass) 0.778
F1 Score (highpass) 0.197
Confidence Drop (HP) 0.255
Confidence Drop (LP) 0.052

🟣 ViT-B/16

Metric Value
Accuracy (original) 0.994
Accuracy (lowpass) 0.900
Accuracy (highpass) 0.467
F1 Score (original) 0.994
F1 Score (lowpass) 0.901
F1 Score (highpass) 0.452
Confidence Drop (HP) 0.083
Confidence Drop (LP) 0.011

Eval_Plot

Summary

  • ViT retains performance significantly better under both lowpass and highpass filtering.
  • ResNet heavily relies on low-frequency content, failing when highpass content is dominant.
  • Confidence drop is much smaller for ViT, showing more spectral generalization.

License

This project is licensed under the MIT License.

About

This repository investigates and compares the spectral properties of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) using Fourier analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published