Evaluating multi-task, multi-modal deep learning models for semantic art classification on the SemArt dataset using image and text inputs with ResNet18, EfficientNet-B0, and ViT-Tiny-Patch16-224.
- Best overall trade-off: EfficientNet-B0 delivered the strongest balance of predictive performance and computational efficiency
- EfficientNet-B0 test accuracy: Type 81%, Medium 80%, School 79%, Timeframe 60%
- Training time per epoch: EfficientNet-B0 ~20 min, ResNet18 ~22 min, ViT-Tiny ~40 min
This project benchmarks multi-task, multi-modal deep learning models that jointly predict multiple attributes of European paintings using both image and text inputs. The main goal is to compare predictive performance and computational efficiency across architectures under a consistent training and evaluation setup.
How do ResNet18, EfficientNet-B0, and ViT-Tiny perform in multi-task classification for predicting painting attributes using image and text inputs?
Predicted tasks:
- Timeframe
- School
- Type
- Medium (extracted from the technique attribute)
SemArt dataset:
- 21,328 European paintings
- Available attributes include author, title, date, technique, type, school, timeframe, and description
- Medium labels are derived from the technique field
Notes:
- The dataset images are not included in this repository due to size constraints
- Use the official SemArt source to obtain the data
Image preprocessing and augmentation:
- Resizing and normalization
- Random horizontal flipping
- Random rotation
- Color jitter
- Random resized cropping
- Random erasing
Text processing:
- DistilBERT base uncased is used to extract features from painting descriptions
- Only the last layer is fine-tuned to obtain meaningful text embeddings
Multi-modal fusion:
- Image features and text embeddings are combined in a fusion layer for multi-modal learning
ResNet18:
- Pre-trained CNN with 11.7M parameters
- Dropout used to reduce overfitting
- Early layers are frozen to stabilize fine-tuning
EfficientNet-B0:
- CNN with 5.3M parameters
- Dropout used to reduce overfitting
- First blocks are frozen during fine-tuning
ViT-Tiny-Patch16-224:
- Compact Vision Transformer
- Higher compute demand and more data-hungry behavior
- Early blocks are frozen during fine-tuning
Optimization:
- Adam optimizer
- Learning rate scheduling
- Weight decay for regularization
- Trained for 10 epochs
Multi-task learning:
- Timeframe, School, Type, and Medium are predicted simultaneously using a shared representation with task-specific heads
Model selection:
- Best checkpoint chosen based on validation performance
Metrics:
- Accuracy
- F1-score
- Confusion matrix analysis on test data
- EfficientNet-B0 provides the best balance of accuracy and computational efficiency
- ViT achieves competitive accuracy but at higher computational cost
- All models struggle most on the Timeframe task
- Timeframe: 60% accuracy, 60% F1
- School: 79% accuracy, 77% F1
- Type: 81% accuracy, 80% F1
- Medium: 80% accuracy, 79% F1
Timeframe:
- Earliest timeframes show weaker performance
- Best categorized timeframe observed: 1851 to 1900 with 94% accuracy
- Misclassifications often occur between neighboring historical periods
School:
- Italian school achieves 94% accuracy
- Rare schools such as Scottish, Swiss, and Bohemian show more misclassification
Type:
- High accuracy for landscape, religious, portrait, and still-life types
- Rare categories show more confusion
Medium:
- Strong performance for Canvas and Wall media
- Misclassifications are often absorbed by broad groups such as Other, Metal, and Paper
Per-epoch training time comparison:
- EfficientNet-B0: ~20 min
- ResNet18: ~22 min
- ViT-Tiny: ~40 min
EfficientNet-B0 is the preferred model because it balances predictive performance and computational efficiency. ViT-Tiny can achieve strong results but is computationally costly and more sensitive to data limitations. ResNet18 is lightweight and effective but shows lower overall accuracy.
Limitations:
- Training cost and time for transformer-based models
- Vision Transformers require larger datasets to perform optimally
- Class imbalance and under-represented categories reduce performance
Future work:
- Train on a larger and more diverse dataset
- Apply imbalance mitigation strategies
- Build an interactive application for real-time use cases
If you want to understand the work end-to-end, review the notebooks in the following order.
-
notebooks/ResNet18_WDescription_Final2.ipynb
Multi-task training with ResNet18 using image and text inputs -
notebooks/EfficiNetB0_WDescription_Final2.ipynb
Multi-task training with EfficientNet-B0 using transfer learning, augmentation, and regularization -
notebooks/ViT_WDescription_Final.ipynb
Multi-task training with ViT-Tiny, including fine-tuning with frozen early blocks
notebooks/Evaluation_on_Test_Data_EfficientNet_Best.ipynb
Test-set evaluation, metrics reporting, and confusion matrix analysis for the selected best checkpoint
Garcia, N., Renoust, B., Nakashima, Y. Understanding art through multi-modal retrieval in paintings. ICCV Workshops, 2019.