Multi-Task Learning for Semantic Art Classification

Evaluating multi-task, multi-modal deep learning models for semantic art classification on the SemArt dataset using image and text inputs with ResNet18, EfficientNet-B0, and ViT-Tiny-Patch16-224.

Key Results

Best overall trade-off: EfficientNet-B0 delivered the strongest balance of predictive performance and computational efficiency
EfficientNet-B0 test accuracy: Type 81%, Medium 80%, School 79%, Timeframe 60%
Training time per epoch: EfficientNet-B0 ~20 min, ResNet18 ~22 min, ViT-Tiny ~40 min

Project Overview

This project benchmarks multi-task, multi-modal deep learning models that jointly predict multiple attributes of European paintings using both image and text inputs. The main goal is to compare predictive performance and computational efficiency across architectures under a consistent training and evaluation setup.

Research Question

How do ResNet18, EfficientNet-B0, and ViT-Tiny perform in multi-task classification for predicting painting attributes using image and text inputs?

Predicted tasks:

Timeframe
School
Type
Medium (extracted from the technique attribute)

Dataset

SemArt dataset:

21,328 European paintings
Available attributes include author, title, date, technique, type, school, timeframe, and description
Medium labels are derived from the technique field

Notes:

The dataset images are not included in this repository due to size constraints
Use the official SemArt source to obtain the data

Methodology

Data Processing and Feature Engineering

Image preprocessing and augmentation:

Resizing and normalization
Random horizontal flipping
Random rotation
Color jitter
Random resized cropping
Random erasing

Text processing:

DistilBERT base uncased is used to extract features from painting descriptions
Only the last layer is fine-tuned to obtain meaningful text embeddings

Multi-modal fusion:

Image features and text embeddings are combined in a fusion layer for multi-modal learning

Model Architectures

ResNet18:

Pre-trained CNN with 11.7M parameters
Dropout used to reduce overfitting
Early layers are frozen to stabilize fine-tuning

EfficientNet-B0:

CNN with 5.3M parameters
Dropout used to reduce overfitting
First blocks are frozen during fine-tuning

ViT-Tiny-Patch16-224:

Compact Vision Transformer
Higher compute demand and more data-hungry behavior
Early blocks are frozen during fine-tuning

Training Setup

Optimization:

Adam optimizer
Learning rate scheduling
Weight decay for regularization
Trained for 10 epochs

Multi-task learning:

Timeframe, School, Type, and Medium are predicted simultaneously using a shared representation with task-specific heads

Evaluation

Model selection:

Best checkpoint chosen based on validation performance

Metrics:

Accuracy
F1-score
Confusion matrix analysis on test data

Results Summary

Overall Findings

EfficientNet-B0 provides the best balance of accuracy and computational efficiency
ViT achieves competitive accuracy but at higher computational cost
All models struggle most on the Timeframe task

EfficientNet-B0 Test Performance

Timeframe: 60% accuracy, 60% F1
School: 79% accuracy, 77% F1
Type: 81% accuracy, 80% F1
Medium: 80% accuracy, 79% F1

Selected Observations

Timeframe:

Earliest timeframes show weaker performance
Best categorized timeframe observed: 1851 to 1900 with 94% accuracy
Misclassifications often occur between neighboring historical periods

School:

Italian school achieves 94% accuracy
Rare schools such as Scottish, Swiss, and Bohemian show more misclassification

Type:

High accuracy for landscape, religious, portrait, and still-life types
Rare categories show more confusion

Medium:

Strong performance for Canvas and Wall media
Misclassifications are often absorbed by broad groups such as Other, Metal, and Paper

Computational Efficiency

Per-epoch training time comparison:

EfficientNet-B0: ~20 min
ResNet18: ~22 min
ViT-Tiny: ~40 min

Conclusion

EfficientNet-B0 is the preferred model because it balances predictive performance and computational efficiency. ViT-Tiny can achieve strong results but is computationally costly and more sensitive to data limitations. ResNet18 is lightweight and effective but shows lower overall accuracy.

Limitations and Future Work

Limitations:

Training cost and time for transformer-based models
Vision Transformers require larger datasets to perform optimally
Class imbalance and under-represented categories reduce performance

Future work:

Train on a larger and more diverse dataset
Apply imbalance mitigation strategies
Build an interactive application for real-time use cases

Notebook Guide

If you want to understand the work end-to-end, review the notebooks in the following order.

Training notebooks

notebooks/ResNet18_WDescription_Final2.ipynb
Multi-task training with ResNet18 using image and text inputs
notebooks/EfficiNetB0_WDescription_Final2.ipynb
Multi-task training with EfficientNet-B0 using transfer learning, augmentation, and regularization
notebooks/ViT_WDescription_Final.ipynb
Multi-task training with ViT-Tiny, including fine-tuning with frozen early blocks

Evaluation notebook

notebooks/Evaluation_on_Test_Data_EfficientNet_Best.ipynb
Test-set evaluation, metrics reporting, and confusion matrix analysis for the selected best checkpoint

Reference

Garcia, N., Renoust, B., Nakashima, Y. Understanding art through multi-modal retrieval in paintings. ICCV Workshops, 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
notebooks		notebooks
reports		reports
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Task Learning for Semantic Art Classification

Key Results

Project Overview

Research Question

Dataset

Methodology

Data Processing and Feature Engineering

Model Architectures

Training Setup

Evaluation

Results Summary

Overall Findings

EfficientNet-B0 Test Performance

Selected Observations

Computational Efficiency

Conclusion

Limitations and Future Work

Notebook Guide

Training notebooks

Evaluation notebook

Reference

About

Uh oh!

Releases

Packages

Languages

BoraKucukarslan/Multi-Task-Learning-for-Art-Classification

Folders and files

Latest commit

History

Repository files navigation

Multi-Task Learning for Semantic Art Classification

Key Results

Project Overview

Research Question

Dataset

Methodology

Data Processing and Feature Engineering

Model Architectures

Training Setup

Evaluation

Results Summary

Overall Findings

EfficientNet-B0 Test Performance

Selected Observations

Computational Efficiency

Conclusion

Limitations and Future Work

Notebook Guide

Training notebooks

Evaluation notebook

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages