Skip to content

Multi-task deep learning in TensorFlow for art classification across type, school, and timeframe

Notifications You must be signed in to change notification settings

BoraKucukarslan/Multi-Task-Learning-for-Art-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Task Learning for Semantic Art Classification

Evaluating multi-task, multi-modal deep learning models for semantic art classification on the SemArt dataset using image and text inputs with ResNet18, EfficientNet-B0, and ViT-Tiny-Patch16-224.

Key Results

  • Best overall trade-off: EfficientNet-B0 delivered the strongest balance of predictive performance and computational efficiency
  • EfficientNet-B0 test accuracy: Type 81%, Medium 80%, School 79%, Timeframe 60%
  • Training time per epoch: EfficientNet-B0 ~20 min, ResNet18 ~22 min, ViT-Tiny ~40 min

Project Overview

This project benchmarks multi-task, multi-modal deep learning models that jointly predict multiple attributes of European paintings using both image and text inputs. The main goal is to compare predictive performance and computational efficiency across architectures under a consistent training and evaluation setup.

Research Question

How do ResNet18, EfficientNet-B0, and ViT-Tiny perform in multi-task classification for predicting painting attributes using image and text inputs?

Predicted tasks:

  • Timeframe
  • School
  • Type
  • Medium (extracted from the technique attribute)

Dataset

SemArt dataset:

  • 21,328 European paintings
  • Available attributes include author, title, date, technique, type, school, timeframe, and description
  • Medium labels are derived from the technique field

Notes:

  • The dataset images are not included in this repository due to size constraints
  • Use the official SemArt source to obtain the data

Methodology

Data Processing and Feature Engineering

Image preprocessing and augmentation:

  • Resizing and normalization
  • Random horizontal flipping
  • Random rotation
  • Color jitter
  • Random resized cropping
  • Random erasing

Text processing:

  • DistilBERT base uncased is used to extract features from painting descriptions
  • Only the last layer is fine-tuned to obtain meaningful text embeddings

Multi-modal fusion:

  • Image features and text embeddings are combined in a fusion layer for multi-modal learning

Model Architectures

ResNet18:

  • Pre-trained CNN with 11.7M parameters
  • Dropout used to reduce overfitting
  • Early layers are frozen to stabilize fine-tuning

EfficientNet-B0:

  • CNN with 5.3M parameters
  • Dropout used to reduce overfitting
  • First blocks are frozen during fine-tuning

ViT-Tiny-Patch16-224:

  • Compact Vision Transformer
  • Higher compute demand and more data-hungry behavior
  • Early blocks are frozen during fine-tuning

Training Setup

Optimization:

  • Adam optimizer
  • Learning rate scheduling
  • Weight decay for regularization
  • Trained for 10 epochs

Multi-task learning:

  • Timeframe, School, Type, and Medium are predicted simultaneously using a shared representation with task-specific heads

Evaluation

Model selection:

  • Best checkpoint chosen based on validation performance

Metrics:

  • Accuracy
  • F1-score
  • Confusion matrix analysis on test data

Results Summary

Overall Findings

  • EfficientNet-B0 provides the best balance of accuracy and computational efficiency
  • ViT achieves competitive accuracy but at higher computational cost
  • All models struggle most on the Timeframe task

EfficientNet-B0 Test Performance

  • Timeframe: 60% accuracy, 60% F1
  • School: 79% accuracy, 77% F1
  • Type: 81% accuracy, 80% F1
  • Medium: 80% accuracy, 79% F1

Selected Observations

Timeframe:

  • Earliest timeframes show weaker performance
  • Best categorized timeframe observed: 1851 to 1900 with 94% accuracy
  • Misclassifications often occur between neighboring historical periods

School:

  • Italian school achieves 94% accuracy
  • Rare schools such as Scottish, Swiss, and Bohemian show more misclassification

Type:

  • High accuracy for landscape, religious, portrait, and still-life types
  • Rare categories show more confusion

Medium:

  • Strong performance for Canvas and Wall media
  • Misclassifications are often absorbed by broad groups such as Other, Metal, and Paper

Computational Efficiency

Per-epoch training time comparison:

  • EfficientNet-B0: ~20 min
  • ResNet18: ~22 min
  • ViT-Tiny: ~40 min

Conclusion

EfficientNet-B0 is the preferred model because it balances predictive performance and computational efficiency. ViT-Tiny can achieve strong results but is computationally costly and more sensitive to data limitations. ResNet18 is lightweight and effective but shows lower overall accuracy.

Limitations and Future Work

Limitations:

  • Training cost and time for transformer-based models
  • Vision Transformers require larger datasets to perform optimally
  • Class imbalance and under-represented categories reduce performance

Future work:

  • Train on a larger and more diverse dataset
  • Apply imbalance mitigation strategies
  • Build an interactive application for real-time use cases

Notebook Guide

If you want to understand the work end-to-end, review the notebooks in the following order.

Training notebooks

  • notebooks/ResNet18_WDescription_Final2.ipynb
    Multi-task training with ResNet18 using image and text inputs

  • notebooks/EfficiNetB0_WDescription_Final2.ipynb
    Multi-task training with EfficientNet-B0 using transfer learning, augmentation, and regularization

  • notebooks/ViT_WDescription_Final.ipynb
    Multi-task training with ViT-Tiny, including fine-tuning with frozen early blocks

Evaluation notebook

  • notebooks/Evaluation_on_Test_Data_EfficientNet_Best.ipynb
    Test-set evaluation, metrics reporting, and confusion matrix analysis for the selected best checkpoint

Reference

Garcia, N., Renoust, B., Nakashima, Y. Understanding art through multi-modal retrieval in paintings. ICCV Workshops, 2019.

About

Multi-task deep learning in TensorFlow for art classification across type, school, and timeframe

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published