Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.
It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order. Recently, deep learning methods have achieved state-of-the-art results on examples of this problem.
I would like to thank Jason Brownlee for his wonderful blog that help me to learn how to build a Image Caption Generator.
This project need high RAM : 32 GB/ 64 GB. Either you can use AWS EC2 instance or Google Collaboratory[The one I used].
This project requires a lot of modules and packages. This can be installed from requirement.txt file using following command:
pip install -r requirements.txt for python 2.x
pip3 install -r requirements.txt for python 3.x
All the helper functions needed for this project are in utility.py file.
The dataset can be downloaded from Kaggle from here. You can use the already created feature file [features extracted from images] located in Features folder. Its compressed you need to unzip it.
The dataset consists of 2 files:
- Images
- Description and Image IDs
Now comes the training part. To train your model defined in model.py we will run the train.py file.
Remember training may take longer time to run depends on the congiguration of the machine. Each epoch takes around 15-20 mins.
training need 4 arguments:
- textPath
- trainPath
- devPath
- features
python train.py --textPath /Path to Textfile/ --trainPath /Path to trainfile/ --devPath /Path to valimages/ --features /Path to features/
After training we will evaluate our model on test dataset. Run the following command:
python evaluate.py --testPath /Path to testfile/
The model Architechture:
