-
Create a new anaconda environment with python version
3.10$ conda create -n next-word-prediction python=3.10
-
Install all dependencies
$ pip install -r requirements.txt
-
Training script is located in
train.py.
Next Word Prediction (Approach 1).ipynbis the same astrain.pybut in a Jupyter Notebook format.next-word-prediction (Approach 2 - Google USE).ipynbis the second approach using Google Universal Sentence Encoder to retrieve the embedding of the input text.Next Word Prediction (word2vec).ipynbis the third approach using word2vec to retrieve the embedding.
- The most difficult part of this project was the overfitting issue. I tried using some popular techniques to reduce overfitting, such as dropout, L2 regularization. However, none of them produce a significant improvement.
-
One major reason for the overfitting issue is that my input data and output data are too sparse. Especially I set the labels to be one-hot-vectors. This makes the model hard to learn and generalize.
-
What I can do instead is to replace the sparse input vectors with word embeddings. Embeddings are dense, low-dimensional representations of words that capture semantic similarities.
-
Similarly, I should also consider using embeddings as the target output as well. This can be achieved by using an embedding layer for the output and training with a cosine similarity loss.
-
I could try to use a different model architecture, such as a transformer, to improve the performance.
-
If time permits, I am also intended to dockerize everything into a container for easy deployment.