Project for the ALTEGRAD (Advanced AI for Texts and Graphs) class in the Master Data Science of IP Paris. Realized by Antoine Gilson, Paul Lemoine Vandermoere and Alexandre Zenou.
Graph neural network for molecular graph and text description retrieval.
pip install -r requirements.txtThe data comes from the kaggle challenge "Molecular Graph Captioning" Challenge on Kaggle (https://www.kaggle.com/competitions/molecular-graph-captioning/data)
Place your preprocessed graph data files in the data/ directory:
train_graphs.pklvalidation_graphs.pkltest_graphs.pkl
Run the following scripts in order:
Check the structure and contents of your graph files:
python inspect_graph_data.pyCreate embeddings for molecular descriptions:
python generate_description_embeddings.pyThis generates:
data/train_embeddings.csvdata/validation_embeddings.csv
Main updates compared to the baseline:
- Support for multiple pretrained text encoders (including scientific and retrieval-optimized models)
- Increased maximum sequence length to reduce truncation
- Mean pooling over non-padding tokens instead of CLS-only pooling
- L2 normalization of embeddings for stable cosine similarity
- Robust handling of empty or malformed descriptions
Train the graph neural network:
python train_gcn.pyMain updates compared to the baseline:
- Use of explicit node and edge features instead of feature-free graphs
- GINE-based message passing to incorporate edge attributes
- Multi-pooling readout (mean, max, sum) at the graph level
- Projection head with normalization and dropout
- Contrastive training objective (InfoNCE-style) instead of embedding regression
- L2-normalized graph embeddings for retrieval
This creates a model model_{ARCH}_{POOL}_{JK_MODE}_output.pt.
Retrieve descriptions for test molecules:
python retrieval_answer.pyMain updates compared to the baseline:
- Replacement of hard nearest-neighbor retrieval with a weighted top-k strategy
- Aggregation of multiple close textual candidates using similarity-based weights
- Improved robustness to noise in the embedding space
This generates test_retrieved_descriptions.csv with retrieved descriptions for each test molecule.
model_{ARCH}_{POOL}_{JK_MODE}_output.pt: Trained GCN modeltest_retrieved_descriptions.csv: Retrieved descriptions for test set
The proposed modifications consistently improved retrieval performance compared to the baseline pipeline, with gains primarily driven by stronger text embeddings and contrastive alignment between graph and text representations. Improvements in the graph encoder and the retrieval strategy further increased robustness, especially in cases where multiple semantically similar descriptions exist. Overall, the system achieves stable performance without extensive hyperparameter tuning, suggesting that results are driven by principled architectural choices rather than overfitting.
Without any use of external data, we achieved a performance of 0.644 on the hidden test set.