Skip to content

DarkAdin/ESM2_15B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

ESM2_15B

This repository will hold the downstream analysis done on protein embeddings generated by the sate-of-the-art ESM2-15B model available at https://huggingface.co/datasets/Darkadin/ESM2_embeddings_Human_Mouse

The input data was obtained from the following link as it was in May, 2025: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

More species might be added in the future. Check the original repository for more details.

The goal of this repository is to show the different possibilities in terms of analyzing the embeddings together with their associated metadata.

Classification results

After a quick classification effort through LightGBM Classifier, these are the metrics for the 20 most common GO annotations when trying to predict them on a test partition of the data. The table is ordered by F1-Score. See the notebook for more details.

Accuracy should almost always not be considered in this specific case, since when considering individual labels, the vast majority of proteins in the dataset have it deactivated (its value is 0). This means that just predicting 0 everywhere will by default yield a very good accuracy.

Accuracy F1 Precision Recall GO GO_desc
0.99 0.9 0.92 0.89 GO:0005524 ATP binding
0.98 0.86 0.8 0.94 GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific
0.97 0.86 0.89 0.82 GO:0008270 zinc ion binding
0.97 0.79 0.74 0.84 GO:0006357 regulation of transcription by RNA polymerase II
0.87 0.77 0.75 0.79 GO:0005886 plasma membrane
0.97 0.73 0.72 0.74 GO:0000785 chromatin
0.83 0.72 0.7 0.75 GO:0005634 nucleus
0.94 0.69 0.61 0.79 GO:0005615 extracellular space
0.92 0.67 0.6 0.75 GO:0005576 extracellular region
0.78 0.66 0.59 0.75 GO:0005829 cytosol
0.94 0.63 0.63 0.63 GO:0005739 mitochondrion
0.77 0.62 0.55 0.71 GO:0005737 cytoplasm
0.94 0.62 0.57 0.68 GO:0003723 RNA binding
0.83 0.61 0.56 0.68 GO:0005654 nucleoplasm
0.95 0.6 0.57 0.64 GO:0046872 metal ion binding
0.94 0.57 0.5 0.66 GO:0005789 endoplasmic reticulum membrane
0.79 0.49 0.42 0.6 GO:0016020 membrane
0.86 0.46 0.39 0.57 GO:0070062 extracellular exosome
0.92 0.41 0.37 0.46 GO:0005783 endoplasmic reticulum
0.85 0.35 0.3 0.42 GO:0042802 identical protein binding

About

ESM2-15B Protein Embeddings

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published