Skip to content

LLM-based variant extraction from title and abstracts of biomedical publications. Search literature-derived co-associations between variants, cancers, and treatments

Notifications You must be signed in to change notification settings

hastingslab-org/Variantscape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 

Repository files navigation

Variantscape

Overview

Variantscape is a large-scale literature mining approach that combines LLM-driven entity extraction with co-association and network-based analysis to uncover variant–treatment–cancer relationships from biomedical abstracts.

image


Features

  • Retrieving biomedical literature from a defined search string and timeframe, including preprocessing and cleaning
  • ML- and LLM-enhanced extraction of genes, variants, treatments, and cancer types from titles and abstracts of biomedical literature
  • Co-occurrence analysis for identifying variant–treatment–cancer relationships
  • Study design–weighted scoring to prioritize stronger evidence
  • Graph-based representation of biomedical associations
  • Integration of external biomedical APIs (OpenAlex, CIViC, MONDO, etc.) for enrichment
  • Exploratory analysis of rare or under-characterized variants


Installation and setup

All required packages and model downloads are handled directly within each Jupyter notebook.

Datasets can either be generated by using 01_fetching_articles, or accessed in a Zenodo repository: https://zenodo.org/records/15268056

Technologies used

  • Python and Jupyter Notebook for execution
  • Named Entity Recognition (NER) models (e.g., SciSpaCy) for information extraction
  • BioBERT (Biomedical BERT model) for information extraction
  • Large Language Models (LLMs) (e.g., Llama 3.3) for variant extraction and classification
  • OpenAlex API (for fetching of biomedical publications)
  • CIViC API (for disease and treatment information)
  • MyCancerInfo API (for gene name augmentation)
  • MyDiseaseInfo API (for disease and cancer type synonyms)
  • MyTreatmentInfo API (for treatment synonyms and aliases)
  • MONDO API (for cancer ontology mapping and synonyms)
  • General classifier (for study design classification)

Note: This project uses several external APIs (OpenAlex, CIViC, MONDO, etc.) and NLP models (e.g., SciSpaCy, BioBERT).
Some notebooks use LLMs via DeepInfra (e.g., Llama 3.3). To use them, a DeepInfra account, API key, and credits are needed.


Usage

This project is structured as a modular pipeline, with Jupyter notebooks organized into sequential folders:

01_fetching_articles
02_cleaning_and_normalization
03_gene_extraction
04_categorization
05_LLM_variant_extraction
06_coassociation_and_network_analysis
    ...



How to run

  1. Start from 01_fetching_articles and move step-by-step through each numbered folder.
  2. Inside each folder, execute notebooks in order (e.g., 01.1, 02.1, etc.)
  3. Each notebook handles its own imports, installations, and API requests.
  4. Modify parameters (e.g., variant or cancer type) inside the notebook as needed.

Internet access is required for live API calls (e.g., OpenAlex, CIViC, MONDO).
All notebooks are designed to be executed sequentially for a complete analysis workflow.


Evaluation studies

Two dedicated evaluation modules are included to validate and assess extraction quality:

  • NLP_gene_evaluation/: Evaluates gene name extraction using traditional NLP techniques and biomedical NER models. Results are discussed in "Wosny, M. & Hastings, J., "Automated gene identification in oncology literature: A comparative evaluation of Natural Language Processing approaches."
  • LLM_variant_evaluation/: Evaluates the performance of large language model–driven variant extraction. Results and methodology are described in "Wosny, M. & Hastings, J., "Large Language Models for Detection of Genetic Variants in Biomedical Literature."

These studies provide insight into the reliability and limitations of automated extraction methods for genes and variants for downstream analyses.


Interactive Web Tool

To translate our findings into a usable resource, we developed a publicly accessible web tool that enables interactive exploration of the variant–treatment–cancer associations identified in this study.

The tool allows users to:

  • Search by variant or gene
  • Filter by cancer type
  • Visualize co-association strengths with treatments and other cancer types

This interface supports exploratory navigation of literature-derived associations and may serve as a starting point for hypothesis generation and knowledge discovery in research settings.

How to cite

If you use this work, please cite the accompanying manuscripts:

  • Wosny M., Boesch M., Peres T., Niederhauser T., Früh M., Rothermundt C., Hastings J (2025) "Variantscape: Using Large Language Models to Build a Comprehensive Landscape of Cancer Variants for Precision Oncology." (Preprint)
  • Wosny M, Hastings J (2025) Automated Gene Identification in Oncology Literature: A Comparative Evaluation of Natural Language Processing Approaches. In: Glob. Healthc. Transform. Era Artif. Intell. Inform. IOS Press, pp 61–65, https://ebooks.iospress.nl/doi/10.3233/SHTI250673
  • Wosny, M. & Hastings, J. (2025) "Large Language Models for Detection of Genetic Variants in Biomedical Literature." Studies in Health Technology and Informatics (Preprint, tp be published in August 2025)

About

LLM-based variant extraction from title and abstracts of biomedical publications. Search literature-derived co-associations between variants, cancers, and treatments

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published