GitHub - SimonHegele/UTRpy: UTR extensions for annotations from protein orthology based gene prediction using exons from reference based transcriptome assembly

Archived: UTRpy will be continued as LUTR (https://github.com/SimonHegele/LUTR)

Lost in translation but meaningfull: UnTranslated Regions (UTRs), have a variety of
important regulatory functions and a genome annotation without them wouldn't really be
complete, right?

Protein ortholog-based gene prediction enables the transfer of detailed gene structure and
functional annotations across species by leveraging evolutionary conservation. However,
genome annotations from such methods lack UTRs. UTRpy supplements these by using exons
from reference-based transcriptome assemblies.

1 Installation

Requirements:

Python = 3.13

Pandas
Numpy

AGAT (https://github.com/NBISweden/AGAT)

conda create -n utrpy python=3.13
conda activate utrpy

mamba install -c conda-forge -c bioconda agat
git clone https://github.com/SimonHegele/UTRpy
cd UTRpy
pip install .

2 Usage

usage: utrpy [-h] [-m ] [-ks] [-me ] [-s ] [-k ] [-p ] [-pp ] [-tmp ] [-l ] prediction assembly outdir

UTR extension of transcript exons from protein orthology based gene prediction using exons from reference based assembly

positional arguments:
  prediction            Annotation from gene prediction (GFF/GTF)
  assembly              Annotation from transcriptome assembly (GFF/GTF)
  outdir                Output directory (Must not exist already)

options:
  -h, --help            show this help message and exit

Transcript matching:
  -m, --match           What exons of predicted transcripts to match [choices: ends, all] [default: all]
  -ks, --know_strand    Use only transcripts where the strand is known
  -me, --max_exon_length
                        Don't use assembled transcripts with exons longer than this [default: 20000]

UTR-variant selection:
  -s, --select          How to select UTR-variants if there are multiple [choices: shortest, longest, all] [default: all]
  -k, --keep            Keep the original transcript instead of deleting them

Others:
  -p, --processes       Number of parallel processes to use [Default:4]
  -pp, --pinky_promise
                        Pinky promise that prediction is correct (Will fix it otherwise)
  -tmp, --tmpdir        Temporary directory
  -l, --log_level       [default: info]

3 UTRpy workflow

Preprocessing with AGAT
AGAT is used to fix inconsistencies in the input annotations.
Most importantly transcripts are added as explicit features for the assembly.
For the prediction the preprocessing can be skipped using the -pp / --pinky_promise parameter if you are sure that your annotation is a correctly formatted GFF3-file.
Transcript matching
Explicit representations of transcripts can be created from the annotations. These are created for all predicted transcripts and for assembled transcripts whose genomic position includes those of predicted transcripts.
The figure below shows a match between a predicted transcript (green) and an assembled one (blue)
UTR-variant construction
For matching pairs of transcripts UTR-variants are created combining the features of both transcripts (without duplicating exons) and replace the original predicted transcript in the annotation. Gene start and end positions are updated accordingly.
Postprocessing with AGAT
AGAT is used to explicitly add UTRs as features to the annotation.

4 Example

Screenshot from the IGV-genome browser

5 Limitations / Known issues

UTRpy does not address potential gene fusions and AGAT migth overlook them as well

6 Future plans / ideas

Performance:

Pandas -> Polars
A "smart" way to split DataFrames

Limitations:

Addressing gene fusion
1. Identification of potential gene fusion -> Sufficient to guide manual curation.
2. Automatically merging fused genes.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
figures		figures
src/utrpy		src/utrpy
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Installation

2 Usage

3 UTRpy workflow

4 Example

5 Limitations / Known issues

6 Future plans / ideas

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

SimonHegele/UTRpy

Folders and files

Latest commit

History

Repository files navigation

1 Installation

2 Usage

3 UTRpy workflow

4 Example

5 Limitations / Known issues

6 Future plans / ideas

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages