GitHub - ProjecticumDataScience/DeNovoGAP_lumbricus_terrestris: Pipeline for de novo genome assembly and annotation of the earthworm Lumbricus terrestris.

ab de novo genome annotation

Workflow for processing of genoomannotatie ab de novo .

Table of Contents

About The Project
Getting Started
- Prerequisites
Usage
Control
Scripts
Issues
Statistics
Contributing

About The Project

Dit project omvat het alignment van transcriptoomreads, het trainen met het AUGUSTUS-programma en het analyseren van de proteomen van Lumbricus Terrestris en Rubellus.

Het doel van dit project is:

ab de novo genomische annotatie voor Lumbricus Terrestris en Lumbricus Rubellus, volgens protocol 1-2.
Daarnaast willen we een database opzetten met de geïdentificeerde genen van deze twee wormsoorten, volgens protocol 2, 2.1.
Tot slot voeren we een vergelijkende proteomische analyse uit van de genen die betrokken zijn bij motorische functies, volgens protocol 2, 2.1, en we hebben de resultaten in een rapport verzamelt(results->proteomische studie).

Getting Started

Bestandsstructuur:

'data' Deze sectie bevat alle bestanden die tijdens de workflow zijn verwerkt.

Dit omvat dingen zoals modelontwikkeling, uitkomsten van rna-seq alignments, uitkomsten van GenomeThreader protein Alignments, Augustus Modellen (species). De gegevens zijn gesorteerd op protocol.

The 'data_input ' In dit gedeelte vind je reference genomic and transctiptomic data.

'results' Inclusief map met genen databases voor Lumbricus Terrestris en Lumbricus Rubellus, de webapp broncode en een analytisch rapport over proteomica.

'archief' deze verzameling scripts vervangt alle Perl RegEX scripts die een bash syntaxfout geven, binnen protocolredundantie "remove redundancy structure"

Tophat wordt gebruikt voor splice aware alignment.

Augustus & GeneMark -ES/ET/EP+ ver 4.7 worden gebruikt om model ab de novo te bouwen

GenomeThreader wordt gebruikt om eiwitten uit te alignen

Protocols:

Protocol1 - omvat alignment transctiptomische reads en model building , gebaseerd op rna-seq alignment

Protocol2 -omvat ontwikkeling van model dat steunt op eiwitstructuren.

Protocol2.1 -Protocol 2.1 omvat tet creëren van een database waarin de genen en hun structuren voor de soorten Lumbricus Terrestris en Lumbricus Rubellus zijn geïdentificeerd.

Protocol "Removing Redundant Gene Structures" is gedaan, maar om de scripts eenvoudig te houden, is dit deel naar de archiefmap verplaatst ("scripts/archeief/remove_redudant").

Prerequisites

required soft:

GeneMark-ES/ET/EP+ ver 4.72_lic * requeires Perl configuration, path and dependenices download from: https://exon.gatech.edu/GeneMark/license_download.cgi
TopHat, can be installed with conda
Bowtie2, can be installed with conda
GenomeThreader, can be installed with conda
Augustus can be installed with conda
Breaker can be installed with zip
Python3
Linux-besturingssysteem

Control

De positieve controle in dit experiment is C. Elegans. Het testmodel is ontwikkeld op basis van eiwitten.

C. Elegans positiv control

Usage

Om genoom structure Lumbricus Terrestris en Lumbricus Rubellus te vergelijken kun je genoomcoördinaten invoeren in de app:

app

Model Usage

Om het model te gebruiken, kopieert u de regenworm map naar uw Augutsus distributie configuratie map.

Model: data/protocol2/model

sudo cp -r regenworm /usr/share/augustus/config/species/
cp -r regenworm anaconda3/envs/c/config/species/

Scripts

Scripts zijn geordend volgens het protocol.

protocol-2.1
scripts/protocol2.1/dbscript.py

Dit script zal een database aanmaken van de geïdentificeerde genen voor Lumbricus Terrestris en Lumbricus Rubellus. Gebruik in bash: python dbscript.py -i inputfile.xml -o dabase.txt

protocol-2
scripts/protocol2/get_uniprot.py

Dit script haalt het proteoom van C. elegans, E. fetida en Lumbricus op van Uniprot en maakt een multifasta GZ-bestand aan.

usage from bash: python get_uniprot.py

Statistics

aantal genen per chromosoom*

chromosoom	L.Terrestris	L.Rubellus

chr1	4358	1665
chr2	3540	3178
chr3	3337	3045
chr4	3192	3238
chr5	2817	2748
chr6	3141	3118

de genentelling wordt gedaan met een machine learning helixer

Issues

- Dit pakket is best complex en vereist Linux-configuratie,inclusief het installeren van GeneMark ET.
Het 'startAlign.pl' script stopt het proces als het geheugengebruik boven een bepaalde limiet komt. Als je dit probleem tegenkomt, probeer dan het fasta-bestand in twee delen te splitsen, of gebruik de --pos optie om de positie te beperken.
Je ziet het volgende bericht in de console: ERROR in startAlign.pl op regel 673. Uit dit bericht is de reden voor de fout niet duidelijk. De werkelijke oorzaak is dat het FASTA-bestand te groot is. Je kunt het probleem verhelpen door de fastasgrootte te verminderen of de fasta-positie in te stellen. Een alternatieve oplossing is om deze stap in twee delen te doen.
-Protocol1 Bonafide fout: "niet unieke identificaties", je kunt scripts/protocol1/get_uniq.py gebruiken. Elke Python-script kan vanuit bash worden uitgevoerd, python get_uniq.py. Je moet het patroon aanpassen zodat het overeenkomt met de regel na LOCUS in bonafide.gb. De fout komt voort uit het feit dat werken met tekst, in de kern, om tekst draait. Het kan nodig zijn om tekststrings die een identificatie bevatten, te formatteren voordat de software ze kan verwerken.
-Protocol1 randomSplit.pl kent 0 genen toe aan de test- of trainingset. Je kunt in dit geval split -n gebruiken, of het script randomSplit.pl stap voor stap doorlopen om te ontdekken waar de resultaten op nul worden gereset.
-Protocol2. Alle Perl RegEx scripts geven een bash syntaxisfout. Alle commando's worden anders herschreven. De oplossing hiervoor staat beschreven in docs/docs.pdf en scripts/archeief/remove_redudant. Als je mijn scripts met RegEx gebruikt, moet je de chromosoom-id aanpassen
Protocol 6 -"filterGenesIn.pl nonred.loci.lst bonafide.gb > bonafide.f.gb" Dit commando slaat enkel de laatste locus van bonafid.gb op. Het doel van deze taak is om alle loci die niet redudant zijn in de verzameling te identificeren. Deze taak is vervangen door een loop en is te vinden in de scripts, archief in sectie remove_redudant.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
conda_enviroments		conda_enviroments
data		data
data_input		data_input
docs		docs
images/product-screenshot		images/product-screenshot
protocol_lajsa		protocol_lajsa
results		results
scripts		scripts
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
lumbricus.Rproj		lumbricus.Rproj
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ab de novo genome annotation

About The Project

Getting Started

Protocols:

Prerequisites

Control

Usage

Model Usage

Scripts

Statistics

Issues

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ProjecticumDataScience/DeNovoGAP_lumbricus_terrestris

Folders and files

Latest commit

History

Repository files navigation

ab de novo genome annotation

About The Project

Getting Started

Protocols:

Prerequisites

Control

Usage

Model Usage

Scripts

Statistics

Issues

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages