-
Notifications
You must be signed in to change notification settings - Fork 6
4. The reference builder
- Genome builds in aRNApipe
- Generating a new genome build
- Example cases
- Adding common genome assemblies
- Using GATK compatible builds
The reference builder is an independent aRNApipe module that can be used to generate all the reference and annotation files for a given genome build. The programs used for RNA-seq data processing often use different formats and standards for their corresponding input reference and annotation files. In order to solve the associated problems linked to this lack of consensus files, facilitate the inclusion of new genome assemblies in aRNApipe and provide a centralized repository, the reference builder is used to add new genome builds to aRNApipe creating all the required files from a few standardized and easy-to-obtain annotation files.
The genomes_processed directory contains the folders corresponding to each available genome build and the repository file:
$ ls -la genomes_processed/
-rw-r--r-- 1 [...] 155 Feb 5 01:40 installed_genomes.txt
drwxr-xr-x 6 [...] 65536 Jan 24 19:55 DM_BDGP6_E83
drwxr-xr-x 6 [...] 65536 Jan 24 22:06 MM_GRCm38_E83
drwxr-xr-x 6 [...] 65536 Jan 25 00:03 ZF_GRCz10_E83
drwxr-xr-x 6 [...] 65536 Jan 25 00:03 GRCh37_E75
drwxr-xr-x 6 [...] 65536 Feb 9 11:23 GRCh38_E83
drwxr-xr-x 9 [...] 65536 Feb 12 11:42 g1k_v37
$ cat genomes_processed/installed_genomes.txt
DM_BDGP6_E83 160124_194703
MM_GRCm38_E83 160124_214932
ZF_GRCz10_E83 160124_223816
GRCh37_E75 160124_233625
GRCh38_E83 160124_234239
g1k_v37 160205_014015
The available genome builds can be easily queried by using the aRNApipe mode 'genomes'. When calling aRNApipe using the ‘genomes’ mode, aRNApipe returns all available keys for the installed genome builds:
$ aRNApipe -m genomes
Available genome builds:
- Key: DM_BDGP6_E83 (installed 160124_194703)
- Key: MM_GRCm38_E83 (installed 160124_214932)
- Key: ZF_GRCz10_E83 (installed 160124_223816)
- Key: GRCh37_E75 (installed 160124_233625)
- Key: GRCh38_E83 (installed 160124_234239)
- Key: g1k_v37 (installed 160205_014015)
All Done!
The associated folder of each genome build contains the following files (programs used to generate them in brackets):
genome.fa Copy of the original DNA file provided by the user
transcriptome.fa Copy of the original cDNA file provided by the user
genesets.gtf Copy of the original GTF file provided by the user
genesets.exon.txt Exon annotation according to 'genesets.gtf'
genesets.gene.txt Gene annotation according to 'genesets.gtf'
genesets.transcript.txt Transcript annotation according to 'genesets.gtf'
genome.fa.fai Index file of 'genome.fa' used by GATK (samtools)
genome.dict Genome dictionary used by GATK (Picard/CreateSequenceDictionary)
kallisto.idx Transcriptome index used by Kallisto (Kallisto)
STAR_genome Folder with the genome version used by STAR (STAR)
log Folder with log files
refFlats Folder with refFlat and GTF versions stratified by feature type (aRNApipe library lib/refbuild.refflat() and UCSC gtfToGenePred)
lincRNA.[gtf/refFlat]
pre_miRNA.[gtf/refFlat]
protein_coding.[gtf/refFlat]
pseudogene.[gtf/refFlat]
rRNA.[gtf/refFlat]
snoRNA.[gtf/refFlat]
snRNA.[gtf/refFlat]
tRNA.[gtf/refFlat]
##### Generating a new genome build To generate a new genome build and make it available for aRNApipe, the user can choose between two implementations placed in the code directory of aRNApipe:
- aRNApipe/code/reference_builder.py is used to perform the action by submitting the job to an LSF cluster
- aRNApipe/code/lib/wr_refbuilder.py is used to perform the action in the current shell and can be performed on a single machine
The main arguments of both scripts are the same. The former already includes options for selecting the wall-time and the option for spanning a single node within the cluster.
Take into account that the generation of new genome builds require the user to have complete permissions within the genomes_processed folder.
Using aRNApipe/code/reference_builder.py:
$ python aRNApipe/code/refbuilder.py --help
Usage: refbuilder.py [options]
aRNApipe: Reference builder
Options:
-h, --help show this help message and exit
-L LABEL, --label=LABEL
[Required] Identifier label for the genome.
-p PATH, --path=PATH [Required] Absolute path to the aRNApipe installation
directory where the folder 'genomes_processed' is
located.
-f FASTA, --fasta=FASTA
Path to the uncompressed genome fasta file ('.fa' or
'.fasta').
-c CDNA, --cdna=CDNA Path to the cDNA fasta file (accepts '.gz').
-g GTF, --gtf=GTF Path to the GTF gene set file ('.gtf').
-n N, --ncpu=N Number of threads that STAR will use to generate the
reference genome (default=8).
-w WT, --wt=WT Wall time (default=200:00).
-s SPAN, --span=SPAN Span 1 host (yes/no, default no).
Using aRNApipe/code/lib/wr_refbuilder.py:
$ python aRNApipe/code/lib/wr_refbuilder.py --help
Usage: wr_refbuilder.py [options]
aRNApipe: Reference builder
Options:
-h, --help show this help message and exit
-L LABEL, --label=LABEL
[Required] Identifier label for the genome.
-p PATH, --path=PATH [Required] Absolute path to the aRNApipe installation
directory where the folder 'genomes_processed' is
located.
-f FASTA, --fasta=FASTA
Path to the uncompressed genome fasta file ('.fa' or
'.fasta').
-c CDNA, --cdna=CDNA Path to the cDNA fasta file (accepts '.gz').
-g GTF, --gtf=GTF Path to the GTF gene set file ('.gtf').
-n N, --ncpu=N Number of threads that STAR will use to generate the
reference genome (default=8).
The arguments of these scripts are described below:
- '-L LABEL, --label=LABEL': Identifier key that will be used for the current genome build being processed. This key must not be already present within the installed genome builds.
- '-p PATH, --path=PATH': Path to the aRNApipe basepath where the folders 'code' and 'genomes_processed' are found.
- '-f FASTA, --fasta=FASTA': Path to the FASTA file with the genome sequences (uncompressed).
- '-c CDNA, --cdna=CDNA': Path to the FASTA file with cDNA gene sequences (compressed or uncompressed).
- '-g GTF, --gtf=GTF': Path to the GTF file with gene annotations.
- '-n N, --ncpu=N': Number of threads that the job will use (only used for generation of the STAR genome).
- '-w WT, --wt=WT': Wall time for the associated job.
- '-s SPAN, --span=SPAN': If 'yes', all CPUs will span a single node.
If the user plans to run variant calling with the GATK pipeline it is recommended to use the genome assemblies provided by the Broad Institute 'bundles':
##### Example cases In this example, the script for sending the job to the LSF cluster is used:
$ python aRNApipe/code/refbuilder.py -L test_dm_2 -p /[..]/aRNApipe -f Drosophila_melanogaster.BDGP6.dna.toplevel.fa -c Drosophila_melanogaster.BDGP6.cdna.all.fa.gz -g Drosophila_melanogaster.BDGP6.83.gtf -n 16 -w 4:00 -s yes
Job <363375> is submitted to queue <normal>.
In this second example, genome generation is performed on a typical desktop computer:
$ python bedpipe/code/lib/wr_refbuilder.py -L test_dm -p /[...]/aRNApipe -f Danio_rerio.GRCz10.dna.toplevel.fa -c Danio_rerio.GRCz10.cdna.all.fa.gz -g Danio_rerio.GRCz10.83.gtf -n 4
Options:
- Genome label: test_dm
- Pipeline path: /[...]/aRNApipe
- Input GTF file: Danio_rerio.GRCz10.83.gtf
- Input FASTA file: Danio_rerio.GRCz10.dna.toplevel.fa
- Input cDNA file: Danio_rerio.GRCz10.cdna.all.fa.gz
> Checking processed genome directories...
Directory for processed genomes already exists.
> Checking if genome label is already installed...
> Checking input fasta/cdna/gtf files...
> Creating output directory for the current genome version in the processed genomes folder...
> Making a copy of the FASTA file to the processed genome directory: 'genome.fa'
> Making a copy of the cDNA file to the processed genome directory: 'transcriptome.fa'
> Making a copy of the GTF file to the processed genome directory: 'genesets.gtf'
> Generating exon, transcript and gene annotation from GTF...
- /[...]/aRNApipe/genomes_processed/test_dm/genesets.exon.txt
- /[...]/aRNApipe/genomes_processed/test_dm/genesets.gene.txt
- /[...]/aRNApipe/genomes_processed/test_dm/genesets.transcript.txt
> Generating index and dict files...
- genome.dict
- genome.fa.fai
> Creation of reference genome for STAR analysis...
- Path: /[...]/aRNApipe/genomes_processed/test_dm/STAR_genome
- Including GTF annotation
> Creation of transcriptome index for Kallisto analysis: 'kallisto.idx'
> Converting GTF file to refFlat format...
> Splitting GTF file by feature type and converting to refFlat format...
> Adding genome key to installed genomes...
##### Adding common genome assemblies Here we provide a list of sources that can be used for adding different genome assemblies to aRNApipe. * Human GRCh38: * Genome FASTA file: ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz * Transcriptome FASTA file: ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz * GTF: ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz * Human GRCh37: * Genome FASTA file: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz * Transcriptome FASTA file: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz * GTF: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz * Human G1Kv37 (GATK compatible): * Genome FASTA file: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/human_g1k_v37.fasta.gz * Transcriptome FASTA file: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz * GTF: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz * dbSNP and indel annotation files from: ftp://ftp.broadinstitute.org/bundle/2.8/b37/
##### Using GATK compatible builds
The reference builder is not adapted to generate the files for the GATK variang calling pipeline. These files are provided by the Broad Institute as bundles of different genome builds. See the following links:
If the user is planning to use the GATK pipeline for variant calling, the following files must be placed in a folder called "gatk" in the corresponding genome build folder:
- The most recent dbSNP release
- VCF InDel annotation files
Once done, update the configuration file of aRNApipe (config.py) to enable GATK when using the corresponding genome build:
annots_gatk = {"genome_label": ["dbsnp_vcf_file", ["indel_vcf_file_1", "indel_vcf_file_2", ...]]}
Example:
annots_gatk = {"g1k_v37": ["dbsnp_138.b37.vcf", ["1000G_phase1.indels.b37.vcf", "Mills_and_1000G_gold_standard.indels.b37.vcf"]]}
##### Adding STAR-Fusion compatible indexes
To use STAR-Fusion some data resources provided by the Trinity Cancer Transcriptome Analysis Toolkit (CTAT) are required. These resources must be downloaded from the CTAT resources website or generated following the instructions provided by STAR-Fusion. Currently, the CTAT only provides pre-build indexes for human GRCh37 and GRCh38. If you want to use a different target species check the following instructions in STAR-Fusion website. The downloaded/generated indexes must be stored in a folder called star-fusion inside the specific genome reference folder. The following instructions show how to dowload and make the pre-build indexes available for GRCh37:
wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/GRCh37_gencode_v19_CTAT_lib.tar.gz
tar xvf GRCh37_gencode_v19_CTAT_lib.tar.gz
rm GRCh37_gencode_v19_CTAT_lib.tar.gz
mv GRCh37_gencode_v19_CTAT_lib [PATH_TO_GENOMES_PROCESSED_FOLDER]/genomes_processed/g1k37/star-fusion
cd [PATH_TO_GENOMES_PROCESSED_FOLDER]/genomes_processed/g1k37/star-fusion/
$STAR_FUSION_HOME/FusionFilter/prep_genome_lib.pl --genome_fa ref_genome.fa --gtf ref_annot.gtf --blast_pairs blast_pairs.outfmt6.gz'
Note that in this example we use the genome label g1k37 and save the indexes in its corresponding folder inside the genomes_processed directory.
More detailed information about this process can be found in the STAR-Fusion wiki.