Skip to content

4. The reference builder

arnald edited this page Dec 11, 2016 · 8 revisions
  1. Genome builds in aRNApipe
  2. Generating a new genome build
  3. Example cases
  4. Adding common genome assemblies
  5. Using GATK compatible builds

The reference builder is an independent aRNApipe module that can be used to generate all the reference and annotation files for a given genome build. The programs used for RNA-seq data processing often use different formats and standards for their corresponding input reference and annotation files. In order to solve the associated problems linked to this lack of consensus files, facilitate the inclusion of new genome assemblies in aRNApipe and provide a centralized repository, the reference builder is used to add new genome builds to aRNApipe creating all the required files from a few standardized and easy-to-obtain annotation files.

##### Genome builds in aRNApipe For each aRNApipe project the user can choose the genome build that will be used. This setting is configured through the parameter *genome_build* within the project's configuration file. Each available assembly is identified by a unique key and all its corresponding annotation files are stored in a directory named *genomes_processed*. The path to this directory is set by the variable *path_db* in the configuration file *lib/config.py* This directory also contains the file *installed_genomes.txt* that tracks the currently available genome builds. The name of each genome build folder is the same as the corresponding key identifier.

The genomes_processed directory contains the folders corresponding to each available genome build and the repository file:

$ ls -la genomes_processed/
-rw-r--r-- 1 [...]   155 Feb  5 01:40 installed_genomes.txt
drwxr-xr-x 6 [...] 65536 Jan 24 19:55 DM_BDGP6_E83
drwxr-xr-x 6 [...] 65536 Jan 24 22:06 MM_GRCm38_E83
drwxr-xr-x 6 [...] 65536 Jan 25 00:03 ZF_GRCz10_E83
drwxr-xr-x 6 [...] 65536 Jan 25 00:03 GRCh37_E75
drwxr-xr-x 6 [...] 65536 Feb  9 11:23 GRCh38_E83
drwxr-xr-x 9 [...] 65536 Feb 12 11:42 g1k_v37
$ cat genomes_processed/installed_genomes.txt
DM_BDGP6_E83    160124_194703
MM_GRCm38_E83   160124_214932
ZF_GRCz10_E83   160124_223816
GRCh37_E75      160124_233625
GRCh38_E83      160124_234239
g1k_v37 160205_014015

The available genome builds can be easily queried by using the aRNApipe mode 'genomes'. When calling aRNApipe using the ‘genomes’ mode, aRNApipe returns all available keys for the installed genome builds:

$ aRNApipe -m genomes
Available genome builds:
- Key: DM_BDGP6_E83 (installed 160124_194703)
- Key: MM_GRCm38_E83 (installed 160124_214932)
- Key: ZF_GRCz10_E83 (installed 160124_223816)
- Key: GRCh37_E75 (installed 160124_233625)
- Key: GRCh38_E83 (installed 160124_234239)
- Key: g1k_v37 (installed 160205_014015)
All Done!

The associated folder of each genome build contains the following files (programs used to generate them in brackets):

genome.fa           Copy of the original DNA file provided by the user
transcriptome.fa    Copy of the original cDNA file provided by the user
genesets.gtf        Copy of the original GTF file provided by the user
genesets.exon.txt   Exon annotation according to 'genesets.gtf'
genesets.gene.txt   Gene annotation according to 'genesets.gtf'
genesets.transcript.txt  Transcript annotation according to 'genesets.gtf'
genome.fa.fai       Index file of 'genome.fa' used by GATK (samtools)
genome.dict         Genome dictionary used by GATK (Picard/CreateSequenceDictionary)
kallisto.idx        Transcriptome index used by Kallisto (Kallisto)
STAR_genome         Folder with the genome version used by STAR (STAR)
log                 Folder with log files
refFlats            Folder with refFlat and GTF versions stratified by feature type (aRNApipe library lib/refbuild.refflat() and UCSC gtfToGenePred)
    lincRNA.[gtf/refFlat]
    pre_miRNA.[gtf/refFlat]
    protein_coding.[gtf/refFlat]
    pseudogene.[gtf/refFlat]
    rRNA.[gtf/refFlat]
    snoRNA.[gtf/refFlat]
    snRNA.[gtf/refFlat]
    tRNA.[gtf/refFlat]

##### Generating a new genome build To generate a new genome build and make it available for aRNApipe, the user can choose between two implementations placed in the code directory of aRNApipe:
  • aRNApipe/code/reference_builder.py is used to perform the action by submitting the job to an LSF cluster
  • aRNApipe/code/lib/wr_refbuilder.py is used to perform the action in the current shell and can be performed on a single machine

The main arguments of both scripts are the same. The former already includes options for selecting the wall-time and the option for spanning a single node within the cluster.

Take into account that the generation of new genome builds require the user to have complete permissions within the genomes_processed folder.

Using aRNApipe/code/reference_builder.py:

$ python aRNApipe/code/refbuilder.py --help
Usage: refbuilder.py [options]

aRNApipe: Reference builder

Options:
  -h, --help            show this help message and exit
  -L LABEL, --label=LABEL
                        [Required] Identifier label for the genome.
  -p PATH, --path=PATH  [Required] Absolute path to the aRNApipe installation
                        directory where the folder 'genomes_processed' is
                        located.
  -f FASTA, --fasta=FASTA
                        Path to the uncompressed genome fasta file ('.fa' or
                        '.fasta').
  -c CDNA, --cdna=CDNA  Path to the cDNA fasta file (accepts '.gz').
  -g GTF, --gtf=GTF     Path to the GTF gene set file ('.gtf').
  -n N, --ncpu=N        Number of threads that STAR will use to generate the
                        reference genome (default=8).
  -w WT, --wt=WT        Wall time (default=200:00).
  -s SPAN, --span=SPAN  Span 1 host (yes/no, default no).

Using aRNApipe/code/lib/wr_refbuilder.py:

$ python aRNApipe/code/lib/wr_refbuilder.py --help
Usage: wr_refbuilder.py [options]

aRNApipe: Reference builder

Options:
  -h, --help            show this help message and exit
  -L LABEL, --label=LABEL
                        [Required] Identifier label for the genome.
  -p PATH, --path=PATH  [Required] Absolute path to the aRNApipe installation
                        directory where the folder 'genomes_processed' is
                        located.
  -f FASTA, --fasta=FASTA
                        Path to the uncompressed genome fasta file ('.fa' or
                        '.fasta').
  -c CDNA, --cdna=CDNA  Path to the cDNA fasta file (accepts '.gz').
  -g GTF, --gtf=GTF     Path to the GTF gene set file ('.gtf').
  -n N, --ncpu=N        Number of threads that STAR will use to generate the
                        reference genome (default=8).

The arguments of these scripts are described below:

  • '-L LABEL, --label=LABEL': Identifier key that will be used for the current genome build being processed. This key must not be already present within the installed genome builds.
  • '-p PATH, --path=PATH': Path to the aRNApipe basepath where the folders 'code' and 'genomes_processed' are found.
  • '-f FASTA, --fasta=FASTA': Path to the FASTA file with the genome sequences (uncompressed).
  • '-c CDNA, --cdna=CDNA': Path to the FASTA file with cDNA gene sequences (compressed or uncompressed).
  • '-g GTF, --gtf=GTF': Path to the GTF file with gene annotations.
  • '-n N, --ncpu=N': Number of threads that the job will use (only used for generation of the STAR genome).
  • '-w WT, --wt=WT': Wall time for the associated job.
  • '-s SPAN, --span=SPAN': If 'yes', all CPUs will span a single node.

If the user plans to run variant calling with the GATK pipeline it is recommended to use the genome assemblies provided by the Broad Institute 'bundles': 


##### Example cases In this example, the script for sending the job to the LSF cluster is used:
$ python aRNApipe/code/refbuilder.py -L test_dm_2 -p /[..]/aRNApipe -f Drosophila_melanogaster.BDGP6.dna.toplevel.fa -c Drosophila_melanogaster.BDGP6.cdna.all.fa.gz -g Drosophila_melanogaster.BDGP6.83.gtf -n 16 -w 4:00 -s yes
Job <363375> is submitted to queue <normal>.

In this second example, genome generation is performed on a typical desktop computer:

$ python bedpipe/code/lib/wr_refbuilder.py -L test_dm -p /[...]/aRNApipe -f Danio_rerio.GRCz10.dna.toplevel.fa -c Danio_rerio.GRCz10.cdna.all.fa.gz -g Danio_rerio.GRCz10.83.gtf -n 4

Options:
- Genome label:     test_dm
- Pipeline path:    /[...]/aRNApipe
- Input GTF file:   Danio_rerio.GRCz10.83.gtf
- Input FASTA file: Danio_rerio.GRCz10.dna.toplevel.fa
- Input cDNA file:  Danio_rerio.GRCz10.cdna.all.fa.gz

> Checking processed genome directories...
  Directory for processed genomes already exists.
> Checking if genome label is already installed...
> Checking input fasta/cdna/gtf files...
> Creating output directory for the current genome version in the processed genomes folder...
> Making a copy of the FASTA file to the processed genome directory: 'genome.fa'
> Making a copy of the cDNA file to the processed genome directory: 'transcriptome.fa'
> Making a copy of the GTF file to the processed genome directory: 'genesets.gtf'
> Generating exon, transcript and gene annotation from GTF...
  - /[...]/aRNApipe/genomes_processed/test_dm/genesets.exon.txt
  - /[...]/aRNApipe/genomes_processed/test_dm/genesets.gene.txt
  - /[...]/aRNApipe/genomes_processed/test_dm/genesets.transcript.txt
> Generating index and dict files...
  - genome.dict
  - genome.fa.fai
> Creation of reference genome for STAR analysis...
  - Path: /[...]/aRNApipe/genomes_processed/test_dm/STAR_genome
  - Including GTF annotation
> Creation of transcriptome index for Kallisto analysis: 'kallisto.idx'
> Converting GTF file to refFlat format...
> Splitting GTF file by feature type and converting to refFlat format...
> Adding genome key to installed genomes...

##### Adding common genome assemblies Here we provide a list of sources that can be used for adding different genome assemblies to aRNApipe. * Human GRCh38: * Genome FASTA file: ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz * Transcriptome FASTA file: ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz * GTF: ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz * Human GRCh37: * Genome FASTA file: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz * Transcriptome FASTA file: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz * GTF: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz * Human G1Kv37 (GATK compatible): * Genome FASTA file: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/human_g1k_v37.fasta.gz * Transcriptome FASTA file: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz * GTF: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz * dbSNP and indel annotation files from: ftp://ftp.broadinstitute.org/bundle/2.8/b37/
##### Using GATK compatible builds

The reference builder is not adapted to generate the files for the GATK variang calling pipeline. These files are provided by the Broad Institute as bundles of different genome builds. See the following links:

If the user is planning to use the GATK pipeline for variant calling, the following files must be placed in a folder called "gatk" in the corresponding genome build folder:

  • The most recent dbSNP release
  • VCF InDel annotation files

Once done, update the configuration file of aRNApipe (config.py) to enable GATK when using the corresponding genome build:

annots_gatk = {"genome_label": ["dbsnp_vcf_file", ["indel_vcf_file_1", "indel_vcf_file_2", ...]]}
Example:
annots_gatk = {"g1k_v37": ["dbsnp_138.b37.vcf", ["1000G_phase1.indels.b37.vcf", "Mills_and_1000G_gold_standard.indels.b37.vcf"]]}

##### Adding STAR-Fusion compatible indexes

To use STAR-Fusion some data resources provided by the Trinity Cancer Transcriptome Analysis Toolkit (CTAT) are required. These resources must be downloaded from the CTAT resources website or generated following the instructions provided by STAR-Fusion. Currently, the CTAT only provides pre-build indexes for human GRCh37 and GRCh38. If you want to use a different target species check the following instructions in STAR-Fusion website. The downloaded/generated indexes must be stored in a folder called star-fusion inside the specific genome reference folder. The following instructions show how to dowload and make the pre-build indexes available for GRCh37:

wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/GRCh37_gencode_v19_CTAT_lib.tar.gz
tar xvf GRCh37_gencode_v19_CTAT_lib.tar.gz
rm GRCh37_gencode_v19_CTAT_lib.tar.gz
mv GRCh37_gencode_v19_CTAT_lib [PATH_TO_GENOMES_PROCESSED_FOLDER]/genomes_processed/g1k37/star-fusion
cd [PATH_TO_GENOMES_PROCESSED_FOLDER]/genomes_processed/g1k37/star-fusion/
$STAR_FUSION_HOME/FusionFilter/prep_genome_lib.pl --genome_fa ref_genome.fa --gtf ref_annot.gtf --blast_pairs blast_pairs.outfmt6.gz'

Note that in this example we use the genome label g1k37 and save the indexes in its corresponding folder inside the genomes_processed directory.

More detailed information about this process can be found in the STAR-Fusion wiki.