-
Notifications
You must be signed in to change notification settings - Fork 6
3. The Spider: Interactive web reports
##### Summary The Spider add-on works as an independent module and performs the following tasks:
- Parses generated results and gathers the most important information including:
- Analysis configuration and processed libraries
- Log files generated during the analysis
- Running times
- QC metrics of the input data and the alignment for each library
- Principal component analysis: The first through the fourth PCA scores are provided for all samples. PCAs are computed using centered logarithm-transformed counts/RPKMs and the resulting scores scaled to unit variance. Only the features (gene, exon or transcript) with at least 2 counts in at least one sample are taken into account to compute PCAs.
- Count distribution percentiles at the sample level.
- Generates counts matrix files and their corresponding annotation files from the results of HTseq, STAR and Kallisto including:
- Raw counts per gene (STAR/HTseq), exon (HTseq) and transcript (Kallisto)
- RPKMs per gene (STAR/HTseq) and exon (HTseq)
- Annotation files with gene/exon/transcript information (feature identifier, feature length and associated gene identifier for exons and transcripts)
- Generates a web report with dynamic tables, figures and links.
##### Running the Spider The Spider can be run on a project that is being or has already been analyzed by providing the project folder path:
$ spider --help
Usage: spider.py [options]
aRNApipe: SPIDER module
Options:
-h, --help show this help message and exit
-p PATH, --path=PATH Required: Path to the project folder
##### Output files The Spider creates two new folders within the project directory: * **HTML**: Web reports of the project. * **outputs**: Count/RPKM matrices, annotation files and statistic files.
The list below provides a list of the most important files generated in the outputs folder:
-
Count data for each enabled module providing gene/exon counts (STAR, HTseq-Gene and HTseq-Exon):
- [module_name]_counts.txt: Count matrix (features row-wise and samples column-wise).
- [module_name]_rpkms.txt: RPKM matrix (features row-wise and samples column-wise).
- [module_name]_annotation.txt: Feature annotation of the corresponding feature rows in the count and RPKM matrix files.
- [module_name]_stats.txt: Quality control mapping statistics generated by each module. These files are used by the Spider to generate the quality control web reports.
- [module_name]_pca.txt: PCA scores and distribution percentiles per sample. These files are used by the Spider to generate the web reports with PCA and expression distribution percentiles.
Note: By default, STAR generates counts under the assumption of unstranded, stranded and reverse stranded RNA-seq data. The user should use the files that correspond to the type of data used for running the analysis (star_unstranded_, star_stranded_ or star_stranded-reverse_).
-
Count data for transcript estimates generated by Kallisto:
- kallisto_est_counts.txt: Reported transcript abundances in estimated counts.
- kallisto_tpm.txt: Reported transcript abundances in transcripts per million.
- kallisto_eff_length.txt: Transcript effective lengths.
-
Aggregate of identified fusions (starfusion_aggregate.txt).
-
Files with statistics used to generate the web reports for each module (stats_[module_name]).
-
Summary of the computational resources used by each module (log_stats.txt).
The web report is created in the HTML folder. It uses templates that are adapted to the data generated in the outputs folder. Several javascript libraries have been used to provide more functionality (jQuery v1.12.0, dataTables v1.10.10, amcharts v3.19.3 and lytebox v5.5). From the main report file (summary.html) the users can access to all the other generated reports through the links generated in the left panel. Reports are organized as follows:
- Summary (summary.html): Provides a summary of the project including the libraries analyzed, the obtained results of each sub-module, and the configuration file.
- HPC statistics (hpc.html): Provides a summary of the computational performance of each sub-module and processes used during analysis. It also provides a summary figure displaying the processing timeline that can be used to identify possible bottlenecks in the pipeline (so more resources can be assigned in subsequent analyses).
- TrimGalore/Cutadapt (trim.html): Provides information about the number of reads trimmed because of the presence of adapter sequence or bad base quality.
- FastQC (fastqc.html): Quality control results generated by FastQC. Each parameter has a dynamic link enabled that allows visualizing the corresponding figure generated by FastQC.
-
Alignment quality control: Quality control metrics of the alignment process
- Picard (picard.html): Results of the quality control measurements provided by Picard like the number of reads mapping to different genomic regions, mRNA biotypes and 3'-5' coverage bias information.
- Picard Insert Size (picard-is.html): Provides statistics about the insert-size distribution of paired-end RNAseq samples.
- STAR (star.html): Results of the quality control measurements provided by STAR according to the mapping properties of the genes.
- HTseq (htseq-gene.html and htseq-exon.html): Results of the quality control measurements provided by HTseq according to the mapping properties of the genes.
-
Count statistics:
- DOWNLOADS (downloads.html): Links to count/RPKM matrices and their corresponding annotation files.
- STAR/HTseq (star2.html, htseq-gene2.html and htseq-exon2.html): Sample PCA scores. Only features with more than 2 counts in at least one sample are taken into account when computing the PCA. The counts are log transformed (log(1+ncounts)). The 5%, 25%, 50%, 75% and 95% percentiles of the count/RPKM distributions are also provided.
-
Variant calling:
- VARSCAN/GATK (varscan.html and gatk.html): Summary of the number of calls made for each sample.
- Star-Fusion (star-fusion.html): List of gene fusions identified by sample.
- Subdirectory "html": Javascript and CSS files required to display the contents of the web reports.
##### Opening the web report When running aRNApipe on a cluster the project folder and its corresponding HTML folder will not be located in the local computer. A few steps are recommended to have full access to the web report from the local web browser.
The user can choose between one of the three following options:
1. Mounting your cluster home folder on your local computer:
Using this option will allow the user to 'virtually' mount its cluster working directory on the local computer. Consequently the web browser will be able to open the generated reports with full functionality. In this demo, an empty folder called "morgan" in the "local home path" ("/home/user/morgan") is used to mount the cluster working directory (i.e. "/cluster/home/user"). The use of an an alias on the local computer (".bashrc" file) is recommended. Note that the first option mounts the filesystem with full permissions (rwx) while the second mounts it in read-only mode (more secure):
- alias cluster_mount='sudo sshfs -o exec,allow_other user@cluster.cat:/cluster/home/user /home/user/morgan'
- alias cluster_mount='sudo sshfs -o ro,allow_other user@cluster.cat:/cluster/home/user /home/user/morgan'
Note that command sshfs must be available in the corresponding OS. Follow the instructions below to install sshfs on a MAC OS system:
- Download and install the OSXFUSE library: https://osxfuse.github.io
- Download and install the SSHFS package from the same webpage: https://osxfuse.github.io
- Create the folder in your local computer where your cluster home folder will be mounted: i.e. $ mkdir /Users/username/morgan_home
- Mount your cluster working folder ("/cluster/home/user") in the local computer folder created in the previous step: i.e. $ sudo sshfs -o ro,allow_other [cluster_username]@[cluster_address]:/cluster/home/user /Users/username/morgan_home.
- The shell will ask first for your local computer password and for your morgan account password later. Once mounted, you can access to your cluster working folder files: cd /Users/username/morgan_home
- Once mounted, go to the HTML folder of your RNA-seq project and open the file summary.html in your web browser
2. Copying the project folder to your local computer: You can copy the entire project folder to your local computer and open the summary report. Nevertheless, take into account that the results generated by the programs might have large sizes.
3 Copying the HTML/outputs folders to your local computer: This is the light version of the previous option. Very fast but the links to the extended FastQC html reports and the log files won't work.
##### Web report demo The following demo provides an example of how to run the Spider on a processed dataset:
$ spider -p /gpfs/[...]/aRNApipe_demo/
> Checking project path...
> Building HTML and OUTPUT folders skeletons...
- Path: /gpfs/[...]/aRNApipe_demo
- Libs: /gpfs/[aRNApipe_path]/aRNApipe/code
> Parsing configuration file...
> Recovering LSF stats from: /gpfs/[...]/aRNApipe_demo/logs/
Recovered stats from 40 log files.
> Recovering stats from Kallisto and building count matrices: /gpfs/[...]/aRNApipe_demo/results_kallisto/
Data found for 4 of 4 samples
Creating annotation file...
Recovered annotation for 180253 of 180253 transcripts
> Recovering stats from STAR logs: /gpfs/[...]/aRNApipe_demo/results_star/
> Recovering stats from STAR counts and building count matrices: /gpfs/[...]/aRNApipe_demo/results_star/
Data found for 4 of 4 samples
Creating annotation file...
Recovered annotation for 57905 of 57905 genes
Computing RPKMs...
> Recovering stats from HTseq gene counts and building count matrices: /gpfs/[...]/aRNApipe_demo/results_htseq-gene/
Data found for 4 of 4 samples
Creating annotation file...
Recovered annotation for 63677 of 63677 genes
Computing RPKMs...
> Recovering stats from HTseq exon counts and building count matrices: /gpfs/[...]/aRNApipe_demo/results_htseq-exon/
Data found for 4 of 4 samples
Creating annotation file...
Recovered annotation for 738009 of 738009 genes
Computing RPKMs...
> Generating webpage with samples list and configuration...
- /gpfs/[...]/aRNApipe_demo/HTML/summary.html
> Generating webpage with download links...
- /gpfs/[...]/aRNApipe_demo/HTML/downloads.html
> Generating webpage with HPC and LOG statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/hpc.html
> Generating webpage with TrimGalore/Cutadapt statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/trim.html
> Generating webpage with fastqc statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/fastqc.html
> Generating webpage with picard statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/picard.html
> Generating webpage with Star-Fusion results...
- /gpfs/[...]/aRNApipe_demo/HTML/star-fusion.html
> Generating webpage with STAR statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/star.html
> Generating webpage with htseq-gene statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/htseq-gene.html
> Generating webpage with htseq-exon statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/htseq-exon.html
> Generating webpage with VARSCAN statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/varscan.html
> Generating webpage with GATK statistics...
- /gpfs/[...]/aRNApipe_demo/HTML/gatk.html
$ ls aRNApipe_demo/outputs/
htseq-exon_annotation.txt kallisto_annotation.txt starfusion_aggregate.txt star_unstranded_rpkms.txt
htseq-exon_counts.txt kallisto_eff_length.txt star_stats_log.txt star_unstranded_stats.txt
htseq-exon_pca.txt kallisto_est_counts.txt star_stranded_counts.txt stats_gatk.txt
htseq-exon_rpkms.txt kallisto_stats_eff_length.txt star_stranded-reverse_counts.txt stats_picard.txt
htseq-exon_stats.txt kallisto_stats_est_counts.txt star_stranded-reverse_rpkms.txt stats_trim_plot.txt
htseq-gene_annotation.txt kallisto_stats_tpm.txt star_stranded-reverse_stats.txt stats_trim.txt
htseq-gene_counts.txt kallisto_tpm.txt star_stranded_rpkms.txt stats_varscan.txt
htseq-gene_pca.txt log_stats_1.png star_stranded_stats.txt
htseq-gene_rpkms.txt log_stats.txt star_unstranded_counts.txt
htseq-gene_stats.txt star_annotation.txt star_unstranded_pca.txt
$ ls aRNApipe_demo/HTML/
downloads.html gatk.html html htseq-exon.html htseq-gene.html star2.html star.html trim.html
fastqc.html hpc.html htseq-exon2.html htseq-gene2.html picard.html star-fusion.html summary.html varscan.html