CamPype is a pipeline for the analysis of Illumina paired-end sequencing data and/or whole bacterial genomes. The development of the workflow is mainly intended for the analysis of Campylobacter jejuni/coli genomes, although any other bacterial genus can be analyzed as well. CamPype is specially designed for users without knowledge of bioinformatics or programming, so the ease of installation and execution are the fundamentals of its development. Moreover, CamPype is a user-customizable workflow that allows you to select the analysis and the tools you are interested in.
Here you wil find the schema of CamPype. CamPype allows the user to previously check the quality of sequencing raw data in an independent step to optimize the read filtering analysis. Moreover, bacteria identification can be performed on the filtered fastq reads when raw reads are provided or after genome assembly when contigs are used.
Software or databases are indicated in boxes, while discontinuous boxes indicate tools that users can deactivate.
-
Clone this repository:
git clone https://github.com/JoseBarbero/CamPype.git
-
Go to CamPype's directory:
cd CamPype -
Create the environment with conda:
conda config --append channels conda-forge conda config --append channels bioconda conda env create -f campype_env.yml conda env create -f campype_env_aux.yml
The creation of the conda environments will take some minutes, be patient.
-
After conda environments were created, update the databases of AMRFinder, Prokka and ABRicate:
conda activate campype amrfinder -u prokka --setupdb abricate --setupdb -
Additionally, CamPype allows you to check for read contamination and determine bacteria taxonomy using Kraken2. The installation of Kraken2 is optional to avoid possible storage limitations as it requires the use of a heavy database that requires high free disk space, but it is not needed for Campype if you are not interested in this analysis. If you want to install this module, at least 8 GB will be occupied to store the MiniKraken_8GB_202003 database. This database is enough for bacteria identification and CamPype performance, but if you have enough disk space, you can download and install "Standard Kraken2 Databases" following the instructions here for better sensitivity. To install Kraken2 in CamPype, make sure you are in CamPype's directory and run:
./install_kraken.sh
For beginners and anyone using a host operative system different than Linux, we have created a virtual machine (VM) with CamPype installed to be imported in VirtualBox so that you can use the workflow.
- First, you will need to download and install VirtualBox following the instructions you will find here.
- Download the VM here.
- Import the VM following the instructions you will find here.
The VM is ready for use. However, in case you encounter problems related to space left in the virual disk, we suggest you to resize it following this instructions.
- Make sure Docker is installed on your machine
- Run [feel free to use any image tag you want]
docker build . -t campype:latest - Run
docker run -it campype:latest bash
- You are now inside the container which contains an installed version of CamPype. You can follow the instructions for using CamPype on Linux.
- You can exit the container with the command
exit. If you want to reuse the same container in the future (i.e. if there's data you've generated and want to access), usedocker container ls -ato view old containers, and access the container by its id withdocker start -i CONTAINER_ID
CamPype can run on two modes depending on the input files. The FASTQ mode analyses (un)compressed raw reads in fastq format, while the FASTA mode analyses assembled genomes in fasta format.
-
Before running CamPype, you must indicate the location of the input files in the CamPype/input_files.csv file (TAB as separator). For fastq files, you need to indicate the path of each pair of reads in the Forward and Reverse columns, while for fasta files both the Forward and Reverse columns will refer to the path of the assembled genome, that is, the content of both columns will be the same. We recommend you to indicate the Genus and Species of your genomes in case you known this information beforehand (this taxonomy will be considered even in you run the bacteria identification module). Please, indicate the species in case you indicate the genus. If not, we strongly encourage you to activate the bacteria identification analysis (as below explained).
Samples Forward Reverse Genus Species sample_ID /path/to/your/forward/fastq1_file.fastq /path/to/your/reverse/fastq1_file.fastq YourStrainGenus YourStrainSpecies sample_ID /path/to/your/forward/fastq2_file.fastq /path/to/your/reverse/fastq2_file.fastq YourStrainGenus YourStrainSpecies sample_ID /path/to/your/forward/fastq3_file.fastq /path/to/your/reverse/fastq3_file.fastq YourStrainGenus YourStrainSpecies This structure must be respected anyway in that file (
$\textcolor{red}{\textsf{tab as a delimiter}}$ ). Make sure headers haven't changed and samples ID do not contain the dot symbol.$\textcolor{red}{\textsf{Be careful with typos!!!}}$ -
Set the modules you want to run in the CamPype/campype_config.py file. There you will set your own running parameters for each tool and the select your tools of interest when possible.
-
Default settings are configured for Campylobacter jejuni/coli. If you want to use a different bacteria, we strongly recommend you to adapt the configuration of CamPype as previously explained. In particular, you must modify the
reference_genome, deactivate the optioninclude_cc, and use abricate for virulence genes searching or/and use your own virulence genes database with BLAST instead (indicate this accordingly in the CamPype/campype_config.py file).$\textcolor{red}{\textsf{Be careful if you want to analyse a mix of bacterial species}}$ , we recommend you to delete areference_genomeand deactivate the optionrun_variant_calling.
An optional test can be run to check the correct installation of CamPype.
- Activate the CamPype's environment:
conda activate campype
- Go to CamPype's directory:
cd your/path/to/CamPype - Run the CamPype's test:
./campype_test.sh
An example of the HTML report you will get can be found here.
-
Activate the CamPype's environment:
conda activate campype
-
Go to CamPype's directory:
cd your/path/to/CamPype -
In case you want to run CamPype in the FASTQ mode, we encourage you to perform first a quality control step to check how good are your raw reads and adjust the read quality control filtering step (remember to include the path of the fastq files in the CamPype/input_files.csv file):
bash -i campype_qc
A quality control analysis will be performed in each fastq file and a summary HTML report will be generated for fast visualization in the directory fastq_quality_control, that will be located inside the output directoy of CamPype named as you indicated in the CamPype/campype_config.py file. An example of this report can be found here. We recommend you to check this video to know how to understand these results.
-
Once you have set the configuration, run CamPype:
bash -i campype
The results will be located in the output directoy of CamPype named as you indicated in the CamPype/campype_config.py file. If you want to store these results in the same directory where the quality control analysis data are, remember to indicate that directory in the configuration file.
-
You can deactivate the environment when you are finished:
conda deactivate
The results of CamPype are stored in very detailed directories for each analysis, with separate subdirectories for each tool and isolate. The files will be generated for analysis tracking due to execution error. An interactive HTML summary report will be generated at the end of the analysis to simplify the task of data visualization and interpretation, although the figures can also be found in the html_report_figures directory within CamPype output directory. This HTML file can be opened on any Web browser. Examples of reports can be found here:
- Analysis with 5 Campylobacter jejuni and 5 Campylobacter coli (input: raw reads in fastq format)
- Analysis with 44 Escherichia coli (input: assembled genomes in fasta format)
The datasets that were used as input can be found here.
After completion of CamPype analysis, the report can also be generated by executing the following commands in the Linux terminal:
- For raw fastq reads as input:
conda activate campypeR
Rscript -e "rmarkdown::render('CamPype_Report_long.Rmd', params = list(directory = '~/path/to/data'))"
- For assembled genomes as input:
conda activate campypeR
Rscript -e "rmarkdown::render('CamPype_Report_short.Rmd', params = list(directory = '~/path/to/data'))"
In both cases, you will have to change '~/path/to/data' with the corresponding path of the CamPype output directory containing the output files required to create the summary HTML report.
We recommend to update CamPype when newer versions are launched:
-
Make sure you are not in CamPype conda environment
conda deactivate
-
Save a copy of your configuration files. Updating CamPype will overwrite you configuration files because this files properties may change with newer versions of CamPype.
-
Run ./updatecampype
./updatecampype
You should answer YES to the first question (Are you sure you want to remove your configuration files?) to update configuration files and YES to the second question (Proceed ([y]/n)?) to update all the packages and tools included in CamPype. This might take several minutes.
The databases of AMRFinder and ABRicate can be updated without the need of updating CamPype. To update the database of AMRFinder, run amrfinder -u
To update the databases of ABRicate, run:
abricate-get_db --db [resfinder | argannot | ncbi | ecoh | megares | card | ecoli_vf | plasmidfinder | vfdb] --forceOnly one database can be updated at once. In ABRicate 1.0.1, only for certain databases you have to manually edit the file abricate-get_db that you will find in ./anaconda3/envs/campype/bin:
- To update the NCBI database, edit first line 249 before running previous command. This line has to be exactly like this:
my $src = "$AFP/AMR_CDS.fa"
- To update the megares database, edit first line 350 before running previous command. This line has to be exactly like this:
download(' https://www.meglab.org/downloads/megares_v3.00.zip', $zip);
You can always check the date of the ABRicate databases by running: abricate --list
- Go to CamPype's directory
cd your/path/to/CamPype - Make sure you are not in CamPype conda environment
conda deactivate
- Run ./uninstallcampype
./uninstallcampype
-
Prokka stops running with this error:
Could not run command: cat \/home\/CamPype_OUTPUT_20220511_131550\/Prokka_annotation\/NCTC11168\/NCTC11168\.IS\.tmp\.35844\.faa | parallel --gnu --plain -j 8 --block 313 --recstart '>' --pipe blastp -query - -db /home/instalador/anaconda3/envs/campype/db/kingdom/Bacteria/IS -evalue 1e-30 -qcov_hsp_perc 90 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > \/home\/CamPype_OUTPUT_20220511_131550\/Prokka_annotation\/NCTC11168\/NCTC11168\.IS\.tmp\.35844\.blast 2> /dev/nullActivate the CamPype's directory
conda activate campype, runprokka --setupdbfirst, and execute CamPype again. -
ABRicate can't find any gen and this message appears:
BLAST Database error: Error pre-fetching sequence dataActivate the CamPype's directory
conda activate campype, runabricate --setupdbfirst, and execute CamPype again. -
If you find missing font problems running Mauve, you should install the required fonts:
sudo apt-get install ttf-dejavu
The following functionalities will be included in CamPype as soon as possible:
- Our in-house Campylobacter virulence-associated genes database will be expanded to up to 333 different gene sequences. Meanwhile, the database can be found in our latest publication
- CamPype will support the assignment of Clonal Complexes to any MLST scheme included in the PubMLST database.
- Tools and packages will be updated to the latest version.
Please cite CamPype whenever you use it as:
Ortega-Sanz, I., Barbero-Aparicio, J.A., Canepa-Oneto, A. et al. CamPype: an open-source workflow for automated bacterial whole-genome sequencing analysis focused on Campylobacter. BMC Bioinformatics 24, 291 (2023). https://doi.org/10.1186/s12859-023-05414-w
For questions, bugs and suggestions, please open a new issue or contact us through email to irene.ortegasanz@ugent.be or jabarbero@ubu.es.