The goal of this project is to re implement the methodology presented in the paper "CoLoRMap: Correcting Long Reads by Mapping short reads" by Ehsan Haghshenas, Faraz Hach, S. Cenk Sahinalp and Cedric Chauve
-
Entrez Direct: used for downloading reference sequence NC 000913
apt install ncbi-entrez-direct -
BWA - see here for installation : https://github.com/lh3/bwa. This code expects BWA to be in
$PATH -
zlib.h
apt-get install zlib1g-dev -
samtools
apt install samtools -
Boost: C++ library used in the codebase to for store graphs,find connected components,run dijstra's shortest path etc.
apt-get install libboost-graph-dev -
BLASR - used to align long reads to reference genome
apt install -y blasr
To download the Illumina short reads, PacBio long reads and reference genome for the Escherichia coli str. K-12 substr. MG1655
do the following:
cd ecoli
bash init_ecoli.sh
The main 4 parameter for Snakefile are
folder: The name of the target folder which contains (ex. test_data/)<short reads 1>.fastq: The name of one of fastq files infolder(ex.ill_1.fastq)<short_reads_1>.fastq: The name of the other fastq file infolder(ex.ill_2.fastq)<long reads>.fasta: The name of the fasta file infolder(ex.pac.fasta)
The other parameters are
-
test_name: The suffix of the file containing the corrected long reads. More spefically, the corrected long reads will be stored in<folder>/lr_corr_<test_name>.fasta. This is an id that is intended to be used to distiguish the output files produced as thecolormap.cppis adjusted. -
correct_singletons: When set to"no", then a short read$s$ which has been mapped to a long read$l$ and is not adjacent to any other short reads mapped to$l$ will not be used to correct$l$ . Other wise such short reads will be used to correct$l$
This pipeline produces the file
This file can be used directly to correct long reads. It takes 2 command line arguments:
<long_reads>.fasta
this is just the relative path to the long reads which are- a "raw alignment file"