-
Notifications
You must be signed in to change notification settings - Fork 9
Description
You write in the README.md that you support the following formats:
Input Type 5 Bismark coverage2cytosine format:
//Bismark coverage2cytosine format Example: chr1 762 763 + 17 64 CG CGA
Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer [0,4294967295]. Column3: strand. Column4: methylated C count, an unsigned integer in [0,4294967295]. Column5: C count, an unsigned integer in [0,4294967295]. Column6: C-context, e.g. CG, CH, CHH. Column7: C-context, e.g. CGA, CGT, etc.
Input Type 6 Bismark coverage2cytosine format:
Example: chr1 762 763 0.265625 17 76
Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer in [0,4294967295]. Column3: nucleotide/end position, an unsigned integer in [0,4294967295]. Column4: methylation percentage, which is calculated by Defiant. Column5: methylated C count, an unsigned integer in [0,4294967295]. Column6: C count, an unsigned integer in [0,4294967295].
However, these don’t entirely match what is described in the bismark_methylation_extractor help:
The genome-wide cytosine methylation output file is tab-delimited in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C-context> <trinucleotide context>
and
The coverage output looks like this (tab-delimited, 1-based genomic coords; zero-based half-open coordinates available with
--zero_based):
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count non-methylated>
-
You call both "coverage2cytosine" format. The "coverage2cytosine" Bismark module can create a "genome-wide cytosine methylation output file" (which looks ALMOST like Input Type 5) from the coverage output (which looks ALMOST like your Input Type 6), but can also be created from bismark_methylation_extractor directly.
-
In Input Type 5 example you show start and end position (and 8 columns in total), but describe below only start position and 7 columns in total. I assume it's just a typo in the example?
-
You write the start/end position for all are in [0,4294967295], Bismark by default uses 1-based, unless
--zero-basedis explicitly specified, and only then it becomes half-open. So, by default it's all 1-based and start position == end position, in your example it says '762 763', so should indeed--zero-basedbe specified? -
Bismark clearly states "count methylated" and "count non-methylated" rather than "methylated C count" and "C count". "C count" sounds like total count (methylated + non-methylated). What is actually expected here?
-
Input Type 6 "Column4: methylation percentage, which is calculated by Defiant." - Why is this calculated by Defiant? And how? Shouldn't this be input to Defiant? It is part of the Bismark coverage output. However your example... "chr1 762 763 0.265625 17 76 "
How would you get to 0.265625? It's neither 17/76, nor 17/(76+17), depending on what you actually mean in no 4... (17/64=0.265625 , assuming the 64 that you mention in input type 5 example )
However, from an Bismark run, I got e.g. in coverage output (test.deduplicated.bismark.cov.gz):
chr3 3008646 3008646 33.3333333333333 1 2
chr3 5620584 5620584 75 3 1
So, the methylation percentage is (100*col5/(col5+col6)) and not (col5/col6)
(Also, the start and end position are same (as stated in 3), unless --zero-based is used, but then it would not be valid input to the coverage2cytosine script.)
Please consider to provide an example call for the bismark_methylation_extractor, that will produce files of the type that defiant will read and process as expected.
Looking forward to test the program once this is clarified.