Skip to content

Add VCF support#107

Open
nvnieuwk wants to merge 5 commits intoCenterForMedicalGeneticsGhent:masterfrom
nvnieuwk:add-vcf-support
Open

Add VCF support#107
nvnieuwk wants to merge 5 commits intoCenterForMedicalGeneticsGhent:masterfrom
nvnieuwk:add-vcf-support

Conversation

@nvnieuwk
Copy link

Adds #105

Adds 3 new options to the predict command:

  1. --vcf to state that a VCF file should be created (Will create a bgzipped vcf file)
  2. --fai to create the contigs in the VCF header
  3. --sample to set the sample name to be used in the VCF. This will default to the basename of the outid

The VCF header looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
...
##contig=<ID=chrUn_JTFH01001998v1_decoy,length=2001>
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##FILTER=<ID=cnvQual,Description="CNV with quality below 10">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=ZS,Number=1,Type=Float,Description="The z-score calculated for the current CNV">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample

And the variants themselves will look like this:

chr1	6260002	WisecondorX_DUP_4	N	<DUP>	.	.	END=6280000;SVTYPE=CNV;SVLEN=20000	GT:SM:ZS	./.:1.0289:6.62867
chr1	6515002	WisecondorX_DEL_1	N	<DEL>	.	.	END=6895000;SVTYPE=CNV;SVLEN=380000	GT:SM:ZS	./.:-0.2654:-5.26266

@nvnieuwk nvnieuwk requested review from matthdsm and mvheetve June 28, 2023 11:27
@nvnieuwk
Copy link
Author

I moved the fields to INFO, changed the output to segments and added an ABB flag if the variant is an abberation

@JspSrs
Copy link

JspSrs commented Jun 29, 2023

Hi Matthias and others, I do not have a copy of WiseCondorX running, but noticed the VCF output remark (#relevant for another project).
For me the DUP_4 in the example is unclear, also in relation to the linear copy number ratio in that example. Is the example just a real mock-up or should it reflect reality? If so, does the "_4" mean a copy number of 4 (CN=4, i.e. like with a homozygous tandem duplication). DUP has a meaning, like insertion of the exact sequence in tandem. GAIN is more neutral and normaly used in CNV analysis. AMP is often for any GAIN amounting more, i.e. CN>3, 4 and up

@nvnieuwk
Copy link
Author

Hi @JspSrs,
The number for is just a count value. The snippet posted consists of two variants I took from the test VCF. So this value has no real meaning except for making the identifiers unique.

CN currently isn't in the VCF because this isn't supported by WisecondorX at the moment (correct me if I'm wrong @matthdsm).

DUP has a meaning, like insertion of the exact sequence in tandem. GAIN is more neutral and normaly used in CNV analysis. AMP is often for any GAIN amounting more, i.e. CN>3, 4 and up

For this I followed the conventions on CNVs in VCFs with what I could derive from the data available in WisecondorX. I don't think using GAIN is such a good idea since GAIN isn't used in VCFs to specify CNVs.

The info available in the VCF is very limited at the moment and I would like to see it expanded in the future but I'm for now unable to tell you if that will happen and if so when.

-Nicolas

@JspSrs
Copy link

JspSrs commented Jun 29, 2023

@nvnieuwk, Hi Nicolas,
Thank you for the prompt response.
Regarding "For this I followed the conventions on CNVs in VCFs with what I could derive from the data available in WisecondorX. I don't think using GAIN is such a good idea since GAIN isn't used in VCFs to specify CNVs."; I think it shows the VCF format should have more definition and including cytogenomics specialists.
"GAIN" is more versatile, while DUP has a very specific implication in both DNA diagnostics and cytogenomics (i.e. a specific, identical sequence inserted next to the original. Either in inverted or in_tandem orientation).

Why I mention it here, sometimes improvements must come from bottom up. ;-)

Jasper Saris, Dept Clinical Genetics, Erasmus MC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants