Skip to content

Use data.table package to parse vcf #31

@xiangzhu

Description

@xiangzhu

Currently ldshrink assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.

In the past I first used vcftools to convert vcf data to IMPUTE2 format (which is indeed a p-by-n matrix), and then transpose IMPUTE2-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.

This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like vcftools before any LD-related operations in ldshrink.

It seems that now we can use data.table (https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.

Here is my own example.

> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")

|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754         8
> head(my_dt)
   #CHROM       POS          ID REF ALT QUAL FILTER
1:  chr15 101094127  rs12904576   T   C    .   PASS
2:  chr15 101114640  rs36053285   T   C    .   PASS
3:  chr15  45685616 rs796721871  TA   T    .   PASS
4:  chr15  45670316   rs2114501   G   A    .   PASS
5:  chr15  45618730  rs59889118   G   A    .   PASS
6:  chr15  45620612   rs1980288   T   C    .   PASS
                                                                                                               INFO
1:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18

The benefit of using data.table here is two-fold: i) users don't have to leave R and use vcftools to get n-by-p genotype matrix from vcf data; ii) data.table is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).

Hence, we can either add a wrapper that uses data.table to parse vcf for ldshrink users, or at minimum, we can simply provide a vignette showing how to use data.table to parse vcf.

Finally there exists a package vcfR (https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions