Use `data.table` package to parse vcf

Currently `ldshrink` assumes the input genotype/haplotype data are stored in an [n-by-p numerical matrix](https://github.com/stephenslab/ldshrink/blob/901d911677c858590ae88374fe3323535a60550a/R/LDshrink.R#L8), which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.

In the past I first used `vcftools` to convert vcf data to `IMPUTE2` format (which is indeed a p-by-n matrix), and then transpose `IMPUTE2`-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.

This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like `vcftools` before any LD-related operations in `ldshrink`.     

It seems that now we can use `data.table` (https://cran.r-project.org/web/packages/data.table) to **directly** convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.

Here is my own example.

```r
> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")

|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754         8
> head(my_dt)
   #CHROM       POS          ID REF ALT QUAL FILTER
1:  chr15 101094127  rs12904576   T   C    .   PASS
2:  chr15 101114640  rs36053285   T   C    .   PASS
3:  chr15  45685616 rs796721871  TA   T    .   PASS
4:  chr15  45670316   rs2114501   G   A    .   PASS
5:  chr15  45618730  rs59889118   G   A    .   PASS
6:  chr15  45620612   rs1980288   T   C    .   PASS
                                                                                                               INFO
1:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
```

The benefit of using `data.table` here is two-fold: i) users don't have to leave R and use `vcftools` to get n-by-p genotype matrix from vcf data; ii) `data.table` is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).

Hence, we can either add a wrapper that uses `data.table` to parse vcf for `ldshrink` users, or at minimum, we can simply provide a vignette showing how to use `data.table` to parse vcf. 

Finally there exists a package `vcfR` (https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `data.table` package to parse vcf #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use data.table package to parse vcf #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Use `data.table` package to parse vcf #31