-
Notifications
You must be signed in to change notification settings - Fork 0
jerryctnbio/myAnnotator
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
USER GUIDE
1. Github link
2. Running environment
3. How to run
4. Input
5. Output
6. Background
7. Limits
1. Github link
https://github.com/jerryctnbio/myAnnotator
2. Running environment
2.1 This tool works with python 3. Tested on 3.7.7 with centos 7
2.2 It runs a local vep. tested on vep version 100.
2.3 vep has to be in the $Path environment variable.
2.4 vep is short for "Variant Effect Predictor". It is a tool from
Ensembl and can be obtained from the following address:
http://uswest.ensembl.org/info/docs/tools/vep/index.html
2.5 This tool works with local genome cache files and completely offline.
2.6 One can download the cache tar balls from the following address.
follow the manual download instruction there. You need GRCh37 for
this challenge.
https://uswest.ensembl.org/info/docs/tools/vep/script/vep_cache.html
3. How to run
3.1 Clone the repo (section 1) into a local directory, say, 'try'
git clone https://github.com/jerryctnbio/myAnnotator try
cd try
3.2 Issue the following command
./python myAnnotator2.py -i x.vcf
3.3 A more complete usegae below. One can also issue the following
command to see the usage.
./python myAnnotator2.py -h
Usage:
-i x.vcf
required.
the input vcf file.
-o x.annatations.tsv
optional.
the output tsv file. If not specified, it will match to the
input file. For example, input=x.cvf, output=x.annotations.tsv
-f False
optional.
whether to filter out variants failed FILTER in vcf file. If
not specified, it is set to False.
-s None
optional.
the variant severity level file. This is a tab delimited two
column file with a **required** first header line. Comment
lines are allowed starting with '#'. The first
column is the variant consequence. The second column is the
corresponding severity level (integer from 1 up). The lower
the level the more severe the variant is. This is based
on Ensembl website as below.
https://uswest.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences
If not specified, it is **assumed** that a file named
"myAnnotator.severity.txt" is in the current working directory.
-g GRCh37
optional. Either 'GRCh37' or 'GRCh38', nothing else.
human genome version.
defaults to GRCh37 to match the challenge.
-c None
optional.
defaults to whatever local vep points.
4. Input
The input is a VCF file.
5. Output
The output is a tab delimited text file with the following columns.
5.1 The following five (5) columns are exactly the same as in the VCF
file.
column 1: Chr
column 2: Pos
column 3: ID
column 4: Ref
column 5: Alt
5.2 The next columns are about the depth and reference allele
column 6: Gene symbol
column 7: Depth
column 8: Reference allele count
column 9: Reference allele percentage
5.3 The next columns are for all the alternate alleles. All of the
information for the alternate alleles are grouped together in one
column separated by comma. For example, '18,34' denotes two alternate
alleles with counts as 18 and 34, respectively.
column 10: All alternate allele counts
column 11: All alternate allele percentages
5.4 The next columns are for the most severe allele
column 12: Variant class, such as 'snv'
column 13: Allele frequency from ExAC. If no entry in the database,
it is set to 'allele_freq_not_in_ExAC'.
column 14: The most deleterious effect of the variant
6. Background
This script annotates variants in a VCF file. Besides the depth, read
count information directly from the VCF file, it also queries the ExAC
database (http://exac.hms.harvard.edu) for minor allele frequency. The
variant effect is obtained by running a local vep program (see section2).
Only the most deleterious effect is kept in the output for each variant
in the VCF file. This applies whether the variant matches to multiple
transcripts or the variant has multiple alternate alleles, or both. See
example below.
The variant '3 49397819 . GCAAAG GAAAAA,AAAAAA' is
a variant on chromosome 3 at position 49397819. The reference allele is
'GCAAAG'. It has two alternate alleles 'GAAAAA' and 'AAAAAA'.
The first alternate allele 'GAAAAA' matches to five(5) transcripts,
each of which has a corresponding consequence, e.g., 'intron_variant'. The
other alternate allele 'AAAAAA' also matches to five(5) transcripts with
their consequences.
All of the ten (10) consequences are compared and the most severe one
is picked. The detailed information about that alternate allele is output,
such as frequency from ExAC, consequence, variation type.
7. Limits
7.1 This tool queries the ExAC database. The better way to query the ExAC
database would be to query in a bulk fashion.
First collect all the variants, divide them into
big chucks, say 1000 variants, and then query one chunk at a time.
Unfortunately, the ExAC database always gave 405 error when using
bulk query. In the end, this was implemented one variant at a time.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published