You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## NEW!!! Current Version: v1.4.0 (released May 10, 2024)
4
+
- Added support for compressed (gz) hapcount/dosage and phenotype files.
5
+
- Improved file reading efficiency by implementing fread in chunks, mitigating memory errors.
6
+
- Implemented parallel processing for regression, resulting in significant speed improvements with multi-core systems.
7
+
- Enhanced flexibility in organizing phenotype files:
8
+
- Users can specify sample ID column (`--sampleidcol`), phenotype ID column (`--phenocol`), and covariate column list (`--covarcollist`)
9
+
- Updated output summary statistics to include SE and t-val, with column names adjusted to adhere to GWAS standards.
4
10
5
-
**Current version: 1.1.0**
11
+
# TRACTOR - Local Ancestry Aware GWAS
6
12
7
13
Tractor is a specialized tool designed to enhance Genome-Wide Association Studies (GWAS) for diverse cohorts by addressing challenges associated with analyzing admixed populations. Admixed populations are often excluded from genomic studies due to concerns about how to properly account for their complex ancestry.
8
14
@@ -11,20 +17,22 @@ Tractor facilitates the inclusion of admixed individuals in association studies
11
17
## Classic GWAS vs. TRACTOR GWAS
12
18
Unlike traditional GWAS methods, Tractor requires local ancestry estimates in its analyses. It employs a multi-step approach involving phasing, local ancestry inference, and regression analysis with ancestral allele dosages. This method aims to improve the accuracy of association analyses in cohorts with diverse ancestries, overcoming issues such as population stratification and variable linkage disequilibrium patterns.
*[Steps for Running Tractor on Hail](#steps-for-running-tractor-on-hail)
21
-
*[Output Files](#output-files)
22
30
*[License](#license)
23
31
*[Cite this article](#cite-this-article)
24
32
25
33
## Setup Conda environment
26
34
27
-
We recommend creating a Conda environment to run Tractor locally.
35
+
We recommend creating a Conda environment to run Tractor locally. This will install the necessary Python 3 and R dependencies required by the scripts.
28
36
```bash
29
37
conda env create -f conda_py3_tractor.yml
30
38
conda activate py3_tractor
@@ -38,18 +46,20 @@ conda activate py3_tractor
38
46
39
47
All scripts desribed in the following steps are available in the [`scripts`](https://github.com/Atkinson-Lab/Tractor-New/tree/main/scripts) directory, and Hail implementation is present in the [`ipynbs`](https://github.com/Atkinson-Lab/Tractor-New/tree/main/ipynbs) directory
40
48
41
-
### Optional Step: Recovering Haplotypes Disrupted by Statistical Phasing
49
+
### Step 0 [Optional]: Recovering Haplotypes Disrupted by Statistical Phasing
42
50
43
51
Statistical phasing can lead to switch errors as described in [Fig. 1](https://www.nature.com/articles/s41588-020-00766-y/figures/1) of the Tractor publication.
44
52
For this purpose, we have written two scripts, `unkink_2way_mspfile.py` and `unkink_2way_genofile.py`. These scripts help recover disrupted tracts from the **MSP file and VCF file**, rectifying errors, and outputs an unkinked VCF file that can be used for subsequent steps. Currently they are implemented for two-way admixed popuations only.
45
53
-`unkink_2way_mspfile.py`
54
+
46
55
```
47
56
--msp Path stem to MSP file, not including ".msp.tsv". (Must end in .msp.tsv)
48
57
```
49
58
-**Output File:**
50
59
- The output file \*.switches.txt includes information on windows from the MSP file that needs to be switched.
51
60
- This file will serve as an input to `unkink_2way_genofile.py`
52
61
-`unkink_2way_genofile.py`
62
+
53
63
```
54
64
--switches Path to *.switches.txt, which includes info on windows to be switched
55
65
--genofile Path stem to input VCF with phased genotypes, not including .vcf suffix
@@ -59,11 +69,11 @@ For this purpose, we have written two scripts, `unkink_2way_mspfile.py` and `unk
59
69
60
70
[Contents](#contents)
61
71
62
-
### Extracting Tracts and Ancestry Dosages
72
+
### Step 1: Extracting Tracts and Ancestry Dosages
63
73
64
74
Simultaneously extract risk allele and local ancestry information, a prerequisite for running Tractor GWAS. The scripts output risk allele by ancestry dosages and haplotype counts for the input VCF files. A file of each of these is generated for each ancestry component.
65
75
-**Note that the input VCF file must be the phased file on which local ancestry was called.**
66
-
- Running `extract_tracts.py` requires the **input MSP and VCF file**, and the number of ancestral populations within the VCF file. This script outputs the dosage and hapcount files required for running Tractor using `run_tractor.R`.
76
+
- Running `extract_tracts.py` requires the **input MSP and VCF file**, and the number of ancestral populations within the VCF file. This script outputs the dosage and hapcount files required for running Tractor.
67
77
-`extract_tracts.py`:
68
78
```
69
79
--vcf Path to VCF file (*.vcf or *.vcf.gz)
@@ -110,63 +120,154 @@ Simultaneously extract risk allele and local ancestry information, a prerequisit
110
120
111
121
[Contents](#contents)
112
122
113
-
### Running Tractor
123
+
### Step 2: Running Tractor
114
124
115
-
- The Tractor code runs in R, and all required library packages should be installed within the Conda environment.
116
-
- Arguments:
117
-
```
118
-
--hapdose Prefix of hapcount and dosage files generated
other columns will be treated as covariates. Missing data is allowed.
121
-
--method "linear" or "logistic"
122
-
--out Output file name for ancestry-specific summary statistics
123
-
```
125
+
The Tractor code runs in R, and to make sure the script works, you'll need to install the following libraries. Your conda environment should handle these installations by default.
126
+
```
127
+
install.packages('optparse')
128
+
install.packages('data.table')
129
+
install.packages('R.utils')
130
+
install.packages('dplyr')
131
+
install.packages('doParallel')
132
+
```
124
133
125
-
**Example run:**
126
-
```
127
-
${script_path}/run_tractor.R \
128
-
--hapdose dataset_qc_phased \
129
-
--phe dataset_qc_pheno_covars.txt \
130
-
--method logistic \
131
-
--out dataset_qc_phased_sumstats
132
-
```
134
+
**Arguments:**
135
+
```
136
+
--hapdose [Mandatory] Prefix for hapcount and dosage files.
- In real-world scenarios, datasets may vary in size and default assumptions may not apply. Tractor accommodates these scenarios with optional arguments.
191
+
- Assuming a phenotype file with columns: PC1, PC2, PC3, PC4, PC5, age, sex, pheno1, pheno2, pheno3, sample_id, users can perform GWAS across different phenotypes.
192
+
- Below is an example to run Tractor GWAS for the **pheno1** phenotype:
- Users can utilize multi-threading for improved performance, and control file reading with chunking to avoid memory errors with extremely large files.
205
+
- Ensure a balance between chunk size and thread count to optimize performance without encountering memory issues.
206
+
- Below is an example to run Tractor GWAS for the **pheno1** phenotype w/ multithreading (4 cpu) and larger chunksize (15000):
Tractor generates ancestry-specific summary statistics, producing output files with varying column numbers based on the input number of ancestries.
223
+
224
+
All summary statistic files include:
225
+
* **Variant Information:**
226
+
* CHR: Chromosome
227
+
* POS: Position
228
+
* ID: SNP ID
229
+
* REF: Reference allele
230
+
* ALT: Alternate allele
231
+
* **Sample Size:**
232
+
* N: Total number of samples going into the model (after exclude NAs).
233
+
* Note this number can vary from the number of samples present in hapcount/dosage files, as there may be samples with NAs in the phenotype file which are eventually skipped.
234
+
* **Allele Frequency (AF), Local Ancestry Proportion (LAprop), Effect Size (beta), p-value (pval), and t-value (tval):**
235
+
* For each ancestry term (anc), there are 'n' sets of columns. For instance, if there are n=2 ancestries, expect 2 sets of columns for each of these parameters.
236
+
* **Local Ancestry (LA) Related Columns:**
237
+
* LApval: p-value for the local ancestry term (X1 term in Tractor)
238
+
* LAeff: Effect size for the local ancestry term (X1 term in Tractor)
239
+
* For 'n' ancestry terms (anc), expect 'n-1' sets of these columns. For example, if there are n=2 ancestries, expect 1 set of columns for each of these parameters.
240
+
241
+
**Example Output File Structure**
242
+
```
243
+
CHR Chromosome
244
+
POS Position
245
+
ID SNP ID
246
+
REF Reference allele
247
+
ALT Alternate allele
248
+
N Total sample size
249
+
AF_anc0 Allele frequency for anc0; sum(dosage)/sum(local ancestry)
250
+
LAprop_anc0 Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
251
+
beta_anc0 Effect size for alternate alleles inherited from anc0
252
+
se_anc0 Standard error for effect size (beta_anc0)
253
+
pval_anc0 p-value for alternate alleles inherited from anc0 (NOT -log10(pvalues))
254
+
tval_anc0 t-value for anc0
255
+
...
256
+
LApval_anc0 p-value for the local ancestry term (X1 term in Tractor)
257
+
LAeff_anc0 Effect size for the local ancestry term (X1 term in Tractor)
258
+
...
259
+
260
+
```
155
261
156
262
[Contents](#contents)
157
263
158
264
## Steps for Running Tractor on Hail
159
265
- Hail implementation of the pipeline is described in [`hail_example_tractor_gwas.ipynb`](https://github.com/Atkinson-Lab/Tractor-New/blob/main/ipynbs/hail_example_tractor_gwas.ipynb).
160
266
161
-
[Contents](#contents)
162
-
163
267
## License
164
268
The Tractor program is licensed under the MIT License. You may obtain a copy of the License [here](https://github.com/Atkinson-Lab/Tractor-New/blob/main/LICENSE).
165
269
166
-
[Contents](#contents)
167
-
168
270
## Cite this article
169
-
170
271
The methodology and utility of Tractor are more fully described in our manuscript. If you use Tractor in your research, please cite the following article:
171
272
172
273
> Atkinson, E.G., Maihofer, A.X., Kanai, M. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 53, 195–204 (2021). [Link](https://doi.org/10.1038/s41588-020-00766-y)
0 commit comments