Skip to content

Commit ceff9e1

Browse files
committed
major v1.4.0 update
1 parent f57f614 commit ceff9e1

File tree

3 files changed

+541
-184
lines changed

3 files changed

+541
-184
lines changed

.DS_Store

-6 KB
Binary file not shown.

README.md

Lines changed: 155 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,14 @@
11
![](images/tractor_icon.png)
22

3-
# TRACTOR - Local Ancestry Aware GWAS
3+
## NEW!!! Current Version: v1.4.0 (released May 10, 2024)
4+
- Added support for compressed (gz) hapcount/dosage and phenotype files.
5+
- Improved file reading efficiency by implementing fread in chunks, mitigating memory errors.
6+
- Implemented parallel processing for regression, resulting in significant speed improvements with multi-core systems.
7+
- Enhanced flexibility in organizing phenotype files:
8+
- Users can specify sample ID column (`--sampleidcol`), phenotype ID column (`--phenocol`), and covariate column list (`--covarcollist`)
9+
- Updated output summary statistics to include SE and t-val, with column names adjusted to adhere to GWAS standards.
410

5-
**Current version: 1.1.0**
11+
# TRACTOR - Local Ancestry Aware GWAS
612

713
Tractor is a specialized tool designed to enhance Genome-Wide Association Studies (GWAS) for diverse cohorts by addressing challenges associated with analyzing admixed populations. Admixed populations are often excluded from genomic studies due to concerns about how to properly account for their complex ancestry.
814

@@ -11,20 +17,22 @@ Tractor facilitates the inclusion of admixed individuals in association studies
1117
## Classic GWAS vs. TRACTOR GWAS
1218
Unlike traditional GWAS methods, Tractor requires local ancestry estimates in its analyses. It employs a multi-step approach involving phasing, local ancestry inference, and regression analysis with ancestral allele dosages. This method aims to improve the accuracy of association analyses in cohorts with diverse ancestries, overcoming issues such as population stratification and variable linkage disequilibrium patterns.
1319

20+
21+
1422
## Contents
1523
* [Setup Conda environment](#setup-conda-environment)
1624
* [Steps for Running Tractor Locally](#steps-for-running-tractor-locally)
17-
* [Optional Step: Recovering Haplotypes Disrupted by Statistical Phasing](#optional-step-recovering-haplotypes-disrupted-by-statistical-phasing)
18-
* [Extracting Tracts and Ancestry Dosages](#extracting-tracts-and-ancestry-dosages)
19-
* [Running Tractor](#running-tractor)
25+
* [Step 0 \[Optional\]: Recovering Haplotypes Disrupted by Statistical Phasing](#step-0-optional-recovering-haplotypes-disrupted-by-statistical-phasing)
26+
* [Step 1: Extracting Tracts and Ancestry Dosages](#step-1-extracting-tracts-and-ancestry-dosages)
27+
* [Step 2: Running Tractor](#step-2-running-tractor) **(Tractor v1.4.0 released with additional functionalities)**
28+
* [Output Files (Running Tractor)](#output-files-running-tractor)
2029
* [Steps for Running Tractor on Hail](#steps-for-running-tractor-on-hail)
21-
* [Output Files](#output-files)
2230
* [License](#license)
2331
* [Cite this article](#cite-this-article)
2432

2533
## Setup Conda environment
2634

27-
We recommend creating a Conda environment to run Tractor locally.
35+
We recommend creating a Conda environment to run Tractor locally. This will install the necessary Python 3 and R dependencies required by the scripts.
2836
```bash
2937
conda env create -f conda_py3_tractor.yml
3038
conda activate py3_tractor
@@ -38,18 +46,20 @@ conda activate py3_tractor
3846

3947
All scripts desribed in the following steps are available in the [`scripts`](https://github.com/Atkinson-Lab/Tractor-New/tree/main/scripts) directory, and Hail implementation is present in the [`ipynbs`](https://github.com/Atkinson-Lab/Tractor-New/tree/main/ipynbs) directory
4048

41-
### Optional Step: Recovering Haplotypes Disrupted by Statistical Phasing
49+
### Step 0 [Optional]: Recovering Haplotypes Disrupted by Statistical Phasing
4250

4351
Statistical phasing can lead to switch errors as described in [Fig. 1](https://www.nature.com/articles/s41588-020-00766-y/figures/1) of the Tractor publication.
4452
For this purpose, we have written two scripts, `unkink_2way_mspfile.py` and `unkink_2way_genofile.py`. These scripts help recover disrupted tracts from the **MSP file and VCF file**, rectifying errors, and outputs an unkinked VCF file that can be used for subsequent steps. Currently they are implemented for two-way admixed popuations only.
4553
- `unkink_2way_mspfile.py`
54+
4655
```
4756
--msp Path stem to MSP file, not including ".msp.tsv". (Must end in .msp.tsv)
4857
```
4958
- **Output File:**
5059
- The output file \*.switches.txt includes information on windows from the MSP file that needs to be switched.
5160
- This file will serve as an input to `unkink_2way_genofile.py`
5261
- `unkink_2way_genofile.py`
62+
5363
```
5464
--switches Path to *.switches.txt, which includes info on windows to be switched
5565
--genofile Path stem to input VCF with phased genotypes, not including .vcf suffix
@@ -59,11 +69,11 @@ For this purpose, we have written two scripts, `unkink_2way_mspfile.py` and `unk
5969

6070
[Contents](#contents)
6171

62-
### Extracting Tracts and Ancestry Dosages
72+
### Step 1: Extracting Tracts and Ancestry Dosages
6373

6474
Simultaneously extract risk allele and local ancestry information, a prerequisite for running Tractor GWAS. The scripts output risk allele by ancestry dosages and haplotype counts for the input VCF files. A file of each of these is generated for each ancestry component.
6575
- **Note that the input VCF file must be the phased file on which local ancestry was called.**
66-
- Running `extract_tracts.py` requires the **input MSP and VCF file**, and the number of ancestral populations within the VCF file. This script outputs the dosage and hapcount files required for running Tractor using `run_tractor.R`.
76+
- Running `extract_tracts.py` requires the **input MSP and VCF file**, and the number of ancestral populations within the VCF file. This script outputs the dosage and hapcount files required for running Tractor.
6777
- `extract_tracts.py`:
6878
```
6979
--vcf Path to VCF file (*.vcf or *.vcf.gz)
@@ -110,63 +120,154 @@ Simultaneously extract risk allele and local ancestry information, a prerequisit
110120
111121
[Contents](#contents)
112122
113-
### Running Tractor
123+
### Step 2: Running Tractor
114124
115-
- The Tractor code runs in R, and all required library packages should be installed within the Conda environment.
116-
- Arguments:
117-
```
118-
--hapdose Prefix of hapcount and dosage files generated
119-
--phe Phenotype file; 1st column sample ID, 2nd column phenotype,
120-
other columns will be treated as covariates. Missing data is allowed.
121-
--method "linear" or "logistic"
122-
--out Output file name for ancestry-specific summary statistics
123-
```
125+
The Tractor code runs in R, and to make sure the script works, you'll need to install the following libraries. Your conda environment should handle these installations by default.
126+
```
127+
install.packages('optparse')
128+
install.packages('data.table')
129+
install.packages('R.utils')
130+
install.packages('dplyr')
131+
install.packages('doParallel')
132+
```
124133
125-
**Example run:**
126-
```
127-
${script_path}/run_tractor.R \
128-
--hapdose dataset_qc_phased \
129-
--phe dataset_qc_pheno_covars.txt \
130-
--method logistic \
131-
--out dataset_qc_phased_sumstats
132-
```
134+
**Arguments:**
135+
```
136+
--hapdose [Mandatory] Prefix for hapcount and dosage files.
137+
E.g. If you have the following files:
138+
filename.anc0.dosage.txt filename.anc0.hapcount.txt
139+
filename.anc1.dosage.txt filename.anc1.hapcount.txt
140+
use "--hapdose filename".
141+
--phenofile [Mandatory] Path to the file containing phenotype and covariate data.
142+
Default assumptions: Sample ID column: "IID" or "#IID", Phenotype column: "y".
143+
If different column names are used, refer to --sampleidcol and --phenocol arguments.
144+
All covariates MUST be included using --covarcollist.
145+
--covarcollist [Mandatory] Specify column names of covariates in the --phenofile.
146+
Only listed columns will be included as covariates.
147+
Separate multiple covariates with commas.
148+
E.g. --covarcollist age,sex,PC1,PC2.
149+
To exclude covariates, specify "--covarcollist none".
150+
--method [Mandatory] Specify the method to be used: <linear> or <logistic>.
151+
--output [Mandatory] File name for summary statistics output.
152+
E.g. /path/to/file/output_sumstats.txt
153+
154+
--sampleidcol [Optional] Specify sample ID column name in the --phenofile.
155+
Default: "IID" or "#IID"
156+
--phenocol [Optional] Specify phenotype column name in the --phenofile.
157+
Default: "y"
158+
--chunksize [Optional] Number of rows to read at once from hapcount and dosage files.
159+
Use smaller values for lower memory usage.
160+
Note: Higher chunksize speeds up streaming but requires more memory.
161+
If out-of-memory errors occur, try increasing memory or
162+
reducing --chunksize or --nthreads.
163+
Default: 10000
164+
--nthreads [Optional] Specify number of threads to use.
165+
Increasing threads can speed up processing but may increase memory usage.
166+
Default: 1
167+
--totallines [Optional] Specify total number of lines in hapcount/dosage files (wc -l *.hapcount.txt).
168+
If not provided, it will be calculated internally (recommended).
169+
Exercise caution: if --totallines is smaller than the actual lines in the files,
170+
only a subset of data will be analyzed. If larger than the actual lines in the files,
171+
an error will occur. Both scenarios are discouraged.
172+
```
133173
134-
- **Output Files:**
135-
- Tractor is a local-ancestry aware GWAS that offers ancestry-specific summary statistics.
136-
- The number of columns would depend on the number of ancestries within the study. Here is a description of the columns of 2-way admixed dataset:
137-
```
138-
CHROM: Chromosome
139-
POS: Position
140-
ID: SNP ID
141-
REF: Reference allele
142-
ALT: Alternate allele
143-
AF_anc0: Allele frequency for anc0; sum(dosage)/sum(local ancestry)
144-
AF_anc1: Allele frequency for anc1; sum(dosage)/sum(local ancestry)
145-
LAprop_anc0: Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
146-
LAprop_anc1: Local ancestry proportion for anc1; sum(local ancestry)/2 * sample size
147-
LAeff_anc0: Effect size for the local ancestry term (X1 term in Tractor)
148-
LApval_anc0: p value for the local ancestry term (X1 term in Tractor)
149-
Geff_anc0: Effect size for alternate alleles that are interited from anc0
150-
Geff_anc1: Effect size for alternate alleles that are interited from anc1
151-
Gpval_anc0: p value for alternate alleles that are interited from anc0
152-
Gpval_anc1: p value for alternate alleles that are interited from anc1
153-
```
154-
- Gpval (Genotype p-value) columns can be used for generating ancestry-specific Manhattan plots.
174+
**Example Run (with Mandatory Arguments)**
175+
176+
- The latest Tractor v1.4.0 update introduces changes to default arguments, enhancing versatility and applicability.
177+
- To run Tractor with the default assumptions, only 5 arguments are required.
178+
- Ensure all covariates are specified using the `--covarcollist` flag.
179+
```
180+
run_tractor.R \
181+
--hapdose /path/to/file/tmp1 \
182+
--phenofile /path/to/file/dataset_qc_pheno_covars.txt \
183+
--covarcollist age,sex,PC1,PC2,PC3,PC4,PC5 \
184+
--method linear \
185+
--output /path/to/results/test1.txt
186+
```
187+
188+
**Example Run (with Optional Arguments)**
189+
190+
- In real-world scenarios, datasets may vary in size and default assumptions may not apply. Tractor accommodates these scenarios with optional arguments.
191+
- Assuming a phenotype file with columns: PC1, PC2, PC3, PC4, PC5, age, sex, pheno1, pheno2, pheno3, sample_id, users can perform GWAS across different phenotypes.
192+
- Below is an example to run Tractor GWAS for the **pheno1** phenotype:
193+
```
194+
run_tractor.R \
195+
--hapdose /path/to/file/tmp1 \
196+
--phenofile /path/to/file/dataset_qc_pheno_covars.txt \
197+
--covarcollist age,sex,PC1,PC2,PC3,PC4,PC5 \
198+
--method linear \
199+
--output /path/to/results/test1.txt \
200+
--sampleidcol sample_id \
201+
--phenocol pheno1
202+
```
203+
204+
- Users can utilize multi-threading for improved performance, and control file reading with chunking to avoid memory errors with extremely large files.
205+
- Ensure a balance between chunk size and thread count to optimize performance without encountering memory issues.
206+
- Below is an example to run Tractor GWAS for the **pheno1** phenotype w/ multithreading (4 cpu) and larger chunksize (15000):
207+
```
208+
run_tractor.R \
209+
--hapdose /path/to/file/tmp1 \
210+
--phenofile /path/to/file/dataset_qc_pheno_covars.txt \
211+
--covarcollist age,sex,PC1,PC2,PC3,PC4,PC5 \
212+
--method linear \
213+
--output /path/to/results/test1.txt \
214+
--sampleidcol sample_id \
215+
--phenocol pheno1 \
216+
--chunksize 15000 \
217+
--nthreads 4
218+
```
219+
220+
## Output Files (Running Tractor)
221+
222+
Tractor generates ancestry-specific summary statistics, producing output files with varying column numbers based on the input number of ancestries.
223+
224+
All summary statistic files include:
225+
* **Variant Information:**
226+
* CHR: Chromosome
227+
* POS: Position
228+
* ID: SNP ID
229+
* REF: Reference allele
230+
* ALT: Alternate allele
231+
* **Sample Size:**
232+
* N: Total number of samples going into the model (after exclude NAs).
233+
* Note this number can vary from the number of samples present in hapcount/dosage files, as there may be samples with NAs in the phenotype file which are eventually skipped.
234+
* **Allele Frequency (AF), Local Ancestry Proportion (LAprop), Effect Size (beta), p-value (pval), and t-value (tval):**
235+
* For each ancestry term (anc), there are 'n' sets of columns. For instance, if there are n=2 ancestries, expect 2 sets of columns for each of these parameters.
236+
* **Local Ancestry (LA) Related Columns:**
237+
* LApval: p-value for the local ancestry term (X1 term in Tractor)
238+
* LAeff: Effect size for the local ancestry term (X1 term in Tractor)
239+
* For 'n' ancestry terms (anc), expect 'n-1' sets of these columns. For example, if there are n=2 ancestries, expect 1 set of columns for each of these parameters.
240+
241+
**Example Output File Structure**
242+
```
243+
CHR Chromosome
244+
POS Position
245+
ID SNP ID
246+
REF Reference allele
247+
ALT Alternate allele
248+
N Total sample size
249+
AF_anc0 Allele frequency for anc0; sum(dosage)/sum(local ancestry)
250+
LAprop_anc0 Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
251+
beta_anc0 Effect size for alternate alleles inherited from anc0
252+
se_anc0 Standard error for effect size (beta_anc0)
253+
pval_anc0 p-value for alternate alleles inherited from anc0 (NOT -log10(pvalues))
254+
tval_anc0 t-value for anc0
255+
...
256+
LApval_anc0 p-value for the local ancestry term (X1 term in Tractor)
257+
LAeff_anc0 Effect size for the local ancestry term (X1 term in Tractor)
258+
...
259+
260+
```
155261
156262
[Contents](#contents)
157263
158264
## Steps for Running Tractor on Hail
159265
- Hail implementation of the pipeline is described in [`hail_example_tractor_gwas.ipynb`](https://github.com/Atkinson-Lab/Tractor-New/blob/main/ipynbs/hail_example_tractor_gwas.ipynb).
160266
161-
[Contents](#contents)
162-
163267
## License
164268
The Tractor program is licensed under the MIT License. You may obtain a copy of the License [here](https://github.com/Atkinson-Lab/Tractor-New/blob/main/LICENSE).
165269
166-
[Contents](#contents)
167-
168270
## Cite this article
169-
170271
The methodology and utility of Tractor are more fully described in our manuscript. If you use Tractor in your research, please cite the following article:
171272
172273
> Atkinson, E.G., Maihofer, A.X., Kanai, M. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 53, 195–204 (2021). [Link](https://doi.org/10.1038/s41588-020-00766-y)

0 commit comments

Comments
 (0)