Atkinson-Lab
diff --git a/‎.DS_Store‎
-6 KB b/‎.DS_Store‎
-6 KB
diff --git a/‎README.md‎
Lines changed: 155 additions & 54 deletions b/‎README.md‎
Lines changed: 155 additions & 54 deletions
@@ -1,8 +1,14 @@
 ![](images/tractor_icon.png)
 
-# TRACTOR - Local Ancestry Aware GWAS
+## NEW!!! Current Version: v1.4.0 (released May 10, 2024)
+- Added support for compressed (gz) hapcount/dosage and phenotype files.
+- Improved file reading efficiency by implementing fread in chunks, mitigating memory errors.
+- Implemented parallel processing for regression, resulting in significant speed improvements with multi-core systems.
+- Enhanced flexibility in organizing phenotype files:
+   - Users can specify sample ID column (`--sampleidcol`), phenotype ID column (`--phenocol`), and covariate column list (`--covarcollist`)
+- Updated output summary statistics to include SE and t-val, with column names adjusted to adhere to GWAS standards.
 
-**Current version: 1.1.0**
+# TRACTOR - Local Ancestry Aware GWAS
 
 Tractor is a specialized tool designed to enhance Genome-Wide Association Studies (GWAS) for diverse cohorts by addressing challenges associated with analyzing admixed populations. Admixed populations are often excluded from genomic studies due to concerns about how to properly account for their complex ancestry.
 
@@ -11,20 +17,22 @@ Tractor facilitates the inclusion of admixed individuals in association studies
 ## Classic GWAS vs. TRACTOR GWAS
 Unlike traditional GWAS methods, Tractor requires local ancestry estimates in its analyses. It employs a multi-step approach involving phasing, local ancestry inference, and regression analysis with ancestral allele dosages. This method aims to improve the accuracy of association analyses in cohorts with diverse ancestries, overcoming issues such as population stratification and variable linkage disequilibrium patterns.
 
+
+
 ## Contents
 * [Setup Conda environment](#setup-conda-environment)
 * [Steps for Running Tractor Locally](#steps-for-running-tractor-locally)
-  * [Optional Step: Recovering Haplotypes Disrupted by Statistical Phasing](#optional-step-recovering-haplotypes-disrupted-by-statistical-phasing)
-  * [Extracting Tracts and Ancestry Dosages](#extracting-tracts-and-ancestry-dosages)
-  * [Running Tractor](#running-tractor)
+  * [Step 0 \[Optional\]: Recovering Haplotypes Disrupted by Statistical Phasing](#step-0-optional-recovering-haplotypes-disrupted-by-statistical-phasing)
+  * [Step 1: Extracting Tracts and Ancestry Dosages](#step-1-extracting-tracts-and-ancestry-dosages)
+  * [Step 2: Running Tractor](#step-2-running-tractor) **(Tractor v1.4.0 released with additional functionalities)**
+* [Output Files (Running Tractor)](#output-files-running-tractor)
 * [Steps for Running Tractor on Hail](#steps-for-running-tractor-on-hail)
-* [Output Files](#output-files)
 * [License](#license)
 * [Cite this article](#cite-this-article)
 
 ## Setup Conda environment
 
-We recommend creating a Conda environment to run Tractor locally.
+We recommend creating a Conda environment to run Tractor locally. This will install the necessary Python 3 and R dependencies required by the scripts.
 ```bash
 conda env create -f conda_py3_tractor.yml
 conda activate py3_tractor
@@ -38,18 +46,20 @@ conda activate py3_tractor
 
 All scripts desribed in the following steps are available in the [`scripts`](https://github.com/Atkinson-Lab/Tractor-New/tree/main/scripts) directory, and Hail implementation is present in the [`ipynbs`](https://github.com/Atkinson-Lab/Tractor-New/tree/main/ipynbs) directory
 
-### Optional Step: Recovering Haplotypes Disrupted by Statistical Phasing
+### Step 0 [Optional]: Recovering Haplotypes Disrupted by Statistical Phasing
 
 Statistical phasing can lead to switch errors as described in [Fig. 1](https://www.nature.com/articles/s41588-020-00766-y/figures/1) of the Tractor publication.
 For this purpose, we have written two scripts, `unkink_2way_mspfile.py` and `unkink_2way_genofile.py`. These scripts help recover disrupted tracts from the **MSP file and VCF file**, rectifying errors, and outputs an unkinked VCF file that can be used for subsequent steps. Currently they are implemented for two-way admixed popuations only.
 - `unkink_2way_mspfile.py`
+  
   ```
   --msp              Path stem to MSP file, not including ".msp.tsv". (Must end in .msp.tsv)
   ```
 - **Output File:**
   - The output file \*.switches.txt includes information on windows from the MSP file that needs to be switched. 
   - This file will serve as an input to `unkink_2way_genofile.py`
 - `unkink_2way_genofile.py`
+  
   ```
   --switches         Path to *.switches.txt, which includes info on windows to be switched
   --genofile         Path stem to input VCF with phased genotypes, not including .vcf suffix
@@ -59,11 +69,11 @@ For this purpose, we have written two scripts, `unkink_2way_mspfile.py` and `unk
 
 [Contents](#contents)
 
-### Extracting Tracts and Ancestry Dosages
+### Step 1: Extracting Tracts and Ancestry Dosages
 
 Simultaneously extract risk allele and local ancestry information, a prerequisite for running Tractor GWAS. The scripts output risk allele by ancestry dosages and haplotype counts for the input VCF files. A file of each of these is generated for each ancestry component.
 - **Note that the input VCF file must be the phased file on which local ancestry was called.**
-- Running `extract_tracts.py` requires the **input MSP and VCF file**, and the number of ancestral populations within the VCF file. This script outputs the dosage and hapcount files required for running Tractor using `run_tractor.R`.
+- Running `extract_tracts.py` requires the **input MSP and VCF file**, and the number of ancestral populations within the VCF file. This script outputs the dosage and hapcount files required for running Tractor.
   - `extract_tracts.py`:
     ```
     --vcf              Path to VCF file (*.vcf or *.vcf.gz)
@@ -110,63 +120,154 @@ Simultaneously extract risk allele and local ancestry information, a prerequisit
 
 [Contents](#contents)
 
-### Running Tractor
+### Step 2: Running Tractor
 
-- The Tractor code runs in R, and all required library packages should be installed within the Conda environment.
-- Arguments:
-  ```
-  --hapdose         Prefix of hapcount and dosage files generated
-  --phe             Phenotype file; 1st column sample ID, 2nd column phenotype,
-                    other columns will be treated as covariates. Missing data is allowed.
-  --method          "linear" or "logistic"
-  --out             Output file name for ancestry-specific summary statistics
-  ```
+The Tractor code runs in R, and to make sure the script works, you'll need to install the following libraries. Your conda environment should handle these installations by default.
+```
+install.packages('optparse')
+install.packages('data.table')
+install.packages('R.utils')
+install.packages('dplyr')
+install.packages('doParallel')
+```
 
-  **Example run:**
-  ```
-  ${script_path}/run_tractor.R \
-  --hapdose dataset_qc_phased \
-  --phe dataset_qc_pheno_covars.txt \
-  --method logistic \
-  --out dataset_qc_phased_sumstats
-  ```
+**Arguments:**
+```
+--hapdose       [Mandatory] Prefix for hapcount and dosage files.
+                    E.g. If you have the following files:
+                         filename.anc0.dosage.txt filename.anc0.hapcount.txt
+                         filename.anc1.dosage.txt filename.anc1.hapcount.txt
+                    use "--hapdose filename".
+--phenofile     [Mandatory] Path to the file containing phenotype and covariate data. 
+                    Default assumptions: Sample ID column: "IID" or "#IID", Phenotype column: "y".
+                    If different column names are used, refer to --sampleidcol and --phenocol arguments.
+                    All covariates MUST be included using --covarcollist.
+--covarcollist  [Mandatory] Specify column names of covariates in the --phenofile.
+                    Only listed columns will be included as covariates.
+                    Separate multiple covariates with commas.
+                    E.g. --covarcollist age,sex,PC1,PC2.
+                    To exclude covariates, specify "--covarcollist none".
+--method        [Mandatory] Specify the method to be used: <linear> or <logistic>.
+--output        [Mandatory] File name for summary statistics output.
+                    E.g. /path/to/file/output_sumstats.txt
+
+--sampleidcol   [Optional] Specify sample ID column name in the --phenofile.
+                    Default: "IID" or "#IID"
+--phenocol      [Optional] Specify phenotype column name in the --phenofile.
+                    Default: "y"
+--chunksize     [Optional] Number of rows to read at once from hapcount and dosage files.
+                    Use smaller values for lower memory usage.
+                    Note: Higher chunksize speeds up streaming but requires more memory.
+                    If out-of-memory errors occur, try increasing memory or
+                    reducing --chunksize or --nthreads.
+                    Default: 10000
+--nthreads      [Optional] Specify number of threads to use.
+                    Increasing threads can speed up processing but may increase memory usage.
+                    Default: 1
+--totallines    [Optional] Specify total number of lines in hapcount/dosage files (wc -l *.hapcount.txt).
+                    If not provided, it will be calculated internally (recommended).
+                    Exercise caution: if --totallines is smaller than the actual lines in the files, 
+                    only a subset of data will be analyzed. If larger than the actual lines in the files,
+                    an error will occur. Both scenarios are discouraged.
+```
 
-- **Output Files:**
-  - Tractor is a local-ancestry aware GWAS that offers ancestry-specific summary statistics.
-  - The number of columns would depend on the number of ancestries within the study. Here is a description of the columns of 2-way admixed dataset:
-    ```
-    CHROM:              Chromosome 
-    POS:                Position 
-    ID:                 SNP ID
-    REF:                Reference allele 
-    ALT:                Alternate allele 
-    AF_anc0:            Allele frequency for anc0; sum(dosage)/sum(local ancestry)
-    AF_anc1:            Allele frequency for anc1; sum(dosage)/sum(local ancestry)
-    LAprop_anc0:        Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
-    LAprop_anc1:        Local ancestry proportion for anc1; sum(local ancestry)/2 * sample size
-    LAeff_anc0:         Effect size for the local ancestry term (X1 term in Tractor)
-    LApval_anc0:        p value for the local ancestry term (X1 term in Tractor)
-    Geff_anc0:          Effect size for alternate alleles that are interited from anc0
-    Geff_anc1:          Effect size for alternate alleles that are interited from anc1
-    Gpval_anc0:         p value for alternate alleles that are interited from anc0
-    Gpval_anc1:         p value for alternate alleles that are interited from anc1
-    ```
-  - Gpval (Genotype p-value) columns can be used for generating ancestry-specific Manhattan plots.
+**Example Run (with Mandatory Arguments)**
+
+- The latest Tractor v1.4.0 update introduces changes to default arguments, enhancing versatility and applicability.
+- To run Tractor with the default assumptions, only 5 arguments are required.
+- Ensure all covariates are specified using the `--covarcollist` flag.
+```
+run_tractor.R \
+--hapdose /path/to/file/tmp1 \
+--phenofile /path/to/file/dataset_qc_pheno_covars.txt \
+--covarcollist age,sex,PC1,PC2,PC3,PC4,PC5 \
+--method linear \
+--output /path/to/results/test1.txt
+```
+
+**Example Run (with Optional Arguments)**
+
+- In real-world scenarios, datasets may vary in size and default assumptions may not apply. Tractor accommodates these scenarios with optional arguments.
+- Assuming a phenotype file with columns: PC1, PC2, PC3, PC4, PC5, age, sex, pheno1, pheno2, pheno3, sample_id, users can perform GWAS across different phenotypes.
+- Below is an example to run Tractor GWAS for the **pheno1** phenotype:
+```
+run_tractor.R \
+--hapdose /path/to/file/tmp1 \
+--phenofile /path/to/file/dataset_qc_pheno_covars.txt \
+--covarcollist age,sex,PC1,PC2,PC3,PC4,PC5 \
+--method linear \
+--output /path/to/results/test1.txt \
+--sampleidcol sample_id \
+--phenocol pheno1
+```
+
+- Users can utilize multi-threading for improved performance, and control file reading with chunking to avoid memory errors with extremely large files.
+- Ensure a balance between chunk size and thread count to optimize performance without encountering memory issues.
+- Below is an example to run Tractor GWAS for the **pheno1** phenotype w/ multithreading (4 cpu) and larger chunksize (15000):
+```
+run_tractor.R \
+--hapdose /path/to/file/tmp1 \
+--phenofile /path/to/file/dataset_qc_pheno_covars.txt \
+--covarcollist age,sex,PC1,PC2,PC3,PC4,PC5 \
+--method linear \
+--output /path/to/results/test1.txt \
+--sampleidcol sample_id \
+--phenocol pheno1 \
+--chunksize 15000 \
+--nthreads 4
+```
+
+## Output Files (Running Tractor)
+
+Tractor generates ancestry-specific summary statistics, producing output files with varying column numbers based on the input number of ancestries.
+
+All summary statistic files include:
+* **Variant Information:**
+  * CHR: Chromosome
+  * POS: Position
+  * ID: SNP ID
+  * REF: Reference allele
+  * ALT: Alternate allele
+* **Sample Size:**
+  * N: Total number of samples going into the model (after exclude NAs).
+    * Note this number can vary from the number of samples present in hapcount/dosage files, as there may be samples with NAs in the phenotype file which are eventually skipped.
+* **Allele Frequency (AF), Local Ancestry Proportion (LAprop), Effect Size (beta), p-value (pval), and t-value (tval):**
+  * For each ancestry term (anc), there are 'n' sets of columns. For instance, if there are n=2 ancestries, expect 2 sets of columns for each of these parameters.
+* **Local Ancestry (LA) Related Columns:**
+  * LApval: p-value for the local ancestry term (X1 term in Tractor)
+  * LAeff: Effect size for the local ancestry term (X1 term in Tractor)
+  * For 'n' ancestry terms (anc), expect 'n-1' sets of these columns. For example, if there are n=2 ancestries, expect 1 set of columns for each of these parameters.
+
+**Example Output File Structure**
+```
+CHR             Chromosome 
+POS             Position 
+ID              SNP ID
+REF             Reference allele
+ALT             Alternate allele
+N               Total sample size
+AF_anc0         Allele frequency for anc0; sum(dosage)/sum(local ancestry)
+LAprop_anc0     Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
+beta_anc0       Effect size for alternate alleles inherited from anc0
+se_anc0         Standard error for effect size (beta_anc0)
+pval_anc0       p-value for alternate alleles inherited from anc0 (NOT -log10(pvalues))
+tval_anc0       t-value for anc0
+...
+LApval_anc0     p-value for the local ancestry term (X1 term in Tractor)
+LAeff_anc0      Effect size for the local ancestry term (X1 term in Tractor)
+...
+
+```
 
 [Contents](#contents)
 
 ## Steps for Running Tractor on Hail
 - Hail implementation of the pipeline is described in [`hail_example_tractor_gwas.ipynb`](https://github.com/Atkinson-Lab/Tractor-New/blob/main/ipynbs/hail_example_tractor_gwas.ipynb).
 
-[Contents](#contents)
-
 ## License
 The Tractor program is licensed under the MIT License. You may obtain a copy of the License [here](https://github.com/Atkinson-Lab/Tractor-New/blob/main/LICENSE).
 
-[Contents](#contents)
-
 ## Cite this article
-
 The methodology and utility of Tractor are more fully described in our manuscript. If you use Tractor in your research, please cite the following article:
 
 > Atkinson, E.G., Maihofer, A.X., Kanai, M. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 53, 195–204 (2021). [Link](https://doi.org/10.1038/s41588-020-00766-y)