Semi-automate the preprocessing and quality control of paired-end FASTQ data using FastQC, MultiQC, and fastp. This Bash script streamlines the entire workflow by creating necessary directory structures, allowing environment selection, logging read statistics, and generating read statistics graphs.
- Features
- Prerequisites
- Installation
- Usage
- Pipeline Overview
- Configuration
- Output
- Troubleshooting
- Contributing
- Contact
- Automated Workflow: Seamlessly integrates FastQC, MultiQC, and fastp for comprehensive FASTQ data preprocessing and quality control.
- Environment Selection: Allows users to select from available Conda environments to ensure the correct tool versions and dependencies.
- Dynamic Configuration: Interactive prompts enable users to customize fastp parameters based on their specific requirements.
- Comprehensive Reporting: Generates detailed reports in both HTML and JSON formats, including aggregated MultiQC reports.
- Read Statistics Logging: Extracts and logs read statistics before and after filtering, providing insights into data quality and processing effectiveness.
- Visualization: Creates informative read statistics graphs using Python's
plotextlibrary, visualized directly in the terminal. - Error Handling: Robust checks ensure all necessary tools are installed and valid user inputs are provided.
Before running the script, ensure that the following tools and dependencies are installed and accessible:
- Bash: Version 4 or higher.
- Conda: For environment management.
- FastQC: Quality control tool for high throughput sequence data.
- MultiQC: Aggregates results from multiple tools into a single report.
- fastp: Fast all-in-one FASTQ preprocessor.
- jq: Command-line JSON processor.
- Python 3: With
plotextlibrary installed (the script will install it if missing).
-
Clone the Repository:
git clone https://github.com/yourusername/fastq-processing-pipeline.git cd fastq-processing-pipeline -
Ensure Execution Permissions:
chmod +x fastq_processing_pipeline.sh
-
Install Required Conda Environments:
The script will prompt you to select a Conda environment. Ensure you have the necessary environments created with the required tools installed.
Example to create a new environment:
conda create -n fastq_env fastqc multiqc fastp jq plotext python=3.8 conda activate fastq_env
Execute the script from your terminal:
./fastq_processing_pipeline.shThe script will guide you through several interactive prompts:
-
Conda Environment Selection:
- Lists available Conda environments (excluding
base). - Prompts you to select an environment by entering its corresponding number.
- Lists available Conda environments (excluding
-
Directory Structure:
- Asks if the necessary directories (
rawdata/andprocessed_data/) are already created. - If not, it creates the required directory structure.
- Asks if the necessary directories (
-
FASTQ File Organization:
- Moves all
.fastq.gzfiles from the current directory to therawdata/directory.
- Moves all
-
fastp Configuration:
- Interactive prompts to enable/disable and set parameters for:
- Quality Filtering
- Length Filtering
- Low Complexity Filtering
- Adapter Trimming
- PolyG Tail Trimming
- PolyX Tail Trimming
- Deduplication
- Interactive prompts to enable/disable and set parameters for:
-
Environment Setup:
- Sources the appropriate Conda initialization script.
- Activates the user-selected Conda environment.
-
Tool Verification:
- Checks for the presence of required tools:
fastqc,multiqc,fastp, andjq.
- Checks for the presence of required tools:
-
Directory Structure Creation:
- Sets up
rawdata/for input FASTQ files andprocessed_data/for output files. - Creates subdirectories for reports generated by FastQC, MultiQC, and fastp.
- Sets up
-
FASTQ File Organization:
- Moves all
.fastq.gzfiles to therawdata/directory.
- Moves all
-
Paired-End File Identification:
- Identifies paired-end FASTQ files based on the naming pattern
*_plus_1_aaa.fastq.gz.
- Identifies paired-end FASTQ files based on the naming pattern
-
Quality Control with FastQC:
- Runs FastQC on raw FASTQ files.
- Aggregates FastQC reports using MultiQC.
-
Preprocessing with fastp:
- Configures fastp parameters based on user input.
- Executes fastp to perform filtering, trimming, and optional deduplication.
- Logs read statistics before and after processing.
-
Post-Processing Quality Control:
- Runs FastQC on processed FASTQ files.
- Aggregates FastQC reports using MultiQC.
-
Statistics Extraction and Visualization:
- Extracts read statistics from fastp's JSON reports using
jq. - Logs statistics into
readLOG.txt. - Generates a read statistics graph displayed in the terminal using Python's
plotext.
- Extracts read statistics from fastp's JSON reports using
-
Completion:
- Provides a summary of the pipeline execution and points to the generated reports.
During the pipeline execution, you'll be prompted to configure various fastp parameters:
-
Quality Filtering:
- Qualified Quality PHRED: Minimum PHRED score for a base to be considered high quality.
- Unqualified Percent Limit: Maximum percentage of low-quality bases allowed per read.
-
Length Filtering:
- Minimum Length Required: Reads shorter than this length will be discarded.
- Maximum Length Limit: Reads longer than this length will be discarded (0 means no limit).
-
Low Complexity Filtering:
- Complexity Threshold: Percentage of base diversity required to pass.
-
Adapter Trimming:
- Option to enable or disable adapter trimming.
-
PolyG and PolyX Tail Trimming:
- PolyG Minimum Length: Minimum length to detect and trim PolyG tails.
- PolyX Minimum Length: Minimum length to detect and trim PolyX tails.
-
Deduplication:
- Option to enable deduplication to remove duplicated reads.
These configurations allow you to tailor the preprocessing steps to the specific requirements of your sequencing data.
After execution, the following directory structure will be created:
rawdata/
└── reports/
├── fastqc/
└── multiqc/
processed_data/
└── reports/
├── fastqc/
├── multiqc/
└── fastp/
-
FastQC Reports:
- Located in
rawdata/reports/fastqc/for raw data. - Located in
processed_data/reports/fastqc/for processed data.
- Located in
-
MultiQC Reports:
- Aggregated reports in
rawdata/reports/multiqc/for raw data. - Aggregated reports in
processed_data/reports/multiqc/for processed data.
- Aggregated reports in
-
fastp Reports:
- JSON and HTML reports for each sample in
processed_data/reports/fastp/. - Log files capturing the fastp execution details.
- JSON and HTML reports for each sample in
-
readLOG.txt:
- Located in
processed_data/reports/fastp/. - Logs the number of reads before and after filtering, as well as the number of discarded reads for each sample.
Sample Content:
Sample Before_Reads After_Reads Discarded_Reads 0_mM_NOD 1000000 950000 50000 400_mM_NOD 1000000 960000 40000 - Located in
-
Read Statistics Graph:
-
Conda Activation Issues:
- Ensure that Conda is correctly installed and that the script is sourcing the correct
conda.shpath. - Verify that the selected Conda environment has all required tools installed.
- Ensure that Conda is correctly installed and that the script is sourcing the correct
-
Missing Tools:
- The script checks for
fastqc,multiqc,fastp, andjq. Ensure these are installed within the selected Conda environment.
- The script checks for
-
Invalid FASTQ Files:
- Ensure that your FASTQ files follow the naming convention
*_plus_1_aaa.fastq.gzand*_plus_2_aaa.fastq.gzfor paired-end data.
- Ensure that your FASTQ files follow the naming convention
-
Python Dependencies:
-
The script uses
plotextfor generating graphs. If installation fails, manually install it using:pip install plotext
-
-
Insufficient Permissions:
- Ensure you have the necessary read/write permissions for the directories involved.
-
Error Messages During Execution:
- Review the corresponding log files in
processed_data/reports/fastp/for detailed error information.
- Review the corresponding log files in
Contributions are welcome! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.
-
Fork the Repository
-
Create a Feature Branch:
git checkout -b feature/YourFeature
-
Commit Your Changes:
git commit -m "Add your feature" -
Push to the Branch:
git push origin feature/YourFeature
-
Open a Pull Request
For any questions, issues, or suggestions, please contact Dan.
