Skip to content

lemmerelassal/ncbi-sra-fetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 NCBI Fetcher (PGX)

A small Go application that downloads gzipped FASTA sequences from the NCBI Short Reads Archive, extracts them, and stores them in a PostgreSQL database.


📦 Requirements

Make sure you have the following installed:


🛠️ Installation

  1. Clone or unzip the project:

    unzip ncbi-fetcher-pgx-updated.zip
    cd ncbi-fetcher-pgx
  2. Initialize Go modules and download dependencies:

    go mod tidy
  3. Create a PostgreSQL database (e.g., dna_sequences):

    createdb dna_sequences
  4. Configure environment variables
    (these are used by the app to connect to PostgreSQL):

    export DB_HOST=localhost
    export DB_PORT=5432
    export DB_USER=postgres
    export DB_PASSWORD=yourpassword
    export DB_NAME=dna_sequences

    You can also edit config/config.go to set your defaults.


▶️ Running the App

Fetch sequences for a specific SRA accession number (e.g., SRR35830121):

go run main.go SRR35830121

The program will:

  1. Download the gzipped FASTA file from NCBI.
  2. Decompress it in memory.
  3. Parse all sequences.
  4. Create the sequences table (if it doesn’t exist).
  5. Insert all sequences into PostgreSQL.

🧱 Database Schema

Column Type Description
id SERIAL PK Auto-incrementing ID
accession TEXT SRA accession number
header TEXT FASTA header line (no ">")
sequence TEXT Nucleotide sequence
source_file TEXT Original .gz file name
created_at TIMESTAMP Default NOW()

🧩 Example Query

To verify data insertion:

SELECT accession, header, LENGTH(sequence) AS seq_len
FROM sequences
LIMIT 5;

🧰 Troubleshooting

  • failed to fetch data: 404
    → The accession ID might not exist or NCBI is temporarily unavailable.
  • connection refused
    → Check your PostgreSQL connection parameters or pg_hba.conf.
  • gzip: invalid header
    → Ensure the endpoint returns a .gz file and not plain FASTA.

📘 Notes

  • You can modify the database connection defaults in config/config.go.
  • The app uses the pgx/v5 library for high-performance database access.
  • To improve speed for large datasets, you can later add batching or concurrent insert workers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages