A small Go application that downloads gzipped FASTA sequences from the NCBI Short Reads Archive, extracts them, and stores them in a PostgreSQL database.
Make sure you have the following installed:
- Go ≥ 1.20
Install Go - PostgreSQL ≥ 13
Install PostgreSQL - Internet access to reach NCBI’s
trace.ncbi.nlm.nih.govAPI
-
Clone or unzip the project:
unzip ncbi-fetcher-pgx-updated.zip cd ncbi-fetcher-pgx -
Initialize Go modules and download dependencies:
go mod tidy
-
Create a PostgreSQL database (e.g.,
dna_sequences):createdb dna_sequences
-
Configure environment variables
(these are used by the app to connect to PostgreSQL):export DB_HOST=localhost export DB_PORT=5432 export DB_USER=postgres export DB_PASSWORD=yourpassword export DB_NAME=dna_sequences
You can also edit
config/config.goto set your defaults.
Fetch sequences for a specific SRA accession number (e.g., SRR35830121):
go run main.go SRR35830121The program will:
- Download the gzipped FASTA file from NCBI.
- Decompress it in memory.
- Parse all sequences.
- Create the
sequencestable (if it doesn’t exist). - Insert all sequences into PostgreSQL.
| Column | Type | Description |
|---|---|---|
| id | SERIAL PK | Auto-incrementing ID |
| accession | TEXT | SRA accession number |
| header | TEXT | FASTA header line (no ">") |
| sequence | TEXT | Nucleotide sequence |
| source_file | TEXT | Original .gz file name |
| created_at | TIMESTAMP | Default NOW() |
To verify data insertion:
SELECT accession, header, LENGTH(sequence) AS seq_len
FROM sequences
LIMIT 5;failed to fetch data: 404
→ The accession ID might not exist or NCBI is temporarily unavailable.connection refused
→ Check your PostgreSQL connection parameters orpg_hba.conf.gzip: invalid header
→ Ensure the endpoint returns a.gzfile and not plain FASTA.
- You can modify the database connection defaults in
config/config.go. - The app uses the
pgx/v5library for high-performance database access. - To improve speed for large datasets, you can later add batching or concurrent insert workers.