Efficiently snapshot PostgreSQL tables to Parquet files in S3 — written in Go.
postgres2parquet-go is a lightweight, high-performance tool designed to export entire PostgreSQL tables into columnar Parquet format and store them directly in S3.
It was originally built to support reliable pre-replication snapshots for CDC workflows, especially for large tables where traditional snapshot mechanisms can be slow or resource-intensive.
Unlike heavier frameworks that require complex cluster setups, this tool runs as a simple binary with minimal dependencies—making it ideal for containerized environments, CI/CD pipelines, or one-off data exports.
Why This Tool?
While solutions like Apache Spark (via JDBC) or Apache Flink CDC can achieve similar goals, they often come with operational overhead. postgres2parquet-go offers a focused, zero-friction alternative:
- Native Parquet output using efficient Go libraries (parquet-go)
- Direct streaming to S3 without intermediate disk storage
- Optimized for large tables with configurable chunking and concurrency (pgx)
- Minimal runtime footprint — no JVM, no cluster coordination
- Produces consistent, point-in-time snapshots suitable for initializing CDC pipelines
Whether you're bootstrapping a change data capture pipeline, archiving historical data, or preparing datasets for analytics, postgres2parquet-go delivers speed, simplicity, and reliability.
| Mode | Description |
|---|---|
| default | This mode is useful for fetching simple postgres table. If partition-column is set then table will be splited into chunks and each chunk will be fetched concurrently. Otherwise the whole table will be fetched in a single select |
| partitioned | This mode allows you to fetch each partition of partitioned table concurrently. Moreover, if you specify partition-column than each partition will be additionally split into chunks. This allows you to also control parallelism more granular |
| hypertable | Special mode for hypertable (timescaledb). Logic is very similar to partitioned table. The only difference is how partitions are represented. Each partition also can be split into additional chunks if partition-column is specified |
ENVS ./postgres2parquet-go {mode} {flags}Important
Keep in mind that partition-column should be numeric or temporal type.
Also, this column should be indexed or be a part of a composite index (in case of partitioned and hypertable tables).
Otherwise each query will cause a table full-scan which will lead to degraded performance and negatively affect database.
Example:
PG2P_DB_PASSWORD=postgres PG2P_DB_USER=postgres PG2P_S3_ACCESS_KEY=admin PG2P_S3_SECRET_ACCESS_KEY=password ./postgres2parquet-go snapshot default \
--schema public \
--table all_simple_types \
--partition-column primary_key \
--parallelism 200 \
--bucket warehouse \
--fetch-size 10000 \
--file-path /schema/table/ \
--file-size-threshold 52428800 \
--db-host localhost \
--db-port 5432 \
--db-name postgres \
--db-pool-size 20 \
--s3-endpoint localhost:9000 \
--s3-region none \
--s3-secure false| Option | Required | Default | Description |
|---|---|---|---|
| db-host | false | localhost | DB host |
| db-port | false | 5432 | DB port |
| db-name | false | postgres | DB name |
| db-pool-size | false | 1 | DB connection pool size |
| s3-endpoint | false | localhost:9000 | S3-like storage endpoint |
| s3-region | false | none | S3 region |
| s3-secure | false | false | If true then https will be used to connect to storage. Otherwise http |
| schema | true | Schema of snapshotted table | |
| table | true | Name of snapshotted table | |
| partition-column | false | Column which should be used to split table into chunks. Type of column should be numeric or timestamp. Of specified then table will be split into chunks and chunks will be fetched concurrently. If not specified then whole table (default) or each partition (partitioned, hypertable) will be fetched without additional splitting | |
| parallelism | false | 1 | Amount of chunks |
| fetch-size | false | 10000 | Rows per single fetch from each chunk |
| file-size-threshold | false | 1 Mb | Soft-limit of parquet file size |
| file-path | false | / | Target file path location in S3 |
| bucket | true | Target bucket | |
| (env) PG2P_DB_PASSWORD | true | DB password | |
| (env) PG2P_DB_USER | true | DB user | |
| (env) PG2P_S3_ACCESS_KEY | true | S3 access-key | |
| (env) PG2P_S3_SECRET_ACCESS_KEY | true | S3 secret-access-key |
| Postgres | Parquet |
|---|---|
| smallint (int2) | Integer |
| integer, serial (int4) | Integer |
| bigint, bigserial (int8) | Bigint |
| text | Varchar |
| char | Varchar |
| varchar | Varchar |
| xml | Varchar |
| json | Varchar |
| jsonb | Varchar |
| inet | Varchar |
| macaddr | Varchar |
| decimal(p, s) | Decimal(p, s). Defaults if no specified: precision=18, scale=4 |
| numeric(p, s) | Decimal(p, s). Defaults if no specified: precision=18, scale=4 |
| real (float4) | Float |
| double precision (float8) | Double |
| bool | Boolean |
| timestamp without timezone | Timestamp (millis) |
| timestamp with timezone | TimestampTZ (millis) |
| time without timezone | Time (millis) |
| date | Date |
| uuid | UUID |
| custom type | Varchar |
| []T (except custom types) | []T |
All arrays represented as:
<repeat type> group <column> (LIST) {
repeated group list {
required <type> element;
}
}TIMETZis not supportedHASHpartition type is not supported- Arrays of custom types are not supported
- If array contain
NULLvalues then default value will be used (zero value for primitives andnilfor pointers)
Some usage examples can be found in examples directory: