Skip to content

Tool for point-in-time snapshot of postgres table (plain, partitioned, hypertable) into S3 and parquet format

License

Notifications You must be signed in to change notification settings

nryanov/postgres2parquet-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

postgres2parquet-go

Efficiently snapshot PostgreSQL tables to Parquet files in S3 — written in Go.
postgres2parquet-go is a lightweight, high-performance tool designed to export entire PostgreSQL tables into columnar Parquet format and store them directly in S3. It was originally built to support reliable pre-replication snapshots for CDC workflows, especially for large tables where traditional snapshot mechanisms can be slow or resource-intensive.

Unlike heavier frameworks that require complex cluster setups, this tool runs as a simple binary with minimal dependencies—making it ideal for containerized environments, CI/CD pipelines, or one-off data exports.

Why This Tool?
While solutions like Apache Spark (via JDBC) or Apache Flink CDC can achieve similar goals, they often come with operational overhead. postgres2parquet-go offers a focused, zero-friction alternative:

  • Native Parquet output using efficient Go libraries (parquet-go)
  • Direct streaming to S3 without intermediate disk storage
  • Optimized for large tables with configurable chunking and concurrency (pgx)
  • Minimal runtime footprint — no JVM, no cluster coordination
  • Produces consistent, point-in-time snapshots suitable for initializing CDC pipelines

Whether you're bootstrapping a change data capture pipeline, archiving historical data, or preparing datasets for analytics, postgres2parquet-go delivers speed, simplicity, and reliability.

Usage

Mode Description
default This mode is useful for fetching simple postgres table. If partition-column is set then table will be splited into chunks and each chunk will be fetched concurrently. Otherwise the whole table will be fetched in a single select
partitioned This mode allows you to fetch each partition of partitioned table concurrently. Moreover, if you specify partition-column than each partition will be additionally split into chunks. This allows you to also control parallelism more granular
hypertable Special mode for hypertable (timescaledb). Logic is very similar to partitioned table. The only difference is how partitions are represented. Each partition also can be split into additional chunks if partition-column is specified
ENVS ./postgres2parquet-go {mode} {flags}

Important

Keep in mind that partition-column should be numeric or temporal type. Also, this column should be indexed or be a part of a composite index (in case of partitioned and hypertable tables). Otherwise each query will cause a table full-scan which will lead to degraded performance and negatively affect database.

Example:

PG2P_DB_PASSWORD=postgres PG2P_DB_USER=postgres PG2P_S3_ACCESS_KEY=admin PG2P_S3_SECRET_ACCESS_KEY=password ./postgres2parquet-go snapshot default \
  --schema public \
  --table all_simple_types \
  --partition-column primary_key \
  --parallelism 200 \
  --bucket warehouse \
  --fetch-size 10000 \
  --file-path /schema/table/ \
  --file-size-threshold 52428800 \
  --db-host localhost \
  --db-port 5432 \
  --db-name postgres \
  --db-pool-size 20 \
  --s3-endpoint localhost:9000 \
  --s3-region none \
  --s3-secure false

Options

Option Required Default Description
db-host false localhost DB host
db-port false 5432 DB port
db-name false postgres DB name
db-pool-size false 1 DB connection pool size
s3-endpoint false localhost:9000 S3-like storage endpoint
s3-region false none S3 region
s3-secure false false If true then https will be used to connect to storage. Otherwise http
schema true Schema of snapshotted table
table true Name of snapshotted table
partition-column false Column which should be used to split table into chunks. Type of column should be numeric or timestamp. Of specified then table will be split into chunks and chunks will be fetched concurrently. If not specified then whole table (default) or each partition (partitioned, hypertable) will be fetched without additional splitting
parallelism false 1 Amount of chunks
fetch-size false 10000 Rows per single fetch from each chunk
file-size-threshold false 1 Mb Soft-limit of parquet file size
file-path false / Target file path location in S3
bucket true Target bucket
(env) PG2P_DB_PASSWORD true DB password
(env) PG2P_DB_USER true DB user
(env) PG2P_S3_ACCESS_KEY true S3 access-key
(env) PG2P_S3_SECRET_ACCESS_KEY true S3 secret-access-key

Types mappings

Postgres Parquet
smallint (int2) Integer
integer, serial (int4) Integer
bigint, bigserial (int8) Bigint
text Varchar
char Varchar
varchar Varchar
xml Varchar
json Varchar
jsonb Varchar
inet Varchar
macaddr Varchar
decimal(p, s) Decimal(p, s). Defaults if no specified: precision=18, scale=4
numeric(p, s) Decimal(p, s). Defaults if no specified: precision=18, scale=4
real (float4) Float
double precision (float8) Double
bool Boolean
timestamp without timezone Timestamp (millis)
timestamp with timezone TimestampTZ (millis)
time without timezone Time (millis)
date Date
uuid UUID
custom type Varchar
[]T (except custom types) []T

All arrays represented as:

<repeat type> group <column> (LIST) {
    repeated group list {
        required <type> element;
    }
}

Limitations

  • TIMETZ is not supported
  • HASH partition type is not supported
  • Arrays of custom types are not supported
  • If array contain NULL values then default value will be used (zero value for primitives and nil for pointers)

Examples

Some usage examples can be found in examples directory:

About

Tool for point-in-time snapshot of postgres table (plain, partitioned, hypertable) into S3 and parquet format

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages