Skip to content

Commit 9238eec

Browse files
authored
Merge pull request #1633 from bruin-data/docs/introduce-lakehouses
Lakehouse Docs (Initial)
2 parents 64b9641 + 773ed20 commit 9238eec

File tree

5 files changed

+284
-1
lines changed

5 files changed

+284
-1
lines changed

docs/.vitepress/config.mjs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
9292
{text: "Environments", link: "/getting-started/devenv"},
9393
{text: "Variables", link: "/getting-started/pipeline-variables"},
9494
{text: "Bruin MCP", link: "/getting-started/bruin-mcp"},
95+
{text: "Lakehouse Support", link: "/getting-started/lakehouse"},
9596
]
9697
},
9798
{text: "Concepts", link: "/getting-started/concepts"},

docs/.vitepress/theme/custom.css

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,4 +56,13 @@ html:not(.dark) .vp-code-dark {
5656

5757
.VPSidebarItem .text {
5858
padding-top: 0.1rem !important;
59-
}
59+
}
60+
61+
.lh-check {
62+
color: #22c55e;
63+
font-weight: 600;
64+
}
65+
66+
.lh-check::before {
67+
content: "\2713";
68+
}

docs/getting-started/lakehouse.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Lakehouse Support <Badge type="warning" text="beta" />
2+
3+
> [!WARNING]
4+
> Lakehouse support is currently in **beta**. APIs and configuration may change in future releases.
5+
6+
Bring lakehouse tables directly into your Bruin pipelines. Query Iceberg and DuckLake data on cloud object storage with a catalog-backed metadata layer, all from the same workflows you already use. This page summarizes supported engines, catalogs, and storage backends.
7+
8+
## Engines and formats
9+
10+
DuckDB and Trino are the engines Bruin supports. In each section, you can discover the lakehouse format + catalog/storage combination supported by Bruin. Visit [DuckDB](../platforms/duckdb.md#lakehouse-support) or [Trino](../platforms/trino.md#lakehouse-support) for Bruin configurations.
11+
12+
### DuckDB [](../platforms/duckdb.md#lakehouse-support)
13+
14+
[Iceberg](https://duckdb.org/docs/extensions/iceberg) and [DuckLake](https://duckdb.org/docs/extensions/ducklake) format are natively supported in Bruin.
15+
16+
#### DuckLake
17+
18+
DuckLake uses a DuckDB, SQLite, or Postgres catalog. The table shows supported storage + catalog combinations.
19+
For more guidance, see DuckLake's [choosing a catalog database](https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database).
20+
21+
22+
| Catalog | S3 |
23+
|-------------------|----|
24+
| DuckDB| <span class="lh-check" aria-label="supported"></span> |
25+
| SQLite | <span class="lh-check" aria-label="supported"></span> |
26+
| Postgres| <span class="lh-check" aria-label="supported"></span> |
27+
| MySQL | Planned |
28+
29+
30+
31+
#### Iceberg
32+
33+
Iceberg uses the AWS Glue Data Catalog ([AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-data-catalog.html)). The table shows supported storage + catalog combinations.
34+
35+
| Catalog | S3 |
36+
|-------------------|----|
37+
| Glue | <span class="lh-check" aria-label="supported"></span> |
38+
39+
40+
### Trino [](../platforms/trino.md#lakehouse-support)
41+
42+
Trino supports lakehouse access via the [Iceberg connector](https://trino.io/docs/current/connector/iceberg.html) with Glue and [Nessie](https://projectnessie.org/) catalogs. Detailed setup guides are coming soon. Check out [Trino](../platforms/trino.md#lakehouse-support) for Bruin configuration.
43+
44+
| Catalog | S3 |
45+
|-------------------|----|
46+
| Glue | <span class="lh-check" aria-label="supported"></span> |
47+
| Nessie | <span class="lh-check" aria-label="supported"></span> |
48+
49+
## What is a Lakehouse?
50+
51+
A lakehouse combines the scalability of data lakes with the reliability of data warehouses. Data is stored in open formats on object storage (S3, GCS, Azure Blob) while metadata catalogs track schema, partitions, and table history.
52+
53+
<!-- Architecture -->
54+
55+
```mermaid
56+
%%{init: {"flowchart": {"useMaxWidth": true, "nodeSpacing": 180, "rankSpacing": 80}}}%%
57+
flowchart TB
58+
QE["Query Engine<br/>(DuckDB, Trino, ...)<br/>&nbsp;"]
59+
Catalog["**Catalog**<br/>(Glue, REST, ...)<br/><br/>Table metadata, Schema info, Partition info<br/>&nbsp;"]
60+
Storage["**Storage**<br/>(S3, GCS, ...)<br/>**Format**(Iceberg, DuckLake)<br/><br/>Parquet files, Manifest files, Data files<br/>&nbsp;"]
61+
62+
QE --> Catalog
63+
QE --> Storage
64+
```
65+
66+
## Quick Start
67+
68+
Let's add a DuckLake lakehouse configuration to your DuckDB connection (DuckDB catalog + S3 storage):
69+
70+
```yaml
71+
connections:
72+
duckdb:
73+
- name: "analytics"
74+
path: "./path/to/duckdb.db"
75+
lakehouse:
76+
format: ducklake
77+
catalog:
78+
type: duckdb
79+
path: "metadata.ducklake"
80+
storage:
81+
type: s3
82+
path: "s3://my-ducklake-warehouse/path"
83+
region: "us-east-1"
84+
auth:
85+
access_key: "AKIA..."
86+
secret_key: "..."
87+
```
88+
89+
Then query your Iceberg tables (defaults to the `main` schema):
90+
91+
```Bruin-sql
92+
/* @Bruin
93+
name: lakehouse_users
94+
type: duckdb.sql
95+
connection: analytics
96+
@Bruin */
97+
98+
99+
SELECT * FROM users;
100+
```
101+
102+
See the engine-specific pages [DuckDB](../platforms/duckdb.md#lakehouse-support) or [Trino](../platforms/trino.md#lakehouse-support) for detailed configuration options.

docs/platforms/duckdb.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,3 +130,163 @@ name,networking_through,position,contact_date
130130
Y,LinkedIn,SDE,2024-01-01
131131
B,LinkedIn,SDE 2,2024-01-01
132132
```
133+
134+
135+
## Lakehouse Support <Badge type="warning" text="beta" />
136+
137+
DuckDB can query [Iceberg](https://duckdb.org/docs/extensions/iceberg) and [DuckLake](https://duckdb.org/docs/extensions/ducklake) tables through its native extensions. DuckLake supports DuckDB, SQLite, or Postgres catalogs with S3-backed storage.
138+
139+
### Connection
140+
141+
Add the `lakehouse` block to your DuckDB connection in `.bruin.yml`:
142+
143+
```yaml
144+
connections:
145+
duckdb:
146+
- name: "example-conn"
147+
path: "./path/to/duckdb.db"
148+
lakehouse:
149+
format: <iceberg|ducklake>
150+
catalog:
151+
type: <glue|postgres|duckdb|sqlite>
152+
auth: { ... } # optional
153+
storage:
154+
type: <s3>
155+
auth: { ... } # optional
156+
```
157+
158+
<br>
159+
160+
| Field | Type | Required | Description |
161+
|-------|------|----------|-------------|
162+
| `format` | string | Yes | Table format: `iceberg` or `ducklake` |
163+
| `catalog` | object | Yes | Catalog configuration (Glue for Iceberg, DuckDB/SQLite/Postgres for DuckLake) |
164+
| `storage` | object | No | Storage configuration (required for DuckLake) |
165+
166+
---
167+
### Supported Lakehouse Formats
168+
169+
170+
171+
#### DuckLake
172+
173+
| Catalog | S3 |
174+
|-------------------|----|
175+
| DuckDB | <span class="lh-check" aria-label="supported"></span> |
176+
| SQLite | <span class="lh-check" aria-label="supported"></span> |
177+
| Postgres | <span class="lh-check" aria-label="supported"></span> |
178+
| MySQL | Planned |
179+
180+
181+
182+
#### Iceberg
183+
184+
| Catalog | S3 |
185+
|-------------------|----|
186+
| Glue | <span class="lh-check" aria-label="supported"></span> |
187+
188+
189+
For background, see DuckDB's [lakehouse format overview](https://duckdb.org/docs/stable/lakehouse_formats).
190+
191+
---
192+
### Catalog Options
193+
For guidance, see DuckLake's [choosing a catalog database](https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database).
194+
195+
196+
#### Glue
197+
198+
```yaml
199+
catalog:
200+
type: glue
201+
catalog_id: "123456789012"
202+
region: "us-east-1"
203+
auth:
204+
access_key: "${AWS_ACCESS_KEY_ID}"
205+
secret_key: "${AWS_SECRET_ACCESS_KEY}"
206+
session_token: "${AWS_SESSION_TOKEN}" # optional
207+
```
208+
209+
#### Postgres
210+
211+
212+
```yaml
213+
catalog:
214+
type: postgres
215+
host: "localhost"
216+
port: 5432 # optional - default: 5432
217+
database: "ducklake_catalog"
218+
auth:
219+
username: "ducklake_user"
220+
password: "ducklake_password"
221+
```
222+
223+
#### DuckDB
224+
225+
```yaml
226+
catalog:
227+
type: duckdb
228+
path: "metadata.ducklake"
229+
```
230+
231+
`catalog.path` should point to the DuckLake metadata file.
232+
233+
Note that if you are using DuckDB as your catalog database, you're limited to a single client.
234+
235+
#### SQLite
236+
237+
```yaml
238+
catalog:
239+
type: sqlite
240+
path: "metadata.sqlite"
241+
```
242+
243+
244+
---
245+
### Storage Options
246+
247+
#### S3
248+
249+
Bruin currently only supports explicit AWS credentials in the `auth` block.
250+
Session tokens are supported for temporary credentials (AWS STS).
251+
252+
```yaml
253+
storage:
254+
type: s3
255+
path: "s3://my-ducklake-warehouse/path" # required for DuckLake, optional for Iceberg
256+
region: "us-east-1"
257+
auth:
258+
access_key: "${AWS_ACCESS_KEY_ID}"
259+
secret_key: "${AWS_SECRET_ACCESS_KEY}"
260+
session_token: "${AWS_SESSION_TOKEN}" # optional
261+
```
262+
263+
---
264+
### Usage
265+
266+
Bruin makes the lakehouse catalog active for your session and ensures a default `main` schema is available (cannot create Iceberg on S3 schemas/tables, so they must already exist). You can query tables with or without a schema:
267+
268+
```sql
269+
SELECT * FROM my_table;
270+
```
271+
272+
You can also use the fully qualified path:
273+
274+
```sql
275+
SELECT * FROM iceberg_catalog.main.my_table;
276+
```
277+
278+
> [!NOTE]
279+
> Unqualified table names resolve to the `main` schema of the active catalog. Use `<schema>.<table>` to target non-main schemas.
280+
281+
#### Example Asset
282+
283+
```bruin-sql
284+
/* @bruin
285+
name: lakehouse_example
286+
type: duckdb.sql
287+
connection: example-conn
288+
@bruin */
289+
290+
SELECT SUM(amount) as total_sales
291+
FROM orders;
292+
```

docs/platforms/trino.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,14 @@ type: trino.sensor.query
8181
parameters:
8282
query: select exists(select 1 from upstream_table where inserted_at > '{{ end_timestamp }}')
8383
```
84+
85+
86+
## Lakehouse Support
87+
88+
> [!WARNING]
89+
> Trino lakehouse support is available. Detailed setup guides are coming soon.
90+
91+
| Catalog \ Storage | S3 |
92+
|-------------------|----|
93+
| Glue | <span class="lh-check" aria-label="supported"></span> |
94+
| Nessie | <span class="lh-check" aria-label="supported"></span> |

0 commit comments

Comments
 (0)