You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+86-15Lines changed: 86 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ RAGent is a CLI tool for building a RAG (Retrieval-Augmented Generation) system
33
33
34
34
## Features
35
35
36
-
-**Vectorization**: Convert markdown files to embeddings using Amazon Bedrock
36
+
-**Vectorization**: Convert source files (markdown and CSV) to embeddings using Amazon Bedrock
37
37
-**S3 Vector Integration**: Store generated vectors in Amazon S3 Vectors
38
38
-**Hybrid Search**: Combined BM25 + vector search using OpenSearch
39
39
-**Slack Search Integration**: Blend document results with Slack conversations via an iterative enrichment pipeline
@@ -313,18 +313,57 @@ flowchart TD
313
313
314
314
## Prerequisites
315
315
316
-
### Prepare Markdown Documents
316
+
### Prepare Source Documents
317
317
318
-
Before using RAGent, you need to prepare markdown documents in a `markdown/` directory. These documents should contain the content you want to make searchable through the RAG system.
318
+
Before using RAGent, you need to prepare source documents in a `source/` directory. These documents should contain the content you want to make searchable through the RAG system.
319
319
320
+
**Supported file types:**
321
+
-**Markdown (.md, .markdown)**: Each file becomes one document
322
+
-**CSV (.csv)**: Each row becomes one document (header row required)
323
+
324
+
```bash
325
+
# Create source directory
326
+
mkdir source
327
+
328
+
# Place your files in this directory
329
+
cp /path/to/your/documents/*.md source/
330
+
cp /path/to/your/data/*.csv source/
331
+
```
332
+
333
+
For CSV files, you can optionally provide a configuration file to specify column mappings:
320
334
```bash
321
-
# Create markdown directory
322
-
mkdir markdown
335
+
# Copy example configuration
336
+
cp csv-config.yaml.example csv-config.yaml
337
+
338
+
# Run with CSV configuration
339
+
RAGent vectorize --csv-config csv-config.yaml
340
+
```
341
+
342
+
#### CSV Configuration Options
343
+
344
+
The `csv-config.yaml` supports the following options:
345
+
346
+
**header_row (Header Row Position):**
347
+
348
+
Use this option when your CSV file has metadata or summary rows before the actual header row.
349
+
When `header_row` is specified, that row is used as the column headers, and all preceding rows are skipped.
323
350
324
-
# Place your markdown files in this directory
325
-
cp /path/to/your/documents/*.md markdown/
351
+
```yaml
352
+
csv:
353
+
files:
354
+
- pattern: "sample.csv"
355
+
header_row: 7# Row 7 is the header (1-indexed)
356
+
# Rows 1-6 are skipped, data starts from row 8
357
+
content:
358
+
columns: ["task", "category"]
359
+
metadata:
360
+
title: "task"
361
+
category: "category"
326
362
```
327
363
364
+
- If `header_row` is not specified, the default is `1` (first row is the header)
365
+
- Row numbers are 1-indexed
366
+
328
367
For exporting notes from Kibela, use the separate export tool available in the `export/` directory.
329
368
330
369
## Required Environment Variables
@@ -607,20 +646,41 @@ All entries should report `OK`. If a mismatch occurs, re-download the artifact.
607
646
608
647
### 1. vectorize - Vectorization and S3 Storage
609
648
610
-
Read markdown files, extract metadata, generate embeddings using Amazon Bedrock, and store them in Amazon S3 Vectors.
649
+
Read source files (markdown and CSV), extract metadata, generate embeddings using Amazon Bedrock, and store them in Amazon S3 Vectors.
611
650
612
651
```bash
613
652
RAGent vectorize
614
653
```
615
654
616
655
**Options:**
617
-
-`-d, --directory`: Directory containing markdown files to process (default: `./markdown`)
656
+
- `-d, --directory`: Directory containing source files to process (default: `./source`)
618
657
- `--dry-run`: Display processing details without making actual API calls
619
658
- `-c, --concurrency`: Number of concurrent processes (0 = use default value from config file)
0 commit comments