Comprehensive Data Ingestion & New Format Support #227

tolgakaratas · 2026-01-25T12:26:23Z

This PR introduces significant improvements to LEANN's data ingestion capabilities, enabling structural extraction for Office documents, Mindmaps, and various personal data sources.

Key Features

Expanded Format Support:
- Mindmap (.mm): Added hierarchical text extraction for FreeMind and Freeplane files, preserving the parent-child node relationships.
- Office Suite: Dedicated extractors for Word (.docx), Excel (.xlsx), and PowerPoint (.pptx) that preserve tables, slide numbers, and sheet structures.
Robust PDF Processing Pipeline:
- Implemented a multi-layer fallback chain: PyMuPDF → pypdf → pdfplumber → Docling OCR.
- Ensures reliable text and layout extraction even for complex, image-heavy, or slightly corrupted PDFs.
Integrated CLI Commands:
- Centralized ingestion logic into the core CLI with dedicated commands:
  - leann index-email: Apple Mail indexing.
  - leann index-calendar: Apple Calendar indexing.
  - leann index-wechat / leann index-imessage: Chat history processing.
  - leann index-slack / leann index-twitter: MCP-based ingestion.
  - leann index-chatgpt / leann index-claude: AI export indexing.
Performance & Bug Fixes:
- Fixed a bug in base_rag_example.py where index existence checks incorrectly triggered on directory existence, preventing rebuilds in temporary folders.
- Optimized chunking defaults to handle deep directory structures (fixing metadata length issues).
- Added production-ready sync utilities in scripts/sync_utilities/ for managing large vaults with DiskANN.

These changes significantly advance LEANN's vision of being a universal personal AI memory that can 'RAG Everything' with minimal overhead.

… Calendar, and enhanced PDF processing

…ry crawling

…aware recommendations

…e in README

yichuan-w · 2026-01-27T23:55:54Z

That is a wonderful PR, @ASuresh0524. We should carefully review this

ASuresh0524 · 2026-01-28T19:01:46Z

Agreed! Went through most of it, looks good to me

ASuresh0524 · 2026-01-29T15:19:43Z

@tolgakaratas something i noticed:

Bug: Missing embedding arguments in index-* command parsers

The index-* commands (index-email, index-calendar, index-wechat, index-imessage, index-slack, index-chatgpt, index-claude, index-browser) reference args.embedding_model and args.embedding_mode in their implementations, but these arguments are not added to their respective parsers.

Impact:
Running any of these commands will raise:

AttributeError: 'Namespace' object has no attribute 'embedding_model'

How to fix:

Add the embedding arguments to each index-* parser. Two approaches:

Option 1: Add individually to each parser (explicit but verbose):

email_parser.add_argument(
    "--embedding-model",
    type=str,
    default="facebook/contriever",
    help="Embedding model (default: facebook/contriever)",
)
email_parser.add_argument(
    "--embedding-mode",
    type=str,
    default="sentence-transformers",
    choices=["sentence-transformers", "openai", "mlx", "ollama"],
    help="Embedding backend mode (default: sentence-transformers)",
)
# Repeat for all index-* parsers...

Option 2: Use a helper function (DRY approach):

def add_embedding_args(parser):
    """Add common embedding arguments to a parser."""
    parser.add_argument(
        "--embedding-model",
        type=str,
        default="facebook/contriever",
        help="Embedding model (default: facebook/contriever)",
    )
    parser.add_argument(
        "--embedding-mode",
        type=str,
        default="sentence-transformers",
        choices=["sentence-transformers", "openai", "mlx", "ollama"],
        help="Embedding backend mode (default: sentence-transformers)",
    )
    # Add other embedding-related args as needed (host, api-base, api-key, etc.)

# Then use it:
add_embedding_args(email_parser)
add_embedding_args(calendar_parser)
# ... etc for all index-* parsers

Affected commands:

index-email
index-calendar
index-wechat
index-imessage
index-slack
index-chatgpt
index-claude
index-browser

Location: packages/leann-core/src/leann/cli.py - around where the index-* parsers are defined (after line ~450 based on the diff).

yichuan-w · 2026-01-29T23:38:14Z

@ASuresh0524 , next time you can comment on the code directly, try to be familiar with GitHub code review, thanks!!

ASuresh0524 · 2026-02-05T20:35:29Z

@tolgakaratas please let us know when you can fix

yichuan-w · 2026-02-05T22:28:06Z

Thanks @ASuresh0524,if he cannot make that, maybe you can push on this branch and co-author this PR with him

ASuresh0524 · 2026-02-05T23:23:52Z

Yeah, will do by tn/tmorow if they are unable to.

Ensure all index-* CLI commands accept embedding model/mode arguments to match their builder usage. Co-authored-by: Cursor <cursoragent@cursor.com>

tolgakaratas added 11 commits January 25, 2026 15:24

feat: add comprehensive ingestion support for Office, Mindmaps, Mail,…

00529d6

… Calendar, and enhanced PDF processing

docs: add links to third-party tools and improve feature descriptions

57446d4

docs: fix docling link to point to current project URL

648a47a

perf: optimize list_indexes to avoid redundant scans and deep directo…

fc05933

…ry crawling

docs: comprehensive update and perf: optimized list command with AST-…

3014ed4

…aware recommendations

docs: fix broken and unstable links in documentation

786b6d4

docs: restructure all documentation into Diataxis methodology

2bbf738

docs: fix persistent 404 links for Apple Services in README

1b65ccf

docs: restore Portable feature and fix unstable Apple links in README

2abbde0

docs: restore value propositions and enhance feature list in README

e2b1a57

docs: restore core value propositions and Claude Code integration not…

becb1dc

…e in README

ASuresh0524 requested a review from yichuan-w January 26, 2026 09:04

fix: add embedding args for index commands

4cdc622

Ensure all index-* CLI commands accept embedding model/mode arguments to match their builder usage. Co-authored-by: Cursor <cursoragent@cursor.com>

ASuresh0524 merged commit 2c79621 into yichuan-w:main Feb 10, 2026
27 checks passed

yichuan-w mentioned this pull request Feb 10, 2026

Revert "Comprehensive Data Ingestion & New Format Support" #241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive Data Ingestion & New Format Support #227

Comprehensive Data Ingestion & New Format Support #227

Uh oh!

tolgakaratas commented Jan 25, 2026

Uh oh!

yichuan-w commented Jan 27, 2026

Uh oh!

ASuresh0524 commented Jan 28, 2026

Uh oh!

ASuresh0524 commented Jan 29, 2026

Uh oh!

yichuan-w commented Jan 29, 2026

Uh oh!

ASuresh0524 commented Feb 5, 2026

Uh oh!

yichuan-w commented Feb 5, 2026

Uh oh!

ASuresh0524 commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comprehensive Data Ingestion & New Format Support #227

Comprehensive Data Ingestion & New Format Support #227

Uh oh!

Conversation

tolgakaratas commented Jan 25, 2026

Key Features

Uh oh!

yichuan-w commented Jan 27, 2026

Uh oh!

ASuresh0524 commented Jan 28, 2026

Uh oh!

ASuresh0524 commented Jan 29, 2026

Uh oh!

yichuan-w commented Jan 29, 2026

Uh oh!

ASuresh0524 commented Feb 5, 2026

Uh oh!

yichuan-w commented Feb 5, 2026

Uh oh!

ASuresh0524 commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants