Skip to content

Conversation

@tolgakaratas
Copy link
Contributor

This PR introduces significant improvements to LEANN's data ingestion capabilities, enabling structural extraction for Office documents, Mindmaps, and various personal data sources.

Key Features

  1. Expanded Format Support:
    • Mindmap (.mm): Added hierarchical text extraction for FreeMind and Freeplane files, preserving the parent-child node relationships.
    • Office Suite: Dedicated extractors for Word (.docx), Excel (.xlsx), and PowerPoint (.pptx) that preserve tables, slide numbers, and sheet structures.
  2. Robust PDF Processing Pipeline:
    • Implemented a multi-layer fallback chain: PyMuPDF → pypdf → pdfplumber → Docling OCR.
    • Ensures reliable text and layout extraction even for complex, image-heavy, or slightly corrupted PDFs.
  3. Integrated CLI Commands:
    • Centralized ingestion logic into the core CLI with dedicated commands:
      • leann index-email: Apple Mail indexing.
      • leann index-calendar: Apple Calendar indexing.
      • leann index-wechat / leann index-imessage: Chat history processing.
      • leann index-slack / leann index-twitter: MCP-based ingestion.
      • leann index-chatgpt / leann index-claude: AI export indexing.
  4. Performance & Bug Fixes:
    • Fixed a bug in base_rag_example.py where index existence checks incorrectly triggered on directory existence, preventing rebuilds in temporary folders.
    • Optimized chunking defaults to handle deep directory structures (fixing metadata length issues).
    • Added production-ready sync utilities in scripts/sync_utilities/ for managing large vaults with DiskANN.

These changes significantly advance LEANN's vision of being a universal personal AI memory that can 'RAG Everything' with minimal overhead.

@ASuresh0524 ASuresh0524 requested a review from yichuan-w January 26, 2026 09:04
@yichuan-w
Copy link
Owner

That is a wonderful PR, @ASuresh0524. We should carefully review this

@ASuresh0524
Copy link
Collaborator

Agreed! Went through most of it, looks good to me

@ASuresh0524
Copy link
Collaborator

@tolgakaratas something i noticed:

Bug: Missing embedding arguments in index-* command parsers

The index-* commands (index-email, index-calendar, index-wechat, index-imessage, index-slack, index-chatgpt, index-claude, index-browser) reference args.embedding_model and args.embedding_mode in their implementations, but these arguments are not added to their respective parsers.

Impact:
Running any of these commands will raise:

AttributeError: 'Namespace' object has no attribute 'embedding_model'

How to fix:

Add the embedding arguments to each index-* parser. Two approaches:

Option 1: Add individually to each parser (explicit but verbose):

email_parser.add_argument(
    "--embedding-model",
    type=str,
    default="facebook/contriever",
    help="Embedding model (default: facebook/contriever)",
)
email_parser.add_argument(
    "--embedding-mode",
    type=str,
    default="sentence-transformers",
    choices=["sentence-transformers", "openai", "mlx", "ollama"],
    help="Embedding backend mode (default: sentence-transformers)",
)
# Repeat for all index-* parsers...

Option 2: Use a helper function (DRY approach):

def add_embedding_args(parser):
    """Add common embedding arguments to a parser."""
    parser.add_argument(
        "--embedding-model",
        type=str,
        default="facebook/contriever",
        help="Embedding model (default: facebook/contriever)",
    )
    parser.add_argument(
        "--embedding-mode",
        type=str,
        default="sentence-transformers",
        choices=["sentence-transformers", "openai", "mlx", "ollama"],
        help="Embedding backend mode (default: sentence-transformers)",
    )
    # Add other embedding-related args as needed (host, api-base, api-key, etc.)

# Then use it:
add_embedding_args(email_parser)
add_embedding_args(calendar_parser)
# ... etc for all index-* parsers

Affected commands:

  • index-email
  • index-calendar
  • index-wechat
  • index-imessage
  • index-slack
  • index-chatgpt
  • index-claude
  • index-browser

Location: packages/leann-core/src/leann/cli.py - around where the index-* parsers are defined (after line ~450 based on the diff).


@yichuan-w
Copy link
Owner

@ASuresh0524 , next time you can comment on the code directly, try to be familiar with GitHub code review, thanks!!

@ASuresh0524
Copy link
Collaborator

@tolgakaratas please let us know when you can fix

@yichuan-w
Copy link
Owner

Thanks @ASuresh0524,if he cannot make that, maybe you can push on this branch and co-author this PR with him

@ASuresh0524
Copy link
Collaborator

Yeah, will do by tn/tmorow if they are unable to.

Ensure all index-* CLI commands accept embedding model/mode
arguments to match their builder usage.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ASuresh0524 ASuresh0524 merged commit 2c79621 into yichuan-w:main Feb 10, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants