-
Notifications
You must be signed in to change notification settings - Fork 863
Comprehensive Data Ingestion & New Format Support #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comprehensive Data Ingestion & New Format Support #227
Conversation
… Calendar, and enhanced PDF processing
…aware recommendations
|
That is a wonderful PR, @ASuresh0524. We should carefully review this |
|
Agreed! Went through most of it, looks good to me |
|
@tolgakaratas something i noticed: Bug: Missing embedding arguments in The Impact: How to fix: Add the embedding arguments to each Option 1: Add individually to each parser (explicit but verbose): email_parser.add_argument(
"--embedding-model",
type=str,
default="facebook/contriever",
help="Embedding model (default: facebook/contriever)",
)
email_parser.add_argument(
"--embedding-mode",
type=str,
default="sentence-transformers",
choices=["sentence-transformers", "openai", "mlx", "ollama"],
help="Embedding backend mode (default: sentence-transformers)",
)
# Repeat for all index-* parsers...Option 2: Use a helper function (DRY approach): def add_embedding_args(parser):
"""Add common embedding arguments to a parser."""
parser.add_argument(
"--embedding-model",
type=str,
default="facebook/contriever",
help="Embedding model (default: facebook/contriever)",
)
parser.add_argument(
"--embedding-mode",
type=str,
default="sentence-transformers",
choices=["sentence-transformers", "openai", "mlx", "ollama"],
help="Embedding backend mode (default: sentence-transformers)",
)
# Add other embedding-related args as needed (host, api-base, api-key, etc.)
# Then use it:
add_embedding_args(email_parser)
add_embedding_args(calendar_parser)
# ... etc for all index-* parsersAffected commands:
Location: |
|
@ASuresh0524 , next time you can comment on the code directly, try to be familiar with GitHub code review, thanks!! |
|
@tolgakaratas please let us know when you can fix |
|
Thanks @ASuresh0524,if he cannot make that, maybe you can push on this branch and co-author this PR with him |
|
Yeah, will do by tn/tmorow if they are unable to. |
Ensure all index-* CLI commands accept embedding model/mode arguments to match their builder usage. Co-authored-by: Cursor <cursoragent@cursor.com>
This PR introduces significant improvements to LEANN's data ingestion capabilities, enabling structural extraction for Office documents, Mindmaps, and various personal data sources.
Key Features
leann index-email: Apple Mail indexing.leann index-calendar: Apple Calendar indexing.leann index-wechat/leann index-imessage: Chat history processing.leann index-slack/leann index-twitter: MCP-based ingestion.leann index-chatgpt/leann index-claude: AI export indexing.base_rag_example.pywhere index existence checks incorrectly triggered on directory existence, preventing rebuilds in temporary folders.scripts/sync_utilities/for managing large vaults with DiskANN.These changes significantly advance LEANN's vision of being a universal personal AI memory that can 'RAG Everything' with minimal overhead.