-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
When an LLM call fails during metadata extraction (e.g. Azure content safety false positive, rate limit, transient network error), the entire ingestion pipeline crashes. This happens because BaseExtractor.aprocess_nodes() calls aextract() with no error handling at all -- a single failed node kills the whole batch.
This is the scenario described in #20054. The reporter hits this about every 15,000 nodes with Azure OpenAI guardrails.
Root cause
aprocess_nodes()callsawait self.aextract(new_nodes)on line 129 ofinterface.pywith no try/catchrun_jobs()inasync_utils.pyusesasyncio.gather()withoutreturn_exceptions=True, so one failed job kills the batch- None of the standard extractors (Title, Keyword, QA, Summary) have per-node error handling
- Only
DocumentContextExtractorhas any resilience, but it's a hardcoded 5-retry with 60s backoff that only catches rate limit errors
Proposed fix
Add three configurable fields to BaseExtractor that all extractors inherit automatically:
max_retries(default 0 -- current behaviour, no retry)retry_backoff(default 1.0s, exponential backoff)on_extraction_error("raise" or "skip" -- "raise" is current behaviour)
The retry logic lives in a single _aextract_with_retry() method called from aprocess_nodes(). Fully backwards compatible since all defaults match existing behaviour.
Example usage for someone hitting the Azure guardrail issue:
from llama_index.core.extractors import TitleExtractor
extractor = TitleExtractor(
llm=llm,
max_retries=3,
retry_backoff=2.0,
on_extraction_error="skip",
)This would retry up to 3 times with exponential backoff (2s, 4s, 8s), and if all retries fail, log a warning and continue with empty metadata instead of crashing.