Skip to content

BaseExtractor crashes entire pipeline on transient LLM errors #20692

@debu-sinha

Description

@debu-sinha

When an LLM call fails during metadata extraction (e.g. Azure content safety false positive, rate limit, transient network error), the entire ingestion pipeline crashes. This happens because BaseExtractor.aprocess_nodes() calls aextract() with no error handling at all -- a single failed node kills the whole batch.

This is the scenario described in #20054. The reporter hits this about every 15,000 nodes with Azure OpenAI guardrails.

Root cause

  1. aprocess_nodes() calls await self.aextract(new_nodes) on line 129 of interface.py with no try/catch
  2. run_jobs() in async_utils.py uses asyncio.gather() without return_exceptions=True, so one failed job kills the batch
  3. None of the standard extractors (Title, Keyword, QA, Summary) have per-node error handling
  4. Only DocumentContextExtractor has any resilience, but it's a hardcoded 5-retry with 60s backoff that only catches rate limit errors

Proposed fix

Add three configurable fields to BaseExtractor that all extractors inherit automatically:

  • max_retries (default 0 -- current behaviour, no retry)
  • retry_backoff (default 1.0s, exponential backoff)
  • on_extraction_error ("raise" or "skip" -- "raise" is current behaviour)

The retry logic lives in a single _aextract_with_retry() method called from aprocess_nodes(). Fully backwards compatible since all defaults match existing behaviour.

Example usage for someone hitting the Azure guardrail issue:

from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)

This would retry up to 3 times with exponential backoff (2s, 4s, 8s), and if all retries fail, log a warning and continue with empty metadata instead of crashing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions