AI Crawler Log Ingester is a streaming service that tails web access logs, extracts structured requests, and labels traffic generated by AI crawlers in real time. It keeps per-file offsets so restarts resume where they left off, enriches each record with crawler metadata, and forwards the resulting events to the sinks you enable.
- Supports Apache combined, Nginx combined, and custom regex log formats across multiple files.
- Hot-reloads AI crawler detection rules with optional IP verification and priority-based matching.
- Emits events to JSONL stdout, rolling files, or Elastic/OpenSearch via the bulk API.
- Exposes Prometheus metrics and a ready/health endpoint for integrations.
- Persists tail offsets using BoltDB to guarantee at-least-once delivery after restarts.
- Go 1.23+ (the repository uses the Go 1.24.8 toolchain by default).
- A YAML configuration file describing the logs to ingest and the desired sinks.
make build
# or
go build -o bin/aicrawl-ingester ./cmd/aicrawl-ingesterStart from the bundled detection rules in rules/default.yaml and create a configuration file like the example below:
server_name: edge-1
logs:
- path: /var/log/nginx/access.log
format: nginx_combined
server_label: edge-1
detection:
rules_file: ./rules/default.yaml
hot_reload_rules: true
sinks:
stdout_jsonl:
enabled: true
file_jsonl:
enabled: false
elastic:
enabled: false
runtime:
offsets_path: ./data/offsets.db
health_bind: 0.0.0.0:8090
metrics:
namespace: aicrawl
subsystem: ingester./bin/aicrawl-ingester --config config/ingester.yamlUseful flags:
--rulesto override the rules file defined in the config.--print-effective-configto see the resolved configuration and exit.--validate-configto verify the configuration without starting the service.--log-levelto set Zerolog’s global level (debug,info,warn,error).
- Events: Structured JSON events include crawler identification, the original request metadata, and the source log file offset.
- Prometheus metrics: When enabled, the metrics server exposes ingestion rates, sink latencies, and AI crawler hit counters.
- Health checks: The health server flips to ready once tailing, detection, and sinks are online.
Run the available checks before sending changes:
make test
make vetThe repository also ships sample access logs under testdata/ for local experimentation with the parser and detection engine.