Skip to content

Real-time log ingester that tails web access logs, labels AI crawler traffic, and streams structured events to multiple sinks.

Notifications You must be signed in to change notification settings

mateohysa/log-ingester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Crawler Log Ingester

Overview

AI Crawler Log Ingester is a streaming service that tails web access logs, extracts structured requests, and labels traffic generated by AI crawlers in real time. It keeps per-file offsets so restarts resume where they left off, enriches each record with crawler metadata, and forwards the resulting events to the sinks you enable.

Features

  • Supports Apache combined, Nginx combined, and custom regex log formats across multiple files.
  • Hot-reloads AI crawler detection rules with optional IP verification and priority-based matching.
  • Emits events to JSONL stdout, rolling files, or Elastic/OpenSearch via the bulk API.
  • Exposes Prometheus metrics and a ready/health endpoint for integrations.
  • Persists tail offsets using BoltDB to guarantee at-least-once delivery after restarts.

Quick Start

Prerequisites

  • Go 1.23+ (the repository uses the Go 1.24.8 toolchain by default).
  • A YAML configuration file describing the logs to ingest and the desired sinks.

Build

make build
# or
go build -o bin/aicrawl-ingester ./cmd/aicrawl-ingester

Configure

Start from the bundled detection rules in rules/default.yaml and create a configuration file like the example below:

server_name: edge-1
logs:
  - path: /var/log/nginx/access.log
    format: nginx_combined
    server_label: edge-1
detection:
  rules_file: ./rules/default.yaml
  hot_reload_rules: true
sinks:
  stdout_jsonl:
    enabled: true
  file_jsonl:
    enabled: false
  elastic:
    enabled: false
runtime:
  offsets_path: ./data/offsets.db
  health_bind: 0.0.0.0:8090
metrics:
  namespace: aicrawl
  subsystem: ingester

Run

./bin/aicrawl-ingester --config config/ingester.yaml

Useful flags:

  • --rules to override the rules file defined in the config.
  • --print-effective-config to see the resolved configuration and exit.
  • --validate-config to verify the configuration without starting the service.
  • --log-level to set Zerolog’s global level (debug, info, warn, error).

Observability & Outputs

  • Events: Structured JSON events include crawler identification, the original request metadata, and the source log file offset.
  • Prometheus metrics: When enabled, the metrics server exposes ingestion rates, sink latencies, and AI crawler hit counters.
  • Health checks: The health server flips to ready once tailing, detection, and sinks are online.

Development

Run the available checks before sending changes:

make test
make vet

The repository also ships sample access logs under testdata/ for local experimentation with the parser and detection engine.

About

Real-time log ingester that tails web access logs, labels AI crawler traffic, and streams structured events to multiple sinks.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages