Skip to content

fw7th/redact

Repository files navigation

redact-logo

🔒 A fast, serverless, and scalable API service for automated OCR and PII redaction on document batches.


Release Build Status Coverage Status

Redact exposes a FastAPI HTTPs interface, persists state in Supabase (Postgres + Storage), and offloads compute-heavy redaction workloads to Modal

Features

  • RESTful API: Clear endpoints for submitting and retrieving redaction tasks.

  • Asynchronous Processing: Long-running redaction jobs are executed asynchronously via serverless compute.

  • Persistent Storage: Task metadata stored in Supabase Postgres via SQLAlchemy; documents stored in Supabase Storage.

  • Benchmarked: Includes load testing scripts using Locust and benchmarks for performance analysis.

  • API: FastAPI

  • Database: Supabase Postgres

  • Object Storage: Supabase Storage

  • Compute: Modal (serverless execution)

Random image Redacted image

Random image I found on the internet vs. Redacted.

Performance

API Benchmarking Table

Endpoint Operation Payload Size Concurrent Users Requests/sec Avg Latency (ms) P95 Latency (ms) Error Rate Notes
POST /predict Create 100KB image 2 0.67 34.95 56 0% Includes file validation, disk write, Inference offshoring
GET /download/{id} GET N/A 10 3.58 10.38 32 0% Retrieve processed document from server
GET /check/{id} Read N/A 10 3.52 9.94 28 0% Fetches job status from Supabase Postgres
DELETE /drop/{id} Delete N/A 10 3.47 10.43 30 Delete a batch and all related files from DB

Legend:

  • Payload Size: Size of file or JSON sent in the request.
  • Concurrent Users: Simulated users (e.g., in Locust).
  • Requests/sec: Throughput under load.
  • Latency: Time from request to response (P95 = 95th percentile).

Usage

To interact with the hosted service as a client, see the client/ directory for example request scripts and helpers.

⚠️ Local execution note
Running the API locally is intended for development and testing only.
Production workloads are designed to run on Modal’s serverless infrastructure.

Development Setup

Prerequisites

To run this project locally, you will need:

  1. A Supabase project

    • Supabase Postgres (used for task metadata)
    • Supabase Storage (used for document and redaction artifacts)
  2. A Modal account

    • Modal must be configured locally (one-time setup)
  3. Python 3.10+

Modal Authentication

Modal must be authenticated on your local machine.

If you have not configured Modal before, run:

modal token set

This will store your credentials locally (e.g. in ~/.modal.toml). Alternatively, you may provide credentials manually via environment variables:

MODAL_TOKEN_ID=...
MODAL_TOKEN_SECRET=...

Ensure the Modal app name configured in the project is unique within your Modal account.

Supabase credentials are always required. Modal credentials are optional if Modal is already configured locally. You only need to set your modal app name in the .env file.

Then from ~/redact/ proceed to run:

modal run redact/workers/modalapp.py 

For a persistent app on modal run:

modal deploy redact/workers/modalapp.py 

Quickstart

Clone the repo

git clone https://github.com/fw7th/redact.git
cd redact

Create and activate virtualenv (optional)

python -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Copy example env and configure

Please go over the Development Setup Prerequisites section.

cp .env.example .env

Start the app

uvicorn app.main:app --reload

Running tests

pytest tests/

Optionally benchmark

./benchmark/benchmark

This project does not use Docker for deployment; serverless execution is handled entirely by Modal.

API Documentation

When the server is running locally, visit:

These provide interactive documentation of all available endpoints with live testing.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Citations

@misc{stepanov2024glinermultitaskgeneralistlightweight,
      title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks}, 
      author={Ihor Stepanov and Mykhailo Shtopko},
      year={2024},
      eprint={2406.12925},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.12925}, 
}
@Manual{,
  title = {tesseract: Open Source OCR Engine},
  author = {Jeroen Ooms},
  year = {2026},
  note = {R package version 5.2.5},
  url = {https://docs.ropensci.org/tesseract/},
}