Skip to content

Migrate KHI File Format to Protocol Buffers #425

@kyasbal

Description

@kyasbal

Problem

1. Scalability Limit (JSON Size)

The current KHI file format consists of a JSON header followed by a gzip-compressed text body. Browsers and JavaScript engines often have a hard limit on the size of a single string or JSON object they can parse (typically around 500MB).
When an inspection report's metadata (JSON header) exceeds this limit, the browser fails to load the file, crashing the application or showing an empty state. This effectively puts a hard ceiling on the size/complexity of clusters KHI can inspect.

2. Type Synchronization Overhead

Currently, data models are defined separately in:

  • Backend (Go): validation and serialization logic.
  • Frontend (TypeScript): display and interaction logic.

This duplication requires manual maintenance to keep them in sync. Adding a new field or changing a type requires modification in both places, leading to potential bugs and increased development effort.

Proposed Solution

Migrate the KHI file format and internal data models to Protocol Buffers (protobuf).

1. Unified Schema Definition

Define the KHI data model (inspection data, timelines, logs, etc.) in .proto files.

  • These files will serve as the single source of truth for the data structure.

2. Code Generation

Use protoc to generate:

  • Go structs for the backend (parsing, analysis, and API response).
  • TypeScript interfaces/classes for the frontend (view logic).

This ensures that the frontend and backend always share the exact same type definitions.

3. Binary File Format: Concatenated Containers

To overcome Protobuf size limits (~50MB) and browser string limits, we will use a Concatenated Container format. The file will be a sequence of binary blocks.

Structure:

  1. Magic Bytes (3 bytes): KHI
  2. Metadata Size (4 bytes): UInt32 size of the following metadata protobuf.
  3. Container Metadata (N bytes): A Protobuf message describing the file structure.
    • Contains a list of Container descriptors.
    • Each descriptor specifies:
      • Size: Byte size of the container.
      • Type: e.g., "TIMELINE_DATA", "LOG_DATA".
      • Compression: e.g., "GZIP", "NONE".
      • Content: What logic data this container holds (e.g., "Timelines for cluster X").
  4. Container 1 (M bytes): Binary data (e.g., a serialized Protobuf message or raw GZIP stream).
  5. Container 2...
  6. Container N...

Key Features:

  • Lazy Loading: The frontend only decodes the "Container Metadata" first. It then seeks to and decompresses specific containers only when needed (e.g., decoding only the "Timelines" container for the initial view, and "Logs" on demand).
  • Segmentation: Large datasets are split into multiple containers, avoiding the 50MB protobuf limit and 500MB string limit.
  • Efficient Decompression: GZIP can be applied per-container, allowing the browser to decompress only relevant sections.

Migration Strategy

  1. Define .proto schema: Map the existing Go structs for the file header and internal models to Protobuf messages.
  2. Setup Logic: Configure protoc generation in the Makefile for both Go and TS targets.
  3. Backend Refactor: Update the backend serializer to write the new binary format.
  4. Frontend Refactor: Update the frontend data loader to parse the binary format instead of JSON.
    • Note: Backward compatibility for legacy JSON-based files should be maintained if possible, or a conversion tool provided.

Definition of Done

  • .proto files created for KHI data models.
  • Automated code generation setup for Go and TypeScript.
  • Backend produces valid protobuf-based files.
  • Frontend successfully loads and renders protobuf-based files.
  • 500MB+ metadata files can be loaded in the browser without crashing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions