Skip to content

[Feature] Advanced Token Usage and Management System #289

@Avtrkrb

Description

@Avtrkrb

Description

Implement advanced token usage and management capabilities to enhance Nanocoder's efficiency and resource optimization. Currently, Nanocoder has a solid foundation with provider-specific tokenization and basic context monitoring, but lacks intelligent context management, advanced usage tracking, and multi-model optimization. The system needs enhanced tokenization for more providers, intelligent context compression, comprehensive usage analytics, and context-aware caching to match industry-leading CLI coding agentic tools.

The feature will implement:

  • Enhanced tokenization with support for more providers (Gemini, Mistral, Qwen)
  • Intelligent context compression with automatic summarization
  • Sliding window context management with pinning capabilities
  • Comprehensive usage tracking with session-based analytics
  • Context-aware token caching with intelligent invalidation
  • Multi-model tokenizer pooling for efficient resource management

Use Case

Current Problem:

  • Limited tokenization support for providers beyond OpenAI, Anthropic, and Llama
  • No automatic context pruning or compression when approaching limits
  • Basic token estimation without intelligent context management
  • Limited usage tracking and analytics capabilities
  • No multi-model tokenizer optimization

Target Scenarios:

  1. Enhanced Tokenization: Support for Gemini, Mistral, Qwen and other providers
  2. Intelligent Context Management: Automatic compression when approaching limits
  3. Sliding Window Context: Fixed-size context window with important message pinning
  4. Usage Analytics: Comprehensive session-based tracking and reporting
  5. Multi-model Optimization: Efficient tokenizer resource management

Proposed Solution

Phase 1: Enhanced Tokenization System (2-3 weeks)

  • Implement multi-provider tokenizer support (Gemini, Mistral, Qwen)
  • Create TokenizerPool for efficient multi-model resource management
  • Enhance fallback tokenizer with better estimation algorithms
  • Update tokenizer factory with new provider detection
  • Create EnhancedFallbackTokenizer with content-aware estimation

Phase 2: Intelligent Context Management (3-4 weeks)

  • Implement ContextCompressor with LLM-powered summarization
  • Add SlidingWindowContextManager for fixed-size context windows
  • Create context compression engine with automatic summarization
  • Add message pinning capabilities to preserve important context
  • Integrate with existing context monitoring system

Phase 3: Advanced Usage Tracking and Optimization (4-5 weeks)

  • Implement TokenUsageTracker for session-based analytics
  • Create ContextAwareTokenCache with intelligent invalidation
  • Add comprehensive usage reporting and analytics
  • Implement context-aware cache optimization
  • Add usage export and visualization capabilities

Technical Implementation

Core Components

// Enhanced tokenizer factory with more providers
export function createEnhancedTokenizer(
  providerName: string,
  modelId: string
): Tokenizer {
  const provider = detectEnhancedProvider(providerName, modelId);

  switch (provider) {
    case 'openai':
      return new OpenAITokenizer(modelId);
    case 'anthropic':
      return new AnthropicTokenizer(modelId);
    case 'llama':
      return new LlamaTokenizer(modelId);
    case 'gemini':
      return new GeminiTokenizer(modelId);
    case 'mistral':
      return new MistralTokenizer(modelId);
    case 'qwen':
      return new QwenTokenizer(modelId);
    case 'fallback':
    default:
      return new EnhancedFallbackTokenizer();
  }
}

// Enhanced fallback tokenizer with better estimation
class EnhancedFallbackTokenizer implements Tokenizer {
  countTokens(text: string): number {
    const charCount = text.length;
    const wordCount = text.split(/\s+/).length;

    // Adjust estimation based on content characteristics
    const adjustmentFactor = this.calculateAdjustmentFactor(text);
    return Math.round((charCount / CHARS_PER_TOKEN_ESTIMATE) * adjustmentFactor);
  }

  private calculateAdjustmentFactor(text: string): number {
    // Analyze text characteristics for better estimation
    const codeRatio = this.estimateCodeRatio(text);
    const punctuationRatio = this.estimatePunctuationRatio(text);

    // Code-heavy text typically has higher token density
    if (codeRatio > 0.7) return 1.2;
    if (codeRatio > 0.4) return 1.1;

    // High punctuation might indicate more tokens
    if (punctuationRatio > 0.3) return 1.15;

    return 1.0;
  }
}

// Tokenizer pool for efficient multi-model support
export class TokenizerPool {
  private pool: Map<string, Tokenizer> = new Map();
  private usageCount: Map<string, number> = new Map();

  getTokenizer(provider: string, model: string): Tokenizer {
    const key = `${provider}:${model}`;

    if (this.pool.has(key)) {
      this.usageCount.set(key, (this.usageCount.get(key) || 0) + 1);
      return this.pool.get(key)!;
    }

    const tokenizer = createEnhancedTokenizer(provider, model);
    this.pool.set(key, tokenizer);
    this.usageCount.set(key, 1);

    return tokenizer;
  }

  releaseTokenizer(provider: string, model: string): void {
    const key = `${provider}:${model}`;
    const count = this.usageCount.get(key) || 0;

    if (count <= 1) {
      const tokenizer = this.pool.get(key);
      if (tokenizer?.free) {
        tokenizer.free();
      }
      this.pool.delete(key);
      this.usageCount.delete(key);
    } else {
      this.usageCount.set(key, count - 1);
    }
  }

  cleanupUnused(): void {
    for (const [key, count] of this.usageCount) {
      if (count === 0) {
        const tokenizer = this.pool.get(key);
        if (tokenizer?.free) {
          tokenizer.free();
        }
        this.pool.delete(key);
        this.usageCount.delete(key);
      }
    }
  }
}

// Context compressor with intelligent summarization
export class ContextCompressor {
  private summarizationModel: string;
  private compressionThreshold: number;

  constructor(options: {summarizationModel?: string; threshold?: number} = {}) {
    this.summarizationModel = options.summarizationModel || 'gpt-3.5-turbo';
    this.compressionThreshold = options.threshold || 0.8;
  }

  async compressContext(
    messages: Message[],
    currentTokenCount: number,
    contextLimit: number,
    tokenizer: Tokenizer
  ): Promise<Message[]> {
    const usageRatio = currentTokenCount / contextLimit;

    if (usageRatio < this.compressionThreshold) {
      return messages; // No compression needed
    }

    const compressibleMessages = this.findCompressibleMessages(messages);

    if (compressibleMessages.length === 0) {
      return messages;
    }

    const summary = await this.summarizeMessages(compressibleMessages);
    return this.replaceWithSummary(messages, compressibleMessages, summary);
  }

  private findCompressibleMessages(messages: Message[]): Message[] {
    const compressible: Message[] = [];

    for (let i = 0; i < messages.length; i++) {
      const message = messages[i];

      // Skip system messages and very recent messages
      if (message.role === 'system' || i >= messages.length - 3) {
        continue;
      }

      // Only compress user and assistant messages
      if (message.role === 'user' || message.role === 'assistant') {
        compressible.push(message);
      }
    }

    return compressible;
  }

  private async summarizeMessages(messages: Message[]): Promise<Message> {
    const summaryPrompt = this.createSummaryPrompt(messages);

    const summary = await callSummarizationModel(
      summaryPrompt,
      this.summarizationModel
    );

    return {
      role: 'system',
      content: `[Context Summary] ${summary}`,
      contextSummary: true
    };
  }
}

// Sliding window context manager
export class SlidingWindowContextManager {
  private window: Message[] = [];
  private maxTokens: number;
  private tokenizer: Tokenizer;

  constructor(maxTokens: number, tokenizer: Tokenizer) {
    this.maxTokens = maxTokens;
    this.tokenizer = tokenizer;
  }

  addMessage(message: Message): void {
    const messageTokens = this.tokenizer.countTokens(message);

    while (this.getTotalTokens() + messageTokens > this.maxTokens && this.window.length > 0) {
      const oldestMessage = this.window[0];
      const oldestTokens = this.tokenizer.countTokens(oldestMessage);

      if (this.getTotalTokens() + messageTokens - oldestTokens <= this.maxTokens) {
        this.window.shift();
      } else {
        const messagesToRemove = Math.ceil(
          (this.getTotalTokens() + messageTokens - this.maxTokens) /
          (oldestTokens || 1)
        );

        for (let i = 0; i < messagesToRemove && this.window.length > 0; i++) {
          this.window.shift();
        }
      }
    }

    this.window.push(message);
  }

  getMessages(): Message[] {
    return [...this.window];
  }

  getTotalTokens(): number {
    return this.window.reduce(
      (sum, msg) => sum + this.tokenizer.countTokens(msg),
      0
    );
  }

  // Pin important messages that shouldn't be removed
  pinMessage(index: number): void {
    if (index >= 0 && index < this.window.length) {
      const message = this.window[index];
      message.pinned = true;
    }
  }
}

// Token usage tracker for session-based analytics
export class TokenUsageTracker {
  private sessionHistory: UsageSession[] = [];
  private currentSession: UsageSession;
  private maxSessions: number;

  constructor(maxSessions: number = 100) {
    this.maxSessions = maxSessions;
    this.currentSession = this.createNewSession();
  }

  private createNewSession(): UsageSession {
    return {
      id: generateSessionId(),
      startTime: Date.now(),
      endTime: null,
      tokenBreakdown: {
        system: 0,
        userMessages: 0,
        assistantMessages: 0,
        toolDefinitions: 0,
        toolResults: 0,
        total: 0
      },
      messageCount: 0,
      toolUsage: new Map<string, number>(),
      modelInfo: null
    };
  }

  startNewSession(modelInfo?: ModelInfo): void {
    if (this.currentSession) {
      this.currentSession.endTime = Date.now();
      this.sessionHistory.unshift(this.currentSession);

      if (this.sessionHistory.length > this.maxSessions) {
        this.sessionHistory.pop();
      }
    }

    this.currentSession = this.createNewSession();
    if (modelInfo) {
      this.currentSession.modelInfo = modelInfo;
    }
  }

  trackMessageTokens(message: Message, tokens: number): void {
    this.currentSession.messageCount++;

    switch (message.role) {
      case 'system':
        this.currentSession.tokenBreakdown.system += tokens;
        break;
      case 'user':
        this.currentSession.tokenBreakdown.userMessages += tokens;
        break;
      case 'assistant':
        this.currentSession.tokenBreakdown.assistantMessages += tokens;
        break;
      case 'tool':
        this.currentSession.tokenBreakdown.toolResults += tokens;
        break;
    }

    this.currentSession.tokenBreakdown.total += tokens;
  }

  trackToolUsage(toolName: string, tokenCost: number): void {
    const currentCount = this.currentSession.toolUsage.get(toolName) || 0;
    this.currentSession.toolUsage.set(toolName, currentCount + 1);
    this.currentSession.tokenBreakdown.toolDefinitions += tokenCost;
    this.currentSession.tokenBreakdown.total += tokenCost;
  }

  getCurrentUsage(): TokenBreakdown {
    return {...this.currentSession.tokenBreakdown};
  }

  getSessionHistory(): UsageSession[] {
    return [...this.sessionHistory];
  }

  generateReport(): UsageReport {
    const totalTokens = this.sessionHistory.reduce(
      (sum, session) => sum + session.tokenBreakdown.total,
      0
    );

    const avgPerSession = this.sessionHistory.length > 0
      ? totalTokens / this.sessionHistory.length
      : 0;

    return {
      totalSessions: this.sessionHistory.length,
      totalTokens,
      averagePerSession: avgPerSession,
      breakdownByCategory: this.aggregateBreakdown(),
      topTools: this.getTopTools()
    };
  }
}

// Context-aware token cache with intelligent invalidation
export class ContextAwareTokenCache {
  private cache: Map<string, number>;
  private contextHash: string = '';
  private maxSize: number;

  constructor(maxSize: number = 1000) {
    this.maxSize = maxSize;
    this.cache = new Map();
  }

  getCachedTokens(
    message: Message,
    tokenizer: Tokenizer,
    context: ConversationContext
  ): number {
    const currentContextHash = this.calculateContextHash(context);
    const cacheKey = this.getCacheKey(message, tokenizer);

    if (currentContextHash !== this.contextHash) {
      this.invalidateStaleEntries(currentContextHash);
      this.contextHash = currentContextHash;
    }

    if (this.cache.has(cacheKey)) {
      return this.cache.get(cacheKey)!;
    }

    const tokens = tokenizer.countTokens(message);

    if (this.cache.size >= this.maxSize) {
      const oldestKey = this.cache.keys().next().value;
      this.cache.delete(oldestKey);
    }

    this.cache.set(cacheKey, tokens);
    return tokens;
  }

  private calculateContextHash(context: ConversationContext): string {
    const factors = [
      context.messagesBeforeToolExecution.length,
      context.systemMessage.content?.length || 0,
      context.assistantMsg.content?.length || 0
    ];

    return factors.join('|');
  }

  private getCacheKey(message: Message, tokenizer: Tokenizer): string {
    const tokenizerType = this.getTokenizerType(tokenizer);
    return `${tokenizerType}:${message.content}:${message.role}`;
  }

  private invalidateStaleEntries(newContextHash: string): void {
    const hashDiff = this.calculateHashDifference(this.contextHash, newContextHash);

    if (hashDiff > CONTEXT_CHANGE_THRESHOLD) {
      this.cache.clear();
    } else {
      this.pruneOldEntries(0.5); // Keep 50% of cache
    }
  }
}

Integration Points

  • Tokenizer Factory: Enhance source/tokenization/tokenizer-factory.ts with new providers
  • App State: Integrate with source/hooks/useAppState.tsx for token caching
  • Context Checker: Enhance source/hooks/chat-handler/utils/context-checker.tsx with compression
  • Usage Calculator: Update source/usage/calculator.ts with enhanced tracking
  • Constants: Update source/constants.ts with new thresholds and configurations
  • Chat Handler: Integrate with source/hooks/chat-handler/conversation/conversation-loop.tsx

Files to Modify/Create

  • source/tokenization/enhanced-tokenizer-factory.ts (new) - Enhanced provider support
  • source/tokenization/tokenizer-pool.ts (new) - Multi-model resource management
  • source/tokenization/enhanced-fallback-tokenizer.ts (new) - Better estimation
  • source/context/context-compressor.ts (new) - Intelligent compression
  • source/context/sliding-window-context-manager.ts (new) - Fixed-size windows
  • source/usage/token-usage-tracker.ts (new) - Session analytics
  • source/usage/context-aware-token-cache.ts (new) - Intelligent caching
  • source/tokenization/tokenizer-factory.ts (modify) - Enhanced provider detection
  • source/hooks/useAppState.tsx (enhance) - Token caching integration
  • source/hooks/chat-handler/utils/context-checker.tsx (enhance) - Compression integration
  • source/usage/calculator.ts (enhance) - Enhanced tracking
  • source/constants.ts (enhance) - New thresholds and configs
  • source/components/usage/usage-display.tsx (enhance) - Analytics visualization

Alternatives Considered

  1. Simple Tokenization Extension: Considered but rejected for limited context management
  2. Basic Context Pruning: Rejected for lack of intelligent summarization
  3. Static Token Limits: Rejected for inability to adapt to conversation complexity
  4. Monolithic Token System: Rejected for poor maintainability and scalability

Additional Context

  • I have searched existing issues to ensure this is not a duplicate
  • This feature aligns with the project's goals (local-first AI assistance)
  • The implementation considers local LLM performance constraints
  • Memory efficiency is prioritized for local usage

Performance Considerations

  • Efficient tokenizer pooling for multi-model scenarios
  • Memory-optimized context management algorithms
  • Incremental context compression to minimize memory usage
  • Optimized token calculation performance

Local LLM Adaptations

  • Memory-efficient tokenizer instances
  • Lightweight context compression algorithms
  • Resource-aware token estimation
  • Progressive enhancement for local model capabilities

Token Management Benefits

  • Enhanced tokenization support for multiple providers
  • Intelligent context compression with automatic summarization
  • Sliding window management with message pinning
  • Comprehensive usage tracking with session analytics
  • Context-aware caching with intelligent invalidation

Implementation Notes (optional)

Key Integration Points

  • Integrate with existing tokenizer factory system
  • Connect to context monitoring and checking
  • Enhance usage calculation and display
  • Add to chat handler for compression integration
  • Connect with UI components for analytics visualization

Testing Strategy

  • Unit tests for tokenization algorithms
  • Integration tests for context compression
  • Performance tests for token caching
  • Memory usage monitoring for tokenizer pools
  • Context management efficiency testing

Migration Path

  • All new features will be optional and backward compatible
  • Existing tokenization remains as fallback
  • Gradual rollout with feature flags
  • User preferences for token management features

Success Metrics

  • Tokenization Accuracy: <5% error rate for supported providers
  • Cache Hit Rate: 90%+ for typical usage patterns
  • Context Compression: 30-50% reduction when needed
  • Performance Impact: <10ms overhead per message
  • Memory Usage: Keep tokenizer pool under 10MB
  • User Satisfaction: Reduced manual context management
  • Safety: 95%+ fewer context limit errors

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions