Apple's Foundation Model (AFMTextV7) represents a breakthrough in on-device language modeling, featuring a sophisticated dual-model architecture optimized for Apple Silicon deployment. The system consists of a 3.18 billion parameter base model for high-quality generation and a 48.77 million parameter draft model for speculative decoding acceleration. Both models employ innovative design choices including built-in LoRA adapter support, extended context capabilities, and aggressive quantization for efficient on-device deployment.
| Specification | Base Model (3B) | Draft Model (48M) |
|---|---|---|
| Model Type | AFMTextV7 | AFMTextV7 |
| Total Parameters | 3.18 billion | 48.77 million |
| Trainable Parameters | 66.6 million | 48.77 million |
| Frozen Parameters | 3.11 billion | 0 (all trainable) |
| Architecture | Two-Segment Transformer | Uniform Transformer |
| Total Layers | 56 layers (35+21) | 12 layers |
| Hidden Dimension | 2,048 | 256 |
| Attention Heads | 16 per layer | 8 per layer |
| Feed-Forward Dimension | 6,656 | 768 |
| Vocabulary Size | 153,600 tokens | 153,600 tokens |
| Context Window (Training) | 4,096 tokens | 4,096 tokens |
| Context Window (Extended) | Up to 205K tokens | Up to 205K tokens |
| Context Window (Practical) | ~100K tokens | ~100K tokens |
| LoRA Configuration | Rank 32, Alpha 16 | N/A (full training) |
| Quantization | 2-bit weights, 4-bit embeddings | Full precision training |
| Memory Footprint (FP32) | ~12.1GB | ~186MB |
| Production Memory | ~1.0-1.1GB | ~186MB |
| Environment | Base Model | Draft Model | Details |
|---|---|---|---|
| Apple Silicon (Production) | ~1.0-1.1GB | ~186MB | ANE-optimized deployment |
| Host Training (FP16) | ~6.4GB | ~186MB | MPS/CUDA/CPU training |
| Host Training (FP32) | ~12.1GB | ~186MB | Full precision training |
| LoRA Adapters Only | 254MB | N/A | Trainable parameters |
- Speculative Decoding: Revolutionary dual-model system for 2-4x faster inference
- Two-Segment Architecture: Hybrid design balances computational efficiency with capability
- Native LoRA Integration: Built-in adapter support for efficient fine-tuning
- Extended Context Capability: 50x context extension via RoPE scaling (4K→205K tokens)
- Apple Silicon Optimization: ANE acceleration with aggressive quantization
- Draft Model Acceleration: 48M parameter model enables ultra-fast candidate generation
- Quantization-Aware Training: 2-bit weights with maintained performance through QAT
The AFM system introduces a groundbreaking speculative decoding approach that delivers 2-4x faster inference while maintaining full model quality. This innovation represents one of the most significant advances in on-device AI acceleration.
graph LR
subgraph "Speculative Decoding Pipeline"
A[Input Sequence] --> B[Draft Model<br/>48M params<br/>⚡ Ultra Fast]
B --> C[Generate 4-8<br/>Candidate Tokens]
C --> D[Base Model<br/>3.18B params<br/>🎯 High Quality]
D --> E{Verify Candidates<br/>in Parallel}
E -->|Accept| F[Accepted Tokens<br/>✅ High Quality]
E -->|Reject| G[Reject & Regenerate<br/>❌ Low Quality]
G --> B
F --> H[Final Output]
end
subgraph "Performance Benefits"
I[🚀 2-4x Speedup<br/>vs Standard Decoding]
J[💾 Memory Efficient<br/>Only +186MB for Draft]
K[🎯 Quality Preserved<br/>Base Model Standards]
end
classDef draft fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef base fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef benefit fill:#fff3e0,stroke:#f57c00,stroke-width:2px
class B,C draft
class D,E base
class I,J,K benefit
Draft Model (48M Parameters):
- Purpose: Generate multiple token candidates rapidly
- Architecture: Uniform 12-layer transformer (256 hidden dim)
- Speed: ~10x faster than base model per token
- Memory: Only 186MB additional footprint
Base Model (3.18B Parameters):
- Purpose: Verify and validate draft candidates
- Verification: Process multiple candidates in parallel
- Quality Control: Maintains full model standards
- Decision: Accept high-quality tokens, reject poor ones
Acceleration Mechanism:
- Parallel Candidate Generation: Draft model produces 4-8 tokens simultaneously
- Batch Verification: Base model validates all candidates in one forward pass
- Intelligent Acceptance: Statistical acceptance rates of 60-80% typical
- Graceful Fallback: Rejected tokens trigger new draft generation
This represents a paradigm shift from sequential token generation to speculative parallel processing, enabling real-time conversational AI on mobile devices.
The following diagram provides a comprehensive overview of the AFMTextV7 model's technical specifications and configuration details:
graph LR
%% Central Model Node
A[AFMTextV7<br/>3.18B Parameters<br/>Apple Foundation Model]
%% Model Statistics
subgraph Stats ["Model Statistics"]
B[Total Parameters<br/>3,178,001,792<br/>3.18 Billion]
C[Frozen Parameters<br/>3,111,368,064<br/>97.9% of total]
D[Trainable Parameters<br/>66,633,728<br/>2.1% of total]
E[Model Memory<br/>Base: 11.87GB<br/>Adapters: 254MB<br/>Total: 12.12GB]
end
%% LoRA Configuration
subgraph LoRA ["LoRA Configuration"]
F[Basic Settings<br/>Rank: 32<br/>Alpha: 16<br/>Scaling: 0.5<br/>Dropout: 0.0]
G[Component Count<br/>Total Components: 1,173<br/>LoRA Modules: 350<br/>Adapted Layers: 280<br/>Module Containers: 280]
H[Parameter Distribution<br/>Segment 0: 43.6M trainable<br/>Segment 1: 23.1M trainable<br/>Per Layer S0: 1.245M<br/>Per Layer S1: 1.098M]
end
%% Context Capabilities
subgraph Context ["Context Window Capabilities"]
I[Training Window<br/>4,096 tokens<br/>Optimal Quality]
J[Conservative Extension<br/>41,000 tokens<br/>High Quality]
K[Moderate Extension<br/>102,000 tokens<br/>Good Quality]
L[Theoretical Maximum<br/>205,000 tokens<br/>Degraded Quality]
M[RoPE Configuration<br/>Theta: 500,000<br/>50x Extension Factor]
end
%% Architecture Details
subgraph Arch ["Architecture Specifications"]
N[Layer Distribution<br/>Total: 56 Layers<br/>Segment 0: 35 Layers<br/>Segment 1: 21 Layers]
O[Attention Configuration<br/>Heads: 16 per layer<br/>Head Dimension: 128<br/>Hidden Dimension: 2,048]
P[Feed-Forward Network<br/>Expansion: 3.25x<br/>Hidden to FFN: 6,656<br/>Activation: SwiGLU]
Q[Vocabulary and Embedding<br/>Vocab Size: 153,600<br/>Embedding Dim: 2,048<br/>Tied Weights: Yes]
end
%% Quantization Details
subgraph Quant ["Quantization Features"]
R[KV Quantization<br/>FakeQuantize Method<br/>EMA Observers<br/>Memory Reduction]
S[Production Quantization<br/>2 bits per weight<br/>Quantization-Aware Training<br/>Maintained Performance]
end
%% Connections
A --> Stats
A --> LoRA
A --> Context
A --> Arch
A --> Quant
%% Internal connections for flow
B --> C
C --> D
D --> E
F --> G
G --> H
I --> J
J --> K
K --> L
M --> I
N --> O
O --> P
P --> Q
R --> S
%% Styling
classDef central fill:#ffeb3b,stroke:#f57f17,stroke-width:3px
classDef stats fill:#e3f2fd,stroke:#1976d2
classDef lora fill:#e8f5e8,stroke:#388e3c
classDef context fill:#fce4ec,stroke:#c2185b
classDef arch fill:#f3e5f5,stroke:#7b1fa2
classDef quant fill:#e0f2f1,stroke:#00695c
class A central
class B,C,D,E stats
class F,G,H lora
class I,J,K,L,M context
class N,O,P,Q arch
class R,S quant
This comprehensive technical analysis is based on examination of Apple's Foundation Model Adapter Toolkit, available at: https://developer.apple.com/apple-intelligence/foundation-models-adapter
The following findings are derived from analyzing the source code, configuration files, model architecture, and running detailed parameter analysis on the AFMTextV7 model.
- Model Type: AFMTextV7 (Apple Foundation Model)
- Total Parameters: 3,178,001,792 (~3.18B parameters)
- Hidden Dimension: 2,048
- Vocabulary Size: 153,600 tokens
- Total Layers: 56 transformer layers (35 + 21)
- Attention Heads: 16 heads per layer
- Head Dimension: 128 (2048 ÷ 16)
- Feed-Forward Dimension: 6,656 (3.25x expansion)
Segment 0 (Standard Transformer): 35 layers
layer_0throughlayer_34- Full QKV attention computation with
TransformerAttention - Trainable parameters per layer: 1,245,184
- Total Segment 0 trainable: 43,581,440 parameters
Segment 1 (KV-Reuse Transformer): 21 layers
layer_0throughlayer_20(within segment_1)- Query-only attention with
KVReuseTransformerAttention - Trainable parameters per layer: 1,097,728
- Total Segment 1 trainable: 23,052,288 parameters
- Total Trainable Parameters: 66,633,728 (~66.6M)
- Trainable Ratio: 2.097% (extremely efficient)
- LoRA Adapters Found: 1,173 individual adapter components
- Rank: 32 (consistent across all adapters)
- Alpha: 16 (scaling factor α/r = 0.5)
- Total Parameters: 48,765,952 (48.77M)
- Architecture: Uniform Transformer (no segmentation)
- Layers: 12 identical TransformerLayers
- Hidden Dimension: 256
- Attention Heads: 8 per layer
- Head Dimension: 32 (256 ÷ 8)
- Feed-Forward Dimension: 768 (3.0x expansion)
- Vocabulary: 153,600 tokens (same as base model)
| Component | Parameters | Percentage | Details |
|---|---|---|---|
| Embeddings + Output | 39,321,600 | 80.6% | TiedWeightLinear |
| Transformer Layers (12x) | 9,444,096 | 19.4% | 787,008 per layer |
| Normalization | 7,936 | <0.1% | RMSNorm + QKNorm |
| Total | 48,765,952 | 100% | All trainable |
Per TransformerLayer (787,008 parameters):
- TransformerFeedForward: 590,080 parameters
- hidden_transform: 393,216 (256→768, 2x parallel)
- output_transform: 196,608 (768→256)
- TransformerAttention: 196,928 parameters
- qkv_transform: 131,072 (256→512)
- output_transform: 65,536 (256→256)
Component Distribution by Type:
- Linear layers: 60 instances, 9,437,184 parameters
- TransformerFeedForward: 12 instances, 7,080,960 parameters
- TransformerAttention: 12 instances, 2,363,136 parameters
- RMSNorm: 49 instances, 7,168 parameters
- QKNorm: 12 instances, 768 parameters
- Total Model Memory: 186.03 MB (0.18 GB FP32)
- Training Efficiency: 100% parameters trainable
- No LoRA Required: Full model training for specialization
- Deployment: Optimized for speculative decoding
Both models are configured with:
MAX_CONTEXT_LEN = 4096 # Maximum context length during trainingBoth models use Rotary Position Embedding with theta=500,000, providing theoretical context extension:
| Scenario | Context Length | Quality | Use Case |
|---|---|---|---|
| Training Window | 4,096 tokens | Optimal | Standard conversations |
| Conservative Extension | ~41K tokens | High quality | Long documents |
| Moderate Extension | ~102K tokens | Good quality | Books, large codebases |
| Theoretical Maximum | ~205K tokens | Degraded quality | Technical limit |
Scaling Factor: 50x improvement over standard models (theta=10K)
The following diagram illustrates the comprehensive architectural flow of both the base model's two-segment design and how it integrates with the draft model for speculative decoding:
graph TD
%% Input Processing
A[Input Tokens<br/>153,600 vocab] --> B[Token Embedding<br/>314M params<br/>153,600 to 2,048]
A --> C[Token Metadata Logic<br/>Segment IDs and Positions]
%% Position Encoding
B --> D[RoPE Positional Encoding<br/>theta = 500,000<br/>50x Context Extension<br/>4K to 205K tokens]
C --> D
%% Draft Model Branch (ALWAYS FIRST)
A --> A1[Draft Model<br/>48M Parameters<br/>🚀 Always First Step]
subgraph DRAFT ["Draft Model - Speculative Decoding Pipeline"]
A1 --> A2[12 Uniform Layers<br/>256 Hidden Dim<br/>8 Attention Heads]
A2 --> A3[Fast Candidate Generation<br/>4-8 Tokens Simultaneously<br/>⚡ 10x Faster than Base]
A3 --> A4[Candidates Ready<br/>for Verification]
end
%% Segment 0: Standard Transformer
D --> E[Segment 0: Standard Transformer<br/>35 Layers - 43.6M trainable params]
subgraph S0 ["Segment 0 - Full Attention - Layers 0-34"]
E1[Layer Input<br/>Hidden: 2,048] --> E2[RMSNorm<br/>Pre-attention]
E2 --> E3[Multi-Head Attention<br/>16 heads x 128 dim<br/>Q: 2048 to 2048, K/V: 2048 to 256]
E3 --> E4[LoRA Adapters<br/>QKV: 278K params<br/>Output: 131K params]
E4 --> E5[KV Quantization<br/>FakeQuantize + EMA]
E5 --> E6[RoPE Transform<br/>Q and K projections]
E6 --> E7[Scaled Dot-Product<br/>Flash Attention]
E7 --> E8[Residual Connection]
E1 --> E8
E8 --> E9[RMSNorm<br/>Pre-FFN]
E9 --> E10[Feed-Forward Network<br/>2048 to 6656 to 2048<br/>SwiGLU Activation]
E10 --> E11[LoRA Adapters<br/>Linear0: 278K<br/>Linear1: 278K<br/>Output: 278K]
E11 --> E12[Residual Connection]
E8 --> E12
end
E --> E1
E12 --> F[Segment 1: KV-Reuse Transformer<br/>21 Layers - 23.1M trainable params]
%% Segment 1: KV-Reuse Transformer
subgraph S1 ["Segment 1 - KV Reuse - Layers 35-55"]
F1[Layer Input<br/>Hidden: 2,048] --> F2[RMSNorm<br/>Pre-attention]
F2 --> F3[Query-Only Attention<br/>Reuses K/V from Segment 0<br/>Q: 2048 to 2048 only]
F3 --> F4[LoRA Adapters<br/>Q-transform: 131K<br/>Output: 131K]
F4 --> F5[KV Reuse Logic<br/>Memory Efficient]
F5 --> F6[Attention Computation<br/>16 heads x 128 dim]
F6 --> F7[Residual Connection]
F1 --> F7
F7 --> F8[RMSNorm<br/>Pre-FFN]
F8 --> F9[Feed-Forward Network<br/>2048 to 6656 to 2048<br/>SwiGLU Activation]
F9 --> F10[LoRA Adapters<br/>Same as Segment 0<br/>835K params total]
F10 --> F11[Residual Connection]
F7 --> F11
end
F --> F1
F11 --> G[Final Layer Norm<br/>RMSNorm 2,048]
%% Output Processing and Speculative Verification
G --> H[Output Transform<br/>Tied Weights<br/>2,048 to 153,600]
H --> I[Base Model Logits<br/>Vocabulary Distribution]
%% Speculative Decoding Integration (CORRECTED FLOW)
A4 --> J[🔍 Parallel Verification<br/>Base Model Evaluates<br/>All Draft Candidates]
I --> J
J --> K{Quality Assessment<br/>Per Candidate Token}
K -->|High Quality<br/>60-80% typical| L[✅ Accept Token<br/>Add to Final Output]
K -->|Low Quality<br/>20-40% typical| M[❌ Reject Token<br/>Quality Below Threshold]
L --> N{More Tokens<br/>Needed?}
M --> O[🔄 Generate New<br/>Candidates]
N -->|Yes| O
N -->|No| P[Final Response<br/>🎉 Complete Sequence]
O --> A1
%% Performance Benefits Box
Q[📊 Performance Benefits<br/>🚀 2-4x Faster Inference<br/>💾 +186MB Memory Only<br/>🎯 Zero Quality Loss<br/>📈 60-80% Accept Rate]
%% Styling
classDef input fill:#e1f5fe
classDef draft fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef segment0 fill:#f3e5f5
classDef segment1 fill:#e8f5e8
classDef output fill:#fff3e0
classDef speculative fill:#ffebee,stroke:#d32f2f,stroke-width:2px
classDef decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef performance fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
class A,B,C,D input
class A1,A2,A3,A4,O draft
class E,E1,E2,E3,E4,E5,E6,E7,E8,E9,E10,E11,E12 segment0
class F,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11 segment1
class G,H,I output
class J,K,L,M,N speculative
class P decision
class Q performance
Embedding(153600, 2048, padding_idx=0)
- Vocabulary Size: 153,600 tokens
- Embedding Dimension: 2,048
- Special Tokens: Includes padding token at index 0
- Memory Footprint: ~314M parameters for embeddings alone
- Component: VanillaTokenMetadataLogic()
- Purpose: Handles token-level metadata processing
- Function: Manages segment IDs, positions, and token types
- Type: Causal attention mask
- Flash Attention: Auto-mode for optimized attention computation
- Implementation: Prevents information leakage from future tokens
- Type: Rotary Positional Embedding (RoPE)
- Dimension per Head: 128
- Theta Parameter: 500,000 (extended context capability)
- Context Extension: Enables up to 50x longer sequences than training window
The model uses a sophisticated two-segment architecture:
Layer Configuration:
- Count: 35 layers (layers 0-34)
- Type: TransformerLayer with full attention computation
- Hidden Dimension: 2,048
- Feed-Forward Dimension: 6,656 (3.25x expansion ratio)
- Attention Heads: 16
- Head Dimension: 128 (2048 ÷ 16)
Attention Mechanism:
- QKV Transform: Fused multi-output linear layer
- Query: 2048 → 2048
- Key: 2048 → 256
- Value: 2048 → 256
- Key-Value Quantization: FakeQuantize with EMA observers
- QK Normalization: RMSNorm on queries and keys (128 dimensions each)
- RoPE Transform: Applied to queries and keys
- Scaled Dot-Product Attention: 16 heads, auto implementation
Layer Configuration:
- Count: 21 layers (layers 35-55, total 56 layers)
- Type: KVReuseTransformerAttention
- Optimization: Reuses key-value pairs for efficiency
- Query-only Updates: Only query vectors are computed per layer
Key Differences from Segment 0:
- Q-only Transform: Only query projection, reuses K/V from previous segments
- Memory Efficiency: Significant reduction in memory and computation
- Maintained Performance: Preserves model quality while reducing resources
Both segments use identical feed-forward architectures:
TransformerFeedForward(
hidden_transform: [2048 → 6656, 2048 → 6656]
activation: SwiGLU
output_transform: 6656 → 2048
)
Key Features:
- Gated Linear Units: SwiGLU activation function
- Dual Linear Projections: Two parallel transformations for gating
- Expansion Ratio: 3.25x (6656/2048)
- Dropout: Configurable (currently 0.0)
Configuration:
- Rank: 32
- Alpha: 16
- Dropout: 0.0
- Scaling Factor: α/r = 16/32 = 0.5
The comprehensive analysis reveals the exact adapter distribution:
LoRA Component Breakdown:
- Total Individual LoRA Components: 1,173 (not 350 as previously estimated)
- LoRA Modules: 350 actual LoRA computation units
- ModuleDict Containers: 280 adapter management containers
- AdaptedLayer Wrappers: 280 wrapper layers
Segment 0 Attention Adapters (per layer):
LoRAFusedMultiOutputLinear: 278,528 parameters (QKV combined)
├─ lora_0: Query projection adapters
├─ lora_1: Key projection adapters
└─ lora_2: Value projection adapters
LoRA (output): 131,072 parameters (attention output)
Segment 1 Attention Adapters (per layer):
LoRA (q_transform): 131,072 parameters (query-only)
LoRA (output): 131,072 parameters (attention output)
Feed-Forward Adapters (both segments, per layer):
LoRA (linear_0): 278,528 parameters (input to hidden)
LoRA (linear_1): 278,528 parameters (gate projection)
LoRA (output): 278,528 parameters (hidden to output)
Per-Layer Parameter Distribution:
- Segment 0 layers: 1,245,184 trainable parameters each
- Segment 1 layers: 1,097,728 trainable parameters each
- Difference: 147,456 fewer parameters per Segment 1 layer due to KV-reuse optimization
Memory Efficiency:
- Trainable Parameters: Only LoRA adapters (~2.1% of total parameters)
- Base Model: Remains frozen during fine-tuning
- Adaptation Capability: Efficient task-specific customization
output_norm: RMSNorm(2048)
output_transform: TiedWeightLinear(2048 → 153600)
Features:
- Weight Tying: Output projection shares weights with input embeddings
- Normalization: Final RMSNorm before output projection
- Vocabulary: Projects to full 153,600 token vocabulary
- Input: 2,048 dimensions
- Intermediate: 6,656 dimensions
- Output: 2,048 dimensions
- Expansion Factor: 3.25x
- Architecture: SwiGLU with dual parallel projections
- Input: 256 dimensions
- Intermediate: 768 dimensions
- Output: 256 dimensions
- Expansion Factor: 3.0x
- Architecture: SwiGLU with dual parallel projections
Both models use identical feed-forward architectures scaled to their dimensions:
TransformerFeedForward(
hidden_transform: [input → intermediate, input → intermediate] # Two parallel projections
activation: SwiGLU # Gated activation
output_transform: intermediate → input # Final projection
)
-
KV Quantization
- Purpose: Reduce memory footprint of key-value cache
- Implementation: FakeQuantize with EMA observers
- Benefit: Enables longer context processing on device
-
Hierarchical Layer Design
- Segment 0: Full computation for early layers (35 layers)
- Segment 1: KV-reuse for later layers (21 layers)
- Rationale: Balances model capacity with computational efficiency
-
Native LoRA Integration
- Built-in adapter support for efficient fine-tuning
- 1,173 adapter components throughout the architecture
- Only 2.1% of parameters require training
-
Uniform Architecture
- Simplified 12-layer design for fast inference
- No KV quantization needed due to smaller size
- Full parameter training for task specialization
-
Optimized Dimensions
- 8x smaller hidden dimension (256 vs 2,048)
- 2x fewer attention heads (8 vs 16)
- Maintains vocabulary compatibility
-
Speculative Decoding Ready
- Designed for candidate token generation
- Ultra-low latency inference
- Seamless integration with base model
Base Model Optimized Memory:
- Core Model (2-bit + 4-bit embeddings): ~0.89GB
- ANE Acceleration Buffers: ~50-100MB
- Runtime Overhead: ~50MB
- Total Production Footprint: ~1.0-1.1GB
Draft Model Memory:
- Full Model (FP32): 186.03 MB
- Optimized Deployment: ~186MB (no additional quantization needed)
- Ultra-efficient: 65x smaller than base model
Base Model (Host Training):
- FP32: ~12.1GB total, 254MB trainable (LoRA only)
- FP16: ~6.4GB total, 254MB trainable (LoRA only)
- Supported Devices: MPS, CUDA, CPU
Draft Model (Host Training):
- FP32: 186.03 MB (all parameters trainable)
- FP16: ~93MB (all parameters trainable)
- Minimal Resource Requirements: Suitable for edge training
Apple's AFM introduces the industry's most sophisticated speculative decoding implementation:
- Dual-Model System: 48M draft + 3.18B base for optimal speed-quality balance
- Parallel Candidate Processing: 4-8 tokens generated and verified simultaneously
- Intelligent Acceptance Logic: Statistical models predict acceptance likelihood
- Zero Quality Degradation: Maintains full model standards while delivering 2-4x speedup
- Mobile-Optimized: Designed specifically for on-device deployment constraints
This novel approach represents a breakthrough in transformer design:
- Early Layers: Full transformer computation for foundational understanding (35 layers)
- Later Layers: KV-reuse for efficiency while maintaining performance (21 layers)
- Memory Innovation: 88% parameter efficiency in Segment 1 through intelligent KV sharing
- Computational Balance: Optimizes the compute-capability trade-off for on-device deployment
Unlike retrofitted LoRA implementations, AFM features built-in adaptation:
- Native Integration: LoRA adapters built into the architecture
- Comprehensive Coverage: Adapters on all linear transformations
- Optimized Training: Designed for efficient adapter-only fine-tuning
- KV Quantization: Reduces memory bottleneck in attention mechanism
- Observer-based: Uses EMA observers for dynamic quantization
- Deployment Ready: Optimized for on-device inference
- RoPE Scaling: theta=500K enables 50x context extension
- Practical Range: 40K-100K tokens with maintained quality
- Technical Limit: Up to ~205K tokens theoretically
- 2-bit Weights: Aggressive quantization for production deployment
- 4-bit Embeddings: Optimized vocabulary representation
- Performance Preservation: Maintains model quality through QAT methodology
- Apple Silicon Optimization: ANE-specific quantization schemes
Apple's AFM implements an advanced speculative decoding system that revolutionizes on-device inference:
- Draft Generation: Ultra-fast 48M parameter model generates 4-8 candidate tokens in parallel
- Parallel Verification: 3.18B base model validates all candidates simultaneously in one forward pass
- Token Acceptance: High-quality candidates (60-80% typical acceptance rate) are immediately accepted
- Intelligent Rejection: Poor candidates trigger new draft generation, maintaining quality standards
- Performance Gain: Overall 2-4x speedup with zero quality degradation
This breakthrough enables real-time conversational AI on mobile devices while preserving the full capabilities of large language models.
| Aspect | Base Model | Draft Model |
|---|---|---|
| Purpose | High-quality generation | Fast candidate generation |
| Architecture | Two-segment hybrid | Uniform transformer |
| Layers | 56 (35+21) | 12 |
| Hidden Dimension | 2,048 | 256 |
| Parameters | 3.18B | 48.77M |
| Memory (Production) | ~1.0-1.1GB | ~186MB |
| Training Strategy | LoRA adapters | Full fine-tuning |
| Specialization | LoRA rank 32/alpha 16 | Direct parameter updates |
Base Model:
- Quality: Maximum capability for complex reasoning
- Context: Up to 205K tokens theoretical, 100K practical
- Customization: Efficient LoRA adapter training
- Memory: Aggressive quantization for on-device deployment
Draft Model:
- Speed: Ultra-fast inference for candidate generation
- Efficiency: 65x smaller memory footprint
- Simplicity: Uniform architecture, no quantization needed
- Training: Direct fine-tuning for task specialization
- TAMM Framework: Apple's proprietary ML framework
- Hardware Optimization: Tuned for Apple Silicon
- Metal Performance: GPU acceleration on Apple devices
- CoreML Compatible: Can be converted for iOS deployment
Base Model:
- Pre-trained and frozen base weights
- LoRA adapter training only
- ~97% reduction in trainable parameters
- Efficient task-specific customization
Draft Model:
- Full parameter training
- Task-specific specialization
- Minimal resource requirements
- Fast convergence due to smaller size
Dual Model System:
- Combined Memory: ~1.2-1.3GB total footprint
- Inference Speed: Significantly improved through speculative decoding
- Power Efficiency: Optimized for battery-powered devices
- Context Handling: Extended context support on both models
Individual Model Deployment:
- Base Model Only: Full capability, standard inference speed
- Draft Model Only: Limited capability, ultra-fast inference
- Recommended: Combined deployment for optimal performance
Base Model:
- Multiple LoRA adapter sets for different tasks
- Quick adapter switching without model reloading
- User-specific personalization
- Efficient storage of multiple specializations
Draft Model:
- Direct fine-tuning for specific domains
- Fast training convergence
- Task-specific optimization
- Minimal computational requirements
- Input Shape: [1, 64] (batch_size=1, sequence_length=64)
- Output Shape: [1, 64, 153600] (batch, sequence, vocabulary)
- Status: ✓ Forward pass successful
- Input Shape: [1, 64] (batch_size=1, sequence_length=64)
- Output Shape: [1, 64, 153600] (batch, sequence, vocabulary)
- Output Statistics: Mean=6.39, Std=3.95, Range=[-15.3, 26.6]
- Status: ✓ Forward pass successful
Both models successfully demonstrate functional operation with identical input/output specifications, ensuring compatibility for speculative decoding.
Based on the comprehensive module analysis:
| Component | Parameters | Percentage | Trainable |
|---|---|---|---|
| Total Model | 3,178,001,792 | 100% | 66,633,728 |
| Transformer Layers | 2,863,426,944 | 90.1% | 66,633,728 |
| Embeddings | 314,572,800 | 9.9% | 0 |
| Output Transform | 314,572,800 | 9.9% | 0 |
| Feed-Forward Networks | 2,336,997,376 | 73.5% | 46,792,704 |
| Attention Mechanisms | 526,429,568 | 16.6% | 19,841,024 |
| Normalization (RMSNorm) | 243,072 | <0.1% | 0 |
Most Parameter-Dense Components:
- TransformerLayer: 56 instances, 51.1M parameters each (segment 0) / 50.3M (segment 1)
- TransformerFeedForward: 56 instances, 41.7M parameters each
- Linear Transformations: 280 instances, various sizes
- MultiOutputLinear: 56 instances (SwiGLU implementation)
LoRA Adapter Distribution:
- LoRA Modules: 350 computational units
- AdaptedLayer Wrappers: 280 management layers
- ModuleDict Containers: 280 adapter collections
- Average LoRA Size: ~190K parameters per module
- FLOPs per Token: ~6B (estimated)
- Memory Bandwidth: Optimized through quantization and KV-reuse
- Parallelization: 16-head attention enables efficient GPU utilization
- Context Length: Extended through RoPE (theta=500K)
- Batch Processing: Efficient attention implementation
- Adapter Scaling: Easy addition of task-specific adapters
| Aspect | Standard Transformer | AFMTextV7 |
|---|---|---|
| Layer Uniformity | All layers identical | Two-segment design |
| Position Encoding | Absolute/Learned | RoPE (theta=500K) |
| Normalization | LayerNorm | RMSNorm |
| Adapter Support | External | Built-in LoRA |
| KV Caching | Standard | Quantized |
| Context Window | Fixed | Extensible (4K→205K) |
- More Efficient: KV-reuse and quantization reduce memory usage
- Adapter-Native: Built-in fine-tuning support
- Device-Optimized: Designed for on-device deployment
- Extended Context: 50x context capability vs standard models
- Apple Ecosystem: Integrated with Apple's ML stack
Apple introduced two primary foundation language models at WWDC 2025: an on-device model (~3 billion parameters) and a server-based model (larger, designed for Private Cloud Compute). These models power Apple Intelligence features across iOS 26, iPadOS 26, macOS Tahoe 26, and other platforms. Below are the technical details:
-
On-Device Model:
- Parameter Count: Approximately 3 billion parameters.
- Compression: Uses 2 bits per weight (bpw) with Quantization-Aware Training (QAT) to reduce memory footprint and improve inference speed while maintaining performance.
- Design: Optimized for efficiency on Apple Silicon, leveraging unified memory and hardware acceleration (e.g., Neural Engine). The model is compact, enabling fast inference for tasks like text summarization, entity extraction, and text refinement.
- Capabilities: Excels at text-based tasks (e.g., summarization, text understanding, creative content generation) but is not designed as a general-purpose chatbot for world knowledge. It supports guided generation (constrained decoding via Swift's @Generable macro) and tool calling for app-specific tasks.
- Performance: Human evaluations show it performs comparably to larger models like Gemma-3 (4B) and Qwen-3 (4B) in specific tasks, despite its smaller size.
-
Quantization-Aware Training (QAT): Applied to the on-device model to achieve 2-bit weights, reducing memory usage and inference latency while preserving accuracy.
-
Fine-Tuning: Both models are fine-tuned for Apple Intelligence tasks (e.g., text refinement, notification prioritization, image creation). Adapters are used to specialize models for specific user needs without full retraining.
-
On-Device Efficiency:
- The 3B model is optimized for Apple Silicon, using the Neural Engine and unified memory to minimize latency.
- Supports offline operation, ensuring privacy and availability without internet connectivity.
- Benchmarks:
- The on-device model outperforms Llama-3 (8B) in mathematical reasoning and matches Gemma-3/Qwen-3 (4B) in human evaluations for text tasks.
References:
- https://machinelearning.apple.com/research/introducing-apple-foundation-models
- https://machinelearning.apple.com/research/apple-foundation-models-2025-updates
The Apple Foundation Model (AFMTextV7) dual-architecture represents a sophisticated approach to on-device language modeling with several key innovations:
- Dual Model Innovation: Base + Draft architecture optimizes both quality and speed
- Hybrid Base Design: Two-segment architecture balances capability with efficiency
- Native Adapter Support: Built-in LoRA integration enables efficient customization
- Memory Optimizations: Advanced quantization and efficient architectures
- Extended Context: RoPE scaling provides 50x context window extension
- Device-First Design: Optimized for mobile and edge deployment scenarios
- Speculative Decoding: Advanced inference acceleration through candidate generation
- Quantization-Aware Training: 2-bit weights with preserved performance
This architecture demonstrates Apple's commitment to bringing large language model capabilities to consumer devices while maintaining privacy, performance, and efficiency standards. The dual-model approach represents a significant advancement in on-device AI, enabling both high-quality generation and fast inference within tight resource constraints.
The AFMTextV7 system establishes a strong foundation for the next generation of on-device AI applications, providing sophisticated language understanding and generation capabilities while respecting the unique constraints and opportunities of mobile device deployment.
This analysis is based on examination of Apple's Foundation Model Adapter Toolkit:
- Official Toolkit: https://developer.apple.com/apple-intelligence/foundation-models-adapter
- Analysis Date: June 2025
- Model Versions: AFMTextV7 (Base & Draft)
- Framework: TAMM (Apple's ML Framework)
This analysis compares three prominent 3B parameter language models: Apple's AFM (AFMTextV7), Meta's Llama 3.2 3B, and Alibaba's Qwen2.5 3B. Each represents different architectural philosophies and optimization strategies for on-device and edge deployment scenarios.
| Specification | AFM Base (3.18B) | Llama 3.2 3B | Qwen2.5 3B |
|---|---|---|---|
| Total Parameters | 3.18B | ~3.2B | 3.09B |
| Developer | Apple | Meta | Alibaba |
| Release Date | June 2025* (updated arch) | September 2024 | September 2024 |
| Primary Use Case | On-device AI | Edge/Mobile | General Purpose |
| Architecture Type | Two-Segment Hybrid | Optimized Transformer | Standard Transformer |
| Vocabulary Size | 153,600 | ~128K (estimated) | 151,936 |
| Context Length | 4K training / 205K extended | 128K | 32K |
| Deployment Focus | Apple Silicon | Cross-platform edge | Cloud & edge |
| Component | AFM Base | Llama 3.2 3B | Qwen2.5 3B |
|---|---|---|---|
| Layers | 56 (35+21 hybrid) | ~32 (estimated) | 36 |
| Hidden Dimension | 2,048 | ~2,048 (estimated) | 2,048 |
| Attention Heads | 16 | ~16 (estimated) | 16 Query, 2 KV (GQA) |
| Head Dimension | 128 | 128 | 128 |
| FFN Intermediate | 6,656 | ~8,192 (estimated) | 11,008 |
| FFN Expansion Factor | 3.25x | ~4.0x | 5.38x |
| Activation Function | SwiGLU | SiLU | SwiGLU |
| Normalization | RMSNorm | RMSNorm | RMSNorm |
| Position Encoding | RoPE (θ=500K) | RoPE | RoPE |
| Model | Input | Intermediate | Output | Expansion | Activation |
|---|---|---|---|---|---|
| AFM Base | 2,048 | 6,656 | 2,048 | 3.25x | SwiGLU |
| Llama 3.2 3B | 2,048 | ~8,192 | 2,048 | ~4.0x | SiLU |
| Qwen2.5 3B | 2,048 | 11,008 | 2,048 | 5.38x | SwiGLU |
Key Observations:
- AFM: Most conservative FFN expansion (3.25x) optimized for efficiency
- Llama 3.2: Moderate expansion (~4.0x) balancing performance and efficiency
- Qwen2.5: Largest expansion (5.38x) prioritizing model capacity
| Feature | AFM Base | Llama 3.2 3B | Qwen2.5 3B |
|---|---|---|---|
| Attention Type | Multi-Head (Segment 0) Query-only (Segment 1) |
Grouped Query (GQA) | Grouped Query (GQA) |
| Query Heads | 16 | ~16 | 16 |
| Key/Value Heads | 16 (S0), Reuse (S1) | ~2-4 (estimated) | 2 |
| QKV Projections | Q:2048→2048 K,V:2048→256 |
Standard GQA | Standard GQA |
| Attention Optimization | KV Quantization + Reuse | Standard | Standard |
Unique Features:
- Two-Segment Architecture: 35 standard + 21 KV-reuse layers
- Native LoRA Integration: Built-in adapter support (1,173 components)
- Extended Context: RoPE θ=500K enables 50x context extension
- Apple Silicon Optimization: 2-bit quantization for production deployment
- Speculative Decoding: Dual model system (3B base + 48M draft)
Design Philosophy: Extreme on-device optimization with quality preservation
Unique Features:
- Pruning + Distillation: Derived from Llama 3.1 8B through knowledge transfer
- Mobile Optimization: Designed specifically for edge deployment
- Grouped Query Attention: Optimized KV cache for efficiency
- 128K Context: Long context support out of the box
- Multilingual: 8+ officially supported languages
Design Philosophy: Knowledge distillation for mobile-first deployment
Unique Features:
- Aggressive FFN Scaling: 5.38x expansion for maximum capacity
- Grouped Query Attention: 16 query heads, 2 KV heads for efficiency
- Large Vocabulary: 151,936 tokens for comprehensive coverage
- Sliding Window Attention: Optional for very long sequences
- Standardized Architecture: Compatible with broader Qwen ecosystem
Design Philosophy: Maximum capability within 3B parameter constraint
| Environment | AFM Base | Llama 3.2 3B | Qwen2.5 3B |
|---|---|---|---|
| Production | ~1.0-1.1GB | ~2.5-3GB | ~2.5-3GB |
| Training (FP16) | ~6.4GB | ~6-7GB | ~6-7GB |
| LoRA Training | 254MB | N/A | N/A |
| Metric | AFM | Llama 3.2 | Qwen2.5 |
|---|---|---|---|
| FLOPs/Token | ~6B (optimized) | ~6.5B | ~7.5B |
| Memory Efficiency | Highest (quantization) | High | Standard |
| Inference Speed | Fastest (speculative) | Fast | Standard |
All three models share fundamental transformer design patterns:
- RMSNorm: All use RMSNorm for better training stability
- RoPE: Rotary Position Embedding for position encoding
- GQA/Attention Optimization: Various forms of attention efficiency
- SwiGLU/SiLU: Gated activation functions for better expressiveness
- Hidden Dimension: All use 2,048 hidden dimensions
- Head Dimension: Consistent 128 dimensions per attention head
- Vocabulary Size: Large vocabularies (128K-153K range) for multilingual support
- Context Extension: All support extended context beyond training window
- Parameter Efficiency: All designed for <3.5B parameter constraints
- Memory Optimization: Various strategies to reduce inference memory
- Quantization Support: All models support some form of quantization
- AFM: Conservative 3.25x (efficiency-first)
- Llama 3.2: Moderate ~4.0x (balanced approach)
- Qwen2.5: Aggressive 5.38x (capacity-first)
- AFM: Most complex (two-segment hybrid + LoRA integration)
- Llama 3.2: Simplified (pruned from larger model)
- Qwen2.5: Standard (scaled transformer architecture)
- AFM: Hardware-specific (Apple Silicon ANE)
- Llama 3.2: Cross-platform mobile optimization
- Qwen2.5: General-purpose with optional optimizations
- AFM: Pre-trained base + LoRA adaptation
- Llama 3.2: Pruning + knowledge distillation
- Qwen2.5: Traditional scaled pre-training
Optimal For:
- Apple ecosystem applications
- Privacy-sensitive on-device AI
- Applications requiring fine-tuning (LoRA)
- Long-context processing (up to 205K tokens)
Optimal For:
- Cross-platform mobile applications
- Edge AI with standard hardware
- Multilingual dialogue systems
- General-purpose text generation
Optimal For:
- Applications requiring maximum 3B model capability
- Cloud-edge hybrid deployments
- Complex reasoning tasks
- Extensive multilingual support
- AFM: Revolutionary two-segment design with native LoRA integration
- Llama 3.2: Advanced knowledge distillation from larger models
- Qwen2.5: Optimized standard architecture with aggressive scaling
- AFM: Extreme quantization (2-bit weights) + KV reuse
- Llama 3.2: Pruning-based size reduction with performance retention
- Qwen2.5: Grouped Query Attention for KV cache efficiency
- AFM: Longest theoretical context (205K via RoPE scaling)
- Llama 3.2: Long practical context (128K)
- Qwen2.5: Standard long context (32K) with sliding window option
AFM (Apple): Represents the most radical architectural innovation with its two-segment design, native LoRA integration, and extreme optimization for Apple Silicon. Prioritizes on-device efficiency while maintaining quality through innovative speculative decoding.
Llama 3.2 3B (Meta): Exemplifies sophisticated knowledge transfer techniques, creating an efficient 3B model through pruning and distillation from larger models. Balances cross-platform compatibility with mobile optimization.
Qwen2.5 3B (Alibaba): Demonstrates aggressive scaling within parameter constraints, using a large FFN expansion ratio to maximize model capacity. Represents a more traditional but highly optimized transformer approach.
All three models demonstrate convergence on:
- RMSNorm + RoPE + GQA/Attention Optimization
- 2,048 hidden dimensions as optimal for 3B models
- Gated activations (SwiGLU/SiLU) for better expressiveness
- Large vocabularies for multilingual capability
- Mobile/edge deployment as primary target
The models diverge significantly in:
- FFN expansion strategies (3.25x vs 4.0x vs 5.38x)
- Architectural complexity (hybrid vs simplified vs standard)
- Optimization focus (hardware-specific vs cross-platform vs general)
- Training methodology (LoRA vs distillation vs traditional)
This analysis reveals how different organizations approach the same 3B parameter constraint with dramatically different architectural philosophies, each optimized for their specific ecosystem and use case requirements.
This report is based on analysis of the model architecture, configuration files, checkpoint specifications, and experimental verification. Some performance characteristics are estimated based on architectural analysis and industry standards.