@@ -4,7 +4,7 @@ AI-powered content intelligence for WordPress using Apple Foundation Models.
44
55## Features
66
7- - ** Excerpt Generation** - Generate 3 excerpt variations in multiple languages and styles
7+ - ** Excerpt Generation** - Generate 3 excerpt variations in 8 languages with configurable length/style
88- ** Tag Suggestions** - AI-powered tag recommendations
99- ** Post Summaries** - Automatic content summarization
1010
@@ -20,13 +20,16 @@ let generator = ExcerptGeneration(length: .medium, style: .engaging)
2020let excerpts = try await generator.generate (for : postContent)
2121```
2222
23- ** Languages** : English, Spanish, French, German, Japanese, Mandarin, Russian
23+ ** Languages** : English, Spanish, French, German, Italian, Portuguese, Japanese, Chinese
2424** Lengths** : Short (15-35 words), Medium (40-80 words), Long (90-130 words)
2525** Styles** : Engaging, Professional, Conversational, Formal, Witty
2626
2727## Testing
2828
29- ### Standard Tests
29+ ### Standard XCTest
30+
31+ Run standard tests that verify language, length, and diversity:
32+
3033``` bash
3134cd Modules
3235xcodebuild test \
@@ -35,126 +38,106 @@ xcodebuild test \
3538 -only-testing:WordPressIntelligenceTests
3639```
3740
38- Tests excerpt generation for 8 languages with automatic language verification, length compliance, and performance measurement. Tests emit both formatted output and structured JSON markers for optional Claude-based evaluation.
41+ ### Quality Evaluation
3942
40- ### Quality Evaluation (Optional)
43+ Evaluate AI-generated content quality using Claude scoring. Requires [ Claude CLI ] ( https://github.com/anthropics/claude-cli ) .
4144
42- Evaluate AI-generated content quality with Claude CLI:
45+ ** Location ** : ` Modules/Tests/WordPressIntelligenceTests/ `
4346
4447``` bash
45- # Setup (one-time)
46- pip install claude-cli && claude configure
47-
48- # Run evaluations
48+ # Quick start
4949cd Modules/Tests/WordPressIntelligenceTests
50-
51- # Run all excerpt tests (default)
52- ./evaluate-with-claude.sh
53-
54- # Run tag suggestion tests
55- ./evaluate-with-claude.sh --test-type tags
56-
57- # Run post summary tests
58- ./evaluate-with-claude.sh --test-type summary
59-
60- # Run a specific test
61- ./evaluate-with-claude.sh --only-testing " ExcerptGenerationTests/excerptGenerationEnglish(parameters:)"
62- ./evaluate-with-claude.sh --test-type tags --only-testing " IntelligenceSuggestedTagsTests/testEnglishPost()"
63- ./evaluate-with-claude.sh --test-type summary --only-testing " PostSummaryTests/testEnglishBlogPost()"
50+ make # Show all available commands
51+ make eval # Run full evaluation (all test types)
52+ make eval-quick # Run English excerpt evaluation
53+ make eval TESTS=" excerpts" # Run only excerpt tests
54+ make eval TESTS=" excerpts tags" # Run excerpt and tag tests
55+ make eval-tags # Evaluate tag suggestions
56+ make eval-summary # Evaluate post summaries
57+ make open # Open latest HTML report
6458```
6559
66- #### Options
60+ ** Common targets** :
61+ - ` make eval ` - Run full evaluation for all test types (excerpts, tags, summary)
62+ - ` make eval TESTS="excerpts" ` - Run only specific test types
63+ - ` make eval-quick ` - Fast evaluation (English excerpts only)
64+ - ` make rebuild-improve ` - Regenerate HTML with mock improvements (for UI development)
65+ - ` make open ` - Open latest evaluation report
66+ - ` make help ` - Show all available commands
6767
68- | Option | Values | Description |
69- | --------| --------| -------------|
70- | ` --test-type ` | ` excerpts ` , ` tags ` , ` summary ` | Test type to run (default: ` excerpts ` ) |
71- | ` --model ` | ` sonnet ` , ` opus ` , ` haiku ` | Claude model for evaluation (default: ` sonnet ` ) |
72- | ` --simulator ` | Simulator name | iOS simulator (default: ` iPhone 16 Pro ` ) |
73- | ` --only-testing ` | Test identifier | Run specific test only |
74- | ` --skip-tests ` | - | Skip test execution, re-evaluate previous results |
68+ For advanced options and HTML report development, see:
69+ - ` Modules/Tests/WordPressIntelligenceTests/Makefile `
70+ - ` Modules/Tests/WordPressIntelligenceTests/lib/DEVELOPMENT.md `
7571
76- #### Examples
72+ ### Evaluation Output
7773
78- ``` bash
79- # Use Claude Opus for evaluation
80- ./evaluate-with-claude.sh --model opus
74+ Results are saved to ` /tmp/WordPressIntelligence-Tests/evaluation-<timestamp>/ ` :
8175
82- # Use different simulator
83- ./evaluate-with-claude.sh --simulator " iPhone 15"
76+ - ** ` evaluation-report.html ` ** - Interactive report with filtering, sorting, baseline comparison
77+ - ** ` evaluation-results.json ` ** - Machine-readable data for CI/CD
78+ - Console output with quick summary
8479
85- # Re-evaluate existing results
86- ./evaluate-with-claude.sh --skip-tests
87- ```
80+ ** HTML Report Features** :
81+ - Sortable columns (test name, status, score, duration)
82+ - Filter by language, status, or comparison results
83+ - Baseline comparison with delta indicators (↑ improved, ↓ regressed, = unchanged)
84+ - Click any test to see detailed scores, generated content, and Claude feedback
85+ - Score distribution dots (●●●) show pass/warn/fail for each excerpt
8886
89- #### Evaluation Output
87+ ### Scoring
9088
91- Results saved to ` /tmp/WordPressIntelligence-Tests/evaluation-<timestamp>/ ` :
89+ Quality scores use weighted criteria (1-10 scale) :
9290
93- - ** ` evaluation-results.json ` ** - Machine-readable data for CI/CD integration
94- - ** ` evaluation-report.html ` ** - Interactive report with filters, sorting, and baseline comparison
95- - ** Console output** - Quick summary with statistics and category averages
91+ ** Excerpt Generation** :
92+ - Language Match (3.0×), Grammar (2.0×), Relevance (2.0×) - critical factors
93+ - Hook Quality (1.5×), Key Info (1.5×), Length, Style, Standalone, Engagement (1.0× each)
94+ - Diversity: structural, angle, length, lexical variation
9695
97- Open HTML report: ` open /tmp/WordPressIntelligence-Tests/evaluation-*/evaluation-report.html `
96+ ** Pass criteria** : Overall ≥ 7.0 AND no critical failures
97+ ** Needs Improvement** : 6.0-6.9 OR any score < 4.0
98+ ** Failed** : Language < 8.0 OR Grammar < 6.0 OR Overall < 6.0
9899
99- #### HTML Report Features
100+ * Note: Tag and summary evaluations use different criteria optimized for their use cases. *
100101
101- The interactive HTML report displays evaluation results in a sortable table:
102+ ## Extending Tests
102103
103- ** Table Columns:**
104- - ** Test Name** - Test identifier and parameters
105- - ** Status** - Pass/fail badge with score distribution indicators
106- - For excerpt tests: colored dots (●●●) show individual excerpt scores
107- - Green (●): score ≥ 7.0 (passed)
108- - Yellow (●): 6.0 ≤ score < 7.0 (needs improvement)
109- - Red (●): score < 6.0 (failed)
110- - Hover over dots to see individual scores
111- - ** Score** - Average score across all excerpts/attempts
112- - ** Δ Baseline** - Score change vs. baseline (shown only when comparison enabled)
113- - Green (↑): improvement
114- - Red (↓): regression
115- - Gray (=): unchanged
116- - ** Duration** - Test execution time
104+ ### Adding Test Cases
117105
118- ** Baseline Comparison:**
119- 1 . Load a baseline JSON file using the file picker
120- 2 . Table automatically shows the Δ Baseline column with score deltas
121- 3 . Click "Clear Comparison" to hide the baseline column and return to single-report view
106+ 1 . Add test data to ` lib/config.py ` :
107+ ``` python
108+ " new_test_case" : TestConfig(
109+ original_content = " ..." ,
110+ language = " english" ,
111+ # ... other parameters
112+ )
113+ ```
122114
123- ## Evaluating Results
115+ 2 . Update ` Makefile ` if adding new test type:
116+ ``` makefile
117+ eval-newtype :
118+ @./lib/evaluate-with-claude.sh --test-type newtype
119+ ```
124120
125- ** Standard Tests** - Automatic verification for excerpts:
126- - ✅ Language matches input (via NLLanguageRecognizer)
127- - ✅ Word count within target range (with warnings for minor deviations)
128- - ✅ 3 diverse variations generated (Levenshtein distance ≥ 15%)
129- - ✅ HTML properly stripped
121+ ### Customizing Evaluation Criteria
130122
131- ** Evaluation Tests ** - Weighted quality scores (1-10, example for excerpts):
123+ Edit scoring logic in ` lib/evaluators.py ` . Each test type has its own evaluator class with weighted criteria and thresholds.
132124
133- Status thresholds:
134- - ✅ ** Passed** : Overall ≥ 7.0 AND no critical failures
135- - ⚠️ ** Needs Improvement** : 6.0 ≤ Overall < 7.0 OR any score < 4.0
136- - ❌ ** Failed** : Language < 8.0 OR Grammar < 6.0 OR Overall < 6.0
125+ ### Developing HTML Report
137126
138- Weighted criteria (higher weight = more important):
139- - ** Language Match** (3.0×) - Correct language (critical)
140- - ** Grammar** (2.0×) - No grammatical errors (critical)
141- - ** Relevance** (2.0×) - Captures main message
142- - ** Hook Quality** (1.5×) - Enticing opening
143- - ** Key Info** (1.5×) - Preserves critical facts
144- - Length, Style, Standalone, Engagement (1.0× each)
127+ For fast iteration on HTML report UI without re-running tests:
145128
146- Diversity evaluation:
147- - Structural (opening styles)
148- - Angle (different emphases)
149- - Length (sentence variety)
150- - Lexical (vocabulary variation)
129+ ``` bash
130+ make rebuild-improve # Regenerate with mock improvements
131+ # Edit lib/evaluation-viewer.html
132+ make rebuild-improve # Instant preview
133+ ```
151134
152- * Note: Tags and summary evaluations use different criteria tailored to their specific requirements. *
135+ See ` lib/DEVELOPMENT.md ` for complete HTML development workflow.
153136
154137## Troubleshooting
155138
156139** Tests skipped** : Missing iOS 26 or Apple Intelligence support
157- ** Language issues** : Check prompt in ` ExcerptGeneration.swift `
158- ** Evaluation fails** : Run ` claude configure ` to authenticate
140+ ** Language issues** : Check prompt in ` Sources/WordPressIntelligence/ ExcerptGeneration.swift`
141+ ** Evaluation fails** : Install/configure Claude CLI: ` pip install claude-cli && claude configure `
159142
160- See ` CLAUDE.md ` for project guidelines.
143+ See ` CLAUDE.md ` for project development guidelines.
0 commit comments