Skip to content

Commit 31c3b1a

Browse files
authored
Merge pull request #1291 from PyThaiNLP/copilot/improve-corpus-test-speed
Optimize corpus tests: mock downloads, separate data validation, suppress CLI output
2 parents 5f02646 + 2a817fd commit 31c3b1a

File tree

10 files changed

+907
-155
lines changed

10 files changed

+907
-155
lines changed

.github/workflows/corpus.yml

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# SPDX-FileCopyrightText: 2026 PyThaiNLP Project
2+
# SPDX-FileType: SOURCE
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
name: Corpus test
6+
7+
on:
8+
push:
9+
branches:
10+
- dev
11+
paths:
12+
- ".github/workflows/corpus.yml"
13+
- "pythainlp/corpus/**"
14+
- "tests/corpus/**"
15+
pull_request:
16+
branches:
17+
- dev
18+
paths:
19+
- ".github/workflows/corpus.yml"
20+
- "pythainlp/corpus/**"
21+
- "tests/corpus/**"
22+
23+
# Avoid duplicate runs for the same source branch and repository
24+
concurrency:
25+
group: >-
26+
${{ github.workflow }}-${{
27+
github.event.pull_request.head.repo.full_name || github.repository
28+
}}-${{ github.head_ref || github.ref_name }}
29+
cancel-in-progress: true
30+
31+
jobs:
32+
corpus:
33+
runs-on: ubuntu-latest
34+
permissions:
35+
contents: read
36+
37+
steps:
38+
- name: Checkout
39+
uses: actions/checkout@v6
40+
41+
- name: Set up Python
42+
uses: actions/setup-python@v6
43+
with:
44+
python-version: "3.13"
45+
cache: "pip"
46+
47+
- name: Install dependencies
48+
run: |
49+
pip install --upgrade pip
50+
pip install .
51+
52+
- name: Test corpus catalog
53+
env:
54+
PYTHONIOENCODING: utf-8
55+
run: |
56+
python -m unittest discover -s tests/corpus -p "test_catalog*.py" -v
57+
58+
- name: Test built-in corpus files
59+
env:
60+
PYTHONIOENCODING: utf-8
61+
run: |
62+
python -m unittest discover -s tests/corpus -p "test_builtin_*.py" -v
63+
64+
- name: Test downloadable corpus files
65+
env:
66+
PYTHONIOENCODING: utf-8
67+
run: |
68+
python -m unittest discover -s tests/corpus -p "test_downloadable_*.py" -v

tests/README.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
SPDX-FileCopyrightText: 2026 PyThaiNLP Project
2+
SPDX-FileType: DOCUMENTATION
3+
SPDX-License-Identifier: Apache-2.0
4+
---
5+
16
# Test suites and execution
27

38
To run a test suite, run:
@@ -155,7 +160,6 @@ By separating tests by dependency group, we can:
155160
- Requires: Internet connection, may involve large downloads
156161
- Test case class suffix: `TestCaseN`
157162

158-
159163
## Robustness tests (test_robustness.py)
160164

161165
A comprehensive test suite within core tests that tests edge cases important
@@ -169,3 +173,24 @@ for real-world usage:
169173
- Thai-specific edge cases with combining characters and mixed scripts
170174
- Multi-engine robustness testing across all core tokenization engines
171175
- Very long strings that can cause performance issues (issue #893)
176+
177+
## Corpus test (corpus/)
178+
179+
A separate test suite that verifies the integrity, format, parseability,
180+
and catalog functionality of corpus in PyThaiNLP.
181+
182+
These tests are separate from regular unit tests because they test actual
183+
file loading and parsing (not mocked), require network access, and
184+
can be resource intensive.
185+
186+
For detailed information about corpus test, see:
187+
[tests/corpus/README.md](corpus/README.md)
188+
189+
The corpus test is triggered automatically via GitHub Actions
190+
when changes are made to `pythainlp/corpus/**` or `tests/corpus/**`.
191+
192+
Run corpus test:
193+
194+
```shell
195+
python -m unittest tests.corpus
196+
```

0 commit comments

Comments
 (0)