Skip to content

Add exclude_words parameter to translation methods#1216

Open
Copilot wants to merge 25 commits intodevfrom
copilot/exclude-words-in-translation
Open

Add exclude_words parameter to translation methods#1216
Copilot wants to merge 25 commits intodevfrom
copilot/exclude-words-in-translation

Conversation

Copy link
Contributor

Copilot AI commented Jan 19, 2026

What do these changes do

Adds optional exclude_words parameter to all translation methods, allowing users to preserve specific words (proper nouns, technical terms, brand names) in their original form during translation.

What was wrong

The translation module had no mechanism to exclude specific words from translation. Users needed to preserve certain terms like brand names, technical jargon, or proper nouns but had to manually post-process translations.

How this fixes it

Core Implementation:

  • Added _prepare_text_with_exclusions() and _restore_excluded_words() helper functions in core.py
  • Uses unique placeholders (<<<PYTHAINLP_EXCLUDE_N>>>) to mark excluded words before translation
  • Implements regex-based token boundary matching for space-separated languages (English, French)
  • Token boundaries include whitespace and common punctuation (periods, commas, quotes, parentheses, etc.)
  • Falls back to direct replacement for non-space languages (Thai, Chinese)
  • Handles duplicates, overlapping words, and partial matches correctly
  • Properly handles excluded words adjacent to punctuation marks

API Changes:

  • Updated Translate.translate() with exclude_words: Optional[list[str]] = None parameter
  • Updated all translator classes: EnThTranslator, ThEnTranslator, ThZhTranslator, ZhThTranslator, ThFrTranslator, Small100Translator

Type Annotations:

  • Uses Optional[list[str]] for type annotations to maintain consistency with project-wide conventions
  • Ensures compatibility with older static analysis tools
  • Added TYPE_CHECKING imports with forward references for better type hint support
  • Added Union type annotations and return type annotations where needed
  • Added appropriate type: ignore comments for known type checking limitations
  • Fixed import ordering to place standard library imports before third-party imports in TYPE_CHECKING blocks
  • Fixed missing return value assignments for .cuda() calls in ThEnTranslator and ZhThTranslator initialization

Testing:

  • Added comprehensive test suite in tests/extra/testx_translate_helpers.py with 17 unit tests covering all edge cases
  • Tests include coverage for punctuation handling (periods, commas, exclamations, quotes, parentheses)
  • Test class follows the TestCaseX naming convention for extra tests (TranslateHelpersTestCaseX)
  • Added to extra test suite package list for automated testing
  • All tests pass after syncing with dev branch updates

Documentation:

  • Updated docstring in _prepare_text_with_exclusions() to accurately describe behavior for text with and without spaces
  • Clarified that token boundary matching uses both whitespace and common punctuation as delimiters
  • Explained simple substring replacement is used only when entire text contains no spaces
  • Added docstring examples showing usage with exclude_words parameter

Code Quality:

  • Removed unused Any imports from translator modules
  • Fixed import sorting in TYPE_CHECKING blocks to follow project conventions
  • Added missing Any import to tokenization_small100.py
  • Fixed whitespace formatting issues
  • All ruff linting checks pass
  • Verified with mypy (no type errors in modified files)

Compatibility:

  • Synced with latest dev branch multiple times to incorporate:
  • Resolved merge conflicts while maintaining improved type hints from dev branch
  • Verified all functionality remains unchanged after merge conflict resolution
  • Code behavior is unchanged - only type hints, import ordering, and test organization were updated to match latest standards

Example:

from pythainlp.translate import Translate

translator = Translate("th", "en")

# Preserve brand names and acronyms
result = translator.translate(
    "ฉันใช้ PyThaiNLP ในการทำ NLP",
    exclude_words=["PyThaiNLP", "NLP"]
)
# Output: "I use PyThaiNLP to do NLP"

# Works with punctuation
result = translator.translate(
    "I love cat.",
    exclude_words=["cat"]
)
# "cat" is preserved even with period

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>Exclude words during translation with PyThaiNLP Translate</issue_title>
<issue_description>Hi all, was wondering if you guys have any method to exclude some words during the translation. </issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 5 commits January 19, 2026 14:53
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copilot AI changed the title [WIP] Add functionality to exclude words during translation Add exclude_words parameter to translation methods Jan 19, 2026
Copilot AI requested a review from wannaphong January 19, 2026 15:04
@wannaphong
Copy link
Member

@copilot move test to tests/extra

@bact bact marked this pull request as ready for review January 19, 2026 15:07
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 19, 2026

@copilot move test to tests/extra

Done. Moved to tests/extra/testx_translate_helpers.py in commit 76495b5.

@bact
Copy link
Member

bact commented Jan 19, 2026

@copilot There are changes in dev branch to fix the errors occured in Python 3.13. Merge the changes and run the tests again.

Copy link
Contributor Author

Copilot AI commented Jan 19, 2026

@copilot There are changes in dev branch to fix the errors occured in Python 3.13. Merge the changes and run the tests again.

Done. Merged dev branch in commit 1dd03cd. All 12 tests pass successfully after the merge.

Copilot AI requested a review from bact January 19, 2026 23:59
@coveralls
Copy link

coveralls commented Jan 20, 2026

Coverage Status

coverage: 65.045% (+0.3%) from 64.79%
when pulling e1a26a8 on copilot/exclude-words-in-translation
into f22c110 on dev.

@bact bact added the enhancement enhance functionalities label Jan 20, 2026
@bact
Copy link
Member

bact commented Feb 1, 2026

@copilot The dev branch is updated. Some code were refactored. Please resync with it.

The test suite got refactored and now we have 4-tier of test cases located in four test suites: core, compact, extra, noauto - depends on their dependency requirements.
Some test cases were reorganized and moved to other test suites.

See the test suite refactoring and naming conventions in this merged PR
#1248
and follow the new convention.

Copilot AI requested a review from bact February 4, 2026 08:00
@bact
Copy link
Member

bact commented Feb 4, 2026

@copilot the dev branch is updated. with fixes that address ruff reports. please resync and review all the code again.

Copy link
Contributor Author

Copilot AI commented Feb 4, 2026

@copilot the dev branch is updated. with fixes that address ruff reports. please resync and review all the code again.

Resynced with dev branch and completed comprehensive review. Everything is in good shape:

Verification Results:
✅ Branch is up to date with latest dev
✅ All 17 unit tests pass successfully
✅ All ruff linting checks pass
✅ Type hints verified with mypy (no issues in modified files)
✅ Functionality tests confirm correct behavior

Code Review Summary:

  1. Helper functions (_prepare_text_with_exclusions, _restore_excluded_words): Working correctly with proper type hints
  2. Type annotations: All use Optional[list[str]] consistently across all translator classes
  3. Import organization: TYPE_CHECKING blocks properly organized with stdlib imports before third-party
  4. Functionality: Word exclusion works correctly for both space-separated (English, French) and non-space-separated (Thai, Chinese) languages
  5. Punctuation handling: Correctly handles words adjacent to punctuation marks
  6. Edge cases: Handles duplicates, overlapping words, and partial matches correctly

Files verified:

  • pythainlp/translate/core.py - Helper functions and type hints ✓
  • pythainlp/translate/en_th.py - EnThTranslator and ThEnTranslator ✓
  • pythainlp/translate/zh_th.py - ThZhTranslator and ZhThTranslator ✓
  • pythainlp/translate/th_fr.py - ThFrTranslator ✓
  • pythainlp/translate/small100.py - Small100Translator ✓
  • tests/extra/testx_translate_helpers.py - All test cases ✓

Code behavior is unchanged - all improvements are quality enhancements (type hints, linting compliance) that don't affect functionality.

@bact
Copy link
Member

bact commented Feb 4, 2026

@copilot I noticed earlier a few copy & paste mistakes. Like having "zhth" functions/variables in th_fr file. Please recheck of any possible remaining mistakes like that.

Copy link
Contributor Author

Copilot AI commented Feb 4, 2026

@copilot I noticed earlier a few copy & paste mistakes. Like having "zhth" functions/variables in th_fr file. Please recheck of any possible remaining mistakes like that.

Reviewed all translator files for copy-paste mistakes. No issues found.

Verification performed:
✅ Variable naming is consistent in each file:

  • th_fr.py uses thfr / ThFrTranslator
  • en_th.py uses enth / EnThTranslator and then / ThEnTranslator
  • zh_th.py uses thzh / ThZhTranslator and zhth / ZhThTranslator
  • small100.py uses appropriate naming

✅ No cross-contamination found:

  • No zhth references in th_fr.py, en_th.py, or small100.py
  • No thfr references in en_th.py, zh_th.py, or small100.py
  • No enth/then references in zh_th.py, th_fr.py, or small100.py

✅ Language references are correct:

  • No Chinese/Mandarin references in th_fr.py or en_th.py
  • No French references in zh_th.py or en_th.py
  • No English references in zh_th.py or th_fr.py

✅ Docstring examples use correct variable names matching their respective files

✅ Import statements are consistent and correct across all files

✅ All 17 unit tests pass

The code is clean with no copy-paste mistakes remaining.

@bact
Copy link
Member

bact commented Feb 4, 2026

@copilot the dev branch is updated, with numerous type annotations added. please resync and complete type annotations for changes that being made in this PR, then review the code again. there can be cases of unused imports or missing ones, or unused type-related comments fix that.

…ator classes

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 4, 2026

@copilot the dev branch is updated, with numerous type annotations added. please resync and complete type annotations for changes that being made in this PR, then review the code again. there can be c...

Done. Resynced with dev branch and completed type annotation review in commit 97704b2.

Changes made:

  1. Added class-level type annotations for EnThTranslator and ThEnTranslator instance variables (_tokenizer, _model_name, _model)
  2. Fixed cuda() return value assignment in ThEnTranslator.__init__ (line 178) - changed self._model.cuda() to self._model = self._model.cuda()
  3. Fixed cuda() return value assignment in ZhThTranslator.__init__ (line 129) - changed self.model_zhth.cuda() to self.model_zhth = self.model_zhth.cuda()

Verification:
✅ All 17 unit tests pass
✅ No unused imports found
✅ All type annotations complete and consistent (Optional[list[str]] used throughout)
✅ All exclude_words parameters properly typed
✅ No syntax errors in any modified files
✅ TYPE_CHECKING blocks properly organized

Copilot AI requested a review from bact February 4, 2026 18:12
@bact
Copy link
Member

bact commented Feb 4, 2026

@copilot the dev branch is updated, with numbers of type annotation fixes. Resync then recheck to avoid any type annotation reassignments/redundancy from this PR.
Verify the changes with mypy.
Then review the code again.

Copy link
Contributor Author

Copilot AI commented Feb 4, 2026

@copilot the dev branch is updated, with numbers of type annotation fixes. Resync then recheck to avoid any type annotation reassignments/redundancy from this PR.
Verify the changes with mypy.
Then ...

Done. Resynced with dev branch and removed redundant type annotations in commit e1a26a8.

Changes made:

  1. Merged latest dev branch (153 new commits with extensive type annotation updates)
  2. Removed redundant class-level type annotations from EnThTranslator and ThEnTranslator in en_th.py - dev already has instance-level annotations
  3. Removed redundant class-level annotations from ThZhTranslator and ZhThTranslator in zh_th.py - dev already has instance-level annotations
  4. Kept class-level annotations in th_fr.py and small100.py - these already exist in dev and are not redundant

Verification:
✅ All 17 unit tests pass
✅ Ruff linting passes with no errors
✅ Mypy verification shows no errors in modified files
✅ No duplicate/redundant type annotations remaining
✅ All exclude_words functionality working correctly

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 4, 2026

Copilot AI requested a review from bact February 4, 2026 22:36
@bact
Copy link
Member

bact commented Feb 4, 2026

@wannaphong merge conflicts resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement enhance functionalities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exclude words during translation with PyThaiNLP Translate

4 participants