Skip to content

feat: Unify UTF-8 file handling, introduce Unicode normalization.#245

Open
Marxyz wants to merge 2 commits intosouzatharsis:mainfrom
Marxyz:bugfix/fix-encoding-issues-with-yaml-configs
Open

feat: Unify UTF-8 file handling, introduce Unicode normalization.#245
Marxyz wants to merge 2 commits intosouzatharsis:mainfrom
Marxyz:bugfix/fix-encoding-issues-with-yaml-configs

Conversation

@Marxyz
Copy link

@Marxyz Marxyz commented Feb 20, 2025

Inspired by Issue #229, this PR adds support for Unicode-based YAML configurations and introduces a basic sanitization step for UTF-8 text generation.

(I initially waited for the main reporter to address this issue, and if they have any objections to my approach, I’m happy to relinquish.)

  • File I/O with UTF-8 ensures all file reads/writes use encoding='utf-8', so that Unicode characters in YAML configs or transcripts are handled correctly. This resolves errors when loading non-ASCII characters and adds compatibility with multilingual TTS services (e.g., Eleven Labs). It allows podcast taglines from the config (in my case in Polish language) to be included in the podcast.

  • Added __sanitize_unicode_text() method applying NFKC normalization in generate_qa_content(). While the current placement does not look right for me and I would gladly move it elsewhere where pointed, it helps mitigate encountered unexpected 503 errors from Edge TTS by removing problematic hidden characters, without stripping important diacritics.

  • Replaced the test webpage (which changed over time and caused test failures) with example.com. According to RFC 2606, this domain is officially reserved and will remain unchanged, ensuring stable tests.
    (The original site pointed to the project author’s personal page, so I’m slightly uneasy about removing it, but it was necessary to change it to fix the failing test.)

  • Edited client test case for Unicode characters (Polish and Japanese) in the MOCK_CONVERSATION_CONFIG to verify that the system can load and process multilingual text correctly.

…mple.com

- Explicitly set encoding="utf-8" in all open() calls to avoid encoding issues.
- Updated the URL under test from souzatharsis.com to example.com for more stable content.
- Adjusted expected content (website.md) to match the actual output from example.com.
- Ensures consistent behavior across different environments and avoids random test failures.
…ters

Introduce a private method __sanitize_unicode_text() to apply NFKC normalization
on the final response. Update the MOCK_CONVERSATION_CONFIG to include Polish and
Japanese characters in the podcast tagline.
These changes ensure robust handling of
multilingual Unicode input and prevent TTS errors caused by hidden or non-standard Unicode
characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments