feat: Unify UTF-8 file handling, introduce Unicode normalization. by Marxyz · Pull Request #245 · souzatharsis/podcastfy

Marxyz · 2025-02-20T23:54:10Z

Inspired by Issue #229, this PR adds support for Unicode-based YAML configurations and introduces a basic sanitization step for UTF-8 text generation.

(I initially waited for the main reporter to address this issue, and if they have any objections to my approach, I’m happy to relinquish.)

File I/O with UTF-8 ensures all file reads/writes use encoding='utf-8', so that Unicode characters in YAML configs or transcripts are handled correctly. This resolves errors when loading non-ASCII characters and adds compatibility with multilingual TTS services (e.g., Eleven Labs). It allows podcast taglines from the config (in my case in Polish language) to be included in the podcast.
Added __sanitize_unicode_text() method applying NFKC normalization in generate_qa_content(). While the current placement does not look right for me and I would gladly move it elsewhere where pointed, it helps mitigate encountered unexpected 503 errors from Edge TTS by removing problematic hidden characters, without stripping important diacritics.
Replaced the test webpage (which changed over time and caused test failures) with example.com. According to RFC 2606, this domain is officially reserved and will remain unchanged, ensuring stable tests.
(The original site pointed to the project author’s personal page, so I’m slightly uneasy about removing it, but it was necessary to change it to fix the failing test.)
Edited client test case for Unicode characters (Polish and Japanese) in the MOCK_CONVERSATION_CONFIG to verify that the system can load and process multilingual text correctly.

…mple.com - Explicitly set encoding="utf-8" in all open() calls to avoid encoding issues. - Updated the URL under test from souzatharsis.com to example.com for more stable content. - Adjusted expected content (website.md) to match the actual output from example.com. - Ensures consistent behavior across different environments and avoids random test failures.

…ters Introduce a private method __sanitize_unicode_text() to apply NFKC normalization on the final response. Update the MOCK_CONVERSATION_CONFIG to include Polish and Japanese characters in the podcast tagline. These changes ensure robust handling of multilingual Unicode input and prevent TTS errors caused by hidden or non-standard Unicode characters.

Marxyz added 2 commits February 20, 2025 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Unify UTF-8 file handling, introduce Unicode normalization.#245

feat: Unify UTF-8 file handling, introduce Unicode normalization.#245
Marxyz wants to merge 2 commits intosouzatharsis:mainfrom
Marxyz:bugfix/fix-encoding-issues-with-yaml-configs

Marxyz commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Uh oh!

Conversation

Marxyz commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments