Version 3.1.3: Fix ChatterBox character switching crashes with short text segments

diodiogod · diodiogod · commit ceba744ba8f2 · 2025-07-18T02:18:54.000-03:00
- Fixed ChatterBox sequential generation bug causing CUDA tensor indexing crashes
- Added dynamic space padding for short text segments in character switching mode
- Space padding preserves speech quality while providing sufficient tokens
- Improved version bump script to prevent downgrade attempts
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,13 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [3.1.3] - 2025-07-18
+
+### Fixed
+
+- ChatterBox character switching crashes with short text segments by implementing dynamic space padding
+- Sequential generation CUDA tensor indexing errors in character switching mode
+- Version bump script now prevents downgrade attempts
 ## [3.1.2] - 2025-07-17
 
 ### Added
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 [![Forks][forks-shield]][forks-url]
 [![Dynamic TOML Badge][version-shield]][version-url]
 
-# ComfyUI ChatterBox SRT Voice (diogod) v3.1.2
+# ComfyUI ChatterBox SRT Voice (diogod) v3.1.3
 
 *This is a refactored node, originally created by [ShmuelRonen](https://github.com/ShmuelRonen/ComfyUI_ChatterBox_Voice).*
 
diff --git a/chatterbox_srt/__init__.py b/chatterbox_srt/__init__.py
@@ -4,7 +4,7 @@
 """
 
 # Version info
-__version__ = "3.1.2"
+__version__ = "3.1.3"
 __author__ = "Diogod"
 
 # Import the new SRT modules
diff --git a/core/__init__.py b/core/__init__.py
@@ -4,7 +4,7 @@
 """
 
 # Version info
-__version__ = "3.1.2"
+__version__ = "3.1.3"
 __author__ = "Diogod"
 
 # Make imports available at package level
diff --git a/docs/test_cases.txt b/docs/test_cases.txt
@@ -0,0 +1,126 @@
+CHATTERBOX CHARACTER SWITCHING BUG TEST CASES
+============================================
+
+Test 1 fails, log:
+
+📦 Loading local ChatterBox models from: J:\stablediffusion1111s2\Data\Packages\ComfyUIPy129\ComfyUI\models\chatterbox
+input frame rate=25
+loaded PerthNet (Implicit) at step 250,000
+✅ Successfully loaded all local ChatterBox models
+🎭 ChatterBox: Character switching mode - found characters: narrator, female_01, male_01
+🔄 Using main voice for character 'narrator' (not found in voice folders)
+🎭 Using character voice for 'female_01'
+🎭 Using character voice for 'male_01'
+🎤 Generating ChatterBox segment 1/6 chunk 1/1 for 'narrator'...
+Sampling:   0%|                                                                               | 0/1000 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
+Sampling:   5%|███▋                                                                  | 52/1000 [00:01<00:32, 28.77it/s]
+🎤 Generating ChatterBox segment 2/6 chunk 1/1 for 'female_01'...
+Sampling:   2%|█▌                                                                    | 23/1000 [00:00<00:33, 28.86it/s]
+🎤 Generating ChatterBox segment 3/6 chunk 1/1 for 'male_01'...
+Reference mel length is not equal to 2 * reference token length.
+
+Sampling:   3%|█▊                                                                    | 26/1000 [00:00<00:33, 28.68it/s]
+🎤 Generating ChatterBox segment 4/6 chunk 1/1 for 'narrator'...
+Sampling:   2%|█▌                                                                    | 22/1000 [00:00<00:33, 29.33it/s]
+🎤 Generating ChatterBox segment 5/6 chunk 1/1 for 'female_01'...
+Sampling:   2%|█▎                                                                    | 18/1000 [00:00<00:33, 28.99it/s]
+🎤 Generating ChatterBox segment 6/6 chunk 1/1 for 'narrator'...
+Sampling:   4%|███                                                                   | 43/1000 [00:01<00:32, 29.13it/s]
+C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1553: block: [40,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
+C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1553: block: [40,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
+
+
+
+Test Case 2: Question Mark Focus (Target: punctuation)
+------------------------------------------------------
+This is a test.
+[Alice] Really?
+[Bob] Why not?
+What do you think?
+[Alice] Maybe?
+Final words.
+
+Test Case 3: Very Short Segments (Target: minimal text)
+-------------------------------------------------------
+Start.
+[Alice] Ok.
+[Bob] No.
+Yes?
+[Alice] Go.
+End.
+
+Test Case 4: Mixed Long/Short (Target: length variation)
+-------------------------------------------------------
+This is a longer introduction that should work fine without issues.
+[Alice] Short.
+[Bob] This is a much longer segment that might work better than short ones.
+Brief?
+[Alice] Another very long segment that contains multiple sentences and should be processed without the same issues.
+Done.
+
+Test Case 5: Exact Position Test (Target: 5th segment)
+------------------------------------------------------
+Segment one here.
+[Alice] Segment two here.
+[Bob] Segment three here.
+[Alice] Segment four here.
+This is segment five.
+[Bob] Segment six here.
+Final segment.
+
+Test Case 6: Character Switching Pattern (Target: same pattern as bug)
+----------------------------------------------------------------------
+Opening statement.
+[crestfallen_original] Character line.
+[Girl] Another character.
+[crestfallen_original] Second time.
+Back to narrator.
+[Bob] Different character.
+Closing statement.
+
+Test Case 7: Special Characters & Punctuation
+---------------------------------------------
+Hello there!
+[Alice] What's this?
+[Bob] It's... complicated.
+Really?!
+[Alice] Yes—exactly that.
+The end.
+
+Test Case 8: Empty/Whitespace Lines
+-----------------------------------
+First line.
+[Alice] Second line.
+
+[Bob] After empty line.
+Another gap coming.
+
+Final line.
+
+Test Case 9: Single Words (Target: minimal content)
+---------------------------------------------------
+Beginning.
+[Alice] Word.
+[Bob] Another.
+Question?
+[Alice] Answer.
+Conclusion.
+
+Test Case 10: Exact Recreation (Target: original crash)
+-------------------------------------------------------
+Hello! This is the first subtitle. I'll make it long on purpose.
+[crestfallen_original] This is Long?!
+
+[Girl]This is the second [crestfallen_original] subtitle with precise timing.
+Back to me?
+
+[Bob] The audio will match these exact timings.
+
+Back to me again? This looks like a meeees...
+
+INSTRUCTIONS:
+- Test each case separately
+- Note which segment number crashes (if any)
+- Record any "Reference mel length" warnings
+- Try with same characters: crestfallen_original, Girl (maps to female_01), Bob (maps to male_01)
+- Look for patterns in crashes (position, text length, punctuation, etc.)
diff --git a/nodes.py b/nodes.py
@@ -1,5 +1,5 @@
 # Version and constants
-VERSION = "3.1.2"
+VERSION = "3.1.3"
 IS_DEV = False  # Set to False for release builds
 VERSION_DISPLAY = f"v{VERSION}" + (" (dev)" if IS_DEV else "")
 SEPARATOR = "=" * 70
diff --git a/nodes/tts_node.py b/nodes/tts_node.py
@@ -99,6 +99,32 @@ def INPUT_TYPES(cls):
     def __init__(self):
         super().__init__()
         self.chunker = ImprovedChatterBoxChunker()
+    
+    def _pad_short_text_for_chatterbox(self, text: str, min_length: int = 35) -> str:
+        """
+        Pad short text with spaces to prevent ChatterBox sequential generation crashes.
+        
+        ChatterBox has a bug where short text segments cause CUDA tensor indexing errors
+        in sequential generation scenarios. Adding spaces provides sufficient tokens
+        without affecting the actual speech content.
+        
+        Based on testing:
+        - "word" (4 chars) crashes in sequential generation
+        - "word" + 26+ spaces works reliably
+        - Safe threshold appears to be 35+ characters
+        
+        Args:
+            text: Input text to check and pad if needed
+            min_length: Minimum text length threshold (default: 35 characters)
+            
+        Returns:
+            Original text or text padded with spaces if too short
+        """
+        stripped_text = text.strip()
+        if len(stripped_text) < min_length:
+            padding_needed = min_length - len(stripped_text)
+            return stripped_text + " " * padding_needed
+        return text
 
     def validate_inputs(self, **inputs) -> Dict[str, Any]:
         """Validate and normalize inputs."""
@@ -226,8 +252,12 @@ def _process():
                     for chunk_i, chunk_text in enumerate(segment_chunks):
                         print(f"🎤 Generating ChatterBox segment {i+1}/{len(character_segments)} chunk {chunk_i+1}/{len(segment_chunks)} for '{character}'...")
                         
+                        # BUGFIX: Pad short text with spaces to prevent ChatterBox sequential generation crashes
+                        # Only for ChatterBox (not F5TTS) and only when text is very short
+                        processed_chunk_text = self._pad_short_text_for_chatterbox(chunk_text)
+                        
                         chunk_audio = self.generate_tts_audio(
-                            chunk_text, char_audio_prompt, inputs["exaggeration"], 
+                            processed_chunk_text, char_audio_prompt, inputs["exaggeration"], 
                             inputs["temperature"], inputs["cfg_weight"]
                         )
                         audio_segments.append(chunk_audio)
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,7 +1,7 @@
 [project]
 name = "chatterbox_srt_voice"
 description = "ChatterBox SRT Voice TTS Node is a fork of 'ChatteBox Voice' with additional devolpments and full F5-TTS implementation as well. I introduced a SRT node designed to help you synchronize your generated TTS audio with `.srt` subtitle files. Audio wave analyzer will help you find speech segments for f5 speech edit and much more!"
-version = "3.1.2"
+version = "3.1.3"
 license = {file = "LICENSE"}
 dependencies = ["s3tokenizer>=0.1.7", "resemble-perth", "librosa", "scipy", "omegaconf", "accelerate", "transformers==4.46.3", "# Additional dependencies for SRT support and audio processing", "conformer>=0.3.2", "torch", "torchaudio", "numpy", "einops", "phonemizer", "g2p-en", "unidecode", "# Audio processing and timing dependencies", "soundfile", "resampy", "webrtcvad", "# Optional but recommended for better performance", "numba"]
 
diff --git a/scripts/bump_version_enhanced.py b/scripts/bump_version_enhanced.py
@@ -109,13 +109,13 @@ def main():
         new_parts = list(map(int, args.version.split('.')))
         
         if tuple(new_parts) <= tuple(current_parts):
-            print(f"Warning: New version {args.version} is not newer than current {current_version}")
-            response = input("Continue anyway? (y/N): ")
-            if response.lower() != 'y':
-                print("Version bump cancelled")
-                sys.exit(0)
+            print(f"Error: New version {args.version} is not newer than current {current_version}")
+            print("Cannot bump to an older or same version number.")
+            print("Use a higher version number for the next release.")
+            sys.exit(1)
     except Exception as e:
         print(f"Warning: Could not compare versions: {e}")
+        print("Proceeding with caution...")
     
     # Create backup
     print("\nCreating backup of current files...")
diff --git a/voices_examples/crestfallen_original.mp3 b/voices_examples/crestfallen_original.mp3