Skip to content

Bad Punctuation Formatting Bugs in SRT Document Translation #58

@ThioJoe

Description

@ThioJoe

It seems that in every language I've tried (spanish, portuguese, italian, indonesian, french), while the outputted SRT file has the correct timing, there are weird punctuation formatting bugs spread extensively throughout the resulting SRT file.

I'm working with the API and the DeepL python SDK, but I'm assuming it's not just isolated to that sdk.

Examples

Commas and/or periods being placed at the beginning of a block instead of attached to the previous word:

117
00:10:23,280 --> 00:10:28,160
, de nuevo, si ves algo como un descuento 
del 10 % o más, seguro que es falso. 

Periods and commas getting spaces added in front of them (see first line):

266
00:23:21,840 --> 00:23:26,000
variante interesante . Podemos hablar de 
ello en los comentarios. Y, por supuesto, 
si te ha gustado el vídeo, 

Lots of text getting shoved into one block leaving adjacent ones nearly empty. Notice this one shows a single word for 4 seconds. Also notice the comma with the space inserted before in block 4 line 3. Also similar to this other issue ( #56 ) , I've seen instances where it creates entirely blank blocks, not just a single word like this. In that case too it's often where adjacent blocks would take them on and be extra long.

3
00:00:10,240 --> 00:00:14,880
son 

4
00:00:14,880 --> 00:00:19,840
estafas que deben conocer y de las que 
deben cuidarse en 2026. Y, en realidad, 
para la mayoría de ellas , la mejor forma 
de defenderse es simplemente conocerlas. 
Así que, al final, deberían estar 
preparados. Soy ThioJoe, 

---------- ORIGINAL ----------
3
00:00:10,240 --> 00:00:14,880
want to be aware of and watch out for in 2026. And really for most of these, the best way to

4
00:00:14,880 --> 00:00:19,840
defend against them is to simply know about it. So by the end, you should be good. I'm ThioJoe,

None of these issues are present in the original SRT file.

I also noticed in another instance it split a word right across blocks, the word estafas. Also it seems to maintain line breaks in the original SRT file unnecessarily if there were multiple lines:

2
00:00:04,560 --> 00:00:10,240
un montón de estafas nuevas o incluso de 
variaciones nuevas de estafas antiguas. 
Todas ellas son estaf 

3
00:00:10,240 --> 00:00:14,880
as que deben conocer y de las que deben 
cuidarse en 2026. 
Y, en realidad, para la mayoría de ellas, 
la mejor manera 

---------- ORIGINAL ----------
2
00:00:04,560 --> 00:00:10,240
a bunch of brand new scams or even ones that are
new variations on older scams. All of which you'll

3
00:00:10,240 --> 00:00:14,880
want to be aware of and watch out for in 2026.
And really for most of these, the best way to

After that one I realized the issue about the newlines, and also realized the SRT file i was using downloaded from YouTube includes non-breaking spaces at the end, of the first lines. So I removed those and made each block 1 line, and that seemed to get rid of the mid-word break, but the punctuation issues remain. I'm also not sure if the mid-word break being "fixed" was merely a result of the new translation being slightly different so the word location landed differently.


Example Call:

    # We pass in the actual out_file object, the sdk module automatically writes the result to it
    with open(input_path, "rb") as in_file, open(output_path, "wb") as out_file:
        result:deepl.DocumentStatus = deepl_api.translate_document(
            input_document = in_file,
            output_document = out_file,
            target_lang="es",
            formality="prefer_less,
        )

Where the in_file is an .srt file. I also tried explicitly setting output_format="srt" but it doesn't make a difference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions