Skip to content

Wrapping Bug for Spanish GNA #42

@tajmone

Description

@tajmone

@thoni56, I've noticed an issue with how ALAN wraps transcripts.

Example, in ponibles_test.a3t the 's' of "puestos" is split on the next line:

> x gemelos
Unos gemelos. Ambos llevan la misma ropa puesta. Los gemelos llevan puesto
s unos pantalones y unas botas.

This is the library code that prints "puestos":

  "llevas puest$$"
  If obj is femenina
    then "a"
    else "o"
  End If.
  If obj is plural then "$$s" End If.

The above is a typical example of how the Spanish library handles gender and noun in various language constructs (adjectives, articles, verbs, etc.), by adding the 'a' or 'o' suffix depending on gender, and a final 's' if plural.

The problem seems to be that when the sentence reaches "puesto" (column 75) ALAN decides it's time to wrap without checking whether the upcoming text contains a $$ (or punctuation) which might need to be joined with the current word (i.e. the one being parsed when ALAN decides to wrap).

If I were to replace the above code with:

  "llevas puest$$"
  If obj is femenina
    then
      If obj is plural
        then "as"
        else "a"
      End If.
    else
      If obj is plural
        then "os"
        else "o"
      End If.
  End If.
  If obj is plural then "$$s" End If.

the output wouldn't be truncated prematurely. Apparently, ALAN sees the $$ and waits before wrapping. The problem is that the above code variations is more verbose compared the to one being uses, because we only add the final 's' if the noun is plural (so no $$ on the previous vowel, in case there's not need to add a plurality 's').

Probably I should add a proper minimum viable ad hoc test in the alan-bugs-testbed, but I wanted to mention it right away when I discovered it, and begin by posting here on ALAN i18n, since this affects the Spanish library and we all need to be aware of the issue and decide if it's worth using the longer code to prevent breaking the word.

Also, I'm not sure why ALAN is wrapping at 75, since I believe the default is 80 columns. I think this issue of incorrect wrapping already came up before, and was due to miscounting the various special $ symbols in a way that affected columns book-keeping for when to wrap. But I thought that the problem had been solved already.

In any case, this problem also affects punctuation, for I noticed in various transcripts that ALAN wraps lines just before a ., , or ) (or other punctuation marks), which doesn't look nice either. I'm not sure if this is due to the presence of a $$ in the previous token or preceding the punctuation mark, but definitely ALAN should do some lookahead scrutiny before wrapping, to check that the next string "token" is not something that needs to be adjoined with the current one.

From what I remember from peeking at the ALAN sources, the way output strings work in ALAN is a bit intricate, since some strings are retrieved from disk (those that are within quotes in the source) while others are taken from memory (those stored as attributes), and that the way these are handles is a bit complex due to Huffman compression — so the whole process is a very fragmented series of long jumps in C, where the various snippets that will form a string a retrieved as the AMachine munches code in real time.

I'm not sure where the part that handles wrapping falls in the process, but it looks like strings are truncated as they are being "stitched together", i.e. there's no "paragraphs buffer" where they are stored for later inspection-&-wrapping. I guess that probably adding some lookahead functionality to prevent cases like the above would require lot's of code changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    💀 bugSomething isn't working👅 ESSpanish language

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions