Skip to content

No apostrophes allowed! #19

@cyisfor

Description

@cyisfor

writeln(createDocument!(DOMCreateOptions.None)("It's").root.html);
=>
I'ts

:/

Instead of encoding text as entities when stored in the document, it also escapes those entities again when printing the document out. writeHTMLEscaped doesn't have any support for not doing that, so there's no way to stop characters from getting encoded as numeric entities. You have to escape the apostrophe if it's inside an attribute value, but there's no accounting for that. Plus there's smart quotes, and innumerable utf-8 characters that make perfectly valid HTML, but are forced into ugly numeric escapes instead, with this strategy.

It's confusing terminology too. "decode entities" seems to mean turning " " into " ". I don't think anyone would want to do that by default, and that's actually _trans_coding named entities into numeric entities, not decoding anything. "decoding entities" should at least mean turning the utf-8 encoded " " into " " Strictly speaking, that's transcoding too, since " " is still just a sequence of undecoded bytes, not a complex data structure. Decoding is where you take " " or " " and turn it into a call to onEntity() or something. It looks like the code tries to do that, but... doesn't really do it right.

I like to adopt the strategy of "never decode, until you are ready to display." Which is fancy way of writing "never decode." I only decode stuff when I need meaningful data inside it, otherwise I run the risk of double decoding it, or being unable to specify what encoding it forces everything into.

Really, entities are entirely separate from the structure of the HTML document. Entities aren't separate nodes in the DOM for any major web browser, they're just more bytes within the text node. There are also two entirely separate categories of entity that get conflated together: 7-byte characters that would mess up the HTML if unescaped, and 8-byte characters. I'll want to escape some HTML by replacing < and >, and maybe " with entities, but no other ones. Conversely, if someone wants to turn their HTML into proper english schoolteacher HTML, they will want all codepoints escaped, but leave < and > intact.

Here's how I think it should work:

writeln(createDocument("It’s").root.html);
=>
It’s

writeln(escapeEntities(createDocument("It’s <e/>").root.html));
=>
It&apos;ts <e/>

writeln(escapeEntities!(named: false)(createDocument("It’s <e/>").root.html));
=>
It&#2019;ts <e/>

writeln(escapeHTML(createDocument("It’s "<e/>"").root.html));
=>
It’s "&lt;e/&gt"

writeln(escapeAttribute("this attribute has a \" in it, as well as > and <"));
=>
this attribute has a &quot; in it, as well as &gt; and &lt;

node.html("<p>");
writeln(node.html);

=>
<p>

node.html(escapeHTML("<p>"));
writeln(node.html);

=>
&lt;p&gt;

I'm honestly starting to think that a HTML parser should not deal with entities at all. They have to be dealt with using a separate character-by-character parser since they're on the level of characters, and not part of the document structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions