No apostrophes allowed!

`writeln(createDocument!(DOMCreateOptions.None)("It's").root.html);`
=>
`I&#39;ts`

:/

Instead of encoding text as entities when stored in the document, it also escapes those entities again when printing the document out. writeHTMLEscaped doesn't have any support for not doing that, so there's no way to stop characters from getting encoded as numeric entities. You have to escape the apostrophe if it's inside an attribute value, but there's no accounting for that. Plus there's smart quotes, and innumerable utf-8 characters that make perfectly valid HTML, but are forced into ugly numeric escapes instead, with this strategy.

It's confusing terminology too. "decode entities" seems to mean turning "&amp;nbsp;" into "&amp;#160;". I don't think anyone would want to do that by default, and that's actually _trans_coding named entities into numeric entities, not decoding anything. "decoding entities" should at least mean turning the utf-8 encoded "&nbsp;" into "&amp;nbsp;" Strictly speaking, that's transcoding too, since "&amp;nbsp;" is still just a sequence of undecoded bytes, not a complex data structure. Decoding is where you take "&nbsp;" or "&amp;nbsp;" and turn it into a call to onEntity() or something. It looks like the code tries to do that, but... doesn't really do it right.

I like to adopt the strategy of "never decode, until you are ready to display." Which is fancy way of writing "never decode." I only decode stuff when I need meaningful data inside it, otherwise I run the risk of double decoding it, or being unable to specify what encoding it forces everything into.

Really, entities are entirely separate from the structure of the HTML document. Entities aren't separate nodes in the DOM for any major web browser, they're just more bytes within the text node. There are also two entirely separate categories of entity that get conflated together: 7-byte characters that would mess up the HTML if unescaped, and 8-byte characters. I'll want to escape some HTML by replacing < and >, and maybe " with entities, but no other ones. Conversely, if someone wants to turn their HTML into proper english schoolteacher HTML, they will want all codepoints escaped, but leave < and > intact.

Here's how I think it should work:

`writeln(createDocument("It’s").root.html);`
=>
It’s

`writeln(escapeEntities(createDocument("It’s <e/>").root.html));`
=>
`It&apos;ts <e/>`

`writeln(escapeEntities!(named: false)(createDocument("It’s <e/>").root.html));`
=>
`It&#2019;ts <e/>`

`writeln(escapeHTML(createDocument("It’s "<e/>"").root.html));`
=>
`It’s "&lt;e/&gt"`

`writeln(escapeAttribute("this attribute has a \" in it, as well as > and <"));`
=>
`this attribute has a &quot; in it, as well as &gt; and &lt;`

``` D
node.html("<p>");
writeln(node.html);
```

=>
`<p>`

``` D
node.html(escapeHTML("<p>"));
writeln(node.html);
```

=>
`&lt;p&gt;`

I'm honestly starting to think that a HTML parser should not deal with entities at all. They have to be dealt with using a separate character-by-character parser since they're on the level of characters, and not part of the document structure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No apostrophes allowed! #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No apostrophes allowed! #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions