Skip to content

Invalid characters in XML files #4852

@Pizzabroodje

Description

@Pizzabroodje

Describe the bug
It's possible for invalid characters to be put into XML files by bridges, like U+0003 in my case.

To Reproduce
In my case, it happened with the MarktplaatsBridge.
Sometimes, the feed gives an error in TT-RSS. The last two times I went to debug this, and I found out that when it broke there was a ETX character (so U+0003) at the end of a string somewhere. In both cases, the text being pulled by the bridge had an apostrophe in this place.
Edit 10-01-2026:

  • For this to happen with the marktplaats bridge, the $jsonString needs to contain an ETX character somewhere. To reproduce it, you can simulate this by pasting the character in an item string somewhere in the foreach loop that loops through $jsonObj->listings (in my case it was in a $listing->description). Then import this feed in TT-RSS.
  • For the case described in the other edit at the end if this post, the $jsonString needs to contain '\ud83d'. Json_decode fails with a Syntax Error if it contains that, and just returns an empty feed. You can make your own $jsonString containing this to reproduce it (echo the $jsonString from a working feed, put '\ud83d' somewhere in it, for example the description of a listing, and replace the $jsonString in the marktplaatsbridge with that).

Expected behavior
Check the jsonString for any invalid characters or isolated high surrogates, and remove them if present. Probably best to do this for every bridge, not just the Marktplaats Bridge.

Additional context
A regex seems the best solution for this. In my personal version of the bridge, I have done the following to solve it:
preg_replace('/[^\PC\s]/u', '', $string);
This should remove a lot of characters that can break XML files.

It might be overkill if it only breaks on this ETX character at the end of a string, but I think there could be more cases where invalid characters break feeds.

Edit 10-01-2026:
It broke today because there was an unpaired high surrogate in the JSON string pulled from Marktplaats.
The string where it broke: "categorySpecificDescription": "Te koop: professionele pizza oven – 2 x 4 pizza’s ø35 cm – made in italy wegens wijziging van plannen bied ik deze professionele italiaanse pizza oven te koop aan. De oven is slechts 2 maanden gebruikt en verkeert in zeer nette, bijna nieuwe staat. \ud83d..."

The description of the actual ad, including the next line:
`Te koop: Professionele Pizza Oven – 2 x 4 Pizza’s Ø35 cm – Made in Italy

Wegens wijziging van plannen bied ik deze professionele Italiaanse pizza oven te koop aan.
De oven is slechts 2 maanden gebruikt en verkeert in zeer nette, bijna nieuwe staat.

🔹 Capaciteit: 2 kamers – per kamer geschikt voor 4 pizza’s van Ø35 cm`

It appears to be breaking because the string is broken up right at the point between the high and low surrogates of the 🔹 character, leaving only \ud83d behind and thus breaking the JSON.

Solved it by running preg_replace('/\\\uD[89AB][0-9A-F]{2}(?!\\\uD[CDEF][0-9A-F]{2})/i', '', $string); on the entire JSON string.
I don't know if this would also solve the previous issue with the ETX character, as I'm not familiar with regular expressions (took them from forums).
^^^^^^
Did some testing, and it doesn't solve the issue with ETX characters. So both regex's have to be combined:
preg_replace('/\\\uD[89AB][0-9A-F]{2}(?!\\\uD[CDEF][0-9A-F]{2})|[^\PC\s]/ui', '', $string);

Metadata

Metadata

Assignees

No one assigned

    Labels

    Bug-ReportConfirmed bug report

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions