Conversation
|
I think this sort of clean-up has some merit. Maybe we could only decide to clean out the |
|
@jtojnar done |
Pull Request Test Coverage Report for Build 2583510249
💛 - Coveralls |
Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side. Fixes wallabag/wallabag#5805 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
|
I'm currently having a second though about this cleanup. Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"? |
|
Similarity would make sense but then we would need to decide on the precise metric. Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of |
Just checking if the first child of the content is h*, right? |
|
Right, that is what I meant. |
|
I'm trying to resume work on this PR. I've made a small check on entries hosted on my instance, using a query like It seems that there are several cases where the content begins with a legitimate heading entity, for example:
What should we do? |
Would not those articles still begin with The third one would not be matched by the first heuristic. Ideally, we would combine both. We should add a test suite for all these cases so that we can better discuss what is happening. |
Even though using h1 tags for sections inside an article is semantically
wrong, a lot of websites are doing it anyway. So the idea here is to
stop stripping headings, including h1 on Readability's side.
Fixes wallabag/wallabag#5805