Keep h1 and other headings by Kdecherf · Pull Request #75 · j0k3r/php-readability

Kdecherf · 2022-06-11T18:35:58Z

Even though using h1 tags for sections inside an article is semantically
wrong, a lot of websites are doing it anyway. So the idea here is to
stop stripping headings, including h1 on Readability's side.

Fixes wallabag/wallabag#5805

Kdecherf · 2022-06-11T18:46:24Z

Request for comment @j0k3r @jtojnar

jtojnar · 2022-06-11T22:56:30Z

I think this sort of clean-up has some merit. Maybe we could only decide to clean out the h1 if there is only a single one, like we do with h2. And only clean up h2 if there is no h1?

Kdecherf · 2022-06-28T19:50:10Z

@jtojnar done

coveralls · 2022-06-28T19:50:37Z

Pull Request Test Coverage Report for Build 2583510249

4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 84.669%

Totals
Change from base Build 2486672970:	0.2%
Covered Lines:	602
Relevant Lines:	711

💛 - Coveralls

src/Readability.php

j0k3r

Looks great, you can squash 👍

Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side. Fixes wallabag/wallabag#5805 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>

Kdecherf · 2022-08-15T12:32:22Z

I'm currently having a second though about this cleanup.

Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor
The article only contains one h2 entity and is less than 100 characters (« China's plans to curb its carbon emissions »), thus it is removed by the routine. However this heading has its importance in this context.

What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"?

jtojnar · 2022-08-15T13:40:56Z

Similarity would make sense but then we would need to decide on the precise metric.

Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of <summary> <heading> <actual content> but that would resolve itself if we drop the summary (which is orthogonal to this). And under-removal is probably better than over-removal anyway.

j0k3r · 2022-09-29T12:20:45Z

Another possible heuristic would be checking if the heading is the first element in the content

Just checking if the first child of the content is h*, right?

jtojnar · 2022-09-29T14:58:49Z

Right, that is what I meant.

Kdecherf · 2023-04-07T17:26:12Z

I'm trying to resume work on this PR.

I've made a small check on entries hosted on my instance, using a query like select id, substring(content FOR 250) from wallabag_entry where substring(content FOR 15) ilike '<h_>%' limit 5;.

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

https://futurism.com/the-byte/china-ai-prosecutor-crimes (h2 as a subtitle)
https://www.theregister.com/2019/10/23/ai_dataset_imagenet_consent/ (h2 as a subtitle)
https://affordance.framasoft.org/2022/03/par-dela-like-colere/ (h2)

What should we do?

jtojnar · 2023-04-12T04:02:41Z

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

Would not those articles still begin with h1 pre-cleanup? In that case, the second heuristic would only remove the h1 since it is the first heading in the content.

The third one would not be matched by the first heuristic.

Ideally, we would combine both.

We should add a test suite for all these cases so that we can better discuss what is happening.

Kdecherf force-pushed the impr/headings branch from ada8ff0 to d3af559 Compare June 28, 2022 19:49

jtojnar reviewed Jun 28, 2022

View reviewed changes

src/Readability.php Outdated Show resolved Hide resolved

j0k3r reviewed Jun 29, 2022

View reviewed changes

Kdecherf force-pushed the impr/headings branch from d6cc782 to 41ef592 Compare June 29, 2022 13:36

Kdecherf requested review from j0k3r and jtojnar and removed request for jtojnar August 15, 2022 12:33

j0k3r mentioned this pull request Jul 12, 2023

Article containing multiple h1 is not saved properly (missing said h1's) wallabag/wallabag#5095

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep h1 and other headings#75

Keep h1 and other headings#75
Kdecherf wants to merge 1 commit intoj0k3r:masterfrom
Kdecherf:impr/headings

Kdecherf commented Jun 11, 2022

Uh oh!

Kdecherf commented Jun 11, 2022

Uh oh!

jtojnar commented Jun 11, 2022

Uh oh!

Kdecherf commented Jun 28, 2022

Uh oh!

coveralls commented Jun 28, 2022 •

edited

Loading

Uh oh!

Uh oh!

j0k3r left a comment

Uh oh!

Kdecherf commented Aug 15, 2022 •

edited

Loading

Uh oh!

jtojnar commented Aug 15, 2022 •

edited

Loading

Uh oh!

j0k3r commented Sep 29, 2022 •

edited

Loading

Uh oh!

jtojnar commented Sep 29, 2022

Uh oh!

Kdecherf commented Apr 7, 2023

Uh oh!

jtojnar commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Kdecherf commented Jun 11, 2022

Uh oh!

Kdecherf commented Jun 11, 2022

Uh oh!

jtojnar commented Jun 11, 2022

Uh oh!

Kdecherf commented Jun 28, 2022

Uh oh!

coveralls commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 2583510249

💛 - Coveralls

Uh oh!

Uh oh!

j0k3r left a comment

Choose a reason for hiding this comment

Uh oh!

Kdecherf commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtojnar commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j0k3r commented Sep 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtojnar commented Sep 29, 2022

Uh oh!

Kdecherf commented Apr 7, 2023

Uh oh!

jtojnar commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Jun 28, 2022 •

edited

Loading

Kdecherf commented Aug 15, 2022 •

edited

Loading

jtojnar commented Aug 15, 2022 •

edited

Loading

j0k3r commented Sep 29, 2022 •

edited

Loading