fix: Guard article.inner_text() against stale element errors in BigDeal scraper#39
Conversation
Co-authored-by: jjlauterbach <1447549+jjlauterbach@users.noreply.github.com>
Deploying daily-custard with
|
| Latest commit: |
0d1dbe6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://470bccd1.daily-custard.pages.dev |
| Branch Preview URL: | https://copilot-sub-pr-30-again.daily-custard.pages.dev |
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. |
There was a problem hiding this comment.
Pull request overview
This PR improves the resilience of the Big Deal Facebook Playwright scraper by preventing a single stale/detached post element from aborting an entire scrape, and updates unit tests to reflect the scraper’s nested-comment filtering behavior.
Changes:
- Guard
article.inner_text()in the BigDeal scraper so Playwright DOM errors skip the bad post and continue scanning. - Fix existing BigDeal scraper tests by setting
article.evaluate.return_value = Falseso mocked posts aren’t filtered out as nested comments. - Add a regression test ensuring an
inner_text()Playwright error on one post still allows a later valid post to be returned.
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| app/scrapers/bigdeal.py | Handles inner_text() Playwright errors per-article to avoid aborting the scrape attempt on a single bad element. |
| tests/test_bigdeal_scraper.py | Updates mocks to pass nested-article filtering and adds a test covering inner_text() PlaywrightError continuation behavior. |
| mock_article2.inner_text.return_value = "Today's flavor is Vanilla!" | ||
| mock_article2.evaluate.return_value = False | ||
|
|
||
| mock_page.query_selector_all.return_value = [mock_article1, mock_article2] |
There was a problem hiding this comment.
In this test, mock_page.query_selector_all.return_value is used for all selectors, but the scraper calls query_selector_all('text="See more"') before query_selector_all('[role="article"]'). With the current setup the article mocks can be treated as "See more" buttons (calling is_visible()/click()), which makes the test less representative and could hide regressions. Consider routing query_selector_all via side_effect (as done in test_scrape_facebook_page_expands_see_more_buttons) to return [] for the "See more" selector and the articles list for [role="article"].
| mock_page.query_selector_all.return_value = [mock_article1, mock_article2] | |
| def query_selector_all_side_effect(selector: str): | |
| if selector == 'text="See more"': | |
| # No "See more" buttons in this scenario | |
| return [] | |
| if selector == '[role="article"]': | |
| # Return the mocked posts | |
| return [mock_article1, mock_article2] | |
| return [] | |
| mock_page.query_selector_all.side_effect = query_selector_all_side_effect |
A detached/stale Playwright element during
article.inner_text()would abort the entire scrape attempt with no retry, since the outer wrapper only retries onPlaywrightTimeoutError.Changes
bigdeal.py: Wraparticle.inner_text()intry/except PlaywrightError+except Exceptionwithcontinue, matching Leon's scraper pattern — a bad element skips to the next post instead of abortingtests/test_bigdeal_scraper.py: Fixed 5 broken tests where mock articles lackedevaluate.return_value = False— the comment-filtering step callsarticle.evaluate(...), and an unsetMock()returns a truthy value, causing all articles to be silently dropped as nested commentstests/test_bigdeal_scraper.py: Addedtest_scrape_facebook_inner_text_error_continues_to_next_post— verifies that aPlaywrightErroron one article'sinner_text()skips that element and returns the flavor from the next valid post🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.