Add Franklin County Board of Commissioners spider#9
Add Franklin County Board of Commissioners spider#91992tw wants to merge 2 commits intoCity-Bureau:mainfrom
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @1992tw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new web scraping spider designed to collect and standardize meeting data for the Franklin County Board of Commissioners. The spider leverages a multi-stage API interaction to gather detailed meeting information, ensuring accurate capture of event specifics while intelligently filtering out irrelevant calendar entries. This enhancement expands the project's data coverage by integrating a new public body's meeting schedule. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new spider for the Franklin County Board of Commissioners. The spider implementation is well-structured, using a two-step API scraping process, and is accompanied by a comprehensive set of tests. However, I've identified a critical issue with the Pipfile.lock file, which appears to be corrupted with non-existent package versions, and this will prevent dependencies from being installed. Additionally, I've provided a couple of suggestions to improve the robustness of the spider's parsing logic, aligning it more closely with Scrapy best practices.
| desc = sel.css("div.meeting-container > p::text").get() | ||
| return desc.strip() if desc else "" |
There was a problem hiding this comment.
The current implementation using css(...).get() is a bit fragile. It will only extract the first text node from the first <p> tag. If the <p> tag contains other tags like <br> or <strong>, their text content will be missed. Using xpath('string(...)') is more robust as it concatenates all descendant text nodes of the selected element.
| desc = sel.css("div.meeting-container > p::text").get() | |
| return desc.strip() if desc else "" | |
| desc = sel.xpath("string(//div[@class='meeting-container']/p[1])").get() | |
| return desc.strip() if desc else "" |
| def _parse_links(self, sel): | ||
| """Extract document links (agendas, minutes) from meeting detail HTML.""" | ||
| links = [] | ||
| for doc_div in sel.css("div.meeting-document"): | ||
| title = doc_div.css("h3::text").get("").strip() | ||
| href = doc_div.css("div.alt-formats a::attr(href)").get() | ||
| if href: | ||
| if not href.startswith("http"): | ||
| href = "https://www.franklincountyohio.gov" + href | ||
| links.append({"href": href, "title": title or "Document"}) | ||
| return links |
There was a problem hiding this comment.
To make URL joining more robust, I suggest passing the response object to this method and using response.urljoin(). This is the standard Scrapy way to handle relative URLs and is more reliable than string concatenation.
You would also need to update the call in parse_detail at line 126 from self._parse_links(sel) to self._parse_links(response, sel).
| def _parse_links(self, sel): | |
| """Extract document links (agendas, minutes) from meeting detail HTML.""" | |
| links = [] | |
| for doc_div in sel.css("div.meeting-document"): | |
| title = doc_div.css("h3::text").get("").strip() | |
| href = doc_div.css("div.alt-formats a::attr(href)").get() | |
| if href: | |
| if not href.startswith("http"): | |
| href = "https://www.franklincountyohio.gov" + href | |
| links.append({"href": href, "title": title or "Document"}) | |
| return links | |
| def _parse_links(self, response, sel): | |
| """Extract document links (agendas, minutes) from meeting detail HTML.""" | |
| links = [] | |
| for doc_div in sel.css("div.meeting-document"): | |
| title = doc_div.css("h3::text").get("").strip() | |
| href = doc_div.css("div.alt-formats a::attr(href)").get() | |
| if href: | |
| href = response.urljoin(href) | |
| links.append({"href": href, "title": title or "Document"}) | |
| return links |
7381a93 to
abccfb2
Compare
msrezaie
left a comment
There was a problem hiding this comment.
Aside from tweaking the location parsing, it looks good.
What's this PR do?
This PR adds a new spider for Franklin County Board of Commissioners meetings that scrapes data from their calendar API. The spider uses a two-step pattern: POST to the calendar endpoint for meeting items, then GET each detail endpoint for full meeting info (location, agenda/minutes links, descriptions). It covers ~14 months of meetings (2 months back through 12 months forward).
Why are we doing this?
This addresses the need to track Franklin County Board of Commissioners meetings as part of the City Scrapers project. This resolves the ticket for Franklin County Board of Commissioners (Batch 3, Agency ID 1872).
Steps to manually test
test_output.csvto ensure the data looks valid (should contain General Sessions, Briefing Sessions, etc.).Are there any smells or added technical debt to note?
No major debt. There is a commented-out
MEETING_NAME_REfilter (lines 39, 84-85) that could be cleaned up or re-enabled if non-meeting calendar items (holidays, community events) need to be excluded. The spider setsROBOTSTXT_OBEY = Falsebecause the API endpoints are not crawlable otherwise. Test suite covers 24 cases including General Session, Briefing Session, invalid responses, and parametrized field checks.