January 2023 - May 2024
Little is known about the background of the people who run to the U.S. state legislature. We developed Python software utilizing ChatGPT’s ability to efficiently summarize large amounts of text in order to compile the biographical data of nearly 150,000 U.S. state legislator candidates who ran for office between 1967 and 2017. Specifically, we gathered each candidate's college major, undergraduate institution, highest degree and institution, and work history. Prior to this effort, no comprehensive database of such biodata existed at this scale. Our methodology used the Google Custom Search JSON API to identify relevant webpages for each candidate, Beautiful Soup and PDF-scraping libraries to retrieve content, and the OpenAI API to summarize and extract the target biodata. We found that approximately 40% of all U.S. state legislator candidates between 1967 and 2017 had biodata available online, with coverage generally increasing for more recent candidates. LLM hallucinations created some false positives and false negatives in the final dataset, and some results were lost due to "prompt errors" (the website context was too long for the OpenAI API) and "parse errors" (the response was formatted incorrectly). This dataset offers us unique insights into the background experience and influences of state legislative candidates and will be used as part of a larger analysis studying lobbying laws and the revolving door in politics.
- Professor Jetson Leder-Luis (Boston University)
- Professor Ray Fisman (Boston University)
- Professor Silvia Vannutelli (Northwestern University).
August 2024 - May 2025
- Professor Jetson Leder-Luis (Boston University)
- Professor Colleen Carey (Cornell University).