Ensure processing is better for automatic LLM-based curinator.md to issue conversion #1893

parmsam · 2025-09-18T17:38:30Z

Changes made to rweekly scripts folder for curation team use.

This pull request updates the scripts/process_curinator.R script to enhance its data processing pipeline for extracting and summarizing R-related markdown links. The main improvements include reorganizing the order of library imports, expanding the set of libraries used, and significantly refactoring the data wrangling logic to produce grouped, formatted summaries.

Dependency management:

Moved the import of ellmer after dplyr and added new dependencies: tidyr and glue, to support improved data manipulation and string formatting.

Data processing and summarization:

Refactored the processing pipeline to:
- Store the initial results in result_raw with a new json_metadata column.
- Parse and expand the metadata, filter for R-related entries, format markdown links, group by type, and generate combined summaries for each group using glue.

Limitations

Note that some URLs fail to read which might require us to manually review them, but this approach can still help us automatically check if an RSS post is R-related and put it in the right category.
I have limited the character length of website content converted markdown to 1000 characters to respect model token limits

jonocarroll · 2025-09-19T12:48:01Z

Running this as a test after manually running curatinator today...

There's a post that errors (with an unusually long URL); I like that it's included in the manual_review_links but (if we proceed with this) a final 'report' which prints out the result_wrangled, manual_review_links, and also the filter(result_temp, is_r_related != "yes") results for inspection would be helpful.
I get an error here

> result_temp <- result_raw |>
+   mutate(metadata = map(json_metadata, ~ fromJSON(.x))) |>
+   unnest_wider(metadata) 
# Error in `mutate()`:
# ℹ In argument: `metadata = map(json_metadata, ~fromJSON(.x))`.
# Caused by error in `map()`:
# ℹ In index: 19.
# Caused by error:
# ! lexical error: invalid char in json text.
#                                        ```json {   "is_r_related": "ye
#                      (right here) ------^
# Run `rlang::last_trace()` to see where the error occurred.

because the output (from this post) looks like this

result_raw[19, ]$json_metadata
# [1] "```json\n{\n  \"is_r_related\": \"yes\",\n  \"category\": \"R in Organization\"\n}\n```"

It's not clear to me why the code fence is there, but it should probably be stripped prior to attempting fromJSON(). This seems to do the trick in this case

map(json_metadata, ~ fromJSON(gsub("```(json\n)*", "", .x))))

In this instance all of the R-related posts were assigned to 'Insights' including 'RcppSimdJson 0.1.14 on CRAN: New Upstream Major'. For me to give support to this (since it cost US$0.34 to perform) I'd like it to have a clear demonstration of doing better than assigning to the default category. I suspect we could achieve the R-related inference with some keyword matching, possibly even building our own classification model based on the existing issues.

jonocarroll · 2025-09-19T12:52:14Z

It also misclassified this post as not R related - I can see why it had trouble with it, but misclassifications add work rather than reduce it.

parmsam · 2025-09-19T13:44:19Z

That's interesting. I recently switched over to a Claude sonnet model for our use case. I'll experiment with different models and with changing our system prompt. Maybe I need to do a better job teaching it how to decide the post category. Right now, I'm just using our wiki page material which can be confusing. We might even come up with maybe an eval dataset based on a previous issue or multiple issues to measure accuracy. Also, maybe the 1000 character limit isn't enough sometimes to determine if a post is R related.

rpodcast · 2025-09-27T02:13:35Z

I just performed a run to kick off my issue curation for 2025-W40, and my experience was quite similar to @jonocarroll . My observations:

I also has the case of a post with a long URL (from the same site) being funneled to manual review.
Same issue with the code fenced block appearing in one of the raw post contents, and implementing map(json_metadata, ~ fromJSON(gsub("```(json\n)*", "", .x)))) did the trick.
In my case, all of the RSS posts were classified as tutorial. I don't have a huge batch of links (17 RSS posts), but definitely a handful of them belong in other categories.
At first I was perplexed why {reticulate} was involved until I read up on the documentation of {ragnar}. Once I figured out how to add Python to my project's Nix configuration via {rix} I was able to solve the errors. But that's just a caution to other curators that try to use this in the future.

While I am very supportive of finding clever ways to reduce the manual effort in the process of curating an issue, it's pretty clear that more testing is needed to find the optimal combo of a prompt and appropriate LLM to make this practical for the rest of the team. But this is an excellent starting point.

parmsam · 2025-09-28T20:57:39Z

Sorry, for the delay. Finally had some free time today to revisit this. Improved the system prompt, addressed the json code chunks issue, and switched over to a Quarto report format. You'll see I'm now using a decision tree sort of format that Claude recommended to me for the system prompt. Increased the character limit to 1,500 too. The result seems to be much better now: process_curinator_report.html

@rpodcast, I'd love to know how the report compares to your grouping for this upcoming issue. We'll probably need to continue tuning the system prompt to meet our needs. It's getting better though.

…er readability

parmsam added 2 commits September 18, 2025 13:33

ensure wrangling is more complete for processed curinator

394616a

add wrangled output for near usable solution

eaa9b16

parmsam requested review from jonocarroll and rpodcast September 18, 2025 17:50

parmsam added 3 commits September 18, 2025 13:55

switch over to anthropic model

cc6f211

add code for links requiring manual review

8670ab7

reduce dup code

ef75f1d

improve system prompt and switch to quarto report format

fe01d4b

parmsam added 4 commits September 28, 2025 17:06

refine classification criteria and enhance report formatting for bett…

c4576e7

…er readability

add mention of R code into r-related step check

9cb44ea

increase characters to 2,500

0ca188e

add not about manually reviewing not R-related links

4258fce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure processing is better for automatic LLM-based curinator.md to issue conversion #1893

Ensure processing is better for automatic LLM-based curinator.md to issue conversion #1893

Uh oh!

parmsam commented Sep 18, 2025 •

edited

Loading

Uh oh!

jonocarroll commented Sep 19, 2025

Uh oh!

jonocarroll commented Sep 19, 2025

Uh oh!

parmsam commented Sep 19, 2025 •

edited

Loading

Uh oh!

rpodcast commented Sep 27, 2025

Uh oh!

parmsam commented Sep 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ensure processing is better for automatic LLM-based curinator.md to issue conversion #1893

Are you sure you want to change the base?

Ensure processing is better for automatic LLM-based curinator.md to issue conversion #1893

Uh oh!

Conversation

parmsam commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Limitations

Uh oh!

jonocarroll commented Sep 19, 2025

Uh oh!

jonocarroll commented Sep 19, 2025

Uh oh!

parmsam commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rpodcast commented Sep 27, 2025

Uh oh!

parmsam commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

parmsam commented Sep 18, 2025 •

edited

Loading

parmsam commented Sep 19, 2025 •

edited

Loading

parmsam commented Sep 28, 2025 •

edited

Loading