Use extracted text in WARC resource records

Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the [blog post](https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/) highlighted limitations with citation (which is important for web archives).

I was wondering if it might be useful to use the text/plain WARC *resource* records that [browsertrix-crawler](https://github.com/webrecorder/browsertrix-crawler?tab=readme-ov-file#text-extraction) creates from the *rendered* page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?

I think it would mostly be a matter of adding some logic to [ingest.py](https://github.com/harvard-lil/warc-gpt/blob/main/warc_gpt/commands/ingest.py#L126) to look for records with `WARC-Type: resource` and then use the URL that's in the `WARC-Target-URI` header to determine the URL to associate the text with?

Here's an example for the text generated on the initial page render:

```
WARC/1.1
Content-Type: text/plain
WARC-Target-URI: urn:text:https://genart.social/tags/genuary
WARC-Date: 2024-02-18T16:58:12.661Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:1d657dd4-1b01-4e76-bba2-ea641d74c029>
WARC-Payload-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
WARC-Block-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
Content-Length: 897

Mastodon
Create account
Login
Recent searches
No recent searches
Search options
Not available on genart.social.
genart.social
is part of the decentralized social network powered by
Mastodon
.
...
```

The `WARC-Target-URI` could also look like `WARC-Target-URI: urn:textFinal:{url}` which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use extracted text in WARC resource records #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use extracted text in WARC resource records #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions