-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Description
340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using HTTP/2 after a Java security upgrade which included ALPN and therefor allowed for HTTP/2. The crawler started to use HTTP/2 after an automatic restart.
The mentioned WARC files may cause WARC readers (eg. jwarc) to fail while parsing the HTTP headers:
- request
GET /2020/09/12/business/brexit-no-deal-uk-economy/index.html HTTP/2 ... - response
HTTP/2 200
To address the issue:
- for now block usage of HTTP/2
- test which WARC parsers fail
- enable the WARC bolt to write failure-proof files when using HTTP/2 (cf. WARC revision 1.1 (modification): support of HTTP 2.X protocol in WARC format. iipc/warc-specifications#15, WARC-Protocol field proposal iipc/warc-specifications#42)
- push fixes to the WARC parser libs or rewrite the WARC files so that they're compatible
Affected files:
s3://commoncrawl/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200912083952-00000.warc.gz
...
s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201004110027-00339.warc.gz
More than 80% of the records are captured using HTTP/2.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels