Add support for source byte-range tracking for ByteRecord#286
Add support for source byte-range tracking for ByteRecord#286AndreaOddo89 wants to merge 3 commits intoBurntSushi:masterfrom
Conversation
|
There is a clear overlap between |
|
Sorry, can you please elaborate a bit more on what you want to use this for? A code sample would be helpful I think. New APIs, and especially new types, really need to clear a big hurdle. |
|
Makes perfect sense, I'll try to clarify the specific use case. I need to read some generally large csv files, and validate the values in each line for conformance with specific constraints. In order to do so, I'd need to seek into the original file and extract the source bytes, but String/ByteRecord only provides access to the location of the first byte, not of the last, nor the length to calculate it. Any alternative strategy to identify the end of the source span requires reimplementing the CSV parsing logic, which defeats the purpose of using a csv parser in the first place. Here is an example code where this solution applies: use std::error::Error;
use std::io::{Read, Seek, SeekFrom};
use csv::{ReaderBuilder, StringRecord};
pub fn process_file(src: impl Read + Seek) -> Result<(), Box<dyn Error>> {
let mut csv_reader = ReaderBuilder::new().from_reader(src);
let mut record_buffer = StringRecord::new();
fn validate_record(_record: &StringRecord) -> Result<(), String> {
todo!()
}
while csv_reader.read_record(&mut record_buffer)? {
if let Err(msg) = validate_record(&record_buffer) {
let position = record_buffer.position().unwrap();
let src_line = {
let span = record_buffer.as_byte_record().span().unwrap();
let mut raw_reader = csv_reader.into_inner();
raw_reader.seek(SeekFrom::Start(span.start()))?;
let mut raw_line_buffer = vec![0; span.len()];
raw_reader.read_exact(&mut raw_line_buffer)?;
String::from_utf8(raw_line_buffer)?
};
return Err(Box::from(format!("Validation error at record {}: {}. Source line: {}", position.record(), msg, src_line)));
}
}
Ok(())
}As the necessary state is already available in the reader, I assume the performance impact should be negligible. |
… when using write_byte_record
ByteRecord(viaByteRecordInner) already exposes apositionmethod to access line number, record number and byte offset of the start of the record.This PR adds information to
ByteRecordto track not only the byte offset of the start of the record, but also its end, via aspanmethod returning aSpan; this is useful to retrieve the original source bytes for a parsed record when e.g. reporting errors.