A small Rust utility that batch-converts XML files into cleaned JSON Q→A pairs.
It streams each XML file, extracts <row ... /> attributes (Id, PostId/ParentId, Text/Body/Content, Score, CreationDate), cleans text (HTML entities, tags, @mentions, whitespace) and writes one pretty JSON file per input XML in an output directory.
- Streaming XML parsing with
quick-xml— memory friendly for large files. - Cleans HTML entities, strips tags and
@usernamementions, normalizes whitespace. - Groups answers by
PostId/ParentIdand picks the top answer byScore(tie → oldestCreationDate). - Writes one JSON file per input XML.
src/main.rs— the Rust program (paste your Rust source here).Cargo.toml— project manifest with dependencies (example below).datasets/unzipped/— expected input folder (put your.xmlfiles here).cleaned_xml/— output folder (created automatically).
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustc --version
cargo --version
2. Create the Cargo project (or use your existing repo)
cargo new xml2json
cd xml2json
3. Replace src/main.rs
Overwrite src/main.rs with the Rust source you prepared.
4. Use this Cargo.toml
Replace or merge into your Cargo.toml:
[package]
name = "xml2json"
version = "0.1.0"
edition = "2021"
[dependencies]
quick-xml = "0.36"
serde_json = "1.0"
regex = "1.11"
html-escape = "0.2.13"
chrono = { version = "0.4", features = ["serde"] }
anyhow = "1.0"
lazy_static = "1.4"
walkdir = "2.3"
Note: crate names in Cargo.toml use hyphens (e.g. html-escape) but are imported in code using underscores when necessary (e.g. html_escape).
5. Prepare folders & a quick test XML
mkdir -p datasets/unzipped
cat > datasets/unzipped/test.xml <<'XML'
<root>
<row Id="1" PostId="10" Text="<b>What is X?</b>" Score="0" CreationDate="2020-01-01T00:00:00" />
<row Id="2" PostId="10" Text="Answer 1" Score="5" CreationDate="2020-01-01T01:00:00" />
<row Id="3" PostId="10" Text="Answer 2" Score="3" CreationDate="2020-01-01T02:00:00" />
</root>
XML
6. Build and run (release mode recommended)
cargo build --release
./target/release/xml2json
# or run directly:
cargo run --release
Expected output:
[*] Processing datasets/unzipped/test.xml -> cleaned_xml/test.json
[+] Wrote 1 Q->A pairs to cleaned_xml/test.json
[*] All XML files processed.
Then inspect:
cat cleaned_xml/test.json
How it works (short)
collect_comment_groups streams the XML with quick-xml::Reader, finds <row ...> elements and extracts attributes.
clean_text unescapes HTML entities (via html-escape), strips HTML tags (regex), removes @mentions, collapses whitespace and trims edge quotes.
write_json groups answers by PostId and for each question chooses the top answer using score desc, created asc. Writes the result map as pretty JSON.
Memory & performance notes
The code streams XML, so it does not build a full DOM — this helps with large files.
The program does keep two in-memory maps per file:
id_to_text (Id → cleaned text)
groups (PostId → Vec<Answer>)
If a single XML file contains hundreds of millions of rows, these maps can grow large and risk OOM. For extreme scale consider:
writing results incrementally (flush per-post when possible),
using a disk-backed key-value store (RocksDB/LevelDB),
increasing host RAM or processing files in chunks.
Always build with --release for best performance.
Troubleshooting
Crate/version errors: ensure Cargo.toml versions match the snippet above; use cargo update.
quick-xml API differences: this README targets quick-xml = "0.36". If you use another version, API calls may differ.
Lifetime/borrow errors: ensure e.name() results are bound to a variable before calling .as_ref() (the provided source handles this).
Input path errors: verify datasets/unzipped exists and contains .xml files or change the paths in main().
Customization ideas