Skip to content

arditbe/rust-xml-to-json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xml2json

A small Rust utility that batch-converts XML files into cleaned JSON Q→A pairs.

It streams each XML file, extracts <row ... /> attributes (Id, PostId/ParentId, Text/Body/Content, Score, CreationDate), cleans text (HTML entities, tags, @mentions, whitespace) and writes one pretty JSON file per input XML in an output directory.


Features

  • Streaming XML parsing with quick-xml — memory friendly for large files.
  • Cleans HTML entities, strips tags and @username mentions, normalizes whitespace.
  • Groups answers by PostId / ParentId and picks the top answer by Score (tie → oldest CreationDate).
  • Writes one JSON file per input XML.

Files

  • src/main.rs — the Rust program (paste your Rust source here).
  • Cargo.toml — project manifest with dependencies (example below).
  • datasets/unzipped/ — expected input folder (put your .xml files here).
  • cleaned_xml/ — output folder (created automatically).

Quick start (copy & paste)

1. Install Rust (if not installed)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustc --version
cargo --version

2. Create the Cargo project (or use your existing repo)
cargo new xml2json
cd xml2json

3. Replace src/main.rs

Overwrite src/main.rs with the Rust source you prepared.

4. Use this Cargo.toml

Replace or merge into your Cargo.toml:

[package]
name = "xml2json"
version = "0.1.0"
edition = "2021"

[dependencies]
quick-xml = "0.36"
serde_json = "1.0"
regex = "1.11"
html-escape = "0.2.13"
chrono = { version = "0.4", features = ["serde"] }
anyhow = "1.0"
lazy_static = "1.4"
walkdir = "2.3"


Note: crate names in Cargo.toml use hyphens (e.g. html-escape) but are imported in code using underscores when necessary (e.g. html_escape).

5. Prepare folders & a quick test XML
mkdir -p datasets/unzipped
cat > datasets/unzipped/test.xml <<'XML'
<root>
  <row Id="1" PostId="10" Text="&lt;b&gt;What is X?&lt;/b&gt;" Score="0" CreationDate="2020-01-01T00:00:00" />
  <row Id="2" PostId="10" Text="Answer 1" Score="5" CreationDate="2020-01-01T01:00:00" />
  <row Id="3" PostId="10" Text="Answer 2" Score="3" CreationDate="2020-01-01T02:00:00" />
</root>
XML

6. Build and run (release mode recommended)
cargo build --release
./target/release/xml2json
# or run directly:
cargo run --release


Expected output:

[*] Processing datasets/unzipped/test.xml -> cleaned_xml/test.json
[+] Wrote 1 Q->A pairs to cleaned_xml/test.json
[*] All XML files processed.


Then inspect:

cat cleaned_xml/test.json

How it works (short)

collect_comment_groups streams the XML with quick-xml::Reader, finds <row ...> elements and extracts attributes.

clean_text unescapes HTML entities (via html-escape), strips HTML tags (regex), removes @mentions, collapses whitespace and trims edge quotes.

write_json groups answers by PostId and for each question chooses the top answer using score desc, created asc. Writes the result map as pretty JSON.

Memory & performance notes

The code streams XML, so it does not build a full DOM — this helps with large files.

The program does keep two in-memory maps per file:

id_to_text (Id → cleaned text)

groups (PostId → Vec<Answer>)
If a single XML file contains hundreds of millions of rows, these maps can grow large and risk OOM. For extreme scale consider:

writing results incrementally (flush per-post when possible),

using a disk-backed key-value store (RocksDB/LevelDB),

increasing host RAM or processing files in chunks.

Always build with --release for best performance.

Troubleshooting

Crate/version errors: ensure Cargo.toml versions match the snippet above; use cargo update.

quick-xml API differences: this README targets quick-xml = "0.36". If you use another version, API calls may differ.

Lifetime/borrow errors: ensure e.name() results are bound to a variable before calling .as_ref() (the provided source handles this).

Input path errors: verify datasets/unzipped exists and contains .xml files or change the paths in main().

Customization ideas

About

This is Coded with Rust Memory Optimized

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published