Discussion: PNG decoding performance improvement opportunities #416

anforowicz · 2023-09-29T20:11:42Z

anforowicz
Sep 29, 2023

Hello!

I wanted to share some of my ideas for performance opportunities in the png crate.
I hope that we can discuss these ideas together to refine them - for example:

Maybe it’s better to invest in performance-sensitive areas of code other than the ones identified below?
Maybe there are better ways to benchmark and/or profile the code?
Maybe some of these changes are good for microbenchmarks but not for end-to-end performance?
etc.

I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?

I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).

My ultimate goal is to bring the performance of the png crate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)

Performance improvement ideas

Idea 1: Explicit SIMD-ification of `unfilter`

Why this idea seems promising:

perf record of png crate’s benchmarks says that png::filter::unfilter accounts for 18% of runtime
libpng uses explicit SIMD intrinsics to SIMD-ify unfilter.
- There are implementations for 3-bytes-wide and 4-bytes-wide pixels (e.g. png_read_filter_row_avg3_sse2 and png_read_filter_row_avg4_sse2)
- There are implementations for Avg, Sub, and Paeth filter types (e.g. png_read_filter_row_sub3_sse2, png_read_filter_row_avg3_sse2, and png_read_filter_row_paeth3_sse2
- There are separate Intel (intel/filter_sse2_intrinsics.c) and Arm (arm/filter_neon_intrinsics.c - also SIMD-ifies Up filter type)

Status: PR is under review:

unfilter benchmarks landed in Scaffolding for direct benchmarking of crate::filter::unfilter. #413
Paeth/bpp=3 and Paeth/bpp=6 improvements are under review in Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 #414 (this yields 20% improvements in unfilter benchmarks)
Failed to improve Up
Failed to improve Avg and Sub (see the abandoned commit here, and the discussion here)
Failed to improve other bytes-per-pixel (see the abandoned commit here)

Discussion of next steps:

(in progress) Continue the discussion in Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 #414
(not in my short-term plan, help welcomed) Double-check if investing more time here is worthwhile:
- Maybe one can use godbolt to compare disassembly of Rust vs C++ (for all filter types and/or bytes-per-pixels?). Example for Paeth/bpp=3: https://godbolt.org/z/j8fK39oEE.
- FWIW png::filter::unfilter went down to 8.6% of runtime - this is the upper bound on possible further improvements.

Idea 2: Avoid copying data within the `png` crate

Why this idea seems promising:

This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)
perf record of png crate’s benchmarks (after SIMD-ification of unfilter) shows:
- __memmove_avx_unaligned_erms: 9.25% of runtime
- __memset_avx2_unaligned_erms: 1.45% of runtime
- __memmove_avx_unaligned: 0.98% of runtime

Idea 2.1: Reduce copying of raw rows

Goal: get rid of the copies done in the two places here:

Preserving the previous row for unfiltering here: self.prev[..rowlen].copy_from_slice(&row[..rowlen])
Compacting the raw rows buffer here: self.current.drain(..self.scan_start).for_each(drop); (copies all bytes from self.scan_start to the end of the self.current vector)

Status: in progress - actively working on this:

Prototype: see here
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:
- High confidence / I plan to work on this soon: Implement a benchmark that isolates the new RawRowsBuffer while still replicating decoding/decompressing patterns (of say kodim02.png). This can be done by abstracting away ReadDecoder by hiding it behind an equivalent AbstractReadDecoder trait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).
- Low confidence: Somehow reduce noise for end-to-end benchmarks…
  - Maybe creating an artificial PNG that replicates most behavioral patterns of kodim02.png, but uses no unfilter and uses no Huffman encoding… This seems hard…
  - Other ideas from the benchmarking write-up below?

Idea 2.2: Reduce copying within `ZlibStream::transfer_finished_data`

There seem to be 2 performance improvement opportunities within ZlibStream::transfer_finished_data:

Avoid copying not-yet-decompressed/populated bytes in self.out_buffer[self.out_pos..].
- Maybe we can drop fdeflate’s assumption that out_buffer has been zero-ed out. In this case, these bytes can simply be ignored.
- OTOH, we can keep fdeflate's assumption and still avoid copying these bytes by rely on zeroing them out in prepare_vec_for_appending. In other words, we can replace (slower?) memcpy with (faster?) memset.
Reduce how often out_buffer is compacted
- Today transfer_finished_data compacts the buffer every time (copying 32kB every time safe number of bytes is returned to the client; possibly copying multiple bytes per single returned byte)
- Idea: only compact when safe > CHUNCK_BUFFER_SIZE * 3 (copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).

Status: started - plan to resume work on this soon:

Prototyped here (passes functional tests, not measured yet due to irrational fear of benchmark noise… 🙂)
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy: similar to what I plan idea 2.1 above:
- High confidence / I plan to work on this soon: Implement a benchmark that isolates the ZlibStream while still replicating decoding/decompressing patterns (of say kodim02.png)

Idea 2.3: Avoid `BufReader` when unnecessary

Sometimes the input the png crate has already been buffered into memory (Chromium’s //ui/gfx/codec/png_codec.cc takes a span of memory as input; png crate’s benchmarks use Vec<u8>). In such scenarios using a BufReader will needlessly copy the already-buffered/already-in-memory bytes into BufReader’s internal buffer.

Status: thinking how to best address this:

Maybe the public API of the png and image crates should take BufRead instead of Read as input?
Maybe the public API of the png(and image?) crate can be bifurcated (internally holding dyn Box<BufRead> instead of BufReader<R> in ReadDecoder::reader):
- new can continue taking Read as input (maybe deprecate new + rename it into from_unbuffered_input)
- A new API (from_buf_read? from_buffered_input?) can take BufRead as input. Alternatively we could also introduce a from_slice API (OTOH all slices are BufRead but not all BufReads are slices, so this isn't a fully generic solution).
Other options?

Idea 2.4: Other copy avoidance

Half-baked ideas:

ZlibStream::out_buffer (extension of idea 2.2): maybe we can avoid compacting at all
- Silly idea 2.4.1: What if we had a data structure like Vec (i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)
- Silly idea 2.4.2: Circular buffer instead of compacting (probably detrimental to performance of fdeflate...)
ZlibStream::in_buffer:
- Seems impossible to avoid because decompression needs to work with a single buffer, but PNG can split compressed data stream across multiple IDAT chunks.

Idea 3: Minimize the number of allocations

Why this idea seems promising:

Chromium benchmarks show limited, but non-zero performance opportunities
- 0.78% of runtime: partition_alloc::internal::PartitionBucket::SlowPathAlloc
- 0.76% of runtime: allocator_shim::internal::PartitionMalloc
- 0.66% of runtime: allocator_shim::internal::PartitionFree
There are some low-hanging fruit here - most allocations are triggered (directly or indirectly) underneath (as measured by heaptrack for png crate's decoder benchmarks):
- 26.1% allocations in fdeflate::decompress::Decompressor::build_tables
- 17.1% allocations in png::decoder::zlib::ZlibStream::new
- 15.7% allocations in png::decoder::stream::StreamingDecoder::parse_text
- 12.2% allocations in png::decoder::zlib::ZlibStream::transfer_finished_data

Status: not started yet:

Idea 3.1: image crate and Chromium ignore text chunks - we should just call png::Decoder::set_ignore_text_chunk. (This should help with parse_text.)
Idea 3.2: Maybe the Box in ZlibStream is not needed?

Idea 4: Try to improve decompression speed

Why this idea seems promising:

Other Chromium engineers point out that some Chromium patches may be responsible for significant performance improvements of zlib. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)
perf record of png crate’s benchmarks says that:
- 56.44% of runtime is spent in fdeflate::decompress::Decompressor::read
- 9.20% of runtime is spent in fdeflate::decompress::Decompressor::build_tables

Status: haven’t really started looking at fdeflate source code yet:

High-level ideas:
- Try to replicate some of the ideas captured in Wuff blog post (see also github permalink here):
  - 4-byte loads and stores
  - Run fast longer
- Stare hard at the Chromium SIMD code in a futile attempt to understand why it helps?
More specific, but still rather vague ideas:
- 4-byte (or 8-byte?) loads and stores in table lookups (maybe already done by fdeflate... I haven’t started reading the code yet…)
- Copying previous chunks: let’s overshoot if it helps memcpy faster + if we know the output buffer is long enough? May need to wrestle with deflate’s undocumented assumption that the output buffer has been zeroed out…

Idea 5: Try to improve `expand_paletted`

Why this idea seemed promising:

I am fairly sure that I saw expand_paletted relatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)
ARM engineers landed some related improvements in Chromium (see https://crbug.com/706134 and the slide here)
- Idea 5.1: Precompute/memoize “riffled” palette (combining separate RGB-palette and alpha-palette chunks into a single lookup table)
- Idea 5.2: SIMD-ify (using std::simd::Simd::gather_or_default? mimicking Chromium SIMD code that I don’t understand?)

Status: tentatively abandoned for now

Prototyped here
The prototype seems to improve expand_palette microbenchmarks by 21% to 56%, but
- EDIT 2024-01-12: WARNING: This bullet item is incorrect because of a mistake in measurement methodology (see here). I’ve realized that expand_paletted barely registers in a runtime profile of png crate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking that png crate's benchmark to use the EXPAND transformation similarly to how the png crate is used from the image crate)
- What’s good for microbenchmarks isn’t necessarily good for end-to-end performance. See also https://abseil.io/fast/39.

Benchmarking is hard…

I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).

The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the unfilter changes (idea 1 above) show a clear improvement in png crate’s benchmarks (not only 20% improvement on unfilter microbenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-end decoder benchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.

Maybe I should try to run Criterion benchmarks for longer (e.g. --sample-size=1000 --measurement-time=500? (This doesn’t seem to help. I still get wild swings in results… :-/)
Maybe I should look into cachegrind-based, iai-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. The unfilter change for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).
Maybe measuring an average and doing fancy statistics is misguided and one should just measure the minimum possible time? This assumes that the external noise can only make a benchmark slower. This idea is based on a blog post here.
Maybe I should give up microbenchmarks and instead rely on Chromium’s A/B experimentation framework.
Maybe I should introduce and use more focused microbenchmarks (and to some extent ignore the end-to-end decoder benchmarks of the png crate):
- I do in fact plan to use dependency injection to isolate performance of RawRowsBuffer (idea 2.1) and ZlibStream (idea 2.2)
- I also have a tentative idea to craft a PNG file that has minimal decompression and unfiltering overhead while still preserving most of the other behavioral patterns (e.g. preserving the number and length of IDAT and ZLIB chunks, how they get decompressed, etc.). This seems hard. Maybe I should try compressing a black PNG?
Maybe I should attempt to reduce the measurement noise somehow?
- nice -19 is one way (sadly I didn’t think of using that initially…)
- Maybe there is a way to check and/or guarantee if the CPU is always using a consistent clock speed/frequency (e.g. doesn’t sleep or enter low power mode)
- Maybe I should look into discounting code/memory-layout effects as described in https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf?
- Maybe I can somehow dedicate a single CPU core to the benchmarks? nice -19 helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t have csets command on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to use isolcpus.

So...

So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?

Feedback welcomed!

fintelia · 2023-09-30T00:53:50Z

fintelia
Sep 30, 2023
Maintainer

General Optimization Tips

My first suggestion for benchmarking would be to collect a representative sample of a couple thousand PNGs to use as a corpus. In my experience, it is a much better use of time to measure many different images once than the same image many times.

The corpus-bench was initially created to work on encoder performance and compression ratio, but with some tweaks it should be suitable for optimizing decoding (specifically, you'll want to measure the time decoding the original version of PNGs, rather than re-encoded versions). I've personally used the QOI Corpus, but I suspect Chromium might already have -- or be able to collect -- its own corpuses

Ideally, you'd measure both this crate and Chromium's existing PNG decoder against the same corpus, which should give an indication of how much potential for further improvement is left. (I'd caution you not to rely on other people's benchmarks here. As an example, the numbers quoted in the Wuffs blog post are quite out of date)

Having representative end-to-end benchmarks has two advantages: it lets you double check that any optimizations are worthwhile, and profiling benchmarks runs can help identify portions of the code worth optimizing. If you can compare traces between multiple decoders, that's even better for identifying optimization targets!

0 replies

fintelia · 2023-09-30T00:58:21Z

fintelia
Sep 30, 2023
Maintainer

Idea 5: Try to improve expand_paletted

The core problem here is that the vast majority of PNGs aren't paletted (though only a representative corpus will tell you the exact number...). There's probably a lot of low hanging fruit here, so on the images that do have palettes you can make a big improvement. But optimizing a function that is never called for the vast majority of images isn't going to improve average performance very much, no matter how good a job you do!

1 reply

anforowicz Jan 12, 2024
Author

Whether or not paletted PNGs are important depends on the selected test corpus. According to a later comment/post here palleted images appear in 35.27% of PNG images from the top-500-websites test corpus and 36.93% of the QOI test corpus.

fintelia · 2023-09-30T01:20:23Z

fintelia
Sep 30, 2023
Maintainer

Idea 4: Try to improve decompression speed

The fdeflate library was written from scratch with the goal of optimizing the compression and decompression of PNGs. It has many of the optimizations you describe:

It reads 8-bytes at a time.
It does 16-byte writes, knowing that'll usually overshoot.
Each primary table entry is 32-bits.
It uses SIMD accelerated CRC32's and Adler32 checksums (when checksum validation isn't disabled entirely).

And another notable optimization that it uses is multi-byte decoding.

This isn't to say there aren't further optimization opportunities. The decoder uses heuristics to decide how much time to spend building decoding tables versus performing decoding (bigger tables take longer to build but make decoding go faster). And there are surely other optimizations to be found. But this is all a long way of saying that there probably isn't a ton of easy performance wins to be found here.

0 replies

fintelia · 2023-09-30T01:27:00Z

fintelia
Sep 30, 2023
Maintainer

Idea 3: Minimize the number of allocations

There hasn't been a ton of investigation here. With focused attention, it is likely that small gains may be possible.

0 replies

fintelia · 2023-09-30T02:12:48Z

fintelia
Sep 30, 2023
Maintainer

Idea 2: Avoid copying data within the png crate

Profiling can be misleading here. Decoding an image requires pulling the bytes of the image from RAM into the CPU cache. The first time you do this will be quite slow. Subsequent copies of data that's already in the L1 cache will be extremely fast. If you avoid the first copy, you'll find that the next time the data is copied suddenly gets way slower.

Avoiding copies is good for code clarity, but I personally don't think doing lots of copies is holding us back much.

Idea 2.3: Avoid BufReader when unnecessary

I actually thought this was already removed. The convention elsewhere in image-rs is that decoders take an R: Read which is assumed to be buffered, but we don't actually check. (It might seem like requiring R: BufRead would be helpful, but IMHO that trait is actually really flawed for anything other than implementing its own provided methods, because normally you'd want to be sure the internal buffer >1 byte, but the trait doesn't guarantee that)

6 replies

fintelia Oct 4, 2023
Maintainer

I think I'd rather change this crate to not need BufRead rather than go the other direction. The main thing interfering with that is the interface to StreamingDecoder::update which takes in a &[u8] rather than a impl Read. Honestly, that change would probably make things simpler. For instance, it would mean we could call read_u32 several times rather than the convoluted state machine we currently use for reading chunk headers:

image-png/src/decoder/stream.rs

Lines 634 to 645 in f10238a

    
           U32Byte2(type_, val) => { 
        
               self.state = Some(U32Byte3(type_, val | u32::from(current_byte) << 8)); 
        
               Ok((1, Decoded::Nothing)) 
        
           } 
        
           U32Byte1(type_, val) => { 
        
               self.state = Some(U32Byte2(type_, val | u32::from(current_byte) << 16)); 
        
               Ok((1, Decoded::Nothing)) 
        
           } 
        
           U32(type_) => { 
        
               self.state = Some(U32Byte1(type_, u32::from(current_byte) << 24)); 
        
               Ok((1, Decoded::Nothing)) 
        
           }

anforowicz Oct 4, 2023
Author

Thanks for the suggestion - this might indeed be a better way to approach this. This should still avoid the extra copy if we read directly into ChunkState::raw_bytes. And you're probably right about potential simplifications of the state machine in next_state (at least for u32s + for signature).

(That said, I am struggling a bit with too many in-flight things. I am not sure if I should first try to finish and measure the L1-cache-friendliness work VS start this... I'll keep you posted...)

anforowicz Oct 5, 2023
Author

I thought about this a bit more and I am no longer sure if taking an impl Read argument instead of &[u8] will work for StreamingDecoder::update. This is because (as part of the L1-cache-friendliness work) I hope to bypass ChunkState::raw_bytes entirely (for IDAT chunks) and pass the external data directly into ZlibStream::decompress. This would not work if the external data is represented as impl Read because ZlibStream::decompress and fdeflate::Decompressor::read take &[u8].

So, I think that we have 2 options (if we want to eliminate BufReader-related data copies + if as a long-term goal we want to avoid all intermediate copies between the external data and ZlibStream):

Option 1: Replace R: Read with BufRead: R
Option 2: Change the ZlibStream::decompress and fdeflate::Decompressor::read APIs to take something like &mut impl Read instead of &[u8]. This is made a little complicated by the fact that sometimes fdeflate cannot make forward progress if given insufficient/too-short input (this is ok - such input is retained in ZlibStream::in_buffer; but... with &mut impl Read` this gets a little bit more complicated)

Maybe there is also an option to avoid mentioning Read or BufRead in png structs - e.g. instead of png::Decoder<R: Read> we could have png::Decoder<'a> (with a corresponding change of a field type from reader: R to a type-erased Box<dyn Read + 'a>). FWIW I see this as a flavor of option 1.

fintelia Oct 5, 2023
Maintainer

With some small tweaks, fdeflate should be able to guarantee that it either (a) fully consumes the input, (b) fully fills the output buffer, or (c) has read the entire deflate stream. I'll try to make those changes soon.

That said, passing a reader directly to fdeflate won't work since we need to be able to first feed the input bytes into the CRC32 hasher. (By default we skip that step, but we need to at least provide the option for users who want to validate checksums.)

anforowicz Oct 6, 2023
Author

With some small tweaks, fdeflate should be able to guarantee that it either (a) fully consumes the input, (b) fully fills the output buffer, or (c) has read the entire deflate stream. I'll try to make those changes soon.

Ack and thanks for looking into this. I think that even relaxed a or b might help (relaxed = consumes _come_input / fills some output (this avoids the no-progress-has-been-made situation which is when png's ZlibStream needs to stash things into in_buffer; currently it may stash a bit more often - I have a WIP CL to avoid that).

That said, I am not sure if we currently have confidence that these fdeflate changes will help end-to-end:

As you pointed out (thanks!) the checksum calculations require having the external input as a slice. I think this argues for taking a BufRead as input (this was if the external input is a slice, then fill_buf is "free" and doesn't require an additional buffer).
I am struggling to achieve significant end-to-end runtime improvements with my current in-progress work. I would need a day or two to share a new status update.

fintelia · 2023-09-30T02:21:02Z

fintelia
Sep 30, 2023
Maintainer

Idea 1: Explicit SIMD-ification of unfilter

I'm honestly thrilled this worked. Paeth with 3 bpp is probably the single most important case: 3 bpp is the most common bit depth and the paeth filter gets the highest compression ratio*. The net result is that it gets used a lot. And despite those two factors, paeth unfiltering was by far the slowest.

_{*An adaptive filter that picks among all filter methods wins by maybe one percentage point, but a large portion of the rows still get encoded with paeth.}

1 reply

okaneco Nov 4, 2023

There might be a small performance gain with the next major LLVM release (provided it doesn't get taken away elsewhere).
Some optimizations weren't enabled when the result type was less than 64 bits, so clamp casts produced extra instructions for <4 x i8> vector types.

Commit: llvm/llvm-project@f471f6f?diff=split
i16 clamp to u8: https://llvm.godbolt.org/z/ozqnGGGoj
i32 clamp to u8: https://llvm.godbolt.org/z/Pxc3zved1

I submitted an issue to portable-simd which ended up being that LLVM upstream issue.

anforowicz · 2023-10-02T23:43:31Z

anforowicz
Oct 2, 2023
Author

Benchmarking noise

Special thanks to @marshallpierce and @veluca93 for their tips on how to reduce benchmarking noise. In particular, the https://rigtorp.se/low-latency-guide/ link I got from @marshallpierce has been very helpful.

For replicability, let me share the set of steps that I’ve followed when running benchmarks reported below:

One-time:
- Edit kernel boot parameters (in /boot/efi/loader/entries/….conf on my system) and add the following line: options isolcpus=nohz,domain,4-7 (see also the docs here)
- Install a few extra tools: sudo apt-get install msr-tools
After rebooting:
- for i in $(seq 4 7); do sudo wrmsr -p$i 0x1a0 0x4000850089; done - disabling TurboBoost
- sudo sysctl vm.stat_interval=60 - less frequent VM stats everywhere
Before benchmarking:
- sudo pgrep -P 2 | xargs -i taskset -pc 0-3 {} - moving kernel work to CPUs 0-3
- sudo find /sys/devices/virtual/workqueue -name cpumask -exec sh -c 'echo 7 > {}' ';' - moving kernel work to CPUs 0-2
- sudo irqbalance --foreground --oneshot - moving IRQ handlers (supposedly isolcpus-aware, so should move away from CPUs 4-7)
When benchmarking:
- First build: rustup run nightly cargo build --bench=decoder --features=unstable --release
- Then run the benchmark on the isolated CPUs: taskset --cpu-list 4-7 nice -n -19 target/release/deps/decoder-8ada079085f50ef9 --bench --save-baseline=my_baseline (or something like this)

Other notes:

I isolate CPUs 4-7, so I assume that I don’t really need to globally disable SMT / hyper-threading…
On my machine there is no cpufreq driver in the kernel, so I assume that I don’t need to worry about variable CPU frequency and/or about setting the governor to performance mode.
There is still considerable noise in the results… for example, idea 2.5 (see below) consistently produces improvements in Transparency.png from 5 to 11%, but other files sometimes show results from a regression of 5% to improvement of 5% :-(
OTOH, maybe I should forget most of the above and instead focus on using a more representative machine for benchmarking. So far I’ve been using my main development machine that has a lot of oomph for building Chromium and a somewhat unrealistic amount of L1 cache.

Benchmarking corpus

Thank you very much @fintelia for pointing out the QOI corpus of PNG images. That does indeed seem like a much better way to evaluate performance of PNG decoding. I plan to switch to this corpus after completing (and measuring) the current batch of in-progress performance improvements.

Idea 2: Avoid copying data within the png crate

I’ve tried combining ideas 2.1 and 2.2 into something that is both simpler, and avoids even more copies (maybe let’s call this idea 2.5 when referring to this later?): 87b44a8

This seems to help: 1 image wasn’t affected by much, 1 image improved by more than 10%, 4 images improved by 3-5%:

decode/kodim02.png      time:   [4.8052 ms 4.8267 ms 4.8572 ms]
                        thrpt:  [231.61 MiB/s 233.08 MiB/s 234.12 MiB/s]
                 change:
                        time:   [-6.1403% -5.4022% -4.5959%] (p = 0.00 < 0.05)
…
decode/Transparency.png time:   [97.714 µs 97.974 µs 98.318 µs]
                        thrpt:  [3.4101 GiB/s 3.4221 GiB/s 3.4312 GiB/s]
                 change:
                        time:   [-12.301% -11.675% -11.039%] (p = 0.00 < 0.05)
…
decode/kodim17.png      time:   [4.0421 ms 4.0585 ms 4.0743 ms]
                        thrpt:  [276.12 MiB/s 277.19 MiB/s 278.32 MiB/s]
                 change:
                        time:   [-0.1552% +0.2141% +0.6280%] (p = 0.32 > 0.05)
…
decode/kodim07.png      time:   [5.4501 ms 5.4615 ms 5.4771 ms]
                        thrpt:  [205.40 MiB/s 205.99 MiB/s 206.42 MiB/s]
                 change:
                        time:   [-4.1394% -3.4088% -2.7658%] (p = 0.00 < 0.05)
…
decode/kodim23.png      time:   [4.3043 ms 4.3227 ms 4.3440 ms]
                        thrpt:  [258.98 MiB/s 260.25 MiB/s 261.37 MiB/s]
                 change:
                        time:   [-3.5038% -3.0657% -2.5867%] (p = 0.00 < 0.05)
…
Benchmarking decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 10.
Benchmarking decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: Collecting 20 samples in estimated 6.4234 s (210 iterat                                                                                                                                   decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png
                        time:   [30.506 ms 30.571 ms 30.660 ms]
                        thrpt:  [348.29 MiB/s 349.31 MiB/s 350.06 MiB/s]
                 change:
                        time:   [-6.3951% -5.7833% -5.2585%] (p = 0.00 < 0.05)
…

OTOH, I am a bit reluctant to pursue this direction further, because of the idea 6 below… (

Idea 6: Avoiding 32kB delay / L1 cache friendliness

Ideas 2.1, 2.2, or 2.5 above don’t change the fact that unfiltering happens with a delay of 32kB. And when discussing this with @veluca93 and @marshallpierce, they pointed out that such a delay means that things may drop out of the L1 cache most of the time (Skylake, for instance, has a 32K L1; and a bigger L1 cache doesn’t necessarily help outside of microbenchmarks, where other things put additional pressure on the cache). And avoiding the 32kB delay necessitates some copying (because we can't mutate the most recently decompressed 32kB), and therefore maybe idea 2.5 is not that great in the long run (i.e. avoiding the 32kB delay would require reverting parts of idea 2.5 and e.g. reintroducing some form of the prev and/or current Vecs).

I think I’ll focus on this area in the next couple of days. Quick-and-dirty notes about what this may encompass:
BufReader: lower CHUNCK_BUFFER_SIZE + introduce a separate LOOKBACK_SIZE
StreamingDecoder::current_chunk.raw_bytes - size-limiting (at least for IDAT chunks).
fdeflate calls from ZlibStream - size-limiting the size of the slice from out_pos to out_buffer.len()
Make ZlibStream a Read (instead of passing image_data: &mut Vec<u8>)
Refactor Reader::prev/current/scan_start into Reader::curr_row/prev_row (swapping them instead of copying; probably still need scan_start-like cursor)

7 replies

anforowicz Oct 9, 2023
Author

Oh, BTW, perf record -e L1-dcache-load-misses says that after the commits mentioned above, the cache misses are spread as follows:

53.56%: fdeflate::decompress::Decompressor::read (this seems high, although note that as pointed in my previous message this function has some unavoidable misses when reading the compressed PNG data from the "external" memory slice)
10.1%: png::filter::unfilter (anything higher than zero is too high here)
9.99% - something in libc (for some reason I only got an address with no symbol)

fintelia Oct 9, 2023
Maintainer

Changes that make the code cleaner are desirable even if they don't improve performance. Since your series of commits changes so much, it however may be better to consider them individually or in small batches.

With regards to performance counters, you may be able to make more sense of things if you collect a bunch of different counters. Some others that may be insightful include L2-misses, branch-mispredicts, number of cycles stalled (just because there a cache miss doesn't mean the CPU is fully stalled), and total cycles (to give a baseline to compare other numbers against). Based on your measurements so far, it kind of sounds like the the couple billion L1 misses over tens of billions of CPU cycles just doesn't account for that much stall time.

Another thing to consider is how other decoders do on this benchmark. If we're already faster, then there may be very limited additional perf possible. Conversely, a big gap would indicate further room for improvement and would give us a chance to investigate which optimizations we're missing out on.

Finally, when it does come to the point of considering changes to merge, we'll want to consider a much wider corpus. Things like the width of the image (which determines whether two rows fit in the L1 cache or spill into the L2) may have a big impact. Only by benchmarking a lot of images will be know whether we've introduced any performance cliffs.

fintelia Oct 9, 2023
Maintainer

I'll also mention that the amount of memory a deflate decoder uses very directly impacts decoding speed. The main decoding loop involves peeking at n bits of the deflate stream and doing a table lookup into a 2^n entry table. If the next symbol in the stream is n bits or fewer, the symbol can be directly decoded. Otherwise, there'll be a branch misprediction and the decoder will then use the entry's value to index into a secondary table. The larger the table, the less often the decoder will hit this slowpath and the faster decompression will be. Due to the way huffman codes work, each doubling of the table size reduces the number of misses by approximately 50%.

anforowicz Oct 10, 2023
Author

Changes that make the code cleaner are desirable even if they don't improve performance. Since your series of commits changes so much, it however may be better to consider them individually or in small batches.

Ack - makes sense. I have to think a bit more whether the changes (or a subset of them) are desirable even if they don't affect performance. Most of them are not really big simplifications - they are just a different ways to do things (although maybe passing input directly into ZlibStream, bypassing ChunkState::raw_bytes and BufReader counts as a simplification... I dunno...).

Based on your measurements so far, it kind of sounds like the the couple billion L1 misses over tens of billions of CPU cycles just doesn't account for that much stall time.

Good point. Before diving into making these changes I probably should have tried measuring first how much stalling actually occurs in practice. The data I've gathered today indeed doesn't seem to indicate that L1 cache behavior is a problem:

Before changes:

$ taskset --cpu-list 4-5 nice -n -19 sudo perf stat -e cpu-cycles,stalled-cycles-backend,stalled-cycles-frontend target/release/deps/absolute-4061d6a30340ef56 
…
       49217335891      cpu-cycles                                                            
         271064336      stalled-cycles-backend           #    0.55% backend cycles idle       
          10532183      stalled-cycles-frontend          #    0.02% frontend cycles idle      
…
$ taskset --cpu-list 4-5 nice -n -19 sudo perf stat -e cpu-cycles,instructions target/release/deps/absolute-4061d6a30340ef56 
…
       49275722280      cpu-cycles                                                            
      147708237677      instructions                     #    3.00  insn per cycle            
…
$ taskset --cpu-list 4-5 nice -n -19 sudo perf stat -e L1-dcache-load-misses,L1-dcache-loads target/release/deps/absolute-4faaae7905300fff 
…
        2011178874      L1-dcache-load-misses            #    5.33% of all L1-dcache accesses 
       37729457864      L1-dcache-loads

After changes (excluding the miniz_oxide change):

0.27% backend cycles idle
0.05% frontend cycles idle
3.05 insn per cycle
3.12% of all L1-dcache accesses are a miss

Another thing to consider is how other decoders do on this benchmark

When I compare Chromium C++ implementation to the Rust implementation, I am observing ~1.35x slowdown.

Notes / disclaimers:

There are some performance opportunities in the Chromium integration layer (e.g. it may be possible to avoid png crate's scratch_buffer and image crate's PngReader buffer).
The comparison is done on a non-representative corpus. OTOH, I don't see a reason why the corpus would prefer one decoder over another.
- I still need to look into switching to using the QUI corpus (thanks for referring me to it), but I wasn't able to understand how the corpus was constructed (does it provide a representative sample of image contents and/or sizes vs of PNG encoders or compression settings).
- But ultimately, I probably should run A/B tests on Chromium's Canary/Dev release channels - these would be more trustworthy than any particular benchmark or corpus
The comparison was done on a single machine (AFAIU some aspects of C++ code have ARM-only optimizations)

I'll try to gather/compare perf stat metrics across C++ vs Rust.

fintelia Oct 10, 2023
Maintainer

I don't think the QOI corpus provides any indication of how the images were chosen. It most likely isn't representative with regards to content or sizes. Two factors going for it are that its a large collection of freely available PNGs, and there seems to be representation from multiple encoders (or at least some images have very different encoding settings).

One clear example of different encoders changing benchmark results was in #384 where relative decoding times went from 6% slower to 22% faster by re-encoding them.

anforowicz · 2023-10-31T15:40:32Z

anforowicz
Oct 31, 2023
Author

More realistic benchmarking corpus

I have run some additional benchmarks on PNG images from the top 500 websites, and on the QOI corpus

Fetching images from the top 500 websites

I have fetched images from the main pages of the top 500 most popular websites (according to https://moz.com/top500). The pages and their subresources were fetched using the following script:

$ cat ~/scratch/top500/fetch.sh
#!/bin/bash

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

CSV="$SCRIPT_DIR/top500.csv"
DOMAINS="$SCRIPT_DIR/domains.txt"
wget https://moz.com/top-500/download/?table=top500Domains -O "$CSV"
cat "$CSV" | cut -d , -f 2 | cut -d '"' -f 2 >"$DOMAINS"

for domain in `cat "$DOMAINS"`;
do
  DIR="$SCRIPT_DIR/$domain"
  rm -rf "$DIR"
  mkdir "$DIR"
  pushd "$DIR"
  wget "https://$domain" \
    --page-requisites \
    --no-directories \
    --timeout=10 \
    --wait=1 \
    --tries=3
  popd
done

Benchmarking process

I have measured performance using a WIP Chromium CL at https://crrev.com/c/4980955/5.

After patching the CL (and the chain of 11 upstream WIP CLs) I have built png_codec_benchmark binary using the following args.gn:

$ cat out/bench/args.gn
target_os = "linux"
is_debug = false
is_component_build = false
enable_nacl = false
symbol_level=2
dcheck_always_on = false
use_goma = true
ffmpeg_branding = "Chrome"
proprietary_codecs = true
is_official_build = true
chrome_pgo_phase = 0  # Avoiding having to download actual PGO profiles
enable_rust_png = true

The benchmarking process has been kicked off against the top 500 websites and against the QOI corpus (recommended above) by running the following commands:

$ find ~/scratch/top500 -iname '*png*' | nice -n -19 out/bench/png_codec_benchmark | tee ~/scratch/top500.csv
…
$ find ~/scratch/qoi_corpus/images/ -name '*.png' | nice -n -19 out/bench/png_codec_benchmark | tee ~/scratch/qoi.csv

The benchmark tested 2 implementations of PNG decoding under //ui/gfx/codec:

the libpng-based implementation that ships in Chromium and
the Rust png-crate-based implementation that has been prototyped and integrated with Chromium in the WIP CL

Disclaimers and other notes:

I hear that the Chromium copy of libpng and zlib may contain Chromium-specific patches (I am not 100% sure if this is limited to third_party/zlib/patches). This highlights that results of most other comparisons and benchmarks (against non-Chromium libpng) might not directly translate into similar results in Chromium.
The benchmark covers not only the png crate, but also integration with Chromium (e.g. converting from the decoder output into RGBA8 expected by Chromium, alpha premultiplication, etc.). I have tried to optimize (e.g. see here and here) the C++/Rust glue, but it is possible that I missed something and that there are some performance improvement opportunities here (rather than in the png crate itself).
Replacing libpng under //ui/gfx/codec is a careful, incremental step - this decoder handles favicons and other similar UI elements and is not yet used when Chromium renders web pages using the Blink engine. I think that the difference from Blink performance profile should be relatively small, but I want to be upfront that the current benchmark results here can’t really predict performance of using Rust png when rendering web pages.

Results

Spreadsheet with the results can be found here:

For each tested file, the benchmark has recorded the following information about the behavior of the PNG decoder under //ui/gfx/codec/png_codec.cc.

Measured data:
- Average time to decode the given image when using the C++ implementation
- Average time to decode the given image when using the Rust implementation
Detected data:
- IHDR information: color type, bit depth, presence of interlacing
- File size
- File name (including the name of the website as a directory name)

The spreadsheet has additionally computed the following data:

Ratio of Rust to C++ for the specific, individual file
Ratio of Rust to C++ for the total decoding time for all files, if only this single file was decoded using the Rust implementation. Rationale: This metric tries to account for the fact that even if some small, quickly-decoded files were 20 times slower with Rust, then this wouldn’t have affected this metric much, because small files are a much smaller part of the total decoding time than bigger files.

Discussion of some of the results

It is interesting to note the distribution of various PNG properties - this should help guide future optimization work:

95% of images use bit depth of 8 bits
54% of images encode RGBA pixels and 35% encode palette indices

The Rust-vs-C++ speed difference is more pronounced for small images, but small images don’t contribute as much as big images to the total runtime (at least from Chromium website rendering perspective - the conclusion may be different in other scenarios that are more heavily skewed toward small images).

Revisiting copy avoidance

Effects on the bigger test corpus

I have tried using the bigger testing corpus to test the performance impact of the copy avoidance approach (see the series of commits here. The numbers below show the ratio between the total, cumulative Rust runtime and the C/C++ runtime:

Top 500 websites: 1.2079 before copy avoidance, 1.0922 after copy avoidance (9.5% improvement)
QOI test corpus: 1.0329 before copy avoidance, 1.0784 after copy avoidance (4.4% regression)

Effects on non-compressed images

I have also created bare-bones PNG images, where no unfilter is needed (i.e. filter type is always 0) and no compression is used (i.e. deflate block type or BTYPE is always 00). The images can be found in the commit here. The effects of copy avoidance are very positive for these images:

$ nice -n -19 target/release/deps/decoder-00ddcd67707d6a9a --bench --baseline=my_baseline noncompressed
…
decode/128x128-noncompressed.png
                        time:   [20.141 µs 20.233 µs 20.332 µs]
                        thrpt:  [3.0020 GiB/s 3.0166 GiB/s 3.0303 GiB/s]
                 change:
                        time:   [-89.210% -89.153% -89.098%] (p = 0.00 < 0.05)
                        thrpt:  [+817.24% +821.90% +826.82%]
                        Performance has improved.

decode/8x8-noncompressed.png
                        time:   [2.4352 µs 2.4438 µs 2.4532 µs]
                        thrpt:  [99.520 MiB/s 99.901 MiB/s 100.26 MiB/s]
                 change:
                        time:   [-32.023% -31.719% -31.373%] (p = 0.00 < 0.05)
                        thrpt:  [+45.716% +46.454% +47.108%]
                        Performance has improved.
…

The images used for these benchmarks are somewhat artificial, but they do help to measure the performance effects of the code outside of decompression (by magnifying this effect and therefore avoiding noise). My initial assumption was that if a change has a positive performance effect on such images, then it should also have a positive (although smaller) effect on more realistic images. It seems that the results from the QOI test corpus have invalidated this assumption.

Next steps

I will break the copy avoidance commits into a set of a few smaller PRs. Hopefully proceeding in this way will shed some light on the source of the unexpected slowdown observed with the QOI test corpus (i.e. maybe I made a mistake in one of the commits and we can catch this with a collective review and/or with the new “noncompressed” benchmarks; one suspicious part is somewhat arbitrary choice of constants - we probably should use less than 1MB for OUT_BUFFER_SIZE in the commit here).

Various notes and arguments for proceeding with merging the PRs:

The changes seem to reduce the total work (i.e. the number of times the image bytes are copied from one buffer to another)
- This is definitely the case for the “noncompressed” image files - the benchmarks show drastic improvement in this scenario
- This seems to be true in general. At least I can’t explain how the changes would add work.
The changes seem to make the algorithm more L1-cache friendly
- On one hand, L1-cache misses don’t seem to be the bottleneck (see also the earlier comment)
- OTOH, outside of microbenchmarks the L1-cache pressure is typically higher (e.g. when hyperthreading may share the cache across virtual cores; or [less confidently] in presence of context switches and/or other code)
The changes help with the top 500 websites test corpus
- The improvement of 9.5% for the top 500 websites is greater than the regression of 4.4% for the QOI test corpus)
- The top 500 websites seem to be a more realistic test corpus (at least for web browser scenarios)

5 replies

fintelia Nov 4, 2023
Maintainer

Thanks for putting together these benchmarks! I looked into the top500 corpus. Running your scripts didn't produce exactly the same number of PNGs (expected... websites change) but it seems that nearly 2/3rds of all PNGs came from ok.ru and only 40 websites contribute any PNGs at all?

fintelia Nov 4, 2023
Maintainer

A couple other performance notes:

fdeflate 3.1 is now published with its stronger forward progress guarantees. With this change, I believe ZlibStream should never be buffering anything in its in_buffer.
I've found that compiling with RUSTFLAGS="-C target-cpu=x86-64-v2" has a couple performance boost for decoding
We should probably think a bit more about checksum validation. At the moment, the default is to check CRC32's and not the adler32 within the deflate streams. Both are pretty fast to compute, but the default seems a bit weird, particularly since I believe CRC32 is the slower of the two.
I opened Reduce copying and allocations #422 which makes the code slightly cleaner by eliminating Reader::prev

Shnatsel Jan 2, 2024
Collaborator

Looking at the benchmarking data, I see that the total decoding time for the corpus is 2 seconds for Rust and 1.68 seconds for C++. I understand this was measured before the recent performance improving PRs were merged?

I believe that with the recent decoding speed improvements, and if you disable checksum verification, the total decoding time for all images should be comparable between the Rust and C++ implementations.

@anforowicz could you re-measure it on the same corpus and confirm?

anforowicz Jan 2, 2024
Author

I understand this was measured before the recent performance improving PRs were merged?

Yes, some performance-improving PRs were not landed at the time when the measurements were done. Before rerunning, I am also hoping to first get the following PRs merged:

if you disable checksum verification

This is a bit tricky, because we want apples-to-apples comparison between the C++ and the Rust implementation.

I know that libpng treats Adler checksum failures as a benign error (effectively ignoring them), so I am not concerned with using Rust defaults (where Adler is not even calculated)
I am not sure how Crc checksums are handled in libpng

I also note that Crc-related code is a relatively small portion of the CPU/runtime profile.

fintelia Jan 2, 2024
Maintainer

With #430, I believe we now match libpng's default behavior for CRCs. Our adler32 checksum computation is very fast, but skipping them entirely obviously won't be slower.

As far as #421, I'd been thinking it wouldn't have performance impact on this benchmark because it applies to next_row. But I see that it also improves the decoding of interlaced images by the main next_frame method, so I'll prioritize it more

anforowicz · 2023-11-15T17:52:11Z

anforowicz
Nov 15, 2023
Author

I just wanted to share that I plan to try implementing and measuring the impact of the following on the noncompressed-8x8 benchmark. (But I promise that no cookies have been licked - please just gives me a heads-up if you plan to work on these items yourself.)

Hopefully today or tomorrow: decoder/stream.rs:

U32 reading without going through 4 state transitions (State::U32, State::U32Byte1, ..., State::U32Byte1) if BufRead input already has sufficient number of bytes (otherwise we may still need to accumulate byte-by-byte)
Skipping ChunkState::raw_bytes for non-IDAT/fdAT chunks - i.e. skip accumulating if BufRead input already has sufficient number of bytes (otherwise we still need to accumulate)
Maybe unify how accumulating is done in the two scenarios above (e.g. try to replace ChunkState::raw_bytes with StreamingDecoder::accumulated_bytes?)
Avoid preallocating StreamingDecoder::accumulated_bytes (since it may end up not being needed)

Hopefully later this week: decoder/zlib.rs:

Avoiding Box::new in zlib.rs helps (probably no impact on benchmarks, but allocation/deallocation is a bit heavier in Chromium because of MiraclePtr so if no impact then it seems worth doing)
Avoid memset-ing / zero-ing-out full CHUNCK_BUFFER_SIZE in zlib.rs (e.g. a decompressed 8x8 RGBA8 image may take 1kB rather than 32kB; 16x16 would still be just 4kB). When doing this maybe decouple CHUNCK_BUFFER_SIZE (heuristic-based buffer size) from LOOKBACK_SIZE (has to be 32kB for zlib).

/cc @fintelia

1 reply

anforowicz Nov 15, 2023
Author

a decompressed 8x8 RGBA8 image may take 1kB

Ooops... apparently I am not very good at math :-). 8x8x4 = 256 bytes rather than 1kB...

But I guess the main point still stands - we memset more memory than needed (in ZlibStream::new when initializing out_buffer: vec![0; 2 * CHUNCK_BUFFER_SIZE] as well as in ZlibStream::prepare_vec_for_appending when calling self.out_buffer.resize(buffered_len, 0u8) with AFAIU pretty big buffered_len).

anforowicz · 2023-11-16T17:13:29Z

anforowicz
Nov 16, 2023
Author

Idea 5: Try to improve expand_paletted

EDIT 2024-01-12: WARNING: Conclusions of this comment/post are incorrect because of a mistake in measurement methodology (see here).

I just wanted to note another negative result and confirm that investing in expand_paletted doesn't seem worth it.
I've used data from benchmarking top500 sites (see also the comment here) to pick 3 "Indexed" files with some of the worst "individual slowdown" (these are fairly small files) + 2 files with close to the worst "global slowdown" (these are medium-sized files).

When profiling all 5 of these files via perf record -e cpu-cycles, only 0.09% of cycles/runtime is attributed topng::decoder::palette::expand.

And when trying the improvement ideas from the commit here, the resulting runtime gets worse:

top500-line-me-d23df73ddb8fabe683cd95898f8d4e6.png (117kB): time: [+1.1291% +1.3995% +1.6588%] (p = 0.00 < 0.05)
top500-www-gov-br-tree-collapsed.png (106 bytes): No change in performance detected. (p = 0.29 > 0.05)
top500-foursquare-com-tab-btn.png (258 bytes): time: [+16.945% +17.864% +18.664%] (p = 0.00 < 0.05) (BIG regression)
top500-developers-google-com-makersuite.png (54kB): time: [-4.1688% -2.1739% -0.2479%] (p = 0.03 < 0.05)
top500-ok-ru-new-green.png (574 bytes): time: [+4.9123% +5.3666% +5.7881%] (p = 0.00 < 0.05) (moderate regression)

1 reply

okaneco Nov 16, 2023

I investigated doing this and came to a similar conclusion when trying to improve expand_paletted in #405 3 months ago. Someone commented that this commit 4b2a01f reduced their WASM benchmark by half, but I'm not familiar with the target or how code like iterators get optimized there.

I also tried unrolling the shifts in unpack_bits and matching on bit-depth/channels which was negative.
If I remember the problem correctly, we're writing a single 3-byte RGB a pixel at a time which is difficult to optimize in chunks or batches. Simplifying that hot loop was the best I could think of at the time.

It'd be a breaking change, but I also tried computing the palette earlier on in parse_plte by copying into a [[u8; 3]; 256] table (boxed and unboxed variations) with values from self.current_chunk.raw_bytes.

image-png/src/decoder/stream.rs

Lines 904 to 915 in 8afa7ef

    
           fn parse_plte(&mut self) -> Result<Decoded, DecodingError> { 
        
               let info = self.info.as_mut().unwrap(); 
        
               if info.palette.is_some() { 
        
                   // Only one palette is allowed 
        
                   Err(DecodingError::Format( 
        
                       FormatErrorInner::DuplicateChunk { kind: chunk::PLTE }.into(), 
        
                   )) 
        
               } else { 
        
                   info.palette = Some(Cow::Owned(self.current_chunk.raw_bytes.clone())); 
        
                   Ok(Decoded::Nothing) 
        
               } 
        
           }

My thinking at the time was that perhaps getting rid of a bounds check and the unwrap_or(&[0; 3]) would grant a small improvement. I think it was performance neutral, but I didn't benchmark as thoroughly as you've been doing. It might only show changes on the microbenchmarks.

The current definition of the palette and transparency fields:

pub struct Info<'a> {
    pub trns: Option<Cow<'a, [u8]>>,
    pub palette: Option<Cow<'a, [u8]>>,
    ...
}

Shnatsel · 2024-01-06T22:11:48Z

Shnatsel
Jan 6, 2024
Collaborator

There is a known performance issue on paletted images: #393

It is possible to do much better than the png crate currently does here by using vectorization. The zune-png crate implements such a fast path in Rust, so it should be relatively easy to integrate into the png crate as well. This would be a massive performance win on a considerable subset of images.

0 replies

anforowicz · 2024-01-11T00:49:10Z

anforowicz
Jan 11, 2024
Author

The `BufReader` Removal Proposal

I wanted to share more broadly some notes about the performance investigation related to the proposal to remove an intermediate BufReader. Before I do that, let me briefly describe in this section what this proposal means.

Currently (i.e. as of 1636b55) the image data is copied through the following buffers (in this order):

External input: the Read instance passed to png::Decoder::new. In png benchmarks (and in real-world usage scenarios that I am interested in) this is often a &[u8].
A 32kB BufReader created by Decoder::new_with_limits and/or Decoder::new_with_options
ZlibStream::out_buffer. AFAIU this is separate from subsequent buffers, because decompression requires retaining most recent 32kB, while the unfilter step mutates row pixels)
Reader::data_stream which is where rows may be potentially mutated by unfilter. Let me try to list reasons why I think that this is separate from subsequent buffers:
- To skip over the byte that describes the filter type (0=None up to 4=Paeth). (Not sure if this is an actual, valid reason - we could just leave that byte in ZlibStream::out_buffer.)
- To allow transforming the pixels via expand_gray_u8, expand_paletted, etc. I assume that expanding in-place (i.e. within the external output) is difficult due to the borrow checker (although I note that libpng works from the last byte of the output row - I am guessing that this is done to perform the update in place, although I am not sure if this is actually what is happening there)
External output: the &mut [u8] passed to png::Reader::next_frame

Here, I want to discuss the performance impact of the proposal to remove the 32kB BufReader.

Feasibility of Removing the `BufReader`

Before discussing the performance impact, let me share a few notes that hopefully convey that the removal seems feasible (even if it is a bit disruptive due to breaking API changes):

Note that &[u8] already implements the BufRead trait. (i.e. there is no need to provide BufRead by re-wrapping &[u8] in a BufReader; this is different from, say, File which only implements Read)
ee896d1 shows how BufReader can be removed by changing generic constraints of png::Decoder and png::Reader from requiring Read to requiring BufRead. I understand that this is a breaking API change.

Performance Impact of Just Removing the `BufReader`

Removing the BufReader should result in less bytes being copied between buffers. And less work should mean improved runtime. But, it turns out that removing the BufReader brings mixed performance results - some benchmarks regress. Let’s take a closer look at what happens when measuring the delta of BufReader avoidance.

The generated-noncompressed-64k-idat/2048x2048.png benchmark has a pronounced, stable regression (in 3 runs I’ve measured +15.375%, +13.548%, and +14.223% runtime change) so let’s focus on this benchmark, even though some other benchmarks also see some regressions - e.g. kodim23.png (+2.8622%, +4.4204%, +7.1379%), generated-noncompressed-64k-idat/128x128.png (+53.114%, +36.331%, -3.2525% - somewhat noisy), generated-noncompressed-64k-idat/12288x12288.png (+15.951%, +18.972%, +16.252%).

Let’s start to measure how often the CPU backend stalls and how many L1 cache misses are experienced. We’ll measure these metrics while decompressing generated-noncompressed-64k-idat/2048x2048.png a fixed number of times (see here). (We won’t use the Criterion-based decoder because it can’t run a fixed number of iterations and running for X seconds wouldn’t give us an apples-to-apples comparison). We see that indeed the cache behavior is unsatisfactory after the BufReader removal:

Repro: taskset --cpu-list 4-5 nice -n -19 sudo perf stat -e cpu-cycles,stalled-cycles-backend,instructions,L1-dcache-load-misses,l2_cache_misses_from_dc_misses,l3_misses target/release/deps/absolute-648b03c0fe1f2a7a
instructions went from 4726838847 to 4549908334 (96% of the baseline - yay?)
The runtime went from 1.996s to 2.323s (116% of the baseline)
cpu-cycles went from 8854178026 to 10277670922 (116% of the baseline)
stalled-cycles-backend went from 2.95% cycles to 10.47% cycles
cache-misses went from 42253036 to 47700918 (112% of the baseline)
L1-dcache-load-misses went from 1370773486 to 1068379604 (78% of the baseline)
l2_cache_misses_from_dc_misses went from 20044385 to 16783534 (83% of the baseline)

I couldn’t find a way to measure L3 misses on my CPU, but it seems that the increase of cache-misses comes mostly from the L3 level.

Memory Access Patterns

Let’s look at how many bytes are copied between some of the buffers we’ve enumerated above: into BufReader (as a result of BufReader::fill_buf calls), into ZlibStream::out_stream (as a result of fdeflate::Decompressor::read calls), into Reader::data_stream (as a result of ZlibStream::transfer_finished_data), and let’s also track copies within out_stream (as a result of ZlibStream::compact_out_buffer_if_needed). (We won’t look closer at Reader::data_stream and beyond because they seem to always access the data row-by-row.)

When decoding generated-noncompressed-64k-idat/2048x2048.png before any changes the access patterns look as follows:

fdeflate::Decompressor::read: wrote 32720 bytes; ZlibStream::transfer_finished_data: copied 32720 bytes
BufReader::fill_buf: 32768 bytes
fdeflate::Decompressor::read: wrote 32768 bytes; ZlibStream::transfer_finished_data: copied 32768 bytes
BufReader::fill_buf: 32768 bytes
fdeflate::Decompressor::read: wrote 41 bytes; ZlibStream::transfer_finished_data: copied 41 bytes
fdeflate::Decompressor::read: wrote 32710 bytes; ZlibStream::transfer_finished_data: copied 32710 bytes
BufReader::fill_buf: 32768 bytes
fdeflate::Decompressor::read: wrote 32768 bytes; ZlibStream::transfer_finished_data: copied 32768 bytes
BufReader::fill_buf: 32768 bytes
fdeflate::Decompressor::read: wrote 53 bytes; ZlibStream::transfer_finished_data: copied 53 bytes
fdeflate::Decompressor::read: wrote 32698 bytes; ZlibStream::transfer_finished_data: copied 32698 bytes
and so forth (two ~32kB-sized decompressions, and then one less than 100 bytes)

After the changes, the patterns look like this:

fdeflate::Decompressor::read: wrote 32768 bytes; ZlibStream::transfer_finished_data: copied 32768 bytes
fdeflate::Decompressor::read: wrote 32761 bytes; ZlibStream::transfer_finished_data: copied 32761 bytes
fdeflate::Decompressor::read: wrote 65531 bytes; ZlibStream::transfer_finished_data: copied 65531 bytes
fdeflate::Decompressor::read: wrote 65531 bytes; ZlibStream::transfer_finished_data: copied 65531 bytes
ZlibStream::compact_out_buffer_if_needed: copied 32768 bytes
fdeflate::Decompressor::read: wrote 65531 bytes; ZlibStream::transfer_finished_data: copied 65531 bytes
fdeflate::Decompressor::read: wrote 65531 bytes; ZlibStream::transfer_finished_data: copied 65531 bytes
ZlibStream::compact_out_buffer_if_needed: copied 32768 bytes
…

Some notes:

All calls to fdeflate::Decompressor::read (except the first 2 calls) decompress 65531 bytes. This corresponds to the fact that the IDAT chunks in this testcase carry this much data.
I think that the first two 2 to fdeflate::Decompressor::read decompress less data, because ZlibStream::out_buffer’s capacity hasn’t yet grown
Caching:
- It seems reasonable that copying in chunks of 64kB is less cache-friendly than copying in chunks of 32kB, because the size of L1 cache is typically 32kB.
- Recall that most new cache misses comes from L3 cache. It’s unclear to me if the 64kB-vs-32kB difference can explain the new L3 cache misses.

Let’s try to restore the behavior when smaller chunks are copied at a time (see ac815e7) and measure again:

fdeflate::Decompressor::read: wrote 16384 bytes; ZlibStream::transfer_finished_data: copied 16384 bytes
fdeflate::Decompressor::read: wrote 16384 bytes; ZlibStream::transfer_finished_data: copied 16384 bytes
fdeflate::Decompressor::read: wrote 16384 bytes; ZlibStream::transfer_finished_data: copied 16384 bytes
…
ZlibStream::compact_out_buffer_if_needed: copied 32768 bytes
fdeflate::Decompressor::read: wrote 16384 bytes
ZlibStream::transfer_finished_data: copied 16384 bytes
…

It seems that we have indeed restored copying in smaller chunks. But… it seems that we haven’t restored the old performance and still observe a regression (measuring the delta of BufReader avoidance + capping single decompression size):

Repro: taskset --cpu-list 4-5 nice -n -19 sudo perf stat -e cpu-cycles,stalled-cycles-backend,instructions,L1-dcache-load-misses,l2_cache_misses_from_dc_misses,l3_misses target/release/deps/absolute-648b03c0fe1f2a7a
instructions went from 4747557801 to 4744410237 (99.9% of the baseline, compared with 96% from just BufReader avoidance)
The runtime went from 2.070s to 2.444s (118%; previously 116%)
cpu-cycles went from 9043614414 to 10796409904 (119%; previously 116%)
stalled-cycles-backend went from 4.96% cycles to 3.39% cycles (yay?)
cache-misses went from 45573858 to 34576143 (75%; previously 112%)
L1-dcache-load-misses went from 1371516321 to 1151282103 (83%; previously 78%)
l2_cache_misses_from_dc_misses went from 19008967 to 12691592 (66%; previously 83%)

I don’t understand why we went from 0.52 instruction per cycle to 0.44 instruction per cycle, even though the cache misses went down compared to the baseline.

I don’t fully understand why the instructions count didn’t change much, despite removing the BufReader-related copies. Maybe each chunk has some overhead, and processing more (smaller) chunks means that the accumulated overhead eats all the gains from BufReader removal?

AFAICT I also get a regression regardless if I change the decompression cap to 8kB, 16kB, or 32kB. FWIW, I've initially tested with 16kB, because there are 2 copies and I wanted both of them to fit into the L1 cache. We can get more accurate results using Criterion (3 measurements against the same baseline):

No cap (just BufReader removal): +14.675%, +14.366%, +12.061%
8kB: +23.927%, +24.384%, +23.051%
16kB: +24.622%, +18.780%, +19.219%
32kB: +4.7798%, +15.009%, +4.1462% (this seems better than no cap, but still results in an overall regression).

Next Steps?

I don’t understand what the next steps here should be. In particular, I am hesitant to experiment with _mm_prefetch (hiding it behind a platform/target-specific #[cfg(...)]), because:

This won't help on non-x86 platforms
I wasn't able to find clear guidance on what to prefetch (maybe XkB into the future/not-yet-consumed fdeflate input?) and when (after fdeflate::Decompressor::read returns?).
I worry that prefetching impact may depend a lot on the exact CPU configuration and that what I measure on my CPU may not accurately predict or approximate overall impact.
I worry that the hypothetical _mm_prefetch-driven prefetches may have unforeseen, undesirable interactions with hardware-driven prefetching
I worry that prefetching of BufRead's buffer may exhibit different behavior when tested with impl BufRead for &[u8] vs with BufReader (e.g. around File).

13 replies

fintelia Feb 13, 2024
Maintainer

Resuming from EOF does necessitate having a state machine so you can return control to the caller whenever you run out of bytes to read. It might be possible to do a little input buffering so it wouldn't have to be a 1-byte at a time state machine though. However, from an implementation perspective, it is much nicer if the provided reader can block when it runs out of data to return rather than returning temporary EOFs.

It is also worth mentioning that right now the png::Reader::{next_row, next_frame} methods can't be resumed if they hit an EOF. This is also true of most other decoder implementations in image-rs and the image::ImageDecoder trait itself where read_image actually consumes the decoder. For PNGs, it wouldn't be super difficult to allow resuming, but it would rule out some of the proposed refactoring which switches to read_exact...

etemesi254 Feb 13, 2024

Hi, I'm curious what you want to replace the BufRead with in case that is the route you are going with

anforowicz Feb 13, 2024
Author

curious what you want to replace the BufRead with in case that is the route you are going with

I am not sure if I am the target of your question and/or what specific comment is the source of your question. But, I personally do not want to replace the BufRead trait and I do not want to replace the usage of the BufRead::fill_buf nor BufRead::consume methods.

My "BufReader Removal Proposal" proposes to remove from the png crate the usage of the BufReader struct. This proposal will allow us to remove extra (and therefore unnecessary) buffering in cases where the users of the png crate already have a buffered input (e.g. &[u8] already implements BufRead and therefore it seems that wrapping it in BufReader is wasteful).

I am no longer advocating for this proposal, because the proposal actually regresses the performance of some benchmarks. I think that before implementing/merging/landing the proposal we should understand why these regressions occur. And I am currently unable to explain these regressions and I have no idea what the next steps should be toward gaining a better understanding.

etemesi254 Feb 13, 2024

I am not sure if I am the target of your question and/or what specific comment is the source of your question. But, I personally do not want to replace the BufRead trait and I do not want to replace the usage of the BufRead::fill_buf nor BufRead::consume methods.

I assumed from the title of this discussion that the end goal was to remove BufReader with something else, so I thought maybe you were rolling out your own trait.

Here's my two cents

`&[u8]`

Advantages

Simple and fast , reasoning about I/O is easy because you can say how long something takes, a read is constant and we won't go to an external function we don't know how long it may take
Easy to handle out of buffer errors, rewind stream, seek forward, all of these are easily reasoned about.

Disadvantages

The whole thing must be in memory, you can't do things like streaming I/O.

`BufRead`

Advantages

Streaming works, you can maintain a small buffer in memory, and operate on that.

Disadvantages

Everything else sucks, you can't handle rewinding and seeking easily, out of buffer errors become troublesome, something as simple as querying the length of the stream may be a syscall e.g if the underlying reader is a file
Double buffering (or triple if we take the kernel data as a buffer)

None of them are perfect, but ideally after avoiding the latter for so long I'm currently testing a Bufread impl that works well for the zune family of crates, and perf difference between buffered and unbuffered is about 1ms (both test opening file and differ on whether we pass a Bufreader<File> or Cursor<Vec<u8>>).

I am no longer advocating for this proposal, because the proposal actually regresses the performance of some benchmarks. I think that before implementing/merging/landing the proposal we should understand why these regressions occur. And I am currently unable to explain these regressions and I have no idea what the next steps should be toward gaining a better understanding.

I couldn't see d1-cache-loads in your perf output, might you re-run them with perf stat -d for detailed analysis. The only hint I may have may be how warm the cache is, i.e now you don't get less hits, but your hits are in places for which before the cache was warm but is now cold

fintelia Feb 14, 2024
Maintainer

You may want to benchmark different settings for the BufReader buffer size. I think the default is 8KB which is perhaps on the small size.

Another difference is that not reading the whole input into memory at once avoids running out of memory when trying to decode /dev/random or something like that.

etemesi254 · 2024-01-15T10:25:10Z

etemesi254
Jan 15, 2024

Hi, thought it would be nice to inform you, I ran some benchmarks on the latest 0.17.11 crates.io release, the numbers can be found at https://etemesi254.github.io/assets/criterion/report/index.html

The summary is that zune-png and image-png are almost similar in speed, and spng with the zlib-ng backend is still beating both the libraries by a good margin.

but kudos to everyone involved

20 replies

Shnatsel Dec 1, 2024
Collaborator

With the regression fix applied, we've gained about 7 MP/s of throughput in geomean on the web-scraped corpus from those optimizations:

image-rs PNG:     3716.308 MP/s (average) 140.756 MP/s (geomean)
zune-png:         4816.355 MP/s (average) 164.880 MP/s (geomean)
wuffs PNG:        4814.906 MP/s (average) 175.801 MP/s (geomean)
libpng:           2544.626 MP/s (average) 70.647 MP/s (geomean)
spng:             3554.643 MP/s (average) 94.879 MP/s (geomean)
stb_image PNG:    4607.112 MP/s (average) 112.830 MP/s (geomean)

Edit: the benchmark is kinda noisy, it fluctuates between geomean of 140 and 150 MP/s.

No significant changes on the other two, but we're already on par with or ahead of zune-png on those.

anforowicz Dec 3, 2024
Author

I am surprised to see the Rust implementations beat libpng by such a wide margin when the Chromium benchmarks show rough parity between the two.

I wouldn't focus on the benchmark results too much. I see benchmarks as imperfect, biased tools that may guide us toward improving the performance of real-world workloads. I think that the ultimate comparison will come from an A/B field-trial that compares Rust's png crate with some other decoder. I am almost at a point where I can start such field trials in Chromium (see the status update here if interested in more details). (I'll try to share the results as much as possible - raw data cannot be shared, but cumulative results should be ok.)

I guess C decoders cannot compete because they are built with zlib rather than zlib-ng?

I am told that Chromium ships with an optimized, custom/patched version of zlib. I am not sure about the details, so take it with a grain of salt.

Shnatsel Dec 5, 2024
Collaborator

After #539 and with latest fdeflate from git we're now getting 155 MP/s (geomean) on the web-scraped corpus on Zen 2. I ran a regression test on several corpora, it's all green. I'll open PRs to publish new versions of fdeflate and png soon.

Shnatsel Dec 5, 2024
Collaborator

I've figured out why enabling zlib-ng didn't help spng, and why everything is so slow in general: plain old slow zlib is still being actually used despite zlib-rs being requested. I suspect this is due to several zlib implementations warring over the same library name in the binary. fintelia/corpus-bench#3

Shnatsel Dec 5, 2024
Collaborator

I've fixed libpng and spng to use zlib-ng, and we're still comfortably ahead on the QOI corpus:

Running decoding benchmark with corpus: QoiBench
image-rs PNG:     392.213 MP/s (average) 327.177 MP/s (geomean)
zune-png:         376.646 MP/s (average) 301.160 MP/s (geomean)
wuffs PNG:        377.835 MP/s (average) 287.112 MP/s (geomean)
libpng:           208.464 MP/s (average) 172.573 MP/s (geomean)
spng:             298.802 MP/s (average) 234.257 MP/s (geomean)
stb_image PNG:    237.021 MP/s (average) 172.338 MP/s (geomean)

(these numbers are not directly comparable to the ones from before because they're also taken on a different machine, sorry)

Also comfortably ahead of all C implementations on the web-scraped corpus:

image-rs PNG:     9904.298 MP/s (average) 296.662 MP/s (geomean)
zune-png:         11953.486 MP/s (average) 320.831 MP/s (geomean)
wuffs PNG:        10173.126 MP/s (average) 299.521 MP/s (geomean)
libpng:           7483.158 MP/s (average) 178.228 MP/s (geomean)
spng:             11264.987 MP/s (average) 257.175 MP/s (geomean)
stb_image PNG:    9911.169 MP/s (average) 195.744 MP/s (geomean)

Shnatsel · 2024-09-21T12:21:28Z

Shnatsel
Sep 21, 2024
Collaborator

A significant performance improvement can be obtained by switching the underlying Zlib implementation from miniz_oxide to zlib-rs, as described here.

With zlib-rs and the unstable feature used together, in theory png should beat libpng by a signficant margin and match spng in performance. However, that comes at the cost of some unsafe code in zlib-rs.

14 replies

anforowicz Oct 19, 2024
Author

Thanks - this is a fantastic improvement! FWIW on my corpus/machine I got an overall improvement of around 8% - see my notes here.

Shnatsel Oct 27, 2024
Collaborator

Are there any remaining blockers to publishing the optimized fdeflate to crates.io? Anything I can help with?

fintelia Oct 27, 2024
Maintainer

I don't think so. Want to make a release notes + version bump PR?

Shnatsel Oct 28, 2024
Collaborator

Done: image-rs/fdeflate#36

I also ran a regression test on my corpus of PNGs scraped from the web just to be sure. All good 👍

Shnatsel Oct 28, 2024
Collaborator

fdeflate v0.3.6 with these optimizations has shipped.

@anforowicz I am curious about the combined effect of fdeflate v0.3.6 and http://review.skia.org/911438/5 . On paper it looks like the png crate should outperform libpng now, despite being 100% memory safe and platform independent-code. And if that is indeed the case, then that's one hell of an achievement!

etemesi254 · 2024-11-28T05:13:51Z

etemesi254
Nov 28, 2024

stb_image is really good, Fabian, one of the developers has done interesting optimizations in the decoder, e.g their paeth implementation is https://github.com/nothings/stb/blob/5c205738c191bcb0abc65c4febfa9bd25ff35234/stb_image.h#L4657 which looks a bit weird but boosted decoding perf for some paeth heavy images by 20% on some corpus I had when the implementation fell back to scalar so I am not surprised its doing competitively well

…

On Thu, 28 Nov 2024 at 08:05, Jonathan Behrens ***@***.***> wrote: I'm glad you're seeing similar numbers. One interesting reference is this 2021 blog post <https://nigeltao.github.io/blog/2021/fastest-safest-png-decoder.html> comparing wuffs' PNG decoder to others. The sample size is very limited, but at the time they measured wuffs being 2-3x faster than their C-based competitors and libpng being overall a bit slower than spng and stb_image. (They also measured this crate; it is incredible to see just how far we've come since then!) I'd totally believe that something strange is going on with the benchmark harness, but I haven't been able to find anything that would account for the size of differences we're seeing. Just minor differences in the decoding options used and how input/output buffers are managed. Though it is entirely possible that one or more of the libraries are being used incorrectly somehow. I also just added logic to write a measurements.csv file when the harness finishes. It includes the the decoding time in milliseconds for each image across the different decoders to make it easier to investigate outliers. — Reply to this email directly, view it on GitHub <#416 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZRVEYJL2KGLYZ5NIZ3CAL2C2QATAVCNFSM6AAAAAA5NFHHF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNBQGIYTCMY> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

Shnatsel Dec 1, 2024
Collaborator

I ported the optimized Paeth implementation from stb_image in #539. On a sample all-Paeth image concocted with convert -quality 49 in.png out.png I'm seeing a 10% end-to-end performance improvement, which is a lot!

I've been bitten by Paeth perf gains not translating to other machines before, so I'd appreciate if someone else could try it and see if it's actually any faster on anything other than recent Zen CPUs.

anforowicz · 2025-01-08T00:06:35Z

anforowicz
Jan 8, 2025
Author

Below I am sharing some data that may help estimate the impact of PGO (Profile-guided Optimization) on PNG decoding runtime. (This is relevant for me, because by default Chromium's PGO data is generated using only one experiment arm.) The TL;DR is that we can probably expect 1.3% - 5.9% improvement:

Summary of results:

decode/Transparency.png : time: [-6.2272% -5.9128% -5.6031%] (p = 0.00 < 0.05)
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png : time: [-4.4735% -4.2140% -3.9662%] (p = 0.00 < 0.05)
decode/kodim07.png : time: [-3.2167% -3.1205% -3.0300%] (p = 0.00 < 0.05)
decode/kodim17.png (regression) : time: [-3.3438% -3.2354% -3.1234%]
decode/kodim02.png : time: [-1.4810% -1.3284% -1.1577%] (p = 0.00 < 0.05)
decode/kodim23.png : time: [-4.4161% -4.0583% -3.8170%] (p = 0.00 < 0.05)
decode/paletted-zune.png : time: [-3.9460% -3.7734% -3.5955%] (p = 0.00 < 0.05)

My repro steps were based on https://doc.rust-lang.org/rustc/profile-guided-optimization.html:

Step 1: gather baseline measurements without PGO:

$ rm target/release/deps/decoder*
$ rustup run nightly cargo build --bench=decoder --features=unstable --release
$ time taskset --cpu-list 4-7 nice -n -19 target/release/deps/decoder-468fa0c48fefc283 --bench --save-baseline=my_baseline

Step 2: gather PGO data and build an optimized binary:

$ rm target/release/deps/decoder*
$ RUSTFLAGS=-Cprofile-generate=$HOME/scratch/pgo-data rustup run nightly cargo build --bench=decoder --features=unstable --release
$ target/release/deps/decoder-89a0482479eb5cea --test
$ ~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/llvm-profdata merge -o ./merged.profdata ~/scratch/pgo-data/
$ rm target/release/deps/decoder*
$ RUSTFLAGS=-Cprofile-use=$PWD/merged.profdata rustup run nightly cargo build --bench=decoder --features=unstable --release

Step 3: measure and compare PGO vs non-PGO (I am eliding results for artificial / generated... benchmarks):

$ time taskset --cpu-list 4-7 nice -n -19 target/release/deps/decoder-0675c432e00f8f8a --bench --baseline=my_baseline
...
decode/Transparency.png time:   [97.780 µs 98.032 µs 98.332 µs]
                        thrpt:  [3.4096 GiB/s 3.4201 GiB/s 3.4289 GiB/s]
                 change:
                        time:   [-6.2272% -5.9128% -5.6031%] (p = 0.00 < 0.05)
                        thrpt:  [+5.9357% +6.2844% +6.6407%]
                        Performance has improved.
...
Benchmarking decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: Collecting 10 samples in estimated 5.3194 s (220 iterati
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png
                        time:   [24.215 ms 24.261 ms 24.285 ms]
                        thrpt:  [439.72 MiB/s 440.15 MiB/s 441.00 MiB/s]
                 change:
                        time:   [-4.4735% -4.2140% -3.9662%] (p = 0.00 < 0.05)
                        thrpt:  [+4.1300% +4.3994% +4.6829%]
                        Performance has improved.
...
decode/kodim07.png      time:   [3.8217 ms 3.8244 ms 3.8272 ms]
                        thrpt:  [293.95 MiB/s 294.16 MiB/s 294.37 MiB/s]
                 change:
                        time:   [-3.2167% -3.1205% -3.0300%] (p = 0.00 < 0.05)
                        thrpt:  [+3.1247% +3.2210% +3.3236%]
                        Performance has improved.
...
decode/kodim17.png      time:   [3.2908 ms 3.2939 ms 3.2964 ms]
                        thrpt:  [341.28 MiB/s 341.54 MiB/s 341.86 MiB/s]
                 change:
                        time:   [-3.3438% -3.2354% -3.1234%] (p = 0.00 < 0.05)
                        thrpt:  [+3.2241% +3.3436% +3.4594%]
                        Performance has improved.
...
decode/kodim02.png      time:   [3.4295 ms 3.4331 ms 3.4383 ms]
                        thrpt:  [327.20 MiB/s 327.69 MiB/s 328.04 MiB/s]
                 change:
                        time:   [-1.4810% -1.3284% -1.1577%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1713% +1.3463% +1.5032%]
                        Performance has improved.
...
decode/kodim23.png      time:   [3.2035 ms 3.2069 ms 3.2102 ms]
                        thrpt:  [350.45 MiB/s 350.81 MiB/s 351.18 MiB/s]
                 change:
                        time:   [-4.4161% -4.0583% -3.8170%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9685% +4.2300% +4.6201%]
                        Performance has improved.
...
decode/paletted-zune.png
                        time:   [9.6773 ms 9.6863 ms 9.6956 ms]
                        thrpt:  [1.3296 GiB/s 1.3309 GiB/s 1.3322 GiB/s]
                 change:
                        time:   [-3.9460% -3.7734% -3.5955%] (p = 0.00 < 0.05)
                        thrpt:  [+3.7296% +3.9214% +4.1081%]
                        Performance has improved.
...

1 reply

anforowicz Jan 10, 2025
Author

BTW, the results here may be an underestimate, because the microbenchmarks above run with warmed-up instructions cache and branch predictor, but "real life" workloads may run cold. And AFAIU some of PGO-influenced code layout tweaks especially matter for cold runs - e.g. laying out the most likely branches in the "default" path; or laying out cold code out of the way and therefore out of the instructions cache. OTOH branches and instructions from the hot loop should warm up relatively quickly, so I am not sure if this would be a big underestimate.

Shnatsel · 2025-03-29T08:54:50Z

Shnatsel
Mar 29, 2025
Collaborator

@anforowicz I am probably getting ahead of myself, but it seems that the required pieces for getting PNG out of a sandbox entirely are falling into place: there is now a pure-Rust color management system, https://github.com/awxkee/moxcms, and https://crates.io/crates/kamadak-exif for memory-safe Exif parsing.

It is likely that they will require some modification to match the current Skia behavior, but at least the building blocks are there at last. Although I'm not convinced that pursuing this would be a better use of resources than e.g. shipping a memory-safe WebP decoder.

1 reply

reneleonhardt Jun 25, 2025

WebP could be more important, as adoption is rising: https://w3techs.com/technologies/details/im-webp

Sentimentron · 2025-08-27T21:07:54Z

Sentimentron
Aug 27, 2025

I've been doing some investigation into explicit SIMD-ification of unfilter to bring performance closer to libpng, particularly around optimization of the 3bpp and 4bpp variants which make up most of the PNG images I can find. I'll start with some theoretical results comparing the auto-vectorized code against my portable_simd code and then I'll get into the results on a real system.

Theoretical speedups from micro-architecture simulation

Generally see a pretty good speedup on most micro-architectures, but looks like there's a little bit of a regression for up (3bpp) that I still need to to work on:

Filter	Arm Cortex A520	Arm Cortex X4	Intel Arrow Lake	AMD Zen 5
Sub (3bpp/4bpp)	7.80%	41.97%	96.17%	79.83%
Up (3bpp)	2.02%	3.65%	27.64%	0.92%
Up (4bpp)	2.02%	3.74%	27.64%	-4.24%
Avg (3bpp)	-12.48%	-12.27%	84.09%	36.16%
Avg (4bpp)	18.11%	36.88%	61.33%	45.48%
Paeth (3bpp)	24.30%	27.00%	56.71%	31.65%
Paeth (4bpp)	29.53%	9.34%	12.74%	23.18%

This comes from a little test-bench program that looks like this:

// Microbenchmark for all the various filters

use png::benchable_apis::unfilter;
use png::Filter;
use std::hint::black_box;

const CURRENT_ROW: [u8; 771] = [
    // SNIP
];
const PREV_ROW: [u8; 771] = [
    // SNIP
];

fn main() {
    let bpps: [u8; 1] = [4];
    let filters: [Filter; 1] = [
        //Filter::Sub,
        //Filter::Up,
        //Filter::Avg,
        Filter::Paeth,
    ];

    for i in 0..64 {
        for &filter in filters.iter() {
            for &bpp in bpps.iter() {
                let mut curr_row: Vec<u8> = CURRENT_ROW.to_vec();
                black_box(unfilter(filter, bpp, &PREV_ROW, curr_row.as_mut_slice()))
            }
        }
    }
}

I've compiled the test bench, and run the filter code under a micro-architecture simulator (based on llvm-mca) to produce the above results.

Results on a physical system on the exif corpus

Exif corpus is here: https://github.com/getlantern/exif-image-corpus

On a real system, libpng (baseline, blue line) tends to perform best on the smallest images (i.e. ones that are less than the 75th percentile of libpng decode time on a Cortex A520). On the Arm Cortex X4, image-rs starts to pull ahead at the 75th percentile with the auto-vectorized code (red line, highend_improved2). At the the highest percentiles, the new filter code comfortably pulls ahead of the auto-vectorized code on the little core for RGB and the big core for RGBA.

Here's an idea of what the speedup looks like like across a random subset of 512 images on the RGB corpus, broken down by the percentile of decode time on the little core (libpng baseline is implicitly 0%):

I think so far we're comfortably ahead for RGBA against the auto-vectorized baseline, and we're ahead for RGB on the little core, where the slowdown is likely to be most perceptible. There's potentially a few more refinements to do (such as switching off the 3bpp RGB path on Arm - doesn't seem to perform well), but IMO I think we're at the point where we can start a PR for the Paeth filter. @anforowicz WDYT?

11 replies

Sentimentron Aug 28, 2025

OK, first PR is open: #632

Shnatsel Aug 28, 2025
Collaborator

It's plausible that #625 only helped a specific compiler version and that 1.91 regresses it. I don't think it was ever measured on 1.91 nightlies.

fintelia Aug 28, 2025
Maintainer

I think 1.91 was specifically the first version of rustc that uses LLVM 21 and was tested while working on #625

197g Aug 28, 2025
Maintainer

With the amount of flip-flop between LLVM versions, I think the stability alone would be good reason for SIMD implementations of any variety. Not a good reason on its for unsafe ones though, so portable_simd is a welcome idea :)

Sentimentron Sep 12, 2025

OK, first PR (#632) has merged, #633 #641, #642 and #643 follow-up with the other implementations. Here's how things are looking across the other filters on physical silicon:

Discussion: PNG decoding performance improvement opportunities #416

Uh oh!

Uh oh!

anforowicz Sep 29, 2023

Performance improvement ideas

Idea 1: Explicit SIMD-ification of unfilter

Idea 2: Avoid copying data within the png crate

Idea 2.1: Reduce copying of raw rows

Idea 2.2: Reduce copying within ZlibStream::transfer_finished_data

Idea 2.3: Avoid BufReader when unnecessary

Idea 2.4: Other copy avoidance

Idea 3: Minimize the number of allocations

Idea 4: Try to improve decompression speed

Idea 5: Try to improve expand_paletted

Benchmarking is hard…

So...

Replies: 18 comments · 83 replies

Uh oh!

fintelia Sep 30, 2023 Maintainer

General Optimization Tips

Uh oh!

Uh oh!

fintelia Sep 30, 2023 Maintainer

Idea 5: Try to improve expand_paletted

Uh oh!

anforowicz Jan 12, 2024 Author

Uh oh!

Uh oh!

fintelia Sep 30, 2023 Maintainer

Idea 4: Try to improve decompression speed

Uh oh!

fintelia Sep 30, 2023 Maintainer

Idea 3: Minimize the number of allocations

Uh oh!

fintelia Sep 30, 2023 Maintainer

Idea 2: Avoid copying data within the png crate

Idea 2.3: Avoid BufReader when unnecessary

Uh oh!

Uh oh!

fintelia Oct 4, 2023 Maintainer

Uh oh!

anforowicz Oct 4, 2023 Author

Uh oh!

anforowicz Oct 5, 2023 Author

Uh oh!

fintelia Oct 5, 2023 Maintainer

Uh oh!

anforowicz Oct 6, 2023 Author

Uh oh!

Uh oh!

fintelia Sep 30, 2023 Maintainer

Idea 1: Explicit SIMD-ification of unfilter

Uh oh!

okaneco Nov 4, 2023

Uh oh!

anforowicz Oct 2, 2023 Author

Benchmarking noise

Benchmarking corpus

Idea 2: Avoid copying data within the png crate

Idea 6: Avoiding 32kB delay / L1 cache friendliness

Uh oh!

anforowicz Oct 9, 2023 Author

Uh oh!

Uh oh!

fintelia Oct 9, 2023 Maintainer

Uh oh!

Uh oh!

fintelia Oct 9, 2023 Maintainer

Uh oh!

anforowicz Oct 10, 2023 Author

Uh oh!

fintelia Oct 10, 2023 Maintainer

Uh oh!

anforowicz Oct 31, 2023 Author

More realistic benchmarking corpus

Fetching images from the top 500 websites

Benchmarking process

Results

Discussion of some of the results

Revisiting copy avoidance

anforowicz
Sep 29, 2023

Idea 1: Explicit SIMD-ification of `unfilter`

Idea 2: Avoid copying data within the `png` crate

Idea 2.2: Reduce copying within `ZlibStream::transfer_finished_data`

Idea 2.3: Avoid `BufReader` when unnecessary

Idea 5: Try to improve `expand_paletted`

Replies: 18 comments 83 replies

fintelia
Sep 30, 2023
Maintainer

fintelia
Sep 30, 2023
Maintainer

anforowicz Jan 12, 2024
Author

fintelia
Sep 30, 2023
Maintainer

fintelia
Sep 30, 2023
Maintainer

fintelia
Sep 30, 2023
Maintainer

fintelia Oct 4, 2023
Maintainer

anforowicz Oct 4, 2023
Author

anforowicz Oct 5, 2023
Author

fintelia Oct 5, 2023
Maintainer

anforowicz Oct 6, 2023
Author

fintelia
Sep 30, 2023
Maintainer

anforowicz
Oct 2, 2023
Author

anforowicz Oct 9, 2023
Author

fintelia Oct 9, 2023
Maintainer

fintelia Oct 9, 2023
Maintainer

anforowicz Oct 10, 2023
Author

fintelia Oct 10, 2023
Maintainer

anforowicz
Oct 31, 2023
Author