Discussion: PNG decoding performance improvement opportunities #416
Replies: 18 comments 83 replies
-
General Optimization TipsMy first suggestion for benchmarking would be to collect a representative sample of a couple thousand PNGs to use as a corpus. In my experience, it is a much better use of time to measure many different images once than the same image many times. The corpus-bench was initially created to work on encoder performance and compression ratio, but with some tweaks it should be suitable for optimizing decoding (specifically, you'll want to measure the time decoding the original version of PNGs, rather than re-encoded versions). I've personally used the QOI Corpus, but I suspect Chromium might already have -- or be able to collect -- its own corpuses Ideally, you'd measure both this crate and Chromium's existing PNG decoder against the same corpus, which should give an indication of how much potential for further improvement is left. (I'd caution you not to rely on other people's benchmarks here. As an example, the numbers quoted in the Wuffs blog post are quite out of date) Having representative end-to-end benchmarks has two advantages: it lets you double check that any optimizations are worthwhile, and profiling benchmarks runs can help identify portions of the code worth optimizing. If you can compare traces between multiple decoders, that's even better for identifying optimization targets! |
Beta Was this translation helpful? Give feedback.
-
Idea 5: Try to improve expand_palettedThe core problem here is that the vast majority of PNGs aren't paletted (though only a representative corpus will tell you the exact number...). There's probably a lot of low hanging fruit here, so on the images that do have palettes you can make a big improvement. But optimizing a function that is never called for the vast majority of images isn't going to improve average performance very much, no matter how good a job you do! |
Beta Was this translation helpful? Give feedback.
-
Idea 4: Try to improve decompression speedThe
And another notable optimization that it uses is multi-byte decoding. This isn't to say there aren't further optimization opportunities. The decoder uses heuristics to decide how much time to spend building decoding tables versus performing decoding (bigger tables take longer to build but make decoding go faster). And there are surely other optimizations to be found. But this is all a long way of saying that there probably isn't a ton of easy performance wins to be found here. |
Beta Was this translation helpful? Give feedback.
-
Idea 3: Minimize the number of allocationsThere hasn't been a ton of investigation here. With focused attention, it is likely that small gains may be possible. |
Beta Was this translation helpful? Give feedback.
-
Idea 2: Avoid copying data within the png crateProfiling can be misleading here. Decoding an image requires pulling the bytes of the image from RAM into the CPU cache. The first time you do this will be quite slow. Subsequent copies of data that's already in the L1 cache will be extremely fast. If you avoid the first copy, you'll find that the next time the data is copied suddenly gets way slower. Avoiding copies is good for code clarity, but I personally don't think doing lots of copies is holding us back much. Idea 2.3: Avoid BufReader when unnecessaryI actually thought this was already removed. The convention elsewhere in image-rs is that decoders take an |
Beta Was this translation helpful? Give feedback.
-
Idea 1: Explicit SIMD-ification of unfilterI'm honestly thrilled this worked. Paeth with 3 bpp is probably the single most important case: 3 bpp is the most common bit depth and the paeth filter gets the highest compression ratio*. The net result is that it gets used a lot. And despite those two factors, paeth unfiltering was by far the slowest. *An adaptive filter that picks among all filter methods wins by maybe one percentage point, but a large portion of the rows still get encoded with paeth. |
Beta Was this translation helpful? Give feedback.
-
Benchmarking noiseSpecial thanks to @marshallpierce and @veluca93 for their tips on how to reduce benchmarking noise. In particular, the https://rigtorp.se/low-latency-guide/ link I got from @marshallpierce has been very helpful. For replicability, let me share the set of steps that I’ve followed when running benchmarks reported below:
Other notes:
Benchmarking corpusThank you very much @fintelia for pointing out the QOI corpus of PNG images. That does indeed seem like a much better way to evaluate performance of PNG decoding. I plan to switch to this corpus after completing (and measuring) the current batch of in-progress performance improvements. Idea 2: Avoid copying data within the png crateI’ve tried combining ideas 2.1 and 2.2 into something that is both simpler, and avoids even more copies (maybe let’s call this idea 2.5 when referring to this later?): 87b44a8 This seems to help: 1 image wasn’t affected by much, 1 image improved by more than 10%, 4 images improved by 3-5%: OTOH, I am a bit reluctant to pursue this direction further, because of the idea 6 below… ( Idea 6: Avoiding 32kB delay / L1 cache friendlinessIdeas 2.1, 2.2, or 2.5 above don’t change the fact that unfiltering happens with a delay of 32kB. And when discussing this with @veluca93 and @marshallpierce, they pointed out that such a delay means that things may drop out of the L1 cache most of the time (Skylake, for instance, has a 32K L1; and a bigger L1 cache doesn’t necessarily help outside of microbenchmarks, where other things put additional pressure on the cache). And avoiding the 32kB delay necessitates some copying (because we can't mutate the most recently decompressed 32kB), and therefore maybe idea 2.5 is not that great in the long run (i.e. avoiding the 32kB delay would require reverting parts of idea 2.5 and e.g. reintroducing some form of the I think I’ll focus on this area in the next couple of days. Quick-and-dirty notes about what this may encompass: |
Beta Was this translation helpful? Give feedback.
-
More realistic benchmarking corpusI have run some additional benchmarks on PNG images from the top 500 websites, and on the QOI corpus Fetching images from the top 500 websitesI have fetched images from the main pages of the top 500 most popular websites (according to https://moz.com/top500). The pages and their subresources were fetched using the following script: Benchmarking processI have measured performance using a WIP Chromium CL at https://crrev.com/c/4980955/5. After patching the CL (and the chain of 11 upstream WIP CLs) I have built The benchmarking process has been kicked off against the top 500 websites and against the QOI corpus (recommended above) by running the following commands: The benchmark tested 2 implementations of PNG decoding under //ui/gfx/codec:
Disclaimers and other notes:
ResultsSpreadsheet with the results can be found here: For each tested file, the benchmark has recorded the following information about the behavior of the PNG decoder under
The spreadsheet has additionally computed the following data:
Discussion of some of the resultsIt is interesting to note the distribution of various PNG properties - this should help guide future optimization work:
The Rust-vs-C++ speed difference is more pronounced for small images, but small images don’t contribute as much as big images to the total runtime (at least from Chromium website rendering perspective - the conclusion may be different in other scenarios that are more heavily skewed toward small images). Revisiting copy avoidanceEffects on the bigger test corpusI have tried using the bigger testing corpus to test the performance impact of the copy avoidance approach (see the series of commits here. The numbers below show the ratio between the total, cumulative Rust runtime and the C/C++ runtime:
Effects on non-compressed imagesI have also created bare-bones PNG images, where no The images used for these benchmarks are somewhat artificial, but they do help to measure the performance effects of the code outside of decompression (by magnifying this effect and therefore avoiding noise). My initial assumption was that if a change has a positive performance effect on such images, then it should also have a positive (although smaller) effect on more realistic images. It seems that the results from the QOI test corpus have invalidated this assumption. Next stepsI will break the copy avoidance commits into a set of a few smaller PRs. Hopefully proceeding in this way will shed some light on the source of the unexpected slowdown observed with the QOI test corpus (i.e. maybe I made a mistake in one of the commits and we can catch this with a collective review and/or with the new “noncompressed” benchmarks; one suspicious part is somewhat arbitrary choice of constants - we probably should use less than 1MB for Various notes and arguments for proceeding with merging the PRs:
|
Beta Was this translation helpful? Give feedback.
-
|
I just wanted to share that I plan to try implementing and measuring the impact of the following on the noncompressed-8x8 benchmark. (But I promise that no cookies have been licked - please just gives me a heads-up if you plan to work on these items yourself.) Hopefully today or tomorrow:
Hopefully later this week:
/cc @fintelia |
Beta Was this translation helpful? Give feedback.
-
EDIT 2024-01-12: WARNING: Conclusions of this comment/post are incorrect because of a mistake in measurement methodology (see here). I just wanted to note another negative result and confirm that investing in When profiling all 5 of these files via And when trying the improvement ideas from the commit here, the resulting runtime gets worse:
|
Beta Was this translation helpful? Give feedback.
-
|
There is a known performance issue on paletted images: #393 It is possible to do much better than the |
Beta Was this translation helpful? Give feedback.
-
The
|
Beta Was this translation helpful? Give feedback.
-
|
Hi, thought it would be nice to inform you, I ran some benchmarks on the latest The summary is that but kudos to everyone involved |
Beta Was this translation helpful? Give feedback.
-
|
A significant performance improvement can be obtained by switching the underlying Zlib implementation from With zlib-rs and the |
Beta Was this translation helpful? Give feedback.
-
|
stb_image is really good, Fabian, one of the developers has done
interesting optimizations in the decoder, e.g their paeth implementation is
https://github.com/nothings/stb/blob/5c205738c191bcb0abc65c4febfa9bd25ff35234/stb_image.h#L4657
which
looks a bit weird but boosted decoding perf for some paeth heavy images by
20% on some corpus I had when the implementation fell back to scalar so I
am not surprised its doing competitively well
…On Thu, 28 Nov 2024 at 08:05, Jonathan Behrens ***@***.***> wrote:
I'm glad you're seeing similar numbers. One interesting reference is this
2021 blog post
<https://nigeltao.github.io/blog/2021/fastest-safest-png-decoder.html>
comparing wuffs' PNG decoder to others. The sample size is very limited,
but at the time they measured wuffs being 2-3x faster than their C-based
competitors and libpng being overall a bit slower than spng and stb_image.
(They also measured this crate; it is incredible to see just how far we've
come since then!)
I'd totally believe that something strange is going on with the benchmark
harness, but I haven't been able to find anything that would account for
the size of differences we're seeing. Just minor differences in the
decoding options used and how input/output buffers are managed. Though it
is entirely possible that one or more of the libraries are being used
incorrectly somehow.
I also just added logic to write a measurements.csv file when the harness
finishes. It includes the the decoding time in milliseconds for each image
across the different decoders to make it easier to investigate outliers.
—
Reply to this email directly, view it on GitHub
<#416 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFZRVEYJL2KGLYZ5NIZ3CAL2C2QATAVCNFSM6AAAAAA5NFHHF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNBQGIYTCMY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Below I am sharing some data that may help estimate the impact of PGO (Profile-guided Optimization) on PNG decoding runtime. (This is relevant for me, because by default Chromium's PGO data is generated using only one experiment arm.) The TL;DR is that we can probably expect 1.3% - 5.9% improvement: Summary of results:
My repro steps were based on https://doc.rust-lang.org/rustc/profile-guided-optimization.html: Step 1: gather baseline measurements without PGO: Step 2: gather PGO data and build an optimized binary: Step 3: measure and compare PGO vs non-PGO (I am eliding results for artificial / |
Beta Was this translation helpful? Give feedback.
-
|
@anforowicz I am probably getting ahead of myself, but it seems that the required pieces for getting PNG out of a sandbox entirely are falling into place: there is now a pure-Rust color management system, https://github.com/awxkee/moxcms, and https://crates.io/crates/kamadak-exif for memory-safe Exif parsing. It is likely that they will require some modification to match the current Skia behavior, but at least the building blocks are there at last. Although I'm not convinced that pursuing this would be a better use of resources than e.g. shipping a memory-safe WebP decoder. |
Beta Was this translation helpful? Give feedback.
-
|
I've been doing some investigation into explicit SIMD-ification of Theoretical speedups from micro-architecture simulationGenerally see a pretty good speedup on most micro-architectures, but looks like there's a little bit of a regression for up (3bpp) that I still need to to work on:
This comes from a little test-bench program that looks like this: // Microbenchmark for all the various filters
use png::benchable_apis::unfilter;
use png::Filter;
use std::hint::black_box;
const CURRENT_ROW: [u8; 771] = [
// SNIP
];
const PREV_ROW: [u8; 771] = [
// SNIP
];
fn main() {
let bpps: [u8; 1] = [4];
let filters: [Filter; 1] = [
//Filter::Sub,
//Filter::Up,
//Filter::Avg,
Filter::Paeth,
];
for i in 0..64 {
for &filter in filters.iter() {
for &bpp in bpps.iter() {
let mut curr_row: Vec<u8> = CURRENT_ROW.to_vec();
black_box(unfilter(filter, bpp, &PREV_ROW, curr_row.as_mut_slice()))
}
}
}
}I've compiled the test bench, and run the filter code under a micro-architecture simulator (based on llvm-mca) to produce the above results. Results on a physical system on the exif corpusExif corpus is here: https://github.com/getlantern/exif-image-corpus On a real system, libpng (baseline, blue line) tends to perform best on the smallest images (i.e. ones that are less than the 75th percentile of libpng decode time on a Cortex A520). On the Arm Cortex X4, image-rs starts to pull ahead at the 75th percentile with the auto-vectorized code (red line, highend_improved2). At the the highest percentiles, the new filter code comfortably pulls ahead of the auto-vectorized code on the little core for RGB and the big core for RGBA.
Here's an idea of what the speedup looks like like across a random subset of 512 images on the RGB corpus, broken down by the percentile of decode time on the little core (libpng baseline is implicitly 0%):
I think so far we're comfortably ahead for RGBA against the auto-vectorized baseline, and we're ahead for RGB on the little core, where the slowdown is likely to be most perceptible. There's potentially a few more refinements to do (such as switching off the 3bpp RGB path on Arm - doesn't seem to perform well), but IMO I think we're at the point where we can start a PR for the Paeth filter. @anforowicz WDYT? |
Beta Was this translation helpful? Give feedback.




Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I wanted to share some of my ideas for performance opportunities in the
pngcrate.I hope that we can discuss these ideas together to refine them - for example:
I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?
I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).
My ultimate goal is to bring the performance of the
pngcrate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)Performance improvement ideas
Idea 1: Explicit SIMD-ification of
unfilterWhy this idea seems promising:
perf recordofpngcrate’s benchmarks says thatpng::filter::unfilteraccounts for 18% of runtimelibpnguses explicit SIMD intrinsics to SIMD-ify unfilter.png_read_filter_row_avg3_sse2andpng_read_filter_row_avg4_sse2)png_read_filter_row_sub3_sse2,png_read_filter_row_avg3_sse2, andpng_read_filter_row_paeth3_sse2Status: PR is under review:
unfilterbenchmarks landed in Scaffolding for direct benchmarking ofcrate::filter::unfilter. #413std::simdto speed-upunfilterforPaethfor bpp=3 and bpp=6 #414 (this yields 20% improvements inunfilterbenchmarks)Discussion of next steps:
std::simdto speed-upunfilterforPaethfor bpp=3 and bpp=6 #414png::filter::unfilterwent down to 8.6% of runtime - this is the upper bound on possible further improvements.Idea 2: Avoid copying data within the
pngcrateWhy this idea seems promising:
This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)
perf recordofpngcrate’s benchmarks (after SIMD-ification ofunfilter) shows:__memmove_avx_unaligned_erms: 9.25% of runtime__memset_avx2_unaligned_erms: 1.45% of runtime__memmove_avx_unaligned: 0.98% of runtimeIdea 2.1: Reduce copying of raw rows
Goal: get rid of the copies done in the two places here:
self.prev[..rowlen].copy_from_slice(&row[..rowlen])self.current.drain(..self.scan_start).for_each(drop);(copies all bytes fromself.scan_startto the end of theself.currentvector)Status: in progress - actively working on this:
Prototype: see here
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:
RawRowsBufferwhile still replicating decoding/decompressing patterns (of saykodim02.png). This can be done by abstracting awayReadDecoderby hiding it behind an equivalentAbstractReadDecodertrait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).unfilterand uses no Huffman encoding… This seems hard…Idea 2.2: Reduce copying within
ZlibStream::transfer_finished_dataThere seem to be 2 performance improvement opportunities within
ZlibStream::transfer_finished_data:Avoid copying not-yet-decompressed/populated bytes in
self.out_buffer[self.out_pos..].fdeflate’s assumption thatout_bufferhas been zero-ed out. In this case, these bytes can simply be ignored.fdeflate's assumption and still avoid copying these bytes by rely on zeroing them out inprepare_vec_for_appending. In other words, we can replace (slower?)memcpywith (faster?)memset.Reduce how often
out_bufferis compactedtransfer_finished_datacompacts the buffer every time (copying 32kB every timesafenumber of bytes is returned to the client; possibly copying multiple bytes per single returned byte)safe > CHUNCK_BUFFER_SIZE * 3(copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).Status: started - plan to resume work on this soon:
ZlibStreamwhile still replicating decoding/decompressing patterns (of saykodim02.png)Idea 2.3: Avoid
BufReaderwhen unnecessarySometimes the input the
pngcrate has already been buffered into memory (Chromium’s//ui/gfx/codec/png_codec.cctakes a span of memory as input;pngcrate’s benchmarks useVec<u8>). In such scenarios using aBufReaderwill needlessly copy the already-buffered/already-in-memory bytes intoBufReader’s internal buffer.Status: thinking how to best address this:
pngandimagecrates should takeBufReadinstead ofReadas input?png(andimage?) crate can be bifurcated (internally holdingdyn Box<BufRead>instead ofBufReader<R>inReadDecoder::reader):newcan continue takingReadas input (maybe deprecatenew+ rename it intofrom_unbuffered_input)from_buf_read?from_buffered_input?) can takeBufReadas input. Alternatively we could also introduce afrom_sliceAPI (OTOH all slices areBufReadbut not allBufReads are slices, so this isn't a fully generic solution).Idea 2.4: Other copy avoidance
Half-baked ideas:
ZlibStream::out_buffer(extension of idea 2.2): maybe we can avoid compacting at allVec(i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)fdeflate...)ZlibStream::in_buffer:Idea 3: Minimize the number of allocations
Why this idea seems promising:
partition_alloc::internal::PartitionBucket::SlowPathAllocallocator_shim::internal::PartitionMallocallocator_shim::internal::PartitionFreeheaptrackforpngcrate'sdecoderbenchmarks):fdeflate::decompress::Decompressor::build_tablespng::decoder::zlib::ZlibStream::newpng::decoder::stream::StreamingDecoder::parse_textpng::decoder::zlib::ZlibStream::transfer_finished_dataStatus: not started yet:
imagecrate and Chromium ignore text chunks - we should just callpng::Decoder::set_ignore_text_chunk. (This should help withparse_text.)BoxinZlibStreamis not needed?Idea 4: Try to improve decompression speed
Why this idea seems promising:
zlib. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)perf recordofpngcrate’s benchmarks says that:fdeflate::decompress::Decompressor::readfdeflate::decompress::Decompressor::build_tablesStatus: haven’t really started looking at
fdeflatesource code yet:fdeflate... I haven’t started reading the code yet…)Idea 5: Try to improve
expand_palettedWhy this idea seemed promising:
expand_palettedrelatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)std::simd::Simd::gather_or_default? mimicking Chromium SIMD code that I don’t understand?)Status: tentatively abandoned for now
expand_palettemicrobenchmarks by 21% to 56%, butexpand_palettedbarely registers in a runtime profile ofpngcrate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking thatpngcrate's benchmark to use theEXPANDtransformation similarly to how thepngcrate is used from theimagecrate)Benchmarking is hard…
I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).
The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the
unfilterchanges (idea 1 above) show a clear improvement inpngcrate’s benchmarks (not only 20% improvement onunfiltermicrobenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-enddecoderbenchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.--sample-size=1000 --measurement-time=500? (This doesn’t seem to help. I still get wild swings in results… :-/)cachegrind-based,iai-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. Theunfilterchange for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).decoderbenchmarks of thepngcrate):RawRowsBuffer(idea 2.1) andZlibStream(idea 2.2)nice -19is one way (sadly I didn’t think of using that initially…)nice -19helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t havecsetscommand on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to useisolcpus.So...
So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?
Feedback welcomed!
Beta Was this translation helpful? Give feedback.
All reactions