Spend less wasted time zeroing out ImageBuf #3754

lgritz · 2023-01-18T02:37:46Z

Examining some timings and function call counts, I saw that a typical oiiotool command was spending a surprising amount of time in ImageBufAlgo::zero(), not just more runtime overall, but also calling it many more times than I thought it should. For example,

oiiotool -create 4096x4096 3 -colorconvert srgb acescg

called ImageBufAlgo::zero()... 4 times, totaling as much time as the colorconvert operation.

The idiom used in IBA::op(dst, src, ...) which initializes the dst IB if it's not already initialized will call the IB constructor that zeros out its contents (so you don't have every new IB full of garbage pixel values). But for IBA functions, this is almost never necessary, since the IBA function itself will tend to overwrite all pixel values. Not always, of course -- since IBA functions let you restrict the spatial area or channel range -- but in the cases where the destination starts out uninitialized and is allocated to exactly mtch the ROI, I think you can count on the calling IBA function filling in all those values.

So I changed IBAPrep() to call the IB constructor that says not to zero out the pixels.

Just in case we ever need the ability, I made a new flag value that lets you call IBAPrep and tell it, yes, do zero out any newly allocated buffers. But we pass the testsuite without needing that flag anywhere, so I think we're ok.

Also, the --create command itself would make new buffers (with the implied zeroing), and then explicitly zero it again!

So after these changes, the oiiotool example command line above only calls IBA::zero() ONCE (correctly, in the initial new image --create makes), eliminating three redundant calls.

I also hypothesized that each call to zero was taking too much time, because underneath it was calling IBA::fill which laboriously walked through the IB with iterators and copying a fixed pixel value that happened to contain 0 values. So I rewrote a special case where if the buffer had contiguous strides, it would just use memset instead. Surprisingly, it was not appreciably faster, and I'm guessing that the memory bandwidth necessary to touch every pixel in a 4k buffer is the limiting factor, and the other pixel-by-pixel iterator diddling was inconsequential compared to waiting for the memory bus. I'm still going to leave this "optimization" in, even though it was barely measurable for the case I was trying.

Examining some timings and function call counts, I saw that a typical oiiotool command was spending a surprising amount of time in ImageBufAlgo::zero(), not just more runtime overall, but also calling it many more times than I thought it should. For example, oiiotool -create 4096x4096 3 -colorconvert srgb acescg called ImageBufAlgo::zero()... 4 times, totaling as much time as the colorconvert operation. The idiom used in `IBA::op(dst, src, ...)` which initializes the `dst` IB if it's not already initialized will call the IB constructor that zeros out its contents (so you don't have every new IB full of garbage pixel values). But for IBA functions, this is almost never necessary, since the IBA function itself will tend to overwrite all pixel values. Not always, of course -- since IBA functions let you restrict the spatial area or channel range -- but in the cases where the destination starts out uninitialized and is allocated to exactly mtch the ROI, I *think* you can count on the calling IBA function filling in all those values. So I changed IBAPrep() to call the IB constructor that says not to zero out the pixels. Just in case we ever need the ability, I made a new flag value that lets you call IBAPrep and tell it, yes, do zero out any newly allocated buffers. But we pass the testsuite without needing that flag anywhere, so I think we're ok. Also, the --create command itself would make new buffers (with the implied zeroing), and then explicitly zero it again! So after these changes, the oiiotool example command line above only calls IBA::zero() ONCE (correctly, in the initial new image --create makes), eliminating three redundant calls. I also hypothesized that each call to zero was taking too much time, because underneath it was calling IBA::fill which laboriously walked through the IB with iterators and copying a fixed pixel value that happened to contain 0 values. So I rewrote a special case where if the buffer had contiguous strides, it would just use memset instead. Surprisingly, it was not appreciably faster, and I'm guessing that the memory bandwidth necessary to touch every pixel in a 4k buffer is the limiting factor, and the other pixel-by-pixel iterator diddling was inconsequential compared to waiting for the memory bus. I'm still going to leave this "optimization" in, even though it was barely measurable for the case I was trying.

lgritz merged commit 59f54c8 into AcademySoftwareFoundation:master Jan 18, 2023

lgritz deleted the lg-zero branch January 18, 2023 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spend less wasted time zeroing out ImageBuf #3754

Spend less wasted time zeroing out ImageBuf #3754

Spend less wasted time zeroing out ImageBuf #3754

Spend less wasted time zeroing out ImageBuf #3754

Conversation