Published August 1, 20209 minute read

How do you average (or sum) a lot of numbers, quickly?

More precisely, how do you do this on a webpage, preferably in real-time, when your values are stored as pixels in a large image? In general-purpose applications, we have access to the massive parallelization capabilities of GPUs with CUDA or OpenCL — but on the web, we’re stuck with WebGL and fragment shaders.

One possibility is to write a fragment shader that repeatedly downscales the image by a factor of 2, where each output pixel is the average of the four pixels in its preimage. Assuming each render takes constant time due to GPU parallelization,[1] this technique can average an n×nn \times n texture in log2n\log_2{n} time.

Downscaled 0 timesDownscaled 1 timesDownscaled 2 timesDownscaled 3 timesDownscaled 4 times

This works, but has a few disadvantages:

  • It’s a pain to deal with images with sidelength not a power of 2. It can be done, but I wouldn’t want to do it.[2]
  • Each iteration requires a texture of a different size. Instantiating and swapping around all of these textures (or managing glViewport) is a pain.
  • Perhaps most importantly, this technique clearly works and is thus mathematically uninteresting.

An alternative technique that “resolves” some of these issues is to just blur the image repeatedly until it becomes a solid color, then read an arbitrary pixel.

Blurred 0 timesBlurred 1 timesBlurred 2 timesBlurred 3 times

This technique:

  • Works equally well on images of any size or shape.
  • Uses textures of all the same size, so we only need to instantiate two textures and render back and forth between them.
  • And, as we’re about to see, is very interesting.

To obtain an accurate average and avoid edge-effects, we treat the texture as a torus. What the blur operation is really doing, then, is circularly convolving the input image with a Gaussian kernel:




Gaussian KernelGaussian Kernel



Blurred OnceBlurred Once

I1=I0KI_1 = I_0 * K

As you may have guessed from the above images, however, repeated Gaussian blurring is really bad for this purpose. The issue is that the Gaussian kernel is very local, so information doesn’t spread very fast throughout the image.

How to Judge a Kernel

Perhaps a kernel that’s more spread out would do better — indeed, there’s no reason for the kernel to even be contiguous!

But first, we need a way to quantify the “quality” of a kernel KK. Let I0I_0 denote the original image, and let InI_n denote the image after convolving with KK a total of nn times. Then In=In1KI_n = I_{n-1} * K, and taking the 2D discrete Fourier transform yields I^n=I^n1K^\hat{I}_n = \hat{I}_{n-1} \cdot \hat{K}.[3]From here, we easily obtain I^n=I^0K^n\hat{I}_n = \hat{I}_0 \cdot \hat{K}^n.

The image II_\infty we want to eventually converge to has all pixels equal to the average value I0\langle I_0 \rangle of I0I_0, so that I^(0,0)=I^0(0,0)\hat{I}_\infty(0, 0) = \hat{I}_0(0, 0) is the only non-zero coefficient in I^\hat{I}_\infty. Intuitively, the “quality” of the kernel KK should quantify how quickly the InII_n \to I_\infty, say in 2\ell^2 norm. By Parseval’s theorem, it is equivalent to study the convergence of I^nI^\hat{I}_n \to \hat{I}_\infty in 2\ell^2. To do this, we’ll first need to deduce a bit more about K^\hat{K}.

Let ei,je_{i, j} for 0i,jN0 \leq i, j \leq N denote the image with value 1 at pixel (i,j)(i, j) and 0 elsewhere. Then every coefficient in e^i,j\hat{e}_{i, j} has magnitude 1, by definition.[4]If we assume that the kernel KK is a convex linear combination of the ei,je_{i, j},[5] then every coefficient in its Fourier transform K^\hat{K} is a convex linear combination of norm-1 complex numbers — and thus has norm at most 1. Thus all components of I^n\hat{I}_n either decay exponentially to zero or remain fixed as nn \to \infty.

In fact, since all e^i,j(0,0)=1\hat{e}_{i, j}(0, 0) = 1, we have K^(0,0)\hat{K}(0, 0) = 1, and so all I^n(0,0)=I^(0,0)\hat{I}_n(0, 0) = \hat{I}_\infty(0, 0). Letting KK' denote the kernel KK with the entry (0,0)(0, 0) set to zero, we finally define the quality of KK to be

Q(K):=1(K^).Q(K) \defeq \frac{1}{\ell^\infty(\hat{K'})}.

Put simply, we look at the largest coefficient in K^\hat{K} by magnitude, excluding K^(0,0)\hat{K}(0, 0). Then the rate of convergence of I^nI^\hat{I}_n \to \hat{I}_\infty is on the order of Q(K)nQ(K)^{-n}: our average becomes roughly Q(K)Q(K) times better on each iteration.

Sidenote: We can now understand precisely why the Gaussian kernel is so bad: the magnitude of the Fourier transform of a Gaussian is a Gaussian, which has a very “flat” peak near (0,0)(0, 0). Thus the Fourier coefficients near (0,0)(0, 0) have magnitudes almost exactly equal to 1.

The Search Begins

We now have a clearly defined task: find a kernel KK of maximal quality Q(K)Q(K).

Clearly, the optimal quality is \infty, when all coefficients of K^\hat{K} other than K^(0,0)\hat{K}(0, 0) are zero. This corresponds to a filter with equal weight on every pixel of the image, so we converge in one step. End of post!

Of course, in practice this would be horrendously slow, even on the GPU. The performance of our fragment shader is related to the number of input pixels we access per output pixel — i.e., the number of non-zero entries of KK. So, our task is really:

Find a kernel KK of maximal quality Q(K)Q(K), subject to 0(K)N\ell^0(K) \leq N.

There’s just one issue: this task is horrendously difficult.[6]

This forces us to turn to brute-force Monte-Carlo methods. The procedure I used is as follows:

  1. Randomly choose NN pixels without replacement from the n×nn \times n input image. These locations represent precisely the non-zero entries of KK.
  2. Given this restriction, maximize Q(K)Q(K) using the method of your choice. This is an NN-dimensional optimization problem where the variables must be non-negative and sum to one.[7] I used scipy.optimize.
  3. Repeat.

Keeping track of the maximum-quality kernel during this procedure should give a good estimate of the true solution. Here’s some experimental results for a square image of sidelength n=128n =128:

Experimental resultsExperimental results

As expected, the quality increases as the sparsity budget NN is increased. For a given NN, the median kernel quality, in the grand scheme of things, isn’t that much worse than that of the best observed candidate.

So perhaps we don’t need to find the absolute best kernel for a given sparsity; using randomly chosen pixel locations yields comparable quality, and we can just increase NN by a bit if it’s not good enough.

Theoretical Insight

How can we understand this result theoretically? The optimization step (Step 2) is quite difficult to think about, so let’s assume that all NN randomly chosen pixels end up with equal weights.[8]

Consider an arbitrary coefficient K^(u,v)\hat{K}(u, v), excluding (u,v)=(0,0)(u, v) = (0, 0). This coefficient will be the average of e^i,j(u,v)\hat{e}_{i, j}(u, v) for NN randomly chosen pairs (i,j)(i, j). Since e^i,j(u,v)\hat{e}_{i, j}(u, v) has unit norm and essentially random phase, its real part has a mean of zero and a variance of12π02πcos2θ  dθ=12.\frac{1}{2\pi}\int_0^{2\pi} \cos^2{\theta} \, \dd{\theta} = \frac{1}{2}.Assuming the (i,j)(i, j) are roughly independent and NN is relatively large, we have Re(K^(u,v))N(0,1/(2N))\mathrm{Re}(\hat{K}(u, v)) \sim \mathcal{N}(0, 1/(2N)) by the Central Limit Theorem. The same holds for the imaginary component, so K^(u,v)\abss{\hat{K}(u, v)} has the Rayleigh distributionK^(u,v)xσ2ex2/(2σ2),x0,\abss{\hat{K}(u, v)} \sim \frac{x}{\sigma^2} e^{-x^2/(2 \sigma^2)}, \quad x \geq 0,where σ2=1/(2N)\sigma^2 = 1/(2N).

Since the kernel KK is real-valued, half of its Fourier coefficients will be redundant, so we may treat M:=(K^)M \defeq \ell^\infty(\hat{K}') as the maximum of n2/2n^2/2 independent[9] Rayleigh-distributed random variables. The CDF of each Rayleigh-distributed variable is F(x)=1ex2/(2σ2)F(x) = 1-e^{-x^2 / (2 \sigma^2)}, so the expected value of the maximum isE(M)=0(1F(x)n2/2)  dx.\mathbb{E}(M) = \int_0^\infty (1 - F(x)^{n^2/2}) \, \dd{x}.Assuming NN and n2n^2 are both fairly large, the integrand can be well-approximated as a step function that transitions from 11 to 00 at the xx-value satisfying F(x)n2/2=12F(x)^{n^2/2} = \frac{1}{2}. Solving, we obtain

E(M)σ2log(11/2n2/2)2lognlog2loglog2N.\begin{aligned} \mathbb{E}(M) &\approx \sigma \sqrt{-2 \log(1 - \sqrt[n^2/2]{1/2})} \\ &\approx \sqrt{\frac{2\log{n} - \log{2} - \log{\log{2}}}{N}}. \end{aligned}

The median quality Q(K)Q(K) should be roughly the inverse of this. With fingers crossed, we compare the theory to experiment:

Comparison of theory and experimentComparison of theory and experiment

Not bad, but not great. The actual median Q(K)Q(K) is a bit higher, presumably due to the optimization step (Step 2) that we overlooked. Indeed, if we disable the optimization step and weight each kernel entry equally, the results match quite closely:

Comparison of theory and experiment, optimization step disabledComparison of theory and experiment, optimization step disabled

Thus we can expect our estimate to provide a reliable lower bound for the optimized median. This is actually great news: even randomly choosing pixels and setting equal weights, to achieve a target quality QQ, we need only choose a filter with a size of

N2Q2logn,N \approx 2 Q^2 \log{n},

which grows only logarithmically with the image size! Thus, even for fairly large images, the kernel remains sparse enough for near-constant-time evaluation on the GPU.

Demo Time!

The example image I’ve been using has sidelength n=256n = 256. Aiming for a quality of Q=4Q = 4, our formula tells us we need to choose a random kernel with N=178N = 178 entries. Here’s the results with no kernel optimization (equal weights):[11]

random kernel, no optimization, applied 0 timesrandom kernel, no optimization, applied 1 timesrandom kernel, no optimization, applied 2 timesrandom kernel, no optimization, applied 3 times

Here’s the results after optimizing the kernel (Step 2 in the procedure described before):

random kernel, with optimization, applied 0 timesrandom kernel, with optimization, applied 1 timesrandom kernel, with optimization, applied 2 timesrandom kernel, with optimization, applied 3 times

Yep. Even with the unoptimized kernel, after two iterations the average is resolved to a resolution of at least 256/Q2=16256 / Q^2 = 16 in each channel, which is barely perceptible. For comparison, the Gaussian kernel I used at the beginning of the post had 101×101=10,201101 \times 101 = 10,201 entries.[10]

To achieve the same result with the scaling-based algorithm outlined at the start of the post, you would need to scale down by 16×16 \times at each step, so each output pixel would need to look at 256256 input pixels. Admittedly, this isn’t a direct comparison, since the scaling method provides an exact solution — but for many applications, a rapidly converging approximate solution does the job.

And finally, for fun, a demo that calculates the value of π\pi by averaging an image of a circle! The estimate shown is derived from a single pixel in the displayed image.[12]

π0.00000...\pi \approx 0.00000...

Please enable Javascript and WebGL to use the demo.

Kernel Sparsity

N=50N = 50

Texture Size: 300 × 300
Estimated Quality: Q(K) = 2.094


  1. [1]Assuming we’re hitting up against the 60 FPS frame cap, this is a reasonable assumption.
  2. [2]Incidentally, the glGenerateMipmap function executes essentially this procedure — and is restricted to power-of-two sidelengths only. WebGL 2 supports non-power-of-two textures, but there’s no guarantee the edge effects will be treated as to give an accurate average...
  3. [3]Here, we use \cdot to denote component-wise multiplication.
  4. [4]This also defines the convention we’re using for our Fourier transforms.
  5. [5]This is a reasonable assumption — the components should sum to one to preserve the average on each iteration. You can convince yourself (fun exercise!) that negative components will lead to exponential blowups in certain situations.
  6. [6]To the best of my knowledge, anyway. If anyone manages to find a reference, I’d love to hear about it!
  7. [7]So really, there are N1N-1 degrees of freedom.
  8. [8]There is some theoretical basis to this assumption — choosing all weights equal minimizes the 2\ell^2 norm of KK, which (by Parseval’s theorem) minimizes the 2\ell^2 norm of K^\hat{K}'. The 2\ell^2 norm is an upper bound for the \ell^\infty norm we really care about.
  9. [9]As with many of the assumptions here, this isn’t strictly true, but works well enough.
  10. [10]Since the Gaussian kernel can be factored, this really only needs 2 passes, one horizontal and one vertical, with a kernel of size 101. But even then, the results with the Gaussian are objectively much worse.
  11. [11]For fun, you can try to figure out what kernel I used by downloading the first two images, dividing their 2D discrete Fourier transforms, and taking an inverse DFT. I haven't tried this and have no idea if it will work (since the pixel colors are quantized).
  12. [12]Through some clever WebGL hacks, I am using all four color channels internally to increase the precision. There also appears to be a bug on mobile (possibly due to low precision?) that causes a small bit of drift in the final average.