Question

Python - RGBA image non-zero pixels extracting - numpy mask speed up

I need to extract non-zero pixels from RGBA Image. Code below works, but because I need to deal with really huge images, some speed up will be salutary. Getting "f_mask" is the longest task. Is it possible to somehow make things work faster? How to delete rows with all zero values ([0, 0, 0, 0]) faster?

import numpy as np
import time


img_size = (10000, 10000)
img = np.zeros((*img_size, 4), float)  # Make RGBA image

# Put some values for pixels [float, float, float, int]
img[0][1] = [1.1, 2.2, 3.3, 4]
img[1][0] = [0, 0, 0, 10]
img[1][2] = [6.1, 7.1, 8.1, 0]


def f_img_to_pts(f_img):  # Get non-zero rows with values from whole img array
    f_shp = f_img.shape
    f_newshape = (f_shp[0]*f_shp[1], f_shp[2])
    f_pts = np.reshape(f_img, f_newshape)
    f_mask = ~np.all(f_pts == 0, axis=1)
    f_pts = f_pts[f_mask]
    return f_pts


t1 = time.time()
pxs = f_img_to_pts(img)
t2 = time.time()
print('PIXELS EXTRACTING TIME: ', t2 - t1)
print(pxs)

2 150 2

1 Jan 1970

Solution

Analysis

~np.all(f_pts == 0, axis=1) is sub-optimal for several reasons:

First of all, the operation is memory-bound (typically by the RAM bandwidth) because the image is huge (400 MB in memory) -- even the boolean masks are huge (100 MB). Moreover, Numpy creates multiple temporary arrays: one for the result of f_pts == 0, one for np.all(...) and even one for ~np.all(...). Each array is fully stored in RAM and read back which is inefficient. Even worst: newly allocated arrays are typically not directly mapped in physical memory due to the way virtual memory works and so the expensive overhead of page-faults needs to be paid for each temporary array. Last but not least, Numpy is not optimized for computing array where the main dimension (ie. generally the last) is tiny, like here (4 items). This can be fixed by using a better memory layout though. See AoS vs SoA for more informations.

Note that np.zeros reserve some zeroized space in virtual memory but it does not physically map it. Thus, the first call to f_img_to_pts is slower because of page faults. In real-world application, this is generally not the case because writing zeros in img or reading it once causes the whole page to be physically mapped (ie. page-faults). If this is not the case, then you should certainly use sparse matrices instead of dense ones. Let's assume img is already mapped in physical memory (this can be done by running the code twice or by just calling img.fill(0) after np.zeros.

Also please note that the dtype float is actually 64-bit float and not 32-bit ones. 64-bit floats are generally more expensive, especially in memory-bound codes since they take twice more space. You should certainly use 32-bit ones for image computations since they nearly never require such a very-high precision. Let's assume the input array is a 32-bit one now.

Solutions

One way to reduce the overheads is to iterate row by row (loops are not so bad as they seems here, quite the opposite actually if done correctly). While this should be faster, this is still sub-optimal. There is no way to make this much faster than that because of the way Numpy is actually designed.

To make this even faster, one can use modules compiling Python functions to native ones like Numba or Cython. With them, we can easily avoid the creation of temporary arrays, specialize the code for 4-channel images and even use multiple threads. Here is the resulting code:

import numpy as np
import time
import numba as nb

img_size = (10000, 10000)
img = np.zeros((*img_size, 4), np.float32)  # Make RGBA image
img.fill(0)

# Put some values for pixels [float, float, float, int]
img[0][1] = [1.1, 2.2, 3.3, 4]
img[1][0] = [0, 0, 0, 10]
img[1][2] = [6.1, 7.1, 8.1, 0]

# Compute ~np.all(f_pts==0,axis=1) and assume most items are 0
@nb.njit('(float32[:,:,::1],)', parallel=True)
def f_img_to_pts_optim(f_img):  # Get non-zero rows with values from whole img array
    n, m, c = f_img.shape
    assert c == 4 # For sake of performance
    f_mask = np.zeros(n * m, dtype=np.bool_)
    f_pts = np.reshape(f_img, (n*m, c))
    for ij in nb.prange(n * m):
        f_mask[ij] = (f_pts[ij, 0] != 0) | (f_pts[ij, 1] != 0) | (f_pts[ij, 2] != 0) | (f_pts[ij, 3] != 0)
    f_pts = f_pts[f_mask]
    return f_pts

t1 = time.time()
pxs = f_img_to_pts_optim(img)
t2 = time.time()
print('PIXELS EXTRACTING TIME: ', t2 - t1)
print(pxs)

Results

Here are timings on a 10-core Skylake Xeon CPU:

Initial code (64-bit):          1720 ms
Initial code (32-bit):          1630 ms
Optimized code (sequential):     323 ms
Optimized code (parallel):       182 ms   <----------

The optimized code is 9 times faster than the initial code (on the same input type). Note that the code does not scale with the number of core since reading the input is a bottleneck combined with the sequential f_pts[f_mask] operation. One can build f_pts[f_mask] using multiple threads (and it should be about 2 times faster) but this is rather complicated to do (especially with Numba).

2024-07-12

Jérôme Richard

Solution

I typically "resort" to C++ when I need to speed-up Python code. @Jérôme Richard's answer impressed me, though, particularly the use of numba, which I wasn't familiar with. As an alternative, here's a parallel solution using OpenMP, which achieves a 5x speedup compared to the JIT-based approach (on i7 11700K). It leverages the __int128 type available in GCC to perform zero-tests on four floats simultaneously.

Python side:

    #!/usr/bin/env python3
    
    import numpy as np
    from ctypes import *
    from time import time
    import numba as nb
    
    from numpy.ctypeslib import ndpointer
    from math import prod
    
    clib = CDLL('./libfilter.so')
    clib.FindNZ_IndexesF32x4.restype  = c_uint
    clib.FindNZ_IndexesF32x4.argtypes = (c_void_p, # inp: int128 (see f_img_to_pts_cpp(..))
                                         ndpointer(c_uint,
                                                   flags="C_CONTIGUOUS"), # indexes
                                         c_uint) # size
    
    
    def f_img_to_pts_cpp(inp):
        cpp   = 4 # components per pixel
        assert inp.shape[-1] == cpp
    
        flat_size = prod(inp.shape)
        inp       = inp.reshape(flat_size)
    
        # The ctypes module doesn't have a 128-bit type, so we manually assert
        # the preconditions for safe casting
        # (except alignment, but it works as it is, at least on x86_64)
        assert inp.flags['C_CONTIGUOUS']
        assert inp.dtype == np.float32
    
        inp     = inp.view(np.complex128) # cast to 128-bit elements
        indexes = np.empty(flat_size, dtype=np.uint32) # uninitialzed array
        size    = clib.FindNZ_IndexesF32x4(inp.ctypes.data, 
                                           indexes,
                                           flat_size//cpp)
        indexes = indexes[:size]
        out = inp[indexes].view(np.float32) # cast back to float
        return out.reshape((size, cpp))
    
    
    @nb.njit('(float32[:,:,::1],)', parallel=True)
    def f_img_to_pts_optim(f_img):  # Get non-zero rows with values from whole img array
        n, m, c = f_img.shape
        assert c == 4 # For sake of performance
        f_mask = np.zeros(n*m, dtype=np.bool_)
        f_pts = np.reshape(f_img, (n*m, c))
        for ij in nb.prange(n*m):
            f_mask[ij] = (f_pts[ij, 0] != 0) | (f_pts[ij, 1] != 0) | \
                (f_pts[ij, 2] != 0) | (f_pts[ij, 3] != 0)
        f_pts = f_pts[f_mask]
        return f_pts
    
    
    def py_opt(img):
        print('warming up...') # "spin up" the processor 
        for i in range(4):
            f_img_to_pts_optim(img)
    
        tstart = time()
        ret = f_img_to_pts_optim(img)
        runtime = time() - tstart
        print(f'numba runtime {runtime:.4f}')
        return ret
    
    def cpp_opt(img):
        tstart = time()
        pxs = f_img_to_pts_cpp(img)
        runtime = time() - tstart
        print(f'C++ runtime {runtime:.4f}')
        return pxs
    
    img = np.zeros((10000, 10000, 4), np.float32)
    
    img[0][1] = [1.1, 2.2, 3.3, 4]
    img[1][0] = [0,   0,   0,   10]
    img[1][2] = [6.1, 7.1, 8.1, 0]
    
    py_pxs  = py_opt(img)
    cpp_pxs = cpp_opt(img)
    
    print(py_pxs)
    assert (py_pxs == cpp_pxs).all()

C++ side:

#include <cstdint>
#include <atomic>

using u32    = uint32_t;
using int128 = __int128;

extern "C"
u32 FindNZ_IndexesF32x4(const int128 *  inp,
                        u32          *  indexes,
                        const u32 size) noexcept {
  std::atomic_uint l(0);

  #pragma omp parallel for
  for(u32 i = 0; i < size; i++) {
    if(inp[i]) {
      u32 j = l.fetch_add(1);
      indexes[j] = i;
    }
  }

  return l;
}

To build the source above:

g++ -DNDEBUG -fPIC -Ofast -mtune=native -fopenmp -std=gnu++17 -shared -o libfilter.so filter.cpp

Output:

warming up...
numba runtime 0.1169
C++ runtime 0.0163
[[ 1.1  2.2  3.3  4. ]
 [ 0.   0.   0.  10. ]
 [ 6.1  7.1  8.1  0. ]]

EDIT

As @Jérôme Richard correctly pointed out, the non-deterministic nature of threading means that the order of the indexes (and thus the order of the result) may differ from the JIT variant. However, since the number of non-zero elements is extremely small, sorting the indexes is almost cost-free.

Alignment: according to this Why is malloc 16 byte aligned?, malloc-allocated addresses on an x86_64 system are 16-byte aligned unless the requested size is less than 15 bytes, which is not the case in this context.

#include <cstdint>
#include <atomic>
#include <algorithm>

using u32    = uint32_t;
using int128 = __int128;

extern "C"
u32 FindNZ_IndexesF32x4(const int128 *  inp,
                        u32          *  indexes,
                        const u32 size) noexcept {
  std::atomic_uint l(0);

  // It's always aligned on a x86_64 system
  // const bool is_misaligned  = (size_t(inp) & 15) != 0;

  // Regarding -0.0:
  int128    mask;
  {
    u32 * u = (u32*)&mask;
    u[0] = (1u << 31) - 1;
    u[1] = u[0];
    u[2] = u[0];
    u[3] = u[0];
  }

  #pragma omp parallel for
  for(u32 i = 0;  i < size;  i++) {
    if(inp[i] & mask) {
      u32 j = l.fetch_add(1);
      indexes[j] = i;
    }
  }

  std::sort(indexes, indexes + l);

  return l;
}

Output (with masking and sorting, i7 11700K):

warming up...
numba runtime 0.142
C++ runtime 0.0171
[[ 1.1  2.2  3.3  4. ]
 [ 0.   0.   0.  10. ]
 [ 6.1  7.1  8.1  0. ]]

But I would agree that my solution is somewhat overfitted. The JIT function is more versatile and will perform well even with a substantial number of non-zero elements.

UPDATE

There was another observation made by @Jérôme Richard: the effective memory bandwidth in my results appears too good to be true. It took me some testing and googling to understand what's going on. Consider this function:

...
extern "C"
u64 CPP_Reduce(const u64 * inp, const u32 size) noexcept {
  u64 acc = 0;
  #pragma omp parallel for reduction(+: acc)
  for(u32 i = 0;  i < size;  i++)
    acc += inp[i];
  return acc;
}
...

Let's pass it the img from Python:

    clib.CPP_Reduce.argtypes = (ndpointer(c_uint64,
                                          flags="C_CONTIGUOUS"),
                                c_uint) # size
    clib.CPP_Reduce.restype = c_uint64
...
    as_vec64 = img.reshape(prod(img.shape)).view(dtype=np.uint64)
    print(f'reduction input: size {as_vec64.nbytes}  shape {as_vec64.shape}')

    n_iters = 10
    tstart = time()
    for i in range(n_iters):
        clib.CPP_Reduce(as_vec64, as_vec64.shape[0])
    dt = (time() - tstart) / n_iters
    print(f'dt: {dt:.4f} s')
    bandwidth = as_vec64.nbytes / dt / 2**30
    print(f'bandwidth: {bandwidth:.4f} GB/s')
...

Output:

reduction input: size 1600000000  shape (200000000,)
dt: 0.0123 s
bandwidth: 120.8112 GB/s

This is on an i5 1335U with RAM clocked at 4.267 GHz. The total bus width is 128 bits. Its peak bandwidth, therefore, is 4.267e9 * 16 / 2**30 = 63.58 GB/s. At first glance, it may seem that the reduction example is incorrect, but it's not (I hope). Let's run our reduction function with a different input:

img = np.arange(0, 10000*10000*4, 1, np.float32).reshape((10000, 10000, 4))

Output:

reduction input: size 1600000000  shape (200000000,)
dt: 0.0458 s
bandwidth: 32.5458 GB/s

This result is a lot less surprising (and probably not a great one, but that's off-topic). You can read an explanation here: Linux kernel: Role of zero page allocation at paging_init time Essentially, due to paging tricks, img may effectively fit in the CPU cache completely if it's mostly zeros, but "behaves as expected" otherwise.

2024-07-16

BadEnglish