I typically "resort" to C++ when I need to speed-up Python code. @Jérôme Richard's answer impressed me, though, particularly the use of numba, which I wasn't familiar with. As an alternative, here's a parallel solution using OpenMP, which achieves a 5x speedup compared to the JIT-based approach (on i7 11700K). It leverages the __int128 type available in GCC to perform zero-tests on four floats simultaneously.
Python side:
#!/usr/bin/env python3
import numpy as np
from ctypes import *
from time import time
import numba as nb
from numpy.ctypeslib import ndpointer
from math import prod
clib = CDLL('./libfilter.so')
clib.FindNZ_IndexesF32x4.restype = c_uint
clib.FindNZ_IndexesF32x4.argtypes = (c_void_p, # inp: int128 (see f_img_to_pts_cpp(..))
ndpointer(c_uint,
flags="C_CONTIGUOUS"), # indexes
c_uint) # size
def f_img_to_pts_cpp(inp):
cpp = 4 # components per pixel
assert inp.shape[-1] == cpp
flat_size = prod(inp.shape)
inp = inp.reshape(flat_size)
# The ctypes module doesn't have a 128-bit type, so we manually assert
# the preconditions for safe casting
# (except alignment, but it works as it is, at least on x86_64)
assert inp.flags['C_CONTIGUOUS']
assert inp.dtype == np.float32
inp = inp.view(np.complex128) # cast to 128-bit elements
indexes = np.empty(flat_size, dtype=np.uint32) # uninitialzed array
size = clib.FindNZ_IndexesF32x4(inp.ctypes.data,
indexes,
flat_size//cpp)
indexes = indexes[:size]
out = inp[indexes].view(np.float32) # cast back to float
return out.reshape((size, cpp))
@nb.njit('(float32[:,:,::1],)', parallel=True)
def f_img_to_pts_optim(f_img): # Get non-zero rows with values from whole img array
n, m, c = f_img.shape
assert c == 4 # For sake of performance
f_mask = np.zeros(n*m, dtype=np.bool_)
f_pts = np.reshape(f_img, (n*m, c))
for ij in nb.prange(n*m):
f_mask[ij] = (f_pts[ij, 0] != 0) | (f_pts[ij, 1] != 0) | \
(f_pts[ij, 2] != 0) | (f_pts[ij, 3] != 0)
f_pts = f_pts[f_mask]
return f_pts
def py_opt(img):
print('warming up...') # "spin up" the processor
for i in range(4):
f_img_to_pts_optim(img)
tstart = time()
ret = f_img_to_pts_optim(img)
runtime = time() - tstart
print(f'numba runtime {runtime:.4f}')
return ret
def cpp_opt(img):
tstart = time()
pxs = f_img_to_pts_cpp(img)
runtime = time() - tstart
print(f'C++ runtime {runtime:.4f}')
return pxs
img = np.zeros((10000, 10000, 4), np.float32)
img[0][1] = [1.1, 2.2, 3.3, 4]
img[1][0] = [0, 0, 0, 10]
img[1][2] = [6.1, 7.1, 8.1, 0]
py_pxs = py_opt(img)
cpp_pxs = cpp_opt(img)
print(py_pxs)
assert (py_pxs == cpp_pxs).all()
C++ side:
#include <cstdint>
#include <atomic>
using u32 = uint32_t;
using int128 = __int128;
extern "C"
u32 FindNZ_IndexesF32x4(const int128 * inp,
u32 * indexes,
const u32 size) noexcept {
std::atomic_uint l(0);
#pragma omp parallel for
for(u32 i = 0; i < size; i++) {
if(inp[i]) {
u32 j = l.fetch_add(1);
indexes[j] = i;
}
}
return l;
}
To build the source above:
g++ -DNDEBUG -fPIC -Ofast -mtune=native -fopenmp -std=gnu++17 -shared -o libfilter.so filter.cpp
Output:
warming up...
numba runtime 0.1169
C++ runtime 0.0163
[[ 1.1 2.2 3.3 4. ]
[ 0. 0. 0. 10. ]
[ 6.1 7.1 8.1 0. ]]
EDIT
As @Jérôme Richard correctly pointed out, the non-deterministic nature of threading means that the order of the indexes (and thus the order of the result) may differ from the JIT variant. However, since the number of non-zero elements is extremely small, sorting the indexes is almost cost-free.
Alignment: according to this
Why is malloc 16 byte aligned?,
malloc-allocated addresses on an x86_64 system are 16-byte aligned unless the requested size is less than 15 bytes, which is not the case in this context.
#include <cstdint>
#include <atomic>
#include <algorithm>
using u32 = uint32_t;
using int128 = __int128;
extern "C"
u32 FindNZ_IndexesF32x4(const int128 * inp,
u32 * indexes,
const u32 size) noexcept {
std::atomic_uint l(0);
// It's always aligned on a x86_64 system
// const bool is_misaligned = (size_t(inp) & 15) != 0;
// Regarding -0.0:
int128 mask;
{
u32 * u = (u32*)&mask;
u[0] = (1u << 31) - 1;
u[1] = u[0];
u[2] = u[0];
u[3] = u[0];
}
#pragma omp parallel for
for(u32 i = 0; i < size; i++) {
if(inp[i] & mask) {
u32 j = l.fetch_add(1);
indexes[j] = i;
}
}
std::sort(indexes, indexes + l);
return l;
}
Output (with masking and sorting, i7 11700K):
warming up...
numba runtime 0.142
C++ runtime 0.0171
[[ 1.1 2.2 3.3 4. ]
[ 0. 0. 0. 10. ]
[ 6.1 7.1 8.1 0. ]]
But I would agree that my solution is somewhat overfitted. The JIT function is more versatile and will perform well even with a substantial number of non-zero elements.
UPDATE
There was another observation made by @Jérôme Richard: the effective memory bandwidth in my results appears too good to be true. It took me some testing and googling to understand what's going on. Consider this function:
...
extern "C"
u64 CPP_Reduce(const u64 * inp, const u32 size) noexcept {
u64 acc = 0;
#pragma omp parallel for reduction(+: acc)
for(u32 i = 0; i < size; i++)
acc += inp[i];
return acc;
}
...
Let's pass it the img
from Python:
clib.CPP_Reduce.argtypes = (ndpointer(c_uint64,
flags="C_CONTIGUOUS"),
c_uint) # size
clib.CPP_Reduce.restype = c_uint64
...
as_vec64 = img.reshape(prod(img.shape)).view(dtype=np.uint64)
print(f'reduction input: size {as_vec64.nbytes} shape {as_vec64.shape}')
n_iters = 10
tstart = time()
for i in range(n_iters):
clib.CPP_Reduce(as_vec64, as_vec64.shape[0])
dt = (time() - tstart) / n_iters
print(f'dt: {dt:.4f} s')
bandwidth = as_vec64.nbytes / dt / 2**30
print(f'bandwidth: {bandwidth:.4f} GB/s')
...
Output:
reduction input: size 1600000000 shape (200000000,)
dt: 0.0123 s
bandwidth: 120.8112 GB/s
This is on an i5 1335U with RAM clocked at 4.267 GHz. The total bus width is 128 bits. Its peak bandwidth, therefore, is 4.267e9 * 16 / 2**30 = 63.58 GB/s. At first glance, it may seem that the reduction example is incorrect, but it's not (I hope). Let's run our reduction function with a different input:
img = np.arange(0, 10000*10000*4, 1, np.float32).reshape((10000, 10000, 4))
Output:
reduction input: size 1600000000 shape (200000000,)
dt: 0.0458 s
bandwidth: 32.5458 GB/s
This result is a lot less surprising (and probably not a great one, but that's off-topic). You can read an explanation here: Linux kernel: Role of zero page allocation at paging_init time
Essentially, due to paging tricks, img
may effectively fit in the CPU cache completely if it's mostly zeros, but "behaves as expected" otherwise.