We accelerate the 4-bit product quantization (PQ) on the ARM architecture.
Notably, the drastic performance of the conventional 4-bit PQ strongly relies
on x64-specific SIMD register, such as AVX2; hence, we cannot yet achieve such
good performance on ARM. To fill this gap, we first bundle two 128-bit
registers as one 256-bit component. We then apply shuffle operations for each
using the ARM-specific NEON instruction. By making this simple but critical
modification, we achieve a dramatic speedup for the 4-bit PQ on an ARM
architecture. Experiments show that the proposed method consistently achieves a
10x improvement over the naive PQ with the same accuracy.