Journal of Real-Time Image Processing manuscript No.
(will be inserted by the editor)
Su-Shin Ang · George Constantinides · Wayne Luk · Peter Cheung
Custom Parallel Caching Schemes for Hardware-accelerated Image
Compression
Received: date / Revised: date
Abstract In an effort to achieve lower bandwidth require-
ments, video compression algorithms have become increas-
ingly complex. Consequently, the deployment of these al-
gorithms on Field Programmable Gate Arrays (FPGAs) is
becoming increasingly desirable, because of the computa-
tional parallelism on these platforms as well as the measure
of flexibility afforded to designers. Typically, video data is
stored in large and slow external memory arrays, but the im-
pact of the memory access bottleneck may be reduced by
buffering frequently used data in fast on-chip memories. The
order of the memory accesses, resulting from many com-
pression algorithms are dependent on the input data [18].
These data dependent memory accesses complicate the ex-
ploitation of data re-use, and subsequently reduce the ex-
tent to which an application may be accelerated. In this pa-
per, we present a hybrid memory sub-system which is able
to capture data re-use effectively in spite of data dependent
memory accesses. This memory sub-system is made up of
a custom parallel cache and a scratchpad memory. Further,
the framework is capable of exploiting two dimensional spa-
tial locality, which is frequently exhibited in the access pat-
terns of image processing applications. In a case study in-
volving the Quad-tree Structured Pulse Code Modulation
(QSDPCM) application, the impact of data dependence on
memory accesses is shown to be significant. In compari-
son with an implementation which only employs an SPM,
performance improvements of up to 1.7× and 1.4× are ob-
served through actual implementation on two modern FPGA
platforms. These performance improvements are more pro-
nounced for image sequences exhibiting greater inter-frame
movements. In addition, reductions of on-chip memory re-
s