/ projects

NVIDIA CUDA, GPU Computing, and Parallelized Computer Vision Algorithms

For our final course project for Rice University's ELEC 301--Signals and Systems, Emilio Del Vechhio (@em0980), Senthil Natarajan (@senth1s), and I set out to explore parallelized GPU computation with NVIDIA CUDA for fast, real-time two-dimensional signal processing. We implemented an adaptation of the Sobel edge detection filter and an edge difference density-based motion detection algorithm that allowed us to estimate in real-time, with a configurable sensitivity threshold, regions of an image that exhbitied motion. We furthermore explored the GPU speedup of HAAR feature-based facial recognition with the existing facial recognition OpenCV library.


More formally, our full abstract:

GPUs provide a powerful platform for parallel computations on large data inputs, namely images. In this paper, we explore a GPU-based implementation of a simplified adaptation of existing edge detection algorithms fast enough to operate on frames of a continuous video stream in real-time. We also demonstrate a practical application of edge detection--an edge-based method for motion detection estimation. Additionally, we explore the GPU-CPU speedup of existing OpenCV GPU computation libraries, namely, for facial recognition algorithms. Finally, we demonstrate speedups as high as 10x we achieve with GPU parallelism, as compared to a reference serial CPU-based implementation.

Our full paper is available here. It's also published on OpenStax CNX.

What follows is a hopefully concise, but still sufficiently technically detailed, summary of the theory and implementation of the components of our project.

Edge Detection

Separable Convolution

To begin, it's worth noting that all of our two-dimensional convolution algorithms are implemented separably, e.g. that a standard two-dimensional convolution is implemented as two instances of one-dimensional convolution. This is a result of the fact that all of our 2D convolution kernels can be implemented as the product of two 1D kernels.

$$y[m, n] = x[m, n] * h[m, n] = \sum_{i = -\infty}^{\infty} \sum_{j = -\infty}^{\infty} h[i, j] x[m - i, n - j]$$

$$y[m, n] = h_1[n] * (h_2[m] * x[m, n])$$

$$y[m, n] = \sum_{i = -\infty}^{\infty}h_1[i] \sum_{j = -\infty}^{\infty}h_2[j] x[m - i, n - j]$$

This cuts down the computational complexity, and in practice, runs about twice as fast.

Gaussian Filtering

We first reduce the amount of noise, like that introduced by an image taken at a high ISO, so to minimize the number of false edge classifications. This is done by convolving a discrete Gaussian kernel with the input image (implemented separably) that effectively functions as a two-dimensional low-pass filter, suppressing high-frequency noise.

FFT of a noisy image:

Unfiltered noisy image

FFT of the same image convolved with a Gaussian kernel:

Gaussian-filtered noisy image

Sobel Edge Detection

Edges in the image correspond to locations at which the gradient (e.g. two-dimensional spatial derivative of the grayscale intensity) hits a maximum. We approximate the gradient of the image with the Sobel operator (implemented separably).

$$\boldsymbol{G_x} =
-1 & 0 & 1 \\
-2 & 0 & 2 \\
-1 & 0 & 1
\end{bmatrix} * \boldsymbol{A}$$
$$\boldsymbol{G_y} =
-1 & -2 & -1 \\
0 & 0 & 0 \\
1 & 2 & 1
\end{bmatrix} * \boldsymbol{A}$$

Original image:

Input image to Sobel filter

The magnitude of the gradient at each pixel after applying a Sobel filter in both the horizontal and vertical directions:

Sobel-filtered image

Visualization of the same magnitude, as a three-dimensional mesh:

Mesh of gradient magnitude

Non-maximum Suppression and Selective Thresholding

The procedure we present here is an adaption of the final stage, hysteresis, of the popular Canny edge detection algorithm. We classify edges by thresholding the output of the Sobel filter and examining the gradient magnitude of each pixel relative to that of the pixels in the gradient direction, then attempt to search around each pixel in a limited apron in an attempt to catch pixels that are falsely labeled as non-edges in the initial thresholding procedure.

The final output is a binary edge/non-edge classification of each pixel, resulting in something like this.

PCB edges

CUDA Implementation

A core part of our GPU CUDA code is our implementation of separable convolution: this is the module that drives all of our image filtering tasks.

__global__ void horizontal_convolve(int *d_out, int *x, int *h, int x_width, int x_height, int h_width, int h_height) {
    const int r = blockIdx.y * blockDim.y + threadIdx.y;
    const int c = blockIdx.x * blockDim.x + threadIdx.x;
    const int i = r * (x_width + h_width - 1) + c;
    int sum = 0;
    for (int j = 0; j < h_width; j++) {
        int p = x_width*r + c - j;
        if (c - j >= 0 && c - j < x_width) {
            sum += h[j] * x[p];
    d_out[i] = sum;

__global__ void vertical_convolve(int *d_out, int *x, int *h, int x_width, int x_height, int h_width, int h_height, double constant_scalar) {
	const int r = blockIdx.y * blockDim.y + threadIdx.y;
	const int c = blockIdx.x * blockDim.x + threadIdx.x;
	const int i = r * x_width + c;

    int sum = 0;
    for (int j = 0; j < h_height; j++) {
        int p = x_width*(r - j) + c;
        if (r - j >= 0 && r - j < x_height) {
            sum += h[j] * x[p];
    d_out[i] = (int)(constant_scalar * (double)sum);

Externally, we launch the device kernel on a convolution kernel of our choice with 16 threads per block in both directions.

#define TX 16
#define TY 16

dim3 block_size(TX, TY);
int bx_horizontal = horizontal_convolution_width/block_size.x;
int by_horizontal = horizontal_convolution_height/block_size.y;
dim3 grid_size_horizontal = dim3(bx_horizontal, by_horizontal);
int bx_vertical = vertical_convolution_width/block_size.x;
int by_vertical = vertical_convolution_height/block_size.y;
dim3 grid_size_vertical = dim3(bx_vertical, by_vertical);

horizontal_convolve<<<grid_size_horizontal, block_size>>>(...);
vertical_convolve<<<grid_size_vertical, block_size>>>(...);

The non-maximum suppression and selective thresholding procedue is also parallelized on a per-pixel basis. Please refer to our full paper if you're interested in the code.

Motion Detection

Difference Matrix

Our motion detection algorithm builds a difference image out of the edges of two frames in a continuous video stream. Then, we consider a limited apron around each pixel, selectively eliminating difference pixels that might be caused by innocent, unintentional camera movement or random noise. The output of this procedure is a reasonably accurate depiction of the pixels in the image that differ from one frame to the next.

Motion detection input 1

Motion detection input 2

Motion detection difference

Differnce Density Matrix

We then analyze the difference matrix above and build a limited-size density matrix, representing the concentration of difference pixels at each region in the image. It is from this matrix that we determine, with configurable sensitivity, the regions of the image motion exists.

Difference density mesh
Motion area estimation

CUDA Implementation

For our full code, please see the full paper.

Below is a snippet of how we calculate the difference density matrix given a difference matrix.

__global__ void spatial_difference_density_map(
	double *density_map,
	int *difference,
	int width,
	int height,
	int horizontal_divisions,
	int vertical_divisions
) {
	int r = blockIdx.y * blockDim.y + threadIdx.y;
	int c = blockIdx.x * blockDim.x + threadIdx.x;
	int i = r * width + c;
	int horizontal_block_size = width/horizontal_divisions;
	int vertical_block_size = height/vertical_divisions;
	int block_size = horizontal_block_size * vertical_block_size;
	const int scaling_factor = 1000;
	if (difference[i] != 0) {
		int i = (int)(vertical_divisions * r/(double)height);
		int j = (int)(horizontal_divisions * c/(double)width);
		density_map[i * horizontal_divisions + j] += scaling_factor/(double)block_size;

Facial Recognition

The core of our facial recognition implementation is borrowed from the OpenCV facial recognition library. We first train the HAAR cascade classifier on a large sample of positive and negative images, then, for each input image, we extract features that are used as input to our binary face/not-face classifier.

Cascade classifer framework

In practice, we demoed overlaying a Daft Punk mask image over every face in the image, in real-time.

Daft punk masks

GPU Speedup Results

We ran our benchmarks on a system with an Intel Core i5 4200U @ 2.6 GHz and an NVIDIA GeForce GTX 860M (1152 CUDA cores, 1029 MHz core clock, 2500 MHz memory clock, 2 GB GDDR5 video memory).

GPU speedup results


We wouldn't have been able to do this project without the support of CJ Barberan, Rice University ECE PhD student.
We would also like to thank Richard Baraniuk, professor of electrical and computer engineering in signals and systems, for his Fall 2015 instruction of the course for which we created this project.