1. abstract.
For this lab, we wrote video processing routines that modify a 8-bit grayscale image feed on the fly. We implemented 5 different algorithms both in plain C and in in-line assembly using MMX SIMD capabilities. We compared the performances of using plain C and using MMX instructions. As expected MMX algorithms performed better, yet the performance difference was not reflected on frame rate. However, a significant performance improment was observable once we started clocking the filters themselves.
2. routines.
The difference routine was already provided to us in MMX assembly form. All other routines were implemented in both C and MMX.
2.1 Edge Detect
Outline:
Finds edges by taking the gradient of the image in and x and y directions. A
threshold is applied to the result and can be used to increase or decrease the
sensitivity of the filter.
Basic algorithm:
Current pixel = (|left pixel-right pixel|+|pix above-pix below|)-threshold
MMX and C differences:
This routine was implemented with the same algorithm on both platforms. For
the MMX version we had to create two different pointers for accessing right
and left pixels since the double pointer was moving 8 pixels at a time. This
was causing undesired overlapping images to appear.
Image:
2.2 Brightness
Outline:
Changes the overall brightness of an image by adding a fixed signed value to
all pixels twice. Since the signed char value had half the maximum absolute
value unsigned char value had, we added it twice to allow the extreme
brightness outcomes.
Basic algorithm:
Current pixel = current pixel + (brightness change*2)
MMX and C differences:
Both versions use the same basic algorithm. Since we could not combine signed
and unsigned values in assembly we needed to use the absolute value of the
brightness change variable and have two different algorithms, one adding and
one subtracting pixels.
Images:
2.3 Contrast
Outline:
Changes the deviation of pixels from the average brightness value (assumed
128). Increased contrast means increased deviation, so brighter pixels
become brighter and darker pixels becoming darker and vice versa. A double
precision floating point number signifies the change in contrast. The
deviation of each pixel is augmented or diminished by that value.
Basic algorithm:
Current pixel = 128 + (|current pixel-128|*contrast change)
MMX and C differences:
The C algorithm was very straight forward, the deviation of each pixel was
found, applied change, recombined with average brightness. The MMX algorithm
was a lot more complex. We generated a "mask" for differentiating between
pixels brighter than average and darker than average. We also had two
different assembly instruction sets based on weather contrast was going to be
increased or decreased. To increase contrast, brighter/darker pixels were
made brighter/darker by a fixed value using the aforementioned mask, and vice
versa. The mask was inverted by subtracting from an array of all 1s, between
applying to brighter and darker pixels. The fixed number was generated
based on the range the contrast change variable was in. The if statements
required to make this determination turned out to be significantly faster than
doing any math with double precision floats.
Images:
2.4 Difference
Outline:
This algorithm was provided as a sample in the set of files provided for this
lab. We also implemented it in plain C. It takes the absolute value of the
differences between pixels of current frame and a reference frame. The
reference frame in this case is the previous frame. A threshold value is
subtracted from the result and can be used to modify the sensitivity of the
filter.
Basic algorithm:
Current pixel = |current pixel - previous pixel|-threshold
MMX and C differences:
Both versions use the same algorithm. We had to implement the absolute value
using the "?" operator in C. In MMX saturated math was relied on, the
saturated differences x-y and y-x were added together.
Image:
2.5 Motion detect
Outline:
Detects motion based on the difference algorithm. Pixels in motion are white
and pixels not in motion are black. A pixel is taken to be in motion if its
difference with previous pixel is greater than a given threshold.
Basic algorithm:
If difference(current pixel) > threshold
current pixel = 255
else
current pixel = 0
MMX and C differences:
The C algorithm determined if a given pixel was in motion by checking if
difference - threshold is greater than zero. The MMX algorithm generated a mask
using the packed byte greater than comparator. This comparator assumed signed
bytes, but since out threshold value was small and its sign bit was zero there
were no problems and the algorithm ran like the C algorithm. This mask was
outputted to the screen.
Images:
2.6 Motion Overlay
Outline:
Built on the motion detect filter. Mixes pixels in motion with pixels from
previous frame, as a result objects in motion fade.
Basic algorithm:
If pixel is moving
current pixel = (current pixel+previous pixel)/2
else
current pixel = current pixel
MMX and C differences:
The C algorithm mixes the pixels by mathematically adding their values
together and dividing by two. The MMX algorithm replaces moving pixels with
pixels from previous frame.
Images:
3. about the application.
The application we generated was mostly based on the files provided for the
lab. It can apply edge detect, brightness and contrast filters all at the same
time. But only one of difference, motion detect, motion overlay filters can
run at a time, since they all rely on the same reference previous image.
It alternatingly uses registers B and C as source and target, which allows the
cumulative filter behavior.
Modified files: mmx/xmain.c, mmx/mmxLib.c, include/mmxLib.h
Created files: mmx/cOperLib.c, include/cOperLib.h
3.1 Key bindings
Control:
Capture screen: c
Freeze screen: f
Start feed: s
End feed: e
End application: q
Filters:
Brightness filter:
Toggle: b
Increase: .
Decrease: /
Contrast filter:
Toggle: o
Increase: [
Decrease: ]
Difference filter (was already implemented):
Toggle: d
Edge detect filter:
Toggle: g
Motion detect filter
Toggle: n
Motion overlay filter
Toggle: v
MMX/C Controls:
Switch between using plain C and MMX assembly instructions: m
Display time to apply filters: r
performance comparison.
We could observe no uniform change in frame rate when filters were on and off.
So we measured the time time it took to apply all filters by putting a timer
in the while loop that refreshes the image. The process of moving the output
to the xim1 was left outside timers. It turned out that the process of
updating the reference frame with the current frame at the end took a
significant portion of the total time, 1.4*10^-4s of a total of 1.6*10^-4 in
some cases.
Theoretically, since 8 pixels were processed simultaneously with MMX
instructions and two instructions can run simultaneosly, if our instuctions
are paralellizable 50% of the time we should get an average speed up of around
12x. We did not expect this to be actual case since there usually were other
significant differences between our plain C and MMX implementations. For
instance the contrast filter's implementaion the plain C had double precision
floating point multiplication, which slowed it down significantly, and
resulted in our greatest speedup value, ~13x. In all MMX cases we tried to
minimize mathematical computations outside of the assembly block, and magnify
the performance impact of using MMX instuctions itself. The plain C also used
branching much more extensively than MMX code which is another reason which
slowed it down. Still in all cases a significant speedup was observed with MMX
instructions barely taking longer than the base case.
FILTER |
MMX (x 10-4 s) |
C (x 10-4 s) |
speedup |
motion detect |
2.0 |
6.2 |
3.1 |
motion overlay |
2.1 |
6.1 |
2.9 |
difference |
1.9 |
9.6 |
5.1 |
contrast |
1.6 |
21.2 |
13.3 |
brightness |
1.5 |
3.7 |
2.5 |
edge detect |
1.5 |
8.8 |
5.9 |
Table 1. Performance comparison and speedup for different filters. ( base case, time with all filters off: 1.4E-4 s
)
extensions.
Apart from implementing extra routines, we modified the application to allow
for filters to have cumulative effect. We also optimized our assembly code to
make most out of the two available pipelines. This had no recordable effect
which was probably because PIV's instruction horizon saw far enough into
our instructions to avoid dependency conflicts whenever possible. We also
improved the application to allow for cumulative filter effect (ex. decreased
contrast, increased brightness).
conclusions/what we learned.
Using MMX through in-line assembly is a powerful tool. Yet debugging assembly
and coming up with methods of differentiation in absence of branching is a
significant challenge. We estimate a ratio of 5 to 1 in time spent working on
MMX to time spent working on plain C. With a 4x speed up on average, MMX
does not have that good of effort-to-payoff ratio.
Regardless, we learned a lot about using in-line assembly, SIMD operations on
packed values and saturated math.
|