Motion Estimation

```Motion Estimation
ECE 569 – Spring 2010
Toan Nguyen
Outline
• What is new with motion estimation
• Four Step Search and Hexagon Search
Algorithms
• Parallelization strategies
• Results and discussions
What is new with motion estimation?
• The familiar way – Full search
• Full search is not so efficient
• Some of the most popular fast search algorithms:
 Diamond search
 Hexagon search
 Three-step search
 Four-step search
 Orthogonal search
 And many more
So what is the best?
• There is a trade-off between the run time and the
accuracy.
• Full search will be most accurate because of
exhaustive search, but will require more time
• Fast search is faster but the accuracy will be
reduced because of estimation algorithms.
• We implemented two of the most popular fast
search algorithms for comparison:
Four Step Search
Hexagon Search
Four Step Search Algorithm
•
Step 1: A minimum BDM point is found from a nine-checking points pattern on a 5
x 5 window located at the center of the 15 x 15 searching area. If the minimum
BDM point is found at the center of the search window, go to Step 4; otherwise go
to Step 2.
• Step 2: The search window size is maintained in 5 x 5. However, the search pattern
will depend on the position of the previous minimum BDM point.
 If the previous minimum BDM point is located at the corner of the previous
search window, five additional checking points as shown in Fig. 2(b) are used.
 If the previous minimum BDM point is located at the middle of horizontal or
vertical axis of the previous search window, three additional checking points as
shown in Fig. 2(c) are used.
 If the minimum BDM point is found at the center of the search window, go to
Step 4; otherwise go to Step 3.
• Step 3: The searching pattern strategy is the same as Step 2, but finally it will go to
Step 4.
• Step 4: The search window is reduced to 3 x 3 as shown in Fig. 2(d) and the
direction of the overall motion vector is considered as the minimum BDM point
among these nine searching points.
Four Step Search Example
Hexagon Search Algorithm
• Step 1: The large hexagon with seven checking points is centered at,
the center of a predefined search window in the motion field. If the
MBD point is found to be at the center of the hexagon, proceed to
Step 3; otherwise, proceed to Step 2.
• Step 2: With the MBD point in the previous search step as the
center, a new large hexagon is formed. Three new candidate points
are checked, and the MBD point is again identified. If the MBD
point is still the center point of the newly formed hexagon, then go
to Step 3; otherwise, repeat this step continuously.
• Step 3: Switch the search pattern from the large to the small size of
the hexagon. The four points covered by the small hexagon are
evaluated to compare with the current MBD point. The new MBD
point is the final solution of the motion vector.
Hexagon Search Example
Design Implementation
• Parallelization is possible by dividing the
image into small sub-image partitions.
• Each thread will work on a sub-image
independently using a designed algorithm ( i.e
Four step search or Hexagon Search).
• At the end, the minimum SAD of each subimage is compared to get the final minimum
Implementation Notes
• Since the number of threads we use is multiple of
2’s, if the number of sub-image is not multiple of
rows and columns and we ignore the results from
those extra sub-images.
• We excluded the time it takes to read a text file
and store data into the window and image arrays
when we compare the runtime for performance
analysis.
Simulation Results
• First we varied the number of threads per block to find the
maximal configuration that gives the best run time.
Runtime (seconds)
6
Runtime of full search on various threads/block
5
4
3
32
64
2
128
1
256
512
0
Image Size
 256 threads/block give the best performance.
Simulation Results (cont.)
• The runtime of the serial versions and the parallel versions of different
algorithms are collected and compare to see what kind of performance
improvement we achieved.
Full Search serial vs. parallel
Hexagon Search serial vs. parallel
3.5
20
15
FSS_Serial
10
FSS_Parallel
5
Runtime (seconds)
25
3
2.5
2
1.5
Hexagon_Serial
1
Hexagon_parallel
0.5
0
0
Image Size
Image Size
Four Step Search serial vs. parallel
Runtime (seconds)
Runtime (seconds)
30
4
3.5
3
2.5
2
1.5
1
0.5
0
• We only see the performance
improvement when the image size is
4SS_Serial
256x256 or bigger. Any image of size
4SS_Parallel
smaller than this will actually decrease
the performance.
Image Size
Simulation Results (cont.)
• So how much speed up do we get and which
algorithm is better, Full Search, Four Step
Search, or Hexagon Search?
Parallel vs. serial versions speedup
35
30
Speed up
25
20
15
Speed_UP_FS
10
Speed_UP_4SS
Speed_UP_Hexagon
5
0
Image size
Simulation Results (cont.)
• Overall performance
16X16
32X32
64X64
Full_Serial Full_Parallel 4SS_Serial 4SS_Parallel Hexagon_Serial Hexagon_parallel
0
0
0
0.016
0
0.078
0
0.016
0
0.015
0
0.047
0.01
0.016
0.01
0.015
0.01
0.062
128X128
0.02
0.016
0.01
0.015
0.01
0.062
256X256
0.09
0.031
0.02
0.016
0.02
0.047
512X512
0.41
0.078
0.06
0.016
0.06
0.063
1024X1024
1.64
0.265
0.236
0.032
0.22
0.062
2048X2048
6.56
0.922
0.87
0.047
0.85
0.078
26.29
3.719
3.38
0.11
3.3
0.157
4096X4096
30
25
20
15
FSS_Serial
FSS_Parallel
4SS_Serial
10
4SS_Parallel
Hexagon_Serial
5
0
Hexagon_parallel
Simulation Results (cont.)
• Performance comparison between NVIDIA
8400 GS and 9800 GT GPUs.
NVIDIA 8400 GS vs. 9800 GT performance
4.5
4
3.5
Speed up
3
2.5
2
Speed-up_FSS
1.5
Speed-up_4SS
1
Speed-up_Hexagon
0.5
0
Image Size
Simulation Results (cont.)
• Distortion measurement (motion estimation
quality).
Fast Search Distortion
Min. SAD returned by different algorithms
600
1400
500
1200
1000
300
4SS distortion
Hexagon distortion
200
Distortion
400
800
600
Full-Step
4SS
400
Hexagon
100
200
0
0
Image size
Image size
Result Analysis Summary
1. Motion estimation parallel versions performance only
improve when image is large (256x256). Smaller image will
reduce performance.
 Larger image ~ greater speedup
2. Fast search algorithms outperform full search algorithm,
hence “fast”.
3. Parallelization on Four Step Search gives a slightly edge
improvement over Hexagon Search.
4. The distortion we see on the two fast search algorithms are
similar.
Result Conclusions
• Based on the data collected from different
algorithms, Four Step Search gives a slightly
better performance than Hexagon Search,
while the distortion is very similar.
Hence, Four Step Search is a better fast search
algorithm than Hexagon Search.
Only perform motion estimation algorithms on
GPU if image size is larger than 256x256. Smaller
image size should be ran serially on CPU.
Limitations
• Image and window files are random.
• Not make use of shared memory
Other parallelization strategy
• After each step, the SAD of the new checking
points will be computed. We can parallelize by
in the sub-image.
• Then after each step complete and the SAD for
the new checking points needed, we already have
them computed by the threads in previous step.
• Drawback of this strategy:
Not getting a considerable amount of speedup
Lots of data transfer between host and device
More complicated implementation
References
• Deepak Turaga , Mohamed Alkanhal . "Search Algorithms for BlockMatching in Motion Estimation". ECE - CMU. March 06, 2010
<http://www.ece.cmu.edu/~ee899/project/deepak_mid.htm>.
• Lai-Man Po, Wing-Chung Ma. A Novel Four-Step Search Algorithm for Fast
Block Motion Estimation. JUNE 1996
• Xuan Jing, Lap-Pui Chau. "An Efficient Three-Step Search Algorithm for
Block Motion Estimation". IEEE TRANSACTIONS ON MULTIMEDIA JUNE
2004: 435-437.
• Chen Lu, Wang. "Diamond Search Algorithm". ECE, U of Texas. March 06,
2010
<http://users.ece.utexas.edu/~bevans/courses/ee381k/projects/fall98/ch
en-lu-wang/presentation/sld012.htm>.
Questions?
```