### Slides - DAIICT Intranet

```A High-Quality Video Denoising
Algorithm based on Reliable Motion
Estimation
Ce Liu (MSR)
William T. Freeman (MSR, MIT)
Initial Introduction
• To remove real, structured noise introduced by low-end camcorders and
digital cameras.
• Structured noise: The noise in real cameras can have strong spatial
correlations.
• Sparsity in Image: For any image patch, there will be a similar ones in
other locations of the image. This property is used in NLM method used
for denoising.
• Sparsity in video: A new frame can be well predicted from previous
frames. (Temporally consistent or temporal coherence)
Chicken and Egg Problem…
• Motion Estimation techniques (e.g. Block matching) are mainly based on
temporal coherence across the frames.
• Structured noise can mislead the search for similar patches and then
breaks the temporal coherence criterion in video de-noising.
• Motion should be estimated from the underlying signal after denoising,
and denoising relies on the temporal correspondence from motion
estimation.
• Solution: Use of robust optical flow with spatial regularization to establish
reliable temporal correspondence despite noise.
A temporally Coherent Video
Denoising Framework
• For every patch in a video, find a set of supporting patches from this frame
and temporal adjacent frames that are similar to this patch.
• Find spatially neighboring and temporally corresponding pixels.
Algorithms
Approximate KNN
Optical flow
Approximate K-nearest Neighbors
(AKNN) for a single frame

•
•
•
•
•
•
{I1, I2 , … , IT} – input noisy image.
z = (x,y,t) – index to a space time volume.
P(z) or P(x,y,t) – a patch at location z.
N (q)  P (q ) ; a set of approximate K-nearest neighbors for q = (x,y).
vi = qi – q ; offset of the found patch from the current patch.
Take a priority queue and store these K-nearest neighbors in it such that:
K
i
i 1
D( P (q), P (q i ))  D (P (q), P (q j )), 1  i  j  K
Where D(·,·) is the sum of square distance (SSD) over two patches,
defined as:
D ( P (q ), P (q )) 
( I (q  u )  I (q  u ))


i

u [  s, s ]  [  s, s ]
2
i
Three phases of AKNN
•
Initialization: The K-nearest neighbors are initialized by randomization
v i   sn i
where ni is a 2d standard normal random variable, and
 controls the radius (=w/3 initially).
• Propagation:
s
•
– The idea is to improve
approximate K-nearest neighbor set based on the fact that
 tothe
neighboring pixels tend
have similar AKNN structures (offsets).

– In the scanline order, we attempt to improve AKNN {vi(x,y)} using neighbor {vi(x−1,y) and
{vi(x,y,−1)}. In the reverse scanline order, we attempt to improve {vi(x,y)} using neighbor
{vi(x+1,y)} and {vi(x,y+1)}.
Random Search:
After the propagation step, we allow every patch to randomly match other
patches in the image for M times using the following mechanism
vi = σsαini , i = 1,··· ,M
where ni is a standard 2d normal random variable, α = 1 and
M = min(log2σs, K).
Non-local means with temporal
coherence
• Temporal coherence is vital for denoising.
• Algorithm that produces more temporal coherent results is preferred.
• Since, for real sequences, it can be difficult to distinguish high-intensity
structured noise from image signal. Therefore, it is important to establish
temporal correspondence between adjacent frames and require
corresponding pixels to be similar.
• For this optical flow algorithm is used.
• Since, optical flow is not invertible, we estimate forward flow wf (z) = [vx,
vy, 1] from frame It to It+1, and backward flow wb(z)=[vx,vy,−1] from frame It
to It−1, in order to establish bidirectional correspondence.
Optical flow
• Method to estimate the motion between two image frames taken at time
t and t+Δt at every pixel position.
• Let a pixel I(x,y,t) is moved in next frame to I(x+Δx,y+Δy,t+Δt), then
I(x,y,t) = I(x+Δx,y+Δy,t+Δt)
• By Taylor Series expansion
I( x  x, y  y, t  t )  I( x, y, t ) 
I
x 
x
I
y
y 
I
t
t  H .O .T .
which gives: Ixu+Iyv+It=0
• Horn-Schunck algorithm considers a global constraint of smoothness for
aperture problem and gives an energy functional:

E 

 [( I
2
u  Iyv  It )   (  u
x
2
2
2
  v )] dxdy
Optical flow (cont.)
• This functional can be minimized by Euler-Lagrange equations as:
L
u
L
v
• Which gives –


 L
x ux
 L
x v x
 L

0
y uy
 L

0
y v y
Ix ( Ix u  Iyv  It )    u  0
2
Iy ( Ix u  Iyv  It )    v  0
2

where Δ is laplacian operator, Δu(x,y) = u(x,y) – u(x,y).
• The above equations are linear in u and v, which can be iteratively solved

as:
I (I u  I u  I )
k
u
k 1
u 
k
x
k
x
y
2
x
v
v 
k
2
y
Iy ( Ix v  Iy v  It )
k
k 1
t
 I I
2
k
  Ix  Iy
2
2
2
Non-local means with temporal
coherence (cont.)
• So, pixel z corresponds to z + wf (z) in next frame and to z + wb(z) in
previous frame across |H| frames.
• Modified AKNN’s , which forms the supporting patches for P(z) is Ni =
{P(zij)}, j=1,…,K. for ith frame.
where, zij = (xij, yij,,i)
Non-local means with temporal
coherence (cont.)
•
The non local means estimate for pixel z can be written as:
I( z) 
1
Z
tH

 D w (( P ( z), P (z ij )) 

 I( z ij ) exp 
2
2



t
j 1
K
|i t|
i t  H
where Z is the normalization factor:
tH
Z 


i t  H
 D w (( P (z), P ( z ij )) 

 exp 
2
2 t


j 1
K
|i t|
and Dw (·,·) is a weighted SSD function, summed over spatial, but not
temporal, offsets:

 u 2 

D w (( P (z), P ( z ij )) 
(P (z1  u)  P ( z 2  u)) exp 

2
Z ' u[  s,s ] [  s,s ] 0
2


p 
1
2
where σp = s/2 , and Z’ is a normalization constant, γ = 0.9 to control temporal
decay and σt is related to the noise level in the video sequence.

Non-local means with temporal
coherence (cont.)
• For a fixed number of iterations, the complexity of this denoising
algorithm for a frame is O(NHKlogK), where N is the number of pixels per
frame, H is the temporal window size, and K is the number of approximate
K-nearest neighbors.
• This is a significant reduction compared to O(N2) of the original NLM
algorithm, since K << N (typically K = 10 and N = 640×480).
removal
• Set σt small when noise level is low to avoid over smoothing, and when
the noise level is high, set σt large to smooth out noise.
• Proposed noise model based on optical flow:
– Ideally, the difference between the warped frame and It should be the difference of
independent noise, but due to unreliable motion estimation an outlier in noise
estimation is used:
f
I t ( z )  I t 1 ( z  w ( z ))   z n z  (1   z ) u z
here, nz is a pixel-wise Gaussian random variable: E(nz)=0, E(nz2)= σn, and uz ∼ U[−1,1] is
a pixel-wise uniform random variable. These two random variables are balanced by
weight αz.Let Jt(z) = It(z) − It+1(z + wf (z)).
removal (cont.)
– We use an expectation-maximization (EM) algorithm to estimate parameters:
1. Initialize σn = 20. Loop between step 2 and 3 until convergence.
2. (E-Step) Evaluate
 J ( z) 
exp  t 2 
 2  n 
z 
 J ( z )  1
exp  t 2 
2  n
 2  n  2
1. (M-step) Estimate
 z J t ( z)  z
2

n 
z z
– This estimation is performed for each of R, G and B channels independently.
– The relationship between
the noise level σn and scaling parameter σt in Eqn. (5) depends

on K and H.
– Empirically, it has been found that when K = 11 and H = 5, σt = σn generates visually
pleasing results.
Experimental Results (given by
Authors)
• Implementation Details:
–
–
–
–
–
Used Patch size – 7X7
K = 11
H=5
4 Iterations of random K-nearest neighbor matching for each frame.
The EM algorithm for noise estimation converges in about 10 iterations.
• Tennis sequence:
– added artificially generated AWGN (σ = 20).
– Compared with BM3D which gave PSNR 30.22db in first run and 31.20db in second run.
– This algorithm gave PSNR 30.21db.
• Real video sequence room captured by a Canon S90.
• Denoising quality depends on the quality of motion estimation.
Experimental Results (given by
Authors)
• Evaluation of motion estimation:
– Applied block matching and optical flow
methods on the video.
– Block matching failed due to structured
noise, but optical flow gave good smooth
results.
• For room sequence, two parameters
are used σ = 20 and σ = 40.
outperforms BM3D.
Experimental Results (given by
Authors)
• Temporal coherence as a quality measure:
– Used the human-assisted motion annotation tool [27] to annotate the ground-truth
motion of the room sequence.
Experimental Results (given by
Authors)
Std. Dev.
Input data
BM3D (σ = 20)
BM3D (σ = 20)
Proposed
Algorithm
RED
4.20
2.22
1.52
1.23
BLUE
3.81
1.78
1.45
1.13
GREEN
9.55
5.77
2.81
2.91
The average standard deviation along motion paths is measured for different
algorithms at different RGB channels. Proposed system has overall the least temporal
fluctuation.
```