• Objective
• Standardize algorithms for audiovisual coding in
multimedia applications allowing for
High compression
Scalability of audio and video content
Support for natural and synthetic audio and video
• The Idea
• An audiovisual scene is a coded representation of
audiovisual objects related in space and time
MPEG-4: Scenario
• A/V object
A video object within a scene
The background
An instrument or voice
Coded independently
• A/V scene
Mixture of natural or synthetic objects
Individual bitstreams multiplexed and transmitted
One or more channels
Each channel may have its own quality of service
MPEG-4: Video Object Plane
• Video frame = sum of segmented regions with
arbitrary shape (VOP)
• Shape motion and texture information of VOPs
belonging to the same video object is encoded into
a video object layer (VOL)
• Encode
• VOL identifiers
• Composition information
• Overlapping configuration of VOPs
MPEG-4: Coding
• Shape coding
Shape information in alpha planes
Transparency of shape encoded
Inter and intra shape coding functions
After shape coding each VOP in a VO is partitioned
into non-overlapping macroblocks
• Motion coding
• Shift parameter wrt reference window
• Standard macroblock
• Contour macroblock
MPEG-4: Coding
• Texture coding
• Intra-VOPs, residual errors from motion compensation are DCT
coded like MPEG-1
• 4 luminance and 2 chrominance blocks in a macroblock
• P-VOPs (prediction error blocks) may not conform to VOP
• Pixels outside the active area are set to a constant value
• Standard compression
• Efficient prediction of DC and AC components from intra and inter
coded blocks
• Multiplexing
• Shape  motion  texture coded data
• Motion and DCT coefficients can be jointly (H.263) or individually
MPEG-4 Video Object
• Construct a video object
• User selects start frame, outlines polygon designating rough object
• Refine boundary using snake algorithm, if needed
• Compute a k-pixel bounding box around the object
• Within bounding box compute
• Edge map: bit plane, after thresholding a convolution kernel
• Color map: compute luminance and chrominance, quantize by kmeans clustering, keep quantization table
• Motion field: block-based motion vector
• Segment into regions no significant edge, smooth color having
smooth motion
• Intersect segments and initial object boundary and determine
foreground and background region
• Estimate the motion of regions in the next frame with an affine
motion model
MPEG-4 Video Object
• Track object
• Locate estimated position of foreground and background regions from
previous frame. Call this the object mask.
Generate same three feature maps with the quantization table;
Requantize if error is large
Classify regions into foreground/background and new regions
Intersection ratio r with object mask
For foreground regions, if r > 80% OR foreground  mask, mark as
foreground; label foreground - mask as new
For new regions, if r < 30% mark as new; if r > 80% mark as
foreground; else find nearest-motion-similar neighbor. If it is in the
foreground, do previous step, else keep region as new
Iterate until stable

similar documents