Computer Vision Systems for the Blind and Visually Disabled.

What is Vision?
Why is it Hard?
Alan L. Yuille.
The Purpose of Vision.
“To Know What is Where by Looking”.
Aristotle. (384-322 BC).
 Information Processing: receive a signal by
light rays and decode its information.
 Vision appears deceptively simple, but this
is highly misleading.
What can humans see?
Rich description: fox, tree trunk, grass and
background twigs.
 Also: shape of fox’s legs and head, its fur, what
it is doing? old or young? and more.
 Local regions (B,C,D) are highly ambiguous.
Local regions are ambiguous
Why can Humans see so well?
Humans appear understand images effortlessly.
But this is only because of the enormous amount
of our brains that we devote to this task.
 It is estimated that 40-50% of neurons in the
cortex are involved in doing vision.
 Humans are very visual. We get much more
information from our eyes than other animals.
Vision and the Brain
Half the Cortex does Vision
But human vision is not perfect
Human perception is often incorrect.
 Visual illusions suggest that perception is
often a reconstruction, or even a controlled
 Here are some illusions.
Ames Room
Perspective Cues
Are these lines horizontal?
Which square is brighter?
Visual Illusions
The perception of brightness of a surface,
 or the length of a line depends on context.
 Not on basic measurements like:
 the no. of photons that reach the eye
 or the length of line in the image.
Perception is Inference
Helmholtz. 1821-1894.
 “Perception as Unconscious Inference”.
So why is vision hard?
Vision is an inverse inference problem.
There is a highly complex process that
generates the image.
 This process is studied in Computer
Graphics. It models objects, light sources,
and how they interact.
 Vision must invert this image formation
process to estimate the “causal factors” –
objects, lighting, and so on.
Vision as Inverse Inference:
Inverting image formation.
Inverse Problems are often hard
There are an infinite number of ways that
images can be formed.
 Which ways are most plausible?
Vision and Bayes
Bayes’ Theorem gives a procedure to solve
inverse inference problems.
 It states that we can infer the
state S of the world from the
observed image I by,
P(S|I) = P(I|S)P(S)/P(I).
 But doesn’t tell us how to
 Specify these distributions.
The Fundamental Problem?
The set of all images is almost infinite
(Kersten 1987).
 Estimate that only a tiny fraction of all
possible images have been seen by humans
(even small images 10x10).
 The no. of objects is large – 30,000?
 The no. of scene types is large – 1,000?
 Hence I and S are extremely complex.
Variability and Ambiguity
Variability: Images of similar objects
(bicycle) can be very different (left, center).
 Ambiguity: Images of different objects can
be identical (right).
Complexity, Variability, Ambiguity
What is the space of all possible images?
 It is only now becoming possible to explore
this questions – (technological advances).
 Large datasets – 20,000 images (Pascal),
1,000,000 (ImageNet).
 But are these datasets big enough for results
on them to apply to other images?
Datasets: Promise and Peril
Vision evaluates performance of algorithms
on benchmarked datasets.
 But results on these datasets may fail to
generalize to other images.
 Datasets need to be representative of the
complexity of the real world.
 Tyranny of Datasets. Unrepresentative
datasets. Limited visual tasks.
Dataset Examples
Pascal (top). 20,000 images. 20 objects.
ImageNet (bottom) 1,000,000 images. 1,000 objects.
Brief History of Vision
The difficulty of vision became clear in the
1960’s,70’s when Artificial Intelligence
researchers tried to get computer programs to
interpret images.
 Initially AI researchers thought vision was easy.
“Solve vision in a summer”.
 Researchers gradually started realizing that vision
was extremely hard. Much harder than Chess.
Big Picture Theories of Vision.
Work in the 1980’s in vision was largely
influenced by big picture theories.
e.g., David Marr’s Vision (1982).
 The theories were interesting, but practical
results were limited.
 Progress started being made by breaking up
vision into subproblems. Inventing or
borrowing mathematical tools for these
Techniques used in Vision
Multi-Disciplinary set of tools.
 Linear and Non-linear filtering.
 Variational Methods.
 Probabilities on Graphs.
 Perspective Geometry.
 Differential Geometry.
 And many more. We will introduce many of
these at this summer school.
Vision as a Bag of Tricks?
Dividing vision up into sub-problems
enabled progress to be made.
 But lead to a situation where vision was
seen as an enormous bag of tricks.
 “What papers should I read in computer
vision? There are so many, and they are so
different.” Chinese Graduate student.
 Attempts are being made – summer schools,
Wikivision project – to give a core and
foundation for vision,
The mathematical richness of
The complexity and ambiguity of vision is a
 Vision problems (broadly defined) have
attracted the attention of learning
mathematicians (e.g., Field’s medalists).
 David Mumford, Shing-Tung Yau, Maxim
Kontsevich, Stephen Smale, Terrence Tao.
Relationships to other
Computer vision relates closely to
 Image Processing
 Machine Learning.
 Partially covered in this school.
 Artificial Intelligence.
 Robotics.
 Study of Biological Vision Systems –
Psychology and Neuroscience.
Imaging Devices
In this school we will be dealing with
“natural images” of everyday scenes.
 But the methods used can be applied to
other imaging modalities.
 Medical images: fMRI, EEC,
 Astronomial Images:…
 Ever-increasing number of imaging devices.
Need a science of visual images.
Structure of the School
Vision can be broken down into low-, mid-,
and high-level vision (very roughly).
 Low-level vision – local image operations
which have limited knowledge of the world.
 Mid-level vision – non-local operations
which know about surfaces and geometry.
 High-level vision – operations which know
about objects and scene structures.
Low-level vision
Image processing.
 Filtering, denoising,
 enhancement.
 Edge detection.
 Image segmentation.
 Right: ideal segmentation
 (followed by labeling).
Mid-Level Vision
Estimation of 3D surfaces:
 Binocular stereo,
 structure from motion,
 other depth cues
 (e.g., perspective, shading,
 texture, focus).
 Fig: Image, Depth (blue to red),
 Segmentation.
Mid-Level Vision: Grouping
Kanisza. Humans have a strong tendency to
group image structures as surfaces.
High-Level Vision
Object detection and Scene Understanding.
 Example: detect objects in Pascal.
High-Level Vision: parts
Detect parts of objects:
High-Level Vision – AI.
Rich Explicit Representations enable:
 Understanding of objects, scenes, and events.
 Reasoning about functions and roles of objects, goals
and intentions of agents, predicting the outcomes of
events. SC Zhu – MURI.
Visual Turing tests
Vision algorithms that have similar
properties to those of humans:
 (i) flexible, adaptive, robust
 (ii) capable of learning from limited data,
ability to transfer,
 (iii) able to perform multiple tasks,
 (iv) closely coupled to reasoning, language,
and other cognitive abilities.
Grammars, Compositionality,
and Pattern Theory.
How to put everything together?
 How to address the complexity problem of
 Pattern theory offers suggestions:
 Mumford and Desolneux 2010.
 Describe the class of patterns that can
happen in images using grammars.
Grammars and Pattern Theory
Parse images by decomposing them into
elementary image patterns.
 Inverse inference on a grammar for
generating images.
The next few lectures
The next few lectures will concentrate on
basic low-level vision tasks:
 Filtering – linear and non-linear.
 Segmentation. Variational and probabilistic
Vision is difficult and challenging. Humans
devote half of their cortex to doing it.
It is difficult because it is inverse inference in
an extremely complex domain.
It involves a great variety of mathematical
and computational techniques.
Low-, Mid-, High-Level taxonomy.
Pattern Theory and Grammars.

similar documents