FINST_RU2 - Center for Cognitive Science

A neglected problem in the
Representational Theory of Mind
Object Tracking and the
Mind-World Connection
Before I begin I would like you to see a
‘video game’ to which I will refer later
The demonstration shows a task called
“Multiple Object Tracking”
Track the initially-distinct (flashing) items
through the trial (here 10 secs) and indicate
at the end which items are the “targets”
After each example I’d like you to ask
yourself, “How do I do it?”
If you are like most of our subjects you will
have no idea, or a false idea…
Keep track of the objects that flash
512x6.83 172x 169
 How did you do it?
 What properties of individual objects did you use in
order to track them?
 Did you use some grouping or chunking heuristic?
 Does your introspection reveal how you tracked the
 Does your introspection ever reveal what processes
go on in your mind?
Going behind occluding surfaces does not disrupt tracking
Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual
objecthood. Cognitive Psychology, 38(2), 259-290.
Not all well-defined features can be tracked:
Track endpoints of these lines
Endpoints move exactly as the squares did!
The basic problem of cognitive science
 What
determines our behavior is not how the
world is, but how we represent it as being
 As Chomsky pointed out in his review of Skinner, if we
describe behavior in relation to the objective properties
of the world, we would have to conclude that behavior
is essentially stimulus-independent
 Every naturally-occurring behavioral regularity is
cognitively penetrable
 Any information that changes beliefs can
systematically and rationally change behavior
Representation and Mind
Why representations are essential
Do representations only come into play in “higher
level” mental activities, such as reasoning?
Even at early stages of perception many of the
states that must be postulated are representations
(i.e. their content or what they are about plays a
role in explanations).
Examples from vision (1): Intrapercept constraints
Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83.
Another example of a classical representation
Other forms of representation….
Lines FG, BC are parallel and equal.
Lines EH, AD are parallel and equal.
Lines FB, GC are parallel and equal.
Lines EA, HD are parallel and equal.
Vertices EF, HG, DC and AB are joined....
Part-Of{Cube, Top-Face(EFGH), BottomFace(ABCD), Front-Face(FGCB), BackFace(EHDA)}
Part-Of{Top-Face(Front-Edge(FG), BackEdge(EH), Left-Edge(EF), Right-Edge(HG)},…
What’s wrong with these
What’s wrong is that the CTM is incomplete — it does
not address a number of fundamental questions
It fails to specify how representations connect with what
they represent – it’s not enough to use English words in
the representation (that’s been a common confusion in AI)
or to draw pictures (a common confusion in theories of
mental imagery)
 English labels and pictures may help the theorist recall which
objects are being referred to … but
 What makes it the case that a particular mental symbol refers
to one thing rather than another? Or,
 How are concepts grounded? (Symbol Grounding Problem)
Another way to look at what the
Computational Theory of Mind lacks
The missing function in the CTM is a mechanism that
allows perception to refer to individual things in the
visual field directly without appealing to their
properties – i.e., nonconceptually:
 Not as “whatever has properties P1, P2, P3, ...”, but as a
singular term that refers directly to an individual and does
not appeal to a representation of the individual’s properties.
 Such a reference is like a proper name, or like a
demonstrative term (like this or that) in natural language or
like a pointer in a computer data structure.
 There is more to come on the mechanism of visual indexing
An example from personal history: Why
we need to pick out individual things
without referring to their properties
We wanted to develop a computer system that would
reason about geometry by actually drawing a diagram
and noticing adventitious properties of the diagram
from which it would conjecture lemmas to prove
We wanted the system to be as psychologically
realistic as possible so we assumed that it had a narrow
field of view and noticed only limited, spatiallyrestricted information as it examined the drawing
This immediately raised the problem of coordinating
noticings and led us to the idea of visual indexes to
keep track of previously encoded parts of the diagram.
Begin by drawing a line….
Now draw a second line….
And draw a third line….
What do we have so far?
We know there are three lines, but we don’t
know the spatial relations between them.
That requires:
1. Seeing several of them together (at least in pairs)
2. Knowing which object seen at time t+1 corresponds to
a particular object that was seen at time t.
Establishing (2) requires solving one form of the
correspondence problem. This problem is
ubiquitous in perception. Solving it over time is
called tracking.
For example, suppose you recall noticing
two intersecting lines such as these:
You know that there is an intersection of two lines…
But which of the two lines you drew earlier are they?
There is no way to indicate which individual things are seen
again without a way to refer to individual token things
Look around some more to see what is there ….
 Here is another intersection of two lines…
 Is it the same intersection as the one seen earlier?
 Without a special way to keep track of individuals the only
way to tell would be to encode unique properties of each
of the lines. Which properties should you encode?
In examining a geometrical figure one only
gets to see a sequence of local glimpses
A note about the use of labels in this example
There are two purposes for figure labels. One is to specify
what type of individual it is (line, vertex,..). The other is to
specify which individual it is in order to keep track of it
and in order to bind it to the argument of a predicate.
 The second of these is what I am concerned with because
indicating which individual it is is essential in vision.
 Many people (e.g., Marr, Yantis) have suggested that individuals
may be marked by tags, but that won’t do since one cannot
literally place a tag on an object and even if we could it would not
obviate the need to individuate and index just as labels don’t help.
Labeling things in the world is not enough because to refer
to the line labeled L1 you would have to be able to think
“this is line L1” and you could not think that unless you
had a way to first picking out the referent of this.
The correspondence Problem
A frequent task in perception is to establish a
correspondence between proximal tokens that arise
from the same distal token.
Apparent Motion. Tokens at different times may
correspond to the same object that has moved.
Constructing a representation over time (and over eye
fixations) requires determining the correspondence between
tokens at different stages in constructing the representation.
Tracking token individuals over time/space. To distinguish
“here it is again” from “here is another one” and so to
maintain the identity of objects.
Stereo Vision requires establishing a correspondence
between two proximal (retinal) tokens – one in each eye
Apparent Motion solves a correspondence problem
Dawson Configuration (Dawson &Pylyshyn, 1988)
Linear trajectory?
Curved trajectory?
Which criterion does the visual module prefer?
Dawson Configuration (animated)
Apparent Motion solves a correspondence problem
Dawson Configuration (Dawson &Pylyshyn, 1988)
Nearest mean distance?
Nearest vector distance?
Nearest configural distance?
Which criterion does the visual module prefer?
Dawson Configuration (animated)
Colors & shapes are ignored
Dawson Configuration Different properties Ignored
Yantis use of the “Ternus Configuration” to
demonstrate the early visual effect of objecthood
Short time delays result in “element motion”
(the middle object persists as the “same object”
so it does not appear to move)
Long time delays result in “group motion” because
the middle object does not persist but is perceived
as a new object each time it reappears
Relevance to the present theme
These different examples illustrate the need to keep
track of objects’ numerical identity (or their same‘individuality’) in a primitive non-conceptual way (and
of putting their token representations in correspondence)
In each case the correspondence is computed without
any conscious awareness by the early vision module
The examples (apparent motion, stereovision,
incremental construction of representations, and keeping
track of individuality over time/space) are on different
time scales so it is an empirical matter whether they
involve the same mechanism, but they do address the
same problem – tracking individuals without using their
unique properties.
The difference between a direct (demonstrative) and a
descriptive way of picking something out has produced
many “You are here” cartoons.
It is also illustrated in this recent New Yorker cartoon…
The difference between descriptive and
demonstrative ways of picking something out
(illustrated in this New Yorker cartoon by Sipress )
‘Picking out’
Picking out entails individuating, in the sense of separating
something from a background (what Gestalt psychologists
called a figure-ground distinction)
This sort of picking out has been studied in psychology under
the heading of focal or selective attention.
 Focal attention appears to pick out and adhere to objects rather than places
In addition to a unitary focal attention there is also evidence
for a mechanism of multiple references (about 4 or 5), that I
have called a visual index or a FINST
 Indexes are different from focal attention in many ways that we have
studied in our laboratory (I will mention a few later)
 A visual index is like a pointer in a computer data structure – it allows
access but does not itself tell you anything about what is being pointed to.
Note that the English word pointer is misleading because it suggests that
vision picks out objects by pointing to their location.
The requirements for picking out and keeping
track of several individual things reminded me of
an early comic book character called Plastic Man
Imagine being able to place several of your fingers on things
in the world without recognizing their properties while doing
so. You could then refer to those things (e.g. ‘what finger #2 is
touching’) and could move your attention to them. You would
then be said to possess FINgers of INSTantiation (FINSTs)
FINST Theory postulates a limited number of pointers in early
vision that are elicited by certain events in the visual field and that
index the objects associated with the event. These enable vision to
refer to those objects without doing so under a concept/description
This idea is intriguing but it is missing one
or two details as well as some distinctions
We need to distinguish the mechanisms of early vision (inside the
vision module) from those of general cognition
We need to distinguish different types of information in different
parts of vision (e.g., representations vs physical states, conceptual
vs nonconceptual, as well as personal vs subpersonal).
Closely related to these, we need to distinguish between the
process of vision from those of belief fixation.
Finally, we need to provide a motivated proposal for what the
modular (subpersonal?) part of vision hands off to the rest of the
cognitive mind. This is a difficult problem and will occupy some
of our time in the rest of this class.
Returning to the FINST Theory
First Approximation: FINSTs and Object Files and
the link between the world and its conceptualization
The only
in thisare
Information (causal) link
FINST Demonstrative
reference link
Summarizing the theory so far
A FINST index is a primitive mechanism of reference that
refers to individual visible objects in the world. There are a
small number (~4-5) indexes available at any one time.
Indexes refer to individual objects without referring to them
under conceptual categories, so they provide nonconceptual
reference. Q: Is this a case of seeing without seeing as?
Indexing objects is prior to encoding any of their properties.
So objects are picked out and referred to without using any
encoding of their properties.
 This does not mean that object properties are irrelevant to the
grabbing of indexes or the subsequent process of tracking
 The claim that we initially refer to objects without having encoded
their location is surprising to many people (why?)
 What may be even more surprising is that we can index and refer to
objects without knowing what they are!
Summarizing the theory so far
An important function of these indexes is to bind arguments of
visual predicates to things in the world to which they refer. Only
predicates with bound arguments can be evaluated. Since predicates
are quintessential concepts, an index serves as a bridge from objects
to conceptual representations.
Indexes can also bind arguments of motor commands, including the
command to move focal attention or gaze to the indexed object: e.g.,
Some hard problems that Fodor and I will discuss at a later lecture
Getting information about a particular object into its Object File
How and when does this happen?
Who can use the information in an object file?
Can it be used to track objects by checking whether a candidate
object has the same properties as a particular previous object?
Some hard problems and some
open empirical questions
To be discussed at various later lectures
How and when does information about a particular
object get into its Object File?
Who can use the information in an object file?
Can the information in the file be used to determine the
correspondence between objects by checking whether
they have the same properties?
Is the Object File inside the vision module or outside?
Is this how tracking is accomplished?
Is information in the Object File used to solve the
many-properties binding problem?
Is this done during tracking?
Part 2
Some notes on how indexes
might be implemented
A thought experiment: How might one implement an indexing
system? The attempt might clarify how it is possible to index an
object without having explicit access to the coordinates or other
properties of objects.
 I will sketch a network model but will only describe how it looks
functionally to a user who pushes buttons and notices which lights
come on.
 The model takes as input an activation map (on the proximal
stimulus) with a set of sensors at each point (each pixel). Based on
the relative activity at each point it indexes a number of active
objects and illuminates a light for each. The user choses one of the
illuminated objects (by name – nobody knows where they are) and
pressing a button beside one of the lights.
Some notes on how indexes
might be implemented
The person then presses a button on a property detection panel
marked with a property name. If the light beside the button
illuminates then we know that the object indexed in panel 2 has
property indicated by panels at 3.
The way this model is wired up is simple. The first panel feeds a
Winner-Take-All network which inhibits every input unit but the
most active one (a classical Darwinian or capitalist world). This
enables a circuit from the button next to the illuminated light to the
input unit which led to the light being on (that’s the ‘index’).
Pressing the button sends a unit of activity to that input unit which
now has a property transducer and an activity selector on (two out of
the required 3 before it send out a general tremor of activity).
Now you press the button by a property inquiry (panel 3) which
activates all P detectors. If the selected input unit, the property
transducer, and the property inquiry signal are all on, that input fires.
It is trivial to design a circuit that allows one to
check whether a particular place on a proximal
stimulus that has grabbed an index, has a
particular property. All it takes are some
threshold units and some and and or units.
Although the simple black box I showed you can
only detect one static input place at a time, it can
inquiry about several properties. Extending this to
moving objects is easy using the same ideas – you
partially activate regions near each selected input
units, this increasing the likelihood that it will be
selected at the next cycle and so on.
Some evidence for indexes and Object Files
The correspondence problem
The binding problem
Evaluating multi-place visual predicates
Operating over several visual elements at once
without having to search for them first
Recognizing shapes by their part-whole relations
Subset search
Multiple-Object Tracking
Imagining space without requiring a spatial display
in the head {This is a large topic beyond the scope of this class, but
see Things and Places, Chapter 5}
A quick tour of some evidence for FINSTs
The correspondence problem (mentioned earlier)
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without having to search for them first
 Subitizing
 Subset selection
 Multiple-Object Tracking
• Imagining space without requiring a spatial
display in the head
An early architecture for
vision, called Pandemonium,
was proposed by Oliver
Selfridge in 1959. This idea
continues to be at the heart of
many psycho-logical models,
including ones implemented
in contemporary connectionist
or neural net models. It is
also the basic idea in what are
called Blackboard Architectures in AI (e.g., Hearsay
speech recognition systems).
These architectures have no
way to represent that some of
the features detected actually
belonw with other features
Vertical lines
Horizontal lines
Oblique lines
Right angles
Acute angles
Introduction to the Binding Problem:
Encoding conjunctions of properties
Experiments show the special difficulty that vision
has in detecting conjunctions of several properties
It seems that items have to be attended (i.e.,
individuated and selected) in order for their
property-conjunction to be encoded
 When a display is not attended, conjunction
errors are frequent
Read the vertical line of digits in this display
What were the letters and their colors?
This is what you saw briefly …
Under these conditions Conjunction Errors are very frequent
Encoding conjunctions requires attention
One source of evidence is from search experiments:
 Single feature search is fast and appears to be
independent of the number if items searched through
(suggesting it is automatic and ‘pre-attentive’)
 Conjunction search is slower and the time increases
with the number of items searched through (suggesting
it requires serial scanning of attention)
Rapid visual search (Treisman)
Find the following simple figure in the next slide:
This case is easy – and the time is independent of
how many nontargets there are – because there is
only one red item. This is called a ‘popout’ search
This case is also easy – and the time is independent of
how many nontargets there are – because there is only
one right-leaning item. This is also a ‘popout’ search.
Rapid visual search
Find the following simple figure in the next slide:
Constraints on nonconceptual representation
of visual information (and the binding problem)
Because early (nonconceptual) vision must not fuse the conjunctive
grouping of properties, visual properties can’t just be represented as
being present in the scene – because then the binding problem could
not be solved!
What else is required?
 The most common answer is that each property must be
represented as being at a particular location
 According to Peter Strawson and Austin Clark, the basic unit of
sensory representation is
 This is the so-called feature placing proposal.
This proposal fails for interesting empirical reasons
 But if feature placing is not the answer, what is?
The role of attention to location in Treisman’s Feature Integration Theory
Conjunction detected
Color maps
Shape maps
Orientation maps
Master location map
Original Input
Attention “beam”
Individual objects and the binding problem
We can distinguish scenes that differ by conjunctions of
properties, so early vision must somehow keep track of
how properties co-occur – conjunction must not be
obscured. How to do this is called the binding problem.
The most common proposal is that vision keeps track of
properties according to their location and binds together
colocated properties.
The proposal of binding conjunctions by the location
of conjuncts does not work when feature location is
not punctate and becomes even more problematic if
they are colocated – e.g., if their relation is “inside”
Binding as object-based
The proposal that properties are conjoined by virtue of their
common location has many problems
 In order to assign a location to a property you need to know its
boundaries, which requires distinguishing the object that has those
properties from its background (figure-ground individuation)
 Properties are properties of objects, not of locations – which is why
properties move when objects move. Empty locations have no
causal properties.
The alternative to conjoining-by-location is conjoining by
object. According to this view, solving the binding problem
requires first selecting individual objects and then keeping
track of each object’s properties (in its object file or OF)
 If only properties of selected objects are encoded and if those
properties are recorded in each object’s OFs, then all conjoined
properties will be recorded in the same object file, thus solving the
binding problem
A quick tour of some evidence for FINSTs
The correspondence problem (mentioned earlier)
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at once
without having to search for them first
 Subitizing
 Subset selection
Multiple-Object Tracking
Cognizing space without requiring a spatial display
in the head
Being able to refer to individual objects or
object-parts is essential for recognizing patterns
Encoding relational predicates; e.g., Collinear (x,y,z,..);
Inside (x, C); Above (x,y); Square (w,x,y,z), requires
simultaneously binding the arguments of
n-place predicates to n elements* in the visual scene
 Evaluating such visual predicates requires
individuating and referring to the objects over
which the predicate is evaluated: i.e., the
arguments in the predicate must be bound to
individual elements in the scene.
*Note: “elements” is used to refer to objects that serve as parts of other objects
Several objects must be picked out at
once in making relational judgments
When we judge that certain objects are collinear, we must first
pick out the relevant objects while ignoring their properties
Several objects must be picked out at
once in making relational judgments
The same is true for other relational judgments like inside or onthe-same-contour… etc. We must pick out the relevant individual
objects first. Are dots Inside-same contour? On-same contour?
*Note: Ullman (1984) has shown that some patterns cannot be recognized without
doing so in a serial manner, where the serial elements must be indexed first.
And that is yet another reason why Connectionist architectures cannot work!
A quick tour of some evidence for FINSTs
The correspondence problem
 The binding problem
 Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without first having to search for them
 Subitizing
 Subset selection
Multiple-Object Tracking
Cognizing space without requiring a spatial
display in the head
More functions of FINSTs
Further experimental explorations
Recognizing the cardinality of small sets of things:
Subitizing vs counting (Trick, 1994)
 Searching through subsets – selecting items to search
through (Burkell, 1997)
 Selecting subsets and maintaining the selection during a
saccade (Currie, 2002)
Application of FINST index theory to infant
cardinality studies (Carey, Spelke, Leslie, Uller, etc)
 Indexes may explain how children are able to
acquire words for objects by ostension without
suffering Quine’s Gavagai problem.
Signature subitizing phenomena only appear when
objects are automatically individuated and indexed
Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A
limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.
Subitizing results
There is evidence that a different mechanism is involved in
enumerating small (n<4) and large (n>4) numbers of items (even
different brain mechanisms – Dehaene & Cohen, 1994)
Rapid small-number enumeration (subitizing) only occurs when
items are first (automatically) individuated*
Unlike counting, subitizing is not enhanced by precuing location*
Subitizing is insensitive to distance among items*
 Our account for what is special about subitizing is that once
FINST indexes are assigned to n< 4 individual objects, the objects
can be enumerated without first searching for them. In fact they
might be enumerated simply by counting active indexes which is
fast and accurate because it does not require visual scanning
* Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated
differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.
Subset selection for search
Target =
Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial
Vision, 11(2), 225-258.
Subset search results:
Only properties of the subset matter – but note that
properties of the entire subset must be taken into
account simultaneously (since that is what distinguishes a feature search from a conjunction search)
 If the subset is a single-feature search it is fast and the
slope (RT vs number of items) is shallow
 If the subset is a conjunction search set, it takes longer
is more error prone and is more sensitive to the set size
As with subitizing, the distance between targets
does not matter, so observers don’t seem to be
scanning the display looking for the target
The stability of the visual world entails the capacity
to track some individuals after a saccade
There is no problem about how the tactile sense can
provide a stable world when you move around while
keeping your fingers on the same objects – because in
that case retaining individual identity is automatic
But with FINSTs the same can be true in vision – at
least for a small number of visual objects
 This is compatible with the fact that it appears that one
retains the relative location of only about 4 elements
during saccadic eye movements (Irwin, 1996)
[Irwin, D. E. (1996). Integrating information across saccadic eye
movements. Current Directions in Psychological Science, 5(3), 94-100.]
The selective search experiment with a saccade induced
between the late onset cues and start of search
Onset of new objects
grabs indexes
A saccade
Target =
Even with a saccade between selection and access, items can be accessed efficiently
A quick tour of some evidence for FINSTs
The correspondence problem (mentioned earlier)
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at once
without having to search for them first
 Subitizing
 Subset selection
Multiple-Object Tracking
Imagining space without requiring a spatial
display in the head
Demonstrating the function of FINSTs with
Multiple Object Tracking (MOT)
In a typical experiment, 8 simple identical objects are
presented on a screen and 4 of them are briefly
distinguished in some visual manner – usually by flashing
them on and off.
After these 4 targets are briefly identified, all objects
resume their identical appearance and move randomly.
The observers’ task is to keep track of the ones that had
been designated as targets at the start
After a period of 5-10 seconds the motion stops and
observers must indicate, using a mouse, which objects are
the targets
Another example of MOT: With self occlusion
5 x 5 1.75 x 1.75
Self occlusion dues not seriously impair tracking
Some findings with Multiple Object Tracking
Basic finding: Most people can track at least 4 targets
that move randomly among identical non-target objects
(even some 5 year old children can track 3 objects)
Object properties do not appear to be recorded during
tracking and tracking is not improved if no two objects
have the same color, shape or size (asynch vs synch changes)
How is tracking done?
We showed that it is unlikely that the tracking is done by
keeping a record of the targets’ locations and updating them by
serially visiting the objects (Pylyshyn & Storm, 1998)
Other strategies may be employed (e.g., tracking a single
deforming pattern), but they do not explain tracking 
Hypothesis: FINST Indexes are grabbed by blinking targets. At
the end of the trial these indexes can be used to move attention
to the targets and hence to select them in making the response
What role do visual properties play in MOT?
Certain properties must be present in order for an index to be
grabbed, and certain properties (probably different
properties) must be present in order for the index to keep
track of the object, but this does not mean that such
properties are encoded, stored, or used in tracking.
Is there something special about location? Do we record and
track properties-at-locations?
 Location in time & space may be essential for individuating or
clustering objects, but metrical coordinates need not be encoded or
made cognitively available
 The fact that an object is actually at some location or other does not
mean that it is represented as such. Representing property ‘P’ (where
P happens to be at location L) ≠ Representing property ‘P-at-L’.
A way of viewing what goes on in MOT
An object file may contain information about the object to
which it is bound. But according to FINST Theory, keeping
track of the object’s identity does not require the use of this
information. The evidence suggests that in MOT, little or
nothing is stored in the object file. Occasionally some
information may get encoded and entered in the Object File
(e.g., when an object appears or disappears) but this is not used
in the tracking process itself.*
* We will see later that this has to be stated with care since location may be stored in the
object file and used in a certain sense when the usual continuous tracking does not work.
Another way of viewing MOT
What makes something the same object over time is that it remains
connected to the same object-file by the same Index. Thus, for
something to be the same enduring object no appeal to properties
or concepts is needed. The only requirement is that it be trackable.
Another view of tracking is that it is the basis of objecthood: An
object is something that can be perceptually tracked (Fodor).
There seems to be growing evidence that tracking is a reflex -- it
proceeds without interference from other attentive tasks.*
Franconeri et al.** showed that the apparent sensitivity of tracking
performance to such properties as speed is due to a confound of
speed with object density. Distance between objects is critical to
MOT performance, which is predicted by parallel tracking models.
Although tracking feels effortful, many secondary tasks do not interfere with tracking (search)
** Franconeri, S., Lin, J., Pylyshyn, Z., Fisher, B., & Enns, J. (2008). Evidence against a speed limit
in multiple-object tracking. Psychonomic Bulletin & Review, 15(4), 802-808.
Why is this relevant to foundational
questions in the philosophy of mind?
According to Quine, Strawson, and most philosophers, you
cannot pick out or track individuals without concepts (sortals)
But you also cannot pick out individuals with only concepts
 Sooner or later you have to pick out individuals using nonconceptual causal connections between things and thoughts.
The present proposal is that FINSTs provide the needed nonconceptual mechanism for individuating objects and for tracking
their (numerical) identity, which works most of the time in our
kind of world. It relies on some natural constraints (Marr).
FINST indexes provide the right sort of connection to allow the
arguments of predicates to be bound to objects prior to the
predicates being evaluated.
 They may also be the basis for learning nouns by ostension.
But there must be some properties
that cause indexes to be grabbed!
Of course there are properties that are causally
responsible for indexes being grabbed, and also
properties (probably different ones) that make it
possible for objects to be tracked;
But these properties need not be represented
(encoded) and used in tracking
The distinction between properties that cause
indexes to be grabbed and those that are represented
(in Object Files) is similar to Kripke’s distinction
between properties that are needed to name an object
(by baptismal) and those that constitute its meaning
Effect of target properties on MOT
Changes of object properties are not noticed during MOT
Keeping all targets at different color, size, or shape does not
improve tracking
Observers do not use target speed or direction in tracking
(e.g., they do not track by anticipating where the targets will
be when they reappear after occlusion)
 Targets can go behind an opaque screen and come out the other
side transformed in: color, shape, speed or direction of motion
(up to 60° from pre-occlusion direction), without affecting
tracking, but also without observers noticing the change!
 What affects tracking is the distance travelled while behind the
occluding screen. The closer the reappearance to the point of
disappearance the better the tracking – even if the closer
location happens to be in the middle of the occluding screen!
Some open questions
We have arrived at the view that only properties of selected
(indexed) objects enter into subsequent conceptualization and
perception-based thought (i.e., only information in object files
is made available to cognition)
So what happens to the rest of the visual information?
Visual information seems rich and fine-grained while this
theory says that properties of only 4 or 5 objects are encoded!
 The present view also leaves no room for representations whose
content corresponds to the content of conscious experience
 According to the present view, the only content that modular
nonconceptual representations have is the demonstrative content
of indexes that refer to perceptual objects
 Question: Why do we need any more than that?
An intriguing possibility….
Maybe the theoretically relevant information we take in is
less than (or at least different from) what we experience
 This possibility has received attention recently with the discovery
of various “blindnesses” (e.g., change-blindness, inattentional
blindness, blindsight…) as well as the discovery of independentvision systems (e.g., recognition and motor control)
 The qualitative content of conscious experience may not play a role
in explanations of cognitive processes
 Even if detailed quantitative information enters into causal process
(e.g., motor control) it may not be represented – not even as
nonconceptual representation
 For something to be a representation its content must figure in explanations
– it must capture generalizations. It must have truth conditions and therefore
allow for misrepresentation. It is an empirical question whether current
proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyn’s Razor
An alternative view of
reference by Indexes
This provisional revised theory responds to Fodor’s argument
that there is no seeing without seeing-as
According to Fodor, the visual module must do more than the
current theory assumes, because its output must provide the
basis for induction over what something is seen as.
 This is not the traditional argument that percepts have a finer grain
than most theories provide for – especially theories that assume a
symbolic output like this one. That argument relies too much on our
phenomenology which more often than not leads us astray.
So the vision module must contain more than object files. It
must be able to classify objects by their visual properties
alone, or to compute for each object a particular appearanceclass to which it belongs (see black swan example).
An alternative view of
reference by Indexes
Since the vision module is encapsulated it must have a
mechanism for assigning each object x to an equivalence
class based solely on what x looks like. It must do this for a
large number of such classes, based both on its innate
mechanisms and its visual experience [Look of x = L (x)].
L (x) is thus an equivalence class induced by the sensorium
which includes the current token x. The L (x) associated
with each token x must be sufficiently distinctive to allow
the cognitive system to recognize x unambiguously as an
token of something it knows about (e.g., L (x) => looks like
a cow & this is a farm => x is likely a cow). The sequence
from x to recognition must be correct most of the time in
our kind of world (so it must embody a natural constraint).
An alternative view of
reference by Indexes
This idea of an appearance class L (x) has been explored in
computational vision, where a number of different functions
have been proposed, many of them based on mathematical
compression or encoding functions.
 An early idea which has implications for the present discussion, is a
proposal by David Marr called a Multiple-View proposal. He wrote:
“The Multiple View representation is based on the insight that if one chooses
one’s primitives correctly, the number of qualitatively different views of an object
may be quite small” and Marr cites Minsky as speculating that the representation
of a 3D shape might consist of a catalog of different appearances of that shape,
and that catalog may not need to be very large. (Marr & Nishihara, 1976)
 The search for the most general form of representation has yielded
many proposals, many of which have been tested in Psychology Labs.
E.g., generalized cylinders and part-decomposition:
Biederman, I. (1987). Recognition-by-components: A theory of human image
interpretation. Psychological Review, 94, 115-148.
Seeing without Seeing As?
It’s true that instances of visual encounters deliver an
equivalence-class to which the object belongs by virtue of its
appearance as mapped by the function L (x). It is an
appearance class because it can only use information from the
sensorium and the “natural constraints” built into the modular
vision system. So in that respect one might say that seeing is
always a seeing as where the relevant category is L (x).
But this is unlikely to be the category under which the object
enters into thought. So the kind of seeing as category L (x), is
not the same category as the one under which the object is
contemplated in thought, where its category would depend on
background knowledge and personal history. The appearance
L (x) is now replaced by familiar categories of thought (e.g.
card table, Ford car, Coca Cola bottle, Warhol Brillo Box, and
so on, categories rich in their interconnections).
More on the structure of
the Visual Module
In order to compute L (x), the vision module must
possess enough machinery to map a token object x onto
an equivalence class designated by L(x) using only
sensory information and module-specific processes and
representations, without appealing to general knowledge.
The module must also have some 4-5 Object Files,
because it needs those to solve the binding problem as
well as to bind predicate arguments to objects (and also
to use the proposed Recognition-By-Parts process for
recognizing complex objects).
Alternative view of what’s in the module
The alternative view of what goes on inside the visual module would
furnish it with more processes to catalog and lookup of object shapetypes L (x). Our assumptions would seem to require that this
augmented machinery also be barred from accessing cognitive
memories and general inference capacity. Does this conflict with
Fodor’s requirement that the output be right for belief fixation?
Which functions are
in the visual module?
L (x)
Modular vision
computer. Input is
sensory information,
output is standard
form for appearance
of objects L
Minimal (Just indexes)
Original (indexes and files)
Maximal (computing L (x) )
Summary of the current FINST model
Up to 5 indexes can be grabbed based on local properties
Active indexes bind objects to object files (initially empty)
Bound objects can then be queried* and salient properties
encoded in their Object File Does this require voluntary attention?
Indexes stay bound to the objects that grabbed them even as the
objects change any of their properties, including briefly
disappearing behind an occluding screen.
When the objects change their location, the result is tracking
which is automatic / reflexive
 We also have evidence that objects can be tracked through other
continuously changing properties (Blaser, Pylyshyn & Holcombe 2000)
The only factor that impairs tracking performance is spacing:
too close yields item-ambiguity and tracking errors
Tracking and spatial proximity
Many experiments show that the only factor that affects
tracking performance is inter-item spacing: when items are too
close there is item-ambiguity resulting in tracking errors
Other factors that allegedly impair tracking (e.g., speed) do so
only because they affect average spacing.
The very process of tracking, which requires something like
smooth continuous movement, makes use of proximity. So does
the process of Gestalt individuation which must collect nearby
pixels and features (regardless of type).
We have many results showing that when objects disappear
their only recalled property is where they were at the time and
the only thing that determines how well they continue to be
tracked when they reappear is how far away they have moved.
Franconeri, S., Pylyshyn, Z. W., & Scholl, B. J. (2012). A simple proximity heuristic allows
tracking of multiple objects through occlusion Attention, Perception and Psychophysics, 72(4).
How is location stored and used?
It is possible that location is stored in object files since it is one
of the more important properties of moving objects.
Object location is a property that must be used in tracking since
to track smoothly moving objects just is to solve the
correspondence problem by taking the nearest object
Many experiments show that the correspondence problem in
this case does not involve choosing the most similar object or
the one moving with the same speed or in the same direction…
but the closest one to the locus of disappearance.
Does this mean that object location is stored and used in
tracking, contrary to my earlier claim? Maybe, but …
That depends on whether location is in this case a conceptual
property and tracking is a process involving conceptual
representations and there is evidence that it is not.
Is location a conceptual property?
Is location in this case a conceptual property and is tracking a
process involving conceptual representations?
 Computing correspondence and tracking are prototypical automatic and
cognitively impenetrable processes, likely computed by local parallel
processes, which suggests that it is subpersonal, modular and
nonconceptual, since most automatic processes are nonconceptual.
 Location plays a critical part in all motor control and there is reason to
believe that it plays this role in a different way than the way conceptual
information does. It typically involves a different visual system, the dorsal
pathway. A great deal of evidence is now available showing that only the
central pathway contributes to object recognition while the dorsal pathway
is specialized for motor control (Milner & Goodale, 1995; 2004)
 All in all it seems more likely that location is used in MOT and other
visual processes but that it is not a conceptual process at all.
 If you accept that location is conceptual, you pay a high price: you lose the
goal of finding a nononceptual link between cognition and the world!
Summary of augmented FINST model
So far the only visual information that is available to the
mind is contained in the Object Files in the visual module.
The index mechanism discussed so far also makes it possible to use
additional currently perceived information (see Things & Places, Chapt 5)
Information in the module is in a symbolic form very similar
to the subsequent conceptual representation, except:
 It is encoded in the vocabulary of modular (subpersonal) categories
(that many would call nonconceptual), not in person-level conceptual
 Construction of the intramodular representation cannot use general
knowledge, so all relevant representations must reside in the module
 The intramodular representation uses information in Object Files and
preserves its bindings. The Object Files are the only mechanisms for
dealing with the General Binding Problem, as well as the problem of
binding predicate arguments to objects in the world.
Open Questions about the
augmented FINST model
The modular processes must somehow recover the relations
between objects, and these may or may not be encoded in OFs.
Since information in the module may serve a number of
subsequent functions – including visual-motor coordination
and multimodal perceptual integration – it will have to
represent metrical information, very likely in a nonconceptual
form. The question of representing metrical information is one
we leave for the future since little is known about how
analogue representation might function in cognition
We now arrive at a central question of considerable importance
to the view we are promoting: What form is the visual
representation in when it is handed on to Cognition?
For a copy of these slides see:
Or MIT Press
You are now here
But you are also here
Additional examples of MOT
MOT with occlusion
MOT with virtual occluders
MOT with matched nonoccluding disappearance
Track endpoints of lines
Track rubber-band linked boxes
Track and remember ID by location
Track and remember ID by name (number)
Track while everything briefly disappears (½ sec) and
goes on moving while invisible
Track while everything briefy disappears and reappears
where they were when they disappeared

similar documents