Big Data needs Big Classification

Report
Big Data needs Big Classification
Andy Carnahan
Customer and Information Services Manager
Wingecarribee Shire Council
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Context
•
•
•
•
The “records” battle is lost
The federated states of search
A tale of two indexes
Why big classification will happen
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
J Brew – Cathedral Termite Mound
Cc Wikimedia Commons
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The extinct ECM model
Louisa Anne Meredith, 'Tasmanian Tiger',
1880 (Tasmanian Library, SLT)
•
•
•
CONTEXT
CONTENT
Manual Record Keeping
Central registry
Human Classification
•
•
•
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Physical Content
Manageable Growth
Single instance
The current ECM model
CONTEXT
• Manual Record Keeping
• Single instance ECM
• Human Classification
CONTENT
• Technology created
• Huge volumes
• Indexing speed/access
Popular Science Monthly/Volume 8/February 1876
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The new ECM model
Aviceda Wikimedia Commons:
Peregrine Falcon , Qld, Apr 2007
•
•
•
CONTEXT
CONTENT
Synthetic RK
Semantic Indexing
Machine Classification
•
•
•
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Technology created
Federated Search
Indexing speed
Big Data
 Velocity
 Volume
 Variety
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Big Data
 Velocity
 Volume
 Variety
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Big Classification
 Classification must be performed
at the speed of information
creation at equal to human
quality
 Data does not (necessarily) reside
in the RKMS
 So long as the information is
digitally available it can be will be
accessible in the future
What is big classification?
Human-competitive automatic topic
indexing
Alyona Medelyan
PhD thesis in Computer Science
University of Waikato
July 2009
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
How’s your math?
• Our inputs:
– Central registry: 250 mail items/day, 100
classified/workflowed (40%) by 2 records staff
– 200 white collar staff: 5,000 emails / day
• The Question: How many records staff do I need to
manage the 5,000 daily emails?
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Current Human situation
• General staff won’t do it
– Don’t understand metadata/Classification schemas
– Not their job/too busy
– Maintain their own data sets for own productivity
• Records staff can’t do it
– Sheer quantity of information
– Limited access to information domain
– Not their job/too busy
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Fields, haystacks and needles
Claude Monet Wheatstacks (End of Summer),
1890-91, The Art Institute of Chicago
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The (old) centre of the Universe
Records
THE Central
Registry
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
A small part of a bigger Universe
Teamsites
(Sharepoint)
Documents
Intranet
Databases
Enterprise
Information
Domain
(Content)
Extranet
Corporate
Intranet
Multimedia
Application
Data
Social
Media
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Email
ECMS
The inadvertent ECMS
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The deliberate ECMS
www.youragency.gov.org.com.net.au/about
Our mission is to organise the world’s agency’s
information and make it universally appropriately
accessible and useful.
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
What is an ECMS?
ECMS
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The two components
CONTEXT
Metadata
Database
CONTENT
Document
Store
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Contenxt is King
CONTEXT
Metadata
Database
CONTENT
Document
Store
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Extending to the domain
CONTEXT
Metadata
Database
CONTENT
Document
Store
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Crawling the domain
CONTEXT
Metadata
Database
CONTENT
Document
Store
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Exploring Context
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Concordance -> content
Concordance Index
(location)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Semantic -> context
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Concordance Index
(location, Content)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The big imbalance
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Concordance Index
(location, Content)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Index (disambiguation)
The CONTENT index
 Concordance (location)
 Left Brain
 Literal/Algorithmic
 Single path
 Speed
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The CONTEXT index
 Semantic (meaning)
 Right brain
 Concepts/thesaurus
 Back of Book
 Precision
Concordance Indexing
Concordance Index
(location, Content)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Concordance History
 First recorded instance 1230 AD
 Hugh of St Cher
 500 monks created index to location
of every word in the Versio Vulgata
(common Bible)
 Same methodology used by Google
to index web – URL instead of
book/chapter/verse
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Concordance Index
(location, Content)
Concordance Features
 Needle-in-haystack searching
 Index requires no human assistance
to build
 Index is now built at machine speed
 Access to results is at machine speed
 Mature, widely adopted
 Can find the needle, but we still need
the haystacks
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Concordance Index
(location, Content)
Semantic Indexing
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Semantic Indexing
 Injects meaning into search
 Search on concepts
 Enables multiple taxonomies in virtual
views (pivot taxonomy)
 Disambiguates
 Emerging in research and commercial
software
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Manual Semantic Indexing
Record Keepers
continuously
classifying
incoming mail
Confident
Classification
Threshold?
Y
N
Get expert help
RK remembers
how to classify
that exception
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Add to
indexes
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Automatic Semantic Indexing
“robot”
continuously
crawls content
domain
Confident
Classification
Threshold?
Y
N
Get human help
robot remembers
how to classify
that exception
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Add to
indexes
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Semantic Indexing
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Denise Bedford AIIM2013
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Semantic Indexing
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Denise Bedford AIIM2013
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The big balance
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Concordance Index
(location, Content)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Big classification will happen
Owen Carnahan - used with permission
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
2008 Predictions
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
2014 three suggestions
• Enterprise content management can only occur if context
management (classification) is performed at machine speed.
• Machine classification must be performed at a quality similar
to a records officer.
• The entire enterprise information domain is the responsibility
of Records and Information Management Professionals.
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The e-context future is coming…
•
•
•
•
•
IBM’s Watson
Wolfram Alpha (Siri)
Cyc/Wikipedia
Smartlogic
TopQuadrant
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The e-context future is coming…
•
•
•
•
•
IBM’s Watson
Wolfram Alpha (Siri)
Cyc/Wikipedia
Smartlogic
TopQuadrant
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The e-context future is coming…
•
•
•
•
•
•
•
•
IBM’s Watson
Wolfram Alpha (Siri)
StoredIQ
Recommind
Pingar
Cyc/Wikipedia
Smartlogic
TopQuadrant
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The e-context future is coming…
•
•
•
•
•
IBM’s Watson
Wolfram Alpha (Siri)
Cyc/Wikipedia
Smartlogic
TopQuadrant
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The e-context future is coming…
•
•
•
•
•
IBM’s Watson
Wolfram Alpha (Siri)
Cyc/Wikipedia
Smartlogic
TopQuadrant
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The e-context future is coming…
•
•
•
•
•
IBM’s Watson
Wolfram Alpha (Siri)
Cyc/Wikipedia
Smartlogic
TopQuadrant
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Record Context keepers
•
•
•
•
Most important people in the organisation
“Training” the synthetic context agent
Refining and enhancing the context engine
ICT is the maintainer of the content engine and content
store
• Records and Information Management Role is much more
rewarding
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Record keeping now
“small”
data
Record
Keepers
continuously
classifying
incoming
mail
Confident
Classification
Threshold?
Y
Add to
indexes
ECMS
N
Get expert help
RK remembers
how to classify
that exception
Manual Process
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Automatic Semantic Indexing
“robot”
continuously
crawls
content
domain
Confident
Classification
Threshold?
Y
Add to
indexes
N
Get expert help
robot
remembers
how to classify
that exception
Manual Process
Automated Process
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
The new ECM model
Aviceda Wikimedia Commons:
Peregrine Falcon , Qld, Apr 2007
•
•
•
CONTEXT
CONTENT
Synthetic RK
Semantic Indexing
Machine Classification
•
•
•
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Technology created
Federated Search
Indexing speed
The big balance
Semantic Index
(Classification, Taxonomy
Thesaurus, Context)
Concordance Index
(location, Content)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Big classification
1. Federated search based on concordance
indexing
2. Semantic search/classification based on
context engine (machine speed/human
quality)
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014
Thank you!
Keep in touch
–
–
–
–
–
Linked in Andy Carnahan
[email protected]
[email protected]
RIMPA list (www.rimpa.com.au)
LG IT lists
• [email protected][email protected][email protected]
Sometimes small data is good
BIG DATA NEEDS BIG CLASSIFICATION
Andy Carnahan 13 November 2014

similar documents