Linguamatics I2E - London Info International

Report
Connecting Knowledge Silos using
Federated Text Mining
Guy Singh
Senior Manager, Product & Strategic Alliances
©2014 Linguamatics Ltd
Click
edit
Master
title style
Data
Silos
Clicktoto
edit
Master
title
style
• Structured, semi-structured or unstructured content
• Separate interfaces to access content
• Cannot query across the silos, or exchange content
Internal Content
External Content
©2014 Linguamatics Ltd
Click
edit
Master
title style
Possible
Approaches
Clicktoto
edit
Master
title
Federated Text Mining
style
Data Warehousing
Connecting
Silos
Linked Data (RDF)
©2014 Linguamatics Ltd
Workflow Integration
Click
edit
Master
titleWorkflow
style
Integration
using
Tools
Clicktoto
edit
Master
title style
• If each data source has an API, can link together using specific
tools for each data source
• Can program particular workflows pulling information
together from different data sources
• Advantages
– Can perform complex data manipulation
– Can exploit structure in data sources, or use I2E to transform the
unstructured data
• Disadvantages
– Workflows are fixed: can’t easily navigate and explore connections
between data
©2014 Linguamatics Ltd
Click
edit
Master
title
style
Connecting
via
Linked
Data
Clicktoto
edit
Master
title
style
• Transform databases to RDF or provide a conversion layer
• Advantages
–
–
–
–
Standardizes data format
Can exploit structure in structured data sources
Can use I2E to transform unstructured data into RDF
Can reason with the RDF
• Disadvantages
– Transformations are fixed
– Have to predict what information you need from the unstructured
text
• typically pull out a small proportion of the original information
©2014 Linguamatics Ltd
Click
edit
Master
title style
Using
a edit
Data
Warehouses
Clicktoto
Master
title style
• Integrate the data together into a data warehouse
– Extract, Transform and Load each data source into a new database
• Advantages
– Allows users to perform a single query across all the content
– Can use I2E to pull information out of unstructured text
– Can combine with human curation so warehouse contains checked
content
• Disadvantages
– ETL can be time consuming and expensive process
– Lose information
• have to predict what information you need from the unstructured text
– typically pull out a small proportion of the original information
• transformation of discrete fields can lose finer distinctions
©2014 Linguamatics Ltd
Click
edit
Master
title
style
Federated
Text
Mining
Data
Clicktoto
edit
Master
titlefor
style
Silos
• Use I2E to make data available for search, navigation, linking
– Keep data in original format without any data loss
– I2E queries become the conversion layer, dynamically transforming data
into the format we want when we need it
– Ontologies convert between different identifiers, or different languages
– Configurable: just change the queries
• Use other methods when require their strengths
– RDF for reasoning with results
– Workflow tools for complex data analysis and manipulation
– Data warehouses for curated data
©2014 Linguamatics Ltd
Road
to
Federated
Text
Mining
Click
toto
edit
Master
title style
Click
edit
Master
title
style
Link the
Content
Servers
Data
Normalization
©2014 Linguamatics Ltd
Federated
Text Mining
Merge
Results
Click
edit
Master
title style
Data
Normalisation
–title
Virtual
Clicktoto
edit
Master
styleIndexes
Pathology
Reports Index
Journal
Abstracts Index
Virtual
Index
9
Click
edit
Master
title style
Data
Normalisation
–title
Document
Clicktoto
edit
Master
style
Structure
Journal
Abstracts
Pathology
Reports
10
Click
edit
Master
title style
Data
Normalisation
-title
Entities
Clicktoto
edit
Master
style
Journal
Abstracts
11
Combined
(Normalized)
Pathology
Reports
Road
to
Federated
Text
Mining
Click
toto
edit
Master
title style
Click
edit
Master
title
style
Link the
Content
Servers
Data
Normalization
©2014 Linguamatics Ltd
Federated
Text Mining
Merge
Results
Click
toto
edit
Master
title
style
I2E
4.1/4.2:
Single
Client,
Multiple
Click
edit
Master
title
style Results
internal network
external network
I2E Server 1
I2E Server 2
Internal Documents
©2014 Linguamatics Ltd
Linked
server
FDA Drug Labels
Road
to
Federated
Text
Mining
Click
toto
edit
Master
title style
Click
edit
Master
title
style
Link the
Content
Servers
Data
Normalization
©2014 Linguamatics Ltd
Federated
Text Mining
Merge
Results
Click
edit
Master
title style
Each
Server
supplying
separate
Clicktoto
edit
Master
title
style
Content
Server 1
Content
Server 2
set of results
Content
Server 3
Merge into a single set
of results
15
Content
Server 4
Road
to
Federated
Text
Mining
Click
toto
edit
Master
title style
Click
edit
Master
title
style
Link the
Content
Servers
Data
Normalization
©2014 Linguamatics Ltd
Federated
Text Mining
Merge
Results
Click
toto
edit
Master
title style
I2E
Text
Mining
ClickFederated
edit
Master
title style
Extract and connect data
in any format, wherever it resides
Knowledge
©2014©Linguamatics
17
LinguamaticsLtd
2014 - Confidential
Connected

similar documents