IMPACT template - Powerpoint presentation

Report
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
An Experimental Workflow Development
Platform for Historical Document
Digitisation and Analysis
Clemens Neudecker, KB National Library of the Netherlands
Research Meeting, Amsterdam 3 November 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Background
 > 20 individual software components for specific challenges
 Prototyping new algorithms, improving commercial solutions
 Different frameworks (C, C++, Java, etc.), platforms (Win/Linux)
 Extensible with 3rd party applications
 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main requirements
Behavioural:
 Minimize integration effort
 Minimize deployment effort
 Maximize usability
 Maximize scalability
Functional:
 Modular
 Transparent
 Expandable
 Open source
 Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Architecture
 Java
 Web Services
 Apache
 Taverna
Open Source available on https://github.com/impactcentre
Free Hackathon 14/15 November, University of Manchester
http://impact-mygrid-taverna-hackathon.wikispaces.com/
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Integration
 Only requirement:
command line executable
 Generic command line wrapper
produces web service
 Web service exposed as
workflow module with
documentation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Generic Web Service Wrapper
 Easy integration: developers
can focus on their application
and have to worry less about
integration = higher quality
software components
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflows
 OCR workflow =
data pipeline
 Building blocks =
processing modules
(nodes)
 Integration =
interaction between
nodes (mashups)
 Collaboration with
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow management
 Web 2.0 style registry: myExperiment
 Local client: Taverna Workbench
 Web client: Project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local client: Taverna Workbench
 Background:
BioSciences
 Developed and
maintained by
myGrid, UK
 Open source
 GUI for design and execution of web services & workflows
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Remote client: Portal
 SOAP/REST API
 Remote execution of web services & workflows
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Community
 Web2.0 style workflow registry
 Community of experts
 Sharing of resources
 Knowledge exchange
 A central meeting point
for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Scalability
 Central ESB proxy
manages multiple
service copies
 Process parallelization,
Load distribution,
Fail over, Security
 Served >2M requests
 Throughput improvements of 94% with every additional instance
 Tested on Dutch Cloud (“Enlighten Your Research”)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dataset
 Access to a representative and annotated dataset of significant size,
with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation features
 Text based comparison of result with ground truth,
using Levenshtein distance method
 Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework
 Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The PAGE Format Framework
Two-level architecture:
• root structure
• task specific sub-formats
 Separate XML Schema definitions
 Format identification
via Namespaces
 Mapping of
• dependencies
• process chains
• alternative processing steps
 Linking via IDs
PAGE root
(XML)
PAGE gts
(XML)
PAGE gts
(XML)
PAGE gts
(XML)
Processing results or ground
truth (e.g. binarisation,
dewarping, page content)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground-Truthing Tools
 Aletheia
 FineReader
PAGE Exporter
 GT Validator
 GT Normalizer
17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Profile ‘Full Text Recognition’

Evaluation for general text recognition
Measure Weights
Region Type Weights
Merge
1.5
Text
1.0
Allowable Merge
0.5
Image
0.0
Split
1.0
Graphic
0.0
Allowable Split
0.5
Chart
0.0
Miss
2.0
Table
0.0
Partial Miss
2.0
Separator
0.0
Misclassification
1.0
Maths
0.0
False Detection
1.0
Noise
0.0
18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Measures – Segmentation Errors
Miss
Partial Miss
Misclassification
Merge
Caption
Paragraph
Ground Truth
Segmentation
Result
Split
19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Accuracy
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Outlook
 Online service for testing/evaluation
 Specification & Guidelines
 Extending the scope:
Workflows for linguistic analysis: CLARIN
Workflows for preservation: SCAPE
 Even better scalability: Map/Reduce
 Supported by a community of developers & practitioners
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
“Anyway, the thing about progress is
that is always seems greater than it really is.”
Ludwig Wittgenstein,
Philosophical Investigations
(quoting Johann Nestroy)

similar documents