z - Malloc Inc.

Report
What is ETL?
• ETL: Extract Transformation and Load
• The term is used to describe data migration or
the data conversion process
• ETL may be part of the regularly repeated
business process
• The amount and complexity of data grew
dramatically over the past decade so the ETL
processes became more complex and
demanding
1
Phases of ETL Project
1.
2.
3.
4.
5.
6.
7.
8.
Requirements
Analysis
Design
Proof of Concept
Development
Testing
Execution
Verification
2
Requirements
• Scope of the data migration - what data is required in the
target system and how is it going to be used?
• Execution requirements – has to be within certain
timeframe, sequence, geographic location, repeatability,
acceptable system down time, etc.
• Source data retention period, backup and restore
requirements
• Requirements should be made with this in mind:
– Data is the company’s most valuable asset.
– Consequences of corrupt data are usually very costly.
3
Analysis
• Understanding the source data
– Data Dictionary that is used for designing ETL process
has to be created
– Mission critical task
– Frequently underestimated (importance and time)
• All available resource should be used to do
analysis properly:
– Available system documentation including Data Model
and Data Dictionary
– People
– Reverse engineering
4
Design
•
•
•
•
•
Choice of methodology
Choice of technology
Design Target Database
Design ETL process
Data Mapping Document
– Map source data to the target database
– Specifies transformation rules
– Specifies generated data (not from source)
• Design ETL verification process
• Ensures that all requirements are addressed
5
Proof of Concept
• Helps to determine or estimate:
– Feasibility of the concept
– Development time
– Performance, capacity and execution time
– Requirements to be met
• Gain knowledge about the technology
• Code produced in this phase usually can be reused during the development phase
6
Development
• Includes:
– Produce code and processes as per Design and Data
Mapping Document
– Data verification scripts or programs as per Test Plan
– Execution scripts as per Execution Plan
– Unit testing – performed and documented by
developers.
• Typical Challenges:
– Inadequate requirements and design documents
– Developers unfamiliar with technology
7
Testing
 Ensures that requirements are met
 Test Plan is highly recommended
 Types of testing:
 Functional, stress, load, integration, connectivity,
regression
 Challenges:
 Automation and repeatability (testing and
verification scripts)
 Creation of the Test Data
 Extracting small data sets from large data volumes
 Confidential data may not be made available for testing
8
Execution
• Execution plan should include:
–
–
–
–
–
Sequence of tasks
Time of execution and expected duration of execution
Checkpoints and success criteria
Back out plan and continuation of business
Resources involved
• For mission, critical system down time could be
limited or even entirely unacceptable
• Execution should be controlled and verified
9
Verification
• Confirms that the data migration was
successful
• Determined during the design phase
• Various methodologies and technologies could
be used
• Automated verifications are highly
recommended (driven by requirements)
10
Why ETL Projects Fail?
•
•
•
•
Underestimate complexity of the project
Overlook or neglect phases of the project
Wrong choice of technology
Common misconceptions like
– Expensive ETL tools will solve all problems
– No or very little programming will be required
– We don't need or we don't have time for plans,
but we know exactly what we need to do
11
Problems with Mainstream ETL Tools
•
•
•
•
•
•
•
•
•
Maintaining license and consultants is very expensive
Significant time required to learn
Usually require dedicated hardware
Cannot take advantage of the database vendor
proprietary technologies that are optimized for the
fastest data migration
Complex tasks very often require integration with
other technologies
Very limited performance
Only small amount of provided functionality is actually
required for ETL project
Very limited application for Data Analysis
Huge discrepancy between marketing promises and
actual performance
12
Malloc Inc.
• During the 15 years in the IT Consulting business, a
proprietary ETL methodology and technology was
developed
• Consists of two major modules:
– Database Analyzer
– G-DAO Framework
• Major advantages:
– Inexpensive, easier to learn and performs better than
mainstream ETL Software
– Any Java developer can master it and start using it within several
days
– It is proven and it works
13
Database Analyzer
• Produces ETL Data Analysis Reports in various
formats
• Major usage:
– To analyze and understand source data and the
database attributes
– Create data mapping and transformation documents
– Create a data dictionary
– Suggests ways to improve database design
– Valuable source of information for Business Analysis,
Data Architects, Developers, and Database
Administrators.
14
Database Analyzer Reports
• Intuitive, descriptive and easy to read
• In HTML format
• Can be imported and edited in major document editors such as MS
Word
15
Graphical User Interface
• User friendly GUI Interface
• It can also run in Batch mode for lengthy analysis (large data
sources)
16
G-DAO Framework
• Java Code Generator
• Eliminates huge legwork to develop the code required
for the ETL Process
• Uses analysis performed by Database Analyzer to
produce the code optimized for a particular database
• Code may be used for purposes other than ETL (any
kind of database access and data manipulation)
• Uses the advantage of an almost unlimited world of
java libraries (no proprietary language and interfaces)
17
G-DAO Framework Advantages
• Data can flow directly from source to target
(no need for intermittent storage into files)
• XA transactions are supported for all major
databases
• Functionality is limited only by the limitation
of the JDBC driver and Java language
• Easy to learn and implement
• No dedicated hardware required
• Provides a platform for any kind of business
application that requires data access
18
Malloc Inc
 Incorporated in 1993 in Toronto, Canada
 Provided IT Consulting to:








Oracle Corporation
General Electric
Citibank
Royal Bank
Bank of Montreal
The Prudential
Standard Life
and many more
 Most recent implementations of Database Analyzer and G-DAO
Framework
 Citibank
 Royal Bank of Canada
 http://www.mallocinc.com
19

similar documents