Rapid Development of Data Generators Using Meta

Report
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Rapid Development of Data
Generators Using Meta
Generators in PDGF
Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno
Jacobsen
DBTest 2013, June 24, New York City
DBMS Benchmarking is
Increasingly Complex
•
Data Volumes are sky rocketing
 Enterprise data warehouses double every three years
 Many enterprise data warehouses are in petabyte size
•
Systems are becoming increasingly complex
 Large number of processor cores
 Single systems (SMP) with high number of cores (80 on
commodity hardware, 2048 on specialized hardware)
 Multi node systems (sky is the limit)
 Large memory
 Dell released a TPC-H benchmark with 15 TB of main
memory on 64 systems
•
How to challenge these systems?
Benchmarks are increasingly
complex
500
450
400
350
300
250
200
150
100
50
0
430
188
92
4 10
TPC-A
9
TPC-C
33
TPC_E
#Tables
#Columns
24
TPC-DS
•
More tables, columns
•
More relationships, dependencies, data types, …
•
How to build these benchmarks?
•
Parallel Data Generation Framework to the rescue!
Parallel Data Generation
Framework
•
Generic data generation framework
•
Relational model
 Schema specified in configuration file
 Post-processing stage for alternative representations
•
Repeatable computation
 Based on XORSHIFT random number generators
 Hierarchical seeding strategy
Repeatable Data Generation
•
Data generation based on random numbers
   = 
•
More specifically parallel random number generation
   =   +  = 
•
Generation of numbers within range (e.g., age)
   =  % 121
  , ,  =  %  + 
•
What if we want NULL values?
  , , ,  =
•
 %

1−
+ ,   %
,
Repeat that logic in every generator?
 ℎ , , ,  =
  ,   % 100 < 
  ,


1−

<
PDGF Architecture
Controller
 Initialization
•• To
generate data
for a schema the user defines:
• Meta Scheduler
 Inter node scheduling
 Schema XML file
• Scheduler
 Inter thread scheduling
• Worker
 Blockwise
 Defines relational
schema data generation
• Update Black Box  Co-ordination of data updates
 Generation XML file
• Seeding System
 Random sequence adaption

Defines
output
(CSV, XML, merging tables)
• Generators
format
Value generation
• Output system
 Data formating
Configuring PDGF
•
Schema configuration
 Data model
•
Relational model
 Tables, fields
•
Properties
 Table size, characters, …
•
Generators
 Base generators
 Meta generators
•
Update definition
<table name="SUPPLIER">
<size>${S}</size>
<field name="S_SUPPKEY" size="" type="NUMERIC“
primary="true" unique="true">
<gen_IdGenerator />
</field>
<field name="S_NAME" size="25" type="VARCHAR">
<gen_PrePostfixGenerator>
<gen_PaddingGenerator>
<gen_OtherFieldValueGenerator>
<reference field="S_SUPPKEY" />
</gen_OtherFieldValueGenerator >
<character>0</character>
<padToLeft>true</padToLeft>
<size>9</size>
</gen_PaddingGenerator >
<prefix>Supplier </prefix>
</gen_PrePostfixGenerator>
</field>
[..]
 Insert, update, delete
 Generated as change data capture
Base Generators in PDGF
•
DictList generator
<table name="users">
<size>10000</size>
 Random line from file
<fields>
<field name="name">
• Long generator
<type>java.sql.types.VARCHAR</type>
 Random long in interval
<size>100</size>
<gen_DictList>
• Others
<file>dicts/names.dict</file>
</gen_DictList>
 StaticValue
</field>
 Double
<field name="age">
 Date
<type>java.sql.types.NUMERIC</type>
<gen_LongGenerator>
 String
<min>0</min>
 Text
<max>120</max>
</gen_LongGenerator>
 …
</field>
</fields>
</table>
Null Generator
•
Add NULL logic to every generator?
 Could easily be implemented in higher class
 Adds to the configuration file
 Reduces performance (every time)
•
Higher order generator NullGenerator
 Only used if added to the schema
 Can be added to any generator
<field name="age">
<type>java.sql.types.NUMERIC</type>
<gen_NullGenerator>
<probability>0.05</probability>
<gen_LongGenerator>
<min>0</min>
<max>120</max>
</gen_LongGenerator>
</gen_NullGenerator>
</field>
Meta Generators
•
Control flow and post-processing generators
 Null generator controls flow
•
Post-processing





•
FormattedNumberGenerator
PaddingGenerator
UpperLowerCaseGenerator
PrePostfixGenerator
FormulaGenerator
Flow control





ProbabilityGenerator
SequentialGenerator
IfGenerator
SwitchGenerator
ReferenceGenerator
Post-Processing Example
•
Phone number for users
 10s of representations
 PhoneNumberGenerator was too inflexible
•
Formatted long number
 Long numbers between 10010001 and 9999999999
 Number formatting (%d%d%d) %d%d%d-%d%d%d%d
<field name="phonenumber">
<type>java.sql.types.VARCHAR</type>
<size>30</size>
<generator name="FormattedNumberGenerator">
<generator name="LongGenerator">
<min>10010001</min>
<max>9999999999</max>
</generator>
<format>(%d%d%d) %d%d%d-%d%d%d%d</format>
</generator>
</field>
Flow Control Example
•
More elaborate name field
 Name male or female
 50% chance
 All upper case
 Padded to 100 characters
•
Sequential generator
 Probability generator
 DictList generator
 UpperLowerCase generator
 Padding generator
<field name="name">
<type>java.sql.types.VARCHAR</type>
<size>100</size>
<generator name="SequentialGenerator">
<generator name="ProbabilityGenerator">
<probability value="0.5">
<generator name="DictList">
<file>dicts/female.dict</file>
</generator>
</probability>
<probability value="0.5">
<generator name="DictList">
<file>dicts/male.dict</file>
</generator>
</probability>
</generator>
<generator name="UpperLowerCaseGenerator">
<mode>uppercase</mode>
</generator>
<generator name="PaddingGenerator">
<character> </character>
<padToLeft>true</padToLeft>
</generator>
</generator>
</field>
Core Performance
250
200
150
100
50
0
Static Value
(no Cache)
Base Time
•
•
Null Generator
(100% NULL)
Generator
Base Time Sub
Null Generator
(0% NULL)
Sub Generator
Test environment: single core laptop, no I/O
Base time for framework ~ 55 ns (Base Time)
 Seeding, method invocation, setting a value
•
Computation time for generator 50+ ns (Gen Time)
•
Cache update if referenced ~ 50 ns (Cache Update)
Cache lookup if intra row reference ~ 50 ns (Cache Lookup)
Sub-generator invocation ~ 50 ns
•
•
Performance Basic Generators
600
500
400
300
200
100
0
DictList
•
LongGenerator DoubleGenerator DateGenerator
Basic generators without formatting
 120ns – 510ns
RandomString
Performance Formatted
Values
2000
1800
1600
1400
1200
1000
800
600
400
200
0
DictList
•
SimpleFormat
Number Generator
DateGenerator
(formatted)
Basic Generators with formatting
 Usually > 1000ns
DoubleGenerator (4
places)
Performance Meta Generators
1600
1400
1200
1000
800
600
400
200
0
Null
Generator
(100% Null)
•
Null
Generator
(0% Null)
PrePostFix
Sequential
(exec 2)
Meta generator overhead:
 Base overhead ~ 50 ns
 Generator overhead starts from 50 ns
 Sub generator invocation ~ 50ns
•
Often negligible due to lazy formatting
Sequential
(concat 2)
Sequential
(2 formated
+ long)
Use Cases
•
TPC-H / SSB





8 tables, 61 columns (first non-trivial example)
Without meta-FVGs: 26 custom FVGs
2h editing: 10 custom FVGs
1 day reimplementation: 0 custom FVGs, i.e. no coding
SSB variations
 skews on dimension attributes, fact measures, references
•
TPC-DI (in process)





20 tables, 200 columns
19 custom FVGs (mainly for performance in corner cases)
56x NullGenerator
32x ProbabilityGenerator
3000 lines of config (XML import for multiple files).
Conclusion & Future Work
•
Meta generators




Improve usability and expressiveness
Speed up schema definition
Remove necessity for coding
Enlarged configuration files
•
Used in TPC benchmark(s)
•
Performance overhead is small, often negligible
•
Future work
 GUI and SQL export
 SQL import and data extraction
Thanks
•
Questions?
•
Contact: [email protected]
•
Download and try PDGF:
•
http://www.paralleldatageneration.org
•
Some big data info in our BigBench presentation
 Tuesday, 4pm, Industry 3

similar documents