the distributed tables

Report
What is Parallel Data
Warehouse (PDW) and
where does it fit?
Mike Lampa
Director – Business Analytic Solutions
Agenda
• What is Big Data and Where does PDW Fit?
• PDW Architecture on Dell hardware
• MPP and Shared Nothing concepts
• Data distribution and re-distribution
• EDW Reference Architecture
• Data Modeling Guidelines
• ETL Guidelines
• Resource Skilling Considerations
Feel free to ask questions as we move along – goal is to
make this presentation as interactive as possible!
2
Confidential
Global Marketing
What is Big Data
Global Marketing
The Data Explosion
• 988 Exabytes of information in 2010.
• Everyday 2.5 quintillions of data is
created.
• Volume of data across enterprise
doubling every 3 years
• 80% of enterprise data is unstructured
4
DELL CONFIDENTIAL
Storm Rising in Data Analytics
Major technology developments are driving three key mega-trends
driving new opportunities for the industry and our customers
Scalable
Database
Architectures
Real-time BI
and Analytics
Self-Service
NoSQL/NewSQL
“Big Data”
Unstructured
Structured
In-Memory
Hadoop
Visualization
Socianalytics
5
Confidential
Columnar/MPP
Pig/Hive
Cloud BI
Financial Customer Advisory Council
February 1-2 | New York City
Big Data complements Analytics (DW & BI)
Data Sources
Sensors
Devices
Crawlers
Bots
Processing Infrastructure
Shared
Infrastructure
Knowledge Capture
Business Value
Models and
Production
Analytic
Applications
Apps
Un-Structured
Exploratory Analytics
New IP Creation
BI Tools
ERP
DW
CRM
Structured
Well-defined
processing
Data-enriched tools
PDW Data
Mgmt Svcs
6
DELL CONFIDENTIAL
IT
Professional
App
Developer
Domain
Specialist
BI User
PDW Appliance:
Architecture &
MPP Concepts
Global Marketing
8
Confidential
Global Marketing
What is the PDW Appliance?
Microsoft Parallel Data Warehouse
• Microsoft software running on Dell hardware
• High-end data warehouse, scales to 100’s of terabytes
• Massively Parallel Processing (MPP) for high performance
• Architected with redundancy throughout the system
• Based on proven MS SQL Server 2008 R2 platform
• Low cost of ownership with industry standard Dell
hardware
• Sold as an appliance with software preloaded
• Extensive consulting and application services available
• Microsoft and Dell representatives work together to serve
the customer
9
Confidential
Global Marketing
Microsoft PDW Architecture
Scales for Resilience and High Performance, with a Low Cost of Entry
Control Rack
Data Racks (up to 4)
PowerEdge R610
Database Servers
Scale by
data
rack(s)
MD3620f
adding
Storage Nodes
Control Nodes (R710)
Active / Passive
Dual Fiber Channel
Client Drivers
Dual Infiniband
Management Servers (R610)
Data Center
Monitoring
Landing Zone (R510)
ETL Load Interface
Backup Node (R710 and
MD3600f w/MD1200’s)
Corporate Backup
Solution
Spare Database Server
Confidential
10
Corporate
Network
Private Network
Global Marketing
Rack Configuration for Dell MD Appliance
Control Rack
11
Confidential
Data Rack 1
Data Rack 2
Data Rack 3
Data Rack 4
Global Marketing
PDW Core Concepts
• Distributed Relational Database
– 10 DBMS servers per Data Rack
– Data distributed across multiple DBMS instances
• Massively parallel processing
– Multiple concurrent resources resolve SQL set operations against Distributed data
›
Compute Node architecture supports 10 parallel instances of DBMS per Data Rack.
›
Each DBMS instance works in parallel on its own “distribution” of a single user query.
• Shared nothing computing
– Resource and data independence are maintained within each DBMS instance
›
Each Compute Node reserves its shared resources (CPU, Memory, Disk) for only its distribution of system
data
– Managed by MPP server (Control node)
– Converting schema & metadata from shared nothing to a common logical view
• Configured for high redundancy
12
Confidential
Global Marketing
Data Distribution
• Distributed: A table structure with evenly distributed records across
multiple shared nothing databases.
–Distribution Key: A single column in a Distributed table that is used
for hash distribution of records across multiple shared nothing databases.
• Replicated: A table structure that exists as a full copy on each shared
nothing database.
• Ultra Shared Nothing: Design database schema with a mix of
replicated and distributed tables to minimize data movement between
nodes.
– Dimensions are replicated
– Facts are distributed
– Redistribute rows at run time when distribution incompatibility is
encountered in SQL set operation.
13
Confidential
Global Marketing
Ultra Shared Nothing Example
14
Confidential
TD
PD
C
D
M
D
TD
PD
C
D
M
D
TD
PD
C
D
M
D
TD
PD
C
D
M
D
Global Marketing
SF
SF
SF
SF
Redistribution
• Redistribution: The movement
of data between shared nothing
database instances to answer
distribution incompatible SQL
queries within a PDW Appliance.
– Shuffle: A redistribution technique
that leverages Inifiniband™ network
to create temporary distribution
compatible data sets.
–
At least one table in the query plan uses
a Distribution Key in its join criteria.
–
Any table that is not joined on it’s
Distribution Key is targeted for Shuffle
first. Leftmost table is chosen if multiple
tables meet this criteria.
– Replication: A redistribution
technique that is used to create a
temporary full copy of a data set.
15
Confidential
Global Marketing
SMP vs MPP
SMP
MPP
• Shared Resources
• Dedicated resources
• Limited scaling
• Scales to PB
• Applicable < 20TB
• Applicable > 20 TB
• HA must be architected in
• Built in HA and redundancy
• High Concurrency for
complex workloads
16
Confidential
Global Marketing
EDW
Architecture
Global Marketing
EDW Logical Architecture
18
Confidential
Global Marketing
EDW Information Layers
Data
Rack
Landing
Zone
19
Confidential
Global Marketing
Data Flow
Integration Layer
Replication
Source
Data In
ETL
Stage
LRF
Base & Package Layer
Data Store
•
•
PDW Load Scripts
Load Scheduling
PDWPRD01
Base & Package
100 TB
Package Copy
for Presentation
Data Presentation Layer
Data Store
PDWPRD03
Presentation
60 TB
Dell network
Infiniband
Consumption Layer
Data out
20
Confidential
SAS, BO, MSAS, BI Tools
Global Marketing
Enabling Packages - Hub and Spoke
• Physical Data Marts
(Packages) May make
sense from consumption
perspective.
• Primary Considerations:
– Business Function
– Type of BI Workload
• Secondary
Considerations:
– User Size, Data Volumes &
Performance
– Security & Sensitivity
21
Confidential
Global Marketing
Data Modeling
Guidelines
Global Marketing
PDW Table “Geometries”
• Replicated: A table structure that exists as a full copy within each
PDW Data Node
• Distributed: A table structure that is hashed and distributed as evenly
as possible across all PDW Data Nodes on the appliance
Global Marketing
PDW Table Geometry Example
Compute Nodes
PDW
Storage Nodes
DD
Source System
SD
Date Dim
Item Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Prod
Prod
Prod
Prod
Dim ID
Category
Sub Cat
Desc
DD
SD
SF
1
SF
2
ID
PD
ID
PD
Sales Fact
DD
Date Dim ID
Store Dim ID
Prod Dim ID
Mktg Camp Id
Qty Sold
Dollars Sold
SD
DD
Store Dim
Store
Store
Store
Store
Dim ID
Name
Mgr
Size
Promo Dim
Mktg Camp ID
Camp Name
Camp Mgr
Camp Start
Camp End
SD
DD
SD
24
Confidential
Global Marketing
SF
3
SF
4
SF
5
ID
PD
ID
PD
ID
PD
Distribute or Replicate?
• Use distributed tables when:
–The table is large – generally > 5GB
–For fact/detail tables
–Full table scans do not provide acceptable performance
• Use replicated tables when:
–The table is small – generally < 5GB
–For dimension/lookup tables
–Multiple foreign keys exist and foreign key joins are
common
25
Confidential
Global Marketing
Partitioning Distributed Tables
• Distributed tables are already segmented by hashed distributions
• Will further partition rows within a distribution, based on a partition
function (eg: Time_Dim Quarter or Year)
• Allows for operations efficiency when adding, loading, dropping,
and switching partitions
• Good for fast loading of an unused partition and then switching it
in after loading
• Partition for manageability
– Typically on a date key (or integer surrogate)
– Typically same as clustered index key
– SWITCH partitions OUT for fast delete of history or IN to modify or add a specific
historical slice
26
Confidential
Global Marketing
Colocation
Tables must be designed for performance from the beginning. Performance
optimization is not just a DBA thing after development is complete!
• Colocation: Within a PDW appliance, two individual records with identical
keys will always belong to the same Distribution.
– Single GREATEST performance consideration
– Beneficial for Join and Aggregation performance (eg Parent & Child join)
– Distribution Compatibility
• Choosing the right Distribution Key
– Identify Commonly used join keys and/or aggregations
– Choose a single column that limits skew to < 40%
(High Domain Cardinality & low Distribution SKEW)
– Distribution Key should be the first column declared for Distributed Table
DDL
– Consider Surrogate Keys when “business key” is compound
27
Confidential
Global Marketing
Colocation Example
• Ensure that all compatible Distribution Keys are identical data
types.
– [customer.customer_id integer] = [customer_hist.customer_id integer]
– [customer.customer_id integer] <> [customer_hist.customer_id char(10)]
Distribution
Key
• Colocation and Distribution Key Example
PK COLUMNS
TABLE:
28
Confidential
MFG_SO
DELL_MFG_ORD_CNFG
DELL_MFG_ORD_STAT
SO_ID
SO_ID
SO_ID
BU_ID
BU_ID
BU_ID
ORD_NUM
ORD_NUM
ORD_NUM
ORD_TIE_NUM
MFG_STAT_CD
MFG_WIP_STAT_DTM
Global Marketing
Handling Large Dimensions
Large Dimensions (>5GB uncompressed)
• Distribute/Normalize
• Distribute:
– If possible, distribute dimension on same key as fact surrogate key.
– If distribution compatibility not possible, “shuffle” dimension data on the
fly (at Query time)
• Normalize:
– Normalize large dimension into smaller tables and replicate the core
dimension (more manageable replication size)
– Look at usage pattern, if only a few columns from dimension are used
in most of the major queries, separate high use columns from low use
columns into separate Dim_tables. Both, have the same surrogate key.
› Core dimension is replicated which is joined locally to fact table
› Create a view to combine data from core and outrigger to insulate complexity
from users (CAUTION: check the performance)
29
Confidential
Global Marketing
Multi Level Partitioning
• Partitioning
– Partitioning is a method of distributing a
table’s rows among a number of subtables (partitions).
– Partitioning is applied within each
Distribution.
• Multi-Level Partition Support
– Any combination of up to four total
Range, Hash, Or List partition schemes.
– Each new partition level generates new
partitions at a multiple of the previous
level.
– Partition Values: [Range 4 x List 6] will
generate 24 partition files per
Distribution.
30
Confidential
Global Marketing
Benefits of Partitioning
• Reduce Table Scans
– This is the most common use case for partitioning.
– Relies on query restrictions aligned with: Range, Hash, or List qualifiers.
– Practice: Partition on commonly restricted fields (query based).
• Minimize Memory Utilization
– Join Operations
› Reduces memory requirements per join.
› Reduces disk spill if session or operation limits are reached.
– Aggregation
› Reduces memory requirements to build result set.
› Reduces disk spill if session or operation limits are reached.
– Practice: Hash Partition on join key.
31
Confidential
Global Marketing
Multi Level Partitioning Example
DDL of Multi-Level Partitioned Table
CREATE TABLE member (
memberID BIGINT NOT NULL,
memberType SMALLINT NOT NULL,
lastName VARCHAR(50) NOT NULL,
activeStatus CHAR(1) NOT NULL,
salesTotal FLOAT,
lastLogin DATE NOT NULL)
WITH distribute_on (memberID),
text compressed,
IIpartition=((range on lastLogin
partition p01 values < '2007_01_01',
partition p02 values < '2007_04_01',
partition p03 values < '2007_07_01',
partition p04 values < '2007_10_01',
partition p05 values < '2008_01_01',
partition p06 values >= '2008_01_01')
SUBpartition (hash on memberID 5 partitions)
SUBpartition (list on activeStatus
PARTITION p101 VALUES ('n'),
PARTITION p102 VALUES ('a'),
PARTITION p103 VALUES (default)));
32
Confidential
Global Marketing
ETL Guidelines
Global Marketing
PDW Data Loading Design Goals
• Load data efficiently and non-obtrusively, respecting
concurrent queries and loads
• Reduce table fragmentation as much as possible
• Provide system recovery capabilities in the event of data load
failure that have minimal impact on concurrent queries
• Provide multiple load/ETL options for PDW customers
Global Marketing
Data Movement Service (DMS) with PDW
• Runs on the following nodes as a Windows service:
› Control
› Compute
› Landing Zone
• Used to quickly move data in parallel between nodes by using
Infiniband network in PDW
• Uses ADO.NET
› Uses SqlClient namespace to select data from SQL Server
› Uses SqlBulkCopy to insert data into Compute nodes
• Two protocols/networks used by DMS:
– Data transfer network to move data between nodes
– Message network to send command and status messages to nodes from Manager
• DMS closely interacts with the primary PDW Engine Service
• DMS is used for both loading and querying data
35
Confidential
Global Marketing
ETL Loading Options in PDW
•
•
•
•
36
DWLoader Utility
SQL Server Integration Services (SSIS)
CREATE TABLE AS SELECT (CTAS)
Standard SQL DML statements: INSERT/SELECT
Confidential
Global Marketing
PDW Distributed Table Load – Step 1
Control Rack
Data Rack
Control Node
Compute Nodes
(2) Load Manager
creates staging
tables
DMSEngine
Ser er
PDW
SQL
Server
DMS
Manager
Load
Manager
DMS
(3) DMS reads
load data and
buffers records
to send to
Compute Nodes
round-robin
Infiniband
(1) DWLoader
invoked/
SSIS
Storage Nodes
(4) Each row is converted
for bulk insert and
hashed based on the
distribution column
DMS
Converter
Sender
Receiver
Writer
(5) Hashed row is sent
to appropriate node
receiver for loading
Landing Zone
Load
Client
Load
File/SSIS
37
Confidential
SSIS
API
DMS
DMS
Distributor
Converter
Sender
Receiver
Writer
(6) Row is bulk
inserted into
staging table
Global Marketing
PDW Distributed Table Load – Step 2
STEP 1: DWloader creates topology equivalent staging table
and moves data from LZ file into staging tables using DMS
Staging
DB
Destination DB
2nd step process
DWloader uses SQL commands
to move from staging to
destination tables
NOTE: distributions of a table are written in parallel when the
multi-transactions option is set to true.
38
Confidential
Global Marketing
Data Loading – DWloader
• Command-line utility invoked on the Landing Zone
• Integrated with DMS
– Streamlines I/O and minimizes data-loading times through powerful
parallel loading functionality against a single text file
– Optimize data load speeds while maintaining a performance balance so
as not to seriously degrade concurrently running queries
• Characteristics of Dwloader
– Accommodate initial data loads of large files over 300 GB
– Achieve data load speeds of up to 2 TB per hour
– Accommodate multiple and concurrent incremental loads
– Has settings for canceling and showing status of loads
– Input file must reside on the Landing Zone
– Max. concurrency 10, queues up subsequent load
39
Confidential
Global Marketing
Data Loading – SSIS
• The SQL Server PDW Destination is an SSIS component that lets you
load data into SQL Server PDW by using an SSIS .dtsx package.
• In the package workflow for SQL Server PDW, you can load and
merge data from multiple sources and load data to multiple
destinations.
• The loads occur in parallel, both within a package and among
multiple packages running concurrently
• SQL Server 2008 R2 SSIS includes:
– SQL Server Parallel Data Warehouse Connection Manager
– SQL Server Parallel Data Warehouse Destination
• Similar to dwloader, SSIS leverages DMS for parallel load operations.
• SSIS can run either on the Landing Zone or on a server outside the
PDW appliance.
40
Confidential
Global Marketing
SSIS and PDW Data Types
When using SSIS to load data from a data source to a SQL Server
PDW database:
–Data is first mapped from the source data to SSIS data
types.
–This allows data from multiple data sources to map to a
common set of data types.
–Then the data is mapped from SSIS to SQL Server PDW data
types.
41
Confidential
Global Marketing
Leading Practices – Data Loading
• Minimize page breaks (fragmentation) by designing “partitionfriendly” loads.
• If necessary, drop non-clustered indexes before loading and
re-index after all loads are complete.
• There is no benefit to sorting data before hitting the Landing
Zone.
42
Confidential
Global Marketing
Leading Practices – Staging Databases
• Historic PDW load jobs tend to be the largest. The staging database
may be reduced in size for subsequent incremental loads.
• When creating the staging database, use the following guidelines:
– Replicated table size should be the estimated size per Compute
Node of all the replicated tables that will load concurrently.
– Distributed table size should be the estimated size per appliance
of all the distributed tables that will load concurrently.
– Log size is typically similar to the replicated table size.
43
Confidential
Global Marketing
Leading Practices - SSIS
• For good PDW loading throughput, it is important to keep a
steady stream with minimal starts and stops.
• PDW connections and queries are very costly to initiate. Use
fewer to do more.
• Data type conversion in the PDW destination adapter is very
expensive. Be sure the input types match the destination
types, especially for strings and decimals.
• Consider performing data transformations after loading into
the staging database (ELT instead of ETL).
44
Confidential
Global Marketing
ETL Guidelines
• Grouping: Determine the largest set of data that is
distribution compatible within the query. This will break
queries in multiple compatible steps.
45
Confidential
Global Marketing
ETL Guidelines
• Joining two distribution
incompatible tables
– Scenario where changing the
structure of either table is not
possible
– Usually encountered while
populating fact tables from
underlying BASE tables
– Create a temporary table with
required columns from driver table
distributed on a key which makes
distribution compatibility possible
– More controlled “Shuffle”
– Temp table or Join output can be
reused by multiple queries
– ETL BEST PRACTICE: BREAK
YOUR WORKLOAD IN MULTIPLE,
MANAGEABLE DISTRIBUTION
COMPATIBLE SET OF QUERIES
46 Confidential
Global Marketing
Resource Skilling
Considerations
Global Marketing
Skills Consideration
• Platform Skills - Moving from SQL to PDW
– Retain much of your SQL skills (SQL Server Data Architecture & DBA,
SSIS, etc)
– Heterogeneous platform as you scale from GB to PB!
• Design Skills:
– MPP Data Architecture differs from SMP
› Think in Terms of Distribution Keys vs Primary_Key and Foreign_Key
› Think in Terms of Distribution Compatibility vs Indexes for Performance
› Surrogate Keys lend themselves to Distribution Keys
– MPP ETL Architecture differs from SMP
› More use of Load Ready Files with Upsert Logic vs Dynamic Lookup
› Staging environment strategies simulate CDC Key Lookups
› Surrogate Keys add complexity to CDC Lookups and Key Generation
48
Confidential
Global Marketing
Summary
• PDW is an MPP appliance from Microsoft on Dell Hardware
• Keep in mind MPP and Shared Nothing concepts while
designing your EDW on PDW.
• Traditional SMP concepts are neither sufficient nor applicable.
• Break your workload in manageable distribution compatible
chunks.
• PDW supports both Normalized and Star schemas.
• Consider grouping data in logical information layers.
• Use combination of Dwloader & SSIS depending on unit-ofwork
• Retain your core technology platform skills, augment your DW
design skills
49
Confidential
Global Marketing
Q&A
Global Marketing

similar documents