MAS TRM Guidelines n Virtualisation - Technical Analysis

MAS TRM Guidelines & Virtualisation
Singapore, July 2013
Purpose of this document:
• To analyse the impact of virtualisation on
MAS TRM Guidelines. It is largely positive as
virtualisation enables FIs to lower risk and
increase availability.
Target audience:
• Infrastructure Architect, not CIO or Head of
Infrastructure. Good level of knowledge on
vCloud Suite is assumed to keep this
document short.
© 2010 VMware Inc. All rights reserved
Iwan Rahabok VCAP-DCD, TOGAF Certified, vExpert
Arseny Chernov
Staff SE, VMware
Enterprise Architect, FSI, Dell
[email protected] |
[email protected] |
 Reviewers:
• Tan Wee Kiong, VCAP-DCD, VCAP-DCA, vExpert, VMware
• Fraser Hall, VCP, Solutions Architect, VMware
• Michael Webster, VCDX, Strategic Architect, VMware
• Heera Singh, VCAP-DCD, VCAP-DCA, UBS Bank
• Kenneth Chan, VCP, Principle Engineer, Trusted Source
• Benjamin Troch, VCAP-DCD, VMUG Singapore Leader, vExpert
• Ivan Chee, VCP, DBS Bank
• Amitabh Dey, VCP, vExpert
MAS: TRM Guidelines
 Published on June 2013
 Combines previous documents into 1:
• IT outsourcing
• Endpoint security and data protection,
• Information systems reliability,
• Resiliency and recoverability
 Focuses on availability & security & performance
• Covers the risk only, not the benefit. Technology can lower risk & improve security compared to pen and
paper. Usage of certain capability of technology should in fact be encouraged. Cars have brake so we can
drive faster. Security and Availability should not slow down the speed of business.
 Written for FI (Financial Institution)
• Both global FI and local FI, which operates in Singapore, follow this closely.
• MAS conducts regular audit. A challenging period for the FI being audited as it’s difficult to fully comply to this
while being cost effective. This is where virtualisation can help. Virtualisation enables FI to comply better
while lowering the cost of business.
Content’s Analysis
 Analysis (from Virtualisation’s centric
• Does not mention virtualisation at all.
• Puzzling as virtualisation is the single largest
disruption to IT architecture and operation.
• Classic physical world architecture &
operation. By itself, physical separation is not
true security.
• Virtualisation also changes the risk as physical
separation is removed.
• Thinking is influenced by TOGAF and ITIL.
• Process and Committee oriented. No Agile
and rapid innovation.
• Rather high level; subject to
• Resulted in inconsistent interpretation by FI.
• FI tend to think Active/Active datacenter is
Content’s Analysis
 Overall Analysis:
• Trust no employee
• Sys Admin must be tracked.
• Guard for internal attack
• Internal sabotage is considered a real risk.
• More dangerous than external attack due
to intimate knowledge
• Broad stroke conservative.
• All social media sites, cloud-based storage,
web-based emails are classified as “unsafe
internet services”. No technical fact given
to support why they they are all insecure.
• DR, not DA, is a must
• DA is not even mentioned.
• Guard the end point
• Smart phone, tablet, laptop
• IT audit must be independent
• In some FI, they are no longer reporting to
Detailed Analysis
DR and Availability
Network & Security
People and Process
End User Computing
Additional Explanation is provided in the speaker notes
DR: Clauses
DR: Clauses
Worst case is not just
when entire datacenter
goes down. It could
also be no expertise
around (e.g. on block
leave or sick leave)
An outdated
Documentation can be
more risky than no
documentation. An
wrong procedure can
cause serious outage.
Manually updating the
DR server is not
Replication implies
DR: Analysis
The changes must be replicated & migrated, not manually implemented in DR Site
• In the physical world, the DR server has a different host name and IP, and is a separate, independent instance. Production is not equal DR. Changes must
be manually implemented.
• Virtualisation solves this by replicating the entire server, not just the data drive.
• Replication can be done at the hypervisor-layer and IP. Not limited to SAN and FC. Replication can also be rolled back to previous snapshot
A Test shall not require Production to be down.
• If a test requires production to be down, then it is a flaw in the principle.
• Classic DR architecture needs Production to be down as it can’t isolate the network. A VM is much easier to control than a physical machine.
A Test should be fully automated and assume the worst case scenario
• You cannot assume all the technical people are available. It must be executed with minimal IT knowledge.
• Most FI do not have ability to test full-datacenter scenario as it’s too manual and complex in the physical world
• DR solution that relies on MS Excel or MS Word gets outdated easily. Manual DR results in “adhoc” as it’s impossible to guarantee human will execute to
the dot
• With Site Recovery Manager (SRM), the entire run book is no longer manually maintained. There is no more “procedures” to be “evaluated and updated”
as the run book is no longer a separate, detached entity. It is defining the recovery plan. The plan and the documentation are one, and there is no longer a
need to manually ensure they are in-sync.
• SRM provides built-in audit as it records every steps in the run book. Hence it also serves as the repository of tests history. Every steps is recorded, and not
typed manually in MS Word. Manual solution such as MS Word means we cannot prove that what actually happened is what is documented/reported.
• If we do regular & frequent Failover, that requires Failback to be automated too.
Annual test is too infrequent due to too many changes
• Changes in users (new staff, new roles) and IT personnel. Changes in applications and infrastructure
Application dependancy can be automatically discovered and shown by the hypervisor
• The dependancy can be integrated with the DR plan.
• There is no need to deploy an agent as it’s provided by the hypervisor
Storage availability
What if we have a new
This requirement
comes out because
there is only 1 central
array. What if we have
many, just like Server?
Merely bringing up the
Storage subsystem
alone is not suffcient.
Server, Network,
Security, Management
must be up too.
Storage availability
 Storage virtualisation addresses the physical array being a SPOF
• Instead of 1 monolithic, complex storage, the storage systems is distributed. Every ESXi node become a
storage subsystem too. It becomes harder to have a major outage as the storage is now distributed.
• Not just the storage is distributed, the data can be duplicated too. A critical VM might need to have quadredundancy and its data is duplicated 4x (so it exists in 4 different ESXi host)
 Storage vMotion allows data to be migrated live from 1 physical array to another.
• With a 10 GE connection, the data migration can be completed relatively quick.
• This replication is independent of underlying technology. It can replicate from FC array to iSCSI array, or from
 Having the data replicated is not enough
• Simply having the storage subsystem replicated does not mean the entire system is serving the users. Access
to this data (storage) is via Server, and access to this Server is via Network, and this Network needs to have
Security, and the entire System needs to be Managed. So we need to replicate entire datacenter, not just
 Synchronous replication means “bad” data is replicated right away. No reaction time from IT.
• Replication should enable multiple snapshot. This allows IT to roll back to point in time.
RTO is practically a lot less than 4 hours
 RTO starts the moment the system is down, not the time CIO approves DR activation.
• Basing on cloud 8.2.4, it starts “from the point of disruption”. So if critical service gets disrupted at 3 am, and
CIO declares disaster at 6 am, there is only 1 hour left to bring up everything.
• It’s also an “application level” definition, not infrastructure level. So if a multi-tier apps needs 1 hour after OS
is up, then it has to be considered. Getting just the OS up and running is not good enough.
RTO is practically a lot less than 4 hours
 The generally industry-accepted definition of RTO is shown below
 MAS definition of RTO is actually MTO.
• In reality, especially for multi-tier applications, you may only have minutes, not 4 hours.
 Annual exercise is way too long to be able to react within that 4 hours MTO.
• People forget, procedures change.
 Manual procedure will not be able meet 4 hours MTO
• The entire DR response has to be automated. The only thing manual is the decision.
Active/Active datacenter, or A/A application?
 A lot of FI are implementing Active/Active Datacenter
• In Active/Active datacenter, both arrays are serving Production workload. This requires very careful operation. They become 1
logical datacenter, with 1 network spanning across both physical sites. A failure could bring down both datacenters as it can
propagate from 1 DC to another. This is especially on the part that is connected (e.g. stretched network, replicated storage,
stretched cluster or single vCenter)
• Active/Active datacenter is not a DR solution.
• Stretched Cluster is often mistaken as a DR solution. It is a DA solution. For a technical explanation, refers to
 Consider a Hybrid Datacenter instead
• Site 2 is running Production workload on a separate storage, which can be patched independantly. This allows FI to patch the
non-production array to see the impact first.
 Virtualisation makes going beyond 2 hardware boxes easier
• In physical world, HA is typically achieved by having 2 hardwares (acting as a pair). This protects when 1 box is down. But it can
not withstand 2 failures. In the virtual world, we can withstand more than 2 node failures, both at the compute node and
storage node.
Complex inter-dependancy
 Network virtualisation can help removing the SPOF of physical network
• While the physical network has redundant hardwares and path, it is still 1 network. Stretching the network to
another datacenter actually extends the risk, as a network outage or misconfiguration can bring down both
datacenter as they are 1 network.
• Network virtualisation is an emerging technology in 2013. It decouples the system’s network from the
underlying datacenter network. Changing the datacenter network will be like changing the hypervisor; it will
have no impact to the VM or network above it.
• Great explanation by Ivan Pepelnjak on why FI should think deeper before implementing a stretched network,
creating 1 logical Datacenter from 2 physical DC.
• “Interconnected things tend to fail at the same time.”
Security protection
Need to protect
against Internal attack.
So having IDS/IPS
and FW are edge is
non compliant.
Hypervisor-based protection is superior to agent-based or network-based protection
• The hypervisor can see everything, but it cannot be seen by the attacker/malware
• It makes security an integral component. Adding a compute node means adding a security node.
• The attacker will not be able to remove the agent as it is agentless. The AV is not visible to the malware or rogue sys admin. A sys admin has local privilege
to the Guest OS and can uninstall or alter any settings, so having an agent is not a tamper proof solution.
• It ensures that AV is deployed. Instead of deploying for 1000 servers and 10,000 desktops, we just need to deploy per ESXi host.
• It is easier to manage. A typical VDI has 80:1 consolidation ratio, cutting the amount of nodes to manage by 8000%
• It has less impact on performance. Running a full scan on Windows can render the OS practically useless for other purpose.
• For Firewall, it is not limited to TCP/IP. Hence, the rules are also more aligned with IT policy, and easier to understand and review. For example, a rule can
state that all virtual desktops, regardless of IP address or applications, in the contractor folder cannot access Internet. This results in both simpler firewall
rules, and a lot less rules.
• Physical FW relies on TCP/IP. If the server is compromised and IP address is changed, the firewall rule will no longer apply. Also, it has many complex rules,
making it difficult to understand the complete rules. It is hence easy to make mistake.
• As the hypervisor provides security services, IDS/IPS can be deployed on every juncture, not just limited to “critical juncture”.
• It can quarantine a new server until it is approved, as it controls the network connection.
• It is easier on the network & storage. Updating 1000 Windows can take serious toll on the physical resource.
Security by Physical separation is good, but…
• It can’t be the only security solution. Physical separation is in fact not a solution as systems are now interconnected in a complex network. While it is
physically separated, they are still connected via network. The security must remains in place when the physical separation no longer exist (e.g. connection
is bridged accidentally). The use of VLAN in networking means there is no physical isolation.
• Isolation can be fully achieved even in the same cluster. Separating the cluster can in fact gives a false assurance that there is security if the physical
isolation is the only “solution” in place.
• With multi-tenancy (even within 1 FI), a new set of tools have to be deployed. These tools are built for virtualisation.
 Penetration Test for virtual world is different
• There are now 2 distinct layer (Consumer and Supplier) as virtualisation decouples a physical machine into 2 layers. The penetration test at the VM layer
does not change. The penetration test at the vSphere layer change.
Virtualisation can retain the separation
 It is cost prohibitive to fully comply with the TRM Guideline as it requires a lot of “separation”
and “segregation”. Virtualisation lowers the cost.
 But needs to take into account Operational discipline and guards againts human error.
• The reason is it is much easier to make changes in virtual environment.
Technology Risk Compliance Checks
 VMware Security Advisories
• Cover risks identified through the breadth of customer install base, as long as through support logs analysis, and
provides detailed impact and remediation information for each case;
• Available at
 VMware Health, Profiles, Security Hardening Checks
• An established framework of tools, developed by VMware is available to perform scheduled overhauls for compliance
Response to Incident
 Response Plan for Security Incidents Can be Scripted
• With vSphere network virtualisation and extensive API set, framework of response scripts that would separate
environment into isolated zones / shut down breached services / isolate users could be in possession of
Global Information Security team as the “Red Button”
• Near real-time audit of changes and ability to extract unusual entitlements (i.e. matching roles and users back
to objects / VMware resource pools) could help prevent internal sabotage or detect it a-posteriori.
• Complete integration with Active Directory or LDAP Services will decrease risk of permission delegation that
could be outside any central regular governance and audit;
 Knowledge of virtualisation is required within Security team
Root cause analysis
 Full Logging is available Out of the Box
• vCenter database is the single resilient archive for all events and tasks submitted by users;
• vSphere comes with an out-of –the-box syslog collector for full low-level logging that would enable aposteriori compliance;
• vShield logging allows audit of all changes introduced to virtualised networks;
• Centralisation of many workloads into one logs rollup enables convenient logs scanning by IPS / IDS;
 Root Cause Analysis is not covered under the normal Production Support
• It’s under Premier Support (business critical or mission critical)
• Source:
• On-site Support is only available in Mission Critical Support
Application: Clauses
Common apps do have
Reference Design on
Complete cloning helps
Separate environment for
• Production
• System Test
• Integration Test
• Development
Some may even need a
Production Fix
Roll back and snaphot is
Application: analysis
 From clause 6.2.5 and 7.2.2, virtualisation is allowed
• Logical separation is allowed. The “or” here means it is acceptable to use shared pool of compute and storage with total
virtualisation separation through the use of VLANs, VXLANs, Storage Pools, Resource Pools and virtual firewalls.
 Virtualisation makes environment cheap and more manageable.
• In the physical world, it is common practice for FI to mix non production, as it does not make sense (business sense, common
sense) to buy 1 server for each environment. Proper testing becomes complex as the various phases (design, development, test)
often use the same copy.
• Cloning a physical server is a long process. Restoring is also difficult with back up is the most common way. But restoring entire
OS is a long process. Virtualisation makes snapshot and rollback easier. It can even be done on live production VM running
• With virtualisation, environment becomes cheap. 1 Production VM can easily have 10 Non Production VM. It is easy to clone,
snapshot, destroy and replicate. Unit Test, Integration Test, System Test, UAT Test, DR Test, etc can now be created.
• It is even possible to create the entire environment per user.
 Virtualisation allows live cloning, while the VMs are running.
• With virtual networking such as vShield Edge, the non-production clone can be placed inside a private, bubble network. This
helps identifying problem quickly in production, as they are identical copy.
• VMware Tools uses VSS to quiese Windows file systems. Apps such as MS SQL and Oracle also use VSS.
• With VAAI enabled, the cloning is offloaded to physical storage. This drastically reduced the load on ESXi host, storage fabric,
storage front end ports. It also speeds up the cloning.
 Addressing the security patch test issue
• The only way to test is to patch it, then run application or regression testing. This is time consuming and risky.
• Virtual Patch enables IT to buy time. Production is patched virtually. Non Production is patched “physically”.
• vSphere, via its API, enables virtual patch. Partners like TrendMicro has leveraged this API to patch/close vulnerability without
tampering with the guest OS. This enables FI to patch production while testing in non production.
 Common applications have reference architecture on VMware
• Oracle, Exchange, SAP, SQL Server, SharePoint, unified messaging, etc.
Inventory & Configuration Management
This is drastically
simplified in “softwaredefined” architecture
Continuous Compliance
with automatic remediation
is possible.
Inventory & Configuration Management
 Inventory management is drastically simplified in virtual world.
• In the physical world, the management tool and the items being managed are 2 separate, independent entity. They
can be out of sync, which is why a stock take is needed from time to time.
• In the virtual world, the “items” are defined in software. It can’t exist unless it is defined.
• For server, a VM cannot exist unless it’s defined in vSphere. Hence inventory has become embedded.
• For network, instead of many physical switches, you will have just a few distributed virtual switch spanning the entire datacenter.
• Physical server (the ESXi host) inventory becomes less relevant as it has no data.
• In fact, the entire hypervisor can also be streamed (not installed), making the entire box is just a physical box with no identity when
powered off.
• Host Profiles provides a mechanism to ensure each ESXi match the master image
 vCloud Suite provides a built-in inventory of all Servers, Network and Storage
• vSphere maintains the inventory. VCM tracks the changes in this inventory.
• SRM maintains the inventory + with its associated DR scripts.
• VIN maintains the map of application dependancy among VM. This list is integrated to SRM to enable a services-based
• vShield Manager defines all the firewall rules.
• vCNS Service Insertion Framework ties functionalities that are otherwise not integrated. While this does not directly
answer the inventory management, the inventory is not completely independent as a result.
• Plug-in from hardware vendors provide single-pane of glass from vSphere UI.
• vCenter Operations provide a continuous compliance via its Compliance Badge and VCM module.
Technology Refresh Plan
 vCloud Suite normally follow a yearly update, typically after VMworld.
• This is more frequent than traditional softwares, which goes for 2-4 year upgrade cycle.
 Hypervisor is a low level software, directly interfacing with hardware.
• Upgrading hypervisor is best done with new hardware purchase. This turns Upgrade to Migrate, significantly
reduce the risk.
 A virtual DC needs to have a formal refresh strategy
• Unlike physical DC, a virtual DC changes more frequently.
• For a large farm (>1000s VMs per datacenter), how many vCloud Suite versions do you allow
• A good practice is to decouple the hypervisor and the management component.
• ESXi should get upgraded together with hardware. So it tends to have 3-4 year upgrade cycle.
• vCenter, and its associated plugins/extension, should get upgraded yearly. This is because it cannot manage a newer ESXi
vSphere life cycle
 New hardware support window: 1.5 years.
 Non-critical bug fixes: 2 years (from a Major release, not Minor)
Change Management
Change Management
 Change Management needs to recognise that virtual world has different changes
• Changes happen much faster and more frequent in the virtual world. This is because the datacenter itself is
defined in software. This is good, as that means IT can support business better. Business normally see IT has a
barrier and road block, slowing business instead of powering it.
• Making it rigid with a blindly applied CM can defeat the purpose of virtualisation.
• ESXi host change is not the same as physical server change as there is no downtime to the business.
• Storage upgrade is very different if the storage is virtualised (read: distributed)
• The storage “array” is now distributed to >1 box, in many cases it extends beyond 8 boxes.
• vMotion, DRS, Storage DRS should not be considered a change request,
• This needs to be discussed in greater depth. Do we allow a 1 TB storage vMotion as it will have impact on the Storage
Network and Storage IOPS, not to mention performance to the up-stream application.
 Virtualisation can reduce the risk as the change itself can be virtual.
• E.g. patching Windows can be done safely if it’s done virtually. This is because it does not change the Guest OS
at all. Virtual patch is technically closing the vulnerability via the hypervisor as the hypervisor sees everything.
Capacity Management
 Capacity management changes in the virtual world
• This is due to the sharing nature and dynamic movement
• As datacenter services move toward software, they are taking up resources like any other workload. This new
type of workload needs to be accounted for in capacity management.
 Example of datacenter services that run on every ESXi host
• Security Services
• Availability Services
 Management softwares should be separated into a cluster by itself
• Hence it is not part of the Capacity available to business workload
Capacity Management at Infra-Level: IaaS “workload”
VM Snapshot, cloning, hibernate
Virtual Storage
(e.g. VMware VSA, Distributed Storage, Nutanix)
vSphere Replication
Edge of Network services (e.g. vShield Edge, F5)
DR Test or Actual (with SRM)
HA and cluster maintenance
Fault Tolerant (VM)
vMotion, Storage vMotion
Backup (e.g. VDP Advance, VADP) with dedupe
- Back up vendor can provide the info
- For VDP-Advance, you can see the actual in vSphere or
vCenter Operations
Hypervisor-based AV, IDS, IPS, FW
Proper Certification
 Virtualisation require deep expertise
• The inter-connected nature of virtualised datacenter means the skills have to be both deep and broad.
• Most of the advanced certifications from VMware cover much broader, non-virtualisation related aspects of
design, imlementation, decision impact analysis, application interdependencies.
• Considering having advanced certification (e.g. VMware VCAP-DCD, VCAP-DCA) in your team
• Once you are >75% virtualised, it is safe to say that the virtual platform is the largest platform in the datacenter. It is a
mission critical platform.
• >95% of support cases handled by Mission Critical Support team in VMware Global Support is not about
VMware. It is about storage, network, hardware, drivers, etc.
End User Computing
End User Computing
 Virtualisation enables control
• It’s hard to lock a physical laptop. Much easier to control a VM living on the physical laptop. With 4 GB RAM and SSD
storage, running a VM on top of the base OS no longer impacts productivity.
• This also address BYOD. Employee can bring their Mac or Sony or Ubuntu, and running OS that is not supported by IT.
IT only supports the Corporate VM, which can be controlled centrally. The separation of personal and work
environment can in fact improve productivity as users want to have complete freedom when they are on their nonwork environment. They do not want that their surfing Internet on their private time is monitored by corporate
• For desktop, it’s easier to control a VDI.
• The Windows images will not deviate from the standard build as there is only 1 image actually deployed.
 The Wireless LAN should not be part of corporate intranet.
• Typical solution is to deploy SSL VPN. This connects the remote device to the corporate network, hence exposing the
corporate network.
• VMware Horizon Suite helps by not using SSL or VPN, and not linking the end device to the corporate network. Access
from device will be limited to View Security Server, which only speaks PCoIP and HTTPS protocol.
• PCoIP does not carry data. It purely sends pixels. And it’s encrypted.
• All devices have View client or just HTML 5 capable browser
• All other applications in the end device continue to be separated from corporate network. Only the View client is connected, and it only
receive pixel.
 Every Internet services is considered insecure.
• DropBox, LinkedIn, FaceBook, and other on-line collaboration improves employees productivity, and they are
becoming necessary as their own customers demand that.
• VMware Horizon Suite provides an IT-controlled dropbox-like file sharing.
Thank You
Next few slides are optional, just-in-case slides for discussion.
© 2010 VMware Inc. All rights reserved
TRM framework needs to be virtualisation-aware
 The TRM framework should be aware of virtualisation
• “System security, reliability, resiliency and recoverability” changed in the virtual world.
• E.g. the typical HA cluster solution the physical world does not apply in the virtual world. The typical DR solution does not
apply too.
• Roles and Responsibility changes are required when operating a virtual datacenter, as Infrastructure is
transformed from “system builder” to “service provider”. Virtualisation Center of Excellence has to be
established and their scope include the TRM framework.
Scope for PCI Audits Now Includes Virtual Infrastructure
 Virtual machines
 Virtual switches/routers
 Virtual appliances
 Virtual applications/desktops
 Hypervisors
The PCI DSS security requirements apply to all system components. In the context of PCI DSS, “system components” are
defined as any network component, server, or application that is included in or connected to the cardholder data environment.
“System components” also include any virtualization components such as virtual machines, virtual switches/routers, virtual
appliances, virtual applications/desktops, and hypervisors.
PCI DSS v2.0 p10
A sample Reference Architecture for Retail PCI
 Inflexible
 Longer lead times to
deploy and scale
Physical device sprawl
Manual provisioning
Fragmented management
Not extensible
Under utilization of
Layered virtual firewalls provides segmentation and isolation
Store Zone 1
Store Zone 2
Distributed Firewall Benefits
Internet /
MPLS / Etc
Non-PCI Zone
PCI Zone
Avoid perimeter chokepoint using
distributed firewall
Eliminate traffic U-turns (trombone),
which impact latency.
Simplify network design using a flat

similar documents