Alta disponibilidad y soluciones de recuperación ante desastres

Alta disponibilidad y soluciones de recuperación ante desastres
Jose Mª Quesada Consultor Pre-Venta Symantec

Conceptos Recovery Point Recovery Time Recovery Point Objective (RPO)
Days Mins Hrs Wks Secs Recovery Point Recovery Time Recovery Point Objective (RPO) El punto en el cuál los datos deben ser restaurados Perdida “aceptable” de datos. Recovery Time Objective (RTO) El tiempo en el cuál el servicio debe estar activo Tiempo de parada “aceptable”

Tecnologias para RPO y RTO
Mins Days Hrs Secs Wks Recovery Point Recovery Time Replicación Sincrona Cluster Extendido Migración Manual Replicación Asincrona Restauración desde cinta Replicación Periodica This slide looks specifically at the RPO or data loss. There’s various strategies/tactics to meet the recovery point objective. Tape backup/restore is on one end of the spectrum with full synchronous data replication at the other end. The specific technologies (e.g., replication, backup…) are not the the focus of this slide. They are just listed here so they can begin to frame this in context. They are used as ways to explain recovery point. Stress: Still need tape backups ¿Cuantos datos puede perder? ¿Cuanto tiempo de caida puede permitir? Backup en Cinta

La curva de la alta disponibilidad
Storage Storage Availability Automation Performance Application Server Low-Level SLA Medium-Level SLA High-Level SLA Highly Available Databases Replication and Remote Mirroring Global Clustering Local Clustering Backup and Restore Online Volume Management, Storage Checkpoints, Point-in-Time Copies Online Volume Management Storage Checkpoints LAN Clustering WAN Clustering Backup Asynchronous Replication Synchronous Replication Vaulting Bare-Metal Restore High Availability Zone AVAILABILITY The availability index assists customers in achieving availability from their storage and servers, as well as from their applications. As customers move up the availability index additional technologies can be used to achieve higher levels of availability. The areas that we will focus on in this module is the high availability zone include clustering, highly available databases, replication and remote mirroring, and global clustering. INVESTMENT

Madrid – 13 de Febrero, 2005 Blackout Notes:
Bellagio Blackout, April 11-13, 2004 Cost = At LEAST $8M (According to University of Las Vegas, Nevada) MGM Mirage Stock (MGG) down $1.25/share or 2.7% to $45.00, shares outstanding Million (loss would’ve been Million based on x 1.25) New York City Blackout, August 14th 2003 Cost of New York City Blackout = $1.1BN (according to New York City Controller William Thompson) $800M in productivity $250M in perishable goods $40M in tax revenue $10M in overtime to police and city workers Blackout can have more effects than just systems being down. Here are some effects: Damaged reputation Lost customers Decrease in stock price Dissatisfied customers Lost productivity Brand equity

Portfolio de soluciones VERITAS: Aportando soluciones para toda la curva
Storage WAN Clustering VERITAS Cluster Server Global Cluster Option™ Synchronous Replication VERITAS Storage Foundation™, Volume Replicator, VERITAS Storage Replicator™ Asynchronous Replication Storage Checkpoints VERITAS Storage Foundation™ for Oracle RAC LAN Clustering AVAILABILITY VERITAS Cluster Server™ Online Volume Management VERITAS Storage Foundation ™ Bare-Metal Restore The key VERITAS products include VERITAS Cluster Server, VERITAS Storage Foundation for Oracle RAC, VERITAS Volume Manager, Volume Replicator, and VERITAS Storage Replicator, and finally VERITAS Global Cluster Manager. Vaulting VERITAS NetBackup ™, VERITAS Backup Exec™ Backup Low-Level SLA Medium-Level SLA High-Level SLA INVESTMENT

Algunas percepciones respecto a las soluciones de alta disponibilidad
Es costoso Es complejo Difícil de medir No es fácil hacer pruebas Note: This slide has hyperlinks to the different areas so that you can jump around and focus on the areas that the customer wants to hear about. To get back to this slide simply click on the VERITAS logo on the bottom right hand side. This can be very useful if you only have a few minutes to give the presentation because it allows you to skip directly to the one or two issues the customer is most interested in. INTRODUCTION: As availability is talked about within many organization there are several thoughts that seem to go through many of our customers minds. Most misconceptions can be boiled down to these four areas PROBLEM It’s Expensive: Have to have idle hardware, double the amount of server capacity per application, duplicate systems, duplicate sites, same operating system, etc, etc. This means to achieve high availability it is expensive. It’s Complex: This means the achieving HA could cause downtime. Setting up HA is difficult and complex. And every time you add a new app, os, server, etc it is like starting from scratch. Not to mention most competitors require weeks worth of consulting dollars to get a single 2 node cluster up and running. It’s Difficult to Measure: Once you have the HA up and running you have no idea whether it is really working or really meeting the SLA’s you have established for the business. Its’ Hard to Test: There is no way to test the environment without stopping production completely. Now we will address these common misconceptions one by one and then talk about how HA relates to an overall DR strategy. Transition: The first misconception that we should address is It’s Expensive.

Soluciones Una visión de las distintas alternativas, y como encajan en los diferentes entornos

¿Que es un Cluster? Colección de multiples sistemas independientes trabajando juntos bajo un mismo marco de gestión para incrementar la disponibilidad del servicio. Aplicación Nodo Almacenamiento Conexión entre nodos Can my application be clustered? Application must be crash tolerant Can restart to known point after single server crash All data needed to restart stored on disk Does not require shutdown to come back up properly No need for “cluster aware” applications, simply well behaved applications Most enterprise applications are completely cluster capable!

¿Por qué VERITAS Cluster?
Diseñado desde uno a 32 nodos en Cluster Gestión avanzada de distribución de carga y aplicaciones Capacidades y gestión identica en distintas plataformas

Que hacen otros Y por que no es recomendado
Cluster de 2 nodos = doble de sistemas = 100% redundancia de costes VERITAS proporciona disponibilidad de aplicaciones Coste de la disponibilidad = spare / servidores (12.5% ejemplo)

Revisión de Configuraciones de Alta Disponibilidad
N+1 Failover Un spare dedicado Coste: Bajo Rendimiento: Alto Complejidad: Baja Disponibilidad: Alta upgrade spare Funcionalidades avanzadas Actualizaciones de software en caliente Actualizaciones de hardware Getión simplificada GUI/Web/CLI Zonas de Clusterización para aisalar zonas de desarrollo/producción N-to-N Failover Capacidad de Spare Coste: Bajo Rendimiento: Alto Complejidad: Media Disponibilidad: Alta Simple Toolset Rolling server upgrades Native large cluster support with policy based control Add/remove cluster nodes ‘on the fly’ Simplified Cluster Management (Web/Java/CLI) Extensive Application Support Applications do not need to be cluster aware Simple development for application control Identical product across all common operating systems Same operators skills and training to build cost effective HA on Unix/Windows/Linux

Clusterizando: Failover de Aplicaciones
La aplicación debe ser “clusterizable” Identificar el fallo, aislar el problema, recuperar rapidamente Parada controlada de los recursos del sistema (si es posible) Cambiar la propiedad al servidor secundario Levantar los recursos en el servidor secundario Cluster Server SAP Storage Foundation Cluster Server Storage Foundation Cluster Server Storage Foundation Application must be “clusterable”: Must have: Defined start/stop procedure. Method of being monitoring Capability to restart to a known state. Must write all of its data to disk. Colorado company: Engineering group had 10 apps on a system. Main application was having trouble failing over. Asked Manager how app was licensed—with a dongle. Out of luck—can’t cluster! Simple failover example (build slide). Basic failover requirements: At least two servers Network connections Mirrored (or RAID) disks Application portability

Symantec Consulting puede clusterizar “cualquier” aplicación
Agente de Aplicación Apache BEA Tuxedo BEA WebLogic Cisco CTM EMC SRDF FileNet Process Engine Hitachi TrueCopy IBM DB2 IBM HTTP Server IBM Lotus Notes IBM WebSphere MQ IBM WebSphere Suite Informix Microsoft Exchange Microsoft IIS (Internet Information Server) Microsoft SQL Server MySQL NFS Mount Nameswitch NetApp Oracle Oracle Applications 11i PeopleSoft SAP R/3 Secureway Directory Server ServPoint NAS for Oracle Siebel Sun iPlanet Sybase VERITAS NetBackup VERITAS Traffic Director VERITAS Volume Replicator (VVR) Voyant ReadiVoice Symantec Consulting puede clusterizar “cualquier” aplicación

Arquitecturas Una visión de las distintas alternativas, y como encajan en los diferentes entornos

Cluster Local Entorno Ventajas Desventajas
Un Cluster de n nodos situado en un mismo CPD. Redundancia de servidores, redes, almacenamiento para aplicaciones. Ventajas Minimiza el tiempo de parada de aplicaciones Eliminar puntos unicos de fallo mediante la eliminación de puntos unicos de fallo (SPOF) Migración de aplicaciones. Desventajas El CPD puede ser un punto único de fallo. Local clustering from VERITAS is the first architectural clustering step to high availability. Local clustering protects against hardware, application or database faults (which is considered unplanned downtime) while also providing availability during server maintenance and application or database upgrades. VERITAS provides the unique ability to scale from 2 to 32 nodes in a single cluster which also includes failover to and from domains within a server. This architecture can be applied to any application or database that you [customer] feels can’t afford any downtime and can also be applicable to all 5 major operating systems: Solaris, HP-UX, AIX, Windows and Linux. Clearly, the disadvantage is no protection against a site failure such as a fire or flood but this would affect a company regardless of deploying a cluster or not. All services would be unavailable without protection beyond local high availability. Note to speaker: There is a very good description of the environment below if you require more information on this architecture. This includes failover behavior, when you position this architecture, the Recovery Point Objective and Recovery Time Objective, VERITAS products that provide this solution, and a customer scenario. Definition A single VCS cluster consists of multiple systems connected in various combinations to shared storage devices. VCS monitors and controls applications running in the cluster, and can restart applications in response to a variety of hardware or software faults. A cluster is defined as all systems with the same cluster-ID and connected via a set of redundant heartbeat networks. This solution provides local recovery of Windows servers in the event of application, OS, or hardware failure at a single site. It also minimizes planned and unplanned application downtime. Planned downtime: If an application, database or server requires upgrades or maintenance, clustering is essential to maintain high availability to the users during these periods. Unplanned downtime: In the event of an application, database or server fault, the services running on the server are failed over to another server to avoid long periods of downtime and maintain a quick recovery time objective. Local clustering, also known as shared-storage clusters, is considered second-generation clusters, and today are the most prevalent for providing HA through application and database failover. Environment · A redundant server, network and storage architecture for application and data availability through the linking of multiple servers with shared storage; · Systems are linked with private heartbeats, usually Ethernet, which they use to communicate state status – VCS uses a fast proprietary protocol, GAB/LLT, to communicate status; · Each system in the cluster can access the storage of any other system. · There is no replication or mirroring of data, as opposed to a shared-nothing or stretch cluster · A SAN facilitates larger clusters (> 2 nodes), and is typically present in all clusters, i.e., switches or hubs are used; · All cluster components – servers, SAN infrastructure, storage – are co-located on a single site. · All servers in the cluster are in a single location (single datacenter) Advantages · Applications recover using data on shared storage (zero data loss) · Minimal risk of split brain (mistaken server failure due to heartbeat link failure) · Minimal downtime for applications and databases (automated failover) · Optimal for server consolidation (N+1 failover scenarios) · Quick recovery time objective to meet stringent service level agreements and high availability Disadvantages · Data center or site can be a single point of failure in a disaster Failover Behavior In the event of an application, database, or server fault, VCS will bring down the faulted application or database in dependency order and bring up those services on another server in the cluster in dependency order. All servers are local (within the same datacenter) and are sharing storage. Since the servers are accessing the same storage, the data that faulted server was accessing is available to the server that restored the services. When do I position Local Clustering? 1. Do you have clustering implemented already in your datacenter? 2. Do you have specific applications that require high availability – unable to have long periods of downtime? 3. Are you frustrated with long application or database outages? 4. Would you like to avoid application or database downtime during planned outages? 5. Are you unable to meet your recovery time objective upon a application, database or server fault? What you are listening for is long downtime due to both planned and unplanned application and database downtime. IT Administrators may have specific applications that may require high availability and are looking for a solution that is simple and contained within a single building (datacenter). It is important to note that the entire datacenter may not need high availability so be sure to ask what applications require minimal downtime. VERITAS products for this solution · VERITAS Volume Manager · VERITAS Cluster Server You can also position NBU or BE. RPO/RTO Facts Recovery Point Objective: At what point can the data be restored? In this configuration, the cluster is using shared storage and therefore, upon a failure, the other servers in the cluster will always have access to the data instantaneously. Recovery Time Objective: Clustering reduces the recovery time objective because it takes the human intervention of first detecting a fault, and then taking the appropriate action of bringing the application down and brining it up on another server. Automating this process ensures that in the event of a fault, the moment an application or database to another server is quick and accurate. Cost Comparison The cost of implementing a local HA solution involving any clustering technology can induce cost in that the IT Administrator will have to learn a new technology and afford another server to failover to. In general though, the cost of downtime can far outweigh the cost of implementing a new technology that can significantly reduce both planned and unplanned downtime. Customer Scenario (this is a Windows enviornment but this architecture is applicable to any major operating system (Solaris, HP-UX, Linux, Windows and AIX) ICON Clinical ICON Clinical is the leader in providing the pharmaceutical and biotechnology industries with exceptional clinical research and biometric services worldwide. These services include: · Provides Clinical Trial support for all phases of drug delivery · Documents all patent/doctor/drug information for the trial with their custom application system. Patients involved in the trial disclose information such as symptoms and daily health either through the ICON Clinical phone system or web interface. This information is a critical component for passing Phase 3 Clinical Trials. Datacenter Information · Data center supports 23 offices, 14 countries on 5 continents, with over 1500 employees. Revenue Information Net revenues increased 36% on a year-to-date basis. $67 million of net new business awarded to ICON during Q2FY03. Problem Statement · Need to achieve 99.9% system availability for Microsoft SQL 2000 database and custom application. · Deliver 24x7 service to doctors and patients involved in the clinical trials. · Required a solution that would provide high availability and disaster recovery for hardware and software already in place. Description of Local HA Environment Operating System: Windows Servers: 2 Microsoft Servers, 1 cluster Storage: Compaq MSA1000 Applications: Microsoft SQL 2000, Custom Application Total Storage: 300 GB VERITAS Products · NetBackup · Volume Manager · Cluster Server · Volume Replicator · Global Cluster Manager Success Clustering exceeded their expectations: · Required a local high availability solution in the event of an application or server outage. Met their 99.9% system availability requirements · Needed a solution that could use their current hardware and software investment (Windows Server and Microsoft SQL 2000 Standard Edition) · Easy to deploy and manage with GUI interface

Clustering como parte de la recuperación ante desastres
La tecnologia de Clustering automatiza la puesta en marcha de servicios en un nodo Almacenamiento Aplicación Red El concepto de Failover extendido a multiples localizaciones se convierte en una solución de recuperación automática ante desastres Las mismas tareas realizadas en el CPD local, son realizadas en el centro remoto de una forma automatizada Se convierte únicamente en un problema de tener los mismos datos tanto en el centro local como en el remoto

Metropolitan Area Disaster Recovery con Mirroring Remoto
Entorno Un único cluster: Servidores distribuidos en multiples CPD conformando un único cluster Distancia limitada por entornos de fibra (storage performance) Ventajas Protección ante desastres locales Los servidores pueden estar distribuidos en varios CPD’s Rapida recuperación Reutilización de infraestructuras Mirroring Sincrono con la plataforma de VxVM Perdida de un 5% de rendimiento vs local mirror (80 km) Desventajas Coste – Requiere infraestructuras de SAN extendida Limitación de distancia It is common for companies to have fibre between buildings. Fibre may be implemented to provide telephone or emergency services between sites but not necessarily utilized for IT. This architecture takes full advantage of your million dollar investment in fibre and provides a level of disaster recovery at a minimal expense. Inherent within VERITAS Volume Manager and at no extra expense, you can mirror the data at the primary site to the secondary site. Since this is synchronous, there is no data loss. There is no requirement for a replication technology nor is there a requirement to mirror between specific hardware arrays. Upon a server, application or database fault, VERITAS Cluster Server will first attempt to failover the application locally. If the servers located at the primary site are not acknowledging the transmission, all services at the primary site will fail over to the secondary site. The configuration is flexible so you can deploy X amount of servers at the primary site and more or less servers at the secondary site. Additionally, VCS provides the flexibility to run in an active/active or in an active/passive state. The choice is left with you! Deploying a metropolitan area disaster recovery is sufficient for most companies as 95% of disasters are localized (fire, flood, power outages, etc.). Looking at the New York Board of Trade during the disaster on September 11, 2000 (which is considered one of the most horrific disasters in US history) was up and running the next day with their secondary datacenter located just 5 miles from the World Trade Center. Clearly the disadvantage is distance and the fibre infrastructure. The distance limitation is purely due to mirroring data between the two sites synchronously. This is not a technology limitation with VERITAS software but because of the inherent nature of running synchronously. Note to speaker: Performance Results will be published on VAN and VNET. If you have any questions, please contact Michelle Mol. You should already know if the customer has fibre or not. If you don’t know, ask before speaking to this slide. Also, there is a very good description of the environment below if you require more information on Campus Clustering. This includes failover behavior, when you position this architecture, what questions to ask when positioning this architecture, the Recovery Point Objective and Recovery Time Objective, VERITAS products that provide this solution, and a customer scenario. Definition Stretch/Campus clustering is a single cluster that stretches over two sites using fibre connectivity for data mirroring and cluster communication. This architecture typically gets deployed when customers want DR over short distances, and they have a SAN infrastructure in place. Many VERITAS customers in the Wall Street area have set up campus clusters with VM mirroring to separate their data centers over several miles, thus providing DR against such disasters as terrorist attacks. This would not provide long distance DR against a natural disaster such as an earthquake. Characteristics include: · Single VCS cluster spanning multiple locations · Can have multiple VCS nodes at each site (2 sites maximum) · Uses VxVM to create a mirror with plexes in two locations · No host or array replication involved · With new data switches using DWDM, support for up to 100KM distances have been claimed. VCS is testing with some of these. · Requires Professional Services to Implement Separation Range dependent on infrastructure provider Environment · Cluster stretches over multiple buildings, data centers, or sites connected via fibre Channel (SAN), with up to 32 nodes per cluster · Local storage is mirrored between cluster nodes at each location · One cluster: Servers located in multiple sites are part of the same cluster Advantages · Protection against disasters local to a building, data center, or site. Example of this is the NY Board of Trade. They had one site in the WTC on 9/11 and another site 10 miles away. Because they were stretching their datacenter, they were up and running the next day. · Cost effective, simple solution – no need for replication (zero data loss due to remote mirroring – which is synchronously copying data to both sides) · Minimal downtime for applications and databases (automated or manual failover) · Leverage existing SAN infrastructure · Included with Volume Manager and Cluster Server at no additional cost Disadvantages · Cost – Requires SAN infrastructure (fibre) · Distance limitations based on storage - ability to mirror storage with adequate performance. · Limited to 2 sites Failover Behavior Example: If 3 servers are located in Building A and another 2 servers are located in Building B, upon a failover of one server in Building A, VCS will attempt to failover the application to another server in Building A. If Building A is down, all services will be failed over to Building B. The data is already at Building B since Volume Manager is using remote mirroring to get the data to another site. When do I position Campus Clustering? Ask your customers the following questions: 1. Do you have fibre? 2. What is the distance between the two sites? 3. Would you like to maximize your million-dollar investment in fibre by providing a level of disaster recovery? What you are listening for is to be sure the customer has fibre before you talk about this solution. The customer is most likely spending money on fibre for other reasons such a phone and alarm services between buildings. By using VERITAS Volume Manager, FlashSnap and VERITAS Cluster Server (at no extra cost – what the product provides today), the customer can have a level of disaster recovery for their systems without spending more money on software or hardware. Data points on this can be found on VNET under Cluster Server and Volume Manager. RTO/RPO Facts Recovery Point Objective: At what point can the data be restored in a campus-clustering configuration? The data is synchronously mirrored between the two sites using VERITAS Volume Manager and therefore, upon a site failure at the primary location, the exact copy of the data can be found at the secondary site, which should be no more than km away. Recovery Time Objective: At what point can the application or database be back online? What makes this architecture so appealing is that the recovery time objective is quick and provides a level of protection against local disasters. In the event of a local disaster (one building fire, flood, etc.), all of the services such as the applications, databases, and data are failed over from that site to the other building that is not experiencing the disaster. If a customer deploys local clustering, and experienced a building disaster, the recovery time objective could be from seconds/minutes when deploying a campus cluster, to days/weeks when only deploying a local cluster. Consider the time necessary to deploy new servers, upload the services and mount the data coupled with the expense of idle users and loss productivity. Clearly there is an advantage to deploying campus clusters if the infrastructure is in place. Cost Comparison It is unlikely that a customer will purchase fibre to implement a campus clustering architecture. The target customers will already have fibre in place. Implementing this solution can maximize the investment made toward the networking infrastructure while providing a level of disaster recovery. The cost of recovering from a local disaster can be unaffordable, even resulting in lost business due to the amount of time it would take to set up the configuration at another site. Since this architecture involves just deploying VERITAS Cluster Server and VERITAS Volume Manager, it is relatively inexpensive disaster recovery solution that would meet most disaster recovery needs. Customer Scenario The Wellcome Trust The Wellcome Trust is an independent research-funding charity, established under the will of Sir Henry Wellcome in It is funded from a private endowment, which is managed with long-term stability and growth in mind. Its mission is 'to foster and promote research with the aim of improving human and animal health'. To this end, it supports 'blue skies' research and applied clinical research. It also encourages the exploitation of research findings for medical benefit. Problem Statement · The Wellcome Trust manages £4bn in funds, so the company needed continuous business, even in the event of a building failure. · The company wanted to maximize infrastructure investment and deploy a level of disaster recovery utilizing their current network investment and company properties. Description of Campus Clustering Environment · Number of servers in the cluster: 6 nodes; Datacenters contains of 50+ servers (2 datacenters; one at each site) · Servers: Compaq 100 · Storage: Compaq MSA1000 Applications: Microsoft SQL, Microsoft Exchange, Custom Application Distance · ~500 meters VERITAS Products · Volume Manager · Cluster Server · Cluster Server agent for Microsoft SQL · Cluster Server agent for Microsoft Exchange Success Utilizing Cluster Server and the remote mirroring features bundled in VERITAS Volume Manager, the customer was able to achieve local high availability as well as a level of disaster recovery without added costs. The customer had already invested in a fibre infrastructure for other reason outside the scope of the datacenter needs and maximized that networking investment by deploying a level of disaster recovery. This configuration was especially useful when preparing for power outages.

…y funciona? Pruebas de entorno
UNIX Servers SAN Storage TPCC test package Brocade switches Ceina Online Edge CWDM equipment Fiber spools in 20km increments Test Methodology Test local, non mirrored storage with various SGA sizes Identical runs with mirrored storage Identical runs with mirrored storage at various distances (20/40/80 KM) Local Local Mirror 80 KM Extended Mirror Performance testing completed December 2003

Metro DR with Remote Mirroring Proof
Remote mirroring has very minimal performance impact – about 2% to 6% compared to local mirroring. And within our distance range, the degradation does not grow linearly as expected.

5.0 Campus Cluster – NEW!! Cross-site growth
Previous versions would allow a plex to be grown across arrays (think: site) effectively reducing availability and creating un-necessary overhead in the SAN infrastructure This regularly happens at large customers and it usually results in unplanned downtime! Read policy needs tweaking to read from the “local” array to prevent ISL overload Detaching/Reattaching done on a per disk base and the decision is not made from a application dataset standpoint

5.0 Prevents “Crossed-Grows”
Site 1 Mirror Site 2 Mirror With 5.0, VxVM can ensure each mirror stays on one site when a mirrored volume is grown. Storage is assigned a site, and VxVM is instructed to mirror across sites. Can an administrator do the same thing? Sure. Consistently? Maybe not. Symantec has seen two large customers do this within the past eighteen months – one in the Americas, one in Europe. When the time came for a site failover, their failover datacenter couldn’t recover. You’re spending millions of dollars for a second datacenter to achieve availability. Shouldn’t you use tools that align with that goal? Site 1 Array & LUNs Site 2 Array & LUNs

5.0 Prevents “Crossed-Grows”
Site 1 Mirror Site 2 Mirror Site 1 Array & LUNs Site 2 Array & LUNs

Preferred Plex reduces Bandwidth
Write to all mirrors Read from local mirror 5.0: Mirrors assigned sites, so failover is automatic 5.0: Multiple preferred mirrors The default behavior of VxVM is to read from all mirrors, in order to spread across more spindles. Administrators can specify an alternate behavior, where reads come from a specific mirror, in order to compensate for performance (or bandwidth cost) differences among mirrors. For consistency, writes must go to all mirrors. (note that in the animation, only data flows are shown, not the acks or the read requests) But reads can be optimized – reads can come from only the local mirror. That saves bandwidth and more importantly, bandwidth costs. (note that the customer’s cost structure will matter – if they have pulled their own cable, the cost is sunk, so who cares?) In 5.0, there are two enhancements. Each mirror is assigned to a site, so when a server/application fails over to the remote site, the local mirrors become the preferred plexes. The other enhancement is that today, you can only have one preferred mirror – if you have more than one local mirror, lets you read from both, improving performance. Site 1 Mirror Site 2 Mirror

VERITAS Storage Foundation
Ahorro de Costes Soporta CUALQUIER almacenamiento (EMC, HDS, Sun, IBM, …) Realiza Mirror de datos sobre Fibre Channel No se requiere una guía de red especializada Protección No hay ventana de corrupción de datos Soporte total de Bases de Datos (en todos los modos) Gestión Realiza gestión y mirroring de almacenamiento con la misma tecnología Disponible en Solaris, HPUX, AIX, Linux y Windows Supports ANY Storage (EMC, HDS, Sun, MTI, STK, IBM…) – While most VVR customers are using EMC hardware, we actually have several customers replicating from, for example, Sun Storage to an IBM Shark array… Shared network support – VVR can replicate over a shared network. As long as it’s an IP network with sufficient bandwidth to handle the traffic, we can replicate over it. No specialized network gear required – Simply replicates over an IP network using either TCP or UDP (user selectable). Data consistency in Async – We do not use “track copy” asynchronous mode which does not preserve data consistency during replication. Instead, VVR offers real-time write ordered asynchronous replication. This delivers near real-time data (usually within milliseconds) in a fully consistent manner. Supports ANY DBMS or FS in both Sync and Async Modes – We support Oracle, Sybase, DB2, … anything that VM would support. After all, VVR literally is VM. No Distance Limitations – Nothing in our documentation will require extra hardware for long distance replication. As long as there’s sufficient bandwidth on an IP network, we can reliable replicate over it. Customer Example: Northern California to Singapore. For that distance hardware replication would require channel extenders which would cost $250,000 each (2). And would have to run in synchronous mode, where you would have to suffer performance hit as well. Maintain ALL VxVM Online Management – adding VVR does not at all impede the capabilities of VM. All VxVM on-line management can be performed on replicated volumes. Any Storage Layout (w/ VxVM) – Primary and secondary storage layouts do not need to match. Even the physical layout & configuration of the of the storage hardware need not match. As long as we have enough storage at the secondary, we can replicate to it. FastFailback – track changes among primary and secondary so that failover and failback can occur even with graceful and non-graceful migration scenarios. This eliminates the need to perform a complete re-synchronization once the initial baseline between primary and secondary has been established. Initialization options – we can send data over the wire in band to “initialize” the secondary or simply take a backup of the primary and ship the tape to the secondary. Then, VVR can just quickly synchronize the delta between the backup and the current primary. This tape-assisted initialization is not available with SRDF. Online replication mode switch - For customers that chose synchronous mode, the default behavior is for the product to automatically fall back into async mode if the network (or secondary host) goes down. Then, when the network comes back up, VVR will consistently drain the log and then snap back into full synchronous mode when fully drained. This is a nice feature which, again, takes place without the need for user intervention. We see this as essential for doing synchronous over WAN environments (although competitors do not offer this and actually require redundant links to ensure maximum network reliability). Others… Replicate among similar or dissimilar SAN architectures – VVR enables SAN to SAN replication over IP. Can replicate volumes that span storage arrays – If the database spans a storage array, VVR can reliably replicate it (unlike an array-based hardware replication solution). Scales to 32 locations – VVR supports many primaries or many secondaries. Many to one, one to many… Hardware replication only supports one to one (albeit from an array to array perspective).

Metropolitan Area Disaster Recovery con Replicación
Entorno Un único cluster: Servidores distribuidos en multiples CPD conformando un único cluster Almacenamiento local replicado (VERITAS Volume Replicator, ó replicación basada en hardware) Replicación Sincrona Ventajas Coste – No requiere infraestructura especializada Protección ante desastres locales Desventajas Limitaciones de distancia* Rendimiento menor que un mirror Resincronización despues de parada THIS IS DRAMATICALLY DIFFERENT THAN FROM Version 3.5!!!!! – a much better solution with 4.0! To provide a metropolitan area clustering to companies who don’t have fibre, VERITAS has introduced an architecture called Metro DR with Replication (previously referred to as Replicated Data Cluster or RDC). Rather than using remote mirroring to replicated the data from one site to the other, this architecture configuration uses replication technology over IP. Replication in this configuration is always synchronous. Upon a server, application or database fault, the services running at the primary site will fail over to the secondary site automatically. This is an ideal configuration if your datacenter requires a level of disaster recovery but the infrastructure is limited to IP and another building within 100 miles. This architecture, like the other architectures previously described, is extremely flexible, supporting dissimilar storage arrays (replicating from, for example, an EMC SRDF array to an Hitachi TrueCopy array) as well as disparate servers (such as a Sun 4800 and a Sun E10K). Note to speaker: You should already know if the customer has fibre or not. If you don’t know, ask before speaking to this slide. Also, there is a very good description of the environment below if you require more information on Campus Clustering. This includes failover behavior, when you position this architecture, what questions to ask when positioning this architecture, the Recovery Point Objective and Recovery Time Objective, VERITAS products that provide this solution, and a customer scenario.

Bunker Replication: RPO of Zero over any Distance
Traditional approach 5X storage requirement Storage hardware lock in Cascaded (more dependencies) Heavy-weight bandwidth reqs Primary Site Bunker Site Secondary Site Veritas Bunker Replication approach Reduces storage requirements Reduced bandwidth requirements Zero RPO over any distance Little or no application impact Bunker Replication Unique in the industry Gives you to have a Recovery Point Objective of zero regardless of distance You replicate your application synchronously over any distance Controller based bunker replication requires 5x storage Gives you the ability to choose hardware Replicates to the bunker synchronously, and then async to the secondary site Write order fidelity is maintained In an outage, all bunker changes are sent to secondary site and then the site is brought live

Wide Area/Global Disaster Recovery
Replication TCP/IP Entorno Necesidades de Alta Disponibilidad, y recuperación ante desastres, en centros geograficamente dispersos Ventajas Puede soportar cualquier distancia utilizando la red IP Multiples soluciones de replicación Multiples clusters Failover local antes de remoto Monitorización desde un punto único de todos los entornos de cluster Desventajas Coste de centros remotos Coste de ancho de banda It is important to properly assess your environment to chose the right architecture that best fits your needs based on budgets, mission critical services, service level agreements, and industry requirements/government regulations. Wide Area Disaster Recovery provides the best protection against site failures that effect the surrounding areas. Natural incidents that can cause such a disaster are earthquakes, tornadoes, hurricanes, etc. In this environment, the health of applications, databases and services are monitored through a VERITAS protocol called Global Atomic Broadcast (GAB). To replicate data from one site to another, VERITAS provides the flexibility to use either VERITAS Volume Replicator, EMC SRDF, or Hitachi TrueCopy. There are several benefits to using VERITAS Volume Replicator in that you have a choice to replicate either in synchronous or asynchronous mode and replicate from one hardware array to another hardware array. VERITAS Volume Replicator guarantees write order fedelity as well as replication over IP without distance limitations. Upon a server, application or database fault, VCS attempts to first fail over services to another node in the cluster at the primary site. If the primary site is not responding, all services, including DNS is migrated over to the secondary site and the data is mounted. Services for the end users resume. The disadvantages to this is the huge cost implications of setting up another HOT site with hardware and network infrastructure. The cost, however, can be outweighed by the advantages of full protection. Having a long period of downtime due to a disaster that effects a metropolitan area can certainly be costly and even be the demise of the business. Additional Speaker Information around Wide Area Disaster Recovery Below: Environment · Local cluster deployed at primary site & local cluster deployed at secondary site(s) · Data is replicated between clusters at each location, with up to 32 nodes per cluster and up to 64 clusters per site Advantages · All advantages of local clustering · Unlimited distance in both synchronous and asynchronous mode · Protection against disasters local to a building, data center, or site · Supports any distance using IP for cluster-to-cluster communication · Application failover is automated locally, with manual recovery to a remote site · Support for VERITAS and 3rd party replication solutions Single point of monitoring and administration All this can take place with a single command or mouse click. Disadvantages · Cost of servers and storage at remote site(s) · Potential data loss if failure occurs while running in asynchronous mode Customer Successes: To name a few: These references can be found on VNET. BlueStar (Solaris) ICON Clinical (Windows) Other misc info… Wide Area Availability (0-10,000’s KM) over a WAN/MAN Global Application Object’s Horizontal Application Scaling & Data Sharing Strategic Platform Coverage Wide Area Failover Site-Wide Configuration Global Availability Management Replication Integration, Management and Monitoring Tied to local HA platform (service groups)

Volume Replicator Soporta CUALQUIER storage (EMC, HDS, Sun, LSI, MTI, STK, IBM…) Supporta CUALQUIER DBMS o FS en modos Sync & Async Replica via TCP/IP SIN limitaciones de distancia Consistencia de Datos en Async No requiere red exclusiva Escala hasta 32 sites Opciones de Inicialización Gestión Online VVR’s flexibility is also unmatched. Most of these bullets are self explanatory, but here’s some background on each… Maintain ALL VxVM Online Management – adding VVR does not at all impede the capabilities of VM. All VxVM on-line management can be performed on replicated volumes. Any Storage Layout (w/ VxVM) – Primary and secondary storage layouts do not need to match. Even the physical layout & configuration of the of the storage hardware need not match. As long as we have enough storage at the secondary, we can replicate to it. Supports ANY Storage (EMC, HDS, Sun, MTI, STK, IBM…) – While most VVR customers are using EMC hardware, we actually have several customers replicating from, for example, Sun Storage to an IBM Shark array… Replicate among similar or dissimilar SAN architectures – VVR enables SAN to SAN replication over IP. Supports ANY DBMS or FS in both Sync and Async Modes – We support Oracle, Sybase, DB2, … anything that VM would support. After all, VVR literally is VM. NO Distance Limitations – Nothing in our documentation will require extra hardware for long distance replication. As long as there’s sufficient bandwidth on an IP network, we can reliable replicate over it. Data consistency in Async – We do not use “track copy” acynchronous mode which does not preserve data consistency during replication. Instead, VVR offers real-time write ordered asynchronous replication. This delivers near real-time data (usually within milliseconds) in a fully consistent manner. Async override option – For customers that chose synchronous mode, the default behavior is for the product to automatically fall back into async mode if the network (or secondary host) goes down. Then, when the network comes back up, VVR will consistently drain the log and then snap back into full synchronous mode when fully drained. This is a nice feature which, again, takes place without the need for user intervention. We see this as essential for doing synchronous over WAN environments (although competitors do not offer this and actually require redundant links to ensure maximum network reliability). Initialization options – we can send data over the wire in band to “initialize” the secondary or simply take a backup of the primary and ship the tape to the secondary. Then, VVR can just quickly synchronize the delta between the backup and the current primary. This tape-assisted initialization is not available with SRDF. Can replicate volumes that span storage arrays – If the database spans a storage array, VVR can reliably replicate it (unlike an array-based hardware replication solution). Shared network support – VVR can replicate over a shared network. As long as it’s an IP network with sufficient bandwidth to handle the traffic, we can replicate over it. Scales to 32 locations – VVR supports many primaries or many secondaries. Many to one, one to many… Hardware replication only supports one to one (albeit from an array to array perspective).

Global Cluster Global Service Group
Replication TCP/IP Global Service Group Service Group which spans 2 or more clusters Each group contains a list of both local and remotely available nodes Service Group attributes support local failover as priority model, wide area as low-pri or custom set-up Replication management framework

VERITAS Cluster Server : Simulacro (Fire Drill)
Secundario Comienzo del testeo (mediante cron en el sistema secundario) Oracle FireDrill = 1 Oracle IP Snapshot Instantánea De Espacio Optimizado Mount Mount Mount Mount Mount Mount NIC Montaje de volúmenes y filesystems VVR Primary Snapshot Arranque de la aplicación Si la aplicación se arranca bien, enviar notificación administrador de que la prueba ha tenido éxito RVG IP Si la aplicación falla, notificar al administrador para que solucione problema Cache NIC Diskgroup

VERITAS Cluster Server : Simulacro (Fire Drill)
Secundario Comienzo del testeo (mediante cron en el sistema secundario) Oracle FireDrill = 1 Oracle Snapshot Instantánea De Espacio Optimizado IP Mount Mount Mount Montaje de volúmenes y filesystems Mount Mount Mount NIC Arranque de la aplicación VVR Primary Snapshot Si la aplicación se arranca bien, enviar notificación administrador de que la prueba ha tenido éxito RVG IP Si la aplicación falla, notificar al administrador para que solucione problema Cache Finalizar testeo y deshechar snapshot NIC Diskgroup

Resumen de Opciones Metropolitan Cluster with Remote Mirroring
Requires full SAN/network connectivity Full automation of high availability and disaster recovery environment Ideal for metropolitan areas with dark fiber Metropolitan Cluster with Replication Ideal where no SAN connection between nodes Automates replication control Wide Area/Global Cluster Full local/metro/wide area disaster recovery capability Multiple failover choices, with differing priority Meta believes: The distance between main Data Center, and primary and secondary backup Data Center should be at least 100 miles. Options Dedicated hot site Dual data centers Hosting DR service provider Hardware vendor’s “best efforts” are not always sufficient Could takes weeks, or longer Don’t overpay unnecessarily… Still need people

Veritas Cluster Server
Local Clustering Metropolitan Disaster Recovery Wide Area Disaster Recovery Remote Mirroring Replication VERITAS perceives clustering benefits to not just provide local high availability but to go beyond a single site and extend availability based on your [the customer’s] infrastructure and requirements. For example: the infrastructure may include fibre between buildings within a metropolitan area. If that infrastructure exists, VERITAS enables a customer to maximize that fibre investment by providing a level of disaster recovery by deploying a campus cluster. The flexibility doesn’t stop there. If [customer] doesn’t have the luxury of fibre between sites, providing disaster recovery over IP is completely supported and encouraged. Additionally, the architecture is supported regardless of platform and regardless of the application running. An IT Administrator may deem a few (or just one) application mission critical and implement a Metropolitan or Wide Area architecture specifically to keep that environment highly available. VERITAS certainly doesn’t require the entire environment to be deployed the same way. Let’s examine each of the infrastructures in greater detail and look at the advantages and disadvantages of each [CLICK to highlight local clustering] Let’s look at Local High Availability Remote Mirror, SAN Attached, Fibre Replication, IP, DWDM, Escon VERITAS Cluster Server VERITAS Cluster Server VERITAS Storage Foundation + Volume Replicator or 3rd party Replication + HA/DR

VCS 5.0 for VMware 3.0

VCS for VMware ESX Current version VCS 2.2 VMware 2.5.2/2.5.3
Limited capability - failover VMs between ESX Console Servers Headed for end-of-life VCS for VMware ESX 5.0 VMware 3.0 Still in development - release early November Adds visibility into Virtual Machine Adds DR/GCO support CMC (hopefully)

VCS for VMware Today… Common architecture used today:
VCS to cluster ESX Console Servers Run one-node VCS inside VMs for Application management Current customer base is largely windows based

What VMware 3.0 Doesn’t Have…
No protection against application failure Questions around split-brain protection w/ VMware HA No DR Support

Our Value Add… VCS brings proven, enterprise class-HA/DR to the VMware ESX platform Strong protection from split-brain scenarios Protection against OS failures (blue screens), and application failures Monitoring of applications inside the guest without traditional clustering in the guest Automated Fire Drill DR testing with no impact to production applications Disaster recovery combining replication and multi-data center clustering Centralized, web-based management and reporting for all VCS clusters, across both physical and virtual environments Standard solution across all hardware and virtual environments

Overview of VCS for VMware 5.0
VCS runs on the ESX Console server Lightweight Agent inside the VMs – replace one-node clusters in VM Monitor Virtual Switch and Virtual Disks Virtual FireDrill Last known good copy (boot images) DR support (replication initially just EMC mirror-view) Lots of Wizards – ease of use top priority

VMware Infrastructure 3.0
Three Components – ESX Server, Virtual Center, HA, DRS, Backup Virtual Center rapidly provision virtual machines and monitor performance of physical servers and virtual machines VMware HA Clustering (VMotion for planned maintenance / VMware HA for unplanned maintenance VMware DRS Dynamic Resource Scheduling Workload balancing between ESX Servers Backup with Consolidated Backup MS Proxy Server Very easy to use

Replication Technologies

VERITAS Replication Technologies
Remote office data protection MAN/WAN Disaster Recovery MAN WAN This the VERITAS Replication positioning slide VSR is to be used to protect remote office data. It should be positioned as an extension to backup. VM and VVR are tools for replicating data for disaster recovery because they are extremely robust technologies that will protect all mission critical data. VM: Remote mirroring over Fibre Channel (synchronous only) VVR: Replication over IP (synchronous and asynchronous) Remote Mirroring over Fibre Channel: Storage Foundation Backup Centralization: CDP and Pure Disk Replication over IP: Volume Replicator

VERITAS Storage Foundation Mirroring síncrono sobre Fibre Channel
Site Primario Site Secundario DEFINICIÓN: El Mirroring Remoto en una forma de realizar una copia de los datos síncronamente sobre una SAN, o una conexión de fibra, a una distancia limitada. Aplicación Volume Manager SAN Fabric Let’s look at the first of these technologies: Logical Volume Management (like with VM). SAN and other storage networking technologies are now allowing geographic separation of mirrors. This makes it a viable DR option for some organizations. Volume Mirroring over SAN Using storage virtualization software like VM that has mirroring capabilities, you can create redundant images up to the distance limitation of fiber infrastructures (generally 10km – although this may vary widely depending upon who you talk to… I believe Brocade and others are now saying things like 100km with some extra gear). People have been using logical volume management software (mostly from VERITAS, given our 97% market share with Volume Manager on Solaris and 79% market share in the storage virtualization space) to create redundancy at the data or disk layer for nearly 10 years. In the past it’s been limited by the distance limitations of copper SCSI, but with the advent of SANs those distance limitations have been increased dramatically. In W2K: LVM is VERITAS Volume Manager In fact, synchronous replication using nothing but VM over fiber achieves the exact same result as the complex and expensive hardware-based synchronous replication solutions within a SAN radius. [drill this home, give an example]. Many IT people don’t think about this, but VM is likely a far superior solution (and much cheaper) than hardware-based replication within a SAN radius. Limited to one strand of single node fibre: 10 KM. Our recommendation is within one campus When you start stretching it out can get some latency depending on distances, especially if it’s a write intensive app

VERITAS Storage Foundation™
Ahorro de Costes Soporta CUALQUIER almacenamiento (EMC, HDS, Sun, IBM, …) Realiza Mirror de datos sobre Fibre Channel No se requiere una guía de red especializada Protección No hay ventana de corrupción de datos Soporte total de Bases de Datos (en todos los modos) Gestión Realiza gestión y mirroring de almacenamiento con la misma tecnología Disponible en Solaris, HPUX, AIX, Linux y Windows Supports ANY Storage (EMC, HDS, Sun, MTI, STK, IBM…) – While most VVR customers are using EMC hardware, we actually have several customers replicating from, for example, Sun Storage to an IBM Shark array… Shared network support – VVR can replicate over a shared network. As long as it’s an IP network with sufficient bandwidth to handle the traffic, we can replicate over it. No specialized network gear required – Simply replicates over an IP network using either TCP or UDP (user selectable). Data consistency in Async – We do not use “track copy” asynchronous mode which does not preserve data consistency during replication. Instead, VVR offers real-time write ordered asynchronous replication. This delivers near real-time data (usually within milliseconds) in a fully consistent manner. Supports ANY DBMS or FS in both Sync and Async Modes – We support Oracle, Sybase, DB2, … anything that VM would support. After all, VVR literally is VM. No Distance Limitations – Nothing in our documentation will require extra hardware for long distance replication. As long as there’s sufficient bandwidth on an IP network, we can reliable replicate over it. Customer Example: Northern California to Singapore. For that distance hardware replication would require channel extenders which would cost $250,000 each (2). And would have to run in synchronous mode, where you would have to suffer performance hit as well. Maintain ALL VxVM Online Management – adding VVR does not at all impede the capabilities of VM. All VxVM on-line management can be performed on replicated volumes. Any Storage Layout (w/ VxVM) – Primary and secondary storage layouts do not need to match. Even the physical layout & configuration of the of the storage hardware need not match. As long as we have enough storage at the secondary, we can replicate to it. FastFailback – track changes among primary and secondary so that failover and failback can occur even with graceful and non-graceful migration scenarios. This eliminates the need to perform a complete re-synchronization once the initial baseline between primary and secondary has been established. Initialization options – we can send data over the wire in band to “initialize” the secondary or simply take a backup of the primary and ship the tape to the secondary. Then, VVR can just quickly synchronize the delta between the backup and the current primary. This tape-assisted initialization is not available with SRDF. Online replication mode switch - For customers that chose synchronous mode, the default behavior is for the product to automatically fall back into async mode if the network (or secondary host) goes down. Then, when the network comes back up, VVR will consistently drain the log and then snap back into full synchronous mode when fully drained. This is a nice feature which, again, takes place without the need for user intervention. We see this as essential for doing synchronous over WAN environments (although competitors do not offer this and actually require redundant links to ensure maximum network reliability). Others… Replicate among similar or dissimilar SAN architectures – VVR enables SAN to SAN replication over IP. Can replicate volumes that span storage arrays – If the database spans a storage array, VVR can reliably replicate it (unlike an array-based hardware replication solution). Scales to 32 locations – VVR supports many primaries or many secondaries. Many to one, one to many… Hardware replication only supports one to one (albeit from an array to array perspective).

VERITAS Volume Replicator Replicación sobre Redes IP
Site Primario Site Secundario Red IP Aplicación Volume Manager RLink Aplicación Volume Manager Volume Manager Volume Replicator Replication Links (Rlinks): Suportados hasta 32 Configurar cada uno para replicación síncrona o asíncrona Storage Replicator Log (SRL) Controla el orden de escritura para grantizar la consistencia Replicated Volume Group (RVG) Conjunto de volúmenes a ser replicados a uno ó más sistemas Sin Limitaciones de Distancia RVG SRL VVR is extending the concept of volume manager over distance Use the same management technique whether mirroring with VM or mirroring over distance with VVR, no additional replication expertise is needed Can take advantage of the processing power that VM provides VVR is most likely already on your system, just need to call sales rep to add a license key By adding VVR to your environment you are really only adding 3 components: Rlink (Replication Link): Link defines the relationship between one host and another. Can create up to 32 Rlinks in environment Can define each Rlink with a different replication mode. So can replicate NY to NJ synchronously and with the same host replicate NY to SF asynchronously RVG: Replicated Volume Group: Groups consistency volumes together so write order is preserved. For example, Oracle tends to span volumes so this groups all Oracle requests together to make sure they get to the secondary site consistently. SRL: Storage Replicator Log: Tracks writes in asynch mode to guarantee write-order fidelity Key Feature: Initialization Can get data to the secondary site either by sending over the wire or doing tape-based initialization. No distance limitations: can replicate over any distance without any specialized network gear

Replicación Síncrona Ventajas Datos Iguales en primario y secundario
Primary Site Secondary Site Ventajas Datos Iguales en primario y secundario Debilidades Fuerte impacto en el rendimiento Escrituras penalizadas por la latencia La latencia afecta al commit Synchronous replication is more or less mirroring over some sort of network. The data is not acknowledged as written (committed) at the primary, until it is sent to the secondary, acknowledged to have arrived, and then posted as completed at the primary. The strength of synchronous is that the data is fully current (or up to date) at the secondary. With synchronous there is NO chance for ANY data loss. Not even a few milliseconds of data loss. The ultimate RPO. However, the big downside is that network latency may directly impact application performance at the primary. “The speed of light is not just a good idea, it’s the law.” However, this application performance impact is highly application dependent. We’ve seen people replicating over fairly long distances with relatively low application impact. For write intensive or very long distance replication scenarios, synchronous is probably not the optimal choice. At VERITAS, we see about 50% of our customers that deploy replication solutions going with synchronous.

Replicación Asínconra
Primary Site Secondary Site Ventajas Gran rendimiento Impacto mínimo en la aplicación Inconvenientes Secundario puede ir retrasado frente al primario Posibilidad de corrupción (no con VERITAS) Asynchronous replication is different that synchronous in that the data is written and committed at the primary without the need to receive the acknowledgement from the secondary that the data has arrived. This acknowledgement must still come, however it’s outside the critical path of the primary application write request. The big benefit of not waiting for the network round trip acknowledgement is application performance benefits at the primary. We have seen volumes replicated asynchronously have optimal performance (I.e., identical throughput to non-replicated system under the same load). Like with synchronous replication (or even periodic) the data is available immediately at the secondary. In other words, it’s on disk, attached to a host, and ready to stand in immediately for a failed primary. The weaknesses of asynchronous replication is that the secondary may get slightly behind the primary (or lag, get behind, be less current, less fresh, not as up-to-date, etc…). So there could be some RPO exposure. Generally this is measured in milliseconds. Sometimes it could be measured in seconds. In extreme cases (where a network failure precedes a site disaster) this could be measured in minutes. However, we’re generally talking about very small amounts of data loss. So we say that asynchronous replication is (like synchronous) a “real-time” replication mode. Another potential weakness for replication products running in asynchronous mode is that there’s the potential (even probability) that they will corrupt data at the secondary. This is especially common among hardware (array based) replication solutions. Like the last bullet states, this is not a concern for VERITAS products. VERITAS replication solutions ensure that write order fidelity is strictly maintained in asynchronous mode. So there’s no chance for database corruption even if replicating in asynchronous mode. The only way for a secondary to become corrupt with a VERITAS replication solution is to have had the exact same level of exposure at the primary and then had a site disaster at that exact instant. But if you feel relatively comfortable that your storage management solution a the primary will not corrupt the data, then you should have that same level of confidence regarding the integrity of the data at the secondary. But again, it may lag or get slightly behind the primary in asynchronous mode. One more comment on asynchronous mode. This mode is also very useful as a backup mode to synchronous. 100% (that I know of anyway) of customers that choose to run in synchronous mode configure the product to dynamically switch on the fly to asynchronous mode if there happens to be a network (or some other) failure. They’d prefer to switch to asynchronous and queue changes than endure downtime at the primary. When the failure has been rectified, then the asynchronous log will drain and then snap back into synchronous mode. This concept of a log is the fundamental enabling factor for asynchronous replication and VERITAS is unique in its ability to gracefully migrate or switch among modes on the fly (even without user intervention).

Volume Replicator Soporta CUALQUIER storage (EMC, HDS, Sun, LSI, MTI, STK, IBM…) Supporta CUALQUIER DBMS o FS en modos Sync & Async Replica via TCP/IP SIN limitaciones de distancia Consistencia de Datos en Async No requiere red exclusiva Escala hasta 32 sites Opciones de Inicialización Gestión Online VVR’s flexibility is also unmatched. Most of these bullets are self explanatory, but here’s some background on each… Maintain ALL VxVM Online Management – adding VVR does not at all impede the capabilities of VM. All VxVM on-line management can be performed on replicated volumes. Any Storage Layout (w/ VxVM) – Primary and secondary storage layouts do not need to match. Even the physical layout & configuration of the of the storage hardware need not match. As long as we have enough storage at the secondary, we can replicate to it. Supports ANY Storage (EMC, HDS, Sun, MTI, STK, IBM…) – While most VVR customers are using EMC hardware, we actually have several customers replicating from, for example, Sun Storage to an IBM Shark array… Replicate among similar or dissimilar SAN architectures – VVR enables SAN to SAN replication over IP. Supports ANY DBMS or FS in both Sync and Async Modes – We support Oracle, Sybase, DB2, … anything that VM would support. After all, VVR literally is VM. NO Distance Limitations – Nothing in our documentation will require extra hardware for long distance replication. As long as there’s sufficient bandwidth on an IP network, we can reliable replicate over it. Data consistency in Async – We do not use “track copy” acynchronous mode which does not preserve data consistency during replication. Instead, VVR offers real-time write ordered asynchronous replication. This delivers near real-time data (usually within milliseconds) in a fully consistent manner. Async override option – For customers that chose synchronous mode, the default behavior is for the product to automatically fall back into async mode if the network (or secondary host) goes down. Then, when the network comes back up, VVR will consistently drain the log and then snap back into full synchronous mode when fully drained. This is a nice feature which, again, takes place without the need for user intervention. We see this as essential for doing synchronous over WAN environments (although competitors do not offer this and actually require redundant links to ensure maximum network reliability). Initialization options – we can send data over the wire in band to “initialize” the secondary or simply take a backup of the primary and ship the tape to the secondary. Then, VVR can just quickly synchronize the delta between the backup and the current primary. This tape-assisted initialization is not available with SRDF. Can replicate volumes that span storage arrays – If the database spans a storage array, VVR can reliably replicate it (unlike an array-based hardware replication solution). Shared network support – VVR can replicate over a shared network. As long as it’s an IP network with sufficient bandwidth to handle the traffic, we can replicate over it. Scales to 32 locations – VVR supports many primaries or many secondaries. Many to one, one to many… Hardware replication only supports one to one (albeit from an array to array perspective).

VERITAS Storage Foundation para Oracle RAC

Desafíos de Oracle9i RAC
¿Como protejo mi base de datos de la corrupción ante un split-brain? ¿Cómo creo una infraestructura que escale para absorber mi crecimiento? ¿Cómo reduzco los costes de gestíón? ¿Cómo maximizo la disponibilidad y el rendimiento?

RAC Necesita la infraestructura correcta
Servicio de Cluster membership ¿Quién está en el cluster? ¿Quién está entrando en el cluster? ¿Quién ha dejado el cluster? Acceso compartido al almacenamiento Cluster Volume Manager Cluster File System Soporte de Oracle Disk Manager Conexión de altas prestaciones entre nodos Mensajería entre instancias Estado del cluster Gestión del almacenamiento compartido Escalabilidad Facilidad de gestión The purpose of this slide is to position the types of things you should think about when implementing RAC. Bullet #1: You don’t want to sacrifice manageability in your effort to create an available, scalable environment. You need to be sure that the benefits of availability and scalability that you achieve through RAC are not outweighed by increased complexity. Bullet #2: You want to be able to manage HA generically, not just in terms of RAC. With VERITAS you can have other applications in the same cluster; you can fail over anything you want. Bullet #3: You want to be able to distribute your workload across systems. Your service providers (VERITAS, Oracle) will help you to set up these practices; also, VERITAS solutions (Traffic Director, VCS) will help to automate workload management as well. Bullet #4: You need to have a business contingency plan/DR strategy for your RAC configuration as much of your most critical data will be stored on a RAC cluster.

RAC “Necesita” VERITAS
Proporciona máximo rendimiento para ORACLE RAC Utiliza el cluster más extendido del mercado Soporta múltiples arrays Escalable: Soporta hasta 32 nodos Soporta múltiples plataformas Evita cualquier ventana de corrupción de la información Hace RAC fácilmente gestionable gracias al CLUSTER FILE SYSTEM The purpose of this slide is to position the types of things you should think about when implementing RAC. Bullet #1: You don’t want to sacrifice manageability in your effort to create an available, scalable environment. You need to be sure that the benefits of availability and scalability that you achieve through RAC are not outweighed by increased complexity. Bullet #2: You want to be able to manage HA generically, not just in terms of RAC. With VERITAS you can have other applications in the same cluster; you can fail over anything you want. Bullet #3: You want to be able to distribute your workload across systems. Your service providers (VERITAS, Oracle) will help you to set up these practices; also, VERITAS solutions (Traffic Director, VCS) will help to automate workload management as well. Bullet #4: You need to have a business contingency plan/DR strategy for your RAC configuration as much of your most critical data will be stored on a RAC cluster.

VERITAS Storage Foundation para RAC
VERITAS Storage Foundation para Oracle RAC es la solución completa para RAC Basada en tecnologías clave de VERITAS Optimizado para RAC Testeado con RAC Certificado por Oracle Oracle9i RAC VCS RAC Extensions Cluster Server (VCS) VERITAS Storage Foundation para Oracle RAC Database Accelerator (QIO/ODM) Cluster File System Cluster Volume Manager Examples of cluster services: Cluster Transport I/O Fencing FYI - We also support raw. So if someone doesn’t want to run CFS, one could implement DBE/AC on raw. Hardware

VERITAS Storage Foundation para RAC
VERITAS Storage Foundation para Oracle RAC es la solución completa para RAC Basada en tecnologías clave de VERITAS Optimizado para RAC Testeado con RAC Certificado por Oracle Cluster Volume Manager Cluster File System Database Accelerator (ODM) Cluster Server RAC Extensions Hardware Oracle9i RAC DBE/AC is a complete solution for RAC Within DBE/AC there are several components based on core VERITAS technology: Cluster Volume Manager – utilizes technology from our leading Volume Manager product Cluster File System - utilizes technology from our leading File System product Database Accelerator (ODM) – This is a library we created specifically for DBE/AC. The library interfaces with Oracle’s ODM API (explained later) Cluster Server - utilizes technology from our leading Cluster Server product RAC Extensions – new technology we created specifically for RAC (explained later)

Cluster Volume Manager
Acceso Concurrente a volúmenes desde múltiples nodos Soporta múltiples Arrays: SUN, HDS, EMC, CLARIION, HP XP, IBM ESS, NetApp, … Striping entre LUNs y mirroring entre Arrays Dynamic Multipathing gratis Terceras Copias gratis para procesamiento Offhost Elimina ventana de corrupción mediante I/O Fencing Oracle9i RAC RAC Extensions Cluster Server Database Accelerator (ODM) Cluster File System Cluster Volume Manager #3: increased reliability: make sure that when there are failures that you don’t corrupt your data. Benefit of using CFS. VM actually does this. WE do allow RAC on CVM. Hardware

I/O Fencing Entendiendo el Split Brain El Split Brain puede producir corrupción en la base de datos Al producirse un fallo, un nodo debe sobrevivir y el otro debe morir VERITAS I/O Fencing maneja todos los escenarios eliminando la posibilidad de Split Brain Why do you want to protect against a split brain scenario? Because an unprotected split brain will corrupt the Oracle RAC database and force you to restore from backups Split brain occurs when the nodes can’t talk to each other (the internode links all fail) and they both think they are the only surviving member of the cluster. If the nodes in the cluster have uncoordinated access to storage they will overwrite each other’s data (data corruption occurs) In order to prevent this corruption one node must shutdown The next two slides explain VRTS IO Fencing with an example ¿Caída de la red o caída del sistema? ¿Caída del sistema o saturación temporal del sistema?

I/O Fencing A B Nodo 0 con clave A se registra en el disco 1 por todos los paths Node 1 con clave B se registra en el disco 1 por todos los paths

I/O Fencing: En caso de Split Brain B A Nodo 0 con clave A expulsa la clave B del disco 1 mediante un comando SCSI III. Un único comando de expulsión elimina los permisos de escritura por todos los paths Node 1 no puede escribir porque su clave ha sido expulsada. El disco prohibe la escritura

VERITAS Cluster File System
Oracle9i RAC VERITAS es la única solución sobre Solaris certificada por Oracle para correr RAC sobre FileSystem Acceso concurrente a los mismos filesystems desde múltiples nodos Todos los nodos pueden leer/escribir directamente Se integra con Oracle Disk Manager para maximizar el rendimiento Mantiene gestión online Costes de administración más bajos RAC Extensions Cluster Server Database Accelerator (ODM) Cluster File System Cluster Volume Manager Key points for this slide: To manage your RAC environment, you need: A true multi-node file system where each node has simultaneous read and write capability directly to each individual node – without going through the disk A cluster file system that is guaranteed not to give you data corruption in the event that a node fails A resilient system where any one node failure will not take down access to the data from other nodes. A solution that is certified by Oracle With the locking of data. You don’t want RAC using its intelligence, and your CFS using different intelligence. You want a CFS that gets out of Oracle’s way. VERITAS goal: not to get in the way of RAC processing; we ensure that VERITAS does not create a bottleneck. We’ve seen little to no degradation by adding VERITAS CFS to a RAC environment. Typical performance is within margin of error. First benefit of CFS for RAC Crucial slide Increased manageability: Easy creation and expansion of files. Without file system; you have provide Oracle with a fixed sized partition in the very beginning. You have to plan for the future; pre think, pre format your partitions to the sizes you want them to be for the futures. With CFS they grow dynamically. Not specific to CFS – the benefit is more RAW vs. CFS; in single instance you have choices other than VERITAS. Less prone to user error File systems are things people are experienced with, know how to manage them. Raw, most needs to be done manually, RAW partitions are not “visable” people have mistakenly put file systems over RAW. Nothing in Oracle that prevents you from making those mistakes. Data center… If you have RAW partictions, you will have RAC specific backup strategy, if you have cfs, you can extend to data center wide. Hardware

Beneficios de usar un Cluster File System
Instalación y configuración más sencilla Posibilidad de instalar un sólo ORACLE_HOME Los archive logs están accesibles a todos los nodos en todo momento Sin NFS Backup más sencillo Recovery de ORACLE más rápido y sencillo!! Fácil creación, migración, expansión y reducción de filesystems Menor probabilidad de fallo humano Escalable: cada datafile es un fichero Con las máximas prestaciones en rendimiento #1: don’t have to have as many partitions as your number of files, instead big partitions and multiple small files #2: partition does not allow you to do this. You can migrate, shrink, grow file system.

Cluster File System Integración con ODM
Oracle Disk Manager (ODM) Todos los procesos de Oracle utilizan una única librería para hacer I/O (ODM) Es un I/O API definida por Oracle VERITAS proporciona su propia librería para los procesos de Oracle integrada con dicha API No hay penalización de rendimiento a la hora de usar un filesystem en lugar de un raw device Menos overhead en el Sistema Operativo Soporta Oracle Managed Files (OMF) Soporta auto_extend de tablespaces Simplifica la administración de ficheros Oracle9i RAC RAC Enhancements Cluster Server Database Accelerator (ODM) Cluster File System Cluster Volume Manager DBE/AC includes a IO library that improves the performance of RAC When our library is installed, we are using the Oracle Disk Manager (ODM) API provided by Oracle This allows our CFS to act like a raw device, we turn off FS locking and there is no FS buffering done in the kernal This allows our CFS to have the same performance as RAW In some cases a ODM system could be faster than raw because less file descriptors are being used. Additionally our CFS is easier for Oracle DBAs to administer the database because the ODM interface allows us to support Oracle Managed Files (OMF) OMF allows auto extend tablespaces, which allows the tables within an Oracle DB to grow without admin intervention With the ODM interface, creation and deletion of the datafiles is synchronized with the creation and deletion of the OS file Hardware

Cluster File System CFS vs. Raw Throughput SGA Size
Facilidad de gestión con mejor rendimiento que RAW -3% -1% +1% CFS vs. Raw +2% Throughput (Transactions/Min) Higher is better Internal VERITAS tests Net Result: Our CFS does not inject a performance penalty. You get all the CFS benefits without any performance tradeoff. Identical hardware configuration used for two test cases One with raw storage devices One with Cluster File System Performance results of the two runs were compared Test config: Pair of Sun SunFire 6800 servers In each server: 8 UltraSparc II 750 MHz processers 8 MB Level II cache per processor 8 GB RAM in each server 2 Qlogic 2200 HBAs 1 Quad Fast Ethernet card 2 Gigabit Ethernet cards Used for Cluster Interconnect Running Solaris 8 Update 5 (64bit) 3PARdata InServ storage array 8 995 MHz Pentium III CPUs 8 GB Mirrored Data Cache (4 GB Usable) 116 18GB Hitachi drives used to create 18 LUNs Hardware RAID 5 within array for 16 LUNs VERITAS CVM RAID 0 stripe across 16 LUNs for datafiles Hardware RAID 1 within array for 2 LUNs Used for Oracle log files Oracle9i Release 2 64 bit 150 GB Database 89 datafiles 100 concurrent users SGA Size

Cluster File System Reorganización en Caliente
Reorganización de la estructura de la base de datos en caliente Comienza el I/O La Actividad de la Base de Datos sobrepasa las previsiones DBA Decide Mover un índice muy accedido Se añade un nuevo volumen al filesystem El Índice se mueve en caliente El rendimiento mejora Index file system New Volume QoSS can also be used in databases. For example (read slide) Index

Database Accelerator (ODM) Cluster Volume Manager
Disponibilidad de Misión Crítica + Grandes prestaciones en la red privada Utiliza la tecnología cluster líder del mercado Posibilidad de elección de arquitectura HA Permite gestionar la alta disponibilidad de RAC y del resto de aplicaciones del negocio Oracle9i RAC RAC Extensions Cluster Server Database Accelerator (ODM) Cluster File System Cluster Volume Manager ODM=Oracle’s preferred way of doing IO If oracle is not in the audience: jointly developed If oracle is in the room: we were the first to implement. Implementation of ODM in CFS is a requirement from Oracle to support RAC. Message: interoperability; working closely with oracle; first to do it; did this before with Quick I/O – now with ODM (only do ODM, no more Quick I/O for this product) Kernel asynchronous I/O; instead of using synchronous I/O where your processing will stop until you get an acknowledgement, you will continue writing until confirmation Eliminates traditional UNIX file system overhead Allows parallel updates to database files for increased throughput Takes advantage of kernel async I/O Oracle handles locking for data integrity Fewer system calls and context switches Reduced CPU utilization Increased I/O parallelism Efficient file creation and disk allocation Hardware

VERITAS Cluster Server y Oracle RAC Local SAN Clustering
Uso de una SAN para montar un cluster local de hasta 8 nodos

VERITAS Cluster Server y Oracle RAC Campus Clustering

VERITAS Cluster Server y Oracle RAC WAN Clustering
Primary RAC Secondary RAC IP Network Gestionando sites: Gestión de la replicación Automatización de servicios Gestión de direccionamiento IP Gestión centralizada y gráfica

Database Accelerator (ODM) Cluster Volume Manager
Disponibilidad de Misión Crítica + Grandes prestaciones en la red privada Máximas prestaciones en las comunicaciones entre nodos Global Atomic Broadcast (GAB) Low Latency Transport (LLT) LMX (LLT multiplexado) No overhead asociado a TCP/IP Máximas prestaciones sobre Fast Ethernet o Gigabit (ahorro de costes y facilidad de gestión sin sacrificar el rendimiento) Oracle9i RAC RAC Extensions Cluster Server Database Accelerator (ODM) Cluster File System Cluster Volume Manager ODM=Oracle’s preferred way of doing IO If oracle is not in the audience: jointly developed If oracle is in the room: we were the first to implement. Implementation of ODM in CFS is a requirement from Oracle to support RAC. Message: interoperability; working closely with oracle; first to do it; did this before with Quick I/O – now with ODM (only do ODM, no more Quick I/O for this product) Kernel asynchronous I/O; instead of using synchronous I/O where your processing will stop until you get an acknowledgement, you will continue writing until confirmation Eliminates traditional UNIX file system overhead Allows parallel updates to database files for increased throughput Takes advantage of kernel async I/O Oracle handles locking for data integrity Fewer system calls and context switches Reduced CPU utilization Increased I/O parallelism Efficient file creation and disk allocation Hardware

Manage Complexity - Datacenter Foundation
Reducción de costes Reducción de Riesgos Alineamiento con el negocio

Alta disponibilidad y soluciones de recuperación ante desastres

Presentaciones similares

Presentación del tema: "Alta disponibilidad y soluciones de recuperación ante desastres"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Alta disponibilidad y soluciones de recuperación ante desastres

Presentaciones similares

Presentación del tema: "Alta disponibilidad y soluciones de recuperación ante desastres"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback