Soluciones de Alta Disponibilidad y Escalabilidad con SQL Server 2000

Soluciones de Alta Disponibilidad y Escalabilidad con SQL Server 2000
Diego Casali Systems Engineer Microsoft de Argentina Region Cordoba y NOA TechEd 2002

Agenda Que entendemos por Alta Disponibilidad (HA)? Tecnologías de HA
Administrando para HA Diseñando una Solución para HA TechEd 2002

Alta Disponibilidad Es No es
Una combinación de diseño, personas, procesos, y tecnología No es Solo una solución tecnológica Sinónimo de escalabilidad o managability Una decisión de IT sin conocimiento del negocio Una decisión de negocio aislada del costo de “downtime” TechEd 2002

A=(F-R)/F Cinco Que? 12 345 99.999 % uptime Que es un Nueve?
A=disponibilidad F=MTBF(Tiempo medio entre fallas) R=Tiempo medio para reparar % uptime Respuesta Ecuación Por Año ( 8760 – HorasBajoPorAño) / 8760 Por mes ( (24 * NumDiasEseMes) – HorasBajoPorMes) / (24 * NumDiasEseMes) Por semana ( 168 – HorasBajoEsaSemana) / 168 TechEd 2002

Bajando los Nueve Porcentaje Sin Respuesta (por Año) 100% Nada 99.999%
< 5.26 minutos 99.99% 5.26 – 52 minutos 99.9 % 52 m – 8 h, 45 min 99 % 8 h, 45 m – 87 h, 36 m 90% 788 h, 24 m – 875 h, 54 m Percentage Range Hours Down per Percentage : NO DOWNTIME : 5.26 mins : 52 mins – 5.26 mins : 8 hrs 45 mins - 54 mins : 87 hrs 36 mins - 8 hrs 45 mins : 87 hrs 36 mins hrs 6 mins : 175 hrs 12 mins hrs 42 mins : 262 hrs 48 mins hrs 18 mins : 350 hrs 24 mins hrs 54 mins : 438 hrs hrs 30 mins : 525 hrs 36 mins hrs 6 mins : 613 hrs 12 mins hrs 42 mins : 700 hrs 48 mins hrs 18 mins : 788 hrs 24 mins hrs 54 mins TechEd 2002

Que significa No Disponible?
Total de No Disponibilidad Mantenimiento planeado de Servidores Caídas no planeadas (ataques, fallas de hardware, etc.) Tiempo para switchear a tecnología Disponible Restauración de Bases de Datos Que mas…… TechEd 2002

Que significa No Disponible? (cont.)
No Disponibilidad perceptible SitioWeb/aplicación/etc. caído, SQL Server Funcionando Problemas de Red Implementando nueva versión de aplicación Errores de usuario o aplicación Recursos mal entrenados Etc. TechEd 2002

Como obtener mayor Disponibilidad
Hardware redundante y de Calidad Correctamente Administrado (SW & HW) Procesos que funcionen, incluyendo control de cambios Planes apropiados de mitigación (recuperación de desastres, etc.) que estén probados Excelencia en Operación, planificación y diseño Staff entrenado y calificado – valido en cualquier disciplina TechEd 2002

Como obtener mayor Disponibilidad (cont)
… pero cuales son sus barreras a la Alta Disponibilidad? Personas? Procesos? Dinero? Tiempo? Tecnología? … TechEd 2002

El Costo de HA Cost is not just hardware – it’s an investment in people/staff, process, hardware, technology, etc. Getting to 3 9’s is fairly easy, but that 4th or 5th 9 will cost you TechEd 2002

Calculando el Tiempo y Costo de HA
Factores de TCO 1: Hardware y Software 7: Costos de Usaurios finales 2: Costos de Administración /Procedimientos 3: Soporte 4: Desarrollo 6: Caídas 5: Costos de Telecomunicaciones

COMBINACIONES Dos Niveles … Tecnología SQL Tecnología Windows
Failover Clustering Log Shipping Replicación Tecnología Windows Windows Clustering NLB COMBINACIONES

Como funciona el Failover Clustering
PC Clientes Nodo A Nodo B SQL Server SQL Server Heartbeat Array de discos compartido

Log Shipping LS has been around awhile – concept is simple. Take a point-in-time backup, apply it to a secondary, and then restore subsequent tran logs and bring online if necessary. Secondary DB can possibly be used for reporting in certain instances TechEd 2002

Usos de Log Shipping Usos para HA:
Facilita la actualización 7.0  2000 Método de HA secundario para un failover cluster (resuelve el problema de la distancia) Llevar a cabo mantenimiento en el Servidor principal Chequeo de estado de la BD También para reportes y consultas (no HA)

Evaluando la Replicación como una solución para HA
Si están descartados failover clustering y log shipping Detección de fallas y failover no son automáticos Cuando una funcionalidad Warm-standby sea aceptable Standby server no es idéntico al primario: Algunos esquemas de usuario y algunos datos de sistemas no son replicados Los datos pueden no estar actuales La replicación Merge no es consistente transaccional mente Hay algunos beneficios: Particionar los datos en el standby server (se pueden replicar partes de tablas) Acceso a datos para reportes

Comparacion de Soluciones Standby Hot y Warm
Definiciones Hot standby Warm standby Soporte de failover Hot standby Se requiere Failover clustering Detección de Falla y Failover son automáticos Soluciones Warm standby Log shipping – Transfiere backups desde un servidor primario a uno secundario Replicación – Provee acceso simultaneo a datos en otro nodo y particionamiento de objetos y datos

Comparación Tecnológica de HA en SQL
Feature Failover Clustering Log Shipping Transactional Replication Failure detection Automatic Not Automatic Automatic switch to secondary Yes Manual Protects against failed server process Yes, but … Protects against failed disk No, Shared-disk clustering Meta data support All system and user schema and data for all databases Some system, all user schema and data for select databases Some user schema and data Transactionally consistent Transactionally current No, since last transaction log backup No, since last replicated transaction

Comparación Tecnológica de HA en SQL
Feature Failover Clustering Log Shipping Transactional Replication Performance impact None Minimal (file copying on primary) Log reader continually running Time to switch Seconds to minutes, depends on db recovery time Seconds, more to recover more thoroughly Locations Close (unless using distance clusters on HCL) Not location bound Additional complexity Some More Maximum number of servers 4 32 with NLB, otherwise no limit No limit Standby available for reporting, etc. N/A – not a warm standby solution Yes. Possible Read-only access when logs are not being loaded Yes Partitioning of data to standby No

Backup/Restore Se necesita una buena estrategia siempre pero … Pros
Para HA, debe ser el ultimo resorte Pros Usted conoce de esto y lo Ama !!!!! Cons Fallos de medios, como cinta Tiempo para llevarlo acabo No crea redundancia En realidad, se necesita mas que datos de usuario – BD de sistema, SO, etc. This is the method most people are familiar with. However, it is not the best method, as it can create extended downtime, especially with large databases. How well tested are your backups? And how long do they take to not only make, but restore? Restoring a 20 GB database is not trivial, and what if it doesn’t work? What if the tape was erased due to a magnet? The only possible exception is the Split Mirror (Snapshot) backup, which is great for VLDBs, but also very expensive. It is a quick process. But you still want at least one or two methods on top of it – for VLDBs, that should be the preferred backup/restore method. TechEd 2002

Balance de carga de red (NLB)
Generalmente utilizado para escalabilidad no de SQL Server Puede ser usado con BD para obtener HA – usarlo solo en las situaciones correctas Servidores de datos redundantes para solo lectura (i.e. información de catalogo) Front end switch para el cambio de rol en log shipping Servidor en espera para los Servicios de Análisis (BI-DW) This is probably the best method for HA and Analysis Services, since it’s good for read-only TechEd 2002

Prevención de Desastres
Administración de Riesgos Estrategias para prevención de desastres

Manejo de Riesgos Analizar Identificar Controlar Planear Seguir
1 2 Identificar Analizar Documento con Riesgos Lista de Riesgos Descartados 5 3 Controlar Planear Top n 4 Seguir

Estrategias para prevención de desastres
Determinar potenciales causas de caida Crear Procesos operacionales efectivos Prevenir caídas en forma automática Hardware redundante Volcado automático a un Servidor en espera ....DDR y replicación o log shipping con NLB

Establecer excelencia operacional
Principios de Data Centers Control de Cambios Staff Plan de recuperación ante desastres Libro de Acción Data Center Principles Security Control access – track people entering and leaving. Use credentials (badge, visitor pass, etc.). Lock down use on servers – do not use directly unless you have to. The server room should have restricted entry of some type. Security requirements vary internationally, but they should have enough security to protect them, but it should still allow them to do their jobs. Service level agreements Without a support contract, you may be up the creek without a paddle. Get SW & HW, buy what you need – if you need 2 hour response, get that. If 1 day turnaround is OK, get that. Facility Raised floors Raised floors provides space for cabling and cooling – push cool air under and direct towards servers, etc. Fire suppression Make sure that in the event of a fire, you can suppress it without extensive damage Temperature and humidity control Computer equipment needs to stay cool, and large amounts of equipment generates a lot of heat. High humidity can cause condensation, low static electricity. Both can damage circutry – go for 40 – 45% relative humidity. Also, is the HVAC on the generator? Redundant power Power outages should be taken into account. Make sure it can handle all operations, including AC, etc. This will also allow graceful shutdowns. Data connectivity Voice/high speed data connections. Communications systems should be redundant, and enter from separate locations. Be as close to the external Internet hub as possible. Cable Infrastructure Cable infrastructure is no joke. Make sure cables are clearly labeled, carefully managed, tied off and organized. Loose or unlabeled wires are a hazard, and a threat to availability. Not to mention that they are also more prone to wear and tear. Space Have enough room to store all of your equipment Pick an provider wisely if you don’t have your own datacenter (i.e. the janitor accidentally pulling out the plug to clean, etc.) Make sure they meet your need for access to servers, etc. Change Control Starts in development – use source control for all code including SQL Establish acceptance criteria Clear handoff points Run Plans As Kenny Rogers sang in “The Gambler”, “Know when to hold ’em, know when to fold ’em.” Have a complete backout strategy/contingency plan; clean uninstalls Staffing Choose the right people for each role Don’t put your brand new, junior DBA in charge of your disaster recovery scenario … have a seasoned veteran with real production experience handling the high pressure situations. Have shifts (if necessary) that are well established, so you know who to expect … don’t be caught with your pants down. Establish a chain of command Know who does what and what the escalation points are – chain of command. This eliminates confusion. This should be documented (more on that later) Have contact information on hand (cel phone, extension, pager, ) Contact info will speed along ramping up in an emergency. Document schedules You shouldn’t have to guess who’s on duty and who’s on call Proper training Without training, you can never properly manage your environment. Also make sure it’s not only technical, but process training as well. Disaster Recovery Plan One of the keys to HA Without this, you might as well not do HA Document all steps, who to contact, etc. Test the plan Recurring Documented Run Book Centralized collection of documents for easy reference; make sure all aspects kept up to date Backup file information – primary secondary, and tertiary locations Know where your backups are – don’t guess in the heat of the moment – and also record if they were good, etc. Contact information Location of software, licensing keys, support numbers with customer info If you need to rebuild a server … better have access to the SW & Keys System configuration (OS, SQL Server, SPs, Registry settings, disk config, mapped drives) Database schemas, jobs, specific DB setups Etc. TechEd 2002

Monitoreo para HA Dos Teorías: No olvidar el Profiler
Todos los contadores el 100% del tiempo Solo lo que se necesite No olvidar el Profiler Coordinar con Event Logs, SQL Logs, IIS Logs, etc. Horarios de diferencia entre servidores HA es una solución total … no solo SQL TechEd 2002

Backup y Restore Desarrollar una estrategia de backup
Full database backups File/filegroup backups Transaction log backups Imagen de disco de SO Probar backups en otro servidor Rotar cintas off-site Usar servicios profesionales Asegurarnos que utilizan buenos principios de Data Centers TechEd 2002

Backup y Restore (cont)
Testear los planes de recupero Localizar cintas Testear usando la interfase grafica Testear usando solo script Testear con diferentes personas en todos los equipos Cuanto tiempo lleva? Conocerá al CEO/CFO/CIO cuando un servidor importante este caído… “esta listo ya, esta listo ya?” TechEd 2002

Diseñando un Plan de recuperación en desastres
Una de las claves para HA Sin esto, …..rece para que todo funcione Diferentes planes: Sitio caído Servidor caído Perdida de Datos Documentar el plan (mantenerlo actualizado) – testear, testear TESTEAR! Almacenar resultados/aprendizajes Almacenamiento de backups Off-site, incluyendo el manual de operaciones (libro de acción) Pivotal – this is the execution plan for downtime (generally unplanned) … keep it up-to-date, make sure it’s well tested, and pray you never have to use it. But if it is tested, you should have the conifdence that it will work. TechEd 2002

Sitio Caído Desastre ocasionado por la naturaleza ou Hombre
Buen caso para geoclusters/log shipping Puede darse que cada minuto sea crucial – tener un hot/warm/cold standby es crucial Si no se cuenta con un hot/warm/cold standby, estar preparado para reconstruir ….rece por que cuente con buenos y recientes backups Pivotal – this is the execution plan for downtime (generally unplanned) … keep it up-to-date, make sure it’s well tested, and pray you never have to use it. But if it is tested, you should have the conifdence that it will work. TechEd 2002

Servidor Caído Esto tiende a ser un falla de sw/sw failure o un error human Otro buen argumento para geoclusters/log shipping y redundancia Si necesita reconstruir, tenga a mano: Configuración de Sistema (Libro de Acción) Cintas/CDs disponibles Software, Claves de CD Números de soporte TechEd 2002

Perdida de Datos Error Human? Falla de Hardware?
Puede hacer rollback/solucionar el problema? (i.e. deshacer vía aplicación, instrucción SQL, o herramienta como Lumigent, etc.) En estos casos es cuando un plan de backup/restore real, probado, testeado lo salvara TechEd 2002

Preguntas Básicas Es Misión Critica? Que ocurre cuando esta caído? (Perdida de dinero? Perdida de vidas?) EN cuanto impacta el negocio no disponer de HA? Calcular cuanto costaría estar fuera de servicio Que industria? Es OLTP, DSS? Cual es el presupuesto? Que disponibilidad y performance espera el usuario final? Start here at the most fundamental level … forget systems (I.e. hardware and software). Take it to the 10,000 foot level. Know the end user (internal? External? Both?), and their requirements in terms of availability and performance. This comes into play as the DBA does his day-to-day role. Is taking a possible 10 – 20% performance hit for a query execution acceptable if it means greater uptime? The type of activity will affect how the system is architected from an application and hardware perspective – never try to retrofit, because it often doesn’t work. Money is always a factor. You may be able to only do the best with a limited budget – some HA is better than none. This is important – people tend to focus on how much systems cost, but what is the actual cost of downtime? In an e-commerce environment, each second means money lost. TechEd 2002

Seleccionando la tecnología correcta
No hay ninguna guía al 100% que sirva para cualquier situación Asegurarse que la tecnología sea soportable en nuestro entorno Solo porque algo “esta de moda”, puede que no sea correcto i.e. Así como failover clustering es la mejor opción en la mayoría de las situaciones, no es siempre la elección correcta TechEd 2002

Invertir en su App? Contrario a lo que se piensa, no importa cuan confiable sea el HW, mala app = baja disponibilidad Invertir mucho en el desarrollo de su aplicación Proyecto Nuevo? Mejor escenario Haga lo correcto desde el comienzo TechEd 2002

Invertir en su App? (cont)
Entorno/app existente? Evaluar la infraestructura Necesita nueva estrategia de mantenimiento? Nuevo HW? Migración de Datos? Rediseño de App? Crecimiento … Capacidad, escalabilidad Ajustar índices, esquema, mantenimiento; posibles cambios de diseño Que hacer si el HW no escala mas?  Minimizar el downtime por el upgrade TechEd 2002

Diseñar su Aplicación de BD para HA
Involucrar a sus programadores desde el inicio Utilizar versioning & source control para todo el código – incluyendo SQL Manejar la implementación con cuidado – construir programas con instaladores y desinstaladores “Limpios” Utilizar tecnologías apropiadas en el código 1st bullet – HA is a mindset, so if your developers are not thinking about it, it won’t be built in. And the cost of retrofitting it into code is usually big (as are any wholesale changes). 2nd bullet – keeps paths separate … don’t mix and mingle, because you’ll never have the “true” build that gets into production 3rd bullet – important to version and have master copies of things to compare in a production environment if something goes awry 4th bullet – critical for IT 5th bullet – for example, if you use failover clustering, use the clustering APIs to make your application cluster aware, or code for graceful error messages and such to handle events. The user experience is key. TechEd 2002

Diseñar su Aplicación de BD para HA (cont)
Establecer entornos de desarrollo, testing, mas uno que sea exacto al entorno de producción (con los datos actuales de producción) Tener en cuenta las caídas, y como manejarlas – no dejarlo para IT Datos de solo lectura deben manejarse en cache para mejorar la performance de la aplicación (XML) 1st bullet point – this is key if you can do it. Dev should be different from test which should be different from staging to allow different groups to do their job without stepping on the toes of others. Plus there should be one pristine copy that is an exact replica of the production box to assist in troubleshooting so you know you are comparing apples to apples. 2nd bullet – Design HA in from the start – don’t make it an afterthought 3rd bullet – reduce I/O contention and also increase app performance TechEd 2002

Resolver cualquier problema de la aplicación que afecte directamente SQL Server Locking/blocking Optimizar Consultas/Indización No procesar sobre la BD (cursores, grandes sp) Usar sp para IUD Asegurarse que las estadísticas esten actualizadas 1st bullet point – HA goes to the code level, using Access to build your queries may not be the optimal way. Have your DBAs or qualified T-SQL experts do it, and just have the dev guys build ‘em in. Locking and blocking can bring down a SQL Server … remove any long running queries and transactions (i.e. perceived unavailability) Make sure all queries are optimized – use the right indexes to reduce time of execution (which reduces I/O) Extended stored procs are like developing any other piece of C/VB code … treat them as such Auto update stats may not work for everyone, and this can cause potentially longrunning queries, etc. So use DBCCs or drop/rebuild as part of normal maintenance TechEd 2002

En lo posible codificar aplicaciones sin estado, si se mantiene estado, seleccionar una forma adecuada de hacerlo La experiencia del usuario debe ser positiva Seguridad Hacer competir y convivir su app con otras con las que tenga que vivir No codificar para un Service Pack/Versión específicos (OS/SQL) Dejar que los requerimientos definan la tecnología 1st bullet – It’s best to not have to worry about state in an application. If you do, use middleware (I.e. MS DTC, MSMQ, MTS) if necessary, or something else like XML, but you need to worry about security (XML) or HA for the middleware components. Using Component Services, available in Microsoft Windows® 2000, in conjunction with a COM+ object to achieve a two phase commit is also possible. Remember, however, that two phase commits may affect performance. 2nd bullet point – The user experience is key. Put graceful error messages, and handle things like failovers gracefully, especially if you don’t make the app cluster aware if connecting to a SQL Server backend that uses failover clustering. Use timeout values effectively ... otherwise, for example, in an ASP app, you may cause connections to be spawned, and with too long of a timeout, you may see stuff like ASP queuing. Do you have the clients reconnect and send a message as such, or do you have a retry built into the application? Know what users expect, and make sure no dupe transactions. If you require that the backend servers be transparent to the client, and no client configuration can be done, that fact will dictate which technology or technologies can be used (such as clustering or log shipping). This type of consideration should be done at the design stage. 3rd bullet – Use integrated security; try not to use standard security. If special cases arise where other users would need access to the application (such as UNIX or Macintosh users), create logic in the COM layer to handle it. Using the standard SQL Server system administrator (SA) account in the application and its related tasks, packages, scripts, and so on, is not recommended. Also, know the security implications and limitations of the technology that is being deployed. It may impact the solution. 3rd bullet – Benchmark app requirements to know if it can coexist, and think about how it works so you don't throw two conflicting workloads on one server. This will also help to size servers. This is crucial for log shipping, where one SQL Server may house different workloads and DBs. Coding for a specific release may be a bad idea, especially if log shipping is involved. Your failover plan may not work. TechEd 2002

Utilizar nombre completos para tablas y procedimientos almacenados Colocar todos los objetos en BD de usuario, no de sistema No “harcodear” en la aplicacion nombres de servidores, nombres de instancias, y direcciones IP Reutilizar conexiones de BD (Connection pooling) Besides being a best practice and improving cache hit ratios, it will eliminate confusion as to which object to execute if two DBs have the same proc name. This eliminates confusion, and if you accidentally forget to create the object on the standby server in the case of log shipping, you may be in trouble This will make an application inflexible, and probably incompatible with any backup plan. You may never be able to access the new server. Instead, allow the connection to be made through a COM+ object, giving you not only a more flexible application, but also a more flexible disaster recovery plan. Or talk to an ODBC/ADO/OLE DB provider and you just change the underlying DSN. Closing and opening new connections can be expensive TechEd 2002

De ser necesario crear errores personalizados de SQL Server, pero asegurarse que no entren en conflicto con errores personalizados de la app Analizar Database collations Nombres de Usuario y logins únicos Asegurarse que trabajos de una app no entren en conflicto con otras app Transacciones pequeñas (rollback de failover clustering y Log Shipping) When coding custom error messages, if the backup/failover server is going to host another application or database, make sure that one does not conflict with the other. Also make sure you do not accidentally replace system error messages. This is especially important if you are going to be log shipping more than one DB to a warm standby server. Since Microsoft® SQL Server™ 2000 can support collations at a more granular level, and not just at the server level, make sure the backup database plan takes into account the proper collation. Make sure application users and their cooresponding logins are unique to prevent potential conflicts, because when certain high availability solutions may be implemented, conflicts may occur if two applications share the same username with different rights and responsibilities (I.e. a user may get more rights than they should). This is especially important if you are going to be log shipping more than one DB to a warm standby server. If the application requires batch, scheduled, or other jobs run at various times—mainly in off hours—make sure that they will not interfere with other applications that may be part of the disaster recovery plan. You don’t want to cause potential unavailability because SQL is doing something. Logical units of work & atomic transactions – most important for failover clustering, but also for log shipping. Since failover clustering goes through the rollback/forward process, the smaller the size of the transactions and the shorter the unit of work, the faster things will go. It impacts log shipping in a similar fashion. TechEd 2002

En Resumen Una vez que el servidor este corriendo, déjelo en paz….
Cuatro pilares de HA Diseño Personas Procesos Tecnología (comprar un nuevo cluster e instalar SS2KEE no es suficiente) The Four Pillars of HA One thing is certain: Buying a new cluster and installing SS2KEE is NOT enough. The hardware must be reliable. The software must be properly configured. Both must be properly supported under ENTERPRISE level agreements. TechEd 2002

Hardware bien administrado
9 Casi sin administrar 9 Buena administración y planificación Puede tolerar la mayoría de las fallas de HW Puede tolerar tareas normales de ope (ej., UPG de SW) Puede tolerar algunas fallas de SW 9 Hardware bien administrado Puede tolerar algunas fallas de HW 9 Excelencia operacional, de diseño y de planificación Puede soportar la mayoría de las caídas planeadas o no Puede tolerar algunas fallas de operaciones En Resumen (cont) This slide shows how difficult it is to get five nines graphically to wrap up the past few slides TechEd 2002

Para mas Información … SQL Server Technical Information ( Mucha info & links, incluyendo: SQL Server 2000 Operations Guide SQL Server 2000 Resource Kit (info only; you need the printed book the CD-ROM) SQL Server 2000 Failover Clustering Whitepaper Capacity Planning – Microsoft SQL Server 2000 Administrator’s Companion MS Support Homepage (Q Articles) TechEd 2002

Soluciones de Alta Disponibilidad y Escalabilidad con SQL Server 2000

Presentaciones similares

Presentación del tema: "Soluciones de Alta Disponibilidad y Escalabilidad con SQL Server 2000"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Soluciones de Alta Disponibilidad y Escalabilidad con SQL Server 2000

Presentaciones similares

Presentación del tema: "Soluciones de Alta Disponibilidad y Escalabilidad con SQL Server 2000"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback