Metodología de programación paralela

Slides:



Advertisements
Presentaciones similares
Cable & Wireless Panamá. Entrar a Inicio, Programas y buscar Windows Live.
Advertisements

Enlace Ampliación de BIM para diseño y análisis estructural
PALABRAS INTERROGATIVAS - DE PREGUNTAS
Last Updated: 3/27/2008 DEV-2 Making OpenEdge Architect work for you
Curso Modular de introduccion a las neurociencias Miercoles 28 julio 2010 Dante R. Chialvo Papers:
Programación Multi-core: Conceptos básicos
Como Configurar el Internet
PLEASE READ (hidden slide) This template uses Microsofts corporate font, Segoe Segoe is not a standard font included with Windows, so if you have not.
Quality Management (J07) Overview Argentina
In Lecciones 6–9, you learned the preterite tense
PRUEBA ORAL Capítulo – Para Empezar. ORAL Me llamo ________________. Soy de ________________. Tengo ___________ años. Mi número de teléfono es ______________.
Arquitectura de Computadores I
PIPELINING - INTRODUCCION
ARQUITECTURA DE COMPUTADORES - PIPELINING
¿Preguntas? ¿Cómo te llamas? What is your name?
You will now learn how to form and use the past subjunctive (el pretérito imperfecto de subjuntivo), also called the imperfect subjunctive. Like the present.
Conjunctions are words or phrases that connect other words and clauses in sentences. Certain conjunctions commonly introduce adverbial clauses, which describe.
Copyright © 2008 Vista Higher Learning. All rights reserved The past perfect indicative (el pretérito pluscuamperfecto de indicativo) is used to.
Programando con Hilos POSIX* Intel Software College.
Programación con OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.
Hilos Capítulo 5 Silberschatz Galvin. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered.
Programando con Hilos de Windows* Intel Software College.
Corrigiendo Errores de Paralelización con Intel® Thread Checker para Hilos Explícitos Intel Software College.
1 Las Nuevas Arquitecturas Multicore. 2 Una Nueva Revolución está aquí Recordando: El nuevo Procesador Pentium de Intel revolucionará la industria de.
Si (If) clauses describe a condition or event upon which another condition or event depends. Sentences with si clauses consist of a si clause and a main.
PRODUCCION DE SABER Y GERENCIA DEL CONOCIMIENTO Profesor Dr. Orlando Albornoz Universidad de Guadalajara Villa Primavera, Jalisco, México 12 y 13 de julio.
Lenguajes de Programación Soluciones a pruebas de nivel
Construir una red internacional de innovación tecnológica a través de una estrategia de proximidad Andreia Moreira de Jesus |
EPI-Control Programa para la Vigilancia Epidemiológica Hospitalaria Ampliada y el Control de Infecciones Versión 7.0 Contacto Teléfono.
Instrumentación Industrial
Entrada y salida Fundamentos de programación/Programación I
Visual basic Curso de Habilitación Laboral IV. ¿Qué es Visual Basic Visual Basic es uno de los tantos lenguajes de programación que podemos encontrar.
CONSTRUCCIÓN Y ARQUITECTURA DEL SOFTWARE
Comunidades emprendedoras e innovadoras Agendas públicas... de desarrollo local Lic. Patricia Alessandroni 22 de marzo de 2011 Mar del Plata.
Prof. Fidel Gonzales Quincho
COMPARATIVES AND SUPERLATIVES
Una breve Introducción al proyecto Yussef Farrán Leiva
RESULTADOS INDICADOR AUTOEVALUACION DEL CONTROL DE LOS PROCESOS Julio 2012.
1.Origen del acompañamiento. Las condiciones educativas (como la cobertura, las condiciones actuales de los estudiantes), que hacen replantearse las nuevas.
Análisis de la Respuesta Transitoria y estacionaria de Sistemas Dinámicos Matemáticas en Todo y Para Todos Noviembre 5 de 2012 NOTA: NOTA: Las figuras.
Procedimientos Almacenados y Disparadores
Programa de las Naciones Unidas para el Desarrollo Youth and political participation Attitudes towards voting and the protest.
Corrigiendo Errores en la Paralelización con Intel® Parallel Inspector.
Dual Language Immersion Immersión de Lenguage Dual Title III Office Tulsa Public Schools August 2013.
BUSINESS MODEL INNOVATIONS FACILITATED DISCUSSIONS Prepared for: Tecnológico de Monterrey October 17, 2012 © 2012 ATC, All Rights Reserved
PL/SQL Francisco Moreno Universidad Nacional. Funciones Si un procedimiento tiene solo un parámetro de salida, se puede remplazar por una función y esta.
Circuitos Digitales II The General Computer Architecture The Pipeline Design Semana No.11 Semestre Prof. Gustavo Patiño Prof.
Español II Srta. Forgue El 11 de abril de La clase de hoy El 11 de abril Ahora mismo: Mirar el tutorial de 7.1 Repaso: Corregir INTÉNTALO en la.
Srta. Forgue El 29 de enero de 2011
Srta. Forgue El 14 de febrero
Sra. Renard El 11 de septiembre de ¡Repasar la tarea!
Entender qu é es cloud computing Cloud computing seg ú n Microsoft: Azure.NET Services SQL Services Live Services.
Antonio Gámir TSP – Windows Client Microsoft Ibérica.
Su Negocio Conectado. VisibilidadVisibilidad ColaboraciónColaboración PlanificaciónPlanificación EjecuciónEjecución Build Connections.
para desarrolladores Minimizar el cambio Concentrarse en estabilidad, confiabilidad y rendimiento. Ayudar a mejorar la productividad Reducir la curva.
11 Servidores basados en Arquitectura Intel. 2 * Other names and brands may be claimed as the property of others. Copyright © 2008, Intel Corporation.
SOA conference. Qué es WCF Fundamentos de la Arquitectura WCF Adaptadores BizTalk WCF Novedades en BizTalk Demo.
Green IT con Microsoft Visio. Nuestra empresa Green IT con Microsoft Visio.
ESTRUCTURAS DE CONTROL
C++ vs C Los códigos de C++ están copiados de la documentación de PRO1.
1 USMP PhD in Information Systems Engineering INFRASTRUCTURE MANAGEMENT - IM The Information and Communications Technology Infrastructure Management (ICT-IM)
E L P RETÉRITO VS. E L I MPERFECTO. For actions that were completed a specific number of times For actions that occurred during a specific period of time.
Introducción Program Slicing Pattern Matching Problema: Pérdida de precisión Solución Conclusiones Peculiaridades de Erlang Erlang Dependence Graph.
1 Chapter 8 Scope Dale/Weems/Headington. 2 Tópicos del Capítulo 8 l Local Scope vs. Global Scope of an Identifier l Detailed Scope Rules to Determine.
INTEL CONFIDENTIAL Metodología de programación paralela Intel Software College.
Programación en los procecesadores Intel® Multi Core.
INTEL CONFIDENTIAL Paralelizando para mejorar el rendimiento con Intel® Threading Building Blocks Sesión:
Conceptos básicos de procesamiento en procecesadores Intel® Multi Core.
Hilos Capítulo 5 Silberschatz Galvin. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered.
Diagramas de tuberías 3D usando Autodesk® Revit MEP® 2012
Transcripción de la presentación:

Metodología de programación paralela Intel Software College

Threaded Programming Methodology Objetivos Al final de este módulo Será capaz de realizar un prototipo y estimar el esfuerzo requerido para paralelizar regiones que consumen tiempo Purpose of the Slide States the objectives of this module Details This course module walks through the process of migrating a serial program to a parallel (threaded) one, using the OpenMP model, and the Intel tools VTune (to identify the code section most profitably threaded), Thread Checker (to identify any coding issues specific to threading) and Thread Profiler (to identify performance issues specific to threading). It can stand alone, be used as the sole threading session in a more general course, or as part of an overall threading course. There are 9 lab activities in the sequence. Note: most of the slides use complex builds – be sure to become familiar with them. Threaded Programming Methodology

Threaded Programming Methodology Agenda Un ciclo de desarrollo genérico Caso de estudio: Generación de números primos Algunos problemas de rendimiento comunes Purpose of the Slide Outlines the topics addressed to achieve the module objective. Details The “primes” code – finding all prime numbers up to a specified upper bound – is the example used throughout. Be aware that the prime-finding algorithm employed in this case study is deliberately unsophisticated (scales O(N^(3/2)), so that it can be quickly understood by the students; better approaches exist, but are not pertinent to the matters addressed here. Threaded Programming Methodology

Threaded Programming Methodology ¿Qué es paralelismo? Dos o más procesos o hilos se ejecutan al mismo tiempo Paralelismo para arquitecturas con varios núcleos Múltiples procesos Comunicación a través de IPCs (Inter-Process Communication) Un solo proceso, múltiples hilos Comunicación a través de memoria compartida Purpose of the Slide To frame the discussion - the parallel model used in this session is the one highlighted: single process, multiple threads, shared memory. Threaded Programming Methodology

T El código serial limita la aceleración Ley de Amdahl Describe el límite máximo de aceleración con ejecución paralela n = ∞ n = 2 n = número de procesadores Tparalelo = {(1-P) + P/n} Tserial Aceleración = Tserial / Tparalelo (1-P) P T serial 0.5 + 0.25 1.0/0.75 = 1.33 0.5 + 0.0 1.0/0.5 = 2.0 P/2 P/∞ … (1-P) Purpose of the Slide Explains and illustrates Gene Amdahl’s observation about the maximum theoretical speedup, from parallelization, for a given algorithm. Details The build starts with a serial code taking time T(serial), composed of a parallelizable portion P, and the rest, 1-P, in this example in equal proportions. If P is perfectly divided into two parallel components (on two cores, for example), then the overall time T(parallel) is reduced to 75%. In the limit of very large, perfect parallelization of P, the overall time approaches 50% of the original T(serial). Questions to Ask Students Does overhead play a role? (yes; primes the next slide, where threads are more efficient than processes) Are unwarranted assumptions built in about scaling? (that is: do the serial and parallel portions increase at the same rate with increasing problem size?). This can lead to a brief aside about the complementary point of view, Gustafson’s law (which assumes the parallel portion grows more quickly than the serial). El código serial limita la aceleración Threaded Programming Methodology

Procesos e Hilos main() … Code segment Data segment Los sistemas operativos modernos cargan programas como procesos Tiene recursos Ejecución Un proceso inicia ejecutando su punto de entrada como un hilo Los hilos pueden crear otros hilos dentro del proceso Cada hilo obtiene su propio stack Todos los hilos dentro de un proceso comparten código y segmentos de datos Stack thread main() … thread Code segment Data segment Multi-core Programming: Basic Concepts Speaker’s Notes [Purpose of this Slide] Show the difference between processes (which students should have heard about) and threads (which the students may not know). [Details] Operating system theory categorizes processes as having the two roles listed here. Resource holder refers to the job of the process to “hold” memory, file pointers, and other resources of the system that have been assigned to the process. Execution is the thread within the process that processes the instructions of the code and utilizes the resources held. When the process is terminated, all resources are returned to the system. Also, any active threads that might be running are terminated and the resources assigned to them (stack and other local storage, etc) are returned to the system [Background] There are ways to have threads continue to execute after the parent process has terminated, but this topic will not be covered. Threaded Programming Methodology

Hilos – Beneficios y riesgos Mayor rendimiento y mejor utilización de recursos Incluso en sistemas con un solo procesador – para esconder latencia e incrementar el tiempo de respuesta Comunicación entre procesos a través de memoria compartida es más eficiente Riesgos Incrementa la complejidad de la aplicación Difícil de depurar (condiciones de concurso, interbloqueos, etc.) Purpose of the Slide List some of the benefits and risks (costs) of threaded applications. Details Benefit “1” refers primarily to task parallelism (discussed in detail in the ISC module “Multi-core programming: Basic Concepts”) Benefit “2” is compared to processes (since threads share data, there is minimal overhead to “inter process communication”). Debugging is difficult since the bugs are non-deterministic, that is, they may not occur during each test, and a QA process designed for serial code will very likely miss bugs in threaded code. Threaded Programming Methodology

Preguntas comunes cuando se paralelizan aplicaciones ¿Dónde paralelizar? ¿Cuánto tiempo lleva paralelizar? ¿Cuánto esfuerzo para rediseñar se requiere? ¿Es útil paralelizar una región específica? ¿Qué tanto se espera acelerar? ¿El rendimiento va de acuerdo a mis expectativas? ¿Será escalable a más hilos/datos añadidos? ¿Qué modelo de paralelización utilizar? Purpose of the Slide States key considerations for developers beginning to thread an application. Details Where to thread? => where in the application, the hotspots How long would it take to thread? => in developer time (cost estimate) How much re-design/effort is required? => a factor in developer time (refining the cost estimate) Is it worth threading a selected region? => estimating the benefit What should the expected speedup be? => quantitative; want to approach Amdahl’s law limit Will the performance meet expectations? => if that limit is achieved, is the effort worthwhile? Will it scale as more threads/data are added? => This is very important: future platforms are expected to have additional cores. Which threading model to use? => for compiled apps, this is typically a choice between native models or OpenMP Threaded Programming Methodology

Generación de números primos bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); i factor 61 3 5 7 63 3 65 3 5 67 3 5 7 69 3 71 3 5 7 73 3 5 7 9 75 3 5 Purpose of the Slide Explain the prime number algorithm and code to be used for the 9 lab activities. Details Each build step illustrates a step in the while loop. When there is successful division of “i” by a factor (as with 9/3, 15/3), the slide colours “i” (and gPrimesFound is not incremented). At the end of the loop, the slide shows a total of 7 uncoloured numbers. The final popup shows the console output, where the program reports finding 8 total primes between 1 and 20. 77 3 5 7 79 3 5 7 9 Threaded Programming Methodology

Threaded Programming Methodology Actividad 1 Ejecutar la versión serial de los números primos Buscar el directorio PrimeSingle Compilar con el compilador de Intel Ejecutar algunas veces con rangos diferentes Purpose of the Slide Refers students to the 1st lab activity, whose purpose is to build the initial, serial version of the application. Details Detailed instructions are provided in the student lab manual. Background This exercise assumes that the student is familiar with building applications within Visual Studio, and they can invoke the Intel compiler (make sure this is at least approximately true). Though no coding is required for this stage, it’s a good break from the lecture, and prepares the way for further work. Threaded Programming Methodology

Metodología de desarrollo Análisis Buscar código donde se realiza cómputo intensivo Diseño (Introducir Hilos) Determinar como implementar una solución paralelizada Depurar Detectar cualquier problema como resultado de usar hilos Afinar para mejorar el rendimiento Lograr el mejor rendimiento en paralelo Purpose of the Slide Define the methodology to use when migrating a serial application to a threaded one. Details Don’t rush this slide…each of these four steps will have one or more associated lab activities using the primes code. Threaded Programming Methodology

Threaded Programming Methodology Ciclo de desarrollo Análisis VTune™ Performance Analyzer Diseño (Introducir Hilos) Intel® Performance libraries: IPP y MKL OpenMP* (Intel® Compiler) Creación explícita de hilos (Win32*, Pthreads*) Depuración Intel® Thread Checker Intel Debugger Afinar para mejorar el rendimiento Intel® Thread Profiler Purpose of the Slide Assigns details to, and visually reinforces, the points made on the previous slide; specific tools and threading models are inserted into the general outline made on the previous slide. and to point out the iterative nature of both debugging and the overall development cycle. Details Each of these steps will be addressed in detail during this session. Threaded Programming Methodology

Identifica las regiones que consumen tiempo Analisis - Muestreo bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); Usar el muestreo con VTune Sampling para encontrar hotspots en la aplicación Usaremos el proyecto PrimeSingle para el análisis PrimeSingle <inicio> <fin> Uso: ./PrimeSingle 1 1000000 Purpose of the Slide Introduce and explain further the role of VTune sampling. Details The slide build: initially states the workload (find all primes between 1 and 1000000) shows an extract from the VTune user interface highlighting the function TestforPrime shows the corresponding source code fragment Identifica las regiones que consumen tiempo Threaded Programming Methodology

Análisis – Gráfico de llamadas a funciones Este es el nivel en el árbol de llamadas donde necesitamos paralelizar Purpose of the Slide Introduce and explain the role of VTune Call Graph. Details The slide build: Initial view is excerpt from the call graph user interface (bold red arrows show busiest branch) Assertion made that FindPrimes is the right level to thread Background Coarse-grained parallel is generally more effective (thread one level higher than the hot spot) Questions to Ask Students Why is this the right level, why not in TestForPrime? (be sure you know the answer yourself – look at the code, imagine the thread call in TestForPrime) Usado para encontrar el nivel adecuadoen el árbol de llamadas para paralelizar Threaded Programming Methodology

Threaded Programming Methodology Análisis ¿Dónde paralelizar? FindPrimes() ¿Vale la pena paralelizar una región seleccionada? Parece que tiene un mínimo de dependencias Aparenta ser paralelo en los datos Consume sobre el 95% del tiempo de ejecución Medición base Purpose of the Slide Further analysis of the insertion made in the previous slide. Also: introduces baseline measurement. Details Bullet points illustrate key considerations for threading decision. Note that the final build on the slide, showing a baseline timing, is sudden (really a non sequitur, this sequencing could be better); don’t get surprised. Baseline timing is part of the overall analysis, necessary to measure the impact of any threading efforts; now’s as good a time as any to introduce it. Threaded Programming Methodology

Threaded Programming Methodology Actividad 2 Ejecuta el código con el rango de ‘1 5000000’ para obtener la medición base Tomar nota para referencias futuras Ejecutar la herramienta de análisis VTune en el código serial ¿Qué función se lleva la mayor parte del tiempo? Purpose of the Slide Refers students to the 2nd lab activity, whose purpose (as stated on the slide) is to generate a baseline serial-code measurement, and run the VTune sampling analysis. Details Detailed instructions are provided in the student lab manual. Threaded Programming Methodology

Metodología de diseño de Foster De “Designing and Building Parallel Programs” por Ian Foster Cuatro pasos: Particionar Dividir cómputo y datos Comunicación Intercambio de datos entre cómputos Aglomeración Agrupar tareas para mejorar rendimiento Mapeo Asignar tareas a procesadores/hilos Purpose of the Slide Introduce Foster’s design methodology for parallel programming. Details This somewhat long ellipsis in the presentation, 8 slides of parallel design points and examples, is intended to prepare the design discussion for our own primes example. Ian Foster’s 1994 book is well-known to practitioners of this dark art, and his book is available online (free!), at http://www-unix.mcs.anl.gov/dbpp/. Threaded Programming Methodology

Diseñando programas paralelos Problema Particionar Divide el problema en tareas Comunicar Determina la cantidad y el patrón de comunicación Aglomerar Combinar tareas Mapear Asignar tareas aglomeradas a los hilos generados Tareas iniciales Comunicación Tareas combinadas Purpose of the Slide To graphically illustrate the 4 steps in Foster’s design methodology. Programa final Threaded Programming Methodology

Modelos de programación paralelos Descomposición funcional Paralelismo de tareas Dividir el cómputo, asociarle datos Tareas independientes del mismo problema Descomposición de datos La misma operación ejecutando diferentes datos Dividir datos en piezas, asociarles cómputo Purpose of the Slide Introduce the primary conceptual partitions in parallel programming: task and data. Details Task parallel has traditionally been used in threaded desktop apps (partition among screen update, disk read, print etc), data parallel in HPC apps; both may be appropriate in different sections of an app. Threaded Programming Methodology

Métodos de descomposición Modelo atmosférico Modelo Oceano Modelo terrestre Modelo de hidrología Descomposición funcional Enfocarse a cómputo puede revelar la estructura en un problema Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Descomposición por dominio Enfocarse en la estructura de datos más grande o más frecuentemente accesada Paralelismo en los datos La misma operación aplicada a todos los datos Purpose of the Slide Illustrates, by example, the task and data decompositions in one application (weather modeling). Details Each domain (atmosphere, hydrology etc) can be treated independently, leading to a task parallel design; within the domains, data parallel may be applied as appropriate. Threaded Programming Methodology

Descomposición por Pipeline La computación se hace en etapas independientes Descomposición funcional Los hilos se asignan a una etapa a computar Línea de ensamble de automóviles Descomposición de datos Los hilos procesan todas las etapas de una sola instancia Un trabajador construye un auto completito Purpose of the Slide Introduce pipelined decomposition, which can apply to either task (called “functional” on this slide) or data decomposition. Details This is the first of 3 slides on the topic; the next two illustrate the concept with an example. Threaded Programming Methodology

Estrategia del LAME Encoder LAME MP3 encoder Proyecto Open source Herramienta educativa El objetivo de este proyecto es Mejorar la calidad Mejorar la velocidad de la codificación a MP3 Purpose of the Slide Introduce a particular application, the LAME audio encoder, to set up the next slide showing LAME in a pipelined decomposition. Details This slide serves as a quick backgrounder on the LAME code (not all students will have heard of it). The “Lame MT” project (full description, with source code) is available online at: http://softlab.technion.ac.il/project/LAME/html/lame.html Threaded Programming Methodology

Estrategia de LAME Pipeline Preludio Acústicos Codificación Otro Extraer siguiente frame Caracterización del frame Poner parámetros del encoder Analisis FFT long/short Ensamblar el filtro Aplicar filtros Suprimir ruidos Cuantiza y cuenta bits Agregar encabezado del frame Verificar si es correcto Escribe al disco Frame Frame N Frame N + 1 Time Otro N Preludio N Acústicos N Codificación N T 2 T 1 Acústicos N+1 Preludio N+1 Otro N+1 Codificación N+1 Acústicos N+2 Preludio N+2 T 3 T 4 Preludio N+3 Hierarchical Barrier Purpose of the Slide Show how the LAME compute sequence maps to a pipelined threading approach. Details Each thread (T1, …T4 on the slide) “specializes” in an operation, using results prepared by another thread. Threaded Programming Methodology

Prototipo rápido con OpenMP Diseño ¿Cuál es el beneficio esperado? ¿Cómo logramos esto con el menor esfuerzo? ¿Cuánto se lleva paralelizar? ¿Cuánto esfuerzo se requiere para rediseñar? Aceleración(2P) = 100/(96/2+4) = ~1.92X Prototipo rápido con OpenMP Purpose of the Slide Return us to the primes example, to approach the design stage. Introduce OpenMP as a “prototyping” thread model. Details Although OpenMP is introduced for prototyping, it may (of course) prove efficient enough to be the thread model of choice for this example. Questions to Ask Students Where does this 2P speedup claim come from? Threaded Programming Methodology

Threaded Programming Methodology OpenMP Paralelismo Fork-join: El hilo maestro se divide en un grupo de hilos como sea necesario El paralelismo va incrementando Un programa secuencial evoluciona a un programa paralelo Regiones Paralelas Hilo maestro Purpose of the Slide A conceptual introduction of OpenMP. Details The key point (on the slide): can introduce threading one region at a time, which is not generally true of native threading models. Background OpenMP was launched as a standard 1997. Industry collaborators included Intel, but not Microsoft (who were invited but not interested at the time); Microsoft now (2006) supports OpenMP in its compilers. Threaded Programming Methodology

Threaded Programming Methodology Diseño OpenMP Crea hilos aquí para Esta región paralela Divide iteraciones de el ciclo for #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Purpose of the Slide Show a specific syntax of OpenMP implemented into the primes code. Details An key point: because this is introduced by pragmas, the original source code is not touched. (Native thread methods require changes to the serial sources). Note that the parallel region, in this case, is the “for” loop. The final slide build shows results (number of primes and total time) for the image created with this pragma. Threaded Programming Methodology

Threaded Programming Methodology Actividad 3 Ejecuta la versión OpenMP del código Localiza el directorio PrimeOpenMP y la solución Compila el código Ejecuta con ‘1 5000000’ para comparar ¿Cuál es la aceleración? Purpose of the Slide Refers students to the 3rd lab activity, whose purpose is to build and run an OpenMP version of primes. Details Detailed instructions are provided in the student lab manual. No programming is required for this lab. Threaded Programming Methodology

Aceleración de 1.40X (menor que 1.92X) Diseño ¿Cuál es el beneficio esperado? ¿Cómo logras esto con el menor esfuerzo? ¿Cuánto tiempo se llevó paralelizar? ¿Cuánto esfuerzo se requiere para rediseñar? ¿Es la mejor aceleración posible? Aceleración de 1.40X (menor que 1.92X) Purpose of the Slide Discuss the results obtained in the previous lab activity. Details Speedup was lower than expected – now what? Transition Quote But inefficient speedup is not our first concern… Threaded Programming Methodology

Threaded Programming Methodology Depuración Purpose of the Slide Introduce and stress the importance of correctness. Details In the example shown, each run produces a different number – the bug is non-deterministic. On some platforms, the answer may be correct 9/10 times, and slip through QA. Students can test their own implementation (previous lab) on multiple runs. ¿Es la implementación correcta de paralelismo? No! Los resultados son diferentes cada ejecución … Threaded Programming Methodology

Depuración Intel® Thread Checker Intel® Thread Checker señana errores notorios en al paralelizar como condiciones de concurso, stalls e interbloqueos VTune™ Performance Analyzer Intel® Thread Checker Primes.exe (Instrumentado) Instrumentación Binaria Primes.exe Colector de datos en tiempo de ejecución +DLLs (Instrumentado) Purpose of the Slide Introduce Thread Checker as a tool to address threading correctness; outline its implementation. Details The code is typically instrumented at the binary level, though source instrumentation is also available. From the product FAQ: The Thread Checker library calls record information about threads, including memory accesses and APIs used, in order to find threading diagnostics including errors. Binary instrumentation is added at run-time to an already built (made) binary module, including applications and dynamic or shared libraries.  The instrumentation code is automatically inserted when you run an Intel® Thread Checker activity in the VTune™ environment or the Microsoft .NET* Development Environment. Both Microsoft Windows* and Linux* executables can be instrumented for IA-32 processors, but not for Itanium® processors.  Binary instrumentation can be used for software compiled with any of the supported compilers.  The final build shows the UI page moving ahead to the next slide… Background Be ready to briefly explain the bugs mentioned: data races, stalls, deadlocks. threadchecker.thr (archivo resultante) Threaded Programming Methodology

Threaded Programming Methodology Thread Checker Threaded Programming Methodology

Threaded Programming Methodology Actividad 4 Usa Thread Checker para analizar la aplicación paralelizada Crear una actividad “Thread Checker activity” Ejecuta la aplicación ¿Se reportan errores? Purpose of the Slide Refers students to the 4th lab activity, whose purpose is to run the Thread Checker analysis illustrated on the previous slide. Details Detailed instructions are provided in the student lab manual. Students should see results (race conditions detected) similar to those on the previous slide. Threaded Programming Methodology

Threaded Programming Methodology Depuración ¿Qué tanto esfuerzo se requiere para rediseñar? ¿Cuánto tiempo llevará paralelizar? Thread Checker reportó solo 2 dependencias, por lo tanto el esfuerzo necesario debe ser bajo Purpose of the Slide To address the question, how much effort (cost) will be required to successfully thread this application. Details As asserted in the slide, with only the two dependencies (gPrimesFound and gProgress), the debugging effort should be manageable. Threaded Programming Methodology

Threaded Programming Methodology Depuración #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Creará una sección crítica para esta referencia Creará una sección crítica para ambas referencias Purpose of the Slide To show one way to correct the race conditions on gPrimesFound and gProgress. Details Critical sections can only be accessed by one thread at a time, so this solution should correct the race condition. Note the key point: “one thread at a time” – the critical section is, by design, no longer parallel. Threaded Programming Methodology

Threaded Programming Methodology Actividad 5 Modifica y ejecuta la versión del código OpenMP Añade pragmas de regiones críticas al código Compila el código Ejecuta dentro del Thread Checker Si aun hay errores, haz las correcciones adecuadas al código y ejecútalas nuevamente en el Thread Checker Ejecuta con ‘1 5000000’ para fines de comparación Compila y ejecuta fuera del Thread Checker ¿Cuál es la aceleración? Purpose of the Slide Refers students to the 5th lab activity, whose purpose (as stated on the slide) is to correct the race conditions discovered in the primes code. The resulting image is then checked for results and performance. Details Detailed instructions are provided in the student lab manual. Students will use the critical sections technique described on the previous slide. Threaded Programming Methodology

Threaded Programming Methodology Depuración Respuesta correcta, pero el rendimiento bajo al ~1.33X ¿Es lo mejor que podemos esperar de este algoritmo? Purpose of the Slide Show that the critical sections method fixed the bug, but the performance is lower than expected. Details The slide shows a correct answer, but remind the students that this does not guarantee there is no bug (race conditions, if present, may show up only rarely). To be more rigorous, one would re-run the Thread Checker. No! De acuerdo a la Ley de Amdahl, podemos esperar una aceleración cerca de 1.9X Threaded Programming Methodology

Problemas comunes de rendimiento Sobrecarga en paralelo Dada por la creación de hilos, planificación… Sincronización Datos globales excesivos, contención de los mismos objetos de sincronización Carga desbalanceada Distribución no adecuada del trabajo en paralelo Granularidad No hay suficiente trabajo paralelo Purpose of the Slide To list some common performance issues (follow up to previous slide, which showed poor threading performance). Details Each item listed is linked to a complete slide (in this set) which shows additional detail; recommend linking to each, one by one. We will see examples of two of these, in the remaining labs. This slide and previous set us up for the final section of this module, performance tuning, which begins on the next slide. Threaded Programming Methodology

Afinando para mejorar rendimiento Thread Profiler señala cuellos de botella en aplicaciones paralelas Thread Profiler VTune™ Performance Analyzer Primes.c Primes.exe (Instrumentado) Instrumentación Binaria Compilador Instrumentación fuente /Qopenmp_profile Colector de datos en tiempo de ejecución +DLL’s (Instrumentado) Purpose of the Slide Introduce Thread Profile as a tool to address threading performance; outline its implementation. Details The slide build: Shows the build-and-link stage for primes, using the flag /Qopenmp_profile. This flag replaces /Qopenmp, and is required. From the user guide: Before you begin, you need to link and instrument your application with calls to the OpenMP* statistics gathering Runtime Engine. The Runtime Engine's calls are required because they collect performance data and write it to a file. 1. Compile your application using an Intel(R) Compiler. 2. Link your application to the OpenMP* Runtime Engine using the -Qopenmp_profile option. The slide then shows “Binary Instrumention”, but we will not be using that feature during this module. Binary instrumentation would be used to investigate the underlying native threads of an OpenMP applications). The resulting runtime, then gui snapshot, shown in detail on the next slide Primes.exe Bistro.tp/guide.gvs (archivo de resultados) Threaded Programming Methodology

Thread Profiler para OpenMP Threaded Programming Methodology

Thread Profiler para OpenMP Gráfica de aceleración Estima la aceleración al paralelizar y aceleración potencial Basada en la ley de Amdahl Da las fronteras inferiores y superiores Threaded Programming Methodology

Thread Profiler para OpenMP serial paralelo Threaded Programming Methodology

Thread Profiler para OpenMP Threaded Programming Methodology

Thread Profiler (para Hilos Explicitos) Threaded Programming Methodology

Thread Profiler (para Hilos Explicitos) ¿Porqué demasiadas transiciones? Threaded Programming Methodology

Regreso a la etapa de diseño Threaded Programming Methodology Rendimiento Esta implementación tiene llamadas de sincronización implícitas Esto limita la expansión del rendimiento debido a los cambios de contexto resultantes Purpose of the Slide Gives additional analysis regarding the Timeline and source views shown in the previous slide; identifies a significant bottleneck in the code. Details The slide build highlights the key portions of the Timeline and source view. Questions to Ask Students Why do we call this a synchronization, and what is implicit about it? Regreso a la etapa de diseño Threaded Programming Methodology

Threaded Programming Methodology Actividad 6 Utilizar Thread Profiler para analizar una aplicación paralelizada Usar /Qopenmp_profile para compilar y encadenar Crear actividad “Thread Profiler Activity (for explicit threads)” Ejecuta la aplicación en el Thread Profiler Encuentra la línea en el código fuente que está causando que los hilos estén inactivos Purpose of the Slide Refers students to the 6th lab activity, whose purpose is to run a Thread Profiler analysis on the primes code. Details Detailed instructions are provided in the student lab manual. This lab exercise repeats the steps demonstrated in the preceding slides; students should expect to see similar results. Threaded Programming Methodology

Rendimiento Este cambio debe arreglar el problema de contención ¿Es esto mucha contención esperada? El algoritmo tiene mucho más actualizaciones que las 10 necesarias para mostrar el progreso void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone); } Purpose of the Slide To address the performance problem identified in the preceding slides and lab. Details The test, if (percentDone % 10 ==0), does NOT cause printing to be done every 10th step, but much more often. The slide build introduces a fix. Questions to Ask Students Why is the original algorithm not printing as infrequently as intended? Why does the fix correct this problem? => invite/encourage the students walk through the code with you, so these questions are understood. Este cambio debe arreglar el problema de contención Threaded Programming Methodology

Aceleración es 2.32X ! ¿Es correcto? Threaded Programming Methodology Diseño Metas Elimina la contención implícita debido a la sincronización Purpose of the Slide Shows a result of the primes code which implements the correction shown on the previous slide, and shows an apparent anomaly in the resulting timing. Details The answer is correct, but the speedup of 2.32 for 2 cores cannot be right. Encourage the students to speculate as to causes of this (you may hear words like superscalar, cache etc – all red herrings). Aceleración es 2.32X ! ¿Es correcto? Threaded Programming Methodology

La velocidad actual es 1.40X (<<1.9X)! Rendimiento Nuestra línea base de medición ha “viciado” el algoritmo de actualización del progreso ¿Es lo mejor que podemos esperar de este algoritmo? Purpose of the Slide Show the corrected baseline timing; resolves the apparent anomaly of the previous slide. Details The timing shown is a new baseline timing, with the contention correction added to serial version of primes (note the directory name in the command window). The original baseline timing was 11.73s; this version shows 7.09s, giving us the new speedup ration of 1.40. This is significantly lower than the speedup of 1.9 predicted by Amdahl’s law. La velocidad actual es 1.40X (<<1.9X)! Threaded Programming Methodology

Threaded Programming Methodology Actividad 7 Modifica la función ShowProgress (serial y OpenMP) para que muestre solo la salida necesitada Recompila y ejecuta el código Asegurarse que no se usan banderas de instrumentación ¿Cuál es la aceleración de la versión serial? if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } Purpose of the Slide Refers students to the 7th lab activity, whose purpose is to introduce the performance fix outlined in the preceding slides. Details Detailed instructions are provided in the student lab manual. Students will implement the code shown on previous slides, measure new timings, and derive a new speedup number. While unlikely to precisely match the 1.40x speedup shown on the slides (since platforms used for this class will vary), it should be similar. Threaded Programming Methodology

Revisando el Rendimiento Sigue teniendo 62% de tiempo de ejecución en locks y sinchronización Threaded Programming Methodology

Revisando el Rendimiento Veamos los Locks de OpenMP… void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } El lock está en un ciclo Purpose of the Slide Examine the lock protecting gPrimesFound, to understand the performance impact. Details As stated in the slide, the real issue is putting the critical section (lock) inside a loop. A fix is proposed in the slide build sequence. The slide build: Points out the lock within a loop Introduces a fix, using the Windows threading function InterlockedIncrement. This function is defined as LONG InterlockedIncrement( LONG volatile* Addend ); where Addend [in, out] is a pointer to the variable to be incremented. This is an atomic operation, less disruptive than a critical section. The final build shows a timing result from a primes image incorporating this fix. A key point: it is possible – and sometimes desirable - to mix OpenMP and native threading calls! Threaded Programming Methodology

Revisando el Rendimiento Veamos el segundo lock void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; Este lock también está siendo llamado dentro de un ciclo Purpose of the Slide Examine the lock protecting gProgress, to understand the performance impact. Details The same fix, interlockedIncrement, is used for this critical section. The slide build: Points out the critical section is in a loop (Question: what loop?) Introduces a different solution, using the Windows API. Note that 3 lines of code need to be modified. The final build shows a timing result from a primes image incorporating this fix. Threaded Programming Methodology

Threaded Programming Methodology Actividad 8 Modifica las regiones críticas de OpenMP para reemplazarlas InterlockedIncrement Re-compila y ejecuta el código ¿Cuál es la aceleración con respecto a la versión serial? Purpose of the Slide Refers students to the 8th lab activity, whose purpose is to introduce the code change cited, and measure its impact. Details Detailed instructions are provided in the student lab manual. Threaded Programming Methodology

Thread Profiler para OpenMP Hilo 0 342 factores para probar 116747 500000 250000 750000 1000000 Hilo 1 612 factores para probar 373553 Hilo 2 789 factores para probar 623759 Hilo 3 934 factores para probar 873913 Purpose of the Slide To examine the causes of the load imbalance observed in the profile of the primes code. Details Using 4 threads makes the imbalance more obvious. The slide build: Overlaying the Threads view, a “stair step” is drawn to illustrate that each successive thread takes additional time. A bar is drawn to illustrate that the iterations were divided among the threads in equal amounts. “Didn’t we divide the iterations evenly? Let’s look at the work being done for a ‘middle’ prime in each group.” Boxes with the precise workload stated for each thread appear, showing explicitly that there are more steps required as the algorithm searches for primes in larger numbers A triangle is drawn to illustrate (conceptually, not precisely) the nature of the workload, which increases with increasing number range Threaded Programming Methodology

Arreglando la Carga Desbalanceada Distribuye el trabajo más equitativamente void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } La aceleración lograda es 1.68X Purpose of the Slide Introduce a method to address the load imbalance inherent in the primes algorithm. Details The slide build: The triangle from the previous slide is redrawn, illustrating the different “sizes” of work for each thread An new triangle is shown, with the workload interleaved to achieve a more even distribution An OpenMP schedule pragma is added, which achieves that interleaving (no code change is required) A sample run is shown for this approach, with the time now 4.22s, a speedup of 1.68 Threaded Programming Methodology

Threaded Programming Methodology Actividad 9 Modifica el código para mejorar el balanceo de carga Agrega la cláusula schedule (static, 8) en el pragma parallel for de OpenMP Re-compila y ejecuta código ¿Cuál es la aceleración con respecto al código serial? Purpose of the Slide Refers students to the 9th lab activity, whose purpose is to introduce static OpenMP scheduling, and measure its impact. Details Detailed instructions are provided in the student lab manual. As before, results achieved should be similar to (though probably not exactly the same as) those shown in the slides. Threaded Programming Methodology

Ejecución final del Thread Profiler Purpose of the Slide Show the performance profile of the final version of primes, with all corrections and load balancing implemented. Details Note that the speedup, 1.80x, is faster than the 1.68x cited in a preceding slide; the difference is this final run is the “Release” version, free of the overhead of the “Debug” version shown previously (note the directories shown: here it is c:\classfiles\PrimeOpenMP\Release, previously it is …\Debug). La aceleración lograda es 1.80X Threaded Programming Methodology

Threaded Programming Methodology Análisis Comparativo Las aplicaciones paralelas requieren varias iteraciones al pasar por el ciclo de desarrollo de software Purpose of the Slide Summarizes the results at each step in the performance tuning process; emphasizes the iterative nature of the process. Threaded Programming Methodology

Metodología de programación paralela Lo que se Cubrió Cuatro pasos del ciclo de desarrollo para escribir aplicaciones paralelas desde el código serial y las herramientas de Intel® para soportar cada paso Análisis Diseño (Introducir Hilos) Depurar para la correctud Afinar el rendimiento Las aplicaciones paralelas requieren múltiples iteraciones de diseño, depuración y afinación de rendimiento Usar las herramientas para mejorar productividad Purpose of the Slide Summarizes the key points covered in this module. Threaded Programming Methodology

Threaded Programming Methodology This should always be the last slide of all presentations. Threaded Programming Methodology

Diapositivas Adicionales Threaded Programming Methodology

Sobrecarga en paralelo Sobrecarga de creación de los hilos La sobrecarga incrementa conforme incrementa el número de hilos activos Solución Uso de hilos reusables y “thread pools” Amortiza el costo de crear hilos Mantiene el número de hilos activos relativamente constante Threaded Programming Methodology

Threaded Programming Methodology Sincronización Contención por asignación dinámica de memoria La asignación dinámica de memoria causa sincronización implícita Asignar en el stack para usar almacenamiento local en los hilos Actualizaciones atómicas versus secciones críticas Algunas actualizaciones de datos globales pueden usar operaciones (familia Interlocked) Usar actualizaciones atómicas cada que sea posible Secciones Críticas versus exclusión mutua Los objetos de Sección Crítica residen en el espacio del usuario Usar objetos CRITICAL SECTION cuando no se requiere visibilidad más allá de los límites del proceso Introduce menos sobrecarga Tiene una variante de spin-wait que es útil para algunas aplicaciones Threaded Programming Methodology

Threaded Programming Methodology Trabajo no balanceado Cargas de trabajo desigual nos llevan a hilos ociosos y tiempo desperdiciado Tiempo Ocupado Ocioso Threaded Programming Methodology

Porción paralelizable Threaded Programming Methodology Granularidad Grano grueso Grano fino Escala: ~2.5X Escala: ~3X Serial Porción paralelizable Porción paralelizable Escala: ~1.05X Escala: ~1.10X Serial Threaded Programming Methodology