Metodología de programación paralela

Metodología de programación paralela
Intel Software College

Threaded Programming Methodology
Objetivos Al final de este módulo Será capaz de realizar un prototipo y estimar el esfuerzo requerido para paralelizar regiones que consumen tiempo Purpose of the Slide States the objectives of this module Details This course module walks through the process of migrating a serial program to a parallel (threaded) one, using the OpenMP model, and the Intel tools VTune (to identify the code section most profitably threaded), Thread Checker (to identify any coding issues specific to threading) and Thread Profiler (to identify performance issues specific to threading). It can stand alone, be used as the sole threading session in a more general course, or as part of an overall threading course. There are 9 lab activities in the sequence. Note: most of the slides use complex builds – be sure to become familiar with them. Threaded Programming Methodology

Agenda Un ciclo de desarrollo genérico Caso de estudio: Generación de números primos Algunos problemas de rendimiento comunes Purpose of the Slide Outlines the topics addressed to achieve the module objective. Details The “primes” code – finding all prime numbers up to a specified upper bound – is the example used throughout. Be aware that the prime-finding algorithm employed in this case study is deliberately unsophisticated (scales O(N^(3/2)), so that it can be quickly understood by the students; better approaches exist, but are not pertinent to the matters addressed here. Threaded Programming Methodology

¿Qué es paralelismo? Dos o más procesos o hilos se ejecutan al mismo tiempo Paralelismo para arquitecturas con varios núcleos Múltiples procesos Comunicación a través de IPCs (Inter-Process Communication) Un solo proceso, múltiples hilos Comunicación a través de memoria compartida Purpose of the Slide To frame the discussion - the parallel model used in this session is the one highlighted: single process, multiple threads, shared memory. Threaded Programming Methodology

T El código serial limita la aceleración
Ley de Amdahl Describe el límite máximo de aceleración con ejecución paralela n = ∞ n = 2 n = número de procesadores Tparalelo = {(1-P) + P/n} Tserial Aceleración = Tserial / Tparalelo (1-P) P T serial 1.0/0.75 = 1.33 1.0/0.5 = 2.0 P/2 P/∞ … (1-P) Purpose of the Slide Explains and illustrates Gene Amdahl’s observation about the maximum theoretical speedup, from parallelization, for a given algorithm. Details The build starts with a serial code taking time T(serial), composed of a parallelizable portion P, and the rest, 1-P, in this example in equal proportions. If P is perfectly divided into two parallel components (on two cores, for example), then the overall time T(parallel) is reduced to 75%. In the limit of very large, perfect parallelization of P, the overall time approaches 50% of the original T(serial). Questions to Ask Students Does overhead play a role? (yes; primes the next slide, where threads are more efficient than processes) Are unwarranted assumptions built in about scaling? (that is: do the serial and parallel portions increase at the same rate with increasing problem size?). This can lead to a brief aside about the complementary point of view, Gustafson’s law (which assumes the parallel portion grows more quickly than the serial). El código serial limita la aceleración Threaded Programming Methodology

Procesos e Hilos main() … Code segment Data segment
Los sistemas operativos modernos cargan programas como procesos Tiene recursos Ejecución Un proceso inicia ejecutando su punto de entrada como un hilo Los hilos pueden crear otros hilos dentro del proceso Cada hilo obtiene su propio stack Todos los hilos dentro de un proceso comparten código y segmentos de datos Stack thread main() … thread Code segment Data segment Multi-core Programming: Basic Concepts Speaker’s Notes [Purpose of this Slide] Show the difference between processes (which students should have heard about) and threads (which the students may not know). [Details] Operating system theory categorizes processes as having the two roles listed here. Resource holder refers to the job of the process to “hold” memory, file pointers, and other resources of the system that have been assigned to the process. Execution is the thread within the process that processes the instructions of the code and utilizes the resources held. When the process is terminated, all resources are returned to the system. Also, any active threads that might be running are terminated and the resources assigned to them (stack and other local storage, etc) are returned to the system [Background] There are ways to have threads continue to execute after the parent process has terminated, but this topic will not be covered. Threaded Programming Methodology

Hilos – Beneficios y riesgos
Mayor rendimiento y mejor utilización de recursos Incluso en sistemas con un solo procesador – para esconder latencia e incrementar el tiempo de respuesta Comunicación entre procesos a través de memoria compartida es más eficiente Riesgos Incrementa la complejidad de la aplicación Difícil de depurar (condiciones de concurso, interbloqueos, etc.) Purpose of the Slide List some of the benefits and risks (costs) of threaded applications. Details Benefit “1” refers primarily to task parallelism (discussed in detail in the ISC module “Multi-core programming: Basic Concepts”) Benefit “2” is compared to processes (since threads share data, there is minimal overhead to “inter process communication”). Debugging is difficult since the bugs are non-deterministic, that is, they may not occur during each test, and a QA process designed for serial code will very likely miss bugs in threaded code. Threaded Programming Methodology

Preguntas comunes cuando se paralelizan aplicaciones
¿Dónde paralelizar? ¿Cuánto tiempo lleva paralelizar? ¿Cuánto esfuerzo para rediseñar se requiere? ¿Es útil paralelizar una región específica? ¿Qué tanto se espera acelerar? ¿El rendimiento va de acuerdo a mis expectativas? ¿Será escalable a más hilos/datos añadidos? ¿Qué modelo de paralelización utilizar? Purpose of the Slide States key considerations for developers beginning to thread an application. Details Where to thread? => where in the application, the hotspots How long would it take to thread? => in developer time (cost estimate) How much re-design/effort is required? => a factor in developer time (refining the cost estimate) Is it worth threading a selected region? => estimating the benefit What should the expected speedup be? => quantitative; want to approach Amdahl’s law limit Will the performance meet expectations? => if that limit is achieved, is the effort worthwhile? Will it scale as more threads/data are added? => This is very important: future platforms are expected to have additional cores. Which threading model to use? => for compiled apps, this is typically a choice between native models or OpenMP Threaded Programming Methodology

Generación de números primos
bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); i factor 63 3 69 3 Purpose of the Slide Explain the prime number algorithm and code to be used for the 9 lab activities. Details Each build step illustrates a step in the while loop. When there is successful division of “i” by a factor (as with 9/3, 15/3), the slide colours “i” (and gPrimesFound is not incremented). At the end of the loop, the slide shows a total of 7 uncoloured numbers. The final popup shows the console output, where the program reports finding 8 total primes between 1 and 20. Threaded Programming Methodology

Actividad 1 Ejecutar la versión serial de los números primos Buscar el directorio PrimeSingle Compilar con el compilador de Intel Ejecutar algunas veces con rangos diferentes Purpose of the Slide Refers students to the 1st lab activity, whose purpose is to build the initial, serial version of the application. Details Detailed instructions are provided in the student lab manual. Background This exercise assumes that the student is familiar with building applications within Visual Studio, and they can invoke the Intel compiler (make sure this is at least approximately true). Though no coding is required for this stage, it’s a good break from the lecture, and prepares the way for further work. Threaded Programming Methodology

Metodología de desarrollo
Análisis Buscar código donde se realiza cómputo intensivo Diseño (Introducir Hilos) Determinar como implementar una solución paralelizada Depurar Detectar cualquier problema como resultado de usar hilos Afinar para mejorar el rendimiento Lograr el mejor rendimiento en paralelo Purpose of the Slide Define the methodology to use when migrating a serial application to a threaded one. Details Don’t rush this slide…each of these four steps will have one or more associated lab activities using the primes code. Threaded Programming Methodology

Ciclo de desarrollo Análisis VTune™ Performance Analyzer Diseño (Introducir Hilos) Intel® Performance libraries: IPP y MKL OpenMP* (Intel® Compiler) Creación explícita de hilos (Win32*, Pthreads*) Depuración Intel® Thread Checker Intel Debugger Afinar para mejorar el rendimiento Intel® Thread Profiler Purpose of the Slide Assigns details to, and visually reinforces, the points made on the previous slide; specific tools and threading models are inserted into the general outline made on the previous slide. and to point out the iterative nature of both debugging and the overall development cycle. Details Each of these steps will be addressed in detail during this session. Threaded Programming Methodology

Identifica las regiones que consumen tiempo
Analisis - Muestreo bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); Usar el muestreo con VTune Sampling para encontrar hotspots en la aplicación Usaremos el proyecto PrimeSingle para el análisis PrimeSingle <inicio> <fin> Uso: ./PrimeSingle Purpose of the Slide Introduce and explain further the role of VTune sampling. Details The slide build: initially states the workload (find all primes between 1 and ) shows an extract from the VTune user interface highlighting the function TestforPrime shows the corresponding source code fragment Identifica las regiones que consumen tiempo Threaded Programming Methodology

Análisis – Gráfico de llamadas a funciones
Este es el nivel en el árbol de llamadas donde necesitamos paralelizar Purpose of the Slide Introduce and explain the role of VTune Call Graph. Details The slide build: Initial view is excerpt from the call graph user interface (bold red arrows show busiest branch) Assertion made that FindPrimes is the right level to thread Background Coarse-grained parallel is generally more effective (thread one level higher than the hot spot) Questions to Ask Students Why is this the right level, why not in TestForPrime? (be sure you know the answer yourself – look at the code, imagine the thread call in TestForPrime) Usado para encontrar el nivel adecuadoen el árbol de llamadas para paralelizar Threaded Programming Methodology

Análisis ¿Dónde paralelizar? FindPrimes() ¿Vale la pena paralelizar una región seleccionada? Parece que tiene un mínimo de dependencias Aparenta ser paralelo en los datos Consume sobre el 95% del tiempo de ejecución Medición base Purpose of the Slide Further analysis of the insertion made in the previous slide. Also: introduces baseline measurement. Details Bullet points illustrate key considerations for threading decision. Note that the final build on the slide, showing a baseline timing, is sudden (really a non sequitur, this sequencing could be better); don’t get surprised. Baseline timing is part of the overall analysis, necessary to measure the impact of any threading efforts; now’s as good a time as any to introduce it. Threaded Programming Methodology

Actividad 2 Ejecuta el código con el rango de ‘ ’ para obtener la medición base Tomar nota para referencias futuras Ejecutar la herramienta de análisis VTune en el código serial ¿Qué función se lleva la mayor parte del tiempo? Purpose of the Slide Refers students to the 2nd lab activity, whose purpose (as stated on the slide) is to generate a baseline serial-code measurement, and run the VTune sampling analysis. Details Detailed instructions are provided in the student lab manual. Threaded Programming Methodology

Metodología de diseño de Foster
De “Designing and Building Parallel Programs” por Ian Foster Cuatro pasos: Particionar Dividir cómputo y datos Comunicación Intercambio de datos entre cómputos Aglomeración Agrupar tareas para mejorar rendimiento Mapeo Asignar tareas a procesadores/hilos Purpose of the Slide Introduce Foster’s design methodology for parallel programming. Details This somewhat long ellipsis in the presentation, 8 slides of parallel design points and examples, is intended to prepare the design discussion for our own primes example. Ian Foster’s 1994 book is well-known to practitioners of this dark art, and his book is available online (free!), at Threaded Programming Methodology

Diseñando programas paralelos
Problema Particionar Divide el problema en tareas Comunicar Determina la cantidad y el patrón de comunicación Aglomerar Combinar tareas Mapear Asignar tareas aglomeradas a los hilos generados Tareas iniciales Comunicación Tareas combinadas Purpose of the Slide To graphically illustrate the 4 steps in Foster’s design methodology. Programa final Threaded Programming Methodology

Modelos de programación paralelos
Descomposición funcional Paralelismo de tareas Dividir el cómputo, asociarle datos Tareas independientes del mismo problema Descomposición de datos La misma operación ejecutando diferentes datos Dividir datos en piezas, asociarles cómputo Purpose of the Slide Introduce the primary conceptual partitions in parallel programming: task and data. Details Task parallel has traditionally been used in threaded desktop apps (partition among screen update, disk read, print etc), data parallel in HPC apps; both may be appropriate in different sections of an app. Threaded Programming Methodology

Métodos de descomposición
Modelo atmosférico Modelo Oceano Modelo terrestre Modelo de hidrología Descomposición funcional Enfocarse a cómputo puede revelar la estructura en un problema Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Descomposición por dominio Enfocarse en la estructura de datos más grande o más frecuentemente accesada Paralelismo en los datos La misma operación aplicada a todos los datos Purpose of the Slide Illustrates, by example, the task and data decompositions in one application (weather modeling). Details Each domain (atmosphere, hydrology etc) can be treated independently, leading to a task parallel design; within the domains, data parallel may be applied as appropriate. Threaded Programming Methodology

Descomposición por Pipeline
La computación se hace en etapas independientes Descomposición funcional Los hilos se asignan a una etapa a computar Línea de ensamble de automóviles Descomposición de datos Los hilos procesan todas las etapas de una sola instancia Un trabajador construye un auto completito Purpose of the Slide Introduce pipelined decomposition, which can apply to either task (called “functional” on this slide) or data decomposition. Details This is the first of 3 slides on the topic; the next two illustrate the concept with an example. Threaded Programming Methodology

Estrategia del LAME Encoder
LAME MP3 encoder Proyecto Open source Herramienta educativa El objetivo de este proyecto es Mejorar la calidad Mejorar la velocidad de la codificación a MP3 Purpose of the Slide Introduce a particular application, the LAME audio encoder, to set up the next slide showing LAME in a pipelined decomposition. Details This slide serves as a quick backgrounder on the LAME code (not all students will have heard of it). The “Lame MT” project (full description, with source code) is available online at: Threaded Programming Methodology

Estrategia de LAME Pipeline
Preludio Acústicos Codificación Otro Extraer siguiente frame Caracterización del frame Poner parámetros del encoder Analisis FFT long/short Ensamblar el filtro Aplicar filtros Suprimir ruidos Cuantiza y cuenta bits Agregar encabezado del frame Verificar si es correcto Escribe al disco Frame Frame N Frame N + 1 Time Otro N Preludio N Acústicos N Codificación N T 2 T 1 Acústicos N+1 Preludio N+1 Otro N+1 Codificación N+1 Acústicos N+2 Preludio N+2 T 3 T 4 Preludio N+3 Hierarchical Barrier Purpose of the Slide Show how the LAME compute sequence maps to a pipelined threading approach. Details Each thread (T1, …T4 on the slide) “specializes” in an operation, using results prepared by another thread. Threaded Programming Methodology

Prototipo rápido con OpenMP
Diseño ¿Cuál es el beneficio esperado? ¿Cómo logramos esto con el menor esfuerzo? ¿Cuánto se lleva paralelizar? ¿Cuánto esfuerzo se requiere para rediseñar? Aceleración(2P) = 100/(96/2+4) = ~1.92X Prototipo rápido con OpenMP Purpose of the Slide Return us to the primes example, to approach the design stage. Introduce OpenMP as a “prototyping” thread model. Details Although OpenMP is introduced for prototyping, it may (of course) prove efficient enough to be the thread model of choice for this example. Questions to Ask Students Where does this 2P speedup claim come from? Threaded Programming Methodology

OpenMP Paralelismo Fork-join: El hilo maestro se divide en un grupo de hilos como sea necesario El paralelismo va incrementando Un programa secuencial evoluciona a un programa paralelo Regiones Paralelas Hilo maestro Purpose of the Slide A conceptual introduction of OpenMP. Details The key point (on the slide): can introduce threading one region at a time, which is not generally true of native threading models. Background OpenMP was launched as a standard Industry collaborators included Intel, but not Microsoft (who were invited but not interested at the time); Microsoft now (2006) supports OpenMP in its compilers. Threaded Programming Methodology

Diseño OpenMP Crea hilos aquí para Esta región paralela Divide iteraciones de el ciclo for #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Purpose of the Slide Show a specific syntax of OpenMP implemented into the primes code. Details An key point: because this is introduced by pragmas, the original source code is not touched. (Native thread methods require changes to the serial sources). Note that the parallel region, in this case, is the “for” loop. The final slide build shows results (number of primes and total time) for the image created with this pragma. Threaded Programming Methodology

Actividad 3 Ejecuta la versión OpenMP del código Localiza el directorio PrimeOpenMP y la solución Compila el código Ejecuta con ‘ ’ para comparar ¿Cuál es la aceleración? Purpose of the Slide Refers students to the 3rd lab activity, whose purpose is to build and run an OpenMP version of primes. Details Detailed instructions are provided in the student lab manual. No programming is required for this lab. Threaded Programming Methodology

Aceleración de 1.40X (menor que 1.92X)
Diseño ¿Cuál es el beneficio esperado? ¿Cómo logras esto con el menor esfuerzo? ¿Cuánto tiempo se llevó paralelizar? ¿Cuánto esfuerzo se requiere para rediseñar? ¿Es la mejor aceleración posible? Aceleración de 1.40X (menor que 1.92X) Purpose of the Slide Discuss the results obtained in the previous lab activity. Details Speedup was lower than expected – now what? Transition Quote But inefficient speedup is not our first concern… Threaded Programming Methodology

Depuración Purpose of the Slide Introduce and stress the importance of correctness. Details In the example shown, each run produces a different number – the bug is non-deterministic. On some platforms, the answer may be correct 9/10 times, and slip through QA. Students can test their own implementation (previous lab) on multiple runs. ¿Es la implementación correcta de paralelismo? No! Los resultados son diferentes cada ejecución … Threaded Programming Methodology

Depuración Intel® Thread Checker
Intel® Thread Checker señana errores notorios en al paralelizar como condiciones de concurso, stalls e interbloqueos VTune™ Performance Analyzer Intel® Thread Checker Primes.exe (Instrumentado) Instrumentación Binaria Primes.exe Colector de datos en tiempo de ejecución +DLLs (Instrumentado) Purpose of the Slide Introduce Thread Checker as a tool to address threading correctness; outline its implementation. Details The code is typically instrumented at the binary level, though source instrumentation is also available. From the product FAQ: The Thread Checker library calls record information about threads, including memory accesses and APIs used, in order to find threading diagnostics including errors. Binary instrumentation is added at run-time to an already built (made) binary module, including applications and dynamic or shared libraries. The instrumentation code is automatically inserted when you run an Intel® Thread Checker activity in the VTune™ environment or the Microsoft .NET* Development Environment. Both Microsoft Windows* and Linux* executables can be instrumented for IA-32 processors, but not for Itanium® processors. Binary instrumentation can be used for software compiled with any of the supported compilers. The final build shows the UI page moving ahead to the next slide… Background Be ready to briefly explain the bugs mentioned: data races, stalls, deadlocks. threadchecker.thr (archivo resultante) Threaded Programming Methodology

Thread Checker Threaded Programming Methodology

Actividad 4 Usa Thread Checker para analizar la aplicación paralelizada Crear una actividad “Thread Checker activity” Ejecuta la aplicación ¿Se reportan errores? Purpose of the Slide Refers students to the 4th lab activity, whose purpose is to run the Thread Checker analysis illustrated on the previous slide. Details Detailed instructions are provided in the student lab manual. Students should see results (race conditions detected) similar to those on the previous slide. Threaded Programming Methodology

Depuración ¿Qué tanto esfuerzo se requiere para rediseñar? ¿Cuánto tiempo llevará paralelizar? Thread Checker reportó solo 2 dependencias, por lo tanto el esfuerzo necesario debe ser bajo Purpose of the Slide To address the question, how much effort (cost) will be required to successfully thread this application. Details As asserted in the slide, with only the two dependencies (gPrimesFound and gProgress), the debugging effort should be manageable. Threaded Programming Methodology

Depuración #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Creará una sección crítica para esta referencia Creará una sección crítica para ambas referencias Purpose of the Slide To show one way to correct the race conditions on gPrimesFound and gProgress. Details Critical sections can only be accessed by one thread at a time, so this solution should correct the race condition. Note the key point: “one thread at a time” – the critical section is, by design, no longer parallel. Threaded Programming Methodology

Actividad 5 Modifica y ejecuta la versión del código OpenMP Añade pragmas de regiones críticas al código Compila el código Ejecuta dentro del Thread Checker Si aun hay errores, haz las correcciones adecuadas al código y ejecútalas nuevamente en el Thread Checker Ejecuta con ‘ ’ para fines de comparación Compila y ejecuta fuera del Thread Checker ¿Cuál es la aceleración? Purpose of the Slide Refers students to the 5th lab activity, whose purpose (as stated on the slide) is to correct the race conditions discovered in the primes code. The resulting image is then checked for results and performance. Details Detailed instructions are provided in the student lab manual. Students will use the critical sections technique described on the previous slide. Threaded Programming Methodology

Depuración Respuesta correcta, pero el rendimiento bajo al ~1.33X ¿Es lo mejor que podemos esperar de este algoritmo? Purpose of the Slide Show that the critical sections method fixed the bug, but the performance is lower than expected. Details The slide shows a correct answer, but remind the students that this does not guarantee there is no bug (race conditions, if present, may show up only rarely). To be more rigorous, one would re-run the Thread Checker. No! De acuerdo a la Ley de Amdahl, podemos esperar una aceleración cerca de 1.9X Threaded Programming Methodology

Problemas comunes de rendimiento
Sobrecarga en paralelo Dada por la creación de hilos, planificación… Sincronización Datos globales excesivos, contención de los mismos objetos de sincronización Carga desbalanceada Distribución no adecuada del trabajo en paralelo Granularidad No hay suficiente trabajo paralelo Purpose of the Slide To list some common performance issues (follow up to previous slide, which showed poor threading performance). Details Each item listed is linked to a complete slide (in this set) which shows additional detail; recommend linking to each, one by one. We will see examples of two of these, in the remaining labs. This slide and previous set us up for the final section of this module, performance tuning, which begins on the next slide. Threaded Programming Methodology

Afinando para mejorar rendimiento
Thread Profiler señala cuellos de botella en aplicaciones paralelas Thread Profiler VTune™ Performance Analyzer Primes.c Primes.exe (Instrumentado) Instrumentación Binaria Compilador Instrumentación fuente /Qopenmp_profile Colector de datos en tiempo de ejecución +DLL’s (Instrumentado) Purpose of the Slide Introduce Thread Profile as a tool to address threading performance; outline its implementation. Details The slide build: Shows the build-and-link stage for primes, using the flag /Qopenmp_profile. This flag replaces /Qopenmp, and is required. From the user guide: Before you begin, you need to link and instrument your application with calls to the OpenMP* statistics gathering Runtime Engine. The Runtime Engine's calls are required because they collect performance data and write it to a file. 1. Compile your application using an Intel(R) Compiler. 2. Link your application to the OpenMP* Runtime Engine using the -Qopenmp_profile option. The slide then shows “Binary Instrumention”, but we will not be using that feature during this module. Binary instrumentation would be used to investigate the underlying native threads of an OpenMP applications). The resulting runtime, then gui snapshot, shown in detail on the next slide Primes.exe Bistro.tp/guide.gvs (archivo de resultados) Threaded Programming Methodology

Thread Profiler para OpenMP
Threaded Programming Methodology

Gráfica de aceleración Estima la aceleración al paralelizar y aceleración potencial Basada en la ley de Amdahl Da las fronteras inferiores y superiores Threaded Programming Methodology

serial paralelo Threaded Programming Methodology

Thread Profiler (para Hilos Explicitos)

Thread Profiler (para Hilos Explicitos)
¿Porqué demasiadas transiciones? Threaded Programming Methodology

Regreso a la etapa de diseño Threaded Programming Methodology
Rendimiento Esta implementación tiene llamadas de sincronización implícitas Esto limita la expansión del rendimiento debido a los cambios de contexto resultantes Purpose of the Slide Gives additional analysis regarding the Timeline and source views shown in the previous slide; identifies a significant bottleneck in the code. Details The slide build highlights the key portions of the Timeline and source view. Questions to Ask Students Why do we call this a synchronization, and what is implicit about it? Regreso a la etapa de diseño Threaded Programming Methodology

Actividad 6 Utilizar Thread Profiler para analizar una aplicación paralelizada Usar /Qopenmp_profile para compilar y encadenar Crear actividad “Thread Profiler Activity (for explicit threads)” Ejecuta la aplicación en el Thread Profiler Encuentra la línea en el código fuente que está causando que los hilos estén inactivos Purpose of the Slide Refers students to the 6th lab activity, whose purpose is to run a Thread Profiler analysis on the primes code. Details Detailed instructions are provided in the student lab manual. This lab exercise repeats the steps demonstrated in the preceding slides; students should expect to see similar results. Threaded Programming Methodology

Rendimiento Este cambio debe arreglar el problema de contención
¿Es esto mucha contención esperada? El algoritmo tiene mucho más actualizaciones que las 10 necesarias para mostrar el progreso void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone); } Purpose of the Slide To address the performance problem identified in the preceding slides and lab. Details The test, if (percentDone % 10 ==0), does NOT cause printing to be done every 10th step, but much more often. The slide build introduces a fix. Questions to Ask Students Why is the original algorithm not printing as infrequently as intended? Why does the fix correct this problem? => invite/encourage the students walk through the code with you, so these questions are understood. Este cambio debe arreglar el problema de contención Threaded Programming Methodology

Aceleración es 2.32X ! ¿Es correcto? Threaded Programming Methodology
Diseño Metas Elimina la contención implícita debido a la sincronización Purpose of the Slide Shows a result of the primes code which implements the correction shown on the previous slide, and shows an apparent anomaly in the resulting timing. Details The answer is correct, but the speedup of 2.32 for 2 cores cannot be right. Encourage the students to speculate as to causes of this (you may hear words like superscalar, cache etc – all red herrings). Aceleración es 2.32X ! ¿Es correcto? Threaded Programming Methodology

La velocidad actual es 1.40X (<<1.9X)!
Rendimiento Nuestra línea base de medición ha “viciado” el algoritmo de actualización del progreso ¿Es lo mejor que podemos esperar de este algoritmo? Purpose of the Slide Show the corrected baseline timing; resolves the apparent anomaly of the previous slide. Details The timing shown is a new baseline timing, with the contention correction added to serial version of primes (note the directory name in the command window). The original baseline timing was 11.73s; this version shows 7.09s, giving us the new speedup ration of This is significantly lower than the speedup of 1.9 predicted by Amdahl’s law. La velocidad actual es 1.40X (<<1.9X)! Threaded Programming Methodology

Actividad 7 Modifica la función ShowProgress (serial y OpenMP) para que muestre solo la salida necesitada Recompila y ejecuta el código Asegurarse que no se usan banderas de instrumentación ¿Cuál es la aceleración de la versión serial? if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } Purpose of the Slide Refers students to the 7th lab activity, whose purpose is to introduce the performance fix outlined in the preceding slides. Details Detailed instructions are provided in the student lab manual. Students will implement the code shown on previous slides, measure new timings, and derive a new speedup number. While unlikely to precisely match the 1.40x speedup shown on the slides (since platforms used for this class will vary), it should be similar. Threaded Programming Methodology

Revisando el Rendimiento
Sigue teniendo 62% de tiempo de ejecución en locks y sinchronización Threaded Programming Methodology

Veamos los Locks de OpenMP… void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } El lock está en un ciclo Purpose of the Slide Examine the lock protecting gPrimesFound, to understand the performance impact. Details As stated in the slide, the real issue is putting the critical section (lock) inside a loop. A fix is proposed in the slide build sequence. The slide build: Points out the lock within a loop Introduces a fix, using the Windows threading function InterlockedIncrement. This function is defined as LONG InterlockedIncrement( LONG volatile* Addend ); where Addend [in, out] is a pointer to the variable to be incremented. This is an atomic operation, less disruptive than a critical section. The final build shows a timing result from a primes image incorporating this fix. A key point: it is possible – and sometimes desirable - to mix OpenMP and native threading calls! Threaded Programming Methodology

Veamos el segundo lock void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; Este lock también está siendo llamado dentro de un ciclo Purpose of the Slide Examine the lock protecting gProgress, to understand the performance impact. Details The same fix, interlockedIncrement, is used for this critical section. The slide build: Points out the critical section is in a loop (Question: what loop?) Introduces a different solution, using the Windows API. Note that 3 lines of code need to be modified. The final build shows a timing result from a primes image incorporating this fix. Threaded Programming Methodology

Actividad 8 Modifica las regiones críticas de OpenMP para reemplazarlas InterlockedIncrement Re-compila y ejecuta el código ¿Cuál es la aceleración con respecto a la versión serial? Purpose of the Slide Refers students to the 8th lab activity, whose purpose is to introduce the code change cited, and measure its impact. Details Detailed instructions are provided in the student lab manual. Threaded Programming Methodology

Hilo 0 342 factores para probar 500000 250000 750000 Hilo 1 612 factores para probar Hilo 2 789 factores para probar Hilo 3 934 factores para probar Purpose of the Slide To examine the causes of the load imbalance observed in the profile of the primes code. Details Using 4 threads makes the imbalance more obvious. The slide build: Overlaying the Threads view, a “stair step” is drawn to illustrate that each successive thread takes additional time. A bar is drawn to illustrate that the iterations were divided among the threads in equal amounts. “Didn’t we divide the iterations evenly? Let’s look at the work being done for a ‘middle’ prime in each group.” Boxes with the precise workload stated for each thread appear, showing explicitly that there are more steps required as the algorithm searches for primes in larger numbers A triangle is drawn to illustrate (conceptually, not precisely) the nature of the workload, which increases with increasing number range Threaded Programming Methodology

Arreglando la Carga Desbalanceada
Distribuye el trabajo más equitativamente void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } La aceleración lograda es 1.68X Purpose of the Slide Introduce a method to address the load imbalance inherent in the primes algorithm. Details The slide build: The triangle from the previous slide is redrawn, illustrating the different “sizes” of work for each thread An new triangle is shown, with the workload interleaved to achieve a more even distribution An OpenMP schedule pragma is added, which achieves that interleaving (no code change is required) A sample run is shown for this approach, with the time now 4.22s, a speedup of 1.68 Threaded Programming Methodology

Actividad 9 Modifica el código para mejorar el balanceo de carga Agrega la cláusula schedule (static, 8) en el pragma parallel for de OpenMP Re-compila y ejecuta código ¿Cuál es la aceleración con respecto al código serial? Purpose of the Slide Refers students to the 9th lab activity, whose purpose is to introduce static OpenMP scheduling, and measure its impact. Details Detailed instructions are provided in the student lab manual. As before, results achieved should be similar to (though probably not exactly the same as) those shown in the slides. Threaded Programming Methodology

Ejecución final del Thread Profiler
Purpose of the Slide Show the performance profile of the final version of primes, with all corrections and load balancing implemented. Details Note that the speedup, 1.80x, is faster than the 1.68x cited in a preceding slide; the difference is this final run is the “Release” version, free of the overhead of the “Debug” version shown previously (note the directories shown: here it is c:\classfiles\PrimeOpenMP\Release, previously it is …\Debug). La aceleración lograda es 1.80X Threaded Programming Methodology

Análisis Comparativo Las aplicaciones paralelas requieren varias iteraciones al pasar por el ciclo de desarrollo de software Purpose of the Slide Summarizes the results at each step in the performance tuning process; emphasizes the iterative nature of the process. Threaded Programming Methodology

Metodología de programación paralela Lo que se Cubrió
Cuatro pasos del ciclo de desarrollo para escribir aplicaciones paralelas desde el código serial y las herramientas de Intel® para soportar cada paso Análisis Diseño (Introducir Hilos) Depurar para la correctud Afinar el rendimiento Las aplicaciones paralelas requieren múltiples iteraciones de diseño, depuración y afinación de rendimiento Usar las herramientas para mejorar productividad Purpose of the Slide Summarizes the key points covered in this module. Threaded Programming Methodology

This should always be the last slide of all presentations. Threaded Programming Methodology

Diapositivas Adicionales

Sobrecarga en paralelo
Sobrecarga de creación de los hilos La sobrecarga incrementa conforme incrementa el número de hilos activos Solución Uso de hilos reusables y “thread pools” Amortiza el costo de crear hilos Mantiene el número de hilos activos relativamente constante Threaded Programming Methodology

Sincronización Contención por asignación dinámica de memoria La asignación dinámica de memoria causa sincronización implícita Asignar en el stack para usar almacenamiento local en los hilos Actualizaciones atómicas versus secciones críticas Algunas actualizaciones de datos globales pueden usar operaciones (familia Interlocked) Usar actualizaciones atómicas cada que sea posible Secciones Críticas versus exclusión mutua Los objetos de Sección Crítica residen en el espacio del usuario Usar objetos CRITICAL SECTION cuando no se requiere visibilidad más allá de los límites del proceso Introduce menos sobrecarga Tiene una variante de spin-wait que es útil para algunas aplicaciones Threaded Programming Methodology

Trabajo no balanceado Cargas de trabajo desigual nos llevan a hilos ociosos y tiempo desperdiciado Tiempo Ocupado Ocioso Threaded Programming Methodology

Porción paralelizable Threaded Programming Methodology
Granularidad Grano grueso Grano fino Escala: ~2.5X Escala: ~3X Serial Porción paralelizable Porción paralelizable Escala: ~1.05X Escala: ~1.10X Serial Threaded Programming Methodology

Metodología de programación paralela

Presentaciones similares

Presentación del tema: "Metodología de programación paralela"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Metodología de programación paralela

Presentaciones similares

Presentación del tema: "Metodología de programación paralela"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback