La descarga está en progreso. Por favor, espere

La descarga está en progreso. Por favor, espere

Programando con OpenMP*

Presentaciones similares


Presentación del tema: "Programando con OpenMP*"— Transcripción de la presentación:

1 Programando con OpenMP*
Intel Software College

2 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Objetivos Al término de este módulo el estudiante será capaz de Implementar paralelismo de datos Implementar paralelismo de tarea Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Objectives of the module.

3 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

4 La especificación actual es OpenMP 3.0 (combina C/C++ y Fortran)
¿Qué es OpenMP? API Paralela portable de memoria compartida Fortran, C, y C++ Muchos distribuidores soporte para Linux y Windows Estandariza tareas y paralelismo de ciclos Soporta paralelismo de grano grueso Combina código serial y paralelo en un solo fuente Estandariza ~ 20 años de experiencia de paralelización dirigida por el compilador La especificación actual es OpenMP 3.0 318 Páginas (combina C/C++ y Fortran) Script: What is OpenMP? OpenMP is a portable (OpenMP codes can be moved between linux & windows for example), shared-memory threading API that standardizes task & loop level parallelism. Because OpenMP clauses have both lexical and dynamic extent, it is possible support a broad multi-file course grained parallelism. Often, the best parallelism technique is to parallel at the coarsest grain possible often parallelizing tasks or loops from with the main driver itself – as this gives the most bang for the buck (the most computation for the necessary threading overhead costs). Another key benefits is that OpenMP allows for a developer to parallelize their applications incrementally. Since OpenMP is primarily a pragma or directive based approach we can easily combine serial & parallel code in a single source. By simply compiling with or without the /OpenMP compiler flag we can turn OpenMP on or off. Code compiled without the /OpenMP flag simply ignores the OpenMp pragmas which allows simple access back to the original serial application. Openmp also standardizes about 20 years of compiler directed threading experience. For more information or to review the latest OpenMP spec (currently the latest spec is OpenMP 3.0) – goto

5 Modelo de programación
El hilo maestro se divide en un equipo de hilos como sea necesario El Paralelismo se añade incrementalmente: el programa secuencial se convierte en un programa paralelo Hilo maestro Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a more detailed definition and example of fork-join parallelism within OpenMP. Details Applications starts execution in serial with Master Thread. At each parallel region encountered, threads are forked off, execute concurrently, and then join together at the end of the region. Background There is an API call to change the number of threads that will execute within a parallel region. This will be covered briefly later, if asked. How to choose the number of treads for a parallel region is covered in 3 slides. Regiones paralelas

6 Detalles de la sintaxis para comenzar
Muchas de las construcciones de OpenMP son directivas del compilador o pragmas Para C y C++, los pragmas toman la forma: #pragma omp construct [clause [clause]…] Para Fortran, las directivas toman una de las formas: C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] Archivo de cabecera or módulo de Fortran 90 #include “omp.h” use omp_lib Script: Most of the constructs we will looking at in OpenMP are compiler directives or pragmas. The C/C++ and Fortran versions of these directives are shown here. C and C++, the pragmas take the form: #pragma omp construct [clause [clause] where clauses are optional modifiers Be sure to include “omp.h” if you intend to use any routines from the OpenMP Library. Now let look at some environment variables that control OpenMP behavior Note to Fortran Users: For Fortran, you can also use other sentinel’s (!$OMP), but these must exactly line up on columns Column 6 must be blank or contain a + indicating that this line is a continuation from the previous line.

7 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

8 Región Paralela y Bloques Estructurados (C/C++)
Muchas de las construcciones de OpenMP se aplican a bloques estructurados Bloque Estructurado: un bloque con un punto de entrada al inicio y un punto de salida al final Los únicos “saltos” permitidos son sentencias de STOP en Fortran y exit() en C/C++ Script: A Parallel region is created using the #pragma omp parallel construct. A master thread creates a pool of worker threads once the master thread crosses this pragma. On this foil, the creation of the parallel region is highlighted in yellow and includes the pragma and the left curly brace “{“. The parallel region extends from the left curly brace – to the highlighted yellow right curly brace “}”. There is an implicit barrier at the right curly brace and that’s the point at which the other worker threads complete execution and either go to sleep or spin or otherwise idle. Parallel constructs form the foundation of OpenMP parallel execution. Each time an executing thread enters a parallel region, it creates a team of threads and becomes master of that team. This allows parallel execution to take place within that construct by the threads in that team. The following directives are necessary for a parallel region: #pragma omp parallel A parallel region consists of a structured block of code. We see on the left a good example of a structured block of code where there is a single point of entry into the block at the top of the block, and one exit to the block at the bottom – AND no braches out of the block. Question to class – can someone spot some reasons why this block is unstructured? Here are a couple of reasons that that block is unstructured – multiple entrances and multiple exits to the block. We see on the right a bad example - or an unstructured block of code. Here we have two entries in the block – one from the top of the block and one from the goto more statement which jumps into the block at the label “more:” Additionally, the bad example has multiple exits from the block: 1) From the bottom of the block and one from the goto done statement

9 Región Paralela y Bloques Estructurados (C/C++)
#pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job (id); if (conv (res[id]) goto more; } printf (“All done\n”); if (go_now()) goto more; #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job(id); if (conv (res[id]) goto done; goto more; } done: if (!really_done()) goto more; Script: A Parallel region is created using the #pragma omp parallel construct. A master thread creates a pool of worker threads once the master thread crosses this pragma. On this foil, the creation of the parallel region is highlighted in yellow and includes the pragma and the left curly brace “{“. The parallel region extends from the left curly brace – to the highlighted yellow right curly brace “}”. There is an implicit barrier at the right curly brace and that’s the point at which the other worker threads complete execution and either go to sleep or spin or otherwise idle. Parallel constructs form the foundation of OpenMP parallel execution. Each time an executing thread enters a parallel region, it creates a team of threads and becomes master of that team. This allows parallel execution to take place within that construct by the threads in that team. The following directives are necessary for a parallel region: #pragma omp parallel A parallel region consists of a structured block of code. We see on the left a good example of a structured block of code where there is a single point of entry into the block at the top of the block, and one exit to the block at the bottom – AND no braches out of the block. Question to class – can someone spot some reasons why this block is unstructured? Here are a couple of reasons that that block is unstructured – multiple entrances and multiple exits to the block. We see on the right a bad example - or an unstructured block of code. Here we have two entries in the block – one from the top of the block and one from the goto more statement which jumps into the block at the label “more:” Additionally, the bad example has multiple exits from the block: 1) From the bottom of the block and one from the goto done statement Un bloque estructurado Un bloque no estructurado

10 Actividad 1: Hello Worlds
Modifica el código serial de “Hello, Worlds” para que se ejecute en paralelo con OpenMP* Script: Take about 5 minutes to build and run the hello world lab. In this example – we will print “hello world” from several threads. Run the code several times. Do you see any issues with the code – do you always get expected results? Does anyone in the class have weird behavior in terms of the sequence of the words that are printed out? Since printf is a function of state – it can only print one thing to one screen at a time – some students are likely to see “race conditions” where one printf in a thread is writing over the results of another printf in an other thread.

11 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

12 Automáticamente divide el trabajo entre hilos
Worksharing Worksharing es el término general usdo en OpenMP para describir la distribución de trabajo entre hilos. Tres ejemplos de worksharing en OpenMP son: Construcción omp for Construcción omp sections Construcción omp task Script: Worksharing is the general term used in OpenMP to describe distribution of work across threads. There are three primary categories of worksharing in OpenMP The three examples are: The omp for construct - that automatically divides the a for loop’s work up and distributes the work across threads The omp sections directive distributes work among threads bound to a defined parallel region. This construct is good for function level parallelism where the tasks or functions are well defined and known at compile time The omp task pragma can be used to explicitly define a task. Now we will look more closely at the omp for construct Automáticamente divide el trabajo entre hilos

13 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Construcción omp for #pragma omp parallel #pragma omp for Barrera implícita i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 // assume N=12 #pragma omp parallel #pragma omp for for(i = 1, i < N+1, i++) c[i] = a[i] + b[i]; Los hilos se asignan a un conjunto de iteraciones independientes Los hilos deben de esperar al final del bloque de construcción de trabajo en paralelo Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Demonstrate use and execution of the “for” work-sharing pragma. Details Diagram shows a static division of iterations based on the number of threads. Note implicit barrier at end of construct. Questions to Ask Students Q: Why is there a barrier at the end of the work-sharing construct? A: Code following the for-loop may rely on the results of the computations within the for-loop. In serial code, the for-loop completes before proceeding on to the next computation. Thus, to remain serially consistent, the barrier at the end of the construct is enforced.

14 Combinando pragmas Estos códigos son equivalentes #pragma omp parallel
{ #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); } Script: In these equivalent code snippets we see that OepnMP constructs can be combined down to a single statement Here the #pragma omp parallel and the nested #pragma omp for (from the right handed example) are combined to a single tidier construct on the right #pragma omp parallel for Most often in this course we sill use the more abbreviated combined version. There can be occasions however when it may be useful to separate the constructs – such as if the parallel regions has other work do in addition to the for – maybe a and addition omp sections directive. Now well turn to lab activity 2 Background: Side Note – Not shown – but useful also. It is not just the omp or that can be merged with parallel for . Its is also allowed to merge the omp sections with the omp parallel -> #progma omp parallel sections.

15 La cláusula Private Reproduce la variable por cada hilo
Las variables no son inicializadas; en C++ el objeto es construido por default Cualquier valor externo a la región paralela es indefinido void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } Script: We’ve just about wrapped up our coverage of the omp parallel for construct – before we move on to the lab however – we need to introduce one more concept – the concept of private variables. For reasons that we will go into more later on in this module – it is important to be able to make some variables have copies that are private to each thread. The omp private clause accomplishes this. Declaring a variable to be private with the omp private clause means that each thread has its own copy of that variable that can be modified without effecting any others threads value of the their similarly named variable. In this example above, each thread will have private copies of variables x & y. This means that thread 1, thread 2, thread 3 etc all have a variable named x & a variable named y but thread 1’s variable x can contain a different value than thread 2’s variable x. This construct allows each thread to proceed without effecting the computation of the other threads. We will talk more about private variables later in the context of race conditions. Next foil Background Private Cause For-loop iteration variable is PRIVATE by default.

16 Actividad 2 – Parallel Mandelbrot
Objetivo: crea una versión paralela de Mandelbrot. Modifica el código para añadir cláusulas de OpenMP para paralelizar el cómputo de Mandelbrot. Sigue actividad de Mandelbrot llamada Mandelbrot en el documento de la práctica Script: In this exercise we will use the combined #pragma omp parallel for construct to parallelize a specific loop in the mandelbrot application. Then we will use wall clock time to evaluate the performance of the parallel version as compared to the serial version

17 La cláusula schedule La cláusula schedule afecta en como las iteraciones del ciclo se mapean a los hilos schedule(static [,chunk]) Bloques de iteraciones de tamaño “chunk” a los hilos Distribución Round Robin Poca sobrecarga, puede causar desbalanceo de carga schedule(dynamic[,chunk]) Los hilos toman un fragmento (chunk) de iteraciones Cuando terminan las iteraciones, el hilo solicita el siguiente fragmento Un poco más de sobrecarga, puede reducir el problema de balanceo de carga schedule(guided[,chunk]) Planificación dinámica comenzando desde el bloque más grande El tamaño de los bloques se compacta; pero nunca más pequeño que “chunk”” Script: The omp for loop can be modified with a schedule clause that effects how the loop iterations are mapped onto threads. This mapping can dramatically improve performance by eliminating load imbalances or in reducing threading overhead. We are going to dissect the schedule clause and look at several options we can use in the schedule clause and how openmp loop behavior changes for each option. The first schedule clause we are going to examine is the schedule(static) clause. Lets advance to the first sub animation on the slide. For the sake of argument, we are going to assume that the loops in question have N iterations and lets assume we have 4 threads in the thread pool – just to make the concepts a little more tangible. To talk about this scheduling we first need to define what a chunk is. A chunk is a contiguous range of iterations – so iterations 0 through 99, for example would be considered a chuck of iterations. What schedule( static) does is break the for loop in chunks of iterations. Each thread in the team gets one chunk. If we have N total iteration, schedule(static) assigns a chunk of N/(number of threads) iterations to each thread for execution. schedule(static, chunk) – Lets assume that chunk is of 8. Then schedule (static,8) would interleave the allocation of chunks of size 8 to threads. That means that thread 1 gets 8 iterations, then thread 2 gets another 8 etc. The chunks of 8 are doled out to the threads in round robin fashion to whatever threads a free for execution. Increasing chunk size reduces overhead and may increase cache hit rate. Descreasing chunk size allows for finer balancing of workloads. Next animation schedule(dynamic) – takes more overhead to accomplish but it effectively assigns the threads one-at-a-time dynamically. This is great for loop iterations where iteration 1 of a loop takes vastly different computation that iteration N for example. Using dynamic scheduling can greatly improve threading load balance in such cases. Threading load balance is ideal state where all threads have an equal amount of work to do and can all finish their work in roughly an equal amount of time. schedule(dynamic, chunk) – similar to dynamic but assigns the threads assigns the threads chunk-at-a-time dynamically rather than one-at-a-time dynamically next animation schedule(guided) – is a specific cases of dynamic scheduling where the computation is known to take less time for early iterations and significantly longer for later iterations. For example, in finding prime numbers using a sieve. The larger primes take more time to test than the smaller primes – so if we are testing in a large loop the smaller iteration compute quickly – but the later iterations take a long time – schedule guided may be a great choice for this situation. schedule(guided, chunk) - dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller - minimum chunk size is “chunk” Lets look at Recommend uses Use Static scheduling for predictable and similar work per iteration Use Dynamic scheduling for unpredictable, highly variable work per iteration Use Guided scheduling as a special case of dynamic to reduce scheduling overhead when the computation gets progressively more time consuming Use Auto scheduling to select by Compiler or runtime environment variables Lets look at an example now – see next foil Background: See Dr Michael Quinns Book - Parallel Programming in C with MPI and OpenMP Here are some quick notes regarding each of the options we see on the screen schedule(static) - block allocation of N/threads contiguous iterations to each thread schedule(static, S) - interleaved allocation of chunks of size S to threads schedule(dynamic) - dynamic one-at-a-time allocation of iterations to threads schedule(dynamic, S) - dynamic allocation of S iterations at a time to threads schedule(guided)- guided self-scheduling - minimum chunk size is 1 schedule(guided, S) - dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller - minimum chunk size is S Note: When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case. A compliant implementation of static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and 3) both loop regions bind to the same parallel region. A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.9 on page 170 for examples). dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations. When no chunk_size is specified, it defaults to 1. guided When schedule(guided, chunk_size) is specified, the iterations are assigned to threads in the team in chunks as the executing threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. For a chunk_size with value k (greater than 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations). auto When schedule(auto) is specified, the decision regarding scheduling is delegated to the compiler and/or runtime system. The programmer gives the implementation the freedom to choose any possible mapping of iterations to threads in the team. runtime When schedule(runtime) is specified, the decision regarding scheduling is deferred until run time, and the schedule and chunk size are taken from the run-sched-var ICV. If the ICV is set to auto, the schedule is implementation defined.

18 Ejemplo de la cláusula Schedule
#pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gPrimesFound++; } Las iteraciones se dividen en bloques de 8 Si start = 3, el primer bloque es i={3,5,7,9,11,13,15,17} Script: This example performs a test for primes. It uses a omp parallel for construct modified with a schedule(static,8) clause. This example has an increasing amount of work as the iteration counter gets larger. That is because testing numbers for primality with a brute force method takes longer to compute since there are more numbers to test. This C example uses STATIC scheduling. The set of iterations is divided up into chunks of size 8 and distributed to threads in round robin fashion. Lets compare just using STATIC scheduling with using the STATIC,8 scheduling. And lets assume we have 4 threads in the team and that start =1 and end = 1001. With simple STATIC scheduling with 4 threads each thread would be assigned a chunk of 250 iterations. The fist 250 iteration would be assigned to thread 1, the next 250 to thread 2 and so one. The last thread – thread 4 – would get the 250 most difficult iterations and would take far longer than the others One the other hand, when we use STATIC,8 – the runtime chunks up 8 iterations in a chunk. As we approach the end of the loop, we would have 4 threads all computing the more difficult calculations. Arguably the thread computing the second to last chunk, iterations numbered 985 – 993, is almost as challenged as the thread computing the last chunk, iterations 993 to So we see that load balancing has improved It may be even better to try dynamic or guided here and see how the performance compares Now its time for a lab

19 Actividad 2b – Planificación del Mandelbrot
Objetivo: crea una versión paralela de mandelbrot que use planificación dinámica de OpenMP Sigue la actividad de Mandelbrot llamada Mandelbrot Scheduling en el documento de la práctica Script: In this activity you will experiment with openmp scheduling clause to empirically determine the best scheduling method for mandelbrot. Static scheduling is not the best choice because the work load in the middle of the graphic is complicated and each row of pixels takes a long time o compute, while a row of pixels near the top of the screen is very quick to compute. Next we will look at improving the mandelbrot application by adding a scheduling clause and observing the wall clock time for execution Ask the students which scheduling method seemed to work best

20 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

21 Descomposición de tareas
a = alice(); b = bob(); s = boss(a, b); c = cy(); printf ("%6.2f\n", bigboss(s,c)); alice bob boss cy Script: We will now look at ways to take advantage of task decomposition – also known as function level parallelism. In this example, we have functions: alice, bob, boss, cy and bigboss. There are various dependencies among these functions as seen in the directed edge graph to the right. Boss cant complete until alice & bob functions are complete. Bigboss cant complete until boss & cy are completed. How do we parallelize such a function or task dependency? We’ll see one approach in a couple slides For now, lets identify the functions that can be computed independently (or computed in parallel). Alice & bob & cy can all be computed at the same time as there are no dependencies among them. Lets see if we can put this to good advantage later. Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP bigboss alice,bob, y cy pueden realizarse en paralelo

22 Secciones omp #pragma omp sections
Debe estar dentro de una región paralela Precede un bloque de código que contiene N bloques de código que pueden ser ejecutados concurrentemente por N hilos Abarca cada sección de omp Script: The omp sections directive distributes work among threads bound to a defined parallel region. The omp sections construct (note the “s” in sections) indicates that there will be two or more omp section constructs ahead that can be executed in parallel. The omp sections construct must either be inside a parallel region or must be part of a combined omp parallel sections construct. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. The parallelism comes from executing each omp section in parallel Lets look at the previous example to see how to apply this to our boss bigboss example Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage Parameters clause is any of the following: private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas. firstprivate (list) Declares the scope of the data variables in list to be private to each thread. Each new private object is initialized as if there was an implied declaration within the statement block. Data variables in list are separated by commas. lastprivate (list) Declares the scope of the data variables in list to be private to each thread. The final value of each variable in list, if assigned, will be the value assigned to that variable in the last section. Variables not assigned a value will have an indeterminate value. Data variables in list are separated by commas. reduction (operator: list) Performs a reduction on all scalar variables in list using the specified operator. Reduction variables in list are separated by commas. A private copy of each variable in list is created for each thread. At the end of the statement block, the final values of all private copies of the reduction variable are combined in a manner appropriate to the operator, and the result is placed back into the original value of the shared reduction variable. Variables specified in the reduction clause: must be of a type appropriate to the operator. must be shared in the enclosing context. must not be const-qualified. must not have pointer type. nowait Use this clause to avoid the implied barrier at the end of the sections directive. This is useful if you have multiple independent work-sharing sections within a given parallel region. Only one nowait clause can appear on a given sections directive. Usage The omp section directive is optional for the first program code segment inside the omp sections directive. Following segments must be preceded by an omp section directive. All omp section directives must appear within the lexical construct of the program source code segment associated with the omp sections directive. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. A barrier is implicitly defined at the end of the larger program region associated with the omp sections directive unless the nowait clause is specified.

23 Secciones omp #pragma omp section
Precede cada bloque de código dentro del bloque abarcado descrito anteriormente Puede ser omitido por la primera sección paralela después del pragma parallel sections Los segmentos de programa adjuntos se distribuyen para ejecución paralela entre hilos disponibles Script: The omp sections directive distributes work among threads bound to a defined parallel region. The omp sections construct (note the “s” in sections) indicates that there will be two or more omp section constructs ahead that can be executed in parallel. The omp sections construct must either be inside a parallel region or must be part of a combined omp parallel sections construct. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. The parallelism comes from executing each omp section in parallel Lets look at the previous example to see how to apply this to our boss bigboss example Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage Parameters clause is any of the following: private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas. firstprivate (list) Declares the scope of the data variables in list to be private to each thread. Each new private object is initialized as if there was an implied declaration within the statement block. Data variables in list are separated by commas. lastprivate (list) Declares the scope of the data variables in list to be private to each thread. The final value of each variable in list, if assigned, will be the value assigned to that variable in the last section. Variables not assigned a value will have an indeterminate value. Data variables in list are separated by commas. reduction (operator: list) Performs a reduction on all scalar variables in list using the specified operator. Reduction variables in list are separated by commas. A private copy of each variable in list is created for each thread. At the end of the statement block, the final values of all private copies of the reduction variable are combined in a manner appropriate to the operator, and the result is placed back into the original value of the shared reduction variable. Variables specified in the reduction clause: must be of a type appropriate to the operator. must be shared in the enclosing context. must not be const-qualified. must not have pointer type. nowait Use this clause to avoid the implied barrier at the end of the sections directive. This is useful if you have multiple independent work-sharing sections within a given parallel region. Only one nowait clause can appear on a given sections directive. Usage The omp section directive is optional for the first program code segment inside the omp sections directive. Following segments must be preceded by an omp section directive. All omp section directives must appear within the lexical construct of the program source code segment associated with the omp sections directive. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. A barrier is implicitly defined at the end of the larger program region associated with the omp sections directive unless the nowait clause is specified.

24 Paralelismo a nivel funcional con secciones
#pragma omp parallel sections { #pragma omp section /* Optional */ a = alice(); #pragma omp section b = bob(); c = cy(); } s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); Script: Here we have enclosed the omp sections in the omp parallel construct. We placed code of interest inside the parallel sections construct’s code block. Next we added parallel section constructs in front of the tasks that could be executed in parallel – namely alice, bob, and cy. When all of the threads executing the parallel section reach the implicit barrier (the right curly brace “}” at the end of the parallel section, then the master thread will continue on – executing the boss function and later the biggboss function. Another possible approach that Quinn points out – compute alice and bob together then computes boss & cy together, then computes biggboss. #pragma omp parallel sections { #pragma omp section /* Optional */ a = alice(); #pragma omp section b = bob(); } c = cy(); s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP

25 Ventajas de las secciones paralelas
Secciones independientes de código se pueden ejecutar concurrentemente #pragma omp parallel sections { #pragma omp section phase1(); phase2(); phase3(); } Script: In this example, Phase1, Phase2, and Phase3 represent completely independent tasks Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another concurrent section. Notice the overall parallelism achievable in the serial/parallel flow diagram Now we will begin our exploration of omp tasks Serial Paralela

26 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

27 Nuevo Soporte de OpenMP
Tareas – Lo principal en OpenMP 3.0 Permite paralelización de problemas irregulares Ciclos sin límite Algoritmos recursivos Productor/Consumidor Script: Tasks are a powerful new addition to OpenMP since the OpenMP 3.0 spec. Tasks allow parallelization of irregular problems that were impossible to very difficult to parallel in OpenMP prior to Now it is possible to parallelize unbounded loops (such as while loops loops), recursive algorithms, and producer/consumer patterns. Lets explore what Tasks actually are

28 ¿Qué son las tareas? Serial Paralelo
Las tareas son unidades de trabajo independientes Los hilos se asignan para ejecutar trabajo en cada tarea Las tareas pueden diferirse Las tareas pueden ejecutarse inmediatamente El sistema en tiempo de ejecución decide cual de las descritas anteriormente Las tareas se componen de: código para ejecutar Ambiente de datos Variables de control internas (ICV) Script: First of all, Tasks are independent units of work that get threads assigned to them in order to do some calculation. The assigned threads might start executing immediately or their execution might be deferred depending on decision made by the OS & runtime. Tasks are composed of three components: Code to execute – the literal code in your program enclosed by the task directive 2) A data environment – the shared & private data the manipulated by the task 3) Internal control variables – thread scheduling and environment variable type controls. A task is a specific instance of executable code and its data environment, generated when a thread encounters a task construct or a parallel construct. Background: New concept in OpenMP 3.0: explicit task - We have simply added a way to create a task explicitly for a team of threads to execute. Key Concept: All parallel execution is done in the context of a parallel region - Thread encountering parallel construct packages up a set of N implicit tasks, one per thread. - Team of N threads is created - Each thread begins execution of a separate implicit task immediately New concept: explicit task – OpenMP has simply added a way to create a task explicitly for the team to execute. Every part of an OpenMP program is part of one task or another! Serial Paralelo

29 Ejemplo de task #pragma omp parallel // assume 8 threads {
#pragma omp single private(p) while (p) { #pragma omp task processwork(p); } p = p->next; Se crea un grupo de 8 hilos Un hilo tiene acceso a ejecutar el ciclo while El hilo que ejecuta el “ciclo while” crea una tarea por cada instancia de processwork() Script: In this example, we are looking at a pointer chasing linked list. We create a parallel region using the #pragma omp parallel construct – and we are assuming, for the sake of the illustration, that 8 threads are created once the master thread crosses into the parallel region. At this point we have a team of 8 threads created. Let’s also assume that the linked list contains ~1000 nodes. We immediately limit the number of thread which will operate the while loop. We only want one while loop running. With out the single construct – we would have 8 identical copies of while loops all trying to process work and getting into each others way. The omp task construct copies the code and data and internal control variables to a new task – lets call it task01 and gives that task a thread from the team to execute the task’s instance of the code and data. Since the omp task construct is called from within the while loop, and since the while loop is going to traverse all 1000 nodes, then at some point 1000 tasks will be generated. It is unlikely that all 1000 tasks will be generated at the same time. Since we only have 8 threads to service the 1000 tasks, and the master thread is busy controlling the while loop – we will effectively have 7 threads to do actual processwork. Lets say that the master thread keeps generating tasks and the 7 worker threads can’t consume these tasks quickly enough. Then eventually, the master thread may “task switch”. It may suspend the work of controlling the while loop and creating tasks. The runtime may decide that the master should begin servicing the tasks just like the rest of the threads. When the task pool drains enough due to the extra help, the master thread may task switch back to executing the while loop and begin generating new tasks once more. This process is at the heart of openmp tasks. We’ll see some animations to demonstrate this in the next few foils

30 Task – Visión explícita de una tarea
Se crea un equipo de hilos en el omp parallel Un solo hilo se selecciona para ejecutar el ciclo while – a este hilo le llamaremos “L” El hilo L opera el ciclo while, crea tareas, y obtiene el siguiente apuntador Cada vez que L pasa el omp task genera una nueva tarea que tiene un hilo asignada Cada tarea se ejecuta en su propio hilo Todas las tareas se terminan en la barrera al final de la región paralela #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task private(p) process(p); p = p->next; //block 3 } Script: This foil is one way to look at pointer chasing in a linked list, where the list must be traversed (and each node in the linked list) has to be processed by function “process” Here we see the overview of the flow of this code snippet A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop – lets call this thread “L” Thread L operates the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread All tasks complete at the barrier at the end of the parallel region The next foil will give more insight into the parallelism advantage of this approach

31 ¿Qué tareas son útiles? Tienen potencial para paralelizar patrones irregulares y llamadas a funciones recursivas Bloque 1 Bloque 2 tarea 1 Bloque 2 tarea 2 Bloque 2 tarea 3 Bloque 3 Bloque3 Tiempo Un solo hilo Bloque 1 Bloque 3 Hilo Hilo Hilo Hilo4 Bloque 2 tarea 2 Bloque2 tarea 1 Bloque2 tarea 3 Tiempo ahorrado #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task process(p); p = p->next; //block 3 } Espera Script: Here is another look at the same example – but emphasizing the potential performance payoff from this sort of pipelined approach that we get from using tareas. First off, observe that in single threaded mode, all tasks are done sequentially – Block1 is calculated ( node p is assigned to head), then block 2 (pointer p is processed by process(p), then block 3 ( reads next pointer in the linked list), then repeating Blocks2, block 3 etc. 1st animation Now consider the same code executed in parallel First, the master thread crosses the omp parallel construct and creates a team of threads. Next one of those threads is chosen to execute the while loop – lets call this thread L Thread L encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task1 The thread L increments the pointer p – grabbing a new node from the list, and loops to the top of the while loop Then thread L again encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task2 Then thread L again encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task3 So Thread L’s job is simply to assign work to threads and traverse the linked list. The parallelism comes from the fact that thread L does not have to wait for the results of any task before generating a new task. If the system has sufficient resources (enough cores, registers, memory, etc) then task1, task2, task3 can all be computed in parallel. So roughly speaking – the execution time will be about the duration of the longest executing task (task 2 in this case) plus some extra administration time for thread L. The time save can be significant compared to the serial execution of the same code. Obviously the more parallel resources that are supplied the better the parallelism – up until the point that the longest serial task begins to dominate the total execution time. Now its time for lab activity

32 Actividad 3 – Lista encadenada usando tareas
Objetivo: Modifica la lista encadenada para implementar tareas para paralelizar la aplicación Sigue la lista encadenada de tareas llamada LinkedListTask en el documento de la práctica Script: Likely this lab will have to be skipped for time constraints. However – it does show some fair speedup by using task to parallelize a pointer chasing while loop. This is a fairly simple lab in which you will start with a serial version of the application and add a few openmp pragmas and build & run the application Lets now look at when/where tasks are guaranteed to be complete while(p != NULL){ do_work(p->data); p = p->next; }

33 ¿Cuándo las tareas se garantizan a ser completadas?
En las barreras de los hilos o tareas En la directiva: #pragma omp barrier En la directiva : #pragma omp taskwait Script: Now we are going to quickly explore when tasks are guaranteed to be complete. For doing computations with dependent tasks, where Task B specifically relies on completion of task A, it is sometimes necessary for developers to know explicitly when or where a task can be guaranteed to be completed. This foil addresses this concern. Tasks are guaranteed to be complete at the following three locations: At thread or task barriers – such as the end of a parallel region, the end of a single region, the end of a parallel “for” region – we’ll talk more aout implicit thread or task barriers later At the directive: #pragma omp barrier At the directive: #pragma omp taskwait In the case of #pragma omp taskwait Encountering task suspends at the point of the directive until all its children (all the encountering task’s children tasks created within the encountering task up to this point) are complete. Only direct children - not descendants! Similarly a Thread barrier (implicit or explicit) includes an implicit taskwait. This pretty much wraps up the discussion of tasks – now we are going to look at data environment topics such as data scoping.

34 Ejemplo de terminación de tareas
#pragma omp parallel { #pragma omp task foo(); #pragma omp barrier #pragma omp single bar(); } Aquí se crean varias tareas foo – una para cada hilo Se garantiza que todas las tareas foo terminan aquí Aquí se crea una tarea bar Script: Lets take a look at an example that demonstrates where tasks are guaranteed to be complete. In this example, the master thread crosses the parallel construct and a team of N threads is created. Each thread in the team is assigned a task – in this case each thread gets assigned a “foo” task – so that there are now N foo tasks created The “exit” of the omp barrier construct is where we are guaranteed all the N of the foo tasks are complete. Next a single thread crosses the omp task construct and a single task is created to execute the “bar” function. The bar function is guaranteed to be complete at the exit of the single construct’s code block – the right curly brace that signifies the end of the single region Now lets move on to the next item in the agenda Se garantiza que la tarea bar termina aquí

35 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

36 Alcance de los datos – ¿Qué se comparte?
OpenMP usa un modelo de memoria compartida Variable compartida - una variable que pueden leer o escribir varios hilos La cláusula shared puede usarse para hacer elementos explícitamente compartidos Las variables globales se comparten por default entre tareas Variables con alcance de archivo, variables con alcance de espacio de nombres, variables estáticas, variables con calificador de constante que no tienen miembro mutable son compartidas, Variables estáticas que son declaradas en un alcance dentro del bloque de construcción son compartidas Script: Data scoping – what’s shared? OpenMP uses a shared memory programming model rather than a message passing model. As such, the way that threads communicate results to the master thread or to each other is through shared memory and thru shared variables. Consider two threads each computing a partial sum. A master thread is waiting to add the partial sums to get a grand total. If each thread ONLY has its own local copy of data and can ONLY manipulate its own data – then there would be no way to communicate a partial sum to the master thread. So shared variables play a very important role on a shared memory system. A shared variable is a variable whose name provides access to a the same block of storage for each task region. What variables are considered shared by default? Global variables, variables with file scope, variables with namespace scope, static variables are all shared by default. A variable can be made shared explicitly by adding the shared(list,…) clause to any omp construct. For Fortran junkies, Common blocks, Save variables, and module variables are all shared by default. So now that we know what’s shared – lets look at what’s private. Background: predetermined data-sharing attributes Variables with automatic storage duration that are declared in a scope inside the construct are private. • Variables with heap allocated storage are shared. • Static data members are shared. • Variables with const-qualified type having no mutable member are shared. C/C++ • Static variables which are declared in a scope inside the construct are shared.

37 Alcance de los datos – ¿Qué es privado?
No todo se comparte... Ejemplos de determinadas variables implícitamente privadas: Las variables locales (stack) en functions llamadas de regiones paralelas son PRIVADAS Las variables automáticas dentro de un bloque de sentencias de omp son PRIVADAS Las variables de iteración de ciclos son privadas Las variables implícitamente declaradas privadas dentro de tareas serán tratadas como firstprivate La cláusula Firstprivate declara uno o más elementos a ser privados para una tarea, y inicializa cada uno de ellos con un valor Script: While shared variables may be essential to have – they also have a serious drawback that we will explore in some future foils. Shared variables open up the possibility of data races or race conditions. A race condition is a situation in which multiple threads of execution are all updating a shared variable in an unsynchronized fashion causing indeterminate results – which means that the same program running the same data may arrive at different answers in multiple trials of the program execution. This is a fancy way of saying that you could write a program that adds 1 plus 3 and sometimes gets 4, other times gets 1, and other times gets 3. One way to combat data races is by making use of copies of data in private variables for every thread. In order to use private variables, we ought to know a little about them. First of all, it is important to know that some variables are implicitly considered to be private variables. Other variables can be made private by explicitly declaring them to be so using the private clause. Some examples of implicitly determined private variables are; Stack variables in functions called from parallel regions, Automatic variables within a statement block, Loop iteration variables. Second, all variables that are implicitly determined to be private variables will be treated as tough they are first private within the task – meaning that they are all given initial value from their associated original variable. Lets look at some examples to see what this means Background: • Variables appearing in threadprivate directives are threadprivate. • Variables with automatic storage duration that are declared in a scope inside the construct are private. • Variables with heap allocated storage are shared. • Static data members are shared. • The loop iteration variable(s) in the associated for-loop(s) of a for or parallel for construct is(are) private. • Variables with const-qualified type having no mutable member are shared. C/C++ • Static variables which are declared in a scope inside the construct are shared. private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas predetermined data-sharing attributes For each private variable referenced in the structured block, a new version of the original variable (of the same type and size) is created in memory for each task that contains code associated with the directive References to a private variable in the structured block refer to the current task’s private version of the original variable. A private variable in a task region that eventually generates an inner nested parallel region is permitted to be made shared by implicit tasks in the inner parallel region. A private variable in a task region can be shared by an explicit task region generated during its execution. However, it is the programmer’s responsibility to ensure through synchronization that the lifetime of the variable does not end before completion of the explicit task region sharing it. Any other access by one task to the private variables of another task results in unspecified behavior.

38 Un ejemplo de ambiente de datos
float A[10]; main () { integer index[10]; #pragma omp parallel Work (index); } printf (“%d\n”, index[1]); extern float A[10]; void Work (int *index) { float temp[10]; static integer count; <...> } temp A, index, count Script: Let me ask the class – from what you learned already, tell me which variables are shared and which are private A[] is shared – it is a global variable Index is also an array – we pass the pointer to the first element into work. All thread share this array. Count is also shared – why? Because it is declared to be static – so all tasks can share that same value Temp is private – this is because it is created inside work. Each task has code and data for Work(). And temp is a local variable within the function work(). Lets look at some more examples ¿Cuáles variables son compartidas y cuáles privadas? A, índex, y count se comparten en todos los hilos, pero temp es local para cada hilo

39 Problema con el Alcance de los datos – ejemplo de fib
int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task x = fib(n-1); y = fib(n-2); #pragma omp taskwait return x+y } n es privada en ambas tareas x es una variable privada y es una variable privada Script: We will assume that the parallel region exists outside of fib and that fib and the tasks inside it are in the dynamic extent of a parallel region. n is firstprivate in both tasks – reason – stack variable called from parallel region are implicitly determined to be private which means that within both task directives, they will then be assigned firstprivate Do you see any issues here? 1st animation What about x & y? They are definitely private within the tasks – BUT we want to use their values OUTSIDE the task. We need to share the values of x & y somehow The problem we see is assigning a value to x (which by default is private) and y (which by default is private) is that the value are needed outside the task construct – after the taskwait – and private variables not defined here So to have any meaning – we have to provide a mechanism to communicate the value of these variables to the statement after the task wait – we have several strategies available to make this work as we shall see on following foils ¿Qué es incorrecto? No se pueden usar variables privadas fuera de las tareas

40 Ejemplo del alcance de datos – ejemplo de fib
int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x+y; } n es privada en ambas tareas Script: Good solution In this case, we shared the values of x & y so that the values will available outside each task construct – after the taskwait x & y son compartidas Buena solución necesitamos ambos valores para calcular sum

41 Problema con el alcance de datos – Recorrido de listas
List ml; //my_list Element *e; #pragma omp parallel #pragma omp single { for(e=ml->first;e;e=e->next) #pragma omp task process(e); } ¿Qué está mal aquí? Script: e will be assumed shared here because even though it appears to be a local variable, it is defined outside the parallel region – that means this variable will be treated as shared by default to each task in the parallel region Since e is shared in the task region, we will have a race condition since each task (ie process() call in this case) will be accessing e and possibly updating the variable e. What we want is to have each process have its own private copy of e – we shall see strategies for how to do this in following foils Posible condición de concurso ! La variable compartida e la actualizan múltiples tareas

42 Ejemeplo de alcance de datos – Recorrido de listas
List ml; //my_list Element *e; #pragma omp parallel #pragma omp single { for(e=ml->first;e;e=e->next) #pragma omp task firstprivate(e) process(e); } Buena solución – e es firstprivate Script: Here – we made e explicitly firstprivate – overriding the default rules that made it shared from the previous foil. Now that it is firstprivate each task has its own copy of e

43 Ejemeplo de alcance de datos – Recorrido de listas
List ml; //my_list Element *e; #pragma omp parallel #pragma omp single private(e) { for(e=ml->first;e;e=e->next) #pragma omp task process(e); } Buena solución – e es privada Script: This is another possible solution – By making e private within the parallel region, it will be treated as private by default in the tasks within the parallel region

44 Ejemeplo de alcance de datos – Recorrido de listas
List ml; //my_list #pragma omp parallel { Element *e; for(e=ml->first;e;e=e->next) #pragma omp task process(e); } Script: In this case, we have another good solution – e is declared within the parallel region – it is a local variable defined within the parallel region which makes it a private variable within the parallel region and that makes it a private variable within the tasks within the parallel region. Now we are going to move on to look at synchronization constructs Buena solución – e es privada

45 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

46 Ejemplo: Producto Punto
float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; Script: Here we see a simple minded dot product. We have used a parallel for construct with a shared clause. So what is the problem? Answer: Multiple threads modifying sum with no protection – this is a race condition! Let talk about race condition some more on the next slide ¿Que está mal?

47 Condiciones de Concurso
Una condición de concurso es un comportamiento no-determinístico causado cuando dos o más hilos acceden una variable compartida Por ejemplo, supón que el hilo A y el hilo B están ejecutando area += 4.0 / (1.0 + x*x); Script: A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable Lets look at the code snippet below area = area + 4/(1+x*x) If variable “area” is private, then the individual subtotals will be lost when the loop is exited. But if variable “area” is shared, then we could run into a race condition. So – we have a quandary. On the one hand I want area ot be shared because I want to combine partial sums from thread A & thread B – on the other hand – I don’t want a data race? The next few slides will illustrate the problem.

48 Dos ejemplos El orden de ejecución causa un comportamiento
Valor de area Hilo A Hilo B Valor de area Hilo A Hilo B 11.667 11.667 +3.765 +3.765 15.432 11.667 15.432 15.432 Script: If thread A references “area”, adds to its value, and assigns the new value to “area” before thread B references “area”, then everything is okay. However, if thread B accesses the old value of “area” before thread A writes back the updated value, then variable “area” ends up with the wrong total. The value added by thread A is lost. We see that in a data race condition, the order of execution of each thread can change the resulting calculations 18.995 15.230 El orden de ejecución causa un comportamiento no determinante en una situación de concurso

49 Proteger Datos Compartidos
Debe proteger acceso a datos compartidos modificables float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; Script: To resolve this issue – lets consider a using a pragma omp critical – also called a critical section. We’ll go over the critical section in more detail on the next slide, but for now lets get a feel for how it works. The critical section allows only one thread to enter it at a given time. A critical section “protects’ the next immediate statement or code block – in this case – sum += a[i] + b[i]; Whichever thread, A or B, gets to the critical section first, that thread is guaranteed exclusive access to to that protected code region (called a critical region). Once the thread leaves the critical section, the other thread is allowed to enter. This ability of the critical section to force threads to “take turns” is what prevents the race condition Lets look a little closer at the anatomy of an OpenMP* Critical Construct

50 Cláusula OpenMP* Critical
#pragma omp critical [(lock_name)] Define una región crítica en un bloque estructurado float RES; #pragma omp parallel { float B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (RES_lock) consum (B, RES); } } Los hilos esperan su turno –en un momento, solo uno llama consum() protegiendo RES de condiciones de concurso Nombrar la sección crítica RES_lock es opcional Script: The OpenMP* Critical Construct simply defines a critical region on a structured code block Threads wait their turn –at a time, only one calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional. With named critical regions - a thread waits at the start of a critical region identified by a given name until no other thread in the program is executing a critical region with that same name. Critical sections not specifically named by omp critical directive invocation are mapped to the same unspecified name. Now lets talk about another kind of synchronization called a reduction Buena Práctica – Nombrar todas las secciones críticas

51 Cláusula de reducción OpenMP*
reduction (op : list) Las variables en “list” deben ser compartidas en la región paralela Dentro un bloque parallel o work-sharing: Se crea una copia PRIVADA de cada variable de la lista y se inicializa de acuerdo a “op” Estas copias se actualizan localmente por los hilos Al final del bloque, las copias locales se combinan a través de la “op” en un solo valor con el valor que tenía la variable original COMPARTIDA Script: OpenMP* Reduction Clause is clause is used to combine an array of values into a single combined scalar value based on the “op” operation passed in the parameter list. The reduction clause is used for very common math operations on large quantities of data – called a reduction. A reduction “reduces” the data in a list or array down to a representative single value – a scalar value. For example, If I want to compute the sum of a list of numbers, I can “reduce” the list to the “sum” of the list. We’ll see a code example on the next foil Before jumping to the next foil lets take care of some business. The variables in “list” must be shared in the enclosing parallel region The reduction clause must be inside parallel or work-sharing construct: The way it works internally is that: A PRIVATE copy of each list variable is created and initialized depending on the “op” These copies are updated locally by threads At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable Now lets look at the example on the next foil

52 Ejemplo de reducción Copia local de sum para cada hilo
#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; } Copia local de sum para cada hilo Todas las copias locales de sum se suman y se guardan en la variable “global” Script: In this example, we are computing the sum of the product of two vectors. This is a reduction operation because I am taking an array or list or vector ful of numbers and boiling the information down to a scalar. To use the reduction clause – I note that the basic reduction operation is an addition and that the reduction variable is sum. I add the following pragma; #pragma omp parallel for reduction(+:sum) The reduction will now apply a reduction to the varibale sum based on the + operation Internally – each thread sort of has its own copy of sum to add up partial sums for. After all partial sums have been computed, all te local copies of sum are added together and stored in a “global” variable called sum accessible to the other threads but without a race condition. Following are some of the valid math operations that reductions are designed for

53 Ejemplo de Integración Numérica
4.0 4.0 (1+x2) dx =  1 4.0 (1+x2) f(x) = static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); 2.0 Script: Next we will examine a numerical integration example, where the goal is to compute the value of Pi. We’ll do this using the code for an integration. Animation 1 For the Math geeks out there - In this example we are looking at the code to compute the value of Pi. It basically computes the integral from 0 to 1 of the function 4/(1+x^2). Some of you may remember from calculus that this integral evaluates to the 4 * arctangent(x). The 4 * arctangent of x evaluated on the range 0-1 yields a value of Pi. Animation 2 To approximate the area under the curve – which approximates the integral – we will have small areas (many = num_steps) that we add up. Each area will be “step” wide. The height of the function will be approximated by 4/(1+x*x). Animation 3 The area will just be the sum of all these small areas For the rest of us – we will trust that the mast is right and just dig into the code. Here we have a loop that run from 0 to step_size. Inside the loop – we calculate the approximate location on the x axis of the small area; x= (i+.05)*step We calculate the height of the function at this value of x: 4/(1+x*x). Then we add the small area we just calculated to a running total of allthe area we have computed up to this point. When the loop is done – we print the value of pi. So lets ask the big question – are there any race conditions in play here? Which variables should be marked private? which variables should be marked shared? That is the subject of our next lab 0.0 X 1.0

54 C/C++ Operaciones de reducción
Un rango de operadores asociativos y conmutativos pueden usarse con la reducción Los valores iniciales son aquellos que tienen sentido Operador Valor Inicial + * 1 - ^ Operador Valor Inicial & ~0 | && 1 || Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Provide a list of the legal operators for C/C++ and their associated initial values. Background It is assumed that the operation performed with the scope of a reduction clause will be the same as the operator in the clause. Thus, the partial results of a subtraction reduction will be added together (having been subtracted from the initial zero value). There is no check to ensure that the computation within the reduction matches the reduction operator.

55 Actividad 4 - Calcular Pi
static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); Paraleliza el código de integración numérica usando OpenMP ¿Qué variables pueden compartirse? ¿Qué variables necesitan ser privadas? ¿Qué variables pueden usarse en reducciones? Script: Parallelize the numerical integration code using OpenMP What variables can be shared? What variables need to be private? What variables should be set up for reductions? Please spend about 20 minutes doing lab activity 6 from the student workbook Instructor Note: This is a serial version of the source code. It does not use a “sum” variable that could give a clue to an efficient solution (i.e., local partial sum variable that is updated each loop iteration). This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this. One answer: static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(i, x) reduction(+:sum) for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi);

56 Bloque de construcción Single
Denota un bloque de código que será ejecutado por un solo hilo El hilo seleccionado es dependiente de la implementación Barrera implícita al final #pragma omp parallel { DoManyThings(); #pragma omp single ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings(); } Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Describe the OpenMP single construct. Details Used when only one thread should execute a portion of code. This could be used when two parallel regions have very little serial code in between. Combine the two regions into a single region (less overhead for fork-join) and add the “single” construct for the serial portion.

57 Bloque de construcción Master
Denota bloques de código que serán ejecutados solo por el hilo maestro No hay barrera implícita al final #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings(); Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Describe the OpenMP master construct. Details Similar to “single” but the master thread is chosen. No barrier at end of construct.

58 Barreras implícitas Varias sentencias de OpenMP* tienen barreras implícitas Parallel – barrera necesaria – no puede removerse for single Las barreras innecesarias afectan el rendimiento y pueden removerse con la cláusula nowait La cláusula nowait puede aplicarse a : La cláusula For La cláusula Single Script: Several OpenMP* constructs have implicit barriers Parallel – necessary barrier – cannot be removed For – optional barrier Single – optional barrier Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: For clause Single clause

59 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Cláusula Nowait #pragma omp for nowait for(...) {...}; #pragma single nowait { [...] } Cuando los hilos esperarían entren cómputos independientes #pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigFunc2(j); Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give examples of syntax and usage for nowait clause. Details Schedule in each loop is (dynamic, 1) and computations in each loop are independent of each other (one loop updates a[], other loop updates b[]). Without nowait clause, threads would pause until all work is done in first loop; with nowait clause, when work is exhausted from first loop, threads can begin executing work in second loop. (It is possible that the second loop can complete all work before the work in the first loop is done.)

60 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Barreras Sincronización explícita de barreras Cada hilo espera hasta que todos lleguen #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); // Processed A into B #pragma omp barrier DoSomeWork(B,C); // Processed B into C } Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Describe the OpenMP barrier construct. Details Example code likely uses A to update B in first call, and B to update C in second call. Thus, to ensure correct execution, all processing from first call must be completed before starting second call.

61 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Operaciones Atómicas Caso especial de una sección crítica Aplica solo para la actualización de una posición de memoria #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Describe the OpenMP atomic construct. Details Only a small set of instruction can be used within the atomic pragma. Since index[i] can be the same for different i values, the update to x must be protected. In this case, the update to an element of x[] is atomic. The other computations (call to work1() and the computation of the value in index[i]) will not be done atomically. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel. Background The operations allowed within atomic are: x <binop>= <expr> x++ ++x x— --x where x is “an lvalue expression of scalar type.” Questions to Ask Students (Include any specific questions (and answers) that the instructor could/should ask to enhance understanding.) <Remove these lines and start typing here; retain the sub-heading.> Transition Quote (Suggested dialog that can be used by the instructor to segue into the next slide.) Atomic Construct Since index[i] can be the same for different I values, the update to x must be protected. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel.

62 Multi-core Programming: Programming with OpenMP Speaker’s Notes
Agenda ¿Qué es OpenMP? Regiones Paralelas Worksharing Ambiente de datos Sincronización Tópicos avanzados opcionales Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Give a preview of the topics that are going to be covered in this module.

63 Conceptos Avanzados

64 Bloque de Construcción Parallel – Vista de tareas explícita
#pragma omp parallel Las tareas se crean en OpenMP incluso sin una directiva task explícita Veremos como las tareas se crean implícitamente para el fragmento de código que está abajo. El hilo que encuentra la sentencia parallel empaca un conjunto de tareas implícitas Se crea un conjunto de hilos. Cada hilo en el equipo está asignado a una de las tareas (y vinculado a ella). La barrera mantiene el hilo maestro original hasta que todas las tareas implícitas terminan Thread 1 2 3 { mydata code } Barrier Script: In this animation – we will see tasks are created implicitly by the omp parallel statement – without any explicit tasks involved. First animation First the master thread will cross the omp parallel construct – creating a pool or team of threads. The encountering thread (for this example I’ll say the Master thread) packages up a set of implicit tasks – containing code, data & ICVs. 2nd animation Then the runtime assigns threads to each task. 3rd animation Each thread gets tied to a particular task. The threads being executing their assigned tasks. As each thread completes its task, it can be recycled by being tied to a new task. 4th animation The Master thread meanwhile, is held at the barrier (end of parallel region “}”) until all the implicit tasks are finished. Now we will look at the anatomy of the omp task construct #pragma omp parallel { int mydata code } { int mydata; code… }

65 Bloque de construcción Task
#pragma omp task [clause[[,]clause] ...] bloque estructurado Donde la claúsula puede ser un: if (expresion) untied shared (lista) private (lista) firstprivate (lista) default( shared | none ) Script: The omp task construct should be placed inside a parallel region and the task should encapsulate a structured block. The syntax is #pragma omp task [clause[[,]clause] ...] Explicit tasks are created in OpenMP following the same steps just described. Thread encountering parallel construct creates a team of threads at the omp task construct - a thread in team is assigned to one of the explicit tasks (and tied to it). if the task construct is enclosed inside a while loop or other loop structure then each time the omp task construct is crossed a new instance of the task is created and assigned a thread – which can be initially differed or can be initially executed immediately At the end of the parallel region their is an implied barrier - this barrier holds original master thread until all explicit tasks are finished. The syntax is #pragma omp task [clause[[,]clause] ...] Where clause can be one of the following clauses if (expression) - a user directed optimization that weighs the cost of deferring the task versus executing the task code immediately. It can be used to control cache and memory affinity Untied - specifies that the task created is untied. An untied task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". shared (list) a variable whose name provides access to the same block of storage for each task region private (list) a variable whose name provides access to a different block of storage firstprivate (list) variable whose name provides access to a different block of storage for each task region and whose value is initialized with the value of the original variable default( shared | none ) - specifies the default data scoping rules fr the task Now lets look at untied versus ted tasks Background Info see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage The permissible clauses include: When the if clause argument is false The task is executed immediately by the encountering thread. The data environment is still local to the new task... ...and it’s still a different task with respect to synchronization. Its used to execute immediately (when exp is false) when the cost of deferring the task is too great compared to the cost of executing the task code – this can aid with cache and memory affinity Untied – specifies that the task created is untied. An untied task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". As opposed to a tied task which is a task that, when its task region is suspended, can be resumed only by the same thread that suspended it; that is, the task is tied to that thread. shared (list) with respect to a given set of task regions that bind to the same parallel region, a variable whose name provides access to the same block of storage private (list) with respect to a given set of task regions that bind to the same parallel region, a variable whose name provides access to a different block of storage for each task region. firstprivate (list) - a variable whose name provides access to a different block of storage for each task region declares the scope of the data variables in list to be private to each thread. Each new private object is initialized with the value of the original variable as if there was an implied declaration within the statement block. Data variables in list are separated by commas. default( shared | none ) – specifies the default data scoping rules fr the task Description tied task A task that, when its task region is suspended, can be resumed only by the untied A task that, when its task region is suspended, can be resumed by any thread in the team; that is, the task is not tied to any thread. firstprivate vars are firstprivate unless shared in the enclosing Context - Specifies that each task should have its own instance of a variable, and that the value of each instance should be initialized to the value of the variable as it existed prior to the parallel directive private With respect to a given set of task regions that bind to the same parallel shared With respect to a given set of task regions that bind to the same parallel A variable which is part of another variable (as an array or structure element) cannot be shared independently of the other components, except for static data members of C++ classes. When a thread encounters a task construct, a task is generated from the code for the associated structured block. The data environment of the task is created according to the data-sharing attribute clauses on the task construct and any defaults that apply. The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task. Completion of the task can be guaranteed using task synchronization constructs. A task construct may be nested inside an outer task, but the task region of the inner task is not a part of the task region of the outer task. When an if clause is present on a task construct and the if clause expression evaluates to false, the encountering thread must suspend the current task region and begin execution of the generated task immediately, and the suspended task region may not be resumed until the generated task is completed. The task still behaves as a distinct task region with respect to data environment, lock ownership, and synchronization constructs. Note that the use of a variable in an if clause expression of a task construct causes an implicit reference to the variable in all enclosing constructs. A thread that encounters a task scheduling point within the task region may temporarily suspend the task region. By default, a task is tied and its suspended task region can only be resumed by the thread that started its execution. If the untied clause is present on a task construct, any thread in the team can resume the task region after a suspension. The task construct includes a task scheduling point in the task region of its generating task, immediately following the generation of the explicit task. Each explicit task region includes a task scheduling point at its point of completion. An implementation may add task scheduling points anywhere in untied task regions.

66 Tareas vinculadas y tareas no vinculadas
Una tarea vinculada se le asigna un hilo en su primera ejecución y este mismo hilo le da servicio a la tarea por su tiempo de vida. Un hilo ejecutando una tarea vinculada, puede suspenderse, y enviarse a ejecutar otra tarea, pero eventualmente, el mismo hilo regresará a continuar la ejecución de su tarea vinculada originalmente. Las tareas están vinculadas mientras no se declare desvincular explícitamente Script: By default, a task is created as a tied task. Meaning that it gets a thread assigned to the task for the life of the task. A tied task’s thread is the only thread that can service the task – however – since there can be many fewer threads than tasks – the runtime may suspend the assigned task (lets call it task Z) and assign the its thread go off for new duties (such as being assigned to new tasks – lets say task Y). When task Y’s execution reaches a scheduling point and when the runtime decides that it is time – the thread can by unassigned from task Yand given back task Z to resume computation. So a thread may service multiple tasks – but each tied task can be service only by the thread originally assigned to it. By contrast – An united task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". There can be performance benefits from untied tasks in that the united task is more likely to get serviced by some idle thread sooner rather than later. On the other hand, especially on NUMA architecture, untied tasks could have a negative impact on performance if the random thread assigned to service my task is running on a remote processor with remote cache. It is recommended to avoid using untied tasks unless the developer if willing to explore these performance subtlies (actual performance difference may NOT be subtle but the concept underlying the issue may be subtle) Now lets have a look at explicit tasks

67 Tareas vinculadas y tareas no vinculadas
Una tarea no vinculada no tienen ninguna asociación a largo plazo con ningún hilo. Cualquier hilo que no esté ocupado puede ejecutar una tarea no vinculada. El hilo asignado para ejecutar una tarea no vinculada solo puede cambiar en un “punto de planificación de tareas” Una tarea no vinculada se crea agregando “untied” a la cláusula tarea Ejemplo: #pragma omp task untied Script: By default, a task is created as a tied task. Meaning that it gets a thread assigned to the task for the life of the task. A tied task’s thread is the only thread that can service the task – however – since there can be many fewer threads than tasks – the runtime may suspend the assigned task (lets call it task Z) and assign the its thread go off for new duties (such as being assigned to new tasks – lets say task Y). When task Y’s execution reaches a scheduling point and when the runtime decides that it is time – the thread can by unassigned from task Yand given back task Z to resume computation. So a thread may service multiple tasks – but each tied task can be service only by the thread originally assigned to it. By contrast – An united task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". There can be performance benefits from untied tasks in that the united task is more likely to get serviced by some idle thread sooner rather than later. On the other hand, especially on NUMA architecture, untied tasks could have a negative impact on performance if the random thread assigned to service my task is running on a remote processor with remote cache. It is recommended to avoid using untied tasks unless the developer if willing to explore these performance subtlies (actual performance difference may NOT be subtle but the concept underlying the issue may be subtle) Now lets have a look at explicit tasks

68 Cambio de tareas Cambio de tareas El acto de un hilo en cambiar de la ejecución de una tarea a otra tarea. El propósito de cambiar la tarea es distribuir hilos a lo largo de las tareas no asignadas en el equipo para evitar que se acumulen colas largas de tareas no asignadas Script: The speaker notes have a lot more detail on task switching which we covered lightly in class already Next foil Background In untied task regions, task scheduling points may occur at implementation defined points anywhere in the region. In tied task regions, task scheduling points may occur only in task, taskwait, explicit or implicit barrier constructs, and at the completion point of the task. From the OpenMP 3.0 Spec The following example demonstrates a way to generate a large number of tasks with one thread and execute them with the threads in the parallel team. While generating these tasks, the implementation may reach its limit on unassigned tasks. If it does, the implementation is allowed to cause the thread executing the task generating loop to suspend its task at the task scheduling point in the task directive, and start executing unassigned tasks. Once the number of unassigned tasks is sufficiently low, the thread may resume execution of the task generating loop. Example A.13.5c #define LARGE_NUMBER double item[LARGE_NUMBER]; extern void process(double); int main() { #pragma omp parallel { #pragma omp single int i; for (i=0; i<LARGE_NUMBER; i++) #pragma omp task // i is firstprivate, item is shared process(item[i]); } C/C++ Task Scheduling Whenever a thread reaches a task scheduling point, the implementation may cause it to perform a task switch, beginning or resuming execution of a different task bound to the current team. Task scheduling points are implied at the following locations: • the point immediately following the generation of an explicit task • after the last instruction of a task region • in taskwait regions • in implicit and explicit barrier regions. In addition, implementations may insert task scheduling points in untied tasks anywhere that they are not specifically prohibited in this specification. When a thread encounters a task scheduling point it may do one of the following, subject to the Task Scheduling Constraints (below): • begin execution of a tied task bound to the current team. • resume any suspended task region, bound to the current team, to which it is tied. • begin execution of an untied task bound to the current team. • resume any suspended untied task region bound to the current team. If more than one of the above choices is available, it is unspecified as to which will be chosen. Task Scheduling Constraints 1. An explicit task whose construct contained an if clause whose if clause expression evaluated to false is executed immediately after generation of the task. 2. Other scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread, and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant of every task in the set. A program relying on any other assumption about task scheduling is non-conforming. Note – Task scheduling points dynamically divide task regions into parts. Each part is executed uninterruptedly from start to end. Different parts of the same task region are executed in the order in which they are encountered. In the absence of task synchronization constructs, the order in which a thread executes parts of different schedulable tasks is unspecified. A correct program must behave correctly and consistently with all conceivable scheduling sequences that are compatible with the rules above.

69 Cambio de tareas El cambio de tareas, para tareas vinculadas, solo puede ocurrir en puntos de planificación de tareas localizados dentro de los siguientes bloques de construcción Se encuentran sentencias task Se encuentran sentencias taskwait Se encuentran directivas barrier Regiones barrier implícitas Al final de una región de tarea vinculada Las tareas no vinculadas tienen implementación dependiendo de los puntos de planificación Script: The speaker notes have a lot more detail on task switching which we covered lightly in class already Next foil Background In untied task regions, task scheduling points may occur at implementation defined points anywhere in the region. In tied task regions, task scheduling points may occur only in task, taskwait, explicit or implicit barrier constructs, and at the completion point of the task. From the OpenMP 3.0 Spec The following example demonstrates a way to generate a large number of tasks with one thread and execute them with the threads in the parallel team. While generating these tasks, the implementation may reach its limit on unassigned tasks. If it does, the implementation is allowed to cause the thread executing the task generating loop to suspend its task at the task scheduling point in the task directive, and start executing unassigned tasks. Once the number of unassigned tasks is sufficiently low, the thread may resume execution of the task generating loop. Example A.13.5c #define LARGE_NUMBER double item[LARGE_NUMBER]; extern void process(double); int main() { #pragma omp parallel { #pragma omp single int i; for (i=0; i<LARGE_NUMBER; i++) #pragma omp task // i is firstprivate, item is shared process(item[i]); } C/C++ Task Scheduling Whenever a thread reaches a task scheduling point, the implementation may cause it to perform a task switch, beginning or resuming execution of a different task bound to the current team. Task scheduling points are implied at the following locations: • the point immediately following the generation of an explicit task • after the last instruction of a task region • in taskwait regions • in implicit and explicit barrier regions. In addition, implementations may insert task scheduling points in untied tasks anywhere that they are not specifically prohibited in this specification. When a thread encounters a task scheduling point it may do one of the following, subject to the Task Scheduling Constraints (below): • begin execution of a tied task bound to the current team. • resume any suspended task region, bound to the current team, to which it is tied. • begin execution of an untied task bound to the current team. • resume any suspended untied task region bound to the current team. If more than one of the above choices is available, it is unspecified as to which will be chosen. Task Scheduling Constraints 1. An explicit task whose construct contained an if clause whose if clause expression evaluated to false is executed immediately after generation of the task. 2. Other scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread, and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant of every task in the set. A program relying on any other assumption about task scheduling is non-conforming. Note – Task scheduling points dynamically divide task regions into parts. Each part is executed uninterruptedly from start to end. Different parts of the same task region are executed in the order in which they are encountered. In the absence of task synchronization constructs, the order in which a thread executes parts of different schedulable tasks is unspecified. A correct program must behave correctly and consistently with all conceivable scheduling sequences that are compatible with the rules above.

70 Ejemplo de cambio de tareas
El hilo que ejecuta el “ciclo for”, que es el generador de tareas, genera muchas tareas en poco tiempo tal que... El hilo que es SINGLE está generando tareas y tendrá que suspender por un momento cuando el “conjunto de tareas” se llene El intercambio de tareas se invica para iniciar el vaciado del “conjunto de tareas” Cuando el “conjunto de tareas” se ha vaciado lo suficiente – la tarea en el bloque single puede seguir generando más tareas Script: Here’s a task switching example Next foil Background The thread executing the SINGLE will have to suspend generating tasks at some point, because the "task pool" will fill up. At that point, the SINGLE thread will have to stop generating tasks. It is allowed to start executing some of the tasks in the task pool in order to "drain" it. Once it has drained the pool enough, it may return to generating tasks. Lots of tasks generated in a short time so The SINGLE generating task will have to suspend for a while when “task pool” fills up Task switching is invoked to start draining the “pool” The thread executing SINGLE starts executing the queued up tasks When “pool” is sufficiently drained – then the single task can being generating more tasks again The thread executing SINGLE starts generating more tasks again #pragma omp single { for (i=0; i<ONEZILLION; i++) #pragma omp task process(item[i]); }

71 Opciones adicionales – API de OpenMP*
Obtener el número de hilo dentro de un grupo int omp_get_thread_num(void); Obtener el número de hilos dentro de un grupo int omp_get_num_threads(void); Usualmente no necesario en los códigos de OpenMP Pude hacer que el código no sea serialmmente consistente Tiene usos específicos (debugging) Se debe incluir el archivo de cabecera #include <omp.h> Script: Here’s an API foil showing detail of how to use omp_get_thread_num Next foil Background

72 Opciones adicionales - Monte Carlo Pi
square in darts of # circle hitting 4 1 2 = p r r loop 1 to MAX x.coor=(random#) y.coor=(random#) dist=sqrt(x^2 + y^2) if (dist <= 1) hits=hits+1 pi = 4 * hits/MAX Script: Here’s a very cool optional lab that uses Math Kernel Library (MKL) to do a monte carlo approximation for Pi Next foil Background

73 Opcional – Haciendo Monte Carlo en Paralelo
hits = 0 call SEED48(1) DO I = 1, max x = DRAND48() y = DRAND48() IF (SQRT(x*x + y*y) .LT. 1) THEN hits = hits+1 ENDIF END DO pi = REAL(hits)/REAL(max) * 4.0 Script: They big take away is that Rand() is not thread safe so Monte carlo’s using rand cannot be parallelized as written Next foil Background Making Monte Carlo’s Parallel The Random Number generator maintains a static variable (the Seed). The only way to make this parallel is to do it in a critical section for each call of the DRAND(), which is a lot of overhead. ¿Cuál es el reto aquí?

74 Actividad Opcional 5: Calcular Pi
Use la liberería Intel® Math Kernel Library (Intel® MKL) VSL: Intel MKL’s VSL (Vector Statistics Libraries) VSL crea un arreglo, en vez de un solo número aleatorio VSL puede tener varias semillas (una para cada hilo) Objetivo: Usar lo básico de OpenMP para hacer el código de Pi paralelo Escoge el mejor código para dividir las tareas Categoriza propiamente todas las variables Script: Here’s a lab to use Intel MKL’s VSL (Vector Statistics Libraries) to create a vecotr of random numbers all in parallel – rather than one at a time as was done with DRAND() Next foil Background

75 Cláusula Firstprivate
Variables inicializadas de una variable compartida Los objetos de C++ se construyen a partir de una copia incr=0; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; } Multi-core Programming: Basic Concepts Speaker’s Notes Purpose of the Slide (This is mandatory.) <Remove these lines and start typing here; retain the sub-heading.> Details (Explain points not obvious on slide, or describe build steps; suggested strategies for approaching the material would be included here, too.) Background (Include any support material, stories, anecdotes, or references that might be helpful to the instructor.) Questions to Ask Students (Include any specific questions (and answers) that the instructor could/should ask to enhance understanding.) Transition Quote (Suggested dialog that can be used by the instructor to segue into the next slide.)

76 Multi-core Programming: Basic Concepts Speaker’s Notes
Cláusula Lastprivate Las variables actualizan la variable compartida usando el valor de la última iteración Los objetos de C++ se actualizan por asignación void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x; Multi-core Programming: Basic Concepts Speaker’s Notes Purpose of the Slide (This is mandatory.) <Remove these lines and start typing here; retain the sub-heading.> Details (Explain points not obvious on slide, or describe build steps; suggested strategies for approaching the material would be included here, too.) Background (Include any support material, stories, anecdotes, or references that might be helpful to the instructor.) Questions to Ask Students (Include any specific questions (and answers) that the instructor could/should ask to enhance understanding.) Transition Quote (Suggested dialog that can be used by the instructor to segue into the next slide.)

77 Cláusula Threadprivate
Preserva el alcance global en el almacenamiento por hilo Usa copia para inicializar a partir del hilo maestro struct Astruct A; #pragma omp threadprivate(A) #pragma omp parallel copyin(A) do_something_to(&A); #pragma omp parallel do_something_else_to(&A); Multi-core Programming: Basic Concepts Speaker’s Notes Purpose of the Slide (This is mandatory.) <Remove these lines and start typing here; retain the sub-heading.> Details (Explain points not obvious on slide, or describe build steps; suggested strategies for approaching the material would be included here, too.) Background (Include any support material, stories, anecdotes, or references that might be helpful to the instructor.) Questions to Ask Students (Include any specific questions (and answers) that the instructor could/should ask to enhance understanding.) Transition Quote (Suggested dialog that can be used by the instructor to segue into the next slide.) Las copias privadas de “A” persisten entre regiones

78 20+ Funciones de librería
Rutinas de ambiente en tiempo de ejecución: Modifica/revisa el número de hilos omp_[set|get]_num_threads() omp_get_thread_num() omp_get_max_threads() ¿Estamos en una región paralela? omp_in_parallel() ¿Cuántos procesadores hay en el sistema? omp_get_num_procs() Locks explícitos omp_[set|unset]_lock() Y muchas más... Script: More detail on the API info Next foil Background

79 Funciones de librería Para arreglar el número de hilos usados en el programa Establecer el número de hilos Almacena el número obtenido Solicita tantos hilos como haya procesadores disponibles. #include <omp.h> void main () { int num_threads; omp_set_num_threads (omp_num_procs ()); #pragma omp parallel { int id = omp_get_thread_num (); #pragma omp single num_threads = omp_get_num_threads (); do_lots_of_stuff (id); } } Protégé esta operación porque los almacenamientos en memoria no son atómicos Script: More detail on the API info Next foil Background

80 This should always be the last slide of all presentations.

81 BACKUP


Descargar ppt "Programando con OpenMP*"

Presentaciones similares


Anuncios Google