Un vistazo a la arquitectura Intel® Core 2 y herramientas del desarrollo de software Junio 2009.

Un vistazo a la arquitectura Intel® Core 2 y herramientas del desarrollo de software
Junio 2009

Un vistazo a la arquitectura y las herramientas
Discutiremos: Qué materiales hay disponibles Qué prácticas hay disponibles En qué cursos pueden aplicarse los materiales Discusiones de alto nivel sobre la tecnología

Objetivos Al termino de este módulo, serpa capaz de:
Estar al tanto y tener acceso a varias horas de temas relacionados con MC incluyendo arquitectura, tecnología del compilador, tecnología de caracterización, OpenMP, y efectos de la caché Será capaz de crear ejercicios y como evitar peligros comunes en en paralelización asociados con algunos sistemas MC- tales como una Pobre Utilización de la Caché, False Sharing y desbalanceo de carga Será capaz de crear ejercicios en como utilizar directivas del compilador y switches para mejorar el comportamiento en cada núcleo Será capaz de crear ejercicios en como aprovechar las herramientas para identificar rápidamente problemas de balanceo de carga, pobre reutilización de la caché y problemas de False Sharing

Agenda Motivación de Multi-core Un vistazo a las herramientas
Aprovechar las características de Multi-core Aprovechar las características de paralelismo dentro de cada núcleo (SSEx) Evitar efectos de la Memoria/Cache

¿Por qué la industria está moviéndose a la tecnología Multi-core?
Para mejorar el rendimiento y reducir el consumo de energía Es más eficiente ejecutar varios núcleos a una menor frecuencia que un solo núcleo a una frecuencia más alta Before we get into the architecture or tools that pertain to multi-core or cache issues we ought to consider they the industry is moving in the direction of multiple cores rather than simply scaling frequency in order to achieve faster performance. There are ramifications to the move to multi-core with regard to designing software. So is the move worth it? Why do software developers have to take on an additional burden - we never had to before. The answer lies in the fact that the power consumption is proportional to the square of the frequency. Even small increments in bin frequency cause more & more loss of power.

Menor Frecuencia nos da espacio para un segundo núcleo
Potencia y Frecuencia Curva de Potencia vs. Frecuencia para arquitecturas con un núcleo 9 59 109 159 209 259 309 359 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 Frecuencia (GHz) Potencia (w) Baja de Frecuencia = Mayor baja de potencia Menor Frecuencia nos da espacio para un segundo núcleo Previously, Chip manufacturers were able to increase the performance of CPU’s by increasing the clock frequency significantly in a year. Those frequency increases were a “free lunch” to software developers – who could simply await faster chips to get faster performance on a platform. Those days of the free lunch are over. The technology that allowed for those continual frequency scalings is at a point where too much heat is generated due to leakage currents at high frequencies. From the curve above you can see that as frequency has been increased the power required to light these CPUs has risen exponentially. It is no longer economical to end users to pay for the power consumption as well as the power to cools these chips. Instead – if designers stay in the lower flatter portion of the curve below by fixing the frequeny – it becomes possible to use the lower power consumption in that region of the curve to actually power 2,4,8 or more chips within the same power budget a sinlge CPU designed and operated at the higher frequency range Lets consider the following equation that can be viewed as one way to measure computing power – the number of instructions that can be executed in a unit of time. Instructions/second = Instructions/cycle * cycles/second Computing power can be equated roughly as follows = instructions/cycle Frequency measured in cycles per second So computing power can be correlated to instructions per cycle if we hold frequency constant How can we increase instructions per cycle? If we increase parallelism we can increase instructions per cycle . Which is to say that we increase the number of transistors in our budget that can do parallel work simultaneously. Lets consider two scenarios to double computing power: Old approach: Just double the frequency So doubling frequency does double our computational capability – but since Thermal power used is proportional to frequency, then a doubling of frequency means a quadrupling of consumed power Multi-core approach Hold freq roughly constant, but double the number of cores. Doubling the number of cores (if we can keep both cores busy) means that we double instructions/cycle. This means that instructions/second is also doubled. But power consumed is also doubled – one unit of power for one core – another unit of power for the second core Summary: multi-core approach can yeild same doubling of computing but only consumes half the power as compared to the doubling the frequency approach

Aprovechar las características de Multi-core Aprovechar las características de paralelismo dentro de cada núcleo (SSEx) Evitar efectos de la Memoria/Cache We will look at tools because the move to multi-threading can be difficult and a good tools suite can ease the transition to taking advantage of multi-core in a consistent way

Optimizaciones independientes del procesador
/Od Optimizaciones desabilitadas /O1 Optimiza el tamaño del binario y velocidad: Código Servidor /O2 Optimiza velocidad (default): Vectorización en Intel 64 /O3 Optimiza Caché de Datos: Código cíclico con operaciones de punto flotante /Zi Crea símbolos para debug /Ob0 Apaga “inlining” lo que ayuda a las herramientas de análisis a hacer un mejor trabajo General Optimizations These are general coarse grained switches that govern behavior of both single threaded and multi-threaded codes – They are inlcuded here because they are specifically used in labs that will be done in by faculty/students in a few slides and some introduction of these basic switches is needed prior to use in the lab Note to instructor – no need to cover each topic in detail – just introduce the switch – give brief (1-4 second) description of what it does and then move on. If faculty want more detail – point them to compiler switches slides These are what I call the “coarse grain” switches. Each one has a large effect on what the compiler does. /O2 (-O2) is the default, unless you specify /Zi (-g) for debugging, when the default then becomes /Od (-O0) There are also many “fine grain” switches, which give you more precise control, for example /Oi turns on (or off) inline expansion of intrinsic functions (see icl /help). This class covers the coarse grain switches. Each coarse grain switch represents a set of the fine grain switches and improves the performance of most applications. There is no guarantee that any optimization switch will speed up every application. HPO – High-performance Parallel Optimizer reports: -opt-report-phase hpo (Linux* and Mac OS*) /Qopt-report-phase hpo (Windows*)

Optimizaciones de Vectorización
QaxSSE2 Intel Pentium 4 y procesadorres Intel compatibles. QaxSSE3 Procesadores de la familia Intel(R) Core(TM) con soporte SSE3 (Streaming SIMD Extensions 3) QaxSSE3_ATOM Puede generar instrucciones MOVBE para procesadores Intel y puede optimizar para el procesador Intel® Atom™ y tecnología Intel Centrino® Atom™ SSE3 QaxSSSE3 Procesadores Intel(R) Core(TM)2 con SSSE3 QaxSSE4.1 Intel(R) 45nm Hi-k Nueva generación microarquitectura Intel Core(TM) con soporte para instrucciones de vectorización SSE4 y aceleración multimedia QaxSSE4.2 Puede generar Intel(R) SSE4 instrucciones eficientes para aceleración en el procesamiento de strings y texto soportadas por procesadores Intel(R) Core(TM) i7. Puede generar vectorización Intel(R) SSE4 y aceleración multimedia, Instrucciones Intel(R) SSSE3, SSE3, SSE2, y SSE y puede optimizar para la familia de procesadores Intel(R) Core(TM). Intel tiene una larga historia de proveer switches de auto-vectorización junto con el soporte de nuevas instrucciones del procesador y soporte hacia atrás para viejas instrucciones Los desarrolladores deben echar un ojo a los nuevos desarrollos para sacar provecho del poder de los últimos procesadores This 3.5 hours presentation will focus on Intel compiler only – however GNU gcc has auto-vectorization – see And MS also has support for SIMD/SSEx via intrinsics and – see for support fo SSE4 instructions. However – MS does NOT appraenly support Auto-vectorization - VC doesn't support auto-vectorization. VC has an option /arch:[SSE|SSE2] to generate SSE code. See here about it. These switches are some that govern the parallelism within a single core commonly referred to as SIMD or auto-vectorization. The labs to follow will primarily use axT switch but the choice really depends on the hardware commonly available to the school and or students. There is nothing in the labs that would prevent faculty from using the same labs with more advanced vectorization switches if they have access to the associated hardware Note to instructor – no need to cover each topic in detail – just introduce the switch – give brief (1-4 second) description of what it does and then move on. If faculty want more detail – point them to compiler switches slides /Qax<codes> generate code specialized for processors specified by <codes> while also generating generic IA-32 instructions. /Qx<codes> generate specialized code to run exclusively on processors indicated by <codes> as described below K Intel Pentium III and compatible Intel processors W Intel Pentium 4 and compatible Intel processors N Intel Pentium 4 and compatible Intel processors. Enables new optimizations in addition to Intel processor-specific optimizations P Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support T Intel(R) Core(TM)2 processor family with SSSE3 O Intel(R) Core(TM) processor family. Code is expected to run properly on any processor that supports SSE3, SSE2 and SSE instruction sets S Future Intel processors supporting SSE4 Vectorizing Compiler and Media Accelerator instructions /arch:{SSE|SSE2} Optimization Reports /Qvec-report[n] control amount of vectorizer diagnostic information: n=0 no diagnostic information n=1 indicate vectorized loops (DEFAULT) n=2 indicate vectorized/non-vectorized loops n=3 indicate vectorized/non-vectorized loops and prohibiting data dependence information n=4 indicate non-vectorized loops n=5 indicate non-vectorized loops and prohibiting data dependence information

Más Optimizaciones Avanzadas
Qipo Optimización interprocedural hace un análisis topológico de la aplicación incluyendo todos los códigos fuentes. Con /Qipo (-ipo) el análisis se extiende todos los códigos fuentes. En otras palabras la generación de código en el módulo A puede mejorarse con lo que está sucediendo en el módulo B. Puede habilitar otras optimizaciones como autoparallel y autovectorr Qparallel Habilita el auto-paralelizador para genenerar código multihilos en ciclos que pueden ejecutarse en paralelo de manera segura Qopenmp Habilita al compilador para generar código multihilos basado en directivas de OpenMP* SPEAKER NOTES Note to instructor – no need to cover each topic in detail – just introduce the switch – give brief (1-4 second) description of what it does and then move on. If faculty want more detail – point them to compiler switches slides These more advanced optimizations show what modern compilers can do. It demonstrates what is possible to do. It is up to faculty of compiler courses to challenge the current state and to advance still better technology. One omission for the list in these faculty materials is any discussion of Profile guided optimizations (PGO). This is another more recent compiler technology that could be expanded by faculty of compiler courses. PGO is covered in some depth in the compiler switches materials (see the link on previous foils) but just not enough time is available for discussion of PGO in face to face faculty training Inter procedural Optimizations (IPO) IPO performs a static, topological analysis of your application. With /Qip, the analysis is limited to occur within each source file. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. For example, information from other procedures helps the compiler make better optimization decisions, e.g., propagating loop counts can help the compiler decide whether to vectorize a loop. Multi-pass Optimization - Interprocedural Optimizations Interprocedural optimization works on the entire program across procedures and file boundaries Enabled optimizations: - Procedure inlining (reduc. function call overhead) - Procedure reordering - Interprocedural dead code elimination, constant propagation and procedure reordering Expected Winners - Many small functions IPO can be quite expensive in terms of compilation time and disk space but it can dramatically improve AutoVecotr and AutoParallel in some cases Auto-parallelization -Qpar_threshold[n] (Windows), -par_threshold[n](Linux) = percentage of probability of profitable speedup. 0=parallelized regardless, 100=parallelized only if profitable parallel execution is almost certain. Auto-parallelization is enabled by the /Qparallel switch. The compiler will do its best to thread the loops in the code. But for best results, since you have much more information about the application than the compiler, use the OpenMP* directives.

Actividad 1 – Auto-Paralelización
Objetivo: Usar auto-paralelización en un código simple para obtener experiencia usando la prestación de auto-paralelización del compilador Sigue la actividad VectorSum del cuaderno de prácticas Prueba la compilación AutoParallel en la práctica llamada VectorSum Crédito Extra: paraleliza manualmente y observa que tanto se puede sobrepasar la opción AutoParallel – ver bloques de construcción de openmp para hacer esta prueba The goal of the lab is show how to use auto-parallelization. Not all apps (even other labs we have in this class) can be auto-parallelized. But it is good to show a few simple examples of where it can be used. The challenge to faculty & students here is to find ways to EXPAND the capability of auto-parallelization. The lab can be made more challenging (if higher level of difficulty is needed) by challenging faculty to find other apps that do not auto-parallelize and then articulating ways to make it possible to auto-parallelize such apps This admittedly simple application can be parallelized with /Qparallel (-parallel on linux). Note that later labs that successfully auto-parallelize will have to use helper switches such as O3 and Qipo in order for auto-parallel to be effective The essential steps for these labs inlcude: compile Sum8192Vec using - /QxT compile Sum8192VecParallel using - /QxT /Qparallel Compare execution times Instructor Note: These project can be built with “nmake all” and run very quickly if your prefer to demo this rather than run it as a lab Some codes might require an additional switch to help the compiler decide if it can auto-parallelize These helper switches are /Qipo & /O3 which do interprocedural optimization and loop transformations. While not required in this example – they will be helpful in an upcoming matrix multiply lab

Parallel Studio para encontrar donde paralelizar
Parallel Studio lo usaremos en varias prácticas para encontrar los lugares apropiados para paralelizar el código Parallel Amplifier será usado específicamente para encontrar hotspots- donde el código de la aplicación gasta más tiempo del CPU Parallel Amplifier no requiere instrumentar el código para encontrar los hotspots, se recomienda compilar con información de símbolos /Zi Compilar con /Ob0 apaga el “inlining” y algunas veces es mejor el análisis en Parallel Studio

Parallel Amplifier Hotspots
Open a sln file in visual studio, and if parallel studio is installed, then you will see something like the above screenshot To find hotspot information, click the “Hotspot – where in my code…” selection next to the Profile icon

¿Qué muestra el análisis de hotspots?
Hotspot analysis shows the breakdown of which functions or threads or modules in an application consume the most time Here we see, from a mandelbrot code, that CalcMandelbrot takes the most time

¿Qué hay en los detalles?
Double clicking the CalcMandelbrot function on the previous screen will allow the user to view the time spent on individual lines in the application. In this case – we don’t find any obvious places to exploit parallelism – so we will use the callstack view (see next foil) to find he next higher level function in the call chain

El stack de llamadas El stack de llamadas (call) muestra la relación llamado/llamador entre funciones en el código Here we see that CalcMandelbrot is called by mandelbrot – so we should explore that higher level function for evidence of a place to parallelize the application. We are looking for large loops or nested loops to apply parallelism to

Encontrar paralelismo potencial
Here we see the code for Mandelbrot(). Finally we found some nested loops in the calling path of the hot code. Parallelizing this code could have a big impact on performance

Actividad 2 – Análisis de Hotspots del Mandelbrot
Objetivo: Usar el muestreo para encontrar algo de paralelismo en la aplicación Madelbrot Sigue la práctica llamada Mandelbrot Sampling en el cuaderno de prácticas Identifica ciclos que pueden ser paralelizados The goal here is to take what we have learned about Parallel Studio and apply the knowledge to find where in the code we can parallelize our version of mandelbrot. This version of the code will be unfamiliar to faculty and students and so can be an effective motivator for why such tools as Parallel Studio are useful. It can help identify where to focus parallelization efforts. Parallel Studio Amplifier or Inspector can be used to analyze application – but lab focuses more on hotspots in Amplifier. The basic idea is to find the leaf function or loop that is consuming the most time. Then look at that code to see if there is an obvious loop or function call within the targeted area that can be made parallel. If not – then we move look at the caller of the leaf and look for potential parallel loops or function calls that can be called in parallel. IF none are found, then find then next higher caller etc.

Aprovechar las características de Multi-core Vistazo a alto nivel – Arquitectura Intel® Core Aprovechar las características de paralelismo dentro de cada núcleo (SSEx) Evitar efectos de la Memoria/Cache

Arquitectura Intel® Core 2
Instante en el tiempo durante Penryn, Yorkfield, harpertown Los desarrolladores de software deben saber cuántos núcleos, tamaño de la línea de la cache y tamaños de la caché para hacer frente a los efectos de la caché 6M 6M L2 4 cores 2X6M L2 12M 6M 2X6M L2 2X3M L2 2 cores 4 cores 12M 4M 4M L2 2 cores 4 cores This slide is a snapshot in time of some of the HW characteristics of core platforms during this snapshot in time. Future implementations of core architecture will likely have differing sizes of caches, latencies, bandwidth, names, thermal design points, etc It is intended to give a flavor of some of the metrics that can be of use to a software developer. For example – developers might need to know cache size or cache line size to implement a cache blocking scheme. They may need to know cache line size to buffer data to avoid false sharing, they may want to know the number of cores to estimate scaling potential of an application. For information about cache sizes of various platforms and specific processors see: Intel®Core™Microarchitecture: Dual-Core Xeon List: List of Intel Xeon microprocessors#"Paxville DP" (90 nm) The first dual-core CPU branded Xeon, codenamed Paxville DP, product code 80551, was released by Intel on 10 October Paxville DP had NetBurst architecture, and was a dual-core equivalent of the single-core Irwindale (related to the Pentium D branded "Smithfield"") with 4 MB of L2 Cache (2 MB per core). The only Paxville DP model released ran at 2.8 GHz, featured an 800 MT/s front side bus, and was produced using a 90 nm process. 7000-series "Paxville MP" List: List of Intel Xeon microprocessors#"Paxville MP" (90 nm) An MP-capable version of Paxville DP, codenamed Paxville MP, product code 80560, was released on 1 November There are two versions: one with 2 MB of L2 Cache (1 MB per core), and one with 4 MB of L2 (2 MB per core). Paxville MP, called the dual-core Xeon 7000-series, was produced using a 90 nm process. Paxville MP clock ranges between 2.67 GHz and 3.0 GHz (model numbers ), with some models having a 667 MT/s FSB, and others having an 800 MT/s FSB. Plataforma móvil optimizada 1-4 Núcleos Tamaños de caché L2 3/6MB Línea de la caché L2 64 Bytes 64-bits Plataforma de escritorio optimizada 2-4 Núcleos Tamaños de la caché L2 2X3, 2X6 MB Línea de la caché 64 Bytes 64-bits Plataforma de servidor optimizada 4 Núcleos Cachés L2 2x6 Línea de la caché L2 64 Bytes Soporte DP/MP 64-bits **Feature Names TBD

Jerarquía de Memoria ~ 1’s Ciclo ~ 1’s - 10 Ciclos Caché CPU L2
Disco Magnético Memoria principal Caché L2 Caché L1 CPU ~ 100’s Ciclos Slide depiction fig 3.1 inspired by “Software Optimization for High Performance Computing by HP press By Wadleigh & Crawford Connecting lines thickness depict bandwidth Key take away for faculty/students is that latency for L1 cache is on order of 1 cycle, for L2 on order of between 1 & 10 cycles, for L2 miss forced read from main memory ~ 100’s cycles, And Disk access ~ 1000’s cycles. This is useful to know to be able to explain why effective cache utilization can be MORE important than utilizing multiple cores. BUT – in many cases, we can get good cache use AND use multiple cores to get huge performance gains (5-100X in aggregate for 8 core system) ~ 1000’s Ciclos

Arquitectura vista desde un alto nivel
Intel® Core™ Microarchitecture – Memory Sub-system Arquitectura vista desde un alto nivel Procesador Intel Core 2 Quad Procesador Intel Core 2 Duo A E C1 C2 B A E C B El Dual Core tiene caché compartida Quad core tiene ambos: caché compartida y separada Main point to cover here is that False sharing is issue for platforms with separated cores – we can alleviate false sharing by restructuring the data layout/data access patterns – this is why faculty of either algorithm courses or data structure courses should be interested in this material Simplified from Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture Fig 2.6 A = Architectural states refers to contents or state of XMM, MXCSR, x87 FPU, and MMX registers E = Execution Engine refers to functions units such as FP, ALU, SIMD etc C = 2nd Level Cache – memory w ~1 or 2 cycle latency compared to 100’s of clocks for main memory B = system bus interface that connects to main memory & I/O. Cache Line is the smallest unit of data that can be transferred to or from memory. Cache lines typically contain several pieces of data. For example with 64 Byte size cache lie – the cache line could hold 4 doubles (16 bytes) or 8 floats(8bytes each) or 16 shorts (4 Bytes each) or 64 chars (1 bye each). When a single data elements is requested by a program – say your need to read one variable of type float form memory – then that float and its 7 nearest neighbors (8 floats in total = 64 Bytes) in memory (in the same cache line) are brought into the faster cache memory for use by the processor Memoria Línea de caché 64B Memoria Línea de caché 64B A = Estado de la Arquitectura E = Motor de ejecución e interrupciones C = Caché nivel 2 B = Interfase con el bus

Con cachés separadas Memoria Front Side Bus (FSB) Linea de caché CPU1
Intel® Core™ Microarchitecture – Memory Sub-system Con cachés separadas Memoria Front Side Bus (FSB) Mover la línea de caché L2 ~Medio acceso a memoria CPU1 CPU2 Linea de caché There is often an effect called ping-ponging or tennis where one processor writes to a cache line and then another processor writes to the same cache line but different data element In a separate socket/separate last level cache environment Each core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB This increases the FSB traffic and even in good conditions costs about ½ the cost of a memory access

Memoria Front Side Bus (FSB)
Intel® Core™ Microarchitecture – Memory Sub-system Ventajas de la caché compartida– usando tecnología Advanced Smart Cache® Memoria Front Side Bus (FSB) L2 está compartida: No se requiere mover la línea de la caché Shared L2 No need to ship cache line Cache line just goes from exclusive to shared for the other core to read. CPU1 Línea de la caché CPU2

False Sharing Core 0 Core 1 X[0] = X[1] = Tiempo X[0] = 1 X[1] = 1
Problema de rendimiento en programas donde los núcleos pueden escribir a diferentes direcciones de memoria PERO en la misma línea de la caché Conocido como efecto Ping-Pong – la línea de la cache se mueve entre núcleos Core 0 Core 1 X[0] = X[1] = Tiempo X[0] = 1 X[1] = 1 False Sharing no es un problema en cachés compartidas Es un problema en cachés separadas Intel book – Multi-core Programming – Increasing Performance Through Software Multi-threading by Shameem Akhter and Jason Roberts False Sharing The smallest unit of memory that two processors interchange is a cache line or cache sector. Two separate caches can share a cache line when they both need to read it, but if the line is written in one cache, and read in another, it must be shipped between caches, even if the locations of interest are disjoint. Like two people writing in different parts of a log book, the writes are independent, but unless the book can be ripped apart, the writers must pass the book back and forth. In the same way, two hardware threads writing to different locations contend for a cache sector to the point where it becomes a ping-pong game. In this ping pong game, there are two threads, each running on a different core. Each thread increments a different location belonging to the same cache line. But because the locations belong to the same cache line, the cores must pass the sector back and forth across the memory bus. In order to avoid false sharing – we need to alter either the algorithm or the data structure. We can add some “padding” to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines. Or we can adjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread X[0] = 2 1 1 2 1 1

Ejecución Super-Escalar
FP SIMD INT Varias operaciones ejecutadas en un solo núcelo al mismo tiempo Varias unidades de ejecución Permiten paralelismo SIMD Muchas instrucciones pueden retirarse en un ciclo de reloj The point of this slide is to show that within each core there are “engines” called functional units that can perform work in parallel. For example 3 ALU can do memory loads or integer math while A SIMD MUl is going on while an FP unit FMUL or FDIV is being executed 3 Execution Ports 0, 1 and 5 Scalar Integer on all ports SIMD integer on all ports FP execution on two ports moves on all ports This is the essence of superscale architectures where multiple operations can be occurring at the same time In case anyone asks – the Reservation station is a 32 entry buffer that holds decoded instructions ready to be executed by some functional unit such as an ALU, FPU SIMD unit etc

Historia de las instrucciones SSE
Intel SSE Intel SSE2 Intel SSE3 Intel SSSE3 Intel SSE4.1 1999 2000 2004 2006 2007 Larga historia de nuevas instrucciones La mayoría requieren instrucciones de empaquetar y desempaquetar 70 instr Vectores Simple-Precision Streaming operations 144 instr Vectores Doble-precision 8/16/32 64/128-bit vector entero 13 instr Datos Complejos 32 instr Decodifi-cación 47 instrucciones Aceleradores de Video bloques de contrucción para gráficos Instrucciones avanzadas de vectores Compiler will keep abreast for you for the most part – but there are always situations where given instructions could be hugely beneficial and the compiler just isn’t generating them for your application These instructions have a long shelf life once introduced and so its worth the investment to learn in case you have the occasion for them Intel has added a number of new instructions to Nehalem and it has sped up others. The 4.2 version of Intel’s SSE vector extensions takes the x86 ISA back to the future just a bit by adding new string manipulation instructions. I say “back to the future” because ISA-level support for string processing is a hallmark of CISC architectures that was actively deprecated in the post-RISC years; typically, when a writer wants to give an example of crufty old corners of the x86 ISA that have caused pain for chip architects, string manipulation instructions are what he or she reaches for. But the new SSE 4.2 string instructions are aimed at accelerating XML processing, which makes them Web-friendly and therefore modern (i.e., not crufty). Continuará con: Intel SSE4.2 (procesamiento XML a finales de 2008) ver -

Tipos de datos SSE y Aceleración Potencial
4x floats 16x bytes 8x 16-bit shorts 4x 32-bit enteros 2x 64-bit enteros 1x 128-bit enteros 2x doubles Aceleración potencial (en el ciclo destino) aproximadamente la misma que la cantidad de empaquetamiento Ejemplo. para floats – aceleración ~ 4X SSE-2 SSE-3 SSE-4 Example: SSE scalar to vector speedup is 128 buts divided by size of data type 128b/32b = 4X possible speedup

Usa toda la amplitud de los registros XMM Muchas unidades funcionales
Meta de SSE(x) Procesamiento SIMD con SSE(2,3,4) Una instrucción produce múltiples resultados + x3 x2 x1 x0 y3 y2 y1 y0 x3+y3 x2+y2 x1+y1 x0+y0 X Y X + Y = Procesamiento Escalar Modo tradicional Una instrucción produce un resultado Usa toda la amplitud de los registros XMM Muchas unidades funcionales Selección de varias instrucciones No todos los ciclos pueden vectorizarse No puede vectorizar la mayoría de las llamadas a funciones X + The goal is to take scalar code (on left) and turn it into vector code (on right). This is at the heart of using as much SIMD HW as is available . Y = X + Y

Actividad 3 – IPO Vectorización Asistida
Objetivo: Explorar como inlining una función puede dramáticamente mejorar el rendimiento permitiendo la vectorización de un ciclo con una llamada a función Abrir el folder SquareChargeCVectorizationIPO y usar “nmake all” para construir el proyecto desde la línea de comandos Para añadir switches para hacer el ambiente usar nmake all CF=“/QxSSE3” Experiment w Vectorization switches and ipo and see how surprising performance gains are sometimes possible by combining switches

Efectos de la caché Los efectos de la caché pueden incidir en la velocidad de una aplicación tanto como 10x o hasta 100x Para sacar provecho de la jerarquía de la caché en la máquina, se deben reusar los datos en la caché lo más que se pueda Evitar acceder memoria en direcciones de memoria no contigua, especialmente en ciclos Se puede considerar el intercambio en ciclos para acceder datos de una forma más eficiente

Intercambio de ciclos Índice de ciclo rápido for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; Saltar en la memoria puede causar fallos de la caché – particularmente para arreglos de tamaño 2^n Array sizes of 2^n like 1024 – typically evict cache line more readily because the same small set of cache lines get mapped to the same cache addresses so in Matrix Multiply, Array A[0][0] will likely map to same cache entry as B[0][0] and so will c[0][0] so cache lines are constantly being written and read form memory in this case. Then index is incremented and the thrashing starts all over for A[0][1] versus B[0][1] – who share the same cache entries etc Many Theoretical papers and engineers argue that Loop Interchange is the transformation with the most possibility of adding performance. Note: c[i][j] term is constant in inner loop b[k][j] non-unit stride memory access Interchange loops for unit stride access Loop interchange is used here to cause all my array accesses to be unit stride. The programmer should manually compute a few iterations to verify that the algorithm is unaffected by the loop interchange. DEMO icl /O3 /QxP /Oa /Qvec-report matrix1.c multiply_d_lx.c After loop interchange, all arrays are accessed unit stride. Note that to compute the F/M ratio we look at the number of memory ops and the number of FMA ops. Code inspection shows that there is one FMA op issued per iteration. It also shows that there is one write to memory c[i][j], two reads from memory c[i][j] & b[k][j] for a total of 3 memory operations. Why isn’t a[i][k] taken into consideration when computing the number of memory operations? Because a[i][k] is loop independent of the inner most loop so the compiler puts this value into a register and dos not access it from main memory (or cache) for most of the iterations of the inner most loop J. Muy importante para vectorizar

Acceso de la memoria por paso de unidades (C/C++)
aN-10 aN-1N-1 ai0 ai1 ai2 ai3 aiN-1 a10 a11 a12 a13 a1N-1 a00 a01 a02 a03 a0N-1 k a i Próximo indice de ciclo más rápido Indice de memoria consecutivo bN-10 bN-1N-1 bk0 bk1 bk2 bk3 bkN-1 b10 b11 b12 b13 b1N-1 b00 b01 b02 b03 b0N-1 j b k El incremento de indice más rápido Acceso de memoria consecutivo Just a graphic to show the new memory access patterns. After loop interchange all arrays in the example are accessed in unit stride. The memory access pattern for b before loop interchange is shown in the orange column. After loop interchange b’s memory access pattern is shown by the green row which is now unit stride. The green row for a is before the interchange. (After the interchange, a[i][k] is constant in the inner loop; c[i][j] (not illustrated) is now varying, but is unit stride, like b).

Utilización pobre de la caché – con huevos
El usario solicita un tercer huevo – El cartón se expulsa Un cartón representa una línea de cache El Refrigerador representa la memoria principal Mesa representa la caché Cuando la mesa se llena –los cartones viejos se expulsan y los huevos se desperdician El usuario solicita un segundo huevo en específico El usuario solicita un huevo en específico Sartén listo para freir huevos Refrigerador : : Two procedures to fry eggs follows on the next two slides 1) Inefficient way a) bring an entire egg carton from the refrigerator put it on table b) Take one egg from the carton and fry it c) repeat step a) even after table has filled up with nearly full cartons and they start falling off the table onto the floor Note: As slide indicates – Carton represents a cache line which is typically 64 Bytes Table represents a cache – say an L2 cache that is 2MB in size Refrigerator represents main memory – say 2 GB in size The analogy can be stretched even further if we were to have a grocery store represent data on disk or on network. Solicitar un huevo que no está en la mesa, trae un nuevo cartón de huevos del refrigerador, pero el usuario solo frie un huevo de cada cartón Cuando la mesa se llena, se expulsa un cartón viejo

Buena utilización de la caché - con huevos
Solicitar un huevo trae un nuevo cartón de huevos del refrigerador El usuario solicita específicamente huevos del cartón que ya está en la mesa El usuario frie todos los huevos en el cartón antes de solicitar un huevo del siguiente cartón El usuario solicita los huevos del 1 al 8 El usuario previo ha usado todos los huevos en la mesa El usuario solicita los huevos del 9 al16 El usuaurio eventualmente solicita todos los huevos Refrigerador : : 2) More efficient way a) bring an entire egg carton from the refrigerator b) Take all eggs from the carton and fry them c) repeat step a) note – as we bring full cartons from the refrigerator they will displace empty cartons that have already been used up – so cache eviction would not hurt us Optional Note: As slide indicates – Carton represents a cache line which is typically 64 Bytes Table represents a cache – say an L2 cache that is 2MB in size Refrigerator represents main memory – say 2 GB in size The analogy can be stretched even further if we were to have a grocery store represent data on disk or on network. La expulsión de un carton no afecta porque ya freímos todos los huevos en los cartones que están en la mesa – tal como el usuario previo

Actividad 4 – Efectos de la caché en la multiplicación de matrices
Objetivo: Explorar el impacto de un uso pobre de la caché en el rendimiento con Parallel Studio y ver como manipular los ciclos para lograr significativamente un mejor uso de la caché y rendimiento

BACKUP

Un vistazo a la arquitectura Intel® Core 2 y herramientas del desarrollo de software Junio 2009.

Presentaciones similares

Presentación del tema: "Un vistazo a la arquitectura Intel® Core 2 y herramientas del desarrollo de software Junio 2009."— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Un vistazo a la arquitectura Intel® Core 2 y herramientas del desarrollo de software Junio 2009.

Presentaciones similares

Presentación del tema: "Un vistazo a la arquitectura Intel® Core 2 y herramientas del desarrollo de software Junio 2009."— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback