Alineamientos de secuencias

Alineamientos de secuencias

¿Para qué hace falta la compoaración de secuencias?
Bases biológicas: Muchos genes y proteínas son miembros de familias que tienen funciones biológicas similares o un origen filogenético común. Se usa para: Identificar relacciones evolutivas. Identificar patrones conservados. en caso de secuencias con funciones desconocidas: encontrar dominios similares en otras proteinas implica una función similar. Why do we use computers to compare sequences? Many genes and proteins are members of families which have a similar biochemical function or share a common evolutionary origin. With sequence comparison we can find out whether sequences are related or homologous. It can help us to find parts of sequences that are alike and parts that are different.

Alineamiento de secuencias
Claves: 1- que tipo de alineamiento hay que considerar 2- que sistema de puntuacion “scoring” hay que usar para clasificar los alineamientos 3- que algoritmos hay que usar para encontrar la solución óptima (o buena) 4- métodos estadisiticos necesarios para evaluar la significacion del score de los alineamientos Lo que se pregunta uno es si dos secuencias estan relaccionados. La forma mas facil de hacer esto es mediante un alineamiento de las secuencias enteras o de fragmentos de las mismas. Luego habra que decidir si el alineamiento es signifiativo o ha sucedido al azar.

Tipos de comparación de secuencias
Pairwise Alignments Alineamientos múltiples Búsquedas en bases de datos There are three types of sequence comparison: - pairwise: means to compare two sequences - multiple: comparison of three or more sequences - database search: an homology search to find out whether a sequence, or a part of a sequence, is related to any sequence in a database of sequences, such as embl or swissprot.

Pairwise Sequence Alignment
Principios de la comparación por pares de secuencias alineamientos globales / locales sistemas de puntuación “scoring” penalizaciones por GAP Métodos de pairwise sequence alignment Basados en deslizamiento de ventanas “window-based” programación dinámica These two methods are generally used to obtain alignments. They serve as a basis for many other operations in computational biology. For homology searches in databases both methods are combined. We will come to this later in our db search session.

Alineamientos globales Alineamientos locales Comparing two sequences: Dependiendo de si estamos interesados en secuencias que son similares en general o en fragmentos de interes en una determinada secuencia deberemos elegir entre distintos tipos de alineamientos.

Alineamiento Global Para secuencias que estan muy relaccionadas
This example shows the alignment of two closely related sequences. The program GAP creates a global, end-to-end alignment using the Needleman & Wunsch algorithm, which we will discuss later. (Needleman & Wunsch) crea alineamientos en toda la longitud de la secuencia.

Alineamiento Global Dos secuencias con varias regiones de similaridad
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG |||||||||||||| | | | |||| || | | | || 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70 This example shows a global alignment of two sequences that are not closely related but share several regions of high similarity The other regions, like the one that is marked with the red line, are not detected. Again: GAP computes an end-to-end alignment. Con un alineamiento local solo se obtendrá una similaridad muy baja: fragmento azul

Alineamiento Local 14 TCAGAAGCAGCTAAAGCGT 32 ||||||||| ||||||||| 42 TCAGAAGCA.CTAAAGCGT 59 1 AGGATTGGAATGCT 14 |||||||||||||| 39 AGGATTGGAAT 49 ||||||||||| 1 AGGATTGGAAT 11 62 AGACCG 67 |||||| 66 AGACCG 71 Alineamiento local encuentra la region que tiene la mejor similaridad local.

alfa globina humana beta-globina leghemoglobina Glutonina S-tranferasa nematodos a9 tenemos un alineamiento sibiologicamente significativo. La linea central srepressenta las similaridades o diferencias. símbolo = residuos idénticos, + significa residuos similares, es decir que topdavia tienen una puntacion positiva en las matrices de susbtitución de las que hablaremos despues. Nada = no similaridad. b) un alineamiento con ignificado biolohico, no hay muchas identidades pero hay muchas posiciones conservadas. Tienen las misma estructura tridimensional y la misma funcion de union al oxigeno.. Presencia de gaps. c) es un alineamiento al azar. ¿Cómo diferenciasr alineamientos significatiovs de aquellos que no lo son?.

Parámetros a tener en cuenta en el alineamiento de secuencias
Sistemas de puntuación: A cada par de símbolos se le asigna un valor numerico basado en una tabla de comparación de síbolos. Penalizaciones por Gap: apertura: Costo de introducir un gap Extensión: Costo de extender el gap Most programs calculate the quality of an alignment using a scoring system which means that each symbol pairing is assigned a numerical value, based on a symbol comparison table. Many programs also use gap penalties. The introduction of a gap is biologically more important than the actual length of a gap. Therefore we have opening and extension penalties. The opening penalty is the cost to introduce a gap. The extension penalty is the cost to extend a gap. The scores help us to judge the quality of the different possible alignments. They strictly depend on the parameters used. These values are not optimized by alignment programs!! So they should not be used to compare alignments!!!!

Sistemas de puntuación de secuencias de nucleótidos
Sequencia 1 Sequencia 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact A G C T A G C T Un tipico sistema de puntuacion es darle un valor de 1 a los “match” y cero a las posiciones que no coincidan “Missmatches”. El score total se puede calcular sumando el valor de cada par observado en nuestra alineacion Este esquema favorece los alineamientos largos independientemente de la cantidad de missmatches que se produzcan Match: 1 Mismatch: 0 Score = 5

Sistemas de puntuación de secuencias de nucleótidos
Sequencia 1 Sequencia 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Valores negativos que penalizen los mismatches: A T C G A T C G Other scoring systems penalize mismatches with negative scoring values. Penalizing mismatches forces regions of similarity to be longer in order to get a significant alignment score. This is a widely used DNA substitution matrix. E.g. Fasta uses this matrix. However, other models and matrices exist. Nucleic acid matrices are usually unitary, e.g. a T-A mismatch (or substitution) is not weighted any more than any other possible substitution. Note that this unitary matrix is perhaps only true in a global sense; there are local positional constraints upon DNA evolution as a result of selective pressure upon the proteins that they code for and as a result of codon usage. Matches: 5 Mismatches: 19 Score: 5 x * (-4) = - 51

Sistemas de puntuación de secuencias de proteínas
Sequencia 1 Sequencia 2 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Scoring matrix C S T P A G N D . . C 9 S -1 4 T P A G N D . C S T P A G N D . . C 9 S -1 4 T P A G N D . T:G = -2 T:T = 5 Score = 48 La comparación de proteinas es mucho mas sofisticada ya que se toma en cuenta informacion biológica Programs use a scoring matrix with customized scores for each particular amino acid pair, 210 score values in all (20 possible match scores and 190 possible mismatch scores). Only a part of the matrix is shown here. 210 valores

Protein Scoring Systems
Amino acidos tienen diferentes propiedades bioquímicas y físicas que pueden influenciar su capacidad de ser reemplazados en la evolución tiny P aliphatic C small S+S G G I A V S C N SH L D T Amino acids differ in size, in their affinity to water molecules, etc. It is more likely that amino acids with similar properties get substituted for one another than ones that show different properties (e.g. size). Because evolution tends to preserve most properties of the protein involved. hydrophobic M Y K E Q F W H R positive aromatic polar charged

Protein Scoring Systems
Las matrices reflejan Probabilidades de substituciones mutuas Probabilidad de ocurrencia de un aminoacido Matrices mas usadas: PAM BLOSUM Scoring matrices reflect amino acid properties: It is usually the case that aa with similar properties have high pairwise scores and that unrelated aa have lower scores. Or one can say: Scoring matrices reflect…. AND…. The probability of occurrence of an aa: e.g. the pair alanin - alanin gets a low value assigned because alanin is a very common aa in proteins. Overall frequency of occurrence in a large protein set. Factors that influence the probability of mutual substitution are numerous. Thus, direct observation of actual substitution rates were used to derive scoring matrices: Widely used classes of scoring matrices are: PAM and BLOSUM.

PAM (Percent Accepted Mutations) matrices
Derived from global alignments of protein families . Family members share at least 85% identity (Dayhoff et al., 1978). Construction of phylogenetic tree and ancestral sequences of each protein family Computation of number of replacements for each pair of amino acids Margaret Dayhoff y sus colaboradores desarrollaron matrices PAM apratir de alineamientos globales de proteínas completas. Analizaron secuencias muy parecidas que tenian al menos un 85% de identidad. PAM 1 significa: 1% de mutaciones aceptadas, es decir se utilizaría esta matriz cuando uno esperara un 1 % de substituciones. PAM matrices para distancias evolucionarias mas grandes se pueden extrapolar a partir de esta matriz.

PAM (Percent Accepted Mutations) matrices
The numbers of replacements were used to compute a so-called PAM-1 matrix. PAM 1 significa: 1% de mutaciones aceptadas, es decir se utilizaría esta matriz cuando uno esperara un 1 % de substituciones. PAM matrices para distancias evolucionarias mas grandes se pueden extrapolar a partir de esta matriz. PAM250 = 250 mutaciones por cada 100 residuos. A mayor número mayor distancia evolutiva. PAM250 is an often used matrix. At this evolutionary distance, 48% of the tryptophans, 41% of the cysteines and 20% of the histidines would remain unchanged, but only 7% of the serines. So before doing anything else, the user has to choose an evolutionary distance at which to compare his sequences. PAM250 es muy común. a esta distancia evolutiva, 48% de los triptófanos, 41% de las cisteinas y 20% de las histidinas permanecen inalteradas pero solo 7% de las serinas

PAM 250 El valor de un par de aa idénticos representa la probabilidad de que este aa permanezca inalterado (e.g. triptófano) A R N D C Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z C W This is an example of a PAM matrix. The matrix has positive and negative values. The average score of all scores has to be negative in order to be able to detect weak local similarities. Extreme values: Lowest score: (tryptophan - cystein) This substitution was rarely observed by Dayhoff and coworkers. Highest score: tryptophan - tryptophan Tryptophan is a highly conserved residue. The score value of a pair of identical aa reflects the probability of remaining unchanged. W -8 17

BLOSUM (Blocks Substitution Matrix)
Derivada de alineamientos de dominios pertenecientes aproteinas alejadas en la evolucion (Henikoff & Henikoff,1992). Contaron la presencia de cada par de aa en cada columna de cada bloque de alineamientos. Los números obtenidos del análisis de todos los bloques se usaron para calcular las matrices de tipo BLOSUM. A C E Henikoff and Henikoff used another approach. Utilizaron alineamientos locales de secuencias muy distantes entre si, pero que todavía compartían determinadas regiones. La matrices BLOSUM al contrario que las PAM representan relacciones mas distantes entre proteinas. Se utilizaron mas de 2000 bloques de secuencias alineadas procedentes de distintas familias de proteinas existentes entonces en PROSITE y SWISS_PROT; BLOSUM62 is the most widely used variant. The idea is that highly conserved sequence segments from otherwise highly diverged protein sequences lead to substitution scores that more effectively encourage local alignment algorithms to produce alignments highlighting biologically important similarities. BLOSUM62 means that the matrix is derived from blocks with 62% identical residues. A C E A - C = 4 A - E = 2 C - E = 2 A - A = 1 C - C = 1

BLOSUM (Blocks Substitution Matrix)
Las secuencias se clusterizan dentro de un bloque de acuerdo a su grado de identidad. Clusters are counted as a single sequence. Las matrices BLOSUM difieren en el porcentaje de identidad de secuencias usado para hacer el clustering El número de la matriz (e.g. 62 en BLOSUM62) se refiere al porcentaje máximo de identidad entre las secuencias utilizado para crear la matriz Mayores número significan distancias evolutivas menores.

Matrices de substitución: Log-odds Ratio
Dado un par de secuencias alineadas queremos asignar una score que mida el grado de posibilidad „likelihood“, de que las secuencias estan relaccionadas x,y = amino acids (A,C......Y) P = likelyhood i = 1....n (longitud de la secuencia n) q = probabilidad P(x,y|R) =qx qy Random model (unrelated) : i i i i P(x,y|M) =px y Match model (related) : i i i px y P(x,y|M)P(x,y|R) px y  i i i i Odds ratio : related unrelated = i = qx qy qx qy cociente del registro de probabilidades R es el modelo simple supone que una letra ocurre de forma independiente con una frecuencia q. De ahí que la probabilidad de las dos secuencias es simplemente el producto de las probabilidades de cada amino acido M residuos alineados tienen una probabilidad de co-aparecer pab. Este valor se puede pensar como la probabilidad de que estos aa se originaran de un ancestro omun c s(a,b) es la suma de de escpores individuales ab para cada par de residuos alineados. estos valores se pueden poner en una matriz erstois son los valores de la matriz de substitucion Given a pair of aligned sequences, we want to assign a score that gives a measure of the relative likelyhood that the sequences are related as opposed to being unrelated. Random model: It assumes a residue a occurs independently with some frequency qa . So the probability of two sequences is just the product of the probabilities of each amino acid. Match model: Aligned pairs of residues occur with a joint probability pab. pab can be thought of as the probability that the residues a and b have each independently been derived from some unknown original residue c in their common ancestor. (c might be the same as a and/or b. In mathematical form these two likelihoods look like......, where x,y are... and i is... The ratio of these two likelihoods is known as the odds ratio. In order to arrive at an additive scoring system, we take the logarithm of this ratio, known as the log-odds ratio. The s(a,b) scores can be arranged in a matrix. For proteins they form a 20 x 20 matrix, known as a score matrix or substitution matrix. i i i i i i i pab  Log-odds ratio : S = s(xi,yi) where : s(a,b) = log qa qb i s(a,b) is the log likelyhood ratio of the residue pair (a,b) occurring as an aligned pair, as opposed to an unaligned pair.

Como escoger la matriz adecuada
Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62. Choosing a scoring matrix makes only sense if you want to do a pairwise comparison or a multiple alignment, or with other word: when you know the evolutionary relationship of the sequences envolved. For db searches, - in this case we do not know the relationship of the query sequence and the db sequences- the common “average” matrix is BLOSUM62.

Como puntuar inserciones y delecciones
A T G T A A T G C A T A T G T G G A A T G A A T G T - - A A T G C A T A T G T G G A A T G A insertion / deletion We now know how to score matches and mismatches. Next question is how to score insertion and deletions. In this example the insertion of a gaps leads to more matching pairs. However most programs penalize the creation of a gap with a negative score value. Which one is the better alignment? La creación de un gap se penaliza con un score negativo.

Gap Penalties Un alineamiento optimo maximiza el numero de matches
minimiza el número de gaps Permitir la inserción arbitraria de muchos gaps puede dar lugar a scores altos entre secuencias no homologas. La penalización de los gaps fuerza a los alineamientos a alcanzar los criterios optimos

Linear gap penalty score:
Gap Penalties Linear gap penalty score: (g) = - gd Affine gap penalty score: (g) = -d - (g -1)e (g) = gap penalty score of a gap of lenght g d = gap opening penalty e = gap extension penalty g = gap lenght La funcion linear de penalizacion de gaps asigna un score negativo a cada posicion en el gap. Sin embargo como biologicamente la introducción de un gap es mas importante que su longitud la extensión del mismo se penaliza con un criterio distinto la extension del mismo. This is represented in the affine gap penalty score.

Scoring Insertions and Deletions
A T G T T A T A C T A T G T G C G T A T A match = 1 mismatch = 0 Total Score: 4 Total Score: = 4.8 A T G T T A T A C Gap parameters: d = 3 (gap opening) e = 0.1 (gap extension) g = 3 (gap lenght) (g) = -d - (g -1)e (g) = -3 - (3 -1) 0.1 = -3.2 T A T G T G C G T A T A insertion / deletion

A T T C A C A T A T A C A T T A C G T A C Sequence 2 Sequence 1

Dotplot: A dotplot da una visión general del alineamiento
A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 In the following I will often use dotplots and alignment matrices to explain alignment algorithms. El dotplot permite una inspección visual de todos los alineamientos posibles. Las dos secuencias se alinean con las columnas o con las filas y se escriben de abajo arriba y de izquierda a derecha. Esta es la matriz de alineación. Los puntos se ponen en la matriz cuando los símbolos en esa posición son idénticos. Sequence 1

Dotplot: Cada diagonal en elgráfico corresponde a un posible alineamiento sin gap entre las dos secuencias A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 A dotplot gives an overview of all possible alignments of two sequences. Each diagonal represents one possible alignment. Sequence 1 One possible alignment: T A C A T T A C G T A C A T A C A C T T A

Window-based Approaches
Word Size Window / Stringency Windows-based approaches are quick methods used for database searches There are two different approaches: - word size algorithm, searching for short identities - window/stringency, searching for short similar regions, without gaps Neither one of the methods uses gap penalties!

Word Size Algorithm T A C G G T A T G Word Size = 3 A C A G T A T C
C T A T  G A C A T A C G G T A T G T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C Una ventana con el tamaño de ventana 3 definida dara lugar a un punto en el grafico solo si coinciden los tres nucleotidos alrededor Dentro de las palabaras no puede haber gaps Es un algoritmo poco sensitivo y no detecta sutiles homologias. T A C G G T A T G A C A G T A T C 

Window / Stringency T A C G G T A T G Window = 5 / Stringency = 4
T C A G T A T C Window = 5 / Stringency = 4 C T A  T  G  A C A T A C G G T A T G T A C G G T A T G T C A G T A T C  T A C G G T A T G T C A G T A T C  El problema de la sensitividad se puede superar permitiendo mismatches dentro de una palabra. Esto es la estringencia, dot plots usan esta tecnica y son mas sensitivos Dotplots generated this way are more sensitive. T A C G G T A T G T C A G T A T C 

Considerations The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). The smaller the window, the larger the weight of statistical (unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.

Insertions / Deletions in a Dotplot
Sequence 2 T A C G T A C T G T T C A T This alignment contains one gap. In the corresponding dotplot the diagonals of the alignment are drawn and then they are shifted one position. Sequence 1 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T

Dotplot (Window = 130 / Stringency = 9)
Hemoglobin -chain With the programs Compare and dotplot you can create a visual alignment. If you run Compare with the default parameters aligning very similar sequences the dotplot gets very crowded. You can filter these results either by reducing the windowsize or by increasing the stringency. Hemoglobin -chain

Dotplot (Window = 18 / Stringency = 10)
Hemoglobin -chain Here we changed the size of the window from 30 to 18 and we changed the stringency from 9 to 10. Hemoglobin -chain

Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based approaches dynamic programming approaches Needleman and Wunsch Smith and Waterman Window based approaches are quick methods for the identification of sequence similarities. However, for computing an optimal alignment of two sequences one has to use another approach: dynamic programming.

Dynamic Programming Procedimiento automático que encuentra el mejor
alineamiento con un score óptimo dependiendo de los parámetros elegidos. Soluciones recursivas. Los problemas pequeños se solucionan primero y las soluciones se usan para resolver problemas mayores despues. Las soluciones intermedias se almacenan en matrices tabulares. The Needleman & Wunsch algorithm aligns a pair of sequences over their entire lengths while the Smith-Waterman algorithm finds the best matching regions in the same pair of sequences. Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Very often two sequences share only a single functional domain.

Principios básicos de la programación dinámica
Initialization of alignment matrix: the scoring model - Stepwise calculation of score values (creation of an alignment path matrix) - Backtracking (evaluation of the optimal path) The basic principles of dynamic programming. Basically there are three steps: - Creation of a alignment path matrix - Stepwise calculation of score values - Backtracking: evaluation of the optimal path

Initialization of Matrix (BLOSUM 50)
H E A G A W G H E E P A W H E The score matrix for the two example sequences showing the BLOSUM50 values for each aligned residue pair. Positive scores are in bold El objetivo final es incorporar tabtos como sean posibles de esos scores postivos en el alineamiento minimizando el coste de los residuos no conservados, gaps, ect...

Needleman and Wunsch (global alignment)
Sequence 1: H E A G A W G H E E Sequence 2: P A W H E A E Scoring parameters: BLOSUM50 matrix Gap penalty: Linear gap penalty of 8 El primer problema es conseguir optener un óptimo alineamiento entre sos secuencias. Para ellos vamos a aplicar el algoritmo de needleman & Wunsch. First, we will take a closer look at the Needleman-Wunsch algorithm. We will align these two simple sequences. Because we introduced the scoring scheme as log-odds ratio, the scores are additive and better alignments will have higher scores. For simplicity, we will use a linear gap penalty.

Creation of an alignment path matrix
Idea: Crear un alineamiento global optimo usando soluciones precias para alineamientos optimos de subsecuencias más pequeñas. Construct matrix F indexed by i and j (one index for each sequence) F(i,j) es el score para el mejor alineamiento entre el segmento inicial x1...i de x hasta xi y el segmento inicial y1...j de y hasta yj construir F(i,j) de forma recursiva empezando con F(0,0) = 0 i y j son indices para cada secuencia H - E - A P G - A W G - H E - A Optimal global alignment:

F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i-1, j-1) F(i, j-1) F(i-1,j) F(i, j) HEAGAWGHE-E --P-AW-HEAE Empezamos inicializando la matriz y procedemos a llenar la matriz de arriba a la izquierda a abajo a la dereacha Si conocemos las siguientes celdas podemos calcular F(i,j). Hay tres formas de conseguir el mejor alineamiento x e y se alinean uno enfrente del otro caso 1 x se alinea frente a un gap y se alinea frente a un gap s(xi ,yj) -d -d

If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the largest of the three options

H E A G A W G H E E P A W H E -8 -16 -24 -32 -40 -48 -56 Boundary conditions F(i, 0) = -i d Parsa llenar la fila superior y la columna de la izquierda necesitamos algunas condiciones: Fila superior: j=0 de manera que F(i,j-1) y F(i-1,j-1) no existen. F(i,0) representa gaps que podemos definir como : F(i,0) = -id. Cuando rellenemos las matriz siempre mantendrenos un puntero en cada vuelta als celda de donde F(i,j) se derivo. Left column: i=0, so F0,j) = -jd. F(j, 0) = -j d

Stepwise calculation of score values
H E A G A W G H E E P -8 A -16 W -24 H -32 E -40 A -48 E -56 F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d P-H=-2 E-P=-1 H-A=-2 E-A=-1 -2 -9 -10 -3 F(0,0) + s(xi ,yj) = = -2 F(1,1) = max F(0,1) - d = -8 -8= = -2 F(1,0) - d = -8 -8= -16 F(1,0) + s(xi ,yj) = = -9 F(2,1) = max F(1,1) - d = = = -9 F(2,0) - d = = -24 Filling the alignment path matrix step by step. = -10 F(1,2) = max = -24 = -10 = -10 = -3 F(2,2) = max = = -3 = -17

Backtracking H E A G A W G H E E
P A W H E A E -8 -16 -17 -25 -20 -5 -13 -3 3 The alignment path matrix is now filled completely. The value of the final cellof the matrix F(10,7) at the bottom right corner is by definition the best score for the global alignment of our two sequences. To find the alignment itself we must find the path of choices that lead to this final value. The procedure to do this is called backtracking. - Build the alignment in reverse, starting from the final cell following the pointers that we stored when building the matrix. - At each step we add a pair of symbols to the front end of the alignment. -5 1 E H - E - A P G - A W G - H E - A Optimal global alignment:

Smith and Waterman (local alignment)
Two differences: 1. 2. An alignment can now end anywhere in the matrix F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i, j) = max Whith the Smith Waterman algorithm we can look for the best alignment between subsequences of sequence x and sequence y. This arises for example when we suspect two sequences to share a commen domain or when we compare extended stretches of genomic DNA. It is also the sensitive method to detect highly diverged sequences. There are two differences to the Needleman and Wunsch algorithm. 1. An extra possibilityof 0 is added to the equation. The value taking 0 corresponds to starting a new alignment. As a consequence the top row and the left column are filled with 0. 2. An alignment can end anywhere in the matrix. So we can look for the highest value over the whole and start a backtracking from there. A traceback ends when a cell with value 0 is reached, which corresponds to the start of the alignment. Example: Sequence 1 H E A G A W G H E E Sequence 2 P A W H E A E Scoring parameters: Log-odds ratios Gap penalty: Linear gap penalty of 8

Smith Waterman alignment
H E A G A W G H E E P A W H E A E 5 20 12 22 28 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below. E AA WW G- HH Optimal local alignment:

Extended Smith & Waterman
To get multiple local alignments: delete regions around best path repeat backtracking

H E A G A W G H E E P A W H E A E 5 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below.

H E A G A W G H E E P A W H E A E 10 16 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below. 21 AA H EE Second best local alignment:

Further Extensions of Dynamic Programming
Overlap matches Alignment with affine gap scores The dynamic programming algorithms can be extended to deal with overlap matches e.g. when comparing genomic DNA fragments to each another. And we can include affine gap penaties. Basically these are variations one the same theme. Who wants to know more about it could dive into the literature.

Pairwise sequence comparison global / local alignments parameters scoring systems insertions / deletions Methods of pairwise sequence alignment dotplot windows-based methods dynamic programming algorithm complexity

Methods of Pairwise Comparison
Progressive Alignment: step Multiple Alignment 1. Methods of Pairwise Comparison Programs perform global alignments: Needleman & Wunsch: (Pileup, Tree, Clustal) Word Size Method: (Clustal) X. Huang (MAlign) (modified N-W)

Construction of a Guide Tree
Progressive Alignment: step Multiple Alignment 2. Construction of a Guide Tree Sequence 1 2 3 4 5 Similarity Matrix: displays scores of all sequence pairs. The similarity matrix is transformed into a distance matrix

Construction of a Guide Tree
Progressive Alignment: step Multiple Alignment 2. Construction of a Guide Tree Guide Tree 1 5 Distance Matrix 2 3 4 Neighbour-Joining Method or UPGMA (unweighted pair group method of arithmetic averages)

Multiple Alignment 3. 2 1 Multiple Alignment Guide Tree 1 5 2 3 4
Progressive Alignment: step Multiple Alignment 3. Multiple Alignment Guide Tree 1 5 2 3 2 4 1

Columns - once aligned - are never changed
Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - C A G G T T - C G C C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G

Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - C A G G T T - C G C C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G and new gaps are inserted.

Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - - C A G G T T - C G C - C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G A T C - T - - C A A T C T G - T C C C T A G A T C T - - C A A T C T G T C C C T A G

Sub-sequence alignments

A K-means like clustering problem

Clustering resulting model

Clustering predictions

Assignments Describe a pairwise alignment with a different gap penalization. Provide an example and perform a multiple global alignment. Describe the recipe. Provide an example and perform a multiple alignment of subsequences. Describe the recipe. Algorithms Order (polynomial, exponential, NP)

Algorithmic Complexity
How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem? Needleman & Wunsch Storing (n+1)x(m+1) numbers Each number costs a constant number of calculations to compute (three sums and a max) Algorithm takes O(nm) memory and O(nm) time Since n and m are usually comparable: O(n2) It is useful to know how an algorithm‘s performance in CPU time and required memory storage will scale with the size of the problem. The Needleman and Wusch algorithm stores (n+1)x(m+1) numbers. Each number costs a constant number of calculations to compute (three sums and a max) Algorithm takes O(nm) memory and O(nm) time Since n and m are usually comparable: O(n2) This is called the <big O> notation. The algorithm is of the order nm. With biological sequences and standard computers O(n2) algorithms are feasible but a little slow, while O(n3)algorithms are only feasible for very short sequences.

Gracias por su atención…

Alineamientos de secuencias

Presentaciones similares

Presentación del tema: "Alineamientos de secuencias"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Alineamientos de secuencias

Presentaciones similares

Presentación del tema: "Alineamientos de secuencias"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback