Comparación de secuencias (Sequence comparison)

Slides:



Advertisements
Presentaciones similares
PROTEIOS= PRIMERO O PRINCIPAL
Advertisements

Physical Science. Electricity Electricity is the flow of electrons or electric power or charge. The basic unit of charge is based on the positive charge.
Aminoácidos Lic. Raúl Hernández M..
Comparaciones Iguales
BLAST.
Tratamiento con Insulina en la Diabetes Mellitus tipo 2
Control en cascada.
AMINOÁCIDOS Y PROTEÍNAS
Técnicas Genómicas de Segunda Generación
Encuentra las 12 diferencias
OXIDO NITRICO (NO) Sintetizado en vivo por la enzima NO-sintetasa
El sistema de puntuación
SELECCION DE “TEMPLATES” Y ALINEAMIENTO. Energía X Nativa.
¿Qué hora es? What time is it?. ¿Qué hora es? It’s 1:00 Es la una (notice we do not say uno for time but una) 1:00 is the ONLY time where we say “Es la…”
Variabilidad genética Selección
AMINOÁCIDOS Y PROTEÍNAS
Adjectives agree in gender and number with the persons or things they describe. Masculine adjectives usually end in -o and feminine adjectives usually.
Telling Time (Cómo decir la hora). When we ask what time it is in Spanish, we say “¿Qué hora es?” Some people also say “¿Qué horas son?”
Using Adjectives. All adjectives agree in gender and number with the nouns they modify.
Unidad VIII: Química de Aminoácidos, péptidos y proteínas.
Comparatives. What is it? Oddly enough, we use comparatives to compare two things. So far, we’ve learned how to compare things using the phrase “más ….
Adjective Agreement Well, the same is true of adjectives. You can’t use the exact same word to describe “mujer” that you use to describe “hombre.” Remember.
Bioquímica Aminoácidos.
Estructura de Proteinas Antonio Flores Giancarlo Alvarez 12 de setiembre de 2008.
Matrices de Substitución PAM Y BLOSUM
Teoría de Selección Natural
Comparatives and Superlatives. Unequal comparisons To say something is more or less use Mas que or menos que Ellos salen mas que nosotros. They go out.
Making Comparisons.  In Spanish, comparisons are made by placing “más (more)” or “menos (less)” before an adjective and “que” after it. ◦ -¿Enrique,
Notes #18 Numbers 31 and higher Standard 1.2
-AR Verbs In Spanish, there are three classes (or conjugations) of verbs: those that end in –AR, those that end in –ER, and those that end in –IR. This.
Comparatives Two nouns may be seen as having more, or less, of a characteristic than another. To express this, the Comparative construction is used: When.
The CATH Domain Structure Database Ana Gabriela Murguía Carlos Villa Soto.
Predicción de Estructura 3D de Proteínas Reconocimiento de Plegamiento (threading) Florencio Pazos ALMA Bioinformatics, S. L.
PAM Margaret Dayhoff. Accepted Point Mutations accepted by natural selection.
Alfonso Varela Toro José Ramón Polo López MODELADO DE LA MAQUINARIA CELULAR A TRAVÉS DE LA COMPARACIÓN DE REDES BIOLÓGICAS.
PROTEINAS.
ADJETIVOS OBJETIVO – TO USE ADJECTIVES CONFIDENTLY.
Amino Acidos: Sillares de las Proteínas
© 2006 Plataforma Bioinformàtica de la UAB Introducció a la Bioinformàtica Bioinformàtica: la recerca biomèdica in silico.
Español 2 – Capítulo 1 Comparison of Adjectives © Sandra Hoffmann 2006 Redmond High School.
Algoritmos para alineamientos locales: FastA
Matrices de sustitución
Alineamiento local: búsqueda de homologías
Comparatives Two nouns may be seen as having more, or less, of a characteristic than another. To express this, the Comparative construction is used: When.
Making comparisons.
Tecnología y Estructura de Costos. Technologies u A technology is a process by which inputs are converted to an output. u E.g. labor, a computer, a projector,
Péptidos Alberto L. Vivoni Alonso J. Roberto Ramírez Vivoni
Adverbs are words that describe how, when, and where actions take place. They can modify verbs, adjectives, and even other adverbs. In previous lessons,
EXPOSICION PÉPTIDOS Y PROTEÍNAS. TEMA 6 Danna García Paula Montenegro.
EXPOSICION PÉPTIDOS Y PROTEÍNAS. TEMA 6
MÉTODO CIENTÍFICO SCIENTIFIC METHOD. Observación Observation Scientists use observation skills to identify which problems they would like to solve Simply.
Otra variedad de biomoléculas
10.4 Adverbs ANTE TODO  Adverbs are words that describe how, when, and where actions take place.  They can modify verbs, adjectives, and even other adverbs.
Adjectives Spanish One ch.2.
What are some other organic molecules? Lipids/ Lipidos Fats/ Grasas.
What are nouns? What is different about nouns in Spanish vs. nouns in English? All nouns have gender. ( i.e. masculine & feminine ) el muchacho (masculino)
©2014 by Vista Higher Learning, Inc. All rights reserved You have learned that reflexive verbs indicate that the subject of a sentence does the action.
Aminoácidos José De Jesús Orozco Franco
AIM: Why and how do cells divide? Por que y como se dividen las celulas? DN: Compare and Contrast Sexual and Asexual Reproduction. Compara y contrasta.
LO: SWBAT explain how protein shape is determined and differentiate between the different types of mutations. Objetivo: Explica como se determina la forma.
Lic. María Isabel Fonseca PROTEÍNAS. Lic. María Isabel Fonseca PROTEÍNAS Niveles estructurales.
AIM: How do comparative studies help trace evolution? Como ayuda la comparacion a establecer relaciones evolutivas?
Adjectives agree in gender and number with the persons or things they describe. Masculine adjectives usually end in -o and feminine adjectives.
Aim: How do scientists use biotechnology to manipulate genomes? Objetivo: ¿Cómo los científicos utilizan biotecnología para manipular genomas?
Definite & indefinite articles The articles el, la, los & las are definite articles and mean “the” when translated into English. Use these when talking.
Análisis de proteínas Alberto Vivoni Alonso.
LO: SWBAT explain how gametes are formed. Como se forman los gametos? DN: What are gametes? Where are the gametes formed? Que son los gametos? Donde se.
Los Artículos Los Nombres (Nouns)  Name of a person, place or thing is a noun  In Spanish, every noun has a gender, either masculine or feminine 
What is Genetic Engineering? Que es la Ingenieria Genetica? Genetic Engineering is a new process that scientists use to alter the genetic instructions.
PROTEINAS.
Transcripción de la presentación:

Comparación de secuencias (Sequence comparison)

Objetivo Aprovechar información funcional y/o estructural identificando homología entre secuencias Diferencia entre Homología e identidad Dos secuencias se consideran homólogas cuando: Tienen el mismo origen evolutivo Tienen función y estructura similares

• Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages

Ejercicio

Proteínas posibles de 50 Aminoácidos ? Nuestras proteínas son una minoría Proteínas posibles de 50 Aminoácidos ? MALRTGGPAL VVLLAFWVAL GPCHLQGTDP GASADAEGPQ CPVACTCSHD MRCAPTAGAA LVLCAATAGL LSAQGRPAQP EPPRFASWDE MNLLAHGLLQ 5020: 100000000000000000000000000000000 proteínas posibles Proteínas distintas que existen en la naturaleza: unas 200.000 Porcentaje de reales sobre posibles: 0.0000000000000000000000002% (o sea nada, prácticamente)

Más definiciones Orthologs: secuencias que corresponden exactamente a la misma función/estructura en organismos distintos Paralogs: secuencias producto de duplicaciones en un mismo organismo. Normalmente implican cambios de función.

ORTHOLOGS AND PARALOGS INTO LOCUS ß FROM GLOBINS

Homology and prediction Very divergent protein sequences may suport similar structures Similar protein structures will probably have related or similar functions

3D STRUCTURE VERSUS SEQUENCE Sequence alignment between human myoglobin,  and  globins from hemoglobin

Comparison of 3D structures of human myoglobin,  and  globins from hemoglobin myoglobin -globin -globin

Comparison of 3D structures of human myoglobin,  and  globins from hemoglobin myoglobin -globin -globin

Homology and prediction La comparación de secuencias es el método más simple para identificar la existencia de homología. Identidad > 30% en proteína implica homología Identidad > 80-90% es normal en ortólogos de especies cercanas Identidad 10-30%. Si existe homología, es indetectable (“twilight zone”)

¿DNA o proteína? Ambas proporcionan información sobre homología DNA: Solamente la identidad entre bases es relevante Proteína: Existen equivalencia funcional entre aminoácidos

Apareamientos canónicos (Watson-Crick) Unicamente la identidad es relevante

Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general)

Degeneración en la tercera posición Código genético Pos 1 Posición 2 Pos 3 U C A G Phe Leu Ser Tyr Stop Cys Trp Pro His Gln Arg Ile Met Thr Asn Lys Val Ala Asp Glu Gly Trp, Met (1) Leu, Ser, Arg (6) resto (2) Iniciación AUG Stop (3) Degeneración en la tercera posición XYC = XYU XYA ~ XYG

Aminoácidos “equivalentes” Hidrofóbicos Ala (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp (W), Tyr (Y) Pequeños Gly (G), Ala (A), Ser (S) Polares Ser (S), Thr (T), Asn (N), Gln (Q), Tyr (Y) En la superficie de la proteína polares y cargados son equivalentes Cargados Asp (D), Glu (E) / Lys (K), Arg (R) Dificilmente sustituibles Gly (G), Pro (P), Cys (C), His (H)

3D visualization of some conserved residues in globin family (Myoglobin structure) Prolin in a turn Histidin For the hemo coordination bonds 2 conserved glycines in 2 separate helix crossing each other

La secuencia de DNA diverge más rápidamente mutación o recombinación altera el DNA pero debe mantener la función/estructura La comparación de proteínas permite localizar homologías más lejanas

Alineamiento de secuencias Medir la homología entre secuencias requiere un “alineamiento” Homología alta: AWTRRATVHDGLMEDEFAA AWTRRATVHDGLCEDEFAA Homología baja: AWTKLATAVVVFEGLCEDEWGG AWTRRAT---VHDGLMEDEFAA

Tipos alineamiento “pairwise” Multiple Global Local Dos secuencias Más de dos secuencias Global Toda la secuencia se considera Local Unicamente se alinean regiones parecidas

Estrategias Depende del objetivo Comparación de secuencias Objetivo: medir homología, identificar aminoácidos equivalentes global, ”pairwise”/múltiple Búsqueda en bases de datos Objetivo: Identificar homólogos en un conjunto grande de secuencias Local, “pairwise”

Alineamiento manual proteína Requiere “oficio” Conocer propiedades de aminoácidos Conocer la proteína Permite incorporar información adicional Aminoácidos funcionales Aminoácidos necesarios para mantener la estructura … Es lento y poco reproducible

Alineamiento automático (problema de optimización) Requiere un método objetivo de comparar aminoácidos o bases para “puntuar” el alineamiento (matrices de comparación) algoritmo para encontrar el alineamiento con la máxima puntuación Es reproducible y rápido No permite, en general, introducir información adicional

Tipos de matrices Identidad Propiedades físico-químicas Genéticas (sustitución de codones) Evolutivas

La aplicación sucesiva de la matriz PAM permite simular varias generaciones PAM 40, PAM 100, PAM 250 Evolutionary distance considered is constant Bigger number bigger divergence. Less stringent

Evolutionary distances considered are variable More modern than PAM but similar results. Smaller is n bigger divergence. Less stringency

Blosum 62 Small positive score for changes in similar aminoacids commonaminoacids Infrequente aminoacids have high score High Penalty for very different aminoacids

¿Which matrix to use?? No clear answer All matrix evaluate functional equivalence between aminoacids in the light of evolution and conservation: la equivalencia funcional entre aminoácidos

Choice of a Matrix! BLOSUM90 PAM30 BLOSUM80 PAM120 BLOSUM62 PAM180 Rat versus mouse protein Rat versus bacterial protein

Query Length Substitution Matrix Gap Costs <35 PAM-30 (9,1) 35-50 (10,1) 50-85 BLOSUM-80 85 BLOSUM-62 PAM Point Accepted Mutatiton

Gaps (inserciones/delecciones) Normalmente localizados en loops AWTKLATAVVVFEGLCEDEWGG AWTRRAT---VHDGLMEDEFAA

Gaps (inserciones/delecciones) Esquemas de puntuación: Dependiendo de estructura 2ª Valor constante Función lineal go + n.gl

Global versus local alignment Global alignment Finds best possible alignment across entire length of 2 sequences Aligned sequences assumed to be generally similar over entire length Local alignment Finds local regions with highest similarity between 2 sequences Aligns these without regard for rest of sequence Sequences are not assumed to be similar over entire length

Global or Local ? 1. Searching for conserved motifs in DNA or protein sequences? 2. Aligning two closely related sequences with similar lengths? 3. Aligning highly divergent sequences? 4. Generating an extended alignment of closely related sequences? 5. Generating an extended alignment of closely related sequences with very different lengths?

Local vs. Global Alignment (cont’d) Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

Comparación de secuencias contra bases de datos Base de datos De secuencias AGLM...WTKR TCGGLMN..HICG WRKCPGL ... Secuencia incógnita ATTVG...LMN Requiere algoritmos de comparación muy rápidos

Diasdvantages from global alignment Slow Scores whole sequence Do not recognize multidomain proteins Global alignment server A B C A C’ B D

Alineamiento local 10 – 100x más rápidos Reconocen dominios individuales No proporcionan necesariamente el mejor alineamiento! BLAST, FASTA

Basic Local Alignment Search Tool Blast NCBI

Basic Local Alignment Search Tool Blast NCBI The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Formatos entrada

E parameter (Expected threshold) Expect The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually indentical and have relatively high EValue. This is because the calculation of the E-value also takes into account the length of the Query sequence. This is because shorter sequences have a high probability of occuring in the database purely by chance.

E value (Expect) E value: Warning:  E →  Falsos negativos Expect: This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. E = K.m.n.e-l.S Warning:  E →  Falsos negativos Score Normalization factors Number of letters in query Number of letters in data baseScore

Estadística Indice de referencia: E: número de falsos positivos esperado Búsquedas esporádicas: 0.01 – 0.001 Búsquedas masivas (anotación genoma): 10-6

Programas Blast blastp blastn blastx tblastn tblastx amino acid query sequence vs. protein sequence database blastn nucleotide query sequence vs. nucleotide sequence database blastx nucleotide query sequence translated in all reading frames vs. protein sequence database tblastn protein query sequence vs. a nucleotide sequence database translated in all reading frames tblastx six-frame translations of a nucleotide query vs. the six-frame translations of a nucleotide sequence database.

¿Qué programa usar? La comparación en proteína permite ampliar el espectro de búsqueda (aunque comparemos DNA!) Blastn → blastx, tblastx Blastp → tblastn Degeneración del código genético Equivalencia funcional entre aminoácidos

BLAST substitution matrices A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is: