Modelización y análisis de secuencias biológicas

Modelización y análisis de secuencias biológicas

MODELIZACION DEL ADN Visión simplificada del ADN: secuencias de caracteres pertenecientes al alfabeto finito {A, C, G, T} Estas secuencias están estructuradas Regiones codificantes Exones separados por intrones Pequeños fragmentos de inicio o terminación Regiones no codificantes 21/11/2018

Modelización de las regiones
Cada región posee propiedades estadísticas distintas Si podemos capturar las propiedades en un modelo adecuado Podremos desarrollar tests para decidir si una región es o no codificante. La modelización se basará en conjuntos de datos de entrenamiento cuya estructura es conocida. 21/11/2018

Jerarquías de modelos Como es habitual empezamos por modelos simples y pasamos a sucesivos modelos cada vez más complejos. Independencia Dependencia markoviana Orden 1 Orden > 1  CM no homogéneas Descomposición de la dependencia máxima …. 21/11/2018

Modelos para señales A lo largo de las secuencias de ADN se encuentran pequeñas señales que indican por ejemplo: Donde empieza o acaba un exón / intrón Donde se enlaza una polimerasa de RNA Estas señales pueden utilizarse para detectar donde empieza o acaba un gen 21/11/2018

Datos de ejemplo Para construir modelos de las señales precisamos de datos de ejemplo El sitio es una base de dato de promotores de Drosophila Transcription is centrally involved in an array of biological processes, which include growth, development, and response to external stimuli. In eukaryotes, protein-coding genes are transcribed by the RNA polymerase II transcriptional machinery, which comprises RNA polymerase II and other factors that are required for basal and regulated transcription. Transcription by RNA polymerase II is directed by cis-acting DNA sequences that typically consist of a core promoter along with regulatory elements, such as enhancers, that contain binding sites for sequence-specific transcriptional activator and/or repressor proteins. Thus, the study of both the trans-acting protein factors and the cis-acting DNA elements is necessary to gain a better understanding of the fundamental mechanisms by which genes are transcribed (for recent reviews, see Björkland and Kim 1996; Burley and Roeder 1996 ; Orphanides et al ; Roeder 1996 ; Verrijzer and Tjian 1996 ; Ptashne and Gann 1997 ; Sauer and Tjian 1997 ; Smale 1997 ; Tansey and Herr 1997 ). The key DNA element that is essential for transcription by RNA polymerase II is the core promoter the DNA sequences, which encompass the transcription start site (within about 40 to +40 relative to the +1 start site) and are sufficient to direct the accurate initiation of transcription. Two important core promoter motifs are the TATA box and the initiator (Inr) (Fig. 1). The TATA box is an A/T-rich sequence that is located ~25-30 nucleotides upstream of the RNA start site of many, but not all, promoters. It is recognized by the TATA box-binding polypeptide (TBP), which is a component of the multisubunit TFIID complex. The Inr encompasses the RNA start site, and like the TATA box, it is also present in many, but not all, core promoters (Smale and Baltimore 1989 ; Smale 1994 , 1997 ). Inr elements have been characterized in various TATA-less and TATA-containing promoters, and the Inr consensus sequence is Py-Py-A+1-N-T/A-Py-Py (where A+1 is the transcription start site) in mammalian genes (Smale and Baltimore 1989 ; Bucher 1990 ; Javahery et al ) and T-C-A+1-G/T-T-T/C in Drosophila genes (Hultmark et al ; Purnell et al ; Arkhipova 1995 ). View larger version (22K): [in this window] [in a new window] Figure 1. The TATA box, Inr, and DPE are core promoter elements. The consensus sequences and locations of the TATA box, Inr, and DPE motifs are indicated. The TATA box and DPE appear to be functionally redundant, and promoters generally do not contain both elements. Many promoters contain functionally important sequences that are downstream of the transcription start site. Such downstream promoter sequences have been found in TATA-containing promoters (see, e.g., Lewis and Manley 1985 ; Nakatani et al ; Lee et al ; Emanuel and Gilmour 1993 ; Purnell and Gilmour 1993 ), as well as in TATA-less promoters (see, e.g., Biggin and Tjian 1988 ; Perkins et al ; Soeller et al ; Smale and Baltimore 1989 ; Jarrell and Meselson 1991 ; Contursi et al ; Minchiotti et al ). It appears that many of these downstream promoter sequences are involved in basal transcription, but it is also important to consider that some downstream promoter sequences might be binding sites for sequence-specific transcriptional activators. 21/11/2018

Modelo estadístico básico
El modelo estadístico básico para los miembros de una familia de señales es Pr(sequence of n bases|Member) = f(sequence of n bases) El objetivo es obtener un modelo para los miembros de la familia basado en un conjunto de datos de entrenamiento Que pueda ser utilizado para clasificar nuevas secuencias 21/11/2018

Matrices de pesos A simple model would be that the nucleotide at each position within the signal is independent of the nucleotides at other positions. The model for the signal would be f(b1 ...bn)=Πk=1nfk(bk) , where f() is the probability that a putative sequence, b1 ...bn, could be generated by the signal family; bk is the base at position k in the sequence; and fk(b) is the probability of ﬁnding base b at position k The Π (product) arises from the independence assumption. 21/11/2018

NOTA Este tema está momentáneamente incompleto La base para el tema es el capítulo 5, secciones (5.2) y (5.3) del libro Statistical methods in Bioinformatics. Unas notas sobre el mismo se encuentran en las notas Basic Signal Analysis del curso de Steve Kachman 21/11/2018

Referencias Cap. 5: The Analysis of One DNA sequence 21/11/2018

Modelización y análisis de secuencias biológicas

Presentaciones similares

Presentación del tema: "Modelización y análisis de secuencias biológicas"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Modelización y análisis de secuencias biológicas

Presentaciones similares

Presentación del tema: "Modelización y análisis de secuencias biológicas"— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback