La descarga está en progreso. Por favor, espere

La descarga está en progreso. Por favor, espere

Introducció a la Bioinformàtica Roderic Guigó i Serra Bioinformàtica, UPF Curs 2012-2013.

Presentaciones similares


Presentación del tema: "Introducció a la Bioinformàtica Roderic Guigó i Serra Bioinformàtica, UPF Curs 2012-2013."— Transcripción de la presentación:

1 Introducció a la Bioinformàtica Roderic Guigó i Serra roderic.guigo@crg.cat Bioinformàtica, UPF Curs 2012-2013

2

3 Van Leeuwenhoek In 1676 his credibility was questioned when he sent the Royal Society a copy of his first observations of microscopic single celled organisms. Heretofore, the existence of single celled organisms was entirely unknown … The Royal Society arranged to send an English vicar, as well as a team of respected jurists and doctors to Delft, Holland to determine whether it was in fact Van Leeuwenhoek's ability to observe and reason clearly (wikipedia)

4

5

6

7

8

9

10 ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATG AGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTAC TCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGG GCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGG ACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGA AGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGG AAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATG GAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGA GGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGT AACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGT TGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATC ACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCG GTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTC TTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTT GACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGT GCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGT AGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATT ATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAG CCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTT TTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGC TCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACA TGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAA CTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGC TGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCC CACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCA GCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCC CGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGA GGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGG GTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCA CGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACAC CAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGT CCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAG GCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGA CTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCT GCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAA CCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGG CACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCC CCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCC AGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCAT TTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTC ACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGC CCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGA GCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTC TGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCT CGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC

11

12

13

14

15 La matèria cromosòmica és “un cristall aperiòdic”, constituït per la successió d'un nombre petit d'elements isomèrics*, la seqüència concreta dels quals és la responsable de la seva funcionalitat. (*) “ the number of atoms in such a structure need not to be very large to produce an almost unlimited number of possible arrangements. For illustration, think of the Morse code…” La matèria cromosòmica és “un cristall aperiòdic”, constituït per la successió d'un nombre petit d'elements isomèrics*, la seqüència concreta dels quals és la responsable de la seva funcionalitat. (*) “ the number of atoms in such a structure need not to be very large to produce an almost unlimited number of possible arrangements. For illustration, think of the Morse code…” 1943: Schroëdinger, “What is life?”

16 ENIAC Late 40s: first digital computers

17

18 MALWTRLRPLLALLALWPPPPARAFVNQHLCGS HLVEALYLVCGERGFFYTPKARREVEGPQVGAL ELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQ LENYCN Amino acid sequence of the bovine insuline

19 http://www.ict-science-to-society.org/ Early 60s: the genetic code

20 GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGT CGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCG AAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGA GAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGA CGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTG GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGT MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEALYLVCGERGFFY TPKARREVEGPQVGALELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQ LENYCN

21 1957: invention of the programming language FORTRAN

22 Computers become smaller and therefore faster and cheaper During the 60s computers are introduced into bancs, financial institutions, universities and research centers 60s: Transistors and integrated circuits

23

24

25 Sequence alignment and comparison

26 substitution matrices

27 Sequence alignment The substitution matrices provided a model under which the concept of optimal alignment could be formalized, and computed. The optimal alignment between two sequences is the alignment that maximizes the sum of the amino acid substitution values at each aligned position. A R N D C Q A R N D C Q S K - E A E - S K E A E +1+3-1+3-2+2=6 -1+0+1+3-2+2=3

28 The total number of possible alignments between two sequences of length 100 is approximately 10 200. With DP the number of operations required to obtain the optimal alignment is aproximately 3x100 2 Query: 25 IPREVIERLARSQIHSIRDLQRLLEIDSVGSEDSLDTSLRAHGVHATKHVPEKRPLPIRR 84 IP E+ + L+ I S DLQRLL+ DS G ED + L H+ + R Sbjct: 10 IPEELYKMLSGHSIRSFDDLQRLLQGDS-GKEDGAELDLNMTRSHSGGELESLA----RG 64 Query: 85 KRSI------EEAVPAVCKTRTVIYEIPRSQVDPTSANFLIWPPCVEVKRCTGCCNTSSV 138 KRS+ E A+ A CKTRT ++EI R +D T+ANFL+WPPCVEV+RC+GCCN +V Sbjct: 65 KRSLGSLSVAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNV 124 Query: 139 KCQPSRVHHRSVKVAKVEYVRKKPKLKEVQVRLEEHLECAC 179 +C+P++V R V+V K+E VRKKP K+ V LE+HL C C Sbjct: 125 QCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKC 165 DYNAMIC PROGRAMMING, Nedleman and Wunsch, 1970 Smith and Waterman, 1981 70’s: Optimal sequence alignment

29 mid70’s: DNA sequencing, Sanger. Maxam and Gilbert By the end of the sixties, hundreds of proteins had been sequenced, but the sequencing on nucleic acids remained elusive Sanger (Cambridge) Maxam and Gilbert (Harvard)

30 Anys 70: Internet. Advanced Research Projects Agency

31 gagttttatcgcttccatgacgcagaagttaacactttcggatatttctgatgagtcgaaaaattatcttgataaagcaggaattactactgcttgtttacgaattaaat cgaagtggactgctggcggaaaatgagaaaattcgacctatccttgcgcagctcgagaagctcttactttgcgacctttcgccatcaactaacgattctgtcaaaaactg acgcgttggatgaggagaagtggcttaatatgcttggcacgttcgtcaaggactggtttagatatgagtcacattttgttcatggtagagattctcttgttgacatttta aaagagcgtggattactatctgagtccgatgctgttcaaccactaataggtaagaaatcatgagtcaagttactgaacaatccgtacgtttccagaccgctttggcctct attaagctcattcaggcttctgccgttttggatttaaccgaagatgatttcgattttctgacgagtaacaaagtttggattgctactgaccgctctcgtgctcgtcgctg cgttgaggcttgcgtttatggtacgctggactttgtgggataccctcgctttcctgctcctgttgagtttattgctgccgtcattgcttattatgttcatcccgtcaaca ttcaaacggcctgtctcatcatggaaggcgctgaatttacggaaaacattattaatggcgtcgagcgtccggttaaagccgctgaattgttcgcgtttaccttgcgtgta cgcgcaggaaacactgacgttcttactgacgcagaagaaaacgtgcgtcaaaaattacgtgcggaaggagtgatgtaatgtctaaaggtaaaaaacgttctggcgctcgc cctggtcgtccgcagccgttgcgaggtactaaaggcaagcgtaaaggcgctcgtctttggtatgtaggtggtcaacaattttaattgcaggggcttcggccccttacttg aggataaattatgtctaatattcaaactggcgccgagcgtatgccgcatgacctttcccatcttggcttccttgctggtcagattggtcgtcttattaccatttcaacta ctccggttatcgctggcgactccttcgagatggacgccgttggcgctctccgtctttctccattgcgtcgtggccttgctattgactctactgtagacatttttactttt tatgtccctcatcgtcacgtttatggtgaacagtggattaagttcatgaaggatggtgttaatgccactcctctcccgactgttaacactactggttatattgaccatgc cgcttttcttggcacgattaaccctgataccaataaaatccctaagcatttgtttcagggttatttgaatatctataacaactattttaaagcgccgtggatgcctgacc gtaccgaggctaaccctaatgagcttaatcaagatgatgctcgttatggtttccgttgctgccatctcaaaaacatttggactgctccgcttcctcctgagactgagctt tctcgccaaatgacgacttctaccacatctattgacattatgggtctgcaagctgcttatgctaatttgcatactgaccaagaacgtgattacttcatgcagcgttacca tgatgttatttcttcatttggaggtaaaacctcttatgacgctgacaaccgtcctttacttgtcatgcgctctaatctctgggcatctggctatgatgttgatggaactg accaaacgtcgttaggccagttttctggtcgtgttcaacagacctataaacattctgtgccgcgtttctttgttcctgagcatggcactatgtttactcttgcgcttgtt cgttttccgcctactgcgactaaagagattcagtaccttaacgctaaaggtgctttgacttataccgatattgctggcgaccctgttttgtatggcaacttgccgccgcg tgaaatttctatgaaggatgttttccgttctggtgattcgtctaagaagtttaagattgctgagggtcagtggtatcgttatgcgccttcgtatgtttctcctgcttatc accttcttgaaggcttcccattcattcaggaaccgccttctggtgatttgcaagaacgcgtacttattcgccaccatgattatgaccagtgtttccagtccgttcagttg ttgcagtggaatagtcaggttaaatttaatgtgaccgtttatcgcaatctgccgaccactcgcgattcaatcatgacttcgtgataaaagattgagtgtgaggttataac gccgaagcggtaaaaattttaatttttgccgctgaggggttgaccaagcgaagcgcggtaggttttctgcttaggagtttaatcatgtttcagacttttatttctcgcca taattcaaactttttttctgataagctggttctcacttctgttactccagcttcttcggcacctgttttacagacacctaaagctacatcgtcaacgttatattttgata gtttgacggttaatgctggtaatggtggttttcttcattgcattcagatggatacatctgtcaacgccgctaatcaggttgtttctgttggtgctgatattgcttttgat gccgaccctaaattttttgcctgtttggttcgctttgagtcttcttcggttccgactaccctcccgactgcctatgatgtttatcctttgaatggtcgccatgatggtgg ttattataccgtcaaggactgtgtgactattgacgtccttccccgtacgccgggcaataacgtttatgttggtttcatggtttggtctaactttaccgctactaaatgcc gcggattggtttcgctgaatcaggttattaaagagattatttgtctccagccacttaagtgaggtgatttatgtttggtgctattgctggcggtattgcttctgctcttg ctggtggcgccatgtctaaattgtttggaggcggtcaaaaagccgcctccggtggcattcaaggtgatgtgcttgctaccgataacaatactgtaggcatgggtgatgct ggtattaaatctgccattcaaggctctaatgttcctaaccctgatgaggccgcccctagttttgtttctggtgctatggctaaagctggtaaaggacttcttgaaggtac gttgcaggctggcacttctgccgtttctgataagttgcttgatttggttggacttggtggcaagtctgccgctgataaaggaaaggatactcgtgattatcttgctgctg catttcctgagcttaatgcttgggagcgtgctggtgctgatgcttcctctgctggtatggttgacgccggatttgagaatcaaaaagagcttactaaaatgcaactggac aatcagaaagagattgccgagatgcaaaatgagactcaaaaagagattgctggcattcagtcggcgacttcacgccagaatacgaaagaccaggtatatgcacaaaatga gatgcttgcttatcaacagaaggagtctactgctcgcgttgcgtctattatggaaaacaccaatcttcccaagcaacagcaggtttccgagattatgcgccaaatgctta ctcaagctcaaacggctggtcagtattttaccaatgaccaaatcaaagaaatgactcgcaaggttagtgctgaggttgacttagttcatcagcaaacgcagaatcagcgg tatggctcttctcatattggcgctactgcaaaggatatttctaatgtcgtcactgatgctgcttctggtgtggttgatatttttcatggtattgataaagctgttgccga tacttggaacaatttctggaaagacggtaaagctgatggtattggctctaatttgtctaggaaataaccgtcaggattgacaccctcccaattgtatgttttcatgcctc caaatcttggaggcttttttatggttcgttcttattacccttctgaatgtcacgctgattattttgactttgag 1977:  X174 virus genome

32 1982: the first electronic databases

33 FASTA, 1982: Wilbur and Lipman, 1985: Lipman and Pearson BLAST, 1990: Altschul, Gish, Miller, Myers and Lipman accelerating database searches hash methods 12345678910111213 WATSNANDCRICK ACDIKNRSTW 2626 9 12 811135757 10431 Query Sequence Hash table K=1 http://www.ccl.rutgers.edu/~ouyang/5020/FASTA-BLAST.ppt

34 Search of the Platelet Derived Growth Factor sequence 1982, Doolittle: relationship between oncogenes and growth factors

35

36 1990:The human genome project THE HUMAN GENOME PROGRAM (HGP) is producing large quantities of complex map and DNA sequence data. Informatics projects in algorithms, software, and databases are crucial in accumulating and interpreting these data in a robust and automated fashion at genome and sequencing centers Computer systems play essential roles in all aspects of genome research, from data acquisition and analysis to data management. Without powerful computers and appropriately designed data–management systems, high– volume genome research cannot proceed.

37 This proposal concerns the management of general information about accelerators and experiments at CERN. It discusses the problems of loss of information about complex evolving systems and derives a solution based on a distributed hypertext system (Tim Berners-Lee) 1990:WWW at CERN

38 Human Genome Project Milestones

39 2001: la culminació del projecte

40

41 bioinformatics Medline articles with keyword Bioinformatics. year# articles To 19900

42 bioinformatics Medline articles with keyword Bioinformatics. year# articles To 19900 1990-199415

43 bioinformatics Medline articles with keyword Bioinformatics. year# articles To 19900 1990-199415 1995-1999823

44 bioinformatics Medline articles with keyword Bioinformatics. year# articles To 19900 1990-199415 1995-1999823 2000-20047827

45 bioinformatics Medline articles with keyword Bioinformatics. year# articles To 19900 1990-199415 1995-1999823 2000-20047827 2005-200818822

46 Bioinformatics, Genomics, Systems Biology in Medline

47 What is past, is a prologue W. Shakespeare, La Tempestad,

48 mid70’s: DNA sequencing, Sanger. Maxam and Gilbert By the end of the sixties, hundreds of proteins had been sequenced, but the sequencing on nucleic acids remained elusive Sanger (Cambridge) Maxam and Gilbert (Harvard)

49 ABI PRISM 3700 DNA Analyzer

50 2008: Major genome centers can sequence the same number of base pairs every 4 days 1000 Genome project launched World-wide capacity dramatically increasing Further Evolution of Large-scale Genome Sequencing 2000: Human genome working drafts Data unit of approximately 10x coverage of human –10 years and cost about $3 billion 2009: Every 4 hours ($25,000) 2010: Every 14 minutes ($5,000) Illumina HiSeq2000 machine produces 200 gigabases per 8 day run (BGI have ordered have 128) Slide from Paul Flicek. EBI Bioinformatics Advisory Council

51

52 ENIAC, 1950s 2.4 x 0.9 x 30 (m)  385 operations/second. 10 -6 operations/second/cm 3

53 ENIAC, 1950s 2.4 x 0.9 x 30 (m)  385 operations/second. 10 -6 operations/second/cm 3 MAC AIR, 2010s ~1 x 32.5 x 22.7 (cm)  133,656,056 operations/second. 10 5 operations/second/cm 3

54 CELERA GENOMICS, year 2000 1,000 m 2. 2 yr.  3GB at 10x 5x10 -6 Gb/day/m 3

55 CELERA GENOMICS, year 2000 1,000 m 2. 2 yr.  3GB at 10x 5x10 -6 Gb/day/m 3 HISEQ 2500. year 2012 119 x 94 x 76 (cm). 1 day  120 Gb 10 2 Gb/day/m 3

56 Moore’s Law

57

58


Descargar ppt "Introducció a la Bioinformàtica Roderic Guigó i Serra Bioinformàtica, UPF Curs 2012-2013."

Presentaciones similares


Anuncios Google