An Introduction to Pathway Bioinformatics.ppt

资源描述

1、An Introduction to Pathway Bioinformatics,Yuanhua Tom Tang, Ph.D. Bioinformatics R & D Hyseq Pharmaceuticals, Inc. Sunnyvale, CA, USASingapore National University January, 2002,Definition of Bioinformatics,TheoreticalThe essence of life is information. Bioinformatics is the study of the information

2、content of life.,PracticalThe essential tool is computer.Bioinformatics is computer-based information abstraction and processing of biological knowledge.,Pathways,A schematic diagram of a protein-protein or protein-molecule interaction pathway,A circle indicates a protein or a non-protein biomolecul

3、e. An arrow indicates the direction of protein-protein interaction or protein-molecule interaction.,Pathway Database -Increasing Level of Complexity,The genome 4 bases 3 billion bp total 3 billion bp/cell, identicalThe proteome 20 amino acids 60K genes, 200K proteins 10K proteins/cell; different cel

4、ls/conditions, different expressionsThe pathome 200K reactions 20K pathways 1K pathways/cell; different cells/conditions, different expressions,The Need for Pathway Informatics,Good angle for data integration and representation. Research tool for scientists. Learning tool for students. Pharmaceutica

5、l drug discovery efforts would benefit from comprehensive pathway databases and tools. A challenge for post-genomic era,List of Pathway Databases/Tools,Name: KEGG (Kyoto Encyclopedia of Genes and Genomes) Web: http:/www.genome.ad.jp/kegg/ Owner: Institute for Chemical Research, Kyoto University Desc

6、ription: KEGG is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects. The KEGG project is undertaken in the

7、Bioinformatics Center, Institute for Chemical Research, Kyoto Univ. Name: PathDB Web: http:/www.ncgr.org/pathdb/index.html Owner: National Center for Genomic Resources Description: PathDB is a functional prototype research tool for biochemistry and functional genomics. One of the key underlying phil

8、osophies of their project is to capture discrete metabolic steps. This allows them to build tools to construct metabolic networks de novo from a set of defined steps. PathDB is not simply a data repository but a system around which tools can be created for building, visualizing, and comparing metabo

9、lic networks.,List of Pathway Database/Tools (cont.),Name: GenMapp(Gene MicroArray Pathway Profiler)Gladstone Institute, UCSF. GenMAPP is a computer application designed to visualize gene expression data on maps representing biological pathways and groupings of genes. The first release of GenMAPP 1.

10、0 beta is available with over 50 mouse and human pathways. They also provide hundreds of functional groupings of genes derived from the Gene Ontology Project for the human, mouse, Drosophila, C. elegans, and yeast genomes. GenMAPP seeks collaborators in the biological community to assist in the deve

11、lopment of a library of pathways that will encompass all known genes in the major model organisms.Name: SPAD: Signaling Pathway DatabaseGraduate School of Genetic Resources Technology. Kyushu University. There are multiple signal transduction pathways: cascade of information from plasma membrane to

12、nucleus in response to an extracellular stimulus in living organisms. Extracellular signal molecule binds specific intracellular receptor, and initiates the signaling pathway. Now, there is a large amount of information about the signaling pathways which control the gene expression and cellular prol

13、iferation. They have developed an integrated database SPAD to understand the overview of signaling transduction. SPAD is divided to four categories based on extracellular signal molecules (Growth factor, Cytokine, and Hormone) that initiate the intracellular signaling pathway. SPAD is compiled in or

14、der to describe information on interaction between protein and protein, protein and DNA as well as information on sequences of DNA and proteins.,Specific Pathway Databases,Cytokine Signaling Pathway DB. Dept. of Biochemistry. Kumamoto Univ. The Database contains information on signaling pathways of

15、cytokines. It is designed for researchers who work with cytokines and their receptors, and provides biochemical data and references about signaling molecules as well as ligand-receptor relationships. EcoCyc and MetaCyc Stanford Research Institute EcoCyc database describes the genome and the biochemi

16、cal machinery of E. coli. The database contains up-to-date annotations of all E. coli genes. EcoCyc describes all known pathways of E. coli small-molecule metabolism. Each pathway and its component reactions and enzymes are annotated in rich detail, with extensive references to the biomedical litera

17、ture. The Pathway Tools software provides query and visualization services.BIND (Biomolecular Interaction Network Database) UBC, Univ. of Toronto - BIND is a database designed to store full descriptions of interactions, molecular complexes and pathways, including interactions between any two molecul

18、es composed of proteins, nucleic acids and small molecules. Chemical reactions, photochemical activation and conformational changes can also be described. Abstraction is made in such a way that graph theory methods may be applied for data mining. The database can be used to study networks of interac

19、tions, to map pathways across taxonomic branches and to generate information for kinetic simulations.,Industrial Companies in Path Informatics,Protein Pathways, Los Angeles, USA Genmetrics, Inc., Silicon Valley, USA Biobase, Braunschweig, Germany InforMax, Bethesda, MD and AxCell Bioscience, Newtown

20、 PA Myriad Proteomics, Salt Lake City, Utah CuraGen Corporation, New Haven, CT, USA,Objectives of the KEGG Project,Pathway Database: Computerize current knowledge of molecular and cellular biology in terms of the pathway of interactiong molecules or genes. Genes Database: Maintain gene catalogs of

21、all sequenced organisms and link each gene product to a pathway component Ligand Database: Organize a database of all chemical compounds in living cells and link each compount to a pathway component Pathway Tools: Develop new bioinformatics technologies for functional genomics, such as pathway compa

22、rison, pathway reconstruction, and pathway design Professor M. Kanehisa is the leading scientist on the project,Data Representation in KEGG,Entity: a molecule or a geneBinary relation: a relation between two entitiesNetwork: a graph formed from a set of related entitiesPathway: metabolic pathway or

24、H:dme00193 2.4 Carbon fixation PATH:dme00710 2.5 Reductive carboxylate cycle (CO2 fixation) PATH:dme00720 2.6 Methane metabolism PATH:dme00680 2.7 Nitrogen metabolism PATH:dme00910 2.8 Sulfur metabolism PATH:dme00920 Lipid Metabolism Nucleotide Metabolism Amino Acid Metabolism Metabolism of Other Am

25、ino Acids Metabolism of Complex Carbohydrates Metabolism of Complex Lipids Metabolism of Cofactors and Vitamins,Introduction to GenMAPP,Gene MicroArray Pathway Profiler by Bruce Conklin at Gladstone Institute, UCSF.GenMAPP is a free computer application designed to visualize gene expression data on

26、maps representing biological pathways and groupings of genes. The main features underlying GenMAPP version 1.0 are: Draw pathways with easy to use graphics tools Multiple species gene databases Color genes on MAPP files based on user-imported gene expression data,Part II. Path Metrics,Software Tools

27、 for Developing Pathway Database, Performing Pathway Comparison, and Making Pathway Prediction,Topics to Cover,SLIPPIR standard for pathway database model Gene, pathway, and tissue expression tools Pathway search engine Ortholog pathway prediction Pathway prediction user interface,SLIPPIR standard f

28、or pathway curationSLIPPIR standards for Standard for LInear Protein-Protein Interaction Representation. For linear comparison (homology), 2-D diagrams of pathways 1-D format. We call the 2-D diagrams graph pathways, and the corresponding 1-D pathways linear pathways. One graph pathway may be transf

29、ormed into multiple linear pathways. The generation of graph pathways and the corresponding linear pathways from scientific literature is called pathway curation. Pathways are curated by trained scientists with expertise on the relevant pathways. In addition to generating the graph pathway and linea

30、r pathways, they also have to generate a pathway description file for each pathway they curate (pathway annotation), and a protein file that contains all the proteins in the pathway.,Mode Symbol SpecificationsIt is usually specified by two non-character ASCII symbols.- Direct interaction with direct

31、ion. Used when there is known direct interactions between two nodes (reverse orientation: Clear interaction, but no direction of information flow (notice, no space within, no letters either). This could happen when more than two proteins are involved to form a large complex.,* Bifurcating members (u

32、sually appears only in beginning or ending of a pathway, it can occur in the middle of a pathway only when a pathway bifurcates and immediately folds back, e.g. A-B*C*E-F).If a pathway starts to bifurcate in the middle or at the end, one can use a *path_name to record this event. E.g: A-B-(xx)-C-D*N

33、ew_path_1-E*New_path_2.( ) Symbol for non-protein nodes. If the small molecule is uncertain, it can be omitted. If the small molecule is known, its name should be inserted in between, e.g. -(Ca), or (cAMP).All the small molecules should be included inside a set of parentheses, e.g. A1-(Ca)-A1-(Cytid

34、ine_Diphosphate_Choline). Symbol for another pathway. The path_id should be within the bracket.When linked to other pathways, the path_ids should be put inside a bracket, e.g. A1-Ca_triggered_path1, A1-Gs_pathway.When an ID is given without a () or , it means it is a protein node,SLIPPIR Format for

35、Pathway Entries,The format is based on a common sequence representation format, FASTA The pathway will be keyed in FASTA-format, with the top-line being the annotation line. E.g.PW_ID PW_name PW_annotation Source Curator Date Species Pr1-Pr2-(Ca)-Pr3=Pr4*Pr5*PATH_XXPW_ID: ID for the pathway PW_name:

36、 A name PW_annotation: a brief description about the pathway Source: where this pathway is taken from: article, KEGG, GenMAPP, etc. Curator: the person who inputs the pathway Date: date of curation,Pathway Database Model (cont.),FASTA format protein-node representation Seq_id Annotation ABCDELMEN Co

37、mparison Matrix: percent_identitypercent_positive (PAM/BLOSSUM) FASTA format non-protein node representation Mol_id Annotation Molecular structure Comparison Matrix: identity mappingstructural similarity, evolutionary relationship SCOM matrix (similarity coefficient of modes) A matrix of numbers, po

38、sitive and negative values. Comparison Matrix: identity mappingmatrix of positive/negative numbers,Pathway Database in Simplest Format,A SLIPPIR format pathway file A FASTA format protein sequence file A FASTA format non-protein molecule file Flat file tools to do basic database manipulations: Index

39、 generate index file Retrieval: logN scale speed of component access Insertion: cat to the end, new index Deletion: delete, and new index Updating: deletion, cat to the end, new index,Relational Database Implementation -an example with only protein nodes,Expression and Expression Comparison,Gene ex

40、pression Gene expression comparison Pathway expression Pathway expression comparison Tissue expression Tissue expression comparison,2. PMsearch DocumentationPMsearch is a pathway comparison program. After a user specifies a query pathway, and a search database, PMsearch will compare the query pathwa

41、y with each entry in the pathway database. The query pathway is specified by two input files: a query.pw pathway file, and a query.aa, the protein file. The query.pw contains the pathway information, in FASTA format, and the query.aa contains the involved proteins, in FASTA format. The pathway datab

42、ase is also composed of two files, a db.pw and a db.aa file, except the database files contain more than one entry. Once a job is submitted, the search engine (pm_search) will perform the job, and report back all the homologous pathways that are above a user-specified threshold. The user can also sp

43、ecify other parameters, which are given in the user manual.,Given a list of letters, UIPQWEFOIUFJLK and PQEFOIABCDFJ, a good alignment might be:UIPQWXEFOI-UFJLK| | |PQ-EFOIABCDFJQRSSpecifics for pathway alignment:Each letter can represent a node, or a mode. Nodes do not have to be identical in order

44、 to match; they just have to be homologous. Distance between nodes and modes, and between protein nodes and non-protein nodes are infinite, you cannot align different types of elements.,In the simplest case, consider pathway with only protein nodes. Given an alignment z, the score is given bywhere s

45、x,y) is the similarity of protein x and protein y, ngap is the number of gaps in z, lgap is the total length of the gaps, is a parameter called the “gap opening” penalty, and is a second parameter called the “gap extension” penalty. There are many possible alignment for two pathways, and different

46、alignments may have different scores. PMsearch uses a dynamic programming algorithms to find the alignment with the highest score.,How Alignments Are Determined And Scored,For the alignment to get to (m,n), it must go through one of:(m-1, n-1) (am and bn are a match), (m-1, n) (meaning (m,n) is in a

47、 gap in sequence 2), (m, n-1) (meaning (m,n) is in a gap in sequence 1).Recursion: For i = 1 to mFor j = 1 to nH(i,j) = max H(i-1,j-1)+s(i,j), Hh(i,j), Hv(i,j), whereHh(i,j) = max Hh(i,j-1)-, H(i,j-1)- Hv(i,j) = max Hv(i-1,j)-, H(i-1,j)- End End,PMsearch sample output: list of hitsPMsearch 0.1 Path

48、Metrics 20-Sep-2001 Build linux x-86 30-Jul-1998Reference: US Patent Pending, “Methods for Establishing Pathway Database and Performing Pathway Searches.“ Y. Yang, C. Piercy. February 20, 2001. Application number 60/269,711.Query= hsa00625(5 proteins) PW Database= keggall4,881 pathways; 71,600 total

49、 proteins.Pathways with above-threshold alignments: Score hsa00625 Tetrachloroethene degradation 100 hsa00360 Phenylalanine metabolism 59 hsa00120 Bile acid biosynthesis 58 hsa00627 1,4-Dichlorobenzene degradation 40 hsa00100 Sterol biosynthesis 40 hsa00940 Flavonoids, stilbene and lignin biosynthes

50、is 40 hsa00680 Methane metabolism 40 hsa00950 Alkaloid biosynthesis I 40 hsa00150 Androgen and estrogen metabolism 40 hsa00643 Styrene degradation 40 hsa00380 Tryptophan metabolism 40 hsa00130 Ubiquinone biosynthesis 40 hsa00350 Tyrosine metabolism 40 hsa00340 Histidine metabolism 40 hsa00053 Ascorbate and aldarate metabolism 28,

展开阅读全文