1、A Probabilistic Approach to High Throughput Drug Discovery,Introduction and Motivation Probability Modeling in Drug Discovery Representation of Chemical Structures (Descriptors) Focused Combinatorial Library Design Summary and Outlook,2,High Throughput Screening,Large-scale automation of biological
2、assays (HTS) Use robotics to perform 10,000 to 100,000 screens per day Brute-force approach to drug discovery: “rapidly screen all compounds” Noteworthy drawbacks to HTS: Economics: $1-$5 per assay (provided large collections are assayed) Logistics: compound formatting, inventory systems and other o
3、verhead Precision Loss: effective “binary” measurement: active/inactive (pass/fail) High Error Rate: assay, synthesis failure, sample degradation, registration Resulting effects: Quality for quantity tradeoff - lots of low quality data High level of noise (error) in data makes interpretation very di
4、fficult HTS has gained acceptance and is routinely used to generate lead compounds for drug discovery projects,3,Sources of Compounds for HTS,Initial screening libraries (first libraries used in project) Historical “in-house” collection of compounds augmented with compounds purchased from external s
5、uppliers 1 million+ compounds available means initial screening library must be designed (diversity retained using fewer numbers of compounds) Receptor biased initial screening libraries are a possibility Follow-up libraries Parallel synthesis / combinatorial chemistry is an excellent source of larg
6、e numbers of (new) compounds Synthesis of “all” analogs around a lead structure exhibits poor diversity but very good for “local” exploration and lead follow-up External screening compound purchasing and in-house combinatorial chemistry efforts have gained acceptance and are routinely used in lead g
7、eneration and follow-up,4,High Throughput Discovery Cycle,Brute-force HTS not practical At least 10 trillion stable drug candidates At 1 billion screens per day 27 years are needed to screen all 10 trillion A discovery cycle can be used to reduce total screens Use HTS data to affect the selection of
8、 compounds to screen next Scale-up of the traditional experimental discovery cycle,5,Required Technology for HTD Cycle,High Throughput Screening facility Parallel synthesis and combinatorial chemistry capabilities Methodology for automatically analyzing HTS data Humans find it difficult to interpret
9、 large amounts of noisy data Automatic HTS QSAR technology necessary for HTD cycle Methodology for designing focused combinatorial libraries HTS QSAR results are used to bias a combinatorial library towards activity ADME properties and other design criteria should be taken into account Meaningful re
10、presentation of compounds Collection of molecular descriptors meaningful across projects (avoid time consuming variable selection procedures) Definition of a “chemistry space” for diversity studies (design of initial screening libraries),Probability Modeling in Drug Discovery,7,Probabilistic Formali
11、sm (Bayesian Inference),Step 1: Write all observables as a joint probability density; e.g., Pr (A,B,C) Step 2: Decompose density using probability theory and Bayes theorem until components are measurable; e.g., Pr (A,B,C) = Pr (B | A,C) Pr (C | A) Pr (A) Step 3: Model each component in product from
12、a database or experimental data set Step 4: Make predictions or estimates using computed model of Pr(A,B,C),8,Probabilities in Speech Recognition,Successful speech recognizers select (predict) an output word sequence from an input waveform by maximizing the joint likelihood Pr (WAVE, WORDS) This is
13、used (in part) to solve the isophonetic word sequence problem; e.g., “imadam” can be “Im Adam” or “Im a Dam” or “eye mad am” Pr (WAVE, WORDS) = Pr (WAVE | WORDS) Pr (WORDS) Pr(WORDS) is the prior probability of a word sequence (utterance) Pr(WAVE | WORDS) is used to score the waveform under the assu
14、mption or hypothesis that the word sequence is WORDS Build model of Pr(WORDS) by training on, say, 500,000,000 words of newspaper text (the prior knowledge) Pr(WORDS) effectively depresses importance of unlikely utterances in favor of more plausible statements (real phrases),9,Probabilities in Drug
15、Discovery,Notation: Y = active(0/1) D = drugable(0/1) S = structure Decompose:Product of probabilities balances competing goals Classification alone (e.g., RP) is not enough: weighted outcomes needed Methodology similar to “soft” classification problems or fuzzy logic Any method of probability model
16、ing is valid (e.g., histogram, analytic) Approximations introduced can be clearly identified e.g., Pr (D | Y, S) Pr (D | S) : drugability is independent of activity (!?),Drugable given active structure (approximated by “is drug-like” efforts),Activity assuming structure (probabilistic QSAR efforts),
17、10,Pr(Y|X) via Binary QSAR,If Y is “binary activity” and X is a descriptor vector thenPathology of Binary QSAR is reasonable If new structure is outside the training set then Pr(Y=1), the hit rate, is used to make predictions (no other information available),Active,Inactive,X1,Xk,Xk+1,Xn,Pr(Y),Pr(X|
18、Y),X1,Xn,Active,Inactive,Active,Inactive,Pr(X),Pr(Y|X),Bayes Theorem,11,Distribution Estimates,Four distributions in formula are of two types Pr(Y=0), Pr(Y=1) Prior probability of inactive/active Pr(X=x|Y=0), Pr(X=x|Y=1) Probability of ligand assuming inactive/active Modeling assumption: independent
19、 uncorrelated! Decompose multi-dimensional distribution into a product Estimate 2n+2 distributions instead of original four Binary QSAR Algorithm Compute descriptor vectors di De-correlate descriptors xi = Q(di - u) Estimate distributions from xi ,yi Pr (X = x | Y = y) Assemble p (x) Pr (Y = 1 | X =
20、 x) Predict for new descriptors d p (Q (d - u),12,Experience with Binary QSAR,Fundamental methodology publication (robustness study) Biocomputing Proceedings of the 1999 Pacific Symposium World Scientific Publishing, Singapore, 1999 Example literature data sets (non-HTS data) Estrogen receptor (Gao
21、et al.; J. Chem. Info. Comput. Sci., 1999, 36) O-acyltransferase (ACAT) (Labute et. al.; in press) Example industrial data sets (HTS assays) ArQule: 24,000 cpds. 200 active, 93% on inactives, 60% on actives Pharmacopeia: 24,000 cpds. 90% on inactives, 90% on actives SmithKline Beecham: 80,000 cpds.
22、100 active, 90% on actives Best success story: Pharmacia & Upjohn Binary QSAR model used to select building blocks in combi-chem library Improved activity from M to nM (factor of 1000),13,Combined Design Model for HTD Cycle,Use Binary QSAR method twice, once for activity model and once for drugabili
23、ty model Train drugability model Pr (D | X) on WDI/ACD for drug-like/non-drug-like or on specific data sets (e.g., blood-brain barrier permeability) Complete model of activity and drugability is the product Pr(D | X) Pr(Y | X) which approximates Pr(D, Y | S),ADME Model,Activity Model,Library Design,
24、Binary QSAR,BioAssay,Design Model,Combinatorial Library,HTS Data,Drugability Data (e.g., BBB or drug-like),Binary QSAR,Representation of Chemical Structures (Descriptors),15,A Brief History of QSAR,Original philosophy (Hansch & Leo): Use a fixed set of meaningful molecular properties to describe a w
25、ide variety of biological phenomena Linear regression used to determine SAR The determination of linear relationships is basic science Statistical regression framework used to assess significance of SAR Proliferation of descriptors Early successes lead to introduction of a vast array of descriptors
26、In principle, any number calculable from a chemical structure can be used as a molecular descriptor for SAR determination Over-determination of SAR Multitude of descriptors lead to need for schemes for variable elimination 3D methods treat each grid-point in field representation as a descriptor,16,F
27、undamental Notions,Use a fixed set of descriptors for diversity and QSAR/QSPR A meaningful chemistry space should not require customization In QSAR/QSPR automatic variable selection can be dangerous Make direct use of Hansch & Leo thinking (build on their experience) Model 3D properties from 2D (con
28、nectivity) information 3D information from 2D connectivity = 2 D descriptors HTS QSAR and large-scale diversity require fast calculation times 2D topological descriptors too weak, 3D descriptors too expensive Use approximate atomic surface areas as fundamental representation Complement substructure
29、keys (stay property-based for class-hopping) Intended applications QSAR/QSPR models - linear and nonlinear - early and late in project Chemistry space for library design,17,Exposed Van der Waals Surface Area (VSA),Calculate exposed Van der Waals surface area for each atom by subtracting off surface
30、area inside neighbors Correction factors to sphere formula depend on atomic radii and inter-atomic distances,4r2,4r2-CA,4r2 -CA -CB,A,B,A,r,18,Connection Table VSA Calculation,Neglect Non-bonded neighbors (small molecules have little NB contact) Interaction between angles (1-3 interactions) Stretch
31、of bond lengths (use ideal bond length) Parameters Radii: Van der Waals (or solvation) Inter-atomic distances: Ideal bond lengths Define Vi to be the exposed VSA of atom i.,r,s,d,A,19,Quality of Approximate VSA Calculation,Data set of 1,947 conformations MOE 2D 3D converter, MMFF94 force field, 0.01
32、 RMS gradient Molecular weights in 300,1600 range VDW Surface Area 3D dot calculation Accuracy r = 0.9856 r2 = 0.9666 10% error Largest errors on steroids an other fused ring systems,20,Subdivision of VSA by Properties,Given an atomic property value Pi for each atom i O2 1.2 C3 4.5 C4 5.9 N7 0.2 Bin
33、 Pi by ranges and sum ViVi values: Pi range: 0,1) 1,2) 2,3) 3,4) 4,5) 5,6) Descriptors: D1 D2 D3 D4 D5 D6,V1,V2,V3,V7,V4,+ V5,V6,+ V8,21,8 Molar Refractivity Descriptors,Wildman & Crippen SMR model of Molar Refractivity Specific attention paid to calculation of atomic contributions Protonation state
34、 taken as-is from structure (specific species) Property bins trained derived from 50,000 structures 8 descriptors result: SMR_VSAk Each bin is approximately equally populated over training set,Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic Contributions. J. Chem. Inf. C
35、omput. Sci., 39(5), 868-873 (1999).,22,10 LogP (octanol/water) Descriptors,Wildman & Crippen SlogP model of LogP Specific attention paid to calculation of atomic contributions Protonation state taken as-is from structure (specific species) Property bins trained derived from 50,000 structures 10 desc
36、riptors: SlogP_VSAk Each bin is approximately equally populated over training set,Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).,23,SMR_VSA and SlogP_VSA Inter-correlation,Correlation Analysis SMR SlogP d
37、escriptors weakly correlated Test made on 2000 small molecules not used in definition of descriptors Displayed values are r values (not r2) Descriptors encode “orthogonal” molecular properties,24,14 Partial Charge Descriptors,Gasteiger (PEOE) partial charge model Approximation to local pKa Electrost
38、atic interactions Similar to Jurs descriptors 14 descriptors result from uniform interval boundaries Weak correlation,Stanton D., Jurs, P. Anal. Chem. 62, 2323 (1990)Gasteiger,J., Marsali. Iterative Partial Equalization of Orbital Electronegativity - A Rapid Access to Atomic Charges. Tetrahedron. Vo
39、l. 36, p3219 (1980),25,Encoding of Traditional Descriptors,Traditional descriptors modeled with VSA descriptors 1,932 small organic molecules with weights in (28,800) SlogP_VSA, SMR_VSA and PEOE_VSA descriptors calculated Principal components regression models for 64 traditional descriptors,chi0 0.9
40、9 chi0v_C 0.97 b_ar 0.89 b_1rotN 0.78 Kier1 0.99 KierA1 0.97 Kier2 0.89 b_double 0.77 vdw_area 0.99 a_hyd 0.96 vsa_pol 0.89 b_rotN 0.77 vdw_vol 0.99 a_nC 0.96 vsa_acc 0.88 a_ICM 0.73 vsa_hyd 0.99 a_nH 0.96 diameter 0.87 vsa_don 0.73 a_count 0.98 a_nO 0.95 VadjEq 0.87 KierFlex 0.69 a_heavy 0.98 b_hea
41、vy 0.95 a_nN 0.86 balabanJ 0.61 a_IC 0.98 chi1_C 0.95 KierA2 0.86 a_nP 0.60 apol 0.98 chi1v_C 0.95 radius 0.86 Kier3 0.57 b_count 0.98 SlogP 0.95 VdistMa 0.86 a_nCl 0.56 chi0v 0.98 a_acc 0.94 wienPath 0.85 KierA3 0.55 chi1 0.98 chi1v 0.94 wienPol 0.84 a_nS 0.53 SMR 0.98 Weight 0.93 VadjMa 0.82 b_1ro
42、tR 0.50 b_single 0.97 a_aro 0.91 VdistEq 0.82 density 0.49 bpol 0.97 a_don 0.91 vsa_oth 0.82 b_rotR 0.48 chi0_C 0.97 zagreb 0.91 a_nF 0.80 b_triple 0.46,26,Boiling Point,Data set Exp. boiling point (K) 298 small molecules 18 descriptors: SlogP_VSA(10), SMR_VSA(8) PCA regression r2 = 0.96, RMSE = 15.
43、53 Leave-one-out: r2 = 0.94, RMSE = 21.37 Random leave-100-out: r2 = 0.94,27,Free Energy of Solvation in Water,Data set Exp. Gs (kcal/mol) 291 small molecules 12 descriptors: PEOE_VSA(3), SlogP_VSA(7), SMR_VSA(2) PCA regression r2 = 0.90, RMSE = 0.78 Leave-one-out: r2 = 0.89, RMSE = 0.82 Random leav
44、e-100-out: r2 = 0.88,Viswanadhan, V.N., Ghose, A.K., Singh, U.C., Wendoloski, J.J.; Prediction of Solvation Free Energies of Small Organic Moleucles: Additive-Constitutive Models Based on Molecular Fingerprints and Atomic Constants; J. Chem. Inf. Comput. Sci., 39, 405-412 (1999),28,Thermodynamic Sol
45、ubility in Water,Data set Exp. logW at 25C 1,438 small molecules 32 Descriptors: SlogP_VSA (10), SMR_VSA (8), PEOE_VSA (14) PCA regression r2 = 0.75, RMSE = 2.4 Leave-one-out: r2 = 0.74, RMSE = 2.5,Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL: http:/.,29,Vapo
46、r Pressure,Data set Exp. vapor pressure at 25C 1,771 small molecules 32 Descriptors: SlogP_VSA (10), SMR_VSA (8), PEOE_VSA (14) PCA regression r2 = 0.88, RMSE = 2.1 Leave-one-out: r2 = 0.87, RMSE = 2.2,Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL: http:/.,30,
47、Compound Classification with Binary QSAR,Can Binary QSAR separate inhibitor classes using SLogP_VSAk and SMR_VSAk descriptors? Data: 455 compounds active against one of 7 targets Results (classification accuracy) Class 1: 98.7% p=0.003 Serotonin receptor ligands Class 2: 96.7% p=0.043 Benzodiazepine
48、 receptor ligands Class 3: 96.5% p=0.290 Carbonic anhydrase II inhibitors Class 4: 98.7% p=0.001 Cyclooxygenase-2 (Cox-2) inhibitors Class 5: 98.7% p=0.014 H3 antagonsists Class 6: 98.7% p=0.012 HIV protease inhibitors Class 7: 99.1% p=0.002 Tyrosine Kinase inhibitors,Labute,P. Binary QSAR: A New Me
49、thod for Quantitative Structure Activity Relationships. Proceedings of the 1999 Pacific Symposium World Scientific Publishing, Singapore (1999),31,Compound Classification with CART,Learning set for CART (recursive partitioning) 455 compounds active against one of 7 targets 1,942 “random” organic com
50、pounds SlogP_VSA, SMR_VSA descriptors Classification accuracy (32 node tree, depth 5) Class 1: 84.5% p=0.07 Serotonin receptor ligands Class 2: 49.1% p=0.30 Benzodiazepine receptor ligands Class 3: 92.5% p=0.27 Carbonic anhydrase II inhibitors Class 4: 96.8% p=0.01 Cyclooxygenase-2 (Cox-2) inhibitors Class 5: 82.7% p=0.03 H3 antagonsists Class 6: 85.4% p=0.02 HIV protease inhibitors Class 7: 91.4% p=0.01 Tyrosine Kinase inhibitors,
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1