ImageVerifierCode 换一换
格式:PPT , 页数:68 ,大小:699KB ,
资源ID:378350      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-378350.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(Analysis of gene expression data(Nominal explanatory .ppt)为本站会员(sumcourage256)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

Analysis of gene expression data(Nominal explanatory .ppt

1、Analysis of gene expression data (Nominal explanatory variables),Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC,Outline of the talk,Two types of explanatory variables (“experimental conditions”)Some scientific questions of intere

2、stA brief discussion on false discovery rate (FDR) analysisSome existing statistical methods for analyzing microarray data,Types of explanatory variables,Types of explanatory variables (“experimental conditions”),Nominal variables: No intrinsic order among the levels of the explanatory variable(s).

3、No loss of information if we permuted the labels of the conditions. E.g. Comparison of gene expression of samples from “normal” tissue with those from “tumor” tissue.,Types of explanatory variables (“experimental conditions”),Ordinal/interval variables: Levels of the explanatory variables are ordere

4、d.E.g. Comparison of gene expression of samples from different stages of severity of lessions such as “normal”, “hyperplasia”, “adenoma” and “carcinoma”. (categorically ordered)Time-course/dose-response experiments. (numerically ordered),Focus of this talk: Nominal explanatory variables,Types of mic

5、roarray data,Independent samplesE.g. comparison of gene expression of independent samples drawn from normal patients versus independent samples from tumor patients.Dependent samplesE.g. comparison of gene expression of samples drawn from normal tissues and tumor tissues from the same patient.,Possib

6、le questions of interest,Identify significant “up/down” regulated genes for a given “condition” relative to another “condition” (adjusted for other covariates).Identify genes that discriminate between various “conditions” and predict the “class/condition” of a future observation.Cluster genes accord

7、ing to patterns of expression over “conditions”.Other questions?,Challenges,Small sample size but a large number of genes.Multiple testing Since each microarray has thousands of genes/probes, several thousand hypotheses are being tested. This impacts the overall Type I error rates. Complex dependenc

8、e structure between genes and possibly among samples. Difficult to model and/or account for the underlying dependence structures among genes.,Multiple Testing: Type I Errors - False Discovery Rates ,The Decision Table,The only observable values,Strong and weak control of type I error rates,Strong co

9、ntrol: control type I error rate under any combination of true Weak control: control type I error rate only when all null hypotheses are trueSince we do not know a priori which hypotheses are true, we will focus on strong control of type I error rate.,Consequences of multiple testing,Suppose we test

10、 each hypothesis at 5% level of significance. Suppose n = 10 independent tests performed. Then the probability of declaring at least 1 of the 10 tests significant is 1 0.9510 = 0.401.If 50,000 independent tests are performed as in Affymetrix microarray data then you should expect 2500 false positive

11、s!,Types of errors in the context of multiple testing,Per-Family Error “Rate” (PFER): E(V )Expected number of false rejection ofPer-Comparison Error Rate (PCER): E(V )/mExpected proportion of false rejections of among all m hypotheses.Family-Wise Error Rate (FWER): P( V 0 )Probability of at least on

12、e false rejection of among all m hypotheses,Types of errors in the context of multiple testing,False Discovery Rate (FDR):Expected proportion of Type I errors among all rejected hypotheses. Benjamini-Hochberg (BH): Set V/R = 0 if R = 0. Storey: Only interested in the case R 0. (Positive FDR),Some us

13、eful inequalities,Some useful inequalities,Some useful inequalities,Conclusion,It is conservative to control FWER rather than FDR! It is conservative to control pFDR rather than FDR!,Some useful inequalities,Some useful inequalities,Some useful inequalities,Some useful inequalities,However, in most

14、applications such as microarrays, one expects In general, there is no proof of the statement,q-vlaues versus p-values.,Supposeand suppose we are interested in a one-sided test.Suppose is the value of the test stat. for a given data set.,q-vlaues versus p-values.,The pFDR can be rewritten as Suppose

15、is the value of the test stat. for a given data set. Then the q-value is the posterior-Bayesian p-value,Some popular Type I error controlling procedures,Let denote the ordered p-values for the m tests that are being performed.Let denote the ordered levels of significance used for testing the m null

16、hypotheses, respectively.,Some popular controlling procedures,Step-down procedure:,Some popular controlling procedures,Step up procedure:,Some popular controlling procedures,Single-step procedureA stepwise procedure with critical same critical constant for all m hypotheses.,Some typical stepwise pro

17、cedures: FWER controlling procedures,Bonferroni: A single-step procedure withSidak: A single-step procedure withHolm: A step-down procedure with Hochberg: A step-up procedure withminP method: A resampling-based single-step procedure with where be the quantile of the distribution of the minimum p-val

18、ue.,Comments on the methods,Bonferroni: Very general but can be too conservative for large number of hypotheses.Sidak: More powerful than Bonferroni, but applicable when the test statistics are independent or have certain types of positive dependence.,Comments on the methods,Holm: More powerful than

19、 Bonferroni and is applicable for any type of dependence structure between test statistics.Hochberg: More powerful than Holms procedure but the test statistics should be either independent or the test statistic have a MTP2 property.,Comments on the methods,Multivariate Total Positivity of Order 2 (M

20、TP2),Some typical stepwise procedures: FDR controlling procedure,Benjamini-Hochberg: A step-up procedure with,An Illustration,Lobenhofer et al. (2002) data:Expose breast cancer cells to estrodial for 1 hour or (12, 24 36 hours).Number of genes on the cDNA 2 spot array - 1900.Number of samples per ti

21、me point 8.,Compare 1 hour with (12, 24 and 36 hours) using a two-sided bootstrap t-test.,Some Popular Methods of Analysis,1. Fold-change,1. Fold-change in gene expression,For gene “g” compute the fold change between two conditions (e.g. treatment and control):,1. Fold-change in gene expression,: pr

22、e-defined constants.: gene “g” is “up-regulated”. : gene “g” is “down-regulated”.,1. Fold-change in gene expression,Strengths:Simple to implement. Biologists find it very easy to interpret. It is widely used.Drawbacks:Ignores variability in mean gene expression. Genes with subtle gene expression val

23、ues can be overlooked. i.e. potentially high false negative rates Conversely, high false positive rates are also possible.,2. t-test type procedures,2.1 Permutation t-test,For each gene “g” compute the standard two-sample t-statistic:where are the sample means and is the pooled sample standard devia

24、tion.,2.1 Permutation t-test,Statistical significance of a gene is determined by computing the null distribution of using either permutation or bootstrap procedure.,2.1 Permutation t-test,Strengths:Simple to implement. Biologists find it very easy to interpret. It is widely used.Drawback:Potentially

25、, for some genes the pooled sample standard deviation could be very small and hence it may result in inflated Type I errors and inflated false discovery rates.,2.2 SAM procedure (Significance Analysis of Microarrays) (Tusher et al., PNAS 2001),For each gene “g” modify the standard two-sample t-stati

26、stic as:The “fudge” factor is obtained such that thecoefficient of variation in the above test statistic is minimized.,3. F-test and its variations for more than 2 nominal conditions,Usual F-test and the P-values can be obtained by a suitable permutation procedure.Regularized F-test: Generalization

27、of Baldi and Long methodology for multiple groups. It better controls the false discovery rates and the powers comparable to the F-test.Cui and Churchill (2003) is a good review paper.,4. Linear fixed effects models,Effects:Array (A) - sample Dye (D) Variety (V) test groups Genes (G) Expression (Y),

28、4. Linear fixed effects models (Kerr, Martin, and Churchill, 2000),Linear fixed effects model:,4. Linear fixed effects models,All effects are assumed to be fixed effects.Main drawback all genes have same variance!,5. Linear mixed effects models (Wolfinger et al. 2001),Stage 1 (Global normalization m

29、odel)Stage 2 (Gene specific model),5. Linear mixed effects models,Assumptions:,5. Linear mixed effects models (Wolfinger et al. 2001),Perform inferences on the interaction term,A popular graphical representation: The Volcano Plots,A scatter plot of vsGenes with large fold change will lie outside a p

30、air of vertical“threshold” lines. Further, genes which are highly significant with large fold change will lie either in the upper right hand or upper left hand corner.,A useful review article,Cui, X. and Churchill, G (2003), Genome Biology. Software: R package: statistics for microarray analysis. ht

31、tp:/www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html SAM: Significance Analysis of Microarray. http:/www-stat.stanford.edu/%7Etibs/SAM,Supervised classification algorithms,Discriminant analysis based methods,A. Linear and Quadratic Discriminant analysis based methods:Strength: Well stu

32、died in the classical statistics literatureLimitations: Based on normality Imposes constraints on the covariance matrices. Need to be concerned about the singularity issue.No convenient strategy has been proposed in the literature to select “best” discrminating subset of genes.,Discriminant analysis

33、 based methods,B. Nonparametric classification using Genetic Algorithm and K-nearest neighbors. Li et al. (Bioinformatics, 2001)Strengths: Entirely nonparametric Takes into account the underlying dependence structure among genes Does not require the estimation of a covariance matrixWeakness: Computa

34、tionally very intensive,GA/KNN methodology very brief description,Computes the Euclidean distance between all pairs of samples based on a sub-vector on, say, 50 genes.Clusters each sample into a treatment group (i.e. condition) based on the K-Nearest Neighbors. Computes a fitness score for each subs

35、et of genes based on how many samples are correctly classified. This is the objective function.The objective function is optimized using Genetic Algorithm,X,Expression levels of gene 1,Expression levels of gene 2,K-nearest neighbors classification (k=3),Expression levels of gene 1,Expression levels

36、of gene 2,Subcategories within a class,Advantages of KNN approach,Simple, performs as well as or better than more complex methods Free from assumptions such as normality of the distribution of expression levels Multivariate: takes account of dependence in expression levels Accommodates or even ident

37、ifies distinct subtypes within a class,Expression data: many genes and few samples,There may be many subsets of genes that can statistically discriminate between the treated and untreated.There are too many possible subsets to look at. With 3,000 genes, there are about 1072 ways to make subsets of s

38、ize 30.,The genetic algorithm,Computer algorithm (John Holland) that works by mimicking Darwins natural selection Has been applied to many optimization problems ranging from engine design to protein folding and sequence alignmentEffective in searching high dimensional space,GA works by mimicking evo

39、lution,Randomly select sets (“chromosomes”) of 30 genes from all the genes on the chipEvaluate the “fitness” of each “chromosome” how well can it separate the treated from the untreated?Pass “chromosomes” randomly to next generation, with preference for the fittest,Summary,Pay attention to multiple testing problem. Use FDR over FWER for large data sets such as gene expression microarraysLinear mixed effects models may be used for comparing expression data between groups.For classification problem, one may want to consider GA/KNN approach.,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1