1、,The Problem of Detecting Differentially Expressed Genes,Class 1,Class 2,Fold Change is the Simplest MethodCalculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen.F
2、old change is not a statistical test.,(1) For gene consider the null hypothesis of no association between its expression level and its class membership.,Test of a Single Hypothesis,(3) Perform a test (e.g Students t-test) for each gene.,(2) Decide on level of significance (commonly 5%).,(4) Obtain P
3、-value corresponding to that test statistic.,(5) Compare P-value with the significance level. Then either reject or retain the null hypothesis.,Two-Sample t-Statistic,Students t-statistic:,Two-Sample t-Statistic,Pooled form of the Students t-statistic, assumed common variance in the two classes:,Two
4、-Sample t-Statistic,Modified t-statistic of Tusher et al. (2001):,TRUE,PREDICTED,Types of Errors in Hypothesis Testing,Multiplicity Problem,Further: Genes are co-regulated, subsequently there is correlation between the test statistics.,When many hypotheses are tested, the probability of a false posi
5、tive increases sharply with the number of hypotheses.,Suppose we measure the expression of 10,000 genes in a microarray experiment.,If all 10,000 genes were not differentially expressed, then we would expect for:P= 0.05 for each test, 500 false positives.P= 0.05/10,000 for each test, .05 false posit
6、ives.,Example,Controlling the Error Rate,Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses Use the False Discovery Rate instead (FDR)(Benjamini and Hochberg 1995),Methods for dealing with the Multiplicity Problem,The Bonferroni Method controls the family
7、wise error rate (FWER) i.e. the probability that at least one false positive error will be made,The False Discovery Rate (FDR) emphasizes the proportion of false positives among the identified differentially expressed genes.,Too strict for gene expression data, tries to make it unlikely that even on
8、e false rejection of the null is made, may lead to missed findings,Good for gene expression data says something about the chosen genes,The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.,False Discovery Rate Benjamini and H
9、ochberg (1995),Possible Outcomes for N Hypothesis Tests,where,Positive FDR,Lindsay, Kettenring, and Siegmund (2004).A Report on the Future of Statistics.Statist. Sci. 19.,Key papers on controlling the FDR,Genovese and Wasserman (2002) Storey (2002, 2003)Storey and Tibshirani (2003a, 2003b)Storey, Ta
10、ylor and Siegmund (2004)Black (2004)Cox and Wong (2004),Controlling FDR,Benjamini and Hochberg (1995),Benjamini-Hochberg (BH) Procedure,Controls the FDR at level a when the P-values following the null distribution are independent and uniformly distributed.,(1) Let be the observed P-values.,(2) Calcu
11、late .,(3) If exists then reject null hypotheses corresponding to. Otherwise, reject nothing.,Example: Bonferroni and BH Tests,Suppose that 10 independent hypothesis tests are carried out leading to the following ordered P-values:,0.00017 0.00448 0.00671 0.00907 0.01220 0.33626 0.39341 0.53882 0.581
12、25 0.98617,(a) With a = 0.05, the Bonferroni test rejects any hypothesis whose P-value is less than a / 10 = 0.005.,Thus only the first two hypotheses are rejected.,(b) For the BH test, we find the largest k such that P(k) ka / N.,Here k = 5, thus we reject the first five hypotheses.,q-VALUE,q-value
13、 of a gene j is expected proportion of false positives when calling that gene significant.P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value. The q-value for an observed test statistic can be viewed as the expect
14、ed proportion of false positives among all genes with their test statistics as or more extreme than the observed value.,LIST OF SIGNIFICANT GENES,Call all genes significant if pj 0.05 or Call all genes significant if qj 0.05 to produce a set of significant genes so that a proportion of them (0.05) i
15、s expected to be false (at least for a large no. of genes not necessarily independent),BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001),BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genesP=.001 gave 51 genes differentially expressedP=0.0001 gave 9-11 genes,Using q
16、0.05, gives 160 genes are taken to be significant.It means that approx. 8 of these 160 genes are expected to be false positives.Also, it is estimated that 33% of the genes are differentially expressed.,Permutation Method,The null distribution has a resolution on the order of the number of permutatio
17、ns. If we perform B permutations, then the P-value will be estimated with a resolution of 1/B. If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution.,Null Distribution of the Test Statistic,Using j
18、ust the B permutations of the class labels for the gene-specific statistic Wj , the P-value for Wj = wj is assessed as:,where w(b)0j is the null version of wj after the bth permutation of the class labels.,If we pool over all N genes, then:,Class 1 Class 2,Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1),
19、Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2),Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2.,Null Distribution of the Test Statistic: Example,Class 1 Class 2,Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1),Gene 2 A
20、1(2) A2(2) A3(2) B4(2) B5(2) B6(2),Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1) A6(1),To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that:,Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1) A6(1) . There are
21、10 distinct permutations.,And permute the class labels:,Ten Permutations of Gene 1,A1(1) A2(1) A3(1) A4(1) A5(1) A6(1)A1(1) A2(1) A4(1) A3(1) A5(1) A6(1)A1(1) A2(1) A5(1) A3(1) A4(1) A6(1)A1(1) A2(1) A6(1) A3(1) A4(1) A5(1)A1(1) A3(1) A4(1) A2(1) A5(1) A6(1)A1(1) A3(1) A5(1) A2(1) A4(1) A6(1)A1(1) A
22、3(1) A6(1) A2(1) A4(1) A5(1)A1(1) A4(1) A5(1) A2(1) A3(1) A6(1)A1(1) A4(1) A6(1) A2(1) A3(1) A5(1)A1(1) A5(1) A6(1) A2(1) A3(1) A4(1),As there are only 10 distinct permutations here, the null distribution based on these permutations is too granular. Hence consideration is given to permuting the labe
23、ls of each of the other genes and estimating the null distribution of a gene based on the pooled permutations so obtained. But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are tryin
24、g to estimate.,Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1. Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2 will have a mixed distribution.,Class 1 Class 2,Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1),Gene 2 A1(2
25、) A2(2) A3(2) B4(2) B5(2) B6(2),Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2),Perm. 1 A1(2) A2(2) B4(2) A3(2) B5(2) B6(2).,Permute the class labels:,Example of a null case: with 7 N(0,1) points and 8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labe
26、ls with t13 density superimposed.,ty,Example of a null case: with 7 N(0,1) points and 8 N(10,9) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed.,ty,The SAM Method,Use the permutation method to calculate the null distrib
27、ution of the modified t-statistic (Tusher et al., 2001).,The order statistics t(1), . , t(N) are plotted against their null expectations above. A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.,TRUE,PREDICTED,FDR ,R,FNDR ,The FDR and other
28、 error rates,TRUE,PREDICTED,FDR ,R,FNR =,The FDR and other error rates,FDR = B / (B + D) = 475 / 875 = 54%,FNDR = C / (A + C) = 100 / 9125 = 1%,Toy Example with 10,000 Genes,Two-component mixture model,is the proportion of genes that are not,differentially expressed, and,Use of the P-Value as a Summ
29、ary Statistic (Allison et al., 2002),Instead of using the pooled form of the t-statistic, we can work with the value pj, which is the P-value associated with tj in the test of the null hypothesis of no difference in expression between the two classes.,The distribution of the P-value is modelled by t
30、he h-component mixture model,where a11 = a12 = 1.,Use of the P-Value as a Summary Statistic,Under the null hypothesis of no difference in expression for the jth gene, pj will have a uniform distribution on the unit interval; ie the b1,1 distribution.,The ba1,a2 density is given by,where,Efron B, Tib
31、shirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160. Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis. JASA 99, 96-104. Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference.
32、Efron B (2005) Local False Discovery Rates. Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.,McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866.McLach
33、lan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.,Two component mixture model,Using Bayes Theorem, we calculate the posterior probability that gene j is not differentially ex
34、pressed:,0 is the proportion of genes that are not differentially expressed. The two-component mixture model is:,0(wj) c0,then this decision minimizes the (estimated) Bayes risk,If we conclude that gene j is differentially expressed if:,where,where e01 is the probability of a false positive and e10
35、is the probability of a false negative.,Estimated FDR,where,Similarly, the false positive rate is given by,and the false non-discovery rate and false negative rate by:,F0: N(0,1), 0=0.9 F1: N(1,1), 1=0.1Reject H0 if z20(2) = 0.99972 but FDR=0.17,F0: N(0,1), 0=0.6 F1: N(1,1), 1=0.4Reject H0 if z20(2)
36、 = 0.251 but FDR=0.177,Glonek and Solomon (2003),Gene Statistics: Two-Sample t-Statistic,Students t-statistic,Pooled form of the Students t-statistic, assumed common variance in the two classes,Modified t-statistic of Tusher et al. (2001),1. Obtain the z-score for each of the genes,The Procedure,2.
37、Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).,3. The posterior probability of non-differential expression of gene j, is given by 0(zj).,4. Conclude gene j to be differentially expressed if0(zj) c0,0(zj) c,If 0/1 9,then,e.g.
38、c = 0.2,Suppose,0/ 1 0.9,then,0(zj) c,0(zj) 0.2 if,Much stronger level of evidence against the null than in standard one-at-a-time testing.,e.g.,Zj N(,1),H0 : = 0 vs H1 : = 2.8,Rejecting H0 if |Zj| 1.96 yields a two-sided test of size 0.05 and power 0.80.f1(1.96)/f0(1.96) = 4.8.,Z-scores, null case,
39、Z-scores, +1,Z-scores, +2,Z-scores, +3,zj,j,Use of Wilson-Hilferty Transformation as in Broet et al. (2004),Plot of approximate z-scores obtained by Wilson-Hilferty transformation of simulated values of F-statistic with 1 and 8 degrees of freedom.,EMMIX-FDR,A program has been written in R which inte
40、rfaces with EMMIX to implement the algorithm described in McLachlan et al (2006).We fit a mixture of two normal componentsto test statistics calculated from the gene expression data.,When we equate the sample mean and variance of the mixture to their population counterparts, we obtain:,When we are w
41、orking with the theoretical null, we can easily estimate the mean and variance of the non-null component with the following formulae.,Following the approach of Storey and Tibshirani (2003) we can obtain an initial estimate for 0 as follows:,This estimate of 0 is used as input to EMMIX as part of the
42、 initial parameter estimates. Thus no random or k-means starts are required, as is usually the case.There are two different versions of EMMIX used, the standard version for the empirical null and a modified version for the theoretical null which fixes the mean and variance of the null component to b
43、e 0 and 1 respectively.,Theoretical and empirical nulls,Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance.,Examples,We e
44、xamined the performance of EMMIX-FDR on three well-known data setsfrom the literature.Alon colon cancer data (1999) Hedenfalk breast cancer data (2001) vant Wout HIV data (2003),Hedenfalk Breast Cancer Data,Hedenfalk et al. (2001) used cDNA arrays to obtain gene expression profiles of tumours from c
45、arriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers), as well as sporadic breast cancer.We consider their data set of M = 15 patients, comprising two patient groups: BRCA1 (7) versus BRCA2 - mutation positive (8), with N = 3,226 genes.,The problem is to find genes which are diff
46、erentially expressed between the BRCA1 and BRCA2 patients.,Hedenfalk et al. (2001) NEJM, 344, 539-547,Fitto the N values of wj (based on pooled two-sample t-statistic),j th gene is taken to be differentially expressed if:,Two component model for the Breast Cancer Data,Fitting two component mixture m
47、odel to Hedenfalk data,Null,NonNull (DE genes),Estimates of 0 for Hedenfalk data,0.52 (Broet, 2004) 0.64 (Gottardo, 2006) 0.61 (Ploner et al, 2006) 0.47 (Storey, 2002) Using a theoretical null, we estimated 0 to be 0.65.,We used pj = 2 F13(-|wj|)Similar results where pj obtained by permutation metho
48、ds.,c0 = 0.1,Ranking and Selecting the Genes,FDR= Sum/R = 0.06,Proportion of False Negatives = 1 Sum1/ R1,Local FDR,Estimated FDR for various levels of c0,Estimated FDR and other error rates for various levels of threshold c0 applied to the posterior probability of nondifferential expression for the
49、 breast cancer data (Nr=number of selected genes),Storey and Tibshirani (2003) PNAS, 100, 9440-9445,Comparison of identified DE genes,Our method (143),Hedenfalk (175),Storey and Tibshirani (160),101,6,12,8,39,29,24,Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2,1. Obtain the z-score for each of the genes,