1、Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006.,Summarized by Soo-Jin KimBiointelligence Laboratory, Seoul National University http:/bi.snu.ac.kr/,2,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,Contents,12.1 Principle Component Analysis 12.1
2、.1 Maximum variance formulation 12.1.2 Minimum-error formulation 12.1.3 Application of PCA 12.1.4 PCA for high-dimensional data,3,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1 Principal Component Analysis,Principal Component Analysis (PCA) PCA is used for applications such as dimension
3、ality reduction, lossy data compression, feature extraction and data visualization. Also known as Karhunen-Loeve transform PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data
4、is maximized.,Principal subspace,Orthogonal projection of the data points,The subspace maximizes the variance of the projected points,Minimizing the sum-of-squares of the projection errors,4,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.1 Maximum variance formulation (1/2),Consider dat
5、a set of observations xn where n = 1,N, and xn is a Euclidean variable with dimensionality D. To project the data onto a space having dimensionality MD while maximizing the variance of the projected data. One-dimensional space (M=1) Define the direction of this space using a D-dimensional vector u1
6、The mean of the projected data is where is the sample set mean given byThe variance of the projected data is given by,Maximize the projected variance,S is the data covariance matrix,5,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.1 Maximum variance formulation (2/2),Lagrange multiplier
7、 Make an unconstrained maximization of (u1 must be an eigenvector of S) The variance will be maximum when we set u1 equal to the eigenvector having the largest eigenvalue 1.PCA involves evaluating the mean x and the covariance matrix S of the data set and then finding the M eigenvectors of S corresp
8、onding to the M largest eigenvalues. The cost of computing the full eigenvector decomposition for a matrix of size D x D is O(D3),(this eigenvector is the first principal component),6,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.2 Minimum-error formulation (1/4),Based on projection er
9、ror minimization A complete orthogonal set of D-dimensional basis vectors ui where i = 1,D that satisfy Each data point can be represented exactly by a linear combination of the basis vectorsTo approximate this data point using a representation involving a restricted number MD of variables correspon
10、ding to projection onto a lower-dimensional subspace.,coefficient,(Without loss of generality),7,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.2 Minimum-error formulation (2/4),We approximate each data point xn byDistortion measure the squared distance between the original data point x
11、n and its approximation , so that goal is to minimizethe minimization with respect to the quantities znj where j = 1, M and similarly, where j = M+1, , D.,Depend on particular data point,Constants that are the same for all data points,to choose the ui, the zni, and the bj so as to minimize the disto
12、rtion,The displacement vector from xn to lies in the space orthogonal to the principal subspace (Fig. 12.2.).,8,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.2 Minimum-error formulation (3/4),Therefore, distortion measure JThe case of a 2-dimensional data space D=2 and 1-dimensional pr
13、incipal subspace M=1 To choose a direction u2 so as to minimize Using a Lagrange multiplier 2 to enforce the constraintThe minimum value of J by choosing u2 to be the eigenvector corresponding to the smaller of the two eigenvalueschoose the principal subspace to be aligned with the eigenvector havin
14、g the larger eigenvalues (in order to minimize the average squared projection distance),9,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.2 Minimum-error formulation (4/4),General solution to the minimization of J for arbitrary M D The distortion measure is simply the sum of the eigenval
15、ues of those eigenvectors that are orthogonal to the principal subspace.Canonical correlation analysis (CCA) linear dimensionality reduction technique called CCA PCA works with a single random variable, CCA consider two (or more) variables,10,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12
16、.1.3 Applications of PCA (1/4),Data compression The first five eigenvectors, along with the corresponding eigenvalues,The distortion measure J introduced by projecting the data onto a principal component subspace of dimensionality M,the eigenvalue spectrum,11,(C) 2007, SNU Biointelligence Lab, http:
17、/bi.snu.ac.kr/,12.1.3 Applications of PCA (2/4),PCA approximation to a data vector xn in the formPCA reconstruction of data points by M principal components,12,(C) 2007, SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.3 Applications of PCA (3/4),Data pre-processing The transformation of a data set
18、 in order to standardize certain of its properties (allowing subsequent pattern recognition algorithm) PCA makes a more substantial normalization of the data to give it zero mean and unit covariance,L, DxD diagonal matrix with elements i U, DxD orthogonal matrix with columns given by ui,13,(C) 2007,
19、 SNU Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.3 Applications of PCA (4/4),Compare to PCA with the Fisher linear discriminant for linear dimensionality reduction,PCA chooses the direction of maximum variance,Fisher linear discriminant takes account of the class labels information,14,(C) 2007, SN
20、U Biointelligence Lab, http:/bi.snu.ac.kr/,12.1.4 PCA for high-dimensional data,computationally infeasible (O(D3) ) Resolve this problem Let us define X to be the (NxD)-dimensional centered data matrix, whose nth row is given by,This is an eigenvector of S with eigenvalue i,(vi=Xui),Solve the eigenvector problem in spaces of lower dimensionality with computational cost O(N3) instead of O(D3),