1、Designation: E3080 16 An American National StandardStandard Practice forRegression Analysis1This standard is issued under the fixed designation E3080; the number immediately following the designation indicates the year oforiginal adoption or, in the case of revision, the year of last revision. A num
2、ber in parentheses indicates the year of last reapproval. Asuperscript epsilon () indicates an editorial change since the last revision or reapproval.1. Scope1.1 This practice covers regression analysis methodologyfor estimating, evaluating, and using the simple linear regres-sion model to define th
3、e relationship between two numericalvariables.1.2 The system of units for this practice is not specified.Dimensional quantities in the practice are presented only asillustrations of calculation methods. The examples are notbinding on products or test methods treated.1.3 This standard does not purpor
4、t to address all of thesafety concerns, if any, associated with its use. It is theresponsibility of the user of this standard to establish appro-priate safety and health practices and determine the applica-bility of regulatory limitations prior to use.2. Referenced Documents2.1 ASTM Standards:2E456
5、Terminology Relating to Quality and StatisticsE2282 Guide for Defining the Test Result of a Test MethodE2586 Practice for Calculating and Using Basic Statistics3. Terminology3.1 DefinitionsUnless otherwise noted, terms relating toquality and statistics are as defined in Terminology E456.3.1.1 charac
6、teristic, na property of items in a sample orpopulation which, when measured, counted, or otherwiseobserved, helps to distinguish among the items. E22823.1.2 coeffcient of determination, r2,nsquare of thecorrelation coefficient.3.1.3 confidence interval, nan interval estimate L, Uwith the statistics
7、 L and U as limits for the parameter andwith confidence level 1 , where Pr(L U) 1.E25863.1.3.1 DiscussionThe confidence level, 1 , reflects theproportion of cases that the confidence interval L, U wouldcontain or cover the true parameter value in a series of repeatedrandom samples under identical co
8、nditions. Once L and U aregiven values, the resulting confidence interval either does ordoes not contain it. In this sense “confidence” applies not to theparticular interval but only to the long run proportion of caseswhen repeating the procedure many times.3.1.4 confidence level, nthe value, 1 , of
9、 the probabilityassociated with a confidence interval, often expressed as apercentage. E25863.1.4.1 Discussion is generally a small number. Confi-dence level is often 95 % or 99 %.3.1.5 correlation coeffcient, nfor a population, , a di-mensionless measure of association between two variables Xand Y,
10、 equal to the covariance divided by the product of Xtimes Y.3.1.6 correlation coeffcient, nfor a sample, r, the estimateof the parameter from the data.3.1.7 covariance, nof a population, cov(X, Y), for twovariables, X and Y, the expected value of (X X)(Y Y).3.1.8 covariance, nof a sample; the estima
11、te of the pa-rameter cov(X,Y) from the data.3.1.9 dependent variable, na variable to be predictedusing an equation.3.1.10 degrees of freedom, nthe number of independentdata points minus the number of parameters that have to beestimated before calculating the variance. E25863.1.11 deviation, d, nthe
12、difference of an observed valuefrom its mean.3.1.12 estimate, nsample statistic used to approximate apopulation parameter. E25863.1.13 independent variable, na variable used to predictanother using an equation.3.1.14 mean, nof a population, , average or expectedvalue of a characteristic in a populat
13、ion of a sample, X, sumof the observed values in the sample divided by the samplesize. E25863.1.15 parameter, nsee population parameter. E25863.1.16 population, nthe totality of items or units ofmaterial under consideration. E25861This practice is under the jurisdiction of ASTM Committee E11 on Qual
14、ity andStatistics and is the direct responsibility of Subcommittee E11.10 on Sampling /Statistics.Current edition approved Nov. 1, 2016. Published November 2016. DOI:10.1520/E3080-16.2For referenced ASTM standards, visit the ASTM website, www.astm.org, orcontact ASTM Customer Service at serviceastm.
15、org. For Annual Book of ASTMStandards volume information, refer to the standards Document Summary page onthe ASTM website.Copyright ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States13.1.17 population parameter, nsummary measure of thevalues of so
16、me characteristic of a population. E25863.1.18 prediction interval, nan interval for a future valueor set of values, constructed from a current set of data, in a waythat has a specified probability for the inclusion of the futurevalue. E25863.1.19 regression, nthe process of estimating parameter(s)o
17、f an equation using a set of data.3.1.20 residual, nobserved value minus fitted value, whena model is used.3.1.21 statistic, nsee sample statistic. E25863.1.22 quantile, nvalue such that a fraction f of the sampleor population is less than or equal to that value. E25863.1.23 sample, na group of obse
18、rvations or test results,taken from a larger collection of observations or test results,which serves to provide information that may be used as a basisfor making a decision concerning the larger collection. E25863.1.24 sample size, n, nnumber of observed values in thesample. E25863.1.25 sample stati
19、stic, nsummary measure of the ob-served values of a sample. E25863.1.26 standard errorstandard deviation of the populationof values of a sample statistic in repeated sampling, or anestimate of it. E25863.1.26.1 DiscussionIf the standard error of a statistic isestimated, it will itself be a statistic
20、 with some variance thatdepends on the sample size.3.1.27 standard deviationof a population, , the squareroot of the average or expected value of the squared deviationof a variable from its mean; of a sample, s, the square rootof the sum of the squared deviations of the observed values inthe sample
21、from their mean divided by the sample sizeminus 1. E25863.1.28 variance, 2,s2,nsquare of the standard deviationof the population or sample. E25863.1.28.1 DiscussionFor a finite population, 2is calcu-lated as the sum of squared deviations of values from the mean,divided by n. For a continuous populat
22、ion, 2is calculated byintegrating (x )2with respect to the density function. For asample, s2is calculated as the sum of the squared deviations ofobserved values from their average divided by one less than thesample size.4. Significance and Use4.1 Regression analysis is a statistical procedure that s
23、tudiesthe relations between two or more numerical variables andutilizes existing data to determine a model equation forprediction of one variable from another. In this standard, asimple linear regression model, that is, a straight line relation-ship between two variables, is considered (1, 2).35. St
24、raight Line Regression and Correlation5.1 Two VariablesThe data set includes two variables, Xand Y, measured over a collection of sampling units, experi-mental units or other type of observational units. Each variableoccurs the same number of times and the two variables arepaired one to one. Data of
25、 this type constitute a set of n orderedpairs of the form (xi, yi), where the index variable (i) runs from1 through n.5.1.1 Y is always to be treated as a random variable. X maybe either a random variable sampled from a population with anerror that is negligible compared to the error of Y, or values
26、chosen as in the design of an experiment where the valuesrepresent levels that are fixed and without error. We refer to Xas the independent variable and Y as the dependent variable.5.1.2 The practitioner typically wants to see if a relationshipexists between X and Y. In theory, many different types
27、ofrelationships can occur between X and Y. The most common isa simple linear relationship of the form Y = + X + , where and are model coefficients and is a random error termrepresenting variation in the observed value of Y at given X,and is assumed to have a mean of 0 and some unknownstandard deviat
28、ion . A statistical analysis that seeks todetermine a linear relationship between a dependent variable,Y, and a single independent variable, X, is called simple linearregression. In this type of analysis it is assumed that the errorstructure is normally distributed with mean 0 and someunknown varian
29、ce 2throughout the range of X and Y. Further,the errors are uncorrelated with each other. This will beassumed throughout the remainder of this section.45.1.3 The regression problem is to determine estimates ofthe coefficients and that “best” fit the data and allowestimation of . An additional measur
30、e of association, thecorrelation coefficient, , can also be estimated from this typeof data which indicates the strength of the linear relationshipbetween X and Y. The sample correlation coefficient, r,istheestimate of . The square of the correlation coefficient, r2,iscalled the coefficient of deter
31、mination and has additionalmeaning for the linear relationship between X and Y.5.1.4 When a suitable model is found, it may be used toestimate the mean response at a given value of X or to predictthe range of future Y values from a given X.5.2 Method of Least SquaresThe methodology consideredin this
32、 standard and used to estimate the model parameters and is called the method of least squares. The form of the bestfitting line will be denoted as Y = a + bX, where a and b are theestimates of and respectively. The ith observed values of Xand Y are denoted as xiand yi. The estimate of Y at X = xiisw
33、ritten yi5a1bxi. The “hat” notation over the yivariabledenotes that this is the estimated mean or predicted value of Yfor a given x.5.2.1 The least squares best fitting line is one that minimizesthe sum of the squared deviations from the line to the observed3The boldface numbers in parentheses refer
34、 to a list of references at the end ofthis standard.4The normal distribution of the error structure is not required to fit the linearmodel to the data but is required for performing standard model analysis such asresidual analysis, confidence and prediction intervals and statistical inference on the
35、model parameters.E3080 162yivalues. Note that these are vertical distances. Analytically,this sum of squared deviations is of the form:Sa, b!5 i51nyi2 yi!25 i51nyi2 a 2 bxi!2(1)5.2.2 The sum of squares, S, is written as a function of a andb. Minimizing this function involves taking partial derivativ
36、esof S with respect to a and b. This will result in two linearequations that are then solved simultaneously for a and b. Theresulting solutions are functions of the (xi, yi) paired data.5.2.3 Several algebraically equivalent formulas for the leastsquares solutions are found in the literature. The fo
37、llowingdescribes one convenient form of the solution. First definesums of squares SXXand SYYand the sum of cross products SXYas follows:SXX5 n 2 1!sx25 i51nx12 x!2(2)SYY5 n 2 1!sy25 i51ny12 y!2(3)SXY5 i51nx12 x!y12 y! 5 i51nx12 x!y1(4)Note that in Eq 2 and Eq 3, sxand syare the ordinary samplestanda
38、rd deviations of the X and Y data respectively. The lastexpression in Eq 4 follows from the middle expression becausei51nx12 x!y50.From the least squares solution, the slope estimate iscalculated as:b 5i21nxi2 x!yii21nxi2 x!25SXYSXX(5)Once b is determined, the intercept term is calculated from:a 5 y
39、 2 bx (6)5.3 ExampleAn example for this kind of data and theassociated basic calculations is shown in Table 1. This data istaken from Duncan (3), and shows the relationship between themeasurement of shear strength, Y, and weld diameter, X, for 10random specimens. Values for the estimated slope and i
40、nterceptare b = 6.898 and a = 569.468. Fig. 2 shows the scatter plotand associated least squares linear fit.In Eq 5, the slope estimate b is seen as a weighted averageof the yiwhere the weights, wi, are defined as:wi5xi2 x!SXX(7)Values of xifurthest from the average will have the greatestimpact on t
41、he associated weight applied to observation yiandon the numerical determination of the slope b.5.4 Correlation CoeffcientThe population correlationcoefficient, or Pearson Product Moment CorrelationCoefficient, , is a dimensionless parameter intended to mea-sure the strength of a linear relationship
42、between two variables.The estimated sample correlation coefficient, r, for a set ofpaired data (xi, yi) is calculated as:r 5i21nxi2 x!yi2 y!n 2 1!sxsy5i21nxi2 x!yin 2 1!sxsy(8)In Eq 8, the quantityi21nx 2 x!y 2 y!n 2 1!is referred to as thesample co-variance. Here again, the mean of y disappears fro
43、mthe right side of Eq 8, because i21nx 2 x!y50.5.4.1 An alternative formula for r uses the standard devia-tion of the paired differences (di= yi xi). Note that it does notmatter in what order we calculate these differences. Either di=yi xior di= xi yiwill give the same result:TABLE 1 Weld Diameter (
44、x) and Shear Strength (y)ixiyidi=xiyixix(xix)yi1 190 680 490.0 33.9 23,052.02 200 800 600.0 23.9 19,120.03 209 780 571.0 14.9 11,622.04 215 885 670.0 8.9 7,876.55 215 975 760.0 8.9 8,677.56 215 1025 810.0 8.9 9,122.57 230 1100 870.0 6.1 6,710.08 250 1030 780.0 26.1 26,883.09 265 1175 910.0 41.1 48,2
45、92.510 250 1300 1050.0 26.1 33,930.0average 223.9 975.0stdev (S) 24.196 191.645 170.987S2585.433 36,727.778 29,236.544parameter estimatesb 6.898a 569.468SXX5,268.900SYY330,550.000SXY36,345.000E3080 163r 5sx21sy22 sd22sxsy(9)The correlation coefficient for the data in Table 1 using Eq 8and Eq 9 are:r
46、 536,34510 2 1!24.196!191.645!5 0.871r 524.19621191.64522 170.8972224.196!191.645!5 0.8715.4.2 The value of the correlation coefficient is alwaysbetween 1 and +1. If r is negative (y decreases as x increases)then a line fit to the data will have a negative slope; similarly,positive values of r (y in
47、creased as x increases) are associatedwith a positive slope. Values of r near 0 indicate no linearrelationship so that a line fit to the data will have a slope near0. In cases where the (x, y) data have an r =1orr = +1, therelationship between x and y is perfectly linear.An r value nearto +1 or 1 in
48、dicate that a line may provide an adequate fit tothe data but does not “prove” that the relationship is linearsince other models may provide a better fit (for example, aquadratic model).As values of r become closer to the extremes(1 and +1) a line provides a stronger explanation of therelationship.
49、Fig. 2 shows examples of what correlated datalook like for several values of r.5.4.3 An alternative formula for the estimated slope b as afunction of the correlation coefficient, r, and standard devia-tions of the variables X and Y is:b 5rsysx(10)5.5 ResidualsFor any specified xiin the data set, theresidual at xiis the difference ei5yi2yi5yi2a 1 bxi!, thedifference between