1、Chapter 1 Introduction,What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes,1.1 What are longitudinal and panel data?,With regression data, we collect a cross-section of subjects. The interest is comparing characteristics of the
2、subject, that is, investigating relationships among the variables. In contrast, with time series data, we identify one or more subjects and observe them over time. This allows us to study relationships over time, the so-called dynamic aspect of a problem. Longitudinal/panel data represent a marriage
3、 of regression and time series data. As with regression, we collect a cross-section of subjects. With panel data, we observe each subject over time. The descriptor panel data comes from surveys of individuals; a panel is a group of individuals surveyed repeatedly over time.,Example 1.1 - Divorce rat
4、es,Figure 1.1 shows the 1965 divorce rates versus AFDC (Aid to Families with Dependent Children) for the fifty states. The correlation is -0.37. Counter-intuitive? - we might expect a positive relationship between welfare payments (AFDC) and divorce rates.,Example 1.1 - Divorce rates,A similar figur
5、e shows a negative relationship for 1975 (the correlation is -0.425) Figure 1.2 shows both 1965 and 1975 data, with a line connecting each state The line represents a change over time (dynamic), not a cross-sectional relationship. Each line displays a positive relationship - as welfare payments incr
6、ease so do divorce rates. This is not to argue for a causal relationship between welfare payments and divorce rates. The data are still observational. The dynamic relationship between divorce and AFDC is different from the cross-sectional relationship.,Figure 1.2 1965 and 1975 Divorce rates versus A
7、FDC,Some notation,Longitudinal/panel data - regression data with “double subscripts.” Let yit be the response for the ith subject during the tth time period. We observe the ith subject over t=1, ., Ti time periods, for each of i=1, ., n subjects. First subject - (y11, y12, . , y1T1 ) Second subject
8、- (y21, y22, . , y2T2 ). . . . . The nth subject - (yn1, yn2, . , ynTn ),Prevalence of panel data analysis,Importance in the literature Panel data are also known as “cross-section time series” data in the social sciences Referred to as “longitudinal data analysis” in the biological sciences ABI/INFO
9、RM - 326 articles in 2002 and 2003. The ISI Web of Science - 879 articles in 2002 and 2003. Important panel data bases Historically, we have: Panel Survey of Income Dyanmics (PSID) National Longitudinal Survey of Labor Market Experience (NLS) Financial and Accounting Compustat, CRSP, NAIC Market sca
10、nner databases See Appendix F,Appendix F. Selected Longitudinal and Panel Data Sets,Table F.1 20 International Household Panel Studies Table F.2 5 Studies focused on youth and education Table F.3 4 Studies focused on the elderly and retirement Table F.4 7 miscellaneous studies, including election da
11、ta, manufacturing data, medical expenditure data and insurance company data,1.2 Benefits and drawbacks of longitudinal data,Several advantages of longitudinal data compared to data that are either purely cross-sectional (regression) orpurely time series data. Having longitudinal data allows us to: S
12、tudy dynamic relationships Study heterogeneity Reduce omitted variable bias With longitudinal data, one can also argue Estimators are more efficient Addresses the causal nature of relationships Main drawback - attrition,Dynamic relationships,Static versus dynamic relationships Figure 1.1 showed a cr
13、oss-sectional (static) relationship. We estimate a decrease of 0.95 % in divorce rates for each $100 increase in AFDC payments. Figure 1.2 showed a temporal (dynamic) relationship. We estimate an increase of 2.9% in divorce rates for each $100 increase in AFDC payments. From 1965 to 1975, AFDC payme
14、nts increased an average of $59 and divorce rates increased 2.5%.,Historical approach,In early panel data studies, pooled cross-sectional data were analyzed by estimating cross-sectional parameters using regression andusing time series methods to model the regression parameter estimates, treating th
15、e estimates as known with certainty. Theil and Goldberger (1961) provide an early discussion on the advantages of estimating these two aspects simultaneously.,Dynamic relationships and time series analysis,When studying dynamic relationships, univariate time series methods are the most well-develope
16、d. However, these methods do not account for relationships among different subjects. Multivariate time series accounts for relationships among a limited number of different subjects. Time series methods requires a fair number (generally, at least 30) observations to make reliable inferences.,Panel d
17、ata as repeated time series,With panel data, we observe several (repeated) subjects for each time period. By taking averages over subjects, our statistics are more reliable we require fewer time series observations to estimate dynamic patterns. For repeated subjects, the model isyit = + it, t=1, .,
18、Ti, i=1, ., n. Here, is the overall mean and it represents subject-specific dynamic patterns. “Unfortunately,” we dont get identical repeated looks. We hope to control for differences among subjects by introducing explanatory variables, or covariates. A basic model is yit = + xit + it, where xit is
19、the explanatory variable. Introducing explanatory variables leaves us with only subject-specific dynamic patterns, that is, yit - ( + xit = it,Heterogeneity,Subjects are unique. In cross-sectional analysis, we use yit = + xit + it ascribe the uniqueness to “ it “. In panel data, we have an opportuni
20、ty to model this uniqueness. The model yit = i + xit + it is unidentifiable in cross-sectional regression. In panel data, we can estimate and 1, , n. Subject-specific parameters, such as i, provide an important mechanism for controlling heterogeneity of individuals. Vocabulary: When i are fixed, unk
21、nown parameters to be estimated, we call this a fixed effects model. When i are drawn from an unknown population, that is, random variables, we call this a model with random effects.,Heterogeneity bias,Suppose that a data analyst mistakenly uses the model yit = + xit + it when yit = i + xit + it is
22、the true model. This is an example of heterogeneity bias, or a problem with aggregation with data. Similarly, one could have different (heterogeneous) slopes yit = + xiti + it or different intercepts and slopes yit = i + xiti + it,Omitted variables,Panel data serves to reduce the omitted variable bi
23、as. When omitted variables are time constant, we can still get reliable estimates. Consider the “true” model yit = + xit + zi + it. Unfortunately, we cannot (or not thought to) measure zi. It is “lurking” or “latent.” By considering the changes yit* = yit - yi,t-1 = ( + xit + zi + it) - ( + xit-1 +
24、zi + it-1) = (xit - xit -1 ) + it - it-1) = xit* + it* we do not need to worry about the bias that ordinarily arises from the latent variable, zi . Introducing the subject-specific variable i, accounts for the presence of many types of latent variables.,Efficiency of Estimators,Subject-specific vari
25、ables i also account for a large portion of the variability in many data sets This reduces the mean square error Increases the efficiency (or reduces the standard errors) of our parameter estimators. With panel data, we generally have more observations than with time series or regression. A longitud
26、inal data design may yield more efficient estimators than estimators based on a comparable amount of data from alternative designs. Suppose that the interest is in assessing the average change in a response over time, such as the divorce rate. A repeated cross-section yields Longitudinal data design
27、 yields,Causality and correlation,Three ingredients necessary for establishing causality, taken from the sociology literature: A statistically significant relationship is required. The association between two variables must not be due to another, omitted, variable. The “causal” variable must precede
28、 the other variable in time. Longitudinal data are based on measurements taken over time and thus address the third requirement of a temporal ordering of events. Moreover, longitudinal data models provide additional strategies for accommodating omitted variables that are not available in purely cros
29、s-sectional data.,Drawbacks: Sampling Design (attrition),Selection bias may occur when a rule other than simple random sampling is used to select observational units Example “endogeneous” decisions by agents to join a labor pool or participate in a social program. Missing data Because we follow the
30、same subjects over time, nonresponse typically increases through time. Example: US Panel Study of Income Dynamics (PSID): In the first year (1968), the nonresponse rate was 24%. By 1985, the nonresponse rate was about 50%.,1.3 Longitudinal data models,Types of inference Primary. We are interested in
31、 the effect that an (exogenous) explanatory variable has on a response, controlling for other variables (including omitted variables). Forecasting. We would like to predict future values of the response from a specific subject. Conditional means. We would like to predict the expected value of a futu
32、re response from a specific subject. Here, the conditioning is on latent (unobserved) characteristics associated with the subject. Types of applications - many,Social science statistical modeling,A model based on data characteristics is known as a sampling based model. The model arises from a data g
33、enerating process. In contrast, a structural model is a statistical model that represents causal relationships, as opposed to relationships that simply capture statistical associations. Why bother with an extra layer of theory when considering statistical models? Manski (1992) offers : Interpretatio
34、n - the primary purpose of many statistical analyses is to assess relationships generated by theory from a scientific field. Structural models utilize additional information from an underlying functional field. If this information is utilized correctly, then in some sense the structural model should
35、 provide a better representation than a model without this information. (explanation) Particularly for public policy analysis, the goal of a statistical analysis is to infer the likely behavior of data outside of those realized (extrapolation).,Modeling issues,With subject-specific parameters, there
36、 can be many parameters that describe the model “Fixed” versus “random” effects models Incorporating dynamic structure is important Econometric “dynamic” models (lagged endogenous) versus serial correlation approach Linear versus nonlinear (generalized linear) models Marginal versus hierarchical est
37、imation approaches Parametric versus semiparametric models We wish to separate the effects of: the mean the cross-sectional variance and serial correlation structure,1.4 Historical notes,The term panel study was coined in a marketing context when Lazarsfeld and Fiske (1938) Considered the effect of
38、radio advertising on product sales. People buy a product would be more likely to hear the advertisement, or vice versa. They proposed repeatedly interviewing a set of people (the panel) to clarify the issue. Econometrics Early economics applications include Kuh (1959), Johnson (1960), Mundlak (1961) and Hoch (1962). Biostatistics Wishart (1938), Rao (1959, 1965), Potthoff and Roy (1964) used multivariate analysis to consider the problem of polynomial growth curves of serial measurements from a single group of subjects. Grizzle and Allen (1969) introduced covariates,