Panel Data Using Stata: Fixed Effects and Random Effects
Panel data (also known as longitudinal or cross-sectional time-series data) is a dataset in which the behavior of each individual or entity (e.g., country, state, company, industry) is observed at multiple points in time.
entity time y x1 x2 x3
Angola 2018 13 6 0.5 26
Angola 2020 12 7 0.9 18
Brazil 2018 16 5 0.4 16
Brazil 2019 11 3 0.5 19
Brazil 2020 14 4 0.7 21
China 2018 11 8 0.8 14
China 2019 18 2 0.6 17
China 2020 10 5 0.2 21
In the above Panel dataset, we have data for variables y, x1, x2, and x3 for each entity (i.e., countries - Angola, Brazil, and China) at multiple points in time (i.e., years - 2018, 2019, and 2020).
When all entities are observed across all times, we call it a balanced panel.
When some entities are not observed in some years, we call it an unbalanced panel.
Panel data enables us to control for individual heterogeneity. That means Panel data allows us to control for variables you cannot observe or measure, like cultural factors or differences in business practices across companies or variables that change over time but not across entities (i.e., national policies, federal regulations, international agreements, etc.)
2. Setting Data as Panel in Stata
When we work with panel data in Stata, we need to set the data as a panel first.
We will use an example dataset throughout this tutorial. To get the example dataset, type the following codes in the Stata command window:
use https://dss.princeton.edu/training/Panel101_new.dta
For setting the data as Panel, type:
xtset country year
Stata will give us the following message:
. xtset country year
Panel variable: country (strongly balanced)
Time variable: year, 2011 to 2020
Delta: 1 unit
The term “( strongly balanced )” refers to the fact that all countries have data for all years. If, for example, a country does not have data for any year, then the data is unbalanced. Ideally, you would want to have a balanced dataset, but this is not always the case. Nevertheless, you can still run the model.
NOTE : If you get the following error after using xtset :
string variables not allowed in varlist;
country is a string variable
You need to convert ‘country’ to numeric. To do this, type:
encode country, gen(country1)
Now you have to use ‘ country1 ’ instead of ‘ country ’ for xtset declaration. That means you have to type:
xtset country1 year
Data inspection
After setting the data as a panel, you can use the xt command to visualize your variables. For instance, if you want to check how the dependent variable ( y ) varies over the years across entities (i.e., country A, B, C, etc.), type:
xtline y
Stata will give us the following graph:
We usually use the command sum to get the summary statistics of data. Bur for panel data, we can use xtsum , which has some advantages. For instance, it provides statistics for (i) the variation between different cross-sectional units (e.g., countries) and (ii) the variation within each cross-sectional unit over time. To get the summary stat for selected variables in our dataset, type:
xtsum y x1
Stata will give us the following table:
Variable | Mean Std. dev. Min Max | Observations
-----------------+--------------------------------------------+----------------
y overall | 1.85e+09 3.02e+09 -7.86e+09 8.94e+09 | N = 70
between | 1.31e+09 2.14e+08 3.64e+09 | n = 7
within | 2.75e+09 -6.74e+09 7.23e+09 | T = 10
| |
x1 overall | .6480006 .46807 -.5675749 1.446412 | N = 70
between | .3815777 .1929673 1.239071 | n = 7
within | .3041045 -.5450825 1.50364 | T = 10
Interpretation for the variable x1:
Overall: the average value of x1 across the entire dataset is 0.648, the standard deviation is 0.468, and the values of x1 ranges from -0.568 to 1.446 across all observations.
Between: The standard deviation of 0.382 shows how much x1 varies between the cross-sectional units.
Within: The standard deviation of 0.304 indicates the variation of x1 values within each country over the 10 time periods.
3. Estimating Panel Data Models in Stata
This guide discusses two basic methods we commonly use to analyze panel data:
- Fixed Effects Method
- Random Effects Method
3.1. Estimating the Fixed Effects Model in Stata
When using FE, we assume that something within the individual/entity may impact or bias the predictor or outcome variables, and we need to control for this. FE model removes the effects of individual or entity's time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.
The FE regression model has n different intercepts, one for each entity. These intercepts can be represented by a set of binary variables, and these binary variables absorb the influences of all omitted variables that differ from one entity to the next but are constant over time.
Estimation
(i) Accounting for Entity Fixed Effects
Entity fixed effects account for unobserved heterogeneity across entities (e.g., individuals, firms, countries) that is constant over time but varies between entities. This is the most frequently used model in panel data analysis. Follow the steps below to estimate an entity specific fixed effects model in Stata.
- First, get the example data (ignore this step if you have already opened the dataset in the previous section)
use https://dss.princeton.edu/training/Panel101_new.dta, clear
- Set the dataset as a panel using xtset (ignore this step if you have already set the dataset as a panel)
xtset country year
- Use the following command to estimate your entity specific fixed effects model
xtreg y x1 x2, fe
Note : using the fe option indicates we estimate a fixed effects model.
Stata will give us the following results:
Fixed-effects (within) regression Number of obs = 70
Group variable: country Number of groups = 7
R-squared: Obs per group:
Within = 0.0903 min = 10
Between = 0.0546 avg = 10.0
Overall = 0.0000 max = 10
F(2,61) = 3.03
corr(u_i, Xb) = -0.8561 Prob > F = 0.0557
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x1 | 2.23e+09 1.13e+09 1.97 0.053 -2.86e+07 4.50e+09
x2 | 2.05e+09 2.00e+09 1.02 0.310 -1.95e+09 6.06e+09
_cons | 1.23e+08 7.99e+08 0.15 0.878 -1.48e+09 1.72e+09
-------------+----------------------------------------------------------------
sigma_u | 3.070e+09
sigma_e | 2.794e+09
rho | .54680874 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(6, 61) = 3.14 Prob > F = 0.0095
- The coefficient of x1 indicates how much of y changes over time, on average per country, when x1 increases by one unit, holding all other variables constant. As the p-value is not less than 0.050, the effect is not statistically significant.
(ii) Accounting for both Entity and Time Fixed Effects
Including both entity (e.g., individuals, firms, states, countries, etc.) and time fixed effects controls for both entity-specific unobserved heterogeneity and common time-specific shocks or trends. Estimate the following model when you want to account for both entity and time specific fixed effects.
xtreg y x1 x2 i.year, fe
Note : adding i.year and fe indicates that we are accounting for both entity and time fixed effects.
Stata will give us the following results:
Fixed-effects (within) regression Number of obs = 70
Group variable: country Number of groups = 7
R-squared: Obs per group:
Within = 0.2435 min = 10
Between = 0.0333 avg = 10.0
Overall = 0.0304 max = 10
F(11, 52) = 1.52
corr(u_i, Xb) = -0.7611 Prob > F = 0.1520
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x1 | 1.19e+09 1.34e+09 0.88 0.381 -1.51e+09 3.88e+09
x2 | 1.81e+09 2.06e+09 0.88 0.383 -2.32e+09 5.95e+09
|
year |
2012 | 1.77e+08 1.51e+09 0.12 0.908 -2.86e+09 3.21e+09
2013 | 1.42e+08 1.55e+09 0.09 0.927 -2.97e+09 3.25e+09
2014 | 2.92e+09 1.51e+09 1.93 0.059 -1.10e+08 5.94e+09
2015 | 2.67e+09 1.68e+09 1.59 0.118 -7.02e+08 6.03e+09
2016 | 1.03e+09 1.57e+09 0.66 0.514 -2.12e+09 4.19e+09
2017 | 1.67e+09 1.64e+09 1.02 0.312 -1.61e+09 4.95e+09
2018 | 3.01e+09 1.63e+09 1.85 0.071 -2.61e+08 6.28e+09
2019 | 5.02e+08 1.60e+09 0.31 0.755 -2.71e+09 3.71e+09
2020 | 1.24e+09 1.52e+09 0.82 0.419 -1.81e+09 4.28e+09
|
_cons | -5.02e+08 1.12e+09 -0.45 0.655 -2.74e+09 1.74e+09
-------------+----------------------------------------------------------------
sigma_u | 2.899e+09
sigma_e | 2.760e+09
rho | .52454861 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(6, 52) = 2.53 Prob > F = 0.0318
- The coefficient of x1 indicates how much of Y changes over time, on average per country, when x1 increases by one unit, holding all other variables constant. As the p-value is not less than 0.05, the effect is not statistically significant.
- The coefficients for the year dummies show the effect of each year relative to the omitted reference year (which is 2011). For instance, in 2014, y increased by 2.92e+09 unit relative to the reference year, with a p-value of 0.059, which is marginally significant at the 10% level.
Test whether We Need to Include Time Fixed Effects
We can check whether we need to include time fixed effects in our model by using the command testparm. The command performs a joint F-test to assess if all years collectively equal zero. Type the following codes:
xtreg y x1 x2 i.year, fe
testparm i.year
Stata will give us the following outputs:
( 1) 2012.year = 0
( 2) 2013.year = 0
( 3) 2014.year = 0
( 4) 2015.year = 0
( 5) 2016.year = 0
( 6) 2017.year = 0
( 7) 2018.year = 0
( 8) 2019.year = 0
( 9) 2020.year = 0
F( 9, 52) = 1.17
Prob > F = 0.3333
For more information, type help testparm
(iii) Estimating Fixed Effects using the Least Squares Dummy Variable (LSDV) Approach
When there are a small number of fixed effects to be estimated, it is convenient to just run dummy variable regression for a FE model.
- Use the following dataset (ignore this step if you have already opened the dataset for the previous section)
use https://dss.princeton.edu/training/Panel101_new.dta, clear
- Declare the dataset as a panel using xtset (ignore this step if you have already declared the dataset as a panel)
xtset country year
- Use the following command to estimate your fixed effects model
reg y x1 x2 i.country
Stata will give us the following results:
Source | SS df MS Number of obs = 70
-------------+---------------------------------- F(8, 61) = 2.42
Model | 1.5096e+20 8 1.8870e+19 Prob > F = 0.0245
Residual | 4.7634e+20 61 7.8088e+18 R-squared = 0.2406
-------------+---------------------------------- Adj R-squared = 0.1411
Total | 6.2729e+20 69 9.0912e+18 Root MSE = 2.8e+09
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x1 | 2.23e+09 1.13e+09 1.97 0.053 -2.86e+07 4.50e+09
x2 | 2.05e+09 2.00e+09 1.02 0.310 -1.95e+09 6.06e+09
|
country |
B | -6.77e+09 4.88e+09 -1.39 0.171 -1.65e+10 2.99e+09
C | -1.44e+09 1.96e+09 -0.74 0.464 -5.36e+09 2.47e+09
D | -2.93e+09 5.24e+09 -0.56 0.578 -1.34e+10 7.55e+09
E | -6.54e+09 5.10e+09 -1.28 0.204 -1.67e+10 3.65e+09
F | 6.14e+08 1.38e+09 0.44 0.659 -2.15e+09 3.38e+09
G | -3.32e+08 2.12e+09 -0.16 0.876 -4.56e+09 3.90e+09
|
_cons | 2.61e+09 1.94e+09 1.34 0.184 -1.27e+09 6.49e+09
------------------------------------------------------------------------------
Notice that the estimated coefficients for x1 and x2 are the same for both the "Entity Fixed Effects" method and the "Least Squares Dummy Variable (LSDV)" method.
(iv) Estimating Entity Fixed Effects Models using ' reghdfe '
If you have a large number of fixed effects relative to the number of observations, use the reghdfe command as it is computationally more efficient. The reghdfe command is also useful if you need to account for multiple fixed effects in your models.
- Use the following dataset available by Stata
webuse nlswork , clear
- Use the following command to estimate your fixed effects model if you want to account for fixed effects for the entity idcode
reghdfe ln_w age ttl_exp tenure not_smsa south , absorb(idcode)
Stata will give us the following results:
(dropped 552 singleton observations)
(MWFE estimator converged in 1 iterations)
HDFE Linear regression Number of obs = 27,541
Absorbing 1 HDFE group F( 5, 23389) = 819.94
Prob > F = 0.0000
R-squared = 0.6741
Adj R-squared = 0.6162
Within R-sq. = 0.1491
Root MSE = 0.2948
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.0026787 .000863 -3.10 0.002 -.0043703 -.0009871
ttl_exp | .0287709 .0014474 19.88 0.000 .0259339 .0316079
tenure | .0114355 .0009229 12.39 0.000 .0096265 .0132445
not_smsa | -.0921689 .0096641 -9.54 0.000 -.1111112 -.0732266
south | -.0633396 .0110819 -5.72 0.000 -.0850608 -.0416184
_cons | 1.592607 .0186681 85.31 0.000 1.556016 1.629198
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
idcode | 4147 0 4147 |
-----------------------------------------------------+
- The negative coefficient for age indicates that, holding other variables constant, an increase in age is associated with a decrease in the natural log of wages ( ln_wage ). The p-value of 0.002 suggests that the effect is statistically significant at 1% level.
- The positive coefficient for ttl_exp suggests that an increase in total work experience is associated with an increase in ln_wage . The effect is statistically significant at 1% level.
- The negative coefficient for south indicates that being in the South region is associated with a decrease in ln_wage . The effect is statistically significant.
- Absorbed FE: idcode4147 indicates that the model accounts for individual-specific effects for 4,147 different idcode categories, controlling for their impact on the dependent variable.
(v) Estimating Entity and Time Fixed Effects Models using ' reghdfe '
- If you have a large number of fixed effects relative to the number of observations, and want to account for both entity and time fixed effects, use the following codes:
reghdfe ln_w age ttl_exp tenure not_smsa south , absorb(idcode year)
Stata will give us the following results:
(dropped 552 singleton observations)
(MWFE estimator converged in 8 iterations)
HDFE Linear regression Number of obs = 27,541
Absorbing 2 HDFE groups F( 5, 23375) = 261.99
Prob > F = 0.0000
R-squared = 0.6762
Adj R-squared = 0.6185
Within R-sq. = 0.0531
Root MSE = 0.2939
------------------------------------------------------------------------------
ln_wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | .0114497 .0099824 1.15 0.251 -.0081165 .0310159
ttl_exp | .0323758 .0015046 21.52 0.000 .0294266 .035325
tenure | .0104689 .0009264 11.30 0.000 .0086531 .0122847
not_smsa | -.0914148 .0096386 -9.48 0.000 -.1103071 -.0725225
south | -.0640471 .0110539 -5.79 0.000 -.0857134 -.0423808
_cons | 1.161841 .290702 4.00 0.000 .5920463 1.731636
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
idcode | 4147 0 4147 |
year | 15 1 14 |
-----------------------------------------------------+
- The positive coefficient for age indicates that, holding other variables constant, an increase in age is associated with an increase in the natural log of wages ( ln_wage ). However, the effect is not statistically significant as the p-value is not less than 0.05.
- The positive coefficient for ttl_exp suggests that an increase in total work experience is associated with an increase in ln_wage . The effect is statistically significant at 1% level.
- The negative coefficient for south indicates that being in the South region is associated with a decrease in ln_wage . The effect is statistically significant.
Absorbed FE
- idcode4147 indicates that the model accounts for individual-specific effects for 4,147 different idcode categories, controlling for their impact on the dependent variable.
- year indicates that 15 year categories were included, with 1 redundant (likely due to collinearity with other fixed effects), leaving 14 year fixed effects included.
Notes
Including a lagged dependent variable as a regressor in a fixed effects model can introduce bias, a problem often referred to as the "Nickell bias" or "dynamic panel bias." This bias arises because the lagged dependent variable is correlated with the individual-specific effects, violating the assumption of strict exogeneity required for consistent estimation of fixed effects models. In this case, using dynamic panel data models such as the Arellano-Bond or the generalized method of moments (GMM) can provide consistent estimates.
3.2. Estimating the Random Effects Model in Stata
If individual or entity-specific effects are strictly uncorrelated with the regressors, it may be appropriate to model the individual or entity-specific constant terms as randomly distributed across cross-sectional units. This view would be appropriate if we believe that sampled cross-sectional units were drawn from a large population.
An advantage of using the random effects method is that you can include time-invariant variables (e.g., geographical contiguity, distance between states) in your model. In the fixed effects model, these variables are absorbed by the intercept.
Estimation
- Use the following dataset
use https://dss.princeton.edu/training/Panel101_new.dta, clear
- Declare the dataset as a panel using
xtset country year
- Use the following command to estimate your random effects model
xtreg y x1 x2, re
Note: the use of re option indicates that we are estimating a random effects model.
Stata will give us the following results:
. xtreg y x1 x2, re
Random-effects GLS regression Number of obs = 70
Group variable: country Number of groups = 7
R-squared: Obs per group:
Within = 0.0803 min = 10
Between = 0.2333 avg = 10.0
Overall = 0.0055 max = 10
Wald chi2(2) = 2.24
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.3261
------------------------------------------------------------------------------
y | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
x1 | 1.46e+09 9.78e+08 1.50 0.134 -4.53e+08 3.38e+09
x2 | 2.44e+08 4.20e+08 0.58 0.562 -5.80e+08 1.07e+09
_cons | 8.64e+08 8.48e+08 1.02 0.308 -7.98e+08 2.53e+09
-------------+----------------------------------------------------------------
sigma_u | 1.070e+09
sigma_e | 2.794e+09
rho | .12789303 (fraction of variance due to u_i)
------------------------------------------------------------------------------
The coefficient of x1 indicates how much of Y changes over time, on average per country, when x1 increases by one unit, holding all other variables constant. As the p-value is not less than 0.05, the effect is not statistically significant.
4. Fixed Effects or Random Effects?
Hausman Test
- Use the Hausman test to decide whether to use a fixed effects or random effects model.
- Run a fixed effects model and save the estimates
- Run a random effects model and save the estimates
- Perform the Hausman test
- Use the following Stata commands
xtreg y x1 x2, fe
estimates store fixed
xtreg y x1 x2, re
estimates store random
hausman fixed random
Stata will give us the following results:
. hausman fixed random
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fixed random Difference Std. err.
-------------+----------------------------------------------------------------
x1 | 2.23e+09 1.46e+09 7.69e+08 5.68e+08
x2 | 2.05e+09 2.44e+08 1.81e+09 1.96e+09
------------------------------------------------------------------------------
b = Consistent under H0 and Ha; obtained from xtreg.
B = Inconsistent under Ha, efficient under H0; obtained from xtreg.
Test of H0: Difference in coefficients not systematic
chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 5.99
Prob > chi2 = 0.0500
5. References
Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist's companion. Princeton University Press.
Baltagi, B. (2021). Econometric analysis of panel data (6th ed). Springer.
Bartels, B. (2008). "Beyond fixed versus random effects": a framework for improving substantive and statistical analysis of panel, time-series cross-sectional, and multilevel data. The Society for Political Methodology, 9, 1-43. Available at: https://home.gwu.edu/~bartels/cluster.pdf
Baum, C. F. (2006). An introduction to modern econometrics using Stata. Stata Press.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Greene, W. H. (2018). Econometric analysis (8th ed.). Pearson.
Hamilton, L. C. (2012). Statistics with Stata: version 12. Cengage Learning.
Hoechle, D. (2007). Robust standard errors for panel regressions with cross-sectional dependence. The stata journal, 7(3), 281-312. Available at: https://journals.sagepub.com/doi/pdf/10.1177/1536867X0700700301
Kohler, U., & Kreuter, F. (2012). Data analysis using Stata (3rd ed.). Stata Press.
Stock, J. H., & Watson, M. W. (2019). Introduction to econometrics (4th ed.). Pearson.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT Press.
Wooldridge, J. M. (2020). Introductory econometrics: a modern approach (7th ed). Cengage Learning.