And a Python tutorial on how to build and train a Fixed Effects model for a real-world panel data set
Understanding the Fixed Effects Regression Model | by Sachin Date | Towards Data ScienceOpen in appSign upSign InWriteSign upSign InPublished inTowards Data ScienceSachin DateFollowFeb 14, 2022·19 min read·ListenSaveUnderstanding the Fixed Effects Regression ModelAnd a Python tutorial on how to build and train a Fixed Effects model on a real-world panel data setThe Fixed Effects regression model is used to estimate the effect of intrinsic characteristics of individuals in a panel data set. Examples of such intrinsic characteristics are genetics, acumen and cultural factors. Such factors are not directly observable or measurable but one needs to find a way to estimate their effects since leaving them out leads to a sub-optimally trained regression model. The Fixed Effects model is designed to address this problem.This article is PART 2 of the following three part series on Panel Data Analysis:How to Build A Pooled OLS Regression Model For Panel Data SetsUnderstanding the Fixed Effects Regression ModelThe No-Nonsense Guide to the Random Effects Regression ModelA primer on panel dataA panel data set contains data that is collected over a certain number of time periods for one or more uniquely identifiable “units”. Examples of units are animals, persons, trees, lakes, corporations and countries. A data panel is called a balanced or an unbalanced panel depending on whether or not all units are tracked for the same number of time periods. If the same set of units is tracked throughout the study, it’s called a fixed panel but if the units change during the study, it’s called a rotating panel.Panel data sets usually arise out of longitudinal studies. The Framingham Heart Study is possibly the most well known example of a longitudinal study that has been running since 1948.In this article, we’ll look at a real world panel data set containing the Year-over-Year % growth in per capita GDP of seven countries measured from 1992 through 2014. Along with GDP growth data, the panel also contains Y-o-Y % growth in Gross Capital Formation in each country:A panel data set (Source: World Development Indicators data under CC BY 4.0 license) (Image by Author)In the above data set, the unit is a country, the time frame is 1992 through 2014 (23 time periods), and the panel data is fixed and balanced.The set of data points pertaining to one unit (one country) is called a group. In the above data panel, there are seven groups.Suppose we wish to investigate the influence of Y-o-Y % growth in gross capital formation on Y-o-Y % growth in GDP.Our dependent or response variable y is Y-o-Y % growth in per capita GDP. The independent or explanatory variable X is Y-o-Y % growth in gross capital formation.In notation form, the Y-o-Y % growth in per capita GDP can be expressed as a function of Y-o-Y % growth in gross capital formation as follows:GDP growth of country i at time period t as a function of Gross Capital Formation growth in country i at time period tIn the above regression equation, ϵ_i_t is the residual error of regression and it captures the variance in Y-o-Y Growth in per capita GDP of country i during year t that the model isn’t able to “explain”.Let’s create a scatter plot of y versus X to see how the data looks like.We’ll start by importing all the required Python packages including ones we would use later on to construct the Fixed Effects model.import pandas as pdimport scipy.stats as stimport statsmodels.api as smimport statsmodels.formula.api as smffrom matplotlib import pyplot as pltimport seaborn as snsLet’s load the data set into a Pandas Data frame. The data set is available for download over here.df_panel = pd.read_csv('wb_data_panel_2ind_7units_1992_2014.csv', header=0)We’ll use Seaborn to plot per capita GDP growth across all time periods and across all countries versus gross capital formation growth in each country:colors = ['blue', 'red', 'orange', 'lime', 'yellow', 'cyan', 'violet']sns.scatterplot(x=df_panel['GCF_GWTH_PCNT'], y=df_panel['GDP_PCAP_GWTH_PCNT'], hue=df_panel['COUNTRY'], palette=colors). set(title='Y-o-Y % Change in per-capita GDP versus Y-o-Y % Change in Gross capital formation')plt.show()We see the following plot:Country-wise scatter plot of Y-oY % growth in GDP versus Y-o-Y % growth in gross capital formation (Image by Author)The Y-o-Y % growth in per capita GDP appears to be linearly related to the Y-o-Y % growth in gross capital formation, so, we’ll assume the following linear functional form for our regression model for each unit (country) i:A linear model for country i (Image by Author)In the above equation, all variables are matrices of a certain dimension. Assuming n units, k regression variables per unit, and T time periods per unit, the dimensions of each matrix variable in the above equation are as follows:y_i is the response variable (per capita GDP growth) for unit i. It is a column vector of size [T x 1].X_i is the regression variables matrix of size [T x k].β_i is the coefficients matrix of size [k x 1] containing the population value of the coefficients for the k regression variables in X_i.ϵ_i is a column vector of size [T x 1] containing the error terms, one error for each of the T time periods.Following is the matrix form of the above equation for unit i:The matrix form of the linear regression model for country i (Image by Author)In our example, T=23, k=1 and n=7.Let’s focus our attention on the error terms of the model, ϵ_i. The following are the important sources of errors:Errors are introduced due to random environmental noise, or by the measuring apparatus. Measurement errors introduced by the experimenter because they used the measuring apparatus incorrectly.Errors are introduced due to the omission of explanatory variables which were observable and measurable. These variables would have been able to ‘explain’ some of the variance in the response variable y, and therefore their omission from the X matrix causes the unexplained variance to ‘leak’ into the error term of the regression model.Errors are introduced due to an incorrect functional form or missing variable transformations for some of the regression variables or for the response variable. For example, suppose we need to regress the logarithm of GDP change on the gross capital formation change but we fail to log transform response variable.There is always the possibility that our choice of regression model is wrong. For example, if the correct model happens to be the Nonlinear Least Squares model but instead we use the OLS linear regression model, it would lead to additional regression errors.Finally, there will be errors introduced due to the omission of variables that not measurable. Such variables represent qualities that are intrinsic to the unit being measured. For our countries data panel where the unit is the country, an example of a unit-specific variable could be the socioeconomic fabric of the country that fuels or inhibits GDP growth under different environmental circumstances, and cultural aspects of decision-making in business and government that have evolved over hundreds of years in that country. All such factors impact the Y-o-Y % change in GDP but they cannot be directly measured. However, the omission of such factors from the regression matrix X has the same effect as in (2), that is, their effect leaks into additional variance observed in the error term.Keeping the above commentary in mind, we can express the general form of the linear regression model for country i as follows:The general form of the linear model for country i (Image by Author)In the above equation:y_i is a matrix of size [T x 1] containing the T observations for country i.X_i is a matrix of size [T x k] containing the values of k regression variables all of which are observable and relevant.β_i is a matrix of size [k x 1] containing the population (true)values of regression coefficients for the k regression variables.Z_i is a matrix of size [T x m] containing the (theoretical) values of all the variables (m in number) and effects that cannot be directly observed.γ_i is a matrix of size [m x 1] containing the (theoretical) population values of regression coefficients for the m unobservable variables.ε_i is a matrix of size [T x 1] containing the errors corresponding to the T observations for country i.Here is how the matrix multiplications and additions look like:The general form of the linear model for country i in matrix format (Image by Author)All unit-specific effects are assumed to be introduced by the term Z_iγ_i. The matrix Z_i and its coefficients vector γ_i are purely theoretical terms since what they represent cannot be in reality observed and measured.Our objective is to find a way to estimate the impact of all unobservable effects contained in Z_i on y, i.e. we need to estimate the impact of the Z_iγ_i term of the regression equation on y_i.To simplify the estimation, we’ll combine the effect of all country-specific unobservable effects into one variable which we will call z_i for country i. z_i is a matrix of size [T x 1] since it contains only one variable z_i and it has T rows corresponding to T number of “measurements” of z_i for T time periods.Since z_i is not directly observable, in order to measure the effects of z_i, we need to formalize the effect of leaving out z_i. Fortunately, there is a well-studied concept in statistics called the omitted-variable bias which we can use for this purpose.Omitted variable biasWhile training the model on the panel data set, if we leave out z_i from the model, it will cause what is known as the omitted variable bias. It can be shown that if the regression model is estimated without considering z_i, then the estimated values β_cap_i of the coefficients β_i will be biased as follows:Omitted variable bias: The bias introduced in the estimate of β_i due to the omission of the variable z_i (Image by Author)One can see that the bias introduced in the estimated value β_cap_i is proportional to the covariance between the omitted variable z_i and the explanatory variables X_i.The above equation suggests an approach for constructing the following two kinds of models — the Fixed Effects model, and the Random Effects model depending on whether or not the Covariance term in the above equation is zero, i.e. whether or not the unobservable effects z_i are correlated with the regression variables.In the rest of this article, we’ll focus on the Fixed Effects model, while in my next week’s article, I’ll explain how to build and train the Random Effects model.The Fixed Effects Regression ModelIn this model, we assume that the unobservable individual effects z_i are correlated with the regression variables. In effect, it means that the Covariance(X_i, z_i) in the above equation is non-zero.In many panel data studies, this assumption about correlation is a reasonable one to make. For example, in a stock trading scenario, a trader’s trading acumen or “knack” for making a profit is unmeasurable and unique to that individual. This acumen or knack can be presumed to vary with measurable factors such as age and education level. One may propose (rightly or wrongly) that the process of getting an advanced degree boosts one’s intrinsic acumen or knack at performing some task.In the Fixed Effects model, we also assume that the bias introduced due to the omission of the unit-specific factors is group-specific.To compensate for this bias, we will introduce a group-specific intercept called c_i into the model. c_i is assumed to act in a direction that is opposite (in a vector sense) to the effect of the omitted-variable bias.With these two assumptions in place, we will express the Fixed Effects regression model’s equation as follows:The Fixed Effects regression model (Image by Author)Here is the matrix form:The Fixed Effects regression model (Image by Author)Notice that we have replaced the z_iγ_i term in the earlier equation which represented the effect of the unobservable factors, with c_i which is a unit specific matrix of size [T x 1]. For a given unit i, each element of this matrix has the same value c_i and c_i is assumed to be constant across all time periods.For a particular time period t, the Fixed Effects model’s equation can be expressed as follows:The Fixed Effects regression model for unit i at time period t (Image by Author)Here’s the matrix form:The Fixed Effects regression model for unit i at time period t (Image by Author)In this form, y_i_t, c_i and ϵ_i_t are scalars as they pertain to a specific observation at time t and x_i_t is the t-th row vector of size [1 x k] in the X_i matrix We assume there are k regression variables represented in the X matrix.Estimates c_cap_i of unit-specific effects c_i are random variablesNotice that c_i does not carry the time subscript t as it is the same for a given country for all time periods T. Having said that, the estimated value c_cap_i of the country-specific effect c_i is just as much a random variable as any coefficient in the estimated coefficients matrix β_cap_i. To see why, imagine that the fixed effects model is trained hundreds of times, each time on a different, randomly chosen (but continuous) sub-set of the panel data set. After each training run, all estimated coefficients β_cap_i and the estimated unit-specific effect c_cap_i will attain a somewhat different set of values. If we plot all these estimated values of c_i from different training runs, their frequency distribution will have some shape having a certain mean value and some variance. For example, we may theorize that they are normally distributed around the true population level values of the respective coefficients in β_i and c_i. Thus, the estimated unit-specific effect c_cap_i behaves like a random variable having some probability distribution.In the Fixed Effects model, we assume that the estimated value of all unit specific effects have the same constant variance σ². It is also convenient (although not necessary) to assume a normally distributed c_cap_i. Thus, we have:c_cap_i ~ N(c_i, σ²)The following figure illustrates the probability distributions of c_i for three units in a hypothetical panel data set:Probability distributions of the unit-specific effect c_i for three different units in a panel data set (Image by Author)What if there are also some observable variables that are omitted?In practice, the X matrix is often incomplete. One may have omitted one or more observable variables from the model for a variety of reasons. Perhaps the cost of measuring a variable w.r.t. its presumed effect on y is prohibitive. Perhaps there are moral reasons for not measuring some variable. Or a variable may have been left out of X just out of plain oversight on the part of the experimenter.In such a case, their omission will bias all the parameter estimates of the fitted model including the estimated value of the unit-specific factor c_i for all units.Estimating the Fixed Effects regression modelEstimation of a Fixed Effects model involves estimating the coefficients β_i and the unit-specific effect c_i for each unit i.In practice, we pool together the models of all units into one common regression model by adding unit specific dummy variables d_1, d_2,…,d_n corresponding to the n units or groups as follows:The Fixed Effects model containing dummy variables (Image by Author)In the above equation:y_i_t is a scalar containing a specific observation for unit (country) i at time t.x_i_t is a row vector of size [1 x k] containing the values of all k regression variables for unit i at time t.β_i is a column vector of [k x 1] containing the population (true)values of regression coefficients for the k regression variables.d_i_t is a row vector of size [1 x n] containing one-hot-encoded dummy variables d_i_j_t, where j goes from 1 through n — one dummy variable for each of the n units in the data panel. For example: d= [0 1 0 0 0 0 0] is the dummies vector for unit #2. The idea is that the j-th element of the dummies vector should be 1 when j=i and 0 otherwise.c_i is a column vector of size [n x 1] containing the population values of unit-specific effects associated with the n units.ϵ_i_t is a scalar containing the error term of regression for unit i at time t.The Fixed Effects model expressed in matrix notation (Image by Author)The above model is a linear model and can be easily estimated using the OLS regression technique. This type of a linear regression model with dummy variables is called Least Squares with Dummy Variables (LSDV for short).Model training involves doing the following:Pool together the unit specific matrices y_i, X_i, β_i, d_i, c_i and ϵ_i for all n units into one model.Train the pooled model to generate estimates for the coefficients vector β of size [k x 1] corresponding to the k regression variables, and also the estimates for the unit-specific effects vector c of size [n x 1] for the n units contained in the data panel.The common coefficients assumptionIn the pooled model, we are making the implicit and important assumption that the estimated coefficients β_cap are common for all n units. The Chow test can be used to test this assumption (although we’ll not go into it here).For the World Bank countries data panel, what the poolability assumption means is that the population value of the slope (β) of the gross capital formation change (GCF_GWTH_PCNT) for each country is the same. In other words, a unit change in GCF_GWTH_PCNT is expected to translate into the same amount of change in the % GDP for each country. And therefore, it is the country-specific effect c_i and the error term ϵ_i_t are what are likely to cause the total % GDP change to vary across different countries for each unit change in GCF_GWTH_PCNT.This behavior is a direct outcome of the common coefficients assumption and it happens to be an important but not immediately obvious characteristic of Fixed Effects models.Here is the final thing to remember about the FE model before we dive into the tutorial section of the article:The estimates generated from training the Fixed Effects regression model apply to only the units that are in the panel data set. The estimates from the Fixed Effects model do not generalize to other units of the same nature in the population.What this means for the countries data panel is that the estimates of β and c_i apply to only the 7 countries in the data panel. One should not generalize the country-specific effect c_cap_i that is estimated by training the FE model on the data set to represent in any way the country-specific effect for any country that is not represented in the data set.If we want the unit-specific effects to carry through to the population of similar units, the Random Effects model (covered next week) may be more suitable.How to build a Fixed Effects regression model using Python and StatsmodelsLet us build and train a Fixed Effects model for the World Bank data panel.We’ll continue using the Pandas Dataframe at the beginning of the article. We will be build and train the FE model on the flattened out version of the panel data set which looks like this:Flattened panel data (Image by Author)Notice that in this flattened version, there is a column for the unit (country) and one for the time period (year).Printing out the Pandas Dataframe reveals this structure:The Pandas Dataframe showing the first 30 rows of the World Bank data panel (Image by Author)Let’s create the country-specific dummy variables:unit_col_name='COUNTRY'time_period_col_name='YEAR'#Create the dummy variables, one for each countrydf_dummies = pd.get_dummies(df_panel[unit_col_name])Join the dummies Dataframe to the panel data set:df_panel_with_dummies = df_panel.join(df_dummies)Here’s how the data panel with dummies looks like:Data panel with country-specif… truncated (12,214 more characters in archive)