Chapter 6. Panel data

Table of Contents
Dummy variables
Using lagged values with panel data
Pooled estimation
Illustration: the Penn World Table

Panel data (pooled cross-section and time-series) require special care. Here are some pointers.

Consider a data set composed of observations on each of n cross-sectional units (countries, states, persons or whatever) in each of T periods. Let each observation comprise the values of m variables of interest. The data set then contains mnT values.

The data should be arranged "by observation": each row represents an observation; each column contains the values of a particular variable. The data matrix then has nT rows and m columns. That leaves open the matter of how the rows should be arranged. There are two possibilities.[1]

You may use whichever arrangement is more convenient. The first is perhaps easier to keep straight. If you use the second then of course you must ensure that the cross-sectional units appear in the same order in each of the period data blocks.

In either case you can use the frequency field in the observations line of the data header file (see Chapter 5) to make life a little easier.

If you decide to construct a panel data set using a spreadsheet program first, then bring the data into gretl as a CSV import, the program will (probably) not at first recognize the special nature of the data. You can fix this by using the command setobs (see Chapter 10) or the GUI menu item "Sample, Set frequency, startobs…".

Dummy variables

In a panel study you may wish to construct dummy variables of one or both of the following sorts: (a) dummies as unique identifiers for the cross-sectional units, and (b) dummies as unique identifiers of the time periods. The former may be used to allow the intercept of the regression to differ across the units, the latter to allow the intercept to differ across periods.

You can use two special functions to create such dummies. These are found under the "Data, Add variables" menu in the GUI, or under the genr command in script mode or gretlcli.

  1. "periodic dummies" (script command genr dummy). The common use for this command is to create a set of periodic dummy variables up to the data frequency in a time-series study (for instance a set of quarterly dummies for use in seasonal adjustment). But it also works with panel data. Note that the interpretation of the dummies created by this command differs depending on whether the data rows are grouped by unit or by period. If the grouping is by unit (frequency T) the resulting variables are period dummies and there will be T of them. For instance dummy_2 will have value 1 in each data row corresponding to a period 2 observation, 0 otherwise. If the grouping is by period (frequency n) then n unit dummies will be generated: dummy_2 will have value 1 in each data row associated with cross-sectional unit 2, 0 otherwise.

  2. "panel dummies" (script command genr paneldum). This creates all the dummies, unit and period, at a stroke. The default presumption is that the data rows are grouped by unit. The unit dummies are named du_1, du_2 and so on, while the period dummies are named dt_1, dt_2, etc. The u (for unit) and t (for time) in these names will be wrong if the data rows are grouped by period: to get them right in that setting use genr paneldum -o (script mode only).

If a panel data set has the YEAR of the observation entered as one of the variables you can create a periodic dummy to pick out a particular year, e.g. genr dum = (YEAR=1960). You can also create periodic dummy variables using the modulus operator, %. For instance, to create a dummy with value 1 for the first observation and every thirtieth observation thereafter, 0 otherwise, do


	genr index genr dum = ((index-1)%30) = 0

Notes

[1]

If you don't intend to make any conceptual or statistical distinction between cross-sectional and temporal variation in the data you can arrange the rows arbitrarily, but this is probably wasteful of information.