Multilevel Analysis


Below the different datasets used in the book are briefly described. To download the datasets in different file formats and some analysis outputs please go to the following GitHub repository.

The popularity data in popular2.* are simulated data for 2000 pupils in 100 schools. The purpose is to offer a very simple example for multilevel regression analysis. The main outcome variable is the pupil popularity, a popularity rating on a scale of 1–10 derived by a sociometric procedure. Typically, a sociometric procedure asks all pupils in a class to rate all the other pupils, and then assigns the average received popularity rating to each pupil. Because of the sociometric procedure, group effects as apparent from higher-level variance components are rather strong. There is a second outcome variable: pupil popularity as rated by their teacher, on a scale from 1 to 10. The explanatory variables are pupil gender (boy = 0, girl = 1), pupil extraversion (10-point scale), and teacher experience in years.

The files nurses.* contains three-level simulated data from a hypothetical study on stress in hospitals. The data are from nurses working in wards nested within hospitals. It is a cluster-randomized experiment. In each of 25 hospitals, four wards are selected and randomly assigned to an experimental and a control condition. In the experimental condition, a training program is offered to all nurses to cope with job-related stress. After the program is completed, a sample of about 10 nurses from each ward is given a test that measures job-related stress. Additional variables are: nurse age (years), nurse experience (years), nurse gender (0 = male, 1 = female), type of ward (0 = general care, 1 = special care), and hospital size (0 = small, 1 = medium, 2 = large).

The GPA data are a longitudinal data set, where 200 college students have been followed for six consecutive semesters. The data are simulated. In this data set there are GPA measures taken on six consecutive occasions, with a job status variable (how many hours worked) for the same six occasions. There are two student-level explanatory variables: the gender (0 = male, 1 = female) and the high school GPA. There is also a dichotomous student-level outcome variable, which indicates whether a student has been admitted to the university of their choice. Since not every student applies to a university, this variable has many missing values. The outcome variable ‘admitted’ is not used in any of the examples in this book. These data come in several varieties. The basic data file is gpa2. In this file, the six measurement occasions are represented by separate variables. Some software packages (e.g., Prelis) use this format. Other multilevel software packages (HLM, MLwiN, MixReg, SAS) require that the separate measurement occasions are different data records. The GPA data arranged in this ‘long’ data format are in the data file gpa2long. A second data set based on the GPA data involves a process of panel attrition being simulated. Students were simulated to drop out, partly based on having a low GPA in the previous semester. This dropout process leads to data that are missing at random (MAR). A naive analysis of the incomplete data gives biased results. A sophisticated analysis using multilevel longitudinal modeling or SEM with the modern raw data likelihood (available in AMOS, Mplus and MX, and in recent versions of LISREL) should give unbiased results. Comparing analyses on the complete and the incomplete data sets gives an impression of the amount of bias. The incomplete data are in files gpa2miss and gpa2mislong.

The data in the SPSS file(s) curran*.sav are a data set constructed by Patrick Curran for a symposium ‘Comparing Three Modern Approaches to Longitudinal Data Analysis: An Examination of a Single Developmental Sample’ conducted at the 1997 Biennial Meeting of the Society for Research in Child Development. In this symposium, several different approaches to longitudinal modeling (latent growth curves, multilevel analysis, and mixture modeling) were compared and contrasted by letting experts analyze a single shared data set. This data set, hereafter called the curran data, was compiled by Patrick Curran from a large longitudinal data set. Supporting documentation and the original data files are available on the Internet.

The data are a sample of 405 children who were within the first two years of entry to elementary school. The data consist of four repeated measures of both the child’s antisocial behavior and the child’s reading recognition skills. In addition, on the first measurement occasion, measures were collected of emotional support and cognitive stimulation provided by the mother. The data were collected using face-to-face interviews of both the child and the mother at two-year intervals between 1986 and 1992.

The Thailand education data in file thaieduc are one of the example data sets that are included with the software HLM (also in the student version of HLM). They are discussed at length in the HLM user’s manual. They stem from a large survey of primary education in Thailand (Raudenbush & Bhumirat, 1992). The outcome variable is dichotomous, an indicator whether a pupil has ever repeated a class (0 = no, 1 = yes). The explanatory variables are pupil gender (0 = girl, 1 = boy), pupil pre-primary education (0 = no, 1 = yes), and the school’s mean SES. The example in Chapter 6 of this book uses only pupil gender as explanatory variable. There are 8582 cases in the file thaieduc, but school mean SES is missing in some cases; there are 7516 pupils with complete data. Note that these missing data have to be dealt with before the data are transported to a multilevel program. In the analysis in Chapter 6 they are simply removed using listwise deletion. However, the percentage of pupils with incomplete data is 12.4 percent, which is too large to be simply ignored in a real analysis.

The survey response data used to analyze proportions in Chapter 6 are from a meta-analysis by Hox and de Leeuw (1994). The basic data file is metaresp. This file contains an identification variable for each study located in the meta-analysis. A mode identification indicates the data collection mode (face-to-face, telephone, mail). The main response variable is the proportion of sampled respondents who participate. Different studies report different types of response proportions: we have the completion rate (the proportion of participants from the total initial sample) and the response rate (the proportion of participants from the sample without ineligible respondents (moved, deceased, address nonexistent). Obviously, the response rate is usually higher than the completion rate. The explanatory variables are the year of publication and the (estimated) saliency of the survey’s main topic. The file also contains the denominators for the completion rate and the response rate, if known. Since most studies report only one of the response figures, the variables ‘comp’ and ‘resp’ and the denominators have many missing values. Some software (e.g., MLwiN) expects the proportion of ‘successes’ and the denominator on which it is based; other software (e.g., HLM) expects the number of ‘successes’ and the corresponding denominator. The file contains the proportion only; the number of successes must be computed from the proportion if the software needs that.

A sample of 100 streets are selected, and on each street a random sample of 10 persons are asked how often they feel unsafe while walking that street. The question about feeling unsafe is asked using three answer categories: 1 = never, 2 = sometimes, 3 = often. Predictor variables are age and gender; street characteristics are an economic index (standardized Z-score) and a rating of the crowdedness of the street (7-point scale). File: Safety. Used in Chapter 7 on ordinal data.

The epilepsy data come from a study by Leppik et al. (1987). They have been analyzed by many authors, including Skrondal and Rabe-Hesketh (2004). The data come from a randomized controlled study on the effect of an anti-epileptic drug versus a placebo. It is a longitudinal design. For each patient the number of seizures was measured for a two-week baseline. Next, patients were randomized to the drug or the placebo condition. For four consecutive visits the clinic collected counts of epileptic seizures in the two weeks before the visit. The data set contains the following variables: count of seizures, treatment indicator, visit number, dummy for visit #4, log of age, log of baseline count. All predictors are grand mean centered. The data come from the GLLAMM homepage at:, used in Chapter 7 on count data.

This is a data set from Singer and Willett’s book on longitudinal data analysis (2003), from a study by Capaldi, Crosby and Stoolmiller (1996). A sample of 180 middle-school boys were tracked from the 7th through the 12th grade, with the outcome measure being when they had sex for the first time. At the end, 54 boys (30 percent) were still virgins. These observations are censored. File firstsex is used as an example of (single-level) survival data in Chapter 8. There is one dichotomous predictor variable, which is whether there has been a parental transition (0 if the boy lived with his biological parents before the data collection began).

This involves multilevel survival data analyzed by Dronkers and Hox (2006). The data are from the National Social Science Family Survey of Australia of 1989–1990. In this survey detailed information was collected, including the educational attainment of respondents, their social and economic background, such as parental education and occupational status of the father, parental family size and family form, and other relevant characteristics of 4513 men and women in Australia. The respondent also answered all these questions about his or her parents and siblings. The respondents gave information about at most three siblings, even if there were more siblings in the family. All sibling variables were coded in the same way as the respondents, and all data were combined in a file with respondents or siblings as the unit of analysis. In that new file, respondents and siblings from the same family had the same values for their parental characteristics, but had different values for their child characteristics. The data file contains only those respondents or siblings that were married or had been married, and gave no missing values. File: sibdiv.

This data file is used to demonstrate the cross-classified data with pupils nested within both primary and secondary schools. These are simulated data, where 1000 pupils attended 100 primary and subsequently 30 secondary schools. There is no complete nesting structure; the pupils are nested within the cross-classification of primary and secondary schools. The file pupcross2 contains the secondary school achievement score, which is the outcome variable, and the explanatory pupil-level variables, gender (0 = boy, 1 = girl) and SES. School-level explanatory variables are the denomination of the primary and the secondary school (0 = no, 1 = yes).

The school manager data are from an educational research study (Krüger, 1994). In this study, male and female school managers from 98 schools were rated by 854 pupils. The data are in file manager. These data are used to demonstrate the use of multilevel regression modeling for measuring context characteristics (here, the school manager’s management style). The questions about the school manager are questions 5, 9, 12, 16, 21, and 25; in Chapter 10 of the book these are renumbered 1 … 6. The data set also contains the pupils’ and school manager’s gender (1 = female, 2 = male), which is not used in the example. The remaining questions in the data set are all about various aspects of the school environment; a full multilevel exploratory factor analysis is a useful approach to these data.

The social skills meta-analysis data in file meta20 contain the coded outcomes of 20 studies that investigate the effect of social skills training on social anxiety. All studies use an experimental group/control group design. Explanatory variables are the duration of the training in weeks, the reliability of the social anxiety measure used in each study (two values, taken from the official test manual), and the studies’ sample size. The data are simulated.

The asthma and LRD data are from Nam, Mengersen and Garthwaite (2003). The data are from a set of 59 studies that investigate the relationship between children’s environmental exposure to smoking (ETS) and the child health outcomes of asthma and lower respiratory disease (LRD). Available are the logged odds ratio (LOR) for asthma and LRD, and their standard errors. Study-level variables are the average age of subjects, publication year, smoking (0 = parents, 1 = other in household), and covariate adjustment used (0 = no, 1 = yes). There are two effect sizes, the logged odds ratio for asthma and lower respiratory disease (LRD). Only a few studies report both. Datafile: AstLrd.

The estrone data are 16 independent measurements of the estrone level of five post-menopausal women (Fears et al., 1996). The data file estronex contains the data in the usual format; the file estrlong contains the data in the format used for multilevel analysis. Although the data structure suggests a temporal order in the measurements, there is none. Before the analysis, the estrone levels are transformed by taking the natural logarithm of the measurements. The estrone data are used in Chapter 13 to illustrate the use of advanced estimation and testing methods on difficult data. The difficulty of the estrone data lies in the extremely small sample size and the small value of the variance components. and

The file good89 (from Good, 1999, p. 89) contains the very small data set used to demonstrate the principles of bootstrapping.

The family IQ data are patterned to follow the results from a study of intelligence in large families (van Peet, 1992). They are the scores on six subscales from an intelligence test and are used in Chapter 14 to illustrate multilevel factor analysis. The file FamilyIQ contains the data from 275 children in 50 families. The data file contains the additional variables gender and parental IQs, which are not used in the analyses in this book. Datafile: FamIQ.

The GALO data in file galo are from an educational study by Schijf and Dronkers (1991). They are data from 1377 pupils within 58 schools. We have the following pupil-level variables: father’s occupational status, focc; father’s education, feduc; mother’s education, meduc; pupil sex, sex; the result of GALO school achievement test, GALO; and the teacher’s advice about secondary education, advice. At the school level we have only one variable: the school’s denomination, denom. Denomination is coded 1 = Protestant, 2 = nondenominational, 3 = Catholic (categories based on optimal scaling). The data file galo contains both complete and incomplete cases, and an indicator variable that specifies whether a specific case in the data file is complete or not.