One of the key parameters describing the transition from the susceptible to the infected class is the hazard of infection, often referred to as the force of infection.
The force of infection reflects the degree of contact with potential for transmission between infected and susceptible individuals. The mathematical relation between the force of infection and effective contact patterns is generally assumed to be subjected to the mass action principle, which yields the necessary information to estimate the basic reproduction number, another key parameter in infectious disease epidemiology.
This book is focused on the application of modern statistical methods and models to estimate infectious disease parameters. We want to provide the readers with software guidance, such as R packages, and with data, as far as they can be made publicly available. For an evidence-based and responsible communication of infectious disease topics to avoid misunderstandings and overreaction of the public, we need solid scientific knowledge and an understanding of all aspects of infectious diseases and their control.
The aim of our book is to present the reader with the general picture and the main ideas of the subject. The book introduces the reader to methodological aspects of epidemiology that are specific for infectious diseases and provides insight into the epidemiology of some classes of infectious diseases characterized by their main modes of transmission. This choice of topics bridges the gap between scientific research on the clinical, biological, mathematical, social and economic aspects of infectious diseases and their applications in public health.
The book will help the reader to understand the impact of infectious diseases on modern society and the instruments that policy makers have at their disposal to deal with these challenges.
It is written for students of the health sciences, both of curative medicine and public health, and for experts that are active in these and related domains, and it may be of interest for the educated layman since the technical level is kept relatively low.
Correspondingly, advances in the statistical methods necessary to analyze such data are following closely behind the advances in data generation methods. The main biological topics treated include sequence analysis, BLAST, microarray analysis, gene finding, and the analysis of evolutionary processes. The main statistical techniques covered include hypothesis testing and estimation, Poisson processes, Markov models and Hidden Markov models, and multiple testing methods.
The second edition features new chapters on microarray analysis and on statistical inference, including a discussion of ANOVA, and discussions of the statistical theory of motifs and methods based on the hypergeometric distribution.
Much material has been clarified and reorganized. The book is written so as to appeal to biologists and computer scientists who wish to know more about the statistical methods of the field, as well as to trained statisticians who wish to become involved with bioinformatics.
The earlier chapters introduce the concepts of probability and statistics at an elementary level, but with an emphasis on material relevant to later chapters and often not covered in standard introductory texts. Later chapters should be immediately accessible to the trained statistician.
Sufficient mathematical background consists of introductory courses in calculus and linear algebra. The basic biological concepts that are used are explained, or can be understood from the context, and standard mathematical concepts are summarized in an Appendix.
Problems are provided at the end of each chapter allowing the reader to develop aspects of the theory outlined in the main text. Warren J. Ewens holds the Christopher H. Brown Distinguished Professorship at the University of Pennsylvania.
Gregory R. Grant is a senior bioinformatics researcher in the University of Pennsylvania Computational Biology and Informatics Laboratory. He obtained his Ph. Advances in computers and biotechnology have had a profound impact on biomedical research, and as a result complex data sets can now be generated to address extremely complex biological questions.
Correspondingly, advances in the statistical methods necessary to analyze such data are following closely behind the advances in data generation. Mathematical epidemiology of infectious diseases usually involves describing the flow of individuals between mutually exclusive infection states. One of the key parameters describing the transition from the susceptible to the infected class is the hazard of infection, often referred to as the force of infection. The force of infection reflects. Chapter 7 covers censoring and other types of missing data in greater depth, and also presents more comprehensive methods of analysis for survival data, including the multipredictor Cox proportional hazards regression model.
In standard problems such as linear regression, the sampling distribution of the regression coefficient estimates is well known on theoretical grounds, provided the data meet underlying assumptions. Bootstrap procedures approximate the sampling distribution of statistics of interest by a resampling procedure. Bootstrap samples of the same size as the actual sample — a key determinant of precision — are obtained by resampling with replacement, so that in a given bootstrap sample some observations appear more than once, some once, and some not at all.
We use the sample to represent the population and hence resampling from the actual data mimics drawing repeated samples from the source population. Then, from each of a large number of bootstrap samples, the statistics of interest are computed. The bootstrap SD is a relatively stable estimate of the standard error, since it is based on the complete set of bootstrap samples, so a relatively small number of bootstrap samples may suffice.
However, we often resort to the bootstrap precisely because the sampling distribution of the statistic of interest is unlikely to be normal, particularly in the tails. Because the extreme percentiles of a sample are very noisy estimates of the corresponding percentiles of a population distribution, a much larger number of bootstrap samples is re- quired. Again, a relatively large number of bootstrap samples is required.
See Sects. Consult these for more complete coverage of basic statistical inference, analysis of variance, and linear regression. Good references on methods for the analysis of contingency tables include Fleiss et al. Two applied survival analysis texts with a biomedical orientation are Miller et al. Finally, for a review of bootstrap methods, see Efron and Tibshirani , Explain how this might reduce sensitivity to outliers.
Problem 3. Similarly, the standard deviation of age10 is changed by the same factor: that is, the SD of age is 6. How do we compute the new variable and what is its SD? Using 3. The correlation coefficient is a measure of linear association. In this case the correlation of x and y is zero, even though there is clearly a systematic relationship.
What does this suggest about the need to test model assumptions? Using a statistical package, generate a random sample of values of x uniformly distributed on [—10, 10], compute E[y x] for each value of x, add randomly generated standard normal errors to get the values of y, and check the sample correlation of x and y.
Verify the estimates for the excess risk, relative risk, and odds ratio for the HIV example presented in Table 3. The data presented below are from a case-control study of esophageal cancer. The study and data are described in more detail in Sect. The columns represent a binary indicator of reported consumption of more than ten grams of tobacco per day. Next, compute the odds ratio comparing the proportion of individuals reporting higher levels of consumption among cases with that among the controls.
Why are we unable to estimate mean survival from the Kaplan—Meier result when the largest follow-up time is censored? To gain insight, contrast the survival curves for the 6-MP and placebo groups in Fig. In the leukemia study, the probability of being relapse-free at 20 weeks, conditional on being relapse-free at 10 weeks, can be estimated by the Kaplan—Meier estimate for 20 weeks, divided by the corresponding estimate for 10 weeks.
In the placebo group, those estimates are 0. Verify that the estimated conditional probability of remission at week 20, conditional on being in remission at week 10, is 0. In the 6-MP group, estimated probabilities of remaining in remission are 0. Use these values to estimate the probabilities of remaining in remission at 20 and 30 weeks, conditional on being in remission at 10 weeks. Be familiar with the t-test including versions for paired and unequal- variance data , one-way ANOVA, the correlation coefficient r, and some nonparametric alternatives.
Describe the assumptions and mechanics of the simple linear model for continuous outcomes, and interpret the results. Interpret Kaplan—Meier survival and cumulative incidence curves. Calculate median survival from an estimated survival curve. Interpret the results of a logrank test. But they also tend to be older, frailer, and heavier, which may explain the association be- tween exercise and BMD.
People whose diet is high in fat on average have higher low-density lipoprotein LDL cholesterol, a risk factor for coronary heart disease CHD. But they are also more likely to smoke and be over- weight, factors which are also strongly associated with CHD risk. Increasing body mass index BMI predicts higher levels of hemoglobin Hba1c , a marker for poor control of glucose levels; however, older age and ethnic background also predict higher Hba1c.
These are all examples of potentially complex relationships in observa- tional data where a continuous outcome of interest, such as BMD, SBP, and Hba1c , is related to a risk factor in analyses that do not take account of other factors. But in each case the risk factor of interest is associated with a num- ber of other factors, or potential confounders, which also predict the outcome. So the simple association we observe between the factor of interest and the outcome may be explained by the other factors.
Similarly, in experiments, including clinical trials, factors other than treat- ment may need to be taken into account. If the randomization is properly implemented, treatment assignment is on average not associated with any prognostic variable, so confounding is usually not an issue.
And with continuous outcomes, stratifying on a strong predictor in both design and analysis can account for a substantial proportion of outcome variability, increasing the efficiency of the study. For example, the association of lipoprotein a levels with risk of CHD events appears to vary by ethnicity. The problem of sorting out complex relationships is not restricted to con- tinuous outcomes; the same issues arise with the binary outcomes covered in Chapter 6, survival times in Chapter 7, and repeated measures in Chapter 8.
A general statistical approach to these problems is needed. We begin by illustrating some basic ideas in a simple example Sect. Then in Sect. These themes recur in Sects. In Chapter 5 we discuss the difficult problem of which variables and how many to include in a multipredictor model. As a result, research questions like this are often initially looked at using observational data. Table 4. Unadjusted Regression of Glucose on Exercise.
Furthermore, glucose levels are far more variable among diabetics, a violation of the assumption of homoscedasticity, as we show in Sect. The coefficient estimate Coef. However, women who exercise are slightly younger, a little more likely to use alcohol, and in particular have lower average body mass index BMI , all factors associated with glucose levels.
From Table 4. The multipredictor model also shows that average glucose levels are about 0. Average levels also increase by about 0. Adjusted Regression of Glucose on Exercise. Many basic elements of the multiple linear model carry over from the simple linear model, which was reviewed in Sect. In Sects.
The right-hand side of model 4. Analogous linear combinations of predictors and coefficients, often referred to as the linear predictor, are used in all the other regression models covered in this book. Despite the simple form of 4. Interpretation of Adjusted Regression Coefficients In 4. The resulting sums of squares and variance estimators introduced in Sect.
In the glucose example, the residual standard deviation, shown as Root MSE, declines from 9. However, inclusion of other predictors, especially powerful ones, also tends to decrease s2y x , the residual or unexplained variance of the outcome.
In the glucose example, the standard error of the coefficient estimate for exercise declines slightly, from 0. How- ever, in Sect. In the glucose example, the adjusted coefficient estimate for exercise is considerably smaller than the unadjusted estimate. As a result the t-statistic is reduced in magnitude from —3. Moreover, we pointed out that the scale-free correlation coefficient makes it easier to compare the strength of association between the outcome and various predictors across single-predictor models.
In the context of a mul- tipredictor model, standardized regression coefficients play this role. Thus they give the change in standard deviation units in the average value of y per standard deviation increase in the predictor. However, predictors in both simple and multipredictor re- gression models can be binary, categorical, or discrete numeric, as well as continuous numeric. A good way to code such a variable is as an indicator or dummy variable, taking the value 1 for the group with the characteristic of interest, and 0 for the group without the characteristic.
With this coding, the regression coefficient corresponding to this variable has a straightforward interpretation as the increase or decrease in average outcome levels in the group with the characteristic, with respect to the reference group. In fact this unadjusted model is equivalent to a t-test comparing glucose levels in women who do and do not exercise.
Examples include ethnicity, marital status, occupation, and geographic region. With nominal variables it is even clearer that the numeric codes often used to represent the variable in the database cannot be treated like the values of a numeric variable such as glucose. Categories are usually set up to be mutually exclusive and exhaustive, so that every member of the population falls into one and only one category. Both types of categorical variables are easily accommodated in multi- predictor linear and other regression models, using indicator or dummy vari- ables.
Suppose level 1 is chosen as the baseline level. Following the Stata convention for the naming of the four indicator variables, Table 4. Four other points are to be made from 4. Also the model is said to be saturated and the population group means would be estimated under model 4. All regression pack- ages make it straightforward to estimate and test hypotheses about these linear contrasts. This implies that choice of reference group is in some sense arbitrary.
While a particular choice may be best for ease of presentation, possibly because contrasts with the se- lected reference group are of primary interest, alternative reference groups result in essentially the same model Problem 4.
In contrast, if physact were treated as a score with integer values 1 through 5, the estimated means would be constrained to lie on a straight regression line. Using 4. For example, the second lincom result in Table 4. The last two results in the table are explained below. The testparm result in Table 4. Regression of Glucose on Physical Activity. All levels of the categorical predictor should still be retained in the analysis, however, because residual variance can be reduced, sometimes substantially, by splitting out the remaining groups.
For this case, various methods are available for controlling the experiment- wise type-I error rate EER for the wider set of comparisons. The Sidak correction is slightly more liberal for small values of k, but otherwise equivalent.
A special case arises when only comparisons with a single reference group are of interest, as might arise in a clinical trial with multiple treatments and a single placebo control.
It also illustrates the general principle that controlling the EER for a smaller number of contrasts is less costly in terms of power, so that it makes sense to control only for the contrasts of interest. The previous alternatives provide simultaneous inference on all the pair- wise comparisons considered. The Duncan and Student-Newman-Keuls procedures fall in this class. However, neither protects the EER under partial null hypotheses. Thus using these methods in examining estimates provided by a multipredictor linear model may require help from a statistician.
Tests for linear trend across the values of physact are best performed using a linear contrast in the coefficients corresponding to the various levels of the categorical predictor.
These contrasts can be motivated as the slope coefficients from a regression in which the group means are modeled as linear in the sequential numeric codes for the categorical variable. These formulas are valid for all the other models in this book. In the physact example, shown in Table 4.
The pattern in average glucose across the levels of a categorical variable could be characterized by both a linear trend and a departure from trend.
Model Assessing Departures from Linear Trend. In Table 4. It is important to note that in Table 4. The test for trend must be carried out using the linear contrast described earlier. In other words, the analysis does not take account of confounding of the association we see. Although the unadjusted contrast may be useful for describing subgroups, it would be risky to infer any causal connection be- tween exercise and glucose on this basis.
In contrast, the adjusted coefficient for exercise in Table 4. Because we can never really turn back the clock, one of the two experimental outcomes for every individual is an unob- servable counterfactual. It might even be the case that exposure increases out- come levels for some members of the population and decreases them for others, yet the population means under the two condi- tions are equal.
In other words, all other causal de- terminants of outcome levels are perfectly balanced in the exposed and unexposed populations. Subtracting 4. Now consider comparing the coun- terfactual population means. Equation 4. The outcome is generally observable for each individual under only one of the two conditions. In place of a counterfactual experiment, we usually have to compare mean values of the outcome in two distinct populations, one composed of exposed individuals and the other of unexposed.
Note that this inequality would mean that X1 and X2 are correlated. Then, using 4. In the glucose example, this would imply that exercising or not does not depend in any way on what glucose levels would be under either condition. This is known as the randomization assumption.
In general this assumption is met in randomized experiments, since in that setting, exposure — that is, treatment — is determined by a random process and does not depend on future outcomes. But in the setting of observational data where multipredictor regression models are most useful, this assumption clearly cannot be assumed to hold. In the HERS cohort, the randomization assumption holds for assignment to hormone therapy. Essentially this is because the other factors captured by X2 are causal determinants of glucose levels or proxies for such determinants and correlated with exercise.
Mediation is discussed in more detail below in Sect. Finally, bi-directional causal pathways between X1 and X2 would require more complex methods beyond the scope of this book.
This is easiest to see in our example where all the causal determinants of the outcome Y other than X1 are captured by the binary covariate X2.
The results are shown in Table 4. Furthermore, these arguments for the potential to control confounding using the multipredictor linear model can be extended, with messier algebra, to settings where there is more than one causal co- determinant of the outcome, where any or all of the predictor variables are continuous, counts, or multi-level categories, rather than binary, and where the outcome is binary or a survival time, as discussed in later chapters.
We now consider a small hypothetical example where x1 , the predictor of primary interest, is binary and coded 0 and 1, and the potential confounder, x2 , is continuous. In the upper left panel of Fig. The lower panels of Fig. In short, the association between x1 and y is unmasked by adjustment for x2. The example shown in the lower panels of Fig.
Randomized experiments provide the best approximation to these conditions, since the randomization assumption holds in that context. However, many epidemiologic questions about the causes of disease cannot be answered by experiments. Note that this change could be in either direction, and may even involve change in sign; attenuation is the most common pattern, but in- creases in the absolute value of the coefficient are consistent with negative confounding.
We assumed in that case that all causal determinants of Y other than X1 were completely captured in the binary covariate X2 — a substantial idealization. Of course, the multipredictor linear model 4. Logically, of course, it is not possible to show that all confounders have been measured, and in some cases it may be clear that they have not.
Furthermore, the hypothetical causal framework may be uncertain, especially in the early stages of an investigating a research question. This implies that unadjusted parameter estimates are always biased and adjusted estimates less so. But there is a sense in which this is misleading. Thus it should not be expected to have the same value as the causal parameter. The unadjusted es- timate shows that average LDL increases. However, age, ethnicity nonwhite , smoking, and alcohol use drinkany may confound this unadjusted associ- ation.
After adjustment for these four demographic and lifestyle factors, the estimated in- crease in average LDL is 0. In addition, average LDL is estimated to be 5. In this example, smoking is a negative confounder, because women with higher BMI are less likely to smoke, but both are associated with higher LDL.
Negative confounding is further evidenced by the fact that the adjusted coefficient for BMI is larger 0. The covariates in the adjusted model shown in Table 4. For example, LDL is 5. Recommendations for inclusion of potential confounders in multipredic- tor regression models are given in Chapter 5.
Two comments about Fig. Accordingly, the lines are separated by a vertical distance of 5. Testing the no-interaction assumption will be examined in Sect. The causal pathway from increased abdominal fat to development of diabetes and heart disease may operate through — that is, be mediated by — chemical messengers made by fat cells. A new approach to estimation of PTE has been developed by Li et al. If all three elements of this pattern are present, then the data are consistent with the mediation hypothesis.
However, as discussed in Sect. Since standard statistical packages generally do not provide them, this would re- quire the analyst to carry out computations requiring moderately advanced programming skills. An alternative is provided by bootstrap procedures, which were introduced in Sect. In treating BMI as a confounder of exercise, we implicitly assumed that higher BMI makes women less likely to exercise: in short, BMI is a causal determinant of exercise.
Of course, exercise might also be a determinant of BMI, which would considerably complicate the picture. Thus the potential causal pathway from exercise to decreased BMI appears negligible in this population.
In implementing the series of models set out in Sect. However, the coefficient for BMI is only slightly attenuated when exercise is added to the model, from 0. As shown in Table 4. Of course, the qualitative interpretation would be unchanged. However, this may not hold. Suppose both assignment to hormone therapy and use of statins at baseline are coded using indicator variables. Then the product term for assessing interaction is also an indicator, in this case with value 1 only for the subgroup of women who reported using statins at baseline and were randomly assigned to hormone therapy.
Interaction of Hormone Therapy and Statin Use. However, treatment with statins may modify this association, possibly by interrupting the causal pathway between higher BMI and increased LDL. This would imply that BMI is less strongly associated with increased average LDL among statin users than among non- uses. In examining this interaction, centering the continuous predictor variable BMI about its mean value of That is, the increase in average LDL asso- ciated with increases in BMI is much less rapid among women who use statins.
The estimate of -. It is important to recognize, however, that the need for interaction terms is dependent on the scale on which the outcome is measured or, in the models discussed in later chapters, the scale on which its mean is modeled. Similarly, in the analysis of before-and-after measurements of a response to treatment, we have the option of modeling percent rather than absolute change from baseline.
For example, in logistic regression Chap. In these cases, the default model is additive on a multiplicative scale, as explained in Chapters 6, 7, and 9.
The need to model interaction depends on outcome scale because the sim- pler additive model can only hold exactly on one such scale, and may be an acceptable approximation on some scales but not others. This is in contrast to confounding; if X2 confounds X1 , then it does so on every outcome scale. In the case of the linear model, the dependence of interaction on scale means that transformation of the outcome will sometimes succeed in eliminating an interaction.
Note that base- line LDL is centered in this model in order to make the coefficient for hormone therapy HT easier to interpret.
This turns out to be the case, as shown in Table 4. Note that the coefficient for HT now estimates the average percent change in LDL due to treatment, among women at the average baseline level. Simple computation of product terms involving a categorical predictor will almost always give mistaken results.
This was nearly the case in the interaction of BMI and statin use. For example, it would be difficult to assess the interaction between two types of exposure if they occurred together either little or most of the time. However, in an observational cohort it might be much less common for women to report use of both medications. In that case, oversampling of dual users might be used if the interaction were of sufficient interest.
We have also implicitly assumed that model results are not unduly driven by any small subset of observations. We also discuss assessments of normality, how to transform the outcome in order to make this assumption approximately hold, and discuss conditions under which it may be relaxed. We then discuss depar- tures from the assumption of constant variance and methods for addressing them. All these procedures rely heavily on the transformations of both predic- tor and outcome that were introduced in Chapter 2.
Throughout, we emphasize the severity of depar- tures, since model assumptions rarely hold exactly, and small departures are often benign, especially in large data sets. Nonetheless, careful attention to meeting model assumptions can prevent us from being seriously misled, and sometimes increase the efficiency of our analysis into the bargain. However, this may not be an adequate representation of the true relationship.
This smoother approximates the regression line under the weaker assumption that it is smooth but not necessarily linear, with the degree of smoothness under our control, via the bandwidth. Moreover, nonparametric smoothers work less well in higher dimensions. Fortunately, the residuals from a regression model make it possible to ex- amine the linearity of the adjusted association between a given predictor and the outcome, after taking account of the other predictors in the model.
The basic idea is to plot the residuals versus each continuous predictor in the model; then a nonparametric smoother is used to detect departures from a linear trend in the average value of the residuals across the values of the pre- dictor.
This is a residual versus predictor RVP plot, obtained in Stata using the rvpplot command. This solution is often useful when the regression line estimated by the LOWESS smooth is convex or concave, and especially if the line becomes steeper at either side of the CPR plot. Linearizing Predictor Transformations However, other transformations of the predictor may sometimes be more successful and should be considered.
The upper left panel shows the typical curvature captured by adding a quadratic term in the predictor to the model. On the upper right, both quadratic and cubic terms have been included; in general such higher order polynomial tranformations are useful for S-shapes. Each of these transformations would work just as well for modeling the mirror image of the nonlinear shape, re- versed top-to-bottom.
Categorizing the Predictor Another transformation useful in exploratory analysis is to categorize the con- tinuous predictor, either at cutpoints selected a priori or at percentiles that ensure adequate representation in each category. Then the model is estimated using indicators for all but the reference category of the transformed predic- tor, as in the physact example in Sect. Clearly the transformed variable is ordinal in this case.
This method models the association between the ordi- nal categories and the outcome as a step function Fig. In contrast, smooth transformations, in particular polynomials, are harder to motivate, present, and interpret. In both cases, however, we can check whether R2 improves substantially with the transformation.
As with the t-test reviewed in Sect. The point here is that the residuals may be normally distributed when y is not, and conversely. In the upper panels the histogram and boxplot both suggest a somewhat long tail on the right. The lower left panel presents a nonparametric estimate of the distribution of the residuals obtained using the kdensity, normal command in Stata.
For comparison, the solid line in that panel shows the normal distribution with the same mean and standard deviation. Comparing these two curves suggests some skewing to the right, with a long right and short left tail; but overall the shapes are quite close.
Finally, as explained in Chapter 2, the upward curvature of the normal quantile-quantile Q-Q plot on the lower right is also diagnostic of right-skewness. Interpretation of the results shown in Fig. Given such a large data set, the distribution of the parameter estimates is likely to be well approximated by the normal, despite the mild departure from normality in the residuals. However, in a small data set, say, with 50 or fewer observations, the long right tail might be reason for concern, in part because it could make parameter estimates less precise and tests less powerful.
Testing for Departures From Normality Various statistical tests are available for assessing the normality of the resid- uals, but have the drawback of being sensitive to sample size, often failing to reject the null hypothesis of normality in small samples where meeting this assumption is most important, and conversely rejecting it even for small vio- lations in large data sets where inferences are relatively robust to departures from normality.
For this reason, we do not recommend use of these tests; instead, the graphical methods just described should be used to judge the potential seriousness of the violation in the light of the sample size. Log, Power, and Other Transformations of the Outcome Transforming the outcome is often successful for reducing the skewness of residuals.
One such transformation is to replace the outcome y with log y. A con- stant can be added to an outcome variable with negative or zero values, so that all values are positive, though this may complicate interpretation. The log tranformation is now conventionally used to analyze viral load in studies of HIV and hepatitis infections, triglyceride levels in studies of cardiovascular disease, and in many other contexts.
It should also be noted that there is no qualitative change in inferences for BMI. In this case, y is replaced by y k. Adding a constant so that all values of the outcome are non-negative will sometimes be necessary in this case too. The ladder command in Stata systematically searches for the power transformation of the outcome which is closest to normality.
In this case one solution is the rank transformation, in which each outcome is replaced by its rank in the ordering of all the outcomes, as in the computation of the Spearman correlation coefficient Sect. Generalized Linear Models GLMs Some outcome variables cannot be satisfactorily transformed, or there may be compelling reasons to analyze them on the original scale.
A good alternative is provided by the generalized linear models GLMs discussed in Chapter 9. Fur- thermore, in contrast to violations of the assumption that the residuals are normally distributed, heteroscedasticity is no less a problem in large samples than in small ones. Finally, while violations do not make the coefficient biased, some precision can be lost.
Since the residuals of the LDL analysis gave no evidence of trouble, we examined the residuals from the companion model for HDL, which was shown in Sect. Sub-Sample Variances Constancy of variance across levels of categorical predictor can be checked by comparing the sample variance of the residuals for each category.
If they had been included, the variance of the residuals would have varied between this group of women and the remainder of the HERS cohort by a factor of 26 2, vs. Testing for Departures From Constant Variance Statistical methods available for testing the assumption of homoscedasticity share the sensitivity to sample size described earlier for tests of normality.
The resulting potential for giving false reassurance in small samples leads us to recommend against the use of these formal tests.
Instead, we need to examine the severity of the violation. In that case, non-constant variance can sometimes be addressed using a variance-stabilizing transformation of the outcome, including the log and square root transfor- mations. As shown in Fig. However, in this case our qualitative conclusions would be unchanged by log transformation of HDL. However, this has now been largely supplanted by GLMs such as the Poisson and negative binomial regression models Chap.
As in other GLMs, including the logistic model Chap. GLMs represent the primary alternative when transformation of the outcome fails to rectify substantial violations of the assumption of constant variance. In this sec- tion we consider high-leverage points, which could be described as x-outliers, since they tend to have extreme values of one or more predictors, or represent an unusual combination of predictor values.
This can happen when a high-leverage point also has a large residual. We would have good reason to mistrust substantive conclusions that were dependent on a few observations in this way. Similarly, in regression models oriented to prediction of future outcomes Sect. The sample shown on the upper left includes an outlier with a very large positive residual. However, the leverage of the outlier is minimal, because it is in the center of the distribution of x.
In linear regression, these statistics are exact; for logistic and Cox models, accurate approximations are available. DFBETAs often have a very small inter-quartile range, so that a substantial set of observations may lie beyond the whiskers of the plot. The changes are mostly minor, in particular for BMI, the predictor of primary interest. Unfortunately, user-friendly diagnostics for check- ing sensitivity to omission of sets of observations have not been developed, in part because the computational burden is too great.
In Fig. For both predictors and outcomes, log transformation changes the focus from absolute to relative or percentage change. Log Transformation of the Predictor First consider log transformation of the predictor. In this case, the regression coefficient multiplied by log 1. This is valid whether we use the natural log or logarithms with other bases.
In a linear model using the natural log ln transformation of weight to predict systolic blood pressure SBP , the estimated coefficient for ln weight is 3. Thus we estimate that average SBP increases 3. Within limits, we can approximate these results without using a calculator. This follows because ln 1. But this shortcut is not valid for logarithms with other bases, and analogous calculations for larger percentage increases in the predictor get progressively less accurate and should not be attempted by this means.
Again, we can approximate these results without a calculator under some circumstances. Furthermore, addressing these violations will in many cases involve using transformations of predic- tors or outcomes that may make the results harder to interpret.
If not, it may be reasonable not to use the transformations. Our example using BMI and diabetes to predict HDL is probably a case in point: while log transforma- tion of HDL corrected departures from both normality and constant variance, the conclusions were unchanged.
Inclusion of multiple predictors in the model makes it possible to adjust for confounding variables, examine medi- ation, check for and model interactions, and increase efficiency, especially in experiments, by accounting for design factors. It is important to check the assumptions of the linear model and to use transformations of predictor and outcome variables as necessary to meet them more closely, especially in small samples.
It is also important to recognize common data types where linear regression is not appropriate; these include binary, time-to-event, count, and repeated measures or clustered outcomes, and are addressed in subsequent chapters.
A cutting-edge book in this area, un- fortunately of considerable difficulty, is van der Laan and Robins A standard book on regression diagnostics is Belsey et al. Splines and Generalized Additive Models The Stata package implements a convenient and often more biologically plau- sible alternative to the categorical transformations presented in Sect.
0コメント