Linear Regression on the SENIC Project
- Griffen Herrera
- Feb 23, 2023
- 4 min read
The SENIC project contains characteristics on hospitals participating in the study. Some variables of analysis are the response variable (š): LOS (Length of stay), average length of stay in hospital (in days). And there is the explanatory variable (š): INFRISK (Infection risk), average estimated probability of acquiring infection in hospital (in the unit of percentage).
We found the estimated linear regression function to be y=6.3367865+0.7604209(x). The coefficients show that the linear regression is moving in the positive direction when infection risk is increasing. When infection risk is equal to zero the length of stay in the hospital is about 6.33 days. Below shows the linear regression plot:

Based off the R_squared being 27.81%, we can conclude that the regression model does not fit the observed data that well. We can also say that 27.81% of the variability in length of stay is explained by infection risk.
The p-value of the F-statistic is 1.17710-9 which is much smaller than =0.05, so we can reject the null hypothesis that Beta_1=0. We can conclude that there is a linear relationship between infection risk and length of stay. For the slope the 95% confidence interval is [0.5336442, 0.9871976]. We are 95% confident that the slope parameters fall in between 0.05336442 and 0.9871976. We can conclude that length of stay changes based off of the infection risk.
Using the estimated linear regression function outlined in the graph above, it is possible to calculate fitted values of length of stay (Y) for particular values of infection risk (X). When X=5, for instance, the fitted value for the response variable is equal to Y=6.3368+0.7604(5)=10.14 days.
This value forms the center of the confidence and prediction intervals for X=5, which were obtained using R. The 95% confidence interval, (9.80, 10.48), indicates that if many samples are taken using the same method, 95% of the obtained confidence intervals would contain the mean length of stay when X=5. In a similar vein, the 95% prediction interval, (6.90, 13.37), demonstrates that if many samples are taken using the same method, 95% of the obtained prediction intervals would contain the length of stay when X=5. This differs from the confidence interval in that it is predicting a single observation, not the mean of many; thus, the interval widens.
In order to determine whether the assumptions for this linear regression model are warranted, we must perform diagnostic tests. Visual diagnostics are often the first step in this process: for this dataset, appropriate residual plots were employed to check the normality and equal variance (homoscedasticity) assumptions.
To check the normality assumption, three graphical representations were used: the boxplot and histogram of residuals, and the normal probability plot (i.e., Q-Q plot).

The boxplot seems quite evenly distributed; however, there are three outliers, two of which are quite distanced from any of the other values (even after studentization). This is demonstrative of positive skew, which suggests the normality assumption may not be met.

Similar to the boxplot, the histogram is positively skewed, due to the two outliers mentioned previously. The remaining observations form a fairly normal distribution.

The normal probability plot deviates slightly from the line in a way that reflects positive skew, corroborating the findings of our previous graphs. It provides us with the information that observations 47 and 112 (and 76, to a lesser extent) are significant outliers that hinder our ability to accept the normality assumption.
To perform a preliminary check of the homoscedasticity assumption, three plots were used: residuals vs. predictor variable (i.e., infection risk), residuals vs. fitted values, and square root of the absolute value of residuals vs. predictor variable.

There does not seem to be any pattern to the distribution of residuals, which provides evidence for the equal variance assumption. At the same time, the outliers present from normality plots persist.
The Shapiro-Wilk (shapiro.test), Shapiro-Francia (sf.test in nortest package) and Anderson-Darling (ad.test in nortest package) tests were used to check the normality assumption. The null hypothesis of these tests is that the data comes from a normal distribution, while the alternative hypothesis states that the data does not come from a normal distribution. Hence, for the normality assumption to be met, we should expect the null hypothesis to stand.
When the three tests are run using R, we obtain the following p-values: 1.699 x 10^-8 (Shapiro-Wilk), 7.85 x 10^-8 (Shapiro-Francia), 3.823 x 10^-5 (Anderson-Darling). Even at low levels of α, then, we should expect the null hypothesis to be rejected; this means that the normality assumption is not warranted.
The modified Levene and Breusch-Pagan tests were used to check the equal variance assumption.
The modified Levene test compares the mean distance of residuals of two groups, which are determined by some threshold criterion. In our case, we used the median of the predictor variable to split into two groups. The null hypothesis states that the mean distances in both groups are equal, lending support to the homoscedasticity assumption; the alternative hypothesis is that the mean distances are not equal, suggesting heteroscedasticity. The p-value for this test is 0.1732, which is not low enough to reject the null hypothesis. This means the equal variance assumption may be warranted.
The Bresuch-Pagan test uses an auxiliary regression model to test a similar idea: we obtain p-values of 0.0269 and 1.291 x 10-6 for the studentized and regular tests, respectively. This suggests that the assumption is not warranted; that said, this test assumes normality, which is not satisfied. Therefore, the application of this test may not be appropriate.
To test whether age is a potentially omitted variable, we must run the original regression model (length of stay vs. infection risk) and plot the residuals of this model against our potentially omitted variable (i.e., age).
It can be observed from the plot that there is a weak positive linear relationship between age and the residuals of the original model. That being said, whether this relationship is significant enough to add age as a predictor variable cannot be determined with visual diagnostics alone.

Comments