This post is an example of using inferential statistics to understand an aspect of the business. It demonstrates the use of a structured, disciplined approach to answer a question about the business. Because guess work is removed, one can be more confident of their knowledge and understanding of the business. The post shows the common steps of an analysis including statements of the null and alternative hypotheses, exploratory data analysis, tests and results, and conclusion. The analysis was conducted using R.
Introduction
The analysis attempted to determine if there was an interviewer effect in the results of a lab safety survey. Interviewer effect here is defined as the a variance or error in the data that is caused by the behavior of the interviewer.
The questions I sought to answer were:
- Are the average survey scores across all the interviewers the same, i.e., do the scores come from the same population?
- If the scores are not the same, are there any interviewers with scores that are similar?
- Further, if the scores are not the ame, what actionable steps can be taken to manage the effect when analyzing additional survey results in the future?
The remainder of this report discusses the Material and methods, Results, and Discussion related to my analysis.
Materials and methods
I started with a data set of already munged and cleaned survey responses. The population was 375. The assumption of i.i.d. does hold, i.e., the assignment of an interviewer to conduct a survey in a lab was either random or psuedo random.
The survey response data was binary. The interviews consist of approximately 85 questions. Responses were 0 for no, 1 for yes, and 2 for not applicable. Responses coded not applicable were treated as missing data during analysis.
These responses were previously cleaned and saved. In the code I included below, the names of the interviewers were already sanitized. In those sections where the code is further sanitized, it was included for purposes of completeness and to ensure that the names of the interviewers were not incidentally disclosed.
As a data type, interviewers are a factor, or group. Each interviewer is a level of a factor, or a different group. In this report, where the word group is used, it referred to the interviewers. Identifiable information for the interviewers was removed; their names were recoded to values A, B, C, and D.
I formed a null hypothesis that the survey scores were identical for all four of the interviewers, or across members of the group, i.e, that the results come from the same population. The alternative hypothesis is that they do not come from the same population. The statement of the null hypothesis and the alternate hypothesis were:
\( H_0: \mu_A = \mu_B = \mu_B = \mu_D \)
\( H_a: \mu_A \neq \mu_B \neq \mu_C \neq \mu_D \)
That is, that the survey results for each member of the group are drawn from the same population. If at the end of the analysis I rejected the null hypothesis, I would have evidence that statistically significant effect exists, that there is an interviewer effect in the survey results.
My first step was to perform exploratory data analysis. The first look at the data was to review the measures of central tendency and spread, the mean and variance, respectively:
(means <- tapply(dt$score, dt$interviewer, mean))
## A B C D
## 0.8145 0.8786 0.9319 0.8589
(var <- tapply(dt$score, dt$interviewer, var))
## A B C D
## 0.007354 0.006856 0.002337 0.006176
The mean scores varied somewhat. Group C had the highest mean score and also the tightest spread. Group A had the lowest mean score with a spread about equal to the spread for two of the other groups, B and D.
Included next are two plots showing the pattern of the responses.
(NB: The boxplot in ggplot2
doesn't have variable width capability of the boxes. when using the boxplot()
function in base R graphics, I use the varwidth = TRUE
as a matter of good practice.)
The boxplots showing the distribution of the data by interviewer and the distribution by count were next reviewed. The white diamond mark included for each box is the mean score for that interviewer. The boxplot shows that the mean scores are somewhat lower than the median scores for each of the four groups. The group sizes are not the same as the boxplot was applied to the population prior to sampling the larger groups.
The mean score being less than the median indicates that it was probably the case that there were low survey scores that trailed out to the left, providing somwewhat of a left skew to the distribution of the scores. The histogram did reveal this pattern.
The below figure shows the histogram for each group. (The red line indicates the mean score for the group, the blue indicates the median for the group.) Each histogram shows the actual number of surveys for each group; the group sizes are the actual number of surveys for each group. The smallest group was for group B with a count of 37.
The below figure shows the histogram for each of the four interviewers, or groups. (The red line indicates the mean score for the group, the blue indicates the median for the group.) Each histogram shows the actual number of surveys for each group; the group sizes are the actual number of surveys for each group. The smallest group was for group B with a count of 37.

An aspect of each group was that none of them appear to have a normal distribution.
Results
At least one of the tests, the ANOVA, assumes that the group sizes are the same. In the original survey data set, the groups sizes were not the same. What I did to meet the assumption was determine the N for the smallest group (N = 37), and then sample 37 observations from each of the other three groups. In ANOVA and other test below that assume equal group size, the group size used is 27. When equal group size not an assumption, the full data containing the 375 observations was used.
## size of smallest group is: 37
Do a one-way ANOVA to test whether or not the samples were drawn from the same population. Results show a F of 22 and a p-value of 0. This test does correct for nonhomogeneity in the variances. So, the non-programmatic observation I made about the variance of the C group being dissimlar from the other groups is taken into account with this test.
##
## One-way analysis of means (not assuming equal variances)
##
## data: rbind(A, B, C, D)$score and rbind(A, B, C, D)$interviewer
## F = 28.57, num df = 3.00, denom df = 77.19, p-value = 1.577e-12
At the 0.05 significance level, one-way ANOVA test returned an F of 55 and a p-value equal to 0. This test did not provide evidence to support the null hypothesis.
I wanted to perform a Barlett's test for equal variance. However, the test is sensitive to departures from normality. Instead, I performed a Levene's test to test the null hypothesis of equal variances for the groups.
##
## classical Levene's test based on the absolute deviations from the
## mean ( none not applied because the location is not set to median
## )
##
## data: dt$score
## Test Statistic = 14.52, p-value = 5.778e-09
At the 0.05 significance level, results of the Levene's test were test statisitc of approximately 15 and a p-value of 0. The null hypothesis of the test that the groups have equal variances is rejected.
The next test I used was the Kruskal test. The purpose of this test was to decide whether the distributions of the scores by group were identical without assuming that the distributions were from the normal distribution. This test is useful for unequal group sizes.
##
## Kruskal-Wallis rank sum test
##
## data: dt$score by dt$interviewer
## Kruskal-Wallis chi-squared = 113, df = 3, p-value < 2.2e-16
At the 0.05 significance level, the Kruskal test returns a Kruskal-Wallis chi-squared value of 113 and a p-value equal to zero. We reject the null hypothesis of the test that the groups are drawn from the same population.
Next, I used the ANOVA. The difference between ANOVA and the Kruskal test is that Kruskal is non-parametric and ANOVA is parametric. Using both tests, then, I was able to compare results from a parametric and a non-paramtric test.
## Df Sum Sq Mean Sq F value Pr(>F)
## rbind(A, B, C, D)$interviewer 3 0.356 0.1185 21.2 1.9e-11 ***
## Residuals 144 0.804 0.0056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At the 0.05 significance level, the F value was approximately 15 and the p-value was zero. We reject the null hypothesis of the test that the means of the groups are equal.
Finally, I used the Tukey HSD test to find means that are signficantly different than each other.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = rbind(A, B, C, D)$score ~ rbind(A, B, C, D)$interviewer)
##
## $`rbind(A, B, C, D)$interviewer`
## diff lwr upr p adj
## B-A 0.07886 0.033708 0.124022 0.0001
## C-A 0.13227 0.087113 0.177428 0.0000
## D-A 0.03868 -0.006482 0.083833 0.1211
## C-B 0.05341 0.008248 0.098563 0.0133
## D-B -0.04019 -0.085346 0.004968 0.0997
## D-C -0.09359 -0.138752 -0.048437 0.0000
Because this test compares all possible means, a matrix is produced with each row representing a pair of means and the columns representing the results for the comparison.
At 0.05 significance level, the output from the Tukey HSD test provides evidence that the means of groups D and A, and the means of groups D and B are not significantly different. The p-value for the four other mean groups is below are significance level and we reject the null hypothesis of the test that these means are equal.
At this point in the analysis, I knew that a significant effect had been found. I had answered the first question. The next step was to answer the second of the three questions I set out to answer. To gain a further understanding of the groups that might be similar, I could have applied Welch's t-test on each pair of groups. More conservatively, this test could have been applied with the significance level adjusted for the number of pairs, changing the significance level from 0.05 to 0.05/6, or 0.008. Welch's t-test is used when sample variances are not assumed to be equal. This test also allows for unequal sample sizes. Instead, I applied the pairwise.t.test, both unadjusted and adjusted. Result for the test with the unadjusted p-value were:
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dt$score and dt$interviewer
##
## A B C
## B 1.0e-05 - -
## C < 2e-16 4.3e-05 -
## D 3.6e-05 0.14 4.9e-16
##
## P value adjustment method: none
Results using the Bonferroni adjustment to the p-value:
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dt$score and dt$interviewer
##
## A B C
## B 6.0e-05 - -
## C < 2e-16 0.00026 -
## D 0.00022 0.81851 2.9e-15
##
## P value adjustment method: bonferroni
Results using the Holm adjustment to the p-value:
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dt$score and dt$interviewer
##
## A B C
## B 4.0e-05 - -
## C < 2e-16 0.00011 -
## D 0.00011 0.13642 2.4e-15
##
## P value adjustment method: holm
At the 0.05 significance level, the only pair where evidence existed that they are the same was the B-D pair. The p-value from both adjustments are above the significance level. There was no evidence that the other pairs were the same.
Discussion
I found that there was compelling evidence to reject the null hypothesis at the 0.05 significance level. Interviewer effect did exist in the results for the 375 surveys. Only one pair of groups are similar, the B and D pair.
Even though there was randomized assignment, some interviewers were more generous. It's impossible to know if this was a result of the labs being surveyed by C were more safe already, or if C is that much more generous. To control for interviewer effect, the same assignments could be made next year and the results from next year compared to the results from this year. Alternatively, a control group could be created where half of the labs are assigned by the same interviewer next year and randomized assignment is made for the other half.
From a safety point of view, how bad are the four low-score outliers for C? These labs might have scored much worse had A received the assignment, indicating that the risk is actually greater than what is measured here by the scores they received from C.
Next steps are available in further understanding the interviewer effect. A two-way ANOVA using interviewer and school should be performed. Also, a chi-square test using the share of interviews per school for each of the interviewers would confirm randomized assignment.
Appendix
I conducted the analysis in the R programming language. All the code used in this analysis was made freely available and was posted on gist.github.com as gist 9012633.