Missing data in a data set is a technical and complicated topic. To keep this post focused, I’m going to use the Titanic data set that is associated with the kaggle competition on the very subject of who survived the Titanci disaster.
I wrote earlier in a post here that explored the problem of identifying the type of missingness in the Age column for the data set. That post was about a Tableau visualization. There were not insights from the visualization and I now consider it unsuccessful.
After further exploring the data again this week, I found that there is a pattern to the missingness. I’m going to update my opinon on the type of missing data based on the new information I discovered. The data supports a NMAR pattern. I’ll use a t-test, probabilities, and plots to show the pattern.
First, I’d like to acknowledge the very excellent book by Robert Kabacoff, R in Action, 1st edition. Information about the forthcoming second edition is available at Manning Press. Chapter 15 of the 1st ed. covers missing data in great detail. I continue to refer over and over to p. 354 and his discussion on the classification system for missing data.
Use of t-tests on the means
This data set does not lend itself to using t-tests to explore for patterns for missing data. There is only one continuous variable, Age, and it’s the variable that we want to determine the pattern for. It’s beyond my skill and scope of this post to explore t-test statistics for data of this type.
Probability of missingness
The discrete data lend themselves very nicely to using concatenation to create a new variable.
This approach will try to determine if P(M | Pclass, Sex, SibSp, Parch) = P(M), where M represents a missing Age value in a record, and each of the other four variables represent the value of the variables in the record. The takeaway is that determining and comparing the probabilities can be used to help evaluate if Age is NMAR.
If one or more of the variables was continuous, the first step would be to use quantiles, bin, and discretize the variables, and use the R
breaks() to create a new variable. Since each of the four variables Pclass, Sex, SibSp, and Parch are already discrete, I can concatenate them without determining quantiles and defining breaks.
The first time I saw the raw data in the R console, I knew I found a pattern. It’s a beautiful thing to look at plot, chart, or graph, and immediately spot a pattern. That’s insight.
Here’s the frequency of missing age records for each combination of concatenated variables in the data set that are missing age.
That there are 11 values of the concatenated variables is the data.
I arbitrarily selected a probability threshold of P(M | Pclass, Sex, SibSp, Parch) = 0.05 to consider it meaningful enough. Just four combinations of Pclass, Sex, SibSp, Parch explain about 73% of the missing values for Age. This is a pattern supporting the claim that the missing Age values are NMAR.
Count combinedVar Total_Count Probability 1 21 1000 177 0.11864407 19 9 2000 177 0.05084746 37 77 3000 177 0.43502825 53 22 3100 177 0.12429379 > sum(usefulProbFreq$Probability)  0.7288136
Modelling the missing Age values
There different approaches to handling and imputing missing values. The methods range from easy to follow and understand to sophisticated techniques that require specialized skills. In the case of this data set, once I determined that the missingess was NMAR, I used the median age that applied to each of the four sets of records in the above probability density table.
For the remaining 48 or so records, I used the median Age calcuated from the records that were complete cases. The value is 28 years. The mean and median are close to each other for the whole data set and for each of the respective subsets. I don’t feel it makes too much of a difference in this instance, but I took the median.
Use plots to identify a pattern
R is feature rich in packages that help you handle missing data. A particularly useful package is VIM, a package described for use in “Visualization and Imputation of Missing Values.” Included in the package are the
The areas of the plot in red indicate missing values.
Here is an example of a matrixplot created using
An interactive matrixplot is available via the
TKRmatrixplot() function. The function supports sorting on any one column at a time. When exploring this data set with
TKRmatrixplot(), a pattern is revealed; a disproportionate number of passenger records with Pclass values of third class also have missing values for the Age value. This corroborates with the 55% shown in the probability density chart above.
These three plot functions are available from CRAN.
I’ll prepare the code and post it to a repo on Github.