The iframe tag and the WordPress iframe plugin are how to include HTML widgets in a WordPress blog post. The iframe tag is included in HTML5. Using iframes in a WordPress blog post is how you get the JavaScript associated with a widget to run in your post.

Here’s an example of how to create an HTML widget in R using an .R file. This could also be done in an .Rmd and knitted. Since I didn’t want to create an .html page from within RStudio using knitr, I just sourced the .R file:

library(htmlwidgets)
library(DT)
a <- datatable(iris)
saveWidget(a, "datatable-iris-example.html")

Next, import the .html file to your media library. Then, add the shortcode to your post. Here's how to encode it in the page when editing the blog post:

iframe seamless src="http://www.phillipburger.net/wordpress/wp-content/uploads/2015/05/datatable-iris-example.html" width="100%" height="500"

Note that I had to remove the open [ and closing ] brackets to avoid the code from being executed. When you add your shortcode to your post, add back the [ as the first character and the ] as the last character in the shortcode. This is how simple it is. Here's how it looks:

I am not aware of any other way to get R-created HTML widgets (JavaScript) to execute in a blog post. If you have any suggestions, please post a comment!

This post show how to do a website path analysis with a Sankey Diagram. Complex interactions that visitors have with a website can be presented in a visual form using the Sankey diagram. This post includes tips on how to simplify your data and how to make the diagram appealing. An example of a diagram is included.

Understanding the behavior of customers can lead to improved marketing campaigns, and to improved sales and revenue. This post encapsulated lessons learned from a project at work that studied the pathing of customers on a website. I needed a way to explore and visualize customer behavior and click-paths for a month’s worth of data. A premise of the project was that visualization had to be easily interpretable and effective; it had to be a way to gain insights about the customer journey during a session. The flow visualization example shown below can help to better understand hotspots, highlight common trends, and find insights on individual user and aggregate behavior.

In path analysis you are typically trying to determine a sequence of events in a particular time window. For example you pay attention to paths more frequently used than others in order to understand what path prospects take before they become new customers. Path analysis works best with linear event streams such as customer lifecycle (1. prospect, 2. trial subscription, 3. customer, 4. product upgrade, etc.) but is also commonly used for web usage analysis.

I originally wrote the code and did the analysis for a project at work. I had to sanitize the data and did so by substituting in page group names for a hypothetical, non-profit foundation. I also generated synthetic data for the weights of the links. Notwithstanding these changes, the concepts and discussion apply a across domains.

## Why is website path analysis important?

There are three reasons why this is important.

1. The client has indicated that a pathing analysis would be of interest.
2. Important questions can be answered including:
• Are customers taking the optimal path?
• Can a problem in site design be identified?
• Can customers be segmented, e.g., how is customer behavior different when browsing on a desktop compared to when browsing via a mobile device.
• All else being equal, is customer behavior on one site different than customer behavior on a different site?
• All else being equal, does customer behavior change over time?
• What trips customers up when visiting a site? Where do customers abandon/timeout? What is the last page when a timeout happens?
3. A well-designed visualization leverages the natural abilities of the human cognitive system to discern and interpret information.

## Approach

This project used a Sankey diagram for each of three, different sites to show customer pathing through a website. The code used to obtain the source data was SQL developed by a product developer and shared during a technical brown bag lunches. The data sets is rich and the pathing results shown here represent just a portion of the possible ways to consider pathing.

There are many ways customers can navigate on a website. Even if I could determine the optimal path, it’s very likely that just a very minor number of users will actually take the optimal path. This means you must pay special attention to the path analysis results in order to gain the right insights. You may for example compare the least and most used paths in terms of sequence count or the number of drop off’s (e.g. customers who left a session and therefore didn’t complete an order).

It can be useful to apply segmentation to path analysis (more details below), as this will greatly reduce the number of steps in a path and may represent a better aggregated view about paths taken. In most cases, you are after the number of people taking the optimal path to reach your goal such as purchasing a product. Once you are understanding common paths you can try to influence the customer behavior by redesigning the web page or starting a marketing campaign, for example.

## Website path

The JavaScript library used for this project was Google Charts. Data manipulation and the API to Google Charts output was completed using R 3.1.1 and the package googleVis 0.5.5.

The terms used in a Sankey diagram are nodes, links, and weight. From the Google Developers site for the Sankey diagram:

A Sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains (e.g., universities and majors) or multiple paths through a set of stages (for instance, Google Analytics uses sankeys to show how traffic flows from pages to other pages on your web site).

In these diagrams, the nodes are the page groups, not the actual web page name. The compressed categories are used. Compressed categories remove duplicate page group names that occur in sequence. For example, the compress categories of the array {Home, Foundations, Foundations, Foundation Research, Foundation Research, Foundations} is {Home, Foundations, Foundation Research, Foundations}. Because the path loops and the graphs are cyclical, the second and subsequent page group names in a compressed categories are suffixed with an underscore (_) character. For example, Home__, Home with two underscores as suffix indicates that the landing on Home group shown occurred on the third click.

These diagrams show the landing page plus four clicks, for a total of five nodes and four links. The nodes are the page group name. The links are the click from one page group name to another page group name. Customer clicks through time are read from left to right. The charts are interactive. Activating a node in the middle of the diagram shows the flow into the node and the flow out of the node. The thickness of the links is the weight of that link relative to other links connecting the two nodes.

## Statistical approaches to comparing paths

The Null Hypothesis is that the click patterns from the landing page (node 1) to the fourth click (node 5) are the same for customers navigating through one website as for customers navigating through a second website. Likewise with tests for segmentation based on device, the Null Hypothesis is that customer behavior when browsing using a desktop is the same as customer behavior when browsing from a mobile device. Unfortunately, browsing behavior when represented in a graph, or diagram, is an example of a cyclic graph. In cyclic graphs the starting node on a path is traversed a second time; the result is a loop. A rich set of modelling techniques including Log-Linear, Bayesian, Gaussian are applicable to directed acyclic graphs (DAG). These techniques do not apply to cyclic graphs nor am I aware of any models or test that can be applied to cyclic graphs. Testing for segmentation using graph models is not an option for this data at this time.

For the project, visual inspection of three website did show that customer pathing is different for each site compared to each of the other two sites. However, this is anecdotal. I am not aware of any techniques to measure and compare the differences. Please post a comment if you have any suggestions on how to quantify path differences of cyclic graphs.

## Lessons learned and recommendations

The key takeaway is that the Sankey diagram is an effective way to visualize visitor flow on a web site.

Google Analytics does a good job in implementing Sankey diagrams for visitor flow. Their implementation is rich and provides a good example of what is possible. An example is discussed here.

There are many different paths through a site. It’s easy to show too many paths on the Sankey diagram. This is not unique to Sankey diagrams. Remove the clutter from the diagrams by filtering for weights that exceed a threshold that makes sense for your charts and diagrams. For the diagrams listed in this post, only links that have a weight greater than or equal to 0.02 out of a total weight of 1.0 are shown. This removes the less frequently traveled paths.

Customer behavior is cyclic while Sankey diagrams are acyclic. The diagram won’t render if a path closes on itself. In this study, I used the workaround discussed, i.e., adding an underscore (_) character as a suffix. It’s possible that other JavaScript libraries provide an elegant way to handle this. A hack around the problem using the same JavaScript library would be to pad the node name with one or more whitespace characters.

Other open source tools are available and should be evaluated. These include a combination of the R packages rCharts and igraph, and the D3.js library. One such example is demonstrated here.

It’d be nice to show a funnel that narrows as customer journey’s advance across time (through clicks) from left to right. Here, the bounces are included in the height of the fifth node. It’d be better if they were included in the fifth node at all.

It’d be nice to be able to highlight a path four links and show it from beginning to end. In the diagram above, when a node is highlighted, what’s shown are the nodes that customers arrived from and the nodes that they go to next. Very helpful, but a natural ask is to want to see an entire path across the four links. It might be too complicated to visualize as there might be so many choices.

There are too many page name layouts (page names) on a typical site to use this level of detail to create the data used to generate the diagram. Use page group names instead.

Four links are used in the above diagram. I think this is probably enough to understand customer behavior on the site.

There’s a warning on the Google Developers site that they are working on the Sankey diagram and that changes to the API should be expected.

## Reproducibility and code

The hard work to create this diagram is, of course, writing the SQL to extract it, munging the array returned from the SQL into R vectors to create the nodes, the R code to create the links, and the R code to calculate the weights. But, the payoff is how easy it is to use googleVis to create a visually appealing Sankey diagram. I’m including just the code used to make the Sankey diagram in this gist:

This post explores the probability density function, empirical cumulative density function, and the expected value, for a discrete random variable. The variable is the Average Customer Review score on Amazon.com of the Wolfram book A New Kind of Science. Why this book? I was thinking of selling my copy to Amazon.com for trade-in credit and applying the credit toward the purchase of Seamless R and C++ Integration with Rcpp (Use R!).

When I saw how little credit they give, I went to the reviews for some explanation. Here’s a table of the ratings as of May 18, 2014:

Rating value, Count of rating
5 stars, 100
4 stars, 40
3 stars, 56
2 stars, 62
1 star, 103

From reading some of the 361 reviews, the book is either a contribution of new thought toward an as yet realized science, or it’s too long, boring, and not much of a contribution toward anything. A bar chart better shows the distinction in the counts of those who like the book and those who do not like the book. The counts are high at the end values and asymmetric in the middle-range of values from two to four:

The probability density function of the discrete random variable is . The density for each of the five ratings is shown in this chart:

The cumulative density function uses a rather than to denote the function. I’m using the empirical CDF here. (The probabilities used are taken from what is actually observed rather than obtained from a model.) The general form of the function is:

The plot of the empirical cumulative density function is shown here:

What is the expected, or the average, value for the customer ratings? The notation for the expected value is :

And the calculation is:

This explains exactly the 3-star sprite and the <span class="swSprite s_star_3_0 " title="2.9 out of 5 stars"> code in that part of the Amazon page containing the average customer review; Amazon uses the weighted mean to calculate the value of the average customer review.

#### Wrap

The key takeaway is that the average and the variability are two different concepts and have to be used together to understand a distribution. With just a quick inspection of the average rating, it’d be fair to say that customers think the book is ok. Rather, customer opinions is very divided; the two, extreme values make up %56 of all ratings.

This is my first use of plot.ly visualizations in a post. It took a pretty large chunk of time spread over several sessions and weeks to figure it out. But, the time was worth it and I plan to use the service for visualizations in future posts.

The purpose of this data visualization is to show the density of a measure over time. The measure is an occurrence of a criminal or non-criminal activity or incident reported by the the University of Washington Police Department. The density is the frequency of the measure at a geocoded location over a period of time. What is shown here is, essentially, where officers were called to during the summer of 2013.

This is an example of a CartoDB Torque map. Used alone or with other maps or information, new patterns of human activity are shown and possibly understood for the first time. This map, used in conjunction with a previously completed Tableau viz, are an example of this. The patterns and information easily discernible on these maps without any narrative or explanation would be impossible to discern looking at the same data in tabular format. (The Tableau visualization of this same data is available here on my Tableau Public web site. Feel free to check it out!)

I think it would be interesting to organize the data by day of the week over the period of a year. What are the busiest days of the week, on average, and what types of incidents occur most frequently during different days of the week?

This post is an example of using inferential statistics to understand an aspect of the business. It demonstrates the use of a structured, disciplined approach to answer a question about the business. Because guess work is removed, one can be more confident of their knowledge and understanding of the business. The post shows the common steps of an analysis including statements of the null and alternative hypotheses, exploratory data analysis, tests and results, and conclusion. The analysis was conducted using R.

## Introduction

The analysis attempted to determine if there was an interviewer effect in the results of a lab safety survey. Interviewer effect here is defined as the a variance or error in the data that is caused by the behavior of the interviewer.

The questions I sought to answer were:

• Are the average survey scores across all the interviewers the same, i.e., do the scores come from the same population?
• If the scores are not the same, are there any interviewers with scores that are similar?
• Further, if the scores are not the ame, what actionable steps can be taken to manage the effect when analyzing additional survey results in the future?

The remainder of this report discusses the Material and methods, Results, and Discussion related to my analysis.

## Materials and methods

I started with a data set of already munged and cleaned survey responses. The population was 375. The assumption of i.i.d. does hold, i.e., the assignment of an interviewer to conduct a survey in a lab was either random or psuedo random.

The survey response data was binary. The interviews consist of approximately 85 questions. Responses were 0 for no, 1 for yes, and 2 for not applicable. Responses coded not applicable were treated as missing data during analysis.

These responses were previously cleaned and saved. In the code I included below, the names of the interviewers were already sanitized. In those sections where the code is further sanitized, it was included for purposes of completeness and to ensure that the names of the interviewers were not incidentally disclosed.

As a data type, interviewers are a factor, or group. Each interviewer is a level of a factor, or a different group. In this report, where the word group is used, it referred to the interviewers. Identifiable information for the interviewers was removed; their names were recoded to values A, B, C, and D.

I formed a null hypothesis that the survey scores were identical for all four of the interviewers, or across members of the group, i.e, that the results come from the same population. The alternative hypothesis is that they do not come from the same population. The statement of the null hypothesis and the alternate hypothesis were:

$$H_0: \mu_A = \mu_B = \mu_B = \mu_D$$

$$H_a: \mu_A \neq \mu_B \neq \mu_C \neq \mu_D$$

That is, that the survey results for each member of the group are drawn from the same population. If at the end of the analysis I rejected the null hypothesis, I would have evidence that statistically significant effect exists, that there is an interviewer effect in the survey results.

My first step was to perform exploratory data analysis. The first look at the data was to review the measures of central tendency and spread, the mean and variance, respectively:

(means <- tapply(dt$score, dt$interviewer, mean))

##      A      B      C      D
## 0.8145 0.8786 0.9319 0.8589

(var <- tapply(dt$score, dt$interviewer, var))

##        A        B        C        D
## 0.007354 0.006856 0.002337 0.006176


The mean scores varied somewhat. Group C had the highest mean score and also the tightest spread. Group A had the lowest mean score with a spread about equal to the spread for two of the other groups, B and D.

Included next are two plots showing the pattern of the responses.

(NB: The boxplot in ggplot2 doesn't have variable width capability of the boxes. when using the boxplot() function in base R graphics, I use the varwidth = TRUE as a matter of good practice.)

The boxplots showing the distribution of the data by interviewer and the distribution by count were next reviewed. The white diamond mark included for each box is the mean score for that interviewer. The boxplot shows that the mean scores are somewhat lower than the median scores for each of the four groups. The group sizes are not the same as the boxplot was applied to the population prior to sampling the larger groups.

The mean score being less than the median indicates that it was probably the case that there were low survey scores that trailed out to the left, providing somwewhat of a left skew to the distribution of the scores. The histogram did reveal this pattern.

The below figure shows the histogram for each group. (The red line indicates the mean score for the group, the blue indicates the median for the group.) Each histogram shows the actual number of surveys for each group; the group sizes are the actual number of surveys for each group. The smallest group was for group B with a count of 37.

The below figure shows the histogram for each of the four interviewers, or groups. (The red line indicates the mean score for the group, the blue indicates the median for the group.) Each histogram shows the actual number of surveys for each group; the group sizes are the actual number of surveys for each group. The smallest group was for group B with a count of 37.

An aspect of each group was that none of them appear to have a normal distribution.

### Results

At least one of the tests, the ANOVA, assumes that the group sizes are the same. In the original survey data set, the groups sizes were not the same. What I did to meet the assumption was determine the N for the smallest group (N = 37), and then sample 37 observations from each of the other three groups. In ANOVA and other test below that assume equal group size, the group size used is 27. When equal group size not an assumption, the full data containing the 375 observations was used.

## size of smallest group is:  37


Do a one-way ANOVA to test whether or not the samples were drawn from the same population. Results show a F of 22 and a p-value of 0. This test does correct for nonhomogeneity in the variances. So, the non-programmatic observation I made about the variance of the C group being dissimlar from the other groups is taken into account with this test.

##
##  One-way analysis of means (not assuming equal variances)
##
## data:  rbind(A, B, C, D)$score and rbind(A, B, C, D)$interviewer
## F = 28.57, num df = 3.00, denom df = 77.19, p-value = 1.577e-12


At the 0.05 significance level, one-way ANOVA test returned an F of 55 and a p-value equal to 0. This test did not provide evidence to support the null hypothesis.

I wanted to perform a Barlett's test for equal variance. However, the test is sensitive to departures from normality. Instead, I performed a Levene's test to test the null hypothesis of equal variances for the groups.

##
##  classical Levene's test based on the absolute deviations from the
##  mean ( none not applied because the location is not set to median
##  )
##
## data:  dt$score ## Test Statistic = 14.52, p-value = 5.778e-09  At the 0.05 significance level, results of the Levene's test were test statisitc of approximately 15 and a p-value of 0. The null hypothesis of the test that the groups have equal variances is rejected. The next test I used was the Kruskal test. The purpose of this test was to decide whether the distributions of the scores by group were identical without assuming that the distributions were from the normal distribution. This test is useful for unequal group sizes. ## ## Kruskal-Wallis rank sum test ## ## data: dt$score by dt$interviewer ## Kruskal-Wallis chi-squared = 113, df = 3, p-value < 2.2e-16  At the 0.05 significance level, the Kruskal test returns a Kruskal-Wallis chi-squared value of 113 and a p-value equal to zero. We reject the null hypothesis of the test that the groups are drawn from the same population. Next, I used the ANOVA. The difference between ANOVA and the Kruskal test is that Kruskal is non-parametric and ANOVA is parametric. Using both tests, then, I was able to compare results from a parametric and a non-paramtric test. ## Df Sum Sq Mean Sq F value Pr(>F) ## rbind(A, B, C, D)$interviewer   3  0.356  0.1185    21.2 1.9e-11 ***
## Residuals                     144  0.804  0.0056
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


At the 0.05 significance level, the F value was approximately 15 and the p-value was zero. We reject the null hypothesis of the test that the means of the groups are equal.

Finally, I used the Tukey HSD test to find means that are signficantly different than each other.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##
## Fit: aov(formula = rbind(A, B, C, D)$score ~ rbind(A, B, C, D)$interviewer)
##
## $rbind(A, B, C, D)$interviewer
##         diff       lwr       upr  p adj
## B-A  0.07886  0.033708  0.124022 0.0001
## C-A  0.13227  0.087113  0.177428 0.0000
## D-A  0.03868 -0.006482  0.083833 0.1211
## C-B  0.05341  0.008248  0.098563 0.0133
## D-B -0.04019 -0.085346  0.004968 0.0997
## D-C -0.09359 -0.138752 -0.048437 0.0000


Because this test compares all possible means, a matrix is produced with each row representing a pair of means and the columns representing the results for the comparison.

At 0.05 significance level, the output from the Tukey HSD test provides evidence that the means of groups D and A, and the means of groups D and B are not significantly different. The p-value for the four other mean groups is below are significance level and we reject the null hypothesis of the test that these means are equal.

At this point in the analysis, I knew that a significant effect had been found. I had answered the first question. The next step was to answer the second of the three questions I set out to answer. To gain a further understanding of the groups that might be similar, I could have applied Welch's t-test on each pair of groups. More conservatively, this test could have been applied with the significance level adjusted for the number of pairs, changing the significance level from 0.05 to 0.05/6, or 0.008. Welch's t-test is used when sample variances are not assumed to be equal. This test also allows for unequal sample sizes. Instead, I applied the pairwise.t.test, both unadjusted and adjusted. Result for the test with the unadjusted p-value were:

##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  dt$score and dt$interviewer
##
##   A       B       C
## B 1.0e-05 -       -
## C < 2e-16 4.3e-05 -
## D 3.6e-05 0.14    4.9e-16
##
## P value adjustment method: none


Results using the Bonferroni adjustment to the p-value:

##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  dt$score and dt$interviewer
##
##   A       B       C
## B 6.0e-05 -       -
## C < 2e-16 0.00026 -
## D 0.00022 0.81851 2.9e-15
##
## P value adjustment method: bonferroni


Results using the Holm adjustment to the p-value:

##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  dt$score and dt$interviewer
##
##   A       B       C
## B 4.0e-05 -       -
## C < 2e-16 0.00011 -
## D 0.00011 0.13642 2.4e-15
##
## P value adjustment method: holm


At the 0.05 significance level, the only pair where evidence existed that they are the same was the B-D pair. The p-value from both adjustments are above the significance level. There was no evidence that the other pairs were the same.

## Discussion

I found that there was compelling evidence to reject the null hypothesis at the 0.05 significance level. Interviewer effect did exist in the results for the 375 surveys. Only one pair of groups are similar, the B and D pair.

Even though there was randomized assignment, some interviewers were more generous. It's impossible to know if this was a result of the labs being surveyed by C were more safe already, or if C is that much more generous. To control for interviewer effect, the same assignments could be made next year and the results from next year compared to the results from this year. Alternatively, a control group could be created where half of the labs are assigned by the same interviewer next year and randomized assignment is made for the other half.

From a safety point of view, how bad are the four low-score outliers for C? These labs might have scored much worse had A received the assignment, indicating that the risk is actually greater than what is measured here by the scores they received from C.

Next steps are available in further understanding the interviewer effect. A two-way ANOVA using interviewer and school should be performed. Also, a chi-square test using the share of interviews per school for each of the interviewers would confirm randomized assignment.

## Appendix

I conducted the analysis in the R programming language. All the code used in this analysis was made freely available and was posted on gist.github.com as gist 9012633.