Revolution Analytics announced on January 23, 2015, that they were acquired by Microsoft. This is bad news for Linux stack customers.

First, the positives. This could spur the development of an open source stack that provides the equivalent functionality of Revolution Analytics. Better, Microsoft could make the code they purchased publicly available so this open source effort can be immediately forked to provide bootstrap to the open-source initiative.

Revolution Analytics flingshot monkey
Revolution Analytics flingshot monkey

And the negatives…it’s bad news for Linux stack customers.

Customer first? No way. Are prices going to decrease? Doubtful. Is the technical support model going to stay efficient and friendly? Doubtful. Pity to the poor contract managers in companies who are Linux stack who now have to negotiate with Microsoft. The timing is negative for Big Data for the Enterprise in R.

Possibly the worst outcome is that Microsoft will spike the Linux distributions of Revolutions Analytics software. Not further develop in Linux and maybe sell the code eventually, focusing development solely on Windows distributions in the meantime. It’d be a competitive advantage for them to do so. But, there is time. Revolution Analytics software is well developed as far as what it can do and the value it can add. It works and is solid. Customers can use the releases already in the pipeline to take a wait and see approach and to start thinking about a transition to other software.

A big winner in this acquisition is the Microsoft stack customer base as they are sure to receive integration of the code into their stack. Microsoft missed the first few years of the shift to analytics and this acquisition is one way to meet expectations and needs of their customers. The Python community will benefit as Big Data for the Enterprise in R starts to consider alternatives. (I don’t think the open source R community will notice or be impacted.)

This is Microsoft’s to lose. They have a culture of software that doesn’t play well with others and that gets dysfunctional over time. They have a history of hating Linux, which, admittedly, they are in the process of trying to change. I just don’t see how they can do a good job with Linux-based, analytics software. The best thing Microsoft can do is treat the Linux distributions of Revolution Analytics software as a first-class citizen and then maybe this can work. They have to be fair to Linux stack customers. To continue to attract companies to Big Data in the Enterprise with R, they have to be transparent and open about their intentions.

The paper Strong Regularities in World Wide Web Surfing is 17 years old yet retains relevance. I read the paper as part of the research I’m doing for a patent idea. A reason why I picked the article for my literature review is that it’s cited 812 times…classics are cited about 100 times to 400 times depending on the number of researchers in the field. So, it’s maybe not a seminal paper but did make an important contribution to the field. I decided to write a post on it for the reason that it includes a perfect example of statistics and math applied to model human behavior. And, I found the paper inspiring.

I’m hoping to make this a two-part post, with this article discussing the paper and takeaways. In a Part Two, I hope to apply the theory to one or more recent data sets with the aim of checking whether or not it still applies.

Law of Surfing

The takeaway from the paper is that regularities exist in surfing behavior. Using economic utility theory, a probability of how many links a user will click on a site it derived. The probability distribution function for this is described. That humans will continue to click links on a website while the cost of continuing is less than the perceived, discounted value of future information to be found can be modeled using a Zipf’s Law-like distribution. The probability distribution the authors use to model the behavior is the inverse Gaussian distribution.

The value of the information of a page, or link, is equal to the value of the information on the prior page plus or minus a random term:

V_{L} & = V_{L-1} + \epsilon_{L}

The general formula for the inverse Gaussian distribution is shown below as is the image from the Wikipedia post on the distribution. It’s crazy. Fortunately, there are four R packages that implement the density, cumulative probability, quantiles, and random generation, of the distribution. The link to the CRAN task view on distributions is here.

f(x;\mu,\lambda)= \left[\frac{\lambda}{2 \pi x^3}\right]^{1/2} \exp{\frac{-\lambda (x-\mu)^2}{2 \mu^2 x}}

The

    \[\lambda\]

is the shape parameter. As applied to modelling this behavior, it would be adjusted to different user community populations. No guidelines are given in the paper on how to derive the shape parameter for different populations; they defer this to future research.

Wikipedia image for inverse Gaussian distribution
Wikipedia image for inverse Gaussian distribution

The characteristics of the distribution in the context of the surfing behavior is best explained by Huberman et al.:

First, it has a very long tail, which extends further than that of a normal
distribution with comparable mean and variance. This implies a finite probability for
events that would be unlikely if described by a normal distribution. Consequently, large
deviations from the average number of user clicks computed at a site will be observed.
Second, because of the asymmetry of the distribution function, the typical behavior of
users will not be the same as their average behavior. Thus, since the mode is lower than
the mean, care has to be exercised with available data on the average number of clicks, as
it overestimates the typical depth being surfed.

The authors used data from three different sites contemporary with the time that they were conducting their research. Their population sizes in each of the three are sufficiently large. They show that the distribution can be applied to each of the three data sets.

Another takeaway is the impact of the PageRank algorithm for page relevance. Reading the discussion on page relevance, it’s clear that The PageRank Citation Ranking: Bringing Order to the Web had an enormous impact on the World Wide Web. This is well known as the paper has 7,758 citations. Here is why I was impressed:

A common way of finding information on the WWW is through query-based search
engines, which allow for quick access to information that is often not the most relevant.
This lack of relevance is partly due to the impossibility of cataloguing (sic) an exponentially
growing amount of information in ways that anticipate users’ needs.

Both papers were published near in time to each other. Huberman et al. were not aware of the PageRank algorithm.

Since the paper was published, the degree to which websites have matured has lead to an endlessly diverse ecosystem that users can click through. Some websites such as Wikipedia are meant to have a very high bounce rate. On-line retailers seek to create an experience for users that makes them want to make purchases on the site and browse through the retailers catalog. I’m curious about whether or not the law of surfing applies to user behavior on these two kinds of sites.

Citations

Huberman, Bernardo A., Peter LT Pirolli, James E. Pitkow, and Rajan M. Lukose. “Strong regularities in world wide web surfing.” Science 280, no. 5360 (1998): 95-97.

Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. “The PageRank citation ranking: Bringing order to the web.” (1999).

Lorem ipsum dolor sit amet, eam nisl repudiare appellantur at. Vim ferri rebum ex. At essent eleifend nec, est te fugit omittantur. Tempor vivendo ius id, enim perfecto elaboraret eu vim. Diceret necessitatibus at vim, est nulla alienum necessitatibus ei. At sea nostrum patrioque, nonumes percipitur at his, quo at reque feugiat. Ut pri enim quot semper.

Ne sale intellegebat sed. Mea ad aliquid sensibus definitiones, ea est persius urbanitas, est natum splendide ex. Probo utinam no has, vim modo dico mutat no. Ex vel quodsi ornatus posidonium. Cu vix luptatum volutpat, vix vidit possim accusam eu. At quo solum mentitum persequeris.

[iframe seamless src=”http://www.phillipburger.net/wordpress/wp-content/uploads/2015/05/iris-example2.html” height=”550″]

Adhuc mucius intellegam in usu, atomorum comprehensam ne mei. Et falli detraxit interesset sit, mea accusata deterruisset cu. Cu vix aeque ludus, an dico recusabo intellegat est. Sint facilisis mei ne, eum in alia feugait. Ne ullum labores conceptam quo, ad integre vocibus ancillae vix. Ei eam corpora invidunt adversarium, at pro eros reprimique.

This post is a review of the book Advanced R by Hadley Wickham. The text covers material about R that is not available in one place in any other book. Many books cover statistical techniques and imperative programming in R, applying various statistical or machine learning algorithms to data sets. This book is about how the language works. It’s a great text for someone who does not know much about the internals of R, but comes from other programming languages such as Java, C, Python, etc., and wants to know more.

Advanced R
Advanced R

What distinguishes this book from “The Art of R Programming” (N. Matloff), “The R book” (M. Crawley), and method books such as “An Introduction to Statistical Learning” (G. James, D. Witten, T. Hastie, T. Tibshirani), is the explicit intent of the author to share knowledge about the language. It is not a book that implements tasks in R. This book is for readers who want to become better R programmers or write packages.

After reading Chapter 7, OO field guide, I now understand why object-oriented programming (OOP) in R seems so difficult; it’s because there are three OO systems in R. In my own work, I will use the reference class OO system. I prefer reference semantics and the R reference class (RC) system most closely matches the semantics of the significance OO programming languages.

Someone reading the book does not need any domain in statistics or machine learning. Knowing another language already is essential. Having some knowledge of computer science is helpful to understand how R differs from the other commonly used programming languages.

Throughout this book the reader is introduced to concepts of the R internals. Two of the most useful topics for a reader to get the most out of the book in just a few minutes of readings are environments and closures. The diagrams in the chapter on environments are elegant and easy to understand.

A re-reading of the chapters in Section 2, functional programming, is essential for readers who want to go deep into this topic. But, reading the sections of the book sequentially is not required. This section can be scanned and the reader can move on to other sections and come back to it later.

Section 4 is dedicated to performance. A key insight is in chapter 17 on optimizing code. A practice that many readers are sure to agree with in principle, but may fail to execute in practice, is not to spend too much time optimizing code. Spending too much time optimizing is not a problem unique to R but to craft of programming in general; saving a few seconds of CPU time is not worth the minutes or hours it may take to meet an an arbitrary threshold of optimization. It’s not worth it. But, if you must, this section tells you how.

Conclusion

I recommend this book. It’s not for R beginners nor for any readers new to programming. It is for the reader who wants to advance their skills and who already has command of subsetting, vectorization, and R data structures.

This review is based on the paper copy of the book. The book is freely available here.