In a team work environment, most coding tasks performed in R use the same set of R packages. In the environment at the company where I work, there are approximately 50 R packages we use. This post discusses how all the analysts and data scientists on a team can have access to the same library of packages while also eliminating the work related to keeping packages current.

What is the problem

Loading and maintaining packages takes time, is sometimes inconvenient due to dependencies, and will result in different versions of packages in development workflows. Another problem is that the package versions used in the development workflow might be different than the versions of packages that will eventually be used when the code is put into operation or scheduled to run under a service account on cron.

A Solution

Add to your path the set of packages that are installed and maintained on the service account, (service_account). If you want a package that isn’t loaded in the service_account library, say, for example, packages for a GIS or mapping-related project, install it in your user library at first. If your code is later put into operation, the package can be added to the service_account library and made available to your code and subsequently made available and to the code of all other developers on the team.

How to use the service_account library of packages

Put the service_account library into your path. There are two ways.

1) Manual way that you have to do each time you start RStudio Server –

Execute the commands shown in this image. The first and last calls to .libPaths() are for demonstration only. You will only need to enter the second occurrence, the occurrence with the vector argument. Do this each time you start a session. You can also do this in the middle of a coding session.libpaths-to-use-service-account-libraryI just happen to be using an environment where we have Revolution Analytics Enterprise version 7.4 installed. If you’re using CRAN R 3.1.3 (or any other CRAN R version), the path name in element [3] will be different.

2) Automatic way by adding the .libPaths() to your .Rprofile file –

libpaths-to-use-service-account-rprofile

The above image shows my .Rprofile file. This file is in: /home/{user-alias}. This file is not automatically created as part of the R installation process or any other process. You have to create it. To create it, type the nano .Rprofile command at the command line when you are in your home directory. Add the .libPaths() command as shown, save the file, exit the editor, and restart your R session.

Note that nano is an editor. You can also use vi/vim. The nano editor is available to all users and I recommend it over vi/vim.

What to watch out for

If you install some packages in your library that are also installed in the service_account path, the packages will show twice in your list of installed packages. If you manually load packages using the GUI (by clicking the button to install the package in RStudio Server), it is easy to select package the version of the package you don’t want. But, you can decide what to do. You can have two different versions installed. A great use case for this is for testing changes in packages that your application might be sensitive to. It’s easy to load and unload packages and test code execution in this way when in interactive mode.

R loads the function from the first place that it finds it. In this post, the libraries in service_account are at the front of the path. You could change your path so that packages are searched in order of your library, service_account, then the R system library.

Key takeaway

Use by analysts and data scientists of the library of packages installed for the service account is an easy, low-cost way to improve team workflow and productivity. Doing so also eliminates the possibility of introducing package version-related bugs into code when putting code into operation.

Posted in R.

The iframe tag and the WordPress iframe plugin are how to include HTML widgets in a WordPress blog post. The iframe tag is included in HTML5. Using iframes in a WordPress blog post is how you get the JavaScript associated with a widget to run in your post.

Here’s an example of how to create an HTML widget in R using an .R file. This could also be done in an .Rmd and knitted. Since I didn’t want to create an .html page from within RStudio using knitr, I just sourced the .R file:

library(htmlwidgets)
library(DT)
a <- datatable(iris)
saveWidget(a, "datatable-iris-example.html")

Next, import the .html file to your media library. Then, add the shortcode to your post. Here's how to encode it in the page when editing the blog post:

iframe seamless src="http://www.phillipburger.net/wordpress/wp-content/uploads/2015/05/datatable-iris-example.html" width="100%" height="500"

Note that I had to remove the open [ and closing ] brackets to avoid the code from being executed. When you add your shortcode to your post, add back the [ as the first character and the ] as the last character in the shortcode. This is how simple it is. Here's how it looks:

I am not aware of any other way to get R-created HTML widgets (JavaScript) to execute in a blog post. If you have any suggestions, please post a comment!

The functions in the R readr package load text data into R much faster than the load functions included in the R Utils Package. Adopting the use of the functions in this package will improve the workflow and productivity of the solo analyst and for whole teams of analysts. This post provides an example of how to read .csv files using a function from the package, compares the time cost of one of these functions to a customary way of reading text files, and shows an example of the warnings and problems the package’s functions produce.

(Edit: I just re-ran the code in the below gist again while finishing up this post. The user time for this execution of read_delim() was only 17 seconds.)

Introduction

The data used in this comparison was distributed over 24 files. The total number of data records was 3,020,761. Total size of the data was 1.2GB. A customary way to read data into a R data structure is to use read.table():

A customary method of reading files using read_table()
A customary method of reading files using read_table()

This reads the data into a data frame. This approach and the system time is shown in the adjacent image. The user time to read these 24 files using read.table() was 267 seconds. This was the baseline that the read function in the readr package was compared to.

R readr package code example and results

The readr package has different functions. (The link to the RStudio blog post announcing the package is included at the end of this post. Check out their announcement for a more complete discussion of the package” functions and features.) I used just one of these in my example. Here’s R code that uses read_delim(). This code can be used to read any number of .csv files that are in a directory. The regular expression will cause all files ending in .csv to be included in the processing:

The user time to read this same set of data using the function read_delim() was about 29 seconds:

Reading .csv files using read_delim()
Reading .csv files using read_delim()

Warnings and problems

It’s a talkative package. An aspect of using the package that I did not expect was its use of warnings. This section displays some of the messages and warnings I encountered during the development of my solution above and how I investigated the warnings. Here is the first indication that a possible problem was encountered:

Example of a warning message when reading a file with R package readr
Example of warning message when reading a file with R package readr

Typing the statement warnings() at the console produced this listing:

The warning messages from an execution of a function in readr package
Warning messages from an execution of read_delim()

Here are some of the specifics. Notice the syntax. I loaded my data into a list of data frames. To look at the problems for the just one of the files, use problems() with the object name as a parameter passed in during the call:

The first 25, possible problems displayed to console
The first 25 problems displayed to console

I don’t recommend ignoring the warnings and problems. If you have to operationalize code, and your process repeats each day, for example, but with fresh data, and maybe you are able to consume data that is curated and saved in an EDW, it’s probably ok to not raise an alert and fully research problems. But, it depends. For one-off and ad hoc data analysis, it’s worth spending the time review the warnings and messages. I mention this in comparison to the customary method of reading data using read.table() or similar functions. Now, analysts have three tools to make sure their data is properly loaded…visual inspection, summaries and explorations, and now these warnings and problem listings.

Key takeaway

Hardly any training time is required to be able to use and understand the functions in the readr package. In the example above, using a function from the package to read a set of files was approximately an order of magnitude faster than reading the same set of files using a customary method. This is a big win for workflow and analyst productivity.

The announcement for the package and more information on how to use the other functions and features is located on the RStudio blog.

Data Day Seattle is on Saturday, June 27, 2015. The event is being organized by Lynn Bender. I’m attending and encourage readers in the Puget Sound area, or nearby regions such as Portland or Vancouver, to take a look at the line up and judge whether or not it’d be a good use of a precious Saturday. Event site: http://datadayseattle.com/

Why attend: This event is about data. It’s an opportunity to gain exposure to Silicon Valley/San Francisco-quality speakers, authors, and guests, without the inconvenience and cost of travel.

Who is attending: Data geeks in Seattle.

Serendipity is possible: If you have ever attended a tech conference, tech event, or received training in Silicon Valley or San Francisco, you know you’re visiting the mother ship of tech and data. This is an opportunity to be open to something special such as gaining an insight into a new metric that is just waiting to be developed, identifying new personal or career goals, ways to realize these goals, or maybe meeting someone who has the same idea about something you have.

Sample of speakers and guests:
Eric Sammer – Hadoop. He’s the author of an O’Reilly title on Hadoop operations.
Wes McKinney – Author of the popular O’Reilly title Python for Data Analysis. Decent speaker.
Ted Dunning – An engaging speaker. Currently Chief Applications Architect at MapR. This is the guy where I get my line: The frequency of anomalies in your data should be the number of times in a month that you want to get woken up in the middle of the night by your NOC; adjust the threshold measures on your algorithms accordingly.
Michael Berthold – Founder of KNIME. KNIME is a GUI-based, open-source data mining tool. (Hint: The desktop version is freely available here.) Michael is very smart and freely shares his knowledge. He’s a pretty good speaker. An application using KNIME is on my list of sandbox use cases. KNIME could explode in adoption and usage at any time.

Tickets: The retail cost of $450 might seem kind of expensive. I’ve periodically seen discounts offered. My cost was sub-$200. In checking the ticket site right now, I found that tickets are currently $230 through the rest of the day today. Ticket site: http://www.eventbrite.com/e/data-day-seattle-tickets-15586425418

Rather than hoping for bad weather that day, I hope instead that it’s a great day that meets or exceeds my expectations!

This post describes how to use the R dplyr package to calculate percentages. A data set from the U.S Census Bureau was used. Three tests check that the calculation using dplyr was accurate. The code incorporate the pipe operator %>% that was introduced into R in 2014 via the magrittr package.

The general question is: What is the percentage a measure value associated with an instance of a sub group of data represents of the sum total of the measure values of all instances within the same sub group? This general question statement is accurate but not easy to understand. An example in SQL makes it clear. This statement computes the percentage of an individual’s salary over the total salary within his department:

Select depname, empno, salary, salary/sum(salary) over (partition by depname) from empsalary

It’s the same as a windows function in SQL.

Calculation in dplyr

Here’s the code:

Census data set

The data set is publicly available; the URL is listed in the read.table() on line 10 of the gist. I selected this data set because it has characteristics that made it useful; it contains data that can be grouped, it contains a calculation that is already available in the data set that I was able to compare my results to. Additionally, I didn’t want Puerto Rico in the results so was able to use the filter() feature to remove Puerto Rico prior to executing the calculation. The aim was to make sure I knew how to use dplyr.

Results

Here’s a screen shot of the results for region 4, the West:

R dplyr calculate percentage
Region 4 of result data set

The field I calculated was the pct18Plus, the right-most column in above table. (This is a screen shot of the result and the column header isn’t included.)

The result for Wyoming is interesting. The percentage calculated is so small because the population of Wyoming is small compared to population of California, not because the 18+ population in Wyoming is a small percentage of Wyoming’s population.

Key takeaway

The R dplyr package just works. It’s easier than using base R to complete the same tasks. It’s not arcane. It’s elegant to read and use, and it’s decreased my development time.