The functions in the R readr package load text data into R much faster than the load functions included in the R Utils Package. Adopting the use of the functions in this package will improve the workflow and productivity of the solo analyst and for whole teams of analysts. This post provides an example of how to read .csv files using a function from the package, compares the time cost of one of these functions to a customary way of reading text files, and shows an example of the warnings and problems the package’s functions produce.
(Edit: I just re-ran the code in the below gist again while finishing up this post. The user time for this execution of read_delim() was only 17 seconds.)
The data used in this comparison was distributed over 24 files. The total number of data records was 3,020,761. Total size of the data was 1.2GB. A customary way to read data into a R data structure is to use
This reads the data into a data frame. This approach and the system time is shown in the adjacent image. The user time to read these 24 files using
read.table() was 267 seconds. This was the baseline that the read function in the readr package was compared to.
R readr package code example and results
The readr package has different functions. (The link to the RStudio blog post announcing the package is included at the end of this post. Check out their announcement for a more complete discussion of the package” functions and features.) I used just one of these in my example. Here’s R code that uses read_delim(). This code can be used to read any number of .csv files that are in a directory. The regular expression will cause all files ending in
.csv to be included in the processing:
The user time to read this same set of data using the function
read_delim() was about 29 seconds:
Warnings and problems
It’s a talkative package. An aspect of using the package that I did not expect was its use of warnings. This section displays some of the messages and warnings I encountered during the development of my solution above and how I investigated the warnings. Here is the first indication that a possible problem was encountered:
Typing the statement
warnings() at the console produced this listing:
Here are some of the specifics. Notice the syntax. I loaded my data into a list of data frames. To look at the problems for the just one of the files, use
problems() with the object name as a parameter passed in during the call:
I don’t recommend ignoring the warnings and problems. If you have to operationalize code, and your process repeats each day, for example, but with fresh data, and maybe you are able to consume data that is curated and saved in an EDW, it’s probably ok to not raise an alert and fully research problems. But, it depends. For one-off and ad hoc data analysis, it’s worth spending the time review the warnings and messages. I mention this in comparison to the customary method of reading data using
read.table() or similar functions. Now, analysts have three tools to make sure their data is properly loaded…visual inspection, summaries and explorations, and now these warnings and problem listings.
Hardly any training time is required to be able to use and understand the functions in the readr package. In the example above, using a function from the package to read a set of files was approximately an order of magnitude faster than reading the same set of files using a customary method. This is a big win for workflow and analyst productivity.
The announcement for the package and more information on how to use the other functions and features is located on the RStudio blog.