This post shows how to include a job number with Perforce changelists submitted from the Bash prompt. Here’s an example of the error:

[user-name@server-name ~/scripts]$ p4 submit -d "ABC-1234 Testing Scala connection to Oracle"
Submitting change 2011453.
Locking 1 files ...
Submit validation failed -- fix problems then use 'p4 submit -c 2011453'.
'bugattached' validation failed: Submissions to this branch must have a bug attached

The strings ABC-1234, changelist number 2011453, and the description, are fill values for use in the post to illustrate the problem and the solution.

If you’re receiving this error at the command line and end up going back to the Perforce visual client, read on. There is a way to submit at the command line. The terms bug number, and job, are equivalent to the JIRA ticket number that I use here.

Including JIRA ticket

The above error is obtained when a validation rule in in place that a bug has to be associated with the changelist. The fix is a three-command process:

[burgerp@ordddmbir01 tools]$ p4 submit -d "ABC-1234 Testing Scala connection to Oracle"
[burgerp@ordddmbir01 tools]$ p4 fix -c 2011453 'ABC-1234' 
[burgerp@ordddmbir01 tools]$ p4 submit -c 2011453

Note: System responses that are displayed after each of the commands have been removed.

It’s a three-step process. It works. Reading the Perforce documentation, it looks like there is a second way to submit at the command line, and using just one command. This second way uses the Jobs parameter within the Perforce Form Field: usage. I couldn’t find a working example. Spending time on it myself, I couldn’t figure it out. If you are aware of how to use the form field, please write in with a comment!

I wasn’t sorry to spend my Saturday at Data Day Seattle 2015. Kudos to Lynn Bender for doing a fantastic job of organizing and programming the event. In this post, I’ll mention some overall takeaways from the event and then go deeper into one key takeaway, technical debt in analytics and machine learning (ML) applications.

Data Day Seattle 2015 Takeaways

The event focused on technologies, techniques, and how to solve business problems using these. Smartly, the role of vendors and vendor products was secondary. The variety of topics spanned many different fields, and it was only a one-day event. While not Strata, where it is easy to gain a sense of trends and important topics, this event still exuded what is important right now. These are the emergence of IoT. For technologies, it was Spark and streaming that had mind share.

Disappointing? The KNIME table didn’t receive more attention than it did. With KNIME’s GUI, product architecture, integration with other tools, and native ability handle out-of-RAM data sets, it’s inevitable that it will explode in acceptance and popularity…but this might still be some time away.

Technical Debt in Analytics and Machine Learning Applications

Why would technical debt in analytics and ML applications be different than technical debt in traditional software projects? Turns out, it is different. It’s worse. Why? Speaker Matthew Kirk discussed the problem. His talk was organized around the paper Machine Learning: The High Interest Credit Card of Technical Debt, freely available at Google Research here.

technical debt
Fix this and close your ticket. Will it take more time to rewrite it or to try to figure it out and refactor it?

The problem

Companies are responding to the market quicker and quicker. The way to make a better company is to be more responsive, i.e., use data to drive the business and make decisions. It’s a golden opportunity. The downside is that everything becomes harder and harder to do. This is what lead to the paper.

Discussion

Their point is that there’s a lot we have to be careful of. The problem goes beyond the code complexity in regular software engineering. There are four major categories of pitfalls:

  1. Boundary erosion
  2. Data dependencies
  3. Spaghetti code
  4. The real world

Matthew closely followed the article’s organization. I’ll touch on those debts and problems that are probably the most common.

Boundary erosion – Things become really entangled. The line between the data and the code is blurred. When writing data science software, we can’t write loosely coupled code. Entanglement is the idea that we have features, things we have computed using functions, etc.. If we’re adding to this, things change. A principal is change any thing, change everything (CACE). To overcome entanglement, isolate your models the best you can, regularization. Whenever you change the data set, everything else changes.

Who consumes the output? Visibility debt is different from entanglement. An example of visibility debt is a model of CTR. If finance uses it to calculate lifetime value of a customer, and then the CTR changes, Finance’s modeling is now off. The debt is not knowing who across the company is consuming, or utilizing, your data, across the company. The solution is to keep an API, and a list of user names, and don’t share user names.

Data dependencies – Input signals are constantly changing. Think slang or emoticons. The solution is simple…version your data. Google will version text data, such as the corpus used to train a models. Version your data sets.

Underutilized data adds dimensions and is a curse. Image processing problems work with underutilized data. Most of the information in a picture is not important. Vector distances get further and further apart as you add new dimensions. The solution is feature selection. Fortunately, this topic receives a lot of attention and there are many methods available to the engineer or researcher. It’s up to us to pull out features that are important; it leads to a stable model.

Spaghetti code – There’s a lot of spaghetti code in data science. Researchers and some engineers come from academia. They don’t have experience writing good code. Glue code is common in data science. So many libraries to tie together and we try to tie them together in interesting ways, i.e., use what library and language you know to get it to work. When you have to ship something, you make it work.

The solution is to write your own implementation of some things, such as the implementation of an algorithm in the language framework your using for your system. I don’t agree. Testing new code, especially non-deterministic outcomes, is hard. A better solution is for team members to lean in…if a piece of code is essential to a team, take ownership of it and maintain it.

Experimental paths…there is always a piece of code that says it is going to do something this or that but you’re pretty sure it doesn’t. But, you’re not sure and don’t take it out.  A solution is to tombstone, write a log statement that says you think this method should go away. After a while, look at the log and see if the method you think you want to delete is ready to go. It’s an opportunity to use data analysis for log analysis.

Conclusion

We are in the blissful land of opportunity. Thinking about ways to avoid technical debt in our applications helps us get closer to being able to deploy code that users use and that drive the company forward.

References

SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young. Machine Learning: The High Interest Credit Card of Technical Debt. 2014.

The iframe tag and the WordPress iframe plugin are how to include HTML widgets in a WordPress blog post. The iframe tag is included in HTML5. Using iframes in a WordPress blog post is how you get the JavaScript associated with a widget to run in your post.

Here’s an example of how to create an HTML widget in R using an .R file. This could also be done in an .Rmd and knitted. Since I didn’t want to create an .html page from within RStudio using knitr, I just sourced the .R file:

library(htmlwidgets)
library(DT)
a <- datatable(iris)
saveWidget(a, "datatable-iris-example.html")

Next, import the .html file to your media library. Then, add the shortcode to your post. Here's how to encode it in the page when editing the blog post:

iframe seamless src="http://www.phillipburger.net/wordpress/wp-content/uploads/2015/05/datatable-iris-example.html" width="100%" height="500"

Note that I had to remove the open [ and closing ] brackets to avoid the code from being executed. When you add your shortcode to your post, add back the [ as the first character and the ] as the last character in the shortcode. This is how simple it is. Here's how it looks:

[iframe seamless src="http://www.phillipburger.net/wordpress/wp-content/uploads/2015/05/datatable-iris-example.html" height="535"]

I am not aware of any other way to get R-created HTML widgets (JavaScript) to execute in a blog post. If you have any suggestions, please post a comment!

The functions in the R readr package load text data into R much faster than the load functions included in the R Utils Package. Adopting the use of the functions in this package will improve the workflow and productivity of the solo analyst and for whole teams of analysts. This post provides an example of how to read .csv files using a function from the package, compares the time cost of one of these functions to a customary way of reading text files, and shows an example of the warnings and problems the package’s functions produce.

(Edit: I just re-ran the code in the below gist again while finishing up this post. The user time for this execution of read_delim() was only 17 seconds.)

Introduction

The data used in this comparison was distributed over 24 files. The total number of data records was 3,020,761. Total size of the data was 1.2GB. A customary way to read data into a R data structure is to use read.table():

A customary method of reading files using read_table()
A customary method of reading files using read_table()

This reads the data into a data frame. This approach and the system time is shown in the adjacent image. The user time to read these 24 files using read.table() was 267 seconds. This was the baseline that the read function in the readr package was compared to.

R readr package code example and results

The readr package has different functions. (The link to the RStudio blog post announcing the package is included at the end of this post. Check out their announcement for a more complete discussion of the package” functions and features.) I used just one of these in my example. Here’s R code that uses read_delim(). This code can be used to read any number of .csv files that are in a directory. The regular expression will cause all files ending in .csv to be included in the processing:

The user time to read this same set of data using the function read_delim() was about 29 seconds:

Reading .csv files using read_delim()
Reading .csv files using read_delim()

Warnings and problems

It’s a talkative package. An aspect of using the package that I did not expect was its use of warnings. This section displays some of the messages and warnings I encountered during the development of my solution above and how I investigated the warnings. Here is the first indication that a possible problem was encountered:

Example of a warning message when reading a file with R package readr
Example of warning message when reading a file with R package readr

Typing the statement warnings() at the console produced this listing:

The warning messages from an execution of a function in readr package
Warning messages from an execution of read_delim()

Here are some of the specifics. Notice the syntax. I loaded my data into a list of data frames. To look at the problems for the just one of the files, use problems() with the object name as a parameter passed in during the call:

The first 25, possible problems displayed to console
The first 25 problems displayed to console

I don’t recommend ignoring the warnings and problems. If you have to operationalize code, and your process repeats each day, for example, but with fresh data, and maybe you are able to consume data that is curated and saved in an EDW, it’s probably ok to not raise an alert and fully research problems. But, it depends. For one-off and ad hoc data analysis, it’s worth spending the time review the warnings and messages. I mention this in comparison to the customary method of reading data using read.table() or similar functions. Now, analysts have three tools to make sure their data is properly loaded…visual inspection, summaries and explorations, and now these warnings and problem listings.

Key takeaway

Hardly any training time is required to be able to use and understand the functions in the readr package. In the example above, using a function from the package to read a set of files was approximately an order of magnitude faster than reading the same set of files using a customary method. This is a big win for workflow and analyst productivity.

The announcement for the package and more information on how to use the other functions and features is located on the RStudio blog.

This post describes how to use the R dplyr package to calculate percentages. A data set from the U.S Census Bureau was used. Three tests check that the calculation using dplyr was accurate. The code incorporate the pipe operator %>% that was introduced into R in 2014 via the magrittr package.

The general question is: What is the percentage a measure value associated with an instance of a sub group of data represents of the sum total of the measure values of all instances within the same sub group? This general question statement is accurate but not easy to understand. An example in SQL makes it clear. This statement computes the percentage of an individual’s salary over the total salary within his department:

Select depname, empno, salary, salary/sum(salary) over (partition by depname) from empsalary

It’s the same as a windows function in SQL.

Calculation in dplyr

Here’s the code:

Census data set

The data set is publicly available; the URL is listed in the read.table() on line 10 of the gist. I selected this data set because it has characteristics that made it useful; it contains data that can be grouped, it contains a calculation that is already available in the data set that I was able to compare my results to. Additionally, I didn’t want Puerto Rico in the results so was able to use the filter() feature to remove Puerto Rico prior to executing the calculation. The aim was to make sure I knew how to use dplyr.

Results

Here’s a screen shot of the results for region 4, the West:

R dplyr calculate percentage
Region 4 of result data set

The field I calculated was the pct18Plus, the right-most column in above table. (This is a screen shot of the result and the column header isn’t included.)

The result for Wyoming is interesting. The percentage calculated is so small because the population of Wyoming is small compared to population of California, not because the 18+ population in Wyoming is a small percentage of Wyoming’s population.

Key takeaway

The R dplyr package just works. It’s easier than using base R to complete the same tasks. It’s not arcane. It’s elegant to read and use, and it’s decreased my development time.