This post shows how to include a job number with Perforce changelists submitted from the Bash prompt. Here’s an example of the error:

[user-name@server-name ~/scripts]$ p4 submit -d "ABC-1234 Testing Scala connection to Oracle"
Submitting change 2011453.
Locking 1 files ...
Submit validation failed -- fix problems then use 'p4 submit -c 2011453'.
'bugattached' validation failed: Submissions to this branch must have a bug attached

The strings ABC-1234, changelist number 2011453, and the description, are fill values for use in the post to illustrate the problem and the solution.

If you’re receiving this error at the command line and end up going back to the Perforce visual client, read on. There is a way to submit at the command line. The terms bug number, and job, are equivalent to the JIRA ticket number that I use here.

Including JIRA ticket

The above error is obtained when a validation rule in in place that a bug has to be associated with the changelist. The fix is a three-command process:

[burgerp@ordddmbir01 tools]$ p4 submit -d "ABC-1234 Testing Scala connection to Oracle"
[burgerp@ordddmbir01 tools]$ p4 fix -c 2011453 'ABC-1234' 
[burgerp@ordddmbir01 tools]$ p4 submit -c 2011453

Note: System responses that are displayed after each of the commands have been removed.

It’s a three-step process. It works. Reading the Perforce documentation, it looks like there is a second way to submit at the command line, and using just one command. This second way uses the Jobs parameter within the Perforce Form Field: usage. I couldn’t find a working example. Spending time on it myself, I couldn’t figure it out. If you are aware of how to use the form field, please write in with a comment!

Edit: Paula Poundstone’s keynote address is now available here.

Strata + Hadoop World Santa Jose 2016 did not fail attendees. The best presentation was Paula Poundstone‘s stand-up comedy routine that went two minutes over schedule. She lambasted technology and the industry. Brilliant. I will have to post the YouTube video as an update. In the meantime, here’s a commentary by Paula about the problem of flat-screen addiction:

Expo hall gets it right

The Expo Hall was an integral part of the conference experience. The density of vendors could not have been more satisfying. It was a feast of interactivity between attendees and the really smart people who were staffing the booths. As well, it provided a city-like feel. The vendor booths were like apartments crammed in together like along a city street. Kudos to the organizers.

My favorite

What was the killer software I have to have? DataRobot. It automates the task of building models in a data space that’s too large for a human alone to ever fully explore. The business problem for most is the need to monetize data, the need to innovate and create data products. DataRobot can help do this. It’s the first, I won’t say set-and-forget, but rather tune-set-and-forget, Machine Learning, AI bot that’s commercially available. And, amazingly, DataRobot is not too expensive. For an install on a four-core box, it would set me back about $20,000 (more, obviously, for install on a cluster). The DataRobot company will get bought out. Why? It can be set loose on data to run near continuously, with only periodic human interaction, to model the hell out of your data. On a Mac, with the bot running 24 hours a day, seven days a week, it’d crush. It includes a catalog of models that I have never even heard of. It’s easily 100x times faster at modelling data than a human could ever be. Feature rich including the ability to clone models, provide parameter input settings but also using sensible defaults, use of various metrics to evaluate model performance, and also the automatic generation of plots in a data exploration step. In a final stroke of ah-ha, the bot can be configured to take the top models that were returned from a run on a dataset and create an ensemble from these, improving results (probably!) over the one, best model. Models can then be deployed. DataRobot is a no-brainer investment for companies serious about monetizing their data.

Themes

Overall themes and key observations in a more or less time-sequence order, from beginning of the conference at top to end at the bottom:

  1. Streaming
  2. Real-time analytics
  3. Spark
  4. Tay as an example of AI gone awry; an embarrassment to Tay’s parents
  5. Peak BI
  6. In-memory
  7. Streaming
  8. Spark
  9. Kafka
  10. Use of Notebooks to share results and for narration
  11. Visualization of key note speakers were not all that flashy; numerous examples of simple line graphs and simple charts and it was ok.
  12. I actually saw statistics creeping into Data Science…confidence intervals were included on a bar chart!
  13. A greater percentage of attendees were women compared to the 2013 edition of the same conference in 2013
  14. noSQL means noMarketShare (no common API and no way to exchange data means no market share)
  15. Streaming
  16. Docker; it is worth the time to learn it
  17. Spark
  18. Streaming
  19. Speaker demos that were executed in Python were in Python3 (RIP Python2)
  20. No language wars encountered
  21. Nothing about the Julia programming language

Please feel free to leave comments.

I wasn’t sorry to spend my Saturday at Data Day Seattle 2015. Kudos to Lynn Bender for doing a fantastic job of organizing and programming the event. In this post, I’ll mention some overall takeaways from the event and then go deeper into one key takeaway, technical debt in analytics and machine learning (ML) applications.

Data Day Seattle 2015 Takeaways

The event focused on technologies, techniques, and how to solve business problems using these. Smartly, the role of vendors and vendor products was secondary. The variety of topics spanned many different fields, and it was only a one-day event. While not Strata, where it is easy to gain a sense of trends and important topics, this event still exuded what is important right now. These are the emergence of IoT. For technologies, it was Spark and streaming that had mind share.

Disappointing? The KNIME table didn’t receive more attention than it did. With KNIME’s GUI, product architecture, integration with other tools, and native ability handle out-of-RAM data sets, it’s inevitable that it will explode in acceptance and popularity…but this might still be some time away.

Technical Debt in Analytics and Machine Learning Applications

Why would technical debt in analytics and ML applications be different than technical debt in traditional software projects? Turns out, it is different. It’s worse. Why? Speaker Matthew Kirk discussed the problem. His talk was organized around the paper Machine Learning: The High Interest Credit Card of Technical Debt, freely available at Google Research here.

technical debt
Fix this and close your ticket. Will it take more time to rewrite it or to try to figure it out and refactor it?

The problem

Companies are responding to the market quicker and quicker. The way to make a better company is to be more responsive, i.e., use data to drive the business and make decisions. It’s a golden opportunity. The downside is that everything becomes harder and harder to do. This is what lead to the paper.

Discussion

Their point is that there’s a lot we have to be careful of. The problem goes beyond the code complexity in regular software engineering. There are four major categories of pitfalls:

  1. Boundary erosion
  2. Data dependencies
  3. Spaghetti code
  4. The real world

Matthew closely followed the article’s organization. I’ll touch on those debts and problems that are probably the most common.

Boundary erosion – Things become really entangled. The line between the data and the code is blurred. When writing data science software, we can’t write loosely coupled code. Entanglement is the idea that we have features, things we have computed using functions, etc.. If we’re adding to this, things change. A principal is change any thing, change everything (CACE). To overcome entanglement, isolate your models the best you can, regularization. Whenever you change the data set, everything else changes.

Who consumes the output? Visibility debt is different from entanglement. An example of visibility debt is a model of CTR. If finance uses it to calculate lifetime value of a customer, and then the CTR changes, Finance’s modeling is now off. The debt is not knowing who across the company is consuming, or utilizing, your data, across the company. The solution is to keep an API, and a list of user names, and don’t share user names.

Data dependencies – Input signals are constantly changing. Think slang or emoticons. The solution is simple…version your data. Google will version text data, such as the corpus used to train a models. Version your data sets.

Underutilized data adds dimensions and is a curse. Image processing problems work with underutilized data. Most of the information in a picture is not important. Vector distances get further and further apart as you add new dimensions. The solution is feature selection. Fortunately, this topic receives a lot of attention and there are many methods available to the engineer or researcher. It’s up to us to pull out features that are important; it leads to a stable model.

Spaghetti code – There’s a lot of spaghetti code in data science. Researchers and some engineers come from academia. They don’t have experience writing good code. Glue code is common in data science. So many libraries to tie together and we try to tie them together in interesting ways, i.e., use what library and language you know to get it to work. When you have to ship something, you make it work.

The solution is to write your own implementation of some things, such as the implementation of an algorithm in the language framework your using for your system. I don’t agree. Testing new code, especially non-deterministic outcomes, is hard. A better solution is for team members to lean in…if a piece of code is essential to a team, take ownership of it and maintain it.

Experimental paths…there is always a piece of code that says it is going to do something this or that but you’re pretty sure it doesn’t. But, you’re not sure and don’t take it out.  A solution is to tombstone, write a log statement that says you think this method should go away. After a while, look at the log and see if the method you think you want to delete is ready to go. It’s an opportunity to use data analysis for log analysis.

Conclusion

We are in the blissful land of opportunity. Thinking about ways to avoid technical debt in our applications helps us get closer to being able to deploy code that users use and that drive the company forward.

References

SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young. Machine Learning: The High Interest Credit Card of Technical Debt. 2014.

The functions in the R readr package load text data into R much faster than the load functions included in the R Utils Package. Adopting the use of the functions in this package will improve the workflow and productivity of the solo analyst and for whole teams of analysts. This post provides an example of how to read .csv files using a function from the package, compares the time cost of one of these functions to a customary way of reading text files, and shows an example of the warnings and problems the package’s functions produce.

(Edit: I just re-ran the code in the below gist again while finishing up this post. The user time for this execution of read_delim() was only 17 seconds.)

Introduction

The data used in this comparison was distributed over 24 files. The total number of data records was 3,020,761. Total size of the data was 1.2GB. A customary way to read data into a R data structure is to use read.table():

A customary method of reading files using read_table()
A customary method of reading files using read_table()

This reads the data into a data frame. This approach and the system time is shown in the adjacent image. The user time to read these 24 files using read.table() was 267 seconds. This was the baseline that the read function in the readr package was compared to.

R readr package code example and results

The readr package has different functions. (The link to the RStudio blog post announcing the package is included at the end of this post. Check out their announcement for a more complete discussion of the package” functions and features.) I used just one of these in my example. Here’s R code that uses read_delim(). This code can be used to read any number of .csv files that are in a directory. The regular expression will cause all files ending in .csv to be included in the processing:

The user time to read this same set of data using the function read_delim() was about 29 seconds:

Reading .csv files using read_delim()
Reading .csv files using read_delim()

Warnings and problems

It’s a talkative package. An aspect of using the package that I did not expect was its use of warnings. This section displays some of the messages and warnings I encountered during the development of my solution above and how I investigated the warnings. Here is the first indication that a possible problem was encountered:

Example of a warning message when reading a file with R package readr
Example of warning message when reading a file with R package readr

Typing the statement warnings() at the console produced this listing:

The warning messages from an execution of a function in readr package
Warning messages from an execution of read_delim()

Here are some of the specifics. Notice the syntax. I loaded my data into a list of data frames. To look at the problems for the just one of the files, use problems() with the object name as a parameter passed in during the call:

The first 25, possible problems displayed to console
The first 25 problems displayed to console

I don’t recommend ignoring the warnings and problems. If you have to operationalize code, and your process repeats each day, for example, but with fresh data, and maybe you are able to consume data that is curated and saved in an EDW, it’s probably ok to not raise an alert and fully research problems. But, it depends. For one-off and ad hoc data analysis, it’s worth spending the time review the warnings and messages. I mention this in comparison to the customary method of reading data using read.table() or similar functions. Now, analysts have three tools to make sure their data is properly loaded…visual inspection, summaries and explorations, and now these warnings and problem listings.

Key takeaway

Hardly any training time is required to be able to use and understand the functions in the readr package. In the example above, using a function from the package to read a set of files was approximately an order of magnitude faster than reading the same set of files using a customary method. This is a big win for workflow and analyst productivity.

The announcement for the package and more information on how to use the other functions and features is located on the RStudio blog.

The inaugural meeting of the ACM SIGKDD Seattle chapter was held on Tuesday, October 28, 2014, at the Allen Institute for Artificial Intelligence (AIAI, or AI2). The purpose of this first meetup was twofold. First, this was a forum to introduce the chapter to the public, discuss the founding, introduce some of the founders, talk about goals, fundraising, volunteering, membership, and where they see opportunity to contribute to the area’s data science community. Second was to host a talk. For this meetup, the speaker was Oren Etzioni, professor of computer science at the UW and serial entrepreneur. Oren is the CEO of AI2 and was also the host of the meeting.

(Update: I didn’t hear back from AI2 re Da Vinci but did find a copy of an information sheet that was available to attendees. Included below is one side of the sheet, the side that discusses their mission and projects. The second attachment is the paragraph describing Da Vinci. The project was referred to as Da Vinci but the public-facing project name is Semantic Scholar.)

About the chapter

The chapter was established in March 2014 with the goal of bringing together the data mining, data science, and analytics communities. A further aim is to give the Seattle area more momentum. In the first five days of publishing the announcement, 100 people joined the group.

Badge from ACM SIGKDD Seattle Chapter inaugural Meetup
Badge from ACM SIGKDD Seattle Chapter inaugural Meetup
AI2 sponsored the location and Microsoft sponsored the food. They would also like to be able to fly distinguished speakers in to Seattle for meetups. Another community-focused activity they mentioned was maker labs…every couple of weeks, grab some open source data, open source tools, do some coding, and commit it to a shared repository.

Fundraising is a concern. To contribute, write a check to “The Seattle ACM SIGKDD Chapter,” ask for a matching contribution from your employer, and add “The Seattle ACM SIGKDD Chapter” to your employer’s giving system. And, they need volunteers.

The speaker at the next meetup is going to be Carlos Guestrin, CEO of GraphLab and professor at the UW. Additional, future speakers Roger Barga of Microsoft Research Joseph Sirosh, Cornell professor of Computer Science Johannes Gehrke, Raghu Ramakrishnan, and Rony Kohave.

The Future of Data Mining

No post about a meetup is complete without providing a summary of the talk.

The title of Oren’s talk was The Future of Data Mining. I watched his KDD 2014 talk a few days prior. That talk is not embeddable but is freely available to view here. The focus was on these four themes:

  1. Today’s knowledge bases (KBs) are fact rich but knowledge poor
  2. Reading the web, and theory formation
  3. AI2

“Big data” has had impressive achievements, but there are limits to the paradigm. It only takes us so far. The problem with Big Data is the lack of reasoning, of explanation, for the output of models or results. Classifications or recommendations are based on distance in a vector space. Big Data doesn’t deliver reasoned explanations. What’s next in this wave of innovation are solutions and processes that are constructive, multi-layered, and recursive. Oren’s claim is that a key challenges we face is going beyond the yes/no questions of current, Big Data classification systems. The future is about more challenging classifications.

The next step is to read the web. There are two projects at AI2 that are attempting to do this, Aristo and Da Vinci.

Aristo – What is the knowledge necessary to pass a 4th grade Science test? This is state of the art…in no small measure because no one else has tried it! Text books are used as needed to augment their “reading,” to come up with theories. The test of Aristo is to take tests unseen and score 80%. Recently, they achieved 66% accuracy. They aim to get to 80% next year. An interesting problem is the difficulty of diagrammatic testing.

Da Vinci – What are the scholarly documents that one might find helpful? Currently, there are about 114 million English-language scholarly documents including papers, books, and technical reports, that can be found on the Web. The notion of a Renaissance man/woman is a thing of the past. The demo runs on 20,000 papers.

(Note to readers: I haven’t been able to find any links to Da Vinci. I’ll contact AI2 and see if they have any publicly available content posted that is available to share. I’ll update the post with any links.)

The focus of AI2 is data mining for the common good. A quote attributed to Jeff Hammerbacher is: The best minds of my generation are thinking about how to make people click ads…that sucks. This is AI2’s point of departure from enterprises that have programs focused on the good (profit) of the enterprise. At AI2, they are ambitious, open (their code and data sets are published), and collaborative. They measure results in one to three years.

Readers wanting a deeper dive should head over to the AI2 site.

Takeaway

Seattle is rich in talent. ACM SIGKDD Seattle could help organize this social capital in new and interesting ways. Stay tuned. For further information about the chapter, head over to their site.

Here is the further info here referred to in the above update:
1. A PDF of one side of the information sheet is available here.

Semantic Scholar project description
Semantic Scholar project description
2. The paragraph from the sheet that discusses the Semantic Scholar is to the right.