I wasn’t sorry to spend my Saturday at Data Day Seattle 2015. Kudos to Lynn Bender for doing a fantastic job of organizing and programming the event. In this post, I’ll mention some overall takeaways from the event and then go deeper into one key takeaway, technical debt in analytics and machine learning (ML) applications.
Data Day Seattle 2015 Takeaways
The event focused on technologies, techniques, and how to solve business problems using these. Smartly, the role of vendors and vendor products was secondary. The variety of topics spanned many different fields, and it was only a one-day event. While not Strata, where it is easy to gain a sense of trends and important topics, this event still exuded what is important right now. These are the emergence of IoT. For technologies, it was Spark and streaming that had mind share.
Disappointing? The KNIME table didn’t receive more attention than it did. With KNIME’s GUI, product architecture, integration with other tools, and native ability handle out-of-RAM data sets, it’s inevitable that it will explode in acceptance and popularity…but this might still be some time away.
Technical Debt in Analytics and Machine Learning Applications
Why would technical debt in analytics and ML applications be different than technical debt in traditional software projects? Turns out, it is different. It’s worse. Why? Speaker Matthew Kirk discussed the problem. His talk was organized around the paper Machine Learning: The High Interest Credit Card of Technical Debt, freely available at Google Research here.
Companies are responding to the market quicker and quicker. The way to make a better company is to be more responsive, i.e., use data to drive the business and make decisions. It’s a golden opportunity. The downside is that everything becomes harder and harder to do. This is what lead to the paper.
Their point is that there’s a lot we have to be careful of. The problem goes beyond the code complexity in regular software engineering. There are four major categories of pitfalls:
- Boundary erosion
- Data dependencies
- Spaghetti code
- The real world
Matthew closely followed the article’s organization. I’ll touch on those debts and problems that are probably the most common.
Boundary erosion – Things become really entangled. The line between the data and the code is blurred. When writing data science software, we can’t write loosely coupled code. Entanglement is the idea that we have features, things we have computed using functions, etc.. If we’re adding to this, things change. A principal is change any thing, change everything (CACE). To overcome entanglement, isolate your models the best you can, regularization. Whenever you change the data set, everything else changes.
Who consumes the output? Visibility debt is different from entanglement. An example of visibility debt is a model of CTR. If finance uses it to calculate lifetime value of a customer, and then the CTR changes, Finance’s modeling is now off. The debt is not knowing who across the company is consuming, or utilizing, your data, across the company. The solution is to keep an API, and a list of user names, and don’t share user names.
Data dependencies – Input signals are constantly changing. Think slang or emoticons. The solution is simple…version your data. Google will version text data, such as the corpus used to train a models. Version your data sets.
Underutilized data adds dimensions and is a curse. Image processing problems work with underutilized data. Most of the information in a picture is not important. Vector distances get further and further apart as you add new dimensions. The solution is feature selection. Fortunately, this topic receives a lot of attention and there are many methods available to the engineer or researcher. It’s up to us to pull out features that are important; it leads to a stable model.
Spaghetti code – There’s a lot of spaghetti code in data science. Researchers and some engineers come from academia. They don’t have experience writing good code. Glue code is common in data science. So many libraries to tie together and we try to tie them together in interesting ways, i.e., use what library and language you know to get it to work. When you have to ship something, you make it work.
The solution is to write your own implementation of some things, such as the implementation of an algorithm in the language framework your using for your system. I don’t agree. Testing new code, especially non-deterministic outcomes, is hard. A better solution is for team members to lean in…if a piece of code is essential to a team, take ownership of it and maintain it.
Experimental paths…there is always a piece of code that says it is going to do something this or that but you’re pretty sure it doesn’t. But, you’re not sure and don’t take it out. A solution is to tombstone, write a log statement that says you think this method should go away. After a while, look at the log and see if the method you think you want to delete is ready to go. It’s an opportunity to use data analysis for log analysis.
We are in the blissful land of opportunity. Thinking about ways to avoid technical debt in our applications helps us get closer to being able to deploy code that users use and that drive the company forward.
SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young. Machine Learning: The High Interest Credit Card of Technical Debt. 2014.