The first course for the zero cohort in the Coursera UW Machine Learning (ML) specialization just completed. A key message from this course was that if you put in the work, learn the tools, keep learning, and get some iterations, you can become successful in implementing ML. What does successful in implementing ML this look like? It means you can make intelligent applications at work; make your company money and increase your value in the job market. If you reach, you can invent your own data product and maybe start your own company.

http://xkcd.com/1287/

How was it?

Fantastic! This is the best, on-line course I’ve ever taken. I’m looking forward the remaining courses in the specialization. The course uses a case-study approach. This is great for practitioners.

I’m looking for an introduction video from the instructions that I can include in the post. So far, I can’t find one. Maybe because the course is new or because they haven’t shared it yet on YouTube.

Use of GraphLab Create

Somewhat controversial about the class at the start was the software that is provided to complete assignments. The software is GraphLab create, a commercial toolkit (from Dato.com) that has it’s own API. It is Python. But, it’s not scikit-learn. Nor is pandas used in the course. Students are free to use scikit-learn and pandas. Because I found GraphLab Create cool and fun to use, and because it has the potential to explode onto the scene with widespread adoption, because it supports out of core data sets, I decided to use it rather than the open source Python tools. An academic use license of GraphLab Create is free. Here’s a video introduction to GraphLab Create:

[iframe width=”420″ height=”315″ src=”https://www.youtube.com/embed/491LmZZwBkE” frameborder=”0″ allowfullscreen]

Ok. I haven’t really stated a key reason why it was controversial…the CEO of the company is one of the instructions for the course. I don’t think it matters but the fussy types didn’t like it. If you take the class, it’d be pretty easy to follow along in scikit-learn and pandas if you know Python.

My recommendation on GraphLab Create is to use it for the course. I was able to pick up on it’s API without getting bogged down.

What about the pace and the amount of work required?

The material is definitely an intermediate level; this is not a course for beginning programmers nor is it a course for people who are new to ML. If you are an R programmer, a statistician, data engineer, data analyst, or other you’ll have the background you need. But, you should have programming experience. Plan on the five to 8 hours per week they recommend in the specialization overview.

The intro page is here.

Do they meet the market?

Yes. I applied knowledge I gained in this first course toward trying to find a pattern in data using a clustering algorithm. The material is immediately applicable to any number of data problems. I was already familiar with much of the concepts. This course will help you concretize concepts and fundamentals, and give you the confidence you need to apply these to new problems.

I like the humor in Laetitia Van Cauwenberge’s post below. Sure, the items in the list are in random order, but the most eyeballs are going to be on what’s easy to read and that’s the top of a list.

Fit on a team is important. When I have the time, I’m going extend Drew Conway’s famous Venn diagram of the data scientist to include a third dimension…fit. The Venn diagram is about skills and knowledge. The third dimension is about people and culture. For the data scientist to be effective, they have to first have sufficient acumen to recognize even that there is a culture with codes and customs that exist. Call it emotional intelligence. Assuming the data scientist has this acumen, the next threshold for them to meet is honoring the culture and norms, blending in. If the data scientist disrupts the team they are hired into, something’s wrong.

Data is real?

And for software engineers or data analysts as well, in random order:

The list:

  1. Not being able to work well in a team
  2. Being elitist
  3. Using jargon that stakeholders don’t understand
  4. Being perfectionist: perfection is always associated with negative ROI, in the business world: 20% of your time spent on a project yields 80% of the value; the remaining 80% yields the remaining 20% (this is also known as the law of diminishing returns)
  5. Not spending enough time on documenting your analyses, spreadsheets and code (documenting should eat 25% of your time, and be done on-the-fly, not after completing a project. Without proper documentation, nobody, not even you, will know how to replicate, extract value of, and understand what you’ve done, six months later.
  6. Not spending enough time on prioritizing and updating to-do-lists: talk to your stakeholders (but don’t overwhelm them with a bunch of tiny requests), spend 30 minutes per day on prioritizing (use calendars and project management tools), to fix this
  7. Not respecting or knowing the lifecycle of data science projects
  8. Not creating re-usable procedures or code. Instead spending too much time on tons of one-time analyses
  9. Using old techniques on big data, for instance time-consuming clustering algorithms when an automated tagging or indexation algorithms would work millions of times faster to create taxonomies
  10. Too much or too little computer science – learn how to deploy distributed algorithms, and to optimize algorithms for speed and memory usage, but talk to your IT team before going too deep into this
  11. Create local, internal data marts for your analyses: your sys-admin will hate you. Discuss this topic with her first, before using these IT resources
  12. Behave like a startup guy in a big company, and the other way around. Big companies like highly specialized experts (something dangerous for your career), startups like Jacks of all trades who are master of some of them
  13. Produce poor charts
  14. Focus on tools rather than business problems
  15. Planning communication last
  16. Data analysis without a question / plan
  17. Fail to simplify
  18. Don’t sell well
  19. Your project has no measurable yield – talk to stakeholders to know what the success metrics are, don’t make a guess about them
  20. Identify the right people to talk to in your organization, and outside of it
  21. Avoid silos, and data silos. Be proactive about finding data sources, participate in database design conversations
  22. Failure to automate tasks such as exploratory data analysis

Source: 22 easy-to-fix worst mistakes for data scientists – Data Science Central

Yesterday’s post started out including both the key, takeaway insights from the talks and a strategy for how a non-statistician can get the most out attending meetings for academic statisticians. The ideas were too different and I broke the meetings attendance strategy into this follow-up post.

I want to acknowledge up front that these meeting aren’t just about the talks and sessions. A lot of important housekeeping and governing body work takes place in addition to the sessions.

Here’s a strategy. The aim is to feel afterwards that the time and money were worth it:

  • Don’t attend the whole event – Save money. Cherry pick. Review the program. Identify the one day or the two days that have sessions that look the most interesting to you.
  • Keep it simple – Simple here means non-theoretical. You want to be engaged. Attend sessions that are in a practical track, a business track, or a track on a topic that you are familiarity with.
  • Ask questions – In some presentations in industry, the speakers are rock stars and are mobbed by attendees after their presentations. Here, the speakers are not mobbed and you can have as much of their time and attention at the end of the session as you want.
  • Retrieve publisher discount codes – I love books. The expo had sponsored booths from publishers. I didn’t stop at the SAS booth, but did stop at the Springer, CRC, and Wiley booths. They were stuffed with books and included display copies of forthcoming titles. Each of the publishers had discount codes that could be applied to on-line orders placed by dates in September.
  • Hands-on sessions – There were hands-on sessions, but I wasn’t able to attend any of them. They looked pretty technical rather than practical (as best as I could tell). Don’t sign up for one unless you know for sure it’ll be helpful.
  • Move along – Sessions usually have three or more speakers. Don’t stay in a session that is too technical. If you are in a session that is too technical, leave and go to another session. Leave only between the speakers’ talks, not during. This is a difference from industry conferences where it’s acceptable to get up and leave in the middle of a presentation.
  • It’s not Strata – Don’t expect Strata-like technology. For example, badges did not have QR codes on the passes. Vendors can’t scan your name badge to automatically put you on their mailing lists. Bring enough business cards.
  • Be the engineer – When asking questions or discussing topics with the speaker at the end of a session, tell the speaker that you’re an engineer. They’ll immediately get that you don’t have advanced training in statistics. The interaction will be more productive for you.

About this last point, another idea I picked up from the conference is statisticians who work in industry think that engineers can implement models and use statistics effectively. The problem they see when working with engineers is that the engineers don’t have confidence in their results. I suffer with this exact problem.

The ASA JSM annual meeting, held this year in Seattle, wrapped yesterday. I purchased registration for the last day and a half of the event. I’m not a statistician. Some of the talks I attended were a stretch. But, it was an opportunity to attend the preeminent national meeting for a field that I’m interested in. This post discusses some of the key takeaway insights from the event.

The killer insight relates to attribution of advertising in digital marketing. There is almost no research on the effectiveness of the different attribution models in measuring advertising performance. This applies specifically to web-based marketing. The talk by Stephanie Sapp on Google’s work to develop simulations that model user behavior was the highlight of my time at the meeting. The work they’re doing may finally make some progress toward measuring the effectiveness of different attribution models. Stay tuned…Google Research will be publishing a paper soon.

I also found a different way to look at a question I’ve been thinking about. Do business executives make better decisions when using information visualized in tabular format and poorly constructed data visualizations versus information that is presented in data visualizations that leverage current understanding of human cognition and perception? This can’t be quantified. A different way to frame this question is do they make decisions faster using one method or another? Time can be quantified.

Another takeaway related to forecasting business metrics, e.g., dollars of revenue, or units of demand. It’s not uncommon to rely on just one model to forecast business metrics (and for good reasons). The insight is to use an ensemble of forecasts, like might be done for a classification problem in machine learning.

Finally, there is no question that R is taking over the software space of statistical research. In the talks that I attended where an attendee asked the speaker what the software was that was used to produce the graphics or perform the analysis, the answer was R. Joseph Rickert talks about this transformation in this post he put up yesterday. Ironically, Minitab sponsored the lanyards worn by conference attendees. I used Minitab in quant classes as an undergrad. They still have a product in the marketplace after all these years. I wonder what there competitive advantage is?

Statistics is a complex, sprawling field dominated by academic statisticians. Yet, there was a sense at the conference that the statistics community needs to become more relevant. The moniker for the conference, Statistics: Making Better Decisions, shows this intent. To further their aim to make statistics more relevant, I suggest that ASA join up with O’Reilly to create a new track at Strata for recent applications of advanced statistics in industry.

I wasn’t sorry to spend my Saturday at Data Day Seattle 2015. Kudos to Lynn Bender for doing a fantastic job of organizing and programming the event. In this post, I’ll mention some overall takeaways from the event and then go deeper into one key takeaway, technical debt in analytics and machine learning (ML) applications.

Data Day Seattle 2015 Takeaways

The event focused on technologies, techniques, and how to solve business problems using these. Smartly, the role of vendors and vendor products was secondary. The variety of topics spanned many different fields, and it was only a one-day event. While not Strata, where it is easy to gain a sense of trends and important topics, this event still exuded what is important right now. These are the emergence of IoT. For technologies, it was Spark and streaming that had mind share.

Disappointing? The KNIME table didn’t receive more attention than it did. With KNIME’s GUI, product architecture, integration with other tools, and native ability handle out-of-RAM data sets, it’s inevitable that it will explode in acceptance and popularity…but this might still be some time away.

Technical Debt in Analytics and Machine Learning Applications

Why would technical debt in analytics and ML applications be different than technical debt in traditional software projects? Turns out, it is different. It’s worse. Why? Speaker Matthew Kirk discussed the problem. His talk was organized around the paper Machine Learning: The High Interest Credit Card of Technical Debt, freely available at Google Research here.

technical debt
Fix this and close your ticket. Will it take more time to rewrite it or to try to figure it out and refactor it?

The problem

Companies are responding to the market quicker and quicker. The way to make a better company is to be more responsive, i.e., use data to drive the business and make decisions. It’s a golden opportunity. The downside is that everything becomes harder and harder to do. This is what lead to the paper.

Discussion

Their point is that there’s a lot we have to be careful of. The problem goes beyond the code complexity in regular software engineering. There are four major categories of pitfalls:

  1. Boundary erosion
  2. Data dependencies
  3. Spaghetti code
  4. The real world

Matthew closely followed the article’s organization. I’ll touch on those debts and problems that are probably the most common.

Boundary erosion – Things become really entangled. The line between the data and the code is blurred. When writing data science software, we can’t write loosely coupled code. Entanglement is the idea that we have features, things we have computed using functions, etc.. If we’re adding to this, things change. A principal is change any thing, change everything (CACE). To overcome entanglement, isolate your models the best you can, regularization. Whenever you change the data set, everything else changes.

Who consumes the output? Visibility debt is different from entanglement. An example of visibility debt is a model of CTR. If finance uses it to calculate lifetime value of a customer, and then the CTR changes, Finance’s modeling is now off. The debt is not knowing who across the company is consuming, or utilizing, your data, across the company. The solution is to keep an API, and a list of user names, and don’t share user names.

Data dependencies – Input signals are constantly changing. Think slang or emoticons. The solution is simple…version your data. Google will version text data, such as the corpus used to train a models. Version your data sets.

Underutilized data adds dimensions and is a curse. Image processing problems work with underutilized data. Most of the information in a picture is not important. Vector distances get further and further apart as you add new dimensions. The solution is feature selection. Fortunately, this topic receives a lot of attention and there are many methods available to the engineer or researcher. It’s up to us to pull out features that are important; it leads to a stable model.

Spaghetti code – There’s a lot of spaghetti code in data science. Researchers and some engineers come from academia. They don’t have experience writing good code. Glue code is common in data science. So many libraries to tie together and we try to tie them together in interesting ways, i.e., use what library and language you know to get it to work. When you have to ship something, you make it work.

The solution is to write your own implementation of some things, such as the implementation of an algorithm in the language framework your using for your system. I don’t agree. Testing new code, especially non-deterministic outcomes, is hard. A better solution is for team members to lean in…if a piece of code is essential to a team, take ownership of it and maintain it.

Experimental paths…there is always a piece of code that says it is going to do something this or that but you’re pretty sure it doesn’t. But, you’re not sure and don’t take it out.  A solution is to tombstone, write a log statement that says you think this method should go away. After a while, look at the log and see if the method you think you want to delete is ready to go. It’s an opportunity to use data analysis for log analysis.

Conclusion

We are in the blissful land of opportunity. Thinking about ways to avoid technical debt in our applications helps us get closer to being able to deploy code that users use and that drive the company forward.

References

SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young. Machine Learning: The High Interest Credit Card of Technical Debt. 2014.