Edit: Paula Poundstone’s keynote address is now available here.

Strata + Hadoop World Santa Jose 2016 did not fail attendees. The best presentation was Paula Poundstone‘s stand-up comedy routine that went two minutes over schedule. She lambasted technology and the industry. Brilliant. I will have to post the YouTube video as an update. In the meantime, here’s a commentary by Paula about the problem of flat-screen addiction:

Expo hall gets it right

The Expo Hall was an integral part of the conference experience. The density of vendors could not have been more satisfying. It was a feast of interactivity between attendees and the really smart people who were staffing the booths. As well, it provided a city-like feel. The vendor booths were like apartments crammed in together like along a city street. Kudos to the organizers.

My favorite

What was the killer software I have to have? DataRobot. It automates the task of building models in a data space that’s too large for a human alone to ever fully explore. The business problem for most is the need to monetize data, the need to innovate and create data products. DataRobot can help do this. It’s the first, I won’t say set-and-forget, but rather tune-set-and-forget, Machine Learning, AI bot that’s commercially available. And, amazingly, DataRobot is not too expensive. For an install on a four-core box, it would set me back about $20,000 (more, obviously, for install on a cluster). The DataRobot company will get bought out. Why? It can be set loose on data to run near continuously, with only periodic human interaction, to model the hell out of your data. On a Mac, with the bot running 24 hours a day, seven days a week, it’d crush. It includes a catalog of models that I have never even heard of. It’s easily 100x times faster at modelling data than a human could ever be. Feature rich including the ability to clone models, provide parameter input settings but also using sensible defaults, use of various metrics to evaluate model performance, and also the automatic generation of plots in a data exploration step. In a final stroke of ah-ha, the bot can be configured to take the top models that were returned from a run on a dataset and create an ensemble from these, improving results (probably!) over the one, best model. Models can then be deployed. DataRobot is a no-brainer investment for companies serious about monetizing their data.


Overall themes and key observations in a more or less time-sequence order, from beginning of the conference at top to end at the bottom:

  1. Streaming
  2. Real-time analytics
  3. Spark
  4. Tay as an example of AI gone awry; an embarrassment to Tay’s parents
  5. Peak BI
  6. In-memory
  7. Streaming
  8. Spark
  9. Kafka
  10. Use of Notebooks to share results and for narration
  11. Visualization of key note speakers were not all that flashy; numerous examples of simple line graphs and simple charts and it was ok.
  12. I actually saw statistics creeping into Data Science…confidence intervals were included on a bar chart!
  13. A greater percentage of attendees were women compared to the 2013 edition of the same conference in 2013
  14. noSQL means noMarketShare (no common API and no way to exchange data means no market share)
  15. Streaming
  16. Docker; it is worth the time to learn it
  17. Spark
  18. Streaming
  19. Speaker demos that were executed in Python were in Python3 (RIP Python2)
  20. No language wars encountered
  21. Nothing about the Julia programming language

Please feel free to leave comments.

The first course for the zero cohort in the Coursera UW Machine Learning (ML) specialization just completed. A key message from this course was that if you put in the work, learn the tools, keep learning, and get some iterations, you can become successful in implementing ML. What does successful in implementing ML this look like? It means you can make intelligent applications at work; make your company money and increase your value in the job market. If you reach, you can invent your own data product and maybe start your own company.


How was it?

Fantastic! This is the best, on-line course I’ve ever taken. I’m looking forward the remaining courses in the specialization. The course uses a case-study approach. This is great for practitioners.

I’m looking for an introduction video from the instructions that I can include in the post. So far, I can’t find one. Maybe because the course is new or because they haven’t shared it yet on YouTube.

Use of GraphLab Create

Somewhat controversial about the class at the start was the software that is provided to complete assignments. The software is GraphLab create, a commercial toolkit (from Dato.com) that has it’s own API. It is Python. But, it’s not scikit-learn. Nor is pandas used in the course. Students are free to use scikit-learn and pandas. Because I found GraphLab Create cool and fun to use, and because it has the potential to explode onto the scene with widespread adoption, because it supports out of core data sets, I decided to use it rather than the open source Python tools. An academic use license of GraphLab Create is free. Here’s a video introduction to GraphLab Create:

Ok. I haven’t really stated a key reason why it was controversial…the CEO of the company is one of the instructions for the course. I don’t think it matters but the fussy types didn’t like it. If you take the class, it’d be pretty easy to follow along in scikit-learn and pandas if you know Python.

My recommendation on GraphLab Create is to use it for the course. I was able to pick up on it’s API without getting bogged down.

What about the pace and the amount of work required?

The material is definitely an intermediate level; this is not a course for beginning programmers nor is it a course for people who are new to ML. If you are an R programmer, a statistician, data engineer, data analyst, or other you’ll have the background you need. But, you should have programming experience. Plan on the five to 8 hours per week they recommend in the specialization overview.

The intro page is here.

Do they meet the market?

Yes. I applied knowledge I gained in this first course toward trying to find a pattern in data using a clustering algorithm. The material is immediately applicable to any number of data problems. I was already familiar with much of the concepts. This course will help you concretize concepts and fundamentals, and give you the confidence you need to apply these to new problems.