Computing clusters are going to get larger. And it’s not just because Facebook needs room to store photos drunk people take of each other at wedding parties. Why do we need larger clusters?

An article published in The Atlantic discusses the orders of magnitude faster that today’s cameras are compared to the very first cameras. The difference in speed, as measured by the shutter speed, is 10 billion trillion times faster. In default R, scientific notation, that’s 1e+16. And the difference is going to get larger, our now being on the brink of attosecond photography. An attosecond is the length of time it takes for light to travel three hydrogen atoms.

A computing cluster today stores about 1 billion trillion more bytes than say, a computing device of approximately 70 years ago. Today’s edge case for cluster capacity is in the petabytes (PB), one billion trillion bytes. Exabyte (EB) clusters are next. That’s 1,000 times the cluster of a petabyte (PB).


An interesting fact about this chart is that the y-axis has to be transformed to a logarithmic scale. Otherwise, whatever the largest number is, gigabytes, terabytes, or any other metric, it dwarfs the all the smaller metrics, itself being 1,000 times larger than the previous metric. This fact repeats in this way and continues to the asymptotic.

Why are we going to need larger clusters? One application is the storage and processing of images captured from femto cameras. Watch this TedTalk video form YouTube and start thinking about the next dimension in imaging, femto cameras.

It’s inevitable that EB scale clusters will emerge. A use case? Dr. Raskar suggests that smartphone will have a femto camera that can be used to tell the freshness of a fruit at the grocery store. It’s not the camera that will tell the freshness but the processing of the images the camera takes. It’s a classification problem. We are going to build classification systems that determine if a tomato at the food store is fresh or not. The series of images captured from the femto camera will be run through a fruit freshness recognizer app. The algorithm will compare the piece you imaged to known pieces. Instead of a classification system for handwritten, digit recognition that uses a 28×28 pixel training images, imagine a multi-frame, 200×200 pixel image with 200 pieces of fruit in a training set. Half of these, 100, will be fresh fruit that are ripe and should be purchased. The other 100 pieces will be not ripe, maybe some are under ripe and others are over ripe.

Now, imagine all the different kinds of fruits and vegetables in the grocery stor, and everything else that a photon pulse can be shot through and used to build a classificaiton system for. This is going to be a large amount of data and it’s going to require computer clusters that are larger than today’s clusters. Emerging technologies such as attosecond photography will themselves produce large quantities of data, but it’s the applications that we invent that use these technologies that will create even more data. Exabyte computing clusters seem inevitable.

For the curious, here’s the R code used to produce the above chart:

require(scales)  # for y-scale transform
label <- c("kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
value <- c(3, 6, 9, 12, 15, 18, 21, 24) 
number <- c(1000, 1000^2, 1000^3, 1000^4, 1000^5, 1000^6, 1000^7, 1000^8)
numberLabel <- c(
	"1 000 B", 
	"1 000 000 B", 
	"1 000 000 000 B", 
	"1 000 000 000 000 B", 
	"1 000 000 000 000 000 B", 
	"1 000 000 000 000 000 000 B",
	"1 000 000 000 000 000 000 000 B",
	"1 000 000 000 000 000 000 000 000 B")
df <- data.frame(label = label, number = number)
p <- ggplot(df, aes(x = value, y = number))
color <- c("#1b9e77", "#1b9e77", "#1b9e77", "#1b9e77", "#2a2a2a", "#1b9e77", 
	"#1b9e77", "#1b9e77")
p + geom_point(colour = color, size = 4, shape = 18) +	
	labs(x = "Metric",
		title = "Current edge case for size of\ncomputing cluster is petabyte scale") +
		labels = label) +
	scale_y_continuous(trans = log10_trans(),
		breaks = number, 
		labels = numberLabel) +
	coord_trans(y = "log10") +
		x = 18.3,  # will vary depending on aspect ratio
		y = 1000^5 * 0.1,
		label = "today's edge case",
		size = 4.5,
		color = "#2a2a2a",
		fontface = 3) +
	theme(plot.title = element_text(size = rel(1.3),
			face = 'bold'),
		axis.title.y = element_blank(),
		axis.title.x = element_text(face = 'bold', 
			size = 12),
		panel.grid.minor = element_blank(),
		legend.position = "none")

Big Data Bellevue’s (BDB) third meetup included a talk by Magdalena Balazinska, professor in the UW Department of Computer Science and Engineering. Her talk covered topics from a paper she and co-authors published in the March 2013 issue of IEEE Bulletin of the Technical Committee on Data Engineering*. The title and topic of the paper is managing skew in Hadoop.

In a MR job, skew is a significant load imbalance and can be caused by hardware malfunction or uneven partitioning of data across tasks. This latter cause is called data skew. Skew from either cause is bad because it results in performance execution times that can be factors of time longer than a job without load imbalances. The opportunity is that if skew can be reduced or effectively eliminated from a job, more time is available on the cluster to execute other jobs and the value stream of analysis can be completed faster, resulting data getting to the business in less time.

Data skew can occur for both keys and values in either mappers or reducers. Data skew occurs more often for reducers because mappers mostly take the same-size blocks for input based on hash partitions of the key values in the input data. Within data skew, there are two cases. One is uneven data allocation; here, the number of key values for a task is significantly larger than the number of keys in the other partitions to cause an imbalance. Second is uneven processing times. For uneven processing times, there is a larger number of values processed by one task than the number of values for other tasks.

An aspect of their approach that I like is that it’s easy to understand what the metric is for when there is skew. They call a long-running task a straggler, a task that takes at least twice the time process its data as the average for other nodes. If there is at least one straggler, there is skew.

Their solution to manage skew is to optimize. They use both static and dynamic optimization. The system they built for static optimization is SkewReduce. In SkewReduce, the aim is to build a data partition plan that minimizes job execution time. The optimization is applied before the job starts. Once the job starts executing, there is no way to change the optimized, data partition plan. A human is required to decide what the values are to use for some of the parameters used to develop the plan. A sample of the input data is taken. A cost function is produced. Their results found that even for cost plans that are not accurate, runtimes still improved. For better plans, SkewReduce can achieve 4X to 6X improvements in performance execution time.

SkewReduce cannot handle conditions that occur at runtime that might negatively impact performance. So, a change was built into Hadoop that they call SkewTune. SkewTune monitors the progress of tasks on nodes for the duration of the job. It dynamically removes bottlenecks by transferring long-running tasks to other nodes the whole time the job is running. Their results show that 4X improvements in performance can be achieved. What I find notable is that for jobs where there is no skew, the overhead of SkewTune is negligible.

Their code is open source and freely available at their project’s site. A next step might be for a Hadoop contributor or vendor to pick up where they left off, extend the work, and then try to bring SkewTune and SkewReduce into the Hadoop trunk.



* Citation of paper: Managing Skew in Hadoop, YongChul Kwon, Kai Ren, Magdalena Balazinska, and Bill Howe
IEEE Data Engineering Bulletin. Vol 36. no 1. March 2013

If you’re building a multi-node cluster and run into problems or errors with the installation of Hortonworks Data Platform (HDP), simplify your environment. Stop what you’re doing, spin-up a one-node cluster, and install HDP on the one-node cluster instead. Flush out and understand the problem. Solve the problem. Then return to the multi-node installation and apply the solution. Trying an install on a single node immediately eliminates any complexity relating to parallel-distribution computing.

A couple of weeks ago, I was encountering an error during the Ambari server setup step of an HDP installation on a four-node AWS cluster. Here’s the message:

 ERROR: Unable to start PostgreSQL server. Exiting

First of all, I hadn’t looked too much at Ambari to even know that PostgreSQL was installed anywhere on a cluster that HDP is installed on. It turns out that PostgreSQL is a component of Ambari; PostgreSQL is used to store configuration information about the cluster.

The solution to the above problem is to install version 8 of PostgreSQL.  Sean Mikha at How2Hadoop has a shout out on the workaround to the above problem. There’s also a  post on Hortonworks community forum.

If you run into a problem or error installing HDP on a multi-node cluster, simplify. Going to a single-node cluster removes the complexity of a parallel-distributed system.  Solve the problem on the single-node cluster then apply the solution to your multi-node cluster.

This year’s Strata 2013 held in Santa Clara included some great talks and discussions.

One of the surprises from the conference was the news from Google about F1/Spanner, a RDBMS that scales like NoSQL and supports ACID transactions. Tim O’Brien from O’Reilly talked used some great hyperbole comparing this to another walk on water by Google.  F1/Spanner supports ANSI SQL queries. It was hot. Look for more at upcoming conferences and papers. F1/Spanner is intended for business data supports general purpose transactions and is not designed for the use cases that MapReduce does best. The original Spanner paper can be found here, with slides here, and video from OSDI 2012 here.

Also hot at the conference was the Berkeley Data Analytics Stack (BDAS) from the UC Berkeley AMPLab. Tons of buzz on Tuesday and Wednesday. BDAS is an open source software stack to process Big Data. They have released three components, Shark, Spark, and MesosShark is Hive on Spark. Spark is computing system that aims to keep data in memory. Mesos is a resource manager than can run both Spark and Hadoop. The point with BDAS is that it is both fast, two orders of magnitude faster in some tests(!), and compatible with existing open source, Big Data components. This paper is dense. It describes the Computer Science behind their in-memory processing. Videos, slides, and links to exercises that you can execute on AWS, are located here.

Strata 2013 job board
Strata 2013 job board

Hadoop is not going to be eclipsed soon. My impression is that it’s just now starting to see broader adoption in business, as evidenced by Hadoop developers being the second most sought after skill on the two jobs boards. It is perfectly suited to the workloads it was designed for and vendors are adding analytics layers on top of MR. It’ll be interesting to see if the Berkeley Data Analytics Stack (BDAS) influences development and extensions of Hadoop. What is pushing the Hadoop MR right now is the need for real-time analytics and the need for speed. BDAS benchmarks show a 100x performance edge over Hadoop MR.

I can think of six Hadoop MR distributions, but surely there are more. Two distributions announced either immediately before or at the conference were from Intel and EMC Greenplum.

The job boards were loaded with openings for data scientist primarily, followed in Hadoop MR developers. Almost all of the posting for Data Scientists were from companies while many of the posting for Hadoop MR were from recruiters. For Data Science right now, companies are looking to build recommendation systems and the skills they want are with the clustering algorithms. Here’s just a section of one of the job boards:

A surprise for me from the conference is that SQL is alive and well. Moreover, it looks like SQL will emerge as a language for use in analytics at scale. The new motivation to incorporate SQL is driven by two events.  First, the years since 2006 included the adoption and mixed success of several databases. Environments grew complex. All the APIs. It’s time to simplify and consolidate the number of databases and APIs, and use something that will not require developers to be retrained. Ensuring the ongoing life of SQL will aid the transition of developers from existing RDBMS to NoSQL and Hadoop. The original paper doesn’t say anything about the underlying representation of the data. So, is it really a surprise that SQL will emerge as a language for analytics at scale?

I’m looking forward to the upcoming Hadoop Summit.