The purpose of this post is to assess the relative performance of the Python version of the 64-bit Python Tableau TDE API on Linux (TDE API). Included below is a discussion about a benchmark exercise that was conducted over the period of from February 18th to February 23, 2015. The aim was to gather facts and make a recommendation about whether or not the TDE API is suitable for use in a production environment.

Why is this important?

A typical business problem is that a company has to be able to meet SLAs. The existing Windows Server hosting the Tableau Server sustains high levels of usage during peak times. As business grows and the demand for additional processing is placed on this resource, it’s possible that SLAs will be not be met, thus jeopardizing the mission.

During this same period each day, Linux servers that host analytics workflows may remain mostly idle. The opportunity, then, is to determine if these servers be useful to help meet peak demand for the creation of TDEs? Can we use distributed computing to ensure that we are able to continue to meet the mission? An aim of this benchmark is to provide facts for use in the decision making on whether or not to use the TDE API to help solve the business problem.

A number of trials were run and the results of these are reported here. Also listed are lessons learned, challenges, recommendations, and future work.

Key takeaway insights:

  1. Execution time is linearly related to the number of rows and the number of columns of the input file.
  2. The compression achieved in the .tde output files is impressive.
  3. The TDE API does use parallel computing to improve performance.
  4. The TDE API can ingest large-scale source files.
  5. The TDE API is not yet ready for deployment to a production environment.

Background

At my day job, we were maybe the first Tableau Software customer to investigate the use of the Tableau 64-bit Python Linux TDE API to meet peak-demand processing loads. How do I know? The first build we tried to use was hard set to work only on IPv6 configured networks. Based on our feedback, the code was changed and Tableau provided us a preview build that checks for network configuration and works on both IPv4 and IPv6 configured networks. It’s this preview build that I used in the benchmark.

There was no prior experience in the user community to rely for performance information nor was there existing, in-house expertise. Prior to being able to use the TDE API beyond just in a sandbox environment, I had to better understand it. Learning occurred with sharing of experiences on the job, from Tableau support cases, and with a systematic approach to learning how it behaves. This benchmark was a series of trials that used simulated business data and approximate, business processing loads.

Data set used in the benchmark

To facilitate sharing with the user community, and to facilitate cases with Tableau Software support without disclosing business data, a publicly available data set was used. The data set was the Airlines Data Set, the same data set used in the 2009 American Statistical Association challenge but with additional years through 2012 of data added. The source of the data set used was a public-facing page on the Revolution Analytics website that contains data sets. The link to that page is here. The source format was the .xdf version of this file, AirOnTime87to12.xdf, but CSV versions are available.

This data set is described here and also below.

There are 46 columns and 148,619,615 rows. This data set is far too large for the open source R data frame. To use the data, functions from the Revolution Analytics RScaleR package were used to read and write the data set. Overview of the data set showing number of records and columns:

> rxGetInfo("AirOnTime87to12.xdf")
Number of observations: 148619655
Number of variables: 46
Number of blocks: 721
Compression type: zlib

Continue reading

This post provides a working example of how to keep the denominator in a calculation fixed in Tableau desktop while filtering the data. The aim is to provide a non-variable percentage value for each of the dimensions in the pane. The post at the excellent The Information Lab blog provides the thought leadership and solution. Be sure to read their original post.

An extension of this would be to use the X-axis to represent the total sales amount in dollars. I don’t know how to do this and welcome readers to leave comments and suggestions on how to display percentage marks while using the total sales amount for the measure on the X-axis.

This post discusses three tests I did using Boolean expressions in Tableau that evaluated 64-bit operands for equality. I interpret the test results from the test and show what the largest integer bit integer that Tableau does support.

I posted on January 29th about data corruption in R when dealing with 64-bit integers. The takeaway from that post was that analysts have know the limitations of the tool that they are using. Well, it turns out that Tableau is also a tool that has limitations when dealing with 64-bit integers. The nature of the problem is somewhat similar to the limitation in R, but also not as bad as the limitation in R. This post discusses the limitation. This post discusses three tests I did using Boolean expressions in Tableau and how to interpret the results.

A users on the Tableau Community Forum posted about Tableau handling integers of a lenght of 150+ digits. Wow. Ok. I found this interesting paper on super long integers in cryptography.

That above post in the Tableau forum and this one got me thinking about whether or not Tableau supports 64-bit integers.

Test 1 This calculation compares two variables that are assigned the 64-bit integer operand values of 100000000000000000 and 100000000000000001 respectively. The Boolean expression [100000000000000000] == [100000000000000001] correctly evaluates to False.

For readers unfamiliar with variables in R, they are called calculated fields. Enclosing the variables in square brackets is Tableau syntax.

Test 2 This calculation compares two, calculated 64-bit integer operands for equality. The test (2^63)-1 == (2^63)-2 incorrectly evaluates to True.

Test 3 This calculation compares two, 64-bit integer operands for equality. The result of the Boolean express 100000000000000000 == 100000000000000001 correctly evaluates to False.

Here is a dashboard that summarizes the results from my tests:

How do I interpret these results? If you can get your data into Tableau as a 64-bit integer, it looks like you might be ok. But, I need to test this using other data sources. For sure, any data source that relies on the Microsft JET engine is going to result in data corruption as Jet is limited to 32-bit only. If you use a calculated field in Tableau to create an integer constant, and later use that constant in other expressions, it looks like you’ll be ok. Test 1 confirms this. If you compare two integer constants directly, you’ll be ok. Test 3 confirms this. But, if you compare two, calculated values, the result is not correct. It seems that these calculated values are stored as floats. Comparision of two floats, when one or represent values that exceed the precision of the float, is always going to result in data corruption.

How big can you go in Tableau? It looks like the 32-bit, signed integer if you want to be safe, or until Tableau releases technical information about precision limitations of their Number data type for integers and floating point numbers. Why do I say 32-bit? Because it’s a safe bet. Here are some other tests:

Test 4 The expression (2^54)-1 == (2^54)-2 evaluates to False. This is correct.
Test 5 The expression (2^55)-1 == (2^55)-2 incorrectly evaluates to True.
Test 6 For negative numbers, the expression -(2^53) == -(2^53)+1 evaluates to False, the correct result.
Test 7 But, the expression -(2^54) == -(2^54) + 1 evaluates to True, an incorrect result.

But, then:

Test 8 The calculated fields abs(-(2^53)) and (2^53)-1 return the same float value, and, when formatted as integers, the same integer value (as expected). Yet, the expression abs(-(2^53)) == (2^53)-1 returns False, the correct result. If the floats have insufficient precision, how is it that the result of this expression is correct? Is Tableau doing a conversion to their String data type?

I’m not sure why Tableau doesn’t support 64-bit signed integers. They do have just the one, numeric data type, Number, and are probably using the bits to store the length of the mantissa.

These tests were done using Tableau Desktop 8.1.4 (8100.14.0213.2024) 64-bit on a system running Windows 7 Enterprise SP1 64-bit.

Wrap

Same as my previous post…be careful. Know that Tableau doesn’t throw an overflow message or give you a warning. Know the limitations of the tool you’re using. Identify and experiment with each workaround and decide which one is best for the data problem you are trying to solve.

The purpose of this data visualization is to show the density of a measure over time. The measure is an occurrence of a criminal or non-criminal activity or incident reported by the the University of Washington Police Department. The density is the frequency of the measure at a geocoded location over a period of time. What is shown here is, essentially, where officers were called to during the summer of 2013.

[iframe width=’100%’ height=’620′ frameborder=’0′ src=’http://sculpturearts.cartodb.com/viz/f4c5c014-936b-11e3-8d9b-0edd25b1ac90/embed_map?title=true&description=true&search=false&shareable=true&cartodb_logo=true&layer_selector=false&legends=true&scrollwheel=true&sublayer_options=1&sql=&sw_lat=47.640&sw_lon=-122.335&ne_lat=47.66613879599216&ne_lon=-122.281′][/iframe]

This is an example of a CartoDB Torque map. Used alone or with other maps or information, new patterns of human activity are shown and possibly understood for the first time. This map, used in conjunction with a previously completed Tableau viz, are an example of this. The patterns and information easily discernible on these maps without any narrative or explanation would be impossible to discern looking at the same data in tabular format. (The Tableau visualization of this same data is available here on my Tableau Public web site. Feel free to check it out!)

I think it would be interesting to organize the data by day of the week over the period of a year. What are the busiest days of the week, on average, and what types of incidents occur most frequently during different days of the week?

Here are tips that I can guarantee 😉 will help you learn Tableau Desktop faster. The biggest time saver is knowing when the data isn’t in the right form or the Tableau approach you are using to solve a problem isn’t working. If you are trying to create a stacked bar chart for the first time and struggling, and you will struggle, watch a demo or read a tutorial if it takes you more than ten to 15 minutes.

To help set your frame of mind, the real strength of Tableau as a tool is to quickly whip out different views and stories from the data. This much is true and is not marketing hype. Here we go:

  • Watch the free, on demand videos that Tableau has on their site. This is the best source of learning.
  • I thought there was a forum specifically for newbies. I can’t find it. The forums appear now to be organized by topic. (Maybe they always were?) The search engine for finding questions is pretty good, use it.
  • They have a calculation reference library that I just discovered a few minutes ago. It’s probably a place to turn for when you think you might be developing the wheel.
  • Join and attend a local, Tableau user group. I’ve attend maybe five Seattle TUG meetings. I’d say that about half the people in attendance are newbies. The more experienced users at these groups are supportive. And it’s great networking for both relationships and to find out what the good places are to work. Start a discussion on the TUG board for your area and see what response you get.
  • Create a Tableau Public account right away. It’s free. Publish workbooks to it (see the cautionary statement near the end). You’ll get a sensation of what it’s like to publish. You can share the URLs with colleagues, etc.. It’s free hosting. It’s the place to for your Tableau portfolio on. Any hiring looking for Tableau experience in the candidate will go to Public. Complete the simplest stacked-bar chart, make it as nice as you can, publish, and leave it out there.
  • Tableau Public has some sample data sets. Download the sample data sets and try to recreate the published viz that uses that data set.
  • When you have some chops, create a free, 60-day Tableau Online account. You will get even more of the sensation of what it’s like to publish and share. You will be able to invite people via email. You can set up projects and manage security. Wait to use this until you have some chops so you don’t waste your trial. (PS – I just went to the site and they might no longer be offering trials. Cost is otherwise $500 per person per year with a one-year minimum the first year.
  • All downloadable workbooks can be pulled apart and studied. Prowl Tableau Public and download a workbook that has a feature or design that you want to learn. The data will download with it. This is one of the best way to learn new stuff. I under use this advice.
  • To motivate learning, use data that you either know or are interested in. For data you know, use standard business data to demonstrate line charts and the basic, conservative kind of stuff. Use a data set you are interested to do more exploratory stuff like maps, different kinds of charts, background images, etc.. Bounce between the two as you are waiting for answers to questions on message boards.
  • If you spend more than an hour trying to figure something out, your approach is probably wrong or there is some limitation in Tableau on being able to use the data in the way you want. I’ve pent way too many hours on approaches that were wrong, like, for example, trying to aggregate already aggregated fields in a calculated field.
  • When developing and learning, always import the data and use the extract, saving your workbooks as .twbx files.
  • Do some type of pseudo version control. At certain points, stop working on a workbook, number it, copy it, and start working on the new one. Don’t have fear of deleting a worksheet because of something cool on it you want to save. Tableau 8 has functionality to copy worksheets from one workbook and paste it into to another.
  • If you have a workbook open and are staring at the worksheet with no idea how to do what you want to do, just do something. It doesn’t matter. Make movement. Click the undo button. If you find yourself clicking on the “Show Me” chart tool helper and nothing happens, you are probably on the wrong track. Get some fresh ideas from a forum or sample worksheet or just put it aside and do something else for a while.

Take the blue pill or is it the green pill? The Tableau GUI has it’s own visual vocabulary. Tableau and the Tableau GUI are not extensions of Excel. Don’t fight it. Think about it like learning a new language. Learn the alphabet, the words, the grammar and syntax, make sentences, make paragraphs (worksheets), an essay (dashboards), then publish what you write. For some reason, one of the big blockers at the early stage of learning Tableau is the difference between the blue and green fields.

Blend, join, neither, or both. This took me months to figure out. Always push the data preparation and data manipulation down to the database whenever you can. First choice always is to create a view in the database if your data comes from more than one table. If you don’t have the privileges or cooperation to create a view in your DBMS, then join in Tableau. A blend is the last resort and you’ll know you have to use it when joining doesn’t work. A task that can be difficult or impossible in Tableau is applying functions and calculating aggregations on data field that is already an aggregate. If you can create the aggregate in the database and associate it with the data that you import or connect to, your workbook will be simpler and your job easier.

Sharing a viz with “restricted” data. Anything published on Tableau Public is available to anybody with a Tableau Public account. But, people have to go out there and look for it. They have to find your account, find your workbook, and download it. Set up your Tableau Public profile so that the workbooks you publish are by hidden by default. If you need a place to share a viz for a job interview, for example, publish the workbook, go to the interview, discuss what you did, close the browser, then delete the workbook via the mobile app as soon as you leave the office. Don’t send the link to hiring manager or friends because they might download the workbook. The probability of losing restricted data this way is low. Always hide unused fields.

Sharing a viz with “confidential” data. Don’t use Public. Use only a Tableau Online account or a local Tableau server.

Wrap

Always upgrade to latest version of Tableau. You can check for the latest release on the download site. Always stay on the latest release unless you have a reason not to, like the server you deploy to at work is behind. Read the release notes while you’re there.

You’ll soon have some skills and will want to use your new chops to help you land another job. Think of the Tableau sites as social media. Create a history on the Tableau sites using your personal email address. This way, your history will follow you should you find yourself no longer using a work email address.

And, for readers who have gotten this far, the thank you is this final tip. This could save you from many hours frustration. Be prepared to use a schema.ini file when using text files as the data source. If the data in your file looks good in an editor but wacky when you viewing via the “View Data” tool within Tableau, that’s the clue that you might need to use a schema.ini file.