The purpose of this post is to assess the relative performance of the Python version of the 64-bit Python Tableau TDE API on Linux (TDE API). Included below is a discussion about a benchmark exercise that was conducted over the period of from February 18th to February 23, 2015. The aim was to gather facts and make a recommendation about whether or not the TDE API is suitable for use in a production environment.
Why is this important?
A typical business problem is that a company has to be able to meet SLAs. The existing Windows Server hosting the Tableau Server sustains high levels of usage during peak times. As business grows and the demand for additional processing is placed on this resource, it’s possible that SLAs will be not be met, thus jeopardizing the mission.
During this same period each day, Linux servers that host analytics workflows may remain mostly idle. The opportunity, then, is to determine if these servers be useful to help meet peak demand for the creation of TDEs? Can we use distributed computing to ensure that we are able to continue to meet the mission? An aim of this benchmark is to provide facts for use in the decision making on whether or not to use the TDE API to help solve the business problem.
A number of trials were run and the results of these are reported here. Also listed are lessons learned, challenges, recommendations, and future work.
Key takeaway insights:
- Execution time is linearly related to the number of rows and the number of columns of the input file.
- The compression achieved in the .tde output files is impressive.
- The TDE API does use parallel computing to improve performance.
- The TDE API can ingest large-scale source files.
- The TDE API is not yet ready for deployment to a production environment.
At my day job, we were maybe the first Tableau Software customer to investigate the use of the Tableau 64-bit Python Linux TDE API to meet peak-demand processing loads. How do I know? The first build we tried to use was hard set to work only on IPv6 configured networks. Based on our feedback, the code was changed and Tableau provided us a preview build that checks for network configuration and works on both IPv4 and IPv6 configured networks. It’s this preview build that I used in the benchmark.
There was no prior experience in the user community to rely for performance information nor was there existing, in-house expertise. Prior to being able to use the TDE API beyond just in a sandbox environment, I had to better understand it. Learning occurred with sharing of experiences on the job, from Tableau support cases, and with a systematic approach to learning how it behaves. This benchmark was a series of trials that used simulated business data and approximate, business processing loads.
Data set used in the benchmark
To facilitate sharing with the user community, and to facilitate cases with Tableau Software support without disclosing business data, a publicly available data set was used. The data set was the Airlines Data Set, the same data set used in the 2009 American Statistical Association challenge but with additional years through 2012 of data added. The source of the data set used was a public-facing page on the Revolution Analytics website that contains data sets. The link to that page is here. The source format was the .xdf version of this file, AirOnTime87to12.xdf, but CSV versions are available.
This data set is described here and also below.
There are 46 columns and 148,619,615 rows. This data set is far too large for the open source R data frame. To use the data, functions from the Revolution Analytics RScaleR package were used to read and write the data set. Overview of the data set showing number of records and columns:
> rxGetInfo("AirOnTime87to12.xdf") Number of observations: 148619655 Number of variables: 46 Number of blocks: 721 Compression type: zlib