Eat Your Data Faster with Apteco.DataBuild
09 Juni 2015 | von Tim Heron
Improvements in data load times The team at Apteco has long acknowledged that the development of multi-threaded algorithms are the means of driving performance around data compression in modern high performance software development and as a result have seen significant improvements in data load times with each new version of FastStats Designer.
It has been just over 10 years since the publication of Herb Sutter’s article “The Free Lunch Is Over”. In his article Herb described the changing performance characteristics of modern processors and the consequences this has for modern high performance software development.
In the Apteco High Performance Team we have long recognised that improvements in query and data load performance will only come from implementing multi-threaded algorithms instead of continual micro optimisations to sequential code.
With FastStats Designer we have worked over the last few years on developing multi-threaded versions of our sequential data load algorithms. We have worked a component at a time replacing the sequential processes with new multi-threaded equivalents and in doing so we achieved a 25-30% speed improvement over the course of 2014.
For the 2015 Q2 version of FastStats Designer we are releasing for beta use our new multi-threaded data load and compression component “Apteco.DataBuild.dll”.
This routine still produces our standard FastStats optimised, compressed binary data file formats but does so with entirely new code that has been written from scratch using the latest parallelisation technology. The technology we decided to base our component on is a Microsoft library called TPL Dataflow that extends the multi-threaded features of the .Net Framework to support data processing pipelines. We re-implemented our compression routines as components of this pipeline to make full use of the multi-threaded capabilities of modern processors.
Previously with the Nov 14 sequential load process we were able to build a TPC-H reference test system with 1.3Bn records (1Bn Line Item records) in 4 hours. The sequential data load portion of this build took nearly 2 hours. The new Apteco.DataBuild component performs the same processing in under 50 minutes dropping the total load time (including auto discovery, sort and compression) to 2 hours 50 minutes.
Starting with the 2015 Q2 release users of FastStats Designer will have the option to build selected tables with this new component and the improvements will benefit any FastStats Designer data builds that execute on multi-processor machines.
Apteco.DataBuild giving a powerful 32 core AWS EC2 instance something to think about.
The ability to access and analyse big data is key to driving business forward and creating target marketing campaigns that give you and your organisation the edge.
Sutter, H. 2005. "The free lunch is over: A fundamental turn toward concurrency in software," Dr. Dobb's Journal. http://www.gotw.ca/publications/concurrency-ddj.htm