As a database developer, I often hear noise about a number of ETL tools on the marketplace. Having probably drunk more-than-my-fair-share of the proverbial Microsoft Kool-Aid, I am a little biased towards their products. However (at least where I work), an ETL tool called Open Studio by Talend has also gained some traction and interest around our water cooler discussions and in management meetings.
Since we are a Microsoft shop (and hold a number of expensive SQL licenses), I am a firm believer that SSIS is the way to go. It’s really good at working with SQL Server instances and, in my opinion, has an intuitive interface and is overall pretty easy to install, deploy and manage. Let’s get something straight: if you’re a Microsoft shop, or you’re loading data to MS SQL – SSIS is the way to go. You would never let a plumber do your taxes, would you?
In an imperfect world, “the powers that be” don’t care about the toolset being used as long as it fits a need and doesn’t cost too much. Enter Talend’s Open Studio.
As a skeptic, I decided to install Open Studio and play around with it a bit. If you’re an SSIS person, it will take some getting used-to and configuration alone might drive you pretty crazy. If you absolutely abhor Java, do not bother even installing Open Studio as it will take years off of your life. Open Studio does have some interesting features and connectors that could be useful for a non-Microsoft shop and it only touts itself as an ETL tool (not a BI suite). Curiously, I wondered could Open Studio outperform SSIS in some “simple data transformation tasks”?
The “Test”
In testing and comparing the two ETL tools, I figured it would be unfair to load data to or from Microsoft SQL since SSIS is a Microsoft product. I also didn’t want to compare some of the custom connections that come with the “out of the box” version of Open Studio since there would be some overhead in recreating those connectors in SSIS. The simplest comparison is testing overall ability of both tools to load 1 delimited flat file to another delimited flat file on the same server. The file that will be loaded through both toolsets is a comma separated file with 20M records and approximately 73 bytes per record - making the overall file size about 1.5GB.
The environment this test will be performed on is below:
- Server: IBM – System x3650 M3
- Processor: Intel Xeon X5680 @ 3.33GHz (2 processors)
- RAM: 76GB
- OS: Windows Server 2008 R2 - 64 bit
- Talend v5.3.1 Open Studio
- JDK 1.7.0
- JRE 7
- SQL Server 2008 – SP1
NOTE: When installing Talend, you will need some patience in getting your environment set up. I found this (http://www.talend.com/prerequisites) to be a good read that helped move things along fairly quickly. I would advise downloading the most recent JDE/JDK’s from java and setting up your environment variables before installing Talend.
Below are the results from loading data from one delimited file to another to test throughput (20M records). What is important to note is that I had some initial out-of-the-box failures. After some quick reading and research, I was able to tune some of the memory settings to allow the job to merely complete. I did not think the size of the file was too large for (nearly 1.5GB on disk) for any ETL tool to undertake, but I underestimated Talend’s Open Studio here. I spoke with our “resident Talend expert”, and was told that “all ETL tools should be tuned out of the box”. Not sure I agree. I tuned some of the ETL processes per his instructions to get some dedicated memory on the last 3 runs which showed some significant gains. The trial below shows my results from the Open Studio “out-of-the-box install”. The first attempt failed with a run time of 27.2 seconds and did not finish with an “out of memory” issue.
Trial | Records | Talend's Time | Result | Notes |
0 | 5000945 | 27.2 | DNF (out of memory error) | Out of the box install |
Once I was able to get the Talend job to finish, I could then compare the results of my first test case scenario to the results rendered by SSIS. To get a large enough sampling, I ran the first test 3 times (trials 1-3) and then tuned each job to run “optimally” for another 3 trials (4-5). The average time of completion for Talend was 67.69 seconds while SSIS completed the same job with an average time of 39.8 seconds.
Results
Trial | Records | Talend's Time | Notes | SSIS Time | Notes |
1 | 20000000 | 92.28 | Adjusted XMX to -20000M | 49.71 | Out of the box install |
2 | 20000000 | 92.99 | Adjusted XMX to -20000M | 50.2 | Out of the box install |
3 | 20000000 | 86.69 | Adjusted XMX to -20000M | 51.83 | Out of the box install |
4 | 20000000 | 45.6 | Increased initial heap memory size for JVM | 29.39 | Tune DFT (BufferSize / Row Width) |
5 | 20000000 | 44.49 | Increased initial heap memory size for JVM | 29.17 | Tune DFT (BufferSize / Row Width) |
6 | 20000000 | 44.08 | Increased initial heap memory size for JVM | 28.48 | Tune DFT (BufferSize / Row Width) |
Average | 67.69 | 39.8 |
A screen grab of the test completing in SSIS
Below is an image to show what the subsequent test looks like in Talend Open Studio.
A Second “Test”
Since the first test case was fairly simplistic, I wanted to see how Talend’s Open Studio would compare to SSIS in loading the same file with some simple “real world” data transformations. In this second test, I added in a filter to exclude any records with a quantity < 0 (it accounted in about 2.5% of my sample set being filtered out) and then a second transformation for multiply the quantity by 100. This test used the same data input and same test environment. Here we see an increase in overall execution time – as expected with the additional processing steps. Once again SSIS out performed Talend with SSIS having an average completion time of 40.84 and Talend having an average completion time of 78.91.
Trial | Records | Talend's Time | Talend Adjustments | SSIS Time | SSIS Adjustments |
1 | 20000000 | 112.06 | Adjusted XMX to -20000M | 51.32 | None |
2 | 20000000 | 99.44 | Adjusted XMX to -20000M | 53.8 | None |
3 | 20000000 | 106.99 | Adjusted XMX to -20000M | 51.45 | None |
4 | 20000000 | 51.83 | Increased initial heap memory size for JVM | 29.32 | Tune DFT (BufferSize / Row Width) |
5 | 20000000 | 50.23 | Increased initial heap memory size for JVM | 29.67 | Tune DFT (BufferSize / Row Width) |
6 | 20000000 | 52.89 | Increased initial heap memory size for JVM | 29.46 | Tune DFT (BufferSize / Row Width) |
Average | 78.91 | 40.84 |
Here is a screenshot of the SSIS test
Below is what the Talend equivalent of the above package looks like
Some things I liked about Talend
When I made changes to a mapping or transformation I was asked “Would you like to propagate these changes”. Clicking yes pushes the metadata changes out to the rest of your affected transformation elements. Those who have experience with SSIS know that metadata issues can cause some frustrations.
I also really liked the “Debug Run”/Traces debug. It is a lot like a data reader in a Data Flow task but you don’t have to add/remove every time you want to see some particular data elements. In Open Studio when you do a Debug Run > Traces Debug, the debug behaves just as a data reader in SSIS and shows data flowing through the process row-by-row.
A screen grab of the data being displayed in the debug run.
Some Talend Nuisances
There are some things that take a very “inquiring mind” when working with Talend. A few of the errors I got were very vague (SSIS has been known to do that too) and I had to go do some research to find out how to fix the issue. The first test run was a very glaring issue in my eyes. Some other things like knowing ASCII representations help in configuring – but it’s nothing a little trial, error and Google couldn’t help with.
Conclusions
SSIS outperformed Talend on some fairly simple transformations. This doesn’t mean SSIS will outperform Talend all of the time across all environments against all data formats – just in these two test cases it did. This test size was small and I would be willing to bet that Talend can do a number of things more efficiently that SSIS. Plus it is open source.
How is the data development community using Talend Open Studio? I would be especially interested to know how Microsoft shops are using the product and how it helps their respective organizations.
References:
- http://www.talendforge.org/forum/viewtopic.php?pid=79465
- http://www.talendbyexample.com/my-first-talend-job.html
- http://www.talendforge.org/forum/viewtopic.php?id=9878
- http://www.talendforge.org/forum/viewtopic.php?id=11489
- http://www.talendforge.org/forum/viewtopic.php?id=7542
- http://sqlsolace.blogspot.com/2010/06/ssis-tuning-buffer-size.html
- http://stackoverflow.com/questions/1043817/speed-tradeoff-of-javas-xms-and-xmx-options