February 12, 2015 at 9:58 am
Hi all have a concept question for DW's out there.
I work with large file extracts, a mix of either flat or delimted.
What our DW is doing is taking these extracts and loading them into a single table. Normally this isn't an issue, but there are several gargantuan extracts that have 250+ fields and 7-10 million rows. These files are submitted monthly for valuation.
Exempting the possibility of absolutely positively having to build a single table to be compliant with audits and SOX etc, is this a bad practice to do it this way? I'm having some serious performance issues getting the data I need for a proof of concept because I need about 150 of those 250 fields.
Thanks in advance.
Frank
February 21, 2015 at 12:52 am
I'm by far not an expert as I'm still pretty new to the DW world myself, but I did design a system similar to yours just with less fields. I approached my design under the Kimball methodologies and kept things very, very simple.
I import a few files that have millions of records in a single flat file. Sometimes they are more than that. It's very easy for me to import a single file with over 10 million records into a single staging table using bulk insert. However, I am not dealing with the amount of fields you are facing either.
One method that you could try that is something that I have put into practice is transforming data down to the hard disk level. This is one of the suggestions Kimball mentions a few times in order to take the pressure off other systems such as your RDBMS doing all the work for you. Using SQL Server (SSIS) to parse the flat files into smaller chunks down to the hard disk could lessen the pain for you much like has for me in rather large flat files. Processing smaller chunks is going to be a lot easier than one big chunk.
After that, if the amount of fields is still the bottleneck and not necessarily the amount of rows, then you might want to consider the possibility of splitting those fields up into multiple staging tables and or fact tables. It's really the only way you're going to get around this issue unless you completely do all your auditing, transformation and so forth down to the disk level before SQL even loads the data into the DW.
Another possibility is one we are considering in the future for non-relational chaotic data is NoSQL solutions. The Hadoop framework more specifically has a couple solutions that specifically excel with data that have a high amount of fields. This includes your scenarios of 100+ fields or more. I cannot for the life of me remember the engine name more specifically, but I do know they exist and have been mentioned to excel in performance regardless of how many fields you have in your data.
But, that's an entirely different ballgame and a rather large solution than what you likely need. I only mention it to give food for thought.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply