August 13, 2012 at 8:39 am
Hi All,
We have a requirement to create a file for each unique record from the Database and we have nearly million records to process.This Package will be scheduled and will re-create the files if any changes to any of the fields in the record set.
I can use a Foreach Loop Container and use dataflow task to create a file for each record.But this will hit the DB everytime i pass a uniqueID inside the dataflow task.
I am looking for a best strategy to opt for??
Please give me your valuable suggestion.
August 13, 2012 at 8:42 am
I think a For Each loop is what you'll need. Can't think of any other way to do it.
Might want to split them up into smaller chunks in multiple directories. A million files in one directory will create problems if you ever need to view the contents of that folder.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
August 13, 2012 at 8:48 am
This is quite an unusual requirement - do you mind expanding a little on the reasons behind it?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
August 13, 2012 at 8:50 am
Phil Parkin (8/13/2012)
This is quite an unusual requirement - do you mind expanding a little on the reasons behind it?
I feel like maybe OP misunderstood the requirement? Maybe the request was "we need a file with all the records in table x"? Will let the OP expound further but I agree, this sounds bizarre.
August 13, 2012 at 9:00 am
Hey Phil,
Thanks for the reply.
Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.
Now,I need to create a file for each product using ETL.
I think we will split the files into multiple folders based on the product category.
Looking for a best possible method to go for.
Thanks
August 13, 2012 at 9:03 am
sradez_DBA (8/13/2012)
Hey Phil,Thanks for the reply.
Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.
Now,I need to create a file for each product using ETL.
I think we will split the files into multiple folders based on the product category.
Looking for a best possible method to go for.
Thanks
Isn't this what databases were built for? 🙂
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
August 13, 2012 at 9:04 am
sradez_DBA (8/13/2012)
Hey Phil,Thanks for the reply.
Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.
Now,I need to create a file for each product using ETL.
I think we will split the files into multiple folders based on the product category.
Looking for a best possible method to go for.
Thanks
When the users search for the product from the UI, they're searching your database. So why not generate the "file which has all the info" as the output from this search?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 13, 2012 at 9:19 am
The DB comes info Picture only till creating the files and maintaining the log of ETL.
Rest is all done using application..which i cant talk about.This is an interesting project.If you search for Microsoft Index Search you will know more.
Anyways,My question is about the best ETL Method to create files where i dont have to hit million times to create the file.
August 13, 2012 at 9:22 am
They are not searching the Database from the UI.The search is on the files.
August 13, 2012 at 9:30 am
sradez_DBA (8/13/2012)
They are not searching the Database from the UI.The search is on the files.
Granted I don't know all the facts, but I really don't like the sound of what you are building here!
Your only obvious option is an RBAR solution. You may be able to inject some parallelism by creating multiple foreach loops to process different categories at the same time - I'd start with four and see how it performs.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
August 13, 2012 at 9:32 am
Sounds like a serious step backwards in data management technology.
Don't call the database for each row of data. Use a script object to write each row to a file.
You can do this inside a CLR proc in the database (you'd have to set it to unsafe, if I'm not mistaken), or you can do this in a script object in SSIS. VB.NET can write text files directly: http://msdn.microsoft.com/en-us/library/1t4kyezf(v=VS.80).aspx. Pretty sure C# can too.
Pull the dataset, use that for your For Each loop, either in SSIS or in a CLR proc that steps through the data. Write the data from the dataset directly, not using a Data Flow task. One call to the database (a really big call, which may take a while to return a dataset that's a million or so rows long), lots of loops through a datawriter script or DLL.
Of course, breaking it up into smaller database calls, one per row, will probably be faster, easier to write, easier to manage, easier to debug, and won't put anywhere near the strain on your server that a million-row dataset will, but if you don't want to do it that way, do it this way and you can make just one call.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
August 13, 2012 at 9:35 am
sradez_DBA (8/13/2012)
They are not searching the Database from the UI.The search is on the files.
It will be very interesting to see how this approach compares with the more conventional one, of searching a database and generating output from the database content. Superficially it's quite similar - interrogate an index for the address of a bit of data.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 13, 2012 at 9:35 am
GSquared (8/13/2012)
Sounds like a serious step backwards in data management technology.Don't call the database for each row of data. Use a script object to write each row to a file.
You can do this inside a CLR proc in the database (you'd have to set it to unsafe, if I'm not mistaken), or you can do this in a script object in SSIS. VB.NET can write text files directly: http://msdn.microsoft.com/en-us/library/1t4kyezf(v=VS.80).aspx. Pretty sure C# can too.
Pull the dataset, use that for your For Each loop, either in SSIS or in a CLR proc that steps through the data. Write the data from the dataset directly, not using a Data Flow task. One call to the database (a really big call, which may take a while to return a dataset that's a million or so rows long), lots of loops through a datawriter script or DLL.
Of course, breaking it up into smaller database calls, one per row, will probably be faster, easier to write, easier to manage, easier to debug, and won't put anywhere near the strain on your server that a million-row dataset will, but if you don't want to do it that way, do it this way and you can make just one call.
+1 this is good advice.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
August 13, 2012 at 9:47 am
Thank You All for sharing your thoughts.
As i said the project is interesting.What makes it interesting is if you have 9 different companies who want to share there products and they have there own FileSystem and DB but share a common UI.
August 13, 2012 at 10:02 am
As Phil says I will have to try for a SSIS and Parallelism Approach and do some testing on how this works.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply