Best Strategy to Create a Million Files using SSIS

Question

Best Strategy to Create a Million Files using SSIS

sradez_DBA

SSC Veteran

Points: 290
More actions
August 13, 2012 at 8:39 am

#279465

Hi All,
We have a requirement to create a file for each unique record from the Database and we have nearly million records to process.This Package will be scheduled and will re-create the files if any changes to any of the fields in the record set.
I can use a Foreach Loop Container and use dataflow task to create a file for each record.But this will hit the DB everytime i pass a uniqueID inside the dataflow task.
I am looking for a best strategy to opt for??
Please give me your valuable suggestion.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

I think a For Each loop is what you'll need. Can't think of any other way to do it.

Might want to split them up into smaller chunks in multiple directories. A million files in one directory will create problems if you ever need to view the contents of that folder.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Phil Parkin SSC Guru Points: 246963 More actions · Answer 2

This is quite an unusual requirement - do you mind expanding a little on the reasons behind it?

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Scott D. Jacobson SSCertifiable Points: 6149 More actions · Answer 3

Phil Parkin (8/13/2012)
This is quite an unusual requirement - do you mind expanding a little on the reasons behind it?

I feel like maybe OP misunderstood the requirement? Maybe the request was "we need a file with all the records in table x"? Will let the OP expound further but I agree, this sounds bizarre.

sradez_DBA SSC Veteran Points: 290 More actions · Answer 4

Hey Phil,

Thanks for the reply.

Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.

Now,I need to create a file for each product using ETL.

I think we will split the files into multiple folders based on the product category.

Looking for a best possible method to go for.

Thanks

Phil Parkin SSC Guru Points: 246963 More actions · Answer 5

sradez_DBA (8/13/2012)
Hey Phil,
Thanks for the reply.
Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.
Now,I need to create a file for each product using ETL.
I think we will split the files into multiple folders based on the product category.
Looking for a best possible method to go for.
Thanks

Isn't this what databases were built for? 🙂

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

sradez_DBA (8/13/2012)
Hey Phil,
Thanks for the reply.
Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.
Now,I need to create a file for each product using ETL.
I think we will split the files into multiple folders based on the product category.
Looking for a best possible method to go for.
Thanks

When the users search for the product from the UI, they're searching your database. So why not generate the "file which has all the info" as the output from this search?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

sradez_DBA SSC Veteran Points: 290 More actions · Answer 7

The DB comes info Picture only till creating the files and maintaining the log of ETL.

Rest is all done using application..which i cant talk about.This is an interesting project.If you search for Microsoft Index Search you will know more.

Anyways,My question is about the best ETL Method to create files where i dont have to hit million times to create the file.

sradez_DBA SSC Veteran Points: 290 More actions · Answer 8

They are not searching the Database from the UI.The search is on the files.

Phil Parkin SSC Guru Points: 246963 More actions · Answer 9

sradez_DBA (8/13/2012)
They are not searching the Database from the UI.The search is on the files.

Granted I don't know all the facts, but I really don't like the sound of what you are building here!

Your only obvious option is an RBAR solution. You may be able to inject some parallelism by creating multiple foreach loops to process different categories at the same time - I'd start with four and see how it performs.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

GSquared SSC Guru Points: 260824 More actions · Answer 10

Sounds like a serious step backwards in data management technology.

Don't call the database for each row of data. Use a script object to write each row to a file.

You can do this inside a CLR proc in the database (you'd have to set it to unsafe, if I'm not mistaken), or you can do this in a script object in SSIS. VB.NET can write text files directly: http://msdn.microsoft.com/en-us/library/1t4kyezf(v=VS.80).aspx. Pretty sure C# can too.

Pull the dataset, use that for your For Each loop, either in SSIS or in a CLR proc that steps through the data. Write the data from the dataset directly, not using a Data Flow task. One call to the database (a really big call, which may take a while to return a dataset that's a million or so rows long), lots of loops through a datawriter script or DLL.

Of course, breaking it up into smaller database calls, one per row, will probably be faster, easier to write, easier to manage, easier to debug, and won't put anywhere near the strain on your server that a million-row dataset will, but if you don't want to do it that way, do it this way and you can make just one call.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

sradez_DBA (8/13/2012)
They are not searching the Database from the UI.The search is on the files.

It will be very interesting to see how this approach compares with the more conventional one, of searching a database and generating output from the database content. Superficially it's quite similar - interrogate an index for the address of a bit of data.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Phil Parkin SSC Guru Points: 246963 More actions · Answer 12

GSquared (8/13/2012)
Sounds like a serious step backwards in data management technology.
Don't call the database for each row of data. Use a script object to write each row to a file.
You can do this inside a CLR proc in the database (you'd have to set it to unsafe, if I'm not mistaken), or you can do this in a script object in SSIS. VB.NET can write text files directly: http://msdn.microsoft.com/en-us/library/1t4kyezf(v=VS.80).aspx. Pretty sure C# can too.
Pull the dataset, use that for your For Each loop, either in SSIS or in a CLR proc that steps through the data. Write the data from the dataset directly, not using a Data Flow task. One call to the database (a really big call, which may take a while to return a dataset that's a million or so rows long), lots of loops through a datawriter script or DLL.
Of course, breaking it up into smaller database calls, one per row, will probably be faster, easier to write, easier to manage, easier to debug, and won't put anywhere near the strain on your server that a million-row dataset will, but if you don't want to do it that way, do it this way and you can make just one call.

+1 this is good advice.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

sradez_DBA SSC Veteran Points: 290 More actions · Answer 13

Thank You All for sharing your thoughts.

As i said the project is interesting.What makes it interesting is if you have 9 different companies who want to share there products and they have there own FileSystem and DB but share a common UI.

sradez_DBA SSC Veteran Points: 290 More actions · Answer 14

As Phil says I will have to try for a SSIS and Parallelism Approach and do some testing on how this works.