Data Insertion very slow (91k rows)

Question

Data Insertion very slow (91k rows)

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 1

[font="Verdana"]Yeah, I get that. It's much the same issue (I think) that we face with loading bulk data in our data warehouse (which is running on SQL Server 2005).

Here's how we get around it.

1. bulk load the files into load tables

2. insert all of the parent records, using the output from the insert for loading into a table variable

3. insert all of the child records, joining to the table variable from step 2 to get the identities.

You lose the reduction in logging, because you have to do the inserts twice. But the bulk inserts are still quicker.

If you have an alternate key available, you could bulk insert into the parent table, then bulk insert the child records into a loading table, then insert into the child table, selecting from the loading table joined back to the parent table to get the identity key value.

Another option:

1. generate the file for the parent table

2. lock the parent table for inserts

3. bulk insert the parent table data

4. use scope_identity() to get the last identity value loaded. You can then backtrack from this to get the earlier identity values.

5. release the lock on the parent table

6. generate the file for the child table, including the correct identity values

7. bulk insert the child table data

This requires a smarter client process, and the locking of the parent table could be an issue, unless your application is the only thing that inserts into that table.

How do other people solve this issue?

[/font]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 2

Bruce, check out the OUTPUT clause to an INSERT. You can trap ALL the recently generated IDs from the mass INSERT, not just the last one, into a temp table.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 3

Bob Hovious (2/10/2009)
Bruce, check out the OUTPUT clause to an INSERT. You can trap ALL the recently generated IDs from the mass INSERT, not just the last one, into a temp table.

[font="Verdana"]Heh. See my earlier post about the output clause. 🙂

The issue lies in where you don't have a way of getting between the parent and child records any more. For example, say your data comes in like this:

Sample Date Who Took Sample Sample 1 Sample 2 Sample 3

And you want to store it as:

SampleHeader

SampleID

SampleDate

SamplePerson

SampleDetail

SampleID

SampleNumber

SampleResult

After you load up your list of sample headers, how do you then match the IDs you generated to the details? You need some sort of way (usually a row number) of joining back so you can match up the row number and get the corresponding ID.

It's a non-trivial problem, and is usually solved just by doing one parent at a time, and then getting the scope_identity(), doing the children for that parent, then moving on to the next one. Of course, as we all know, that causes serious performance issues.

[/font]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 4

Bruce:

Yeah, I see the problem. You need a row number or some other unique identifier(s) to tie them together, and if you don't have that in your file prior to the DB load, then you would have to generate it by loading a temp table that is an image of the data in the file and take the logging penalty when you insert from there into the other tables.

My first thought is that you could generate a row number with the row_number() function in a cte that reads your file with "FROM OPENROWSET(BULK" but unless you are certain that there would be no duplicate records in the input file, there's a chance that you might get different row_number assignments between the first and second runs.

Edited to add:

I just wanted to clarify that this presupposes that you are going to store both the generated row number from the load table, and the generated key from the parent table with your output statement.

Questions... do you have a unique identifier (even if it involves multiple columns) of any sort prior to the load? If so, are those values being stored in the parent table?

If so... couldn't you do this?

1. Bulk insert to parent table, storing unique identifiers and new ID value in your table variable or temp table.

2. Bulk insert ["FROM OPENROWSET(BULK" ] to the child table, joining to the table variable on the prexisting unique identifiers to get the new ID.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 5

[font="Verdana"]You also have to hope that SQL Server doesn't reorder the data...

I think by using the output clause of the insert, it should be possible to avoid the necessity of having to lock the parent table. The application could read the generated IDs back in, then match them with the row numbers, and write out the correct file to load the child records in with the IDs in the file.

Here's another idea: create a stored procedure that takes all of the records as chunk of XML, and splits them out into the parent and child records. I'm not famililar enough with the SQL Server XML handling to say whether it would simplify the process, but I suspect it would.

[/font]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 6

I just finished editing an earlier post. Tell me what you think?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 7

[font="Verdana"]In some cases, you do have some sort of identifier (usually a business key of some sort). And yes, this makes it a lot easier.

In other cases, such as transaction dumps, you don't.

So it just depends.

I haven't looked into OPENROWSET in this fashion, so I can't comment. It will be worth a try though.

[/font]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 8

OPENROWSET lets you SELECT from a file as if it were a table. It requires a BCP format file to be associated with it, but essentially you are doing a query, and can join to existing tables or table variables.

It's interesting to find someone else who thought of table variables as a way of avoiding logging. Usually I hear only that it is much quicker to use temp tables for large volumes.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 9

Bob Hovious (2/10/2009)
It's interesting to find someone else who thought of table variables as a way of avoiding logging. Usually I hear only that it is much quicker to use temp tables for large volumes.

[font="Verdana"]It would be.

Actually, I was referring to avoiding having to double-load the data, and using a form of bulk insert to do the final load to avoid the logging that the data insertion occurs. I wasn't thinking about the logging associated with the output. 🙂

In this particular case, all we are talking about is storing several thousand (row number, id) pairs. So a table variable could be fine. Useful within the context of a stored procedure, but completely useless for getting that data back to the application if it's needed there, so back to a temp table. 😀

I will have a play around with OPENROWSET. It may solve some fun issues. Does it always have to have a bcp format file?

Edited to add: from reading up, yes a format file is necessary, unless using it to load the entire file as one object. I also didn't know about the XML version of the format file.

[/font]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 10

I've always done it as a bcp. Build a table that looks like what I want then bcp it out, then create a format file. I'm not sure if there are other alternatives.

Let me know if you find out 😉

We've sort of hijacked this thread.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills