Index on table with 70+ million rows

Question

Index on table with 70+ million rows

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 1

Jeff Moden (8/22/2016)
Rowan-283474 (8/22/2016)
...the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.
That may be a part of the key to all of this. How are you loading all that data from the remote source into your table? Is it something like a CSV or Tab delimited file or ???
Also, how big is the table in GB?

I gets loaded from another SQL Server Server.

Size of the table(compressed) is about 15GB

Sergiy SSC Guru Points: 110208 More actions · Answer 2

Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?

Can you answer this?

_____________
Code for TallyGenerator

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 3

Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?

Yes unfortunately I have nothing like the above mentioned to work with

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 4

Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?
Yes unfortunately I have nothing like the above mentioned to work with

I do actually have a timestamp that I can work with but this doesn't help me when it comes to identifying rows that I need to update on my side, without some for of an ID or Key column.

Jeff Moden SSC Guru Points: 1004505 More actions · Answer 5

Rowan-283474 (8/23/2016)
Jeff Moden (8/22/2016)
Rowan-283474 (8/22/2016)
...the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.
That may be a part of the key to all of this. How are you loading all that data from the remote source into your table? Is it something like a CSV or Tab delimited file or ???
Also, how big is the table in GB?
I gets loaded from another SQL Server Server.
Size of the table(compressed) is about 15GB

I figured that. 😉 What method are you using to do the load? Linked Server, SSIS, or ????

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 6

Jeff Moden (8/23/2016)
Rowan-283474 (8/23/2016)
Jeff Moden (8/22/2016)
Rowan-283474 (8/22/2016)
...the source is line of business type transactional records and there is no way to link up rows from my table back to the source table thus the need for a fresh pull of data.
That may be a part of the key to all of this. How are you loading all that data from the remote source into your table? Is it something like a CSV or Tab delimited file or ???
Also, how big is the table in GB?
I gets loaded from another SQL Server Server.
Size of the table(compressed) is about 15GB
I figured that. 😉 What method are you using to do the load? Linked Server, SSIS, or ????

I am using SSIS...

MMartin1 One Orange Chip Points: 27799 More actions · Answer 7

This is one thought that I have >

Assuming there is a realiable PK field in the table:

Bring in the 70 million rows to a staging area locally on the destination server, index that by the PK field, finally perform a 'Merge' operation to the destination table by way of that PK field. If there are really not that many additions and updates ( I am ignoring deletes, but maybe that factors in too) the statistics you have are likely still usable.

You could do all of this in SSIS. Of course I would call a stored Proc or have the merge statement directly in a t-sql task. I dont promise this will be fast, but it might be done in the desired time frame and eliminate the need to re-build indexes each time.

----------------------------------------------------

Sergiy SSC Guru Points: 110208 More actions · Answer 8

Rowan-283474 (8/23/2016)
Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?
Yes unfortunately I have nothing like the above mentioned to work with
I do actually have a timestamp that I can work with but this doesn't help me when it comes to identifying rows that I need to update on my side, without some for of an ID or Key column.

If that time stamp indicates the time when a row was last time affected with an insert/update then you need to copy only those rows which have the time stamp later than the MAX(time stamp) in your reporting data set from yesterday.

_____________
Code for TallyGenerator

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 9

Sergiy (8/23/2016)
Rowan-283474 (8/23/2016)
Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?
Yes unfortunately I have nothing like the above mentioned to work with
I do actually have a timestamp that I can work with but this doesn't help me when it comes to identifying rows that I need to update on my side, without some for of an ID or Key column.
If that time stamp indicates the time when a row was last time affected with an insert/update then you need to copy only those rows which have the time stamp later than the MAX(time stamp) in your reporting data set from yesterday.

You can do that, and I have previously, but then you potentially end up with two versions of the 'same' row so to speak. Because there is no way to link a row of yesterday with a row of 'today' you cannot merge the updated data from today with yesterday, you can only insert and thus the potential for different versions of the 'same' row

Sergiy SSC Guru Points: 110208 More actions · Answer 10

Ok, there is another option.

Count records grouped by, say, a week.

And reload only the rows which belong to weeks where the counts are different.

If a record was updated you'll reload possibly 2 weeks - the week the old version was in, and its current week.

_____________
Code for TallyGenerator

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 11

Sergiy (8/23/2016)
Ok, there is another option.
Count records grouped by, say, a week.
And reload only the rows which belong to weeks where the counts are different.
If a record was updated you'll reload possibly 2 weeks - the week the old version was in, and its current week.

Cool thanks, will look into this...

Jeff Moden SSC Guru Points: 1004505 More actions · Answer 12

Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Rowan-283474 (8/23/2016)
Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?
Yes unfortunately I have nothing like the above mentioned to work with
I do actually have a timestamp that I can work with but this doesn't help me when it comes to identifying rows that I need to update on my side, without some for of an ID or Key column.
If that time stamp indicates the time when a row was last time affected with an insert/update then you need to copy only those rows which have the time stamp later than the MAX(time stamp) in your reporting data set from yesterday.
You can do that, and I have previously, but then you potentially end up with two versions of the 'same' row so to speak. Because there is no way to link a row of yesterday with a row of 'today' you cannot merge the updated data from today with yesterday, you can only insert and thus the potential for different versions of the 'same' row

I'm sure that I'm preaching to the choir but this all has a really rotten smell to it. The data is coming from an SQL Server and I'm reasonably sure that they have some way to uniquely and consistently identify each row between the systems even if they have to use several columns to do it (as in some form of manufactured PK based on the data even if coming from more than 1 table). They need to provide that information and this will all become a cake walk (I suggest using EXCEPT between the new data and the old to quickly identify changes and additions and deletes will be a simple lookup).

Since you're using SSIS for this, are you saying that the company you work for owns that other server? If so, I'd bring guns to bear on the problem and have someone explain the problem they're causing and have someone fix it. If the server is owned by a 3rd party and there's a contract for this data, make them provide the data with some form of consistent PK so that you can do a decent merge without having to reload the whole snapshot.

Otherwise, you're pretty much stuck with what you have. You can seriously decrease the amount of time that data won't be available (to milliseconds) by doing the swap/rename that I previously suggested but you won't be able to decrease the amount of time the server has to work.

Again, not telling you anything you don't know but have to say it out loud... unless they provide a consistent unique key in the data, you're pretty much stuck with things like they are. You can't even bank on dates for a comparison right now.

And, sorry for the rant. I just hate it when people like "them" do things like this to people like "you". It's just not necessary and it's so very easy for them to fix it on their side.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sergiy SSC Guru Points: 110208 More actions · Answer 13

Jeff, just...

Please...

Don't go to "their" campus with a shotgun.

Please.

You're not gonna have enough bullets for "them" all anyway.

🙂

_____________
Code for TallyGenerator

Rowan-283474 SSCommitted Points: 1560 More actions · Answer 14

Jeff Moden (8/23/2016)
Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Rowan-283474 (8/23/2016)
Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?
Yes unfortunately I have nothing like the above mentioned to work with
I do actually have a timestamp that I can work with but this doesn't help me when it comes to identifying rows that I need to update on my side, without some for of an ID or Key column.
If that time stamp indicates the time when a row was last time affected with an insert/update then you need to copy only those rows which have the time stamp later than the MAX(time stamp) in your reporting data set from yesterday.
You can do that, and I have previously, but then you potentially end up with two versions of the 'same' row so to speak. Because there is no way to link a row of yesterday with a row of 'today' you cannot merge the updated data from today with yesterday, you can only insert and thus the potential for different versions of the 'same' row
I'm sure that I'm preaching to the choir but this all has a really rotten smell to it. The data is coming from an SQL Server and I'm reasonably sure that they have some way to uniquely and consistently identify each row between the systems even if they have to use several columns to do it (as in some form of manufactured PK based on the data even if coming from more than 1 table). They need to provide that information and this will all become a cake walk (I suggest using EXCEPT between the new data and the old to quickly identify changes and additions and deletes will be a simple lookup).
Since you're using SSIS for this, are you saying that the company you work for owns that other server? If so, I'd bring guns to bear on the problem and have someone explain the problem they're causing and have someone fix it. If the server is owned by a 3rd party and there's a contract for this data, make them provide the data with some form of consistent PK so that you can do a decent merge without having to reload the whole snapshot.
Otherwise, you're pretty much stuck with what you have. You can seriously decrease the amount of time that data won't be available (to milliseconds) by doing the swap/rename that I previously suggested but you won't be able to decrease the amount of time the server has to work.
Again, not telling you anything you don't know but have to say it out loud... unless they provide a consistent unique key in the data, you're pretty much stuck with things like they are. You can't even bank on dates for a comparison right now.
And, sorry for the rant. I just hate it when people like "them" do things like this to people like "you". It's just not necessary and it's so very easy for them to fix it on their side.

Hi Jeff

For sure, this is all a little bit rotten, no doubt.

I have managed to identify a few fields that together make a unique 'key', so I will get some form of confirmation with regards to whether these fields can be used in this way.

Thank you very much for the input.

Regards