March 28, 2016 at 11:14 am
Short intro to the question. All of our internal IDs are Guids, with predictable results. We’re trying to optimize the system to be able to handle bulk inserts without having to refactor the entire middle tier. The area of the system I’m looking at has two primary Guid IDs, ParticipantID and FeedID, and every table in this area has these two columns. If they are not the PK themselves, they are always part of the natural key. All access to tables by our middle tier is mediated through views. Another thing of note is that all of our queries always return every column in the table. I can explain why that is if you’re interested.
I’m doing a POC with a couple tables that changes some things around In order to increase performance on batch inserts. I’m switching these table to use a monotonically increasing bigint InstanceID that is the combination of the two IDs above. We already have a table, lets call it ParticipantInstance, that stored only ParticipantID and FeedID. I added the InstanceID to that table, clustered on it, and then added a non clustered index on ParticipantID, FeedID. That way I only pay the Guid insert penalty once. I’m hiding the change from the business layer by refactoring the views for the tables I'm working on to join to ParticipantInstance on InstanceID and pull the ParticipantID and FeedID columns from there. I’m seeing very little performance difference on the get. For any tables where it is appropriate, I made InstanceID the clustered PK, and I’ve seen tremendous improvements on the inserts. Down from about 5 minutes for big feeds to about 7-9 seconds.
My question surrounds two tables that can’t use InstanceID as the PK that are in a parent/child relationship. They also both have over 1 billion records (and again, queries will always have to return all fields). The parent table uses a surrogate key (ModelID), and the child table’s pk is ModelID, ExternalID. This is the one place that I’m thinking that we convert the guid to a bigint and refactor these two objects in the middle tier. Queries to the tables will always be through the view and filtered by ParticipantID and/or FeedID. My thinking is that we make ModelID (bigint) on the parent a non-clustered PK, and make a unique clustered key on InstanceID, ModelID since InstanceID alone is not unique. I think (but maybe I’m wrong) the penalty of making SQL uniqueify the rows in the clustered index would be worse than just adding the PK to the clustered index since there will just about always be duplicates. I also think that we want InstanceID as the first column on the clustered index since queries will always need to join by that column and return all columns. In the child table, keep the PK non-clustered as well and cluster on InstanceID, ModelID, and ExternalID for the same reason. That way the clustered indexes are unique and the views can use them for the join without an RID lookup penalty. Does that seem like a good compromise between ideal and the facts on the ground?
///Edit
To the point below about it being hard to visualize, I've included some scripts to set up the relevant tables in the existing and proposed formats.
/* Existing Setup */
CREATE TABLE ParticipantInstance(
ParticipantID UNIQUEIDENTIFIER NOT NULL,
FeedID UNIQUEIDENTIFIER NOT NULL,
CONSTRAINT pkParticipantInstance
PRIMARY KEY CLUSTERED(
ParticipantID,
FeedID
)
);
CREATE INDEX xnParticipantInstance_FeedID_ParticipantID
ON ParticipantInstance(
FeedID,
ParticipantID
);
CREATE TABLE Model(
ModelID UNIQUEIDENTIFIER NOT NULL,
ParticipantID UNIQUEIDENTIFIER NOT NULL,
FeedID UNIQUEIDENTIFIER NOT NULL,
--[Other fields],
CONSTRAINT pkModel
PRIMARY KEY NONCLUSTERED(
ModelID
)
/* TABLE has one nc index*/
);
CREATE TABLE Child(
ModelID UNIQUEIDENTIFIER NOT NULL,
ExternalID VARCHAR(20) NOT NULL,
ParticipantID UNIQUEIDENTIFIER NOT NULL,
FeedID UNIQUEIDENTIFIER NOT NULL,
--[Other fields],
CONSTRAINT pkChild
PRIMARY KEY CLUSTERED(
ModelID,
ExternalID
)
/* TABLE has one nc index*/
);
/*
DROP TABLE ParticipantInstance;
DROP TABLE Model;
DROP TABLE Child;
*/
/* Proposed setup */
CREATE TABLE ParticipantInstance(
InstanceID BIGINT NOT NULL,
ParticipantID UNIQUEIDENTIFIER NOT NULL,
FeedID UNIQUEIDENTIFIER NOT NULL,
CONSTRAINT pkParticipantInstance
PRIMARY KEY CLUSTERED(
InstanceID
)
);
CREATE NONCLUSTERED INDEX xnParticipantInstance_ParticipantID_FeedID
ON ParticipantInstance(
ParticipantID,
FeedID
);
CREATE NONCLUSTERED INDEX xnParticipantInstance_FeedID_ParticipantID
ON ParticipantInstance(
FeedID,
ParticipantID
);
CREATE TABLE Model(
ModelID BIGINT NOT NULL,
InstanceID BIGINT NOT NULL
--[Other fields],
CONSTRAINT pkModel
PRIMARY KEY NONCLUSTERED(
ModelID
)
);
CREATE CLUSTERED INDEX xcModel
ON Model(
InstanceID,
ModelID
);
CREATE TABLE Child(
ModelID BIGINT NOT NULL,
ExternalID VARCHAR(20) NOT NULL,
InstanceID BIGINT NOT NULL,
--[Other fields],
CONSTRAINT pkChild
PRIMARY KEY NONCLUSTERED(
ModelID,
ExternalID
)
);
CREATE CLUSTERED INDEX xcChild
ON Child(
InstanceID,
ModelID,
ExternalID
)
/*
DROP TABLE ParticipantInstance;
DROP TABLE Model;
DROP TABLE Child;
*/
March 28, 2016 at 12:01 pm
That is a tough setup to visualize. I feel your pain with guids, and trying to work with indexes. I have found from past experience that adding an Bigint into a table with a cluster will be faster than clustering or indexing a guid. You just can't get performance out of guids. Adding the bigInt will create a table scan in your setup it sounds. It will probably be faster than the guid seek and smaller index on disc. I would say go for the compromise, but your band aid will not last without rethinking the guid/int model.
I wish I would have removed indexes on guids a long time ago. I would have shown much earlier that the model does not work.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply