Seeking input on Best Method to Join on First or Last Row in a 1:M Join

Question

Seeking input on Best Method to Join on First or Last Row in a 1:M Join

SSC-Insane

Points: 21661

June 18, 2021 at 3:29 pm

I have included the DDL to re-create the 2 tables in this examples. I can't post the actual DDL so these are abbreviated versions of the 2 tables I'm working with.

The relationship between PERSON and PERSON_HISTORY is 1:M. What I seek is the most effective/best method for joining these 2 where I get only the First or Lastrow of data from PERSON_HISTORY. Normally the My (Primary key) column would be the method for determining the first (lowest value for the key) row or the last (highest value for the key) however its possible due to user error that the First or last row in this case would not be based on the value in the My column but by the newest or oldest date/time value in OccurredDate.

I know the below query will work but only if First and Last can be derived by lowest/highest value in the key column My.

Q1: Assuming that the highest value for My would be the latest row and the lowest value for My is the first row is there a better/cleaner way of doing what the below does which is to show the data from PERSON along with the most recently created row of data in PERSON_HISTORY?

Q2: How would that answer differ if Min(DateOcurred) and Max(DateOccurred) would be the way to derive the first and last row of data from PERSON_HISTORY?

Q3: Is there a commonly used term or name to refer to this, a term/phrase used within-sql devs that describes this?

/*Return most recent row of data from PERSON_HISTORY for each row in PERSON:*/
SELECT P.Code, LastName, P.Firstname,
       PH.*

FROM PERSON P LEFT OUTER JOIN PERSON_HISTORY PH ON P.MyPerson = PH.hMyPerson AND PH.My = (SELECT Max(PHX.hMy) FROM PERSON_HISTORY PHX WHERE PH.hMyPerson = PHX.hMyPerson)

WHERE 1 = 1 

ORDER BY P.Code, PH.My

DDL:

CREATE TABLE dbo.PERSON( MyPerson    NUMERIC(18, 0) NOT NULL,
                         Code        VARCHAR(8)  NOT NULL,
                         LastName    VARCHAR(256)  NULL,
                         FirstName   VARCHAR(256)  NULL,
                         Addr1       VARCHAR(50)  NULL,
                         Addr2       VARCHAR(50)  NULL,
                         City        VARCHAR(40)  NULL,
                         State       VARCHAR(5)  NULL,
                         Zipcode     VARCHAR(30)  NULL
                       ) ON [PRIMARY]
GO

CREATE TABLE dbo.PERSON_HISTORY( My          NUMERIC (18, 0) NOT NULL IDENTITY(1, 1),
                                 hMyPerson    NUMERIC (18, 0) NULL,
                                 Event        VARCHAR (75)  NULL,
                                 OccurredDate DATETIME NULL,
                                 FromDate     DATETIME NULL,
                                 ToDate       DATETIME NULL
                               ) ON [PRIMARY] 
GO

Kindest Regards,

Just say No to Facebook!

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 101096 More actions · Answer 1

First, the history table should be clustered leading on OccurredDate. That's generally true for log tables. [Forget the stupid myth that "by default, every table should be clustered on IDENTITY". ] This particular log table is not standard, but I'd still use the same clus key approach. OccurredDate should also be datetime2 to reduce key collisions. Unfortunately, even then I guess you'll still need an IDENTITY as a guarantee of a unique key. For efficiency, make that ident a bigint, not numeric.

You'll also need a nonclus index to directly support the "latest row for each person" lookup.

CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

CREATE UNIQUE NONCLUSTERED INDEX dbo.PERSON_HISTORY__IX_MyPerson 
    ON dbo.PERSON_HISTORY ( hMyPerson, OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

SELECT P.Code, LastName, P.Firstname,
       PH.*

FROM PERSON P OUTER APPLY (
       SELECT TOP (1) *
       FROM PERSON_HISTORY PH 
       WHERE P.MyPerson = PH.hMyPerson
       ORDER BY OccurredDate DESC, My DESC
       ) AS PH

WHERE 1 = 1 

ORDER BY P.Code, PH.My

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Mr. Brian Gale SSC-Insane Points: 24800 More actions · Answer 2

I see a few optimizations that can probably take place.

First, that WHERE 1=1 isn't needed as it isn't doing anything. Not sure it is a performance improvement by removing it, but it is wasted code and uses up some space in the plan cache, so might as well remove it.

Next, indexes will help. Looking at your DDL, I don't see any indexes, so adding those would help.

Now about "cleaner" ways to write the code, I find that code formatters help a lot with that. A quick 2 button press thing turns your code into this if you use RedGate SQL Prompt (note, I do not work for RedGate, I just like that tool. There are others):

/*Return most recent row of data from PERSON_HISTORY for each row in PERSON:*/
SELECT
[P].[Code]
, [LastName]
, [P].[Firstname]
, [PH].*
 FROM[PERSON] AS [P]
 LEFT OUTER JOIN[PERSON_HISTORY] AS [PH]
 ON [P].[MyPerson] = [PH].[hMyPerson]
AND [PH].[My]  =
(
SELECT
MAX([PHX].[hMy])
 FROM[PERSON_HISTORY] AS [PHX]
 WHERE[PH].[hMyPerson] = [PHX].[hMyPerson]
)
 WHERE1 = 1
 ORDER BY[P].[Code]
, [PH].[My];

My opinion, this is much easier to read. On top of that, I dislike having a * in my "ready for production" code. During development or for one-off queries, I may have a * in the code, but outside of those 2 use cases, I try not to have a *.

As for "better" code, it depends on what you define as better. Faster execution times? Maybe, but I would be looking at indexes to help with that first in this scenario as your code looks pretty simple as is. There does appear to be a typo in the nested select though as I don't think a column named "hMy" exists, but that should be caught when you go to execute the code. Now if you mean better as in "easier to test" or "easier to debug", I think that the code is pretty easy to follow what it happening as is.

The best way to work on making the code "faster" is to check the execution plan. It may be that tweaks you do to make the code perform better don't actually help performance due to how the indexes are. Or it may make performance worse.

Now for question number 2 - I would have the same approach. Indexes and execution plans. See where things are slow and address that if necessary. If the query performance is acceptable by end users and they are OK with the performance degrading over time (as data volume increases, the performance will degrade as it appears you have no indexes so you are scanning the whole table with each run), then I would spend more time on queries that are known slow and end users are complaining about performance.

As for question 3 - I am not aware of any specific term for getting the first or last row from a 1:M table.

One thing you could do to improve performance of the SELECT would be to add a column to the Person_History table of a BIT datatype with a name like "LatestEntry". Then when you do the insert on PERSON_HISTORY, the stored procedure would set "LatestEntry" to 0 for the hMyPerson value and the actual INSERT would set it to 1. Then when you do your JOIN from PERSON to PERSON_HISTORY, if you care about the latest value, you just grab where LatestEntry = 1. You could do the same thing with FirstEntry except that on the INSERT, it would look if the hMyPerson exists in the table. If it does, then the INSET will do FirstEntry = 0, otherwise FirstEntery = 1. This would be a bit of work to get into place for historical data, but would improve the SELECT performance with only a mild hit to INSERT performance. Next, you could add an index on those columns and your select performance would be pretty darn quick and your INSERT performance would only be mildly slower. BUT if you are doing billions of inserts per day, that mildly slower MAY result in end users being grumpy.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Jeff Moden SSC Guru Points: 1004324 More actions · Answer 3

Just thinking out of a different box here...

Questions:

Are you always going to lookup all rows in the PERSON table every time?
Is it correct or not that every person in the PERSONHISTORY table will ALLWAYS have a related row in the PERSON table?
If a person has both a MIN and a MAX FromDate, which do you want to return or do you want to return both rows?
If a person only has one row in the PERSONHISTORY table, should it be listed as the MIN or the MAX row or do you want it to show up as both if the answer to question 3 is also "both"?
Let's stop guessing... what is/are the DRI Constraint(s) that currently exist on the two tables?

As for the DATEOCCURRED, I wouldn't use it all. It's like reporting that scientists discovered another asteroid event that occurred after the "big one" and the discovery was made in 2021. The later fact is info only and has nothing to do with the asteroid itself. Similarly, only either the FromDate or the ToDate matters. AND, since they're both nullable, I have to ask a few more questions.

6. Is the FromDate ever NULL or some artificial value that's supposed to be treated the same as a NULL?

7. Same question for the ToDate.

8. How often will this report be executed?

9. Will the combination of the hMyPerson and the FromDate be unique in the history table?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 4

ScottPletcher wrote:

First, the history table should be clustered leading on OccurredDate. That's generally true for log tables. [Forget the stupid myth that "by default, every table should be clustered on IDENTITY". ] This particular log table is not standard, but I'd still use the same clus key approach. OccurredDate should also be datetime2 to reduce key collisions. Unfortunately, even then I guess you'll still need an IDENTITY as a guarantee of a unique key. For efficiency, make that ident a bigint, not numeric.
You'll also need a nonclus index to directly support the "latest row for each person" lookup.
CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

CREATE UNIQUE NONCLUSTERED INDEX dbo.PERSON_HISTORY__IX_MyPerson 
    ON dbo.PERSON_HISTORY ( hMyPerson, OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

SELECT P.Code, LastName, P.Firstname,
       PH.*

FROM PERSON P OUTER APPLY (
       SELECT TOP (1) *
       FROM PERSON_HISTORY PH 
       WHERE P.MyPerson = PH.hMyPerson
       ORDER BY OccurredDate DESC, My DESC
       ) AS PH

WHERE 1 = 1 

ORDER BY P.Code, PH.My

The indexes:

CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) INCLUDE (Event,FromDate,ToDate)
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ )

Index tmpPERSON_HISTORY__IX_MyPerson should be the clustered index so it an access all the other columns it needs in the query, either that or keep is as noclustered but include all the other columns in the table too. i.e:

CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) 
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) 

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )

ScottPletcher SSC Guru Points: 101096 More actions · Answer 5

Jonathan AC Roberts wrote:

ScottPletcher wrote:

First, the history table should be clustered leading on OccurredDate. That's generally true for log tables. [Forget the stupid myth that "by default, every table should be clustered on IDENTITY". ] This particular log table is not standard, but I'd still use the same clus key approach. OccurredDate should also be datetime2 to reduce key collisions. Unfortunately, even then I guess you'll still need an IDENTITY as a guarantee of a unique key. For efficiency, make that ident a bigint, not numeric.
You'll also need a nonclus index to directly support the "latest row for each person" lookup.
CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

CREATE UNIQUE NONCLUSTERED INDEX dbo.PERSON_HISTORY__IX_MyPerson 
    ON dbo.PERSON_HISTORY ( hMyPerson, OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

SELECT P.Code, LastName, P.Firstname,
       PH.*

FROM PERSON P OUTER APPLY (
       SELECT TOP (1) *
       FROM PERSON_HISTORY PH 
       WHERE P.MyPerson = PH.hMyPerson
       ORDER BY OccurredDate DESC, My DESC
       ) AS PH

WHERE 1 = 1 

ORDER BY P.Code, PH.My

The indexes:

CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) INCLUDE (Event,FromDate,ToDate)
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ )

Index tmpPERSON_HISTORY__IX_MyPerson should be the clustered index so it an access all the other columns it needs in the query, either that or keep is as noclustered but include all the other columns in the table too. i.e:

CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) 
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) 

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )

The problem with that being the clus index is that it will severely fragment the table. Clustering first on $IDENTITY is, yes, sequential and thus less fragmenting, but is useless for satisfying the query given.

OccurredDate is also sequential, and thus also with less fragmentation, but also directly matches the query conditions. Thus, it's the best clustered index lead columns. My ($IDENTITY) is added just to absolutely ensure a unique key since SQL gains efficiency from having an inherently unique key value rather than SQL having to add a "uniquifier".

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 6

ScottPletcher wrote:

Jonathan AC Roberts wrote:
ScottPletcher wrote:
First, the history table should be clustered leading on OccurredDate. That's generally true for log tables. [Forget the stupid myth that "by default, every table should be clustered on IDENTITY". ] This particular log table is not standard, but I'd still use the same clus key approach. OccurredDate should also be datetime2 to reduce key collisions. Unfortunately, even then I guess you'll still need an IDENTITY as a guarantee of a unique key. For efficiency, make that ident a bigint, not numeric.
You'll also need a nonclus index to directly support the "latest row for each person" lookup.
CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

CREATE UNIQUE NONCLUSTERED INDEX dbo.PERSON_HISTORY__IX_MyPerson 
    ON dbo.PERSON_HISTORY ( hMyPerson, OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

SELECT P.Code, LastName, P.Firstname,
       PH.*

FROM PERSON P OUTER APPLY (
       SELECT TOP (1) *
       FROM PERSON_HISTORY PH 
       WHERE P.MyPerson = PH.hMyPerson
       ORDER BY OccurredDate DESC, My DESC
       ) AS PH

WHERE 1 = 1 

ORDER BY P.Code, PH.My
The indexes:
CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) INCLUDE (Event,FromDate,ToDate)
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ )
Index tmpPERSON_HISTORY__IX_MyPerson should be the clustered index so it an access all the other columns it needs in the query, either that or keep is as noclustered but include all the other columns in the table too. i.e:
CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) 
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) 

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )
The problem with that being the clus index is that it will severely fragment the table. Clustering first on $IDENTITY is, yes, sequential and thus less fragmenting, but is useless for satisfying the query given.
OccurredDate is also sequential, and thus also with less fragmentation, but also directly matches the query conditions. Thus, it's the best clustered index lead columns. My ($IDENTITY) is added just to absolutely ensure a unique key since SQL gains efficiency from having an inherently unique key value rather than SQL having to add a "uniquifier".

The query and indexes you have given actually give worse performance than the OPs original query:

set statistics io, time off
go

if object_id('dbo.tmpPERSON_HISTORY','u') is not null begin
    drop table dbo.tmpPERSON_HISTORY
end
go

if object_id('dbo.tmpPERSON','u') is not null begin
    drop table dbo.tmpPERSON
end
go
CREATE TABLE dbo.tmpPERSON
( 
    MyPerson    bigint NOT   NULL IDENTITY(1, 1),
    Code        VARCHAR(8)   NOT NULL,
    LastName    VARCHAR(256) NULL,
    FirstName   VARCHAR(256) NULL,
    Addr1       VARCHAR(50)  NULL,
    Addr2       VARCHAR(50)  NULL,
    City        VARCHAR(40)  NULL,
    State       VARCHAR(5)   NULL,
    Zipcode     VARCHAR(30)  NULL
)
GO
ALTER TABLE dbo.tmpPERSON
ADD CONSTRAINT pk_tmpPERSON PRIMARY KEY CLUSTERED (MyPerson ASC)
                       
GO
CREATE TABLE dbo.tmpPERSON_HISTORY
(
    My           bigint       NOT NULL IDENTITY(1, 1),
    hMyPerson    bigint       NULL,
    Event        VARCHAR(75)  NULL,
    OccurredDate DATETIME     NULL,
    FromDate     DATETIME     NULL,
    ToDate       DATETIME     NULL
) 
GO
ALTER TABLE dbo.tmpPERSON_HISTORY
ADD CONSTRAINT pk_tmpPERSON_HISTORY PRIMARY KEY nonclustered (My ASC)
GO
--ALTER TABLE dbo.tmpPERSON_HISTORY
--ADD CONSTRAINT PERSON_HISTORY_PERSON FOREIGN KEY (hMyPerson) REFERENCES tmpPERSON(MyPerson)
GO
INSERT INTO dbo.tmpPERSON
(
    [Code],
    [LastName],
    [FirstName],
    [Addr1],
    [Addr2],
    [City],
    [State],
    [Zipcode]
)
SELECT left(convert(varchar(36),NEWID()),8) [Code],
       left(convert(varchar(36),NEWID()),12) [LastName], 
       left(convert(varchar(36),NEWID()),12) [FirstName], 
       'Addr1 sd;flgjsd;flgjksd;flgjsd', 
       'Addr2 sd;flgjsd;flgjksd;flgjsd', 
       'my city', 
       'state', 
       'zip 1234'
  from dbo.fnTally(1,2000000)



INSERT INTO dbo.tmpPERSON_HISTORY
(
    [hMyPerson],
    [Event],
    [OccurredDate],
    [FromDate],
    [ToDate]
)
SELECT p.MyPerson,
       t.Event,
       t.OccurredDate,
       t.FromDate,
       t.ToDate
from dbo.tmpPERSON p
cross apply (values ('event name1 here',    '20210101',    '20210101',    '20210201'),
                    ('event name2 here',    '20210201',    '20210201',    '20210301'),
                    ('event name3 here',    '20210301',    '20210301',    '20210401'),
                    ('event name4 here',    '20210401',    '20210401',    '20210501'),
                    ('event name5 here',    '20210501',    '20210501',    '20210601'),
                    ('event name6 here',    '20210601',    '20210601',    '20210701')) T([Event],[OccurredDate],[FromDate],[ToDate])
go

--drop index tmpPERSON_HISTORY__CL  on dbo.tmpPERSON_HISTORY 
--drop index tmpPERSON_HISTORY__IX_MyPerson  on dbo.tmpPERSON_HISTORY 
--go
--CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
--    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) 
-- WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) 

--CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__CL 
--    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
--    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )


CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__CL 
    ON dbo.tmpPERSON_HISTORY (My, OccurredDate) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ )

CREATE UNIQUE NONCLUSTERED INDEX tmpPERSON_HISTORY__IX_MyPerson 
    ON dbo.tmpPERSON_HISTORY (hMyPerson, OccurredDate, My ) INCLUDE (Event,FromDate,ToDate)
 WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95/*or whatever value you want, just be sure to explicitly specify it*/ ) 
 

if object_id('tempdb..#t1','u') is not null begin
    drop table #t1
end
if object_id('tempdb..#t2','u') is not null begin
    drop table #t2
end
set statistics io, time on

SELECT P.Code, 
       P.LastName, 
       P.Firstname,
       PH.*
  into #t1
  FROM dbo.tmpPERSON P 
  LEFT JOIN dbo.tmpPERSON_HISTORY PH 
         ON P.MyPerson = PH.hMyPerson 
        AND PH.My = (SELECT Max(PHX.hMyPerson) 
                       FROM dbo.tmpPERSON_HISTORY PHX 
                      WHERE PH.hMyPerson = PHX.hMyPerson)
 ORDER BY P.Code, PH.My

GO
SELECT P.Code, 
       P.LastName, 
       P.Firstname,
       PH.*
  into #t2
  FROM dbo.tmpPERSON P 
 OUTER APPLY (SELECT TOP (1) *
                FROM dbo.tmpPERSON_HISTORY PH 
               WHERE P.MyPerson = PH.hMyPerson
               ORDER BY OccurredDate DESC, My DESC) AS PH
 ORDER BY P.Code, PH.My

set statistics io, time off

fnTally

ScottPletcher SSC Guru Points: 101096 More actions · Answer 7

Hmm, I remember a WHERE condition on the OccurredDate; maybe I'm mixing this query up with another one.

At any rate, one last time, as I stated earlier:

the history table should be clustered leading on OccurredDate.
And, again, as part of that, OccurredDate should be made datetime2 and not just datetime.

Your (Johnathan) clustering:

CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__CL

ON dbo.tmpPERSON_HISTORY (My, OccurredDate)

makes no sense, because My is unique by itself, you would not need to follow it with OccurredDate.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ScottPletcher SSC Guru Points: 101096 More actions · Answer 8

A better option overall is to just have the last history row id stored in the header table (yes, denormalize it).

A trigger on the PERSON_HISTORY table can do that very accurately and easily and you wouldn't have to search the history at all for which row is last.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 9

ScottPletcher wrote:

Hmm, I remember a WHERE condition on the OccurredDate; maybe I'm mixing this query up with another one.
At any rate, one last time, as I stated earlier:
the history table should be clustered leading on OccurredDate.
And, again, as part of that, OccurredDate should be made datetime2 and not just datetime.
Your (Johnathan) clustering:
CREATE UNIQUE CLUSTERED INDEX tmpPERSON_HISTORY__CL ON dbo.tmpPERSON_HISTORY (My, OccurredDate)
makes no sense, because My is unique by itself, you would not need to follow it with OccurredDate.

Sorry, my fault. I'd accidently put the columns on your clustered index in reverse order. I see what you mean, having a clustered index on (OccurredDate, My) was just removing the need for an identity column on the PERSON_HISTORY table.

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 10

I can't see the point in the clustered index :

CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

It does nothing for performance. Why not just stick with a clustered primary key on column My?

To make the query perform I would just add an index:

 CREATE UNIQUE NONCLUSTERED INDEX IX_PERSON_HISTORY_1
     ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) 
INCLUDE ([Event],[FromDate],[ToDate])

and change the query to Scott's:

SELECT P.Code, 
       P.LastName, 
       P.Firstname,
       PH.*
  FROM dbo.tmpPERSON P 
 OUTER APPLY (SELECT TOP(1) *
                FROM dbo.tmpPERSON_HISTORY PH 
               WHERE P.MyPerson = PH.hMyPerson
               ORDER BY OccurredDate DESC, My DESC) AS PH
 ORDER BY P.Code, PH.My

Scott,

Adding DATA_COMPRESSION = ROW to the index ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) just seems to slow the query down rather than speed it up.

ScottPletcher SSC Guru Points: 101096 More actions · Answer 11

Jonathan AC Roberts wrote:

I can't see the point in the clustered index :

CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]

It does nothing for performance. Why not just stick with a clustered primary key on column My?

To make the query perform I would just add an index:

 CREATE UNIQUE NONCLUSTERED INDEX IX_PERSON_HISTORY_1
     ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) 
INCLUDE ([Event],[FromDate],[ToDate])

and change the query to Scott's:

SELECT P.Code, 
       P.LastName, 
       P.Firstname,
       PH.*
  FROM dbo.tmpPERSON P 
 OUTER APPLY (SELECT TOP(1) *
                FROM dbo.tmpPERSON_HISTORY PH 
               WHERE P.MyPerson = PH.hMyPerson
               ORDER BY OccurredDate DESC, My DESC) AS PH
 ORDER BY P.Code, PH.My

Scott,

Adding DATA_COMPRESSION = ROW to the index ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) just seems to slow the query down rather than speed it up.

(1) $IDENTITY is the wrong clus key on log tables. And, in particular, the original post had conditions on OccurredDate, which apparently were removed later.

I suggest that exact index -- without the DESC, which isn't required -- in my first post.

ROW compression can't typically slow down a SELECT query enough to notice, and in fact usually speeds it up (unless all the data is already in memory). Something else must be going on there.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 12

ScottPletcher wrote:

Jonathan AC Roberts wrote:
I can't see the point in the clustered index :
CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]
It does nothing for performance. Why not just stick with a clustered primary key on column My?
To make the query perform I would just add an index:
 CREATE UNIQUE NONCLUSTERED INDEX IX_PERSON_HISTORY_1
     ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) 
INCLUDE ([Event],[FromDate],[ToDate])
and change the query to Scott's:
SELECT P.Code, 
       P.LastName, 
       P.Firstname,
       PH.*
  FROM dbo.tmpPERSON P 
 OUTER APPLY (SELECT TOP(1) *
                FROM dbo.tmpPERSON_HISTORY PH 
               WHERE P.MyPerson = PH.hMyPerson
               ORDER BY OccurredDate DESC, My DESC) AS PH
 ORDER BY P.Code, PH.My
Scott,
Adding DATA_COMPRESSION = ROW to the index ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) just seems to slow the query down rather than speed it up.
(1) $IDENTITY is the wrong clus key on log tables. And, in particular, the original post had conditions on OccurredDate, which apparently were removed later.
I suggest that exact index -- without the DESC, which isn't required -- in my first post.
ROW compression can't typically slow down a SELECT query enough to notice, and in fact usually speeds it up (unless all the data is already in memory). Something else must be going on there.

I take your point about clustering log tables on a date column as date would often be part of a query criteria.

The main difference in the index I put is that it included all the remaining columns, this means the query would only have to look in the index and not go to the table. Adding DESC didn't seem to make a difference on this query but I've seen situations where it does make it faster if it's in the same direction as the order by.

I guess the performance of compressed indexes depends on the time it takes to decompress data in compressed indexes compared to the increased IO with uncompressed indexes.

ScottPletcher SSC Guru Points: 101096 More actions · Answer 13

Jonathan AC Roberts wrote:

ScottPletcher wrote:
Jonathan AC Roberts wrote:
I can't see the point in the clustered index :
CREATE UNIQUE CLUSTERED INDEX dbo.PERSON_HISTORY__CL 
    ON dbo.PERSON_HISTORY ( OccurredDate, My ) 
    WITH ( DATA_COMPRESSION = NONE /*or PAGE, as you deem fit*/, FILLFACTOR = 99/*or whatever value you want, just be sure to explicitly specify it*/ ) ON [<filegroup_name>]
It does nothing for performance. Why not just stick with a clustered primary key on column My?
To make the query perform I would just add an index:
 CREATE UNIQUE NONCLUSTERED INDEX IX_PERSON_HISTORY_1
     ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) 
INCLUDE ([Event],[FromDate],[ToDate])
and change the query to Scott's:
SELECT P.Code, 
       P.LastName, 
       P.Firstname,
       PH.*
  FROM dbo.tmpPERSON P 
 OUTER APPLY (SELECT TOP(1) *
                FROM dbo.tmpPERSON_HISTORY PH 
               WHERE P.MyPerson = PH.hMyPerson
               ORDER BY OccurredDate DESC, My DESC) AS PH
 ORDER BY P.Code, PH.My
Scott,
Adding DATA_COMPRESSION = ROW to the index ON dbo.tmpPERSON_HISTORY(hMyPerson, OccurredDate DESC, my DESC) just seems to slow the query down rather than speed it up.
(1) $IDENTITY is the wrong clus key on log tables. And, in particular, the original post had conditions on OccurredDate, which apparently were removed later.
I suggest that exact index -- without the DESC, which isn't required -- in my first post.
ROW compression can't typically slow down a SELECT query enough to notice, and in fact usually speeds it up (unless all the data is already in memory). Something else must be going on there.
I take your point about clustering log tables on a date column as date would often be part of a query criteria.
The main difference in the index I put is that it included all the remaining columns, this means the query would only have to look in the index and not go to the table. Adding DESC didn't seem to make a difference on this query but I've seen situations where it does make it faster if it's in the same direction as the order by.
I guess the performance of compressed indexes depends on the time it takes to decompress data in compressed indexes compared to the increased IO with uncompressed indexes.

The original query had "WHERE 1 = 1 AND OccurredDate >= ... AND OccurredDate <condition> <expression>" which really made it clear to use OccurredDate as the leading clus key. But using the datetime2 is true generally anyway for log tables. When you do need to read a log table, it's almost always to look for a certain time period.

If you have to include (almost) the entire table in the nonclus index, then you've gained nothing from having that index.

I guess it's possible that DESC order might be affected more for row compression, which would be interesting, especially if it accounted for some performance difference. If, however, the performance difference was some mild CPU difference, then I'd just discount that.

Again, I've never seen a net loss from ROW compression when used properly, only net gain. That said, my system is vastly more I/O bound than CPU bound (most of the time we have CPU to burn). But that is the general rule for most shops, I/O is nearly always more of an issue than CPU.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

YSLGuru SSC-Insane Points: 21661 More actions · Answer 14

Thank you all for taking time to provide feedback. I haven't been able to go thru these yet because I'm out sick but I did want to take a minute to reply back to say thanks and I will go thru these once my head clears up.

Thanks

Kindest Regards,

Just say No to Facebook!