May 29, 2008 at 2:38 pm
One of the basis relational operators is PROJECT, which Dr Codd defined in his seminal paper "A Relational Model of Large Shared Data Banks" as
2.1.2. Projection. Suppose now we select certain columns of a relation (striking out the others) and then remove from the resulting array any duplication in the rows. The final array represents a relation which is said to be a projection of the given relation.
PROJECT is implemented in SQL by specifying the columns in the SELECT section of a statement such as:
Create table dbo.Person
(PersonIdinteger not null
,CityNamevarchar(50) not null
,CountryNamevarchar(50) not null
,constraint Person_P primary key (PersonId))
An example SQL statement for a PROJECT is:
SELECT PersonId
, CountryName
from Person
However, when the relation is made temporal, so that a history is stored, PRJECT becomes more complex. Given the below table, the desired results is the Person, Country, and a start/end when they continuously resided in that country.
An English description of the solution is: For the each person and country,
1. Get possible time periods - this is a triangle join
2. Exclude where the person resided in a different country during the time period
3. For the end, use the latest end.
4. For the lastest end, use the earliest start
Does anyone have a simplier or more effective solution that can be included as the definition of a view?
Create table dbo.PersonAddress
(PersonIdint not null
,EffectiveTsdatetimenot null
,ExpirationTsdatetimenot null
,CityNamevarchar(50)not null
,CountryNamevarchar(50)not null
,constraint PersonAddress_P primary key (PersonId , EffectiveTs)
)
The SQL is:
Create view PersonCountry AS
SELECT PersonCountry.PersonId
,PersonCountry.CountryName
,MIN(PersonCountry.EffectiveTs)as EffectiveTs
,PersonCountry.ExpirationTs
FROM(SELECT F.PersonId
,F.CountryName
,F.EffectiveTs
,MAX(L.ExpirationTs) as ExpirationTs
FROM PersonAddress F
JOINPersonAddress L
ONL.PersonId= F.PersonId
ANDL.CountryName= F.CountryName
ANDL.EffectiveTs< L.ExpirationTs
ANDNOT EXISTS
--Did not reside in a different country within the time period
(SELECT * FROM PersonAddress M
WHEREM.PersonId= F.PersonId
ANDM.CountryName<> F.CountryName
ANDM.EffectiveTs>F.EffectiveTs
ANDM.EffectiveTs< L.ExpirationTs
)
group by F.PersonId
,F.CountryName
,F.EffectiveTs
) as PersonCountry
GROUP BY PersonCountry.PersonId
,PersonCountry.CountryName
,PersonCountry.ExpirationTs
ORDER BY PersonCountry.PersonId
,EffectiveTs
Here is test data:
truncate table dbo.PersonAddress
insert into dbo.PersonAddress
(PersonId , EffectiveTs, ExpirationTs ,CityName,CountryName)
select 1, '2000-01-01','9999-12-31T23:59:59.997','Chicago','USA' union all
select 2, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 2, '2001-01-01','9999-12-31T23:59:59.997','Detroit','USA' union all
select 3, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 3, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all
select 3, '2002-01-01','9999-12-31T23:59:59.997','Stratford','CAN' union all
select 4, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 4, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all
select 4, '2002-01-01','2002-12-31T23:59:59.997','Stratford','CAN' union all
select 4, '2003-01-01','9999-12-31T23:59:59.997','Toronto','CAN' union all
select 5, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 5, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all
select 5, '2002-01-01','2002-12-31T23:59:59.997','Stratford','CAN' union all
select 5, '2003-01-01','2003-12-31T23:59:59.997','Toronto','CAN' union all
select 5, '2004-01-01','9999-12-31T23:59:59.997','Columbus','USA'
-- Expected Results
selectPersonId,CountryName
,CAST( EffectiveTs as datetime) as EffectiveTs
,CAST( ExpirationTs as datetime) as ExpirationTs
from(
select 1,'USA','2000-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all
select 2,'USA','2000-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all
select 3,'USA','2000-01-01T00:00:00.000','2001-12-31T23:59:59.997' union all
select 3,'CAN','2002-01-01T00:00:00.000','9999-12-31T23:59:59.997'union all
select 4,'USA','2000-01-01T00:00:00.000','2002-12-31T23:59:59.997' union all
select 4,'CAN','2003-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all
select 5,'USA','2000-01-01T00:00:00.000','2001-12-31T23:59:59.997' union all
select 5,'CAN','2002-01-01T00:00:00.000','2003-12-31T23:59:59.997' union all
select 5,'USA','2004-01-01T00:00:00.000','9999-12-31T23:59:59.997'
) as PersonCountry (PersonId,CountryName,EffectiveTs,ExpirationTs)
SQL = Scarcely Qualifies as a Language
May 30, 2008 at 2:47 am
Hello Carl,
ït took some time to think of a different approach, but in the end I came up with this. It is based on the idea, that if someone changed address inside of a country at certain point, there will be the same number of future addresses in different countries. So, all rows with the same count of future abroad addresses can be merged into one period of residence in a country.
SELECT Q.PersonId, Q.CountryName, MIN(Q.EffectiveTs), MAX(Q.ExpirationTs)
FROM
/*count how many entries in future in other country?*/
(SELECT PA1.PersonId, PA1.CountryName, PA1.EffectiveTs, PA1.ExpirationTs , COUNT(C.PersonId) as counter
FROM PersonAddress PA1
LEFT JOIN PersonAddress C ON C.PersonId=PA1.PersonId
AND C.EffectiveTs > PA1.EffectiveTs
AND C.CountryName <> PA1.CountryName
GROUP BY PA1.PersonId, PA1.CountryName, PA1.EffectiveTs, PA1.ExpirationTs) AS Q
GROUP BY Q.PersonId, Q.CountryName, Q.counter /*those with equal counter can be merged, because the country didn't change*/
ORDER BY Q.PersonId, MIN(Q.EffectiveTs)
I don't know whether there is some effect on performance, I just saw that your execution plan does 2 index scans and 1 index seek, while mine only 2 index scans... so there is a chance that my solution will perform the same as yours or a little better. But you have to test it, of course.
BTW, check your posted desired output... there is a mistake for PersonId 4. This person moved to Canada in 2002, not 2003.
June 1, 2008 at 1:23 pm
Read your post and an running tests including scalability tests.
The testing is being performed versus a table containing database file io statistics that have been gathered each hourly for about 150 SQL Servers for the last 60 days. The table has about 2 million rows.
SQL = Scarcely Qualifies as a Language
June 10, 2008 at 12:45 pm
For a Temporal Project, where the project includes all of the primary key columns, benchmarks of three SQL Server solutions were performed and documented below. The three different solutions were 1) a cursor solution 2) Vladan's solution, 3) solution from "Developing Time-Oriented Database Applications in SQL" by Dr. Richard T. Snodgrass of the University of Arizona.
Below is the table used for the benchmarks, is populated based on running the system function fn_virtualfilestats hourly versus 86 SQL Server 2000 instances, and the information is retained for 60 days. The table has a total of 2.2 million rows. To eliminate network IO as a factor, the results were written to a permanent table, which was then used to validate that all three solutions produced identical results.
For an single chosen instance, here are the various cardinalities:
Table Projected
Database Name 17 17
Database Files 34 34
Total Rows 49,054 21,575
Average Rows Per Database File 1,442 634
Maximum Rows Per Database File 1,447 980
Minimum Rows Per Database File 1,354 1
The benchmarks for a single SQL Server instance that consists of thirty four database files are:
Scans – Cursor is 1, Vladan's is 8 and Literature is 473,825
Logical I/O – Cursor is 104,571 , Vladan's is 2,118 and Literature is 473,825
Elapsed Seconds – Cursor is 24, Vladan's is 109 and Literature is 3,105.
The benchmarks for all 86 SQL Server instance are (note that the benchmark for the Literature was cancelled after running for 12 hours !)
Scans – Cursor is 1, Vladan's is 8 and Literature is 473,825
Logical I/O – Cursor is 6,581,739 and Vladan's is 38,274,525
Elapsed Seconds – Cursor is 62 and Vladan's is 6,700
Attached are four files:
InstanceFileIO.txt is the table DDL.
TemporalProject.txt is the SQL solution from the literature.
TemporalProject_Cursor.txt is a cursor based solution.
TemporalProject_Vladan.txt is Vladan's solution.
SQL = Scarcely Qualifies as a Language
June 11, 2008 at 12:52 am
Thanks for posting the results, Carl... very interesting. I knew that triangular joins don't perform well on large amounts of data, but anyway I was surprised about that much better performance of a cursor.
June 15, 2008 at 12:12 pm
Carl: can you post your data so that we can replicate your test?
Easiest way is probably to BCP the table and then include as an attachment in a .ZIP file. The BCP command would be something like:
BCP {dbname}.dbo.InstancefileIO OUT D:\{path}\InstancefileIO.nat -n -T -S {ServerName}
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
June 16, 2008 at 9:51 am
As I used live production data and sp_spaceused reports about one and half million rows and 256Mb of space used, posting that much data is not possible.
SQL = Scarcely Qualifies as a Language
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply