Full outer join question

  • I have two large tables that I need to compare them side-by-side for many of their columns. However, I only need to compare a subset of rows, yet still a large subset. I want to take full advantage of the indexes in both tables and that's where I have my question. Here's an example:

    declare @T1 table

    (

    IndexID int identity(1,1),

    PID int,

    ID int,

    Name varchar(20),

    ParameterValue int,

    primary key ( Name, IndexID )

    )

    declare @T2 table

    (

    IndexID int identity(1,1),

    Name varchar(30),

    ParameterValue int,

    Process bit,

    primary key ( Name, IndexID )

    )

    insert@T1 ( PID, ID, Name, ParameterValue )

    values( 1, 1, 'ABC', 10 ),

    ( 1, 2, 'XYZ', 20 ),

    ( 1, 3, 'PQR', 30 ),

    ( 2, 1, 'ABC', 10 ),

    ( 2, 2, 'XYZ', 20 ),

    ( 2, 3, 'PQR', 30 )

    insert@T2 ( Name, ParameterValue, Process )

    values( 'ABC', 10, 0 ),

    ( 'XYZ', 25, 0 ),

    ( 'JKL', 50, 0 ),

    ( 'MNO', 60, 0 ),

    ( 'ABC', 10, 1 ),

    ( 'XYZ', 25, 1 ),

    ( 'JKL', 50, 1 )

    I want to compare side by side the column ParameterValue in both table. For table @T1, I only want to use the rows with PID = 2, and in table @T2, I only want to use the rows with Process = 1.

    I can write the query:

    selectOldIndexID = T1.IndexID,

    NewIndexID = T2.IndexID,

    PID = T1.PID,

    ID = T1.ID,

    Name = ISNULL(T2.Name,T1.Name),

    OldParam = T1.ParameterValue,

    NewParam = T2.ParameterValue,

    IsNew = case

    when(

    T2.IndexID is not null and

    (

    T1.IndexID is null or

    (

    T1.IndexID is not null and

    isnull(T2.ParameterValue,-1) <> isnull(T1.ParameterValue,-1)

    )

    )

    )

    then 1

    else 0

    end,

    IsOld = case

    when(

    T1.IndexID is not null and

    (

    T2.IndexID is null or

    (

    T2.IndexID is not null and

    isnull(T2.ParameterValue,-1) <> isnull(T1.ParameterValue,-1)

    )

    )

    )

    then 1

    else 0

    end

    from(

    select*

    from@T1

    wherePID = 2

    ) T1 full outer join

    (

    select *

    from@T2

    whereProcess = 1

    ) T2 on

    T2.Name = T1.Name

    which gives me the result:

    OldIndexID NewIndexID PID ID Name OldParam NewParam IsNew IsOld

    ----------- ----------- ----------- ----------- ----- ----------- ----------- ----------- -----------

    4 5 2 1 ABC 10 10 0 0

    5 6 2 2 XYZ 20 25 1 1

    6 NULL 2 3 PQR 30 NULL 0 1

    NULL 7 NULL NULL JKL NULL 50 1 0

    While this result is correct and I can identify what hasn't changed, what changes, what is new, and what needs to be removed, I am afraid the for very large data sets in @T1 and @T2 (permanent tables in my case) will not take full advantage of the indexes I have on Name. In fact the execution plan tells me exactly that.

    I need to rebuild this query such that it filters out records that are of no interest (PID = 1 in table @T1 and Process = 0 in table @T2) and the full outer join still uses the index on Name on both tables.

    Any help is greatly appreciated.

  • Couple things.

    First and foremost, you're not going to know how a SQL call scales to "very large" table sizes with table variables. Table variables by nature have no statistics built on them, and are optimized somewhat differently than persisted tables.

    The other problem with using table variables is that you can't really index them in any sort of meaningful way.

    The first thing I'd suggest is running this operation at "scale" and observing the execution plan looking for costly operations like table scans, which you'll probably see given all the stuff that's going on here.

    Building some supporting indices on the tables may give you better performance. You could also try building a scaled down version using temp tables. Also, what a "very large" table is to one person, may be trivial to a modern SQL Server. I work with tables in the 1-2 billion row range, and that's what I'd consider "pretty big". Even so, there are people who might think those are small. People who don't regularly work with SQL Server often think thousands of rows are "big". My point is, unless you have some statistics showing that what you're trying to do is untenable (and why), you might as well just see what happens.

    The other thing I'd recommend, which has been invaluable to me working with large data sets is to break the operation down into individual component. For example, a full outer join is basically a left join and a right join done at once. While it's often desirable to want to write one really awesome query to do everything you want all at once, sometimes SQL has trouble figuring out the best way to do things when you have lots of case statements, null checks or joins on unindexed columns. You can often "hold its hand" though what you're trying to do by breaking out one big complex statement into multiple smaller, optimized statements.

    Executive Junior Cowboy Developer, Esq.[/url]

  • N_Muller (10/27/2014)


    I need to rebuild this query such that it filters out records that are of no interest (PID = 1 in table @T1 and Process = 0 in table @T2) and the full outer join still uses the index on Name on both tables.

    Any help is greatly appreciated.

    Can we see the sqlplan of the query you've actually built against your tables, and the schema (with indexes) for the two subtables?

    My guess is this needs to start with a new index for the tables in question, leading with the actual restrictions from the where clause before [Name] even gets involved, if they don't exist already. Will depend on the data distribution though if that's even worthwhile.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks Craig and JeeTee for the responses.

    The table variable was just an example. There are permanent tables. One table stores the current values - it has an order of 100 million rows and it is partitioned (PID in my example). The other table is also a permanent table, but it is a staging table storing the output of reading an XML file of about 400 MB, and it contains about 10 million rows. There are some validation on the XML and that's the purpose of the Process in my table variable.

    The purpose of the query is to update the permanent table. There are new rows to be added, rows to be deleted, and rows where the attribute changed values. Some attributes can be null.

    The query as I have does a full outer join on two derived tables and, as such, have no index. I'd like to process these tables in a way where I don't use derived tables and can make full use of the indexes (partition and name on the main table, and process and name in the staging table). The name column is a varchar(50) column. One thing that can speed up is to generate a checksum on the name on the two derived tables - I may even add these columns to the permanent tables as calculated and persisted columns. The checksum doesn't guarantee uniqueness, but at least provides a first level filter on a 4-byte integer, instead of a 50-byte character key.

    I apologize if I wasn't very clear. Any help is still welcomed.

  • I wish there were a precise answer I could give you, but there are a lot of unknowns (including what the actual table structure, existing indices and data look like), and some things you'll probably have to play around with for yourself. The first thing I would be doing is looking at the indexes of both your presentation table (pres), (the table into which you're essentially merging into) and your raw table (raw), (the intermediate table which houses the shredded xml values).

    FWIW, I spent some time messing around with the raw data you posted and your query, and even your existing, un-altered full outer join code, while had to scan each table, it didn't perform all that poorly. This begs the question, what are you trying to optimize this towards? Reducing system resources? Making it run as fast as possible regardless of how many intermediate objects and indices you need to build? I applaud you for wanting to optimize your code, but I wouldn't lose too much sleep over it if it takes a few minutes to complete. Even the most optimized code is going to take a while to move millions of records around. There are some incredibly elaborate ways you could probably cut down on how many reads it has to make at run time, but My takeaways would be this:

    1) Establish what it is you're trying to optimize towards. Is it taking hours and needs to be taking minutes? Is it taking minutes and needs to be taking seconds?

    2) Look into indexing the tables which your derived tables pull from. Consider filtered indexes if you want

    3) set statistics io on, and look at the graphic execution plan

    4) Consider breaking out the statements into smaller steps so that any problem statements are run using the optimal indexes

    Executive Junior Cowboy Developer, Esq.[/url]

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply