Comparing data between fairly large tables: ( 1 billion vs. 1 billion rows)

Question

Comparing data between fairly large tables: ( 1 billion vs. 1 billion rows)

bdba

SSCommitted

Points: 1504
More actions
August 29, 2007 at 7:07 pm

#118374

Hi,

We are exploring the least time consuming approach to be able to compare data between two fairly large tables.
The table to be compared ( table1 in this case) has about 800 Million rows and the table to be compared against (table2 in this case) will have at least 1 billion rows. The data in table is being populated by importing from about 250 flat files provided to us. This means that we have an option of importing the 250 flat files into several smaller tables instead of just one single table.

Both tables (1 and 2) have only one single column which is to be compared. The column to be compared in both the tables will be indexed. The data type of the columns are BIGINT. This will however be just one time process or doesn’t need to be scheduled.

Following are couple of approaches we have thought of:

1. Import the 250 flat files into one singe table (table2). Compare table1 with table2 which will be comparison of close to 1 billion rows against 1 billion single column rows.
2. Import the 250 flat files into multiple smaller tables, let’s say 20 tables. Write a script which will loop through and will compare table1 with one of the 20 tables at a time, if it finds the matching data in any of the tables, it will mark that value as matched and it will exit and will not do any further checks. We hope that this will in average might minimize the comparisons that will otherwise will be done in the 1^st approach.

Any feedback or alternative least time consuming approach would be appreciated. Please let me know if further detail or clarification is needed.

Thanks

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

David McFarland SSChampion Points: 11815 More actions · Answer 1

If you're looking for matches, just index them and do an INNER JOIN between the two tables on that single column. If that's not what you're looking for, then more info would be helpful, such as are these numbers unique in each table, what exactly is the comparison methodology, etc.

bdba SSCommitted Points: 1504 More actions · Answer 2

Thanks for the response. I just need to get the values (rows) from table1 that does not exist in table2. The values in table1 will be unique, however in table2 there might be some duplicate values.

Thanks

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 3

SELECT t1.column

FROM table1 t1

LEFT JOIN table2 t2

ON t1.column = t2.column

WHERE t2.column IS NULL

..or..

SELECT t1.column

FROM table1 t1

WHERE NOT EXISTS ( SELECT * FROM table2 t2 WHERE t1.column = t2.column )

Both ways return t1.column where that value doesn't exist in table2.

Which one is most effective depends. Sometimes it's the first, sometimes the other.

You have to test it on the particular tables involved to know for sure.

/Kenneth

Vladan SSC-Insane Points: 22072 More actions · Answer 4

The only thing I can add is, that in my experience the first way - with LEFT JOIN - is better in most situations (not always).

So, regarding your original question, I would try the first method, and just two tables - no splitting. It's true that I've never worked with such huge datasets, biggest table in our DB had about 100 million rows, but LEFT JOIN performs well with large tables as far as I know.

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 5

Actually the left join will return all the rows that have matching records in the second table, not the missing records. You pretty much have to go with the not exists in the where clause.

James.

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 6

Oops! sorry didn't pay attention to the where clause. Joins are better then sub-selects under most circumstances. Thats what I get for not really looking at something before shooting my mouth off. My apologies.

James.

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 7

It happens

As to which would be best in this case, one has to try both and also preferrably look at the plans produced in order to choose. Other than that, which one to prefer is largely a personal preference.

/Kenneth

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 8

Are you sure that you have no duplicates in these tables? If you do, you could end up with the mother of all cross joins using a join.

This is a safe approach if you don’t know for sure if you have duplicates.

select

a.MY_COLUMN

from

(

select MY_COLUMN, T1 = 1, T2 = 0 from TABLE_1 group by MY_COLUMN

union all

select MY_COLUMN, T1 = 0, T2 = 1 from TABLE_2 group by MY_COLUMN

) a

group by

a.MY_COLUMN

having

max(T1) = 1 and max(T2) = 0

order by

a.MY_COLUMN

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 9

Trying to follow the logic in my head just made it hurt, so I did it in Management Studio. I then compared the actual execution plans agains ten rows of test data in each table. The first method with the left join results in a nested loop (for the left outer join) and an intermediate result set of 100 rows (table 1 rows times table 2 rows). The second method with the derived table looks like a more efficient solution because it never deals with more than 20 rows (table 1 rows plus table 2 rows), though the cost of the second query (with a test of 10 rows per table) is actually higher relative to the batch (of the two queries). The sorting for the "Distinct" appears to be the time killer on query two.

I wonder as the the total number of rows grows is there a break even point where the lack of the cartisian product of query 1 makes query 2 the better choice? Anyone have any thoughts?

James.

Vladan SSC-Insane Points: 22072 More actions · Answer 10

LOL... are you sure that JOIN is female? Maybe it would be a father 🙂

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 11

“…I wonder as the total number of rows grows is there a break even point where the lack of the Cartesian product of query 1 makes query 2 the better choice?...”

You could just try method 1. If it runs for hours until tempdb fills the disk, then you might want to switch to method 2. It really depends on the data. I just wanted to offer an alternative solution.

If would also be easier to use method 2 to implement if you import the data into multiple tables. Just include all the tables in the union.

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 12

I wasn't criticizing, it was a neat solution. I have now run all three methods against two tables with 1,000,000 records of random data in each table in a single batch with the following results:

Query 1: (Sub-Select) 22% of batch.

Query 2: (Left Join) 63% of batch.

Query 3: (Derived Table) 15% of batch

Query 3 looks like it's the most efficient, though time wise they were all within milliseconds of each other. It should be noted that the random data generated dups in Table 1 which results in multiple records with the same value in the compared column being retrieved with Query 1 and 2, however since Query 3 uses a group by it gets only 1 record per distinct value, which could make a difference depending on how the result is used.

I like Query 3 though it took me sometime to figure out why it worked, so on top of everything I learned a new technique.

James.

Here is the code I used for the test in case someone is interested:

if object_id('tab1_') is not null drop table tab1_

if object_id('tab2_') is not null drop table tab2_

go

create table tab1_ (id_ int identity,col1_ int)

create table tab2_ (id_ int identity,col1_ int)

go

create index i_tab1_ on tab1_ (col1_)

create index i_tab2_ on tab2_ (col1_)

go

insert into tab1_ (col1_)

SELECT TOP 1000000

SomeInt = CAST(RAND(CAST(NEWID() AS VARBINARY))*100000+1 AS INT)

FROM Master.dbo.SysColumns sc1,

Master.dbo.SysColumns sc2

go

insert into tab2_ (col1_)

SELECT TOP 1000000

SomeInt = CAST(RAND(CAST(NEWID() AS VARBINARY))*100000+1 AS INT)

FROM Master.dbo.SysColumns sc1,

Master.dbo.SysColumns sc2

go

/*

select top 1000 * from tab1_

select top 1000 * from tab2_

*/

--now test the different methods to find all rows in table 1 that do not exist in table 2

select *

from tab1_ t1

where not exists (select 1 from tab2_ t2 where t2.col1_ = t1.col1_)

go

Select t1.*

from tab1_ t1 left join tab2_ t2 on (t1.col1_ = t2.col1_)

where t2.col1_ is null

go

select

a.col1_

from

(

select col1_, T1 = 1, T2 = 0 from tab1_ group by col1_

union all

select col1_, T1 = 0, T2 = 1 from tab2_ group by col1_

) a

group by

a.col1_

having

max(T1) = 1 and max(T2) = 0

order by

a.col1_

go

bdba SSCommitted Points: 1504 More actions · Answer 13

Hi guys,

Thanks yr prompt feedback and solution. All the solutions seems pretty feasible. The left outer join i guess might just work fine. I was reading about the data partitioning enhancements in SQL 2005. Do you think partioning the the t2 table would be least time consuming solution. Has anybody worked with table partitioning for large data before.

Thanks

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 14

The problem posed stated that the table1.column to be compared against was indeed unique.

It also said that the size was about 800 million.

Table2 size around a billion (20%) larger, and with 'some' dupes in table2.column.

Now, since t2.col may contain dupes, each dupe would produce 'extra rows' as t1.col * t2.col.

Since t1.col was unique, this is the same as 1 * # of t2 dupe values. (not a cartesian product)

When testing, it's important to create testdata with the same charachteristics, else the tests may show false results. The major problem here is how many is 'some' dupes..?

Perhaps it'd be more efficient to first pull out a distinct t2.column into it's own temp table, then do the left join or exists query on that. Now both are ensured to be unique, and any excessive intermediate results, cartesian funky-joins and such shouldn't be a problem. This is however extra overhead, so the same thing applies, you have to test it on the full load in order to see if the extra 'dupe cleansing perparation and indexing temptable' will be benificial overall or not.

Also, be aware that when testing the different variants, that using too small datavolumes for test compared to the actual intended volume, may not yíeld a truthful result. The volume itself is a very important factor when it comes to plans and performance in the end.

/Kenneth