Correlated Subquery - Stream Aggregate

  • Hi All

    Some DDL

    --Table 1

    CREATE TABLE [dbo].[Indexing2](

    [Col1] [int] IDENTITY(1,1) NOT NULL,

    [Col2] [int] NULL,

    [Col3] [int] NULL,

    [Col4] [int] NULL

    )

    --Indexes

    CREATE UNIQUE CLUSTERED INDEX [CX] ON [dbo].[Indexing2] ([Col1])

    CREATE NONCLUSTERED INDEX [NCX] ON [dbo].[Indexing2] ([Col4])

    --Table2

    CREATE TABLE [dbo].[Indexing](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [Col2] [int] NULL

    )

    --Indexes

    CREATE UNIQUE CLUSTERED INDEX [CX] ON [dbo].[Indexing] ([ID])

    CREATE NONCLUSTERED INDEX [NCIX] ON [dbo].[Indexing] ([Col2])

    And consider the following selects

    select Col4 from Indexing2

    inner join Indexing

    on Indexing2.Col4 = Indexing.Col2

    where Col4 between '200' and '250'

    select Col4 from Indexing2

    where exists (select * from Indexing where Indexing.Col2 = Indexing2.Col4)

    and Col4 between '200' and '250'

    Essentially, the 2 queries are the same. Why does the 1st query return duplicates and the 2nd query doesn't?

    Also, the the execution plans are identical besides the 2nd query having a stream aggregate. Why is this being added?

    Thanks

  • Sample data would be good as well

    Using the following sample data

    insert into Indexing values (199),(200),(210),(220),(230),(240),(250),(251)

    insert into Indexing2 (Col4) values (199),(200),(210),(220),(230),(240),(250),(251)

    I get no duplicates and now stream aggregate, the only difference is a left semi join on the exists vs a inner join.

    Also post your execution plans in a SQLPLAN format so we can take a look.

  • Essentially, the 2 queries are the same.

    Fraid not ,

    if a value is used multiple time in Indexing for Col2 then you will get "duplicate" rows back

    select col2,count(*)

    from indexing

    group by col2

    having count(*) >1

    Will tell you which values are used multiple times.

    the exists clause only cares that there is at least one matching row.



    Clear Sky SQL
    My Blog[/url]

  • ...

    And consider the following selects

    select Col4 from Indexing2

    inner join Indexing

    on Indexing2.Col4 = Indexing.Col2

    where Col4 between '200' and '250'

    select Col4 from Indexing2

    where exists (select * from Indexing where Indexing.Col2 = Indexing2.Col4)

    and Col4 between '200' and '250'

    Essentially, the 2 queries are the same. Why does the 1st query return duplicates and the 2nd query doesn't?

    Also, the the execution plans are identical besides the 2nd query having a stream aggregate. Why is this being added?

    Thanks

    Your two query are essentially very different!

    Having the same filter in the WHERE clause, still your first query returns Col4 from Indexing2 for every matching record from Indexing, but the second query only returns one row from Indexing2, regardless how many matching records found in Indexing.

    If you want query one to behave same way as your query two you need to use DISTINCT in select.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • anthony.green (12/19/2012)


    Sample data would be good as well

    Using the following sample data

    insert into Indexing values (199),(200),(210),(220),(230),(240),(250),(251)

    insert into Indexing2 (Col4) values (199),(200),(210),(220),(230),(240),(250),(251)

    I get no duplicates and now stream aggregate, the only difference is a left semi join on the exists vs a inner join.

    Also post your execution plans in a SQLPLAN format so we can take a look.

    Thanks

    I have 30Mil rows in this table

  • anthony.green (12/19/2012)


    Sample data would be good as well

    Using the following sample data

    insert into Indexing values (199),(200),(210),(220),(230),(240),(250),(251)

    insert into Indexing2 (Col4) values (199),(200),(210),(220),(230),(240),(250),(251)

    I get no duplicates and now stream aggregate, the only difference is a left semi join on the exists vs a inner join.

    Also post your execution plans in a SQLPLAN format so we can take a look.

    When I say duplicate, I mean that for Col4 there is more that one row with the same value. The inner join query is returning all those rows while the correlated query doesn't.

  • SQLSACT (12/19/2012)


    Hi All

    Some DDL

    --Table 1

    CREATE TABLE [dbo].[Indexing2](

    [Col1] [int] IDENTITY(1,1) NOT NULL,

    [Col2] [int] NULL,

    [Col3] [int] NULL,

    [Col4] [int] NULL

    )

    --Indexes

    CREATE UNIQUE CLUSTERED INDEX [CX] ON [dbo].[Indexing2] ([Col1])

    CREATE NONCLUSTERED INDEX [NCX] ON [dbo].[Indexing2] ([Col4])

    --Table2

    CREATE TABLE [dbo].[Indexing](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [Col2] [int] NULL

    )

    --Indexes

    CREATE UNIQUE CLUSTERED INDEX [CX] ON [dbo].[Indexing] ([ID])

    CREATE NONCLUSTERED INDEX [NCIX] ON [dbo].[Indexing] ([Col2])

    And consider the following selects

    select Col4 from Indexing2

    inner join Indexing

    on Indexing2.Col4 = Indexing.Col2

    where Col4 between '200' and '250'

    select Col4 from Indexing2

    where exists (select * from Indexing where Indexing.Col2 = Indexing2.Col4)

    and Col4 between '200' and '250'

    Essentially, the 2 queries are the same. Why does the 1st query return duplicates and the 2nd query doesn't?

    Also, the the execution plans are identical besides the 2nd query having a stream aggregate. Why is this being added?

    Thanks

    Plan Attached

  • SQLSACT (12/19/2012)

    Why does the 1st query return duplicates and the 2nd query doesn't?

    ... the 2nd query having a stream aggregate. Why is this being added?

    Thanks

    The first query does a JOIN; joins return all matching rows, whether duplicate matches or not.

    In the second query, you used EXISTS() rather than join. EXISTS() just verifies whether or not a row exists. Since SQL has chosen to use a JOIN to generate the final results, SQL uses the stream aggregate to eliminate duplicate join values, thus insuring that only a single match occurs for each join to the other table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • You have two very different queries there (different in meaning), so it shouldn't be much of a surprise that they return different results.

    See http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/ (IN and EXISTS use the same mechanism)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply