Let's Talk Merge Join

  • This topic arose out of another topic I posted last week. I'm posting in a new thread, though, so I can get people who aren't reading the other thread to take a gander at this one.

    So, I'm new to Merge Join. I know that it accepts two inputs and both have to be sorted. I also know that using a UNION ALL before a Merge Join doesn't work because UNION kills the sort. So you have to add a SORT transformation between the two.

    Here are my questions. First, when you join your inputs to Merge Join, it asks if the input is Right or Left. Could I liken this to a T-SQL Join statement where the Left input is the table right after the FROM and the right input is the table right after the JOIN? Would this be a correct assumption?

    Secondly, when editing the Merge Join, do you have to select the JOIN KEY as pass through if you don't need it after this task?

    Thirdly, if I want to do the equivilant of the below T-SQL code, how do you code this into a Merge Join?

    Select t1.Col1, t1.Col2

    from MyFirstTable t1

    left outer join MySecondTable t2

    on t1.ID = t2.ID

    where t2.ID is NULL

    I saw a post where someone used a LOOKUP to do this and used the Failure thread of the LOOKUP to populate a new table. But LOOKUP in this case has been causing me serious performance issues. Does anyone else have any other thoughts? Could Merge Join be used in the situation I'm describing?

    Thanks in advance for any thoughts,

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • It was me that prompted you to post again, so I hope you get some additional feedback, but I will give you some detail here.

    First question - yup, the LEFT, RIGHT, and OUTER is talking about the same thing as you would have in a join. Remember when you connected the data flow to it and it asked on the first one if it was the right or left input?

    Second question, does join key need to be included? Nope - I had to look on this one, but the join keys do not even default to being included. They can be excluded from the output.

    Third question:

    1) Create an OLEDB Source for "SELECT ID, Col1, Col2 FROM MyFirstTable ORDER BY 1"

    Manually go into the advanced editor for this component and set the

    output as sorted and the sort key for ID to 1

    If ID is as string, you may want to UPPER and RTRIM here also

    2) Create an OLEDB Source for "SELECT ID FROM MySecondTable ORDER BY 1"

    Manually go into the advanced editor for this component and set the

    output as sorted and the sort key for ID to 1

    3) Drag out a MERGE JOIN

    4) Connect the output from the first OLEDB Source component to the MERGE JOIN

    Choose "LEFT" as the input

    5) Connect the output from the second OLEDB Source component to the MERGE JOIN

    "RIGHT" is the only input remaining so it will be taken automatically

    6) Go into the MERGE JOIN and change the join type to "LEFT"

    7) Check off the two fields you want to include in the output

    MERGE JOINS will outperform a LOOKUP if you can have the inputs pre-sorted and the "RIGHT" side of the situation you have is more than a couple thousand records.

    If you need to use the SORT component, you lose a lot of efficiency, but as soon as a LOOKUP has to cache a lot of records, they get really inefficient. Since both inputs on a MERGE JOIN are sorted, it can pretty efficiently find the next joining record - even with a FULL OUTER JOIN.

  • Actually, in step "7", you need to include the second table's ID and then use a conditional split afterwards to exclude records that do not have a NULL ID.

  • I am with Michael on Merge Joins, they outperform Lookups if handled correctly.

    A little footnote on this, what you may want to do is delay the join slightly to avoid buffer overruns (still a bit too complexed in my mind to explain to anyone).

    Alberto Ferrari does a similar technique in tabledifference which really makes sure that it doesnt use too much memory

    http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx (Look at the Delay section)

    Its just an assumption from my side that something similar will assist with Merge Joins on huge resultsets, maybe you want to try and post the performance differences (my gut feel is that there will be quite a signigicant difference)

    Good luck!

    ~PD

  • My situation here is that I'm joining on no less than 6 tables to get my result set, with the final two tables being left outer joins that check for existing records. Yes, I know it's a little much, but the DB source and DW destination designs require it. (BTW, the Conditional Split was something I hadn't considered. Thanks, Michael!).

    So I'm playing with the Data Flow and something in the MERGE JOIN says that the columns and datatypes have to be the same... Which means, if I'm reading correctly, I have to add a SORT after each MJ transformation which utilizes the columns I'm going to be joining on (since they change from table to table). Am I correct in this or am I misunderstanding something?

    Anyway, I've got an initial ETL flow to test, which I'm going to start right now. I'll see if it takes shorter than the 6 hours 36 minutes the Proc source data flow task was taking.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Yes, the join fields must be the same data type and the data sets must be in the same sort order, so you may have to do some sorting.

    This component:

    http://www.sqlbi.com/Projects/FlowSync/tabid/76/language/en-US/Default.aspx

    Will help with a MERGE JOIN running away with memory. Basically, the problem is if you have two data flows that produce data at drastically different speeds, one side of your join will have to queue up data in it's buffer while it waits for the other side. This can cause memory to spike. The component at that link basically causes the faster data flow to slow down.

    At that same site, there is an interesting TableDifference component that you may also be interested in. It is basically a MERGE JOIN and a Conditional Split combined into a single component.

    As far as your complicated join, you may find that the best performance you can get is to stage the data from all of the tables into a single database and use T-SQL to do your join. If you do this on the same server your data mart is located, you could do the LEFT JOIN at the same time and even insert your records directly from a T-SQL script or stored procedure.

  • Unfortunately, doing a separate staging DB is out of the question. We don't have the space for it. @=(

    Interesting site. Have you used these components before? How well do they integrate?

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • I just started using tabledifference yesterday, and it rocks all the way...

    Just watch out for sorting your data and trimming, and case sensitivity.

    Didnt get time to test the Syncflow though...

    ~PD

  • It just happens I needed to use a merge join to exclude "Already Processed" records today. I have attached some screen prints of what I built.

    It is pre-release so I have noted a few places in which I will make some changes, but this may help you.

  • Well, my test Data Flow just completed. Even with 7 Sorts, 5 Merge Joins, 4 Conditional Splits and 2 Unions, it finished in 2 hours 32+ minutes, approximately 1/3 of the time that my stored procedure ran in.

    Wow.

    This is without the tools you recommended. Or without tweeking any of the Engine thread & Buffer settings. Plain vanilla default package with one task, the Data Flow.

    I've attached an image for anyone who wants to see the mess. @=) And if you have any recommendations, I'd love to hear them. LO stands for Left Outer Join and IJ stands for Inner Join.

    And I'll be taking a look at your flow in a moment, Michael. Just saw the notification.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • I guess I prefer to lay out data flows in a top-down tree format and you prefer this sort of everything-goes-everywhere havoc approach...

    The data flow looks pretty good to me. I don't think there is much you can do.

    One tip. In many cases, if you review a graphical execution plan for a query doing the same thing your data flow is, you want a MERGE JOIN where you would see a hash match and a LOOKUP where you would see a loop join. So, if the "right" side of any of your MERGE JOINS is a small recordset, you may be able to mix and match and get a bit better performance.

  • Its great to hear an actual success story.

    Very often everyone posts their problems, and after someone posts the solution they dont come back and say that it works/doesnt work....

    Now for that tabledifference component, as well as getting the buffers evened out....

    ~PD

  • I actually had my data flow designed in a top down, but I couldn't get it to all fit on one screen shot, so I moved everything left to right to get the capture. @=)

    Now I'm playing with putting this on other data flow tasks, but I'm not sure it's working as effectively. I'm definitely going to play with the Lookup for the smaller recordsets, I think. And see if maybe I can use a SQL query for my key matchups on the Producer & Time tables with the CertInfo table. That might work a little bit faster since all three tables are in the same db.

    Thanks for the advice, guys. I appreciate it.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • I am going offline from SSC for a about a month, finally got some actual productional sourcefiles to work with.

    Keep on posting if you find any new techniques though, mighty interesting for performance this.

    ~PD

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply