Data Transfer - data not transfered in order

  • hi there,

    I have a table with 371 751 rows.

    I have a standard SSIS package that transfers data from table A to table B. There are no fancy data manipulation. Yet when I trnasfer the data it doesn't end up in the database table in the same order?

    Example the first entry in the table A is ID "4930". Yet in table B the first ID is "142701". Is there a reason it doesn't transfer in the correct order.

    It's not the end of the world as all the data is transferring, it's just I don't understand the logic.

    Thanks

  • Hi,

    How are you identifying the "different order". There is no default "order by" in SQL Server. If you've got a clustered index on the table then the data is logically ordered by that within the table. Simply doing a select top x from blah with no explicit specification for the order by will select top x random rows.

    Thanks,

    Simon



    MCSE: Data Platform
    MCSE: Business Intelligence
    Follow me on Twitter: @WazzTheBadger
    LinkedIn Profile: Simon Osborne

  • Order isn't important, that's something for the visualization layer 🙂

    The order of the rows can depend on indexes on source and/or destination table and the query written to retrieve the rows.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks guys. It was more a reason I was after which you have both supplied. I won't concern myself with it, but will play about with some indexes in development.

    Thanks again

  • Ryan Keast (7/18/2013)


    Thanks guys. It was more a reason I was after which you have both supplied. I won't concern myself with it, but will play about with some indexes in development.

    Thanks again

    TL/DR: If it is important that the rows in a result set are in a certain order, the query MUST include an ORDER BY clause specifying that order, regardless of what indexes exist on the underlying tables.

    Hi Ryan,

    I'm concerned that maybe the previous answers to your question didn't flesh out the issue enough to keep you from wandering into problematic territory. The relational model defines a relation as an unordered set of tuples. In SQL terminology, a relation is represented as a table (and tuples as rows). The relation:table equivalency is not perfect, however, especially because a relational database has to store actual tables with actual rows instead of conceptual relations comprised of conceptual tuples, so it has to store the rows in some kind of order. A clustered index on the table will govern the logical order in which rows are stored, while tables with no clustered indexes, called heaps, will logically order rows in the order in which they were inserted.

    *However*, clustered indexes ONLY provide a logical order for storing rows and a means to seek for rows when the searching on the clustered index key. The existence of a clustered index does NOT in any way guarantee the order in which rows will be retrieved or returned from the table. In SQL Server, the query optimizer generates an execution plan that will be guaranteed to return all rows that meet the search conditions (and only those rows), but the rows will be returned in the order they leave the last operator in the execution plan. Since some execution plans require sorts of intermediate result sets, the final order of the rows may or may not correspond to the order of rows in a clustered index.

    Accordingly, if it is important that the rows in a result set are in a certain order, the query MUST include an ORDER BY clause specifying that order, regardless of what indexes exist on the underlying tables. That is the ONLY way to guarantee that rows in a result set are ordered in a specific manner. Some people are led astray by the fact that SQL Server often returns rows from a table in order of the clustered index key even in the absence of an ORDER BY clause, but it is merely coincidence that the rows are in that order when they leave the final operator in the execution plan. It can't be said too often - the only way to guarantee the order of rows in a result set is to include an ORDER BY clause in the query.

    Here endeth the lesson. 😀

    Jason

    Jason Wolfkill

  • I have some questions about your post wolfkillj.

    heaps, will logically order rows in the order in which they were inserted

    What do you mean by 'logically' in the above?

    *However*, clustered indexes ONLY provide a logical order for storing rows

    Are you sure? No physical order whatsoever?

    ...but it is merely coincidence that the rows are in that order when they leave the final operator in the execution plan

    Not coincidence. Under some circumstances (often select * from

    ), rows are returned in the order defined by the clustered PK, just because the SQL query optimiser decided that that was the best way to do it. It's not guaranteed, but it's not coincidence either.

    Note that I am not suggesting that ORDER BY should not be used, just picking up on the details in your message.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (7/22/2013)


    I have some questions about your post wolfkillj.

    heaps, will logically order rows in the order in which they were inserted

    What do you mean by 'logically' in the above?

    *However*, clustered indexes ONLY provide a logical order for storing rows

    Are you sure? No physical order whatsoever?

    ...but it is merely coincidence that the rows are in that order when they leave the final operator in the execution plan

    Not coincidence. Under some circumstances (often select * from

    ), rows are returned in the order defined by the clustered PK, just because the SQL query optimiser decided that that was the best way to do it. It's not guaranteed, but it's not coincidence either.

    Note that I am not suggesting that ORDER BY should not be used, just picking up on the details in your message.

    Hi Phil,

    I probably oversimplified my answers you quote to avoid getting *too* technical, as I only wanted to provide a high-level view of why the "rows from a clustered index will always be returned in key order" myth is incorrect.

    I did bungle the first bit about how rows are managed in heaps. There is no order, logical or otherwise, to rows in a heap, of course. What I should have said is that new rows will be inserted in the last allocated page of the heap and a "SELECT *" from a heap will return rows in page allocation order. As Paul Randal explains, this will not necessarily be the same order in which the rows were inserted, apparently because of the possibility that an updated row will be moved to a new page and a forwarding record left on the original page.

    The reason that a clustered index does not guarantee that rows will be in a "physical" order is similarly esoteric. Each data page includes a "row offset array" that gives the starting location of each row in the page, with each row designated to a particular "slot". While the rows on a page are assigned to slots in key order, the actual locations of the data for those rows on the page may not be in key order. This blog post[/url] demonstrates how this works. So strictly speaking, the data for rows in a clustered index will not always be physically stored in key order, although the data can only be out of order within a page.

    I used the word "coincidence" in an effort to drive home the point that a query may return rows in order of the clustered index key in the absence of an "ORDER BY" clause, not because the query syntax demands it, but because the optimizer chooses an execution plan that delivers the results in that order, as you point out. The optimizer is not constrained to return rows in order of the clustered index, it just happens to do so in this case. I suppose it may be more accurate to say that it's just a coincidence (from the user's point of view) that Microsoft designed the optimizer to behave like this. And, Microsoft makes no guarantees that it will maintain this behavior in future versions of SQL Server.

    We seem to agree that it would be an abominable practice to rely on this behavior in any code where the order of results actually matters. I was just trying to emphasize to those who may be mistakenly expecting rows to always be returned in order of the clustered index key that they should not rely on this behavior. What responsible developer wants to write queries that will *probably* return the results in the desired/required order?

    Thanks for your comments - I enjoy the discussion.

    Jason

    Jason Wolfkill

  • That's more like it, thanks for taking the time to clarify Jason.

    I understand that you were trying not to get too technical with your first answer & just wanted to make sure that we clarified a couple of the areas which I thought could have been misleading to those less-experienced SQL devs out there.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • wolfkillj (7/22/2013)


    So strictly speaking, the data for rows in a clustered index will not always be physically stored in key order, although the data can only be out of order within a page.

    The pages can be out of physical order too, it's called fragmentation (logical order of pages doesn't match physical order of pages)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/22/2013)


    wolfkillj (7/22/2013)


    So strictly speaking, the data for rows in a clustered index will not always be physically stored in key order, although the data can only be out of order within a page.

    The pages can be out of physical order too, it's called fragmentation (logical order of pages doesn't match physical order of pages)

    Indeed. The leaf node pages of a clustered index are linked in a doubly-linked list according to the clustering key, but the physical location of the pages allocated to the index may not follow that order.

    Would it be fair to say that this discussion yields at least two general principles?

    1. The organization of rows in pages, leaf node pages in doubly-linked lists, etc., and the allocation and management of pages for such purposes are functions of SQL Server's storage engine.

    2. A developer should write queries that return the desired results using the documented features of the T-SQL language and SQL Server's relational engine, not counting on the workings of the storage engine to affect the results in any way.

    Jason Wolfkill

  • Thank you all for your replys. I've now moved away from "heaping" my data and at least placed a CLUSTERED KEY on the id - which has now sorted it.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply