Why bcp import a file out of order in SQL 2008?

  • Hi guys

    There is an application in my company which imports files inside several tables on database. However when there are simultaneous files, a lot of them import registers out of order.

    i guess there is some configuration to avoid importing files wrong order in bcp.

    How import a file exactly the original file in bcp?

    best regards.

  • Two things to consider:

    - there's no such thing as a "physical order" in a DB. if you want to be able to track the ordering bacsed on when things were inserted, you will likely need an identity (auto-numbered) column.

    - The default setting in BCP is to presume that the data file is unordered. In short that means that the physical order itself has no meaning during the import and the import can import the stuff in any order is most efficient.

    You might care to check out some of the BCP/BULK INSERT options to help with that. One of those would be the ORDER clause for BCP (which would help keep your auto-numbering consistent during the import):

    http://msdn.microsoft.com/en-us/library/ms177468.aspx

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (11/1/2011)


    Two things to consider:

    - there's no such thing as a "physical order" in a DB. if you want to be able to track the ordering bacsed on when things were inserted, you will likely need an identity (auto-numbered) column.

    - The default setting in BCP is to presume that the data file is unordered. In short that means that the physical order itself has no meaning during the import and the import can import the stuff in any order is most efficient.

    You might care to check out some of the BCP/BULK INSERT options to help with that. One of those would be the ORDER clause for BCP (which would help keep your auto-numbering consistent during the import):

    http://msdn.microsoft.com/en-us/library/ms177468.aspx

    I'm not sure that that is necessarily true. There is in fact a "physical order", however when you use the SELECT statement, it is not guaranteed to return the results in that order without an order by clause. If you want the order of the results when you query the data to be specific you must ALWAYS use ORDER BY. Technically, if there is a clustered index the data will be "physically" on a page in that order. Again, that does not mean that the results will always return in that order if the query optimizer determines that it is faster to return them in a different order.

    Thanks,

    Jared

    Jared
    CE - Microsoft

  • jared-709193 (11/1/2011)


    Matt Miller (#4) (11/1/2011)


    Two things to consider:

    - there's no such thing as a "physical order" in a DB. if you want to be able to track the ordering bacsed on when things were inserted, you will likely need an identity (auto-numbered) column.

    - The default setting in BCP is to presume that the data file is unordered. In short that means that the physical order itself has no meaning during the import and the import can import the stuff in any order is most efficient.

    You might care to check out some of the BCP/BULK INSERT options to help with that. One of those would be the ORDER clause for BCP (which would help keep your auto-numbering consistent during the import):

    http://msdn.microsoft.com/en-us/library/ms177468.aspx

    I'm not sure that that is necessarily true. There is in fact a "physical order", however when you use the SELECT statement, it is not guaranteed to return the results in that order without an order by clause. If you want the order of the results when you query the data to be specific you must ALWAYS use ORDER BY. Technically, if there is a clustered index the data will be "physically" on a page in that order. Again, that does not mean that the results will always return in that order if the query optimizer determines that it is faster to return them in a different order.

    Thanks,

    Jared

    The data inside of a data page isn't physically ordered. Since you always have to pull a full page, there's no reason to do that. The pages themselves aren't ordered either: they could be spread all over the physical disks (or RAM structures) you're using.

    The IAM pages keep track of which data pages are allocated/assigned to which tables: they maintain the "logical order". Again - no attempt to physically order the pages on disk, either.

    Gail writes much better about such things. She has a good description of how it works over here (it's not the main topic of discussion in the post, but the description is still there):

    http://sqlinthewild.co.za/index.php/2011/02/01/is-a-clustered-index-best-for-range-queries/[/url]

    Most importantly (on this specific topic) the physical order of the data within a BCP source may or may not be the actual order in which the data is entered into the table. Even with an ORDER clause, the only thing you're guaranteeing is that the numbers assigned by an IDENTITY column will match the sort from the ORDER clause; it makes no claim as to physical order.

    This is Conor's post on the INSERT order (Conor was one of the primary devs on this aspect of SQL).

    http://blogs.msdn.com/b/sqltips/archive/2005/07/20/441053.aspx

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (11/1/2011)


    jared-709193 (11/1/2011)


    Matt Miller (#4) (11/1/2011)


    Two things to consider:

    - there's no such thing as a "physical order" in a DB. if you want to be able to track the ordering bacsed on when things were inserted, you will likely need an identity (auto-numbered) column.

    - The default setting in BCP is to presume that the data file is unordered. In short that means that the physical order itself has no meaning during the import and the import can import the stuff in any order is most efficient.

    You might care to check out some of the BCP/BULK INSERT options to help with that. One of those would be the ORDER clause for BCP (which would help keep your auto-numbering consistent during the import):

    http://msdn.microsoft.com/en-us/library/ms177468.aspx

    I'm not sure that that is necessarily true. There is in fact a "physical order", however when you use the SELECT statement, it is not guaranteed to return the results in that order without an order by clause. If you want the order of the results when you query the data to be specific you must ALWAYS use ORDER BY. Technically, if there is a clustered index the data will be "physically" on a page in that order. Again, that does not mean that the results will always return in that order if the query optimizer determines that it is faster to return them in a different order.

    Thanks,

    Jared

    The data inside of a data page isn't physically ordered. Since you always have to pull a full page, there's no reason to do that. The pages themselves aren't ordered either: they could be spread all over the physical disks (or RAM structures) you're using.

    The IAM pages keep track of which data pages are allocated/assigned to which tables: they maintain the "logical order". Again - no attempt to physically order the pages on disk, either.

    Gail writes much better about such things. She has a good description of how it works over here (it's not the main topic of discussion in the post, but the description is still there):

    http://sqlinthewild.co.za/index.php/2011/02/01/is-a-clustered-index-best-for-range-queries/[/url]

    Most importantly (on this specific topic) the physical order of the data within a BCP source may or may not be the actual order in which the data is entered into the table. Even with an ORDER clause, the only thing you're guaranteeing is that the numbers assigned by an IDENTITY column will match the sort from the ORDER clause; it makes no claim as to physical order.

    I understand that the pages are not in physical order, but "logical" based on the pointers. However, it is my limited understanding based on http://msdn.microsoft.com/en-us/library/ms189051.aspx that "The data rows are stored in order based on the clustered index key." The key words here being "stored in order." Maybe you can point me to a better understanding?

    My question to the OP: How did you determine the "order" of your stored/inserted data? Are you inserting a DATETIME or an identity? Or are you simply doing a SELECT * FROM and assuming that the order of the stored data is that order?

    My assumption is that you are importing the data from the files 'as is' and not adding any columns during the bcp. This then leads me to believe that you are guessing that the order of your results is in fact the order of the data. Is this correct?

    Thanks,

    Jared

    Jared
    CE - Microsoft

  • I think Gail had posted that they are logically stored in the clustered index order, and not necessarily physically stored once you look inside a page.; there was a post that took a heap, that added a clustered index that was technically a different order; the result was it was so small a table( i think it fit in a page or two) that the physical rows were in a different order than the clustered index, which once again reinforced the reminder that an ORDER BY statement is required...

    i even changed my signature to remind myself...you cannot depend or assume the clustered index will return the order expected.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (11/1/2011)


    I think Gail had posted that they are logically stored in the clustered index order, and not necessarily physically stored once you look inside a page.; there was a post that took a heap, that added a clustered index that was technically a different order; the result was it was so small a table( i think it fit in a page or two) that the physical rows were in a different order than the clustered index, which once again reinforced the reminder that an ORDER BY statement is required...

    i even changed my signature to remind myself...you cannot depend or assume the clustered index will return the order expected.

    So, if the rows are not in a physical order on a page for a clustered index then what is used as an "index" on that page? For example, if rows are out of order on page 1, page 2, etc. Then it seems to me that the entire database/index is out of order. I suppose I don't understand how the logical ordering of pages makes any difference if the ordering on a page is off. Also, if there is a logical order to the rows on a page, that would have to be stored somewhere as well, which again defeats the point of a clustered index. What am I missing?

    Thanks,

    Jared

    EDIT: Removed a reference to offset

    Jared
    CE - Microsoft

  • i may be misremembering the thread; i'm searching thru my posts now to find it; i was fiddling with trying to make an example, but the clustred index i'm creating is always putting it in order, even with page splits. so far;

    my workspace for reference:

    SELECT 2 AS ID,1 As ORDERBY INTO MyHEAP UNION ALL

    SELECT 7 AS ID,2 As ORDERBY UNION ALL

    SELECT 3 AS ID,3 As ORDERBY UNION ALL

    SELECT 8 AS ID,4 As ORDERBY UNION ALL

    SELECT 4 AS ID,5 As ORDERBY

    --SQL2008

    SELECT %%physloc%%,ID,ORDERBY FROM MyHEAP

    --SQL2005

    SELECT%%LockRes%%,ID,ORDERBY FROM MyHEAP

    /*

    (No column name)IDORDERBY

    0xE20000000100000021

    0xE20000000100010072

    0xE20000000100020033

    0xE20000000100030084

    0xE20000000100040045

    */

    CREATE CLUSTERED INDEX IX_CLUSTER_ID ON MyHEAP(ID)

    /*

    (No column name)IDORDERBY

    0xE30000000100000021

    0xE30000000100010033

    0xE30000000100020045

    0xE30000000100030072

    0xE30000000100040084

    */

    insert into MyHEAP

    SELECT 2 AS ID,6 As ORDERBY

    insert into MyHEAP

    SELECT 2 AS ID,7 As ORDERBY

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • jared-709193 (11/1/2011)


    Lowell (11/1/2011)


    I think Gail had posted that they are logically stored in the clustered index order, and not necessarily physically stored once you look inside a page.; there was a post that took a heap, that added a clustered index that was technically a different order; the result was it was so small a table( i think it fit in a page or two) that the physical rows were in a different order than the clustered index, which once again reinforced the reminder that an ORDER BY statement is required...

    i even changed my signature to remind myself...you cannot depend or assume the clustered index will return the order expected.

    So, if the rows are not in a physical order on a page for a clustered index then what is used as an "index" on that page? For example, if rows are out of order on page 1, page 2, etc. Then it seems to me that the entire database/index is out of order. I suppose I don't understand how the logical ordering of pages makes any difference if the ordering on a page is off. Also, if there is a logical order to the rows on a page, that would have to be stored somewhere as well, which again defeats the point of a clustered index. What am I missing?

    Thanks,

    Jared

    EDIT: Removed a reference to offset

    Let's try an example to better illustrate it:

    your key values are from 1 to 10, and each page can hold 3 records (assuming no fragmentation):

    - logical page 1 contains the records 1-3 in some order

    - logical page 2 contains records 4-6 in some order

    etc...

    Why? because those rows might have been inserted or updated on some random order. since you have to pull the entire 8K page during a read, all three will be in memory, so the reordering is lightning fast and isn't necessary until the ORDER BY is required.

    The same thing applies in some way to pages: as you enter records that fit somewhere in the middle of a table's key values, a page split would occur to allow the clustered index to maintain the order. if the pages were assigned in a purely physical order it would create a HUGE ripple effect on the storage engine, just to add a single row.

    Sorry if it's taking a while, but Paul Randal used to have a wild description of this (Paul ran the team that built thge storage engine). I am trying to put my hands on the link.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • found it! using the wrong key words will get you lost each time.

    What's a page?

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (11/1/2011)


    jared-709193 (11/1/2011)


    Lowell (11/1/2011)


    I think Gail had posted that they are logically stored in the clustered index order, and not necessarily physically stored once you look inside a page.; there was a post that took a heap, that added a clustered index that was technically a different order; the result was it was so small a table( i think it fit in a page or two) that the physical rows were in a different order than the clustered index, which once again reinforced the reminder that an ORDER BY statement is required...

    i even changed my signature to remind myself...you cannot depend or assume the clustered index will return the order expected.

    So, if the rows are not in a physical order on a page for a clustered index then what is used as an "index" on that page? For example, if rows are out of order on page 1, page 2, etc. Then it seems to me that the entire database/index is out of order. I suppose I don't understand how the logical ordering of pages makes any difference if the ordering on a page is off. Also, if there is a logical order to the rows on a page, that would have to be stored somewhere as well, which again defeats the point of a clustered index. What am I missing?

    Thanks,

    Jared

    EDIT: Removed a reference to offset

    Let's try an example to better illustrate it:

    your key values are from 1 to 10, and each page can hold 3 records (assuming no fragmentation):

    - logical page 1 contains the records 1-3 in some order

    - logical page 2 contains records 4-6 in some order

    etc...

    Why? because those rows might have been inserted or updated on some random order. since you have to pull the entire 8K page during a read, all three will be in memory, so the reordering is lightning fast and isn't necessary until the ORDER BY is required.

    The same thing applies in some way to pages: as you enter records that fit somewhere in the middle of a table's key values, a page split would occur to allow the clustered index to maintain the order. if the pages were assigned in a purely physical order it would create a HUGE ripple effect on the storage engine, just to add a single row.

    Sorry if it's taking a while, but Paul Randal used to have a wild description of this (Paul ran the team that built thge storage engine). I am trying to put my hands on the link.

    I'll have to wait for the article because it still doesn't make sense to me 🙂 f your pages of data can hold substantially more data 3 rows and they are all out of order I would think this would be a Nightmare. Must be some info that I am missing. Thanks for your help with helping me to understand!

    Jared

    EDIT: Late to respond... I will read this when I get home. Thanks!

    Jared
    CE - Microsoft

  • Matt Miller (#4) (11/1/2011)


    found it! using the wrong key words will get you lost each time.

    What's a page?

    Ok, so as I understand from this article... The slot array is basically an index for the page. Is this correct? If so, Is this slot array the same that Microsoft is now referring to as "row offset" at the end of a page? So let's say we have 3 rows with key 1, 4, 6 and out of order (assuming order is asc) and with space between. Each row is a byte for argument sake. You have key 1 in slot 1 (according to the article the 1st in order on the page IS always in slot 1), a 2 byte gap, key 6, a 1 byte gap, and key 4. Sow the row offset would be (from right to left) 1, 6, 4 where these are not the keys, but the number of bytes from the start of the page for each row?

    Am I getting close?

    Thanks,

    Jared

    Jared
    CE - Microsoft

  • jared-709193 (11/2/2011)


    Matt Miller (#4) (11/1/2011)


    found it! using the wrong key words will get you lost each time.

    What's a page?

    Ok, so as I understand from this article... The slot array is basically an index for the page. Is this correct? If so, Is this slot array the same that Microsoft is now referring to as "row offset" at the end of a page? So let's say we have 3 rows with key 1, 4, 6 and out of order (assuming order is asc) and with space between. Each row is a byte for argument sake. You have key 1 in slot 1 (according to the article the 1st in order on the page IS always in slot 1), a 2 byte gap, key 6, a 1 byte gap, and key 4. Sow the row offset would be (from right to left) 1, 6, 4 where these are not the keys, but the number of bytes from the start of the page for each row?

    Am I getting close?

    Thanks,

    Jared

    Yes - that's how I understand it. The slot array is the index for the page content: it is essentially an ordered directory structure (so it points to exactly where stuff is within the page but slots are assigned in order of the clustering key).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (11/2/2011)


    jared-709193 (11/2/2011)


    Matt Miller (#4) (11/1/2011)


    found it! using the wrong key words will get you lost each time.

    What's a page?

    Ok, so as I understand from this article... The slot array is basically an index for the page. Is this correct? If so, Is this slot array the same that Microsoft is now referring to as "row offset" at the end of a page? So let's say we have 3 rows with key 1, 4, 6 and out of order (assuming order is asc) and with space between. Each row is a byte for argument sake. You have key 1 in slot 1 (according to the article the 1st in order on the page IS always in slot 1), a 2 byte gap, key 6, a 1 byte gap, and key 4. Sow the row offset would be (from right to left) 1, 6, 4 where these are not the keys, but the number of bytes from the start of the page for each row?

    Am I getting close?

    Thanks,

    Jared

    Yes - that's how I understand it. The slot array is the index for the page content: it is essentially an ordered directory structure (so it points to exactly where stuff is within the page but slots are assigned in order of the clustering key).

    Thanks so much for bringing me to this understanding. I find that the more I understand the real workings of SQL Server, the better DBA I can be 🙂

    Thanks,

    Jared

    Jared
    CE - Microsoft

  • ulisseslourenco (11/1/2011)


    Hi guys

    There is an application in my company which imports files inside several tables on database. However when there are simultaneous files, a lot of them import registers out of order.

    i guess there is some configuration to avoid importing files wrong order in bcp.

    How import a file exactly the original file in bcp?

    best regards.

    You can do it by setting batch size to 1. Use the -b option for bcp.exe. Use the BATCHSIZE option for BULK INSERT. Here is a thread that describes the issue in detail:

    http://www.sqlservercentral.com/Forums/FindPost1127805.aspx

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply