When is too many columns too many columns?

  • Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, September 5, 2018 11:38 AM

    ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    I don't think so.  I think he was talking about the main row crossing pages.  But in SQL Server, that's just not allowed.

    The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Wednesday, September 5, 2018 12:04 PM

    Jeff Moden - Wednesday, September 5, 2018 11:38 AM

    ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    I don't think so.  I think he was talking about the main row crossing pages.  But in SQL Server, that's just not allowed.

    The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works.

    I absolutely agree with that second sentence.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, September 5, 2018 12:21 PM

    ScottPletcher - Wednesday, September 5, 2018 12:04 PM

    Jeff Moden - Wednesday, September 5, 2018 11:38 AM

    ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    I don't think so.  I think he was talking about the main row crossing pages.  But in SQL Server, that's just not allowed.

    The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works.

    I absolutely agree with that second sentence.

    If you create a table where there is a possibility of there not being enough space to store a row if the values could exceed SQL Server's limit you get a warning.
    I'm not sure what happens if you try to create a table with only column types that can't be shifted that wont fit on a page?

  • Jonathan AC Roberts - Wednesday, September 5, 2018 12:49 PM

    Jeff Moden - Wednesday, September 5, 2018 12:21 PM

    ScottPletcher - Wednesday, September 5, 2018 12:04 PM

    Jeff Moden - Wednesday, September 5, 2018 11:38 AM

    ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    I don't think so.  I think he was talking about the main row crossing pages.  But in SQL Server, that's just not allowed.

    The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works.

    I absolutely agree with that second sentence.

    If you create a table where there is a possibility of there not being enough space to store a row if the values could exceed SQL Server's limit you get a warning.
    I'm not sure what happens if you try to create a table with only column types that can't be shifted that wont fit on a page?

    As I understand it, you will get an error and SQL will not create the page.  I believe that's the point of the warning SQL gave earlier.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Jonathan AC Roberts - Wednesday, September 5, 2018 12:49 PM

    Jeff Moden - Wednesday, September 5, 2018 12:21 PM

    ScottPletcher - Wednesday, September 5, 2018 12:04 PM

    Jeff Moden - Wednesday, September 5, 2018 11:38 AM

    ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    I don't think so.  I think he was talking about the main row crossing pages.  But in SQL Server, that's just not allowed.

    The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works.

    I absolutely agree with that second sentence.

    If you create a table where there is a possibility of there not being enough space to store a row if the values could exceed SQL Server's limit you get a warning.
    I'm not sure what happens if you try to create a table with only column types that can't be shifted that wont fit on a page?

    Just tried it, SQL wouldn't create the table at all, 
    Msg 1701, Level 16, State 1, Line 2
    Creating or altering table 'TEST_LENGTH' failed because the minimum row size would be 9528, including 76 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.

  • ZZartin - Wednesday, September 5, 2018 1:52 PM

    Jonathan AC Roberts - Wednesday, September 5, 2018 12:49 PM

    Jeff Moden - Wednesday, September 5, 2018 12:21 PM

    ScottPletcher - Wednesday, September 5, 2018 12:04 PM

    Jeff Moden - Wednesday, September 5, 2018 11:38 AM

    ScottPletcher - Wednesday, September 5, 2018 11:34 AM

    Sergiy - Tuesday, September 4, 2018 9:04 PM

    RonKyle - Tuesday, September 4, 2018 8:22 PM

    Well, may be transactional support and data integrity control might be named advantages, but I bet you can find (NOLOCK) in every query, which eliminates those advantages.       

    If you're talking about my designs, you just lost your bet.  I do not allow NOLOCK except in very controlled ETL situations involving staging tables.

    I have noi idea about your designs.
    I can only discuss the design posted by the OP.

    When they update one of 100 statuses in a record they re-write the whole row.
    Which means - the row is exclusively locked.
    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.
    With so many columns they need quite a number of indexes, to have their queries comleted before the end of the day.
    Pages occupied by those indexes will be also locked.
    Updating a record of 900 columns takes time.
    Considering the overhead caused by the rows exceeding a single page - is rather a long time.
    Every record has 100 statuses, which means it has to be updated at least 100 times.
    Concurrent updates will increase the locking time exponentially.
    And if there SELECT queries applied shared locks - the updates have to stop and and wait until the shared locks are removed. Holding other pages under X locks in the mean time.
    Which blocks other SELECT querise not being able to apply their S locks to already X locked pages.

    It all leads to an avalanche of locking and deadlocking, wjth the only way out - using (NOLOCK) all over the place.

    Well, there is another way.
    Accumulate all updates into scheduled batches and run them all in a single transaction with TABLOCKX, whatever time it takes to complete.
    Then for the rest of the day the table is effectively reqad-only, so (NOLOCK) is not required.
    But I doubt it's the case here.

    900 columns are no likely to fit into 1 page.
    So, the lock goes beyond a single page lock.

    All SQL rows must fit on one (in-row) page, period.  Overflow pages can be used for max cols and certain varchars, but only if the single main page has room to store all the other columns plus the pointer(s) to the overflow page.

    That's exactly what Sergiy was talking about, though.  In such cases, the lock goes beyond a single page.

    I don't think so.  I think he was talking about the main row crossing pages.  But in SQL Server, that's just not allowed.

    The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works.

    I absolutely agree with that second sentence.

    If you create a table where there is a possibility of there not being enough space to store a row if the values could exceed SQL Server's limit you get a warning.
    I'm not sure what happens if you try to create a table with only column types that can't be shifted that wont fit on a page?

    Just tried it, SQL wouldn't create the table at all, 
    Msg 1701, Level 16, State 1, Line 2
    Creating or altering table 'TEST_LENGTH' failed because the minimum row size would be 9528, including 76 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.

    If it's not possible the page could ever fit, then SQL won't even attempt the table.  But there are cases where it depends on the length of the data actually inserted, and then SQL issues a warning, but it only fails if you actually try to insert too much data to fit.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I was talking about this:

    https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms186981(v=sql.105)

    A table can contain a maximum of 8,060 bytes per row. In SQL Server 2008, this restriction is relaxed for tables that contain varchar, nvarchar, varbinary, sql_variant, or CLR user-defined type columns. The length of each one of these columns must still fall within the limit of 8,000 bytes; however, their combined widths can exceed the 8,060-byte limit. This applies to varchar, nvarchar, varbinary, sql_variant, or CLR user-defined type columns when they are created and modified, and also to when data is updated or inserted.

    Row-Overflow Considerations

    When you combine varchar, nvarchar, varbinary, sql_variant, or CLR user-defined type columns that exceed 8,060 bytes per row, consider the following:

    Surpassing the 8,060-byte row-size limit might affect performance because SQL Server still maintains a limit of 8 KB per page. When a combination of varchar, nvarchar, varbinary, sql_variant, or CLR user-defined type columns exceeds this limit, the SQL Server Database Engine moves the record column with the largest width to another page in the ROW_OVERFLOW_DATA allocation unit, while maintaining a 24-byte pointer on the original page. Moving large records to another page occurs dynamically as records are lengthened based on update operations. Update operations that shorten records may cause records to be moved back to the original page in the IN_ROW_DATA allocation unit. Also, querying and performing other select operations, such as sorts or joins on large records that contain row-overflow data slows processing time, because these records are processed synchronously instead of asynchronously.

    Therefore, when you design a table with multiple varchar, nvarchar, varbinary, sql_variant, or CLR user-defined type columns, consider the percentage of rows that are likely to flow over and the frequency with which this overflow data is likely to be queried. If there are likely to be frequent queries on many rows of row-overflow data, consider normalizing the table so that some columns are moved to another table. This can then be queried in an asynchronous JOIN operation.

    The length of individual columns must still fall within the limit of 8,000 bytes for varchar, nvarchar, varbinary, sql_variant, and CLR user-defined type columns. Only their combined lengths can exceed the 8,060-byte row limit of a table.

    _____________
    Code for TallyGenerator

  • The datatypes of the columns were stated as ints, datetimes, etc., which cannot be pushed to overflow pages, at least as I understand how SQL works

    We know that for 200 columns of statuses.

    But there are 700 other columns we cannot be sure about.

    I simply assumed there must be string type columns, more than 1 of them, which could course rows go exceed 8060 bytes limitation.

    Of course, I could not be certain about it, that's why I used the word "likely".

    Let's ask the OP if my guess was correct.

    _____________
    Code for TallyGenerator

  • Without having it right in front of me, I'm willing to say that almost all of the columns are either int, datetime, numeric, float or money.  The design pushes everything into a different dimension ... and I mean everything.  So instead of storing a Y or N, it stores a -1, 0 or 1 which you have to join with a specific dimension to convert -1, 0, 1 to NULL, N or Y.  Dates are stored at int as yyyymmdd and time is stored as seconds elapsed from midnight (if no time is required, it shows 0 indicating NULL).  Some dimensions work because they are repeatedly used (i.e., -1, 0, 1, 2 translating from NA, Black, Red, Blue) but some dimensions are 1:1 because of custom freeform notes.  As in, NoteID 178 ties to another dimension which has NoteID 178 and the actual Note.  There is no default Note because notes are free form so you have thousands of NoteIDs which have an actual Note of (blank) and/or whitespace ... not NULL.

    Almost every column has a dimension to go with it... unless it is an actual factual numeric value.  Factual text gets converted to a dimension even in the smallest of cases such as the Y or N.

  • Kevlarmpowered - Wednesday, September 5, 2018 6:51 PM

    Without having it right in front of me, I'm willing to say that almost all of the columns are either int, datetime, numeric, float or money.  The design pushes everything into a different dimension ... and I mean everything.  So instead of storing a Y or N, it stores a -1, 0 or 1 which you have to join with a specific dimension to convert -1, 0, 1 to NULL, N or Y.  Dates are stored at int as yyyymmdd and time is stored as seconds elapsed from midnight (if no time is required, it shows 0 indicating NULL).  Some dimensions work because they are repeatedly used (i.e., -1, 0, 1, 2 translating from NA, Black, Red, Blue) but some dimensions are 1:1 because of custom freeform notes.  As in, NoteID 178 ties to another dimension which has NoteID 178 and the actual Note.  There is no default Note because notes are free form so you have thousands of NoteIDs which have an actual Note of (blank) and/or whitespace ... not NULL.

    Almost every column has a dimension to go with it... unless it is an actual factual numeric value.  Factual text gets converted to a dimension even in the smallest of cases such as the Y or N.

    This isn't a shot at you personally.  It's a shot at the whole misbegotten concept and you're definitely not the only one to do this type of thing.

    I guess I don't understand why people do things like this.  For example, the fact that you're storing a -1 means that it's at least a SMALLINT, which is 2 bytes instead of just one for storing a CHAR(1) to hold Y, N, or Null not to mention that you're doing an otherwise unnecessary translation for display purposes.. 

    You're also storing dates as INTs which means that you've lost all temporal calculation capabilities unless you first do the calculation to convert it back to a temporal datatype.  You've done the same with the time by storing it as seconds instead of a temporal datatype.  And, both columns combined into a single temporal datatype take the same number of bytes (or less) as the two INTs and have the ability to quickly and easily be converted to multiple display types without destroying the ability to do temporal math when needed without having to do a shedload of conversions.

    IMHO, data warehouses aren't supposed to be "this is the way I want to display it" warehouses nor should they be made inflexible by morphing temporal datatypes into INTs, which is then all they can be until you do some awful conversion or some lookup in a Dimension table that has been equally crippled.

    I'll just never understand why people think this type of thing is efficient, clever, or even useful.  It's like they're compelled to throw themselves back in time to before temporal data types existed.  Worse yet, it's almost always a duplication of data that is already and actually efficient, doesn't need to be clever (it's damned clever in its own right already), and is incredibly useful as it is.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'd never do it with date and time values, it just becomes impossible to query in a useful fashion as you say and I can't see any situation that would benefit. Similarly for "simple" fields where just storing the value directly in the table would take less storage than a key to an alternate lookup table. 

    I have done something similar with a dimension containing just dates that are then referenced elsewhere using int values. I do however store the date within that dimension as an actual DateTime (it's still on 2005, yes there is a check constraint to keep it date only!) so temporal queries aren't crippled and the dimension table also contains a bunch of additional columns for First Working Day of the Week/Month, Public Holidays, etc to simplify the logic for such things (as well as making them indexable). The int values are arbitrary though (well strictly they aren't, but to all intents and purposes they are treated as such) to avoid people trying to do weird maths on them - the dimension table exists for a reason.

  • andycadley - Wednesday, September 5, 2018 10:19 PM

    I'd never do it with date and time values, it just becomes impossible to query in a useful fashion as you say and I can't see any situation that would benefit. Similarly for "simple" fields where just storing the value directly in the table would take less storage than a key to an alternate lookup table. 

    I have done something similar with a dimension containing just dates that are then referenced elsewhere using int values. I do however store the date within that dimension as an actual DateTime (it's still on 2005, yes there is a check constraint to keep it date only!) so temporal queries aren't crippled and the dimension table also contains a bunch of additional columns for First Working Day of the Week/Month, Public Holidays, etc to simplify the logic for such things (as well as making them indexable). The int values are arbitrary though (well strictly they aren't, but to all intents and purposes they are treated as such) to avoid people trying to do weird maths on them - the dimension table exists for a reason.

    If I correctly understand what your saying, I've worked on systems such as you describe, Andy.  The ones that use an INT to reference a Date Dimension.  I've never understood why the INT abstraction of dates in fact tables is necessary.  I can see why it may have been used back on older 32 bit systems (memory and disk size limitations) but I can't see it anymore.  If you're troubleshooting data directly in the data, you can't tell what dates are directly in the data.  You always have to either use a function or a join to figure out the dates and you certainly can't use the INTs directly for temporal math.  You always have to refer to the Date Dimension table.

    And, if it's a requirement that cannot be avoided (and I don't see why it can't), rather than using a sequence of INTs starting at (usually) 1, the least they could do is use INTs that are based on the underlying date serial number so that you can use a simple function to view the date in fact tables rather than having to do a join.

    I do totally agree that the separation of date/times into separate columns leads to many unnecessary programming chores that could easily be avoided.  I can sometimes see a pair of persisted computed columns based on the task at hand but I never destroy the original combination column.  I also never cripple such a column by using any datatype other than DATETIME.  The meager byte savings just isn't worth giving up the ability to do the simple direct date math that can be done with that data-type.  DATETIME2, DATE, and TIME data-types have been seriously computationally crippled compared to DATETIME.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, September 5, 2018 7:47 PM

    Kevlarmpowered - Wednesday, September 5, 2018 6:51 PM

    Without having it right in front of me, I'm willing to say that almost all of the columns are either int, datetime, numeric, float or money.  The design pushes everything into a different dimension ... and I mean everything.  So instead of storing a Y or N, it stores a -1, 0 or 1 which you have to join with a specific dimension to convert -1, 0, 1 to NULL, N or Y.  Dates are stored at int as yyyymmdd and time is stored as seconds elapsed from midnight (if no time is required, it shows 0 indicating NULL).  Some dimensions work because they are repeatedly used (i.e., -1, 0, 1, 2 translating from NA, Black, Red, Blue) but some dimensions are 1:1 because of custom freeform notes.  As in, NoteID 178 ties to another dimension which has NoteID 178 and the actual Note.  There is no default Note because notes are free form so you have thousands of NoteIDs which have an actual Note of (blank) and/or whitespace ... not NULL.

    Almost every column has a dimension to go with it... unless it is an actual factual numeric value.  Factual text gets converted to a dimension even in the smallest of cases such as the Y or N.

    This isn't a shot at you personally.  It's a shot at the whole misbegotten concept and you're definitely not the only one to do this type of thing.

    I guess I don't understand why people do things like this.  For example, the fact that you're storing a -1 means that it's at least a SMALLINT, which is 2 bytes instead of just one for storing a CHAR(1) to hold Y, N, or Null not to mention that you're doing an otherwise unnecessary translation for display purposes.. 

    You're also storing dates as INTs which means that you've lost all temporal calculation capabilities unless you first do the calculation to convert it back to a temporal datatype.  You've done the same with the time by storing it as seconds instead of a temporal datatype.  And, both columns combined into a single temporal datatype take the same number of bytes (or less) as the two INTs and have the ability to quickly and easily be converted to multiple display types without destroying the ability to do temporal math when needed without having to do a shedload of conversions.

    IMHO, data warehouses aren't supposed to be "this is the way I want to display it" warehouses nor should they be made inflexible by morphing temporal datatypes into INTs, which is then all they can be until you do some awful conversion or some lookup in a Dimension table that has been equally crippled.

    I'll just never understand why people think this type of thing is efficient, clever, or even useful.  It's like they're compelled to throw themselves back in time to before temporal data types existed.  Worse yet, it's almost always a duplication of data that is already and actually efficient, doesn't need to be clever (it's damned clever in its own right already), and is incredibly useful as it is.

    Oh I agree with you... I didn't build this DWH. I have built a few DWH and I have worked in others that I have purchased and never have I ever seen one like this.  I came asking questions here because the team who built it promises me it was built the best way possible so much so they had me doubting my experience.  So I travel the internet trying to learn more and see and ask the opinions of others.  I am trying to understand and walk in their shoes if you will... I already learned that some people are OK with this super wide tables where I have never been a fan.  Super wide tables do serve a purpose, but I think not in this way.  I'd rather build out many tables and provides superwide views if they were really necessary.

    This current design would have you joining the date dimension and time dimensions 140 times to convert INTs to date/time.  I find it easier/faster just to try_convert(date, try_convert(varchar, nullif(int, 19000101))) because the added overhead of the joins are slower for the tests I have done.  Plus, if the thing only has one status, you essentially have 139 copies of 19000101.  You have to do the try_convert because you will find ints that don't make logical dates... how this happens I do not know.  i.e., 20180231 (February 31st?) and this is all before you can use the actual temporal calculations that are natively built-in...

    I pulled the datatypes this morning... I was right on the money in my previous guess with the exception of many more money columns than I would have thought.

Viewing 15 posts - 31 through 45 (of 47 total)

You must be logged in to reply to this topic. Login to reply