T-Sql rant

  • Greg Snidow (3/19/2009)


    Sergiy, the quote button is taking me to the wrong post, but you said

    Except varchar as PK adds extreme overhead for data processing.

    in your last post. Do you know of any good references to learn more about what you are saying? I have a VARCHAR(20) column as the PK for many of my tables, and I never thought twice about it.

    Ohh!!

    I know it since I don't remember when. I probably have that book from relational database study somewhere, but probably not - I changed places so many times, not to mention moving on another side of the world...

    I'm gonna need to google for it.

    But you may get it by logic.

    Vachar has 2 (or 3? - don't remember) bytes overhead - metadata.

    When you process any data computer needs to allocate memory for it first.

    For int it's 4 bytes.

    For varchar CPU needs to read metadata first and then allocate corresponding amount of memory.

    It needs to read metadata and perform dynamic memory allocation FOR EACH VALUE!

    When you join by varchar (joins is what PK is about, right?) you double this overhead.

    CPU compares varchar values sequentially, character by character. It requires internal loop until the point of difference in reached. Compare to 4 bytes int values which are matched by 32-bit CPU in a single cycle.

    And, to finish the picture, vachar values need to compared according to collation definitions. So, before you even start you need to read collation rules and then convert every character to a byte according to collation definitions.

    I don't think it's a complete picture, but for me it's quite enough.

    _____________
    Code for TallyGenerator

  • foxjazz (3/19/2009)


    When our users populate this data, they need to put it in a sequence.

    Actually, I am going to side with Foxjazz on this. There are a number of user-based applications where the business practices require that the user be able to impose an implicit and arbitrary order on the items within a collection.

    Examples include both "Editable" invoices and purchase orders. In these scenarios the document producer must be empowered to take the items to be included in the business document and reorder them as they see fit. Typically, such items are then assigned a "LineNo" (line number) to keep track of their position within the document.

    I have worked on the both client and server sides of this problem and even on both simultaneously. The conclusion that I came too many years ago, and have never had reason to doubt is that the problem of assigning and maintaining the line numbers should be implemented entirely by the same side that specifies the implicit order: on the client-side. The client-side has to do this anyway (though it may be implicit) in order to interact correctly with the user, and it is a task that the client-side is well-suited too.

    However, I most often see developers fall into an ill-advised trap for the sake of performance. They try to get away with only specifying the the line number of a single row at a time on the assumption that they server should be able to make the same implicit corrections to all of the other line numbers in a group.

    IMHO, this is a mistake. It tends to be buggy and very error-prone and the performance advantage is close to nothing.

    Instead, the client should just figure out what all of the new line numbers should be at the end of the user's page-commit and then tell the server to change them all at once, explicitly passing the new line numbers from the client to the server.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • The batch solution is what I ended up doing with something like this several years ago. Worked quite well.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Greg Snidow (3/19/2009)


    Do you know of any good references to learn more about what you are saying? I have a VARCHAR(20) column as the PK for many of my tables, and I never thought twice about it.

    [font="Verdana"]I can't say I know of a "good reference". However, when you think about it...

    For varchars, SQL Server does string comparisons. Those involve a whole heap of rules. For example, a letter may be considered equivalent whether it is in upper case, lower case or has an accent. Also, depending on the language and sort order and so on (all encoded in the COLLATION, by the way), the rules may change about what the order of a string is compared to another.

    Whereas if you are using numbers, not only are they more dense (for example, 4 bytes for an integer versus up to 21 bytes for your varchar(20)) and thus reduce the amount of data you have to copy between tables for relationships, but they also have a binary comparison. So working with them (joining, sorting, looking up) is much quicker.

    Also, the use of numbers helps encourage using synthetic keys (rather than natural or business keys.) Why is this important? Because natural and business keys change. Seriously. If you use a natural key across multiple tables, then suddenly you have to change all of those tables. Ouch. Whereas if you isolate the natural key into one table (for example, an invoice number), and then use some sort of synthetic invoice id, then you only have to change an errant invoice number in one place.

    In SQL Server, synthetic keys also work well for clustered indexes. They order all the inserts into a table to the end of the clustered index, meaning SQL Server doesn't have to do lots of work reordering the data already in the table (which can involve a lot of I/O). It's less critical for Oracle (for example), which uses a very different space allocation scheme to SQL Server, but if you want to get the most out of SQL Server (or Sybase!) then clustered indexes over synthetic ascending numeric keys (er, identities are good examples) are generally the way to go.

    Also remember that the more dense your data, the more effective the database cache is, the less I/O your database server has to do. And it's I/O that is the real measure of how fast your database server runs. These days, this is less of an issue as memory is far less constrained. But even so, it's an easy win to use a dense form of key.

    It shouldn't be hard to construct a test case with comparing times for joins and lookups for one versus the other if you are particularly interested.

    [/font]

  • Bruce W Cassidy (3/19/2009)


    [font="Verdana"]One common business case for generating sequences is for invoice numbers. For some reason, customers like to see their invoice numbers being counted just for them. So invoice numbers are often in the form: [XXXX]-[YYYYY] where X is related to the customer number and Y is the number of the invoice counting from one for that customer.

    [/font]

    Invoice numbers, parts, organizations, humans names, etc. - they are all from another world, they are not parts of data model.

    They should be originated from a specific source, built and verified according to the rules of human world.

    Database should take it and store it as it is.

    No internal processing should be bound to it.

    _____________
    Code for TallyGenerator

  • Sergiy, as far as the bands. How could I use identity and keep with the rule of source bands.

    Where source 1 data must have pk numbers between 1 and 10mill if this situation would ever change, it would break many of the processes of our database. We have a table also that defines those bands. And processes that look to those tables and uses them depending on the datasource those processes target.

    As far as the varchar(50) pk numbers, there is no other way to do it.

    We do it by computername for a reason, our sync processes depend on that data to get uploaded to our server and synced back down to all the other clients. Our sync architecture requires it be this way.

  • Sergiy (3/19/2009)[hrInvoice numbers, parts, organizations, humans names, etc. - they are all from another world, they are not parts of data model.

    They should be originated from a specific source, built and verified according to the rules of human world.

    Database should take it and store it as it is.

    No internal processing should be bound to it.[/font]

    [font="Verdana"]Um... purist versus pragmatist.

    There's often a need to bulk generate invoices. And to bulk calculate invoices. And to bulk this and bulk that. Database level logic is a good place to do that. Whereas if you move it off the database, suddenly you have to retrieve a lot of data, deliver it to the application, let it process through it and then deliver a lot of data back to the database. Inefficient to say the least.

    The architecture of where to place business logic should vary depending on the type of work being done. You want to leverage the strengths of the platforms. And doing bulk operations make working at the database level an ideal candidate.

    For example, billing databases usually do their work in bulk at the database level. It just makes sense. And they are creating invoices all the time.

    Real world.

    [/font]

  • foxjazz (3/19/2009)


    As far as the varchar(50) pk numbers, there is no other way to do it.

    We do it by computername for a reason, our sync processes depend on that data to get uploaded to our server and synced back down to all the other clients. Our sync architecture requires it be this way.

    [font="Verdana"]Have you considered using GUIDs? (Uniqueidentifier in SQL Server parlance).[/font]

  • foxjazz (3/19/2009)


    Sergiy, as far as the bands. How could I use identity and keep with the rule of source bands.

    Where source 1 data must have pk numbers between 1 and 10mill

    WHY???

    if this situation would ever change, it would break many of the processes of our database. We have a table also that defines those bands. And processes that look to those tables and uses them depending on the datasource those processes target.

    Because YOU build the processes this way.

    I would not do it like that ever.

    I also have built some systems where processing rules depend on data source.

    But why it should affect PK value?

    _____________
    Code for TallyGenerator

  • I think this is the query I arrived at for getting the next PK value.

    select @currentval = max(Convert(int,substring(GenLevelAddFieldID,len(@@servername) + 1,10))) from tblLevelAddFieldRel where GenLevelAddFieldID like @@servername + '%'

  • Because when we compare the primary datasource with the data we distribute, it's compared completely within those limits. And for the other datasources, we only need to add them and rarely have to modify them. The primary datasource gets compared every night, and any changes to that table are reflected by the primarykey comparison.

    The primarykey is related to other tables necessary to follow other process comparisons.

  • Foxjazz,

    If you need to use "bands"of IDENITY, you can around it by adding a Band column in your table and let SQL Server deal with IDENTITY key. And as was mentioned before, a LineNo is the normal way of forcing a specific order. Using the script already posted to increment the LineNo to be consecutive allows to do this in one, efficient fell swoop.

  • J, thanks for the reply. I don't think it's safe to go changing production tables to just have an identity pk.

  • Foxjazz:

    If you have millions of records, I think it would well be worth it. To allow testing without risking compromising the production database, you could make a copy of the database and work on the development database.Having a development database should ease what are obviously LEGITIMATE concerns about monkeying with a production database.

    With millions of records, an automatically generated int or bigint primary key covered by a clustered index is an essential step in ensuring performance. Since SQL Server does not care about the physical sequence of records within a table, there is no value in trying to fabricate a sequence only intended for display or printing. Besides, if you are still in SQL2000, you WILL have, sooner or later, to upgrade to SS2K5 and then SS2K8 and that includes the production database.

    Hey, Forum! Anyone out there with experience in performing surgery on a live production database ?

    Also, you should use the GUID as a last resort. I saw something about that being more prone to index fragmentation requiring the comparison of 38 characters instead of a 4-byte int. And the GUID's by design can never be consecutive. The one instance where I use GUID's is when several users share the same table. The new records are added and the tables have an OwnerID varchar(38) column which allows users to execute stored procedure on tables without mixing their data with someone else's. I'll give you as an example a table intended to print several invoices at once. The application allows the user to make a multi-select and the InvoiceID along with a GUID is used by stored procedure which gathers the dada for Crystal Report. Once the report is printed, the records with the specified OwnerID can then be deleted without preventing other users from printing a different batch of invoices.

  • Hey, Forum! Anyone out there with experience in performing surgery on a live production database ?

    Absolutely!!! The operation was successful, but the patient died!

Viewing 15 posts - 301 through 315 (of 465 total)

You must be logged in to reply to this topic. Login to reply