How to fix a non-indexed huge table?

  • Hi all,

    I was given a database with a table that has 60000000 rows, the total mdf file's size is 16G.

    The first I have to fix is: there is no ID column, primary key, index in this huge table.

    I was trying to add another ID column and make it primary key and indexing it, but unfortuntely it's not successful.

    What should I do?

    Thanks lots.

  • You should be able to add a integer column with the Identity property set to true on this table. Setting the identity property will not only generate the artibtrary ID's, it will add a unique clustered index and set it as the primary key as well (provided there is not already a pk or clustered index on the table).

    If it failed when you tried, it was most likely due to an Enterprise Manager timeout. It will take quite a bit of time to create this column and index. Perhaps 30-40 minutes depending on the speed of your system.

    If your EM times out, use an ALTER TABLE statement from Query Analyzer to do the job.

  • A few questions:

    - Can you post the schema?

    - What did you try? and what was the error?

    You need an index and a PK. Probalby more than one index. You should look at which columns are queried, particularly in WHERE clauses or which might be linked to other tables.

  • Thanks for reply. Here is the table:

    if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[IISlog]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)

    drop table [dbo].[IISlog]

    GO

    CREATE TABLE [dbo].[IISlog] (

     [date] [datetime] NULL ,

     [time] [datetime] NULL ,

     [c-ip] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs-username] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [s-ip] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [s-port] [smallint] NULL ,

     [cs-method] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs-uri-stem] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs-uri-query] [varchar] (4096) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [sc-status] [smallint] NULL ,

     [cs(User-Agent)] [varchar] (4096) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs(Referer)] [varchar] (4096) COLLATE SQL_Latin1_General_CP1_CI_AS NULL

    ) ON [PRIMARY]

    GO

  • "If it failed when you tried, it was most likely due to an Enterprise Manager timeout. It will take quite a bit of time to create this column and index. Perhaps 30-40 minutes depending on the speed of your system.

    If your EM times out, use an ALTER TABLE statement from Query Analyzer to do the job."

     

    The table's size is over 63 million rows, it takes hours to add a new column and still failed, time-out? I don't know.

    What do you mean "ALTER TABLE statement from Query Analyzer"? does that add a new column immeadiately? I doubt it, because anyway the system needs to modify over 63 millions records.

  • Sorry I didn't keep the error message, I was trying to export data from the old table to a new table with  indexed primary ID column,  it took me hours and still failed, the new table was created but has 0 rows in it.

    Regardless how I will use the table, the first thing I need to do to fix the table is, I believe, to add index and PK to it. Basically the table needs a new ID column which doesn't exist in the current schema. (How s** it is!)

  • This could take hours depending on your hardware. It will add a int column, set its identity property to true, make it a primary key, and create a unique clustered index.

    ALTER TABLE IISLog ADD IISLog_id INT IDENTITY CONSTRAINT column_IISLog_id_pk PRIMARY KEY

  • If you create a unique clustered index on this you will have a serious impact on the server, and probably gain nothing.  The identity value you put on will not have any bearing to any information in there beyond being a rownumber.  I would personally avoid it, and certainly not cluster it - someone may infer that the rownumber has some relationship to the order of the data or something silly like that.  You will also find that your transaction logs will grow dramatically - specially if you cluster it...

     

    What kind of queries are you running against it this table?

  • AFter two hours, it's done and returned:

    Warning: The table 'IISLog' has been created but its maximum row size (13625) exceeds the maximum number of bytes per row (8060). INSERT or UPDATE of a row in this table will fail if the resulting row length exceeds 8060 bytes.

    (63008970 row(s) affected)

    Original files' size:

    mdf: 16,860,096

    ldf: 1,008,000

    New files' size:

    mdf:37,697,856

    ldf: 28,333,632

  • By executing the following query, it took me 8 minutes (query from indexed 63 million rows). Is it OK or not good? The hardware is: Intel Xeon 2.4G with 1G RAM (it gives me 4 CPU Usage History windows in Task Manager / Performance, does that mean it has four CPUs?)

    select max(len([cs-uri-query])) as [max-cs-uri-query], max(len([cs(User-Agent)])) as [max-cs-User-Agent], max(len([cs(Referer)])) as [max-cs-Referer]  from iislog

     

    If the performance is not good, what should I do to optimize it?

    Thanks.

  • Here's the schema you posted...

    CREATE TABLE [dbo].[IISlog] (

     [date] [datetime] NULL ,

     [time] [datetime] NULL ,

     [c-ip] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs-username] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [s-ip] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [s-port] [smallint] NULL ,

     [cs-method] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs-uri-stem] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs-uri-query] [varchar] (4096) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [sc-status] [smallint] NULL ,

     [cs(User-Agent)] [varchar] (4096) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

     [cs(Referer)] [varchar] (4096) COLLATE SQL_Latin1_General_CP1_CI_AS NULL

    ) ON [PRIMARY]

    GO

    Let's talk optimization...

    1. Why separate columns for Date and Time?  Total waste of time to do it this way.  If you really want to split them up, create a single date/time column and a couple of calculated columns to use just the date or just the time.
    2. Throughout the table, I see "IP" columns... if these are for IP address, you can really decrease the size from 255 to 15.
    3. Same comment for UserName, Method, Stem columns.  You should be able to reduce them quite a bit.
    4. What is the Query column used for?  I'm really hoping it's not being used to store queries...
    5. Why on earth do you need 4096 characters to store the User-Agent and Referer????

    Your goal should be to try to get the row size down to 8k bytes or 4 k bytes or something that will come close to a whole number of rows within 8k bytes.  The other 60 bytes will be used for nullability markers, etc.  This will help a bit with speed when indexes are used and will certainly help with storage requirements... you just did a max on several columns... pick a slightly higher even number and make the columns that wide.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    I totally agree with you that the table is not well designed at all. I am just trying to fix problem instead finding who designed the database.

    But my question is: what should I do before I shrink the database? considering the table has millions rows, I really want to make sure that any change to the database won't lost data, and can be successfully finished.

    Thanks.

  • Ok, I can tell you're just a little nervous about this whole thing and I don't blame you one little bit... so, here's the plan and why I asked the questions I did...

    The first problem to fix towards increasing the performance, is to fix the design so far as column widths go.  You've started to take the first step to ensure no data is lost... determining the max width of the data currently in the table.  You need to post those maximum widths for all the variable length columns before we take another step.

    The second step would be to try those widths (plus a bit of a growth margin) by copying the table to another using SELECT/INTO... we would add the IDENTITY column at the same time.  If it fails or truncates something... no biggee because we've done absolutely nothing to the original data.  We'll simply drop the new table and try again with larger column widths (won't fail the first time if we do our homework).

    Once we get a new table fully populated, complete with an IDENTITY column, properly sized datatypes, and properly indexed, THEN we can put the new table in service in about 65 milli seconds by doing a snap rename of the old and new tables.  We will NEVER overwrite the original table structure or data until we'll 2000% sure that the new table is correct.  Any data that may have been added to the old table between the time we copied the table to the new table and the time we did the snap rename can be easily added to the new table from the old table... there won't be much.

    And THEN, once we're sure the new table is correct and we've validated that all rows and columns are present, accounted for, and correct, you're gonna get your DBA to do a full backup 'cause, right after that, we're gonna delete the old table.

    And, I'm going to help you write the code to do it.  That's what I really had in mind... but I was trying to get some info from you, first.

    Now, about those questions I asked...

    "We're all in this together and I'm pullin' for ya!" --Red Green

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Oh yeah... almost forgot... you need a complete list and script for any triggers that may exist on the table as well as a complete list of views that use the table as a source.  We'll need to reinstantiate the triggers and recompile all the views... EM normally takes care of all that but we're being super careful... I'd do that even if I were using EM to do this all.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff's got great advice here and I agree with him. If this server is used by a few people it might be ok, but 1GB of RAM is pretty low. Even if it was me that needed to access it alone I'd be looking at 2GB to get any good performance out of things.

Viewing 15 posts - 1 through 15 (of 40 total)

You must be logged in to reply to this topic. Login to reply