Index or not? Database with millions of rows

  • We're building a database that includes the Federal Do Not Call telephone number list. This particular table will end up having millions of rows, each one containing only a unique telephone number (one column, BIGINT). Most of the use of the table will be to look up a single number and determine if it is or is not in the table.

    Should the table be defined with the phone number as Primary Key? Will the increase in storage required be offset by the performance (let's assume a real-time online web app that checks the table for a specific number). Are other indexing options more appropriate?

    I've only worked with relatively small tables in the past so Primary Key indexes have been fine since the extra storage required to maintain the Clustered index hasn't been an issue. In this case it may be, in addition to the overhead of loading the entire table periodically from the Fed's updated list and subsequently (or in parallel) having to rebuild the index.

    I'd appreciate any feedback from those of you who have worked with very large tables, particularly where the table structure itself is so simple, it's only the volume of data that creates the question.

    Thanks in advance.

     

  • You need index on the table for both speed and data integrity reasons.

  • I understand that an index is necessary for performance, but does it need to be a Primary Key index (therefore clustered) or will a non-clustered index work as well? If I understand the difference between the two a clustered index requires significantly more disk space since the data is actually stored in the index as well as in the original table, while a non-clustered index doesn't copy the data to the index. Do I have that right?

    How does indexing improve data integrity in what is essentially a read-only table? (I know I didn't include that info in my original post... The table is completely reloaded monthly from a CSV file provided by the FTC. Users only read the table to locate a specific phone number. We'd write a stored procedure that does the lookup and returns a return code indicating whether or not the number is present in the table.)

     

     

  • You are confusing CONSTRAINTS with INDEXES.

    First of all a PRIMARY KEY is a CONSTRAINT that sits on top of a UNIQUE INDEX.  It is used in conjunction with FOERIGN KEY constraints to allow SQL Server maintain referencial integerty between tables.

    What you need to do is to create a UNIQUE CUSTERED INDEX which will do 2 things.

    First of all, because it's clustered it will order the data in the ascending order of the Index Value.  Second it will create a b-tree structure above the data (Root, Non-Leafed and Leafed blocks) with the actual data row in the Leafed Blocks.  This means that the data can be found very fast.

    If you look up PRIMARY KEY CONSTRAINTS you will find that you can create a PRIMARY KEY after the table has been created and indexed.  Lookup ALTER TABLE for the specifics.  You'll see the FOREIGN KEYS are set up through ALTER TABLE too.

    Of course you can create both PRIMARY KEY and FOREIGN KEY CONSTRAINTS in the CREATE TABLE statment too.

    I hope this helps.

  • Since a PK creates a Unique Clustered Index I generally use that as a shortcut in EM when I have a table with no relationships to other tables. I understand that a PK is by definition a constraint, but it does create the appropriate index and if there is no requirement for joins or relationships with other tables, why not just right click on the column and set it as a PK?

    That aside, I'm still concerned about the size of the database (OK, so it's millions with an "M", not billions with a "B") but I have to reload this table every month from scratch so the index has to be rebuilt from scratch. This is a lot of extra overhead in the load process as well as the additional storage to keep the actual data row in the index (Does this mean the actual data row RESIDES in the index, or is a COPY of the data row in the index?)

    What are the performance and storage implications if I create the table with a non-clustered unique index vs. a clustered unique index? Access to the table will always be the following: "SELECT phone from FEDERAL WHERE phone = nnnnnnnnnn", looking for a specific phone number. Basically I just need to know if the number exists in the table or not. Will a clustered index provide a significant enough improvement in performance to justify the extra storage vs. a non-clustered index?

     

  • Clustered indexes are the way to go when queries will return a single row.  Non-clustered indexes are better if retrieving a range of values.  I am pretty sure the clustered index actually stores the DATA in the index. 

  • If this is a single field table with an index on that field you have an interesting issue.

    I believe if you index that field (non clustered) and search only on it, you will have a covering index. So it will never really touch the data !!!!!

    If a query can be satisfied from the index, it is refered to as a covering index, and no access to the data is required.  Sometimes it can be beneficial to include an extra field or 2 on an index if that will allow it to satisfy a heavily used query.

    I would think this would improve the load time significantly.  Although I would have to see if it was quicker to load with the index, or load and then create the index ????

    I may try this over the weekend.

    How many million, 50, 70, ????


    KlK

  • I would defenetly have an index. Probably go with a non-clusterd index if and only if the file you will be importing is allready orderd in asendenig order. (it really wont matter if its PK or not, but keep in mind that PK is by default clusterd, but not nesesaraly).

    Also think on disabling the index during the batch delete, and upload process) you certanly multiply serveral time you upload process.

  • I would index this as a single clustered index. It will have a high degree of selectivity and perform quite well. A previous post is correct clustered index leaf contains the actual data page. If you just put a non-clustered index on this table you will incur a bookmark lookup everytime. It will be fast but it is still an additional step to retreve the data.

    When bulk loading or deleting from this table drop the index do your bulk jobs and then create the index again. It will serve to speed up the load process by not logging the insert if you have your reovery mode set to bulk-logged and rebuilding the index makes sure it is clean and has zero fragmentation issues from the insert or delete job.

    Wes

  • The data is already in ascending order when it's being loaded. The database is 1.7GB and the single column is only one BIGINT, so if you do the math that's a lot of rows!!

    I ran a Select for a single number with the clustered index and without:

       With Clustered Index:     <1 second
       Without Clustered Index: 9:06 (yes, that's 9 minutes and 6 seconds)

    So I guess it goes without saying I'm going to need SOME kind of index!!

    I'm going to try adding a non-clustered index just to see what the performance is like for that. I'll post the results.

     

  • The space Overhead for the clustered index is so small, that it is never a factor. It depends on the number of fields in the key,  their lengh etc, but is usually less than 5 %. In my case I have 600 million, 50 GB table that has 4 fields in the clustered index (total lengh of the fields: 20 bytes) and overhead from the clustered index is less than 1 GB. I read in your mail that you were concerned about where the data is saved with respective to the clustered index. The answer is in the clustered index itself. The overhead we are talking about is coming from the B tree index structure. As already stated this has to be negligible due to the nature of the B tree structure. There is no duplicate storage involved. On the other side, non clustered index is ALWAYS extra storage and is CERTAINLY more than the overhead from the clustered index. This is because all your non clustered indexes also contain the all the fields from clustered index (in the case of my table those 4 fields with 20 bytes). In fact, the space considerations for the clustered index should never be a factor, I just follow the simple rule, A table should HAVE a clustered index (of course temp tables may be exempted).

    Now after all that babble, back to your question, your table would have the clustered index on the phone number. The caveat on the space is that, when your table gets fragmented, you may need to drop and recreate the clustered index, and during this process, the space requirement is about 125 % free space. So, for my 50 GB table, I would need about 65 GB free space to reorganize the table. Of course, the 65 GB free space (and more) would be returned back to the DB when the index is created. So, include that in your estimates.

  • 55 million rows to be exact.....

    So it appears that the response for a single select with a non-clustered unique index is somewhere in the area of <= 2 seconds. The response time for a clustered unique index is <= 1 second. The database size is pretty much the same regardless of which type of index I build, probably because the table only has one column and that's the column that's indexed regardless of whether it's clustered or not so the overhead remains essentially the same.

    I guess I'll go with clustered unique to get optimal performance.

    Thanks to all for your comments and suggestions.

     

     

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply