October 28, 2007 at 3:59 pm
I have begun a new project involving setting up a new database. This new database basically consists of one main flat dataset with ~500 million rows.
I have split the main table into 2 tables (vertical partitioning) with the frequently used columns in TableA and the least used Columns in TableB
TableA
------
ColA int NOT NULL identity (PK)
ColB nchar(2) USState NOT NULL
ColC nchar(9) NULL
ColD datetime NULL
ColE nvarchar(50) NULL
ColF nvarchar(50) NULL
ColG nvarchar(50) NULL
ColH nvarchar(150) NULL
COLI nvarchar(100) NULL
TableB
------
ColA int NT NULL (PK) (FK to TableA.ColA)
ColB nvarchar(50) NULL
ColC nvarchar(50) NULL
ColD nvarchar(50) NULL
ColE nvarchar(50) NULL
ColF nvarchar(50) NULL
ColG nvarchar(50) NULL
ColH nvarchar(50) NULL
...
ColZ nvarchar(50) NULL
ColC (nchar(9) Column) in TableA will need an Index on it
ColD (Date Column) in TableA will need an Index on it
ColE, ColF, ColG, ColH, ColI in TableA will need a Full-Text Index on them
The data can be separated by US State Code
Inserts/Updates will happen in large chunks at off-peak hours
The database is mainly used for Searching (backend of a custom application)
The USState Column (TableA.ColB) will be part of the Search sometimes but not always (probably less than 20% of the time)
The most common fields to be searched will be those in the Full-Text Index as well as the Date and nchar Columns needing Indexes (TableA: ColC - ColI)
From a performance perspective, what is the best way to physically set up this database and the full-text catalog/index?
There will be a C Drive (single drive) for the OS and application files
There will be a D Drive (RAID - Striped Set) for the Data files
There could be a 2nd Drive (RAID - Striped Set) for additional Data files
There will be an E Drive for the Log files
I can split the table into 50 partitions (spread over 50 file groups) on the USState column.
I am looking for some advice on setting up this database, the filegroups, the full-text catalog/index, and the regular indexes.
Mike
October 29, 2007 at 5:19 am
:w00t::sick::unsure:
Presuming the field names have been changed to protect the innocent, how about posting something a little less generic and we might be able to steer you towards a better design.
--------------------
Colt 45 - the original point and click interface
October 29, 2007 at 5:43 am
Honestly, one table... Why put it into a relational data management system? Assuming at least a good primary key, this can be accessed faster from a file than from the database system. Then you get all the overhead of maintenance, etc. Just put it in a file & be done. Not everything needs to be in the database.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 29, 2007 at 9:41 am
philcart (10/29/2007)
:w00t::sick::unsure:Presuming the field names have been changed to protect the innocent, how about posting something a little less generic and we might be able to steer you towards a better design.
Not sure how actual field names are going to change the idea of the design. I don't even know what the all the actual field names are going to be at this point. I know there are a small number of fields that will be the primary search candidates (those are in TableA). I will join to TableB and retrieve complete records as the last step when I have found the small result set from the main searches. I was hoping for some feedback from people who have setup large tables like this as to what they have found to be the best performance. Is it better to keep the table in one large filegroup. Is it better to split it out amongst a number of filegroups (based on USState) even though the USState is not actually used in the Search Conditions that often.
I was hoping that this would generate some good discussions about the preferred ways (best practices) of setting up a database with these characteristics. I have to imagine as we continue to develop monitoring tools and such, that databases with large tables that need to be searched on a small number of key fields will become more commonplace.
Mike
October 29, 2007 at 4:05 pm
With more relevant field names we can get an idea of what sort of data is stored in them.
One thing I see a lot of is duplicate address fields for different address types, eg: Residential address, postal Address, Billing Address Alternate address. Even if we just have the basic street suburb, postal code fields, thats 12 fields in the table. The address data can be carved off into a child table leaving just a key value in the master table and removing 11 fields.
--------------------
Colt 45 - the original point and click interface
October 29, 2007 at 9:45 pm
Grant Fritchey (10/29/2007)
Honestly, one table... Why put it into a relational data management system? Assuming at least a good primary key, this can be accessed faster from a file than from the database system. Then you get all the overhead of maintenance, etc. Just put it in a file & be done. Not everything needs to be in the database.
That was one of my initial thoughts as well but I think there are a number of issues that prevent using a file:
* Concurrency Problems - there will be many users of the front-end application accessing this system at the same time
* Full-Text Search Requirements - Not Sure a File can be used for Full-Text Searching requirements such as Inflection and Thesaurus
* Updates - Automated updates will be run frequently as well as concurrently
* Consistency - Transactions are difficult and ineffective with Files
* Corruption - Large files are much more prone to corruption than SQL Server Databases
* Other Tables - There are other tables (small lookup tables and a few others used by the application), but not linked to the main large table
* There is no Unique Primary Key field on the data - (that is, no column of data is guaranteed to be present, let alone unique) - another reason to use the Database Table is I can generate an Identity Key to be used as a Primary Key
I believe this data does belong in a database. Once again, I was hoping to get some good discussions about setting up large tables such as this.
Mike
October 29, 2007 at 9:53 pm
philcart (10/29/2007)
With more relevant field names we can get an idea of what sort of data is stored in them.One thing I see a lot of is duplicate address fields for different address types, eg: Residential address, postal Address, Billing Address Alternate address. Even if we just have the basic street suburb, postal code fields, thats 12 fields in the table. The address data can be carved off into a child table leaving just a key value in the master table and removing 11 fields.
At this point there is nothing really that can be pulled out into a child table. If there were, they would be in the TableB which would not be searched against, but simply returning the additional properties if a record is found to meet the initial criteria based on the Columns in TableA. With the exception of the USState Column, there is no guarantee of data being in any of the other columns. There will be data in at least one of the searchable columns in TableA, but possibly only in one of those columns or perhaps in all of them. I am aggregating data from a number of different external sources all of which have there own format and completeness standards.
Thanks for the suggestion though.
Mike
October 30, 2007 at 4:32 am
Wow. 500+ million records? That's a LARGE table.
I can tell you from experience that the more columns you have in a table that size, the harder it is to search, not easier. We have a 6+ million record table with only 69 columns and it's already a bear that has been causing us a lot of issues.
As contrary as this may sound, try to break your table into more than just 2 tables. Group the columns in all the secondary tables by their relationship to each other. I.E., financial columns in one table, personal statistics in another, etc. Keep the most commonly used columns in the main table. Then have an ID column in either the secondary tables that points to the primary key in the main table or add a composite ID column in the main table that can be parsed out to find the appropriate secondary table. ONLY pull secondary table data when it is specifically requested, and only pull the necessary secondary table instead of all of them.
JOINS do actually work pretty fast in a scenario like the one I described above.
Anyway, stick the main table on your first RAID array drive. Stick the secondary tables on the second RAID array. If you only have one RAID set available, then you're not going to get much of a performance increase by simply separating the tables, because when you need all the data you'll still be getting a performance hit. Also, I would avoid a Striped Set of RAID if possible. Go with RAID 5 or RAID 1+0. Striped has no fault tolerance so while it's fast to write to disk, you'll lose everything if you lose even one drive.
Lastly, transaction logs. If this table is only going to be searchable and not writable, then you can put the transaction logs on a regular drive and set the database to SIMPLE mode. (Like Ronco. "Set it and forget it!" @=) But if you're going to be writing new data to this db, I'd use either Bulk-Logged or FULL recovery mode and set the transaction log files on another fault tolerant RAID array (separate from the data files).
October 30, 2007 at 5:58 am
Michael Fried (10/29/2007)
That was one of my initial thoughts as well but I think there are a number of issues that prevent using a file:* Concurrency Problems - there will be many users of the front-end application accessing this system at the same time
* Full-Text Search Requirements - Not Sure a File can be used for Full-Text Searching requirements such as Inflection and Thesaurus
* Updates - Automated updates will be run frequently as well as concurrently
* Consistency - Transactions are difficult and ineffective with Files
* Corruption - Large files are much more prone to corruption than SQL Server Databases
* Other Tables - There are other tables (small lookup tables and a few others used by the application), but not linked to the main large table
* There is no Unique Primary Key field on the data - (that is, no column of data is guaranteed to be present, let alone unique) - another reason to use the Database Table is I can generate an Identity Key to be used as a Primary Key
I believe this data does belong in a database. Once again, I was hoping to get some good discussions about setting up large tables such as this.
Mike
Full text search, consistency and creating an artificial key are all things you can overcome through code. However, concurrency corruption and other tables are not. But the other tables don't link to the main table, so they're technically off the list. That only leaves concurrency and corruption. We had a large scale system working primarily from an XML file. Corruption was the one problem that we couldn't readily solve. Fortunately when we moved it into the database system, we were able to normalize the table. Why must your table remain in a denormalized state?
As to storage of something large like this... Before you get to partitioning, I'd be sure to seperate every piece of storage you can into seperate file groups and then depending on the size, divide the file groups into multiple files. So, assuming the PK is clustered (may or may not be a good choice, but it's a default starting point), you have one storage point for the data & pk, if you're doing full text searches I'm assuming a full text index, that's a second storage point, any other indexes should also be seperated into individual storage points, designate another storage point for any LOB columns. If it is this big, I'd put, say, just a swag, three files per group.
How's that?
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 31, 2007 at 12:29 pm
Your design, as others have suggested, cries out to be normalized. This from Books on Line:
"Normalization
The logical design of the database, including the tables and the relationships between them, is the core of an optimized relational database. A good logical database design can lay the foundation for optimal database and application performance. A poor logical database design can impair the performance of the entire system.
Normalizing a logical database design involves using formal methods to separate the data into multiple, related tables. A greater number of narrow tables (with fewer columns) is characteristic of a normalized database. A few wide tables (with more columns) is characteristic of an nonnomalized database.
Reasonable normalization often improves performance. When useful indexes are available, the Microsoft® SQL Server™ 2000 query optimizer is efficient at selecting rapid, efficient joins between tables.
Some of the benefits of normalization include:
Faster sorting and index creation.
A larger number of clustered indexes. For more information, see Clustered Indexes.
Narrower and more compact indexes.
Fewer indexes per table, which improves the performance of INSERT, UPDATE, and DELETE statements.
...
Sometimes the logical database design is already fixed and total redesign is not feasible. Even then, however, it might be possible to normalize a large table selectively into several smaller tables. If the database is accessed through stored procedures, this schema change could take place without affecting applications. If not, it might be possible to create a view that hides the schema change from the applications."
Take a serious look at normalizing your database before you get too far along. BOL has some steps laid out.
Steve
November 2, 2007 at 5:32 am
I think you've missed the point about normalization. If any of the fields of TableA can be NULL (as you say most of them can be) then it makes sense to put them into child tables. For example (in practice you may not need to go this far...):
TableAB
------
ColA int NOT NULL identity (PK)
ColB nchar(2) USState NOT NULL
TableAC
------
ColA int NOT NULL identity (PK, FK to TableA)
ColC nchar(9) NOT NULL
TableAD
------
ColA int NOT NULL identity (PK, FK to TableA)
ColD datetime NOT NULL
TableAE
------
ColA int NOT NULL identity (PK, FK to TableA)
ColE nvarchar(50) NOT NULL
TableAF
------
ColA int NOT NULL identity (PK, FK to TableA)
ColF nvarchar(50) NOT NULL
TableAG
------
ColA int NOT NULL identity (PK, FK to TableA)
ColG nvarchar(50) NOT NULL
TableAH
------
ColA int NOT NULL identity (PK, FK to TableA)
ColH nvarchar(150) NOT NULL
TableAI
------
ColA int NOT NULL identity (PK, FK to TableA)
COLI nvarchar(100) NOT NULL
TableBB
------
ColA int NT NULL (PK, FK to TableA)
ColB nvarchar(50) NOT NULL
TableBC
------
ColA int NT NULL (PK, FK to TableA)
ColC nvarchar(50) NOT NULL
TableBD
------
ColA int NT NULL (PK, FK to TableA)
ColD nvarchar(50) NOT NULL
TableBE
------
ColA int NT NULL (PK, FK to TableA)
ColE nvarchar(50) NOT NULL
TableBF
------
ColA int NT NULL (PK, FK to TableA)
ColF nvarchar(50) NOT NULL
TableBG
------
ColA int NT NULL (PK, FK to TableA)
ColG nvarchar(50) NOT NULL
TableBH
------
ColA int NT NULL (PK, FK to TableA)
ColH nvarchar(50) NOT NULL
...
TableBZ
------
ColA int NT NULL (PK, FK to TableA)
ColZ nvarchar(50) NOT NULL
Now the search code can be written to only search that combination of tables for which values are entered...e.g.
[font="Courier New"]-- must have 2 conditions...
SELECT TableAD.ColA FROM TableAD JOIN TableAH ON TableAD.ColA=TableAH.ColA
WHERE (ColD BETWEEN @date1 AND @date2) AND (ColH LIKE '%'+@sometext+'%')
-- may have either of 2...
SELECT ColA FROM (
SELECT ColA FROM TableAD WHERE (ColD BETWEEN @date1 AND @date2)
) d UNION (
SELECT ColA FROM TableAH WHERE (ColH LIKE '%'+@sometext+'%')
)[/font]
Without knowing more about the data and how it's going to be queried, it's impossible to know if you'd need to use dynamic SQL to build the queries. In any event there should be significant savings in searching since you are (hopefully) looking at much smaller tables.
Of course, the data loading is going to be more complex, but that should only occur once, while searches will presumably happen all the time.
Derek
November 3, 2007 at 11:22 am
Michael,
Is there any particular reason why you opt to use nchar/nvarchar and datetime data types?
Does your data allow to use char/varchar/smalldatetime instead? If yes, you can easily reduce the length of data row/size of your table 30 to 50%.
Regards,
Boris
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply