100 million rows

Question

100 million rows

Andrew Wilkes

SSC Veteran

Points: 284
More actions
May 26, 2007 at 7:06 pm

#176859

I'm working on a project that will have somewhere in the range of 100 million rows in the table that gets the most rows. I'll have somewhere near 10 million rows loaded very shortly on a development system.
The development system is running 2005, the very top version of 2005. The development system hardware is rather small and simple with two processors and two disks. All the data is presently on one disk.
The production server is much bigger, with eight processors and a raid array of disks that run at 15k rpm.
The database design we have presently is a relational one. I'm worried about the 100 million rows. I'm not sure we can get by just using a relational design and some very fast hardware. With very little adjustment, the existing relational design could be pushed toward a star schema. I wonder about partitioning the table that will have 100 million rows and have given some thought to how I should break up the 100 million rows into partitions.
Does anyone have a rule of thumb on where the data warehousing technologies need to be put to use? Is it at 10 million rows, 50 million ????
Andrew Wilkes

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 1

Like anything else, "It Depends". What is the table being used for? Does it have a clustered PK or clustered index? Is anything "time sensitive" in the table? How frequently are "old rows" accessed? Are there any columns with NULLs? Can any of the data be relegated to an archive table? Is the table properly normalized?

I know... not much help, so far... perhaps an example would serve better...

We have several tables in the range you speak of... for example, our Invoice table contains 26,982,387 rows and the InvoiceDetail table contains 230,096,380 rows. Every night, we have a job that identifies which Invoice rows are older than 120 days and we move those rows, along with the associated InvoiceDetail rows, to a separate "archive" data base. That keeps the size of the normally accessed rows (usually within the last 120 days) down to a very manageable 3,638,014 Invoice rows and 21,554,023 InvoiceDetail rows. The tables suffer a lot of Inserts and virtually no Updates.

The powers to be wanted combined access to both the in-service tables and the archive tables, so we have a bunch of time-sensitive views (last 6 months, last 9 months, last year, all, for example) that allow such combined access. They just didn't go far enough, in my opinion... they haven't "warehoused" the data...

To me and in this instance, that would mean summarizing the old data and putting the summaries into tables or perhaps cubes. It's a rare thing when we need to look at detail rows that are more than a year old... but we do frequently calculate and compare the customer base, revenue, and revenue from various product types. So, why not have a job that does the summarization table build, say, once a week?

SQL Server is a powerful tool and 100 million rows just isn't that big in the face of good table design, good procedure design, and proper indexing. Sure, partitioning by something like date will surely help until you need to put it all back together again. Even that's not so bad but is still relatively slow compared to just looking at active data. I'd say your two partitions should be rows that you expect to be frequently accessed (last 120 days, for example) and those you do not (archive). Then add the technique of pre-aggragating (summarizing) data in another table (or cube), and I'd say you have a winner. Do NOT use views to retrieve aggragate the data instead of dedicating the space to summary tables... the aggragated data should be in a properly indexed table if you want any decent performance.

Anyway, that's just my 2 cents on the subject. Write back if you need something else...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 2

I agree with Jeff. 100M rows really isn't that much. I have a client that uses a set of quarterly state tables to track keys representing the state of every row on each day. Each table winds up with between 150-200M rows. This is on SQL 2000 on a pizza-box (2cpu, 2GB ram) and performance is poor but acceptable. Note that they have 5-10 different databases with the same situation on each pizza-box they run too!

Again, Jeff is right in that a rolling partition containing the latest "important" set of data would result in optimal performance. Setting up and managing this can be a bit tedious initially, but could pay off if you need top performance.

Of course a star-schema for your reporting (if properly designed/indexed/accessed) may well lead to the best performance of all.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Andrew Wilkes SSC Veteran Points: 284 More actions · Answer 3

Thanks guys for the information.

I'm remembering a conversation I had with a friend in 1991.

He had started working on data warehouses. He was all excited about denormalizing the data.

I had just got my head around normalization and was successfully running a medium sized relational database.

I suggested to my friend that he throw hardware at the problem He said 'I have 1 million rows'.

I said 'oh'.

Andrew

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 4

Andrew... this is what I'm referring to as "summary tables" and others refer to as "Star Schema" or "OLAP"...

http://www.ciobriefings.com/whitepapers/StarSchema.asp

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)