Query on a large database

Question

Query on a large database

sreeya

SSC Eights!

Points: 979
More actions
October 19, 2015 at 8:15 pm

#312431

Hi All.
I was asked to create a report on very large data. Below are the requirements. Any help would be greatly appreciated. Thanks.
We have a very large fact table with 100 millions of records. I will have to create a report on that table. I just created a query and it is taking more than an hour to retrieve Month's worth of data. As this report will be used by our customers, it should not take longer than 15 seconds. Would creating tables with precalculated values help improve the performance.
Thanks.

Viewing 15 posts - 1 through 15 (of 104 total)

You must be logged in to reply to this topic. Login to reply

Sergiy SSC Guru Points: 110208 More actions · Answer 1

Make sure you have a clustered index on "record date time" column.

Then selecting of a monthly range of records will work really fast, and the rest of conditions would apply to that small subset, so whatever they are they could not affect performance of the query.

_____________
Code for TallyGenerator

Kristen-173977 SSCrazy Eights Points: 8687 More actions · Answer 2

Sergiy (10/19/2015)
Make sure you have a clustered index on "record date time" column.

I would only re-create a clustered index if there isn't one already - any existing clustered index may well be critical to something else.

If a clustered index already exists then I would try creating a non-clustered index on "record date time" to improve the report

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Table definitions, index definitions, query and actual execution plan please.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 4

with 100 million rows your better off using a Columnstore index or moving the table to SSAS. If the report requires data on a monthly basis partitioning it by month could help improve the performance. What you need to understand in this case is that most of the performance issue your facing is likely because of the massive IO of fetching data from disk , or aggregating large data sets (CPU issues), indexes help search for an eliminate unwanted rows, but if you still have millions of rows after the fact your will still face performance issues that require a suitable technology to be used.

Summary tables are a good solution but you will loose granularity and might need to create many different summaries at different grouping for different reports and this can become unmanageable over a period of time.

Jayanth Kurup[/url]

Luis Cazares SSC Guru Points: 183706 More actions · Answer 5

A fact table with 100 million rows is not that large comparing it to the tables I had in my previous company. However, good indexing is important. We can't give any advice without the information asked by Gail. For more detailed explanation on what we ask, read the following article: http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

sreeya SSC Eights! Points: 979 More actions · Answer 6

Thanks all for your suggestions. Here is more information on the table design etc:

I will need to create a report on below tables.

Order Facts Table has below columns:

ID bigint Clustered, Date(datetime), StoreItemID int, CustomerID bigint

We also have a non clustered index on Date includes (Storeitemid,CustomerId)

Store Items table:

ItemID int Clustered, ItemName nvarchar(100), Item Category nvarchar(100)

The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

Report should show customers by Item purchased in a month and Customers by Item Category in a month.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

GilaMonster (10/20/2015)
Table definitions, index definitions, query and actual execution plan please.

Table and indexes as their CREATE statements please, it makes things easier to read

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

sreeya (10/20/2015)
Thanks all for your suggestions. Here is more information on the table design etc:
I will need to create a report on below tables.
Order Facts Table has below columns:
ID bigint Clustered, Date(datetime), StoreItemID int, CustomerID bigint
We also have a non clustered index on Date includes (Storeitemid,CustomerId)
Store Items table:
ItemID int Clustered, ItemName nvarchar(100), Item Category nvarchar(100)
The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.
Report should show customers by Item purchased in a month and Customers by Item Category in a month.

GilaMonster (10/20/2015)
Table definitions, index definitions, query and actual execution plan please.

The details you have posted are insufficient. Gail's requests represent the minimum requirement for a decent chance of offering you any help.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

baddulahariprasad SSC Enthusiast Points: 136 More actions · Answer 9

Hi,

Index creation on date column and table partition would be a better option for creating reports on a big table.

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 10

sreeya (10/20/2015)
The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

thats a lot of sales a day...being curious what industry are you involved in?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Sergiy SSC Guru Points: 110208 More actions · Answer 11

Kristen-173977 (10/20/2015)
Sergiy (10/19/2015)
Make sure you have a clustered index on "record date time" column.
I would only re-create a clustered index if there isn't one already - any existing clustered index may well be critical to something else.
If a clustered index already exists then I would try creating a non-clustered index on "record date time" to improve the report

As we can see from the following post there is nothing important about existing clustered indexes. Just common junior mistake of putting a clustered index on an identity column.

But I would not easily suggest recreating/changing a clustered index on a table with 100 mil rows. Might be a quite long excercise.

There is still an option of using existing clustered index:

Select MaxID = MAX(ID), MinID = MIN(ID)

From ... Where -[Date] between @MinDate and @MaxDate

And then use those MaxID and MinID to limit the number of Ross processed by all other conditions in the WHERE clause.

Still need to apply the filter against the transaction date.

_____________
Code for TallyGenerator

Greg Edwards-268690 SSC-Insane Points: 20680 More actions · Answer 12

Sergiy (10/20/2015)
Kristen-173977 (10/20/2015)
Sergiy (10/19/2015)
Make sure you have a clustered index on "record date time" column.
I would only re-create a clustered index if there isn't one already - any existing clustered index may well be critical to something else.
If a clustered index already exists then I would try creating a non-clustered index on "record date time" to improve the report
As we can see from the following post there is nothing important about existing clustered indexes. Just common junior mistake of putting a clustered index on an identity column.
But I would not easily suggest recreating/changing a clustered index on a table with 100 mil rows. Might be a quite long excercise.
There is still an option of using existing clustered index:
Select MaxID = MAX(ID), MinID = MIN(ID)
From ... Where -[Date] between @MinDate and @MaxDate
And then use those MaxID and MinID to limit the number of Ross processed by all other conditions in the WHERE clause.
Still need to apply the filter against the transaction date.

Why not index Date in one table, Item in the other?

Assuming that anyone can see all Customers and Stores.

It would be interesting to see requested information for table definition.

And query plan in Profiler.

Would be very helpful to getting better feedback.

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 13

Jayanth_Kurup (10/20/2015)
with 100 million rows your better off using a Columnstore index or moving the table to SSAS. If the report requires data on a monthly basis partitioning it by month could help improve the performance. What you need to understand in this case is that most of the performance issue your facing is likely because of the massive IO of fetching data from disk , or aggregating large data sets (CPU issues), indexes help search for an eliminate unwanted rows, but if you still have millions of rows after the fact your will still face performance issues that require a suitable technology to be used.
Summary tables are a good solution but you will loose granularity and might need to create many different summaries at different grouping for different reports and this can become unmanageable over a period of time.

Careful now... Partitioning it by month won't help anymore than having the proper index on a monolithic table and a proper query to take advantage of it. All else being equal, Partititioning can actually slow down queries thanks to the multiple B-Trees involved behind the scenes.

Not sure how you think moving a table to SSAS will actually improve performance.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Greg Edwards-268690 SSC-Insane Points: 20680 More actions · Answer 14

Jeff Moden (10/21/2015)
Jayanth_Kurup (10/20/2015)
with 100 million rows your better off using a Columnstore index or moving the table to SSAS. If the report requires data on a monthly basis partitioning it by month could help improve the performance. What you need to understand in this case is that most of the performance issue your facing is likely because of the massive IO of fetching data from disk , or aggregating large data sets (CPU issues), indexes help search for an eliminate unwanted rows, but if you still have millions of rows after the fact your will still face performance issues that require a suitable technology to be used.
Summary tables are a good solution but you will loose granularity and might need to create many different summaries at different grouping for different reports and this can become unmanageable over a period of time.
Careful now... Partitioning it by month won't help anymore than having the proper index on a monolithic table and a proper query to take advantage of it. All else being equal, Partititioning can actually slow down queries thanks to the multiple B-Trees involved behind the scenes.
Not sure how you think moving a table to SSAS will actually improve performance.

Partitioning is more for having sliding windows of data for archiving. Surprising how many think it will improve performance.

If users are dumping details for all customers, all order lines anyways, you would be missing the whole point of SSAS.

Although just in setting up a cube to run, they would have to fix the core performance issue.

100 million rows in order table

100 million users who can have multiple orders per day

Millions of rows per day

Without a cross join between the tables, this just doesn't make sense to me

Hope the poster comes back with some details