November 11, 2008 at 9:22 am
Hi,
I am implementing something akin to an online store which has a rather large catalogue of products (700,000+) within a 5 level catalogue hierarchy of approximately 500 nodes.
One of the requirements of the application is to allow the user to traverse the catalogue hierarchy and see the number of products at each level, 'rolled up' from the levels below. After some investigation, I have implemented the hierarchy using 'Nested Sets' (looked at CTE but performance was poor) and been able to build a static table of product counts at each level which works nicely.
However, I'm now considering how I might show the number of products at each level of the hierarchy when some filter is applied to the contents. For example, if the user were to search within our catalogue and I wanted to show the number of products at each level of hierarchy from the results of their search, I can't use any pre-compiled counts as this would be inaccurate due to the search filter in operation. I also cannot see how I could calculate this on-the-fly given the number of records we have and the number of searches which will take place (approx 10,000 per hour).
Has anyone else looked at this problem?
Many Thanks
James
November 11, 2008 at 9:49 am
Given an appropriate table design and indexing, I do not see why you would not be able to query the table and get results quickly enough.
Perhaps you can post the table structure and any index information you have - along with the query you have already written.
November 12, 2008 at 1:42 am
Thanks for the reply. OK, so for table structures we have CAT_Product which is the product record, CAT_CatalogueSection which is the hierarchy table (with LeftExtent and RightExtent columns to make nested sets) and a CAT_CatalogueProduct which maps a product onto a CAT_CatalogueSection. Here's the DDL
CREATE TABLE [dbo].[CAT_Product](
[ProductID] [int] NOT NULL,
[Description] [varchar](512) NULL,
CONSTRAINT [PK_CAT_Product] PRIMARY KEY CLUSTERED
(
[ProductID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [dbo].[CAT_CatalogueSection](
[CatalogueID] [int] NOT NULL,
[SectionID] [int] NOT NULL,
[ParentID] [int] NULL,
[SectionName] [varchar](255) NOT NULL,
[DisplayOrder] [int] NOT NULL,
[LeftExtent] [int] NULL,
[RightExtent] [int] NULL,
CONSTRAINT [PK_CAT_CatalogueSection] PRIMARY KEY CLUSTERED
(
[CatalogueID] ASC,
[SectionID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [dbo].[CAT_CatalogueProduct](
[CatalogueProductID] [int] NOT NULL,
[CatalogueID] [int] NOT NULL,
[SectionID] [int] NOT NULL,
[ProductID] [int] NOT NULL,
CONSTRAINT [PK_CAT_CatalogueProduct] PRIMARY KEY CLUSTERED
(
[CatalogueProductID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IDX_SectionID] ON [dbo].[CAT_CatalogueProduct]
(
[SectionID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
So, lets say for example the user searches for the term 'mouse', the SQL to get the number of matches in each catalogue section would be:-
SELECT
CS1.SectionName,
COUNT(DISTINCT CS2.SectionID) As SubSectionCount,
COUNT(DISTINCT P.ProductID) As ProductCount
FROM
CAT_CatalogueSection CS1
INNER JOIN
CAT_CatalogueSection CS2
ON
(CS2.LeftExtent BETWEEN CS1.LeftExtent AND CS1.RightExtent AND CS2.LeftExtent <> CS1.LeftExtent)
INNER JOIN
CAT_CatalogueProduct CP
ON
CP.SectionID = CS2.SectionID
INNER JOIN
CAT_Product P
ON
P.ProductID = CP.ProductID
WHERE
P.Description LIKE '%mouse%'
GROUP BY
CS1.SectionName
ORDER BY
ProductCount DESC
(Note that we will be using Full Text indexing for the search but the principal is the same).
Running this query takes about 10 seconds on my machine here (Dual Xeon, 4GB RAM, SQL Server 2005) which is far too slow. I need less than 0.5s response time ideally.
Any advice appreciated
James
November 12, 2008 at 4:56 am
I need you to post something that will generate test data. The Left/Right Extent design does not make any sense to me and I do not want to waste more time trying to figure it out.
You certainly have some missing indexes.
FullText searching is going to have a significant impact if you have a lot of products. At the moment, you are searching for a product name with criteria starting with a wildcard. Even if you had an index on the product description, the wildcard at the beginning makes the index useless. FullText searching would make the criteria a word and it would be able to use the indexing to find it.
November 12, 2008 at 5:08 am
Hi Michael,
The LeftExtent/RightExtent is from Joe Celko's Nested Set pattern (http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html) which allows fast searching of leafs and nodes in a hierarchy.
I take your point about the Full Text Index but this is a simplified example of what we actually because I didn't want to confuse the issue. What I wanted was a general discussion on how this kind of thing might be implemented in a scalable way rather than a discussion of the implementation details.
Regards
James
November 12, 2008 at 5:14 am
Take a good look at your query. The columns used in the joins are all candidates for needing an index. Without some kind of realistic data, I cannot tell you which ones need to be indexed.
My first guess would be all of the "SectionID" fields used in the joins, the "ProductID" fields used in the joins, and a composite index on LeftExtent/RightExtent. The data will make a big difference.
November 18, 2008 at 9:43 am
I am not an expert in this subject but my guess would be how about creating covering index.
November 18, 2008 at 10:38 pm
James,
Are all the mice in the same downline? I believe that your code is resolving the whole tree trying to find %mouse% and it may actually be resolving the whole tree more than once. I'm not sure because I don't have the data to play with, but I think you need to give it the extent coordinates for the top most node that is capable of having mice and then search the downline for %mouse% even if that means searching the whole tree rather than doing a self join to the tree.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply