SQL DISTINCT, PIVOT and TOP commands in same query - will it work ?

  • In the below query, i would like to have a limit of 100 unique records returned.

    If i place the 'TOP 100' command before the DISTINCT, i get an error, and after the DISTINCT it takes forever, so something is obviously wrong!

    It works in the SELECT TOP 100 email1, ...... command, but returns only 69 records (which is accurate)

    Can i make it work to return 100 records or is that not possible with this query ?

    Thanks

    SELECT DISTINCT Email

    FROM (SELECT Email1,Email2,Email3,Email4,Email5,Email6,Email7,Email8,Email9,Email10 FROM outputresume3

    where (contains (originalresume, '"livelink"

    and "j2ee"'))

    ) p

    UNPIVOT (Email FOR Emails IN (Email1,Email2,Email3,Email4,Email5,Email6,Email7,Email8,Email9,Email10)) as unpvt

    where Email like '%@%'

    order by email

  • Without any sample data to work with, we have nothing to go on. I have to presume that the query that takes a long time is at least syntactically correct, however, again, as we have no information on the contents or size of this table that's being searched using a full-text-search method, all we can do is assume the worst, which is that there are a very large number of records, or that the number of records times the number of characters in this field ends up being a large number. Defining large to a specific range depends on what kind of hardware the system is running on, as well as ALL the details on how it's configured, so can you at least provide some details on what the tables contain? We really don't have enough information to be much help...

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Steve,

    Thanks for your reply !

    The reason i did not provide data, is that it would take too much data to get the query to work slow enough to prove the performance issue.

    You are correct, i have a lot of data, 40 GB.

    Please see below for the table details. Originalresume is FTI.

    Ian

    USE [microsoft525]

    GO

    /****** Object: Table [dbo].[OutputResume3] Script Date: 09/08/2009 11:10:10 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[OutputResume3](

    [ID3] [int] IDENTITY(1,1) NOT NULL,

    [OriginalResume] [nvarchar](max) NULL,

    [Email1] [nvarchar](1000) NULL,

    [Email2] [nvarchar](1000) NULL,

    [Email3] [nvarchar](1000) NULL,

    [Email4] [nvarchar](1000) NULL,

    [Email5] [nvarchar](1000) NULL,

    [Email6] [nvarchar](1000) NULL,

    [Email7] [nvarchar](1000) NULL,

    [Email8] [nvarchar](1000) NULL,

    [Email9] [nvarchar](1000) NULL,

    [Email10] [nvarchar](1000) NULL,

    [currentdateout] [nvarchar](50) NULL,

    [resumeId] [int] NULL,

    CONSTRAINT [PK_ID3] PRIMARY KEY CLUSTERED

    (

    [ID3] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

  • Wow... 1000 characters for an e-mail address, and up to 10 of them ??? Obviously, from your previous posts, these are keyword fields, but I have to wonder why you have so many? Keywords can add up, but why not put them in just one field, and put the actual e-mail addresses in much smaller fields (surely you won't need more than two or three per applicant?) I also wonder how you can reasonably need 10000 characters for keywords? This sounds like a serious case of overkill, and it does explain why you have so much data. FTI on a 40GB database IS going to take a long time, even with the very top end in hardware. Perhaps you need to run an update one time and move all e-mail addresses to new fields that are more appropriate to the task of storing them. I don't see how you can expect to get the results you're looking for in a reasonable timeframe given this kind of design. There's just nothing in the data to help you out EXCEPT for FTI, and that's not a good thing on 10K bytes per record. Of course, you COULD go out and buy the most advanced x86 server you can get your hands on - ideally one that can perhaps do some serious parallel processing with at least a dual or even quad-socket motherboard that supports the latest Core i7 cpus from Intel, and then install SQL Server 2008 Enterprise Edition on it, and give it a good 128GB of RAM, and see where you stand, but my guess is that might not be realistic from a fiscal perspective.

    One way to find out where you're at is to let that query run until it completes, starting it at 5pm on a Friday night where Monday is a holiday, and see how long it takes to run. From there, you'll at least have some idea of just how staggering your task is.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • While you're at it, you might want to post the execution plan for the query that was going to run a while, so we can see where the cost is going.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Steve,

    You do make some valid points regarding overkill ! I guess i got carried away 🙂

    I do have good performance and can search 1.65 million records in about 3-4 seconds and display the TOP 12000, it just shows only 9000 results, because it is based on email1.

    Thanks

    Ian

  • Rookie Question,

    I did a length check and found the longest email to be 83 characters.

    Could i not just right click each Email1- Email10 column and do a Modify from 1000 chars to 100 chars, or is that a poor way to do this change ?

    Thanks

  • If the only contents of the Email fields are e-mail addresses, then by all means, change the number downwards, but to 100 chars, to allow for some growth (if you have an 83-char e-mail address, someone is bound to eventually give you one that's longer, although at times folks can just play around with your system and provide bogus information, so it would be good to verify any supplied e-mail addresses.

    However, I remember seeing a search for keywords like J2EE somewhere in your code, so if these fields have anything OTHER than e-mail addresses, then

    you'd need to move any other content somewhere else before reducing field size. Also, you might want to consider re-building all the FTI as well as all indexes once you make the changes. AND, do you really need 10 e-mail addresses? Most folks don't make use of more than 2, and maybe 3, when dealing with their resume, so I might consider lopping off 4 thru 10 altogether.

    Finally, if the e-mail address field are truly e-mail addresses, a constraint requiring an @ sign followed by at least two characters and then further followed by at least one period might be practical to avoid corruption of your data. This, of course, then begs the question of whether you need FTI on any field other than the resume itself. Why would you need FTI on e-mail addresses?

    Steve

    (aka smunson)

    :-):-):-)

    ifila (9/8/2009)


    Rookie Question,

    I did a length check and found the longest email to be 83 characters.

    Could i not just right click each Email1- Email10 column and do a Modify from 1000 chars to 100 chars, or is that a poor way to do this change ?

    Thanks

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Oh yeah, I almost forgot. Why store an empty e-mail address field? The ideal design is to have a primary key in your resume table, which is then joined to your e-mail address table by that primary key, and the e-mail address table has it's own PK (a simple integer identity field would be adequate).

    That kind of design avoids much of the wasted space. Also, I might consider verifying that 83 character e-mail address - that seems highly unrealistic to me. Who would want to have to remember anything that long? I'd seriously consider telling that one person that you can't continue to support e-mail addresses of that length, and if you lose that one person, it shouldn't be all that big a deal. Of course, verify that the address actually works - you may find it doesn't, and then you can delete it and look for the next longest one, etc., etc... As a practical guide, I'd limit e-mail addresses to perhaps 40 to 50 characters, at most.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Steve,

    Thanks for your suggestions !

    As a test using SSMS, I did create a Copy DB and reduced the size of email field to 100 chars, did a reindex and then ran a test on 225k emails, using my PIVOT query.

    Both the old DB with ten 1000 char Emails and the Copy DB with ten 100 char Emails took 12 seconds to run.

    I was surprised by no performance improvement. I will have to do some more performance testing :hehe:

    Ian

  • When you do performance testing like that, it helps tremendously to clear the cache in advance of each test run, so that cacheing has no impact on the test results. Otherwise, you are likely to NOT get accurate results. Of course, if the system is in use for other things at the time, you'll be hurting everyone, so that may not be practical at any time other than off-hours.

    It's also important to know what you did or didn't do with FTI. I'm not real familiar with it's details, but if possible, I'd not want to have it looking at the e-mail columns, and instead have those columns have regular indexes. You also didn't say if you removed any of those columns. What I'm getting at, is that you need to be alomost despicably anal about specifying EXACTLY what you did, at the deepest "in the weeds" level, including indicating things like what other load was running at the time, was the drive for the database copy a different one than for the original, etc., etc., ... Even characteristics of your IO subsystem may be critical to understanding the test results. The ideal test scenario is a freshly re-booted machine, as this eliminates the caching effects as well as most other potential pieces of grief.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Steve,

    The only column that is FTI is originalresume. The email1 column is indexed.

    I decided to do things one step at a time, and just reduce the size of the email field. To follow your valuable suggestion and create a dedicated Email table, involves rewriting major parts of the whole system, so that will take time.

    I will do more testing after a reboot 🙂

    Thanks

    Ian

    BTW I did create the copy DB on the same RAIDS as the original !

  • Another best practice to add to your list: the liberal use of views and/or stored procedures, so as to isolate your application code from the database details. It can avoid major app re-writes, and limit the changes to as little as a single stored procedure or view in many cases.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I reduced the email1- email10 size from 1000 chars to 100 chars, saving 9000 chars.

    9000 * 1 650 000 records = 1.485e+10

    So why when i do a backup is my DB still 40 GB in size ? This is also the size of the original DB !

    :crazy:

  • Remember when I referred to being "almost despicably anal" about describing EXACTLY what you did ? You've done quite a number of things, but while you say what you achieved, you've provided almost no detail about HOW (and I mean right down the click or T-SQL script) you did it. You made a copy of the database, but there are an infinite number of different ways to do that when you stop and think about the different ways to set the initial size and the amount to grow by. Also, you haven't specified what kind of disks are in your RAID setup, nor the number of disks, nor the RAID level. There are just too many unknowns here, and you're not providing the kind of detail one needs in order to make useful recommendations.

    However, you can always try to shrink the database that's the copy, and see how that goes. I'm not sure how a shrink operation is affected by initial size and growth parameters, so you'll have to look that shrink command up in Books Online (aka BOL). Don't be afraid to discuss this at the deep in the weeds level, because that's where you need to be.

    Also, I may have assumed more knowledge on your part than you actually have, but it's hard to tell at times what level someone is at. For that, my apologies. If you wish to reach me privately, send me a private message on the forum and I can provide a phone number you can reach and the hours during which I'll be available.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply