distinct keyword

  • Hi Experts,

    I want to avoid distinct keyword to the below result.

    Table A

    Eno

    1

    1

    2

    2

    3

    4

    5

    6

    7

    8

    9

    10

    I would like to display the unique Eno. How it is possible without using distinct keyword ?

    karthik

  • ...Use GROUP BY clause

    But why you dont want to use DISTINCT?:cool: Either way the plans generated would going to be similar...

    --Ramesh


  • I'm thinking "interview", "test", or "homework" question 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Select DISTINCT interview.test where homework=0....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • No No this is not an interview question.....

    karthik

  • If we take any performance related article or document, definetely it highlight the below point ' don't use distinct keyword in your query.It decrease your performance'.

    Thats why i posted this forum. Suppose if we want to filter the above table ,then how can we avoid the distinct keyword for the above output ?

    karthik

  • Well - it is true that Select * from table1 will usually run faster than select distinct * from table1 - but if you have duplicates, the first one isn't going to give you what you need.

    If you're refusing to use DISTINCT, then Group By will also do it, or you could use OVER() (in 2005) to get there as well. However, the operation will return the same performance hit either way (well - a performance hit will happen either way).

    Actually - if you didn't "care" about performance - you could even torture a UNION query to do the SAME thing.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • karthikeyan (11/6/2007)


    If we take any performance related article or document, definetely it highlight the below point ' don't use distinct keyword in your query.It decrease your performance'.

    Thats why i posted this forum. Suppose if we want to filter the above table ,then how can we avoid the distinct keyword for the above output ?

    I think that's a frequent mis-interpretation... it should be "If you don't need it, don't use DISTINCT because it will cause the query to take longer even if there are no dupes." The best way to avoid the slight performance hit of DISTINCT or GROUP BY is to write code that doesn't need it 😛

    And, GROUP BY is no better or worse than DISTINCT on simple queries... of course, don't take my word for it... take the word of the code for it... they produce identical execution plans and run in nearly identical times with each beating the other on occasion...

    DROP TABLE jbmTest

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    --===== A table is not properly formed unless a Primary Key has been assigned

    -- Takes about 12 seconds to execute.

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    GO

    --===== Declare local variables

    DECLARE @StartTime DATETIME --To measure simple duration

    DECLARE @Bitbucket INT --To take display time out of the picture

    --===== Test the DISTINCT Method

    PRINT 'Distinct method...'

    SET @StartTime = GETDATE()

    SELECT DISTINCT

    @Bitbucket = SomeINT

    FROM dbo.jbmTest

    PRINT CONVERT(CHAR(12),GETDATE()-@StartTime,114) + ' Duration (hh:mi:ss:mmm)'

    PRINT REPLICATE('=',78)

    --===== Test the GROUP BY Method

    PRINT 'GROUP BY method...'

    SET @StartTime = GETDATE()

    SELECT @Bitbucket = SomeINT

    FROM dbo.jbmTest

    GROUP BY SomeINT

    PRINT CONVERT(CHAR(12),GETDATE()-@StartTime,114) + ' Duration (hh:mi:ss:mmm)'

    PRINT REPLICATE('=',78)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    Before executing this code blindly , can you explain me the following sql query ?

    -------------------------------------------

    SELECT TOP 1000000 RowNum = IDENTITY(INT,1,1), SomeInt = ABS(CHECKSUM(NEWID()))%50000+1, SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65) + CHAR(ABS(CHECKSUM(NEWID()))%26+65), SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)), SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY), SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME), SomeHex12 = RIGHT(NEWID(),12) INTO dbo.JBMTest FROM Master.dbo.SysColumns t1, Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    ---------------------------------------------

    My Questions :

    1. Whether the numbers used in this query are constant or not ?

    2. Can you explain me the logic of this number ?

    3. Can i use any numbers(for ex : %26+65) in this place ?

    4. What is eman by part01,part02...part0n ?

    5. Why are you using Newid() ? Newid() - random number generator - am i correct ?

    karthik

  • My Questions :

    1. Whether the numbers used in this query are constant or not ?

    2. Can you explain me the logic of this number ?

    3. Can i use any numbers(for ex : %26+65) in this place ?

    4. What is eman by part01,part02...part0n ?

    5. Why are you using Newid() ? Newid() - random number generator - am i correct ?

    1. The numbers "used" in this query may be changed to suite your needs. For example, the number 3653.0 is the number of days in 10 years (if you start at 01/01/2000, Leap Year thingy) and is used as the "range extent" of the random numbers. The number 36524.0 is date serial number for the date of 01/01/2000 and is used as the "range offset". The numbers in the "result" set are very nearly true random numbers and they have a varying distribution that you would expect for true random numbers over the range created for each column of numbers.

    The basis for all of the calculations is simple range generation using a random source. For RAND, that would be X*Range+Offset where X is a typical random number >=0 and less than 1. The CHECKSUM generation of random numbers is a bit different and boils down to using MODULO(X,N)+Offset where X is any whole random number and N is the desired range.

    Off course, the Cross-Join on a table known to have at least 4,000 rows in it at inception allows me to gen up to 16,000,000 million rows just by changing the value in the TOP clause and, it too, runs nasty fast.

    2. The logic of which number? If you're speaking in general, perhaps more detail than in 1 above would help... NEWID() produces a random result... CHECKSUM is used to very quickly convert NEWID() to a positive, negative, or 0 value which is also random in nature because of the operand being used. ABS is used to convert any negative values to postive values. Modulo (%) is used to control the range extent of the numbers returned. Simple addition is used to offset that range from 0.

    Sometimes, as in the case of datetimes, it was easier to use RAND instead of CHECKSUM because CHECKSUM returns a whole number... RAND allows a FLOAT which is the basis of datetime (albeit a predefined "fixed length" float). As you know, RAND returns values of "x" as 0 <= x < 1. With simple multiplication by the desired extent of the range and adding an offset, you can create a random range of FLOAT numbers that are suitably used for generating random datetimes. If you want "whole" dates whose time is always "midnight", use the CHECKSUM method instead.

    3. Yes but then you destroy the intent of that particular calculation... CHAR(65) is the letter "A" and there are 26 letters in the English alphabet... of course, if you have different language requirements, then you must absolutely understand the ramifications in so far as the underlying character set is concerned. In other words, you better have a very good understanding of how character sets and ASCII numbering works.

    4. Not sure what your question actually is on this one so I tell you what I use the column for... I frequently test "split" functions, functions to return only alpha characters, functions to return only numeric characters, functions to remove only special characters, etc, etc... that particular column allows me to do all of that testing quite easily.

    5. Yes... you are correct. NEWID() is the only thing in SQL that will produce a non-repeatable value over a broad range of values within a given query which makes it perfect for generating nearly true random numbers. As you know, RAND will produce the same number no matter how many times it's "called" when used within the same query unless it's provided a random seed. Makes RAND almost totally useless in set-based queries as random number generators in SQL Server goes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff ! Really i felt i have gathered some good idea.

    karthik

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply