Different Methods of JOINing to table with 6.4 Million Records

  • Hello,

    I have been working on a query that joins a table (A) of IP Addresses to a table of 6.4 Million IP to Domain lookups (B). The IPs in Table B are stored in BIGINT for the quickest lookups. Before joining A to B, the IPs in A are converted to BIGINT format. This process goes extremely quick. The problem is when A is joined to B.

    If I use a temp table (#), the query may run in 2-4 minutes. If I use a table variable (@), it can sometimes take over 10 minutes. If I simply join and avoid temp tables, the process takes over 10 minutes. The problem with the first method is that it is not usable in Reporting Services since when you run two or more reports simultaneously, the 2nd+ instance will return an error (because it tries to create a PK with the same name and fails). I can leave out the explicit creation of the PK and create it when the temp table is created, or leave it out entirely, but in either case, it just prolongs the query more than the 2-4 minutes. I can't find a way to create a system-generated PK name once the table has already been created. Using ALTER TABLE ADD CONSTRAINT without the name returns an error (despite BOL's syntax that leaving it blank will use a sys generated name).

    This is problem 1. Problem 2 is that running any of these queries pegs the server to 100% usage for an extended period of time. This is obviously no good either. I do not have the resources for partitioning, etc., although not sure how that would help anyways seeing as the query is searching the entire table of 6.4 million records. It's not by date range, for instance. It is, however, using a Clustered Index Seek which is good.

    Sample Code for Method 1 (Temp Table):

    -- ** This section goes very quickly

    IF OBJECT_ID('TempDB..#mtbl_isps') IS NOT NULL DROP TABLE #mtbl_isps

    CREATE TABLE #mtbl_isps (IPAddr varchar(15) NOT NULL, LongIP bigint NOT NULL, [Views] int)

    INSERT INTO #mtbl_isps

    SELECT IPAddr, LongIP, SUM([Views]) As [Views]

    FROM (

    SELECT blah blah FROM blah...

    ) dTbl

    GROUP BY IPAddr, LongIP

    ALTER TABLE #mtbl_isps

    ADD CONSTRAINT PK_MTbl_ISPs_LongIP PRIMARY KEY CLUSTERED (LongIP) WITH FILLFACTOR = 100

    -- ** This part is slow

    SELECT ICI.ipISP AS Referrer, SUM(M.[Views]) AS [Views]

    FROM dbo.IPCITYISP ICI WITH (NOLOCK) JOIN

    #mtbl_isps M WITH (NOLOCK) ON M.LongIP BETWEEN ICI.ipFROM AND ICI.ipTO

    GROUP BY ICI.ipISP

    ORDER BY ICI.ipISP

    DROP TABLE #mtbl_isps

    Any suggestions?

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • Would you please post the DDL for the base tables including indexes, some sample data for both tables (in a readily consumable format that can be cut, paste, and run in SSMS), and the expected results from query. I realize that your sample data will only be a few records each, but it at least provides something to test against.

    Also, could you also post the actual execution plan of your queries?

  • If you don't use the "constraint" keyword, that command works.

    create table #T (ID int);

    alter table #T add primary key (ID);

    Just tested that and it worked.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Lynn Pettis (4/20/2009)


    Would you please post the DDL for the base tables including indexes, some sample data for both tables (in a readily consumable format that can be cut, paste, and run in SSMS), and the expected results from query. I realize that your sample data will only be a few records each, but it at least provides something to test against.

    Also, could you also post the actual execution plan of your queries?

    -- This is the table that contains the 6M+ IP to Domain lookups

    CREATE TABLE [dbo].[IPCITYISP](

    [ipFROM] [bigint] NOT NULL,

    [ipTO] [bigint] NOT NULL,

    [countrySHORT] [char](2) NOT NULL,

    [countryLONG] [varchar](64) NOT NULL,

    [ipREGION] [varchar](128) NOT NULL,

    [ipCITY] [varchar](128) NOT NULL,

    [ipISP] [varchar](256) NOT NULL,

    CONSTRAINT [PK_IPCITYISP] PRIMARY KEY CLUSTERED

    (

    [ipFROM] ASC,

    [ipTO] ASC

    )WITH FILLFACTOR = 100 ON [PRIMARY]

    ) ON [PRIMARY]

    -- Sample statistics table (others similar; they are just unioned in the initial query)

    CREATE TABLE [dbo].[BackgroundViews](

    [ClickDT] [datetime] NOT NULL,

    [CustID] [int] NOT NULL,

    [CompID] [int] NOT NULL,

    [CatID] [int] NULL,

    [SubcatID] [int] NULL,

    [IPAddr] [varchar](15) NULL,

    CONSTRAINT [PK_CoBackgroundViews] PRIMARY KEY CLUSTERED

    (

    [ClickDT] ASC,

    [CustID] ASC,

    [CompID] ASC

    )WITH FILLFACTOR = 80 ON [PRIMARY]

    ) ON [PRIMARY]

    -- Scalar function called to convert IPAddr to LongIP in Stats Tables

    CREATE FUNCTION [dbo].[fnDot2LongIP]

    (@ipaddr varchar(15))

    RETURNS bigint

    AS

    BEGIN

    DECLARE @longip bigint, @i tinyint, @pos tinyint, @ppos tinyint

    SET @i = 1

    SET @ppos = 0

    SET @longip = 0

    WHILE @i <= 4

    BEGIN

    SET @pos = CHARINDEX(''.'', @ipaddr, @ppos + 1)

    IF @i = 4 SET @pos = LEN(@ipaddr) + 1

    SET @longip = @longip + ((CAST(SUBSTRING(@ipaddr, @ppos + 1, @pos - @ppos - 1) AS bigint) % 256) * (POWER(256,(4 - @i))))

    SET @ppos = @pos

    SET @i = @i + 1

    END

    RETURN @longip

    -- Sample Data

    INSERT INTO dbo.IPCITYISP

    SELECT 50334720, 50334975, 'US', 'UNITED STATES', 'MISSOURI', 'CHILLICOTHE', 'GENERAL ELECTRIC COMPANY'

    UNION ALL

    SELECT 55297536, 55297791, 'US', 'UNITED STATES', 'MASSACHUSETTS', 'BEVERLY', 'GENERAL ELECTRIC COMPANY'

    UNION ALL

    SELECT 68774912, 68775167, 'US', 'UNITED STATES', 'NEW YORK', 'STATEN ISLAND', 'LEVEL 3 COMMUNICATIONS INC'

    UNION ALL

    SELECT 70603920, 70603935, 'US', 'UNITED STATES', 'NEW YORK', 'NEW YORK', 'TOP SPOT BRANDS INC'

    UNION ALL

    SELECT 72320544, 72320551, 'US', 'UNITED STATES', 'FLORIDA', 'SARASOTA', 'AMK INVESTMENTS'

    INSERT INTO dbo.BackgroundViews

    SELECT '2009-01-07 16:40:59.623', 46636, 225433, 3862, NULL, '65.217.70.18'

    UNION ALL

    SELECT '2009-01-07 16:41:41.507', 46636, 224672, NULL, NULL, '216.123.183.132'

    UNION ALL

    SELECT '2009-01-07 17:04:31.987', 46636, 228385, NULL, NULL, '72.19.30.58'

    UNION ALL

    SELECT '2009-01-08 09:18:36.097', 46636, 220596, 2097, 2044, '24.63.5.118'

    UNION ALL

    SELECT '2009-01-08 09:49:15.810', 46636, 220655, NULL, NULL, '207.228.217.2'

    UNION ALL

    SELECT '2009-01-08 10:01:58.227', 46636, 221654, 2740, NULL, '66.0.90.162'

    As for an example of the problem I am faced with, I ran one query that used no temp tables or table variables, and it took 4:04 to return 37 records. During that time, the CPU had high usage. Another example using a temp table returned 250 rows in 1:49. The CPU was at 60-80% during the time elapsed.

    The expected results are something like:

    ReferrerViews

    3G MOBILE SERVICE PROVIDER2

    3M (S) PTE LTD2

    3M COMPANY3

    754TH ELECTRONIC SYSTEMS GROUP3

    A2ZFX INC1

    ABB POWER GENERATION1

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • Could you post your code as well? I wouldn't be surprised if the problem with performance is actually related to your scalar function for converting the IP address.

  • Lynn Pettis (4/20/2009)


    Could you post your code as well? I wouldn't be surprised if the problem with performance is actually related to your scalar function for converting the IP address.

    I don't see how it could be that. If I just run that initial query (to convert an IP to LongIP), it'll run in seconds. That scalar converted 2377 IPs to a LongIP (bigint) in under 1 second. As soon as I join those LongIPs to the lookup table, the query and server come to a crawl.

    The IPs in the 6.4 Million record table are already converted to bigint and no conversion is done there.

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • FYI, the sample data for dbo.BackgroundViews does not have any data that matches the sample data for dbo.IPCITYISP.

  • Lynn Pettis (4/20/2009)


    FYI, the sample data for dbo.BackgroundViews does not have any data that matches the sample data for dbo.IPCITYISP.

    I realize that, but I'd have to modify the queries to pull out some of the LongIPs that the query returns (the end user never sees them). Regardless, didn't think it's relevant to troubleshooting this. I guess you could just change the records in IPCITYISP to those from the actual query, or vice versa, if it really matters. I am not seeing anything that has been added to SQL 2008 that would speed up this query (CTEs included), but just trying to see what comments people have. Surely I can't be the only one with this 'data warehousing' problem of matching up millions of records.

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • If I am going to try and see what I can do, having sample data and expected results based on that sample data is important. How do I know I have a proper query if I have nothing to compare to?

  • I still need to see your query as well. I need to see how the tables are being joined together.

    Also, are there any other indexes on the tables?

  • I'll try to get better sample data tomorrow if that's what you require. Also, all the queries and DDL are there (some code is in the very first post). The only thing you're missing is the "select blah blah blah" query from the first post. That portion is something like:

    SELECT IPAddr, dbo.fnDot2LongIP(IPAddr) AS LongIP, COUNT(IPAddr)

    FROM dbo.BackgroundViews CBV WITH (NOLOCK)

    WHERE (CBV.ClickDT BETWEEN @begDate AND @endDate + ' 23:59:59.997') AND (CBV.CompID = @compid)

    GROUP BY IPAddr, dbo.fnDot2LongIP(IPAddr)

    UNION ALL

    SELECT IPAddr, dbo.fnDot2LongIP(IPAddr), COUNT(IPAddr)

    FROM OtherStatsTable1

    WHERE (DateRange...)

    UNION ALL

    SELECT IPAddr, dbo.fnDot2LongIP(IPAddr), COUNT(IPAddr)

    FROM OtherStatsTable2

    WHERE (DateRange...)

    This portion of retrieving the appropriate records based on the date range, and converting the IP Address to a LongIP for joining to the 6.4 M table is done very quickly and generally ranges anywhere from 1-10 seconds. No other indexes other than what I already provided.

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • Looking at the code and such, I think I know what is going on. Can you post the actual execution plan for the query? I suspect that your query has to scan the entire 6.4 million records to determine the result set as you are looking values between two columns within each record, not a range within a single column.

  • GSquared (4/20/2009)


    If you don't use the "constraint" keyword, that command works.

    create table #T (ID int);

    alter table #T add primary key (ID);

    Just tested that and it worked.

    Ah yes, it sure does. Thanks. That solves the concurrency problem then, assuming I have to stick with temp tables. Hopefully I can avoid that though.

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • Lynn Pettis (4/20/2009)


    Looking at the code and such, I think I know what is going on. Can you post the actual execution plan for the query? I suspect that your query has to scan the entire 6.4 million records to determine the result set as you are looking values between two columns within each record, not a range within a single column.

    I'm sure that's the problem. It's pulling a CI Seek from the temp table (containing the aggregated stats and IPs), but a CI Scan from the 6.4 Million table. I just attached the execution plan to this message (rename extension to .sqlplan). If you need it in text format, let me know. I find that hard to read, though. Also, I sent you a PM yesterday but it never shows as being sent. I resent it as an email in hopes you'll get it.

    [font="Tahoma"]Zycon - The Search Engine for Manufacturers[/url]. Helping Engineers and Technical Buyers Locate Aluminum Extrusions, Metal Stampers, Plastic Fabricators, Valves, and More.[/font]

  • I need related sample data between the two tables. The data previously provided results in an empty result set. I'm not sure what values to actually change or to what.

    I may have a solution, but without data that returns a result set, I'm not sure.

Viewing 15 posts - 1 through 15 (of 31 total)

You must be logged in to reply to this topic. Login to reply