Threw an exception... out of memory

  • It's been a while since I've posted much, but I'm back to functional hardware again, so here goes...

    I recently investigated; for learning purposes; zip-code databases with latitude and longitude values for each 5 digit zip code. I then wanted to see just how well my Vista Ultimate 64-bit system with SQL Server 2005 Developer Edition would handle this. There are 42,192 records for 5 digit zip codes in the free MS Access database I found using a Google search. I then copied the data up to the server using SQL Server Migration Assistant, after fixing up the latitude and longitude fields to be decimal(8,5) and a few other small prettying up tasks.

    I did a self-join to just try and do a SELECT on this table as a CROSS JOIN, and used a user-defined function to calculate the distance between for each zip code pair. It got to roughly 35 million records, plus a few hundred thousand, and then SSMS threw an out of memory exception. It didn't crash SSMS, but the query kept running, but not producing any additional output. My guess is that SSMS is NOT a 64-bit program? I was watching task manager while this ran, and total memory usage never got past 3.78 GB in this 8 GB machine, so is the 2GB 32-bit process limit what I ran into with Studio? Will I be able to do a SELECT INTO if I want to actually create the table?

    Just figured I'd see what anyone might know about this kind of scenario...

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • SSMS isn't 64-bit as far as I know.

    I built a VLDB in SQL 2008 which takes 17 billion rows per month (all numeric values), using good old fashioned INSERT INTO ... SELECT ... We then massaged it into the cube source table with a PIVOT or two.

    I've never tried SELECT ... INTO with that many rows, but I can't see why it wouldn't work.

  • Well folks, things have definitely gotten interesting... I know that SQL Server has some level of overhead in a table, but I'm completely lost as to how the following code managed to expand the database the new table is located in to more than 360 GIG !!! This new table's records are 15 bytes each for just the data, and on that basis alone, with the number of records it SHOULD generate being 890,061,336 - one would conclude there's 12 GIG of data +/- a few hundred meg. How on earth did this generate 360+ GIG ??? I had to hit cancel on the query to stop it from running disk space down to zero when I was down to just 22mb free space, and from what I can see, even if logging of the inserts means 200% more data, that's only another 24 GIG, for a total of 36, and I had over 360 !!! By the way, tempdb never changed in size... at least not that I could see. Anyone have any ideas ?

    Here's the code:

    DECLARE @START DateTime, @END DateTime, @DURATION decimal(12,3)

    SET @START = GETDATE()

    CREATE TABLE dbo.DISTANCES (

    ZIP1 char(5),

    ZIP2 char(5),

    DISTANCE decimal(7,2)

    )

    INSERT INTO dbo.DISTANCES (ZIP1, ZIP2, DISTANCE)

    SELECT Z.[ZIP Code] AS ZIP1, Y.[ZIP Code] AS ZIP2,

    CASE

    WHEN (Z.Latitude = Y.Latitude) AND (Z.Longitude = Y.Longitude) THEN

    CAST(0 AS decimal(7,2))

    ELSE ROUND(69.1 * 180 / PI() * ACOS(

    SIN(RADIANS(Z.Latitude)) * SIN(RADIANS(Y.Latitude)) +

    COS(RADIANS(Z.Latitude)) * COS(RADIANS(Y.Latitude)) *

    COS(RADIANS(Y.Longitude) - RADIANS(Z.Longitude))),2)

    END AS DISTANCE

    FROM dbo.[ZIP Codes] AS Z, dbo.[ZIP Codes] AS Y

    WHERE Z.[ZIP Code] < Y.[ZIP Code]

    SET @END = GETDATE()

    SET @DURATION = CAST(DATEDIFF(ms, @START, @END) AS decimal(12,3)) / 1000.

    Steve

    (aka smunson)

    :w00t::w00t::w00t:

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • smunson (7/2/2009)


    Well folks, things have definitely gotten interesting... I know that SQL Server has some level of overhead in a table, but I'm completely lost as to how the following code managed to expand the database the new table is located in to more than 360 GIG !!! This new table's records are 15 bytes each for just the data, and on that basis alone, with the number of records it SHOULD generate being 890,061,336 - one would conclude there's 12 GIG of data +/- a few hundred meg. How on earth did this generate 360+ GIG ??? I had to hit cancel on the query to stop it from running disk space down to zero when I was down to just 22mb free space, and from what I can see, even if logging of the inserts means 200% more data, that's only another 24 GIG, for a total of 36, and I had over 360 !!! By the way, tempdb never changed in size... at least not that I could see. Anyone have any ideas ?

    It's entirely possible that it grew to more than 10 times your expected size. I've seen that sort of growth before, especially on full recovery mode.

    I did some calculations, and you could fit 546 rows per page based on a row size of 15 bytes. That excludes the indexes if you have any (as it turns out, quite an important factor). If you multiply that out, you'll have 1 630 149 pages, which is your 12GB, give or take.

    The transaction log has to keep a record of what the database looks like before the transaction completes, to ensure ACID compliance.

    You mention you had to stop the query before it finished - I'd be interested to find out (given sufficient space) what the final result would be, i.e. how much of the database is actual data versus free space and log file, and what the fragmentation would be on the indexes.

    Regarding indexing, if you have a clustered index on one or more of the columns, you would cause a page split for the inserts if they are not sequential. This would completely destroy (y)our theory of 12GB. Something to think about.

  • I'd agree easily that indexes could certainly cause that kind of difference, but I didn't create any, nor did I even create a primary key, and that was all intentional so as to avoid anything but data space and basic table overhead. All that I ran is exactly the code I posted, and the database size prior to that wasn't more than 50 mb of actual data. Unfortunately, I do not know how big the entire database was, which would have included indexes on the existing data. I have a hard time believing that the existing tables other than this new one would get a before/after snapshot for every update to a brand new table. If it's of any value, the new table had gotten to roughly 790,000,000 records as of the point at which I killed it.

    In any case, since I didn't have ANY indexes or keys in this new table, I can't see what could possibly generate 360 GIG where only 12 GIG is needed. Any other ideas?

    Since I don't have to preserve what I ended up with, I'm going to drop that database, delete it's data and log files, and re-create it using simple recovery, remigrate the data and then rerun my code again and see what that does for me. Unless I see this thing move a lot faster this time, I'll kill it a lot sooner, so I don't have to mess with nasty long wait times for the query to cancel.

    Steve

    (aka smunson)

    :-):-):-)

    Randolph Potter (7/2/2009)


    smunson (7/2/2009)


    Well folks, things have definitely gotten interesting... I know that SQL Server has some level of overhead in a table, but I'm completely lost as to how the following code managed to expand the database the new table is located in to more than 360 GIG !!! This new table's records are 15 bytes each for just the data, and on that basis alone, with the number of records it SHOULD generate being 890,061,336 - one would conclude there's 12 GIG of data +/- a few hundred meg. How on earth did this generate 360+ GIG ??? I had to hit cancel on the query to stop it from running disk space down to zero when I was down to just 22mb free space, and from what I can see, even if logging of the inserts means 200% more data, that's only another 24 GIG, for a total of 36, and I had over 360 !!! By the way, tempdb never changed in size... at least not that I could see. Anyone have any ideas ?

    It's entirely possible that it grew to more than 10 times your expected size. I've seen that sort of growth before, especially on full recovery mode.

    I did some calculations, and you could fit 546 rows per page based on a row size of 15 bytes. That excludes the indexes if you have any (as it turns out, quite an important factor). If you multiply that out, you'll have 1 630 149 pages, which is your 12GB, give or take.

    The transaction log has to keep a record of what the database looks like before the transaction completes, to ensure ACID compliance.

    You mention you had to stop the query before it finished - I'd be interested to find out (given sufficient space) what the final result would be, i.e. how much of the database is actual data versus free space and log file, and what the fragmentation would be on the indexes.

    Regarding indexing, if you have a clustered index on one or more of the columns, you would cause a page split for the inserts if they are not sequential. This would completely destroy (y)our theory of 12GB. Something to think about.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • UPDATE:

    I'd still like to know what was going on with such huge logging, and it was definitely the log file growth that was the problem. As this database doesn't need to be logged during creation, and because I can recreate it rather easily, I've adopted a bit of log maintenance to solve the problem. Clearly, this isn't exactly applicable to typical production databases, but it will get me past the problem until such time as I have a LOT more disk space readily available. I've also intentionally limited the log size to just 1 GIG, as there's just no need for anything beyond that. For technique, I broke up the inserts into groups, based on the 1st four digits of the zip code. I then just used a WHILE loop and loop counter (with it's character equivalent plus leading zeros) to add a CHECKPOINT, log truncation, and DBCC SHRINKFILE on the log file to each pass. This way, I effectively eliminate any lasting effect of logging, and still get my desired table.

    I'd love to hear any alternative solutions that would have allowed me to not have to go through these gyrations that don't involve additional hardware (I have 2 new 1TB drives on order, but I needed this done today). Learning is always a good thing.

    Here's my code:

    CREATE TABLE dbo.DISTANCES (

    ZIP1 char(5),

    ZIP2 char(5),

    DISTANCE decimal(7,2)

    )

    DECLARE @START DateTime, @END DateTime, @DURATION decimal(12,3), @LOOPER int, @LOOPCHAR varchar(5)

    SET @LOOPER = 0

    SET @START = GETDATE()

    PRINT 'STARTING DATE/TIME: ' + CAST(@START AS varchar(30))

    WHILE (@LOOPER < 10000)

    BEGIN

    SET @LOOPCHAR = RIGHT('000' + CAST(@LOOPER AS varchar(4)),4) + '%'

    PRINT 'LOOP START FOR #' + @LOOPCHAR

    INSERT INTO dbo.DISTANCES (ZIP1, ZIP2, DISTANCE)

    SELECT Z.[ZIP Code] AS ZIP1, Y.[ZIP Code] AS ZIP2,

    CASE

    WHEN (Z.Latitude = Y.Latitude) AND (Z.Longitude = Y.Longitude) THEN

    CAST(0 AS decimal(7,2))

    ELSE ROUND(69.1 * 180 / PI() * ACOS(

    SIN(RADIANS(Z.Latitude)) * SIN(RADIANS(Y.Latitude)) +

    COS(RADIANS(Z.Latitude)) * COS(RADIANS(Y.Latitude)) *

    COS(RADIANS(Y.Longitude) - RADIANS(Z.Longitude))),2)

    END AS DISTANCE

    FROM dbo.[ZIP Codes] AS Z, dbo.[ZIP Codes] AS Y

    WHERE Z.[ZIP Code] < Y.[ZIP Code] AND

    Z.[ZIP Code] Like @LOOPCHAR

    SET @LOOPER = @LOOPER + 1

    PRINT 'LOOP PASS #' + @LOOPCHAR

    CHECKPOINT

    BACKUP LOG ZIP_CODES WITH TRUNCATE_ONLY

    DBCC SHRINKFILE(ZIP_CODES_log)

    PRINT 'LOG MAINTENANCE COMPLETED'

    END

    SET @END = GETDATE()

    SET @DURATION = CAST(DATEDIFF(ms, @START, @END) AS decimal(12,3)) / 1000.

    PRINT 'ENDING DATE/TIME: ' + CAST(@END AS varchar(30))

    PRINT 'TOTAL DURATION WAS: ' + CAST(@DURATION AS varchar(30)) + ' SECONDS'

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Simple recovery mode and batch inserts will probably be enough, without shrinking the logs, but as you say, this isn't production.

    It's always interesting to do this sort of thing. My 17 billion row per month database was done with batch inserts per customer, at 1000 rows per customer, and we used minimal logging anyway, so we didn't run into this issue. Well, I probably designed around it without realising it.

    Thanks for doing all the heavy lifting 🙂

  • Whew.... it ran successfully last night !!! Thus I now have 890,061,336 records that occupy 20.6 GIG at the table level (still an awful lot of overhead at 66% over the raw data), and a database that's only 20.8 GIG. Total run time was 4 hrs, 37 minutes, 50.61 seconds on a Vista Ultimate 64-bit system with SQL Server 2005 Developer Edition 64-bit, on a system with an ASUS motherboard (P5Q3 Deluxe WiFi AP@n) with 8 GIG of RAM, a quad-core Intel Q9550 cpu at 2.83 GHz, and with max server memory set to 5000 MB. I do have the SQL Server service account having the priveledge to lock pages in memory. SQL Server has SP2 installed, and is at 9.0.3050.

    I still need to know more about what all that overhead is. I watched a good portion of that process and saw all 4 cores doing work for a combined max cpu usage of ~30%. That's rather low, given I would have thought this thing would be cpu bound based on the distance calculation involving so many trig functions. Am I right if I guess it was disk I/O it was waiting for? That's not entirely supported by the evidence I had in watching the hard drive light. That was pegged during the periods when cpu would spike up into the 20s or 30, and then cpu would nearly idle for a while, while the disk light would only flash fairly regularly. That was a clear and repeated pattern, and I'd also love to know what that was all about. Ultimately, I'm asking these questions because I have what I perceive to be a rather heavy-duty peice of hardware, and I really thought that the process of creating that table was going to take maybe half an hour to an hour given my hardware. Any thoughts? What else can I learn here?

    Now for the next step: creating the indexes on the two ZIP fields and then establishing a new field that will categorize the distances. I'm wondering what kind of logging I'll see, and if I'll have a repeat of the previous experience with running out of disk space.

    If anyone has any advice on the logging, please let me know. The more I know, the better prepared I'll be to make it work. The plan for the indexes is to provide cover for a search where a single zip code is in either ZIP1 or ZIP2. I don't think I need a primary key here, and I'm looking at creating two indexes, non-clustered, one for each of those fields. Advice is welcome on the indexes as well.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • First off, non-clustered indexes will be better. I don't even want to think how long it would take to build a clustered index.

    You would have improved performance using table partitioning I think.

    Regarding the CPU, you can expect ~30% CPU usage with I/O of this magnitude. The calcs you're doing are tiny to the point of negligence compared to the disk thrashing.

    I think that's enough for now. Your box needs a nap 🙂

  • Well, at least I seem to be on the right track, but I'm not about to give my box a nap... it's a quad-core, for crying out loud - it's SUPPOSED to kick you know what.... and if I'd been able to afford it, I would have had a dual-socket, quad-core in each socket, setup. I just want to know if I can MAKE that system perform like a speed skater instead of a steam train.

    Steve

    (aka smunson)

    :-):-):-)

    Randolph Potter (7/3/2009)


    First off, non-clustered indexes will be better. I don't even want to think how long it would take to build a clustered index.

    You would have improved performance using table partitioning I think.

    Regarding the CPU, you can expect ~30% CPU usage with I/O of this magnitude. The calcs you're doing are tiny to the point of negligence compared to the disk thrashing.

    I think that's enough for now. Your box needs a nap 🙂

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • UPDATE:

    I've created my two indexes on the Zip Code fields, which didn't take all that long, and expanded the database from 16 GIG to roughly 50 GIG. Due to the previous experience where I was running out of disk space due to extraordinary log file growth far beyond what I thought were reasonable expectations (23 times the actual final data size, when I expected perhaps 3 to 5 times), I decided to embark on a similar methodology for updating the distance categorization field, by again breaking the task down into 10,000 manageable pieces using a nearly identical loop and log maintenance code. It took almost 14 hours to run, and I suspect that the logging of updates is somehow significantly smaller than for inserts, because most of that time appears to have been occupied by the repeated log maintenance. Geez Louise, maybe 10,000 pieces in this instance was more than a few too many, but how was I to know?

    I then created an index on this last new field, and I now have a 70 GIG database. My next task will be to see how long it takes to query it for a given zip code and have it find all zip codes within a given distance. This should be interesting.

    Of course, I'm still baffled by the logging, and any technical detail anyone has on exactly what gets logged, and under what circumstances, would be most welcome, so I can more accurately determine how to survive limited disk space scenarios.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • ANOTHER UPDATE:

    I hate when I do this kind of thing - I messed up my initial update to the distance category field. I forgot to change out a constant I'd been using in my calculation tests and replace it with the actual distance field. Dumb, de dumb dumb - DOOOOHHHHH !!!!

    Anyway, I'm in the process of re-running, and that gave me a chance to try it in 1,000 pieces instead of 10,000, and that's going to cut the run-time to a bit less than half of what it was before. It's been running for 2 1/4 hours so far, and is in the 433xx's at the moment, so I'm guessing about 5 to 6 hours vs. the nearly 14 last time.

    The index on the new field will take an additional hour, if past history is any indication.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • UPDATE #3:

    My re-run of the distance categories update ran in 3:23:09.29. Wow, what a difference! Index creation now in progress.

    Steve

    (aka smunson)

    :-):-):-)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Steve, I'm not sure I can help but I'm certainly interested in this problem. Would you provide a link to the free database you mentioned so I can "play" along with you on this problem? Who knows? I might even be able to come up with a high speed alternative. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Never mind, Steve... I found one that's a CSV file at

    http://www.boutell.com/zipcodes/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply