December 10, 2007 at 2:40 pm
Currently I am converting 10 years worth of customer and order data from an old FileMaker database into MS SQL 2000 and the best way we have found to do this is with cursors using T-SQL. I have it all working fine now, but had a few bumps along the way.
With SQL 2000 it seems that there is major memory paging going on when a large table is being read from within a cursor. I have done quite a bit of research on this in the last couple weeks, but didn't find out too much as everyone seems to hate cursors LOL
Anyways, I imported a 1,000,000+ line csv file into a temp table in SQL 2000. From this I have built a cursor to read it one row at a time, do one select query to find a product id number from the new product table, insert the data into the new table, then mark it as imported. Pretty basic, but this cursor was taking days to run!
I did some testing and used a smaller 100,000 row table to run the cursor from, and there was a huge performance gain. Where I was processing around 200 rows per minute on my older and overworked SQL server, I was suddenly getting 4000+ rows a minute!
So I do some more testing. I tried to make smaller cursors with a WHILE loop to only grab 10000 rows at once from my 1,000,000 table but that only got me up to 400 rows a minute. Then I decided to expand the while loop to read 10k rows, create a new table and populate it with them, then create the cursor with the temp table, and drop it when done. Rinse and repeat. This got me up to 5000+ rows a minute which I am quite happy with given my server.
Now I ask you all, does SQL 2005 have these same types of issues with cursors on large tables?
I just didn't find any info like this on any site in the last 3 weeks of searching, so I thought it would be good to throw this out as it might help others speed up their cursors.
Perhaps this way of breaking down a cursor will help someone...
December 10, 2007 at 2:57 pm
I think that you'll find the same problems in SQL Server 2005 as well. SQL Server is designed to work off sets of data. Row by row processing is terribly inefficient, slow, and resource intensive. For the example you gave, why do you feel you need to insert the rows one at a time? If you've already imported from csv into a table, why can't you do your INSERT as a set based transaction?
December 10, 2007 at 3:13 pm
For all of the cursors I have written for this data conversion, I need to do one or more seperate selects or inserts besides the row I'm working with. Sometimes I need to do some math with the data read before it is inserted, sometimes I need to break an address or phone number out to be inserted into a different table.
The above example is the simplest cursor I have done so far, it is for order line items and does not write to more than one table. So yea, this one might be able to be done another way.
I just have not found another way to do this other than a cursor.
I'm not quite sure what you mean by "INSERT as a set based transaction"
I'm still kinda new to MS SQL 2000. I have 6 years expereince with MySQL and PHP with smaller databases on e-commerce projects, but this large data conversion stuff is new to me... Still got a lot to learn.
December 10, 2007 at 3:30 pm
Set based means that you are working on 'sets' of data instead of rows. Anytime you can apply logic or business rules accross a set of common data, you'll way outperform cursoring through each row and working one at a time. Why don't you give us some greater detail as to your setup and schema and then post your cursors. If you provide enough background and setup info, you'll have several experts jump in and help you. What you are trying to do sounds like a pretty basic ETL (Extract-Transform-Load) operation and I've yet to find one that cannot be solved without the use of a Cursor.
So for your case, you are using a cursor because you have to do different actions depending on how the data looks right? This tells me that you have some sort of business rules that tell you what type of activity you need to perform on that data. Instead of cursoring through each row and deciding what to do with the data based on those rules, why not run one operation for each grouping or set of rows? I may not be comming accross real clear but take this example. Say you have a table that looks like this:
MyTable
RoNum RowValue
1 A
2 B
3 C
4 B
5 B
6 A
Let's say that the RowValues define which set of business rules you are going to use to load this data. The cursor approach says to process all rows one at a time starting with row 1. As soon as the cursor pulls RoNum 1, it says 'Ah ha, this is a type A. I'll run the business logic for this type of row'. It will process RoNum 1 and then move on to RoNum 2 to determine what to do there. This process continues until there are no more rows. Pretty simple right? Well, with set based logic, you would apply your busness logic to each set of data. In this case, your code would process RoNum 1 and 6 at the same time, rows 2, 4, and 5 and the same time, and then row 3.
Post your cursors and you'll get some help. If you really are open minded to learn, having a cursor re-written as set based for you will get you going down the right path.
December 10, 2007 at 4:29 pm
OK, here is the cursor I am using for the above example. This is the simplest cursor I have used so far so it is probably the easiest to work with.
The 'posted_order_items_from_filemaker2' table is 1,027,000 rows long and has all of the data from the old database. The 'posted_order_items_from_filemaker ' table is being used as a temp table for each cursor run through. I am marking each row as 'done' in the 'posted_order_items_from_filemaker2' table in case the cursor needs to be stopped (or stops on its own) as I had the SQL Query Analyzer freeze up when I was importing the main order table and breaking it down with a much more complex cursor. The 'imported' field is '0' when not imported, then changed to '1' after the new row is written in the new table 'OrderItems'.
By breaking this cursor down into smaller ones, it took 2:51 hours to finish. When it was one large cursor it took like 3 days! I consider this a big improvement, but if there is a better way then I'm all ears! Or is that eyes? hmmm
DECLARE @comments varchar(255)
DECLARE @customer_key varchar(255)
DECLARE @discount varchar(255)
DECLARE @item_code varchar(255)
DECLARE @price_per_item money
DECLARE @order_key varchar(255)
DECLARE @quantity_received varchar(255)
DECLARE @quantity_to_ship varchar(255)
DECLARE @cost money
DECLARE @price_per_item_after_discount money
DECLARE @line_id int
DECLARE @customer_id int
DECLARE @product_id int
DECLARE @quantity int
DECLARE @start_num int
DECLARE @end_num int
DECLARE @ImportCursor CURSOR
SET @start_num = 1
SET @end_num = 10000
WHILE @start_num < 1030000
BEGIN
SELECT * INTO posted_order_items_from_filemaker FROM posted_order_items_from_filemaker2
WHERE imported = '0'
AND id >= @start_num
AND id <= @end_num
SET @ImportCursor = CURSOR FAST_FORWARD
FOR
Select comments, customer_key, discount, item_code, price_per_item, order_key, quantity_received,
quantity_to_ship, cost, price_per_item_after_discount, id
From posted_order_items_from_filemaker
OPEN @ImportCursor
FETCH NEXT FROM @ImportCursor
INTO @comments, @customer_key, @discount, @item_code, @price_per_item, @order_key,
@quantity_received, @quantity_to_ship, @cost, @price_per_item_after_discount, @line_id
WHILE @@FETCH_STATUS = 0
BEGIN
SET @product_id = ''
SELECT @product_id = id
FROM Products
WHERE ProductCode = @item_code
SET @quantity = @quantity_received + @quantity_to_ship
SET NOCOUNT ON
INSERT INTO OrderItems (
OrderID,
ItemID,
Qty,
QtyDelivered,
QtyToSend,
item_price,
item_cost,
comments,
filemaker_discount,
filemaker_price_per_item_after_discount,
import_item,
filemaker_id
) VALUES (
@order_key,
@product_id,
@quantity,
@quantity_received,
@quantity_to_ship,
@price_per_item,
@cost,
@comments,
@price_per_item_after_discount,
'1',
@line_id
)
UPDATE posted_order_items_from_filemaker2
SET imported = '1'
WHEREid = @line_id
FETCH NEXT FROM @ImportCursor
INTO @comments, @customer_key, @discount, @item_code, @price_per_item, @order_key,
@quantity_received, @quantity_to_ship, @cost, @price_per_item_after_discount, @line_idEND
CLOSE @ImportCursor
DEALLOCATE @ImportCursor
SET @start_num = @start_num + 10000
SET @end_num = @end_num + 10000
DROP TABLE posted_order_items_from_filemaker
END
December 10, 2007 at 4:59 pm
This will do everything:
[Code]
INSERT INTO OrderItems (
OrderID,
ItemID,
Qty,
QtyDelivered,
QtyToSend,
item_price,
item_cost,
comments,
filemaker_discount,
filemaker_price_per_item_after_discount,
import_item,
filemaker_id
)
SELECT
F.order_key,
P.id,
F.quantity_received + F.quantity_to_ship,
F.quantity_received,
F.quantity_to_ship,
F.price_per_item
F.cost,
comments,
discount,
price_per_item_after_discount,
'1',
id
From dbo.posted_order_items_from_filemaker2 F
INNER JOIN dbo.Products P ON P.ProductCode = F.item_code
[/Code]
_____________
Code for TallyGenerator
December 10, 2007 at 5:32 pm
Thanks! That definitly sheds some light on how to avoid a cursor.
I tried to do an INNER JOIN in my cursor but unfortunalty some of the input data rows have blank (null) item_code or one that is in error (not in the Products table) so those ended up getting skipped completely. That is why I broke that part out into a seperate select statement so as to allow it to be blank and still get inserted into the new table. It's sloppy, but I need to get all of the incomplete rows inputed still to convert all the old orders into the new system.
I have a lot of 'bad' data that has to be allowed for 🙁
Also, on my more complex cursors I need to check to see if an address is already in the address table, if it is then I need to pull the address_id, otherwise I need to create a new address then use it's address_id. How would I do something like this? I have several other more complex cursors which are all doing selects then using IF statements to determine the next step. i.e. INSERT or SELECT or maybe even an UPDATE.
December 10, 2007 at 5:46 pm
I have a lot of 'bad' data that has to be allowed for 🙁
LEFT JOIN to dbo.Product will do just that.
But can you explain what that "bad" data will mean?
Bad data should not be allowed in relational database.
Also, on my more complex cursors I need to check to see if an address is already in the address table, if it is then I need to pull the address_id, otherwise I need to create a new address then use it's address_id. How would I do something like this? I have several other more complex cursors which are all doing selects then using IF statements to determine the next step. i.e. INSERT or SELECT or maybe even an UPDATE.
INSERT INTO dbo.Address
SELECT ....
WHERE NOT EXISTS (select 1 from dbo.Address
where {check unique key existence})
Then just join dbo.Address to final insert
_____________
Code for TallyGenerator
December 10, 2007 at 6:01 pm
By 'bad data' I mean data that is not in the correct format. For instance, in the above line item import, I ran across 3 rows which had '.5' for the quantity. As this is a float value in SQL and not an integer it made the cursor stop each time. I altered the original data to '1' and restarted the cursor. I am running into a lot of data that is blank/null when it should not be. Or has invalid values.
The data was pulled from a FileMaker Pro 4.0 database which seems to have extreemly simplistic data types. A 'Number' field can be a Integer or a Float for example. A Text field can be a varchar of any length. FileMaker does not make you specify the data lengths anywhere it seems, so I'm having to find out through trial and error where I need to tweak the lengths of my varchars.
If I'm processing this garbage data one row at a time and marking each row as 'done', when it stops I can find where it stopped fast and look at the data on that row to see what is happening.
I'll try implementing the suggestions above to see if I can allow for all the variables I'm dealing with. Thanks for the input! 🙂
December 10, 2007 at 6:16 pm
Steve Stout (12/10/2007)
By 'bad data' I mean data that is not in the correct format. For instance, in the above line item import, I ran across 3 rows which had '.5' for the quantity. As this is a float value in SQL and not an integer it made the cursor stop each time. I altered the original data to '1' and restarted the cursor. I am running into a lot of data that is blank/null when it should not be. Or has invalid values.The data was pulled from a FileMaker Pro 4.0 database which seems to have extreemly simplistic data types. A 'Number' field can be a Integer or a Float for example. A Text field can be a varchar of any length. FileMaker does not make you specify the data lengths anywhere it seems, so I'm having to find out through trial and error where I need to tweak the lengths of my varchars.
If I'm processing this garbage data one row at a time and marking each row as 'done', when it stops I can find where it stopped fast and look at the data on that row to see what is happening.
I'll try implementing the suggestions above to see if I can allow for all the variables I'm dealing with. Thanks for the input! 🙂
You have your data types, you have your business rules regarding the imported data.
Right?
So, just put the checks for those rules in WHERE clause:
WHERE Quantity NOT LIKE '%[^0-9]%' -- this will leave only rows with integer quantity
AND EXISTS (select 1 from Products P ON P.ProductCode = F.item_code)
AND {whatever else you need to check}
If you want to look at rejected lines store it in "Failed_FileMaker" table:
INSERT INTO dbo.Failed_FileMaker (...)
SELECT ...
WHERE Quantity LIKE '%[^0-9]%' OR Quantity = '' OR Quantity IS NULL
OR NOT EXISTS (select 1 from Products P ON P.ProductCode = F.item_code)
OR .... {whatever else you need to check}
_____________
Code for TallyGenerator
December 11, 2007 at 7:44 am
If you want to learn SSIS, and it is worth the investment in time, you could do all of this in there and implement complex rules for handling this.
However Sergiy and John have given you some great advice for working with this in T-SQL.
December 11, 2007 at 8:49 am
Wow, you guys were working late yesterday. Thanks for jumping in Sergiy. Steve (Stout), the example Sergiy gave you is exactly what I was describing. He was able to replace all of the cursor logic and variable usage with one simple INSERT INTO...SELECT statement. Feel free to post some of your more complex cursors here and we'll see what can be done. Also, Steve (Jones) is correct in that all of this can be done in DTS or SSIS if you have the time to play around with it. I would say that your best bet would be to do this in straignt T-SQL just so you can start to devlop the right coding habits. The faster you are able to change your thinking from writing procedural T-SQL code to set based T-SQL code, the better it will be for you in the long run.
December 12, 2007 at 8:12 am
This is an example of what you are trying to do:
DECLARE @start_num int
DECLARE @end_num int
SET @start_num = 1
SET @end_num = 10000
SET NOCOUNT ON
INSERT INTO OrderItems
( OrderID
, ItemID
, Qty
, QtyDelivered
, QtyToSend
, item_price
, item_cost
, comments
, filemaker_discount
, filemaker_price_per_item_after_discount
, import_item
, filemaker_id
)
SELECT order_key
, products.id AS product_id
, (quantity_received + quantity_to_ship) AS quantity
, quantity_received
, quantity_to_ship
, price_per_item, cost
, comments
, discount
, price_per_item_after_discount, '1'
, line_id
FROM posted_order_items_from_filemaker2
LEFT JOIN Products
ON ProductCode = item_code
WHERE imported = '0'
AND id >= @start_num
AND id <= @end_num
UPDATE posted_order_items_from_filemaker2
SET imported = '1'
FROM posted_order_items_from_filemaker2
WHERE imported = '0'
AND id >= @start_num
AND id <= @end_num
December 12, 2007 at 10:39 am
i have done these types of conversion projects and almost always use a staging table (a permanent table used temporarily) with a cursor and RBAR.
why? because you can go back and easily query which rows did not load. yes, you can write a query with the opposite WHERE clause but that has to be maintained too and is subject to human error.
in my situations, i often added several control columns to the staging table such as the incoming filename that was the source of the data, what batch number (defined by the user) for backwards tracking of where did the data come from (only important in multi-file or multi-source situations), and most importantly a column to store why the row failed to load.
those error rows can be easily queried and presented to the data owners for a decision on how to correct or skip the data.
besides foreign key issues (items not found, etc.), the most common problem in my experience is having input data that contains combined columns that have to be split a part such as names (prefix, first, mi, last, suffix) and address blocks. and it gets worse when the orignal system allowed free form entry and the new system has separate columns and proper business rules.
just my two cents,
joe
December 12, 2007 at 2:37 pm
If you have a staging table, why not use that instead of a cursor to scrub/update/correlate/fill in your data. Even with the extra insert - it's going to be 20 to 30 times faster, consumer fewer resources, free up the table faster, etc...
If you need to know what didn't get updated, then hold onto the staging table, and compare it to the master table (using the staging PK). anything in staging that didn't get ported over needs to be manually reviewed.
Everything else you mention (tracking, including the file name and date) are all very easily implemented using something set-based.
I just did something like that today - using a staging table - the 1M or so records took 13 seconds to finish. You'll be lucky if you can get a cursor solution to finish in 10 minutes (more like an hour).
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply