October 8, 2008 at 8:07 am
Interesting point. I'd guess that most of us use a tally table with a million numbers in it. Having found that reducing the rowcount (or maximum number, in other words) to fit requirements makes little discernible difference to appropriately-coded tally-number queries, I leave mine at a million numbers. I don't get many strings as long as a million characters.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 8, 2008 at 8:08 am
rbarryyoung (10/7/2008)
Jeff Moden (10/7/2008)
BWAA-HAA! Did you leave any redmarks on your cheeks from the slingshot kickback? :hehe:Is that the tough chewy part?
Nah that's the target π
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 8, 2008 at 9:12 am
As suggestged by Jeff
[font="Courier New"]CREATE PROCEDURE stp_CONVERSION_INIT_UPC_Table AS
-- --------------------------------------------------------------------------------------------
-- THE CCCCCC-NNNNN-P UNIVERSAL PRODUCT CODE IS AS FOLLOWS:
-- CCCCCC: 6 DIGIT IDENTIFICATION OF A COMPANY
-- NNNNN : 5 DIGIT SEQUENTIAL NUMBERR (FOR ITEM IDENTIFICATION)
-- P : PARITY / CHECKSUMCALCULATED BY MODULO-10 ALGORITHM
--
-- AT FIRST, UPC NUMERS WERE CREATED WITH NNNNN BEING USED FOR THE
-- THE ITEM ID ** AND ** A SUPPLIER ID. THIS CREATED GAPS IN THE SEQUENCE.
-- OVER THE COURSE OF TIME, SUPPLIERS CHANGED....
--
-- INSTEAD OF STARTING FROM THE HIGHEST NUMBER, KEEP A LIST OF AVAILABLE
-- NUMBERS. CREATE THE WHOLE SEQUENCE AND DELETE NUMBERS ALREADY IN USE.
-- WHEN IMPLEMENTED AS A WHILE INSERTING 199,998 ITEMS ONE AT A TIME, RUN TIME WAS 29 min
-- WHEN USING A TALLY TABLE TO INSERTING 199,998 ITEMS ALL AT ONCE, RUN TIME WAS 48 SECONDS
--
-- REFERENCE: JEFF MODEN http://www.sqlservercentral.com/articles/TSQL/62867/
--
-- CREATE TALLY TABLE - WILL SUPPLY A SET OF INTEGERS FROM 0 TO 99,999
--
-- SELECT TOP 99999
-- IDENTITY(INT,1,1) AS N
-- INTO dbo._Aux_Tally
-- FROM Master.dbo.SysColumns sc1, --JUST USE THE CARTESIAN PRODUCT TO GENERATE
-- Master.dbo.SysColumns sc2, --A LARGE NUMBER OF RECORDS
-- Master.dbo.SysColumns sc3,
-- Master.dbo.SysColumns sc4
--
-- ALTER TABLE dbo._Aux_Tally
-- ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--
-- GRANT SELECT, REFERENCES ON dbo._Aux_Tally TO PUBLIC
-- --------------------------------------------------------------------------------------------
SET NOCOUNT ON
DECLARE @li_RetCode int
SET @li_RetCode = -1
TRUNCATE TABLE Next_Available_UPC
IF @@ERROR <> 0 SET @li_RetCode = -1 ELSE SET @li_RetCode = 0
IF @li_RetCode <> 0 RETURN @li_RetCode
-- GENERATE FIRST SET OF UPC-12 NUMBERS
INSERT INTO Next_Available_UPC (UPCNo)
SELECT '012345-' + RIGHT('0000' + CONVERT(varchar(5), N), 5) + '-'
FROM _Aux_Tally
WHERE N < 100000
ORDER BY N
IF @@ERROR <> 0 SET @li_RetCode = -2
IF @li_RetCode <> 0 RETURN @li_RetCode
-- GENERATE SECOND SET
INSERT INTO Next_Available_UPC (UPCNo)
SELECT '234567-' + RIGHT('0000' + CONVERT(varchar(5), N), 5) + '-'
FROM _Aux_Tally
WHERE N < 100000
ORDER BY N
IF @@ERROR <> 0 SET @li_RetCode = -3
IF @li_RetCode <> 0 RETURN @li_RetCode
-- CALCULATE FINAL CHECK DIGIT
UPDATE Next_Available_UPC
SET UPCNo = UPCNo + [dbo].fnComputeUPCNoCheckDigit(UPCNo)
IF @@ERROR <> 0 SET @li_RetCode = -4
IF @li_RetCode <> 0 RETURN @li_RetCode
-- REMOVE NUMBERS THAT ARE ALREADY IN USE
DELETE FROM Next_Available_UPC
WHERE UPCNo IN ( SELECT CodeUPC FROM Product)
IF @@ERROR <> 0 SET @li_RetCode = -5
RETURN @li_RetCode
GO[/font]
[font="Courier New"]
CREATE FUNCTION fnComputeUPCNoCheckDigit
(
@ps_UPCNo varchar(20)
)
RETURNS varchar(1) AS
BEGIN
DECLARE @li_CharValue int
DECLARE @li_SumOdd int
SET @li_SumOdd = 0
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 1, 1))
SET @li_SumOdd = @li_CharValue
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 3, 1))
SET @li_SumOdd = @li_CharValue + @li_SumOdd
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 5, 1))
SET @li_SumOdd = @li_CharValue + @li_SumOdd
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 8, 1))
SET @li_SumOdd = @li_CharValue + @li_SumOdd
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 10, 1))
SET @li_SumOdd = @li_CharValue + @li_SumOdd
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 12, 1))
SET @li_SumOdd = @li_CharValue + @li_SumOdd
DECLARE @li_SumEven int
SET @li_SumEven = 0
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 2, 1))
SET @li_SumEven= @li_CharValue
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 4, 1))
SET @li_SumEven= @li_CharValue + @li_SumEven
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 6, 1))
SET @li_SumEven= @li_CharValue + @li_SumEven
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 9, 1))
SET @li_SumEven= @li_CharValue + @li_SumEven
SET @li_CharValue = CONVERT(int, SUBSTRING(@ps_UPCNo, 11, 1))
SET @li_SumEven= @li_CharValue + @li_SumEven
DECLARE @li_Sums int
SET @li_Sums = (210 - (@li_SumEven + (3 * @li_SumOdd)) ) % 10
DECLARE @ls_Parity varchar(1)
SET @ls_Parity = CONVERT(varchar(1), @li_Sums)
--SET @ls_UCC128 = @ls_UCC128 + '-' + CONVERT(varchar(4), @li_SumOdd)
--SET @ls_UCC128 = @ls_UCC128 + '-' + CONVERT(varchar(4), @li_SumEven)
RETURN @ls_Parity
END[/font]
October 8, 2008 at 6:01 pm
ggraber (10/8/2008)
An article on this topic would be really cool, Jeff.I just presented this concept at work using your split csv example.
I was just trying to think what other examples I could use for future presentations.
One question that came up, though. If your string is longer than the tally table the split doesn't work.
It would seem that this is a risk factor in using the tally table which you don't have when doing the split with a while loop.
How would you answer that question?
Since VARCHAR(8000) is the largest datatype (save the really unwieldy Text datatype) in SQL Server 2000, and 11,000 row (for mortgage calcs) Tally table always suffices.
For SQL Server 2005, I use a Tally "CTE" which runs just as fast and will easily create 121 million numbers much faster than any While loop can iterate with a limit set by the length of whatever is being split.
Phil Factor and I did do a test on his older box though. Same test on his machine (parsing the entire Moby Dick novel) and the While loop won. On the 4 machines I tested on with identical code, the Tally table always won quite handily.
Bottom line is, you always need to test "insitu" for maximum performance benefits.
Lemme know if you don't know what I mean by a "Tally CTE"... most folks fail performance wise when they try to make one but have still written potloads of bad examples on the internet.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2008 at 6:08 pm
ggraber (10/8/2008)
I guess the question is that a while loop has a dynamic number of iterations based on the length of the parameter.You don't have that kind of flexibility with a tally table. If the tally table you have isn't big enough, you will be missing some elements at the end of the csv.
That's why a lot of folks maintain a million row Tally table in 2k5. Like I said, in 2k, most folks never make it past VARCHAR(8000) because the Text data type isn't real friendly.
That would also be a good question... what's the largest string that anyone has split? Phil Factor and I have recently been testing with the entire Moby Dick novel. The file I used had 1,253,122 characters including spaces. Not sure why anyone would need more than that, but I suppose it's possible.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2008 at 6:11 pm
Chris Morris (10/8/2008)
Interesting point. I'd guess that most of us use a tally table with a million numbers in it. Having found that reducing the rowcount (or maximum number, in other words) to fit requirements makes little discernible difference to appropriately-coded tally-number queries, I leave mine at a million numbers. I don't get many strings as long as a million characters.Cheers
ChrisM
Most people never get past the 8k limit in 2k. Like I said above, in 2k5, a properly written Tally "CTE" can easily hit the 121 million row mark and just as easily hit the 2 row mark... all auto-magically.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2008 at 6:55 pm
Jeff Moden (10/8/2008)
ggraber (10/8/2008)
That would also be a good question... what's the largest string that anyone has split?
My procedure splits ntext string. Up to 2 GB long.
It's in SQL2k.
Tested with several MB lonf string - it works.
But it uses a loop - you know why.
It needs to take care about quotes, and I have not seen a Tally approach good for this.
So far. π
_____________
Code for TallyGenerator
October 8, 2008 at 7:16 pm
I remember... it's a pretty tough task because it's not just quotes that Sergiy has to worry about with the data he showed me. It's data from a "properly" formatted CSV where strings are in quotes (and may contain commas and doubled up quotes), proper numerics are not, and every column is also separated with a comma. Might sound like a "simple" BCP solution, but it's not. The number of columns at run time are unknown. The order of the delimiters (hence, the columns themselves) may be in any order.
The idea is to split the whole thing using a Tally table... without using a function because of still other requirements for performance and other limitations, etc. On top of that, in has to be done in SQL Server 2k.
I've been working on a couple of ideas, but no success, yet.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2008 at 8:44 pm
Jeff Moden (10/8/2008)
I remember... it's a pretty tough task because it's not just quotes that Sergiy has to worry about with the data he showed me. It's data from a "properly" formatted CSV where strings are in quotes (and may contain commas and doubled up quotes), proper numerics are not, and every column is also separated with a comma. Might sound like a "simple" BCP solution, but it's not. The number of columns at run time are unknown. The order of the delimiters (hence, the columns themselves) may be in any order.The idea is to split the whole thing using a Tally table... without using a function because of still other requirements for performance and other limitations, etc. On top of that, in has to be done in SQL Server 2k.
I've been working on a couple of ideas, but no success, yet.
I would have said CLR, but then again - Sergiy might start foaming at the mouth....:)
Just build a .NET scrubber for clean up the file into a format that CAN be used, then fed it in using bcp once the format works......Why use a sledgehammer to get a screw into the wall?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 9, 2008 at 1:11 am
Matt Miller (10/8/2008)
I would have said CLR, but then again - Sergiy might start foaming at the mouth....:)
Why should I?
Just build a .NET scrubber for clean up the file into a format that CAN be used, then fed it in using bcp once the format works.
I'd happily pick the idea if you'd indicate what kind of format CAN be used.
_____________
Code for TallyGenerator
October 9, 2008 at 7:54 am
Jeff Moden (10/8/2008)
I remember... it's a pretty tough task because it's not just quotes that Sergiy has to worry about with the data he showed me. It's data from a "properly" formatted CSV where strings are in quotes (and may contain commas and doubled up quotes), proper numerics are not, and every column is also separated with a comma. Might sound like a "simple" BCP solution, but it's not. The number of columns at run time are unknown. The order of the delimiters (hence, the columns themselves) may be in any order.
A couple of weeks ago I wrote a command line utility (like BCP) that uses client code to load a CSV file. It's pretty good in that it can handle the stuff that BCP cannot like commas and line-breaks within quotes, quotes based on the cell-contents rather than the column-type, ragged-right, etc., (i.e., the type of CSV file that Excel makes as output), but it's still got a couple of problems:
1) In terms of maximum row/buffer size I have it fixed at 100,000 right now. The problem with fixing it at 2GB is that that is an awful lot of extra memory and buffer initialization overhead if it is not going to be that big. I could have a command-line switch to control this, but then the user would have to know ahead of time when they needed to use this. I *might* be able to do some kind of adaptive thing, where it checks for overflows and then doubles the buffer-size every time that happens, but I am concerned that it could be very slow.
2) Performance-wise it is only about 1/3 the speed of BCP, and I cannot figure out how to improve it as almost all of its time is being spent, not in my client-code, but in the .Net SQL data-type conversion routines. I suppose that I could do the data-type conversions myself, but it is hard for me to believe that my stuff would be significantly faster than the .Net SQL libraries.
You could, as Matt suggested write a .Net "cleaner" (actually I would call it a "conditioner" since the real problem is that BCP cannot handle perfectly "clean" CSV's), but I have always felt that if you are already writing custom code to condition the data, you might as well load it then too.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
October 9, 2008 at 8:10 am
Sergiy (10/9/2008)
Matt Miller (10/8/2008)
I would have said CLR, but then again - Sergiy might start foaming at the mouth....:)Why should I?
(just poking fun a little - I know you're not a big fan of it within SQL Server.) In 2005, I would probably integrate this in as a stored proc, depending on what it yields.
Just build a .NET scrubber for clean up the file into a format that CAN be used, then fed it in using bcp once the format works.
I'd happily pick the idea if you'd indicate what kind of format CAN be used.
I don't have all of the facts (I must have missed the initial thread), but it sounds like the 2 primary issues with the file are a. variable amounts of fields, and b. the multiple delimiters issue. A somewhat similar scrubber I've had to deal with would build out each row, supplementing the source file with any "missing columns" it might not have included (doing this row by row if need be). In addition - in order to address the delimiters issue - knowing what delimiter could be use to replace what is there (since comma is clearly not enough or not used well). You can use regex to set up a replacement based on "ignore commas that are between quotes, but use the ones outside of the quotes". I've recently had to use compound character delimiters (so delimiters became a character sequence instead of a single char), which worked pretty well.
In other words - "fix" the format. what it gets replaced with would depend on what's appropriate in the file.
If you have a link back to the previous thread where this was described - I will see if I can supply more details.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 9, 2008 at 1:15 pm
Dang it... I forgot about RegEx. That might work... change all the delimiters to a "thorn" and go for it.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2008 at 2:10 pm
Jeff Moden (10/9/2008)
change all the delimiters to a "thorn" and go for it.
There is no guarantee "thorn" is not or will not be used within the filed for other purposes.
Barry, this is what, I guess, I will do as well. As soon as I get a little bit time for it.
For now - it works with loop, and it's OK.
By the time when performance will be a problem something must be invented.
_____________
Code for TallyGenerator
October 9, 2008 at 5:56 pm
Sergiy (10/9/2008)
Jeff Moden (10/9/2008)
change all the delimiters to a "thorn" and go for it.There is no guarantee "thorn" is not or will not be used within the filed for other purposes.
Barry, this is what, I guess, I will do as well. As soon as I get a little bit time for it.
For now - it works with loop, and it's OK.
By the time when performance will be a problem something must be invented.
True enough... but have you ever see it used for anything other than a delimiter? π
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 31 through 45 (of 66 total)
You must be logged in to reply to this topic. Login to reply