April 28, 2009 at 9:07 am
Paul White (4/28/2009)
@mister.magoo,I think the function you refer to was one I posted originally, and improved by Flo. Not that I'm territorial!
I do apologise - it's such a long thread and function names keep changing!
The point about the scalability is not so much about how much memory a single call would allocate, rather: consider many concurrent users executing the function...!
Now you really are scaring me - people are going to put this kind of unprocessed data into a database and then allow multiple users access to it at the same time?
Please excuse me while I book myself on a woodwork course :hehe:
P.S. What about the question of whether passing large chunks of data to a CLR function is really good practice to start with? 😉
MM
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
April 28, 2009 at 9:12 am
Now you really are scaring me - people are going to put this kind of unprocessed data into a database and then allow multiple users access to it at the same time?
The primary use case I see is apps that need to send collections of identifiers to the database for one reason or another. This can be done using a chatty interface (i.e. multiple calls to a stored procedure that only accepts a single identifier) and that tends not to scale at all. So more informed developers (or consultants who come in to clean up the mess :-D) will instead figure out some way to batch the calls, usually via lists. I've seen some extremely active apps using this pattern -- tens of thousands of batch requests per second -- so the method must be able to scale up. The T-SQL methods actually work pretty well in this case, but I've come to believe that a finely-tuned CLR method could do even better.
--
Adam Machanic
whoisactive
April 28, 2009 at 9:41 am
how to make it ignore consecutive delimiters
Ignore in which way?
I split data with consecutive delimiters, ie a column/field with no data
Far away is close at hand in the images of elsewhere.
Anon.
April 28, 2009 at 9:44 am
David Burrows (4/28/2009)
how to make it ignore consecutive delimiters
Ignore in which way?
I split data with consecutive delimiters, ie a column/field with no data
In most of the use cases I'm interested in, "a,,b,,c" with a delimiter of "," should return 3 rows:
a
b
c
likewise ",a,b,c" or "a,b,c," or ",,a,b,c,," or any number of other slightly malformed combinations.
--
Adam Machanic
whoisactive
April 28, 2009 at 9:48 am
In most of the use cases I'm interested in, "a,,b,,c" with a delimiter of "," should return 3 rows:
Fair enough, another one of those 'It Depends' features 😀
I suppose the 'trade of' in time is between processing during split vs post split
Far away is close at hand in the images of elsewhere.
Anon.
April 28, 2009 at 10:30 am
OK, here's my latest. A few bug fixes, and it seems to properly ignore consecutive delimiters (you don't want to know how many times I recompiled this fighting off-by-one errors...)
Enjoy! By the way, if you do performance tests, make sure to compile in Release mode, not Debug mode. I just did a quick comparison and it does seem to make a significant difference.
using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName = "FillRow_Multi",
TableDefinition = "item nvarchar(4000)"
)
]
public static IEnumerator SplitString_Multi(
[SqlFacet(MaxSize = -1)]
SqlChars Input,
[SqlFacet(MaxSize = 255)]
SqlChars Delimiter
)
{
return (
(Input.IsNull || Delimiter.IsNull) ?
new SplitStringMulti(new char[0], new char[0]) :
new SplitStringMulti(Input.Value, Delimiter.Value));
}
public static void FillRow_Multi(object obj, out SqlString item)
{
item = new SqlString((string)obj);
}
public class SplitStringMulti : IEnumerator
{
public SplitStringMulti(char[] TheString, char[] Delimiter)
{
theString = TheString;
stringLen = TheString.Length;
delimiter = Delimiter;
delimiterLen = (byte)(Delimiter.Length);
isSingleCharDelim = (delimiterLen == 1);
lastPos = 0;
nextPos = delimiterLen * -1;
}
#region IEnumerator Members
public object Current
{
get
{
return new string(theString, lastPos, nextPos - lastPos);
}
}
public bool MoveNext()
{
if (nextPos >= stringLen)
return false;
else
{
lastPos = nextPos + delimiterLen;
for (int i = lastPos; i < stringLen; i++)
{
bool matches = true;
//Optimize for single-character delimiters
if (isSingleCharDelim)
{
if (theString != delimiter[0])
matches = false;
}
else
{
for (byte j = 0; j = stringLen) || (theString != delimiter[j]))
{
matches = false;
break;
}
}
}
if (matches)
{
nextPos = i;
//Deal with consecutive delimiters
if ((nextPos - lastPos) > 0)
return true;
else
{
i += (delimiterLen-1);
lastPos += delimiterLen;
}
}
}
lastPos = nextPos + delimiterLen;
nextPos = stringLen;
if ((nextPos - lastPos) > 0)
return true;
else
return false;
}
}
public void Reset()
{
lastPos = 0;
nextPos = delimiterLen * -1;
}
#endregion
private int lastPos;
private int nextPos;
private readonly char[] theString;
private readonly char[] delimiter;
private readonly int stringLen;
private readonly byte delimiterLen;
private readonly bool isSingleCharDelim;
}
};
--
Adam Machanic
whoisactive
April 28, 2009 at 1:25 pm
Adam Machanic (4/28/2009)
We're not done yet -- if we're going to create the "best" method it should do as much as possible in terms of functionality we would normally want from string splitting routines. So I'm currently in the process of modifying Flo's modification to my character-based approach, and trying to figure out how to make it ignore consecutive delimiters ... Any other ideas? I often add a DISTINCT to my T-SQL efforts but I don't think that's a possibility in the SQLCLR routine.
Hi Adam
Correct. A complete well done split function should be able to handle delimiters within the text to be split. I usually avoid this "practice" by splitting with CHAR(0) which should never appear in any text. Anyway, the requirement stays absolutely valid. The question is only "how should it be handled?". As usual in CSV this will be done by quotation marks. If even the quotation mark appears within the text it will be masked by a second quotation mark. Would this be okay?
Greets
Flo
April 28, 2009 at 1:28 pm
Paul White (4/28/2009)
Ah. A post from Adam. Must be later than I realized. :w00t:This is probably feeping creaturism, but it would be great if it could handle more complex splitting tasks.
For example, strings like:
~Parent~child, child, child, child~Another Parent~its child~...
or
~Fish~colour=gold, size=M, memory=0~...
and yes, this is a real example (fish used to protect the guilty).
Also, I would still like to see a CLR TVF stream on the input as well as the output. I don't think this is going to fly though - not without an external class library, the unsafe permission set, and maybe (gasp) storing to a static field - a horrible idea, which I am trying hard to forget.
Time for bed.
Paul
edit: P.S. Most reasonable requirements could be handled I guess by allowing a regualr expression to be passed in, but that will break the nice split-on-the-fly thing. Wouldn't it?
Hi Paul!
Hierarchical data to be split? Well. This becomes really freaky... It would require either recursive methods or a much more complex RegEx which also becomes quiet slow.
Give me some hours to think about 😉
Greets
Flo
April 28, 2009 at 1:39 pm
Hi mister.magoo
As first thanks a lot for your invested time and your test results!!
mister.magoo (4/28/2009)
Realistically though - how often would you encounter a single chunk of text large enough for that to be an actual problem? (See end of post for more thoughts on this...)
This splitting requirement is very current at my company. Many of your projects are based on more than one database and those databases are on different servers. For criteria which excess one database the data have to be taken from one database to another. It is much faster to transfer the data as text to the server and just let it be split instead of creating temporary tables.
Another business case here is my own project. It is a distributed system. The communication works over XML the data will be imported by C# processes. The data have a local Id which is different on every server and a global Id (a VARCHAR) which is unique all over the world. The importer parses the XML file and now has to select 20,000 objects from database to merge them with the new data.
Greets
Flo
April 28, 2009 at 1:41 pm
Paul White (4/28/2009)The point about the scalability is not so much about how much memory a single call would allocate, rather: consider many concurrent users executing the function...!
I'm currently doin' my first load tests, but since now I don't have enough information that it would make sense to share 😉
April 28, 2009 at 1:49 pm
Adam Machanic (4/28/2009)
OK, here's my latest. A few bug fixes, and it seems to properly ignore consecutive delimiters (you don't want to know how many times I recompiled this fighting off-by-one errors...)Enjoy! By the way, if you do performance tests, make sure to compile in Release mode, not Debug mode. I just did a quick comparison and it does seem to make a significant difference.
using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName = "FillRow_Multi",
TableDefinition = "item nvarchar(4000)"
)
]
public static IEnumerator SplitString_Multi(
[SqlFacet(MaxSize = -1)]
SqlChars Input,
[SqlFacet(MaxSize = 255)]
SqlChars Delimiter
)
{
return (
(Input.IsNull || Delimiter.IsNull) ?
new SplitStringMulti(new char[0], new char[0]) :
new SplitStringMulti(Input.Value, Delimiter.Value));
}
public static void FillRow_Multi(object obj, out SqlString item)
{
item = new SqlString((string)obj);
}
public class SplitStringMulti : IEnumerator
{
public SplitStringMulti(char[] TheString, char[] Delimiter)
{
theString = TheString;
stringLen = TheString.Length;
delimiter = Delimiter;
delimiterLen = (byte)(Delimiter.Length);
isSingleCharDelim = (delimiterLen == 1);
lastPos = 0;
nextPos = delimiterLen * -1;
}
#region IEnumerator Members
public object Current
{
get
{
return new string(theString, lastPos, nextPos - lastPos);
}
}
public bool MoveNext()
{
if (nextPos >= stringLen)
return false;
else
{
lastPos = nextPos + delimiterLen;
for (int i = lastPos; i < stringLen; i++)
{
bool matches = true;
//Optimize for single-character delimiters
if (isSingleCharDelim)
{
if (theString != delimiter[0])
matches = false;
}
else
{
for (byte j = 0; j = stringLen) || (theString != delimiter[j]))
{
matches = false;
break;
}
}
}
if (matches)
{
nextPos = i;
//Deal with consecutive delimiters
if ((nextPos - lastPos) > 0)
return true;
else
{
i += (delimiterLen-1);
lastPos += delimiterLen;
}
}
}
lastPos = nextPos + delimiterLen;
nextPos = stringLen;
if ((nextPos - lastPos) > 0)
return true;
else
return false;
}
}
public void Reset()
{
lastPos = 0;
nextPos = delimiterLen * -1;
}
#endregion
private int lastPos;
private int nextPos;
private readonly char[] theString;
private readonly char[] delimiter;
private readonly int stringLen;
private readonly byte delimiterLen;
private readonly bool isSingleCharDelim;
}
};
Hi Adam!
Thanks for your new version :-)! I will include it into my other tests and give you feedback as soon as possible!! I just have to finish the setup of a load-test environment.
Sure, all test results are done in release build 😉
I keep you informed!
Flo
April 28, 2009 at 3:54 pm
[font="Verdana"]You know? I am really looking forward to Flo's final article. :-D[/font]
April 28, 2009 at 4:45 pm
Bruce W Cassidy (4/28/2009)
[font="Verdana"]You know? I am really looking forward to Flo's final article. :-D[/font]
Howdy Bruce!
Hold your horses 😛
I just don't think I can put all these information into one article...
Currently I have over 400 test cases for single threading and just setting up the load testing environment.
Best wishes!
Flo
April 28, 2009 at 4:55 pm
Florian Reischl (4/28/2009)
I just don't think I can put all these information into one article...
[font="Verdana"]Hey Flo,
A series of articles won't hurt. Then you can do a separate articles around performance of RegEx, best of breed CLR, best of breed T-SQL, etc.
And I think having one article on dealing with hierarchical string splitting would be a good one to add. 😀
[/font]
April 28, 2009 at 5:01 pm
A series of articles won't hurt. Then you can do a separate articles around performance of RegEx, best of breed CLR, best of breed T-SQL, etc.
Yes, I think it will be a series, especially for different requirements. One huge split, different data, multi users...
And I think having one article on dealing with hierarchical string splitting would be a good one to add. 😀
:sick:!!
It's late now...
CU tomorrow!
Flo
Viewing 15 posts - 301 through 315 (of 522 total)
You must be logged in to reply to this topic. Login to reply