Performance issue with tally solution

  • UMG,

    We covered the bulk import idea about eleventy gazillion posts back - IIRC we were splitting the text of Moby Dick into words. The basic technique was to export the data from the database table to a text-format file, and then to bulk import it again into a new table, specifying the space character as the column/row terminator.

    It's a very fast method - the only disappointment for me was that it is limited to processing text files. Text file bulk import is fast (much faster than any splitting code - even the CLR stuff) but it would be astonishingly fast if the trick could be performed with native-format files.

    Paul

  • Paul White (5/10/2009)


    We covered the bulk import idea about eleventy gazillion posts back - IIRC we were splitting the text of Moby Dick into words. The basic technique was to export the data from the database table to a text-format file, and then to bulk import it again into a new table, specifying the space character as the column/row terminator.

    I'm not a bulk import expert because pretty much all of my data starts in a DB, I took a quick look at BCP, and it didn't seem like it could do what I need without pre-formatting and post processing of the file. In what you covered before you weren't keeping an ID number with each of the words you split from a source row. Is that something that it can do or would you have to craft the text file just right, import it with the ID number coming in as it's own row, and then run an update or select into query to attach the correct ID and line number to each of the split lines? (When you import with bulk import is the row order guaranteed so that could work?)

  • I believe row order is guaranteed using any of the available bulk import mechanisms (see http://msdn.microsoft.com/en-us/library/ms188728(SQL.90).aspx). I will try to find something authoritative to back that up.

    One solution would be to include an identity column in the destination table definition. One would need to do the import with a format file in this case, to skip the identity column on insert.

    See http://msdn.microsoft.com/en-us/library/ms179250(SQL.90).aspx for details on how to create such a format file.

    BTW, when going this route be sure to follow the rules for minimally-logged bulk loads. See http://msdn.microsoft.com/en-us/library/ms190422(SQL.90).aspx.

    Cheers,

    Paul

  • Well, I can't find anything to say that a bcp operation reads the rows in order from the source file and inserts them into the destination table in the same order. Of course some people will just say that tables without a clustering key are inherently unordered (yes, very clever) - but the issue here is whether an IDENTITY column will assign values in file order.

    I did find a Connect item from Erland asking for an explicit PRESERVE_ORDER option on bulk loads. The item is not brilliantly answered by Microsoft, so I am unsure what to make of it. The fact that Connect doesn't allow anyone apart from the author or Microsoft to see any attached scripts means that I can't see if Erland provided an example.

    On the other hand, I haven't been able to get a bulk import from a text-format file to do anything other than import in order!

    I created a 10M row table with one row for each of the numbers from 1 to 10,000,000. I used bcp out with the -c option to get that into a text file. A visual check confirmed that the rows were in-order at this point. I suppose one could always use a queryout with an explicit ORDER BY to enforce this.

    Next I created a table with an identity column (no index or key) and a bigint column, and created a format file to skip the identity column on load.

    Loading the 10M rows into the new table produced the desired effect - each row had an identity value which matched the row value.

    This test was bulk-logged and performed on a machine with two cores. I am inclined to conclude that character-format files are read in order, and identity values for the incoming rows calculated in the same order.

    In the absence of documentation to that effect though, I guess you would be taking a risk if you chose to use this method, fast as it may be.

    Cheers,

    Paul

  • UMG Developer (5/11/2009)


    I'm not a bulk import expert because pretty much all of my data starts in a DB, I took a quick look at BCP, and it didn't seem like it could do what I need without pre-formatting and post processing of the file. In what you covered before you weren't keeping an ID number with each of the words you split from a source row. Is that something that it can do or would you have to craft the text file just right, import it with the ID number coming in as it's own row, and then run an update or select into query to attach the correct ID and line number to each of the split lines? (When you import with bulk import is the row order guaranteed so that could work?)

    Like Paul said, there's nothing in writing in the documentation that guarantees the order. If such a written guarantee is necessary, then don't use it. But, like Paul said, I've never had it do anything out of order and an IDENTITY column in the destination table can certainly be used to preserve that order.

    You don't need a format file to skip the identity column. You can create a pass though view which doesn't include that column.

    If you use "|" as a row terminator and comma as the field terminator, it'll import AND SPLIT all of your data all at once all in the correct columns. The only left over headache is that you will have to do a REPLACE to ensure there are no Cr or Lf characters left. But this will be quite fast, very easy, and very maintainable. Instead of it taking 15 minutes, it should take about 60 seconds or maybe less.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Paul White (5/11/2009)


    Well, I can't find anything to say that a bcp operation reads the rows in order from the source file and inserts them into the destination table in the same order...

    One possibility here would be to use SMO's SqlBulkCopy to "roll your own" import routine. Because you, as the developer have much more control over the reading and writing operations, you should be able to control/insure the order of insertion better (I say "should" because there is some asynchrony going on).

    In my own use of it I have been able to get it run about as fast as BULK INSERT and close to what BCP can do at it's fastest (I just can't match the speed of BCP's ancient(pre-.Net, pre-COM even) datatype conversion routines with the .Net conversions, too much wrapper overhead, I suspect).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Paul White (5/10/2009)


    ...

    It's a very fast method - the only disappointment for me was that it is limited to processing text files. Text file bulk import is fast (much faster than any splitting code - even the CLR stuff) but it would be astonishingly fast if the trick could be performed with native-format files.

    Hmmm, not sure if it applies or not, but if you use SMO's SQLBulkCopy, you can implement your own IDataReader and effectively read from anything.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (5/11/2009)


    Paul White (5/10/2009)


    ...

    It's a very fast method - the only disappointment for me was that it is limited to processing text files. Text file bulk import is fast (much faster than any splitting code - even the CLR stuff) but it would be astonishingly fast if the trick could be performed with native-format files.

    Hmmm, not sure if it applies or not, but if you use SMO's SQLBulkCopy, you can implement your own IDataReader and effectively read from anything.

    Here's a rough C# implementation using SqlBulkCopy and Adam Machanic's SplitStringMulti

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using System.Collections;

    using System.Data.Common;

    public partial class StoredProcedures

    {

    [Microsoft.SqlServer.Server.SqlProcedure]

    public static void SplitStringBulkImport(SqlChars theString, SqlChars delimiter,

    SqlString DatabaseName, SqlString TableName, SqlString colName)

    {

    using (SqlConnection conn = new SqlConnection(@"Server=(local);Integrated Security=True;Connection Timeout=0;"))

    {

    conn.Open();

    using (SqlBulkCopy bulkCopy = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null))

    {

    using (IDataReader reader = new SplitStringDataReader(theString.Value, delimiter.Value, colName.Value))

    {

    bulkCopy.DestinationTableName = DatabaseName.Value + "." + TableName.Value;

    bulkCopy.ColumnMappings.Add(colName.Value, colName.Value);

    bulkCopy.BulkCopyTimeout = 0;

    bulkCopy.BatchSize = 1000;

    bulkCopy.WriteToServer(reader);

    }

    }

    }

    }

    public class SplitStringDataReader : IDataReader

    {

    private readonly IEnumerator _splitFunction;

    private readonly string _colName;

    private readonly DataTable _schemaTable;

    public SplitStringDataReader(char[] theString, char[] delimiter, string colName)

    {

    // SplitStringMulti from here ==> http://www.sqlservercentral.com/Forums/Topic695508-338-31.aspx#bm706065

    _splitFunction = new UserDefinedFunctions.SplitStringMulti(theString, delimiter);

    _colName = colName;

    _schemaTable = new DataTable("SchemaTable");

    _schemaTable.MinimumCapacity = 1;

    _schemaTable.Columns.Add(SchemaTableColumn.AllowDBNull, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.BaseColumnName, typeof(string)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.BaseSchemaName, typeof(string)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.BaseTableName, typeof(string)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.ColumnName, typeof(string)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.ColumnOrdinal, typeof(int)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.ColumnSize, typeof(int)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.DataType, typeof(object)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.IsAliased, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.IsExpression, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.IsKey, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.IsLong, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.IsUnique, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.NumericPrecision, typeof(short)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.NumericScale, typeof(short)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableColumn.ProviderType, typeof(int)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableOptionalColumn.BaseCatalogName, typeof(string)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableOptionalColumn.BaseServerName, typeof(string)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableOptionalColumn.IsAutoIncrement, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableOptionalColumn.IsHidden, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableOptionalColumn.IsReadOnly, typeof(bool)).ReadOnly = true;

    _schemaTable.Columns.Add(SchemaTableOptionalColumn.IsRowVersion, typeof(bool)).ReadOnly = true;

    object[] schemaRow = new object[] {

    true,

    colName,

    string.Empty,

    string.Empty,

    colName,

    1,

    int.MaxValue,

    typeof(string),

    false,

    false,

    false,

    false,

    false,

    DBNull.Value,

    DBNull.Value,

    (int) DbType.String,

    string.Empty,

    string.Empty,

    false,

    false,

    true,

    false

    };

    _schemaTable.Rows.Add(schemaRow);

    }

    #region IDataReader Members

    public void Close()

    {

    }

    public int Depth

    {

    get { return 0; }

    }

    public DataTable GetSchemaTable()

    {

    return _schemaTable;

    }

    public bool IsClosed

    {

    get { return false; }

    }

    public bool NextResult()

    {

    return false;

    }

    public bool Read()

    {

    return _splitFunction.MoveNext();

    }

    public int RecordsAffected

    {

    get { return -1; }

    }

    #endregion

    #region IDisposable Members

    public void Dispose()

    {

    }

    #endregion

    #region IDataRecord Members

    public int FieldCount

    {

    get { return 1; }

    }

    public bool GetBoolean(int i)

    {

    throw new Exception("GetBoolean not implemented.");

    }

    public byte GetByte(int i)

    {

    throw new Exception("GetByte not implemented.");

    }

    public long GetBytes(int i, long fieldOffset, byte[] buffer, int bufferoffset, int length)

    {

    throw new Exception("GetBytes not implemented.");

    }

    public char GetChar(int i)

    {

    throw new Exception("GetChar not implemented.");

    }

    public long GetChars(int i, long fieldoffset, char[] buffer, int bufferoffset, int length)

    {

    throw new Exception("GetChars not implemented.");

    }

    public IDataReader GetData(int i)

    {

    throw new Exception("GetData not implemented.");

    }

    public string GetDataTypeName(int i)

    {

    throw new Exception("GetDataTypeName not implemented.");

    }

    public DateTime GetDateTime(int i)

    {

    throw new Exception("GetDateTime not implemented.");

    }

    public decimal GetDecimal(int i)

    {

    throw new Exception("GetDecimal not implemented.");

    }

    public double GetDouble(int i)

    {

    throw new Exception("GetDouble not implemented.");

    }

    public Type GetFieldType(int i)

    {

    throw new Exception("GetFieldType not implemented.");

    }

    public float GetFloat(int i)

    {

    throw new Exception("GetFloat not implemented.");

    }

    public Guid GetGuid(int i)

    {

    throw new Exception("GetGuid not implemented.");

    }

    public short GetInt16(int i)

    {

    throw new Exception("GetInt16 not implemented.");

    }

    public int GetInt32(int i)

    {

    throw new Exception("GetInt32 not implemented.");

    }

    public long GetInt64(int i)

    {

    throw new Exception("GetInt64 not implemented.");

    }

    public string GetName(int i)

    {

    return _colName;

    }

    public int GetOrdinal(string name)

    {

    return 1;

    }

    public string GetString(int i)

    {

    return GetValue(i).ToString();

    }

    public object GetValue(int i)

    {

    return _splitFunction.Current;

    }

    public int GetValues(object[] values)

    {

    throw new Exception("GetValues not implemented.");

    }

    public bool IsDBNull(int i)

    {

    return false;

    }

    public object this[string name]

    {

    get { throw new Exception("this[string] not implemented."); }

    }

    public object this[int i]

    {

    get { throw new Exception("this[int] not implemented."); }

    }

    #endregion

    }

    };

    Run using this

    create table dbo.TestBulkImport(i int identity(1,1), s varchar(100))

    declare @STR nvarchar(max)

    set @STR='a,b,c,dd,ee,fff,....'

    exec dbo.SplitStringBulkImport @STR, ',' , 'MyTestDB' , 'dbo.TestBulkImport', 's'

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Jeff Moden (5/11/2009)


    UMG Developer (5/11/2009)


    If you use "|" as a row terminator and comma as the field terminator, it'll import AND SPLIT all of your data all at once all in the correct columns. The only left over headache is that you will have to do a REPLACE to ensure there are no Cr or Lf characters left. But this will be quite fast, very easy, and very maintainable. Instead of it taking 15 minutes, it should take about 60 seconds or maybe less.

    I can't use a comma as a field terminator, as there can be commas in the comment text. I may be missing something, but how do I get the ID number from the source row to carry down on all the rows split from it?

    i.e. this:

    1Line1|Line2|Line3

    should turn into:

    ID Comment Line

    -- -------------

    1 Line 1

    1 Line 2

    1 Line 3

    If I preformat the file a little like maybe:

    1|Line1|Line2|Line3

    and import it with an identity field, I could see running a query to copy the ID down and then delete the ID rows. I'll have to play with it some tonight and see what I can come up with.

  • can't use a comma as a field terminator, as there can be commas in the comment text. I may be missing something, but how do I get the ID number from the source row to carry down on all the rows split from it?

    i.e. this:

    1Line1|Line2|Line3

    If I understand the problem correctly then BCP will not help you as you have a variable number of lines per row. Only by using the basic split method will enable you to replicate the id for each split row.

    This thread is trying to tune the split to gain the fastest solution.

    If the overall performance of the basic split is OK for you then I see no reason not to use it.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • UMG Developer (5/10/2009)


    Flo,

    Yes the data is all text. I'll have to take a look at your proposed solution, but it would need some work to handle all the possible crap data cases. If you use this for the source, you loose a hole set of the Ds:

    Well, this brings up some more "fun"!

    I did some changes in my previous version. Give it a try:

    [font="Courier New"]---==================================

    -- Some source data

    DECLARE @Source TABLE (Id INT NOT NULL IDENTITY, Txt VARCHAR(100))

    INSERT INTO @Source

                 SELECT 'aaa,bb'

       UNION ALL SELECT 'b,ccc,ddd'

       UNION ALL SELECT 'ddd'

       UNION ALL SELECT 'ddd,e'

       UNION ALL SELECT 'eee'

       UNION ALL SELECT ',fff'

    ---==================================

    -- Destination table

    DECLARE @Result TABLE (Id INT NOT NULL IDENTITY, SourceId INT, Txt VARCHAR(100), Sequence INT)

    ---==================================

    -- Delimiter

    DECLARE @Delimiter VARCHAR(20)

    DECLARE @DelimiterLen INT

    SELECT @Delimiter = ','

    SELECT @DelimiterLen = LEN(@Delimiter)

    ---==================================

    -- Split items without a required leading/trailing delimiter

    --

    -- IMPORATNT:

    -- If the source row was determied correct (with the delimiter) this will return an empty row

    INSERT INTO @Result (

             SourceId,

             Txt,

             Sequence

             )

       SELECT

             s.Id,

             l.Item,

             1

          FROM @Source s

             CROSS APPLY

             (

                SELECT

                      SUBSTRING(s.Txt, 1, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, 1) - 1, -1), LEN(s.Txt))) Item,

                      1 Sorting

                UNION ALL

                SELECT TOP 100 PERCENT

                      SUBSTRING(s.Txt, t.N + @DelimiterLen, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, t.N + 1) - t.N - 1, -t.N - 1), LEN(s.Txt) - t.N)) Item,

                      2 Sorting

                   FROM Tally t

                   WHERE t.N <= LEN(s.Txt)

                      AND SUBSTRING(s.Txt, t.N, @DelimiterLen) = @Delimiter

                ORDER BY Sorting, t.N

             ) l

    ---==================================

    -- Move single fragments up

    WHILE (1 = 1)

    BEGIN

       ; WITH

       single_fragment (Id, Txt) AS

       (

          SELECT MIN(Id), MIN(Txt) FROM @Result GROUP BY SourceId HAVING COUNT(*) = 1

       )

       UPDATE r SET

             r.Txt = r.Txt + sf.Txt,

             r.Sequence = r.Sequence + 1

          FROM @Result r

             JOIN single_fragment sf ON r.Id + r.Sequence = sf.Id

       IF (@@ROWCOUNT = 0)

          BREAK

    END

    ---==================================

    -- Move last fragments up

    ; WITH

    last_item (Id) AS

    (

       SELECT MAX(Id) FROM @Result GROUP BY SourceId

    )

    UPDATE r1 SET

          Txt = r1.Txt + r2.Txt,

          Sequence = r1.Sequence + 1

       FROM last_item li

          JOIN @Result r1 ON li.Id = r1.Id

          JOIN @Result r2 ON r1.Id + r1.Sequence = r2.Id

       WHERE

          r1.Txt != ''

    ---==================================

    -- Delete the rows which have been joined up to the previous

    DELETE r1

       FROM @Result r1

          JOIN @Result r2 ON r1.Id > r2.Id AND r1.Id < r2.Id + r2.Sequence

    ---==================================

    -- Delete the empty last rows which resulted by the initial

    -- tally split

    ; WITH

    last_item AS ( SELECT MAX(Id) Id FROM @Result GROUP BY SourceId )

    DELETE r

       FROM last_item li

          JOIN @Result r ON li.Id = r.Id

       WHERE r.Txt = ''

    ---==================================

    -- Result

    SELECT * FROM @Source

    SELECT * FROM @Result

    [/font]

    This is why I thought it would just be easier to roll the parts all up and then split it. (But maybe the resulting really long strings make the splitting go slower?)

    Little hint for your current solution. Since your data are (N)VARCHAR(MAX) you should try some cursor based solutions. They work better with large data in one row and large result items.

    Greets

    Flo

  • @jeff: good point about the view! A good option - I do dislike editing format files!

    @Barry, Mark: That's cool, and always good to see good examples of code for this sort of thing. Very nice. I guess the code-averse could use an SSIS package - SSIS would handle the row id requirement nicely too, while preserving the ability to buk load in minimally logged mode.

    @Flo: Not sure about the recommendation to use a cursor-like approach? Even with MAX dataypes, the CLR string-splits work fastest IIRC...?

    Cheers,

    Paul

  • Hi Paul

    I know this sound's strange but all my tests showed, that the set-based solutions using a Tally table run into problems with huge text and large result items.

    Remember Phil's test to split the book Moby Dick. I included this into my tests to split the text of the book by commas with following results:

    CLR: 00:00.867

    Cursor: 00:01.310

    Set-Based: 00:06.260

    The book has about 1.2 million characters and the text specified by UMG has more than 4.3 million characters.

    Greets

    Flo

  • Hey Flo,

    Yes that was my recollection - I was just making the point that the CLR function still wins with MAX datatypes.

    I had temporarily forgotten that UMG Developer was not ready for the world of .NET yet :rolleyes:

    Cursors beating set operations eh? MAX datatypes have a lot to answer for!!

    Paul

  • I have to say the cursors not only win the MAX data-type contest. The set-based solutions have a generally problem with large items within the source text. Have a look at the attached statistic results of my tests.

    Greets

    Flo

Viewing 15 posts - 421 through 435 (of 522 total)

You must be logged in to reply to this topic. Login to reply