Intellectual Puzzle for massive cartesian project

  • Here are the real-world limitations:

    1. The Data Dictionary is 7 years old and parts are obsolete.

    2. There are 1623 tables, which 934 being empty of rows.

    3. All GUIDs (here called OIDS) are Varchar(32)

    4. Referential integrity is loosely enforced, with most tables not having FK relationships described.

    5. There are scores of many-to-many tables, small with only 4 columns, called 'ThisTable_ToThatTable_Helper' or 'ThisTable_ToThatTable_Collection'. They have incoming and outgoing OIDS.

    6. The varchar(32) fields are called OID, OID_Link, or {field_name}Moniker, about 70% of the time.

    Here are my actions so far:

    1. I selected all table.column object, twice, into two table variables, then multiplied for a cartesian product.

    2. I used the output, 'tblDataCartesian' to build dynamic SQL strings.

    3. No table is looking at itself

    4. All rows of the cartesian, which I am calling "Efforts", which came from an empty source table were removed.

    5. There are over 31.4 Million 'Efforts'.

    6. Each test for a Top 1 Inner join between SourceTable, TargetTable on SourceColA = TargetColB takes about 1 second. Any efforts with '1' row are posted to 'tblDataDictionary' [intDictionaryID],[vcrSourceTable],[vcrOnField],vcrTargetTable],[bolEffortDone]

    7. Uninterrupted, it will take 363 days to perform all 31.4 million efforts. (Efforts marked [bolEffortDone] = 1 are not re-tried.)

    So... you see my problem.

    Goal: To have a 3 column data dictionary that is queryable via a recursive CTE to perform traveling-salesman type questions like, 'Starting at ThisTable.ThisColumn, what is the |shortest|longest|must-go-through-TheOtherTable| path to ThatTable.Column, where all OIDS successfully map?'

    Your Task:

    Looking at this SQL scrip, how can we modify it to be more efficient that 1 INNER JOIN per second?

    ===========

    [font="Courier New"]

    SET NOCOUNT ON;

    Declare @tblShort TABLE (tblNameA varchar(255),colNameA varchar(255));

    Declare @tblLong TABLE (tblNameB varchar(255),colNameB varchar(255));

    Declare @tblCartesian TABLE(tblNameA varchar(255),colNameA varchar(255),tblNameB varchar(255),colNameB varchar(255))

    Declare @strSQLs NVarchar(4000);

    Declare @strSQLc NVarchar(4000);

    Declare @strSQLu NVarchar(4000);

    Declare @bigCartProduct BIGINT;

    Declare @bigCounter BIGINT;

    Declare @tblCounts TABLE (rowCnt BIGINT);

    Declare @tblDictionary table (SourceTable Varchar(255), OnField Varchar(255), TargetTable Varchar(255));

    Declare @tblSrcA Varchar(255)

    Declare @colSrcA Varchar(255)

    Declare @tblSrcB Varchar(255)

    Declare @colSrcB Varchar(255)

    Declare @startTime date

    Declare @endTime date

    Declare @minsElap INT

    Set @startTime = GetDate()

    SET @bigCounter = 0;

    SET @strSQLs = '';

    SET @strSQLc = '';

    SET @strSQLu = '';

    /* -- REM: Disable re-build of the 'tblDataCartesian', having already built it once.

    -- Make CTE1 & 2 for the cartesian join

    WITH CTE1 (TableNameA,ColumnNameA) as (

    SELECT

    OBJECT_NAME(c.OBJECT_ID) as TableNameA

    , c.name as ColumnNameA

    FROM sys.columns AS c

    INNER JOIN sys.types AS t ON c.user_type_id=t.user_type_id

    INNER JOIN sys.objects AS o ON C.Object_ID = O.Object_ID

    WHERE O.[Type] = 'U'

    AND t.name = 'Varchar'

    AND C.Max_Length = 32

    AND C.Name is not null

    AND CHARINDEX('-',OBJECT_NAME(c.OBJECT_ID),1) = 0

    )

    INSERT INTO @tblShort

    SELECT * FROM CTE1

    ORDER BY TableNameA, ColumnNameA;

    WITH CTE2 (TableNameB,ColumnNameB) as (

    SELECT

    OBJECT_NAME(c.OBJECT_ID) TableNameB

    , c.name ColumnNameB

    FROM sys.columns AS c

    INNER JOIN sys.types AS t ON c.user_type_id=t.user_type_id

    INNER JOIN sys.objects AS o ON C.Object_ID = O.Object_ID

    WHERE O.[Type] = 'U'

    AND t.name = 'Varchar'

    AND C.Max_Length = 32

    AND C.Name is not null

    AND CHARINDEX('-',OBJECT_NAME(c.OBJECT_ID),1) = 0

    )

    INSERT INTO @tblLong

    SELECT * FROM CTE2;

    INSERT INTO tblDataCartesian (vcrSourceTable,vcrSourceCol,vcrTargetTable,vcrTargetCol,bolEffortDone)

    SELECT tblNameA, colNameA, tblNameB, colNameB,'0'

    FROM @tblShort, @tblLong

    WHERE tblNameA <> tblNameB

    SELECT @bigCartProduct = COUNT(*) FROM tblDataCartesian WHERE bolEffortDone = 0

    PRINT 'Total Cartesian Records Remaining: ' + cast(@bigCartProduct as varchar(100))

    -- End of Disabling

    */

    Declare myCursor Cursor READ_ONLY FOR

    (SELECT vcrSourceTable, vcrSourceCol, vcrTargetTable, vcrTargetCol

    FROM tblDataCartesian

    WHERE bolEffortDone = 0

    AND vcrSourceTable = 'Client')

    OPEN myCursor

    FETCH NEXT FROM myCursor INTO @tblSrcA, @colSrcA, @tblSrcB, @colSrcB

    WHILE @@FETCH_STATUS = 0

    BEGIN

    UPDATE @tblCounts SET rowcnt = 0

    SET @bigCounter = @bigCounter + 1

    SET @strSQLs = ''

    SET @strSQLc = ''

    SET @strSQLu = ''

    SET @strSQLs = 'SELECT TOP 1 ''[' + @tblSrcA + ']'' AS SourceTable, '

    SET @strSQLS = @strSQLs + '''' + @colSrcA + ''' as OnField, '

    SET @strSQLS = @strSQLs + '''[' + @tblSrcB + ']'' AS TargetTable '

    SET @strSQLS = @strSQLs + 'FROM ' + @tblSrcA + ' X (nolock) '

    SET @strSQLS = @strSQLs + 'INNER JOIN [' + @tblSrcB + '] Y (nolock) ON X.' + @colSrcA + ' = Y.' + @colSrcB + ' '

    SET @strSQLc = N'SELECT COUNT(*) as RowCnt FROM (' + @strSQLs + ') Z;'

    --print @strSQLC

    --INSERT INTO @tblCounts

    --exec SP_ExecuteSQL @strSQLc

    --IF (SELECT * from @tblCounts WHERE rowcnt > 0) > 0

    Begin

    INSERT INTO @tblDictionary

    exec SP_ExecuteSQL @strSQLs;

    INSERT INTO tblDataDictionary

    SELECT *, '1' FROM @tblDictionary;

    end

    -- Don't reuse your last success !

    DELETE FROM @tblDictionary;

    -- Now, having done the quad, don't do it again

    SET @strSQLu = N'UPDATE tblDataCartesian SET bolEffortDone = 1 '

    SET @strSQLu = @strSQLu + 'WHERE vcrSourceTable = ''' + @tblSrcA + ''' '

    SET @strSQLu = @strSQLu + 'AND vcrSourceCol = ''' + @colSrcA + ''' '

    SET @strSQLu = @strSQLu + 'AND vcrTargetTable = ''' + @tblSrcB + ''' '

    SET @strSQLu = @strSQLu + 'AND vcrTargetCol = ''' + @colSrcB + ''';'

    Exec SP_ExecuteSQL @strSQLu

    -- Tell 'Em Watcha Done.

    IF (@bigCounter % 100) = 0

    Begin

    set @minsELAP = Cast(datediff(minute,@startTime,GetDate()) as Varchar(100))

    print 'At Mins: ' + Cast(@minsELAP as Varchar(100)) + ' _ ' + 'Effort # ' + Cast(@bigCounter as varchar(100)) + ' Checking: ' + @tblSrcA + ', ' + @colSrcA + ', ' + @tblSrcB + ', ' + @colSrcB

    end

    FETCH NEXT FROM myCursor INTO @tblSrcA, @colSrcA, @tblSrcB, @colSrcB

    IF @bigCounter = 100000 BREAK

    END-- End of WHILE LOOP

    SET NOCOUNT OFF;

    Set @EndTime = GetDate()

    set @minsELAP = Cast(datediff(minute,@startTime,@endTime) as Varchar(100))

    print 'Minutes Elapsed: ' + Cast(@minsELAP as Varchar(100))

    Print 'Records Per Minute: ' + Cast(Cast(@bigCounter as Float)/ Cast(@minsElap as Float) as Varchar(100))

    SET @bigCounter = 0

    Close myCursor

    Deallocate myCursor

    --================== TESTING SCRIPTS ====================

    /*

    SELECT

    Case boleffortDone

    When 0 then 'To Do'

    When 1 Then 'Done'

    End as StatusIs

    , vcrSourceTable

    , Count(*)

    FROM tblDataCartesian (nolock)

    WHERE vcrSourceTable = 'SupportType'

    Group By

    Case boleffortDone

    When 0 then 'To Do'

    When 1 Then 'Done'

    End ,

    vcrSourceTable

    ORDER BY Case boleffortDone

    When 0 then 'To Do'

    When 1 Then 'Done'

    End

    , vcrSourceTable

    , Count(*) DESC

    GO

    SELECT * FROM tblDataCartesian (nolock)

    WHERE vcrSourceTable = 'Client'

    SELECT * FROM tblDataDictionary (nolock)

    */

    =============[/font]

  • Some points:

    1. A recursive CTE is likely to be slower, not faster.

    2. It really is not at all clear what you are trying to do here. You may want to post a simplified example, with some sample data and table definitions.

    3. It is even less clear why you would want to do something like this. If you could explain that, we may be able to suggest a better way to approach it.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply