Problem with a denormalizing proc to generate a textual index.

  • I have an issue with a complicated stored proc.

    The proc is used to generate a recordset set that is then used by DTSearch to create a textual index for a jobs board.

    The proc is basically denormalizing the data and creating a flat record per job. Each column of values could contain one or many values (comma separated)

    An example of the outputted format is this (with a selection of columns)

    nJobPKnClientFKdLiveDatesSortKeysJobDescriptionsJobTitlesJobRefnIndustryValsIndustryDescnCompetenceValsCompetenceDescnSkillsValsSkillsDesc

    497392006-11-30 00:00:00.00003904900000ASP SQLSELTEST101010160, 61, 62, 63Aerospace / Automotive, Banks / Financial Institutions, Beverages & Tobacco, Chemicals81, 82, 83US GAAP, IFRS,Sarbanes Oxley86SAP

    I have been through many iterations of this index generation process.

    The first version used cursors to build up a temporary table and then I selected all the results from this table but as it involved dynamic sql I had to use a global temporary table which caused issues when the same proc was called at same time.

    Problems also arose when lots of category values were returned and the row size exceeded the limit. Also as I was looping through thousands of jobs in this cursor it was a slow process to build the index.

    The second version removed the temporary table and just build up the sql in a string within the cursor and executed the sql at the end of each loop but this still took too much time.

    The version I am using now has removed the cursors and I use nested user defined functions and a coalesce to return one sql string that I execute to return my recordset. The crux of which is below.

    --build up all standard common fields first

    --now add to that string any custom fields that I need

    SELECT

    @sqls = COALESCE(@sqls + ', ','') + dbo.udf_MAKE_INDEX_COL(cast(c.DataType as varchar(3)),'J', cast(c.CategoryTypePK as varchar(8)) , cast(c.CategoryType as varchar(50)) ,@DispCats )

    FROM

    tbl_CATEGORIES_TYPES as c

    JOIN

    tbl_PAGE_CATEGORIES as p

    ON

    c.CategoryTypePK = p.CategoryTypeFK

    JOIN

    tbl_PAGES as pa

    ON

    pa.PagePK = p.PageFK

    JOIN

    tbl_FIELD_DATA_TYPE as f

    ON

    c.dataType = f.FieldDataTypePK

    WHERE

    c.SiteFK = @SitePK AND

    pa.PagePK = 5 AND

    f.Indexable = cast(1 as bit) AND

    ((c.SearchType <> 0) OR

    PATINDEX('%#'+cast(CategoryTypePK as varchar(10))+'#%',@DispCats)>0) AND

    pa.DataTypeFK = 'J'

    ORDER BY

    p.pageOrder

    set @sql = @sqls + @sql2

    exec(@sql)

    The sql that I get back from this statement and actually executing on the last line is this:

    select

    JobPK as nJobPK, ClientFK as nClientFK, LiveDate as dLiveDate,dbo.udf_CREATE_INDEX_SORT(LiveDate,j.FeatureScore,j.FeaturedAdvert) as sSortKey,

    jobDescription as sJobDescription,jobTitle as sJobTitle,jobRef as sJobRef,

    dbo.udf_GET_INDEX_CUSTOM_DATA(0,j.JobPk,'J',775,1) as nIndustryVal,dbo.udf_GET_INDEX_CUSTOM_DATA(0,j.JobPk,'J',775,2) as sIndustryDesc,

    dbo.udf_GET_INDEX_CUSTOM_DATA(0,j.JobPk,'J',776,1) as nCompetenceVal , dbo.udf_GET_INDEX_CUSTOM_DATA(0,j.JobPk,'J',776,2) as sCompetenceDesc,

    dbo.udf_GET_INDEX_CUSTOM_DATA(0,j.JobPk,'J',777,1) as nSkillsVal, dbo.udf_GET_INDEX_CUSTOM_DATA(0,j.JobPk,'J',777,2) as nSkillsDesc

    from

    tbl_JOBS as j with (nolock)

    inner join

    tbl_CLIENTS as c with (nolock) on j.ClientFK = c.ClientPK

    inner join

    tbl_SITES as s on s.SitePK = c.SiteFK

    where

    j.SiteFK=51 AND

    j.live=cast(1 as bit)

    The user defined function udf_GET_INDEX_CUSTOM_DATA returns the csv list of values or descriptions for each custom field against the jobs primary key.

    The main part of the function is this statement that builds up the values/desc string

    SELECT @retVal = COALESCE(@retVal + ', ', '') + CAST(c.CategoryPK as varchar(100))

    FROM tbl_DATA_CATEGORIES_VALUES as a

    JOIN tbl_CATEGORIES as c

    ONc.CategoryPK = a.CategoryFK

    JOIN tbl_CATEGORIES_TYPES as b

    ONa.CategoryTypeFK = b.CategoryTypePK

    WHEREa.IDFK = @JobPK AND

    a.DataTypeFK = @DataTypeFK AND

    b.CategoryTypePK = @CatTypePK

    return @retVal

    This is all well and dandy has been working a treat for half a year now but I have been tasked with modifying this code.

    The problem is that I have to extend this code so that one site can access another linked sites jobs data. This means that all the categoryPks need to be mapped to the site that I am indexing other wise any search would not return any results.

    Issue1.

    In the main proc I was hoping to create a table variable / temp table (MainID int, RelatedID int) that held this mapping. Then I was hoping to access this from the function so that instead of one sites ID I used the other. This is a no no as I can't access table vars or temp tables from a udf.

    Any ideas on a workaround? I don't want to have to re-create a table in each call of this function so a split like function or xml won't work.

    Issue2

    If I could somehow replace the coalesce statement I am using in the udf I could re-write the proc so that in a loop it build up an sql string that connected all the tables I needed (the same tables many times with different aliases and different join clauses)

    to accomplish the same thing. The problem is I don't know how to build up a csv value without a local variable to append each new value to.

    I could revert back to a cursor to loop through all the fields I need rather than using nested udfs and then my main proc would have access to the mapping table var but at the moment my proc is fast and the execution plan seems solid and I want to avoid using

    cursors and loops through 10000 jobs at one time if I possibly can. Each job record could have up to 15 or more custom fields (sector,jobtype,location etc)

    and each of them could have numerous values/descriptions that need building up.

    Any ideas would be much appreciated. Thanks in advance.

  • Rob,

    Is there any reason that you cannot duplicate the procedure at the other sites and then implement a method to determine which procedure is called, either local or remote?

    I don't know where you have the room to adjust your system, but at first glance I would start there.  I have seen many developers cause huge performance issue with queries that don't work well over a linked server connection.  There may be many reasons that this is not a concern in your environment, but we don't really get to see everything that is happening there. 

    I look forward to hearing more about this one...

    John

  • First of all, you can use local temp tables in dynamic sql.

    Secondly, if you really want to use a global temp table, add a column called SPID and insert @@SPID into it and when you select from it, add "SPID = @@SPID" to the where clause. SPID is the internal ID assigned to your process and @@SPID is a function that returns your current SPID.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • Thanks for the responses.

    There is only 1 DB on one server and no linked sites. A site is just a record in tbl_SITES not a different DB.

    The problem with the global table variable was the following.

    The first loop build up the definition of the table with relevant columns for that sites chosen categories.

    Then there was an iteration through each job on that site that build up a string containing a select statement which was a mixture of cols and values (from all the different coalesces statements) so I would end up with something like this.

    select JobPk, JobTitle, JobDesc, '12,13,14' as nSectorVals,'IT,Sales,Management'as sSectorDesc, '345,346,347' as nLanguagesVals, 'English,French,German' as sLanguagesDesc from tbl_JOBS where JobPk = 4567

    Then at the end of each iteration I would use dynamic sql to insert this select into the temp table. As there was many different pieces of dynamic sql the table had to be visible across each instance so it had to be global.

    I am coming round to the idea of actually having a permanent table tbl_MAPPING (SiteFk int, MainID int, RelatedID int) which would hold all the mappings. Then have a trigger on tbl_CATEGORIES that on insert of a new category would look to see if the site it belonged to was involved in this new related job link and insert the relevant mapping record to this table. As the admin ppl are manually entering new categories into the table at the moment I can't see another way of avoiding a trigger apart from building a front end for them. Then at least this table would be visible to my functions and I can keep the code pretty similar to the way it is now and avoid cursors and temporary tables all together.

    Does this solution sound feasible to you? Thanks for advice.

  • Yes, it sounds like a good idea to me.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply