February 11, 2015 at 9:20 am
By the way, I created a basic, limited workaround.
I created a table which takes the table and self-joins it nine times giving me up-to 10 generations of duplicates in a tree form (with much of the data in most generations being null). I then stored that data in a table and can quickly rip through and count/list the duplicate bugs for any given bug. However, it is not dynamic and is limited to 10 generations. While going beyond that has not happened yet, it has come close and I would like to have something that is dynamic and doesn't have such limitations.
Thanks.
February 11, 2015 at 1:11 pm
mark.worthen (2/11/2015)
By the way, I created a basic, limited workaround.I created a table which takes the table and self-joins it nine times giving me up-to 10 generations of duplicates in a tree form (with much of the data in most generations being null). I then stored that data in a table and can quickly rip through and count/list the duplicate bugs for any given bug. However, it is not dynamic and is limited to 10 generations. While going beyond that has not happened yet, it has come close and I would like to have something that is dynamic and doesn't have such limitations.
Thanks.
Not to worry. You won't have that problem when we're done. I just can't get to it during normal working hours. Have to do it later. I'll be back.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 11, 2015 at 1:15 pm
Jeff Moden (2/11/2015)
Not to worry. You won't have that problem when we're done. I just can't get to it during normal working hours. Have to do it later. I'll be back.
Thanks Jeff. Much appreciated.
February 11, 2015 at 2:13 pm
CELKO (2/11/2015)
Please follow basic Netiquette and post the DDL we need to answer this. Follow industry and ANSI/ISO standards in your data. You should follow ISO-11179 rules for naming data elements. You should follow ISO-8601 rules for displaying temporal data. We need to know the data types, keys and constraints on the table. Avoid dialect in favor of ANSI/ISO Standard SQL.And you need to read and download the PDF for:
https://www.simple-talk.com/books/sql-books/119-sql-code-smells/
>>I'm working with a development bug database where if someone puts in a bug number, they want to get the parent bug. No problem. <<
Lots of problems. The use of “_key” violates ISO-11179 rules about naming with meta data, the terms parent and child are not part of RDBMS, etc. You are not writing declarative code at all! A loop?!! We do not use loops in correct SQL; that was FORTRAN, Basic, Pascal, etc. We do not use local variables in declarative programming.
A dimension has to have a scale. But what measurement is used for a <something>_duplicate”? What is a set of generic duplicates?
You talk about the bugs as a hierarchy. Since we have no DDL, I will guess that you are using an adjacency list model instead of a nested set model for it. Want to tell us, or do we keep guessing?
Have you seen Dr. Zoidberg on FUTURAMA? In one episode he tells Fry that he knows all about humans and that he wants Fry to lift up his gills so he can check for parasites. It is scary when the basic terms are wrong.
>> However, now they want to be able to do the same thing in reverse. Put in a bug and determine how many bugs are duplicates (either directly - children, or indirectly - grandchildren/great-grandchildren, etc.) <<
How do you determine a duplicate? That would seem to be a human determination.
>> If there were a set number of levels, self-joining would work to create a generational matrix. However, as there is no limit on how deep the generations can go, I need something dynamic to traverse the tree and generate the list of bugs/count of bugs. <<
That is a nested sets model
Thank you for the tips on proper posting. As I mentioned, I have been a front-end guy most of my career (BI) and am now being asked to create some back-end functionality (i.e., a function in the database that others can use in their queries). I did not design or build the database and as it is controlled by a different group so I have no say in that aspect of it - I am only tasked with enabling others to obtain the data from it in the desired manner.
February 11, 2015 at 3:59 pm
CELKO (2/11/2015)
Please follow basic Netiquette and post the DDL we need to answer this. Follow industry and ANSI/ISO standards in your data.{snip}
You talk about the bugs as a hierarchy. Since we have no DDL, I will guess that you are using an adjacency list model instead of a nested set model for it. Want to tell us, or do we keep guessing?
BWAA-HAAA-HAAAA!!!! RTFS, Joe! He already did that just three posts above yours! 😉 And, we're working on a fix. And I've seen your "Nested Set" conversion code... It not only has a While Loop in it put it uses a bloody push stack '60s style so you don't have room to criticize others on anything having to do with a While Loop.
Unless you have some code that actually works for this specific problem that you'd like to post, please come down off your ANSI/ISO tower or go away. If you decide to stay, mind your manners. At least check your own gills first. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
February 11, 2015 at 4:08 pm
mark.worthen (2/11/2015)
CELKO (2/11/2015)
Please follow basic Netiquette and post the DDL we need to answer this. Follow industry and ANSI/ISO standards in your data. You should follow ISO-11179 rules for naming data elements. You should follow ISO-8601 rules for displaying temporal data. We need to know the data types, keys and constraints on the table. Avoid dialect in favor of ANSI/ISO Standard SQL.And you need to read and download the PDF for:
https://www.simple-talk.com/books/sql-books/119-sql-code-smells/
>>I'm working with a development bug database where if someone puts in a bug number, they want to get the parent bug. No problem. <<
Lots of problems. The use of “_key” violates ISO-11179 rules about naming with meta data, the terms parent and child are not part of RDBMS, etc. You are not writing declarative code at all! A loop?!! We do not use loops in correct SQL; that was FORTRAN, Basic, Pascal, etc. We do not use local variables in declarative programming.
A dimension has to have a scale. But what measurement is used for a <something>_duplicate”? What is a set of generic duplicates?
You talk about the bugs as a hierarchy. Since we have no DDL, I will guess that you are using an adjacency list model instead of a nested set model for it. Want to tell us, or do we keep guessing?
Have you seen Dr. Zoidberg on FUTURAMA? In one episode he tells Fry that he knows all about humans and that he wants Fry to lift up his gills so he can check for parasites. It is scary when the basic terms are wrong.
>> However, now they want to be able to do the same thing in reverse. Put in a bug and determine how many bugs are duplicates (either directly - children, or indirectly - grandchildren/great-grandchildren, etc.) <<
How do you determine a duplicate? That would seem to be a human determination.
>> If there were a set number of levels, self-joining would work to create a generational matrix. However, as there is no limit on how deep the generations can go, I need something dynamic to traverse the tree and generate the list of bugs/count of bugs. <<
That is a nested sets model
Thank you for the tips on proper posting. As I mentioned, I have been a front-end guy most of my career (BI) and am now being asked to create some back-end functionality (i.e., a function in the database that others can use in their queries). I did not design or build the database and as it is controlled by a different group so I have no say in that aspect of it - I am only tasked with enabling others to obtain the data from it in the desired manner.
You're alright. Just ignore him and he'll go away. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
February 12, 2015 at 1:22 am
Ok. First, some more test data. I borrowed some numbers from the hierarchy article folks previously posted a link to. One "tree" looks like this but, of course, I changed the EmployeeID to be dupe_id and ManagerID to be dupeof_id along with a few other things to make it work with your table. And, no, I didn't created a Nested Set for this although that would definitely be the way to go. The reason why I didn't is because I'm trying to hit the "gotta have two functions" requirement they laid on you.
Here's what one tree would look like before the changes I mentioned. The NULL at the first level for the "ManagerID" (dupeof_id) has been changed to simulate a dupeof_id in the "other" table.
Here's the test data I made. I created the original tree using the diagram above and then duplicated it several times. I also threw in a "stump", which is a tree with a single node. Obviously, you should run this in a test database on a test box... not in production.
--===== This creates the test table.
CREATE TABLE dbo.Dimduplicates
(
duplicates_key INT IDENTITY(1,1) NOT NULL
,dupeof_key INT NOT NULL --parent
,dupe_key INT NOT NULL --child
,start_ts DATETIME NULL --temporarily changed this to null
,end_ts DATETIME NULL
CONSTRAINT PK_Dimduplicates_key PRIMARY KEY CLUSTERED (duplicates_key ASC)
)
;
--===== This populates the first tree in the test table
INSERT INTO dbo.Dimduplicates
(dupe_key, dupeof_key)
SELECT 26, 1 UNION ALL
SELECT 2, 26 UNION ALL
SELECT 3, 26 UNION ALL
SELECT 6, 17 UNION ALL
SELECT 8, 3 UNION ALL
SELECT 7, 3 UNION ALL
SELECT 12, 8 UNION ALL
SELECT 14, 8 UNION ALL
SELECT 17, 2 UNION ALL
SELECT 18, 39 UNION ALL
SELECT 20, 3 UNION ALL
SELECT 21, 39 UNION ALL
SELECT 39, 26 UNION ALL
SELECT 40, 26
;
--===== This creates several other trees with different ID's than the original.
-- The "spt_values" table is kind of like a built in Tally table.
INSERT INTO dbo.Dimduplicates
(dupe_key, dupeof_key)
SELECT dupe_key = dd.dupe_key + sv.number
,dupeof_key = dd.dupeof_key + sv.number
FROM dbo.Dimduplicates dd
CROSS JOIN master.dbo.spt_values sv
WHERE dd.dupe_key < 100
AND sv.type = 'P'
AND sv.number%100 = 0
AND sv.number > 0
;
--===== This creates a single node "stump"
INSERT INTO dbo.Dimduplicates
(dupe_key, dupeof_key)
SELECT dupe_key = 999999999
,dupeof_key = 999999998
CREATE UNIQUE INDEX IX_Dimduplicates_ParentChild
ON dbo.Dimduplicates (dupeof_key,dupe_key)
;
Up next, you need to create the following utility function. Details are in the code. The comments are longer than the code.
CREATE FUNCTION dbo.HTally
/**********************************************************************************************************************
Purpose:
Returns a table of numbers starting at 1 and incrementing by 4 (1,5,9,13 for example) for a count of @HLevel values.
Usage:
--===== Simple Syntax Example
SELECT N FROM dbo.HTally(@HLevel)
;
--===== Example to return 4 levels of numbers (1,5,9,13 for example)
SELECT N FROM dbo.HTally(4)
;
Revision History:
Rev 00 - 06 Oct 2012 - Jeff Moden
- Initial creation for the following article.
http://www.sqlservercentral.com/articles/Hierarchy/94040/
Rev 01 - 12 Feb 2015 - Jeff Moden
- Conversion to "readless" using modified methods of those first published by Itzek Ben-Gan and modified
over time by multiple contributors on SQLServerCentral.com
**********************************************************************************************************************/
--===== Define the I/O for this function
(@HLevel INT)
RETURNS TABLE AS
RETURN WITH
E1(N) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))v(N))
,E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c)
,Tally(N) AS (SELECT 0 UNION ALL SELECT TOP (@HLevel-1) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E3)
SELECT N=N*4+1 FROM Tally
;
Using techniques similar to the things I had in the hierarchy articles, here's one of the two functions you asked for.
CREATE FUNCTION dbo.FindDownline
/**********************************************************************************************************************
Purpose:
Given a "dupeof_key", find all nodes in the downline and produce a count of decsendents for each node not including
the node itself. Leaf Nodes will have a downline count of "0". Other columns are also returned if you need them. The
output is sorted but remember that there is no guarantee that it will continue to display in a sorted fashion unless
you add an order by to the outer calling query.
Usage:
--===== Simple Syntax
SELECT * FROM dbo.FindDownline(@ParentBugID);
Revision History:
Rev 00 - 12 Feb 2015 - Jeff Moden - Initial creation for the following thread.
http://www.sqlservercentral.com/Forums/Topic1658959-3077-1.aspx
**********************************************************************************************************************/
--===== Declare the I/O for this function
(@ParentBugID INT)
RETURNS TABLE AS
RETURN
--===== Return the downloan as a table
WITH cteBuildPath AS
( --=== This is the "anchor" part of the recursive CTE.
-- The only thing it does is load the Root Node.
SELECT anchor.dupe_key
,anchor.dupeof_key
,HLevel = 1
,SortPath = CAST(
+ CAST(anchor.dupe_key AS BINARY(4))
AS VARBINARY(4000)) --Up to 1000 levels deep.
FROM dbo.Dimduplicates AS anchor
WHERE dupeof_key = @ParentBugID
UNION ALL
--==== This is the "recursive" part of the CTE that adds 1 for each level
-- and concatenates each level of dupe_key's to the SortPath column.
SELECT recur.dupe_key
,recur.dupeof_key
,HLevel = cte.HLevel + 1
,SortPath = CAST( --This does the concatenation to build SortPath
cte.SortPath + CAST(Recur.dupe_key AS BINARY(4))
AS VARBINARY(4000))
FROM dbo.Dimduplicates AS recur WITH (TABLOCK)
INNER JOIN cteBuildPath AS cte
ON cte.dupe_key = recur.dupeof_key
) --=== This splits the SortPath and counts the parts to get the downnline count
-- for each dupe_key
SELECT TOP 2000000000
dupe_key = CAST(SUBSTRING(sorted.SortPath,t.N,4) AS INT)
,HLevel = MIN(sorted.HLevel)
,SortPath = MIN(sorted.SortPath)
,DownlineCount = COUNT(*)-1 --Doesn't include the current node
FROM cteBuildPath AS sorted
CROSS APPLY dbo.HTally(sorted.HLevel) AS t
GROUP BY SUBSTRING(sorted.SortPath,t.N,4)
ORDER BY SortPath
;
If you run the following code...
SELECT * FROM dbo.FindDownline(1)
;
You'll get a downline tree that looks like this...
dupe_key HLevel SortPath DownlineCount
26 1 0x0000001A 13
2 2 0x0000001A00000002 2
17 3 0x0000001A0000000200000011 1
6 4 0x0000001A000000020000001100000006 0
3 2 0x0000001A00000003 5
7 3 0x0000001A0000000300000007 0
8 3 0x0000001A0000000300000008 2
12 4 0x0000001A00000003000000080000000C 0
14 4 0x0000001A00000003000000080000000E 0
20 3 0x0000001A0000000300000014 0
39 2 0x0000001A00000027 2
18 3 0x0000001A0000002700000012 0
21 3 0x0000001A0000002700000015 0
40 2 0x0000001A00000028 0
For your first function to find the top level node, you can simplify it a bit (and I'll let you do the conversion to an actual function... you've gotta have some of the fun :-D).
--===== Getting rid of BEGIN/END does speed WHILE loops up a bit.
-- A recursive CTE could be used here but the WHILE loop will be faster
-- for these one-off lookups
WHILE @cur_parent IS NOT NULL
SELECT @parent_bug_id = @cur_parent
,@cur_parent = (SELECT dupeof_key
FROM dbo.Dimduplicates d
WHERE d.dupe_key = @cur_parent
AND d.end_ts IS NULL)
;
SELECT @parent_bug_id
;
Again, even though he's obnoxious as hell, Celko's right about one thing. The whole shebang should be recreated as a separate table that contains Nested Sets. For the small amount of items you have, though, it might not be worth it.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 12, 2015 at 7:58 am
Jeff Moden (2/12/2015)
Again, even though he's obnoxious as hell, Celko's right about one thing. The whole shebang should be recreated as a separate table that contains Nested Sets. For the small amount of items you have, though, it might not be worth it.
Thanks Jeff! I'll take this and work with it and learn from it today. I don't know if I'll get permission to create a separate table but will look into that as well.
Thanks!
February 12, 2015 at 9:19 am
mark.worthen (2/12/2015)
Jeff Moden (2/12/2015)
Again, even though he's obnoxious as hell, Celko's right about one thing. The whole shebang should be recreated as a separate table that contains Nested Sets. For the small amount of items you have, though, it might not be worth it.Thanks Jeff! I'll take this and work with it and learn from it today. I don't know if I'll get permission to create a separate table but will look into that as well.
Thanks!
Just to be clear on the "new table" thing. The existing table could easily be modified to include Nested Sets to do all the magical things that Nested Sets offers WITHOUT changing the Parent/Child method (which is MUCH easier to maintain than a full conversion) of the existing code that feeds the table. If the GUI code can actually withstand the addition of columns then there wouldn't need to be any changes to the code.
If the GUI can't withstand the addition of columns, then we could overlay the table with view that would allow it and still no changes to the GUI.
You say there are only about 10K dupes or so. When someone made a change to the table, we could mark the table as "dirty" and have a job that rebuilds the whole Nested Set instead of trying to do the piecemeal changes that take nearly as long. While that sounds daunting, it's actually incredibly fast. For 10K rows, it would take something like a half second and for 100K rows, something less than 4 seconds. The job could be run once every 5 or 10 minutes or could be triggered to run when there's a change.
The article I wrote pretty much explains all including the timing I just gave. Here's the link to that again.
http://www.sqlservercentral.com/articles/Hierarchy/94040/
But, again... it doesn't sound like the kind of lookups we did on this thread are done very often. It may be that the code that we've developed on this thread is simply "good enough".
--Jeff Moden
Change is inevitable... Change for the better is not.
February 12, 2015 at 9:31 am
Jeff Moden (2/12/2015)Just to be clear on the "new table" thing. The existing table could easily be modified to include Nested Sets to do all the magical things that Nested Sets offers WITHOUT changing the Parent/Child method (which is MUCH easier to maintain than a full conversion) of the existing code that feeds the table. If the GUI code can actually withstand the addition of columns then there wouldn't need to be any changes to the code.
If the GUI can't withstand the addition of columns, then we could overlay the table with view that would allow it and still no changes to the GUI.
You say there are only about 10K dupes or so. When someone made a change to the table, we could mark the table as "dirty" and have a job that rebuilds the whole Nested Set instead of trying to do the piecemeal changes that take nearly as long. While that sounds daunting, it's actually incredibly fast. For 10K rows, it would take something like a half second and for 100K rows, something less than 4 seconds. The job could be run once every 5 or 10 minutes or could be triggered to run when there's a change.
The article I wrote pretty much explains all including the timing I just gave. Here's the link to that again.
But, again... it doesn't sound like the kind of lookups we did on this thread are done very often. It may be that the code that we've developed on this thread is simply "good enough".
Based on how they view this in terms of the overall goals they have for the kind of information they want to get out of the data, I'm guessing the cost(time) / benefit factor for making such a change probably does not make the grade for spending the time/effort in it.
This is a minor part of their overall goal but was one of those, "hey it would be great if we could" things followed by "and you are responsible for enabling that functionality".
Still, I believe what I am learning through this process will likely provide benefits in the future.
Much thanks for your time and assistance. Enables me to provide the desired functionality and learn at the same time.
February 12, 2015 at 3:48 pm
CELKO (2/12/2015)
I did not design or build the database and as it is controlled by a different group so I have no say in that aspect of it - I am only tasked with enabling others to obtain the data from it in the desired manner.
Fireman's Lament
I work here on this railroad train,
I shovel coal all day.
It's what I do that makes the train,
Continue on it's way.
I fill the boiler full with coal,
The fire gets roaring hot.
No time to rest as on we go,
I stay tired quite a lot.
I cannot pull the whistle cord,
I cannot ring the bell,
But let this damned thing jump the track,
And see who catches hell.
I do love that old poem. Thanks for posting it.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 11 posts - 16 through 25 (of 25 total)
You must be logged in to reply to this topic. Login to reply