February 7, 2013 at 11:15 am
Hi,
I am trying to build something that combines all values of all variables using SQL Server. For Example:
A B C D
A1 B1 C1 D1
A2 B2 C2
A3 C3
A4 C4
C5
A1
B1
C1
D1
A1 B1
A1 B2
A1 B1 C1 D1 and so on..
Basically making all possible filter combinations on variables. I have been a little bit successful but the problem is with storage because the combinations can be so many that they are running me out of storage space.
I wanted your help in trying to make a SQL procedure to make all these filter combinations but more importantly comparing each combination as soon as they are generated with another table on a criteria to check performance before deciding to save or delete them.
This is my code so far:
DECLARE @i INT
SET @i = 0
WHILE (@i < 15)
BEGIN
SET @i = @i + 1
EXEC ('CREATE TABLE VAR' + @i + '(VAR' + @i + ' VARCHAR(50))')
END
INSERT INTO VAR1 (VAR1) SELECT DISTINCT VAR1 FROM [Training Data]
DECLARE @i INT
SET @i = 0
WHILE (@i < 15)
BEGIN
SET @i = @i + 1
EXEC('INSERT INTO VAR' + @i + ' (VAR' + @i + ')' + ' SELECT DISTINCT VAR' + @i + ' FROM [Training Data]')
END
CREATE TABLE ALL_COMB (VAR1 VARCHAR(50), VAR2 VARCHAR(50), VAR3 VARCHAR(50), VAR4 VARCHAR(50), VAR5 VARCHAR(50), VAR6 VARCHAR(50), VAR7 VARCHAR(50), VAR8 VARCHAR(50), VAR9 VARCHAR(50), VAR10 VARCHAR(50), VAR11 VARCHAR(50), VAR12 VARCHAR(50), VAR13 VARCHAR(50), VAR14 VARCHAR(50), VAR15 VARCHAR(50))
INSERT INTO ALL_COMB (VAR1, VAR2, VAR3, VAR4, VAR5, VAR6, VAR7, VAR8, VAR9, VAR10, VAR11, VAR12, VAR13, VAR14, VAR15) SELECT * FROM VAR1, VAR2, VAR3, VAR4, VAR5, VAR6, VAR7, VAR8, VAR9, VAR10, VAR11, VAR12, VAR13, VAR14, VAR15
ORDER BY VAR1, VAR2, VAR3, VAR4, VAR5, VAR6, VAR7, VAR8, VAR9, VAR10, VAR11, VAR12, VAR13, VAR14, VAR15;
OR CAN I ATLEAST LIMIT THE OBSERVATIONS GOING INTO ALL_COMB BY STARTING OBSERVATION AND ENDING OBSERVATION.
Thanks in advance for the help.
-Akber.
February 7, 2013 at 12:04 pm
Hi and welcome to SSC! It is really hard to figure out what you are trying to do here based on your description. Can you please try to reword what you are trying to do. It would probably be a good idea to post ddl and sample data. You can read about best practices for posting questions by reading the article found at the first link in my signature.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 7, 2013 at 12:16 pm
Ok.
What I am trying to do is to filter a dataset on all possible combinations.
My code does create combinations in a separate dataset but I am unable to completely run it because of the size which will run up to 800 TB and I dont have that space.
Is there a way to slow SQL down and generate these combination one by one with a condition that checks performance of each combination against a criteria specified?
Thanks for trying to help!
February 7, 2013 at 12:23 pm
akberali67 (2/7/2013)
Ok.What I am trying to do is to filter a dataset on all possible combinations.
My code does create combinations in a separate dataset but I am unable to completely run it because of the size which will run up to 800 TB and I dont have that space.
Is there a way to slow SQL down and generate these combination one by one with a condition that checks performance of each combination against a criteria specified?
Thanks for trying to help!
You are familiar with your project but I am not. I can see your code that is creating some tables and then inserting data into them. I get what you are saying about disc space but I don't understand what you mean about slow sql down??? I shudder to think that you are creating 800TB of data using a loop. Does this take somewhere around 3 weeks to run? If you can clarify what you are trying to do I am sure there is a way to do whatever it is without needing an additional 800TB of disc space!!!
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 7, 2013 at 12:33 pm
The code ran for about a day on my 3 TB partition and I got an error. My dataset is really small but the combinations increase the size. This is an example:
A B C D Outcome
A1 B1 C1 D1 1
A2 B2 C2 0
A3 C3 1
A4 C4 1
C5 0
So, like in excel I filter the data and filter on the variable A on A1, then another combination to filter would be on variable A on A2, B on B2 assuming B2 occurs multiple times. So, I am trying to make all filter combinations and check how the outcome is for them.
I hope I have explained my problem well enough.
February 7, 2013 at 12:37 pm
A,B,C,D,Outcome
A1,B1,C1,D1,1
A2,B2,C2,,0
A3,,C3,,1
A4,,C4,,1
,,C5,,0
February 7, 2013 at 12:41 pm
akberali67 (2/7/2013)
The code ran for about a day on my 3 TB partition and I got an error. My dataset is really small but the combinations increase the size. This is an example:A B C D Outcome
A1 B1 C1 D1 1
A2 B2 C2 0
A3 C3 1
A4 C4 1
C5 0
So, like in excel I filter the data and filter on the variable A on A1, then another combination to filter would be on variable A on A2, B on B2 assuming B2 occurs multiple times. So, I am trying to make all filter combinations and check how the outcome is for them.
I hope I have explained my problem well enough.
Not to be rude but you haven't explained your problem at all. If you look back at what you posted you will notice there are no details to someone who is not familiar with your project.
What does the base table look like and why do you need every single combination of each column?
I would recommend using a cross join instead of looping. And if you filled up a 3TB partition that should be self explanatory that whatever you are trying to do here is not the best approach.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 7, 2013 at 12:43 pm
akberali67 (2/7/2013)
A,B,C,D,OutcomeA1,B1,C1,D1,1
A2,B2,C2,,0
A3,,C3,,1
A4,,C4,,1
,,C5,,0
deleted....didnt read the OP correctly !
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
February 7, 2013 at 5:30 pm
I think this article will help you:
Generating n-Tuples with SQL[/url]
There's a performance improved version of the UNIQUEnTuples rCTE (the one I think you'll need) late in the discussion thread here:
http://www.sqlservercentral.com/Forums/Topic1301485-3122-5.aspx (last post on that page).
I would have worked up an example for you but I am unclear how the input data is structured (i.e., no DDL and consumable sample data to work with).
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
February 7, 2013 at 9:10 pm
Hi,
This is great but the problem is that I have my data in variables (columns) and need it outputted in variables back, so that I can use them to filter another dataset. I have tried SAS, R, MATLAB over the past month but I couldnt work it out. I was close with SQL but It was very inefficient in terms of time and size. Let me share some actual data with you, though its a bit confidential, I guess random data wont hurt.
https://docs.google.com/spreadsheet/ccc?key=0Aok4OTJtBDbrdFc2WFZ5MUpHTF91NFlWeTlwZ0RLRmc&usp=sharing
So, I am trying to filter this dataset on all possible filter combinations from var1 to var15 and then check Outcome for performance in terms of %.
select sum(Outcome)/count(Outcome) as Perc, count(Outcome) as Lines from Sample_Data where var1 = 1
var1 = 1 is one of the filter combination possible.
I hope this helps understand better. Thanks for trying to help.
-Akber.
February 7, 2013 at 9:19 pm
akberali67 (2/7/2013)
Hi,This is so amazing, I mean I ran it on a smaller dataset but this works great. I have tried SAS, R, MATLAB over the past month but I couldnt work it out. I was close with SQL but I think this works. I will try it properly on the entire dataset before I am sure but I think this works. I cant thank you enough. 🙂
-Akber.
You're welcome (assuming you were referring to my article).
The approach will most definitely choke with large datasets unless you can find ways to limit the row sets that are produced. Many problems allow this, for example I've used essentially the same script for some bin packing problems. In those cases, analyzing the desired results allows you to cull unneeded combinations from the sets and curb the explosive growth of the resulting row sets.
Have fun with it.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
February 7, 2013 at 9:47 pm
Hi,
This is great but the problem is that I have my data in variables (columns) and need it outputted in variables back, so that I can use them to filter another dataset. I have tried SAS, R, MATLAB over the past month but I couldnt work it out. I was close with SQL but It was very inefficient in terms of time and size. Let me share some actual data with you, though its a bit confidential, I guess random data wont hurt.
https://docs.google.com/spreadsheet/ccc?key=0Aok4OTJtBDbrdFc2WFZ5MUpHTF91NFlWeTlwZ0RLRmc&usp=sharing
So, I am trying to filter this dataset on all possible filter combinations from var1 to var15 and then check Outcome for performance in terms of %.
select sum(Outcome)/count(Outcome) as Perc, count(Outcome) as Lines from Sample_Data where var1 = 1
var1 = 1 is one of the filter combination possible.
I hope this helps understand better. Thanks for trying to help.
-Akber.
February 7, 2013 at 10:12 pm
That's a lot of sample data! Let's look at just the first 4 rows:
;WITH SampleData
([Customer ID], Var1, Var2, Var3, Var4, Var5, Var6, Var7, Var8, Var9, Var10, Var11, Var12, Var13, Var14, Var15, [Outcome Label])
AS (
SELECT '123',3,0,5,0.5,0.885889264,0.6,0.023603498,0,1,0.89,1,0.654,0,6,1,1
UNION ALL SELECT '288',1,0,4,0.066666667,0.994711653,0.8,0.366666667,0,1,0.53,0,0.400040004,4,1,1,1
UNION ALL SELECT '227',1,1,4,1,0.846637933,0.6,0.333333333,0,0,0.95,1,0,1,18,1,1
UNION ALL SELECT '7',1,0,4,1,0.846637933,0.6,0.333333333,0,0,0.95,1,0,5,3,4,0
)
SELECT *
FROM SampleData
With just these 4 rows, can you show me (not describe) the expected results?
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
February 8, 2013 at 12:10 am
So, what I am trying to do is:
;WITH SampleData
([Customer ID], Var1, Var2, Var3, Var4, Var5, Var6, Var7, Var8, Var9, Var10, Var11, Var12, Var13, Var14, Var15, [Outcome Label])
AS (
SELECT '123',3,0,5,0.5,0.885889264,0.6,0.023603498,0,1,0.89,1,0.654,0,6,1,1
UNION ALL SELECT '288',1,0,4,0.066666667,0.994711653,0.8,0.366666667,0,1,0.53,0,0.400040004,4,1,1,1
UNION ALL SELECT '227',1,1,4,1,0.846637933,0.6,0.333333333,0,0,0.95,1,0,1,18,1,1
UNION ALL SELECT '7',1,0,4,1,0.846637933,0.6,0.333333333,0,0,0.95,1,0,5,3,4,0
)
SELECT SUM([Outcome Label])/COUNT([Outcome Label]) AS PERC, COUNT([Outcome Label]) AS LINES FROM SampleData WHERE VAR1 = 3 AND VAR2 = 0 AND VAR3 = 5
So, this 'where statement' is one kind of combination of filter and there are zillions of others which can be calculated by calculating the distinct count of each variable and factorializing them, this is to just get the count but I want the actual combinations. Which will be in zillions right but what I want is to run each automatically determined combination as a where statement and then check the performance using the select statement to scrap anything that has Perc < 0.8.
I hope this makes sense.
February 8, 2013 at 12:46 am
Hi,
I understood the issue, it is with the concept. Mathematically, it needs more theories (Permutations, Combinations and Selections) implemented within code, I will have to buy books and read to be able to figure it out. Thanks for the help.
Viewing 15 posts - 1 through 15 (of 36 total)
You must be logged in to reply to this topic. Login to reply