October 1, 2013 at 12:04 am
Comments posted to this topic are about the item Data Mining Introduction Part 7: Microsoft Association
October 1, 2013 at 12:03 pm
Awesome article. I love the association algorithm. However, I've run into an issue where I've moved out of the textbook nested table is input,key, and predict typical setup. My issue was that it seems that DMX queries predictAssociation is returning everything and then doing the join to filter. Adding a 50 to the predictAssociation() worked to only return the top fifty results and speed it up, but it still was taking 20 seconds (down from 30ish). So, it's almost like it has to return the entire nested bit before it does the input joins. I have models larger in data size that are the textbook style of nested input/predict/key and the queries take milliseconds to run.
This particular model uses different input than the output. For instance, rather than purchasing a product causes you to buy this product, where you can do a simple market basket analysis of products purchased in the same session, you actually would use categories browsed to determine products purchased. Or, products browsed can predict search terms you might use. After pulling my hair out, I tried other forms of structuring my models, but they would always run out of memory or the queries wouldn't work correctly.
Example model structure:
Columns:
user_session =>key
product_browsed =>input
vw_data_for_nested=>PredictOnly
Columns:
search_term_used =>key
Input Data:
user_session | search_term | product_browsed
90234 white 23AX0DZ
90234 white 039POOZ
34333 light 23AX0DZ
Sample query (the data is great!):
SELECT FLATTENED
(SELECT
[search term]
,$PROBABILITY AS [Probability]
,$AdjustedPROBABILITY AS [AdjustedProbability]
,$Support AS [Support]
FROM PREDICT( [vw For Product Predicts Search],50 ,include_node_id,Include_statistics)
WHERE $nodeid <>'')
FROM [mdlProductPredictsSearch] prediction join
(SELECT '23AX0DZ' AS [product browsed]) as t
on [mdlProductPredictsSearch].[product browsed] = t.[product browsed]
October 1, 2013 at 3:14 pm
Yeah, it is in the Microsoft documentation that this algorithm has some performance problems.
I am copying the technet documentation here:
Performance
The process of creating itemsets and counting correlations can be time-consuming. Although the Microsoft Association Rules algorithm uses optimization techniques to save space and make processing faster, you should know that that performance issues might occur under conditions such as the following:
Data set is large with many individual items.
Minimum itemset size is set too low.
To minimize processing time and reduce the complexity of the itemsets, you might try grouping related items by categories before you analyze the data.
October 1, 2013 at 5:01 pm
Actually, the processing isn't the problem, it's the queries. I haven't seen any other documentation on the latency of queries, but there just doesn't seem to be many people using DMX in the wild to provide feedback.
October 6, 2013 at 11:10 am
Good Article helped me a lot
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply