Data Mining Introduction Part 7: Microsoft Association

Question

Data Mining Introduction Part 7: Microsoft Association

Daniel Calbimonte

SSCertifiable

Points: 6418
More actions
October 1, 2013 at 12:04 am

#277740

Comments posted to this topic are about the item Data Mining Introduction Part 7: Microsoft Association

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

jtreher SSC Rookie Points: 36 More actions · Answer 1

Awesome article. I love the association algorithm. However, I've run into an issue where I've moved out of the textbook nested table is input,key, and predict typical setup. My issue was that it seems that DMX queries predictAssociation is returning everything and then doing the join to filter. Adding a 50 to the predictAssociation() worked to only return the top fifty results and speed it up, but it still was taking 20 seconds (down from 30ish). So, it's almost like it has to return the entire nested bit before it does the input joins. I have models larger in data size that are the textbook style of nested input/predict/key and the queries take milliseconds to run.

This particular model uses different input than the output. For instance, rather than purchasing a product causes you to buy this product, where you can do a simple market basket analysis of products purchased in the same session, you actually would use categories browsed to determine products purchased. Or, products browsed can predict search terms you might use. After pulling my hair out, I tried other forms of structuring my models, but they would always run out of memory or the queries wouldn't work correctly.

Example model structure:

Columns:
user_session =>key
product_browsed =>input
vw_data_for_nested=>PredictOnly
Columns:
search_term_used =>key

Input Data:
user_session | search_term | product_browsed
90234 white 23AX0DZ
90234 white 039POOZ
34333 light 23AX0DZ

Sample query (the data is great!):

SELECT FLATTENED
(SELECT
[search term]
,$PROBABILITY AS [Probability]
,$AdjustedPROBABILITY AS [AdjustedProbability]
,$Support AS [Support]
FROM PREDICT( [vw For Product Predicts Search],50 ,include_node_id,Include_statistics)
WHERE $nodeid <>'')
FROM [mdlProductPredictsSearch] prediction join
(SELECT '23AX0DZ' AS [product browsed]) as t
on [mdlProductPredictsSearch].[product browsed] = t.[product browsed]

Daniel Calbimonte SSCertifiable Points: 6418 More actions · Answer 2

Yeah, it is in the Microsoft documentation that this algorithm has some performance problems.

I am copying the technet documentation here:

Performance

The process of creating itemsets and counting correlations can be time-consuming. Although the Microsoft Association Rules algorithm uses optimization techniques to save space and make processing faster, you should know that that performance issues might occur under conditions such as the following:

Data set is large with many individual items.

Minimum itemset size is set too low.

To minimize processing time and reduce the complexity of the itemsets, you might try grouping related items by categories before you analyze the data.

jtreher SSC Rookie Points: 36 More actions · Answer 3

Actually, the processing isn't the problem, it's the queries. I haven't seen any other documentation on the latency of queries, but there just doesn't seem to be many people using DMX in the wild to provide feedback.

hcprasadv Ten Centuries Points: 1215 More actions · Answer 4

hcprasadv

Ten Centuries

Points: 1215

October 6, 2013 at 11:10 am

#1656167

Good Article helped me a lot