Better option?

  • So basically the scenario is let's say I have 100,000 'items' and 1,000,000 users.

    Each user can rate each item 0 or 1 times. Let's assume the rating itself is just a 1,2,3,4,5.

    What's the best way to store the information on the user/item ratings such that

    - I can quickly determine whether a user has already rated an item

    - I can to whatever extent possible minimize the size of the rating table involved

    I guess what I'm hoping for is something more elegant than what I have now which is a simple table with columns

    USERID ITEMID RATING

    Ultimately I'd like to avoid having this table have 100B rows if possible. Would it make sense to somehow use a binary field in the user table where each item is represented by a corresponding bit (so for the 1000th item if the user has rated it the bit is a 1 otherwise a 0)

  • A reference table with UserId and ItemsId (both integer values) with a clustered index on UserId and ItemsId should be easy enough to query and should perform pretty good (assuming you're usually query for a specific user and the correponding item and rating).

    A million users is asking for quite some traffic tho... You might need to look into horzontal table partitioning in such a case.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • baal32 (1/2/2012)


    So basically the scenario is let's say I have 100,000 'items' and 1,000,000 users.

    Each user can rate each item 0 or 1 times. Let's assume the rating itself is just a 1,2,3,4,5.

    What's the best way to store the information on the user/item ratings such that

    - I can quickly determine whether a user has already rated an item

    - I can to whatever extent possible minimize the size of the rating table involved

    I guess what I'm hoping for is something more elegant than what I have now which is a simple table with columns

    USERID ITEMID RATING

    Does the rating depend on some user/item relationship? e.g. must a user have purchased the item (or done some other action) as to be allowed to rate it? If so then you can trim your table size significantly by linking the rating to an order_line_id (assuming 4 bytes) instead of to a user_id & item_id (4 bytes + 4 bytes). In the example the order_line_id relates a user to a purchase of a specific item...this may become problematic if you must guarantee a user can only rate an item once.

    What about the item depending on the user, are all users allowed to rate all items? If there is some relationship between a user and the items they can rate that could impact your design as well.

    What other requirements are there that may be relevant?

    Ultimately I'd like to avoid having this table have 100B rows if possible. Would it make sense to somehow use a binary field in the user table where each item is represented by a corresponding bit (so for the 1000th item if the user has rated it the bit is a 1 otherwise a 0)

    Whether you stick with the model you have or go with another normalized design I agree with Lutz in that you should handle optimization in your physical implementation, e.g. with partitioning, not via a creative, de-normalized, data model. I predict that unless you are the only one that does, or will ever, manage this database that storing ratings in a binary column will become unmanageable in a hurry.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Does the rating depend on some user/item relationship? e.g. must a user have purchased the item (or done some other action) as to be allowed to rate it? If so then you can trim your table size significantly by linking the rating to an order_line_id (assuming 4 bytes) instead of to a user_id & item_id (4 bytes + 4 bytes). In the example the order_line_id relates a user to a purchase of a specific item...this may become problematic if you must guarantee a user can only rate an item once.

    What about the item depending on the user, are all users allowed to rate all items? If there is some relationship between a user and the items they can rate that could impact your design as well.

    What other requirements are there that may be relevant?

    There are no other uesr/item relationships, all users can rate all items once (and only once). The other requirement (not really a requirement, but more of a consideration) is that I don't really care what the user rates an item as long as the rating itself ends up rolled up into the running item rating average.

    So if Joe rates item Blue Widget a 5 out of 5, I will roll the 5 into another table that contains the item_id, total_votes, and vote_average, and I only need to maintain the informatoin that Joe rated Blue Widget (and not what his rating was) to prevent him from rating it again.

    To be honest the question is somewhat academic, but I guess a better example would be if instead of 'items' I had, say, mutliple daily polls which all usrs could vote in but could not edit (or even retrieve) their rating once they had voted.

    I suppose some part of me just balks at the idea of having 100 rows for Joe, each of which consists of the same 4 bytes of user id, 2 bytes for item id, and 1 byte for rating - 700B, of which 400 is exactly the same. Particularly given that I will maintain another table of aggregate data (the vote_average table above) for performance reasons - with the aggregate table the only thing the detail table is doing is telling me which items Joe rated. (yes, I suppose I don't need to maintain the 1 byte for rating under this model as I will roll it up into the aggregate table, but still seems like a lot of overhead) Probably nothing to be done about it, but I got curious about using a single binary field or some other novel approach to this problem.

    I predict that unless you are the only one that does, or will ever, manage this database that storing ratings in a binary column will become unmanageable in a hurry.

    Job security 🙂 - just kidding

    Thanks for your guidance

  • baal32 (1/2/2012)


    I guess what I'm hoping for is something more elegant than what I have now which is a simple table with columns

    USERID ITEMID RATING

    This is the 3NF solution - which I personally find absolutely elegant.

    baal32 (1/2/2012)


    Ultimately I'd like to avoid having this table have 100B rows if possible. Would it make sense to somehow use a binary field in the user table where each item is represented by a corresponding bit (so for the 1000th item if the user has rated it the bit is a 1 otherwise a 0)

    This is kind of a 1NF solution - what would happen if you have to add 50,000 items? or get rid of half the old ones?

    I wouldn't go this way.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply