Privacy risks of collaborative filtering Yuval Madar, June 2012 Based on a paper by J.A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten & V. Shmatikov Help suggesting users items and other users to their liking by deducing them from previous purchases. Numerous examples ◦ Amazon, iTunes, CNN, Last.fm, Pandora, Netflix, Youtube, Hunch, Hulu, LibraryThing, IMDb and many others. User to item – “You might also like…” Item to item – Similar items list User to user – Another customer with common interests Content Based filtering ◦ Based on A-priori similarity between items, and recommendations are derived from a user’s own history. ◦ Doesn’t pose a privacy threat. Collaborative filtering ◦ Based on correlations between other uses purchases. ◦ Our attacks will target this type of systems. Hybrid ◦ A system employing both filtering techniques The data the recommendation system uses is modeled as a matrix, where the rows correspond to users, and columns correspond to items. Some auxiliary information on the target user is available (A subset of a target user’s transaction history) An attack is successful, if it allows the attacker to learn transactions not part of its auxiliary information. User public rating and comments on products Shared transactions (Via facebook, or other mediums) Discussions in 3rd party sites Favorite books in facebook profile Non-online interactions (With friends, neighbors, coworkers, etc.) Other sources… Input: Observe the related items list of A, until an item in T appears, or moves up. If a target item appears in enough related items lists in the same time, the attacker may infer it was bought by the target user. Note 1 – Scoring may be far more complex, since different items in A are correlated. (Books which belong to a single series, bundle discounts, etc.) Note 2 – It is preferable that A consist of obscure and uncommon items, to improve the effect of the target user’s choices on its related items lists. ◦ a set of target items T and a set of auxiliary items A In some sites, the covariance matrix, describing the correlation between items in the site, is exposed to the users. (Hunch is one such website) Similarly, the attacker is required to watch for improvement in the correlation between the auxiliary items and the target items. Note 1 – Asynchronous updates to different matrix cells. Note2 – inference probability improves if the auxiliary items are user-unique. (No other user bought all auxiliary items) More likely if some of them are unpopular, or if there are enough of them. System model Active Attack Create k dummy users, each buying all known auxiliary items. ◦ For each user, the system finds the k users most similar to it, and ranks items purchased by them by total number of sales. With high probability, the k dummy users and the target user will be clustered together. (Given auxiliary items list of size logarithmic in the total number of users. In practice, 8 items were found to be enough for most sites) In that case, the recommendations to the dummy users will consist of transactions of the target user previously unknown to the attacker. Note – The attack is more feasible in a system where user interactions with items does not involve spending money. The main parameters for evaluation of an inference attack are: ◦ Yield – How many inferences are produced. ◦ Accuracy – How likely is each inference. Yield-accuracy tradeoff - stricter accuracy algorithms reject less probable inferences. The paper further discusses specific attacks performed against: ◦ ◦ ◦ ◦ Hunch LibraryThing Last.fm Amazon And measures the accuracy and yield of these attacks, arriving in some instances to impressive tradeoff figures. (Such as 70% accuracy for 100% yield in Hunch) Not discussed in the paper. Achieved in other papers for static recommendation databases. Remains an open problem for dynamic systems. (Which all real world examples are) Limited-length related items list – The first elements of such lists have low sensitivity to single purchases. Factoring item popularity into update frequency – less popular items are more sensitive to single purchases. Batching their purchases together will decrease the information leak. Limit data access rate – Preventing largescale privacy attacks, though lowering utility and may be circumvented using a botnet. User opt-out – A privacy conscious user may decide to opt-out of recommender systems entirely. (At clear cost of utility) A passive attack on recommender systems using auxiliary information on a certain user’s purchases, allowing the attacker to infer undisclosed private transactions. Increased user awareness is required Suggested several methods to decrease the information leaked by these systems Questions?