HCDA/new_yahoo Files · report.tex · Centrum Wiskunde & Informatica (CWI)

Files @ 7ebf72cf87bf
Branch filter:
Location: HCDA/new_yahoo/report.tex

7ebf72cf87bf 75.7 KiB text/x-tex Show Annotation Show as Raw Download as Raw
Gebrekirstos Gebremeskel
add all
% THIS IS SIGPROC-SP.TEX - VERSION 3.1
% WORKS WITH V3.2SP OF ACM_PROC_ARTICLE-SP.CLS
% APRIL 2009
%
% It is an example file showing how to use the 'acm_proc_article-sp.cls' V3.2SP
% LaTeX2e document class file for Conference Proceedings submissions.
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V3.2SP) *DOES NOT* produce:
%       1) The Permission Statement
%       2) The Conference (location) Info information
%       3) The Copyright Line with ACM data
%       4) Page numbering
% ---------------------------------------------------------------------------------------------------------------
% It is an example which *does* use the .bib file (from which the .bbl file
% is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission,
% you need to 'insert'  your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% Questions regarding SIGS should be sent to
% Adrienne Griscti ---> griscti@acm.org
%
% Questions/suggestions regarding the guidelines, .tex and .cls files, etc. to
% Gerald Murray ---> murray@hq.acm.org
%
% For tracking purposes - this is V3.1SP - APRIL 2009

\documentclass{acm_proc_article-sp}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{booktabs}
\usepackage{color, colortbl}
\usepackage[utf8]{inputenc}
\usepackage{multirow}

\usepackage[usenames,dvipsnames]{xcolor}

\begin{document}

\title{Quantifying the Level of Personalization in a Recommender System: a user-centric point of view}

% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.

% \numberofauthors{8} %  in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%
\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
% \alignauthor
% Ben Trovato\titlenote{Dr.~Trovato insisted his name be first.}\\
%        \affaddr{Institute for Clarity in Documentation}\\
%        \affaddr{1932 Wallamaloo Lane}\\
%        \affaddr{Wallamaloo, New Zealand}\\
%        \email{trovato@corporation.com}
% % 2nd. author
% \alignauthor
% G.K.M. Tobin\titlenote{The secretary disavows
% any knowledge of this author's actions.}\\
%        \affaddr{Institute for Clarity in Documentation}\\
%        \affaddr{P.O. Box 1212}\\
%        \affaddr{Dublin, Ohio 43017-6221}\\
%        \email{webmaster@marysville-ohio.com}
% % 3rd. author
% \alignauthor Lars Th{\o}rv{\"a}ld\titlenote{This author is the
% one who did all the really hard work.}\\
%        \affaddr{The Th{\o}rv{\"a}ld Group}\\
%        \affaddr{1 Th{\o}rv{\"a}ld Circle}\\
%        \affaddr{Hekla, Iceland}\\
%        \email{larst@affiliation.org}
% \and  % use '\and' if you need 'another row' of author names
% % 4th. author
% \alignauthor Lawrence P. Leipuner\\
%        \affaddr{Brookhaven Laboratories}\\
%        \affaddr{Brookhaven National Lab}\\
%        \affaddr{P.O. Box 5000}\\
%        \email{lleipuner@researchlabs.org}
% % 5th. author
% \alignauthor Sean Fogarty\\
%        \affaddr{NASA Ames Research Center}\\
%        \affaddr{Moffett Field}\\
%        \affaddr{California 94035}\\
%        \email{fogartys@amesres.org}
% % 6th. author
% \alignauthor Charles Palmer\\
%        \affaddr{Palmer Research Laboratories}\\
%        \affaddr{8600 Datapoint Drive}\\
%        \affaddr{San Antonio, Texas 78229}\\
%        \email{cpalmer@prl.com}
}



% There's nothing stopping you putting the seventh, eighth, etc.
% author on the opening page (as the 'third row') but we ask,
% for aesthetic reasons that you place these 'additional authors'
% in the \additional authors block, viz.


% \additionalauthors{Additional authors: John Smith (The Th{\o}rv{\"a}ld Group,
% email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat
% (The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}
% \date{30 July 1999}


% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.

\maketitle

%opening
\title{Content Consumption patterns: quantifying customization}
\author{}


\maketitle

\begin{abstract}

 Recommendation is the selection of useful information items from  an overwhelming  number of available items for the user to consume.  The main operationalization of recommendation is personalization, which is the tailoring of recommendations to the tastes and preferences of the user.  
% The two extreme ends of recommendation are 1) No-difference in the information that is made available to different users resulting in information overload, and 2) extreme personalization resulting in filter bubble. It is in the interest of the user and the information service provider (company) to alleviate information overload, but there is a concern that recommendation is creating a filter bubble effect,  a phenomenon  where society is balkanized along real or percieved lines of difference. The filter bubble is the  phenomenon that a  user is isolated from content that an algorithm decides as being not relevant to them. 
The fundamental assumption in personalization is that there are differences between users' information interests and preferences. Personalization is deemed in the interest of the user, otherwise the user has to cope with all the content available, resulting in information overload. There is, however, a concern that personalized recommendation can result in filter bubble effects, where people end up being isolated from certain types of content outside of their interest sphere. An interesting question here when examinging specific recommender systems is whether the difference in content served to distinct groups of users, is amplifying or dampening the potential bubble of the groups' distinct interests. 

%%%% << We here explore whether we can quantify to which extent the content served  difference between the distance between user groups in terms of their interests

%
%The below is not what you're actually doing - I would posit this in a different way, as I've done above.
%%For an informed criticism of personalized recommendation, it would be useful to be able to quantify it. How can we measure whether a system is doing extreme personalization resulting in filter bubble or no personalization resulting in information overload. 

In this study, we view personalization as having two components: 1) the separation of available content according to user interests, a sort of drawing imprecise borders between the users' preferences, and 2) the ranking of a particular user's information items. %As the  second is the classical ranking in information retrieval, 
We focus on the separation component of personalization and propose a method for measuring personalization as the ability to separate information items according to user preferences. 
The proposed method views personalization as the ability to maintain  the same distances (similarities) between the vectors of users' personalized recommendations as there is between the vectors of users' engagement histories.  We apply the method to two recommendation datasets, and show how it can be used to suggest improvements to  a system that does personalized recommendation. 
\end{abstract}


\section{Introduction}
Personalized recommendation is ubiquitous on the internet.  Recommender systems,  e-commerce sites, and social media platforms use personalization to provide users with content and products that are tailored to their preferences \cite{xiao2007commerce}.  Big search engines such as Google  and Bing have also implemented personalization  on search results \cite{hannak2013measuring}.  Personalized Recommendation is a core part of content consumption and companies' revenue. For example in 2012 NetFlix reported that 75\% of what users watched came from recommendation, and Amazon reported that 35\% of its sales came from recommendations \cite{pu2011user}. Beyond sales, personalized recommendations have been found to  influence  users than recommendation from  experts and peers do \cite{senecal2004influence}. Personalized recommendations lower users' decision effort and increase users' decision quality \cite{senecal2004influence}.  %Personalization is  increasingly part of search engines, online content provision 

 The proliferation of recommender systems is a response to the ever-increasing amount of available information - they are the supposed solutions to information overload.  Recommendation is the selection of useful information from a vast amount of available information for the user to consume.  Recommendation can be implemented in many different ways. For example it can be implemented to recommend popular items, or most recent items.   The main operationalization of recommendation is, however, personalization. For the recommended items to be relevant, the user preference must be modeled from user history and the recommended item be selected on the basis of the modeled preference, that is, they must be personalized and tailored to the interests of the user. 
 
 A number of approaches can be applied to online content provision. No (personalized) filtering at all leads to information overload, or necessitates a random selection of content. Full editorial curation leads to a limited set of content with a specific point of view. Individual personalization is more likely to lead to increased content engagement, user satisfaction, retention and revenue for wider audiences. For the user, it means less cognitive overload, increased efficiency and satisfaction.  
 
However, on the extreme end, this could arguably lead to filter bubbles.
A filter bubble might not necessarily directly be a problem from a user engagement standpoint; content may still be consumed as it fits the individual user's interests. However, it might be a problem from the point of view of the user and society. User interests can evolve, expand over time, and over-personalization can become a problem for the user when he or she misses relevant information just because the algorithm decides it is not relevant for him/her. 
 
The filter bubble is an interesting problem from the point of view of society as a whole. It could be argued that is in the interest of common good that people are exposed to diverse views. Exposure to different views can, it is believed,  increase tolerance, social integration and stability. This would mean it is in the interest of society for individuals to be exposed to different views and reduce the effect of a potential filter bubble. 
 
 %One can argue that  the filter bubble is the cost that the user pays for the benefits of efficiency, reduced cognitive load, and  better satisfaction. 
 We can debate about whether the concept of filter bubbles is right or wrong in theory, but recommendation is a fact of life today. The question is whether we can strike a balance between under- and overpersonalization. Whichever direction this balance should tip in, it would be in the interest of everybody to be able to quantify the level of personalization in a recommender system. 
 
 In this study we propose a novel method for quantifying personalization, by comparing the response  of different users to  personalized recommendations. 
 %We call the method a PullPush to indicate whether users want to be kept apart or brought closer to each other from how the current level of personalization maintains. 
%  The proposed method sees personalization fundamentally as the separation of items according to users' information preferences. There are two perspectives to measure the personalization level in a recommender system. One is from the perspective of no-change, where overlap between the information items recommended to different users is desirable. The second perspective is about (automatically) optimizing for user engagement, capturing user preferences in such a way that the content is tailored towards engagements. In this study, we approach the quantification of personalization from the perspectivepara of assessing 'good' (or 'perfect') personalization from each of these perspectives. 

The contributions of our work are: 1) we refine the conceptualization of personalization 2) we propose a method for quantifying personalization using the  response of users to personalized recommendation,  and 3) we show how the method can be applied and used to suggest improvement (from different perspectives) to a system that does personalized recommendation. The rest of the paper is organized as follows.  In Section \ref{rel}, we review related literature and in Section \ref{mot} we present background and motivation for the work. In Section \ref{pro} we discuss our proposed method, followed by a discussion on datasets and experiments in Section \ref{exp}. In Section \ref{result}, we discuss results and conclude with a discussion and conclusion in Section \ref{conc}. 
 

\section{Related Work} \label{rel}

 There are, so far, some attempts to measure personalization in an online information systems that apply personalized recommendation. One study has attempted to quantify personalization  in Google Search \cite{hannak2013measuring}. The study recruited 200 Amazon Mechanical Turk  users with Gmail accounts to participate in searching task in Google Search. The study also used newly  created accounts, which are considered unsuitable for personalization as a result of having no history, to use them to generate search results that were used as baseline. The first page of the returned results were compared against the baseline results and also against each other. They used jaccard similarity and edit distance as metrics. The study reports that about $\mathit{11.7\%}$ of the results showed variation due to personalization. %The study also investigated the factors that caused variation in in personalization and it reports that loggingin and geography were were the factors. 

% Why recommender systems can have the risk of filter bubble, but they do offer an irresistible appeal in that they offer content targeted to one’s interest.  

Another study \cite{nguyen2014exploring} examined the effect of recommender system on the diversity of recommended and consumed items. The study was done on MovieLens dataset \footnote{http://grouplens.org/datasets/movielens/}. They separated recommendation-takers and recommendation-ignorers and examined the content diversity in those two groups. Two movies are compared using their attributes (tag genome data), using Euclidean distance.  The distances between groups are measured as the average of the distances between the groups movies. 

The finding is that indeed recommender systems create a condition where recommended items and items rated by users became narrower (less diverse) over time. However, when comparing recommendation takers and non-takers, the study found out that recommendation takers had more diverse consumed items than non-takers.  The study was conducted on a recommendation system that uses item to item collaborative filtering, and as such its conclusions should be taken with that caveat. 

% Personalization is    ubiquitous on the internet.  Recommender systems,  e-commerce sites,  search engines and social media platforms use personalization to better provide users with content and products that are tailored to their needs.  Big search engines such as google [1] and Bing[2] have also implemented personalization on on search results.  Personalization in recommendation is a core part of company revenues. For example in 2012 NetFlix reported that 75% of what users watched came from recommendation[3], and Amazon reported that 35% of its sales came from recommendations[4]. Beyong sales, personalized recommendations have been found to have greater influence on users than peers and experts[5]. They lower users’ decision effort and increase users’ decision quality[6]

% The ubiquity,importance and influence of personalization, it makes sense to want to quantify it. There are so far some attempts to measure it in an information system. One study has attempted to quantify personalization level in Google Search engine. The study recruited 200 Amazon Mechanical Turk  users with gmail accounts to participate in searching in google. The study also used newly  created accounts to use them as a baseline. The first page of the returned results were compared against the baseline results and against each other too. They used jaccard similarity and edit distance as metrics. The study reports that about 11.7% of the results showed variation due to personalization. The study also investigated the factors that caused variation in in personalization and it reports that loggingin and geography were were the factors. 
% 
% Why are recommender systems can have the risk of filter bubble, but they do offer an irresistible appeal in that they offer content targeted to one’s interest.  

% Another study examined the effect of recommender system on the diversity of recommended and consumed items. the study was done on movieLens dataset. The separated recommendation-takers and ignorers and examines the content diversity in those two groups. Two movies are compared using their attributes (tag genome data), by a Euclidean distance.  The distances between groups are measured as the average distance between the groups’ movies. The finding is that indeed recommender systems  create a condition where recommended items become narrower over time. 


The potential for personalization \cite{teevan2010potential} investigated how much improvement would be obtained if search engines were to personalize their search results, as opposed to providing search results that are the same for everyone, or a group. It used three datasets: explicit relevance judgments, behavioral relevance judgments (clicks on items) and content-based relevance judgments. The work showed that there was a huge potential for personalization in all the relevance judgments.  

The work used DCG to measure the quality of a search result ranking. The best ranking for a user is one that ranks  the items in the order the user ranked them in the case of explicit judgments, or clicked them in the case of the behavioral relevance judgment or in order of the size of their similarity score to previously consumed content in the case of content-based relevance judgment.   So the ideal normalized DCG score for a perfect system is $\mathit{1}$.  When we attempt the best ranking for two or more people (a group), the DCG score will be  less than $\mathit{1}$ as the members will have different rankings. The difference between this group DCG score and the individual score is what they called the potential for personalization.  The research reports that there was a potential to improve the ranking by 70\%. In terms of the relevance judgments, it reports that the content-based relevance judgments show the highest potential followed by explicit judgments and the click-based judgments respectively. 
The paper concludes that the behavior-based relevance judgment can serve as a good proxy to  explicit judgments  in operational search engines.  

\section{Background and Motivation} \label{mot}

There are two seemingly opposing perspectives regarding personalized recommendation. The first perspective is that personalized recommendation is beneficial, for users and publishers. This perspective has resulted in the advent and proliferation of personalized recommendation systems, both as an integral part of a bigger information systems and as standalone systems. Today in the vast cases of online information provision, recommendation is a fact of life. In some cases, people may even visit particular sites because they feel they are adapted to their interests. The concern of a publisher applying personalized recommendation is not that they are doing personalization per se, but that personalization increases user engagement.
 
 The second perspective, a backlash to personalized recommendation, is that  personalized recommendation is disadvantageous for society. The proponents of this view argue that personalized recommendation balkanizes society along interest lines and that it is creating a filter bubble effect,  a   phenomenon where a  user is isolated from content that an algorithm decides as being not relevant to them. The proponents of this view  argue that   personalized recommendation should be condemned and stopped. 
 
 \begin{figure} [t]
\centering
\includegraphics[scale=0.4]{img/recommendation_flow.pdf}
\caption{The recommendation flowchart from available items to clicks.}
\label{fig:flow}
\end{figure}

 The two  perspectives call for two perspectives in  how  personalization could be measured. In both perspectives, what is being scrutinized, the target to be measured,  is the personalized items that are shown to the user. But in the perspective that views the separation of content shown caused by recommendation as less desirable, shown items are compared against the available items; the available items become the reference point. In the extreme case, the position of the  opponents of personalized recommendation  can be construed as saying that the information that is presented must be the same as the available relevant information. This extreme position is untenable and unfeasible because it also implies that presenting items in some ranked order is wrong. In practice, however, they are opposing the difference in the results that are presented for two people that have  supposedly the same information need, as for example when they query a search engine with the same query terms. This milder position has been 
measured by comparing  one user's recommended items against anothers'.
 
 In the perspective that sees recommendation as beneficial, the user's recommended items are compared against the perfect information interest and preferences  of the users. %  For example, for one opposing the filter bubble, the reference frame is the unchanging of the information from user to user. 
 This view strives to increase engagement, and overcome the information overload problem and as such, the reference frame against which the personalized items are compared is the (perfect) interest of each user, that is, how close the personalization is to the  users' actual information preference. 
 
 These two different perspectives on measuring personalized recommendations are similar to the perspectives on  recommender system evaluation: system-centric and user-centric. In a system-centric evaluation, there is a ground truth and the goal is to find out how good the recommender system would be able predict the ground truth. System-centric measures are neat, and replicable. However, since system-centric evaluations do not always correlate to user satisfaction \cite{mcnee2006being}, there are alternative,  user-centric evaluation of recommender systems.  Measures such as click-through-rate (CTR), dwell-time and engagement have been proposed and used as user-centric metrics. 
 
 \subsection{The Recommendation Pipeline}
 
  Figure \ref{fig:flow} shows the process of recommendation from beginning to end. The process in the figure could be for both query-based and query-less personalization. The part that is surrounded by the  yellow-dotted rectangle shows what personalization  targets when it is seen from the perspective of opposing personalized recommendation. The extreme version of that is that items shown to users must be exactly the same as the available relevant items. In practice, however, only the view box is considered and the measures compare the items shown to users against each other.  This means the measure's aim is to quantify the difference between recommendations. 
  %The items that are not shown play no role in the computation of personalization from the perspective of no-change.
  
  For the proponents of personalized recommendation, the part of the recommendation pipeline that they are interested in measuring personalization is the rectangle that is surrounded by the green-dotted line.  Its objective is to measure personalization by comparing  how  close the recommended items are to the actual user interests. From this perspective, a personalized recommender system's usefulness is measured against how well it serves the user's information interests and preferences. In this study, we measure personalization from this perspective, and as such  it is a user-centric point of view.  %This naturally calls for the recommended items to be compared against the  users' actual interest. % As such, it calls for a measure that compares the personalization against the user's actual interest.
  %Measuring the personalization recommendation against this perspective targets the the part of the flowchart that is bound by the red dotted-line.
  
%   Since it is not possible to find the perfect interests of the information consuming unit, it is usually approximated by the information that the information consuming unit has previously consumed.  This perspective and measure is more difficult and tricky than the no-change perspective and the measure that it calls for. The reasons are 1) it means we have many different interests to compare with 2) the clicks are influenced by what the system serves. The problem is that it is hard to know the perfect items that satisfy the user interest because the user interest is not a fixed one, but a shifting one in time and with many other factors.  This situation is complicated even further by the temporarity of the items in the sense that the items to be recommended are usually ephemeral making it hard to learn and improve. That calls for personalization at the level of meta features of the items rather than at the item level.

%  There have been attempts to quantify the level of personalization in recommendation systems from the perspective of no-change. One study examined the level of personalization in search engines by recruiting mechanical Turk users with mail accounts to search in google for a set of keywords. The study compares the results using Jacquard similarity, and edit distance. The study found that about 17\% of the queries were affected by personalization. Another study tried to quantify the change in diversity of recommended items over time with a view to measure the filter bubble. In both cases the perspective is system-centric. 
 
%  The opposite of unchanging information is the perfect need of the user. However, since it is impossible to find the perfect interest of the user, the closest we can come to this perfect information is what the user consumes, the clicks. We can measure the level of personalization against these reference frame as opposed to measuring it against the unchanging information reference frame. When we measure the level of personalization against this, we are measuring how good the system is in delivering the items that the user is interested in. These reference frame has one very serious limitation and that is that the clicks are influenced by what is served and this is a serious limitation. 

% Can these two measures be combined to obtain a holistic level of a recommender system's personalization?  No, they do not compliment and their combination does not make much sense. They have to measured for two different ends.

% Personalization affects   the views, the items that are served to different information consuming units.  There are two perspectives in measuring personalization. One perspective the sameness. It assumes that information provided should not vary. This views is deeply held by the progenitor and anti-proponents of filter bubble. The second perspective is how good a personalization satisfies the user interest. Here the reference frame is the information consuming  information need. The better that the personalization matches the information consuming units  information need, the better the personalization is. 
When measuring personalization from the user-centric point of view, we propose that personalization be viewed as having two components.  One component is the ranking of the selected items to be recommended. In other words, this component is about ranking the items in order of their relevance to the user. 
%This component  should also deal with the number times  of items that items  should be shown to a user. So for example, it should be able to for example recommend more of item on an certain event and less items on another event.  
For the ranking component to start ranking the user's items, the items of interest should be selected from the available items. This is the most important component and we call it  the separation component, and it refers to  the separation of content along user interest lines. A holistic measure of personalization should take into account both the ranking and separation components.  

% While this seems, to us, the fundamental definition of personalization, today it also includes other aspects. For example, not only should a personalization system be able to separate content to different user interests, but also it should be able to know the quantity of and rank of the items in a user interest. So for example, it should be able to for example more of item 1 than item 2 for a user. 2) it should give, one it serves, in the correct order, that is that item 1 should come before item 2.  

% How do we measure the personalization? We believe that it has to account for this two aspects. 
% The separation: Imagine we have the perfect interests of the different information consuming units. That means we can measure the differences. If we treat them as sets, we can for example use Jacquard index to measure their similarity (or dissimilarity). We can average the similarities(dissimilarities) to obtain an aggregate score of similarity (dissimilarity).    

 We argue the response to personalization  can  measured only in a comparative manner.  The reason are first, because there can not be a true ground truth in a dynamic system where the items and user interest change constantly and the situation of users is always subject to different outside factors. In such a system,  the best measure is a comparative one conducted over a dataset collected over a long time. We also argue that the response to personalization can only be measured in a reactive manner, that is that how the users react to it.  

%\subsection{The Ranking Component}

\subsection{The Separation Component}
As the ranking component of the recommender system is the classical information retrieval, here we focus on the separation component. 
Personalization is first and for most the separation of content according to users' interests. So fundamentally, personalization is based on the assumption that there is a difference between the information interests of users.

A good recommender system then must be able to deliver recommendations that maintain this difference between the different users' information preferences. This means the aggregate similarity (dissimilarity) of the different sets of  recommendations to users must be as close as possible to the aggregate similarity (dissimilarity) between their actual preferences. This measure of effectiveness of a recommender system can be called the degree of personalization  and we can define it, mathematically, as the difference between recommendation similarity (dissimilarity) and the actual similarity (dissimilarity) (see Section \ref{pro})

% 
% Personalization = Actual Aggregate - recommendation aggregate. 
% Assume we have n information consuming units. The aggregate similarity between them is the similarity between the all pairs (combinations).
% Summation (combination)similarity between pairs.
% 
% We apply the same formula for the recommendations.  Rearranging, we get the difference between the actual similarity and the recommendation similarity. So the overall formula basically boils down to the sum of the  differences between actual distance and recommendation distance. 

\subsection{Relations Between the Two Perspectives}
 
How are measures based on the two perspectives related to each other? Can measures designed to measure personalization from the assumed perspective of personalization causes overfiltering (bad) tell us anything about measures designed to measure personalization from the assumed perspective of personalization as beneficial (good)? We maintain that there is no direct relationship. A recommender system  might serve different items without meeting the users' interests, thus clearly  doing personalization from the perspective that views recommendation as bad. 
% A bad recommender system might do personalization without meeting the user's information preferences. It might even make it worse than no-recommendation. 

The user-perspective and the measures that it calls for can not also tell us about whether the system is doing personalization from the no-change perspective. For example, a recommender system might achieve a good separation of content according to users interest, but that does not say  much about how the recommended items are similar or different from the available relevant items,  or whether they are similar/different to each other. So, we maintain that the system-centric and user-centric measures are done for different ends. 

% these measures are measuring completely different perspectives and different interested, one to minimize information overload, another to keep diversity of available information. 


\subsection{Clicks as (imperfect proxy of) Actual User Interest}
Now the question is: how can we obtain actual user interest and preference? True user preference is hardly possible to come by in real recommendation system: 1) because user interest changes 2) items change 3) it is not possible that we can rerun the same scenario of recommendation again with the same items and users.  So the best approximation for the actual user interest is the users' history of items engaged with. For the sake of simplicity, we here use users' click history as an example (despite their limitations as a measure as compared to for example dwell time instead \cite{yi2014dwell}).
There is, however, one big problem with users' click history as actual user preferences and that is that clicks are dependent on what has been recommended. This means, if we were to provide different recommendations, the click history would turn out to be different too. However, we think the approximation of actual user interest with clicks is reasonable for the following reasons.  

We are interested in the relative similarity, not absolute similarity. So, yes, indeed, the similarity between the clicks of two users will be affected by what the system recommends. However, we assume the relative similarity (distance) between them remains more or less the same since the recommendations quality affects both of them, unless of course the system is somehow tweaked to be good to some and bad to others. Simply, we are saying that it is reasonable to assume that the distance between clicks is proportionally affected by the recommendation as the distance between the recommendations. This is to say that the difference between the click distance and the recommendation distance remains reasonably the same.






\section{The Proposed Method} \label{pro}

%When a system recommends items to  users, we  can construct two vectors: the view vector and the click vector.  

% To overcome with limitations of CTR as a measure of the level of personalization, we propose a new metric that addresses  the limitation. we propose to use a distance metric. One distance metric is the distance between the View and click vectors for a certain geographic region. We define the distance between the View and Click vectors as revolt. We think of the views as what the system thinks is relevant to the geographic unit in question, and the click is what the geographic region actually clicks. The distance between these two is revolt, the amount buy which the geographical unit tries to reject the view.
% 
% 
% The dimensions of the two vectors can be either items served, meta data about the items, or entities found in the items. The vectors can also be built at different levels of granularity such as the user level, a demographic group, or a geographic unit. when personalization is done at the user-level, it is fair to assume that,  at this granularity, it is not possible to capture higher level similarity of interest  on the basis of demographic and/or geographic reasons. It is, however, intuitive to think that people in one city may have much more common interest between them than people of one city with people of another city. 
% 
% 
% 
% 
% The revolt score is a measure of dissatisfaction, it could be due to over -or under-customization and anything in between. This makes it a useless measure as far as our concern is to measure the level of personalization and sub subsequently suggest ways to improve the system. Specifically, should we personalize more or less?   This means we have to invent another measure that can differentiate overspecialization from internationalization. 




%\subsection{Method}



\begin{figure} [t]
\centering
\includegraphics[scale=0.4]{img/view_click.pdf}
\caption{The difference in similarity  between views and clicks}
\label{fig:flow}
\end{figure}


We propose a method to measure the degree of personalization in a recommender system  as the ability to maintain the same distance between the personalized recommendations as there is between the user's engagement (e.g click, or dwell) histories - here represented by click history. The method is by nature comparative and reactive. It is comparative because personalization is fundamentally about maintaining a difference between users, and as such the measure needs to compare the personalized recommendation of the users that are assumed to have differences in preferences. Thus the measure is comparative 1) in that it needs to compare the items  both among personalized recommendations and among clicks histories 2) in that it compares the aggregate similarity in personalized recommendations  against the aggregate similarities in click histories. %In other words the nature of personalization calls for a comparative measure. 

The measure is reactive because  the  user's interest represented by clicks are a reaction to the recommendations. In other words, there is an inherent dependence of what the user consumes on what is showed to the user.  The relative aspect of measuring personalization means that there is no fixed reference frame against which we can compare. Therefore, we can measure clicks against recommendation, but that is only reactive, that is, in response to the recommendations. Because the moment the recommendations change, the clicks would have been different. So the assumption we are making is 1) clicks are depedent on recommendations in the sense that a change in recommended items would alter clicked ittems 2) Despite click depdence on recommendation,  The similarities/difference between the clicks remain more or less proportional to the similarities/difference between the personalized recommendation.


In a recommender system, there are the items that the system recommends to the user, and there are the items that the user chooses to consume.   If an item is not shown, it can not be clicked. This does not, however, mean that the user has no choice whatsoever. The user can choose in two ways: 1) in absolute sense, that is clicking on some and not on others and 2) in quantitative sense, that is consuming content about, for example, some entity more often than about another entity. Thus the user has some  freedom to choose, but the freedom is constrained by what is recommended. 

The more different the click vector is from the view vector, the more the system is  failing to capture the information consumption needs of the user.  However, the relationship between a view vector and a click vector for a user does not specifically show whether a system is doing over- or under-personalization. It could not for example, distinguish personalized recommendation from non-personalized recommendation. For this reason, we conceive personalization as the ability of a personalized recommender system to separate information items according to user preferences. 

Figure \ref{fig:flow} shows the relationship between Views and Clicks for two users, $\mathit{user1}$ and $\mathit{user2}$. User1 is served with $\mathit{View1}$ and has consumed $\mathit{Click1}$. Similarly, user2 is served with View2 and has consumed $\mathit{Click2}$. The arrows from Views to Clicks show the direction of influence. If there is any difference between the views, that difference   is  the result of personalization. Given this views, the users will click on some and not on others, and thus will have different click vectors from the respective view vectors. The difference between the click vectors is what the actual difference in the consumption of the two users is given the personalization.  The gap between the view difference and click difference is a measure of the tendency  of  the users in response to personalization. We call this gap the PullPush score.  %This difference is $mathi{d_{1}}$ and $\mathit{d_{2}}$.


There is a  fundamental dependence of what the user consumes on what is served.
To measure  the level of personalization, we first compute similarity between the views  themselves, and the  between clicks themselves, and then we compare the aggregate recommendation similarity with the aggregate click similarity. From now on, the we refer to the recommendations as views and the consumed items as clicks.  %the user's recommended itesm (views) first and click vectors of two geographical units.  

We define $\mathit{ViewDistance}$ as the distance between the view vectors (Equation \ref{eq:view}) and  $\mathit{clickDistance}$ as the distance between the click vectors (Equation \ref{eq:click}).
The way we view the relationship between views and clicks is that given the views, how much does the click vector differ from the view vector. Another way of saying that is what is the users' tendency from the current system; do the users tend to 'want' more personalization or less personalization. 


% This question can be answered only in a comparative and relativist way. The comparative is by comparing one user against another. Relative because it can only be measured given what is shown. It is not possible to have an absolute measure of this. 
% 
% There is a very fundamental dependence of what the user consumes on what is served.
% To measure  the level of personalization, we first compute similarity between the views with in themselves, and the clicks between themselves, and then we compare the aggregate recommendation similarity with the aggregate click similarity. From now on, the we refer to the recommendation as views and the consumed items as clicks.  %the user's recommended itesm (views) first and click vectors of two geographical units.  


% 
% To give you some intuition, imagine you make a guess of the positions in a two dimensional space of two points A and B as (a,b) and (c,d) respectively. From the positions, we can compute how far apart you think they are using, for example, Euclidean distance and call this GuessDistance. Then, the true positions are revealed to be  (e,f) and (g,h) respectively, and let's call the true distance between them  TrueDistance.  The ViewDistance is your GuessDistance and your ClickDistance is the TrueDistance. The difference in this scores is what I called PullPush. If it is zero, the GuessDistance and the TrueDistance are the same and if they are negative, the true distance was larger than the guess distance.


%A way to think about the PullPush score is as follows.  The reference frame is the distance between the two click vectors, the distance between the actual vectors. The distance between the view vectors is compared agans this reference frame. 
\begin{equation}
 ViewDistance= d(View_{city1}, View_{city2})
\end{equation}\label{eq:view}


\begin{equation}
 ClickDistance= d(Click_{city1}, Click_{city2})
\end{equation} \label{eq:click}



The $\mathit{ViewDistance}$ is the distance that the system maintains between the two users. If two users are served exactly the same content, the system thinks they are exactly the same in terms of their need for content and hence serves them the same content. The more different content the system serves to the two users, the more different the system thinks they are  in terms of their information consumption needs.  The difference or similarity in the content served is achieved by personalization which is usually implemented at the level of the user.    

However, from the served content, a user has the freedom to choose what to read. For example, a certain content platform may serve content about Obama $\mathit{1000}$ times and about Donald about $\mathit{100}$ times. If the content about Obama has been clicked only 100 times, but the content on Donald has been clicked 100 times, the geographical unit is showing its preference for more content on Donald despite the system thinking more content on Obama is of interest to that geographical unit.  There is also the option that the user might not click on an item altogether. 
The $\mathit{ClickDistance}$ is the measure of how different the users are in terms of the content they choose to consume given the constraints of what they are served by the system. The smaller the distance, the more similar the users are in terms of what they choose to consume.  
\begin{equation}
 PullPush = ViewDistance-ClickDistance
\end{equation} \label{pullpush}



Using both $\mathit{ViewDistance}$  and $\mathit{ClickDistance}$,   We define a metric which we call pullPush (Equation \ref{pullpush}) as the difference between the View distance and the click  distance. To obtain the aggregate PullPush score for a recommender system, we computer the average PullPush score between each pair of users as in Equation \ref{pullpushaggr}.

\begin{equation}
 PullPush_{aggr} = \sum_{i \in pairs} PullPush_{i} \\
\end{equation} \label{pullpushaggr}



\subsection{Selection of a Distance Metric}

The advantage of the method that we have proposed is that it is indepedebt of any specific distance metric. One can use different distance metrics depdenig on their objectives and tastes. 
 In our case, we used Jensen-Shannon Divergence (JSD)  a distance metric based on KL-divergence. JSD is defined  in Equation \ref{eq:jsd} and KL is defined in Equation \ref{eq:kl}. Before applying the JSD on the views and clicks, we  first turn the view and click vectors into conditional probabilities of $\mathit{P(View|State}$ and $\mathit{P(Click|State}$. 


 
 \begin{equation}\label{eq:jsd}
JSD(X,Y) = \sqrt{\frac{1}{2} KL(X, \frac{(X+Y)}{2}) + \frac{1}{2} KL(Y, \frac{(X+Y)}{2})}
\end{equation}

\begin{equation}\label{eq:kl}
KL(X,Y)=\sum\limits_{i}  x_{i}\ln{\frac{x_{i}}{y_{i}}}
\end{equation}


\subsection{Selection of Users}

% Items are ephemeral. Entities are relatively perennial. Traditional philosophy claims that an object that has no permanence is not a reliable object for study.  Also given the fact that over customization or under-customization can only be studied in a comparative sense over a longer time, it makes sense that we focus on objects that have longer life expectancy. Items do not provide us that, but entities can. It is for these reasons that we use entities here as our features rather than the items. 

We have chosen to measure personalization at the geographical level.  This has two advantages. The first one is that we do not suffer from data sparsity. The second one is that we can use this to show that personalization can be measured at any level and that this can be used as an opportunity to show that there is a potential for personalization at another level than the oft-used user-level. As personalization is usually done at the user level, measuring personalization at geographical level can show the commonality (and thereforer the potential for personalization) that can exist outside the user-level. 




\subsection{Interpreting the Method}

%The biggest challange in a recommendation is that we can know if a recommendation is interesting or not only after the fact - after it has been recommened. Only then can we know  on the basis of the user's reaction to the recommendation.  
The PullPush score is   a measure of the tendency of two geographical units to drift away or to get closer to each other from how the  current personalization treats them. By definition, this measure is relative in the sense that we can measure it against the current personalization. It is not possible to measure it in absolute sense.   If the PullPush is $\mathit{0}$, then the right amount of distance is maintained by the personalization used  in the system. It means that the personalized recommendations are capturing the user preferences, as a result of which the the difference between the $\mathit{ViewDistance}$ and the $\mathit{ClickDistance}$ is $mathit{0}$. This score is what we good personalization should strive for. 

If PullPush score  is negative, then the  system is overloading the users with content that they are not interested in, and that is reflected in the fact that the distance between the click vectors is greater than the distance between the click vectors.  It shows that the users want to be treated in a way that captures their interest than the current personalization does. In other words, the systems treats the users in a more similar way than the user want.  The users  want to drift away from  how the system's personalization treats them. The larger the negative score, the more the  need for personalization. 


If the PullPush is positive, then it means the users want to get closer to each other than the systems' personalization treats them. It means, the system is performing over-personalization. A positive score is an indication that filter bubble might be a serious issue. A bigger positive score between two users is an indication that there is a big potential for more personalization between them, for more separation of the content according to their preferences. 

Recommendation is, at bottom, a dynamic system. As such, the choice of click history as the representation of the user interest is problematic. To make matters worse, the clicks are affected by the recommendations. This raises the question: to what extent the users can deviate from the recommended items. Or in other words, is it possible that the click vector could be similar enough to cause that the PullPush measure to be positive? In absolute terms, it can not. For example when one uses Euclidean distance, the components of the click vector will always have a value less than the corresponding value in the view vector. To avoid that problem we normalize the vectors by the sum of all views, and the sum of all clicks respectively. 

The next question is whether normalization eliminates the tendency for the PullPush measure to be negative. We argue that normalization indeed can eliminate the tendency for the PullPush score to be negative. In Table \ref{synth}, we see a table with synthetic view and click vectors. Using those vectors, we obtained a positive score for the PullPush, and this is an indication that it is possible for the PullPush score  to be positive in practice.  
 
\begin{table}
\caption{A table, with synthetic data, that shows that there is a possibility for the PullPush score to be positive.}
  \begin{tabular}{|l|l|l|l|l|l|l|}
    \hline
    \multirow{1}{*}{Entities} &
      \multicolumn{2}{c|}{City1} &
      \multicolumn{2}{c|}{City2}\\ 
    & View & Click & View   &Click \\
    \hline
    Entity1 & 20 & 5 &40&112 \\
    \hline
    Entity2 & 15 & 7 &30&20 \\
    \hline
     Entity3 & 10 & 8 &20&18 \\
    \hline
     Entity4 & 5 & 1 &10&10 \\
    \hline
    
    
  \end{tabular}
  \label{synth}
\end{table}


%Given the dependence of the clicks on the recommendations, the only thing we can measure is how much the clicks deviate from what the recommendations are. This means that the we can only measure the clicks relative to the recommendations. 



%  If one uses them as, then one is purely studying the separation story. However, if one treats them as probabilities, one is incorporating the number of times an item must be recommended to a user. 

In the potential for personalization, they investigated the potential for personalization as the difference between  optimizing for an individual and optimizing for a group. In doing so, their assumption was that either the user interests are known from explicit relevance judgments, or the clicks were seen as the true representation of interest. In a dynamic operational personalization system, the use of explicit relevance judgments is hard to come by and it does not address the problem of how our system is doing in the current state of personalization. 

The use of clicks as  the true representation of user interest is good, but the potential for personalization does not address the coupling of recommendation and clicks, that is the bias that is inherent in the clicks as a result of the recommendation. Its assumption was that there was no personalization involved. In our case, the presence of recommendations changes the dynamics completely. 

The PullPush score can be used as a measure of the potential for personalization. When the score is positive, we can consider that that is the size of the potential for personalization, and when the score is negative it can be considred the potential for depersonalization. In the potential for personalization, they used DCG as a metric.  The reason for using distance metric as a measure of potential for personalization as opposed to using CDG as in the Potential for customization are: 1) First and formost, we view personalization as the ability to maintain correct difference between personalized recommendations to users. 2) we do not have ranking information in our logs 3) We do not have explicit ground truth, and  4) Ground truth in a dynamic system is problematic to say the least.


 

\subsection{Comparisons to Other Measures}
One might wonder how the proposed measure is different from other known measures such as CTR.    A CTR is a measure that overlaps with some of the aspects of ranking and the proposed measure. It overlaps for example with ranking measures in the sense that a higher CTR is indicative of a better ranking. It also can relate to the separation component in the sense that a higher CTR might indicate a good recommendation which indirectly depends on good separation between the different information preferences. A CTR is, by design, not suited to to be a measure for the separation component. It is not suited to take recommendations and clicks of two users. CTR can work only with clicks and views for one user. One can also compare the CTRs of two users, but in that case what one is measuring is the pattern of the ratios, not the personalization. 

There are, however,  many aspects that CTR does not capture even when applied to one user. For example, a system might achieve a good CTR at the expense of  just recommending very few items. It can also not capture the difference between items recommended a large number of times and consumed a large number of times from items recommended few times and consumed few times. CTR, as just a ratio, is not well-suited for capturing that difference. 




% A good way to demonstrate that is to look at  the dataset in Table \ref{input} which shows the view and click vectors for two geographical units (cities, in our specific case) and the entities they consume. A view is a  recommendation, the item that a user from a certain geographical unit is exposed to when they visit a website.  A click is the view that was actually clicked by a user. In the table, the views and clicks are aggregated by entities over all users from the city in question. Note that the CTR  in the table is the same for both cities, illustrating that the  CTR metric can hide the difference in the entities that these cities are served and actually read (click). 
% The success of the recommender system can be measured by how similar the two vectors are to each other. A perfect recommender system is one  where the view and the click vectors are identical. We can quantify the relationship  between the  view and the click vectors in many different ways. One measure is to use Click-Through RATE (CTR).  However, a CTR measure can misleadingly indicate that the click vector is the same as the view vector even if they are different. 
%   
% 
% \begin{table}
%   \begin{tabular}{|l|l|l|l|l|l|}
%     \hline
%     \multirow{2}{*}{Entities} &
%       \multicolumn{2}{c|}{City1} &
%       \multicolumn{2}{c}{City2} &
%       \multirow{2}{*}{CTR} \\
%     & View & Click & View  & Click  \\
%     \hline
%     Entity1 & 10000 & 1000 &10&1 &0.1\\
%     \hline
%     Entity2 & 5000 & 500 &100&10&0.1 \\
%     \hline
%      Entity3 & 1000 & 100 &500&50&0.1 \\
%     \hline
%      Entity4 & 500 & 50 &1000&100&0.1 \\
%     \hline
%      Entity5 & 100 & 10 &5000&500&0.1 \\
%     \hline
%      Entity6 & 10 & 1 &100&10 &0.1\\
%     \hline
%     
%     
%   \end{tabular}
%   \label{input}
% \end{table}
% 




% 
% This can happen, for example, when the ratio is the same, but the actual values are different such as 10/100 and 100/1000. In personalization, this difference matters because this shows that if I am interested in Item and and another is is interested in Item, even if the ration is the same, it does not mean we are interested in the same items. CTR  is agnostic on the difference at the level of the dimensions of the vector. For example, it would not differentiate between a geographic region which have been served 100 items about Obama  and consumed 
% 10, 200 times on Donald and consumed 10. If we exchanged the quantities on Obama and Donald, the CTR score would remain the same. Another problem with CTR is that is that it does not take into account what is not served (withheld).
% 
% 




\section{Datasets and Experiments} \label{exp}
We experimented on two datasets of recommendations and clicks collected from Plista.  
% The second dataset is Yahoo! stream dataset. In this dataset, the dimension of the vectors are entities. Entities are relatively perennial, as compared to items. 
 %Or maybe we should consider some cities and Tagesspiegel? I think that seems to make more sense. With the second case, we considered a total of 14 cities, seven of them around the Bay Area and 7 of them around New York city. The results both are multidimensional scaled as shown in Figure \ref{}.
Plista is a recommendation service provider \footnote{http://orp.plista.com/documentation} that  offers the Open Recommendation Platform (ORP), a framework that brings together online contenet publishers  in need of recommendation service and news recommendation service  providers that provide recommendation by plugging their recommendation algorithms to the platform.  When a user starts reading a news item, a recommendation request is sent to one of the recommendation providers while the other participants receive the impression information - information about the user visiting the item. When a recommended item is clicked by the user, all participants receive the click information.  

Every participant has access to all user-news item interaction information. It is however, not possible to find out who recommended  the clicked item.  
For this reason, we collected all the clicks regardless of who recommended them. Another way to look at it is to think of all the recommenders as one ensemble recommender.  We also assumed that the recommenders from the participants did not employ personalization, or they were recommending the same items to all users. Most of the recommenders were variations of the recency algorithm. For our analysis we choose two German news and opinion portals: Tagesspiegel\footnote{www.tagesspiegel.de} and Kstatager\footnote{http://www.ksta.de}.   For users, we chose the 16 states of Germany. 
%\subsection{Yahoo! Dataset}

\section{Results and Analysis} \label{result}

There are two levels of results  we can obtain with the proposed method. One level is the aggregate personalization score for the overall personalization  in a recommender system. The scores of the recommender systems for the two publishers of Tagesspiegel and Ksta are $\mathit{-0.224}$ and $mathit{-0.213}$  respectively. The smaller score for Tagesspiegel can be explained by the fact that ksta is a more geographically local newspaper as compared to Tagesspiegel which is seen as a national newspaper. A more local readership means less diversity in information need by users.  Tagesspigel has larger and diversified readership, which means more need to cater tto different needs.  The smaller  PullPush score for Ksta means  there is less need for personalization in the case of the Ksta, than there is in the case of Tagesspiegel. 

The second level score is the individual results for the pairs of users, which shows how the system's personalization performs in each individual pairs. The results for select $\mathit{11}$ German states are presented in Table \ref{tage-ksta}. The upper diagonal shows the PullPush scores for Tagesspiegels and the lower diagonal for Ksta. The first thing that we observe is that all the scores are negative, indicating that there was overall a potential for personalization in both publishers. Another observation is that when we compare the results for corresponding pairs of states in Tagesspiegel and  in Ksta, we find that in most cases the scores in tagesspigel have greater absolute values than those in Ksta. This is again an indication of the national character of Tagespiegel and thus a more diverse audience and a more need for personalization to meet that diverse need. 


Looking at specific scores though, we find that Westphalia has the largest absolute PullPush score with all of the other states in the case of Ksta. In Tagespiegel, Berlin follwoed by Branderbrug have the largest absolute scores.  These scores can be seen in the  multidimensional visualizations   in Figure \ref{fig:tage} and Figure \ref{fig:ksta}.  This can be explained  by the fact that Ksta is based in Cologne and most of its readers come from the state in which Cologne is found - the state of Westphalia. There is more need for personalization for audiences coming from Westphalia than there is for audiences coming from other states. Similarly, 
%Just looking at the multidimensionally scaled results, we can see differences between the two publishers. In Tagesspiegel (Figure \ref{fig:tage}), we observe that  Berlin followed by Brandenburg are further apart from the rest of the states. 
 personalized recommendations are much more needed in Berlin and then in Brandenburg than in other states for Tagespiegel. This has to do with the fact that Tagesspiegel is seen as primarily the local portal for Berlin and Brandenburg as it is also based in Berlin. So clearly, as most people from these states consume news from Tagesspiegel, making a distinction between users by means of personalization would make much more sense here, than on other states  with less consumers.
 
 
%  The next observation we see is that the distances among the other states is a bit far apart
% 
% In the case of Ksta (Figure \ref{fig:ksta}), we observe that Westphalia stands out as the furthest states from the others. The other states are much more closer to each other than how they are in the case of Tagesspiegel. This is an indication that  
% there is more need for personalization in the Westphalia than in other states. The same is true for Berlin and Brandenburg. 

%A third level score is to choose a pair that  for inspection  and look at the items that are doing worse/better.  In \ref{} are the results for aggregate performance levels for the recommendations on the two Plista datasets (Tagesspiegel and Ksta). %and on the yahoo dataset.  
Figures \ref{fig:tage} and \ref{fig:ksta} show the results of the distances between each pair in a multidimensionally scaled to two-dimension for viewing. 


\begin{table*}
\caption{Adjacency matrix for select  distances  between states of 
 Germany. The upper part of the matrix is for Tagesspiegel and the lower part for Ksta.}
  \begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|l|}
    \hline

&Baden&Bavaria&Berlin&Bremen&Hamburg&Hessen&MeckPom&Saarland&Saxony&Thuringia&Westphalia\\
Baden&0&-0.127&-0.221&-0.302&-0.198&-0.144&-0.302&-0.344&-0.192&-0.26&-0.133\\
Bavaria&-0.178&0&-0.221&-0.307&-0.201&-0.146&-0.306&-0.349&-0.196&-0.263&-0.129\\
Berlin&-0.198&-0.187&0&-0.366&-0.264&-0.233&-0.359&-0.404&-0.256&-0.321&-0.204\\

Bremen&-0.291&-0.271&-0.204&0&-0.206&-0.269&-0.161&-0.156&-0.231&-0.174&-0.334\\
Hamburg&-0.227&-0.215&-0.158&-0.149&0&-0.176&-0.212&-0.245&-0.166&-0.185&-0.221\\
Hessen&-0.168&-0.167&-0.164&-0.247&-0.184&0&-0.271&-0.31&-0.177&-0.227&-0.156\\

MeckPom&-0.279&-0.267&-0.192&-0.087&-0.134&-0.235&0&-0.165&-0.228&-0.173&-0.334\\

Saarland&-0.281&-0.267&-0.194&-0.087&-0.137&-0.235&-0.086&0&-0.265&-0.194&-0.377\\

Saxony&-0.233&-0.218&-0.158&-0.138&-0.126&-0.188&-0.125&-0.125&0&-0.195&-0.21\\

Thuringia&-0.268&-0.257&-0.185&-0.098&-0.127&-0.227&-0.083&-0.091&-0.121&0&-0.288\\
Westphalia&-0.382&-0.389&-0.452&-0.569&-0.502&-0.401&-0.561&-0.56&-0.509&-0.554&0\\

    \hline
    
    
  \end{tabular}
  \label{tage-ksta}
\end{table*}




 \begin{figure} [t]
\centering
\includegraphics[scale=0.5]{img/mds_tage.pdf}
\caption{The multidimensionally scaled distances between the states of Germany for the Tagesspiegel.}
\label{fig:tage}
\end{figure}


 \begin{figure} [t]
\centering
\includegraphics[scale=0.5]{img/mds_ksta.pdf}
\caption{The multidimensionally scaled distances between the states of Germany for the Ksta.}
\label{fig:ksta}
\end{figure}



 Applying the proposed method on 16 German states, we found that they all wanted to drift away from each other  suggesting that the recommendation algorithms were  overloading the states with information that is not engaged with.
 
 %The next natural question was which entities should be served more,  which entities should be  served less, or which entities should be left unchanged and for which cities. 
% % So  for each  entity shared between two users, we  have  9 combinations of increases, decreases and do-not-changes(do nothing).  Note that each  increase or decrease is to be based on  the view and click scores. 
% The goal of recommending increase, decrease or do-not-changes  for each entity  is in order to  achieve  an equilibrium, that is  $\mathit{PullPush=0}$ between the two geographical units under consideration.
% %This part of the  work is not done,  but we feel, if solved, can be a good addition to and a completion of the above method we proposed.


\section{Suggesting Improvement to a Recommender System}

%The method  is either to recommend a decrease or increase in the recommendation of some entities.
% On what basis can we recommend? We want a metric that can take into account the difference between an item that has been recommended 100 times, but was clicked 100 time  and one that has been served once and clicked once. The way to do this is to devise a measure that can quantify that the fact that one has been recommended many times is and clicked many times is an evidence that the article is important, not just potentially important, but it has proved that it is important. 

How can the measure be used to suggest improvement to a personalized recommender system? There are several ways that the measure can be used to suggest improvement. One way is to use it as a measure of the overall personalization of a recommender system and to use that measure, on the basis of the score, to either reduce or increase personalization. This can be achieved by for example introducing other features in addition to the ones that we have.

Another use of the measure is to use it at different levels of personalization, for example demographic or geographic and use that as a measure of potential for personalization. This is like what the potential for personalization has done with users, but this time at another granularity. This measure is a quick way of knowing the potential and can be used as measure to be met. This can also be used to identify where we need to wok more, to introduce more personalization. For example, if we consider two cities, and they seem to be diverging more and more, we can zoom on those cities and do more personalization. 

A third-level is to look at the entities and items specifically to identify better or worse performing. The best use for this is to start from no recommendation and to progressively decrease those items that are not of interest to a certain user. This would create a better recommender system in the long run. However, the best way to use the method to improve a personalized recommender system is to use it as an objective function and experiment with different features with the aim to lower the PullPush score  to $mathit{0}$. This is a much more useful objective function than CTR, or any other measure of personalization so far. 

And finally, another perspective to bring on this measure, is assuming that we can use it to avoid the 'maximum filter bubble' that would organically occur if user cohorts' interests are indeed pushing the served content apart. Arguably, the system should always push back on the potential desire to be apart, and as such the 'maximum personalization score' should never be reached.

% 
% \section{On measuring content consumption patterns}
% Can we also measure content consumption patterns across cities? Can we use our method to characterize content consumption patterns? That, to me, seems to need a way of measuring how important an item is for a city using the views and clicks. Can the PullPush score be used as a measure of similarity of content between two different cities? 


% Can we measure content consumption differences across geographical units? the answer for this would be normally yes, why not? But what complicates the measurement of content consumption patterns across geographical units is the recommendation bias, the phenomenon that users are exposed only to what the recommender systems shows them. In the absence of a situation where users are exposed to all content available (that is that there is always a recommendation bias in practically deployed recommender systems), it is not possible to measure the absolute interest of a certain information consuming unit. 
% 
% The question then is whether we can measure information consumption patterns given the recommendation bias, some sort of biased information consumption pattern. In order to do that, I think we need the click vector of the the information consuming unit, but also a way to discount the effect of the recommendation bias. To discount the recommendation bias, we use the view vector.  so how can we go about this? There can be two measurements we can do in this regard. One is measuring the information consumption interest of a certain geographical area. The other is the relative and comparative approach where an information consumption unit's information consumption is measured relative to others.  How would we go around doing this?  we measure the similarity between their click vectors and subtract the similarity between their view vectors. The resulting score is the objective similarity between the two information consumption units.  This score is the same as the negative of the PulPush score. 
% 
% But would that really be a true information measure of similarity or dissimilarity of the information consumption of two geographical units? I doubt it. I should think more about it. 

% \section{Discovering uniquely Local Content}
% How can we identify the most interesting items for a geographical unit? One way to do so would be to computer the distribution of an entity across the geographical regions and then use this distribution to select the geographical with the highest score as the regions main readership area.  Then we aggregate by geographical units and all the entities that belong to the geographical unit are then its uniquely interesting entities. There are two problems with this method. One is a big-city problem. Big cities by virtue of their size will be able to amass most of the entities as their unique entities. A second problem is that how do we differentiate the importance of of one entity from another entity In a given geographical unit?  So maybe we use the TF-IDF concept here? Why not?  It seems the way forward. 
% 
% Another issues I would like to include is how one can compare two geographical areas and where the difference occurs. One way to do so is to to measure similarity between two geographical units by removing entities at different cutoff frequencies. The point of doing this is to identify entities that affect the similarity scores. This can be done int two ways, one is by starting with all entities and removing the most frequent entities at different levels and computing the similarity metric. Another way is by removing the least frequent entities. 
% 


% \section{Random thoughts}
% 
% There are two sides to online information provision. One is to personalize so much that a filter bubble becomes a serious problem. Second is to do nothing and live the consequences of information overload. How can you measure information overload? Can we find the amount of information that a user consumes, for example, in terms of the number of entities and then use that information to estimate the size of information that a geographical unit, a city, consumes? If this can be used as a reasonable estimate of the size of information need, then the discrepancy with what is supplied becomes a measure of information overload. The discrepancy with what is clicked becomes the actual information overload. 
% 
% 
% 
% Another idea is can we measure the information that is never showed to a certain area? And what the difference between that and what is shown is?  
% 
% 
% A question? 



\section{Discussion and Conclusion} \label{conc}

In this work, we attempted to quantify  personalization in a recommendation system. We approached it from the perspective of those who seek a perfect personalization, not from the perspective of those who seek no personalization. We showed that these two perspectives call for different ways of measuring personalization. We used the click history as an approximate measures of user preference and compared recommendation against it.  We proposed a comparative and reactive metric which we called PullPush as a measure of the users tendency to drift away or come closer towards each other in response to personalized recommendations. We also showed why the propsed metric is a better metric compared to other metrics such as CTR, which are by design not personalization metrics. 

For users, we abstracted away from the actual user and considered geographical cities as users. These provided us with two advantages data sparsity and chance to examine personalization at a higher level. We also elaborated that that the use of entities as opposed to items is a good choice because it allows us to use an extended time thus avoiding sparsity problem. Using the proposed metric, we showed that in  Plista, there was a need for more personalization. As the personalization was done at  state levels, we can consider the scores as the potential for personalization at that level.  


\bibliographystyle{abbrv}
\bibliography{ref}  % sigproc.bib is the name of the Bibliography in this case

\end{document}