{"id":375,"date":"2024-12-05T07:03:06","date_gmt":"2024-12-05T07:03:06","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/05\/introducing-univariate-exemplar-recommenders-how-to-profile-customer-behavior-in-a-single-vector-c90c9943fe7d\/"},"modified":"2024-12-05T07:03:06","modified_gmt":"2024-12-05T07:03:06","slug":"introducing-univariate-exemplar-recommenders-how-to-profile-customer-behavior-in-a-single-vector-c90c9943fe7d","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/05\/introducing-univariate-exemplar-recommenders-how-to-profile-customer-behavior-in-a-single-vector-c90c9943fe7d\/","title":{"rendered":"Introducing Univariate Exemplar Recommenders: how to profile Customer Behavior in a single vector"},"content":{"rendered":"<p>    Introducing Univariate Exemplar Recommenders: how to profile Customer Behavior in a single vector<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Customer Profiling<\/h4>\n<h4>Surveying and improving the current methodologies for customer profiling<\/h4>\n<blockquote><p>***To understand this article, knowledge of <strong>embeddings, clustering, and recommendation systems <\/strong>is required. The implementation of this algorithm has been released on <a href=\"https:\/\/github.com\/atlantis-nova\/univariate-sequential-recommender\">GitHub<\/a> and is fully open-source. <strong>I am open to criticism<\/strong> and <strong>welcome any feedback.<\/strong>\n<\/p><\/blockquote>\n<p>Most platforms, nowadays, understand that tailoring individual choices for each customer leads to increased user engagement. Because of this,<strong> the recommender systems&#8217; domain has been constantly evolving<\/strong>, witnessing the birth of new algorithms every\u00a0year.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A8YjeQJ0IkEZiWDfyXjjFIA.png?ssl=1\"><figcaption>hierarchical clustering, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>Unfortunately, <strong>no existing taxonomy is keeping track<\/strong> of all algorithms in this domain. While most recommendation algorithms, such as matrix factorization, employ a neural network to make recommendations based on a list of choices, in this article, I will focus on the ones that <strong>employ a vector-based architecture to keep track of user preferences.<\/strong><\/p>\n<h3>Exemplar Recommenders<\/h3>\n<p>Thanks to the simplicity of embeddings, each sample that can be recommended (ex. products, content\u2026) is converted into a vector using a pre-trained neural network (for example a matrix factorization): we can then use knn to make recommendations of similar products\/customers. The algorithms following this paradigm are known as <strong>vector-based recommender systems. <\/strong>However, when these models take into consideration the previous user choices,<strong> they add a sequential layer<\/strong> to their base architecture and become technically known as <strong>vector-based<\/strong> <strong>sequential recommenders<\/strong>. Because these architectures are becoming increasingly difficult (to both remember and pronounce), I am calling them <strong>exemplar recommenders<\/strong>: they extract a set of representative vectors from an initial set of choices to represent a <strong>user\u00a0vector<\/strong>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/657\/1%2ABRbcUi6LmEnjpDKyrKQDOQ.png?ssl=1\"><figcaption>subdivision of recommender systems, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>One of the first systems built on top of this architecture is <strong>Pinterest<\/strong>, which is running on top of <a href=\"https:\/\/medium.com\/pinterest-engineering\/pinnersage-multi-modal-user-embedding-framework-for-recommendations-at-pinterest-bfd116b49475\">its Pinnersage Recommendation engine<\/a>: this scaled engine capable of managing over 2 Billion pins runs its own specific architecture and <strong>performs clustering on the choices<\/strong> of each individual user. As we can imagine, this represents a computational challenge when scaled. Especially after discovering <strong>covariate encoding<\/strong>, I would like to introduce four complementary architectures (two in particular, with the article&#8217;s name) that can <strong>relieve the stress of clustering algorithms<\/strong> when trying to profile each customer. You can refer to the following diagram to differentiate between\u00a0them.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AYNqke1bkyfjBl4AXixuPlQ.png?ssl=1\"><figcaption>summary of exemplar recommenders, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>Note that all the above approaches are classified as content-based filtering, and <strong>not collaborative filtering<\/strong>. In regards to the exemplar architecture, we can identify <strong>two main defining parameters<\/strong>: <strong>in-stack clustering implementation<\/strong> (we either perform clustering on the sample embedding or directly on the user embedding), and<strong> the number of vectors<\/strong> used to store user preferences over\u00a0time.<\/p>\n<h3>In-Stack Clustering implementation<\/h3>\n<p>Using once again Pinnersage as an example, we can see how it performs <strong>a novel clustering iter for each user<\/strong>. However advantageous from an accuracy perspective, this is computationally very\u00a0heavy.<\/p>\n<h4>Post-Clustering<\/h4>\n<p>When clustering is used on top of the user embeddings, we can refer to this approach (in this specific stack) as <strong>post-clustering<\/strong>. However inefficient this may look, applying a non-parametric clustering algorithm on billions of samples is borderline impossible, and probably not the best\u00a0option.<\/p>\n<h4>Pre-Clustering<\/h4>\n<p>There might be some use cases when applying clustering on top of the sample data could be advantageous: we can refer to this approach (in this specific stack) as <strong>pre-clustering.<\/strong> For example, a retail store may need to track the history of millions of users, requiring the same computational resources of the Pinnersage architecture.<\/p>\n<p>However, the number of samples of a retail store, compared to the Pinterest platform, <strong>should not exceed 10.000<\/strong>, against the staggering <strong>2 Billion<\/strong> in comparison. With such a small number of samples, performing clustering on the sample embedding <strong>is very efficient<\/strong>, and will relieve the need to use it on the user embedding, if utilized properly.<\/p>\n<h3>Introducing the Univariate Architecture<\/h3>\n<p>As mentioned, the biggest challenge when creating these architectures is scalability. Each user amounts to <strong>hundreds of past choices held in record <\/strong>that need to be computed for <strong>exemplar extraction<\/strong>.<\/p>\n<h4>Multivariate architecture<\/h4>\n<p>The most common way of building a vector-based recommender is to pin every user choice to an existing pre-computed vector. However, even if we resort to decay functions to minimize the number of vectors to take into account for our calculation, we still need to <strong>fill the cache with all the vectors at the time of our computation<\/strong>. In addition, at the time of retrieval, the vectors cannot be stored on the machine that performs the calculation, but need to be queried from a database: this sets an additional challenge for scalability.<\/p>\n<p>The flow of this approach is the limited variance in recommendations. The recommended samples will be spatially very close to each other (the sample variance is minimized) and will only belong to the same category (unless there is in place a more complex logic defining this interaction).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/887\/1%2ADiN7aDMNIXfEyq6jkHj4qw.png?ssl=1\"><figcaption>multivariate exemplar recommendation, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>WHEN TO USE: This approach (I am only taking into account the behavior of the model, not its computational needs) is suited for applications where <strong>we can recommend a batch of samples all from the same category. <\/strong>Art or social media applications are one\u00a0example.<\/p>\n<h4>Univariate architecture<\/h4>\n<p>With this novel approach, we can store each user choice using a single vector that keeps updating over time. This should prove to be a remarkable improvement in scalability, minimizing the computational stress derived from both <strong>knn <\/strong>and <strong>retrieval<\/strong>.<\/p>\n<p>To make it even more complicated, there are two indexes where we can perform clustering. We can either cluster the <strong>items <\/strong>or the <strong>categories <\/strong>(both labeled using tags). There is no superior approach, we have to choose one depending on our use\u00a0case.<\/p>\n<h4>&gt; category-based<\/h4>\n<p>This article is entirely based on the construction of a category-based model. After tagging our data we can perform <strong>a clustering to group our data into a hierarchy of categories <\/strong>(in case our data is already organized into categories, there is no need to apply hierarchical clustering).<\/p>\n<p>The main advantage of this approach is that the exemplar indicating the user preferences will be linked to similar categories (increasing product variance).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/896\/1%2A01jyXjugizOaesFAk6oymw.png?ssl=1\"><figcaption>univariate category-based exemplar recommendation, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>WHEN TO USE: Sometimes, we want to focus on recommending an entire category to our customers, rather than individual products. For example, if our user enjoys buying shirts (and by chance the exemplar is located in the latent region of <strong>red shirts<\/strong>), we would benefit more from recommending him the entire clothing category, rather than <strong>only red shirts<\/strong>. This approach is best suited for retail and fashion companies.<\/p>\n<h4>&gt; item-based<\/h4>\n<p>With an item-based approach, we are performing clustering on top of our samples. This will allow us to capture more granular information on the data, rather than focusing on separated categories: we want to expand beyond the limitations of the product categorization and recommend items across existing categories.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/914\/1%2As5DPmgzLEJFYK5OOr9i6Dw.png?ssl=1\"><figcaption>univariate item-based exemplar recommendation, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>WHEN TO USE: The best companies that can make the best use for this approach are human resources and retailers with cross-categorical products (ex. videogames).<\/p>\n<h3>Univariate Exemplar Recommenders<\/h3>\n<p>Finally, we can explain in depth the architecture behind the category-based approach. This algorithm will perform exemplar extraction <strong>by only storing a single vector<\/strong> over time: the only technology capable of managing it is <strong>covariate encoding<\/strong>, hence <strong>we will use tags <\/strong>on top of the data. Because it uses <strong>pre-clustering<\/strong>, it is ideal for use cases with a manageable number of samples, but an unlimited number of\u00a0users.<\/p>\n<p>For this example, I will be using the open-source collection of the <strong>Steam game library<\/strong> (<a href=\"https:\/\/www.kaggle.com\/datasets\/fronkongames\/steam-games-dataset\">downloadable from Kaggle<\/a>\u200a\u2014\u200a<a href=\"https:\/\/www.mit.edu\/~amini\/LICENSE.md\">MIT License<\/a>), which is a perfect use case for this recommender at scale: Steam uses no more than 450 tags, and the number can occasionally increase over time; yet,<strong> it is manageable<\/strong>. This set of tags can be clustered very easily, and <strong>can even allow for manual intervention<\/strong> if we question the cluster assignment. Last, it serves millions of users, proving to be <strong>a realistic use case<\/strong> for our recommender.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AnSkaO6QeduiosqTVD60uiQ.png?ssl=1\"><figcaption>Sample of the Steam game dataset, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>Its architecture can be articulated into the following phases:<br \/>***Note that when creating the sample code of this architecture I am using LLMs to make the entire process <strong>free from any human supervision<\/strong>. However, <strong>LLMs remain optional<\/strong>, and while they may improve the level of this recommender system, they are not an essential part of\u00a0it.<\/p>\n<ol>\n<li>Sample Labeling<br \/>We need to make sure to assign tags to each of our samples. Because of semantic tag filtering, we do not need to resort to zero-shots, but we can let a LLM manage this process without any supervision.<\/li>\n<li>Pre-Clustering<br \/>We are going to divide the tag embedding into different clusters. For a higher level of accuracy, we are going to use <strong>hierarchical clustering <\/strong>with a depth of\u00a03.<\/li>\n<li>Cluster labeling<br \/>Once we have defined our cluster tree, we need to label each generated supercluster. We can still use LLM for this purpose. If you decide to avoid using LLMs, not that clusters can remain in a numerical form (this may only alter the user perception of the recommender).<\/li>\n<li>Balance non-uniform tag frequency<br \/>The first challenge in picking from a list of tags is that the tags that appear the most (and are assigned to one cluster), heavily skew the recommender <strong>to propose that very cluster<\/strong>. We need to make sure that each cluster has the same probability of being recommended. We can achieve this by adding a custom multiplier that uniforms the probability of each cluster being recommended.<\/li>\n<li>Univariate sequential encoding<br \/>Now that our encoding weights have been defined, we can encode the user history in a vector, but with the possibility of updating it over time (using a decay function to get rid of old user preferences).<\/li>\n<li>Account for scalability: pruning mechanism<br \/>Because the dimensions of our vector are equivalent to the number of tags, we need to find a way to limit the size of the vector over time. PCA is a valid option, but because of the sum operations on the vector, feature pruning has proved to be more efficient.<\/li>\n<li>Exemplar estimation<br \/>This is where the innovation lies. We can encode the user profile <strong>as a single exemplar<\/strong> and <strong>still obtain separate cluster recommendations <\/strong>without any information loss that would arise IF we were to average multiple exemplars. This means that each of the previous multivariate methods would be incompatible with this architecture.<\/li>\n<\/ol>\n<p>Let us begin with the full explanation behind the Univariate Exemplar Recommender:<\/p>\n<h4>1. Sample\u00a0Labeling<\/h4>\n<p>In our reference dataset all samples have already been labeled using tags. If by any chance we are working with labeled data, we can easily do that using a LLM, <strong>prompting a request for a list of tags<\/strong> for each sample. As explained in my article on semantic tag filtering, we do not need to use zero-shots to guide the choice of labels, and the process can be completely unsupervised.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AQdxQImGZSTuBbBQvwob32Q.png?ssl=1\"><figcaption>Screenshot of our sample data, each sample labeled with tags, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<h4>2. Pre-Clustering<\/h4>\n<p>As mentioned, the idea behind this recommender is to first organize the data into clusters, and then identify the most common clusters (exemplars) that define the preferences of every single user. Because the data is ideally very small (thousands of tags against billions of samples), clustering is no longer a burden and can be done on the tag embedding, <strong>rather than on the millions of user embeddings<\/strong>.<\/p>\n<p>The more the number of tags increases, the more it makes sense to use a hierarchical structure to manage its complexity. Ideally, I would want not only to keep track of the main interests of each user but also <strong>their sub-interests<\/strong> and make recommendations accordingly. By using a dendrogram, we can define the different levels of clusters by <strong>using a threshold level<\/strong>.<\/p>\n<p>The first superclusters (level 1) will be the result of using a threshold of 11.4, resulting in the first 81 clusters. We can also see how their distribution is non-uniform (some clusters are bigger than others), but all considered, is not excessively skewed.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AJpeWinvkTEYlYiou0Bh5Uw.png?ssl=1\"><figcaption>hierarchical clustering, level 1, threshold=11.4, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AayjQpPU1e-AHAKd6l_SkqA.png?ssl=1\"><figcaption>all the cluster sizes of level 1 clustering, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>The next clustering level will be defined by a smaller threshold (9), which organizes the data in 181 clusters. Equivalently for the first level of clustering, the size distribution is uneven, but there are only two big clusters, so it should not be this big of an\u00a0issue.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AI4cWVMnaA2qsagObJ5jMIA.png?ssl=1\"><figcaption>hierarchical clustering, level 2, threshold=9, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AFLVp7NvZoQ3POo43rZvUcQ.png?ssl=1\"><figcaption>all the cluster sizes of level 2 clustering, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>These thresholds have been arbitrarily chosen. Although <strong>there are non-parametric clustering algorithms<\/strong> that can perform the clustering process without any human input, they are quite challenging to manage, especially at scale, and show side effects such as the <strong>non-uniform distribution of cluster sizes<\/strong>. If among our clusters there are some that are too big (ex. one single cluster may even account for 20% of the overall data), then they may incorporate most recommendations without much\u00a0sense.<\/p>\n<p>Our priority when executing clustering is to <strong>obtain the most uniform distribution while maximizing the number of clusters<\/strong> so that the data can be split and differently represented as much as possible.<\/p>\n<h4>3. Cluster\u00a0labeling<\/h4>\n<p>Because we have chosen to perform clustering on two levels of depths on top of our existing data, we have reached a total of 3 layers. The last layer is made by individual labels and is the only labeled layer. The other two, instead, only hold the cluster number without proper\u00a0naming.<\/p>\n<p>To solve this problem (note that this supercluster labeling step is not mandatory, but can improve how the user interacts with our recommender) we can use LLM on top of the superclusters. <br \/>Let us try to automatically label all our clusters by feeding the tags inside of each\u00a0group:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/770\/1%2AAnLcSZB7y1hMjWhS7CkiNA.png?ssl=1\"><figcaption>labeling for clusters at different depths, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>Now that also our clusters have been labeled correctly, we can start building the foundation of our sequential recommender.<\/p>\n<h4>4. Balance non-uniform tag frequency<\/h4>\n<p>So far, we have completed the easy part. Now that we have all our elements ready to create a recommender, we still need to adjust the imbalances. It would be much more intuitive to showcase this step after the recommender is done, but, unfortunately, it is part of its base structure, you will need to bear this with\u00a0me.<\/p>\n<h4>4.1 What if we skip balancing?<\/h4>\n<p>Let us, for a moment, skip ahead of time, and show the capabilities of our finished recommender by simply <strong>skipping this essential step<\/strong>. By assigning a score of 1 to each tag, there will be some tags that are so common that they will heavily skew the recommendation scores.<\/p>\n<p>The following is a Monte Carlo simulation <strong>of 5000 random tag choices from the dataset<\/strong>. What we are looking at is the distribution of clusters that end up being chosen randomly after summing the scores. As we can see, the distribution is highly skewed and it will certainly break the recommender in favor of the clusters with the highest\u00a0score.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AZVOCiUt7DJ0-E00F8RVFbg.png?ssl=1\"><figcaption>recommended cluster frequency over 10k simulations, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>For example, the cluster <strong>\u201cDark Norse Realms\u201d<\/strong> contains the tag <strong>Indie<\/strong>, which appears in 64% of all Samples (basically is almost impossible not to pick repetitively).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/526\/1%2A2C5bR6FlekKhLcw1duwOcQ.png?ssl=1\"><figcaption>example of recommended clusters, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>To be even more precise, let us directly simulate 100 different random sessions, each one picking <strong>the top 3 clusters from the session<\/strong> (the main user preference we keep track of), let us simulate entire user sessions so that the data is more complete. It is normal, especially when using a decay function, for the distribution to be non-uniform, and keep shifting over\u00a0time.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ah2Cm3QIWjqsNZ-pcsyHFlQ.png?ssl=1\"><figcaption>recommended cluster frequency over 10k simulations, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>However, if the skewness is excessive, the result is that the majority of users will be recommended <strong>the top 5% of the clusters 95% of the time<\/strong> (it is not precise numbers, just to prove my\u00a0point).<\/p>\n<h4>4.2 Balancing probability distribution<\/h4>\n<p><strong>Instead<\/strong>, let us use a proper formula for frequency adjustment. Because the probability for each cluster is different, we want to assign a score that, when used to balance the weights of our user vector, <strong>will balance cluster retrieval:<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AGIpR-ApIjbn1Y7-SwzM9LA.png?ssl=1\"><figcaption>scoring function to balance probability non-uniformity, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>Let us look at the score assigned to each tag for <strong>4 different random clusters<\/strong>:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/695\/1%2A6X3iztcztVnm-n-izQfxfw.png?ssl=1\"><figcaption>example of recommended clusters, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>If we apply the score to the random pick (5000 picks, counting the frequency adjusted by the aforementioned <strong>weight<\/strong>), we can see how the tag distribution is now balanced (the outline ~ \u201cAdrenaline Rush\u201d is caused by a duplicate name):<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ab-YsRFcXwm-NF0QoSSlLmQ.png?ssl=1\"><figcaption>cluster probability over 10k simulations, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>In fact, by looking at the normal distribution of the fluctuations, we see that the standard deviation for picking any cluster is approx. 0.1, <strong>which is extremely low<\/strong> (especially compared to\u00a0before).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/544\/1%2Aynw5jL56fpWHuRqMA5nb3Q.png?ssl=1\"><figcaption>fluctuation distribution over 10k simulations, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>By replicating 100 sessions, we see how, even with a pseudo-uniform probability distribution, the clusters amass over time following the Pareto principle.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AEog3gh-3btnrdQClfKotYQ.png?ssl=1\"><figcaption>recommended cluster frequency over 10k simulations, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<h4><strong>5. Univariate sequential encoding<\/strong><\/h4>\n<p>It is time to build the sequential mechanism to keep track of user choices over time. The mechanism I idealized <strong>works on two separate vectors<\/strong> (that after the process end up being one, hence univariate), a <strong>historical vector<\/strong> and a <strong>caching\u00a0vector<\/strong>.<\/p>\n<p>The <strong>historical vector<\/strong> is the one that is used to perform knn on the existing clusters. Once a session is concluded, we update the historical vector with the new user choices. At the same time, we adjust existing values with a decay function that diminishes the existing weights over time. By doing so, we make sure to keep up with the customer trends and <strong>give more weight to new choices, rather than older\u00a0ones<\/strong>.<\/p>\n<p>Rather than updating the vector at each user makes a choice (which is not computationally efficient, in addition, we risk letting older choices decay too quickly, as every user interaction will trigger the decay mechanism), <strong>we can store a temporary vector <\/strong>that is only valid for the current session. Each user interaction, converted into a vector <strong>using the tag frequency as one hot weight<\/strong>, will be summed to the existing cached\u00a0vector.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A64mX4qp-fpoMgHSaDBa37A.png?ssl=1\"><figcaption>vector sum workflow, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>Once the session is closed, we will retrieve the historical vector from the database, merge it with the cached vector, and <strong>apply the adjustment mechanisms<\/strong>, such as the decay function and pruning, as we will see later). After the historical vector has been updated, it will be stored in the database replacing the old\u00a0one.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/732\/1%2AyEvcBQAoJ-uQzDOAo0fQqA.png?ssl=1\"><figcaption>session recommender workflow, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>The two reasons to follow this approach are to minimize the weight difference between older and newer interactions and to make the entire process scalable and computationally efficient.<\/p>\n<h4>6. Pruning Mechanism<\/h4>\n<p>The system has been completed. However, there is an additional problem: covariate encoding has one flaw: its base vector <strong>is scaled proportionally to the number of encoded tags.<\/strong> For example, if our database were to reach 100k tags, the vector would have an equivalent number of dimensions.<\/p>\n<p>The original covariate encoding architecture already takes this problem into account, proposing a PCA compression mechanism as a solution. However, applied to our recommender, PCA causes issues when iteratively summing vectors, resulting in information loss. Because every user choice will cause a summation of existing vectors with a new one, this solution is not advisable.<\/p>\n<p>However, If we cannot compress the vector we can prune the dimensions with the lowest scores. The system will execute a knn based on the most relevant scores of the vector; this direct method of feature engineering won\u2019t affect negatively (better yet, not excessively) the results of the final recommendation.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AFBjNYHo7CRrfnA4cUjI1-w.png?ssl=1\"><figcaption>pruning mechanism, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>By pruning our vector, we can arbitrarily set a maximum number of dimensions to our vectors. Without altering the tag indexes, we can start operating on sparse vectors, rather than a dense one, a data structure that only saves the active indexes of our vectors, being able to scale indefinitely. We can compare the recommendations obtained from a full vector (dense vector) against a sparse vector (pruned\u00a0vector).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2As7gzaC-49NBE9dlI5GJKwA.png?ssl=1\"><figcaption>recommendation of the same user vector using a <strong>dense <\/strong>vs. <strong>sparse <\/strong>vector, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>As we can see, we can spot minor differences, but the overall integrity of the vector has been maintained <strong>in exchange for scalability<\/strong>. A very intuitive alternative to this process is by performing clustering at the tag level, maintaining the vector size fixed. In this case, a tag will need to be assigned to the closest tag semantically, and will not occupy its dedicated dimension.<\/p>\n<h4>7. Exemplar estimation<\/h4>\n<p>Now that you have fully grasped the theory behind this new approach, we can compare them more clearly. In a multivariate approach, the first step was to identify the top user preferences using clustering. As we can see, this process required us to store as many vectors as found exemplars.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/859\/1%2AU-6uWsBXJ0B4V0GAVBX3jg.png?ssl=1\"><figcaption>Examplar extraction, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>However, in a univariate approach, <strong>because covariate encoding works on a transposed version of the encoded data<\/strong>, we can <strong>use sections of our historical vector<\/strong> to store user preferences, hence only using a single vector for the entire process. Using <strong>the historical vector as a query <\/strong>to search through encoded tags: its <strong>top-k results from a knn search<\/strong> will be equivalent to the top-k preferential clusters.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/855\/1%2AWymC-puyCXNAzUgv13I2Hg.png?ssl=1\"><figcaption>difference between multivariate and univariate sets of vectors, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<h4>8. Recommendation approaches<\/h4>\n<p>Now that we have captured more than one preference, how do we plan to recommend items? This is the major difference between the two systems. The traditional multivariate recommender will use the exemplar to <strong>recommend k items<\/strong> to a user. However, our system has assigned our customer one supercluster and the top subclusters under it (depending on our level of tag segmentation, we can increase the number of levels). We will not recommend the top k items, <strong>but the top k subclusters<\/strong>.<\/p>\n<h4>Using groupby instead of vector\u00a0search<\/h4>\n<p>So far, we have been using a vector to store data, but that <strong>does not mean we need to rely on vector search<\/strong> to perform recommendations, because it will be much slower than a SQL operation. Note that obtaining the same exact results using vector search on the user array is indeed possible.<\/p>\n<p>If you are wondering why you would be switching from a vector-based system to a count-based system, it is a legitimate question. The simple answer to <strong>that is that this is the most loyal replica of the multivariate system<\/strong> (as portrayed in the reference images), but much more scalable (it can reach up to 3000 recommendations\/s on 16 CPU cores using pandas). Originally, the univariate recommender was designed to employ vector search, but, as showcased, there are simpler and better search algorithms.<\/p>\n<h3>Simulation<\/h3>\n<p>Let us run a full test that we can monitor. We can use the code from the sample notebook: for our simple example, the user selects at least one game <strong>labeled with corresponding tags<\/strong>.<\/p>\n<pre># if no vector exists, the first choices are the historical vector<br>historical_vector = user_choices(5, tag_lists=[['Shooter', 'Fantasy']], tag_frequency=tag_frequency, display_tags=False)<br><br># day1<br>cached_vector = user_choices(3, tag_lists=[['Puzzle-Platformer'], ['Dark Fantasy'], ['Fantasy']], tag_frequency=tag_frequency, display_tags=False)<br>historical_vector = update_vector(historical_vector, cached_vector, 1, 0.8)<br><br># day2<br>cached_vector = user_choices(3, tag_lists=[['Puzzle'], ['Puzzle-Platformer']], tag_frequency=tag_frequency, display_tags=False)<br>historical_vector = update_vector(historical_vector, cached_vector, 1, 0.8)<br><br># day3<br>cached_vector = user_choices(3, tag_lists=[['Adventure'], ['2D', 'Turn-Based']], tag_frequency=tag_frequency, display_tags=False)<br>historical_vector = update_vector(historical_vector, cached_vector, 1, 0.8)<br><br>compute_recommendation(historical_vector, label_1_max=3)<\/pre>\n<p>At the end of 3 sessions, these are the top 3 exemplars (label_1) <strong>extracted from our recommender<\/strong>:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/566\/1%2A408DO3nyW9Jd-tZ8HAfiFA.png?ssl=1\"><figcaption>recommendation after 3 sessions, <strong>image by\u00a0Author<\/strong><\/figcaption><\/figure>\n<p>In the notebook, you will find the option to perform Monte Carlo simulations, but there would be no easy way to validate them (mostly because team games are not tagged with the highest accuracy, and I noticed that most small games list too many unrelated or common\u00a0tags).<\/p>\n<h3>Conclusion<\/h3>\n<p>The architectures of the most popular recommender systems still do not take into account session history, but with the development of new algorithms and the increase in computing power, it is now possible to tackle a higher level of complexity.<\/p>\n<p>This new approach should offer a comprehensive alternative to the <strong>sequential recommender systems<\/strong> available on the market, but I am convinced that there is always room for improvement. To further enhance this architecture it would be possible to switch from a <strong>clustering-based<\/strong> to a <strong>network-based<\/strong> approach.<\/p>\n<p>It is important to note that this recommender system can only excel when applied to a limited number of domains but has the potential to shine in conditions of scarce computational resources or extremely high\u00a0demand.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=c90c9943fe7d\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/introducing-univariate-exemplar-recommenders-how-to-profile-customer-behavior-in-a-single-vector-c90c9943fe7d\">Introducing Univariate Exemplar Recommenders: how to profile Customer Behavior in a single vector<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Michelangiolo Mazzeschi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fintroducing-univariate-exemplar-recommenders-how-to-profile-customer-behavior-in-a-single-vector-c90c9943fe7d\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introducing Univariate Exemplar Recommenders: how to profile Customer Behavior in a single vector Customer Profiling Surveying and improving the current methodologies for customer profiling ***To understand this article, knowledge of embeddings, clustering, and recommendation systems is required. The implementation of this algorithm has been released on GitHub and is fully open-source. I am open to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,434,67,435,436,433],"tags":[439,438,437],"class_list":["post-375","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-customer-behavior-ai","category-deep-dives","category-recommender-systems","category-retail-recommendations","category-vector-recommender","tag-customer","tag-recommenders","tag-vector"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/375"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=375"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/375\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=375"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=375"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=375"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}