{"id":1585,"date":"2025-02-01T07:03:24","date_gmt":"2025-02-01T07:03:24","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/01\/inequality-in-practice-e-commerce-portfolio-analysis-adc3d0876acd\/"},"modified":"2025-02-01T07:03:24","modified_gmt":"2025-02-01T07:03:24","slug":"inequality-in-practice-e-commerce-portfolio-analysis-adc3d0876acd","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/01\/inequality-in-practice-e-commerce-portfolio-analysis-adc3d0876acd\/","title":{"rendered":"Inequality in Practice: E-commerce Portfolio Analysis"},"content":{"rendered":"<p>    Inequality in Practice: E-commerce Portfolio Analysis<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>From Mathematical Theory to Actionable Insights: A 6-Year Shopify Case\u00a0Study<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Afc3aEYB3InUBz1IgfXGFtA.png?ssl=1\"><figcaption>Image generated by DALL-E, based on author\u2019s prompt, inspired by \u201cThe Bremen Town Musicians\u201d<\/figcaption><\/figure>\n<p>Are your top-selling products making or breaking your business?<\/p>\n<p>It\u2019s terrifying to think your entire revenue might collapse if one or two products fall out of favor. Yet spreading too thin across hundreds of products often leads to mediocre results and brutal price\u00a0wars.<\/p>\n<p>Discover how a 6-year Shopify case study uncovered the perfect balance between focus and diversification.<\/p>\n<h4>Why bother?<\/h4>\n<p>Understanding concentration in your product portfolio is more than simply an intellectual exercise; it has a direct impact on crucial business choices. From inventory planning to marketing spend, understanding how your revenue is distributed among goods impacts your approach.<\/p>\n<p>This post walks through practical strategies for monitoring concentration, explaining what these measurements actually mean and how to get useful insights from your\u00a0data.<\/p>\n<p>I\u2019ll take you through fundamental metrics and advanced analysis, including interactive visualisations that bring the data to\u00a0life.<\/p>\n<p>I am also sharing chunks of R code used in this analysis. Use it directly or adapt the logic to your preferred programming language.<\/p>\n<h3>The Concentration Question<\/h3>\n<p>Looking at market analysis or investment theory, we often focus on concentration\u200a\u2014\u200ahow value is distributed across different elements. In e-commerce, this translates into a fundamental question: How much of your revenue should come from your top products?<\/p>\n<p>Is it better to have several strong sellers or a broad product range? This isn\u2019t just a theoretical question\u00a0\u2026<\/p>\n<p>Having most of your revenue tied to few products means your operations are streamlined and focused. But what happens when market preferences shift? Conversely, spreading revenue across hundreds of products might seem safer, but it often means you lack any real competitive advantage.<\/p>\n<p>So where\u2019s the optimal point? Or rather what is the optimal range, and how various ratios describe\u00a0it.<\/p>\n<p>What makes this analysis particularly valuable is that it is based on real data from a business that kept expanding its product range over\u00a0time.<\/p>\n<h3>Getting the Data\u00a0Right<\/h3>\n<h4>On Datasets<\/h4>\n<p>This analysis was done for a real US-based e-commerce store\u200a\u2014\u200aone of our clients who kindly agreed to share their data for this article. The data spans six years of their growth, giving us a rich view of how product concentration evolves as business\u00a0matures.<\/p>\n<p>While working with actual business data gives us genuine insights, I\u2019ve also created a synthetic dataset in one of the later sections. This small, artificial dataset helps illustrate the relationships between various ratios in a more controlled setting\u200a\u2014\u200ashowing patterns \u201ccounting on fingers\u201d.<\/p>\n<p>To be clear: this synthetic data was created entirely from scratch and only loosely mimics general patterns seen in real e-commerce\u200a\u2014\u200ait has no direct connection to our client\u2019s actual data. This is different from my previous article, where I generated synthetic data based on real patterns using Snowflake functionality.<\/p>\n<h4>Data Export<\/h4>\n<p>The main analysis draws from real data, but that small artificial dataset serves an important purpose\u200a\u2014\u200ait helps explain relationships between various ratios in a way that\u2019s easy to grasp. And trust me, having such a micro dataset with clear visuals comes in really handy when explaining complex dependencies to stakeholders\u00a0\ud83d\ude09<\/p>\n<p>The raw transaction export from Shopify contains everything we require, but we must arrange it properly for concentration analysis. The data contains all of the products for each transaction, but the date is only in one row per transaction, thus we must propagate it to all products while retaining the transaction id. Probably not for the first iteration of the study, but if we want to fine-tune it, we should consider how to handle discounts, returns, and so on. In the case of foreign sales, conduct a global and country-specific study.<\/p>\n<p>We have a product name and an SKU, both of which should adhere to some naming convention and logic when dealing with variants. If we have a master catalogue with all of these descriptions and codes, we are very fortunate. If you have it, use it, but compare it to the \u2018ground truth\u2019 with actual transaction data.<\/p>\n<h4>Product Variants<\/h4>\n<p>In my case, the product names were structured with a base name and a variant separated by a dash. Very simple to use, divided into main product and variants. Exceptions? Of course, they are always present, especially when dealing with 6 years of highly successful ecommerce data:). For instance, some names (e.g. \u201cAll-purpose\u201d) included a dash, while others did not. Then, some did have variants, while some did not. So expect for some tweaks here, but this is a critical\u00a0stage.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ay43-dSaq6CDyS199Y7sDaw.png?ssl=1\"><figcaption>Number of unique products, with and without variants\u200a\u2014\u200aall charts rendered by author, with own R\u00a0code<\/figcaption><\/figure>\n<p>If you\u2019re wondering why we need to exclude variations from concentration analysis, the figure above illustrates it clearly. The values are considerably different, and we would expect radically different results if we analysed concentration with variants.<\/p>\n<p>The analysis is based on transactions, counting number of products with\/without variants in a given month. But if we have a large number of variants, not all of them will be present in one-month transactions. Yes, that is correct\u200a\u2014\u200aso let us consider a larger time range, one\u00a0year.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ak1b8DIQG-h76T5mvsyuXsQ.png?ssl=1\"><figcaption>Products, their variants, by transaction date,\u00a0yearly<\/figcaption><\/figure>\n<p>I calculated the number of variants per base product in a calendar year based on what we have in transactions. The number of variants per base product is divided into several bins. Let\u2019s take the year 2024. The plot shows that we have somewhat around 170 base items, with less than half having only one variant (light green bar). However, the other part had more than one version, and what is noteworthy (and, I believe, non-obvious, unless you work in apparel ecommerce) is that we have products with a really large number of versions. The black bin contains items that come in 100 or more different variants.<\/p>\n<p>If you guessed that they were increasing their offerings by introducing new products while keeping old ones available, you are correct. But wouldn\u2019t it be interesting to know whether the differences stem from heritage or new products? What if we just included products introduced in the current year? We may check it by using the date of product introduction rather than transactions. Because our only dataset is a transaction dump, the first transaction for each product is taken as the introduction date. And for each product, we take all versions that appeared in transactions, with no time constraints (from product introduction to the most current\u00a0record).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ach7oF2keDcCoQ6gzE0c3Bw.png?ssl=1\"><figcaption>Products, their variants, by product introduction date,\u00a0yearly<\/figcaption><\/figure>\n<p>Now let\u2019s have these two plots side by side for easy comparison. Taking transactions dates we have more products in each year, and the difference grows\u200a\u2014\u200asince there are also transactions with products introduced previously. No suprises here, as expected. If you were wondering why data for 2019 differ\u200a\u2014\u200anice catch. In fact, shop started operation in 2018, but I removed these few initial months; still, it is their impact what makes the difference in\u00a02019.<\/p>\n<p>Products variants and it\u2019s impact on revenue is not our focus in this article. But as it is often in real analysis, there are \u2018branching\u2019 options, as we progress, even in the initial phase. We haven\u2019t even finished data preparation, and it is already getting interesting.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AnArp6wNLllnrf2S1y5p-gA.png?ssl=1\"><figcaption>Products, their variants, by product introduction and transaction, yearly<\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AITy37GAbGtmB_HDyPYBAIw.png?ssl=1\"><figcaption>Same data as above, facet by\u00a0bin<\/figcaption><\/figure>\n<p>Understanding the product structure is critical for conducting meaningful concentration analyses. Now that our data is appropriately formatted, we can examine actual concentration measurements and what they reveal about ideal portfolio structure. In the following part, we\u2019ll look at these measurements and what they mean for e-commerce businesses.<\/p>\n<h3>Measuring Concentration\u200a\u2014\u200atheory meets\u00a0practice<\/h3>\n<p>When it comes to determining concentration, economists and market analysts have done the heavy lifting for us. Over decades of research into markets, competitiveness, and inequality, they\u2019ve produced powerful analytical methods that have proven useful in a variety of sectors. Rather than developing novel metrics for e-commerce portfolio analysis, we can use existing time-tested methods.<\/p>\n<p>Let\u2019s see how theoretical frameworks can shed light on practical e-commerce questions.<\/p>\n<h4>Herfindahl-Hirschman Index<\/h4>\n<p>HHI (Herfindahl-Hirschman Index) is probably the most common way to measure concentration. Regulators use it to check if a market isn\u2019t becoming too concentrated\u200a\u2014\u200athey take percentages of each company\u2019s market share, square them, and add up. Simple as that. The result can be anywhere from nearly 0 (many small players) to 10,000 (one company takes it\u00a0all).<\/p>\n<p>Why use HHI for e-commerce portfolio analysis? The logic is straightforward\u200a\u2014\u200ainstead of companies competing in a market, we have products competing for revenue. The math works exactly the same way\u200a\u2014\u200awe take each product\u2019s share of total revenue, square it, and sum up. High HHI means revenue depends on few products, while low HHI shows revenue is spread across many products. This gives us a single number to track portfolio concentration over\u00a0time.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/340\/1%2ApSKS6UCeEO4qgcx4WsBbZA.png?ssl=1\"><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AvEwXiTaYv3FZfZpNBwKymQ.png?ssl=1\"><figcaption>HHI, and Products for\u00a0context<\/figcaption><\/figure>\n<h4>Pareto<\/h4>\n<p>Who has not heard of Pareto\u2019s rules? In 1896, Italian economist Vilfredo Pareto observed that 20% of the population held 80% of Italy\u2019s land. Since then, this pattern has been found in a variety of fields, including wealth distribution and retail\u00a0sales.<\/p>\n<p>While popularly referred to as the \u201c80\/20 rule,\u201d the Pareto principle is not limited to these figures. We can use any x-axis criterion (for example, the top 30% of products) to determine the appropriate y value (revenue contribution). The Lorenz curve, formed by linking these locations, provides a complete picture of concentration.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AvVqu8_U3-8wr3ffTSEcUZg.png?ssl=1\"><figcaption>Pareto lines for different revenue share thresholds<\/figcaption><\/figure>\n<p>The chart above shows how many products do we need to achieve certain revenue share (of the monthly revenue). I took arbitrarily cuts at\u00a0.2,\u00a0.3,\u00a0.5,\u00a0.8,\u00a0.95, and of course also including 1\u200a\u2014\u200awhich means total number of products, contributing to 100% of revenue in a given\u00a0month.<\/p>\n<h4>Lorenz curve<\/h4>\n<p>If we sort products by their revenue contribition, and chart the line, we get Lorenz curve. On both axis we have percentages, of products and their reveue share. I case of perfectly uniform revenue distribution, we\u2019d have a straight line, while in case of \u201cperfect concentration\u201d, very steep curve, climbing close to 100% revenue, and then rapidly turning right, to include some residual revenue from other products.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AOeEkn91odOdGXEp6zaeHNg.png?ssl=1\"><figcaption>Lorenz curve<\/figcaption><\/figure>\n<p>It is interesting to see that line, but in most cases it will look quite similar, like a \u201cbended stick\u201d. So let us now compare these lines for few previous months, and also few years back (sticking to October). The monthly lines are quite similar, and if you think\u200a\u2014\u200ait would be good to have some interactivity in this plot, you are absolutely right. The yearly comparison shows more differences (we still have monthly data, taking October in each year), and this is understandable, since these measurements are more distant in\u00a0time.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AdYR5EHAK-bGtkXtzCe4Q2w.png?ssl=1\"><figcaption>Lorenz curve\u200a\u2014\u200acomparing periods<\/figcaption><\/figure>\n<p>So we do see differences between the lines, but can\u2019t we quantify them somehow, not to rely entirely on visual similarity? Definitely, and there is a ratio for that\u200a\u2014\u200aGini Ratio. And by the way, we will have quite a lot of ratios in next chapters.<\/p>\n<h4>Gini Ratio<\/h4>\n<p>To translate shape of Lorenz curve into numeric value, we can use Gini ratio\u200a\u2014\u200adefined as a ratio between two areas, above and below the equality line. On a plot below it is a ratio between dark and light blue\u00a0areas.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Aba_RTnDMuR85j4b2CsstHQ.png?ssl=1\"><figcaption>Gini ratio visualization<\/figcaption><\/figure>\n<p>Let us then visualize for two periods\u200a\u2014\u200aOctober 2019, and October 2024, exact same periods, as we have on one of the plots\u00a0before.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AReycwUGFFZ-_8KLWaohERw.png?ssl=1\"><figcaption>Gini ratio, comparing two\u00a0periods<\/figcaption><\/figure>\n<p>Once we have good understanding, with visuals, how the Gini ratio is calculated, let\u2019s plot it, over the whole\u00a0period.<\/p>\n<p>I use R for analysis, so I have Gini ratio easily available (as well as other ratios, which I will show later). The initial data table (x3a_dt) contains revenue per product, per month. The resulting one has Gini ratio per\u00a0month.<\/p>\n<pre>#-- calculate Gini ratio, monthly<br>library(data.table, ineq)<br>x3a_ineq_dt &lt;- x3a_dt[, .(gini = ineq::ineq(revenue, type = \"Gini\")), month]<\/pre>\n<p>Good we have all these packages for heavy lifting. The math behind is not super complicated, but our time is precious.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/252\/1%2ALKhVbIWbaoGXl8sJjyxNkw.png?ssl=1\"><\/figure>\n<p>The plot below shows the result of calculations.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A9CimWZmU2eqdNGRNW89oJQ.png?ssl=1\"><figcaption>Gini over\u00a0time<\/figcaption><\/figure>\n<p>I haven\u2019t included a smoothing line, with its confidence interval channel, since we do not have measurement points, but the result of Gini calculation, with its own errors distribution. To be very strict and precise on math, we\u2019d need to calculate the confidence interval, and based on that plot smoothed line. The results are\u00a0below.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A95EgMTxRiFtVDWeAhbUjSw.png?ssl=1\"><figcaption>Gini over time, with trend\u00a0line<\/figcaption><\/figure>\n<p>Since we do not use directly statistical significance of calculated ratio, this super strict approach is a little bit an overkill. I haven\u2019t done it while charting trend line for HHI, nor will do in next plots. But it is good to be aware of this\u00a0nuance.<\/p>\n<p>We have seen so far two ratios\u200a\u2014\u200aHHI and Gini, and they are far from being identical. Lorenz curve closer to diagonal indicates more uniform distribution, which is what we have for October 2019, but the HHI is higher, than for 2024, indicating more concentration in 2019. Maybe I made a mistake in calculations, even worse, early on during data preparation? That would be really unfortunate. Or the data is ok, but we are struggling with proper interpretation?<\/p>\n<p>I have quite often moments of such doubts, especially when moving with the analysis really quick. So how do we cope with that, tightening grip on data and our understanding of dependencies? Bear in mind, that whatever analysis you do, there is always first time. And quite often we do not have a luxury of \u2018leisure\u2019 research, it is more often already work for a Client (or a superior, stakeholder, whoever requested it, even ourselves, if it is our initiative).<\/p>\n<h3>Tightening grip<\/h3>\n<p>We need to have a good understanding of how to interpret all these ratios, including dependencies between them. If you plan to present your results to others, questions here are guaranteed, so better to be well prepared. We can work with an existing dataset, or we can generate a small set, where it will be easier to catch dependencies. Let us follow the latter approach.<\/p>\n<p>Let us start with creating a small\u00a0dataset,<\/p>\n<pre>library(data.table)<br><br>#-- Create sample revenue data<br>revenue &lt;- list(<br>  \"2021\" = rep(15, 10),                    # 10 values of 15<br>  \"2022\" = c(rep(100, 5), rep(10, 25)),    # 5 values of 100, 25 values of 10<br>  \"2023\" = rep(25, 50),                    # 50 values of 25<br>  \"2024\" = c(rep(100, 30), rep(10, 70))    # 30 values of 100, 70 values of 10<br>)<\/pre>\n<p>combining it into a data.table.<\/p>\n<pre>#-- Convert to data.table in one step<br>x_dt &lt;- data.table(<br>  year = rep(names(revenue), sapply(revenue, length)),<br>  revenue = unlist(revenue)<br>)<\/pre>\n<p>A quick overview of the\u00a0data.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A1yGLMQ3zbEYErqdfYthQ9g.png?ssl=1\"><figcaption>Example dataset<\/figcaption><\/figure>\n<p>It seems we have what we needed\u200a\u2014\u200aa simple dataset, but still quite realistic. Now we are proceeding with calculations and charts, similar to what we had for a real dataset\u00a0before.<\/p>\n<pre>#-- HHI, Gini<br>xh_dt &lt;- x_dt[, .(hhi = ineq::Herfindahl(revenue), <br>                  gini = ineq::Gini(revenue)), year]<\/pre>\n<pre>#-- Lorenz<br>xl_dt &lt;- x_dt[order(-revenue), .(<br>  cum_prod_pct = seq_len(.N)\/.N, <br>  cum_rev_pct = cumsum(revenue)\/sum(revenue)), year]<\/pre>\n<p>And rendering plots.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A1oLSyYXq5Eh9-bCDeOesqQ.png?ssl=1\"><figcaption>Ratios comparison<\/figcaption><\/figure>\n<p>These charts help a lot in understanding ratios, relations between them and to data. It is always a good idea to have such micro analysis, for ourselves and for stakeholders\u200a\u2014\u200aas \u2018back pocket\u2019 slides, or even sharing them\u00a0upfront.<\/p>\n<p>Nerdy detail\u200a\u2014\u200ahow to slightly shift the line, so it doesn\u2019t overlap, and add labels within a plot? Render a plot, and then make manual fine tuning, expecting several iterations.<\/p>\n<pre>#-- shift the line<br>xl_dt[year == \"2021\", `:=` (cum_rev_pct = cum_rev_pct - .01)]<\/pre>\n<p>For labelling I use ggrepel, but as a default, it will label all the points, while we need only one per line. An in addition deciding which one, for good looking\u00a0chart.<\/p>\n<pre>#-- decide which points to label<br>labs_key2_dt &lt;- data.table(<br>  year = c(\"2021\", \"2022\", \"2023\", \"2024\"), position = c(4, 5, 25, 30))<br><br>#-- set keys<br>list(xl_dt, labs_key2_dt) |&gt; lapply(setkey, year)<br><br>#-- join<br>label_positions2 &lt;- xl_dt[<br>  labs_key2_dt, on = .(year), # join on 'year' <br>  .SD[get('position')],       # Use get('position') to reference the position from labs_key_dt<br>  by = .EACHI]                # for each year<\/pre>\n<p>Render the\u00a0plot.<\/p>\n<pre>#-- render plot<br>plot_22b &lt;- xl_dt |&gt;<br>  ggplot(aes(cum_prod_pct, cum_rev_pct, color = year, group = year, label = year)) +<br>  geom_line(linewidth = .2) +<br>  geom_point(alpha = .8, shape = 21) +<br>  theme_bw() +<br>  scale_color_viridis_d(option = \"H\", begin = 0, end = 1) +<br>  ggrepel::geom_label_repel(<br>    data = label_positions2, force = 10, <br>    box.padding = 2.5, point.padding = .3, <br>    seed = 3, direction = \"x\") +<br>... additional styling <\/pre>\n<h3>More Ratios<\/h3>\n<p>I began with HHI, the Lorenz curve, and the accompanying Gini ratios, as they appeared to be good starting points for concentration and inequality measurements. However, there are numerous different ratios used to define distributions, whether for inequality or in general. It\u2019s unlikely that we\u2019d employ all of them at once, therefore select the subset that provides the most insights for your specific challenge.<\/p>\n<p>With a proper structure of a dataset, it is quite straightforward to calculate them. I am sharing code snippets, with several ratios calculated monthly. We use a dataset, we already have\u200a\u2014\u200amonthly revenue per product (base products, excluding variants).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A9zcGxwJX7S25viHBFKbg4A.png?ssl=1\"><\/figure>\n<p>Starting with ratios from the ineq\u00a0package.<\/p>\n<pre>#---- inequality ----<br>x3_ineq_dt &lt;- x3a_dt[, .(<br>  # Classical inequality\/concentration measures<br>  gini     = ineq::ineq(revenue, type = \"Gini\"),      # Gini coefficient<br>  hhi      = ineq::Herfindahl(revenue),               # Herfindahl-Hirschman Index<br>  hhi_f    = sum((rev_pct*100)^2),                    # HHI - formula<br>  atkinson = ineq::ineq(revenue, type = \"Atkinson\"),  # Atkinson index<br>  theil    = ineq::ineq(revenue, type = \"Theil\"),     # Theil entropy index<br>  kolm     = ineq::ineq(revenue, type = \"Kolm\"),      # Kolm index<br>  rs       = ineq::ineq(revenue, type = \"RS\"),        # Ricci-Schutz index<br>  entropy  = ineq::entropy(revenue),                  # Entropy measure<br>  hoover   = mean(abs(revenue - mean(revenue)))\/(2 * mean(revenue)), # Hoover (Robin Hood) index<\/pre>\n<p>Diustribution shape and top\/bottom shares and\u00a0ratios.<\/p>\n<pre> # Distribution shape measures<br>  cv       = sd(revenue)\/mean(revenue),               # Coefficient of Variation<br>  skewness = moments::skewness(revenue),              # Skewness<br>  kurtosis = moments::kurtosis(revenue),              # Kurtosis<br><br>  # Ratio measures<br>  p90p10   = quantile(revenue, 0.9)\/quantile(revenue, 0.1),   # P90\/P10 ratio<br>  p75p25   = quantile(revenue, 0.75)\/quantile(revenue, 0.25), # Interquartile ratio<br>  palma    = sum(rev_pct[1:floor(.N*.1)])\/sum(rev_pct[floor(.N*.6):(.N)]), # Palma ratio<\/pre>\n<pre>  # Concentration ratios and shares<br>  top1_share  = max(rev_pct),                         # Share of top product<br>  top3_share  = sum(head(sort(rev_pct, decreasing = TRUE), 3)),  # CR3<br>  top5_share  = sum(head(sort(rev_pct, decreasing = TRUE), 5)),  # CR5<br>  top10_share = sum(head(sort(rev_pct, decreasing = TRUE), 10)), # CR10<br>  top20_share = sum(head(sort(rev_pct, decreasing = TRUE), floor(.N*.2))),  # Top 20% share<br>  mid40_share = sum(sort(rev_pct, decreasing = TRUE)[floor(.N*.2):floor(.N*.6)]), # Middle 40% share<br>  bottom40_share = sum(tail(sort(rev_pct), floor(.N*.4))),      # Bottom 40% share<br>  bottom20_share = sum(tail(sort(rev_pct), floor(.N*.2))),      # Bottom 20% share<\/pre>\n<p>Basic statistics, quantiles.<\/p>\n<pre> # Basic statistics<br>  unique_products = .N,                              # Number of unique products<br>  revenue_total = sum(revenue),                      # Total revenue<br>  mean_revenue = mean(revenue),                      # Mean revenue per product<br>  median_revenue = median(revenue),                  # Median revenue<br>  revenue_sd = sd(revenue),                          # Revenue standard deviation<br><br>  # Quantile values<br>  q20 = quantile(revenue, 0.2),                      # 20th percentile<br>  q40 = quantile(revenue, 0.4),                      # 40th percentile<br>  q60 = quantile(revenue, 0.6),                      # 60th percentile<br>  q80 = quantile(revenue, 0.8),                      # 80th percentile<\/pre>\n<p>Count measures.<\/p>\n<pre> # Count measures<br>  above_mean_n = sum(revenue &gt; mean(revenue)),        # Number of products above mean<br>  above_2mean_n = sum(revenue &gt; 2*mean(revenue)),     # Number of products above 2x mean<br>  top_quartile_n = sum(revenue &gt; quantile(revenue, 0.75)), # Number of products in top quartile<br>  zero_revenue_n = sum(revenue == 0),                 # Number of products with zero revenue<br>  within_1sd_n = sum(abs(revenue - mean(revenue)) &lt;= sd(revenue)),    # Products within 1 SD<br>  within_2sd_n = sum(abs(revenue - mean(revenue)) &lt;= 2*sd(revenue)),  # Products within 2 SD<\/pre>\n<p>Revenue above (or below) the threshold.<\/p>\n<pre>  # Revenue above threshold<br>  rev_above_mean = sum(revenue[revenue &gt; mean(revenue)])  # Revenue from products above mean<br>), month]<\/pre>\n<p>The resulting table has 40 columns, and 72 rows (months).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/566\/1%2AInr7l_VcHVZ06golm7617w.png?ssl=1\"><\/figure>\n<p>As mentioned earlier, it is difficult to imagine, one would work with 40 ratios, so I am rather showing a method how to calculate them, and one should pick relevant ones. As always, it is good to visualize and see how they relate to each\u00a0other.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AtmHk8wkHqUxRy7hqyKzn0A.png?ssl=1\"><figcaption>Selected ratios over\u00a0time<\/figcaption><\/figure>\n<p>We can calculate correlation matrix between all ratios, or selected\u00a0subset.<\/p>\n<pre># Select key metrics for a clearer visualization<br>key_metrics &lt;- c(\"gini\", \"hhi\", \"atkinson\", \"theil\", \"entropy\", \"hoover\", <br>                 \"top1_share\", \"top3_share\", \"top5_share\", \"unique_products\")<br><br>cor_matrix &lt;- x3_ineq_dt[, .SD, .SDcols = key_metrics] |&gt; cor()<\/pre>\n<p>Change column names to more friendly\u00a0names.<\/p>\n<pre># Make variable names more readable<br>pretty_names &lt;- c(<br>  \"Gini\", \"HHI\", \"Atkinson\", \"Theil\", \"Entropy\", \"Hoover\",<br>  \"Top 1%\", \"Top 3%\", \"Top 5%\", \"Products\"<br>)<br>colnames(cor_matrix) &lt;- rownames(cor_matrix) &lt;- pretty_names<\/pre>\n<p>And render the\u00a0plot.<\/p>\n<pre>corrplot::corrplot(cor_matrix, <br>         type = \"upper\",<br>         method = \"color\",<br>         tl.col = \"black\",<br>         tl.srt = 45,<br>         diag = F,<br>         order = \"AOE\") <\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/800\/1%2AiyvX_gR04Kfv7t1ahCc6KA.jpeg?ssl=1\"><figcaption>Correlation matrix, selected\u00a0ratios<\/figcaption><\/figure>\n<p>And then we can plot some interesting pairs. Of course, some of them have positive or negative correlation <em>by definition<\/em>, while in other cases it is not that\u00a0obvious.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2As1ctrG2MEWkMYWyRtjOtHA.png?ssl=1\"><figcaption>Selected ratios, with negative correlation<\/figcaption><\/figure>\n<h3>Show me the\u00a0money<\/h3>\n<p>We started analysis with ratios and Lorenz curve as a top-down overview. It is a good start, but there are two complications\u200a\u2014\u200athe ratios have a relatively broad range, when the business is doing ok, and there is hardly connection to actionable insights. Even if we notice that the ratio is on the edge, or outside of the safe range, it is unclear what we should do. And instructions like \u201cdecrease concentration\u201d are a little ambiguous.<\/p>\n<p>E-commerce talks and breaths products, so to make analysis relatable, we need to reference to particular products. People would also like to understand which products constitute core 50%, 80% of revenue, and equally important, if these products stay consistently as top contributors.<\/p>\n<p>Let us take one month, August 2024 and see which products contributed to 50% revenue in that month. Then, we check revenue from these exact products in other months. There are 5 products, generating (at least) 50% revenue in\u00a0August.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ae8ZHqjLaijjb899rKiqRWw.png?ssl=1\"><figcaption>Products revenue, facets by\u00a0product<\/figcaption><\/figure>\n<p>We can also render more visually appealing plot with a streamgraph. Both plots show exact same dataset, but they complement each other nicely\u200a\u2014\u200abar plots for precision, while streamgraph for a\u00a0story.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ao235vagWj4DgV5wAU22JEQ.png?ssl=1\"><figcaption>Products revenue, stream\u00a0graph<\/figcaption><\/figure>\n<p>The red line indicated selected month. If you feel \u201citching\u201d to shift that line, like in an old-fashioned radio, you are absolutely right\u200a\u2014\u200athat should be an interactive chart, and actually it is, along with a slider for revenue share percentage (we produced it for a\u00a0Client).<\/p>\n<p>So what if we shift that red \u2018tuning line\u2019 a little bit backwards, maybe to 2020? The logic in data preparation is very similar\u200a\u2014\u200aget products contributing to a certain revenue share threshold, and check the revenue from these products in other\u00a0months.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A9d3PGlyU3c7BGbwpkXTGKw.png?ssl=1\"><figcaption>Products revenue, stream\u00a0graph<\/figcaption><\/figure>\n<p>With interactivity on two elements\u200a\u2014\u200arevenue contribution percentage and the date, one can learn a lot about the business, and this is exactly the point of these charts. One can look from different angles:<\/p>\n<ul>\n<li>concentration, how many products do we need for certain revenue threshold,<\/li>\n<li>products themselves, do they stay in certain revenue contribution bin, or do they change and why? Is it seasonality, a valid replacement, lost supplier or something else?<\/li>\n<li>time window, whether we look at one month or a whole\u00a0year,<\/li>\n<li>seasonality, comparing similar time of a year with previous\u00a0periods.<\/li>\n<\/ul>\n<h3>Summary<\/h3>\n<h4>What the Data Tells\u00a0Us<\/h4>\n<p>Our 6-year dataset reveals the evolution of an e-commerce business from high concentration to balanced growth. Here are the key patterns and\u00a0lessons:<\/p>\n<p>With 6 years of data, I had a unique chance to watch concentration metrics evolve as the business grew. Starting with just a handful of products, I saw exactly what you\u2019d expect\u200a\u2014\u200asky-high concentration. But as new products entered the mix, things got more interesting. The business found its rhythm with a dozen or so top performers, and the HHI settled into a comfortable 700\u2013800\u00a0range.<\/p>\n<p>Here\u2019s something fascinating I discovered: concentration and inequality might sound like twins, but they\u2019re more like distant cousins. I noticed this while comparing HHI against Lorenz curves and their Gini ratios. Trust me, you\u2019ll want to get comfortable with the math before explaining these patterns to stakeholders\u200a\u2014\u200athey\u2019ll smell uncertainty from a mile\u00a0away.<\/p>\n<p>Want to really understand these metrics? Do what I did: create a dummy dataset so simple it\u2019s almost embarrassing. I\u2019m talking basic patterns that a fifth-grader could grasp. Sounds like overkill? Maybe, but it saved me countless hours of head-scratching and misinterpretation. Keep these examples in your back pocket\u200a\u2014\u200aor better yet, share them upfront. Nothing builds confidence like showing you\u2019ve done your homework.<\/p>\n<p>Look, calculating these ratios isn\u2019t rocket science. The real magic happens when you dig into how each product contributes to your revenue. That\u2019s why I added the \u201cshow me the money\u201d section\u200a\u2014\u200aI don\u2019t believe in quick fixes or magic formulas. It\u2019s about rolling up your sleeves and understanding how each product really\u00a0behaves.<\/p>\n<p>As you\u2019ve probably noticed yourself, these streamgraphs I showed you are practically begging for interactivity. And boy, does that add value! Once you\u2019ve got your keys and joins sorted out, it\u2019s not even that complicated. Give your users an interactive tool, and suddenly you\u2019re not drowning in one-off questions anymore\u200a\u2014\u200athey\u2019re finding insights themselves.<\/p>\n<p>Here\u2019s a pro tip: use this concentration analysis as your foot in the door with stakeholders. Show your product teams that streamgraph, and I guarantee their eyes will light up. When they start asking for interactive versions, you\u2019ve got them hooked. The best part? They\u2019ll think it was their idea all along. That\u2019s how you get real adoption\u200a\u2014\u200aby letting them discover the value themselves.<\/p>\n<h4>Data Engineering Takeaways<\/h4>\n<p>While quite often we generally know what to expect in a dataset, it is almost guaranteed that there will be some nuances, exceptions, or maybe even surprises. It\u2019s good to spend some time reviewing datasets, using dedicated functions (like str, glimpse in R), looking for empty fields, outliers, but also simply scrolling through to understand the data. I like comparisons, and in this case, I\u2019d compare to smelling fish on a market before jumping to prepare sushi\u00a0\ud83d\ude42<\/p>\n<p>Then, if we work with a raw data export, quite likely there will be several columns in the data dump; after all, if we click \u2018export all\u2019, wouldn\u2019t we expect exactly that? For most analysis we will need a subset of these columns, so it\u2019s good to trim and keep only what we need. I assume we work with a script, so if it turns out, we need more, not an issue, just add missed column and rerun that\u00a0chunk.<\/p>\n<p>In the dataset dump there was a timestamp in one row per transaction, while we needed it per each product. Hence some light data wrangling to propagate these timestamps to all the products.<\/p>\n<p>After cleaning the dataset, it\u2019s important to consider the context of analysis, including the questions to be answered and the necessary changes to the data. This \u201ccontextual cleaning\/wrangling\u201d is critical since it determines whether the analysis succeeds or fails. In our situation, the goal was to analyse product concentration, therefore filtering out variants (size, colour, etc.) was essential. If we had skipped that, the outcome would have been radically different.<\/p>\n<p>Quite often we can expect some \u201ctraps\u201d, where initially it seems we can apply simple approach, while actually, we should add a bit of sophistication. As an example\u200a\u2014\u200aLorenz curve, where we need to calculate how many products do we need to get to a certain revenue threshold. This is where I use rolling joins, which fit here perfectly.<\/p>\n<p>The core logic to produce streamgraphs is to find products which constitute certain revenue percentage in a given month, then \u201cfreeze\u201d them and get their revenue in other months. The toolset I used was adding extra column, with a product number, after sorting per month, and then playing with keys and\u00a0joins.<\/p>\n<p>An important element of this analysis was adding interactivity, allowing users to play with some parameters. That raises the bar, as we need all these operations to be performed lightning fast. The ingredients we need are right data structure, additional columns, proper keys and joins. Prepare as much as possible, precalculating in a data warehouse, so the dashboarding tool is not overloaded. Take caching into\u00a0account.<\/p>\n<h4>How to\u00a0Start?<\/h4>\n<p>Strike a balance between delivering what stakeholders request and exploring potentially valuable insights they haven\u2019t asked for yet. The analysis I presented follows this pattern\u200a\u2014\u200agetting initial concentration ratios is straightforward, while building an interactive streamgraph optimized for lightning-fast operation requires significant effort.<\/p>\n<p>Start small and engage others. Share basic findings, discuss what you could learn together, and only then proceed with more labor-intensive analysis once you\u2019ve secured genuine interest. And always maintain a solid grip on your raw data\u200a\u2014\u200ait\u2019s invaluable for answering those inevitable ad-hoc questions quickly.<\/p>\n<p>Building a prototype before full production allows for validation of interest and feedback without devoting too much time. In my case, such simple concentration ratios sparked debates that eventually led to the more advanced interactive studies on which stakeholders rely\u00a0today.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ANY0rj41RpxDXdtECXz-Ykw.png?ssl=1\"><figcaption>Start small, secure genuine interest\u00a0\u2026\u00a0:-)) \/ image generated by DALL-E, based on author\u2019s\u00a0prompt.<\/figcaption><\/figure>\n<h3>Appendix\u200a\u2014\u200adata preparation &amp; wrangling<\/h3>\n<p>I\u2019ll show you how I prepared the data at each step of this analysis. Since I used R, I\u2019ll include the actual code snippets\u200a\u2014\u200athey\u2019ll help you get started faster, even if you\u2019re working in a different language. This is the code I used for the study, though you\u2019ll probably need to adapt it to your specific needs rather than just copying it over. I decided to keep the code separate from the main analysis, to make it more streamlined and readable for both technical and business\u00a0users.<\/p>\n<p>While I am presenting analysis based on Shopify export, there is no limitation for a particular platform, we just need transactions data.<\/p>\n<h4>Shopify export<\/h4>\n<p>Let\u2019s start with getting our data from Shopify. The raw export needs some work before we can dive into concentration analysis\u200a\u2014\u200ahere\u2019s what I had to deal with\u00a0first.<\/p>\n<p>We start with export of raw transactions data from Shopify. It might take some time, and when ready, we get an email with links to download.<\/p>\n<pre>#-- 0. libs<br>pacman::p_load(data.table)<br><br>#-- 1.1 load data; the csv files are what we get as a full export from Shopify<br>xs1_dt &lt;- fread(file = \"shopify_raw\/orders_export_1.csv\")<br>xs2_dt &lt;- fread(file = \"shopify_raw\/orders_export_2.csv\")<br>xs3_dt &lt;- fread(file = \"shopify_raw\/orders_export_3.csv\")<\/pre>\n<p>Once we have data, we need to combine these files into one dataset, trim columns and perform some cleansing.<\/p>\n<pre>#-- 1.2 check all columns, limit them to essential (for this analysis) and bind into one data.table<br>xs1_dt |&gt; colnames()<br># there are 79 columns in full export, <br># so we select a subset, relevant for this analysis<br>sel_cols &lt;- c(\"Name\", \"Email\", \"Paid at\", \"Fulfillment Status\", \"Accepts Marketing\", \"Currency\", \"Subtotal\",<br>              \"Lineitem quantity\", \"Lineitem name\", \"Lineitem price\", \"Lineitem sku\", \"Discount Amount\",<br>              \"Billing Province\", \"Billing Country\")<br><br>#-- combine into one data.table, with a subset of columns<br>xs_dt &lt;- data.table::rbindlist(l = list(xs1_dt, xs2_dt, xs3_dt), <br>    use.names = T, fill = T, idcol = T) %&gt;% .[, ..sel_cols]<\/pre>\n<p>Some data preparations.<\/p>\n<pre>#-- 2. data prep<br>#-- 2.1 replace spaces in column names, for easier handling<br>sel_cols_new &lt;- sel_cols |&gt; stringr::str_replace(pattern = \" \", replacement = \"_\")<br>setnames(xs_dt, old = sel_cols, new = sel_cols_new)<br><br>#-- 2.2 transaction as integer<br>xs_dt[, `:=` (Transaction_id = stringr::str_remove(Name, pattern = \"#\") |&gt; as.integer())]<\/pre>\n<p>Anonymize emails, as we don\u2019t need\/want to deal with real emails during analysis.<\/p>\n<pre>#-- 2.3 anonymize email <br>new_cols &lt;- c(\"Email_hash\")<br>xs_dt[, (new_cols) := .(digest::digest(Email, algo = \"md5\")), .I]<\/pre>\n<p>Change column types; this depends on personal preferences.<\/p>\n<pre>#-- 2.4 change Accepts_Marketing to logical column<br>xs_dt[, `:=` (Accepts_Marketing_lgcl = fcase(<br>    Accepts_Marketing == \"yes\", TRUE, <br>    Accepts_Marketing == \"no\", FALSE, <br>    default = NA))]<\/pre>\n<p>Now we focus on transactions dataset. In the export files, the transaction amount and timestamp is in only one row per all items in the basket. We need to get these timestamps and propagate to all\u00a0items.<\/p>\n<pre>#-- 3 transactions dataset<br>#-- 3.1 subset transactions<br>#-- limit columns to essential for transaction only<br>trans_sel_cols &lt;- c(\"Transaction_id\", \"Email_hash\", \"Paid_at\", <br>  \"Subtotal\", \"Currency\", \"Billing_Province\", \"Billing_Country\")<br><br>#-- get transactions table based on requirement of non-null payment - as payment (date, amount) is not for all products, it is only once per basket<br>xst_dt &lt;- xs_dt[!is.na(Paid_at) &amp; !is.na(Transaction_id), ..trans_sel_cols]<\/pre>\n<pre>#-- date columns<br>xst_dt[, `:=` (date = as.Date(`Paid_at`))]<br>xst_dt[, `:=` (month = lubridate::floor_date(date, unit = \"months\"))]<\/pre>\n<p>Some extra information, as I call them, <em>derivatives<\/em>.<\/p>\n<pre>#-- 3.2 is user returning? their n-th transaction<br>setkey(xst_dt, Paid_at)<br>xst_dt[, `:=` (tr_n = 1)][, `:=` (tr_n = cumsum(tr_n)), Email_hash]<br><br>xst_dt[, `:=` (returning = fcase(tr_n == 1, FALSE, default = TRUE))]<\/pre>\n<p>Do we have any NA\u2019s in the\u00a0dataset?<\/p>\n<pre>xst_dt[!complete.cases(xst_dt), ]<\/pre>\n<p>Products dataset.<\/p>\n<pre>#-- 4 products dataset<br>#-- 4.1 subset of columns<br>sel_prod_cols &lt;- c(\"Transaction_id\", \"Lineitem_quantity\", \"Lineitem_name\", <br>  \"Lineitem_price\", \"Lineitem_sku\", \"Discount_Amount\")<\/pre>\n<p>Now we join these two datasets, to have transaction characteristics (trans_sel_cols) for all the products.<\/p>\n<pre>#-- 5 join two datasets<br>list(xs_dt, xst_dt) |&gt; lapply(setkey, Transaction_id)<br>x3_dt &lt;- xs_dt[, ..sel_prod_cols][xst_dt]<\/pre>\n<p>Let\u2019s check which columns we have in x3_dt\u00a0dataset.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Azy7rfApUuDEUhyX9UGfxtQ.png?ssl=1\"><\/figure>\n<p>And it is also a moment to inspect the\u00a0dataset.<\/p>\n<pre>x3_dt |&gt; str()<br>x3_dt |&gt; dplyr::glimpse()<br>x3_dt |&gt; head()<\/pre>\n<p>Time for data cleaning. First up: splitting the Lineitem_name into base products and their variants. In theory, these are separated by a dash (\u201c-\u201d). Simple, right? Not quite\u200a\u2014\u200asome product names, like \u2018All-Purpose\u2019, contain dashes as part of their name. So we need to handle these special cases first, temporarily replacing problematic dashes, doing the split, and then restoring the original product\u00a0names.<\/p>\n<pre>#-- 6. cleaning, aggregation on product names<br>#-- 6.1 split product name into base and variants<br>#-- split product names into core and variants<br>product_cols &lt;- c(\"base_product\", \"variants\")<br>#-- with special treatment for 'all-purpose'<br>x3_dt[stringr::str_detect(string = Lineitem_name, pattern = \"All-Purpose\"), <br>  (product_cols) := {<br>    tmp = stringr::str_replace(Lineitem_name, \"All-Purpose\", \"AllPurpose\")<br>    s = stringr::str_split_fixed(tmp, pattern = \"[-\/]\", n = 2)<br>    s = stringr::str_replace(s, \"AllPurpose\", \"All-Purpose\")<br>    .(s[1], s[2])<br>  }, .I]<\/pre>\n<p>It is good to make validation after each\u00a0step.<\/p>\n<pre># validation<br>x3_dt[stringr::str_detect(<br>  string = Lineitem_name, pattern = \"All-Purpose\"), .SD, <br>  .SDcols = c(\"Transaction_id\", \"Lineitem_name\", product_cols)]<\/pre>\n<p>We keep moving with data cleaning\u200a\u2014\u200athe exact steps depend of course on a particular dataset, but I share my flow, as an\u00a0example.<\/p>\n<pre>#-- two scenarios, to cope with `(32-ounce)` in prod name; we don't want that hyphen to cut the name<br>x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = \"ounce\", negate = T) &amp; <br>  stringr::str_detect(string = `Lineitem_name`, pattern = \"All-Purpose\", negate = T), <br>  (product_cols) := {<br>    s = stringr::str_split_fixed(string = `Lineitem_name`, pattern = \"[-\/]\", n = 2); .(s[1], s[2])<br>  }, .I]<br><br>x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = \"ounce\", negate = F) &amp; <br>  stringr::str_detect(string = `Lineitem_name`, pattern = \"All-Purpose\", negate = T), <br>  (product_cols) := {<br>    s = stringr::str_split_fixed(string = `Lineitem_name`, pattern = \"\\) - \", n = 2); .(paste0(s[1], \")\"), s[2])<br>  }, .I]<br><br>#-- small patch for exceptions<br>x3_dt[stringr::str_detect(string = base_product, pattern = \"\\)\\)$\", negate = F), <br>  base_product := stringr::str_replace(string = base_product, pattern = \"\\)\\)$\", replacement = \")\")]<\/pre>\n<p>Validation.<\/p>\n<pre># validation<br>x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = \"ounce\")<br>  ][, .SD, .SDcols = c(eval(sel_cols[6]), product_cols)<br>  ][, .N, c(eval(sel_cols[6]), product_cols)]<br><br>x3_dt[stringr::str_detect(string = `Lineitem_name`, pattern = \"All\")<br>  ][, .SD, .SDcols = c(eval(sel_cols[6]), product_cols)<br>  ][, .N, c(eval(sel_cols[6]), product_cols)]<br><br>x3_dt[stringr::str_detect(string = base_product, pattern = \"All\")]<\/pre>\n<p>We use eval(sel_cols[6]) to get the name of a column sel_cols[6] which is Currency.<\/p>\n<p>We also need to deal with NA\u2019s, but with an understanding of a dataset\u200a\u2014\u200awhere we could have NA\u2019s and where they are not supposed to be, indicating an issue. In some columns, like `Discount_Amount`, we have values (actual discount), zeros, but also sometimes NA\u2019s. Checking final price, we conclude they are\u00a0zeros.<\/p>\n<pre>#-- deal with NA'a - replace them with 0<br>sel_na_cols &lt;- c(\"Discount_Amount\")<br>x3_dt[, (sel_na_cols) := lapply(.SD, fcoalesce, 0), .SDcols = sel_na_cols]<\/pre>\n<p>For consistency and convenience, changing all column names to lowercase.<\/p>\n<pre>setnames(x3_dt, tolower(names(x3_dt)))<\/pre>\n<p>And verification.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AaPOuZvk3rrOR9j89VLpiAw.png?ssl=1\"><\/figure>\n<p>Of course review dataset, with some test aggregations, and also just printing it\u00a0out.<\/p>\n<p>Save dataset as both Rds (native R format) and\u00a0csv.<\/p>\n<pre>x3_dt |&gt; fwrite(file  = \"data\/products.csv\")<br>x3_dt |&gt; saveRDS(file = \"data\/x3_dt.Rds\")<\/pre>\n<p>Conducting steps above we should have a clean dataset, for futher analysis. The code should serve as a guideline, but also can be used directly, if you work in\u00a0R.<\/p>\n<h4>Versions<\/h4>\n<p>As a first glimpse, we will check number of products per month, both base_product, and including all versions.<\/p>\n<p>As a small cleaning, I take only complete\u00a0months.<\/p>\n<pre>month_last &lt;- x3_dt[, max(month)] - months(1)<\/pre>\n<p>Then we count monthly numbers, storing in temporary table, which are then\u00a0joined.<\/p>\n<pre>x3_a_dt &lt;- x3_dt[month &lt;= month_last, .N, .(base_product, month)<br>  ][, .(base_products = .N), keyby = month]<br><br>x3_b_dt &lt;- x3_dt[month &lt;= month_last, .N, .(lineitem_name, month)<br>  ][, .(products = .N), keyby = month]<br><br>x3_c_dt &lt;- x3_a_dt[x3_b_dt]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A5GCjLL37Bjv5oZVT3sfpoA.png?ssl=1\"><\/figure>\n<p>Some data wrangling.<\/p>\n<pre>#-- names, as we want them on plot<br>setnames(x3_c_dt, old = c(\"base_products\", \"products\"), new = c(\"base\", \"all, with variants\"))<br><br>#-- long form<br>x3_d_dt &lt;- x3_c_dt[, melt.data.table(.SD, id.vars = \"month\", variable.name = \"Products\")]<br><br>#-- reverse factors, so they appear on plot in a proper order<br>x3_d_dt[, `:=` (Products = forcats::fct_rev(Products))]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AaX9j3L4UjJbQLrZyb7HeZQ.png?ssl=1\"><\/figure>\n<p>We are ready to plot the\u00a0dataset.<\/p>\n<pre>plot_01_w &lt;- x3_d_dt |&gt;<br>  ggplot(aes(month, value, color = Products, fill = Products)) +<br>  geom_line(show.legend = FALSE) +<br>  geom_area(alpha = .8, position = position_dodge()) +<br>  theme_bw() +<br>  scale_fill_viridis_d(direction = -1, option = \"G\", begin = 0.3, end = .7) +<br>  scale_color_viridis_d(direction = -1, option = \"G\", begin = 0.3, end = .7) +<br>  labs(x = \"\", y = \"Products\", <br>    title = \"Unique products, monthly\", subtitle = \"Impact of aggregation\") +<br>  theme(... additional styling)<\/pre>\n<p>The next plot shows the number of variants grouped into bins. This gives us a chance to talk about chaining operations in R, particularly with the data.table package. In data.table, we can chain operations by opening a new bracket right after closing one\u200a\u2014\u200aresulting in ][ syntax. It creates a compact, readable chain that\u2019s still easy to debug since you can execute it piece by piece. I prefer succinct code, but that\u2019s just my style\u200a\u2014\u200ause whatever approach works best for you. We can write code in one line, or multi-line, with logical\u00a0steps.<\/p>\n<p>On one of the plots we look at a date, when each product was first seen. To get that date, we set a key on date, and then take the first occurrence date[1] per each base_product.<\/p>\n<pre>#-- versions per year, product, with a date, when it was 1st seen<br>x3c_dt &lt;- x3_dt[, .N, .(base_product, variants)<br>  ][, .(variants = .N), base_product][order(-variants)]<br><br>x3_dt |&gt; setkey(date)<br>x3d_dt &lt;- x3_dt[, .(date = date[1]), base_product]<\/pre>\n<pre>list(x3c_dt, x3d_dt) |&gt; lapply(setkey, base_product)<br><br>x3e_dt &lt;- x3c_dt[x3d_dt][order(variants)<br>  ][, `:=` (year = year(date) |&gt; as.factor())][year != 2018<br>  ][, .(products = .N), .(variants, year)][order(-variants)<br>  ][, `:=` (<br>    variant_bin = cut(<br>      variants,<br>      breaks = c(0, 1, 2, 5, 10, 20, 100, Inf),<br>      include.lowest = TRUE,<br>      right = FALSE<br>    ))<br>  ][, .(total_products = sum(products)), .(variant_bin, year)<br>  ][order(variant_bin)<br>  ][, `:=` (year_group = fcase(<br>    year %in% c(2019, 2020, 2021), \"2019-2021\",<br>    year %in% c(2022, 2023, 2024), \"2022-2024\"<br>  ))<br>  ][, `:=` (variant_bin = forcats::fct_rev(variant_bin))]<\/pre>\n<p>The resulting table is exactly as we need it for charting.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AWlKAv8CwioGfl_q1yf4fTw.png?ssl=1\"><\/figure>\n<p>The second plot uses transaction date, so the data wrangling is similar, but without date[1]\u00a0step.<\/p>\n<p>If we want to have a couple of plots combined, we can produce them separately, and combine using for example ggpubr::ggarrange() or we can blend tables into one dataset and then use faceting functionality. The former is when plots are of completely different nature, while latter is useful, when we can naturally have combined\u00a0dataset.<\/p>\n<p>As an example, few more lines from my\u00a0script.<\/p>\n<pre>x3h_dt &lt;- data.table::rbindlist(<br>  l = list(<br>    introduction = x3e_dt[, `:=` (year = as.numeric(as.character(year)))], <br>    transaction  = x3g_dt), <br>  use.names = T, fill = T, idcol = T)<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AGbIeipanxC15SQGKoRWVVg.png?ssl=1\"><\/figure>\n<p>And a plot\u00a0code.<\/p>\n<pre>plot_04_w &lt;- x3h_dt |&gt;<br>  ggplot(aes(year, total_products, <br>    color = variant_bin, fill = variant_bin, group = .id)) +<br>  geom_col(alpha = .8) +<br>  theme_bw() +<br>  scale_fill_viridis_d(direction = 1, option = \"G\") +<br>  scale_color_viridis_d(direction = 1, option = \"G\") +<br>  labs(x = \"\", y = \"Base Products\", <br>    title = \"Products, and their variants\", <br>    subtitle = \"Yearly\",<br>    fill = \"Variants\",    <br>    color = \"Variants\") +<br>  facet_wrap(\".id\", ncol = 2) +<br>  theme(... other styling options)<\/pre>\n<p>Faceting has massive advantage, because we operate on one table, which helps a lot in assuring data consistency.<\/p>\n<h4>Pareto<\/h4>\n<p>The essence of Pareto calculation is to find how many products do we need to achieve certain revenue percentage. We need to prepare the dataset, in a couple of\u00a0steps.<\/p>\n<pre>#-- calculate quantity and revenue per base_product, monthly<br>x3a_dt &lt;- x3_dt[, {<br>  items = sum(lineitem_quantity, na.rm = T); <br>  revenue = sum(lineitem_quantity * lineitem_price); <br>  .(items, revenue)}, keyby = .(month, base_product)<br>  ][, `:=` (i = 1)][order(-revenue)][revenue &gt; 0, ]<br><br>#-- calculate percentage share, and cumulative percentage<br>x3a_dt[, `:=` (<br>  rev_pct = revenue \/ sum(revenue), <br>  cum_rev_pct = cumsum(revenue) \/ sum(revenue), prod_n = cumsum(i)), month]<\/pre>\n<p>In case we\u2019d need to mask exact product names, let us create a new variable.<\/p>\n<pre>#-- products name masking<br>x3a_dt[, masked_name := paste(\"Product\", .GRP), by = base_product]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A2_SeZMFX0b-L1H6Q6ELmEA.png?ssl=1\"><\/figure>\n<p>And dataset printout, with a subset of\u00a0columns.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AxMvF8pIgpnzTp6iLnkkmhg.png?ssl=1\"><\/figure>\n<p>And filtered for one month, showing few lines from top and from the\u00a0bottom.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AOCNYsMDvAs0NxFmN4b43Ng.png?ssl=1\"><\/figure>\n<p>The essential column is cum_rev_pct, which indicates cumulative percentage revenue from products 1-n. We need to find which prod_n covers revenue percentage threshold, as in the pct_thresholds_dt table.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AhiMmEfB5ulXitoyqiFyqRA.png?ssl=1\"><\/figure>\n<p>So we are ready for actual Pareto calculation. The code below, with comments.<\/p>\n<pre>#-- pareto<br>#-- set percentage thresholds<br>pct_thresholds_dt &lt;- data.table(cum_rev_pct = c(0, .2, .3, .5, .8, .95, 1))<br><br>#-- set key for join<br>list(x3a_dt, pct_thresholds_dt) |&gt; lapply(setkey, cum_rev_pct)<br><br>#-- subset columns (optional)<br>sel_cols &lt;- c(\"month\", \"cum_rev_pct\", \"prod_n\")<br><br>#-- perform a rolling join - crucial step!<br>x3b_dt &lt;- x3a_dt[, .SD[pct_thresholds_dt, roll = -Inf], month][, ..sel_cols]<\/pre>\n<p>Why do we perform a rolling join? We need to find the first cum_rev_pct to cover each threshold.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AlYeG09CtMzUeVqHhsaqNfw.png?ssl=1\"><\/figure>\n<p>We need 2 products for 20% revenue, 4 products for 30% and so on. And to have 100% revenue, of course we need contribution from all 72 products.<\/p>\n<p>And a\u00a0plot.<\/p>\n<pre>#-- data prep<br>x3b1_dt &lt;- x3b_dt[month &lt; month_max, <br>  .(month, cum_rev_pct = as.factor(cum_rev_pct) |&gt; forcats::fct_rev(), prod_n)]<br><br>#-- charting<br>plot_07_w &lt;- x3b1_dt |&gt;<br>  ggplot(aes(month, prod_n, color = cum_rev_pct, fill = cum_rev_pct)) +<br>  geom_line() +<br>  theme_bw() +<br>  geom_area(alpha = .2, show.legend  = F, position = position_dodge(width = 0)) +<br>  scale_fill_viridis_d(direction = -1, option = \"G\", begin = 0.2, end = .9) +<br>  scale_color_viridis_d(direction = -1, option = \"G\", begin = 0.2, end = .9,<br>                        labels = function(x) scales::percent(as.numeric(as.character(x)))  # Convert factor to numeric first<br>  ) +<br>  ... other styling options ...<\/pre>\n<h4>Lorenz curve<\/h4>\n<p>To plot Lorenz curve, we need to sort products by it\u2019s contribution to total revenue, and normalize both number of products and\u00a0revenue.<\/p>\n<p>Before the main code, a handy method to pick n-th month from the dataset, from beginning or from the\u00a0end.<\/p>\n<pre>month_sel &lt;- x3a_dt$month |&gt; unique() |&gt; sort(decreasing = T) |&gt; dplyr::nth(2)<\/pre>\n<p>And the\u00a0code.<\/p>\n<pre>xl_oct24_dt &lt;- x3a_dt[month == month_sel, <br>  ][order(-revenue), .(<br>    cum_prod_pct = seq_len(.N)\/.N, <br>    cum_rev_pct = cumsum(revenue)\/sum(revenue))]<\/pre>\n<p>To chart separate lines per each time period, we need to modify accordingly.<\/p>\n<pre>#-- Lorenz curve, yearly aggregation<br>xl_dt &lt;- x3a_dt[order(-revenue), .(<br>    cum_prod_pct = seq_len(.N)\/.N, <br>    cum_rev_pct = cumsum(revenue)\/sum(revenue)), month]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Arg-afsXUM77JY4tVMspyvg.png?ssl=1\"><\/figure>\n<p>The xl_dt dataset is ready for charting.<\/p>\n<h4>Indices, ratios<\/h4>\n<p>The code is straightforward here, assuming sufficient prior data preparation. The logic and some snippets in the main body of this\u00a0article.<\/p>\n<h4>Streamgraph<\/h4>\n<p>The streamgraph shown earlier is an example of a chart that may appear difficult to render, especially when interactivity is required. One of the reasons I included it in this blog is to show how we can simplify tasks with keys, joins, and data.table syntax in particular. Using keys, we can achieve very effective filtering for interactivity. Once we have a handle on the data, we\u2019re virtually done; all that remains are some settings to fine-tune the\u00a0plot.<\/p>\n<p>We start with thresholds table.<\/p>\n<pre>#-- set percentage thresholds<br>pct_thresholds_dt &lt;- data.table(cum_rev_pct = c(0, .2, .3, .5, .8, .95, 1))<\/pre>\n<p>Since we want joins performed monthly, it is good to create a data subset covering one month, to test the logic, before extending for a full\u00a0dataset.<\/p>\n<pre>#-- test logic for one month<br>month_sel &lt;- as.Date(\"2020-01-01\")<br>sel_a_cols &lt;- c(\"month\", \"rev_pct\", \"cum_rev_pct\", \"prod_n\", \"masked_name\")<br>x3a1_dt &lt;- x3a_dt[month == month_sel, ..sel_a_cols]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AxJO-Q3hCcC92WDgRWNAmJg.png?ssl=1\"><\/figure>\n<p>We have 23 products in January 2020, sorted by revenue percentage, and we also have cumulative revenue, reaching 100% with the last, 23rd\u00a0product.<\/p>\n<p>Now we need to create an intermediate table, telling us how many products do we need to achieve each revenue threshold.<\/p>\n<pre>#-- set key for join<br>list(x3a1_dt, pct_thresholds_dt) |&gt; lapply(setkey, cum_rev_pct)<br><br>#-- perform a rolling join - crucial step!<br>sel_b_cols &lt;- c(\"month\", \"cum_rev_pct\", \"prod_n\")<br>x3b1_dt &lt;- x3a1_dt[, .SD[pct_thresholds_dt, roll = -Inf], month][, ..sel_b_cols]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AqM7tovjciF1EH1MVwEUG2A.png?ssl=1\"><\/figure>\n<p>Because we work with a one-month data subset (and picking month with not that many products), it is very easy to check the outcome\u200a\u2014\u200acomparing x3a1_dt and x3b1_dt\u00a0tables.<\/p>\n<p>And now we need to get products names, for selected threshold.<\/p>\n<pre>#-- get products<br>#-- set keys<br>list(x3a1_dt, x3b1_dt) |&gt; lapply(setkey, month, prod_n)<br><br>#-- specify threshold<br>x3b1_dt[cum_rev_pct == .8][x3a1_dt, roll = -Inf, nomatch = 0]<br><br>#-- or, an equivalent, specify table's row<br>x3b1_dt[5, ][x3a1_dt, roll = -Inf, nomatch = 0]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A0P1WNBScfSivJ7aQf9vdcg.png?ssl=1\"><\/figure>\n<p>To achieve 80% revenue, we need 7 products, and from the join above, we get their\u00a0names.<\/p>\n<p>I think you already see, why we use rolling joins, and can\u2019t use simle &lt; or &gt;\u00a0logic.<\/p>\n<p>Now, we need to extend the logic for all\u00a0months.<\/p>\n<pre>#-- extend for all months<br><br>#-- set key for join<br>list(x3a_dt, pct_thresholds_dt) |&gt; lapply(setkey, cum_rev_pct)<br><br>#-- subset columns (optional)<br>sel_cols &lt;- c(\"month\", \"cum_rev_pct\", \"prod_n\")<br><br>#-- perform a rolling join - crucial step!<br>x3b_dt &lt;- x3a_dt[, .SD[pct_thresholds_dt, roll = -Inf], month][, ..sel_cols]<\/pre>\n<p>Get the products.<\/p>\n<pre>#-- set keys, join<br>list(x3a_dt, x3b_dt) |&gt; lapply(setkey, month, prod_n)<br>x3b6_dt &lt;- x3b_dt[cum_rev_pct == .8][x3a_dt, roll = -Inf, nomatch = 0][, ..sel_a_cols]<\/pre>\n<p>And verify, for the same month as in a test data\u00a0subset.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AQqXWWKuGlz02iDeE0stYcQ.png?ssl=1\"><\/figure>\n<p>If we want to freeze products for a certain month, and see revenue from them in the whole period (what second streamgraphs shows), we can set key on product name and perform a\u00a0join.<\/p>\n<pre>#-- freeze products<br>x3b6_key_dt &lt;- x3b6_dt[month == month_sel, .(masked_name)]<br>list(x3a_dt, x3b6_key_dt) |&gt; lapply(setkey, masked_name)<br><br>sel_b2_cols &lt;- c(\"month\", \"revenue\", \"masked_name\")<br>x3a6_dt &lt;- x3a_dt[x3b6_key_dt][, ..sel_b2_cols]<\/pre>\n<p>And we get exactly, what we\u00a0needed.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AOWFvf02hi4_YY9Q5OneNaw.png?ssl=1\"><\/figure>\n<p>Using joins, including rolls, and deciding what can be precalculated in a warehouse, and what is left for dynamic filtering in a dashboard does require some practice, but it definitely pays\u00a0off.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1022\/1%2AvC4zfvenDCAZ18zh7KtacQ.png?ssl=1\"><figcaption>Image generated by DALL-E, based on author\u2019s prompt, inspired by \u201cThe Bremen Town Musicians\u201d<\/figcaption><\/figure>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=adc3d0876acd\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/medium.com\/towards-data-science\/inequality-in-practice-e-commerce-portfolio-analysis-adc3d0876acd\">Inequality in Practice: E-commerce Portfolio Analysis<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Piotr Gruszecki<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/towards-data-science\/inequality-in-practice-e-commerce-portfolio-analysis-adc3d0876acd\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Inequality in Practice: E-commerce Portfolio Analysis From Mathematical Theory to Actionable Insights: A 6-Year Shopify Case\u00a0Study Image generated by DALL-E, based on author\u2019s prompt, inspired by \u201cThe Bremen Town Musicians\u201d Are your top-selling products making or breaking your business? It\u2019s terrifying to think your entire revenue might collapse if one or two products fall out [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1579,67,1580,1577,1578],"tags":[232,1581,163],"class_list":["post-1585","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-concentration","category-deep-dives","category-ecommerce","category-rstats","category-shopify","tag-analysis","tag-products","tag-your"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1585"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1585"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1585\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1585"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1585"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1585"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}