{"id":428,"date":"2024-12-07T07:01:16","date_gmt":"2024-12-07T07:01:16","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/07\/modeling-dau-with-markov-chain-640ea4fddeb4\/"},"modified":"2024-12-07T07:01:16","modified_gmt":"2024-12-07T07:01:16","slug":"modeling-dau-with-markov-chain-640ea4fddeb4","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/07\/modeling-dau-with-markov-chain-640ea4fddeb4\/","title":{"rendered":"Modeling DAU with Markov Chain"},"content":{"rendered":"<p>    Modeling DAU with Markov Chain<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>How to predict DAU using Duolingo\u2019s growth model and control the prediction<\/h4>\n<h3>1. Introduction<\/h3>\n<p>Doubtlessly, DAU, WAU, and MAU\u200a\u2014\u200adaily, weekly, and monthly <a href=\"https:\/\/en.wikipedia.org\/wiki\/Active_users\">active users<\/a>\u200a\u2014\u200aare critical business metrics. An article <a href=\"https:\/\/www.lennysnewsletter.com\/p\/how-duolingo-reignited-user-growth\">\u201cHow Duolingo reignited user growth\u201d<\/a> by <a href=\"https:\/\/www.linkedin.com\/in\/jorgemazal\/\">Jorge Mazal<\/a>, former CPO of Duolingo, is #1 in the Growth section of Lenny\u2019s Newsletter blog. In this article, Jorge paid special attention to the methodology Duolingo used to model the DAU metric (see another article <a href=\"https:\/\/blog.duolingo.com\/growth-model-duolingo\/\">\u201cMeaningful metrics: how data sharpened the focus of product teams\u201d<\/a> by <a href=\"https:\/\/blog.duolingo.com\/author\/erin\/\">Erin Gustafson<\/a>). This methodology has multiple strengths, but I\u2019d like to focus on how one can use this approach for DAU forecasting.<\/p>\n<p>The new year is coming soon, so many companies are planning their budgets for the next year these days. Cost estimations often require DAU forecasts. In this article, I\u2019ll show how you can get this prediction using Duolingo\u2019s growth model. I\u2019ll explain why this approach is better compared to standard time-series forecasting methods and how you can adjust the prediction according to your teams\u2019 plans (e.g., marketing, activation, product\u00a0teams).<\/p>\n<p>The article text goes along with the code, and a simulated dataset is attached so the research is fully reproducible. The Jupyter notebook version is available <a href=\"https:\/\/github.com\/wowone\/wowone.github.io\/blob\/master\/posts\/2024-12-02_dau_prediction\/dau_prediction.ipynb\">here<\/a>. In the end, I\u2019ll share a DAU \u201ccalculator\u201d designed in Google Spreadsheet format.<\/p>\n<p>I\u2019ll be narrating on behalf of the collective \u201cwe\u201d as if we\u2019re talking together.<\/p>\n<h3>2. Methodology<\/h3>\n<p>A quick recap on how the <a href=\"https:\/\/blog.duolingo.com\/growth-model-duolingo\/\">Duolingo\u2019s growth model<\/a> works. At day d (d = 1, 2,\u00a0\u2026 ) of a user\u2019s lifetime, the user can be in one of the following 7 (mutually-exclusive) states: new, current, reactivated, resurrected, at_risk_wau, at_risk_mau, dormant. The states are defined according to indicators of whether a user was active today, in the last 7 days, or in the last 30 days. The definition summary is given in the table\u00a0below:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AtAPSb4Syw4hFteuao0T7PA.png?ssl=1\"><\/figure>\n<p>Having these states defined (as a set S), we can consider user behavior as a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Markov_chain\">Markov chain<\/a>. Here\u2019s an example of a user\u2019s trajectory: new\u2192 current\u2192 current\u2192 at_risk_wau\u2192&#8230;\u2192 at_risk_mau\u2192&#8230;\u2192 dormant. Let M be a transition matrix associated with this Markov process: m_{i, j} = P(s_j | s_i) are the probabilities that a user moves to state s_j right after being at state s_i, where s_i, s_j \u2208 S. Such a matrix is inferred from the historical data.<\/p>\n<p>If we assume that user behavior is stationary (independent of time), the matrix M fully describes the states of all users in the future. Suppose that the vector u_0 of length 7 contains the counts of users in certain states on a given day, denoted as day 0. According to the Markov model, on the next day 1, we expect to have the following number of users states\u00a0u_1:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ACv3xYnXplDXsDIno6MPrFw.png?ssl=1\"><\/figure>\n<p>Applying this formula recursively, we derive the number of users in certain states on any arbitrary day t &gt; 0 in the\u00a0future.<\/p>\n<p>Besides the initial distribution u_0, we need to provide the number of new users that will appear in the product each day in the future. We\u2019ll address this problem as a general time-series forecasting.<\/p>\n<p>Now, having u_t calculated, we can determine DAU values on day\u00a0t:<\/p>\n<p>DAU_t = #New_t + #Current_t + #Reactivated_t + #Resurrected_t<\/p>\n<p>Additionally, we can easily calculate WAU and MAU\u00a0metrics:<\/p>\n<p>WAU_t = DAU_t + #AtRiskWau_t,<br \/>MAU_t = DAU_t + #AtRiskWau_t + #AtRiskMau_t.<\/p>\n<p>Finally, here\u2019s the algorithm outline:<\/p>\n<ol>\n<li>For each prediction day t = 1,\u00a0\u2026, T, calculate the expected number of new users #New_1,\u00a0\u2026,\u00a0#New_T.<\/li>\n<li>For each lifetime day of each user, assign one of the 7\u00a0states.<\/li>\n<li>Calculate the transition matrix M from the historical data.<\/li>\n<li>Calculate initial state counts u_0 corresponding to day\u00a0t=0.<\/li>\n<li>Recursively calculate u_{t+1} = M^T *\u00a0u_t.<\/li>\n<li>Calculate DAU, WAU, and MAU for each prediction day t = 1,\u00a0\u2026,\u00a0T.<\/li>\n<\/ol>\n<h3>3. Implementation<\/h3>\n<p>This section is devoted to technical aspects of the implementation. If you\u2019re interested in studying the model properties rather than code, you may skip this section and go to the <a href=\"https:\/\/towardsdatascience.com\/#1375\">Section\u00a04<\/a>.<\/p>\n<h4>3.1 Dataset<\/h4>\n<p>We use a simulated dataset based on historical data of a SaaS app. The data is stored in the <a href=\"https:\/\/drive.google.com\/file\/d\/16kd8rJBvcgmw95jY42MedRfIxcO4LpPd\/view?usp=sharing\">dau_data.csv.gz<\/a> file and contains three columns: user_id, date, and registration_date. Each record indicates a day when a user was active. The dataset includes activity indicators for 51480 users from 2020-11-01 to 2023-10-31. Additionally, data from October 2020 is included to calculate user states properly, as the at_risk_mau and dormant states require data from one month\u00a0prior.<\/p>\n<pre>import pandas as pd<br><br>df = pd.read_csv('dau_data.csv.gz', compression='gzip')<br>df['date'] = pd.to_datetime(df['date'])<br>df['registration_date'] = pd.to_datetime(df['registration_date'])<br><br>print(f'Shape: {df.shape}')<br>print(f'Total users: {df['user_id'].nunique()}')<br>print(f'Data range: [{df['date'].min()}, {df['date'].max()}]')<br>df.head()<\/pre>\n<pre>Shape: (667236, 3)<br>Total users: 51480<br>Data range: [2020-10-01 00:00:00, 2023-10-31 00:00:00]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AboptDIHcT3KVLhNdbNNnQw.png?ssl=1\"><\/figure>\n<p>This is how the DAU time-series looks\u00a0like.<\/p>\n<pre>df.groupby('date').size()<br>    .plot(title='DAU, historical')<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AB88Nd_1uVTA3SYy4Xh2xgg.png?ssl=1\"><\/figure>\n<p>Suppose that today is 2023\u201310\u201331 and we want to predict the DAU metric for the next 2024 year. We define a couple of global constants PREDICTION_START and PREDICTION_END which encompass the prediction period.<\/p>\n<pre>PREDICTION_START = '2023-11-01'<br>PREDICTION_END = '2024-12-31'<\/pre>\n<h4>3.2 Predicting new users\u00a0amount<\/h4>\n<p>Let\u2019s start from the new users prediction. We use the <a href=\"http:\/\/facebook.github.io\/prophet\/\">prophet<\/a> library as one of the easiest ways to forecast time-series data. The new_users Series contains such data. We extract it from the original df dataset selecting the rows where the registration date is equal to the\u00a0date.<\/p>\n<pre>new_users = df[df['date'] == df['registration_date']]<br>    .groupby('date').size()<br>new_users.head()<\/pre>\n<pre>date<br>2020-10-01    4<br>2020-10-02    4<br>2020-10-03    3<br>2020-10-04    4<br>2020-10-05    8<br>dtype: int64<\/pre>\n<p>prophet requires a time-series as a DataFrame containing two columns ds and y, so we reformat the new_users Series to the new_users_prophet DataFrame. Another thing we need to prepare is to create the future variable containing certain days for prediction: from prediction_start to prediction_end. This logic is implemented in the predict_new_users function. The plot below illustrates predictions for both past and future\u00a0periods.<\/p>\n<pre>import logging<br>import matplotlib.pyplot as plt<br>from prophet import Prophet<br><br># suppress prophet logs<br>logging.getLogger('prophet').setLevel(logging.WARNING)<br>logging.getLogger('cmdstanpy').disabled=True<br><br>def predict_new_users(prediction_start, prediction_end, new_users_train, show_plot=True):<br>    \"\"\"<br>    Forecasts a time-seires for new users<br><br>    Parameters<br>    ----------<br>    prediction_start : str<br>        Date in YYYY-MM-DD format.<br>    prediction_end : str<br>        Date in YYYY-MM-DD format.<br>    new_users_train : pandas.Series<br>        Historical data for the time-series preceding the prediction period.<br>    show_plot : boolean, default=True<br>        If True, a chart with the train and predicted time-series values is displayed.<br>    Returns<br>    -------<br>    pandas.Series<br>        Series containing the predicted values.<br>    \"\"\"<br>    m = Prophet()<br><br>    new_users_train = new_users_train<br>        .loc[new_users_train.index &lt; prediction_start]<br>    new_users_prophet = pd.DataFrame({<br>        'ds': new_users_train.index,<br>        'y': new_users_train.values<br>    })<br><br>    m.fit(new_users_prophet)<br><br>    periods = len(pd.date_range(prediction_start, prediction_end))<br>    future = m.make_future_dataframe(periods=periods)<br>    new_users_pred = m.predict(future)<br>    if show_plot:<br>        m.plot(new_users_pred)<br>        plt.title('New users prediction');<br><br>    new_users_pred = new_users_pred<br>        .assign(yhat=lambda _df: _df['yhat'].astype(int))<br>        .rename(columns={'ds': 'date', 'yhat': 'count'})<br>        .set_index('date')<br>        .clip(lower=0)<br>        ['count']<br><br>    return new_users_pred<\/pre>\n<pre>new_users_pred = predict_new_users(PREDICTION_START, PREDICTION_END, new_users)<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2APrxCj-BJmr7rgc7VyNkIlw.png?ssl=1\"><\/figure>\n<p>The new_users_pred Series stores the predicted users\u00a0amount.<\/p>\n<pre>new_users_pred.tail(5)<\/pre>\n<pre>date<br>2024-12-27    52<br>2024-12-28    56<br>2024-12-29    71<br>2024-12-30    79<br>2024-12-31    74<br>Name: count, dtype: int64<\/pre>\n<h4>3.3 Getting the\u00a0states<\/h4>\n<p>In practice, the most calculations are reasonable to execute as SQL queries to a database where the data is stored. Hereafter, we will simulate such querying using the <a href=\"https:\/\/duckdb.org\/\">duckdb<\/a>\u00a0library.<\/p>\n<p>We want to assign one of the 7 states to each day of a user\u2019s lifetime within the app. According to the definition, for each day, we need to consider at least the past 30 days. This is where SQL window functions come in. However, since the df data contains only records of <em>active days<\/em>, we need to explicitly extend them and include the days when a user was not active. In other words, instead of this list of\u00a0records:<\/p>\n<pre>user_id    date          registration_date<br>1234567    2023-01-01    2023-01-01<br>1234567    2023-01-03    2023-01-01<\/pre>\n<p>we\u2019d like to get a list like\u00a0this:<\/p>\n<pre>user_id    date          is_active    registration_date<br>1234567    2023-01-01    TRUE         2023-01-01<br>1234567    2023-01-02    FALSE        2023-01-01<br>1234567    2023-01-03    TRUE         2023-01-01<br>1234567    2023-01-04    FALSE        2023-01-01<br>1234567    2023-01-05    FALSE        2023-01-01<br>...        ...           ...          ...<br>1234567    2023-10-31    FALSE        2023-01-01<\/pre>\n<p>For readability purposes we split the following SQL query into multiple subqueries.<\/p>\n<ul>\n<li>full_range: Create a full sequence of dates for each\u00a0user.<\/li>\n<li>dau_full: Get the full list of both active and inactive\u00a0records.<\/li>\n<li>states: Assign one of the 7 states for each day of a user&#8217;s lifetime.<\/li>\n<\/ul>\n<pre>import duckdb<br><br>DATASET_START = '2020-11-01'<br>DATASET_END = '2023-10-31'<br>OBSERVATION_START = '2020-10-01'<br><br>query = f\"\"\"<br>WITH<br>full_range AS (<br>    SELECT<br>        user_id, UNNEST(generate_series(greatest(registration_date, '{OBSERVATION_START}'), date '{DATASET_END}', INTERVAL 1 DAY))::date AS date<br>    FROM (<br>        SELECT DISTINCT user_id, registration_date FROM df<br>    )<br>),<br>dau_full AS (<br>    SELECT<br>        fr.user_id,<br>        fr.date,<br>        df.date IS NOT NULL AS is_active,<br>        registration_date<br>    FROM full_range AS fr<br>    LEFT JOIN df USING(user_id, date)<br>),<br>states AS (<br>    SELECT<br>        user_id,<br>        date,<br>        is_active,<br>        first_value(registration_date IGNORE NULLS) OVER (PARTITION BY user_id ORDER BY date) AS registration_date,<br>        SUM(is_active::int) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN 6 PRECEDING and 1 PRECEDING) AS active_days_back_6d,<br>        SUM(is_active::int) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN 29 PRECEDING and 1 PRECEDING) AS active_days_back_29d,<br>        CASE<br>            WHEN date = registration_date THEN 'new'<br>            WHEN is_active = TRUE AND active_days_back_6d BETWEEN 1 and 6 THEN 'current'<br>            WHEN is_active = TRUE AND active_days_back_6d = 0 AND IFNULL(active_days_back_29d, 0) &gt; 0 THEN 'reactivated'<br>            WHEN is_active = TRUE AND active_days_back_6d = 0 AND IFNULL(active_days_back_29d, 0) = 0 THEN 'resurrected'<br>            WHEN is_active = FALSE AND active_days_back_6d &gt; 0 THEN 'at_risk_wau'<br>            WHEN is_active = FALSE AND active_days_back_6d = 0 AND ifnull(active_days_back_29d, 0) &gt; 0 THEN 'at_risk_mau'<br>            ELSE 'dormant'<br>        END AS state<br>    FROM dau_full<br>)<br>SELECT user_id, date, state FROM states<br>WHERE date BETWEEN '{DATASET_START}' AND '{DATASET_END}'<br>ORDER BY user_id, date<br>\"\"\"<br>states = duckdb.sql(query).df()<\/pre>\n<p>The query results are kept in the states DataFrame:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ACKrex2_FcRmfmFMOtT-_yw.png?ssl=1\"><\/figure>\n<h4>3.4 Calculating the transition matrix<\/h4>\n<p>Having obtained these states, we can calculate state transition frequencies. In the <a href=\"https:\/\/towardsdatascience.com\/#0637\">Section 4.3<\/a> we\u2019ll study how the prediction depends on a period in which transitions are considered, so it\u2019s reasonable to pre-aggregate this data on daily basis. The resulting transitions DataFrame contains date, state_from, state_to, and cnt\u00a0columns.<\/p>\n<p>Now, we can calculate the transition matrix M. We implement the get_transition_matrix function, which accepts the transitions DataFrame and a pair of dates that encompass the transitions period to be considered.<\/p>\n<p>As a baseline, let\u2019s calculate the transition matrix for the whole year from 2022-11-01 to 2023-10-31.<\/p>\n<pre>M = get_transition_matrix(transitions, '2022-11-01', '2023-10-31')<br>M<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ACEK1ieeHcLT4-0XtXUAiHg.png?ssl=1\"><\/figure>\n<p>The sum of each row of any transition matrix equals 1 since it represents the probabilities of moving from one state to any other\u00a0state.<\/p>\n<h4>3.5 Getting the initial state\u00a0counts<\/h4>\n<p>An initial state is retrieved from the states DataFrame by the get_state0 function and the corresponding SQL query. The only argument of the function is the date for which we want to get the initial state. We assign the result to the state0 variable.<\/p>\n<pre>def get_state0(date):<br>    query = f\"\"\"<br>    SELECT state, count(*) AS cnt<br>    FROM states<br>    WHERE date = '{date}'<br>    GROUP BY state<br>    \"\"\"<br><br>    state0 = duckdb.sql(query).df()<br>    state0 = state0.set_index('state').reindex(states_order)['cnt']<br>    <br>    return state0<\/pre>\n<pre>state0 = get_state0(DATASET_END)<br>state0<\/pre>\n<pre>state<br>new               20<br>current          475<br>reactivated       15<br>resurrected       19<br>at_risk_wau      404<br>at_risk_mau     1024<br>dormant        49523<br>Name: cnt, dtype: int64<\/pre>\n<h4>3.6 Predicting DAU<\/h4>\n<p>The predict_dau function below accepts all the previous variables required for the DAU prediction and makes this prediction for a date range defined by the start_date and end_date arguments.<\/p>\n<pre>def predict_dau(M, state0, start_date, end_date, new_users):<br>    \"\"\"<br>    Predicts DAU over a given date range.<br><br>    Parameters<br>    ----------<br>    M : pandas.DataFrame<br>        Transition matrix representing user state changes.<br>    state0 : pandas.Series<br>        counts of initial state of users.<br>    start_date : str<br>        Start date of the prediction period in 'YYYY-MM-DD' format.<br>    end_date : str<br>        End date of the prediction period in 'YYYY-MM-DD' format.<br>    new_users : int or pandas.Series<br>        The expected amount of new users for each day between `start_date` and `end_date`.<br>        If a Series, it should have dates as the index.<br>        If an int, the same number is used for each day.<br>        <br>    Returns<br>    -------<br>    pandas.DataFrame<br>        DataFrame containing the predicted DAU, WAU, and MAU for each day in the date range,<br>        with columns for different user states and tot.<br>    \"\"\"<br>    <br>    dates = pd.date_range(start_date, end_date)<br>    dates.name = 'date'<br>    dau_pred = []<br>    new_dau = state0.copy()<br>    for date in dates:<br>        new_dau = (M.transpose() @ new_dau).astype(int)<br>        if isinstance(new_users, int):<br>            new_users_today = new_users<br>        else:<br>            new_users_today = new_users.astype(int).loc[date] <br>        new_dau.loc['new'] = new_users_today<br>        dau_pred.append(new_dau.tolist())<br><br>    dau_pred = pd.DataFrame(dau_pred, index=dates, columns=states_order)<br>    dau_pred['dau'] = dau_pred['new'] + dau_pred['current'] + dau_pred['reactivated'] + dau_pred['resurrected']<br>    dau_pred['wau'] = dau_pred['dau'] + dau_pred['at_risk_wau']<br>    dau_pred['mau'] = dau_pred['dau'] + dau_pred['at_risk_wau'] + dau_pred['at_risk_mau']<br>    <br>    return dau_pred<\/pre>\n<pre>dau_pred = predict_dau(M, state0, PREDICTION_START, PREDICTION_END, new_users_pred)<br>dau_pred<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AonivsmmuFBTDt0qZhlrQXA.png?ssl=1\"><\/figure>\n<p>This is how the DAU prediction dau_pred looks like for the PREDICTION_START &#8211; PREDICTION_END period. Besides the expected dau, wau, and mau columns, the output contains the number of users in each state for each prediction date.<\/p>\n<p>Finally, we calculate the ground-truth values of DAU, WAU, and MAU (along with the user state counts), keep them in the dau_true DataFrame, and plot the predicted and true values altogether.<\/p>\n<pre>query = f\"\"\"<br>SELECT date, state, COUNT(*) AS cnt<br>FROM states<br>GROUP BY date, state<br>ORDER BY date, state;<br>\"\"\"<br><br>dau_true = duckdb.sql(query).df()<br>dau_true['date'] = pd.to_datetime(dau_true['date'])<br>dau_true = dau_true.pivot(index='date', columns='state', values='cnt')<br>dau_true['dau'] = dau_true['new'] + dau_true['current'] + dau_true['reactivated'] + dau_true['resurrected']<br>dau_true['wau'] = dau_true['dau'] + dau_true['at_risk_wau']<br>dau_true['mau'] = dau_true['dau'] + dau_true['at_risk_wau'] + dau_true['at_risk_mau']<\/pre>\n<pre>dau_true.head()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AToJZNPI8Bd4oxJm5tMts5w.png?ssl=1\"><\/figure>\n<pre>pd.concat([dau_true['dau'], dau_pred['dau']])<br>    .plot(title='DAU, historical &amp; predicted');<br>plt.axvline(PREDICTION_START, color='k', linestyle='--');<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ab9opg5Z4MkcNvx0RvWJEjA.png?ssl=1\"><\/figure>\n<p>We\u2019ve obtained the prediction but so far it\u2019s not clear whether it\u2019s fair or not. In the next section, we\u2019ll evaluate the\u00a0model.<\/p>\n<h3>4. Model evaluation<\/h3>\n<h4>4.1 Baseline\u00a0model<\/h4>\n<p>First of all, let\u2019s check whether we really need to build a complex model to predict DAU. Wouldn\u2019t it be better to predict DAU as a general time-series using the mentioned prophet library? The function predict_dau_prophet below implements this. We try to use some tweaks available in the library in order to make the prediction more accurate. In particular:<\/p>\n<ul>\n<li>we use logistic model instead of linear to avoid negative\u00a0values;<\/li>\n<li>we add explicitly monthly and yearly seasonality;<\/li>\n<li>we remove the outliers;<\/li>\n<li>we explicitly define a peak period in January and February as \u201cholidays\u201d.<\/li>\n<\/ul>\n<pre>def predict_dau_prophet(prediction_start, prediction_end, dau_true, show_plot=True):<br>    # assigning peak days for the new year<br>    holidays = pd.DataFrame({<br>        'holiday': 'january_spike',<br>        'ds': pd.date_range('2022-01-01', '2022-01-31', freq='D').tolist() + <br>              pd.date_range('2023-01-01', '2023-01-31', freq='D').tolist(),<br>        'lower_window': 0,<br>        'upper_window': 40<br>    })<br><br>    m = Prophet(growth='logistic', holidays=holidays)<br>    m.add_seasonality(name='monthly', period=30.5, fourier_order=3)<br>    m.add_seasonality(name='yearly', period=365, fourier_order=3)<br><br>    train = dau_true.loc[(dau_true.index &lt; prediction_start) &amp; (dau_true.index &gt;= '2021-08-01')]<br>    train_prophet = pd.DataFrame({'ds': train.index, 'y': train.values})<br>    # removining outliers<br>    train_prophet.loc[train_prophet['ds'].between('2022-06-07', '2022-06-09'), 'y'] = None<br>    train_prophet['new_year_peak'] = (train_prophet['ds'] &gt;= '2022-01-01') &amp;<br>                                     (train_prophet['ds'] &lt;= '2022-02-14')<br>    m.add_regressor('new_year_peak')<br>    # setting logistic upper and lower bounds<br>    train_prophet['cap'] = dau_true.max() * 1.1<br>    train_prophet['floor'] = 0<br><br>    m.fit(train_prophet)<br><br>    periods = len(pd.date_range(prediction_start, prediction_end))<br>    future = m.make_future_dataframe(periods=periods)<br>    future['new_year_peak'] = (future['ds'] &gt;= '2022-01-01') &amp; (future['ds'] &lt;= '2022-02-14')<br>    future['cap'] = dau_true.max() * 1.1<br>    future['floor'] = 0<br>    pred = m.predict(future)<br><br>    if show_plot:<br>        m.plot(pred);<br><br>    # converting the predictions to an appropriate format<br>    pred = pred<br>        .assign(yhat=lambda _df: _df['yhat'].astype(int))<br>        .rename(columns={'ds': 'date', 'yhat': 'count'})<br>        .set_index('date')<br>        .clip(lower=0)<br>        ['count']<br>        .loc[lambda s: (s.index &gt;= prediction_start) &amp; (s.index &lt;= prediction_end)]<br><br>    return pred<\/pre>\n<p>The fact that the code turns out to be quite sophisticated indicates that one can\u2019t simply apply prophet to the DAU time-series.<\/p>\n<p>Hereafter we test a prediction for multiple predicting horizons: 3, 6, and 12 months. As a result, we get 3 test\u00a0sets:<\/p>\n<ul>\n<li>3-months horizon: 2023-08-01 &#8211; 2023-10-31,<\/li>\n<li>6-months horizon: 2023-05-01 &#8211; 2023-10-31,<\/li>\n<li>1-year horizon: 2022-11-01 &#8211; 2023-10-31.<\/li>\n<\/ul>\n<p>For each test set we calculate the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_absolute_percentage_error\">MAPE<\/a> loss function.<\/p>\n<pre>from sklearn.metrics import mean_absolute_percentage_error<br><br>mapes = []<br>prediction_end = '2023-10-31'<br>prediction_horizon = [3, 6, 12]<br><br>for offset in prediction_horizon:<br>    prediction_start = pd.to_datetime(prediction_end) - pd.DateOffset(months=offset - 1)<br>    prediction_start = prediction_start.replace(day=1)<br>    prediction_end = '2023-10-31'<br>    pred = predict_dau_prophet(prediction_start, prediction_end, dau_true['dau'], show_plot=False)<br>    mape = mean_absolute_percentage_error(dau_true['dau'].reindex(pred.index), pred)<br>    mapes.append(mape)<br><br>mapes = pd.DataFrame({'horizon': prediction_horizon, 'MAPE': mapes})<br>mapes<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AIDKv5izzXgCIIEO2BNRO6Q.png?ssl=1\"><\/figure>\n<p>The MAPE error turns out to be high: 18%\u200a\u2014\u200a35%. The fact that the shortest horizon has the highest error means that the model is tuned for the long-term predictions. This is another inconvenience of such an approach: we have to tune the model for each prediction horizon. Anyway, this is our baseline. In the next section we\u2019ll compare it with more advanced\u00a0models.<\/p>\n<h4>4.2 General evaluation<\/h4>\n<p>In this section we evaluate the model implemented in the <a href=\"https:\/\/towardsdatascience.com\/#5dc8\">Section 3.6<\/a>. So far we set the transition period as 1 year before the prediction start. We\u2019ll study how the prediction depends on the transition period in the <a href=\"https:\/\/towardsdatascience.com\/#0637\">Section 4.3<\/a>. As for the new users, we run the model using two options: the real values and the predicted ones. Similarly, we fix the same 3 prediction horizons and test the model on\u00a0them.<\/p>\n<p>The make_predicion helper function below implements the described options. It accepts prediction_start, prediction_end arguments defining the prediction period for a given horizon, new_users_mode which can be either true or predict, and transition_period. The options of the latter argument will be explained further.<\/p>\n<pre>import re<br><br><br>def make_prediction(prediction_start, prediction_end, new_users_mode='predict', transition_period='last_30d'):<br>    prediction_start_minus_1d = pd.to_datetime(prediction_start) - pd.Timedelta('1d')<br>    state0 = get_state0(prediction_start_minus_1d)<br>    <br>    if new_users_mode == 'predict':<br>        new_users_pred = predict_new_users(prediction_start, prediction_end, new_users, show_plot=False)<br>    elif new_users_mode == 'true':<br>        new_users_pred = new_users.copy()<br><br>    if transition_period.startswith('last_'):<br>        shift = int(re.search(r'last_(d+)d', transition_period).group(1))<br>        transitions_start = pd.to_datetime(prediction_start) - pd.Timedelta(shift, 'd')<br>        M = get_transition_matrix(transitions, transitions_start, prediction_start_minus_1d)<br>        dau_pred = predict_dau(M, state0, prediction_start, prediction_end, new_users_pred)<br>    else:<br>        transitions_start = pd.to_datetime(prediction_start) - pd.Timedelta(240, 'd')<br>        M_base = get_transition_matrix(transitions, transitions_start, prediction_start_minus_1d)<br>        dau_pred = pd.DataFrame()<br><br>        month_starts = pd.date_range(prediction_start, prediction_end, freq='1MS')<br>        N = len(month_starts)<br><br>        for i, prediction_month_start in enumerate(month_starts):<br>            prediction_month_end = pd.offsets.MonthEnd().rollforward(prediction_month_start)<br>            transitions_month_start = prediction_month_start - pd.Timedelta('365D')<br>            transitions_month_end = prediction_month_end - pd.Timedelta('365D')<br><br>            M_seasonal = get_transition_matrix(transitions, transitions_month_start, transitions_month_end)<br>            if transition_period == 'smoothing':<br>                i = min(i, 12)<br>                M = M_seasonal * i \/ (N - 1)  + (1 - i \/ (N - 1)) * M_base<br>            elif transition_period.startswith('seasonal_'):<br>                seasonal_coef = float(re.search(r'seasonal_(0.d+)', transition_period).group(1))<br>                M = seasonal_coef * M_seasonal + (1 - seasonal_coef) * M_base<br>            <br>            dau_tmp = predict_dau(M, state0, prediction_month_start, prediction_month_end, new_users_pred)<br>            dau_pred = pd.concat([dau_pred, dau_tmp])<br><br>            state0 = dau_tmp.loc[prediction_month_end][states_order]<br><br>    return dau_pred<br><br>def prediction_details(dau_true, dau_pred, show_plot=True, ax=None):<br>    y_true = dau_true.reindex(dau_pred.index)['dau']<br>    y_pred = dau_pred['dau']<br>    mape = mean_absolute_percentage_error(y_true, y_pred) <br><br>    if show_plot:<br>        prediction_start = str(y_true.index.min().date())<br>        prediction_end = str(y_true.index.max().date())<br>        if ax is None:<br>            y_true.plot(label='DAU true')<br>            y_pred.plot(label='DAU pred')<br>            plt.title(f'DAU prediction, {prediction_start} - {prediction_end}')<br>            plt.legend()<br>        else:<br>            y_true.plot(label='DAU true', ax=ax)<br>            y_pred.plot(label='DAU pred', ax=ax)<br>            ax.set_title(f'DAU prediction, {prediction_start} - {prediction_end}')<br>            ax.legend()<br>    return mape<\/pre>\n<p>In total, we have 6 prediction scenarios: 2 options for new users and 3 prediction horizons. The diagram below illustrates the results. The charts on the left relate to the new_users_mode = &#8216;predict&#8217; option, while the right ones relate to the new_users_mode = &#8216;true&#8217;\u00a0option.<\/p>\n<pre>fig, axs = plt.subplots(3, 2, figsize=(15, 6))<br>mapes = []<br>prediction_end = '2023-10-31'<br>prediction_horizon = [3, 6, 12]<br><br>for i, offset in enumerate(prediction_horizon):<br>    prediction_start = pd.to_datetime(prediction_end) - pd.DateOffset(months=offset - 1)<br>    prediction_start = prediction_start.replace(day=1)<br>    args = {<br>        'prediction_start': prediction_start,<br>        'prediction_end': prediction_end,<br>        'transition_period': 'last_365d'<br>    }<br>    for j, new_users_mode in enumerate(['predict', 'true']):<br>        args['new_users_mode'] = new_users_mode<br>        dau_pred = make_prediction(**args)<br>        mape = prediction_details(dau_true, dau_pred, ax=axs[i, j])<br>        mapes.append([offset, new_users_mode, mape])<br><br>mapes = pd.DataFrame(mapes, columns=['horizon', 'new_users', 'MAPE'])<br>plt.tight_layout()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A2tlgRVrGc_aDjRNdYii_9g.png?ssl=1\"><\/figure>\n<p>And here are the MAPE values summarizing the prediction quality:<\/p>\n<pre>mapes.pivot(index='horizon', columns='new_users', values='MAPE')<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AGX0FfctDtEPiQvBey-IPaQ.png?ssl=1\"><\/figure>\n<p>We notice multiple\u00a0things.<\/p>\n<ul>\n<li>In general, the model demonstrates much better results than the baseline. Indeed, the baseline is based on the historical DAU data only, while the model uses the user states information.<\/li>\n<li>However, for the 1-year horizon and new_users_mode=&#8217;predict&#8217; the MAPE error is huge: 65%. This is 3 times higher than the corresponding baseline error (21%). On the other hand, new_users_mode=&#8217;true&#8217; option gives a much better result: 8%. It means that the new users prediction has a huge impact on the model, especially for long-term predictions. For the shorter periods the difference is less dramatic. The major reason for such a difference is that 1-year period includes Christmas with its extreme values. As a result, i) it&#8217;s hard to predict such high new user values, ii) the period heavily impacts user behavior, the transition matrix and, consequently, DAU values. Hence, we strongly recommend to implement the new user prediction carefully. The baseline model was specially tuned for this Christmas period, so it&#8217;s not surprising that it outperforms the Markov\u00a0model.<\/li>\n<li>When the new users prediction is accurate, the model captures trends well. It means that using last 365 days for the transition matrix calculation is a reasonable choice.<\/li>\n<li>Interestingly, the true new users data provides worse results for the 3-months prediction. This is nothing but a coincidence. The wrong new users prediction in October 2023 reversed the predicted DAU trend and made MAPE a bit\u00a0lower.<\/li>\n<\/ul>\n<p>Now, let\u2019s decompose the prediction error and see which states contribure the most. By error we mean here dau_pred &#8211; dau_true values, by relative error &#8211; ( dau_pred &#8211; dau_true) \/ dau_true &#8211; see left and right diagrams below correspondingly. In order to focus on this aspect, we&#8217;ll narrow the configuration to the 3-months prediction horizon and the new_users_mode=&#8217;true&#8217; option.<\/p>\n<pre>dau_component_cols = ['new', 'current', 'reactivated', 'resurrected']<br><br>dau_pred = make_prediction('2023-08-01', '2023-10-31', new_users_mode='true', transition_period='last_365d')<br>figure, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))<br><br>dau_pred[dau_component_cols]<br>    .subtract(dau_true[dau_component_cols])<br>    .reindex(dau_pred.index)<br>    .plot(title='Prediction error by state', ax=ax1)<br><br>dau_pred[['current']]<br>    .subtract(dau_true[['current']])<br>    .div(dau_true[['current']])<br>    .reindex(dau_pred.index)<br>    .plot(title='Relative prediction error (current state)', ax=ax2);<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AjGx4NxaaMzY8cW27xCuehg.png?ssl=1\"><\/figure>\n<p>From the left chart we notice that the error is basically contributed by the current state. It&#8217;s not surprising since this state contributes to DAU the most. The error for the reactivated, and resurrected states is quite low. Another interesting thing is that this error is mostly negative for the current state and mostly positive for the resurrected state. The former might be explained by the fact that the new users who appeared in the prediction period are more engaged that the users from the past. The latter indicates that the resurrected users in reality contribute to DAU less than the transition matrix expects, so the dormant\u2192 resurrected conversion rate is overestimated.<\/p>\n<p>As for the relative error, it makes sense to analyze it for the current state only. This is because the daily amount of the reactivated and resurrected states are low so the relative error is high and noisy. The relative error for the current state varies between -25% and 4% which is quite high. And since we&#8217;ve fixed the new users prediction, this error is explained by the transition matrix inaccuracy only. In particular, the current\u2192 current conversion rate is roughly 0.8 which is high and, as a result, it contributes to the error a lot. So if we want to improve the prediction we need to consider tuning this conversion rate foremost.<\/p>\n<h4>4.3 Transitions period\u00a0impact<\/h4>\n<p>In the previous section we kept the transitions period fixed: 1 year before a prediction start. Now we\u2019re going to study how long this period should be to get more accurate prediction. We consider the same prediction horizons of 3, 6, and 12 months. In order to mitigate the noise from the new users prediction, we use the real values of the new users amount: new_users_mode=&#8217;true&#8217;.<\/p>\n<p>Here comes varying of the transition_period argument. Its values are masked with the last_&lt;N&gt;d pattern where N stands for the number of days in a transitions period. For each prediction horizon we calculate 12 different transition periods of 1, 2,\u00a0&#8230;, 12 months. Then we calculate the MAPE error for each of the options and plot the\u00a0results.<\/p>\n<pre>result = []<br><br>for prediction_offset in prediction_horizon:<br>    prediction_start = pd.to_datetime(prediction_end) - pd.DateOffset(months=prediction_offset - 1)<br>    prediction_start = prediction_start.replace(day=1)<br><br>    for transition_offset in range(1, 13):<br>        dau_pred = make_prediction(<br>            prediction_start, prediction_end, new_users_mode='true',<br>            transition_period=f'last_{transition_offset*30}d'<br>        )<br>        mape = prediction_details(dau_true, dau_pred, show_plot=False)<br>        result.append([prediction_offset, transition_offset, mape])<br>result = pd.DataFrame(result, columns=['prediction_period', 'transition_period', 'mape'])<br><br>result.pivot(index='transition_period', columns='prediction_period', values='mape')<br>    .plot(title='MAPE by prediction and transition period');<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AqeC8RXnYtfF74NpBIUfUOQ.png?ssl=1\"><\/figure>\n<p>It turns out that the optimal transitions period depends on the prediction horizon. Shorter horizons require shorter transitions periods: the minimal MAPE error is achieved at 1, 4, and 8 transition periods for the 3, 6, and 12 months correspondingly. Apparently, this is because the longer horizons contain some seasonal effects that could be captured only by the longer transitions periods. Also, it seems that for the longer prediction horizons the MAPE curve is U-shaped meaning that too long and too short transitions periods are both not good for the prediction. We\u2019ll develop this idea in the next\u00a0section.<\/p>\n<h4>4.4 Obsolence and seasonality<\/h4>\n<p>Nevertheless, fixing a single transition matrix for predicting the whole year ahead doesn\u2019t seem to be a good idea: such a model would be too rigid. Usually, user behavior varies depending on a season. For example, users who appear after Christmas might have some shifts in behavior. Another typical situation is when users change their behavior in summer. In this section, we\u2019ll try to take into account these seasonal\u00a0effects.<\/p>\n<p>So we want to predict DAU for 1 year ahead starting from November 2022. Instead of using a single transition matrix M_base which is calculated for the last 8 months before the prediction start, according to the previous subsection results (and labeled as the last_240d option below), we&#8217;ll consider a mixture of this matrix and a seasonal one M_seasonal. The latter is calculated on monthly basis lagging 1 year behind. For example, to predict DAU for November 2022 we define M_seasonal as the transition matrix for November 2021. Then we shift the prediction horizon to December 2022 and calculate M_seasonal for December 2021,\u00a0etc.<\/p>\n<p>In order to mix M_base and M_seasonal we define the following two\u00a0options.<\/p>\n<ul>\n<li>seasonal_0.3: M = 0.3 * M_seasonal + 0.7 * M_base. 0.3 is a weight that was chosen as a local minimum after some experiments.<\/li>\n<li>smoothing: M = i\/(N-1) * M_seasonal + (1 &#8211; i\/(N &#8211; 1)) * M_base where N is the number of months within the predicting period, i = 0,\u00a0\u2026, N &#8211; 1 &#8211; the month index. The idea of this configuration is to gradually switch from the most recent transition matrix M_base to seasonal ones as the prediction month moves forward from the prediction start.<\/li>\n<\/ul>\n<pre>result = pd.DataFrame()<br>for transition_period in ['last_240d', 'seasonal_0.3', 'smoothing']:<br>    result[transition_period] = make_prediction(<br>        '2022-11-01', '2023-10-31',<br>        'true',<br>        transition_period<br>    )['dau']<br>result['true'] = dau_true['dau']<br>result['true'] = result['true'].astype(int)<br>result.plot(title='DAU prediction by different transition matrices');<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AZrRv71gr3_36ssoFlWhE3g.png?ssl=1\"><\/figure>\n<pre>mape = pd.DataFrame()<br>for col in result.columns:<br>    if col != 'true':<br>        mape.loc[col, 'mape'] = mean_absolute_percentage_error(result['true'], result[col])<br>mape<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A_cwTRpOtAlKpqDMr75IlOw.png?ssl=1\"><\/figure>\n<p>According to the MAPE errors, seasonal_0.3 configuration provides the best results. Interestingly, smoothing approach has appeared to be even worse than the last_240d. From the diagram above we see that all three models start to underestimate the DAU values in July 2023, especially the smoothing model. It seems that the new users who started appearing in July 2023 are more engaged than the users from 2022. Probably, the app was improved sufficiently or the marketing team did a great job. As a result, the smoothing model that much relies on the outdated transitions data from July 2022 &#8211; October 2022 fails more than the other\u00a0models.<\/p>\n<h4>4.5 Final\u00a0solution<\/h4>\n<p>To sum things up, let\u2019s make a final prediction for the 2024 year. We use the seasonal_0.3 configuration and the predicted values for new\u00a0users.<\/p>\n<pre>dau_pred = make_prediction(<br>    PREDICTION_START, PREDICTION_END,<br>    new_users_mode='predict',<br>    transition_period='seasonal_0.3'<br>)<br>dau_true['dau'].plot(label='true')<br>dau_pred['dau'].plot(label='seasonal_0.3')<br>plt.title('DAU, historical &amp; predicted')<br>plt.axvline(PREDICTION_START, color='k', linestyle='--')<br>plt.legend();<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ABZLm5RMB7E7qN_3UWlJZJw.png?ssl=1\"><\/figure>\n<h3>5. Discussion<\/h3>\n<p>In the <a href=\"https:\/\/towardsdatascience.com\/#1375\">Section 4<\/a> we studied the model performance from the prediction accuracy perspective. Now let\u2019s discuss the model from the practical point of\u00a0view.<\/p>\n<p>Besides poor accuracy, predicting DAU as a time-series (see the <a href=\"https:\/\/towardsdatascience.com\/#0e38\">Section 4.1<\/a>) makes this approach very stiff. Essentially, it makes a prediction in such a manner so it would fit <em>historical<\/em> data best. In practice, when making plans for a next year we usually have some certain expectations about the future. For\u00a0example,<\/p>\n<ul>\n<li>the marketing team is going to launch some new more effective campaings,<\/li>\n<li>the activation team is planning to improve the onboarding process,<\/li>\n<li>the product team will release some new features that would engage and retain users\u00a0more.<\/li>\n<\/ul>\n<p>Our model can take into account such expectations. For the examples above we can adjust the new users prediction, the new\u2192 current and the current\u2192 current conversion rates respectively. As a result, we can get a prediction that doesn&#8217;t match with the historical data but nevertheless would be more realistic. This model&#8217;s property is not just flexible &#8211; it&#8217;s interpretable. You can easily discuss all these adjustments with the stakeholders, and they can understand how the prediction works.<\/p>\n<p>Another advantage of the model is that it doesn\u2019t require predicting whether a certain user will be active on a certain day. Sometimes binary classifiers are used for this purpose. The downside of this approach is that we need to apply such a classifier to each user including all the dormant users and each day from a prediction horizon. This is a tremedous computational cost. In contrast, the Markov model requires only the initial amount of states ( state0). Moreover, such classiffiers are often black-box models: they are poorly interpretable and hard to\u00a0adjust.<\/p>\n<p>The Markov model also has some limitations. As we already have seen, it\u2019s sensitive to the new users prediction. It\u2019s easy to totally ruin the prediction by a wrong new users amount. Another problem is that the Markov model is memoryless meaning that it doesn\u2019t take into account the user\u2019s history. For example, it doesn\u2019t distinguish whether a current user is a newbie, experienced, or reactivated\/ resurrected one. The retention rate of these user types should be certainly different. Also, as we discussed earlier, the user behavior might be of different nature depending on the season, marketing sources, countries, etc. So far our model is not able to capture these differences. However, this might be a subject for further research: we could extend the model by fitting more transition matrices for different user segments.<\/p>\n<p>Finally, as we promised in the introduction, we provide a <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1DxbjrkSy_wvU1lzlNWhrEfWO-Kq1tJrQw0izEHu5ULU\/edit?usp=sharing\">DAU spreadsheet calculator<\/a>. In the Prediction sheet you&#8217;ll need to fill the initial states distribution row (marked with blue) and the new users prediction column (marked with purple). In the Conversions sheet you can adjust the transition matrix values. Remember that the sum of each row of the matrix should be equal to\u00a01.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/0%2AnH670_3JC8_4uul1.png?ssl=1\"><\/figure>\n<p>That\u2019s all for now. I hope that this article was useful for you. In case of any questions or suggestions, feel free to ask in the comments below or contact me directly on <a href=\"https:\/\/www.linkedin.com\/in\/vladimir-kukushkin-95b6487\/\">LinkedIn<\/a>.<\/p>\n<p>All the images in the post are generated by the\u00a0author.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=640ea4fddeb4\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/modeling-dau-with-markov-chain-640ea4fddeb4\">Modeling DAU with Markov Chain<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Vladimir Kukushkin<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fmodeling-dau-with-markov-chain-640ea4fddeb4\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Modeling DAU with Markov Chain How to predict DAU using Duolingo\u2019s growth model and control the prediction 1. Introduction Doubtlessly, DAU, WAU, and MAU\u200a\u2014\u200adaily, weekly, and monthly active users\u200a\u2014\u200aare critical business metrics. An article \u201cHow Duolingo reignited user growth\u201d by Jorge Mazal, former CPO of Duolingo, is #1 in the Growth section of Lenny\u2019s Newsletter [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,240,518,520,519,517],"tags":[521,522,7],"class_list":["post-428","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-editors-pick","category-predictive-analytics","category-predictive-modeling","category-product-analytics","category-quant-uxr","tag-dau","tag-duolingo","tag-how"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/428"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=428"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/428\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}