{"id":1479,"date":"2025-01-28T07:03:27","date_gmt":"2025-01-28T07:03:27","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/28\/building-a-regression-model-delivery-duration-prediction-ab1435952419\/"},"modified":"2025-01-28T07:03:27","modified_gmt":"2025-01-28T07:03:27","slug":"building-a-regression-model-delivery-duration-prediction-ab1435952419","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/28\/building-a-regression-model-delivery-duration-prediction-ab1435952419\/","title":{"rendered":"Building a Regression Model: Delivery Duration Prediction"},"content":{"rendered":"<p>    Building a Regression Model: Delivery Duration Prediction<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h3>Building a Regression Model to Predict Delivery Durations: A Practical Guide<\/h3>\n<h4>E2E walkthrough for approaching a regression modeling\u00a0task<\/h4>\n<p>In this article, we\u2019re going to walk through the process of building a regression model\u200a\u2014\u200afrom dataset cleaning &amp; preparation, to model training &amp; evaluation. The specific regression task we will model for is predicting the expected delivery time for a food delivery\u00a0service.<\/p>\n<p>Comprehensive information about the project &amp; the dataset involved can be found\u00a0<a href=\"https:\/\/platform.stratascratch.com\/data-projects\/delivery-duration-prediction\">here<\/a>.<\/p>\n<p>If you want to follow along with the analysis, you can find the original notebooks in the GitHub\u00a0<a href=\"https:\/\/github.com\/jimin-kang\/Delivery-Duration\">here<\/a>.<\/p>\n<h3>Contents<\/h3>\n<ul>\n<li><strong>Background<\/strong><\/li>\n<li><strong>Dataset Info<\/strong><\/li>\n<li><strong>High-Level Approach<\/strong><\/li>\n<li><strong>Data Preparation &amp; Exploratory Analysis<\/strong><\/li>\n<li><strong>Building Models<\/strong><\/li>\n<li><strong>Final Model Evaluation<\/strong><\/li>\n<li><strong>Conclusion<\/strong><\/li>\n<li><strong>Sources<\/strong><\/li>\n<\/ul>\n<h4>Background<\/h4>\n<p>To provide some motivation for this task, let\u2019s assume we are working for some food delivery service, such as DoorDash.<\/p>\n<p>When a customer places an order on DoorDash, DoorDash must display the estimated amount of time it will take until the customer receives their order. Predicting this arrival time accurately is crucial for DoorDash\u2019s customer satsifaction:<\/p>\n<ul>\n<li>If DoorDash cannot track a customer\u2019s order status precisely, naturally customers will not want to order using the app as they will have no idea when they will receive their\u00a0order.<\/li>\n<li>Thus, our goal is to build a regression model that can predict this delivery time accurately. Precisely, the delivery time is defined as the time elapsed between the moment a customer places an order to when the customer receives their order.<em> <\/em>To accomplish this goal, we are provided some historical orders data, which we describe\u00a0below.<\/li>\n<\/ul>\n<h4>Dataset Info<\/h4>\n<ul>\n<li>The data we are working with contains historical DoorDash orders data from early 2015 in a subset of the cities in which they\u00a0operate.<\/li>\n<li>The data includes features related to the following: order details (price, number of items, etc.), market conditions at the time the order was placed (e.g. # of available drivers), and ingredients for our prediction target (timestamp when order was placed &amp; when it was delivered).<\/li>\n<li>The full data dictionary can be found in the project\u00a0<a href=\"https:\/\/platform.stratascratch.com\/data-projects\/delivery-duration-prediction?utm_source=youtube&amp;utm_medium=click&amp;utm_campaign=YT+classical+ML+doordash+project\">link<\/a>.<\/li>\n<\/ul>\n<h4>High-Level Approach<\/h4>\n<p>Building a regression model from scratch is an extensive process that requires some trial &amp; error. Essentially, the general approach we will take looks something like\u00a0this:<\/p>\n<ul>\n<li>Data prep &amp; exploration (data exploration, cleaning, feature engineering).<\/li>\n<li>Model building\/experimentation (train, tune, and compare several different regression algorithms).<\/li>\n<li>Final model selection &amp; evaluation (choose the \u201cbest\u201d model from our experiments, and evaluate it on our holdout\u00a0set).<\/li>\n<\/ul>\n<p>Additionally, data preparation &amp; exploration can be broken down further into the following steps:<\/p>\n<ul>\n<li>Minimal dataset exploration\/cleaning (just enough to check whether any features should be dropped or any observations should be removed, create prediction target).<\/li>\n<li>Split into train\/test (we do not want to explore the feature distributions of the specific observations that we will evaluate our final model\u00a0upon).<\/li>\n<li>More dataset exploration\/cleaning (check feature distributions &amp; their relations with the prediction target, decide how to deal with missing values accordingly, explore relationships between features).<\/li>\n<li>Feature engineering (derive new features from the existing feature set that may help predict delivery duration).<\/li>\n<\/ul>\n<p>In practice, there may be some back &amp; forth between these steps (features highlighted as important during model training may motivate additional feature engineering related to those features, or poor model performance may motivate the need to gather more data). For now, we\u2019ll walk through this task following the process\u00a0above.<\/p>\n<h4>Data Preparation &amp; Exploratory Analysis<\/h4>\n<p>Now that we\u2019ve outlined our approach, let\u2019s take a look at our data and what kind of features we\u2019re working\u00a0with.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/735\/1%2AXHCI8UtBNOcDv5aYbdJN_A.png?ssl=1\"><\/figure>\n<p>From the above, we see our data contains ~197,000 deliveries, with a variety of numeric &amp; non-numeric features. None of the features are missing a large percentage of values (lowest non-null count ~181,000), so we likely won\u2019t have to worry about dropping any features entirely.<\/p>\n<p>Let\u2019s check if our data contains any duplicated deliveries, and if there are any observations that we cannot compute delivery time\u00a0for.<\/p>\n<pre>print(f\"Number of duplicates: {df.duplicated().sum()} n\")<br><br>print(pd.DataFrame({'Missing Count': df[['created_at', 'actual_delivery_time']].isna().sum()}))<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/347\/1%2Al9ZyXToDNvLkiwDBhkJCJg.png?ssl=1\"><\/figure>\n<p>We see that all the deliveries are unique. However, there are 7 deliveries that are missing a value for actual_delivery_time, which means we won\u2019t be able to compute the delivery duration for these orders. Since there\u2019s only a handful of these, we\u2019ll remove these observations from our\u00a0data.<\/p>\n<p>Now, let\u2019s create our prediction target. We want to predict the delivery duration (in seconds), which is the elapsed time between when the customer placed the order (\u2018created_at\u2019) and when they recieved the order (\u2018actual_delivery_time\u2019).<\/p>\n<pre># convert columns to datetime <br>df['created_at'] = pd.to_datetime(df['created_at'], utc=True)<br>df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)<br><br># create prediction target<br>df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()<\/pre>\n<p>The last thing we\u2019ll do before splitting our data into train\/test is check for missing values. We already viewed the non-null counts for each feature above, but let\u2019s view the proportions to get a better\u00a0picture.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/553\/1%2A16zJU1CivJuGK-L05MBlcQ.png?ssl=1\"><\/figure>\n<p>We see that the market features (\u2018onshift_dashers\u2019, \u2018busy_dashers\u2019, \u2018outstanding_orders\u2019) have the highest percentage of missing values (~8% missing). The feature with the second-highest missing data rate is \u2018store_primary_category\u2019 (~2%). All other features have &lt; 1%\u00a0missing.<\/p>\n<p>Since none of the features have a high missing count, we won\u2019t remove any of them. Later on, we will look at the feature distributions to help us decide how to appropriately deal with missing observations for each\u00a0feature.<\/p>\n<p>But first, let\u2019s split our data into train\/test. We will proceed with an 80\/20 split, and we\u2019ll write this test data to a separate file which we won\u2019t touch until evaluating our final\u00a0model.<\/p>\n<pre>from sklearn.model_selection import train_test_split<br>import os<br><br># shuffle<br>df = df.sample(frac=1, random_state=42)<br>df = df.reset_index(drop=True)<br><br># split<br>train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)<br><br># write test data to separate file<br>directory = 'datasets'<br>file_name = 'test_data.csv'<br>file_path = os.path.join(directory, file_name)<br>os.makedirs(directory, exist_ok=True)<br>test_df.to_csv(file_path, index=False)<\/pre>\n<p>Now, let\u2019s dive into the specifics of our train data. We\u2019ll establish our numeric &amp; categorical features, to make it clear which columns are being referenced in later exploratory steps.<\/p>\n<pre>categorical_feats = [<br>    'market_id',<br>    'store_id',<br>    'store_primary_category',<br>    'order_protocol'<br>]<br><br>numeric_feats = [<br>    'total_items',<br>    'subtotal',<br>    'num_distinct_items',<br>    'min_item_price',<br>    'max_item_price',<br>    'total_onshift_dashers',<br>    'total_busy_dashers',<br>    'total_outstanding_orders', <br>    'estimated_order_place_duration',<br>    'estimated_store_to_consumer_driving_duration'<br>]<\/pre>\n<p>Let\u2019s revisit the categorical features with missing values (\u2018market_id\u2019, \u2018store_primary_category\u2019, \u2018order_protocol\u2019). Since there was little missing data among those features (&lt; 3%), we will simply impute those missing values with an \u201cunknown\u201d category.<\/p>\n<ul>\n<li>This way, we won\u2019t have to remove data from other features.<\/li>\n<li>Perhaps the absence of feature values holds some predictive power for delivery duration i.e. these features are not <a href=\"https:\/\/stefvanbuuren.name\/fimd\/sec-MCAR.html\">missing at\u00a0random<\/a>.<\/li>\n<li>Additionally, we will add this imputation step to our preprocessing pipeline during modeling, so that we won\u2019t have to manually duplicate this work on our test\u00a0set.<\/li>\n<\/ul>\n<pre>missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']<br><br>train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna(\"unknown\")<\/pre>\n<p>Let\u2019s look at our categorical features.<\/p>\n<pre>pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Feature')<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/305\/1%2A7fq8oo9RydsQxzLWNoR8BA.png?ssl=1\"><\/figure>\n<p>Since \u2018market_id\u2019 &amp; \u2018order_protocol\u2019 have low cardinality, we can visualize their distributions easily. On the other hand, \u2018store_id\u2019 &amp; \u2018store_primary_category\u2019 are high cardinality features. We\u2019ll take a deeper look at those\u00a0later.<\/p>\n<pre>import seaborn as sns<br>import matplotlib.pyplot as plt<br><br>categorical_feats_subset = [<br>    'market_id',<br>    'order_protocol'<br>]<br><br># Set up the grid<br>fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)<br><br># Create barplots for each variable<br>for i, col in enumerate(categorical_feats_subset):<br>    sns.countplot(x=col, data=train_df, ax=axes[i])<br>    axes[i].set_title(f\"Frequencies: {col}\")<br><br># Adjust layout<br>plt.tight_layout()<br>plt.show()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Aq2jJDB87lUKza9or_H_bjQ.png?ssl=1\"><\/figure>\n<p>Some key things to\u00a0note:<\/p>\n<ul>\n<li>~70% of orders placed have \u2018market_id\u2019 of 1, 2,\u00a04<\/li>\n<li>&lt; 1% of orders have \u2018order_protocol\u2019 of 6 or\u00a07<\/li>\n<\/ul>\n<p>Unfortunately, we don\u2019t have any additional information about these variables, such as which \u2018market_id\u2019 values are associated with which cities\/locations, and what each \u2018order_protocol\u2019 number represents. At this point, asking for additional data concerning this information may be a good idea, as it may help for investigating trends in delivery duration across broader region\/location categorizations.<\/p>\n<p>Let\u2019s look at our higher cardinality categorical features. Perhaps each \u2018store_primary_category\u2019 has an associated \u2018store_id\u2019 range? If so, we may not need \u2018store_id\u2019, as \u2018store_primary_category\u2019 would already encapsulate a lot of the information about the store being ordered\u00a0from.<\/p>\n<pre>store_info = train_df[['store_id', 'store_primary_category']]<br><br>store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/322\/1%2A1kO3-C3aQCW-gA5O8kLCLQ.png?ssl=1\"><\/figure>\n<p>Clearly not the case: we see that \u2018store_id\u2019 ranges overlap across levels of \u2018store_primary_category\u2019.<\/p>\n<p>A quick look at the distinct values and associated frequencies for \u2018store_id\u2019 &amp; \u2018store_primary_category\u2019 shows that these features have <a href=\"https:\/\/docs.honeycomb.io\/get-started\/basics\/observability\/concepts\/high-cardinality\/#:~:text=High%20cardinality%20refers%20to%20a,of%20thousands%20of%20distinct%20values.\">high cardinality<\/a> and are sparsely distributed. In general, high cardinality categorical features may be problematic in regression tasks, particularly for regression algorithms that require solely numeric data. When these high cardinality features are encoded, they may enlarge the feature space drastically, making the available data sparse and decreasing the model\u2019s ability to generalize to new observations in that feature space. For a better &amp; more professional explanation of the phenomena, you can read more about it\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Curse_of_dimensionality\">here<\/a>.<\/p>\n<p>Let\u2019s get a sense of how sparsely distributed these features\u00a0are.<\/p>\n<pre>store_id_values = train_df['store_id'].value_counts()<br><br># Plot the histogram<br>plt.figure(figsize=(8, 5))<br>plt.bar(store_id_values.index, store_id_values.values, color='skyblue')<br><br># Add titles and labels<br>plt.title('Value Counts: store_id', fontsize=14)<br>plt.xlabel('store_id', fontsize=12)<br>plt.ylabel('Frequency', fontsize=12)<br>plt.xticks(rotation=45)  # Rotate x-axis labels for better readability<br>plt.tight_layout()<br>plt.show()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/733\/1%2Ask7RT5uyobxkRhijZY8ESw.png?ssl=1\"><\/figure>\n<p>We see that there are a handful of stores that have hundreds of orders, but the majority of them have much less than\u00a0100.<\/p>\n<p>To handle the high cardinality of \u2018store_id\u2019, we\u2019ll create another feature, \u2018store_id_freq\u2019, that groups the \u2018store_id\u2019 values by frequency.<\/p>\n<ul>\n<li>We\u2019ll group the \u2018store_id\u2019 values into five different percentile bins shown\u00a0below.<\/li>\n<li>\u2018store_id_freq\u2019 will have much lower cardinality than \u2018store_id\u2019, but will retain relevant information regarding the popularity of the store the delivery was ordered\u00a0from.<\/li>\n<li>For more inspiration behind this logic, check out this\u00a0<a href=\"https:\/\/www.linkedin.com\/advice\/0\/how-do-you-deal-categorical-features-high-cardinality#:~:text=One%20way%20to%20reduce%20the,Tashi%20Tamang\">thread<\/a>.<\/li>\n<\/ul>\n<pre>def encode_frequency(freq, percentiles) -&gt; str:<br>    if freq &lt; percentiles[0]:<br>        return '[0-50)'<br>    elif freq &lt; percentiles[1]:<br>        return '[50-75)'<br>    elif freq &lt; percentiles[2]:<br>        return '[75-90)'<br>    elif freq &lt; percentiles[3]:<br>        return '[90-99)'<br>    else:<br>        return '99+'<br><br>value_counts = train_df['store_id'].value_counts()<br>percentiles = np.percentile(value_counts, [50, 75, 90, 99]) <br><br># apply encode_frequency to each store_id based on their number of orders<br>train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))<br><br>pd.DataFrame({'Count':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/164\/1%2A9gW0KsNrNjIin5j9VSYBww.png?ssl=1\"><\/figure>\n<p>Our encoding shows us that ~60,000 deliveries were ordered from stores catgorized in the 90\u201399th percentile in terms of popularity, whereas ~12,000 deliveries were ordered from stores that were in the 0\u201350th percentile in popularity.<\/p>\n<p>Now that we\u2019ve (attempted) to capture relevant \u2018store_id\u2019 information in a lower dimension, let\u2019s try to do something similar with \u2018store_primary_category\u2019.<\/p>\n<p>Let\u2019s look at the most popular \u2018store_primary_category\u2019 levels.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/261\/1%2ATaUSijHy72EEP8nkkP0Glw.png?ssl=1\"><\/figure>\n<p>A quick look shows us that many of these \u2018store_primary_category\u2019 levels are not exclusive to each other (ex: \u2018american\u2019 &amp; \u2018burger\u2019). Further investigation shows many more examples of this kind of\u00a0overlap.<\/p>\n<p>So, let\u2019s try to map these distinct store categories into a few basic, all-encompassing groups.<\/p>\n<pre>store_category_map = {<br>    'american': ['american', 'burger', 'sandwich', 'barbeque'],<br>    'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean', <br>              'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],<br>    'mexican': ['mexican'],<br>    'italian': ['italian', 'pizza'],<br>}<br><br>def map_to_category_type(category: str) -&gt; str:<br>    for category_type, categories in store_category_map.items():<br>        if category in categories:<br>            return category_type<br>    return \"other\"<br><br>train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))<br><br>value_counts = train_df['store_category_type'].value_counts()<br><br># Plot pie chart<br>plt.figure(figsize=(6, 6))<br>value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)<br>plt.title('Category Distribution')<br>plt.ylabel('')  # Hide y-axis label for aesthetics<br>plt.show()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/452\/1%2Ati8R3V11c8gGewWsDODUrg.png?ssl=1\"><\/figure>\n<p>This grouping is probably brutally simple, and there may very well be a better way to group these store categories. Let\u2019s proceed with it for now for simplicity.<\/p>\n<p>We\u2019ve done a good deal of investigation into our categorical features. Let\u2019s look at the distributions for our numeric features.<\/p>\n<pre># Create grid for boxplots<br>fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15))  # Adjust figure size<br>axes = axes.flatten()  # Flatten the 5x2 axes into a 1D array for easier iteration<br><br># Generate boxplots for each numeric feature<br>for i, column in enumerate(numeric_feats):<br>    sns.boxplot(y=train_df[column], ax=axes[i])<br>    axes[i].set_title(f\"Boxplot for {column}\")<br>    axes[i].set_ylabel(column)<br><br># Remove any unused subplots (if any)<br>for i in range(len(numeric_feats), len(axes)):<br>    fig.delaxes(axes[i])<br><br># Adjust layout for better spacing<br>plt.tight_layout()<br>plt.show()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AOfzdBdsT1n7EUBVGkN6Rtg.png?ssl=1\"><figcaption>Boxplots for a subset of our numeric\u00a0features<\/figcaption><\/figure>\n<p>Many of the distributions appear to be more right skewed then they are due to the presence of outliers.<\/p>\n<p>In particular, there seems to be an order with 400+ items. This seems strange as the next largest order is less than 100\u00a0items.<\/p>\n<p>Let\u2019s look more into that 400+ item\u00a0order.<\/p>\n<pre>train_df[train_df['total_items']==train_df['total_items'].max()]<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AEBnOKPbtcEoCoad96_SVbg.png?ssl=1\"><figcaption>Fast food order w\/ 411 total items, 5 distinct items, and total cost of\u00a0$31.15<\/figcaption><\/figure>\n<p>That order was a fast-food order that consisted of 411 total items, but only 5 distinct items. It was also placed late at night (midnight-1 am).<\/p>\n<p>While it\u2019s not impossible for somebody to place an order like that (<a href=\"https:\/\/www.reddit.com\/r\/McDonaldsEmployees\/comments\/1bn7ona\/400_order_today_usa\/\">McDonalds late night run??<\/a>), we\u2019ll proceed to remove this observation from our data since it\u2019s highly unlikely that our model will have to make delivery duration predictions for orders with more than 100\u00a0items.<\/p>\n<p>A couple of additional feature engineering steps:<\/p>\n<ul>\n<li>For each order, we are provided the price of the cheapest (\u2018min_item_price\u2019) &amp; most expensive (\u2018max_item_price\u2019) item. We can consolidate this price range information into a single feature, which we\u2019ll call \u2018item_price_range\u2019.<\/li>\n<li>Time of day when the order was placed seems like relevant information for predicting delivery duration\u200a\u2014\u200aorders placed at times of day when the market is busier are likely to take longer to deliver. So, let\u2019s extract the hour information from the order creation timestamp (\u2018created_at\u2019). We\u2019ll call this \u2018hour_of_day\u2019.<\/li>\n<\/ul>\n<pre># extract price range<br>train_df['item_price_range'] = train_df['max_item_price'] - train_df['min_item_price']<br><br># extract hour of day<br>time_info = train_df['created_at'].astype(str).str.split().str[1]<br>train_df['hour_of_day'] = time_info.str.split(\":\").str[0]<\/pre>\n<p>Let\u2019s revisit our numeric features with missing\u00a0data.<\/p>\n<p>From the boxplots above, it seemed like most of our features tended to skew right. Thus, we\u2019ll impute missing values for our numeric features with their median, as that is likely more representative of the center of their distributions compared to the\u00a0mean.<\/p>\n<pre>values = {<br>    'total_onshift_dashers': train_df['total_onshift_dashers'].median(), <br>    'total_busy_dashers': train_df['total_busy_dashers'].median(),<br>    'total_outstanding_orders': train_df['total_outstanding_orders'].median(),<br>    'estimated_store_to_consumer_driving_duration': train_df['estimated_store_to_consumer_driving_duration'].median()<br>}<br>train_df[col].fillna(value=values, inplace=True)<\/pre>\n<p>There are certainly other <a href=\"https:\/\/scikit-learn.org\/1.6\/modules\/impute.html\">imputation strategies<\/a> to consider here. Other approaches include KNN imputation (imputing missing values with the average of the K most \u201csimilar\u201d observations), or <a href=\"https:\/\/www.kaggle.com\/code\/residentmario\/simple-techniques-for-missing-data-imputation?scriptVersionId=3398028&amp;cellId=13\">building another regression model to predict those missing values<\/a> from features thought to be potentially relevant. For now, we\u2019ll proceed with median imputation for simplicity.<\/p>\n<p>The last thing we\u2019ll do before moving on to modeling is check correlations among our numeric features. Some of the regression algorithms we will experiment with assume that the features are independent i.e. no collinearity.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AEKp2RmplnzczOTqozgsucA.png?ssl=1\"><\/figure>\n<p>There is some high collinearity present:<\/p>\n<ul>\n<li>\u2018total_items\u2019 &amp; \u2018num_distinct_items\u2019 (r &gt;\u00a00.8)<\/li>\n<li>\u2018total_onshift_dashers\u2019, \u2018total_busy_dashers\u2019, \u2018total_outstanding_orders\u2019 (r &gt;\u00a00.9)<\/li>\n<\/ul>\n<p>Among sets of highly collinear features, we will only keep one from each set for the final feature set for the model. We will add this step as part of our preprocessing pipeline for our\u00a0models.<\/p>\n<p>Interestingly, each of our numeric features show fairly weak correlation (r &lt; 0.2) with our prediction target (\u2018seconds_to_delivery\u2019). This may give us a clue that a linear model will not effectively capture the relationship between our feature set &amp; delivery duration (if there is any relationship).<\/p>\n<p>Now that we\u2019ve prepped &amp; explored our data, let\u2019s build some\u00a0models.<\/p>\n<h4>Building Models<\/h4>\n<p>The first things to consider for modeling are what algorithms we want to build models with, and how to <a href=\"https:\/\/scikit-learn.org\/1.6\/modules\/model_evaluation.html#\">measure model performance<\/a>.<\/p>\n<p>We will experiment with the following regression algorithms, as they are algorithms commonly used for regression tasks. They include a mix of algorithms that can capture linear &amp; non-linear relationships between the feature space &amp; prediction target.<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/1.6\/modules\/generated\/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso\">Lasso Regression<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.SVR.html\">Support Vector Regression<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html\">Random Forest Regression<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.GradientBoostingRegressor.html\">Gradient Boosted Regression<\/a><\/li>\n<\/ul>\n<p>We\u2019ll carry out <a href=\"https:\/\/scikit-learn.org\/1.6\/modules\/grid_search.html\">hyperparameter tuning<\/a> for each of the algorithms above to maximize the performance of each algorithm.<\/p>\n<p>We\u2019ll evaluate our models using <a href=\"https:\/\/scikit-learn.org\/1.6\/modules\/generated\/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error\">Root Mean Squared Error (RMSE)<\/a>, as that will allow us to discuss the errors of each model in seconds, which makes sense in the context of delivery duration.<\/p>\n<p>Although we prepped\/cleaned our data, there are still some <a href=\"https:\/\/scikit-learn.org\/stable\/data_transforms.html\">data transformations<\/a> that need to be done to get the data compatible for each of the regression algorithms above. Specifically, the regression algorithms will require some or all of the following preprocessing steps prior to prediction:<\/p>\n<ul>\n<li>Imputing missing feature\u00a0values<\/li>\n<li>\n<a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/preprocessing\/plot_scaling_importance.html\">Feature scaling<\/a> (for Lasso &amp; SVR\u00a0only)<\/li>\n<li>Dropping highly correlated features<\/li>\n<li>One-hot encoding categorical features<\/li>\n<\/ul>\n<p>These transformations may need to be applied to different subsets of our feature space, and some of these transformations may need to be applied in sequential order. Fortunately, we can use scikit-learn\u2019s <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/compose.html#column-transformer\">ColumnTransformer<\/a> to specify which features to apply specific data transformations to, and scikit-learn\u2019s <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/compose.html#\">Pipeline<\/a> to define a series of data transformations to apply in sequence.<\/p>\n<p>Note that we already took care of some of these preprocessing steps (such as imputing missing values) manually in our data preparation. However, defining this pipeline will come in handy when we want to apply this preprocessing workflow to new data, as these steps will already be saved as part of our model artifact.<\/p>\n<p>Additionally, Lasso Regression &amp; Support Vector Regression require <a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/preprocessing\/plot_scaling_importance.html\">features to be on consistent scales<\/a> to maximize their effectiveness, since they perform distance-based computations. In contrast, Random Forest &amp; Gradient Boosted Trees are built on top of <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/tree.html\">Decision Trees<\/a>, which involve threshold based splitting. Therefore, we\u2019ll define two preprocessing pipelines, one with scaling and one\u00a0without.<\/p>\n<pre># sequence of transformations to apply to categorical features<br>categorical_pipeline = Pipeline(<br>    steps=[<br>        (\"categorical_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"unknown\")),<br>        (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", drop='first', sparse_output=False)),<br>    ]<br>)<br><br># sequence of transformations to apply to numeric features<br>numeric_pipeline_scaled = Pipeline(<br>    steps=[<br>        (\"numeric_imputer\", SimpleImputer(strategy=\"median\")), <br>        (\"scaler\", StandardScaler()),<br>        (\"drop_correlated_feats\", DropHighlyCorrelatedFeatures()),<br>    ]<br>)<br><br>numeric_pipeline = Pipeline(<br>    steps=[<br>        (\"numeric_imputer\", SimpleImputer(strategy=\"median\")), <br>        (\"drop_correlated_feats\", DropHighlyCorrelatedFeatures()),<br>    ]<br>)<br><br># combine numeric &amp; categorical feature transformations into single ColumnTransformer() object<br>preprocessor_w_scaling = ColumnTransformer(<br>    transformers=[<br>        ('numeric', numeric_pipeline_scaled, numeric_feats),<br>        ('cat', categorical_pipeline, categorical_feats),<br>    ],<br>)<br><br>preprocessor_wo_scaling = ColumnTransformer(<br>    transformers=[<br>        ('numeric', numeric_pipeline, numeric_feats),<br>        ('cat', categorical_pipeline, categorical_feats),<br>    ]<br>)<\/pre>\n<p><strong>Let\u2019s build our first model using <\/strong><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV\"><strong>Lasso regression<\/strong><\/a><strong>.<\/strong><\/p>\n<p>Lasso regression models a linear relationship between a scalar target and one or more explanatory variables. It includes a penalty term to the linear regression objective function that may zero out coefficients for features with little to no relationship with the prediction target.<\/p>\n<p>Preprocessing requirements include:<\/p>\n<ul>\n<li>Feature scaling (fair regularization requires features to be on similar scales, and scaling may speed up the linear regression optimization process)<\/li>\n<li>One-hot encoding categorical features<\/li>\n<\/ul>\n<pre>from sklearn.linear_model import Lasso<br>from sklearn.model_selection import GridSearchCV<br><br># lasso regression preprocessing pipeline<br>lasso_reg = Pipeline(<br>    steps=[<br>        (\"preprocessor\", preprocessor_w_scaling), <br>        (\"regression\", Lasso(random_state=13)),<br>    ]<br>)<br><br># lasso regression hyperparameter space<br>lasso_param_grid = {<br>    \"regression__alpha\": np.logspace(-3, -2, 10),<br>}<br><br>lasso_search_cv = GridSearchCV(lasso_reg, lasso_param_grid, scoring='neg_root_mean_squared_error', verbose=4)<br><br># fit model, performing exhaustive search over hyperparameter space<br>lasso_search_cv.fit(df_X, df_y)<br><br># retrieve results of hyperparameter tuning<br>lasso_cv_results_df = pd.DataFrame(lasso_search_cv.cv_results_)<br>lasso_cv_results_df.sort_values(by='mean_test_score', ascending=False).drop(columns=['params','split0_test_score','split1_test_score','split2_test_score','split3_test_score','split4_test_score'])<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/527\/1%2Ad9kuOftTZW4iBbm-I24jzQ.png?ssl=1\"><figcaption>Hyperparameter tuning results: Lasso Regression<\/figcaption><\/figure>\n<p>Lasso Regression Results:<\/p>\n<ul>\n<li>The best performing model had a root mean squared error of ~1080 seconds, or ~18 minutes. However, all models across different values of alpha had practically identical performance.<\/li>\n<li>On average, our predictions for delivery duration are ~18 minutes off from the true delivery duration. So, not super effective for predicting delivery duration.<\/li>\n<\/ul>\n<p><strong>Let\u2019s move to <\/strong><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.SVR.html\"><strong>Support Vector Regression<\/strong><\/a><strong>, a popular regression algorithm derived from <\/strong><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/svm.html\"><strong>Support Vector Machines<\/strong><\/a><strong>.<\/strong><\/p>\n<p>Preprocessing requirements include:<\/p>\n<ul>\n<li>Feature scaling (SVR computes dot products between observations in the feature space, so certain features may dominate the dot product computations if they are on widely different scales)<\/li>\n<li>One-hot encoding categorical features<\/li>\n<\/ul>\n<pre>from sklearn.svm import SVR<br>from sklearn.model_selection import GridSearchCV<br><br>svr_reg = Pipeline(<br>    steps=[<br>        (\"preprocessor\", preprocessor_w_scaling),<br>        (\"regression\", SVR(kernel='poly', max_iter=1000)), # proceeding with polynomial kernel based on sub-par linear regression performance<br>    ]<br>)<br><br>svr_param_grid = {<br>    \"regression__C\": np.logspace(-3, 1, 4),<br>    'regression__epsilon': np.logspace(-3, 1, 4),<br>    # 'regression__kernel': ['linear', 'poly', 'rbf']<br><br>}<br><br>svr_search_cv = GridSearchCV(svr_reg, svr_param_grid, verbose=4, scoring='neg_root_mean_squared_error')<br><br>svr_search_cv.fit(df_X, df_y)<br><br>svr_cv_results_df = pd.DataFrame(svr_search_cv.cv_results_)<br>svr_cv_results_df.sort_values(\"mean_test_score\", ascending=False).drop(columns=['params','split0_test_score','split1_test_score','split2_test_score','split3_test_score','split4_test_score'])<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/687\/1%2AA-CKZNIJ9O2UDAy2Ra0A9Q.png?ssl=1\"><figcaption>Hyperparameter tuning results: Support Vector Regression<\/figcaption><\/figure>\n<p>SVR results:<\/p>\n<ul>\n<li>Best performing model had a root mean squared error of ~1335 seconds (~22 minutes). These errors are ~4 minutes larger than our best Lasso Regression model.<\/li>\n<\/ul>\n<p><strong>Our next algorithm is <\/strong><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor\"><strong>Random Forest Regression<\/strong><\/a><strong>.<\/strong><\/p>\n<p>Random Forest Regression is an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Ensemble_learning#:~:text=Ensemble%20learning%20trains%20two%20or,%22weak%20learners%22%20in%20literature.\">ensemble learning method<\/a> that combines the predictions of multiple decision trees, where each tree is built from a bootstrap sample drawn from the training\u00a0set.<\/p>\n<pre>from sklearn.ensemble import RandomForestRegressor<br>from sklearn.model_selection import GridSearchCV, RandomizedSearchCV<br><br>rfr_clf = Pipeline(<br>    steps=[<br>        (\"preprocessor\", preprocessor_wo_scaling), <br>        (\"regressor\", RandomForestRegressor(random_state=13))<br>    ]<br>)<br><br>rfr_param_grid = {<br>    \"regressor__n_estimators\": [10, 100, 200],<br>    # 'regressor__max_depth': [None, 50],<br>    'regressor__max_features': ['sqrt', 'log2', None],<br>}<br><br>rfr_search_cv = GridSearchCV(rfr_clf, rfr_param_grid, verbose=4, scoring='neg_root_mean_squared_error')<br><br>rfr_search_cv.fit(df_X, df_y)<br><br>rfr_cv_results = pd.DataFrame(rfr_search_cv.cv_results_)<br>rfr_cv_results = rfr_cv_results.sort_values(\"mean_test_score\", ascending=False).drop(columns=['params','split0_test_score','split1_test_score','split2_test_score','split3_test_score','split4_test_score'])<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/790\/1%2AeHNvA67Ro9H7DEoZxImAAg.png?ssl=1\"><figcaption>Results of Random Forest Regression hyperparameter tuning<\/figcaption><\/figure>\n<p>Random Forest\u00a0results:<\/p>\n<ul>\n<li>Best performing model had a root mean squared error of ~1050 seconds (~17.5 minutes). So far, this is our best performing model, but not by much (errors are ~30 seconds lower than that of Lasso Regression).<\/li>\n<\/ul>\n<p><strong>Our last algorithm is <\/strong><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor\"><strong>Gradient Boosted Regression<\/strong><\/a><strong>.<\/strong><\/p>\n<p>Gradient Boosted Regression is also an ensemble method like Random Forest Regression, but unlike the former, the learners in Gradient Boosted Regression are not independent\u200a\u2014\u200aeach base learner attempts to correct the errors of the previous.<\/p>\n<pre>from sklearn.ensemble import GradientBoostingRegressor<br>from sklearn.model_selection import GridSearchCV, RandomizedSearchCV<br><br>gbr_clf = Pipeline(<br>    steps=[<br>        (\"preprocessor\", preprocessor_wo_scaling), <br>        (\"regressor\", GradientBoostingRegressor(random_state=13))<br>    ]<br>)<br><br>gbr_param_grid = {<br>    \"regressor__n_estimators\": [10, 100, 200],<br>    'regressor__learning_rate': np.logspace(-3, 0, 4),<br>}<br><br>gbr_search_cv = GridSearchCV(gbr_clf, gbr_param_grid, verbose=4, scoring='neg_root_mean_squared_error')<br><br>gbr_search_cv.fit(df_X, df_y)<br><br>gbr_cv_results = pd.DataFrame(gbr_search_cv.cv_results_)<br>gbr_cv_results = gbr_cv_results.sort_values(\"mean_test_score\", ascending=False).drop(columns=['params','split0_test_score','split1_test_score','split2_test_score','split3_test_score','split4_test_score'])<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/787\/1%2A6IPLuUxjTwpc6pFChYfueg.png?ssl=1\"><figcaption>Results of Gradient Boosted Regression hyperparameter tuning<\/figcaption><\/figure>\n<p>Gradient Boosted Regression results:<\/p>\n<ul>\n<li>Best performing model had a root mean squared error of ~1046 seconds (~17.5 minutes), which is practically identical to the performance we were getting from Random Forest Regression.<\/li>\n<\/ul>\n<p><strong>Let\u2019s compare the RMSE scores from the best models we achieved with each regression algorithm.<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/742\/1%2AIz085yoVDOL3DLQErA28aw.png?ssl=1\"><\/figure>\n<p>Our Gradient Boosted Regression model had the lowest RMSE, but outside of Support Vector Regression, all models had comparable performance. The RMSEs between Lasso, RFR, &amp; GBR were all within 40 seconds of each other, and considering the prediction errors were already &gt; 17 minutes off on average, this doesn\u2019t seem like\u00a0much.<\/p>\n<p><strong>That being said, we\u2019ll proceed to evaluate our trained GBR model on the test\u00a0set.<\/strong><\/p>\n<ul>\n<li>From the weak correlations (r &lt; 0.2) we saw between the features &amp; delivery duration in our EDA, the nature of the relationship between the features &amp; target is probably best captured in a non-linear fashion, so we\u2019ll rule out Lasso for this reasoning.<\/li>\n<li>Between <a href=\"https:\/\/stats.stackexchange.com\/questions\/173390\/gradient-boosting-tree-vs-random-forest\">Gradient Boosting &amp; Random Forests<\/a>, we\u2019ll choose Gradient Boosting for its potential to capture subtle, complex relationships due to its goal towards reducing bias. When model performance is this close, <a href=\"https:\/\/www.linkedin.com\/advice\/1\/how-do-you-select-most-appropriate-machine-learning-xpy8e\">other factors to consider<\/a> could include interpretability, robustess to outliers, and reliance on hyperparameter tuning.<\/li>\n<\/ul>\n<p>We\u2019ll write out the hyperparameter-tuned GBR model artifact. This <a href=\"https:\/\/scikit-learn.org\/stable\/model_persistence.html#\">model serialization<\/a> will contain the necessary data transformations prior to prediction, as well as the model hyperparameters that led to optimal GBR performance.<\/p>\n<pre># Use joblib for efficient model serialization<br>import joblib  <br><br># Save trained GBR model<br>joblib.dump(gbr_search_cv.best_estimator_, '..\/models\/best_model.pkl')<br>print(\"Best model saved to 'best_model.pkl'\")<\/pre>\n<h4>Final Model Evaluation<\/h4>\n<p>Now, we\u2019ll evaluate our chosen model on our holdout test data. Before we generate predictions on the test data, we\u2019ll have to repeat some feature engineering steps that we did in our initial data prep. Specifically, we\u2019ll need to create the following features in our test\u00a0data:<\/p>\n<ul>\n<li>\u2018store_id_freq\u2019<\/li>\n<li>\u2018store_category_type\u2019<\/li>\n<li>\u2018item_price_range\u2019<\/li>\n<li>\u2018hour_of_day\u2019<\/li>\n<\/ul>\n<p>The rest of the data cleaning\/preprocessing steps (imputing missing values, feature scaling, dropping correlated features) will be taken care of in the pipeline defined in the model artifact.<\/p>\n<pre>from feature_eng_utils import encode_frequency, map_to_category_type<br><br># create store_id_freq<br>value_counts = test_df['store_id'].value_counts()<br>percentiles = np.percentile(value_counts, [50, 75, 90, 99]) <br>test_df['store_id_freq'] = test_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))<br><br># create store_category_type<br>test_df['store_category_type'] = test_df['store_primary_category'].apply(lambda x: map_to_category_type(x))<br><br># create item_price_range<br>test_df['item_price_range'] = test_df['max_item_price'] - test_df['min_item_price']<br><br># create hour_of_day<br>time_info = test_df['created_at'].astype(str).str.split().str[1]<br>test_df['hour_of_day'] = time_info.str.split(\":\").str[0]<\/pre>\n<p>Let\u2019s evaluate our model on the test\u00a0data.<\/p>\n<pre>import joblib<br>from custom_transformers import DropHighlyCorrelatedFeatures<br>from sklearn.metrics import root_mean_squared_error<br><br># Load the saved model<br>loaded_model = joblib.load('..\/models\/best_model.pkl')<br><br># Make predictions<br>y_pred = loaded_model.predict(test_df_X)<br><br># Compute root mean squared error<br>test_rmse = root_mean_squared_error(test_df_y, y_pred)<br>print(\"Test RMSE:\", test_rmse)<\/pre>\n<p>Our final test RMSE comes out to be ~1080\u00a0seconds.<\/p>\n<p>Thus, on average, our predictions for delivery duration are ~18 minutes off from the true delivery duration. So overall, pretty bad. I\u2019m not sure if DoorDash models delivery duration from these features only, but I certainly would not recommend using this model as an endpoint for returning delivery duration estimates.<\/p>\n<p>Some ideas for further work\/investigation could include the following:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/ensemble\/plot_gradient_boosting_regression.html#plot-feature-importance\">Investigating important features<\/a> that came up during prediction &amp; look to gather more data related to those identified features.<\/li>\n<li>Some of the original features were ID values without any context (\u2018market_id\u2019, \u2018order_protocol\u2019). Retrieving more info on what those features mean may be helpful to improve some of the feature engineering decisions we\u00a0made.<\/li>\n<li>Our approach for imputing missing values was fairly basic. Additional investigation could be done to determine whether there was <a href=\"https:\/\/stefvanbuuren.name\/fimd\/sec-MCAR.html\">any pattern in those missing values<\/a>. If so, our imputing approach may have been inappropriate.<\/li>\n<\/ul>\n<h4>Conclusion<\/h4>\n<p>If you made it this far, thanks! There were a lot of data investigation &amp; modeling decisions that we went through, many of which had no clear cut answer. However, I did my best to lay out all the relevant decisions made during the investigation process, as well as the reasoning behind those decisions. If there\u2019s any part of the process that you would\u2019ve done differently, please let me know in the comments, I\u2019d love to hear\u00a0it!<\/p>\n<p>I read through many great resources while doing this investigation. I did my best to include most of them below\u200a\u2014\u200aI highly recommend checking at least some of them\u00a0out!<\/p>\n<h4>Sources<\/h4>\n<p>Missing data:<\/p>\n<ul>\n<li><a href=\"https:\/\/stefvanbuuren.name\/fimd\/sec-MCAR.html\">https:\/\/stefvanbuuren.name\/fimd\/sec-MCAR.html<\/a><\/li>\n<\/ul>\n<p>Handling high cardinality features:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.linkedin.com\/advice\/0\/how-do-you-deal-categorical-features-high-cardinality#\">https:\/\/www.linkedin.com\/advice\/0\/how-do-you-deal-categorical-features-high-cardinality#<\/a><\/li>\n<\/ul>\n<p>Pipelines &amp; Data Transformers:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/1.6\/data_transforms.html\">https:\/\/scikit-learn.org\/1.6\/data_transforms.html<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/compose\/index.html\">https:\/\/scikit-learn.org\/stable\/auto_examples\/compose\/index.html<\/a><\/li>\n<\/ul>\n<p>Scikit-learn Regression APIs:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LassoCV.html\">https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LassoCV.html<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.SVR.html\">https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.SVR.html<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html\">https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/1.5\/modules\/generated\/sklearn.ensemble.GradientBoostingRegressor.html\">https:\/\/scikit-learn.org\/1.5\/modules\/generated\/sklearn.ensemble.GradientBoostingRegressor.html<\/a><\/li>\n<\/ul>\n<p>Hyperparameter tuning:<\/p>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/1.6\/modules\/grid_search.html#\">https:\/\/scikit-learn.org\/1.6\/modules\/grid_search.html#<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/code\/kenjee\/exhaustive-regression-parameter-tuning#ML-Algorithms---Regression-Example\">https:\/\/www.kaggle.com\/code\/kenjee\/exhaustive-regression-parameter-tuning#ML-Algorithms&#8212;Regression-Example<\/a><\/li>\n<li><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/mwl2zj\/do_you_often_find_hyperparam_tuning_does_very\/\">https:\/\/www.reddit.com\/r\/datascience\/comments\/mwl2zj\/do_you_often_find_hyperparam_tuning_does_very\/<\/a><\/li>\n<li><a href=\"https:\/\/www.linkedin.com\/advice\/3\/when-should-you-stop-tuning-your-hyperparameters-8j9pf\">https:\/\/www.linkedin.com\/advice\/3\/when-should-you-stop-tuning-your-hyperparameters-8j9pf<\/a><\/li>\n<\/ul>\n<p>Model selection:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.linkedin.com\/advice\/1\/how-do-you-select-most-appropriate-machine-learning-xpy8e\">https:\/\/www.linkedin.com\/advice\/1\/how-do-you-select-most-appropriate-machine-learning-xpy8e<\/a><\/li>\n<\/ul>\n<p>Most of all, the entirety of the <a href=\"https:\/\/scikit-learn.org\/stable\/user_guide.html\">scikit-learn user guide<\/a> &amp; <a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/index.html\">examples<\/a>.<\/p>\n<p><em>The author has created all the images in this\u00a0article.<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=ab1435952419\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/building-a-regression-model-delivery-duration-prediction-ab1435952419\">Building a Regression Model: Delivery Duration Prediction<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Jimin Kang<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-regression-model-delivery-duration-prediction-ab1435952419\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Building a Regression Model: Delivery Duration Prediction Building a Regression Model to Predict Delivery Durations: A Practical Guide E2E walkthrough for approaching a regression modeling\u00a0task In this article, we\u2019re going to walk through the process of building a regression model\u200a\u2014\u200afrom dataset cleaning &amp; preparation, to model training &amp; evaluation. The specific regression task we will [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,166,70,1500,1498,1114],"tags":[1309,103,1424],"class_list":["post-1479","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-hands-on-tutorials","category-machine-learning","category-model-evaluation","category-model-training","category-regression","tag-delivery","tag-model","tag-order"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1479"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1479"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1479\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1479"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1479"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1479"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}