{"id":1127,"date":"2025-01-12T07:02:33","date_gmt":"2025-01-12T07:02:33","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/12\/exploring-new-hyperparameter-dimensions-with-laplace-approximated-bayesian-optimization-588a0891ca5b\/"},"modified":"2025-01-12T07:02:33","modified_gmt":"2025-01-12T07:02:33","slug":"exploring-new-hyperparameter-dimensions-with-laplace-approximated-bayesian-optimization-588a0891ca5b","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/12\/exploring-new-hyperparameter-dimensions-with-laplace-approximated-bayesian-optimization-588a0891ca5b\/","title":{"rendered":"Exploring New Hyperparameter Dimensions with Laplace Approximated Bayesian Optimization"},"content":{"rendered":"<p>    Exploring New Hyperparameter Dimensions with Laplace Approximated Bayesian Optimization<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Is it better than grid\u00a0search?<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ALnZjq8iJ2Cm2oj1YxeyfUQ.png?ssl=1\"><figcaption>Image by author from\u00a0canva<\/figcaption><\/figure>\n<p><strong>When I notice my model is overfitting<\/strong>, I often think, \u201c<strong>It is time to regularize<\/strong>\u201d. But how do I decide which regularization method to use (L1, L2) and what parameters to choose? Typically, I perform hyperparameter optimization by means of a grid search to select the settings. However, what happens if the independent variables have different scales or varying levels of influence? Can I design a hyperparameter grid with different regularization coefficients for each variable? Is this type of optimization feasible in high-dimensional spaces? And are there alternative ways to design regularization? Let\u2019s explore this with a hypothetical example.<\/p>\n<h3>Use case<\/h3>\n<p>My fictional example is a binary classification use case with 3 explanatory variables. Each of these variables is categorical and has 7 different categories. My reproducible use case is in this <a href=\"https:\/\/github.com\/kapytaine\/bayesian_optimisation\/blob\/master\/notebooks\/bayesian_optimization_instead_of_hyperoptimization_artificial_dataset.ipynb\">notebook<\/a>. The function that generates the dataset is the following:<\/p>\n<pre>import numpy as np<br>import pandas as pd<br><br>def get_classification_dataset():<br>    n_samples = 200<br>    cats = [\"a\", \"b\", \"c\", \"d\", \"e\", \"f\"]<br>    X = pd.DataFrame(<br>        data={<br>            \"col1\": np.random.choice(cats, size=n_samples),<br>            \"col2\": np.random.choice(cats, size=n_samples),<br>            \"col3\": np.random.choice(cats, size=n_samples),<br>        }<br>    )<br>    X_preprocessed = pd.get_dummies(X)<br><br>    theta = np.random.multivariate_normal(<br>        np.zeros(len(cats) * X.shape[1]),<br>        np.diag(np.array([1e-1] * len(cats) + [1] * len(cats) + [1e1] * len(cats))),<br>    )<br><br>    y = pd.Series(<br>        data=np.random.binomial(1, expit(np.dot(X_preprocessed.to_numpy(), theta))),<br>        index=X_preprocessed.index,<br>    )<br>    return X_preprocessed, y<\/pre>\n<p>For information, I deliberately chose 3 different values for the theta covariance matrix to showcase the benefit of the Laplace approximated bayesian optimization method. If the values were somehow similar, the interest would be\u00a0minor.<\/p>\n<h3>Benchmark<\/h3>\n<p>Along with <strong>a simple baseline model that predicts the mean observed value<\/strong> on the training dataset (used for comparison purposes), I opted to design a slightly more complex model. I decided to one-hot encode the three independent variables and apply <strong>a logistic regression model<\/strong> on top of this basic preprocessing. For regularization, I chose an L2 design and aimed to find the optimal regularization coefficient using two techniques: <strong>grid search<\/strong> and <strong>Laplace approximated bayesian optimization<\/strong>, as you may have anticipated by now. Finally, I evaluated the model on a test dataset using two metrics (arbitrarily selected): log loss and AUC\u00a0ROC.<\/p>\n<p>Before presenting the results, let\u2019s first take a closer look at the bayesian model and how we optimize\u00a0it.<\/p>\n<h3>Bayesian model<\/h3>\n<p>In the bayesian framework, the parameters are no longer fixed constants, but random variables. Instead of maximizing the likelihood to estimate these unknown parameters, we now optimize the posterior distribution of the random parameters, given the observed data. This requires us to choose, often somewhat arbitrarily, the design and parameters of the prior. However, it is also possible to treat the parameters of the prior as random variables themselves\u200a\u2014\u200alike in <em>Inception<\/em>, where the layers of uncertainty keep stacking on top of each\u00a0other\u2026<\/p>\n<p>In this study, I have chosen the following model:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/376\/1%2A630IptT6oc2D1EElZCuGmw.png?ssl=1\"><\/figure>\n<p>I have logically chosen a bernouilli model for Y_i | \u03b8, a normal centered prior corresponding to a L2 regularization for \u03b8 | \u03a3 and finally for \u03a3_i^{-1}, I chose a Gamma model. I chose to model the precision matrix instead of the covariance matrix as it is traditional in the literature, like in scikit learn user guide for the Bayesian linear regression [2].<\/p>\n<p>In addition to this written model, I assumed the Y_i and Y_j are conditionally (on \u03b8) independent as well as Y_i and\u00a0\u03a3.<\/p>\n<h3>Calibration<\/h3>\n<h4>Likelihood<\/h4>\n<p>According to the model, the likelihood can consequently be\u00a0written:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/801\/0%2AcoVQ4RMmvb8Z3k8a.png?ssl=1\"><\/figure>\n<p>In order to optimize, we need to evaluate nearly all of the terms, with the exception of P(Y=y). The terms in the numerators can be evaluated using the chosen model. However, the remaining term in the denominator cannot. This is where the <strong>Laplace approximation<\/strong> comes into\u00a0play.<\/p>\n<h4>Laplace approximation<\/h4>\n<p>In order to evaluate the first term of the denominator, we can leverage the Laplace approximation. We approximate the distribution of \u03b8 | Y, \u03a3\u00a0by:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/358\/1%2Aw-R9BC7ytnjg5u_8KHs95Q.png?ssl=1\"><\/figure>\n<p>with \u03b8* being the mode of the mode the density distribution of \u03b8 | Y,\u00a0\u03a3.<\/p>\n<p>Even though we do not know the density function, we can evaluate the Hessian part thanks to the following decomposition:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/843\/1%2AniclhfddKstEWsrEoVgHQw.png?ssl=1\"><\/figure>\n<p>We only need to know the first two terms of the numerator to evaluate the Hessian which we\u00a0do.<\/p>\n<p>For those interested in further explanation, I advise part 4.4, \u201cThe Laplace Approximation\u201d, from Pattern Recognition and Machine Learning from Christopher M. Bishop [1]. It helped me a lot to understand the approximation.<\/p>\n<h4>Laplace approximated likelihood<\/h4>\n<p>Finally the Laplace approximated likelihood to optimize\u00a0is:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/875\/0%2AXGHFKwRlPqvPAl-T.png?ssl=1\"><\/figure>\n<p>Once we approximate the density function of \u03b8 | Y, \u03a3, we could finally evaluate the likelihood at whatever \u03b8 we want if the approximation was accurate everywhere. For the sake of simplicity and because the approximation is accurate only close to the mode, we evaluate approximated likelihood at\u00a0\u03b8*.<\/p>\n<p>Here after is a function that evaluates this loss for a given (scalar) \u03c3\u00b2=1\/p (in addition to the given observed, X and y, and design values, \u03b1 and\u00a0\u03b2).<\/p>\n<pre>import numpy as np<br>from scipy.stats import gamma<br><br>from module.bayesian_model import BayesianLogisticRegression<br><br>def loss(p, X, y, alpha, beta):<br>    # computation of the loss for given values:<br>    # - 1\/sigma\u00b2 (named p for precision here)<br>    # - X: matrix of features<br>    # - y: vector of observations<br>    # - alpha: prior Gamma distribution alpha parameter over 1\/sigma\u00b2<br>    # - beta: prior Gamma distribution beta parameter over 1\/sigma\u00b2<br><br>    n_feat = X.shape[1]<br>    m_vec = np.array([0] * n_feat)<br>    p_vec = np.array(p * n_feat)<br><br>    # computation of theta*<br>    res = minimize(<br>        BayesianLogisticRegression()._loss,<br>        np.array([0] * n_feat),<br>        args=(X, y, m_vec, p_vec),<br>        method=\"BFGS\",<br>        jac=BayesianLogisticRegression()._jac,<br>    )<br>    theta_star = res.x<br>    <br>    # computation the Hessian for the Laplace approximation<br>    H = BayesianLogisticRegression()._hess(theta_star, X, y, m_vec, p_vec)<br>    <br>    # loss<br>    loss = 0<br>    ## first two terms: the log loss and the regularization term<br>    loss += baysian_model._loss(theta_star, X, y, m_vec, p_vec)<br>    ## third term: prior distribution over sigma, written p here<br>    out -= gamma.logpdf(p, a = alpha, scale = 1 \/ beta)<br>    ## fourth term: Laplace approximated last term<br>    out += 0.5 * np.linalg.slogdet(H)[1] - 0.5 * n_feat * np.log(2 * np.pi)<br><br>    return out<\/pre>\n<p>In my use case, I have chosen to optimize it by means of <strong>Adam optimizer, <\/strong>which code has been taken from this\u00a0<a href=\"https:\/\/gist.github.com\/jcmgray\/e0ab3458a252114beecb1f4b631e19ab\">repo<\/a>.<\/p>\n<pre>def adam(<br>    fun,<br>    x0,<br>    jac,<br>    args=(),<br>    learning_rate=0.001,<br>    beta1=0.9,<br>    beta2=0.999,<br>    eps=1e-8,<br>    startiter=0,<br>    maxiter=1000,<br>    callback=None,<br>    **kwargs<br>):<br>    \"\"\"``scipy.optimize.minimize`` compatible implementation of ADAM -<br>    [http:\/\/arxiv.org\/pdf\/1412.6980.pdf].<br>    Adapted from ``autograd\/misc\/optimizers.py``.<br>    \"\"\"<br>    x = x0<br>    m = np.zeros_like(x)<br>    v = np.zeros_like(x)<br><br>    for i in range(startiter, startiter + maxiter):<br>        g = jac(x, *args)<br><br>        if callback and callback(x):<br>            break<br><br>        m = (1 - beta1) * g + beta1 * m  # first  moment estimate.<br>        v = (1 - beta2) * (g**2) + beta2 * v  # second moment estimate.<br>        mhat = m \/ (1 - beta1**(i + 1))  # bias correction.<br>        vhat = v \/ (1 - beta2**(i + 1))<br>        x = x - learning_rate * mhat \/ (np.sqrt(vhat) + eps)<br><br>    i += 1<br>    return OptimizeResult(x=x, fun=fun(x, *args), jac=g, nit=i, nfev=i, success=True)<\/pre>\n<p>For this optimization we need the derivative of the previous loss. We cannot have an analytical form so I decided to use a numerical approximation of the derivative.<\/p>\n<h3>Inference<\/h3>\n<p>Once the model is trained on the training dataset, it is necessary to make predictions on the evaluation dataset to assess its performance and compare different models. However, it is not possible to directly calculate the actual distribution of a new point, as the computation is intractable.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AL1m9M2jrVeODvpxBYBLtiw.png?ssl=1\"><\/figure>\n<p>It is possible to approximate the results\u00a0with:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/875\/0%2AhK4ggNi3sVSzYYbo.png?ssl=1\"><\/figure>\n<p>considering:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/831\/0%2AP61m4I3Xb4kd-fnb.png?ssl=1\"><\/figure>\n<h3>Results<\/h3>\n<p>I chose an uninformative prior over the precision random variable. The naive model performs poorly, with a log loss of 0.60 and an AUC ROC of 0.50. The second model performs better, with a log loss of 0.44 and an AUC ROC of 0.83, both when hyperoptimized using grid search and bayesian optimization. This indicates that the logistic regression model, which incorporates the dependent variables, outperforms the naive model. However, there is no advantage to using bayesian optimization over grid search, so I\u2019ll continue with grid search for now. Thanks for\u00a0reading.<\/p>\n<p>\u2026 But wait, I am thinking. Why are my parameters regularized with the same coefficient? Shouldn\u2019t my prior depend on the underlying dependent variables? Perhaps the parameters for the first dependent variable could take higher values, while those for the second dependent variable, with its smaller influence, should be closer to zero. Let\u2019s explore these new dimensions.<\/p>\n<h3>Benchmark 2<\/h3>\n<p>So far we have considered two techniques, the grid search and the bayesian optimization. We can use these same techniques in higher dimensions.<\/p>\n<p>Considering new dimensions could dramatically increase the number of nodes of my grid. This is why the bayesian optimization makes sense in higher dimensions to get the best regularization coefficients. In the considered use case, I have supposed there are 3 regularization parameters, one for each independent variable. After encoding a single variable, I assumed the generated new variables all shared the same regularization parameter. Hence the total regularization parameters of 3, even if there are more than 3 columns as inputs of the logistic regression.<\/p>\n<p>I updated the previous loss function with the following code:<\/p>\n<pre>import numpy as np<br>from scipy.stats import gamma<br><br>from module.bayesian_model import BayesianLogisticRegression<br><br>def loss(p, X, y, alpha, beta, X_columns, col_to_p_id):<br>    # computation of the loss for given values:<br>    # - 1\/sigma\u00b2 vector (named p for precision here)<br>    # - X: matrix of features<br>    # - y: vector of observations<br>    # - alpha: prior Gamma distribution alpha parameter over 1\/sigma\u00b2<br>    # - beta: prior Gamma distribution beta parameter over 1\/sigma\u00b2<br>    # - X_columns: list of names of X columns<br>    # - col_to_p_id: dictionnary mapping a column name to a p index<br>    # because many column names can share the same p value<br><br>    n_feat = X.shape[1]<br>    m_vec = np.array([0] * n_feat)<br>    p_list = []<br>    for col in X_columns:<br>        p_list.append(p[col_to_p_id[col]])<br>    p_vec = np.array(p_list)<br><br>    # computation of theta*<br>    res = minimize(<br>        BayesianLogisticRegression()._loss,<br>        np.array([0] * n_feat),<br>        args=(X, y, m_vec, p_vec),<br>        method=\"BFGS\",<br>        jac=BayesianLogisticRegression()._jac,<br>    )<br>    theta_star = res.x<br>    <br>    # computation the Hessian for the Laplace approximation<br>    H = BayesianLogisticRegression()._hess(theta_star, X, y, m_vec, p_vec)<br>    <br>    # loss<br>    loss = 0<br>    ## first two terms: the log loss and the regularization term<br>    loss += baysian_model._loss(theta_star, X, y, m_vec, p_vec)<br>    ## third term: prior distribution over 1\/sigma\u00b2 written p here<br>    ## there is now a sum as p is now a vector<br>    out -= np.sum(gamma.logpdf(p, a = alpha, scale = 1 \/ beta))<br>    ## fourth term: Laplace approximated last term<br>    out += 0.5 * np.linalg.slogdet(H)[1] - 0.5 * n_feat * np.log(2 * np.pi)<br><br>    return out<\/pre>\n<p>With this approach, the metrics evaluated on the test dataset are the following: 0.39 and 0.88, which are better than the initial model optimized by means of a grid search and a bayesian approach with only a single prior for all the independent variables.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/803\/0%2ApHlVqBjt-qs1Yetd.png?ssl=1\"><figcaption>Metrics achieved with the different methods on my use\u00a0case.<\/figcaption><\/figure>\n<p>The use case can be reproduced with this <a href=\"https:\/\/github.com\/kapytaine\/bayesian_optimisation\/blob\/master\/notebooks\/bayesian_optimization_instead_of_hyperoptimization_artificial_dataset.ipynb\">notebook<\/a>.<\/p>\n<h3>Limits<\/h3>\n<p>I have created an example to illustrate the usefulness of the technique. However, I have not been able to find a suitable real-world dataset to fully demonstrate its potential. While I was working with an actual dataset, I could not derive any significant benefits from applying this technique. If you come across one, please let me know\u200a\u2014\u200aI would be excited to see a real-world application of this regularization method.<\/p>\n<h3>Conclusion<\/h3>\n<p>In conclusion, using bayesian optimization (with Laplace approximation if needed) to determine the best regularization parameters may be a good alternative to traditional hyperparameter tuning methods. By leveraging probabilistic models, bayesian optimization not only reduces the computational cost but also enhances the likelihood of finding optimal regularization values, especially in high dimension.<\/p>\n<h3>References<\/h3>\n<ol>\n<li>Christopher M. Bishop. (2006). Pattern Recognition and Machine Learning. Springer.<\/li>\n<li>Bayesian Ridge Regression scikit-learn user guide: <a href=\"https:\/\/scikit-learn.org\/1.5\/modules\/linear_model.html#bayesian-ridge-regression\">https:\/\/scikit-learn.org\/1.5\/modules\/linear_model.html#bayesian-ridge-regression<\/a>\n<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=588a0891ca5b\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/exploring-new-hyperparameter-dimensions-with-laplace-approximated-bayesian-optimization-588a0891ca5b\">Exploring New Hyperparameter Dimensions with Laplace Approximated Bayesian Optimization<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Arnaud Capitaine<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fexploring-new-hyperparameter-dimensions-with-laplace-approximated-bayesian-optimization-588a0891ca5b\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Exploring New Hyperparameter Dimensions with Laplace Approximated Bayesian Optimization Is it better than grid\u00a0search? Image by author from\u00a0canva When I notice my model is overfitting, I often think, \u201cIt is time to regularize\u201d. But how do I decide which regularization method to use (L1, L2) and what parameters to choose? Typically, I perform hyperparameter optimization [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1238,177,166,1240,1239],"tags":[1242,1241,483],"class_list":["post-1127","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-bayesian-optimization","category-bayesian-statistics","category-hands-on-tutorials","category-hyperparameter-tuning","category-laplace-approximation","tag-cats","tag-np","tag-optimization"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1127"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1127"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1127\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1127"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1127"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1127"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}