Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes
arXiv:2504.04105v1 Announce Type: new
Abstract: We study $textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $eta$. We show that after at most $1/gamma^2$ burn-in steps, GD achieves a risk upper bounded by $exp(-Theta(eta))$, where $gamma$ is the margin of the dataset. As $eta$ can be arbitrarily large, GD attains an arbitrarily small risk $textit{immediately after the burn-in steps}$, though the risk evolution may be $textit{non-monotonic}$.
We further construct hard datasets with margin $gamma$, where any batch or online first-order method requires $Omega(1/gamma^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $textit{minimax optimal}$ among first-order batch methods. Notably, the classical $textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/gamma^2$, matching GD even in constants.
Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.
Abstract: We study $textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $eta$. We show that after at most $1/gamma^2$ burn-in steps, GD achieves a risk upper bounded by $exp(-Theta(eta))$, where $gamma$ is the margin of the dataset. As $eta$ can be arbitrarily large, GD attains an arbitrarily small risk $textit{immediately after the burn-in steps}$, though the risk evolution may be $textit{non-monotonic}$.
We further construct hard datasets with margin $gamma$, where any batch or online first-order method requires $Omega(1/gamma^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $textit{minimax optimal}$ among first-order batch methods. Notably, the classical $textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/gamma^2$, matching GD even in constants.
Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.
Ruiqi Zhang, Jingfeng Wu, Licong Lin, Peter L. Bartlett
Go to original source