The Implicit Biases of Optimization Algorithms in Deep Learning: A Survey

Uncovering the Hidden Preferences: How Optimization Algorithms Shape Neural Networks in Deep Learning"

Nov 03, 2024

Introduction:

The success of deep learning has been largely attributed to the power of overparameterized neural networks and the effectiveness of optimization algorithms in navigating their complex loss landscapes. Recent research has shed light on the crucial role played by the implicit biases of these algorithms in shaping the generalization properties of the trained models (Neyshabur et al., 2015b; Vardi, 2023). This survey article aims to provide an overview of the current understanding of implicit biases in deep learning optimization, focusing on the family of steepest descent algorithms and their connections to margin maximization and generalization.

Steepest Descent Algorithms and their Geometry:

Steepest descent algorithms, which include variants such as gradient descent, coordinate descent, and sign gradient descent, are the workhorses of deep learning optimization. These methods update the model parameters iteratively in the direction of steepest descent with respect to a chosen norm, inducing a specific geometry on the optimization landscape (Gunasekar et al., 2018; Ji and Telgarsky, 2020). The choice of the norm has a significant impact on the implicit bias of the algorithm. Gradient descent, for instance, has been shown to favor solutions with low ℓ2 norm (Soudry et al., 2018), while coordinate descent promotes sparsity by minimizing the ℓ1 norm (Gunasekar et al., 2018). Sign gradient descent, closely related to the popular Adam optimizer (Kingma and Ba, 2015; Kunstner et al., 2023), exhibits a bias towards minimizing the ℓ∞ norm (Zhang et al., 2024).

Implicit Bias and Margin Maximization:

A key insight into the implicit biases of steepest descent algorithms is their connection to margin maximization. The margin of a classifier measures its ability to separate different classes, with larger margins indicating better generalization (Vapnik, 1998). Recent works have shown that steepest descent algorithms implicitly maximize a margin-based measure, even without explicit margin-based objectives (Ji and Telgarsky, 2020; Lyu and Li, 2020; Nacson et al., 2019).

Lyu and Li (2020) proved that gradient descent on homogeneous neural networks leads to margin maximization in the ℓ2 norm. Nacson et al. (2019) extended these results to both homogeneous and non-homogeneous models, showing the connection between infinitesimal regularization and margin maximization. Ji and Telgarsky (2020) further generalized the analysis to a broader class of loss functions and activation functions.

The Interplay of Optimization and Generalization:

The study of implicit biases has shed light on the interplay between optimization algorithms and generalization in deep learning. Empirical studies have demonstrated that different steepest descent variants can lead to solutions with distinct generalization properties, even when trained on the same dataset (Wilson et al., 2017).

Adaptive gradient methods, such as Adam (Kingma and Ba, 2015), have gained popularity due to their ability to handle sparse gradients and accelerate convergence. However, their implicit biases and generalization properties have been less understood compared to non-adaptive methods. Recent theoretical analysis has revealed that Adam's implicit bias can vary depending on its hyperparameter configuration, with certain settings leading to behavior similar to sign gradient descent (Zhang et al., 2024; Wang et al., 2021, 2022).

The study of implicit biases has also led to the development of novel optimization techniques that leverage these biases to improve generalization. For instance, Path-Normalized SGD (Neyshabur et al., 2015a) and Shampoo (Gupta et al., 2018) incorporate geometry-aware preconditioning to enhance the stability and generalization of deep learning optimization.

Future Directions and Open Questions:

Despite significant progress in understanding the implicit biases of optimization algorithms, several open questions remain. One direction is to extend the analysis to more complex architectures, such as deep residual networks and transformers, where the interplay between optimization and generalization is less well-understood (Kunin et al., 2023). Another important avenue is to explore the connections between implicit biases and other desirable properties, such as robustness to adversarial perturbations (Tsilivis et al., 2024) and the ability to recover training data from trained models (Haim et al., 2022). Understanding these relationships could lead to the design of optimization algorithms that promote both generalization and robustness. Finally, a deeper understanding of the implicit biases of adaptive gradient methods, such as Adam and its variants (Loshchilov and Hutter, 2019), is crucial given their widespread use in practice. Recent works have made progress in this direction (Zhang et al., 2024; Wang et al., 2021, 2022; Xie and Li, 2024), but more research is needed to fully characterize their behavior and develop principled guidelines for their use.

Conclusion:

The study of implicit biases in deep learning optimization has provided valuable insights into the mechanisms underlying the success of overparameterized neural networks. By understanding the geometry induced by different steepest descent algorithms and their connections to margin maximization, researchers have shed light on the interplay between optimization and generalization. As the field continues to evolve, a deeper understanding of implicit biases will be crucial for designing optimization algorithms that lead to robust, generalizable, and interpretable models. This survey has highlighted key advancements in this area and identified promising directions for future research, emphasizing the importance of implicit biases as a lens through which to understand and improve deep learning optimization.

References: Gunasekar, S., Lee, J. D., Soudry, D., & Srebro, N. (2018). Characterizing Implicit Bias in Terms of Optimization Geometry. In International Conference on Machine Learning (pp. 1827-1836). PMLR.

Haim, N., Vardi, G., Yehudai, G., Shamir, O., & Irani, M. (2022). Reconstructing Training Data From Trained Neural Networks. In Advances in Neural Information Processing Systems (pp. 20958-20969).

Ji, Z., & Telgarsky, M. (2020). Directional convergence and alignment in deep learning. In Advances in Neural Information Processing Systems (pp. 18666-18677).

Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.

Kunstner, F., Chen, J., Lavington, J. W., & Schmidt, M. (2023). Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be. In International Conference on Learning Representations.

Kunin, D., Yamamura, A., Ma, C., & Ganguli, S. (2023). The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks. In International Conference on Learning Representations.

Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. In International Conference on Learning Representations.

Lyu, K., & Li, J. (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. In International Conference on Learning Representations.

Nacson, M. S., Gunasekar, S., Lee, J. D., Srebro, N., & Soudry, D. (2019). Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models. In International Conference on Machine Learning (pp. 4683-4692). PMLR.

Neyshabur, B., Salakhutdinov, R., & Srebro, N. (2015a). Path-SGD: Path-Normalized Optimization in Deep Neural Networks. In Advances in Neural Information Processing Systems (pp. 2422-2430).

Neyshabur, B., Tomioka, R., & Srebro, N. (2015b). In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. In International Conference on Learning Representations, Workshop Track.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., & Srebro, N. (2018). The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research, 19(1), 2822-2878.

Tsilivis, N., Frank, N., Srebro, N., & Kempe, J. (2024). The Price of Implicit Bias in Adversarially Robust Generalization. arXiv preprint arXiv:2406.04981.

Vapnik, V. (1998). Statistical learning theory. Wiley.

Vardi, G. (2023). On the Implicit Bias in Deep-Learning Algorithms. Communications of the ACM, 66(6), 86-93.

Wang, B., Meng, Q., Chen, W., & Liu, T. (2021). The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks. In International Conference on Machine Learning (pp. 10849-10858). PMLR.

Wang, B., Meng, Q., Zhang, H., Sun, R., Chen, W., Ma, Z., & Liu, T. (2022). Does Momentum Change the Implicit Regularization on Separable Data? In Advances in Neural Information Processing Systems.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems (pp. 4148-4158).

Xie, S., & Li, Z. (2024). Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization. In International Conference on Machine Learning.

Zhang, C., Zou, D., & Cao, Y. (2024). The Implicit Bias of Adam on Separable Data. arXiv preprint arXiv:2406.10650.

Nemirovskii, A., & Yudin, D. (1983). Problem complexity and method efficiency in optimization. Wiley.

Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman Divergences. J. Mach. Learn. Res., 6, 1705–1749.

Beck, A. (2017). First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA.

Bernstein, J., & Newhouse, L. (2024). Old Optimizer, New Norm: An Anthology. abs/2409.20325.

Boyd, S. P., & Vandenberghe, L. (2014). Convex Optimization. Cambridge University Press.

Bregman, L. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3), 200–217.

Carlson, D. E., Collins, E., Hsieh, Y., Carin, L., & Cevher, V. (2015). Preconditioned Spectral Descent for Deep Learning. In Advances in Neural Information Processing Systems 28, 2971–2979.

Clarke, F. H. (1975). Generalized gradients and applications. Transactions of the American Mathematical Society, 205, 247–262.

Clarke, F. H. (1990). Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics.

Davis, D., Drusvyatskiy, D., Kakade, S. M., & Lee, J. D. (2020). Stochastic Subgradient Method Converges on Tame Functions. Found. Comput. Math., 20(1), 119–154.

Dutta, J., Deb, K., Tulshyan, R., & Arora, R. (2013). Approximate KKT points and a proximity measure for termination. J. Glob. Optim., 56(4).

Fenchel, W. (1949). On Conjugate Convex Functions. Canadian Journal of Mathematics, 1(1), 73–77.

Frei, S., Chatterji, N. S., & Bartlett, P. L. (2022). Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data. In Conference on Learning Theory, 178, 2668–2703. PMLR.

Gunasekar, S., Lee, J. D., Soudry, D., & Srebro, N. (2018). Characterizing Implicit Bias in Terms of Optimization Geometry. In Proceedings of the 35th International Conference on Machine Learning, 80, 1827–1836. PMLR.

Gupta, V., Koren, T., & Singer, Y. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization. In Proceedings of the 35th International Conference on Machine Learning, 80, 1837–1845. PMLR.

Haim, N., Vardi, G., Yehudai, G., Shamir, O., & Irani, M. (2022). Reconstructing Training Data From Trained Neural Networks. In Advances in Neural Information Processing Systems 35.

Ji, Z., & Telgarsky, M. (2019). Gradient descent aligns the layers of deep linear networks. In 7th International Conference on Learning Representations.

Ji, Z., & Telgarsky, M. (2020). Directional convergence and alignment in deep learning. In Advances in Neural Information Processing Systems 33.

Kakade, S. M., Sridharan, K., & Tewari, A. (2008). On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization. In Advances in Neural Information Processing Systems 21, 793–800.

Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side Conditions. Master’s thesis, Department of Mathematics, University of Chicago.

Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations.

Kuhn, H. W., & Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 481–492.

Kunin, D., Yamamura, A., Ma, C., & Ganguli, S. (2023). The Asymmetric Maximum Margin Bias of Quasi Homogeneous Neural Networks. In The Eleventh International Conference on Learning Representations.

Kunstner, F., Chen, J., Lavington, J. W., & Schmidt, M. (2023). Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, But Sign Descent Might Be. In The Eleventh International Conference on Learning Representations.

Large, T., Liu, Y., Huh, M., Bahng, H., Isola, P., & Bernstein, J. (2024). Scalable Optimization in the Modular Norm. abs/2405.14813.

Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations.

Lyu, K., & Li, J. (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. In 8th International Conference on Learning Representations.

Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting Algorithms as Gradient Descent. In Advances in Neural Information Processing Systems, 12.

Nacson, M. S., Gunasekar, S., Lee, J. D., Srebro, N., & Soudry, D. (2019). Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models. In Proceedings of the 36th International Conference on Machine Learning, 97, 4683–4692. PMLR.

Neyshabur, B., Salakhutdinov, R., & Srebro, N. (2015a). Path-SGD: Path-Normalized Optimization in Deep Neural Networks. In Advances in Neural Information Processing Systems 28.

Neyshabur, B., Tomioka, R., & Srebro, N. (2015b). In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. In 3rd International Conference on Learning Representations.

Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning.

Novikoff, A. B. J. (1963). On Convergence Proofs For Perceptrons.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in PyTorch. In NIPS 2017 Workshop on Autodiff.

Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press.

Shalev-Shwartz, S. (2007). Online learning: theory, algorithms and applications. PhD thesis, Hebrew University of Jerusalem, Israel.

Shamir, O. (2023). The Implicit Bias of Benign Overfitting. J. Mach. Learn. Res., 24, 113:1–113:40.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., & Srebro, N. (2018). The Implicit Bias of Gradient Descent on Separable Data. J. Mach. Learn. Res., 19, 70:1–70:57.

Telgarsky, M. (2013). Margins, Shrinkage, and Boosting. In Proceedings of the 30th International Conference on Machine Learning, 28.

Tsilivis, N., Frank, N., Srebro, N., & Kempe, J. (2024). The Price of Implicit Bias in Adversarially Robust Generalization. abs/2406.04981.

Vapnik, V. (1998). Statistical learning theory. Wiley.

Vardi, G. (2023). On the Implicit Bias in Deep-Learning Algorithms. Commun. ACM, 66(6), 86–93.

Vardi, G., Shamir, O., & Srebro, N. (2022). On Margin Maximization in Linear and ReLU Networks. In Advances in Neural Information Processing Systems 35.

Wang, B., Meng, Q., Chen, W., & Liu, T. (2021). The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, 139.

Wang, B., Meng, Q., Zhang, H., Sun, R., Chen, W., Ma, Z., & Liu, T. (2022). Does Momentum Change the Implicit Regularization on Separable Data? In Advances in Neural Information Processing Systems 35.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems 30.

Woodworth, B. E., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., & Srebro, N. (2020). Kernel and Rich Regimes in Overparametrized Models. In Conference on Learning Theory, 125, 3635–3673. PMLR.

Xie, S., & Li, Z. (2024). Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization. In Forty-first International Conference on Machine Learning.

Zhang, C., Zou, D., & Cao, Y. (2024). The Implicit Bias of Adam on Separable Data. abs/2406.10650.

Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S. J., Kumar, S., & Sra, S. (2020). Why are Adaptive Methods Good for Attention Models? In Advances in Neural Information Processing Systems 33.

Disclaimer

This document is intended for informational and educational purposes only. While every effort has been made to ensure accuracy, the information within may not comprehensively cover all aspects or the latest updates in the field of optimization algorithms in deep learning. Readers are encouraged to consult primary research sources, as well as expert opinions, for deeper insights or specific applications.

This document does not constitute professional advice. Neither the authors nor affiliated institutions accept responsibility for any inaccuracies, omissions, or potential consequences resulting from the use of this information.

Bhaktavaschal’s Newsletter

Discussion about this post