What does back propagation compute? | Vidéo | Carmin.tv

00:00:00 / 00:00:00

What does back propagation compute?

By Edouard Pauwels

Appears in collection : Optimization for Machine Learning / Optimisation pour l’apprentissage automatique

We are interested in nonsmooth analysis of backpropagation as implemented in modern machine learning librairies, such as Tensorflow or Pytorch. First I will illustrate how blind application of differential calculus to nonsmooth objects can be problematic, requiring a proper mathematical model. Then I will introduce a weak notion of generalized derivative, named conservativity, and illustrate how it complies with calculus and optimization for well structured objects. We provide stability results for empirical risk minimization similar as in the smooth setting for the combination of nonsmooth automatic differentiation, minibatch stochastic approximation and first order optimization. This is joint work with Jérôme Bolte.

Information about the video

Date of recording 09/03/2020
Date of publication 06/04/2020
Institution CIRM
Licence CC BY NC ND
Language English
Audience Researchers
Director(s) Guillaume Hennenfent
Format MP4

Citation data

DOI 10.24350/CIRM.V.19623303
Cite this video Pauwels, Edouard (09/03/2020). What does back propagation compute?. CIRM. Audiovisual resource. DOI: 10.24350/CIRM.V.19623303
URL https://dx.doi.org/10.24350/CIRM.V.19623303

Domain(s)

Bibliography

ANIL, Cem, LUCAS, James, et GROSSE, Roger. Sorting out lipschitz function approximation. arXiv preprint arXiv:1811.05381, 2018. - https://arxiv.org/abs/1811.05381
DAVIS, Damek, DRUSVYATSKIY, Dmitriy, KAKADE, Sham, et al. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 2018, p. 1-36. - https://doi.org/10.1007/s10208-018-09409-5
ABADI, Martín, BARHAM, Paul, CHEN, Jianmin, et al. Tensorflow: A system for large-scale machine learning. In : 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 2016. p. 265-283. - https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
GUIDE, A. Hitchhiker’s. Infinite dimensional analysis. Springer, 2006.
ATTOUCH, Hedy, GOUDOU, Xavier, et REDONT, Patrick. The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2000, vol. 2, no 01, p. 1-34. - https://doi.org/10.1142/S0219199700000025
AUBIN, Jean-Pierre et CELLINA, Arrigo. Differential inclusions: set-valued maps and viability theory. 1984.
AUBIN, Jean-Pierre et FRANKOWSKA, Hélène. Set-valued analysis. Springer Science & Business Media, 2009.
PASZKE, Adam, GROSS, Sam, CHINTALA, Soumith, et al. Automatic differentiation in pytorch. 2017. - https://dl.acm.org/doi/abs/10.5555/3122009.3242010
BENAÏM, Michel. Dynamics of stochastic approximation algorithms. In : Seminaire de probabilites XXXIII. Springer, Berlin, Heidelberg, 1999. p. 1-68. - http://dx.doi.org/10.1007/BFb0096509
BENAÏM, Michel, HOFBAUER, Josef, et SORIN, Sylvain. Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 2005, vol. 44, no 1, p. 328-348. - https://doi.org/10.1137/S0363012904439301
BOLTE, Jérôme, DANIILIDIS, Aris, LEWIS, Adrian, et al. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 2007, vol. 18, no 2, p. 556-572. - https://doi.org/10.1137/060670080
BOLTE, Jérôme, SABACH, Shoham, et TEBOULLE, Marc. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 2014, vol. 146, no 1-2, p. 459-494. - https://doi.org/10.1007/s10107-013-0701-9
BORKAR, Vivek S. Stochastic approximation: a dynamical systems viewpoint. Springer, 2009.
BORWEIN, Jonathan et LEWIS, Adrian S. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.
BORWEIN, Jonathan M. et MOORS, Warren B. Essentially smooth Lipschitz functions. Journal of functional analysis, 1997, vol. 149, no 2, p. 305-351. - https://doi.org/10.1006/jfan.1997.3101
BORWEIN, Jonathan M. et MOORS, Warren B. A chain rule for essentially smooth Lipschitz functions. SIAM Journal on Optimization, 1998, vol. 8, no 2, p. 300-308. - https://doi.org/10.1137/S1052623496297838
BORWEIN, Jonathan, MOORS, Warren, et WANG, Xianfu. Generalized subdifferentials: a Baire categorical approach. Transactions of the American Mathematical Society, 2001, vol. 353, no 10, p. 3875-3893. - http://dx.doi.org/10.1090/S0002-9947-01-02820-3
BOTTOU, Léon et BOUSQUET, Olivier. The tradeoffs of large scale learning. In : Advances in neural information processing systems. 2008. p. 161-168. - http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning
BOTTOU, Léon, CURTIS, Frank E., et NOCEDAL, Jorge. Optimization methods for large-scale machine learning. Siam Review, 2018, vol. 60, no 2, p. 223-311. - https://doi.org/10.1137/16M1080173
CASTERA, Camille, BOLTE, Jérôme, FÉVOTTE, Cédric, et al. An Inertial Newton Algorithm for Deep Learning. arXiv preprint arXiv:1905.12278, 2019. - https://arxiv.org/abs/1905.12278
CLARKE, Frank H. Optimization and nonsmooth analysis. Siam, 1990.
CHIZAT, Lenaic et BACH, Francis. On the global convergence of gradient descent for over-parameterized models using optimal transport. In : Advances in neural information processing systems. 2018. p. 3036-3046. - http://papers.nips.cc/paper/7567-on-the-global-convergence-of-gradient-descent-for-over-parameterized-models-using-optimal-transport
CORLISS, George, FAURE, Christele, GRIEWANK, Andreas, et al. (ed.). Automatic differentiation of algorithms: from simulation to optimization. Springer Science & Business Media, 2002.
CORREA, RAFAEL et JOFRE, A. Tangentially continuous directional derivatives in nonsmooth analysis. Journal of optimization theory and applications, 1989, vol. 61, no 1, p. 1-21. - https://doi.org/10.1007/BF00940840
COSTE, M. An introduction to o-minimal geometry, Inst. Rech. Math., Univ. de Rennes, 1999.
DAVIS, Damek, DRUSVYATSKIY, Dmitriy, KAKADE, Sham, et al. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 2018, p. 1-36. - https://doi.org/10.1007/s10208-018-09409-5
VAN DEN DRIES, Lou, MILLER, Chris, et al. Geometric categories and o-minimal structures. Duke Math. J, 1996, vol. 84, no 2, p. 497-540.
EVANS, L. C. et GARIEPY, R. F. Measure Theory and Fine Properties of Functions, Revised edn. Textbooks in Mathematics. CRC Press, Boca Raton, 2015.
GLOROT, Xavier, BORDES, Antoine, et BENGIO, Yoshua. Deep sparse rectifier neural networks. In : Proceedings of the fourteenth international conference on artificial intelligence and statistics. 2011. p. 315-323. - http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf
GRIEWANK, Andreas et WALTHER, Andrea. Evaluating derivatives: principles and techniques of algorithmic differentiation. Siam, 2008.
GRIEWANK, Andreas. On stable piecewise linearization and generalized algorithmic differentiation. Optimization Methods and Software, 2013, vol. 28, no 6, p. 1139-1178. - https://doi.org/10.1080/10556788.2013.796683
Griewank A., Walther A., Fiege S. and Bosse T. (2016). On Lipschitz optimization based on gray-box piecewise linearization. Mathematical Programming, 158(1-2), 383-415. - http://dx.doi.org/10.1007/s10107-015-0934-x
IOFFE, A. D. Nonsmooth analysis: differential calculus of nondifferentiable mappings. Transactions of the American Mathematical Society, 1981, vol. 266, no 1, p. 1-56. - http://dx.doi.org/10.1090/S0002-9947-1981-0613784-7
IOFFE, Alexander D. Variational analysis of regular mappings. Springer Monographs in Mathematics. Springer, Cham, 2017. - http://dx.doi.org/10.1007/978-3-319-64277-2
KAKADE, Sham M. et LEE, Jason D. Provably correct automatic sub-differentiation for qualified programs. In : Advances in neural information processing systems. 2018. p. 7125-7135. - http://papers.nips.cc/paper/7943-provably-correct-automatic-sub-differentiation-for-qualified-programs
KURDYKA, Krzysztof. On gradients of functions definable in o-minimal structures. In : Annales de l'institut Fourier. 1998. p. 769-783. - https://doi.org/10.5802/aif.1638
KURDYKA, Krzysztof, MOSTOWSKI, Tadeusz, et PARUSINSKI, Adam. Proof of the gradient conjecture of R. Thom. Annals of Mathematics, 2000, vol. 152, no 3, p. 763-792. - http://www.jstor.org/stable/2661354
KUSHNER, Harold et YIN, G. George. Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media, 2003.
LECUN, Yann, BENGIO, Yoshua, et HINTON, Geoffrey. Deep learning. nature, 2015, vol. 521, no 7553, p. 436-444. - https://doi.org/10.1038/nature14539
LJUNG, Lennart. Analysis of recursive stochastic algorithms. IEEE transactions on automatic control, 1977, vol. 22, no 4, p. 551-575. - https://doi.org/10.1109/TAC.1977.1101561
MAJEWSKI, Szymon, MIASOJEDOW, Błażej, et MOULINES, Eric. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. arXiv preprint arXiv:1805.01916, 2018. - https://arxiv.org/abs/1805.01916
MOHAMMADI, Bijan et PIRONNEAU, Olivier. Applied shape optimization for fluids. Oxford university press, 2010.
MOULINES, Eric et BACH, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In : Advances in Neural Information Processing Systems. 2011. p. 451-459. - http://papers.nips.cc/paper/4316-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine
MOREAU, Jean Jacques. Fonctionnelles sous-différentiables. 1963.
MORDUKHOVICH, Boris S. Variational analysis and generalized differentiation I: Basic theory. Springer Science & Business Media, 2006.
PASZKE, Adam, GROSS, Sam, CHINTALA, Soumith, et al. Automatic differentiation in pytorch. 2017. - https://openreview.net/forum?id=BJJsrmfCZ
ROBBINS, Herbert et MONRO, Sutton. A stochastic approximation method. The annals of mathematical statistics, 1951, p. 400-407. - https://www.jstor.org/stable/2236626
ROCKAFELLAR, Ralph Tyrrell. Convex functions and dual extremum problems. 1963. Thèse de doctorat. Harvard University.
ROCKAFELLAR, Ralph. On the maximal monotonicity of subdifferential mappings. Pacific Journal of Mathematics, 1970, vol. 33, no 1, p. 209-216. - http://dx.doi.org/10.2140/pjm.1970.33.209
ROCKAFELLAR, R. T. et WETS, R. J. B. Variational Analysis. 1998.
RUMELHART, David E., HINTON, Geoffrey E., et WILLIAMS, Ronald J. Learning representations by back-propagating errors. nature, 1986, vol. 323, no 6088, p. 533-536. - https://doi.org/10.1038/323533a0
SPEELPENNING, Bert. Compiling fast partial derivatives of functions given by algorithms. Illinois Univ., Urbana (USA). Dept. of Computer Science, 1980. - https://doi.org/10.2172/5254402
THIBAULT, Lionel. On generalized differentials and subdifferentials of Lipschitz vector-valued functions. Nonlinear Analysis: Theory, Methods & Applications, 1982, vol. 6, no 10, p. 1037-1053. - https://doi.org/10.1016/0362-546X(82)90074-8
THIBAULT, Lionel et ZAGRODNY, Dariusz. Integration of subdifferentials of lower semicontinuous functions on Banach spaces. Journal of Mathematical Analysis and Applications, 1995, vol. 189, no 1, p. 33-58. - https://doi.org/10.1006/jmaa.1995.1003
THIBAULT, Lionel et ZLATEVA, Nadia. Integrability of subdifferentials of directionally Lipschitz functions. Proceedings of the American Mathematical Society, 2005, p. 2939-2948. - https://www.jstor.org/stable/4097908
VALADIER, Michel. Entrainement unilateral, lignes de descente, fonctions lipschitziennes non pathologiques. CRAS Paris, 1989, vol. 308, p. 241-244.
WANG, Xianfu. Pathological Lipschitz function in R (n). 1995. Thèse de doctorat. Theses (Dept. of Mathematics and Statistics)/Simon Fraser University. -

MSC codes

Document(s)

https://www.cirm-math.fr/RepOrga/2133/Slides/presPauwels.pdf

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow

Copyright Carmin.tv 2026

Give feedback