00:00:00 / 00:00:00

What does back propagation compute?

By Edouard Pauwels

Appears in collection : Optimization for Machine Learning / Optimisation pour l’apprentissage automatique

We are interested in nonsmooth analysis of backpropagation as implemented in modern machine learning librairies, such as Tensorflow or Pytorch. First I will illustrate how blind application of differential calculus to nonsmooth objects can be problematic, requiring a proper mathematical model. Then I will introduce a weak notion of generalized derivative, named conservativity, and illustrate how it complies with calculus and optimization for well structured objects. We provide stability results for empirical risk minimization similar as in the smooth setting for the combination of nonsmooth automatic differentiation, minibatch stochastic approximation and first order optimization. This is joint work with Jérôme Bolte.

Information about the video

Citation data

  • DOI 10.24350/CIRM.V.19623303
  • Cite this video Pauwels, Edouard (09/03/2020). What does back propagation compute?. CIRM. Audiovisual resource. DOI: 10.24350/CIRM.V.19623303
  • URL https://dx.doi.org/10.24350/CIRM.V.19623303

Bibliography

  • ANIL, Cem, LUCAS, James, et GROSSE, Roger. Sorting out lipschitz function approximation. arXiv preprint arXiv:1811.05381, 2018. - https://arxiv.org/abs/1811.05381
  • DAVIS, Damek, DRUSVYATSKIY, Dmitriy, KAKADE, Sham, et al. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 2018, p. 1-36. - https://doi.org/10.1007/s10208-018-09409-5
  • ABADI, Martín, BARHAM, Paul, CHEN, Jianmin, et al. Tensorflow: A system for large-scale machine learning. In : 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 2016. p. 265-283. - https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
  • GUIDE, A. Hitchhiker’s. Infinite dimensional analysis. Springer, 2006.
  • ATTOUCH, Hedy, GOUDOU, Xavier, et REDONT, Patrick. The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2000, vol. 2, no 01, p. 1-34. - https://doi.org/10.1142/S0219199700000025
  • AUBIN, Jean-Pierre et CELLINA, Arrigo. Differential inclusions: set-valued maps and viability theory. 1984.
  • AUBIN, Jean-Pierre et FRANKOWSKA, Hélène. Set-valued analysis. Springer Science & Business Media, 2009.
  • PASZKE, Adam, GROSS, Sam, CHINTALA, Soumith, et al. Automatic differentiation in pytorch. 2017. - https://dl.acm.org/doi/abs/10.5555/3122009.3242010
  • BENAÏM, Michel. Dynamics of stochastic approximation algorithms. In : Seminaire de probabilites XXXIII. Springer, Berlin, Heidelberg, 1999. p. 1-68. - http://dx.doi.org/10.1007/BFb0096509
  • BENAÏM, Michel, HOFBAUER, Josef, et SORIN, Sylvain. Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 2005, vol. 44, no 1, p. 328-348. - https://doi.org/10.1137/S0363012904439301
  • BOLTE, Jérôme, DANIILIDIS, Aris, LEWIS, Adrian, et al. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 2007, vol. 18, no 2, p. 556-572. - https://doi.org/10.1137/060670080
  • BOLTE, Jérôme, SABACH, Shoham, et TEBOULLE, Marc. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 2014, vol. 146, no 1-2, p. 459-494. - https://doi.org/10.1007/s10107-013-0701-9
  • BORKAR, Vivek S. Stochastic approximation: a dynamical systems viewpoint. Springer, 2009.
  • BORWEIN, Jonathan et LEWIS, Adrian S. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.
  • BORWEIN, Jonathan M. et MOORS, Warren B. Essentially smooth Lipschitz functions. Journal of functional analysis, 1997, vol. 149, no 2, p. 305-351. - https://doi.org/10.1006/jfan.1997.3101
  • BORWEIN, Jonathan M. et MOORS, Warren B. A chain rule for essentially smooth Lipschitz functions. SIAM Journal on Optimization, 1998, vol. 8, no 2, p. 300-308. - https://doi.org/10.1137/S1052623496297838
  • BORWEIN, Jonathan, MOORS, Warren, et WANG, Xianfu. Generalized subdifferentials: a Baire categorical approach. Transactions of the American Mathematical Society, 2001, vol. 353, no 10, p. 3875-3893. - http://dx.doi.org/10.1090/S0002-9947-01-02820-3
  • BOTTOU, Léon et BOUSQUET, Olivier. The tradeoffs of large scale learning. In : Advances in neural information processing systems. 2008. p. 161-168. - http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning
  • BOTTOU, Léon, CURTIS, Frank E., et NOCEDAL, Jorge. Optimization methods for large-scale machine learning. Siam Review, 2018, vol. 60, no 2, p. 223-311. - https://doi.org/10.1137/16M1080173
  • CASTERA, Camille, BOLTE, Jérôme, FÉVOTTE, Cédric, et al. An Inertial Newton Algorithm for Deep Learning. arXiv preprint arXiv:1905.12278, 2019. - https://arxiv.org/abs/1905.12278
  • CLARKE, Frank H. Optimization and nonsmooth analysis. Siam, 1990.
  • CHIZAT, Lenaic et BACH, Francis. On the global convergence of gradient descent for over-parameterized models using optimal transport. In : Advances in neural information processing systems. 2018. p. 3036-3046. - http://papers.nips.cc/paper/7567-on-the-global-convergence-of-gradient-descent-for-over-parameterized-models-using-optimal-transport
  • CORLISS, George, FAURE, Christele, GRIEWANK, Andreas, et al. (ed.). Automatic differentiation of algorithms: from simulation to optimization. Springer Science & Business Media, 2002.
  • CORREA, RAFAEL et JOFRE, A. Tangentially continuous directional derivatives in nonsmooth analysis. Journal of optimization theory and applications, 1989, vol. 61, no 1, p. 1-21. - https://doi.org/10.1007/BF00940840
  • COSTE, M. An introduction to o-minimal geometry, Inst. Rech. Math., Univ. de Rennes, 1999.
  • DAVIS, Damek, DRUSVYATSKIY, Dmitriy, KAKADE, Sham, et al. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 2018, p. 1-36. - https://doi.org/10.1007/s10208-018-09409-5
  • VAN DEN DRIES, Lou, MILLER, Chris, et al. Geometric categories and o-minimal structures. Duke Math. J, 1996, vol. 84, no 2, p. 497-540.
  • EVANS, L. C. et GARIEPY, R. F. Measure Theory and Fine Properties of Functions, Revised edn. Textbooks in Mathematics. CRC Press, Boca Raton, 2015.
  • GLOROT, Xavier, BORDES, Antoine, et BENGIO, Yoshua. Deep sparse rectifier neural networks. In : Proceedings of the fourteenth international conference on artificial intelligence and statistics. 2011. p. 315-323. - http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf
  • GRIEWANK, Andreas et WALTHER, Andrea. Evaluating derivatives: principles and techniques of algorithmic differentiation. Siam, 2008.
  • GRIEWANK, Andreas. On stable piecewise linearization and generalized algorithmic differentiation. Optimization Methods and Software, 2013, vol. 28, no 6, p. 1139-1178. - https://doi.org/10.1080/10556788.2013.796683
  • Griewank A., Walther A., Fiege S. and Bosse T. (2016). On Lipschitz optimization based on gray-box piecewise linearization. Mathematical Programming, 158(1-2), 383-415. - http://dx.doi.org/10.1007/s10107-015-0934-x
  • IOFFE, A. D. Nonsmooth analysis: differential calculus of nondifferentiable mappings. Transactions of the American Mathematical Society, 1981, vol. 266, no 1, p. 1-56. - http://dx.doi.org/10.1090/S0002-9947-1981-0613784-7
  • IOFFE, Alexander D. Variational analysis of regular mappings. Springer Monographs in Mathematics. Springer, Cham, 2017. - http://dx.doi.org/10.1007/978-3-319-64277-2
  • KAKADE, Sham M. et LEE, Jason D. Provably correct automatic sub-differentiation for qualified programs. In : Advances in neural information processing systems. 2018. p. 7125-7135. - http://papers.nips.cc/paper/7943-provably-correct-automatic-sub-differentiation-for-qualified-programs
  • KURDYKA, Krzysztof. On gradients of functions definable in o-minimal structures. In : Annales de l'institut Fourier. 1998. p. 769-783. - https://doi.org/10.5802/aif.1638
  • KURDYKA, Krzysztof, MOSTOWSKI, Tadeusz, et PARUSINSKI, Adam. Proof of the gradient conjecture of R. Thom. Annals of Mathematics, 2000, vol. 152, no 3, p. 763-792. - http://www.jstor.org/stable/2661354
  • KUSHNER, Harold et YIN, G. George. Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media, 2003.
  • LECUN, Yann, BENGIO, Yoshua, et HINTON, Geoffrey. Deep learning. nature, 2015, vol. 521, no 7553, p. 436-444. - https://doi.org/10.1038/nature14539
  • LJUNG, Lennart. Analysis of recursive stochastic algorithms. IEEE transactions on automatic control, 1977, vol. 22, no 4, p. 551-575. - https://doi.org/10.1109/TAC.1977.1101561
  • MAJEWSKI, Szymon, MIASOJEDOW, Błażej, et MOULINES, Eric. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. arXiv preprint arXiv:1805.01916, 2018. - https://arxiv.org/abs/1805.01916
  • MOHAMMADI, Bijan et PIRONNEAU, Olivier. Applied shape optimization for fluids. Oxford university press, 2010.
  • MOULINES, Eric et BACH, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In : Advances in Neural Information Processing Systems. 2011. p. 451-459. - http://papers.nips.cc/paper/4316-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine
  • MOREAU, Jean Jacques. Fonctionnelles sous-différentiables. 1963.
  • MORDUKHOVICH, Boris S. Variational analysis and generalized differentiation I: Basic theory. Springer Science & Business Media, 2006.
  • PASZKE, Adam, GROSS, Sam, CHINTALA, Soumith, et al. Automatic differentiation in pytorch. 2017. - https://openreview.net/forum?id=BJJsrmfCZ
  • ROBBINS, Herbert et MONRO, Sutton. A stochastic approximation method. The annals of mathematical statistics, 1951, p. 400-407. - https://www.jstor.org/stable/2236626
  • ROCKAFELLAR, Ralph Tyrrell. Convex functions and dual extremum problems. 1963. Thèse de doctorat. Harvard University.
  • ROCKAFELLAR, Ralph. On the maximal monotonicity of subdifferential mappings. Pacific Journal of Mathematics, 1970, vol. 33, no 1, p. 209-216. - http://dx.doi.org/10.2140/pjm.1970.33.209
  • ROCKAFELLAR, R. T. et WETS, R. J. B. Variational Analysis. 1998.
  • RUMELHART, David E., HINTON, Geoffrey E., et WILLIAMS, Ronald J. Learning representations by back-propagating errors. nature, 1986, vol. 323, no 6088, p. 533-536. - https://doi.org/10.1038/323533a0
  • SPEELPENNING, Bert. Compiling fast partial derivatives of functions given by algorithms. Illinois Univ., Urbana (USA). Dept. of Computer Science, 1980. - https://doi.org/10.2172/5254402
  • THIBAULT, Lionel. On generalized differentials and subdifferentials of Lipschitz vector-valued functions. Nonlinear Analysis: Theory, Methods & Applications, 1982, vol. 6, no 10, p. 1037-1053. - https://doi.org/10.1016/0362-546X(82)90074-8
  • THIBAULT, Lionel et ZAGRODNY, Dariusz. Integration of subdifferentials of lower semicontinuous functions on Banach spaces. Journal of Mathematical Analysis and Applications, 1995, vol. 189, no 1, p. 33-58. - https://doi.org/10.1006/jmaa.1995.1003
  • THIBAULT, Lionel et ZLATEVA, Nadia. Integrability of subdifferentials of directionally Lipschitz functions. Proceedings of the American Mathematical Society, 2005, p. 2939-2948. - https://www.jstor.org/stable/4097908
  • VALADIER, Michel. Entrainement unilateral, lignes de descente, fonctions lipschitziennes non pathologiques. CRAS Paris, 1989, vol. 308, p. 241-244.
  • WANG, Xianfu. Pathological Lipschitz function in R (n). 1995. Thèse de doctorat. Theses (Dept. of Mathematics and Statistics)/Simon Fraser University. -

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow




Register

  • Bookmark videos
  • Add videos to see later &
    keep your browsing history
  • Comment with the scientific
    community
  • Get notification updates
    for your favorite subjects
Give feedback