I was curious how well the simple momentum step-size approach shown in the first interactive example compares to alternative methods. The example function featured in the first interactive example is named bananaf ("Rosenbrok Function banana function"), defined as
var s = 3
var x = xy[0]; var y = xy[1]*s
var fx = (1-x)*(1-x) + 20*(y - x*x )*(y - x*x )
var dfx = [-2*(1-x) - 80*x*(-x*x + y), s*40*(-x*x + y)]
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.
From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
now try the same experiment in 1 billion dimensions.
shoo 10 hours ago [-]
It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
'cg': 248
'l-bfgs-b': 40
'm-001-99': 3337
All methods converged in 100 / 100 trials.
m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
Perhaps you took my comment too literally. Try it on a real neutral network, it doesn't work.
Lerc 5 hours ago [-]
While I'm interested in the topic of the post and have seen plenty of visualisations of balls rolling around hills, I was a little disappointed that it didn't cover the thing that has bugging me for years.
Momentum, or specifically inertia, in physics, what the hell is it? There's a Feynman tale where he asked his father why the ball rolled to the back of a trolly when he pulled the trolley. The answer he received was the usual description of inertia, but also the rarely given insight that describing something and giving it a name is completely different from knowing why it happens.
It's one of those things that I lie in bed thinking about. The other one is position, I can grasp the notion of spacetime and the idea of movement and speed as changes in position in space relative to position in time. I really don't have a grasp of what position is though. I know the name, I can attach the numbers to it, but that doesn't really cover what the numbers are of though.
somethingsome 2 hours ago [-]
A rough idea and simplification:
Suppose you have a bunch of 2D points, without coordinates, they exist because you say so, they can represent anything you want.
But you can't do a lot with those points, you may be interested in knowing their distances. To do that, you create some reference system, i.e. 2 non parallel axis and you set a unit on each, for example one could have one centimeter and the other one meter.
Now by placing the reference system on one particular point for example, you can 'identify' each other point on that scale.
With this correspondance, you can uniquely map each point to a coordinate and each coordinate to a point in space, this allows you to measure distances for example.
Notice that the choosen coordinates didn't really matter nor the direction of the axis. But as the rest of the 2D world can be mapped to them, everything is coherent.
Now if you create a novel axis system with another initial point and both axis with 1cm on each, you can find a transformation that transform your first system into the second, this transformation allows you to transform any other point in the new coordinates system.
So what is a position exactly? I would say it's the identification of some objects by an arbitrarly chosen referential system.
What are the numbers? They corresponds to an arbitrary chosen unit of measure.
I hope this will give you more tought matter :)
chermi 2 hours ago [-]
What specifically do you feel you don't grok about inertia? I'll admit the use of "inertia" for explaining phenomena historically bothered me as it seemed like it was just an extra word that was already covered. Inertia/momentum describes what an object will do in the next instant if nothing else "happens" to the object. Force describes deviation from this according to dp/dt = F. Of course this is in the classical sense.
I'm not sure about position. It's a hard one to think about. What's important is that position's numbers (coordinates ) are defined according to a coordinate system. But the actual physical "position" doesn't care about the coordinate system. So things like distance or the time it takes to get from one point to the other (in some units) are invariant under coordinate changes.
Lerc 13 minutes ago [-]
>What specifically do you feel you don't grok about inertia?
It's the why of it. Why do objects stay in motion? Why does it take force to change that? It's hard to imagine a universe where this is not so.
Observation has given us a description that makes good predictions, but that's the what happens. Not the why.
I think it might be similar to the position problem in the sense of what does it mean to say a property is a property of something.
kenjackson 5 hours ago [-]
What’s hard to understand about position? Isn’t it just a specific coordinate in some space?
fnordpiglet 2 hours ago [-]
Except there is no absolute coordinate system. Position is relative to something, and it’s not space. Likewise movement is only meaningful if defined with respect to something else.
Momentum is simply established movement with respect to something else. Acceleration is the only really meaningful thing, as it involves force and a transfer of energy. Momentum is a measure of how much acceleration it would require to change relative speeds with respect to something else. If you removed the something else from the system or substituted it for something with the same “momentum,” everything would cease to have momentum and would not be moving.
The insight that there is no global coordinate system in space is a key insight of relativity. Momentum is a measurement of the state of a system in an inertial frame and can be seen as a measure of the energy required to effectively get all the mass to have that velocity, or the energy required to bring the entire frame to relative rest. It’s a conserved measure, and a bunch of other useful things - but it itself isn’t “a thing,” it’s a measure.
kenjackson 18 minutes ago [-]
"Except there is no absolute coordinate system. Position is relative to something"
Yes, just pick something. I think I get what you're saying, but I'm unclear why its not straightforward. I must be missing something.
Perhaps it is an elementary doubt, but does it all apply to rotational motion? Does a wheel rotating along its own axis continue to rotate in perpetuum, in the absense of friction, air resistance etc?
porridgeraisin 21 hours ago [-]
Distill.pub has such high quality content consistently. It's a shame they don't seem to be active anymore.
imvg 21 hours ago [-]
I agree. Just the usage of animations for explanations was a huge step forward. I wonder why the flagship ML/AI conferences have not adopted the distill digital template for papers yet. I think that would be the first step. The quality would follow
gwern 20 hours ago [-]
The quality would not follow because Distill.pub publications take literally hundreds of man-hours for the Distill part. Highly skilled man-hours too, to make any of this work on the Web reliably. (Source: I once asked Olah basically the same question: "How many man-hours of work by Google-tier web developers could those visualizations possibly be, Michael? 10?")
michael_nielsen 19 hours ago [-]
I've been wondering at what point AI assistants are going to reduce that to a manageable level? It's unfortunately not obvious what the main bottlenecks are, though Chris and Shan might have a good sense.
gwern 17 hours ago [-]
It might be doable soon, you're right. But there seems to be a substantial weakness in vision-language-models where they have a bad time with anything involving screenshots, tables, schematics, or visualizations, compared to real-world photographs. (This is also, I'd guess, partially why Claude/Gemini do so badly on Pokemon screenshots without a lot of hand-engineering. Abstract pixel art in a structured UI may be a sort of worst-case scenario for whatever it is they do.) So that makes it hard to do any kind of feedback, never mind letting them try to code interactive visualization stuff autonomously.
colah3 3 hours ago [-]
A few comments on this thread:
Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.
I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.
However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...
More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.
(I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)
muro 14 hours ago [-]
Only skimped through the article for now, but have to give props to the author - it's beautifully made.
From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
[1] https://en.wikipedia.org/wiki/Wolfe_conditions [2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
All methods converged in 100 / 100 trials.m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...
Momentum, or specifically inertia, in physics, what the hell is it? There's a Feynman tale where he asked his father why the ball rolled to the back of a trolly when he pulled the trolley. The answer he received was the usual description of inertia, but also the rarely given insight that describing something and giving it a name is completely different from knowing why it happens.
It's one of those things that I lie in bed thinking about. The other one is position, I can grasp the notion of spacetime and the idea of movement and speed as changes in position in space relative to position in time. I really don't have a grasp of what position is though. I know the name, I can attach the numbers to it, but that doesn't really cover what the numbers are of though.
Suppose you have a bunch of 2D points, without coordinates, they exist because you say so, they can represent anything you want.
But you can't do a lot with those points, you may be interested in knowing their distances. To do that, you create some reference system, i.e. 2 non parallel axis and you set a unit on each, for example one could have one centimeter and the other one meter.
Now by placing the reference system on one particular point for example, you can 'identify' each other point on that scale.
With this correspondance, you can uniquely map each point to a coordinate and each coordinate to a point in space, this allows you to measure distances for example.
Notice that the choosen coordinates didn't really matter nor the direction of the axis. But as the rest of the 2D world can be mapped to them, everything is coherent.
Now if you create a novel axis system with another initial point and both axis with 1cm on each, you can find a transformation that transform your first system into the second, this transformation allows you to transform any other point in the new coordinates system.
So what is a position exactly? I would say it's the identification of some objects by an arbitrarly chosen referential system. What are the numbers? They corresponds to an arbitrary chosen unit of measure.
I hope this will give you more tought matter :)
I'm not sure about position. It's a hard one to think about. What's important is that position's numbers (coordinates ) are defined according to a coordinate system. But the actual physical "position" doesn't care about the coordinate system. So things like distance or the time it takes to get from one point to the other (in some units) are invariant under coordinate changes.
It's the why of it. Why do objects stay in motion? Why does it take force to change that? It's hard to imagine a universe where this is not so.
Observation has given us a description that makes good predictions, but that's the what happens. Not the why.
I think it might be similar to the position problem in the sense of what does it mean to say a property is a property of something.
Momentum is simply established movement with respect to something else. Acceleration is the only really meaningful thing, as it involves force and a transfer of energy. Momentum is a measure of how much acceleration it would require to change relative speeds with respect to something else. If you removed the something else from the system or substituted it for something with the same “momentum,” everything would cease to have momentum and would not be moving.
The insight that there is no global coordinate system in space is a key insight of relativity. Momentum is a measurement of the state of a system in an inertial frame and can be seen as a measure of the energy required to effectively get all the mass to have that velocity, or the energy required to bring the entire frame to relative rest. It’s a conserved measure, and a bunch of other useful things - but it itself isn’t “a thing,” it’s a measure.
Yes, just pick something. I think I get what you're saying, but I'm unclear why its not straightforward. I must be missing something.
Why Momentum Works - https://news.ycombinator.com/item?id=14034426 - April 2017 (95 comments)
Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.
I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.
However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...
More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.
(I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)