I was reading Baydin et al, Automatic Differentiation in Machine Learning: a Survey, 2018 (Arxiv), which differentiates between *symbolic differentiation* and *automatic differentiation (AD)*. It then says:

AD Is Not Symbolic Differentiation. Symbolic differentiation is the automatic manipulation of

[symbolic] expressions.AD can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives.

Evaluation tracesform the basis of the AD techniques. [Acomputational graph(Bauer, 1974) visualizes dependency relations of (input, working, output) variables inevaluation traces.]

It then goes on by describing how to compute the derivative with AD (in forward or backward mode). The description is basically transforming the evaluation trace / computational graph.

Autograd, Chainer, and PyTorch provide general-purpose reverse mode AD.

It also discusses Theano, TensorFlow, and others, but it basically compares *define-and-run* / *static computational graph* (Theano, TF) vs *define-by-run* / *dynamic computational graph* (PyTorch, TF Eager).
(This would be orthogonal in my understanding to the question of how AD is performed, or would mostly just change how AD is implemented, but not so much the concept of AD.)

Theano is a computational graph optimizer and compiler [...] and it currently handles derivatives in a highly optimized form of symbolic differentiation. The result can be interpreted as a hybrid of symbolic differentiation and reverse mode AD, but Theano does not use the general-purpose reverse accumulation as we describe in this paper. (Personal communication with the authors.)

I'm not sure if the authors imply that Theano/TF do not provide general-purpose reverse mode AD (which would be wrong in my understanding).

I don't exactly understand how Theano does not use the general-purpose reverse accumulation.

Also, I don't understand how symbolic differentiation is different from AD, given this definition.

Or: How are symbolic expressions different from computational graphs?

Related is also *differentiable programming*

differentiable directed graphs assembled from functional blocks

where I again do not see the difference to a computational graph.

And *backpropagation (BP)*:

The resulting algorithm is essentially equivalent to transforming the network evaluation function composed with the objective function under reverse mode AD, which, as we shall see, actually generalizes the backpropagation idea.

I don't see how reverse mode AD is more general than backpropagation. Is it? How?

Schmidhuber, Deep Learning in Neural Networks: An Overview, 2014 (section 5.5) (also) states:

BP is also known as the reverse mode of automatic differentiation (Griewank, 2012).