Deep Integrator 1: What's so Hard about Integration?

Alex McKenzie published on
5 min, 828 words

Why is symbolic integration hard? It's certainly a hell of a lot harder than symbolic differentiation. Consider this list of rules: ($c$, $n$ are constants)

$$ \begin{align*} c' &= 0 \\ (x^n)' &= nx^{n-1} \\ \left(e^x\right)' &= e^x \\ \left( \log(x) \right)' &= \frac{1}{x} \\ (f + g)' &= f' + g' \\ (fg)' &= f' \cdot g + f \cdot g' \\ (f \circ g)' &= (f' \circ g) \cdot (f \circ g') \\ \left( f^{-1} \right)' &= \frac{1}{f' \circ f^{-1}} \end{align*} $$

Except for perhaps the Inverse Rule, these are all standard-issue rules for calculating derivatives that most people who studied Maths at 16 or 17 will recognise. These rules are also sufficient to define a complete algorithm for differentiating any elementary function.

To see that these rules are enough, we first have to note that almost any elementary function is already in the form of one of the inputs of these rules, and all other elementary functions can be easily rewritten into the right form:

  • $\sqrt[n]{x} = x^{1/n}$ which is in the form of the Power Rule
  • $\cos(x) = \sin\left(x + \frac{\pi}{2}\right)$ which is in the form of the Chain Rule
  • etc...

Next, we can see that after any of these rules, the differand [^1] becomes smaller [^2]. So by structural induction applying these rules recursively gives us a total algorithm for differentiation.

Is there a similar algorithm for integration? Certainly not such a simple one. Consider the two main integration heuristics we are taught at school:

$$\int f' \cdot g = f \cdot g - \int f \cdot g'$$

$$ \int (f \circ g) \cdot g' = \int f $$

For Integration By Parts, there's no guarantee that the integrand gets smaller. In fact, for many functions (e.g. $\sin(x) \cdot e^{x^2}$) the integrand gets bigger and more complex! For Integration By Substitution, the integrand does definitely get smaller, but in order to match the left-hand side an integrand has to be in a very particular form. Together, these rules (plus the rules for base terms such as polynomials and exponentials) fall far short of a complete algorithm.

This asymmetry between the two directions of calculus was evident in the codebase of Chegg Math when I worked there. While Differentiation (affectionately called Derivation in our team after an early mistake by a well-meaning coder) took only around 200 lines, Integration sprawled over several thousand. As the number of special cases we uncovered increased, so too did the difficulty of structuring such a complex network of rules and exceptions. In order to cope with the complexity, we invented new machinery to traverse the expression tree, applying layer upon layer of heuristics to decide which computationally expensive rule to try next.

Our approach worked, mostly. We were able to integrate the vast majority of functions that users entered into our solver, and the code retained some semblance of structure. Maintaining this edifice, however, was an enormous effort.

I always wondered if our extremely complicated heuristic selection algorithm could be replaced with a simple machine learning system. I still don't know if that's the case, but I am more confident now that I've spent some time with Guillaume Lample and François Charton's 2019 paper "Deep Learning for Symbolic Mathematics" 1. This paper is clever, if mostly for its boldness: the paper's neural network learns to integrate by simply observing several million expression-antiderivative pairs. The real insights of the paper are threefold:

  1. The authors realised that mathematical expressions, when written in prefix notation, are somewhat similar in structure to human language
  2. They decided to use a standard state-of-the-art neural network architecture for Natural Language Processing, but feed it pairs of mathematical expressions (in prefix notation)
  3. They generated pairs of expressions in three ways: using existing software to integrate expressions, differentiating expressions and swapping the pair, and using Integration By Parts to get a hybrid of the two approaches

The conclusion: it works! According to the paper, it's both faster and more accurate than other commercial Integration software.

This is the first blog post in a series on the Deep Integrator (my grandoise name for Chollet and Lample's system). In the next posts, I'll:

  • Explain in more depth how it works
  • Show you a demo that I built, and talk through some caveats to Chollet and Lample's results
  • Put it through its paces using the 70,000-strong symbolic integration test suite from Rubi

Stay tuned!

[^1]: I agree, "differand" is a horrible word. But it's probably better than "differentiand"...

[^2]: Technically we should be careful here: applying the Sum Rule, Power Rule, Chain Rule or Inverse Rule could result in the derivand staying the same size, but in that case there is always another applicable rule which will reduce the size of the derivand.

1

%5Bhttps://arxiv.org/abs/1912.01412%5D(https://arxiv.org/abs/1912.01412)