A thread on Mathematics

nicoloceneda · May 9, 2024, 5:01am

The first post in what I hope will become a long thread to try to improve the mathematical abilities of Gemini.

1) Broken mathematical notation

Gemini lags behind ChatGPT in its ability to correctly format Mathematics, a thing ChatGPT has been able to do since day one. This is putting off several people who use these models for scientific purposes.

1.1) Gemini Pro

I asked a question and specified “format the mathematics using latex”. Notice that the first part was fine but then it broke down when using the double dollar sign ($$ $$). Notice also that Gemini does not consistently format mathematics by default.

In this other example I did not specify “format the mathematics using latex”. Nonetheless, it correctly did, but then again it broke down (notice the last line in the screenshot). This time not because of the dollar sign.

Similarly, here it broke down. This time when using the align environment.

Other examples of failures with exponents and square roots:

Here is another example, where it consistently fails to format the equation.

1.2) Gemini 1.5 Pro

I was making some tests about information retrieval from a file. Gemini failed to display the formula and raised warnings. I tried to rerun the prompt after deactivating the safety blocks but it still didn’t work. Notice that the information I am asking to retrieve, as well as the uploaded document, are not unsafe.

Here is another attempt.

It seems that Gemini is struggling to display Math notation when using the Latex formatting with double dollar sign.

1.3) Feedback

I think math formatting should be set-up like in ChatGPT: always formatting math notation using latex unless explicitly asked to provide the raw latex code. I think it’s more rare to copy and paste math from Gemini into Latex than reading some math directly in Gemini.

2) Mathematical mistakes

In general, Gemini 1.5 Pro is slightly worse at Math than gpt-4-turbo-2024-04-09, but not by much; in the sense that when you ask simple questions, Gemini tends to do silly mistakes while ChatGPT gets it all right. However, when the models miss because I ask a “difficult” question, Gemini 1.5 Pro misses by little, still giving you some intuition for the path you should follow; instead, gpt-4-turbo-2024-04-09 completely loses it and gets it 100% wrong.

2.1) Gemini 1.5 Pro

Gemini struggles a lot with signs.

Sometimes Gemini refuses to answer a question when it’s too hard. It would be nice if it at least tried or explained why it cannot solve it.

2.2) Gemini 1.5 Pro

Here is an example of Gemini getting the wrong sign even after asking to review the calculations multiple times.

3) Mathematical abilities

ChatGPT has significantly more Mathematical abilities than Gemini 1.5 Pro when using Wolfram GPT. I think Gemini would significantly benefit from this extension that gives a scientific brain to the LLM. See for example the difficult integral that Gemini refused to solve.

nicoloceneda · May 9, 2024, 5:04am

Linking another topic discussing a related issue.

OrangiaNebula · May 9, 2024, 5:53am

There are some cases where Gemini 1.5 does well. This prompt:
“We are given a finite set S and the power set P(S). The set operation Δ is defined as
A Δ B = (A ⋃B) \ (A ⋂B). Is the set and operation, (P(S), Δ) an abelian group?”

produced the Gemini 1.5 output

Analyzing (P(S), Δ) as a potential abelian group:

To determine whether (P(S), Δ) forms an abelian group, we need to check if it satisfies the four key properties of an abelian group: closure, associativity, existence of identity, and existence of inverse.

Closure:

For any two sets A and B within P(S), their symmetric difference A Δ B will also be a subset of S. This is because A Δ B only includes elements that are either in A or B, but not in both. Since both A and B are subsets of S, their symmetric difference will also be contained within S, ensuring closure under the Δ operation.

Associativity:

The symmetric difference operation is associative, meaning that for any sets A, B, and C within P(S), the following holds:

(A Δ B) Δ C = A Δ (B Δ C)

This can be proven by analyzing the elements present in each side of the equation. Both sides ultimately represent the set of elements that are present in exactly one or all three of the sets A, B, and C.

Identity Element:

An identity element, denoted as E, must exist such that for any set A within P(S), the following holds:

A Δ E = A

In this case, the empty set (∅) acts as the identity element. This is because the symmetric difference of any set A with the empty set will simply be A itself, as there are no common elements to remove.

Inverse Element:

For each set A within P(S), there must exist an inverse element, denoted as A^(-1),such that:

A Δ A^(-1) = E

Interestingly, in this case, the inverse of any set A is A itself. This is because the symmetric difference of a set with itself removes all elements, resulting in the empty set, which is the identity element.

Conclusion:

Since (P(S), Δ) satisfies all four properties – closure, associativity, identity element, and inverse element – it can be concluded that it indeed forms an abelian group. Additionally, since the operation Δ is commutative (A Δ B = B Δ A), the group is specifically abelian.

This is pretty good, the characterization of the elements for property 2 is spot on, didn’t miss any of the required proof points, a good response.

An agent interaction to Wolfram or something equivalent would be very helpful. For example, Gemini 1.5 struggles with this:
“Suppose g is a continuous smooth function such that for every x > 0, there is one and only one y > 0 such that g(x) + g(y) ≤ 2xy. Can you prove that g(x) = x^2?”

and GPT-4 nails it one-shot. Gemini 1.5 requires that you give it a massive help by continuing the chat and prompting “Try defining $h(x) = g(x) - x^2$ and using that definition to rewrite the given inequality”, with that boost it enthusiastically completes the proof.

nicoloceneda · May 9, 2024, 6:03am

Thank a lot. Interesting to see how it performs on different types of questions than the ones I normally ask. I hope more people contribute to this thread and I will definitely keep adding to it.

N2U · May 9, 2024, 12:15pm

I see the formatting problem here, and it’s only relevant for the web interface because it uses MathJax for rendering the math (probably I don’t have access to AI studio because my country isn’t supported yet), it will render correctly if you copy the raw output into a Tex editor and compile the document there.

ChatGPT doesn’t use proper latex, it output’s a weird mix of markdown and tex, which is mainly there to make the web interface look pretty, it’s very annoying for people who actually work in latex, because there’s a lot of cleanup to do afterwards.

nicoloceneda · May 9, 2024, 3:10pm

I do work with Latex (daily) and I find the workflow on ChatGPT much more sensible and smooth than the one on Gemini. Let me explain. Let’s say that you want to do some math calculations, like solving an integral. With ChatGPT I can see the steps right away and, if they make sense, ask chat to output the raw latex code for derivation. Instead, on Gemini it’s a back and forth: I always need to copy and paste the answer into my tex file, only to be able to judge whether then answer is valid, then go back to Gemini, ask to fix some parts, copy and paste in the tex again, ecc. So unless Gemini gives a correct answer right away, the number of clicks I need to do is much higher.

N2U · May 10, 2024, 1:27am

I understand what you mean, and I agree with you on the fact that you should be able to see the math you’re working with in the web interface.

I’m 100% on board with making sure the user experience is as frictionless as possible

What I’m trying to tell you is that there should never be a need to ask for “raw latex”, the equations should render properly in both the web interface and in your tex editor. The only reason why you have to ask for raw latex, is because of this:

What I’m arguing here is that these formatting issues should be fixed in the web interface, and not by training the underlying model to output non-standard latex mixed with markdown. If this is done correctly there should never again be a need to ask for raw latex, and you would simply be able to copy the equations and text you see in the web interface directly into your editor

nicoloceneda · May 10, 2024, 1:40am

Ah wow. I had no idea this could be done. Very interesting direction. Now I will have to start pushing for this

N2U · May 10, 2024, 2:02am

We can push for this together

But I have a feeling that some absolute Gigachad developer that is currently working as a product lead for Google, will see this and think “yes, this sounds like a good idea” and fix it, so here’s a few directions on how:

Scan the incoming token steam for \begin{ and \end{
When these are detected, add proper padding in rendering so mathjax can display it properly.
Profit.

OrangiaNebula · May 15, 2024, 6:30pm

On the topic of mathematics proper, I tried out the prompt

Suppose g is a continuous smooth function such that for every x > 0, there is one and only one y > 0 such that g(x) + g(y) ≤ 2xy. Can you prove that g(x) = x^2?

on gemini-1.5-flash-latest to compare it to what gemini-1.5-pro-latest had to say about it. This is what it responded:

Unfortunately, we cannot prove that g(x) = x² under the given conditions. Here’s why:

Counterexample:

Consider the function g(x) = x² + 1.

Continuity and Smoothness: g(x) is both continuous and smooth for all x.
Unique y for every x: For every x > 0, there is a unique y > 0 that satisfies the inequality:
(x² + 1) + (y² + 1) ≤ 2xy
This simplifies to (x - y)² ≤ 0, which implies x = y.

However, g(x) ≠ x², demonstrating that the given conditions are not sufficient to prove g(x) = x².

Why the Conditions Aren’t Strong Enough:

The given condition only establishes a relationship between x and y where the inequality holds. It doesn’t provide enough information to uniquely determine the form of the function g(x).

Possible Directions for a Proof:

To prove g(x) = x², you would need stronger conditions, such as:

A specific value of g at a point: For example, if we knew g(1) = 1, this would help constrain the possible solutions .
Additional relationships between x and y: Perhaps a more specific relationship beyond the inequality could help pin down the form of g(x).
Differential equations: If we had a differential equation involving g(x) and its derivatives, we might be able to solve for a unique solution.

Let me know if you’d like to explore any of these possibilities or have other conditions in mind.

Gemini 1.5 flash did way worse than gemini 1.5 pro. In the counter example, it makes a math error when simplifying. (x² + 1) + (y² + 1) ≤ 2xy is (x - y)^2 + 2<=0 or (x - y)^2 <= - 2, for which no solutions exist. The conclusion x = y in the counter example is bogus.

Gemini 1.5 pro showed the right intuition, it just needed one helpful suggestion to complete the proof. This guy didn’t even try.

As usual, I am interested in counter-examples, in this case that would be a math problem where gemini-1.5-flash-latest does better than gemini-1.5-pro-latest (and I realize that it is probably hard to find such counter-examples)

nicoloceneda · May 15, 2024, 8:31pm

Here is how ChatGPT-4o answers. I think it’s good to see how other models compare to understand Gemini’s strengths and weaknesses and room for improvement.

OrangiaNebula · May 15, 2024, 8:43pm

Clear and comprehensive. And, best I can tell, GPT-4o didn’t even have to use Wolfram to generate the proof.

OrangiaNebula · May 15, 2024, 10:55pm

It gets worse. I’ve been feeding gemini-1.5-flash-latest increasingly simpler problems and it has been failing to solve them (in an attempt to find the applicability frontier for the model). Finally I got to the 10-year old stuff. Prompt

We are given the sequence of numbers 2, 5, 10, 17. Which of the following 3 options is the correct continuation for this sequence: (a) 22, (b) 26 or (c) 34?

This is what gemini-1.5-flash-latest makes of it:
The correct continuation is (c) 34.

Here’s why:

The sequence follows a pattern of adding consecutive odd numbers:

2 + 3 = 5
5 + 5 = 10
10 + 7 = 17
17 + 9 = 26
26 + 11 = 37

Therefore, the next number in the sequence is 37, not 34.

It seems there might be a slight error in the provided options.

I am dumbfounded. It sped through the right answer, didn’t notice it, but did put bold around it, and thinks there must be something wrong with the problem statement. And gives an erroneous answer anyway.

This is how ChatGPT-3 (not a typo, the model from a year ago) handled this problem:

To determine the correct continuation of the sequence 2, 5, 10, 17, we need to analyze the pattern between the numbers.

If we look at the differences between consecutive terms:

Between 2 and 5, the difference is 3.
Between 5 and 10, the difference is 5.
Between 10 and 17, the difference is 7.

We observe that the differences are increasing by 2 each time. This suggests that the sequence might be a pattern of odd numbers.

Starting from the last known term (17), if we add the next odd number (9) to it, we would get 26.

Therefore, the correct continuation of the sequence would be (b) 26.

I don’t understand how this lemon of a model got past QA. Unless the description “Preview” means it didn’t have to get any rigorous testing. It’s cheap (or free) and it’s fast, but it doesn’t really do math. It seems like a step backwards.

nicoloceneda · May 16, 2024, 12:12am

Wow. This would definitely deserve a read from someone at Google. GPT-4o again for comparison.

nicoloceneda · May 18, 2024, 11:18pm

Answer from Gemini 1.5 (I assume the new May version).

OrangiaNebula · May 19, 2024, 1:03am

Yes, Gemini 1.5 Pro (and Gemini Advanced for general use) get it right. It’s gemini-1.5-flash-latest that is the mathematical lemon.

nicoloceneda · May 19, 2024, 5:45pm

New test: elementary optimal control problem.

Gemini 1.5 Pro gets it right;

ChatGPT-4o gets it only partially right.

OrangiaNebula · May 19, 2024, 7:42pm

Gemini 1.5 Pro is capable, Gemini 1.5 flash either solves a math problem but can’t explain it as well as the bigger brother, or (more often than not) just can’t solve it. And there are examples where Gemini 1.5 Pro can do even better than GPT-4.

Try this problem, it is one of many that hasn’t been lifted from textbooks, so LLM’s haven’t been trained on it:

An urn contains 70 balls, all color white. At each step, a robotic device randomly samples the urn and withdraws one ball, examining the color. If the ball is white, it is painted red, otherwise it stays the same color. At the end of each step, the ball is put back into the urn and a next sampling step is initiated. After 45 steps, how many balls are expected to be red?

You get really amusing results, including one perfectly accurate response.

OrangiaNebula · May 19, 2024, 10:38pm

It often does, and it sometimes overthinks the problem and gets it wrong:
https://aistudio.google.com/app/prompts?state={"ids":["17UmmevOccuu649bI7-R_-KSvFe1FU2Yv"],"action":"open","userId":"114091953393014428154","resourceKeys":{}}&usp=sharing

For the optimal control problem, Gemini 1.5 flash works quite well. The Geminis have significant improvements over the previous generation of Google models (publicly visible as Bard) when it comes to physics. In fact, Gemini Advanced shortly after it had launched started out answering a physics problem with "I’ve been improving my skills at solving kinematics problems. " and proceeded to solve it.

nicoloceneda · May 23, 2024, 4:37am

@Logan_Kilpatrick is there an update on the math formatting issue? It is still super bad and I know so many people who use ChatGPT for scientific reasons that do not want to switch to Gemini exactly for this reason. Output formatting does not seem unimportant issue. What is the point of improving the analytical capabilities of these models if then it’s super hard to read their answers? Especially given the latest improvements in mathematical reasoning.

https://x.com/JonasAAdler/status/1791528520943878350