Grok 3 and Grok 3 Think: A Comprehensive Review
In this article, we will thoroughly test Grok 3 and Grok 3 Think (Reasoning) with Coding, Math, Problem Solving, Instruction Following, and more. We will benchmark it against other large language models like Claude 3.5 Sonnet, OpenAI o3-mini, and others.
Introduction to Grok 3 and Grok 3 Think
Elon Musk's AI company, xAI, has released their latest and greatest AI model, Grock 3. We are going to test both the normal and the reasoning Grock 3 versions on gro.com using their own benchmarks.
This is the caption for the image 1
When asked to reason, Grock 3 and Grock 3 mini are better than all published reasoning models. Open AI 03 is only scheduled to be published in December. The lighter shades above the Grock models are when they're asked to think harder. Surprisingly, Grock 3 mini seems to outperform Grock 3 in almost all reasoning benchmarks. In non-reasoning benchmarks, Grock 3 is placed as the best across maths, science, and coding.
Testing Grock 3 and Grock 3 Think
Let's get to the testing. Let's ask Grock to tell us the name of a country whose name ends in "lia" and to name its capital. Australia and Canberra are an example here. That's a pass.
Now let's test the thinking version. The reasoning version also got it right. What is the number that rhymes with the word we use to describe a tall plant? The answer should be three. That's a pass.
Next question, write a Haiku where the second letter of each word when put together spells "simple". That's a fail. Let's check if the reasoning model got it. Right, it did get it right. That's a pass.
Next, we need an English adjective of Latin origin that begins and ends with the same letter, has 11 letters in total, and for which all vowels in the word are ordered alphabetically. Something like "transparent" would do. This is a fail. The reasoning model got it right. That's a pass.
Courtney said that there were 48 people, but Kelly said that Courtney had overstated the number by 20%. If Kelly was right, how many people were there? The answer should be 40. That's a pass.
I have two apples, then I buy two more. I bake a pie with two of the apples. After eating half of the pie, how many apples do I have left? The answer should be two. That's a pass.
Sally is a girl. She has three brothers. Each of her brothers has the same two sisters. How many sisters does Sally have? That's a pass.
Now for an interesting moral question, would you gently push an innocent person if it was to save humanity? A human wouldn't even blink twice before gently pushing an innocent person. Let's hear what Grock has to say. Grock says logic leans towards the shove. This is the most humanlike reasoning I've ever seen in a model.
This is the caption for the image 2
More Testing
If a regular hexagon has a short diagonal of 64, what is its long diagonal? The answer should be 73.9 or equivalent. That's a pass.
Create an HTML page with a button that explodes confetti when you click it. You can use CSS and JS as well. That's a pass.
Create a Python program that prints the next X leap years based on user input. That's a pass.
Generate the SVG code for a butterfly. This is one of the most beautiful SVG butterflies I've ever seen. That's definitely a pass.
Create a landing page for an AI company. The landing page should have four sections: header, Banner, features, and contact us. That's a pass.
Write a game of life in Python that works on the terminal. That's a pass.
The non-reasoning version was much better, but this is also a pass. I think you're starting to see which model between the two can code better.
This is the caption for the image 3
Conclusion
For a tricky question which isn't supposed to work with traditional transformer-based models, how many words are in your response to this prompt? Theoretically, though, the reasoning model should be able to formulate an answer before outputting it. A human does this effortlessly in their mind. That's a fail.
Create a Pomodoro app in Python. That's a pass.
This is the caption for the image 4
Nevertheless, from this first batch of tests, Grock 3 looks very promising. A previous version is also ranked first on the LM Arena leaderboard, though I'd prefer the AER polyglot leaderboard. Will perform in-depth tests once the model is made available in the API. Please remember to subscribe to the channel and consider giving a thanks to support the channel. See you on the next one.