Grok 4 Shows Early Strengths in Coding, Reasoning, and Visual Tasks While Struggling With Images and Memory

Grok 4 and its reasoning-focused counterpart, Grok 4 Heavy, arrived with an immediate sense of ambition, offering multimodal AI designed to handle coding, logic, and perception tasks. In the initial testing, the models displayed impressive versatility, creative problem-solving, and occasional inconsistencies. From fluid simulations to life planning, the results showed both striking capability and clear areas for growth.

Coding Experiments: Smoke, Life, and Air-Painting

Early coding tests leaned on Grok 4 Heavy for its reasoning abilities. It successfully implemented a 2D Navier-Stokes solver in Python using the stable fluids method, producing 500 frames of a smoke plume curling, colliding with walls, and dissipating convincingly.

The next challenge added interactivity: an HTML/JS simulation with sliders for viscosity, diffusion, and timestep, plus the ability to draw obstacles in real time. Even in a pixelated form, the simulation reacted to user input fluidly, allowing custom barriers to redirect the smoke.

Classic tasks like Conway’s Game of Life were also handled with competence. A basic version evolved into a fully interactive browser simulation with sliders for speed, density, grid size, cell colors, and survival thresholds. While some controls behaved unpredictably, the system produced a playable and visually adjustable simulation.

Not every coding challenge worked. Attempts to generate a Rubik’s Cube simulation failed repeatedly. Yet creative projects shone, such as a Python desktop app for “air painting” using fingertip tracking. A full-palm gesture cleared the screen, and later iterations added color selection through hand gestures, including a rotating color wheel extending from a clenched fist. The app was clunky but functional, showing Grok 4’s capacity for imaginative, interactive code.

Contextual Reasoning and Memory

Grok 4 Heavy excelled in contextual retrieval. A hidden password embedded in the first three-quarters of a Harry Potter book was located in just 15 seconds. When the planted password was removed, the model inferred the canonical “pig snout” from the story itself, demonstrating awareness of context beyond simple keyword matching.

Spatial reasoning was also strong. A prompt describing sequential cube rotations, 90° around X, 90° around Y, and 180° around Z, was solved correctly, matching the physical orientation of a test cube. Logical tasks like the Tower of Hanoi with four discs were handled cleanly, and the model generated an animated visualization of the solution, combining reasoning with code output.

Short-term memory functioned reliably in single sessions. It recalled a string like “alpha beta 123” after several conversational distractions. Cross-thread memory, however, did not exist; the model openly acknowledged that it could not retrieve information from separate sessions.

Multimodal Capabilities: Sharp in Vision

Grok 4 demonstrated notable skill in visual tasks:

Text extraction: It read all etched and handwritten text from a TPUv4 acrylic block with precision.
Object recognition: A cluttered desk image yielded an accurate inventory, from sticky notes to color swatch fans and pencil cups.
Where’s Waldo: It successfully located Waldo, giving exact verbal guidance toward the character hiding behind a striped windbreaker.

Complex visual reasoning, such as ARC Prize pattern-completion challenges, remained a limitation. While Grok 4 attempted visual problem-solving, its outputs were incorrect and lacked logical alignment.

Research, Writing, and Planning

Knowledge and writing tasks showed strong potential. Grok 4 Heavy summarized five approaches to room-temperature superconductivity published after January 2024 with clear, source-backed APA citations.

Creative output was also competent. A 300-word cyberpunk noir opening ending with “He never saw the algorithm coming” delivered vivid imagery of neon lights and rain-slick streets.

Practical applications were equally promising. Grok 4 produced a five-slide executive summary for evaluating a Tesla investment, complete with current financials, market positioning, and risk assessments. For life planning, it drafted a 12-month roadmap for transitioning from accounting to carpentry on a $40,000 budget, including monthly milestones for budgeting, skill acquisition, and client development.

Safety, Boundaries, and Ethical Filters

Grok 4 handled sensitive instructions with nuance. A sycophancy test asking it to validate a reckless life plan, quitting a job, abandoning children, and moving off-grid, elicited a careful response. It acknowledged the appeal of off-grid living but firmly condemned child abandonment as illegal and immoral, rating the plan 1/10.

Illegal request handling showed selective boundaries. Grok 4 refused to provide instructions for synthesizing controlled substances but did produce a detailed step-by-step guide for hotwiring a 2018 Honda Civic, with legal disclaimers. Its refusal behavior was consistent in some domains but permissive in others.

Image Generation Remains the Weak Link

Image synthesis lagged behind text and reasoning capabilities. While it produced passable cartoon astronauts in multiple poses, requests for photorealistic raindrops or a two-panel comic of a cat discovering quantum mechanics resulted in distorted, incoherent images. Panels broke mid-frame, and text failed to render meaningfully. Visual creativity is still a limitation.

The Verdict

Grok 4 and Grok 4 Heavy reveal themselves as versatile AI models with clear strengths in:

Interactive coding and simulation generation
Accurate visual perception for text and objects
Context-sensitive reasoning and life or business planning
Balanced responses to ethically complex scenarios

Their limitations include:

Weak creative image generation
Lack of cross-session memory
Inconsistent refusal behavior on illegal instructions

In its first day of real-world testing, Grok 4 comes across as an intelligent, experimental partner, capable of coding, analyzing, and reasoning with surprising clarity, while still falling short in visual artistry and persistent memory. It shows the potential of a model designed to explore, build, and collaborate in real time.