For the second time this year, Samsung has been accused of cheating on benchmarks. Now Anand Lal Shimpi and Brian Klug have discovered “optimization” devices from more Android OEM’s. I do not find anything surprising here. When you see devices, with the same hardware having statistically different synthetic benchmark results, well, that should raise some eyebrows. History shows us that for as long as there have been synthetic benchmarks, manufacturers have been optimizing for them. Seriously, this has been going on since the 1980’s.
Modern computers, like smartphones, do not run their processors at full speed at all times. Some even shutdown processor cores to conserve power. The best made devices will deliver only as much power as needed to run an app, and not more. This provides a balance between performance and battery life. Without going into details, several manufacturers have been optimizing their Android operating systems to detect benchmarks. When a benchmark is detected, the processors run at full speed during these benchmarks, rather than the way they would operate under normal conditions. This skews the benchmark result.
One example of this was the Exynos version of the Samsung Galaxy S4. It would only allow games to run the PowerVR SGX 544MP3 graphics chip at a maximum speed of 480 MHz instead of its full speed of 533 MHz. This decision was probably a balance between performance and heat output. However, it would allow a few apps and gaming benchmarks to run at 533 MHz. Some would call cheating too harsh a word. Whatever you call it, it does mean that the benchmark does not represent real world performance. A 3D game would run at 480MHz. A gaming benchmark would run at 533MHz. I think we should just call a spade a spade. In fairness, it should be noted that most games in the market right now would run at the same speed on 480 MHz and 533 MHz, with smartphone displays or the game itself, capping frame rates.
Optimizing for synthetic benchmark does not improve user experience. So why do manufacturers do it? Well, it is simple really. Reviewers use synthetic benchmark to rank smartphone performance. This is not what benchmarks were intended for, and they should not be used this way. More on that later.
Anand Lal Shimpi’s solution to all this is “to continue to evolve the (benchmark) suite ahead of those optimizing for it.” Unfortunately, and with all due respect, this is the solution of a reviewer drunk on benchmarks. No offense to Mr. Lal Shimpi, who is well regarded in the industry, but he should know best. The real solution to all this is to stop relying exclusively on synthetic benchmarks.
No one has really bothered to benchmark the benchmarks. Does a better GFX Benchmark score equate to a faster performance in Modern Combat? Does the SunSpider Benchmark accurately mean faster webpage loading times?
Apple’s iPhone is a good testbed for this kind of comparison. It has been around for six years, longer than any other current smartphone line. PCMag has compiled Web browser benchmarks of the original iPhone up to the iPhone 5. A comparison of the original iPhone to the iPhone 5S would be more difficult because of changes in the benchmark suite used.
- Sunspider (lower is better) – 46579
- GUI Mark 3 – 3.35
- Browsermark – 8839
- Sunspider (lower is better) – 947
- GUI Mark 3 – 58.1
- Browsermark – 189025
The GUI Mark 3 benchmark would seem to indicate that the web browser on the iPhone 5 performs 17X faster than the original iPhone. The Browsermark benchmark would would indicate the improvement is greater, by a factor of 21X. SunSpider indicates that the iPhone 5 browser is 49X faster than on the original iPhone. Combining the three together, that averages out to 29X. The result, a web page that takes three seconds to load on my iPhone 5 today would have taken one and a half minutes to load on the original iPhone!
Now, this the wrong way to interpret these benchmarks. Even using three benchmarks yields give little indication of real world performance. Synthetic benchmarks have their use. Benchmarks mimic a particular type of workload on a component or system. Synthetic benchmarks do this by using specially created apps. Application benchmarks run real-world apps on the system. Application benchmarks are what should be used if you want a much better measure of real-world performance on a given system. Synthetic benchmarks are useful for testing individual components and are great for diagnosis and locating system bottlenecks. Combining synthetic and real world benchmarks would also allow a reviewer to understand better why a device performs a certain way. Presenting tallies of the benchmarks scores of several devices on several benchmarks really says nothing.
Basically, using a synthetic benchmark is like using a cars horsepower rating to determine speed. How fast a car can run would depend upon multiple factors like weight, aerodynamics, drivetrain and a dozen other variables. The car would generally run as fast as the slowest component would allow it to run. It is the same with electronic devices. In a given task, a device would run at the speed of the slowest relevant component, and not the fastest.
Running real world benchmarks, like measuring how long a smartphone takes to load a game, process a picture, or maybe even trying to measure the actual time it loads a webpage would be more useful to the consumer. If reviewers want to keep using these synthetic benchmarks, then it should be presented with an analysis of how these benchmarks impact on real world performance. This would make benchmark optimization useless, and could also be used to ferret out bad benchmarks. This, I submit, is the best solution to this benchmarking brouhaha.
If you want to find out how fast a car is, take it to several test tracks, pull out a stop watch and measure lap times. Trying to figure out a car’s performance by comparing horsepower, 0-60 MPH acceleration tests, drag co-efficient, braking and roadholding tests is really not the way to go.