mardi 20 juin 2017

Do NOT Trust OnePlus 5 Benchmarks in Reviews – How our Review Unit is Grossly Cheating at Benchmarks

Earlier this year, we published a report that denounced OnePlus (and other companies) for their improper behavior in regards to benchmark manipulation on newer builds of OxygenOS. Today, we sadly must follow-up on our accusations as the company has once more been inappropriately manipulating benchmark scores in the OnePlus 5.

While no customers have a device in their hands (it just launched after all), we have learned about OnePlus' new benchmark cheating mechanism through our review unit, which we received about ten days ago before the day the embargo breaks and reviewers are allowed to report on the device. Unfortunately, it is almost certain that every single review of the OnePlus 5 that contains a benchmark is using misleading results, as OnePlus provided reviewers a device that cheats on benchmarks. This is an inexcusable move, because it is ultimately an attempt to mislead not just customers, but taint the work of reviewers and journalists with misleading data that most are not able to vet or verify. As a result, every OnePlus 5 review citing benchmark scores as an accolade of the phone's success is misleading both writers and readers, and performance analyses based on synthetic benchmarks are invalidated. What is worse is that, this time around, the cheating mechanism is blatant and aimed at maximizing performance, unlike last time which did not increase scores by much on average, but did reduce variance and thermal throttling, as we found.

Before we jump into the details, I would like to state that we are disappointed in the company for once more resorting to these practices. We also will not provide a full performance analysis involving all of our included tests as many of our preferred benchmarks are affected by the cheating mechanism. Finally, we will be compartmentalizing this report from our overall judgment of the device itself, because we are confident the culprit code will be removed from consumer builds following this report and our conversations with OnePlus representatives. While we do not believe this feature article necessarily should alter your perception of the hardware itself, it is right for it to nudge your opinion of the company given it is their second transgression.


A Quick Word on Methodology

All scores on this article have been obtained on a OnePlus 5 review unit running OxygenOS version 4.5.0 (A5000_22_170603); this is a pre-production unit, and it was originally loaded with pre-production software which received an OTA to the version named above. OnePlus forwarded reviewers instructions to enable the ability to download benchmark applications off the Play Store, and presumably this was done so that there would be no benchmark score leaks ahead of time. It did clue me into the fact that OnePlus was referring to benchmark packages by name in their ROM. As for testing, the ROM had minimal background processes with no third-party applications and running Airplane Mode where applicable; CPU frequencies were logged only to determine the extent of the cheating and not in the tests that produced scores for this article. All temperatures were measured using a FLIR C2 Compact with each endurance run beginning at an outer temperature of 28.5°C | 83.3°F.


Benchmark Manipulation — How it's Done

Last January, our report unearthed a cheating mechanism found in OxygenOS Beta builds and in the shipping software of the OnePlus 3T. We attributed these changes to the recent merger of the then-disparate OxygenOS and HydrogenOS developer teams, and the underlying codebase of OxygenOS which was now to be shared with HydrogenOS, though this speculation is yet to be confirmed. It made sense to us at the time and comments from OnePlus representatives made to XDA-Developers added credibility to our theory. With the OnePlus 5, we see a different kind of cheating mechanism, but we cannot pinpoint whether this was consciously introduced by the same developers who added it the first time around. We only know it targets the same packages.

So how does it work, and what's the difference? Last time around, OnePlus introduced changes to the behavior of their ROM whenever it detected a benchmark application was opened. Such application names were explicitly listed by their package IDs within the ROM in a manifest that specified the targets. Then, the ROM would alter the frequency in relation to an adjusted CPU load — our tools showed CPU load would drop to 0% regardless of obvious activity within the application, and the CPU would see a near-minimum frequency of 1.29GHz in the big cores and 0.98GHz in the little cores. This minimum frequency reduced the effective frequency range, which in turn reduced the number of step frequencies; in benchmarks, this resulted in slightly lower variance and, as we showed, higher sustained performance as the higher minimum frequency could not be overridden by thermal throttling. In short, cheating behavior was clear and demonstrable by both looking at score variance, and by monitoring CPU frequencies throughout the benchmark, which showed a frequency floor that – for the most part – allowed the device to consistently score closer to its full potential.

The OnePlus 5, on the other hand, is an entirely different beast — it resorts to the kind of obvious, calculated cheating mechanisms we saw in flagships in the early days of Android, an approach that is clearly intended to maximize scores in the most misleading fashion. While there are no governor switches when a user enters a benchmark (at least, we can't seem to see that's the case), the minimum frequency of the little cluster jumps to the maximum frequency as seen under performance governors. All little cores are affected and kept at 1.9GHz, and it is through this cheat that OnePlus achieves some of the highest GeekBench 4 scores of a Snapdragon 835 to date –  and likely the highest attainable given its no-compromise configuration with its specific configuration. Scores certainly higher than those obtained by similar devices and Qualcomm's own MSM8998 test device which we were lucky enough to benchmark. Below is a list of benchmark applications affected:

  • AnTuTu (com.antutu.benchmark.full)
  • Androbench (com.andromeda.androbench2)
  • Geekbench 4 (com.primatelabs.geekbench)
  • GFXBench (com.glbenchmark.glbenchmark27)
  • Quadrant (com.aurorasoftworks.quadrant.ui.standard)
  • Nenamark 2 (se.nena.nenamark2)
  • Vellamo (com.quicinc.vellamo)

What is completely unsurprising is the applications affected are the exact same ones as last time around, and OnePlus is clearly targeting the very same packages. The difference in scores is just what you would expect, for the most part. We were able to spoof the benchmark cheating and evade it with GeekBench 4, similarly to our testing in our last report. We found that while running GeekBench 4 from the Play Store, the device scored over 6,700 in multi-core, while we never obtained a score of 6,500 once the device behaves as expected with our hidden build of GeekBench. Below you can see a frequency over time plot for the OnePlus 5's little cluster when running GeekBench 4 from the Play Store, and the same configuration running a build of GeekBench 4 stripped of identifiers that is able to fool OnePlus' cheating mechanism.

Benchmark Manipulation test: OnePlus 5 Geekbench 4 CPU Frequency without benchmark cheating

Benchmark Manipulation test: OnePlus 5 Geekbench 4 CPU Frequency with benchmark cheating

In case it isn't evident from the graph above: we polled the CPU frequency every 100ms, and in total, only 24.4% of readings returned the maximum frequency of 1.9Ghz when disabling cheating. Meanwhile, the run with enabled cheating spent a staggering 95% of readings in its maximum frequency state. It is absolutely evident that OnePlus is keeping the CPU frequencies of these cores artificially high during the benchmark, which results in the significantly higher overall scores in the multi-core test and is also manifested in various CPU-bound subscores in the detailed breakdown of every test (particularly in integer and float operations). The difference is most clear and advantageous in multi-core scores, however, and single-core results are actually surprisingly similar between the runs with and without benchmark cheating, with the single core score actually being higher on average without manipulation.

Benchmark Manipulation test: Comparison between OnePlus 5 Geekbench 4 scores with and without benchmark cheating

Still, multi-core is the figure that most people consider and immediately notice when it comes to this specific benchmark, given Android is a highly parallel operating system that is now full of multi-threaded applications after years of support for multiple cores. Even if the increase is only meaningful in multi-threaded benchmarks and tests, it would still result in a considerable, unfair and unrepresentative advantage over other devices who let their standard governor and performance settings operate under the benchmark; these altered results are not representative of the real world performance of the OnePlus 5 in any way, as they are reflecting a peak and otherwise-unattainable performance of the device under artificial conditions and without constraints.

Benchmark Manipulation test: Comparison between OnePlus 5 Geekbench 4 scores with and without benchmark cheating

 

The Multi Core score delta between scores when running GeekBench 4 with and without the cheating mechanism can be up to 6.5%, though on average it is of around 5%. It might look insignificant, but that nudge is enough to propel the device ahead of other Snapdragon 835 devices. Above you can see a dot plot of multiple independent runs of GeekBench 4 with and without the cheating mechanism. The chasm is evident, and as one can infer from the boxplot, it cannot be a result of inherent variance. In short, boosting the CPU frequencies artificially high does indeed produce much better results in synthetic benchmarks.

Below you can see a plot of performance over time with their accompanying temperatures, as we wanted to determine whether there is thermal relaxation at play as well, or whether there was a difference in scores during sustained benchmarking.

Benchmark Manipulation test: Comparison between OnePlus 5 Geekbench 4 Single Core scores and thermal throttling with and without benchmark cheating

Benchmark Manipulation test: Comparison between OnePlus 5 Geekbench 4 Multi Core scores and thermal throttling with and without benchmark cheating

 

We set up GeekBench 4 tests with a two second break in between the results screen and the initiation of another benchmark run; external device temperature (not battery temperature as reported by Android) was measured using a FLIR thermal camera after a second of calibration, averaging the three immediate measurements in the two-seconds between runs. I was rather surprised to see that, overall, these two devices heated up at around the same rate and neither of them saw a drop in score. All results in each data set are within the expected variance, suggesting there is no thermal throttling at play. Upon closer inspection, this really should not come as a surprise given sustained performance is one of the inherent strengths of the Cortex-A73 cores that the Snapdragon 835's Kryo cores are based on. The affected cores are the power-efficient cores, and the fact that GeekBench 4 specifically comes with measures to prevent throttling that alters the scores of the sub-tests near the end of a run, is something we learned from our interview with John Poole.

Interestingly enough, not all popular benchmarks are targeted by OnePlus' cheating mechanism. 3DMark, for example, did not actually see any of these problems when running tests or even opening the application. However, other benchmarks like GFXBench are targeted and we see the same CPU behavior when opening and running them. In fact, during a sustained performance run using GFXBench's Manhattan Battery Test, the OnePlus 5 reached temperatures of over 50°C | 122°F (outer temperature), a very rare occurrence among devices I have tested in the past, all of which experience some degree of thermal throttling that prevents them from getting quite that hot.


Fool me Once, Shame on Me; Fool me Twice, Shame on You

It is a bit upsetting that it has gotten to the point where we have to call out the same company twice for manipulating benchmark scores. The fact that all of this was done on review units as well further exacerbates the issue: this cheating mechanism is aimed at maximizing performance and making the device look better or faster in performance sections of reviews. The targeting and manipulation system was packaged in pre-production units sent to journalists who will base their findings on their device from OnePlus, many of them unable or unwilling to verify the existence of cheating in their review unit. It is by no means their fault, but XDA is on the lookout for benchmark manipulation only because we found it in the past, and we thought it was best to inform our readers and potential phone buyers.

We hope this article might rekindle a broader conversation about benchmarks, their role, and their utility in today's smartphone reviews. Make no mistake, companies like Qualcomm and Samsung do care about benchmarks, and they do consider them a valid, if incomplete, way for customers to judge the performance of their devices even though they have more sophisticated tools to refer to when developing their processors. Ultimately, benchmarks can be of great importance if one understands what the software is measuring, and to which extent its results can be used to deduce the ranking of a particular processor, a particular configuration of hardware, or in more holistic terms, a specific phone with the changes in behavior its software introduces as well. I think that we have come to a time where it is more important to focus on real-world performance and power efficiency than in raw computing or processing prowess, because it is obviously clear at this point that the bottleneck to real-world performance comes from Android and particular implementations of it by OEMs.

Going back to OnePlus, I really do not know why the company's software team, and which side of the software team specifically, re-introduced benchmark manipulation after being called out. It is worse this time around, with the apparent purpose of inflating scores produced by reviewer handsets. The OnePlus 5 is still an incredibly performing device that really doesn't need benchmark cheating to make a statement — truly, I have been amazed by its fluidity and general responsiveness and it is clear to me after my time with the company's devices as well as interviews and conversations with their management that they know performance is a strong aspect of their phones. It is a calculated move, most likely, as they might have figured out that it was worth annoying a small sector of the primarily-Western side of the enthusiast market in order to perhaps plaster the Internet with the highest benchmark scores they could muster. Whatever the case, I honestly hope the company rights this wrong as, while I have great things to say about their hardware, they have begun the release with the wrong foot in my eyes.


Statement from OnePlus

We reached out to OnePlus for a comment on this issue, and here's what they had to say:

People use benchmark apps in order to ascertain the performance of their device, and we want users to see the true performance of the OnePlus 5. Therefore, we have allowed benchmark apps to run in a state similar to daily usage, including the running of resource intensive apps and games. Additionally, when launching apps the OnePlus 5 runs at a similar state in order to increase the speed in which apps open. We are not overclocking the device, rather we are displaying the performance potential of the OnePlus 5.

This statement that we received this morning is a bit of a shock to hear, as the benchmark cheating puts the device into a state which is explicitly not how the device will run in day to day usage, and it is representing performance that you will not see in other apps that aren't specifically targetted by such boosts.

Keep in mind that unlike in competitive overclocking, most phone benchmarks are designed to represent how a phone will operate in everyday usage. It is not just a score to try and achieve the highest results possible, but rather an attempt at representing how the phone performs under regular thermal profiles and battery usage. An attempt at representing how the phone actually runs in day to day usage. These benchmarks are not designed to measure some "performance potential" that is not achievable in real world use, and any attempts to target them with "defeat device" style benchmark cheating code is misleading to users. If you lock CPU clock speeds to their maximum value and allow the phone's body temperature to rise to unusable levels when certain apps are opened, then that is not indicative of how the phone will operate when in actual use.

While the thermal profile was relatively normal in the CPU heavy Geekbench 4 where the fantastic sustained performance of the ARM Cortex-A73 based Kryo 280 cores allow the phone to run at the increased battery usage levels that the benchmark cheating brought without getting too hot, we saw a completely different story with GPU intensive apps. As mentioned, under testing sustained performance with GFXBench's Manhattan Battery Test, the OnePlus 5 reached temperatures of around 50°C | 122°F (outer temperature), which is scorching hot for a phone, and is thoroughly uncomfortable to hold. Trying to play video games or use other GPU intensive apps with a 50°C | 122°F phone would just be a poor user experience.

Even if OnePlus is targeting non-benchmark apps as well with their benchmark cheating code, it would still be a problem, as it would mean that the performance that you see in intensive apps today will be completely different from what you see in current apps that are not on the list, or in future intensive apps once OnePlus stops updating the list. This could be modified by allowing users to whitelist which applications benefit from hidden boosts, as well as transparently display which are benefitting from default — we suggested this with our last report, but it hasn't been implemented.

We are disappointed with OnePlus' actions in this matter, and hope that OnePlus will, for the second time, remove the benchmark cheating code from their software. It is misrepresenting their phone to their customers, and is not the type of behavior that we like to see with devices as otherwise awesome as the OnePlus 3T and the OnePlus 5.



from xda-developers http://ift.tt/2rNoIub
via IFTTT

Aucun commentaire:

Enregistrer un commentaire