You bring up a lot of points, and instead of going point-by-point, let me try and zoom out a little.
I have never heard that out-of-order designs are more efficient. Can you link to any evidence, not a back-of-the-envelope calculation (which is more complicated anyway, because we would have to talk about implementations using fast vs. slow transistor types among many other factors)?
ARM seems to agree (we can quibble about how ARM measures performance and what obvious and less obvious things they take into account), for example Peter Greenhalgh (the chief architect of the Cortex R4, A8, A5 and A53 as well as big.LITTLE)
answered some comments on Anandtech (including one of mine):
Originally Posted by wrkingclasshero
What is ARM's most power efficient processing core? I don't mean using the least power, I mean work per watt. [...]
Originally Posted by Peter Greenhalgh
In the traditional applications class, Cortex-A5, Cortex-A7 and Cortex-A53 have very similar energy efficiency. Once a micro-architecture moves to Out-of-Order and increases the ILP/MLP speculation window and frequency there is a trade-off of power against performance which reduces energy efficiency. There’s no real way around this as higher performance requires more speculative transistors. This is why we believe in big.LITTLE as we have simple (relatively) in-order processors that minimise wasted energy through speculation and higher-performance out-of-order cores which push single-thread performance.
Across the entire portfolio of ARM processors a good case could be made for Cortex-M0+ being the more energy efficient processor depending on the workload and the power in the system around the Cortex-M0+ processor.
There are more details in point 5 of this Q&A on ARM's website
. Of course, you could now say that this guy didn't just drink the kool aid, he thought of the recipe before he drank it!
But I think it explicitly supports my claim.
Originally Posted by P
It isn't a contest because they're running on different tracks. More efficient would mean that a dotted line (in-order core) would be higher up than a solid line (OoOE cores) at the same X coordinate, and doesn't happen very often, because they don't exist at the same performance level.
No, the curve need not be higher for the little core here. The graph I have finally found makes that more explicit:
It shows that near the cross over point, the efficiency of the big and the LITTLE core are comparable, although I think I have seen actual measurements where the big core even had a very slight edge. But the other point is that the line of the big processor does not extend down to the very low frequencies nor does the LITTLE core reach the very high performance region. Put succinctly, because you have a narrower optimization window for each of the cores, i. e. you can optimize the big core for high frequencies (in terms of pipeline length, transistor types, etc. etc.) and the LITTLE core for energy efficiency, you get a SoC that is more energy efficient than if you used just one core to span the whole gamut.
Now you could ask yourself whether you could do a big.BIG core SoC (just like the 2 x 4 x Cortex A53 SoCs that were around for a while with 4 energy efficient A53s and 4 higher clocked A53s) and get something that is more efficient? One counterargument here would be die area, it'd be quite wasteful, but strictly speaking that wasn't my claim. (Even though it is certainly a consideration in SoC design even if money is no object.)
Originally Posted by P
Not because of their design, because of their clockspeed and voltage target. If you need performance X and you have an in-order and an OoOE core that can both deliver performance X, the most efficient way to do it is to use the OoOE core and clock it down to the same performance. That will use less power than the in-order core, unless you have to clock it so low that you're past the peak efficiency.
Oh, yes, design has something to do with it: you tend to need to lengthen pipelines if you want designs to hit high frequencies, which makes bubbles more costly. This is the problem that OoO solves: keep the pipeline (“more”) filled.
Originally Posted by P
But how does the scheduler know that? Are the threads marked, somehow? If they are, that just further reinforces my point that the in-order cores aren't cut out for general purpose computing.
I don't really know how the schedulers figure this stuff out, but here is the relevant info straight from the horse's mouth
Originally Posted by ARM
In big.LITTLE Global Task Scheduling, the same mechanism is in operation, but the OS keeps track of the load history of each thread and uses that history plus real-time performance sampling to balance threads appropriately among big and LITTLE cores.
Apple did something similar when it explained where some of its iOS 12 performance benefits come from. Among other things, the OS would ramp up CPU frequencies much more aggressively and ramp them down more quickly.