March 18, 2024

A significant release for ARM, the Cortex A76 CPU has just some digits changed since their last CPUs. This is their very significant release as the Cortex A76 has been engineered for performing at its peak level by maintaining the form factor.A

Achieving 35 percent of performance improvement and 40 percent approx improved power efficiency over its predecessor A75, ARM’s A76 will be still compatible with all existing processors, as well as, the firm’s DynamIQ CPU cluster technology.

The best way to know and investigate ARM’s Cortex A76 is to go extreme by feeding its execution core with varieties of tasks.

When we dive deep into the execution core, and that’s why I intended to write this posts about, we see that the newer CPU boasts two of it’s ALU(arithmetic locus units) to perform arithmetic tasks and bit-shifting, a multi-cycle integer and a combined unit and a branch unit. We can more easily understand the improvements by noticing that the A75 had only one ALU and one ALU/MAC.

Paired by two 128-bit SIMD/NEON execution pipes which offer double the bandwidth of previous execution pipe. And only one of them can handle the floating point divide the multiply-accumulate instructions. Things remained same like the Half-precision FP16 but incidentally, this helped in boosting low precision INT8 dot product extensions. And as we know they are in much demand these days.

Arm Cortex-A76 micro architecture

Decoupled from the main instruction fetch, the major change that happened to A76 that runs 32 versus 16 bytes per circle. The main reason behind this change is to fight a complex memory level parallelism. Advantages being it’s handy to deal with TLB and caches.

The CPU core can now dispatch up to eight μops/cycle as the A76 moves from three for A75 and two with A73 to four-instruction per cycle. We can witness here that ARM is making a serious significant change in processing power where combined with eight issue queues, one of each of the execution units via a 128-entry instruction unit it processes data an information to improve IPC or natively what we call an instruction per cycle.

Even during a catch miss if the pipes are well fed, ensures high instruction throughput

Now here was all about fetch and execution improvements but what if the memory bottlenecks here, well, ARM has taken care of that too folks!

Though the L1 and L2 caches have been the same, there’s have been alteration here in decoupled address generation and cache look-up pipelines which have received double bandwidth. Memory parallelism was kept in mind while structuring it with handling power of 68 in-flight loads, 72 in-flight stores and 20 outstanding non-prefetch misses. The entire cache hierarchy has been well optimized for latency too. We can see that when it takes 4 cycles to achieve L1, nine to L2 and thirty-one cycles to L3. The concluding factor being, memory access is definitely faster providing a helping hand to speed up execution.

This time ARM has initiated to remove away the 32-bit structure entirely with A76. Still, it supports Aarch32. Aarch64 has been prevailing everywhere up to EL3- from OS to low-level firmware. As of now, it can only be guessed at some point in time in the future ARM will switch totally onto 64-bit supports. But then, the ecosystem matters a lot.

If it still seems like a unicorn out of a fantasy, well a CPU is known by the much it can execute instructions per cycle and here that’s proved by ARM. ARM added an extra math unit to handle its floating point for complex math calculations and to increase its performance subsequently.

Talking about some of the drawbacks which at least I agree upon is that the execution cores are needed to be fed maximum instance of times or they just keep upon wasting silicon and power. Also, it needs to bed fed fast and efficiently. Otherwise, cases might arise of cache misses and system stalls. Focusing on better branch prediction and prefetching as well as a solution to faster access to the cache memory is needed. So all these take too much silicon and power so we have to use it efficiently, more and more!

Arm Cortex-A76 detailed benchmarks

ARM has put much attention to A76. We can definitely prove that the number of detailed changes they have added and altered to put it into a big frame rather just tweaks from A75. Cutting down to 7nm and with overall IPC performance improvemnts, we are looking at a notable 35 percent increase than the already impressive A75. The A76 harnesses low power to by working on low-frequencies but still capable to hit those performance achievements.

The Cortex A76 is coming all way to rule mobile devices from smartphones to laptops and beyond. We expect them before our eyes by 2019.