Startup Claims it Can Boost CPU Performance by 2-100X

Por: Bryan Cockfield

13 Junio 2024 at 02:00

Although Moore’s Law has slowed at bit as chip makers reach the physical limits of transistor size, researchers are having to look to other things other than cramming more transistors on a chip to increase CPU performance. ARM is having a bit of a moment by improving the performance-per-watt of many computing platforms, but some other ideas need to come to the forefront to make any big pushes in this area. This startup called Flow Computing claims it can improve modern CPUs by a significant amount with a slight change to their standard architecture.

It hopes to make these improvements by adding a parallel processing unit, which they call the “back end” to a more-or-less standard CPU, the “front end”. These two computing units would be on the same chip, with a shared bus allowing them to communicate extremely quickly with the front end able to rapidly offload tasks to the back end that are more inclined for parallel processing. Since the front end maintains essentially the same components as a modern CPU, the startup hopes to maintain backwards compatibility with existing software while allowing developers to optimize for use of the new parallel computing unit when needed.

While we’ll take a step back and refrain from claiming this is the future of computing until we see some results and maybe a prototype or two, the idea does show some promise and is similar to some ARM computers which have multiple cores optimized for different tasks, or other computers which offload non-graphics tasks to a GPU which is more optimized for processing parallel tasks. Even the Raspberry Pi is starting to take advantage of external GPUs for tasks like these.

Hackaday
Make Your Code Slower With Multithreading
8 Junio 2024 at 02:00

Make Your Code Slower With Multithreading

Hackaday

Por: Dave Rowntree

8 Junio 2024 at 02:00

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches. Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified. Something in there was using a futex (or fast userspace mutex) a whole lot.

After delving into the glibc source code, a comment said it all:

/* POSIX.1c requires that there is mutual exclusion for the `rand' and `srand' functions to prevent concurrent calls from modifying common data. */

__libc_lock_lock (lock);
 (void) __random_r (&unsafe_state, &retval);
 __libc_lock_unlock (lock);

By replacing the call to rand() with random_r(), the program’s performance with four threads improved dramatically. The runtime was reduced to a theoretical quarter of the single-thread version. As Marc summarizes, multi-threaded programming is not always as straightforward as one might think. While performance can be significantly worse in some cases, improvements are possible. However, this is not guaranteed to be the case in every situation.

The art of debugging and profiling code is complex, so here’s how to use Valgrind to look for problems you might not even know about. Even the humble Linux pipe needs to be thought out to get decent performance. What a surprise!