How to multithread like a pro!

cyboryxmen · Post by **cyboryxmen** » April 11th, 2018, 7:56 am

The world is increasingly moving towards a future defined by parallelism. Processors have stopped getting faster and faster and have started multiplying their number of CPU cores. C++ itself has made the first step towards parallelism by defining the new C++ Memory Model in C++11. This memory model is designed around machines with the capability to execute two threads of execution at once by defining rules on how these threads would share memory between each other. This plus the new classes like std::thread and std::future, solidifies C++’s place in this future of parallelism.

With that said, programming for parallelism is going to be a lot more complicated than how we’re used to program. For one, we are used to a fixed ordering of events. If we write code like this:

Code: Select all

DoPhysics();
DoAI();
DoRendering();

then we expect that it will calculate physics first, then AI, then finally render them onto the screen. This concept of ordering gets chucked out of the window when you live in a world where you can do physics, AI and rendering at the same time! This can cause serious problems where you render an object while it’s still being moved by the physics engine or when you set an AI to move towards the player but the player has already moved elsewhere in another thread. Multithreading is going to involve synchronising your threads together and making sure that they don’t get in the way of other threads. More importantly, it’s no longer about how fast you can do a task but rather, how many times you can do that task in a second. In other words, it’s less about decreasing latency and more about increasing throughput. Before we even implement such a program, let’s talk about how such a program should look like.

The first implementation you’ll probably think of would be to put each “system” of your program on its own thread. Let’s say the machine you’re running on has 8 threads. The Physics, AI, Input, Rendering, Networking and Dynamic Level Loading Systems would each get their own thread to run on. This would allow you to use 6 threads of the machine but there are still 2 left unused. Now this isn’t really a performance loss but it is a loss of potential. You’ll still be running faster than a single-threaded application but you won’t be running as fast as you possibly could be. This problem is going to get worse as time goes on and processors start to be able to run hundreds, maybe thousands of threads in parallel. Even now with Intel’s new Skylake X processors with 36 threads, that’s 30 threads that are going to be put to waste.

Even worse, what if the machine has only 4 threads instead of 8? Only 4 of your threads would be able to run at once and they’ll be constantly fighting amongst each other to get executed by the processor. Thread 2 could almost be done with a very heavy task but is then interrupted and evicted by Thread 3 throwing all that work away. The cost of this “context switch” is so heavy, it’s enough to make your program two times slower than a single-threaded one! This implementation does not scale well with the number of threads on the machine.

More importantly, there are some serious syncing issues with all of these threads running in parallel. Physics and AI would both be modifying the enemies in the game. Having two threads touching the same memory (and at least one of them is changing it) causes undefined behaviour according to the C++11 Memory Model. You’ll need to make a thread wait while another thread is accessing the memory. Simple enough but this could destroy performance dramatically. If you have 6 threads and one thread is “blocking” all the other threads, it’s the same as if it’s a single-threaded application. Except, you’ll still incur same the cost that comes from having 6 threads running together and the cost of the blocking mechanism itself! You do not want a thread to block if you can afford it.

To put more icing on the metaphorical shit cake, some of these threads are going to do much less work than others. Let’s be serious: does input really deserve to be run in its own thread? It’s not even doing any calculation like AI nor copying large amounts of data like Rendering is doing to transfer information to the GPU. Physics is actually doing more work than any other system most of the time. Really, more processing should be used on Physics than any other system.

Alright, so threading every system in the program clearly isn’t the way to take advantage of parallelism. What about pipelining? Basically, we treat our program like a giant factory that has an assembly line with multiple “stages”. We put each gameobject on this assembly line and have it go through each stage. The first stage updates the input of the gameobject. The second stage updates its physics. The third updates its AI and finally, the forth stage will render it onto the screen. We make each stage of this assembly line run on a thread on its own and while one stage processes a gameobject, the previous stage will process the next gameobject.

This fixes the synchronisation issue the previous implementation had. Each stage is processing their own gameobject causing no conflict in memory access. More importantly, it fits closer to the way we are used to program where there is a fixed order to when things happen. This is nice but it doesn’t really fix any of the previous issues like scaling. More importantly, the issue of some of the threads doing more work than others has become even worse! The physics stage is going to take longer to process one gameobject and if the next stages are much faster, they’ll be done with their gameobject before the physics stage pass them one. This causes the physics stage to “starve” the other threads essentially blocking them from doing work. Then we’re back to the same problem we encountered with our multithreaded program that only has one thread actually running.

Let’s try “vectorizing” the threads instead. In most programs, you’ll have a “vector” of gameobjects that you will traverse through to update the gameobjects. We’ll use std::thread::hardware_concurrency ( ); to get the number of threads we should make and split these gameobjects equally among these threads to update. Once they are all done, we return to the main thread that’ll then give the “worker threads” another vector to work with. Each worker thread is running the same code and is processing the same number of objects (there might be remainders when dividing the objects but that’ll just result in some threads having one extra object). This means that each thread is basically doing the same amount of work making the workload more evenly spread. There are no synchronisation issues as each thread is touching their own section of the vector and unless there are more threads than objects, it scales very well with the number of threads on the machine.

So far, this implementation is vastly better than any of the previous implementations. I’ve benchmarked the performance of all of these implementations on a machine with 8 threads before. The vectorization implementation ends up producing results that is around 8 times faster than the single-threaded version! All other implementations either give mediocre performance boosts or end up slower than the single-threaded version itself.

This is an amazing benchmark but we can certainly do better still. The one problem I had with this implementation is the wait all the worker threads have to do at the end. Once a worker thread is done with its task, it gets blocked until all other worker threads are done with their own work. More importantly, when they all do finish, they still need to wait for the main thread to divide up another vector of objects to give to them. When benchmarking this implementation, I regularly see the CPU usage dip from 100% to 40% or lower because of the blocking of those threads. It happened for less than a millisecond but that’s still potential performance being wasted; a loss that accumulates quickly the longer the application runs.

Task Based Programming is potentially the furthest we can push the multithreading concept to get the most performance out of parallelism. You have a “thread pool” of worker threads and one scheduling thread. The scheduling thread can push “tasks” onto a “queue” that the worker threads can pull tasks from. These tasks can range from reading data from a file to processing entire arrays. Once a worker thread is done with their task, it can pull another task from the queue without waiting for the other threads to finish. No blocking needed. The scheduling thread won’t be doing the heavy lifting most of the time. Its job is to just schedule events and other input that the program receives and pass tasks to the worker threads to do the real work. It can check if a task has been finished and if it has, it’ll use it to create another task. Otherwise, it’ll try to do something else instead. Finally, if the scheduling thread doesn’t have anything else to do, it can pop a task from the queue and run it itself. There should be no synchronisation problems encountered with this implementation as long as the tasks are isolated enough from each other. If a thread is given an easy task, it can complete it quickly and move on to the next task. Finally, as long as the scheduling thread is producing enough tasks for everyone to do, no thread will ever block each other.

Tomorrow, I’ll post an implementation of a program that implements Task Based Programming. It’ll uses a queue that allows you to push and pop without blocking. Hopefully, it would inspire you to make your first move towards parallelism!

Post by **albinopapa** » April 11th, 2018, 8:42 am

Hey, great post.

arkfil · Post by **arkfil** » April 12th, 2018, 5:14 am

Great post, thanks

thetoddfather · Post by **thetoddfather** » April 12th, 2018, 6:49 pm

Thanks for posting. This is really interesting. Goes way over my head but I would love to begin using some of this in future projects. Would love to see a sln with you using some of this so I (And other advanced C++ noobs like myself) could see how it's done in a practical way.

cyboryxmen · Post by **cyboryxmen** » May 5th, 2018, 6:02 am

It's been almost a month since I've made this post and needless to say, I've been quite sidetracked these past few weeks. I wanted to come up with a program that I can benchmark to see how much of a benefit it's getting from multithreading but the more and more tests I made, the more ways I realise I could push the performance even further by taking advantage of the asynchronous nature of multithreading. I've mostly moved on to other projects now and it may take a while for me to actually develop an implementation that can test the full potential of multithreading. Until then, here is something that I have made to showcase what I have so far.

Also, when benchmarking, you'll want to run the executable standalone as opposed to running it through the Visual Studio debugger. You'll want to do that when benchmarking anyway but it's especially true when benchmarking multithreaded applications. The reason is because the debugger itself runs on its own thread which can interfere with the program's threads. This can cause conflict between threads and can slow down performance significantly.

MrGodin · Post by **MrGodin** » May 5th, 2018, 4:00 pm

My first real attempt at threading was to do pathfinding with std::future. I used to select a bunch of objects and get them to find a path. It would pause the game for a second while doing so. After I tried putting the pathfinding in a thread if did speed things up a whole lot but sometimes it would still pause. I am totally new to threading but I think i was on the right path.
I look forward to seeing your implementation

Cheers

Planet Chili

How to multithread like a pro!

How to multithread like a pro!

Re: How to multithread like a pro!

Re: How to multithread like a pro!

Re: How to multithread like a pro!

Re: How to multithread like a pro!

Re: How to multithread like a pro!