With that said, programming for parallelism is going to be a lot more complicated than how we’re used to program. For one, we are used to a fixed ordering of events. If we write code like this:
Code: Select all
DoPhysics();
DoAI();
DoRendering();
The first implementation you’ll probably think of would be to put each “system” of your program on its own thread. Let’s say the machine you’re running on has 8 threads. The Physics, AI, Input, Rendering, Networking and Dynamic Level Loading Systems would each get their own thread to run on. This would allow you to use 6 threads of the machine but there are still 2 left unused. Now this isn’t really a performance loss but it is a loss of potential. You’ll still be running faster than a single-threaded application but you won’t be running as fast as you possibly could be. This problem is going to get worse as time goes on and processors start to be able to run hundreds, maybe thousands of threads in parallel. Even now with Intel’s new Skylake X processors with 36 threads, that’s 30 threads that are going to be put to waste.
Even worse, what if the machine has only 4 threads instead of 8? Only 4 of your threads would be able to run at once and they’ll be constantly fighting amongst each other to get executed by the processor. Thread 2 could almost be done with a very heavy task but is then interrupted and evicted by Thread 3 throwing all that work away. The cost of this “context switch” is so heavy, it’s enough to make your program two times slower than a single-threaded one! This implementation does not scale well with the number of threads on the machine.
More importantly, there are some serious syncing issues with all of these threads running in parallel. Physics and AI would both be modifying the enemies in the game. Having two threads touching the same memory (and at least one of them is changing it) causes undefined behaviour according to the C++11 Memory Model. You’ll need to make a thread wait while another thread is accessing the memory. Simple enough but this could destroy performance dramatically. If you have 6 threads and one thread is “blocking” all the other threads, it’s the same as if it’s a single-threaded application. Except, you’ll still incur same the cost that comes from having 6 threads running together and the cost of the blocking mechanism itself! You do not want a thread to block if you can afford it.
To put more icing on the metaphorical shit cake, some of these threads are going to do much less work than others. Let’s be serious: does input really deserve to be run in its own thread? It’s not even doing any calculation like AI nor copying large amounts of data like Rendering is doing to transfer information to the GPU. Physics is actually doing more work than any other system most of the time. Really, more processing should be used on Physics than any other system.
Alright, so threading every system in the program clearly isn’t the way to take advantage of parallelism. What about pipelining? Basically, we treat our program like a giant factory that has an assembly line with multiple “stages”. We put each gameobject on this assembly line and have it go through each stage. The first stage updates the input of the gameobject. The second stage updates its physics. The third updates its AI and finally, the forth stage will render it onto the screen. We make each stage of this assembly line run on a thread on its own and while one stage processes a gameobject, the previous stage will process the next gameobject.
This fixes the synchronisation issue the previous implementation had. Each stage is processing their own gameobject causing no conflict in memory access. More importantly, it fits closer to the way we are used to program where there is a fixed order to when things happen. This is nice but it doesn’t really fix any of the previous issues like scaling. More importantly, the issue of some of the threads doing more work than others has become even worse! The physics stage is going to take longer to process one gameobject and if the next stages are much faster, they’ll be done with their gameobject before the physics stage pass them one. This causes the physics stage to “starve” the other threads essentially blocking them from doing work. Then we’re back to the same problem we encountered with our multithreaded program that only has one thread actually running.
Let’s try “vectorizing” the threads instead. In most programs, you’ll have a “vector” of gameobjects that you will traverse through to update the gameobjects. We’ll use std::thread::hardware_concurrency ( ); to get the number of threads we should make and split these gameobjects equally among these threads to update. Once they are all done, we return to the main thread that’ll then give the “worker threads” another vector to work with. Each worker thread is running the same code and is processing the same number of objects (there might be remainders when dividing the objects but that’ll just result in some threads having one extra object). This means that each thread is basically doing the same amount of work making the workload more evenly spread. There are no synchronisation issues as each thread is touching their own section of the vector and unless there are more threads than objects, it scales very well with the number of threads on the machine.
So far, this implementation is vastly better than any of the previous implementations. I’ve benchmarked the performance of all of these implementations on a machine with 8 threads before. The vectorization implementation ends up producing results that is around 8 times faster than the single-threaded version! All other implementations either give mediocre performance boosts or end up slower than the single-threaded version itself.
This is an amazing benchmark but we can certainly do better still. The one problem I had with this implementation is the wait all the worker threads have to do at the end. Once a worker thread is done with its task, it gets blocked until all other worker threads are done with their own work. More importantly, when they all do finish, they still need to wait for the main thread to divide up another vector of objects to give to them. When benchmarking this implementation, I regularly see the CPU usage dip from 100% to 40% or lower because of the blocking of those threads. It happened for less than a millisecond but that’s still potential performance being wasted; a loss that accumulates quickly the longer the application runs.
Task Based Programming is potentially the furthest we can push the multithreading concept to get the most performance out of parallelism. You have a “thread pool” of worker threads and one scheduling thread. The scheduling thread can push “tasks” onto a “queue” that the worker threads can pull tasks from. These tasks can range from reading data from a file to processing entire arrays. Once a worker thread is done with their task, it can pull another task from the queue without waiting for the other threads to finish. No blocking needed. The scheduling thread won’t be doing the heavy lifting most of the time. Its job is to just schedule events and other input that the program receives and pass tasks to the worker threads to do the real work. It can check if a task has been finished and if it has, it’ll use it to create another task. Otherwise, it’ll try to do something else instead. Finally, if the scheduling thread doesn’t have anything else to do, it can pop a task from the queue and run it itself. There should be no synchronisation problems encountered with this implementation as long as the tasks are isolated enough from each other. If a thread is given an easy task, it can complete it quickly and move on to the next task. Finally, as long as the scheduling thread is producing enough tasks for everyone to do, no thread will ever block each other.
Tomorrow, I’ll post an implementation of a program that implements Task Based Programming. It’ll uses a queue that allows you to push and pop without blocking. Hopefully, it would inspire you to make your first move towards parallelism!