Archive for September 2008

Results

I finished writing my own Thread Pool subsystem and then put it through its paces. The results are obviously not that shocking because anything that is multi threaded is more efficient on dual core machines. The machine I was running it on has a AMD Turion 64 X2 CPU and sports two 2.0Ghz cores. It’s not the best machine but its quite speedy for what it is.

Anyway, as I mentioned in the previous entry, threads are great and run best when they have a dedicated core. When there are more threads then cores, the OS then has to time-slice in order to give the threads equal time on the CPU. Having a dual core machine, the optimal amount of threads that I can handle is 2. The following is a graph of the time it took to process 50,000 square root calculations, 100 times in succession:

Thread Graph

On the first run, I did not use any threads. Rather, I let the calculations run in sequential order on the main process. The blue bar represents the thread priority at normal and the purple bar represents the thread priority at Highest. I did this in order to measure the speed difference between the normal and highest priority setting. I should also note that even though the thread priority can be manipulated, it is up to the OS to enforce that. In some cases the OS may decide not to enforce the request for higher priority and give the processing power to other threads.

The second and subsequent runs were using the Thread Pool with an increasing allotment of threads. Each 50,000 calculations were plugged into its own Job and sent off to a worker thread. Having only one thread was not much of a boost over a single process (Took about 1920ms). The main process was put to sleep using the WaitForSingleObject() function call while the worker thread did the processing. At two threads there is a massive increase in time. Just as one would assume, the amount of work was spread over two cores and the time it took to process it all was cut in half (About 950ms). This is the optimal amount of work my CPU can handle because each thread has a dedicated CPU core. Increasing the amount of threads actually increased the amount of time it took to process the calculations. The reason being is that the OS had to time-slice the different threads on both cores. The time increased by about 20ms each time I increased the number of threads.

Thread Pools are very efficient at what they do. As mentioned in the previous entry, creating a thread and destroying a thread could have increased the amount of time required to process a batch of 100 calculations. Regardless, the work I done is Open Source and as is my engine. You can download it here (I’m not responsible for unexplained fires, deaths, or alien abductions due to using this code).

Multicore Processing And Game Engines

I have been passively researching multicore processing for the last few weeks and I came to the conclusion that it is rather easy to implement. In its simplest form, all you need to do is create threads and have them do Jobs. The OS will then schedule a thread to be run on a dedicated core. Having multiple cores makes those threads run at the same time as opposed to the old time-slicing method of single core processors. But, at the very base level, it’s rather primitive and can actually be improved upon.

Creating threads and closing them is fairly fast but may be a bottle neck if the engine does that consistently. The best way to handle this is to not do it, obviously. This is where a Thread Pool comes in handy. It creates a bunch of worker threads that don’t get destroyed until the program exits. Each thread will sit idle until a job has been passed into it to be processed. This involves the use of critical sections and semaphores to accomplish and is much faster then allocating and deallocating threads. A critical section is optimized for speed as compared to any other form of asynchronous data sharing and messaging (alternatives include Mutexes, Events, and so forth). The rule of thumb is to create enough threads so that the OS does not have to time-slice. This is usually done by allocating [num of cores] + 1 threads.

In order for a Thread Pool to work properly, it requires a few things. Firstly, a Job queue. This is a long list of jobs that will get distributed between the threads once threads become available. Secondly, some sort of thread state management. It includes a set of states that the threads can be at. The basic types are ‘Working’ and ‘Idle’, but it can vary on the amount of complexity you add to the Thread Pool. Lastly, it requires data sharing. I suggest writing an object that wraps data around a locking/unlocking mechanism (Semaphores come in handy for this task). Once these aspects are implemented, the Thread Pool is basically finished.

Usage is another key role. Lets assume that you either use the built in Win32 Threading pool or roll your own, it doesn’t matter which one you do. Furthermore, you have some very repetitive code that you want to multithread. If you don’t quite know what is multithreadable, the best place to start looking is in any for loop. The place where I’m going to use my Thread Pool is in a loop where I would have to update some world object, such as a player state or even scene management. For example, this loop might call your world object and cause it to return collision information with its nearest neighbors. Plug the world objects into the Thread Pool, have it run asynchronously and output some data into some shared object. As it’s doing this, have the main thread wait or do some other processing until the output is realized. Once its all done, the Thread Pool will suspend its threads and the main thread can resume doing its job.

What this Thread Pool is designed to do is to complete small tasks asynchronously. Sticking a dedicated piece of code on one thread (such as a sound subsystem, or networking) is rather counter productive to a Thread Pool because it utilizes the thread until the end of the program. I would suggest that a subsystem that is substantially heavy be on a separately spawned thread instead. Another word of advice that I came across is to keep the amount of writing done on each thread to a minimum because it requires locks. The best approach is to have each thread write to its own dedicated memory that is attached to the Job that its processing. Keep the shared data read only when possible.

I’m in the process of implementing my own Thread Pool and once its done, I’ll post some metrics.

|