Archive for the win32 Category

Radiant Update

I held a bet with my brother to see who can guess the number of lines of code (or at least, close to it) my rendering engine is comprised of. Neither of us were close but I got a good sense of how much work went into this giant mash of code. About 25,000 lines is the total. That total is comprised of approximately 12,000 lines of computational code and about 13,000 lines of comments. This number does not include blank lines and such things. Regardless, it’s a staggering number for a project that is being worked on by one person, part time.

The engine itself is about 45% complete, with the majority of planning done. One of the components, my math module (it handles math stuffs and collision detection), is finally finished. I can breath a sigh of relief, it’s not easy stuff. Unfortunately I didn’t write any unit tests for it so I don’t know if it actually works or not. The project is open source. So, if anyone is masochistic enough to write a tests for my math module, by all means :).

Primary Export: Pain

Everyone knows (or should know) that when you put together a DLL, you need to export functionality so that programs using your DLL know where to find your functions. This is usually done by prefixing classes or functions with __declspec(dllexport) or manually writing a definition file. Straight forward and right to the point. But what happens when you need to export something that does not have a name. Say for example, an overloaded new operator. What the hell does a new operator look like as a definition symbol?

I’ll give you a hint: its not human readable!

So, before I actually spoil the beans and tell you what I did, I have to explain why I did it, because its rather interesting. Radiant, my game engine, is split into 6 DLLs, all of which touch and create dynamically allocated memory somewhere. The problem with allocating memory all over the place is that you need to delete it in the same address space (or DLL) as you allocated it in. With that said, to make it more complex, I had to overload the new and delete operators in order to wrap around _aligned_malloc() and _aligned_free() calls. This is a special type of allocation that allows you to align dynamic memory to an address that is divisible by 16 (or any other value). This is crucial if your using SSE or any special instruction set because all values need to be aligned to at least 16 bytes.

Anyway, going back to the problem at hand, I have a bunch of overloaded operator functions that cannot be exported because if I try to add the __declspec(dllexport) prefix to them, the compiler will scream and tell you that the declarations of the functions do not match up with what is defined internally. Basically, what I am stuck with are a handful of functions that cannot exported programmatically. This is where the definition file comes into play. Exporting a function or class is as easy as entering its name in the definition file under the heading of EXPORTS. But here is the kicker, the overloaded operators of ‘new’ and ‘delete’ do not have a name! They are declared internally in a header that exists in the compiler’s own static data, and there’s no way to override inclusion of that header. Therefore, the only way is to manually enter the function’s mangled name into the definition file.

The mangled name looks something like the following:

??2@YAPAXI@Z (void * __cdecl operator new(unsigned int))
??3@YAXPAX@Z (void __cdecl operator delete(void *))
??_U@YAPAXI@Z (void * __cdecl operator new[](unsigned int))
??_V@YAXPAX@Z (void __cdecl operator delete[](void *))



In fact, those mangled names are fairly generic and may not match up correctly. But their names are very similar to what they should be. A more specific example of the name would be located in the Visual Studio directory under VC\crt\src\intel\_sampld_.def. This file contains a slew of definitions. What your looking for are the first four definitions that look very similar to the ones posted above. If you are running under a x64 or Itanium architecture, there are definition files for those architectures as well.

After a successful compile, all dynamic memory is allocated and deallocated in a single DLL’s memory space. This prevents the Heap Corruption errors I was getting before and allows me to further enhance the allocation and deallocation of dynamic memory. Woot!

I would suggest reading the following link because it contains a very good description of how this process works. Unfortunately, I stumbled upon this file AFTER I already fixed this problem. Alas, C’est La Vie.

Results

I finished writing my own Thread Pool subsystem and then put it through its paces. The results are obviously not that shocking because anything that is multi threaded is more efficient on dual core machines. The machine I was running it on has a AMD Turion 64 X2 CPU and sports two 2.0Ghz cores. It’s not the best machine but its quite speedy for what it is.

Anyway, as I mentioned in the previous entry, threads are great and run best when they have a dedicated core. When there are more threads then cores, the OS then has to time-slice in order to give the threads equal time on the CPU. Having a dual core machine, the optimal amount of threads that I can handle is 2. The following is a graph of the time it took to process 50,000 square root calculations, 100 times in succession:

Thread Graph

On the first run, I did not use any threads. Rather, I let the calculations run in sequential order on the main process. The blue bar represents the thread priority at normal and the purple bar represents the thread priority at Highest. I did this in order to measure the speed difference between the normal and highest priority setting. I should also note that even though the thread priority can be manipulated, it is up to the OS to enforce that. In some cases the OS may decide not to enforce the request for higher priority and give the processing power to other threads.

The second and subsequent runs were using the Thread Pool with an increasing allotment of threads. Each 50,000 calculations were plugged into its own Job and sent off to a worker thread. Having only one thread was not much of a boost over a single process (Took about 1920ms). The main process was put to sleep using the WaitForSingleObject() function call while the worker thread did the processing. At two threads there is a massive increase in time. Just as one would assume, the amount of work was spread over two cores and the time it took to process it all was cut in half (About 950ms). This is the optimal amount of work my CPU can handle because each thread has a dedicated CPU core. Increasing the amount of threads actually increased the amount of time it took to process the calculations. The reason being is that the OS had to time-slice the different threads on both cores. The time increased by about 20ms each time I increased the number of threads.

Thread Pools are very efficient at what they do. As mentioned in the previous entry, creating a thread and destroying a thread could have increased the amount of time required to process a batch of 100 calculations. Regardless, the work I done is Open Source and as is my engine. You can download it here (I’m not responsible for unexplained fires, deaths, or alien abductions due to using this code).

Multicore Processing And Game Engines

I have been passively researching multicore processing for the last few weeks and I came to the conclusion that it is rather easy to implement. In its simplest form, all you need to do is create threads and have them do Jobs. The OS will then schedule a thread to be run on a dedicated core. Having multiple cores makes those threads run at the same time as opposed to the old time-slicing method of single core processors. But, at the very base level, it’s rather primitive and can actually be improved upon.

Creating threads and closing them is fairly fast but may be a bottle neck if the engine does that consistently. The best way to handle this is to not do it, obviously. This is where a Thread Pool comes in handy. It creates a bunch of worker threads that don’t get destroyed until the program exits. Each thread will sit idle until a job has been passed into it to be processed. This involves the use of critical sections and semaphores to accomplish and is much faster then allocating and deallocating threads. A critical section is optimized for speed as compared to any other form of asynchronous data sharing and messaging (alternatives include Mutexes, Events, and so forth). The rule of thumb is to create enough threads so that the OS does not have to time-slice. This is usually done by allocating [num of cores] + 1 threads.

In order for a Thread Pool to work properly, it requires a few things. Firstly, a Job queue. This is a long list of jobs that will get distributed between the threads once threads become available. Secondly, some sort of thread state management. It includes a set of states that the threads can be at. The basic types are ‘Working’ and ‘Idle’, but it can vary on the amount of complexity you add to the Thread Pool. Lastly, it requires data sharing. I suggest writing an object that wraps data around a locking/unlocking mechanism (Semaphores come in handy for this task). Once these aspects are implemented, the Thread Pool is basically finished.

Usage is another key role. Lets assume that you either use the built in Win32 Threading pool or roll your own, it doesn’t matter which one you do. Furthermore, you have some very repetitive code that you want to multithread. If you don’t quite know what is multithreadable, the best place to start looking is in any for loop. The place where I’m going to use my Thread Pool is in a loop where I would have to update some world object, such as a player state or even scene management. For example, this loop might call your world object and cause it to return collision information with its nearest neighbors. Plug the world objects into the Thread Pool, have it run asynchronously and output some data into some shared object. As it’s doing this, have the main thread wait or do some other processing until the output is realized. Once its all done, the Thread Pool will suspend its threads and the main thread can resume doing its job.

What this Thread Pool is designed to do is to complete small tasks asynchronously. Sticking a dedicated piece of code on one thread (such as a sound subsystem, or networking) is rather counter productive to a Thread Pool because it utilizes the thread until the end of the program. I would suggest that a subsystem that is substantially heavy be on a separately spawned thread instead. Another word of advice that I came across is to keep the amount of writing done on each thread to a minimum because it requires locks. The best approach is to have each thread write to its own dedicated memory that is attached to the Job that its processing. Keep the shared data read only when possible.

I’m in the process of implementing my own Thread Pool and once its done, I’ll post some metrics.

Work++; // Again!

I’ve been slacking. No, really, mega slacking. Its been almost two months since I wrote something here. Earlier, I wrote about my latest employment opportunity at PWLabs. Well, that job ended abruptly. There was nothing wrong with that job, I enjoyed it fully, but several days after I started, I got a call from a recruiter that works for McAfee (Yes, the anti virus maker). Impeccable timing! He hooked me up with the dev team at McAfee and we had a phone interview. That, eventually, blossomed to a full 5 hour interview (you heard me correctly, 5 Hours!) where I took two tests on top of an in-person interview. Needless to say, I got the job and my brief career at PWLabs came to a close.

The stuff that they have me doing right now is guarded with an NDA, but its mostly going to be Win32 programming. My experience at BumpTop and at Strike Technologies helped me get my foot in the door. I always avoided huge Corporate jobs but this one actually made me excited.

For those that don’t know, McAfee’s engineering department is located in Waterloo, so the drive from Toronto is a bit time consuming. I ended up buying a 2008 Hyundai Accent for the trip, it’s not a bad car. Nevertheless, I will be moving down to this area eventually. Some places that I was looking at are in Cambridge, Kitchener, and Guelph. But that won’t happen for a few years anyway.

At A Crossroads.

On my spare time (if there is such a thing), I have been slowly putting together my Game Engine. The foundational stuff is pretty much written and tested and I am just about to start writing the heart of this engine, the renderer. But as I’m thinking about the code structure, I am also debating if I should stick it out on the web as an Open Source project. I read up on a bunch of licenses but none of them were exactly what I needed. The type of license I’m looking for is one that would credit my name whenever it is used or redistributed. Also, it would force people who wants to use it for commercial use to cough up some coin. In other words: free for personal use, not free for commercial use, and my name to be always credited. Anyone know of a license like this?

One of the licenses I found useful was the MAME license. I might ask its author for a copy. On another note, I just set up an SVN server to host my engine. Also, I am slowly putting together a webpage for this project so people can easily find it. I’ll post the link soon.

EDIT: Dave advised me against using some of the lesser known licenses and advocated LGPL. I’m going to give LGPL a thorough read.

The BumpTop Presentation

The Presentation Info:

Location: Seneca @ York, room T2109.
When: Tuesday, Nov 6, 2007 at 11:40am.

I will be discussing the relationship between Human Computer Interaction and high performance 3D graphics. Also, I will demonstrate the work that I have been doing lately on the BumpTop 3D Desktop. All are welcomed to attend.

Arrays, Post Haste!

Almost any chance I get, I like to dabble in crazy things ranging from code optimizations to patterns and data structures. Well the last few weekends I had a some of time to kill and with my ever growing disgust of the STL Vector class, I decided to write my own (Growable Array class), just for the hell of it. Out of the box, Vectors are slow, especially when being run in debug mode. There are a slew of iterator checks, data verifications (during memory swaps, etc) and so on which completely turned me off of using the Vector class in time-critical applications such as games.

In that case, lets bust out some numbers. After I wrote my Array class (and optimized it, heavily) , I decided to test it against the Vector class in various different compilation modes. First was a regular debug build (without compiler optimizations), the second test was with full compiler optimizations, and lastly was a test using full compiler optimizations and that neat trick I mentioned in a previous post about disabling iterator checks (#define _SECURE_SCL 0). I was surprised to see the numbers once they came in, check them out:

No Optimizations:
  Vector Array
Push Back: 6542ms 619ms
Operator []: 1156ms 444ms
Operator =: (50) 1710ms 1639ms
Operator ==: (20) 62742ms 2471ms
Operator !=: (20) 62622ms 2452ms
Get Size: 380ms 365ms
Get Back/Front: 28400ms 830ms
PopBack: 351ms 40ms
Erase/Remove: (1000) 25755ms 23777ms

Full Optimizations:
  Vector Array
Push Back: 167ms 84ms
Operator []: 31ms 0ms
Operator =: (1000) 15291ms 15932ms
Operator ==: (500) 0ms 0ms
Operator !=: (500) 0ms 0ms
Get Size: 0ms 0ms
Get Back/Front: 78ms 0ms
PopBack: 4ms 2ms
Erase/Remove: (1000) 22584ms 22373ms

Optimizations and _SECURE_SCL Trick:
  Vector Array
Push Back: 168ms 81ms
Operator []: 0ms 0ms
Operator =: (1000) 15250ms 15832ms
Operator ==: (500) 0ms 0ms
Operator !=: (500) 0ms 0ms
Get Size: 0ms 0ms
Get Back/Front: 78ms 0ms
PopBack: 2ms 1ms
Erase/Remove:(1000) 22611ms 22419m

 
Lots of Numbers! Let me explain what I did to get these numbers. The tests were conducted on 1 million integers within a loop, with the exception of the equals operator (operator=), comparison operator (==), the not equals operator (!=), and the Erase/Remove functions which were reduced considerably since they took so long. The numbers in brackets are the number of iterations that were done to get the result. Due to the debugging information that is added into the Vector class, the performance of the class will slow down to about one-seventh of the speed it should run at. But once all the debugging information is removed, you can really see the class fly. But I guess that was not enough for me. Once the Debugging information was removed and the compiler fully optimized the code, the Vector class was raised to the level of my Array class. It was on par with the majority of the tests, with the exception of push_back(), back(), and front().

One advantage that my Array class has over the Vector class is that it tries to use realloc() in order to grow, rather then trying to allocate new space and copying over the old info. Secondly, I scoured the web for some assembly tricks and stumbled upon an optimization paper written by AMD that mentioned the use of a buffer when copying information from one memory address to another. This is all done on the CPU, using registers. This function out preforms the stock memcpy() or memmove() functions that are included with Visual Studio. I also read that tapping into the MMX or SSE technology would give potential gain, but currently I don’t know enough about that technology.

Either way, I’m releasing the source for everyone to check out under the zlib/libpng license. If anyone has questions/comments/improvements to this class, please feel free to email me. Enjoy :).

Exceptions Con’t

The problem with using SEH (As talked about in my previous blog) is that it does not catch any debugging information that is included inside any of the STL headers. I ran into this problem last night when trying to test the crash handler and I was baffled to see that my program was crashing outright and the crash handler did not work. I tracked the problem down to STL’s Vector class. When compiled in Release mode, the Vector class still contained debugging information which allowed the Just-In-Time (JIT) debugger to kick in and attempt to debug the application. This is bad because the JIT debugger has priority over any other exception handling. Consequentially, the crash handler was not fired and the mini dump was not created and emailed.

So how does one fix this? Actually its just a one-liner. Just by defining this before you include the Vector class removes the debugging code form compilation:

#define _SECURE_SCL 0
#include <vector>

In addition, the debugging code inside the Vector class is all over the place. It will literally remove a massive chunk of assembly code from the executable (and in the case of games, it will push out an extra 20fps or so). Having safety inside STL is nice when you do not have a crash handler, but it can create problems. However, it does compromise security because if you create a buffer overflow, it is easy to exploit that. I would suggest that if your interested in knowing more about this, check out this article, and the MSDN article on Checking Iterators.

Exception Reported!

The last few weeks or so were ridiculously busy at work. The BumpTop internal alpha release is nearly completed and we will be shipping it out to a select few so that they can test it. I am really excited to put the 10,000+ lines of code that I have added to the project to the test in the real world. All of it has been a challenge thus far and as a consequence, it probably contains numerous bugs that I alone was not able to find. So, before the alpha is being shipped out, I am adding a neat snippet of code that allows me to do a stack trace on the executable. This snippet of code dumps the call stack and current variables to a file and then emails it to me for debugging purposes. Unfortunately for all the people that shun the Microsoft C++ Compiler, this is only available using it because the compiler itself exports a debugging database, which the crash can use to identify symbols (variables, line numbers, function names, etc). There might be other compilers out there that export symbols or failmaps, but since this is all new to me, I have only tried it using Visual Studio.

So, on to the code. Now before you scroll down and click the link, let me explain how this code works. Firstly, it uses Structured Exception Handling (SEH), which is much like C++ exception handling but its proprietary to Microsoft. The difference is that SEH does not call the destructor on local objects when the exception is caught, while regular C++ exception handling does. Secondly, the syntax for this type of exception handling is rather different because it uses the proprietary __try and __except keywords to catch the exception. The __except keyword can call a specific type of function that is referred to as a “Filter” by MSDN. This filter is where all your stack tracing and memory dumping can happen. Within this filter, I access a library by the name of DBGHLP.DLL. It contains a function that will dump the contents of a call stack into a file. This dump can be opened by Visual Studio and run much like a regular program. At that point, the contents of the crash can be seen and easily debugged by the programmer. But enough chit-chat, here is the Code.

Also, since the BumpTop alpha is going out internally the public alpha will be available very soon (within a week or two). If anyone is interested in helping with beta testing, please send me an email to mark(NO_SPAM)@bumptop.com. Please remove the (NO_SPAM) string in the email address.