Lately at work I've been dealing with a problematic socket server. The currently deployed version has something of a memory leak (to the tune of 140+MB/day), probably due to complications of incorrectly multithreading System.Net.Socket instances (note: they're not thread-safe).
Unfortunately, when I redid the socket server to lock all the sockets and other non-thread-safe resources, I ran into a deadlock. In chasing it down, I used Phil Haack's modification of Ian Griffith's TimedLock class. That enabled me to find where the deadlocks were, and eliminate them. This class is really a very clever tool, with one small problem: it was throwing exceptions on the production server. The test server ran fine for days at a time, loaded down as heavily as I could manage, but the production server locked inside of two hours every time. The first error in the log was always an ArgumentException thrown by the stack trace hashtable, saying that the object being inserted as the key was already in the hashtable.
After several days of debugging, and a few e-mails exchanged with Phil, he said the following to me:
If the object wasn't removed from the hashtable via the dispose method before the second lock is acquired, that could cause the error.
I started to write back, saying "But isn't the whole point of the locking that there is no way any other thread could acquire that lock until Dispose is called, thus calling Monitor.Exit and removing the object from the hashtable?", and then I was, as they say, enlightened. The sequence of events in the TimedLock runs like this:
TimedLock tl = TimedLock.Lock(o); Monitor.TryEnter(o); StackTraces.Add(o); ... tl.Dispose(); Monitor.Exit(o); StackTraces.Remove(o);
On a single-CPU machine (such as our test server), this code runs fine, I would guess, 99.99999% of the time. On a dual-cpu machine (such as the production server in question), however, it runs fine only 99% of the time. That 100th time, here's what happens...(assuming o is the same object in both threads)
Thread A Thread B TimedLock tl = TimedLock.Lock(o); Monitor.TryEnter(o); StackTraces.Add(o); TimedLock tl = TimedLock.Lock(o); ... Monitor.TryEnter(o); // blocked ... ...waiting ... ...waiting tl.Dispose(); ...waiting Monitor.Exit(o); ...waiting StackTraces.Add(o); //****** StackTraces.Remove(o);
The starred line is where the exception gets thrown. Textbook race condition -- if Thread B doesn't hit that Add() call between Thread A's calls to Monitor.Exit and StackTraces.Remove, then everything looks fine. But every once in a while (such as when processing a send and a receive simultaneously on a socket), it'll hit that tiny little target and blow the whole thing up.
What's worse is that as written, once that target has been hit, that object can't be successfully TimedLocked (even though the original lock has been released) until the TimedLock that hit the exception has been finalized. This is true even if you wrap the TimedLock in a using statement (because the exception will leave using() with a null reference, which it can't Dispose).
The fix? Simple -- swap the order of the Monitor.Exit() and StackTraces.Remove() calls. That ensures that the object will be removed from the hash table before any other thread can try to re-add it.
This all looks very cut and dry now that I've laid it out, but before anyone goes accusing Phil of not knowing his stuff, reread the subject of this post. Multithreading is hard. .Net (and other modern languages) do a good job of hiding some of the complexity; for most WinForms apps, for instance, threading is very easy as long as you remember to use InvokeRequired and Invoke. For something more complex, for instance a server app with multiple long-running threads that must access common resources, you need some help, and writing that help can be very difficult. It took me about 3 full days to find this bug, and all I have to say at the end is that if I weren't using a good helper class like TimedLock, it would have taken me much, much longer.
One other lesson I've (re)learned... always always always test multithreaded code on a multiprocessor machine, because it's so much easier to hit race conditions and other problems on that platform.