Strange COM+/SC Problem with Threads

Link. September 30, 2006. Comments [0]. Posted in: .NET | Enterprise Services | Web Services

I've been spending most of today diagnosing a fairly strange issue that has been nagging one of our services in production, and which we hadn't yet been able to fully repro until today. The basic application itself is fairly simple: It's an ASP.NET v1.1 Web Service that calls into a custom ServicedComponent configured locally in a COM+ Server application:

Service Architecture

It worked fine for a while, but a few weeks ago we started getting intermittent failures, that manifested themselves when the COM+ application stopped working after a few days of constant running, at which point it would start returning errors when a call to a remote ServicedComponent. The remote component works fine (it's was a preexisting component built for another project that works fine for everyone else). All is running on Windows 2000 Server, by the way.

The really weird issue is that the errors the application would start returning where of two different kinds:

  1. It would initially start throwing InvalidCastExceptions saying "QueryInterface for interface System.IDisposable failed" when our COM+ component called Dispose() on the remote ServicedComponent's proxy.
  2. It would eventually start throwing another exception saying "Old format or invalid type library".

Once those started appearing, our COM+ Server Application became a dead-weight. It didn't crash, but *all* requests caused one of the errors mentioned above. Restarting IIS (to reset the ASP.NET Web Service) didn't work, and only shutting down and restarting the COM+ Server application would clear it up.

We had tried several things to try to resolve this, including the obvious things like checking to ensure we were Dispose()'ing everything correctly (we had a few cases that we missed it, and fixed it), completely unregistering our serviced components and registering everything again, checking we had the .NET FX v.1.1 SP1 installed (we did) and so forth, without any luck.

Today we finally got some good information about what seems to be causing the problem, though why it happens is still a mistery to me at this point. It appears something in our COM+ Server application is causing the thread pool to misbehave: It appears almost as if for almost each request received by the server, a new thread was created, used and then discarded. It doesn't get destroyed. It doesn't return to the pool to be reused. It just sits there sleeping.

We discovered this after noticing that the thread count of the COM+ server process just kept growing as we threw requests at it. We didn't notice this before because response time didn't seem to be much affected, and while memory usage was on the rise, it didn't grow steadily and at intervals large chunks of memory would get reclaimed by the system. Obviously, we should've done more extensive load testing to discover this before.

Anyway, the COM+ server process eventually reaches a state where the virtual memory size is around 150MB or more and the Thread Count for the process reaches a little over 7000 (yes, 7000), at which point it would stop servicing requests as I described before. Having discovered this, I got out trusty old Windbg and after doing a smaller repro scenario (after processing just a few hundred requests), I proceeded to take a minidump of the COM+ server process and brought it back with me for analysis.

I've been spending the remaining of the day looking at it, though I must admit I haven't found anything too useful yet. Looking at all the thread stacks (all 93 of them, whoooohooo!), some are easily recognizable and expected, like some RPC and threadpool control threads. However, the vast majority of the threads (around 70) of them show this as the thread stack:

0:000> ~37 kb
ChildEBP RetAddr Args to Child
04eef514 7c59a28f 00000001 04eef52c 00000000 NTDLL!ZwDelayExecution+0xb
04eef534 79233c74 00003a98 00000001 04eef7c0 KERNEL32!SleepEx+0x32
04eef554 79233cec 00003a98 04eef58c 010b5d08 mscorwks!Thread::UserSleep+0x93
04eef564 00e4b69e 04eef570 00003a98 017fc55c mscorwks!ThreadNative::Sleep+0x30
WARNING: Frame IP not in any known module. Following frames may be wrong.
04eef5b4 791d94bc 04eef6cc 791ed194 04eef608 0xe4b69e
04eef5bc 791ed194 04eef608 00000000 04eef5e0 mscorwks!CallDescrWorker+0x30
04eef6cc 791ed54b 00bc82f3 79b7a000 04eef6f8 mscorwks!MethodDesc::CallDescr+0x1b8
04eef788 791ed5b9 79bc82f3 79b7a000 79b91d95 mscorwks!MethodDesc::CallDescr+0x4f
04eef7b0 792e87b7 04eef7f4 03a54298 791b32a8 mscorwks!MethodDesc::Call+0x97
04eef7fc 792e8886 04eef814 03a5a598 792e87c4 mscorwks!ThreadNative::KickOffThread_Worker+0x9d
04eef8a0 791cf03c 03a5a598 00000000 00000000 mscorwks!ThreadNative::KickOffThread+0xc2
04eeffb4 7c57b388 000a7898 006f0063 00000003 mscorwks!Thread::intermediateThreadProc+0x44
04eeffec 00000000 791ceffb 000a7898 00000000 KERNEL32!BaseThreadStart+0x52

From this, it would appear as indeed the threads were being put to sleep right after processing a request. But why, and by what, is still something we're trying to figure out.

Some things might be worth mentioning at this point:

  • We're not manually creating new threads in our code.
  • We're not doing any async operations or calling any async delegates, so it's not a case of us forgetting to call EndXXX() somewhere.
  • We're not explicitly using the managed ThreadPool anywhere.
  • We're not hanging on to the request thread (after all, until the error happens we're getting responses right away to the calling Web Service).

We still have a few things we want to try out, and we're considering getting the most recent COM+ rollup package of hotfixes for Windows 2000 onboard to see if it helps, though we haven't seen anything specific in the KB articles that points directly at a problem like ours. I'm still very reluctant to say this is a platform bug, though; there might be something we did to cause it and we're just missing it.

If anyone happens to have any ideas or suggestions, they would sure be appreciated!

Update 2006/09/30: I've been playing with this some more, and realized I was doing something stupid: I totally forgot about loading the SOS (Son Of Strike) extension to the debugger. With this in hand, I've been able to discover a few extra things:

Using the !comstate and !threads commands reveal that most of the threads being created are not even related to the COM+ thread pool at all. Here's the breakdown of threads:

  • 1 is the finalizer thread (thread 13)
  • 1 is a threadpool worker thread (thread 19)
  • 9 threads are in the MTA, and 1 in the STA
  • All the rest of the threads are not even in a COM apartment at all.

Looking at all the other "weird" threads with SOS, I was able to dump a more useful stack trace:

0:000> ~37e !clrstack
Thread 37
ESP EIP
0x04eef58c 0x77f883a3 [FRAME: ECallMethodFrame] [DEFAULT] Void System.Threading.Thread.Sleep(I4)
0x04eef59c 0x03943e37 [DEFAULT] [hasThis] Void Microsoft.Practices.EnterpriseLibrary.Configuration.Storage.ConfigurationChangeWatcher.Poller()
0x04eef7c0 0x791d94bc [FRAME: GCFrame]

Humm... interesting. We do use the Enterprise Library for database access in this application (though it's just calling a few SPs that manage a transaction log), and now it appears it's ENTLIB creating watcher threads. Certainly something we'll look at monday morning; we may be initializing something wrong and I have an idea as to what might be.

BTW, did I mention Windbg + Symbol Server + SOS rocks?

Update 2006/10/02: Yep, verified; the problem was the Enteprise Library. We got rid of it, and the problem went away. Go figure!

Clemens, ES and JITA

Link. December 28, 2005. Comments [0]. Posted in: Enterprise Services
Clemens has posted a new article on Enterprise Services (COM+) and the power of JIT-Activation and pooling proxy instances to ServicedComponent objects. As usual, Clemens shines in explaining the way COM+ and Enterprise Services works and provides a very clear overview and explanation of the techniques he shows. Excellent stuff; I'm definitely keeping it around here to take advantage of later on.

Enterprise Services 2.0 and Transaction Promotion + Delegation

Link. July 20, 2005. Comments [0]. Posted in: Enterprise Services
Robert Hurlbut talks here about Paul Fallon's comment that Enterprise Services will get Transaction Promotion and Delegation support. I agree this is great news. I'm already very excited about System.Transactions, as the API is extremely nice, and we're planning on using it for the project we are about to begin to control transactional support across plugins attached to an event-driven processing pipeline for the application.

Robert also asks: "I noticed recently through Reflector that Systems.Transactions references System.EnterpriseServices in its assembly (Systems.Transactions.dll). I know that S.T uses managed calls to MSDTC when a transaction has been promoted to a full distributed transaction, but I wonder if ES is still in the mix somehow as a ServicedComponent wrapper?"

Reflector also has this really cool feature that if you right click on a referenced assembly, you can select the View Imports option and it will show you all the classes/methods that are used from that assembly. Here's what it shows for Beta 2 System.Transactions/System.EnterpriseServices:

System.EnterpriseServices.ContextUtil.get_IsInTransaction() : Boolean
System.EnterpriseServices.ContextUtil.get_SystemTransaction() : Transaction
System.EnterpriseServices.ContextUtil.IsDefaultContext() : Boolean
System.EnterpriseServices.ServiceConfig
System.EnterpriseServices.ServiceConfig..ctor()
System.EnterpriseServices.ServiceConfig.set_BringYourOwnSystemTransaction(Transaction) : Void
System.EnterpriseServices.ServiceConfig.set_Synchronization(SynchronizationOption) : Void
System.EnterpriseServices.ServiceDomain.Enter(ServiceConfig) : Void
System.EnterpriseServices.ServiceDomain.Leave() : TransactionStatus
System.EnterpriseServices.SynchronizationOption
System.EnterpriseServices.TransactionStatus

Diablo and Indigo

Link. December 7, 2004. Comments [0]. Posted in: Enterprise Services
While searching for a few things this morning, I ran into an older post from Christian, pondering whether Indigo was supposed to be the next version of COM+ or not. This, of course, has been discussed quite a bit in posts from others like Don [2] and Richard.

What did intrigued me was that, shortly after that, I ran into these old posts from Steve Swartz on the DCOM mailing list, where he mentions "Diablo" as being "the code name for the managed service environment that will ship next year" (that was 2001, btw).

Does anyone care to tell the uninformed (aka "me") what diablo was supposed to have been (or is) and what relation it ended having with Indigo, if any?

EnterpriseServices based applications

Link. March 21, 2004. Comments [0]. Posted in: Enterprise Services
Enterprise Services is a pretty nice technology overall (as is COM+ in general), but, unfortunately, it does have a few significant downsides in .NET-based applications. The team I'm on has been working for quite a while on a fairly large .NET application (250.000+ loc and growing) which is based heavily on EnterpriseServices, so we do have some experience on the subject :)

That aside, here are some of the downsides we've run into:

  • The already well known problem with exception propagation
  • The Compile-Register-Test cycle time is brutal. Registering and deregistering ServicedComponents in COM+ is an extremely expensive operation.
  • We have tons of boilerplate code we need to write on each method of our serviced components to do custom logging, security and a few other things. This sucks, but well, unless we switched to some sort of pipeline-based framework or had AOP, its something we'd have to live with anyway :)
  • No support for app.config files for Server applications. Yes, WinServer 2003 supports it, but that doesn't help us in anyway, since our app must run under Win2000. The biggest problem with this is not writing your own configuration code, we already have a bit of gue for this written that uses IProcessInitializer and custom XML files to do so. The problem lies in that most 3rd party components you will find out there rely heavily on app.config files, so if you can't support them, you need to again write some (small) extra code to support them.
  • Using IProcessInitializer: It can be a pain to track what's going on sometimes to track problems at this point. DebugView seems to be our strongest ally so far :)

The registration time is probably the issue that most pain is causing us right now... Our developers are suffering cycle times of up to 5-minutes sometimes, and that's not only very frustrating, it is a real and very significant problem because it really kills productivity.... if you compile say, 10 times a day (and that's a very very low number), then you're wasting almost an entire our just waiting for the process to finish!

Yes, I know getting faster machines would help (the current set is a couple of years old!), but unfortunately, that's not an option.

So, as one of the people in charge of the application's architecture, I'm heavily looking into anything that can help us bring down this cost, including something that seems pretty ugly at first sight: heavily cutting the number of ServicedComponent-derived classes... our current system has well over 60 ServicedComponent classes.

How Could I do this? Good question. Unfortunately, I haven't got a good answer right now. I've been playing with a few tricks, but nothing seems very clean. I've also seriously consider moving most of the application back-end code into WSE's SoapService-derived clases, and then exposing a single end-point on the COM+ application that deals with the basic stuff (of course, some special classes wouldn't fit this paradigm, but I could live with it). Heck, I've even got a prototype SoapTransport implementation that implements this.

Of course, in such a model, then taking heavy control of distributed transactions and stuff because slightly more complicated, but I'm confident I could come up with a way to extend our framework (if I can call it that) to provide some support for this. However, it somehow feels... well.... strange, to say the list (it also has quite a few benefits, such as allowing us to take advantage of pipeline based-execution of some pre- and post-processing required). I could, of course, build a solution from scratch, but that seems even uglier :)

One bit that of course worries me going in such a way, would be that eventually, migration to newer technologies could be problematic (Indigo, for example). And it would certainly add more complexity to the solution, which can't be a good thing...

Anyone has any comments? options? (Loosing COM+ is not an option, one of the requirements we currently have is the support for distributed transactions). I'd really, really appreciate any ideas!

About

Tomas Restrepo is co-founder of devdeo ltda. His interests include .NET, Connected Systems, PowerShell and, lately, dynamic programming languages. More...

email: tomas@winterdom.com
msn: tomasr@passport.com
twitter: tomas_restrepo

Technorati Profile

devdeo logo

View my profile on LinkedIn

MVP logo

Syndicate

Ads

Links

Tag Cloud

.NET (232) Architecture (47) ASP.NET (6) BizTalk (170) Blogging (64) C++ (3) Castle (2) Commerce Server (3) Development (118) DLR (7) Enterprise Services (25) Fonts (4) Host Integration Server (1) LINQ (3) Linux (5) NHibernate (1) Personal (143) PowerShell (22) QuickCounters (4) Tools (74) Vista (38) VS Color Scheme (10) VSTO (2) WCF (64) Web Services (87) WinFX (80) Workflow (47) WPF (5) XML (21)

Statistics

Total Posts: 986
This Year: 56
This Month: 6
This Week: 0
Comments: 755

Blogroll

Post Archive

Other

Copyright © 2002-2008, Tomas Restrepo.

Powered by: newtelligence dasBlog 1.9.7174.0

Sign In