Service Fabric Streaming Trace Viewer

One of my features I find the most useful in the Service Fabric SDK tools for Visual Studio is the Streaming Trace Viewer window in Visual Studio. This is used to provide a streaming (i.e. close to real-time) view of the events generated by the SF runtime and the application running on the cluster. You can learn more about this feature from the official docs.

What is not immediately obvious about this feature is that this is just a generic ETW event listener, and so becomes useful on a whole bunch of other scenarios.

##Launching the Diagnostics Viewer If you read the documentation above, you’ll have noticed that you can open the streaming trace viewer from the Cloud Explorer tool window:

Launching streaming trace viewer

One big downside of this is that it requires not only that you have the Service Fabric SDK installed, but also that you have a local Service Fabric cluster deployed, and started.

##Capturing ETW Traces When the Diagnostics Viewer window opens, it is by default configured to collect traces from the Service Fabric ETW providers, which are:

Microsoft-ServiceFabric-Actors
Microsoft-ServiceFabric-Services
Microsoft-ServiceFabric:4:0x4000000000000000

Each line is basically the name of the ETW provider to enable. You can optionally specify the trace level and the keywords in the provider to enable, which allows you to be more selective about which ETW events to capture. If you’re familiar with using other providers in PerfView, you’ll recognize the syntax right away.

The beauty on this is that you can easily change the list of ETW providers that are enabled by clicking on the little gearbox icon (third one from the left):

Selecting ETW Providers

In the example above, I’m enabling some of the ETW providers in the Application Insights SDK to troubleshoot some issues related to dependency collection. Once you’ve customized the list, press the Apply button so it takes effect.

##Viewing Events Once collection is started, you can watch events scroll by on the grid:

ETW events scrolling by

You can also pause and restart the trace at any time using the first button at the top-left. For each event, you’ll notice that a TimeStamp, the event name and the textual message associated with the event will be displayed. However, you can also expand an event to view the full event detail in JSON format:

ETW event details

Also not immediately obvious: You can right-click on an event and select the “Copy Event” command from the context menu. This will copy the full JSON details to the clipboard.

##Conclusion ETW events are a fantastic tool in Windows and .NET for troubleshooting all kinds of issues. I’m a huge fan of PerfView; it’s an amazing tool and extremely useful. However, often I need something simpler to capture simpler events.

Overall, the Service Fabric Diagnostics Viewer is an awesome tool for this scenario, as it saves me the trouble of having to start a collection, reproducce a problem, and then wait until the trace is viewable to understand what’s going. For simpler scenarios, this is a far more productive way of going about it.

My only wish here is: I would love to have this as a standalone Visual Studio extension unrelated to Service Fabric. And if not that, at least be able to start the diagnostics viewer without having to have a local Service Fabric node deployed!

Culture is not supported

Today I was helping a customer troubleshoot an issue with a tool that is WPF based. What was happening was a bit surprising to me, and something I haven’t seen all that clearly documented, so thought that putting it in a blog post would be useful in case I run into it again.

The situation was this: * Customer connected from their local machine to a Windows Server 2008 R2 server through Remote Desktop Connection. * Customer ran the WPF tool. * When the customer tried to give focus to a TextBox, the application would blow up with the following error:

Culture is not supported.
Parameter name: culture
22538 (0x580a) is an invalid culture identifier.

Culture 0x580a is es-419 or Spanish (Latin America). Indeed, this gave us a clue: This culture was not originally supported in Windows Server 2008 R2 (It only started being available in Windows 8.1 and above). So we looked at the region settings for the user and the machine, and found that: * The server was configured with the region settings for Spanish (Argentina). * The client machine used to start the RDP session was configured for Spanish (Latin America).

We thought that was it! So we changed the client region settings and tried again, withouth success. Where was this es-419 culture coming from, then?

Time to go back to the code!

Looking at the stack trace for the exception, I noticed that it would go through the InputLanguageSource.CurrentInputLanguage, so I fired up ILSpy and went hunting for the System.Windows.Input.InputLanguageSource class, located in PresentationCore.dll, and found this:

internal InputLanguageSource(InputLanguageManager inputlanguagemanager)
{
    this._inputlanguagemanager = inputlanguagemanager;
    this._langid = (short)NativeMethods.IntPtrToInt32(SafeNativeMethods.GetKeyboardLayout(0));
    this._dispatcherThreadId = SafeNativeMethods.GetCurrentThreadId();
    this._inputlanguagemanager.RegisterInputLanguageSource(this);
}

You can already see the surprising aspect here: The culture is being instantiated not from the regional settings, but from the keyboard layout!

Sure enough, the customer had the keyboard layout set to Spanish - Latin America, which is a perfectly supported keyboard layout in Windows Server 2008, but does not match to a corresponding culture.

We then simply changed the keyboard layout to something else, and the error was resolved.

The story of a TF.exe hang

This morning I was setting up to move some new code I had written as an initial experiment into a new codebase, which is stored using Team Foundation Version Control (TFVC) in Visual Studio Team Services (VSTS). Before I could do that, I needed to go and update my local workspace to the latest version.

Unfortunately, Visual Studio 2017 was not being cooperative, and would get stuck with the “downloading files” dialog open everytime I tried it. So I decided to give TF.exe a try and see if the command line tool was doing the same. It was.

Annoyed with it, and with my search-fu failing me, I decided to attach trusty old WinDBG to the hung instance of TF.exe. What were the threads doing?

DbgId ThreadId ClrThread Apartment Kind       Action
===== ======== ========= ========= ========== ===================================================================================================
    0     31c4   104d2d0 STA                  MonitorWait: Microsoft.TeamFoundation.VersionControl.Client.AsyncOperation.WaitForCompletion(Int32)
    2     1b98   105b6b8 MTA       Finalizer  
    6      6bc   637a078 MTA       Worker     (idle)
    7     2854   637b690           Worker     (idle)
    9     7f10   63adef8 MTA       Background WaitHandle.WaitAny: 0x00000000040708bc
   12      4ec   71f7d00 MTA       IO         
   13     71e4   7232520 MTA       Worker     WaitHandle.WaitOne: 0x0000000003eacd64

Thread 12 didn’t seem to be relevant. Thread 13, however, seemed to be stuck on a file async operation:

DbgId ThreadId Apartment Kind   CLR          GC Mode    GC Suspending?
   13     71e4 MTA       Worker v4.7.2098.00 Preemptive no

WaitHandle.WaitOne
------------------
WaitHandle Type    Microsoft.Win32.SafeHandles.SafeWaitHandle
WaitHandle Address 0x0000000003eacd64

SP       IP       Function
08b5eb44 00000000 HelperMethodFrame_1OBJ [System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)]
08b5ec28 72092f71 System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)
08b5ec40 72092f38 System.Threading.WaitHandle.WaitOne(Int32, Boolean)
08b5ec54 72092fc2 System.Threading.WaitHandle.WaitOne()
08b5ec5c 720d16d9 System.IO.FileStreamAsyncResult.Wait()
08b5ec7c 7208769b System.IO.FileStream.EndWrite(System.IAsyncResult)
08b5ec8c 720b76ce System.IO.FileStream.FlushWrite(Boolean)
08b5ec9c 721267ee System.IO.FileStream.Dispose(Boolean)
08b5ecbc 72126786 System.IO.Stream.Close()
08b5ecc8 0f48294c Microsoft.TeamFoundation.VersionControl.Client.UploadDownload.DownloadBeginReadCallback(System.IAsyncResult)
08b5ed6c 713e83e9 System.Net.LazyAsyncResult..ctor(System.Object, System.Object, System.AsyncCallback, System.Object)
08b5ed80 713e829b System.Net.ConnectStream.BeginReadWithoutValidation(Byte[], Int32, Int32, System.AsyncCallback, System.Object)
08b5eddc 713e8193 System.Net.ConnectStream.BeginRead(Byte[], Int32, Int32, System.AsyncCallback, System.Object)
08b5ee04 0f483d6c Microsoft.TeamFoundation.VersionControl.Client.UploadDownload.DecompressAndWriteChunk(DownloadAsyncResult)
08b5ee20 0f4839e1 Microsoft.TeamFoundation.VersionControl.Client.UploadDownload.DownloadDestBeginWriteCallback(System.IAsyncResult)
08b5ee4c 720d1867 System.IO.FileStreamAsyncResult.CallUserCallbackWorker()
08b5ee58 720d1838 System.IO.FileStreamAsyncResult+<>c.<CallUserCallback>b__39_0(System.Object)
08b5ee5c 720705c2 System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(System.Object)
08b5ee64 72109a4a System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
08b5eed4 72109956 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
08b5eee8 720714c0 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
08b5eefc 72070c22 System.Threading.ThreadPoolWorkQueue.Dispatch()
08b5ef4c 72070a8a System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
08b5f170 00000000 DebuggerU2MCatchHandlerFrame

Looking at the async operation, we see indeed it’s not complete, even after minutes of waiting:

0x03ff028c System.IO.FileStreamAsyncResult
[statics]
  0000  _userCallback           : NULL
  0004  _userStateObject        : NULL
  0008  _waitHandle             : 03ff02bc (System.Threading.ManualResetEvent)
  000c  _handle                 : 03fd7814 (Microsoft.Win32.SafeHandles.SafeFileHandle)
  0010  _overlapped             : 03e90fd8 (System.UIntPtr)
  0014  _EndXxxCalled           : 1 (System.Int32)
  0018  _numBytes               : 0 (System.Int32)
  001c  _errorCode              : 0 (System.Int32)
  0020  _numBufferedBytes       : 0 (System.Int32)
  0024  _isWrite                : True (System.Boolean)
  0025  _isComplete             : False (System.Boolean)
  0026  _completedSynchronously : False (System.Boolean)

So clearly, the hang is happening because the async operation to close one of the .gz files used when downloading changes is never completing.

Where does this leave me? Nowhere, really, since I still have no clue why the file close is never completing. However, spending some quality time in WinDBG sure is better than throwing my laptop out the window!

Up next: Wiping my local workspace and spending 1+ hour downloading everything again.

Moved to Jekyll

I’ve updated my site, moving away from Wordpress. Instead it is now a static site build with Jekyll.

What's new in IIS 10.0

A new version of IIS is included in Windows 10, which was recently released. I was expecting it to be called "IIS 9.0" to follow the model used until now, but we've jumped all the way to "IIS 10.0". However, not much information seems to be available about what has changed, and the documentation at IIS.NET has yet to be updated.

I'll outline some changes I've noticed on this post. Most of them where found simply by going over the configuration schemas in %WINDIR%\System32\inetsrv\config\schema on my own.

Wildcard Host Header Support


This is a very useful feature that allows you to create a Web Site with a host header value of "*.<domain>", meaning that you don't have to create separate bindings for each subdomain you need to bind to the same Web Site! You can read more about this feature here.

IISAdministration PowerShell Module


The WebAdministration module for managing IIS from PowerShell was introduced in IIS 7.5. However, it suffers from several issues, including the fact that it often doesn't feel very "PowerShelly". With IIS 10.0, a new alternative PowerShell module has been added, called IISAdministration. You can read more about this here.

Environment Variables


In IIS 10.0, you can now easily specify a custom set of environment variables for an Application Pool, which will be defined for any worker process created from it. This is not supported in the UI, but you can easily add them using the Configuration Editor, or using AppCmd. For example, the following command will add a new environment variable to the ".NET v4.5" application pool named "MYVARIABLE":

appcmd set apppool ".NET v4.5" /+environmentVariables.add[@start,name='MYVARIABLE',value='C:\temp']

I think this is a very nice, useful feature, particularly for complex hosting environments with many application pools and where you might not want to define the variables at the machine level.

Log Event On Recycle Defaults


Up to IIS 8.5, the logEventOnRecycle property of application pools had a default value of "Time, Memory, PrivateMemory". This means that when an application pool was recycled for any other reason, WAS (Windows Process Activation Services) would not leave an audit event on the System Event Log telling you about it.

With IIS 10.0, the default value of logEventOnRecycle property has been changed to "Time, Requests, Schedule, Memory, IsapiUnhealthy, OnDemand, ConfigChange, PrivateMemory", which is much more useful and aligned with our recommended practices.

Support for Permanent Redirects


When using the "HTTP Redirection" feature, you can now tell IIS to respond with an status code of 308 Permanent Redirect.

Http Redirects

Remove the Server Header


One feature that the old UrlScan tool had that was missing from the Request Filtering module introduced in IIS 7.0 was the option to remove the IIS Server version header ("Server: Microsoft-IIS/10.0"). The option to do so has now been added back in IIS 10.0 with the removeServerHeader option. I don't think this has been exposed in the Management Console yet, but you can use the AppCmd.exe to enable it, like this:

appcmd.exe set config "Default Web Site" /section:system.webServer/security/requestFiltering /removeServerHeader:True

Tracing after Timeout in FREB


IIS 7.0 introduced a fantastic feature known as Failed Request Tracing (aka FREB). However, one issue with FREB was that if you had rules configured for collecting traces based on request execution time, the trace file would get written as soon as the time limit expired. That meant you usually ended up with a "partial trace" that didn't include all events until the request completed.

There is a new attribute you can configure on your rules called traceAllAfterTimeout in IIS 10.0. When it is enabled, IIS will wait until the request is actually completed before writing the trace file. This option, however, is not exposed in the User Interface, but you can still use command line tools to enable it. The following command will add a rule and then enable the option, for example:

appcmd configure trace "Default Web Site/" /enable /path:*.aspx /timeTaken:00:10:00
appcmd.exe set config "Default Web Site" -section:system.webServer/tracing/traceFailedRequests /[path='*.aspx'].failureDefinitions.traceAllAfterTimeout:"True"

Conclusion


Overall, I think these are all excellent, and welcome additions to IIS. Once the next version of Windows Server ships, we'll see if there are any other improvements implemented.

 

WCF: Be careful with allowCookies="true"

I always enjoy a good troubleshooting challenge, and recently my good friend and colleage Martin Jalaf came up with an interesting one.

A customer was load testing a Web application. Like many others, it consisted of farm of load-balanced servers running an ASP.NET application front-end application, which then talked to a set of WCF services running on a second server farm. In both cases, an F5 device was used to load balanced HTTP traffic.

When the customer started increasing the load as part of their load tests, they would notice that calls to the WCF services would get slower, and would sometimes take upwards of 10 minutes to complete. However, the servers running the WCF services, as well as the SQL Server used showed little CPU usage.

Reviewing a few memory dumps of the front-end servers, all that could be observed was threads stuck calling the WCF service. Given this, our first suspicion was that WCF throttling was incorrectly configured. However, dumps of the WCF service revealed this was not the case. In fact, only a single call to the WCF service was running at any given time, and throttling settings were too high to be causing any issues. This could be easily observed using the excellent netext debugging extension:

0:000> !wservice
Address State EndPoints BaseAddresses Behaviors Throttled Calls/Max Sessions/Max ConfigName,.NET Type
000000e9229c78d8 Opened 0n1 0n1 0n10 True 0n1/0n2000 0n0/0n400 "WCFSecurity.SecurityService",WCFSecurity.SecurityService

The WCF service was configured with InstanceContextMode=Single, and ConcurrencyMode=Multiple. However, this did not appear to be related at all. Another unexpected symptom was that despite the fact that the WCF servers were behind a load balancer device, all requests ended up being processed by the same server most of the time. I immediately suspected that the F5 device was configured with session affinity enabled, even though the customer had initially denied it.

Since the WCF service was exposed using an HTTP-based binding, I asked Martin to check the ASP.NET HttpContext instances, just in case they were somehow relevant. We immediately found that there were hundreds of HttpContext instances in the dump that had yet to complete!

0:000> !whttp
HttpContext Thread Time Out Running Status Verb Url
000000e622a12850 -- 00:00:00 00:20:52 200 POST /WCFSecurity/SecurityService.svc
000000e622a18e58 -- 00:00:00 00:20:47 200 POST /WCFSecurity/SecurityService.svc
000000e622a23c58 -- 00:00:00 00:20:39 200 POST /WCFSecurity/SecurityService.svc
....

Looking closely, we realized that all HTTP calls were stuck in the AcquireRequestState stage of the ASP.NET pipeline, which meant we were actually observing an instance of the infamous ASP.NET Session Lock issue (see the section titled Concurrent Requests and Session State)!

Session: cq2uoplja4aqu2aoobw3f1pf
HttpContext Thread Poll (ms) Time Path
================ ====== ========= ==== ===================================
000000e622a12850 500 1253 /WCFSecurity/SecurityService.svc
000000e622a18e58 500 1248 /WCFSecurity/SecurityService.svc
000000e622a23c58 500 1240 /WCFSecurity/SecurityService.svc
000000e622a25d50 500 1238 /WCFSecurity/SecurityService.svc
000000e622a27b60 500 1236 /WCFSecurity/SecurityService.svc
000000e622a283f8 500 1236 /WCFSecurity/SecurityService.svc
....

Why was ASP.NET Session being used here? The service was configured with:

<system.serviceModel>
  <serviceHostingEnvironment>
    aspnetCompatibilityEnabled="true"
  />
</system.serviceModel>

This meant that the service was enabling access to the ASP.NET services, but the service itself was not using session state. What was really unexpected in this case was that all the requests from a single front-end server were ending up with the exact same ASP.NET session cookie, particularly because, by default, the client-side of WCF will not accept and send cookies at all.

Given the issues with load-balancing not working as expected, we also checked the cookies collection of the HttpContext, and found out that, indeed, the F5 server had session affinity enabled using cookies:

0x000000e922d5ced0 System.Collections.Specialized.NameObjectCollectionBase+NameObjectEntry
0000 Key : 000000e922d5cc48 "BIGipServerpl_ATV_qa_80" [23] (System.String)
0008 Value : 000000e922d5cc00 (System.Web.HttpCookie)

With all this data, the light bulb went on, and we realized that the WCF binding on the ASP.NET front-end application had to have the allowCookies property enabled! We looked at it, and, sure enough it was enabled. However, this only explained why session cookies were present, but not why why the same session ID was used for all requests.

Digging a bit into the WCF source code, we realized that cookie support is implemented at the ChannelFactory level in the HTTP transport. The ASP.NET Front-End application was using standard proxy classes derived from ClientBase, which meant that by default, channel factory caching was enabled. This meant that, on a W3WP.EXE process on the front-end, every single call to the WCF service would end up using the same HTTP cookies.

Conclusion


Once the customer configured the binding with allowCookies="false" and fixed the other issues found, application performance significantly improved!

The moral of the story is: be careful when you enable the allowCookies option when calling WCF services over HTTP. In this case the combination of allowCookies, channel factory caching, aspnetCompabilitityEnabled, and an incorrectly configured load-balancer completely killed the application capability to scale.

 

Processing IIS ETW events using Azure Stream Analytics

Updated 2015/07/23: I've updated this post with some up-to-date information on Stream Analytics and some additional comments.

This blog post continues my series of blog posts about ETW logging in IIS 8.5. One of the things that I wanted to do from the start was using this as an excuse to dig into some Azure services. In the last entry, I updated the sample code to push ETW events collected from IIS in near real-time to an Azure Event Hub.

Now, I’d like to take that a bit further and actually process the events that make it to the hub. Naturally, I could write my own Event Hub listener that processes said events, but the Azure Stream Analytics service sounded like a pretty interesting way to do this, so I decided to give it a try!

The collector code serializes each ETW event into a flat JSON structure, using the header_ and event_ prefixes as necessary. Note, however, that Stream Analytics now supports hierarchical JSON structures (i.e. with nested objects), so you could use a different representation if you wanted

Configuring the Stream Analytics job.


The first step to use Stream Analytics is creating a job in the management portal:

clip_image002

Once your Stream Analytics job is created, we will want to connect it to our existing Event Hub that is receiving events from the collector service.

Creating an Input


Before doing this, we need to add a new SAS policy that Stream Analytics can use to read the events from the hub.

To do this, go to your Event Hub configuration page, look for the “shared access policies” section and add a new one with Manage permissions:

clip_image002[10]

Then click on the Save button to persist the changes.

Now go back to the Stream Analytics instance and select the Add Inputs option. Here’re we will want to connect it to our existing Event Hub instance:

clip_image002[12]

The type of the input should be Data Stream, Event Hub, and then enter an alias for the input and select the existing event hub and the right shared access policy:

clip_image004

Then select the serialization format and encoding:

clip_image002[14]

Creating a Query


Now that we have a way to get data into our Stream Analytics job, we can define one or more queries on top of the data. For example, let’s say we want to create a summary of the successful hits within a 5 minute window:

clip_image001

Creating an Output


Now let’s define an output for our query. In this case, I want to store the job output into a SQL Database on azure, which I have already created. Notice that you will want your DB to be on the same region as the Stream Analytics job! In this database, we’ll create a simple table to store the results:

clip_image001[7]

Go to the Output section of your Stream Analytics job and select the Add an Output option. Then select “SQL Database” as the destination, and enter your database details:

clip_image002

Note: If you use a database that has ‘-’ characters as part of the name, trying to create the output will result in an error, at least when I tried it.

clip_image002[16]Running the job


Now we can start our Stream Analytics job and test the results!

After the job has been running for a while, and we’re feeding data to the Event Hub, we start seeing the results in our database table:

clip_image004[6]

Overall, getting this working was surprisingly easy. Since I did my original tests, Stream Analytics and the documentation has improved substantially, making it even easier!

Supporting other event types


While my original idea was to handle IIS log information, there is no reason the same technique cannot be used to collect and process other ETW events. I've recently updated the code to make the core code independent of the IIS log provider, and now you can add/replace the existing ETW providers/parsers used with your own using MEF. This was very easy since the collector is based on the excellent TraceEvent library to collect ETW events.

Playing with Roslyn and VSIX

I have been playing a bit with the Roslyn APIs in Visual Studio "14". Since I am a big fan of VS editor extensions, my first attempt was building something simple: A Visual Studio tagger that would colorize uses of parameter variables and field variables in code:

Roslyn Colorizer

The code is ugly as it is, and does not use the Roslyn Async API, which is something I'd like to look into later on. I'm also not entirely sure if there's a better way to build such a thing (Roslyn has it's own concept of classification, but not sure if that is extensible). It was fun, though!

Code can be found here

ETW Logging in IIS 8.5 (2)

A few days ago I posted a sample on capturing ETW log events generated by IIS 8.5.

This turned out to be a good excuse to play around with the new Event Hubs service in Microsoft Azure, so I've now extended the sample with a collector application that can capture the ETW log events generated by IIS and push them in JSON format to an Azure Event Hub.

IIS - Event Hub

From here, I could process events in a number of ways, and it will be interesting to play with other Azure services to do just that later on. The sample code is updated on the github repository, but be aware it is very rough, but I'll keep improving it.

ETW Logging in IIS 8.5

One of the coolest features in IIS 8.5 (Windows Server 2012 R2) is the option to write W3SVC log entries to ETW (Event Tracing for Windows) instead of the normal W3C log files. I've published a sample on how to capture these events. Check out the readme for more details.