Thursday, 24 February 2011

SQL Server Schedulers and CPU Performance

For most applications, Windows handles (as you would expect) the scheduling of work onto each CPU. This uses a preemptive scheduling approach, which means that Windows will give each process a priority based on a combination of resource usage, interactivity, activity and user preference.

Once a process has reached its turn, Windows assigns it to a core and it begins, but the preemptive approach, means that Windows is able to forcibly relinquish this process and let another process take over.

This works well for most applications, but not so well for SQL Server. This is because it treats SQL Servers worker threads with the same priority algorithms as any other processes within Windows, and this can lead to lots of context switching.

To get around this issue, SQLOS has it's own CPU Scheduler, called the SOS Scheduler. This allows SQL Server to schedule it's own processor workload, using a cooperative scheduling algorithm. This means that instead of forcing a thread to yield, it allows to threads to yield voluntarily.

But hold on a moment, surely, this could lead to a situation where a runaway thread just takes control of the entire system and does not let anything else get a look in, right? Thankfully, no! Instead, SQL has a process called the Scheduler Monitor, which is a thread that will periodically check to make sure that active schedulers have yielded since the last time it checked. Unlike other SQL threads, this thread works in preemptive mode, so it can not cause the above issue itself.

So how many schedulers do you have? Is there a one-to-one mapping with CPU cores? Well, not exactly. There will be one scheduler for each CPU core in the server (even if SQL has been configured not to use them, by setting processor affinity), there will be one scheduler for the DAC (so that even if something has gone badly wrong with the schedulers, you can still connect) and then one scheduler for each Resource Monitor.

Resource Monitor is a process that helps control memory pressure in the cache, and there will be one for every NUMA Node within the system, regardless if you have hard or soft NUMA. (Non-Uniform Memory Access is a way improving performance by splitting memory between processors or groups of processors, to avoid cache misses - but I digress further and further, so I will talk about this another time!)

You can see your schedulers by querying the sys.dm_os_schedulers DMV. This will give you lots of information, but there are some columns that are especially worthy of note.

First up, is a column called Status. Here you may see various options made up of ONLINE/OFFLINE and VISIBLE/HIDDEN. If a scheduler is Online, it means the SQL Instance has access to it. Offline, means that you have set CPU Affinity, and SQL is not allowed to use it. The reason it is set to Offline, rather than removed, is because it allows you to change your CPU affinity, without the need to restart the Instance. (Very cool!)

If a scheduler is Visible, it means that it is mapping directly to a CPU Core and will accept user requests. If it is set to hidden, it means it is associated with a Resource Monitor. The scheduler associated with the DAC is easy to identify, as it ends with (DAC) ! ;-)

The next interesting column is preemptive_switches_count. Now, you will remember me telling you earlier that SQL threads are cooperative, rather than preemptive, so why do we have this counter? Well, if SQL needs to run anything outside of the SQL Server environment (such as distributed queries, or certain CLR objects, etc), then it must switch to preemptive mode, so that Windows can schedule it to run.

runnable_tasks_count brings some context to our understanding of SQL Waits. A request in SQL can be in one of three states. RUNNING, RUNNABLE or SUSPENDED. Running is fairly obvious, Suspended means that it is waiting for a resource to become available, (this is where you see wait types associated with disc such as IO_LATCH_STATS, network waits, such as NETWORKIO and transaction log bottlenecks, such as WRITELOG see this post, etc), and Runnable, means that it is ready to run, but waiting for a scheduler to become available.

work_que_count shows you the number of tasks that are assigned to the scheduler and waiting to run. pending_io_count shows the number of IOs that the scheduler has that are waiting to be completed. These can be helpful for diagnosing resource pressure.

Whilst we are on the subject of resource pressure, it is quite common in multi-processor servers to see a large number of CXPACKET Waits. These waits can often be a symptom of badly tuned queries, and can be resolved by introducing proper indexing strategies, etc, but sometimes they can occur as a result of parallelism being a bottleneck. But parallelism is good right? Well not for OLTP systems. It would be nice if there was a way to tell SQL the type of workload to expect, but there isn't. If your queries are tuned, and you are still getting lots of CXPACKET Waits, then basically a query is running in parallel, and one core has already finished it's work, and is waiting to receive a packet from another core, so that it can synchronise, etc. The way to get around this, is to reduce the MAXDOP setting for the query (or the whole instance in some cases).

If however, you reduce your MAXDOP setting, then you can suddenly see your CXPACKET Waits being replaced by SOS_SCHEDULER_YIELD Waits. This is because now you have reduced the parallelism, then you are coming under CPU pressure on the cores that are still being used. Basically it is telling you that a thread has voluntarily yielded, so that another thread can take over, and is now waiting for it's next turn on the Scheduler.

So if you are coming under CPU pressure, what can you do about it? Well, that is another long blog for another time, but here are a few suggestions.

1) Check your benchmarks to see if you can establish a usage increase pattern (You are taking regular benchmarks, right?) You may find that it is a natural curve, or you may be able to correlate a sharp spike to a particular release, and then go troubleshooting...

2) Run SELECT * FROM sys.dm_exec_query_stats a CROSS APPLY sys.dm_exec_query_plan(a.Plan_Handle) ORDER BY total_worker_time DESC You can then click through to the query plans of the most expensive queries to see if they can be optimized.

3) Run a Perfmon trace simultaneously with a Profiler trace. You can then correlate the two traces in Profiler to identify the queries causing CPU pressure and investigate.

I hope you enjoy reading this post as much as I enjoyed writing it. After a week spent playing with a very large FastTrack at a Microsoft Lab, I felt the overwhelming urge to write something geeky! :)

No comments:

Post a Comment