Tuesday, 29 March 2011

The Five Horesmen Of The SQL Server Appocolypse!

Ok, so it should be four horsemen, but there was a two-way tie for fourth place. What can I say?

Thank you all from you votes from the What is the WORST Performing Feature Of SQL Server Poll (see this post)
You can see the results and percentages in a nice little chart at this link, and they are certainly interesting, so lets look at the winners in a bit more detail...

The clear winner is Auto shrink / Scheduled shrink. But why is this so bad? Well basically it kills performance, both during the shrink operation, and worse still, after the operation has finished. When you shrink a file, starting at the end of the file, and moving backwards, it takes each page, and moves it to the earliest available space within the file.

This is a resource intensive operation, and can cause performance issues whilst it is running, but also causes your indexes to become highly fragmented. I have already blogged about this in detail, so I won't repeat myself too much here, but please check out this post.

Next in the ranking, is Autogrow logs by 1MB . Why is this so bad? Well, once again, there is a two-fold problem. Firstly, if you log is coming under space pressure, and therefore needing to grow, then it can cause huge performance problems, as it needs to grow the Log file (in some circumstances) for almost every transaction. Remember, depending on the action you are performing, SQL may need to record before and after images of the data. I have also experienced issues in the past, where during large operations, your database suddenly ends up in Recovery, because it has not been able to keep up with the number of log grows required, and eventually resulted in the log becoming corrupt, resulting (in my case - the DB was nearly 1TB) in over an hour of downtime, and of course the transaction being rolled-back. See this post for some tips on Log File Optimization.

The other issue is Log File Fragmentation. Log Fragmentation? What on earth is that? Well, basically if you grow your log file in tiny chunks, then you end up with a massive amount of VLFs (or Virtual Log Files). On the VLDB I mentioned above, the client ended up with over 2000. The rule of thumb recommendation, is not to have more than 50! See this post for more info.

Very close behind, in 3rd place, came Functions in Select list. Why not put a function in a SELECT list? That is what they are there for isn't it? No! No, no, no, no, no, no, no! NO! :) If you put a function in a SELECT list, it needs to be evaluated separately, for every single row of the query, which is causing a "cursor like" effect, and often can not be included in the same execution plan. Lets have a look at an example, and how we could perform the task better...

Using the AdventureWorks2008DW database, I have created the following example, which is a reproduction of one of the worst SQL implementations I have seen. (The code has been changed to protect the guilty!) :)

The first thing I am going to do, is create some "dirty data" in the FactResellerSales table, to demonstrate why the code existed, with the following script...

  UPDATE [AdventureWorksDW2008R2].[dbo].[FactResellerSales]
  SET OrderQuantity = 0
  WHERE ProductKey = 351


  UPDATE [AdventureWorksDW2008R2].[dbo].[FactResellerSales]
  SET UnitPrice = 0
  WHERE ProductKey = 223


...Now to create the Function, which I will call SaftyDiv, and will basically stop a divide by 0 error occuring...

CREATE FUNCTION dbo.SaftyDiv(@FirstNumber DECIMAL, @SecondNumber DECIMAL)
RETURNS DECIMAL
AS
BEGIN
     DECLARE @ReturnValue DECIMAL

     IF (@FirstNumber = 0)
     BEGIN
          SET @ReturnValue = 0
     END

     ELSE IF (@SecondNumber = 0)
     BEGIN
          SET @ReturnValue = 0
     END

     ELSE
     BEGIN
          SET @ReturnValue = @FirstNumber / @SecondNumber
     END

     RETURN @ReturnValue
END


...So, there are so many things wrong about this function, I will not patronise you, or drive myself insane by listing them all, but lets see how it performs when we use it to select values from the table...

SELECT
       [SalesTerritoryKey]
      ,[SalesOrderNumber]
      ,[SalesOrderLineNumber]
      , [dbo].[SaftyDiv] ([OrderQuantity], [UnitPrice])
 
  FROM [AdventureWorksDW2008R2].[dbo].[FactResellerSales]


 SQL Server Execution Times:
   CPU time = 438 ms,  elapsed time = 1559 ms.

...If you look at the Execution plan for this, it looks very clean, but that is in fact, because SQL was unable to include the function in the same plan, and actually had to create a second plan for it!

So lets see what performance we get, if we tear down the buffer cache, and then rewrite this query to use a CASE statement in the Select list. To be honest, there are more elegant ways of doing it than this, but it's getting late, and this demonstrates the point...

SELECT
       [SalesTerritoryKey]
      ,[SalesOrderNumber]
      ,[SalesOrderLineNumber]
      , CASE WHEN ([OrderQuantity] = 0 OR [UnitPrice] = 0) THEN (SELECT 0)  ELSE (SELECT [OrderQuantity] / [UnitPrice] ) END 


  FROM [AdventureWorksDW2008R2].[dbo].[FactResellerSales]


 SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 1177 ms.

...So you can see, we got a 78.5% improvement in processor time (because it only needed to compile 1 plan, instead of 2) and we also got a 24.5% improvement in execution time. Remember, in this example, we are only dealing with aprox. 65,000 rows. Imagine if we started scaling that up to millions of rows!

Tied for fourth place, were Encrypt all data with cell level encryption and Cursors. Now I must be honest, this was a surprise for me, I though that cursors would be right up there with Auto/Scheduled shrink. Why did I think that? Well, they are the bain of my life! With no disparity meant what so ever to .NET developers, you can always tell when somebody who is experienced in writing .NET code, but less so in SQL, has been writing SQL code. That is because they love cursors! And it makes perfect sense, in .NET languages, looping is often the best way to achieve your goals, but T-SQL is a SET-Based language, meaning it is optimized for performing an operation on multiple rows at the same time, as opposed looping over a set of rows, which is exactly what a Cursor does.

There is a place is T-SQL for the use of Cursors, but these days, it is a very small, limited place, and basically only shows itself in situations where you need to iterate over a series of DDL objects, such as looping over indexes in a dynamic rebuild scenario.

For almost all other purposes, we have a better way of doing things. For example, if we need to produce a cross-tabulated query, then we have the Pivot and UnPivot operators. If we need to implement recursive logic, then we have Recursive CTEs. For concatenating rows into a string, or vice versa, we have tricks we can use with XML (see this post for an example), and for very complex logic, or string manipulation, we have CLR integration.

Lets look at an example of using a simple Cursor, versus a simple SET-Based solution to perform the same task, and see how they perform...

CREATE TABLE #Sales
(
SalesAmount DECIMAL,
RunningTotal DECIMAL
)


DECLARE @SalesAmount DECIMAL,
        @RunningTotal DECIMAL

SET @RunningTotal = 0

DECLARE myCursor CURSOR
FOR
SELECT SalesAmount
FROM FactResellerSales


OPEN myCursor
FETCH NEXT FROM myCursor INTO @SalesAmount
WHILE @@FETCH_STATUS = 0
 BEGIN
      SET @RunningTotal = @RunningTotal + @SalesAmount
      INSERT #Sales VALUES (@SalesAmount,@RunningTotal)
      FETCH NEXT FROM myCursor INTO @SalesAmount
 END


CLOSE myCursor
DEALLOCATE myCursor


SELECT * FROM #Sales
ORDER BY RunningTotal

...To be honest, I don't know how long this would take to complete, because I got bored of watching it run after about 19 minutes, (remember there are only 65,000 rows in this table) and killed it, so I could run my SET-Based version!

So, how long did the SET-Based version take? Drumb roll...

SELECT a.SalesAmount,
       SUM(b.SalesAmount) AS RunningTotal
FROM FactResellerSales a
INNER JOIN FactResellerSales b
ON (a.SalesOrderNumber = b.SalesOrderNumber
        AND a.SalesOrderLineNumber = b.SalesOrderLineNumber)
GROUP BY a.SalesAmount
ORDER BY RunningTotal

...Less than 1 Second! I rest my case! :)

So finally, what is wrong with using cell-level encryption? Well nothing, if used in moderation, and it is often a necessary evil, in order to meet regulatory requirements, but if you over-indulge, then you have a problem.

If you need to encrypt a CreditCardNumber column to meet a regulatory requirement, then this is fine. Use a symmetric key, and avoid encrypting that key with an asymmetric key, and that with a certificate, and so on! But do not go right on ahead and encrypt the entire table, despite there being no requirement to do so, other than a manager "thinks it might be a good idea". As a technical person, it is your responsibility to point out the limitations of technology to the business, so if you, as I have, had a Director telling you that they want the salaries encrypted, to avoid the slightest risk that somebody might find out what a disgustingly large bonus they get, the correct answer is something along the lines of my standard reply... "Yes Sir, certainly, you are the customer, and if that's what you want, then I can make it happen. However, please be warned that depending of the encryption algorithms used, then this can cause a performance degradation of up to 45-50% when accessing that data, and can cause a data-bloat of up to 4000%!" That is normally enough to make people see it your way! ;-)

I hope you find these results as interesting as I did.


Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

No comments:

Post a Comment