Tuesday, 29 March 2011

The Five Horesmen Of The SQL Server Appocolypse!

Ok, so it should be four horsemen, but there was a two-way tie for fourth place. What can I say?

Thank you all from you votes from the What is the WORST Performing Feature Of SQL Server Poll (see this post)
You can see the results and percentages in a nice little chart at this link, and they are certainly interesting, so lets look at the winners in a bit more detail...

The clear winner is Auto shrink / Scheduled shrink. But why is this so bad? Well basically it kills performance, both during the shrink operation, and worse still, after the operation has finished. When you shrink a file, starting at the end of the file, and moving backwards, it takes each page, and moves it to the earliest available space within the file.

This is a resource intensive operation, and can cause performance issues whilst it is running, but also causes your indexes to become highly fragmented. I have already blogged about this in detail, so I won't repeat myself too much here, but please check out this post.

Next in the ranking, is Autogrow logs by 1MB . Why is this so bad? Well, once again, there is a two-fold problem. Firstly, if you log is coming under space pressure, and therefore needing to grow, then it can cause huge performance problems, as it needs to grow the Log file (in some circumstances) for almost every transaction. Remember, depending on the action you are performing, SQL may need to record before and after images of the data. I have also experienced issues in the past, where during large operations, your database suddenly ends up in Recovery, because it has not been able to keep up with the number of log grows required, and eventually resulted in the log becoming corrupt, resulting (in my case - the DB was nearly 1TB) in over an hour of downtime, and of course the transaction being rolled-back. See this post for some tips on Log File Optimization.

The other issue is Log File Fragmentation. Log Fragmentation? What on earth is that? Well, basically if you grow your log file in tiny chunks, then you end up with a massive amount of VLFs (or Virtual Log Files). On the VLDB I mentioned above, the client ended up with over 2000. The rule of thumb recommendation, is not to have more than 50! See this post for more info.

Very close behind, in 3rd place, came Functions in Select list. Why not put a function in a SELECT list? That is what they are there for isn't it? No! No, no, no, no, no, no, no! NO! :) If you put a function in a SELECT list, it needs to be evaluated separately, for every single row of the query, which is causing a "cursor like" effect, and often can not be included in the same execution plan. Lets have a look at an example, and how we could perform the task better...

Using the AdventureWorks2008DW database, I have created the following example, which is a reproduction of one of the worst SQL implementations I have seen. (The code has been changed to protect the guilty!) :)

The first thing I am going to do, is create some "dirty data" in the FactResellerSales table, to demonstrate why the code existed, with the following script...

  UPDATE [AdventureWorksDW2008R2].[dbo].[FactResellerSales]
  SET OrderQuantity = 0
  WHERE ProductKey = 351


  UPDATE [AdventureWorksDW2008R2].[dbo].[FactResellerSales]
  SET UnitPrice = 0
  WHERE ProductKey = 223


...Now to create the Function, which I will call SaftyDiv, and will basically stop a divide by 0 error occuring...

CREATE FUNCTION dbo.SaftyDiv(@FirstNumber DECIMAL, @SecondNumber DECIMAL)
RETURNS DECIMAL
AS
BEGIN
     DECLARE @ReturnValue DECIMAL

     IF (@FirstNumber = 0)
     BEGIN
          SET @ReturnValue = 0
     END

     ELSE IF (@SecondNumber = 0)
     BEGIN
          SET @ReturnValue = 0
     END

     ELSE
     BEGIN
          SET @ReturnValue = @FirstNumber / @SecondNumber
     END

     RETURN @ReturnValue
END


...So, there are so many things wrong about this function, I will not patronise you, or drive myself insane by listing them all, but lets see how it performs when we use it to select values from the table...

SELECT
       [SalesTerritoryKey]
      ,[SalesOrderNumber]
      ,[SalesOrderLineNumber]
      , [dbo].[SaftyDiv] ([OrderQuantity], [UnitPrice])
 
  FROM [AdventureWorksDW2008R2].[dbo].[FactResellerSales]


 SQL Server Execution Times:
   CPU time = 438 ms,  elapsed time = 1559 ms.

...If you look at the Execution plan for this, it looks very clean, but that is in fact, because SQL was unable to include the function in the same plan, and actually had to create a second plan for it!

So lets see what performance we get, if we tear down the buffer cache, and then rewrite this query to use a CASE statement in the Select list. To be honest, there are more elegant ways of doing it than this, but it's getting late, and this demonstrates the point...

SELECT
       [SalesTerritoryKey]
      ,[SalesOrderNumber]
      ,[SalesOrderLineNumber]
      , CASE WHEN ([OrderQuantity] = 0 OR [UnitPrice] = 0) THEN (SELECT 0)  ELSE (SELECT [OrderQuantity] / [UnitPrice] ) END 


  FROM [AdventureWorksDW2008R2].[dbo].[FactResellerSales]


 SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 1177 ms.

...So you can see, we got a 78.5% improvement in processor time (because it only needed to compile 1 plan, instead of 2) and we also got a 24.5% improvement in execution time. Remember, in this example, we are only dealing with aprox. 65,000 rows. Imagine if we started scaling that up to millions of rows!

Tied for fourth place, were Encrypt all data with cell level encryption and Cursors. Now I must be honest, this was a surprise for me, I though that cursors would be right up there with Auto/Scheduled shrink. Why did I think that? Well, they are the bain of my life! With no disparity meant what so ever to .NET developers, you can always tell when somebody who is experienced in writing .NET code, but less so in SQL, has been writing SQL code. That is because they love cursors! And it makes perfect sense, in .NET languages, looping is often the best way to achieve your goals, but T-SQL is a SET-Based language, meaning it is optimized for performing an operation on multiple rows at the same time, as opposed looping over a set of rows, which is exactly what a Cursor does.

There is a place is T-SQL for the use of Cursors, but these days, it is a very small, limited place, and basically only shows itself in situations where you need to iterate over a series of DDL objects, such as looping over indexes in a dynamic rebuild scenario.

For almost all other purposes, we have a better way of doing things. For example, if we need to produce a cross-tabulated query, then we have the Pivot and UnPivot operators. If we need to implement recursive logic, then we have Recursive CTEs. For concatenating rows into a string, or vice versa, we have tricks we can use with XML (see this post for an example), and for very complex logic, or string manipulation, we have CLR integration.

Lets look at an example of using a simple Cursor, versus a simple SET-Based solution to perform the same task, and see how they perform...

CREATE TABLE #Sales
(
SalesAmount DECIMAL,
RunningTotal DECIMAL
)


DECLARE @SalesAmount DECIMAL,
        @RunningTotal DECIMAL

SET @RunningTotal = 0

DECLARE myCursor CURSOR
FOR
SELECT SalesAmount
FROM FactResellerSales


OPEN myCursor
FETCH NEXT FROM myCursor INTO @SalesAmount
WHILE @@FETCH_STATUS = 0
 BEGIN
      SET @RunningTotal = @RunningTotal + @SalesAmount
      INSERT #Sales VALUES (@SalesAmount,@RunningTotal)
      FETCH NEXT FROM myCursor INTO @SalesAmount
 END


CLOSE myCursor
DEALLOCATE myCursor


SELECT * FROM #Sales
ORDER BY RunningTotal

...To be honest, I don't know how long this would take to complete, because I got bored of watching it run after about 19 minutes, (remember there are only 65,000 rows in this table) and killed it, so I could run my SET-Based version!

So, how long did the SET-Based version take? Drumb roll...

SELECT a.SalesAmount,
       SUM(b.SalesAmount) AS RunningTotal
FROM FactResellerSales a
INNER JOIN FactResellerSales b
ON (a.SalesOrderNumber = b.SalesOrderNumber
        AND a.SalesOrderLineNumber = b.SalesOrderLineNumber)
GROUP BY a.SalesAmount
ORDER BY RunningTotal

...Less than 1 Second! I rest my case! :)

So finally, what is wrong with using cell-level encryption? Well nothing, if used in moderation, and it is often a necessary evil, in order to meet regulatory requirements, but if you over-indulge, then you have a problem.

If you need to encrypt a CreditCardNumber column to meet a regulatory requirement, then this is fine. Use a symmetric key, and avoid encrypting that key with an asymmetric key, and that with a certificate, and so on! But do not go right on ahead and encrypt the entire table, despite there being no requirement to do so, other than a manager "thinks it might be a good idea". As a technical person, it is your responsibility to point out the limitations of technology to the business, so if you, as I have, had a Director telling you that they want the salaries encrypted, to avoid the slightest risk that somebody might find out what a disgustingly large bonus they get, the correct answer is something along the lines of my standard reply... "Yes Sir, certainly, you are the customer, and if that's what you want, then I can make it happen. However, please be warned that depending of the encryption algorithms used, then this can cause a performance degradation of up to 45-50% when accessing that data, and can cause a data-bloat of up to 4000%!" That is normally enough to make people see it your way! ;-)

I hope you find these results as interesting as I did.


Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

Sunday, 20 March 2011

What Is The WORST Feature Of SQL Server?

I am planning a post on "How To Kill Your SQL Server Without Even Trying!" and I would like too know what you think the worst features of SQL Server are (From a PERFORMANCE perspective ONLY!) I will do another post on worst practices (from a non-performance perspective) at a later date. If your favourite option is not listed, leave a comment.

Once I have your votes, I will include the most popular answers in a post, with examples.


Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

Sexy SSIS In Denali - Usability Enhancements

Following the success of my last post on SSIS in Denali see here I have been wanting to do a post for some time, giving an overview of some of the nice new usability enhancements for Developers. Unfortunately, it is predominately visual, so I have been putting it off, as really it requires a webcast, and 1) I hate the sound of my own voice! 2) I do not have any professional screen capture software. Today, however, I bit the bullet and recorded a short piece, that gives an overview of some of my favourite new features for developers. Most of them seem like small changes, but together, they make the product much easier to use.

I hope you enjoy it, but please bare the previous two caveats in mind!!! ;-)

If you enjoy this, and would like to see more webcasts, then please leave a comment to let me know. It's always good to know how to spend my blogging time!



Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

How Do You Create A Database?

There was a question on MSDN, that paraphrased, asked what steps you need to go through to create a database. Now, of course the exact steps will be different in every situation, but there are certain things that people often miss, so I though I would compile a 15-point list of the high level steps that you should take -

1) Consult the business to decide on what data you need to store, estimated volumes, how it will be used, how many users, etc. (You should try to predict 3 years into the future)

2) Go through conceptual/logical design. i.e. Normalization, ERD diagrams, etc.

3) Decide what technology will suite your application best. Is this SQL Server (normally in my bias opinion! ;-) ), or is it Oracle, MySQL, etc.

4) Decide what hardware spec you will require to support database application, and what software versions you need. i.e. Windows, SQL editions, etc.

5) Design the physical table structure, including data types, compression, etc.

6) Design how you will get your data your data in and how you will get your data out of the database. This will involve logical steps agreed with the business, and may include physical technologies, such as SSIS, Stored Procs, Functions, Endpoints, Linked Servers, BCP, etc, etc.

7) Create physical database, specifying files, filegroups, etc.

8) Create physical structures, such as tables, programmable objects, etc.

9) Design security policies, and ensure the principle of least privilege is followed.

10) Agree SLAs with the business owners

11) Design HA and DR strategies for database, so you can set appropriate Recovery Model, configure Mirroring, etc.

12) Go through SAT cycle on Dev environment. Test code functionality, performance, HA strategy, recovery times, etc.

13) Promote database to UAT environment through Backup/Restore, Scripts, or Copy Database wizard, etc.

14) Make sure business fully test and sign-off functionality.

15) Promote to Live.

Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

Sunday, 13 March 2011

How Do Join Operaters In SQL Server Work?

I was asked a question this week, that I could only answer at a very high level, so of course my instant reaction was to research a more detailed answer and blog about it! The question was "How do the join operators in SQL Server work?"

So first for a little background... In SQL Server, if you join two table together, in the old days, there was only one way that the optimizer could perform this task, and that was with an operation called Nested Loops (or loop join). In more recent versions however, in order to support larger databases, with larger tables, two additional operators have been added. These are Merge Join or Hash Match (Hash join). But how do these operators work?

Well, first up is Nested Loops. This operator will use statistics to decide which is the smallest of the two tables (in terms of rows) and will mark this table as the "outer table", and mark the larger table as the "inner table". For each row in the outer table, it will loop through every row in the inner table, looking for a matching key, based on the join column(s), and join the tables based on this value.

Next is Hash Match. This operator works by choosing one of the tables and running a Hash Function against the key column(s). It then puts these hashed values into an in-memory hash table, although if this becomes to large, it can be written to TempDB. It then runs the Hash Function against the second table, and joins the rows based on a match of the two hashed values.

Lastly is Merge Join. This operator relies on both of the tables being sorted by the join key. Because both keys are sorted in the same way, SQL can look at the first value in the first table and the first value in the second table. If they match, then it can output them to the result set, if not, then it can take the next row from the first table and repeat the process, until all of the rows from the first table have been matched.

So when is each operator best? Well Nested Loops is generally better for smaller tables, and also needs to be able to perform an index seek on the "inner table", to ensure any kind of performance.

Hash Match can be better for larger tables, because it requires less IO. This is because each key only needs to be read once, rather than multiple times. However, the hash function causes a fair bit of CPU, and hash table can require a lot of memory, and if you are coming under memory pressure, it may need to write the table to TempDB, which can cause more IO, as well as the memory pressure, making it a less desirable option.

Merge Join is generally regarded to be the most efficient of the operators for larger datasets, because it only needs to read the keys once, and also does not require a lot of CPU cycles. However, it does require both tables to be sorted. If they are (especially by the Clustered Index) then the operator is very fast, but if they are not, then SQL needs to perform a sort operation on them, before they enter the join operation. This, of course, is very inefficient.

It is possible to "override" the optimizer's choice of operator, by using OPTION(Loop Join), OPTION(Hash join) or OPTION(Merge Join) after the join in your query. However, remember that optimizer is very smart, and does not make that many mistakes. Updating Statistics and rebuilding indexes is usually the best way to ensure the optimizer uses the correct option. If you really do need to use these hints, however, then you should always work with SQL, rather than against it.

What do I mean by that? Well, if we take the example that SQL is deciding to use a Merge Join, but we know that this will never be the most effective option, because one of the tables (for some reason) will never be sorted by the join key, then instead of telling the optimizer that it must use a Hash Match, we should tell it to use anything it wants except a Merge Join. We can do this, by comma separating the list of operators that it can use. In this example, we would say OPTION(Hash join, Loop Join)


Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

Sunday, 6 March 2011

System Pages In SQL Server

As you know, SQL Server's data files are made up of 8KB pages, that are organized into 64KB Extents. There are several different page types, for Data, Indexes, LOB data and IAMs (Index Allocation Maps), which are the root pages for partition/allocation unit combinations, but there are also several other pages, that SQL Server maintains internally, in order to operate quickly and correctly.

The first of these page types, is the Database Boot Page. This is always Page 9, in file 1 of your database, and includes information such as the current and original database versions, the LSN (Log Sequence Number) of the last Checkpoint, and the Database Mirroring ID.

The second page type is the File Header Page, and this is always page 0 in every file. This also contains a raft of useful information, with some of the highlights being the auto growth size, the minimum file size, the HDD sector size and the LSNs/GUIDs required for restore operations.

Page 1 of every file is a PFS (Page Free Space) page, that is a byte map, detailing how full each page is. Obviously, there is a limited number of pages that can have there space recorded within an 8060-byte byte map, so this page is repeated on every 8088th page. The information stored for each page recorded in the byte-map includes:

The amount of free space on a page
Are there Ghost records (deleted rows) on the page?
Is the page an Index Allocation Map?
Is the page part of a mixed extent?
Has the page been allocated?

On pages 2 & 3 we begin the GAM Interval. This consists of a GAM (Global Allocation Map) and a SGAM (Shared Global Allocation Map) page. These two pages are bit-maps, that record the status of the following 64000 extents, which equates to around 4GB of data.

The GAM page is a record of which extents have been allocated. 0 = Allocated, 1 = Unallocated.

Then, the SGAM page is a record of if there are free pages within a mixed extent or not. 0 = The extent is either full of mixed pages, or is a uniform extent, 1 = There is one or more pages free in the mixed extent.

It is then possible for SQL to combine the bits from each of these bit-maps to determine if an extent is uniform, mixed or unallocated, so that it knows which mixed extents can be used for new or small objects and which extents are free to be assigned uniformly to larger objects.

Just taking a step back, SQL Server (by default) uses both mixed extents (where different pages are assigned to different objects) and uniform extents (where all pages within the extent are assigned to the same object). The rule it uses, is that when an object is first created, it will use a mixed extent, but once the object has grown to a point where it would fill an entire extent, it switches to uniform extents. It never switches back however, even if you truncate a table.

It is possible to change this behaviour, by turning on Trace Flag 1118. This is generally used to alleviate contention for system pages in TempDB, and will force SQL to only use uniform extents, even when a small object is first created.

Anyway, back to the point - The last two system pages of note, are the MLM and DCM pages. These pages can be found at pages 6 and 7 within a file, and contain information that helps with Disaster Recovery. Like the other bitmap pages, they are repeated every 511,230th page.

The MLM (Minimally Logged Map) page is a bitmap, that specifies whether a page has been modified by any minimally logged transactions, since the last transaction log backup. This helps SQL quickly determine what changes have occurred since the last transaction log backup.

Finally, the DCM (Differential Change Map) page is a bitmap that specifies if a page has changed since the last full backup. This means that when you take a Differential backup, SQL can easily determine which pages is needs to backup.

Denormalizing A Column Into A String

If you wanted to take query results and turn them into a string, how would you go about it? Traditionally, you may use a dreaded cursor/while loop, or a best, a complex, poorly performing UDF. But since the introduction SQL 2005 and native XML, there is a nice little trick you can use.

As an example, I am going to use a scenario from a question I answered on MSDN. Basically, the requirement was to take a list of values, turn them into a string, and then insert the average value into the middle. So the first thing I am going to do, is use a Table Variable to generate the list...

DECLARE @values table(col1 float)

INSERT INTO @values VALUES(2), (2), (3), (4), (5)

...Next, I am going to find the average value, and insert this into the list. For simplicity, I will use a second table variable...

DECLARE @values2 table(col1 float)

INSERT INTO @values2
SELECT AVG(col1) AS col1 FROM @values
UNION ALL
SELECT col1 FROM @values
ORDER BY col1


...So, now I have the list of values I need, pre-sorted in a table variable, It's time for the interesting bit. Basically, I am going to use the data() X-Query function, which returns a typed value for each item specified, inside a sub-query, with a FOR XML clause. The outer query will have no FOR XML clause, so the results will be returned as relational data...

SELECT DISTINCT ConCat_Column =
(
SELECT CAST(col1 as decimal(3,2))  AS [data()]
FROM @values2
FOR XML PATH ('')
)


...The results are as follows. I have highlighted the average value in bold...

2.00 2.00 3.00 3.20 4.00 5.00

...Very cool eh?