Friday, November 18, 2011

T-SQL Tuesday #024: The Roundup

T-SQL TuesdayThanx to everyone for their contributions to T-SQL Tuesday #024: Prox ‘n’ Funx.

The roundup is finally here… Well, shux, I’ve been so busy that I almost let it fall through the crax, but before weex and weex went by, I finally read through the stax of submissions. I certainly had a lot of kix reading them… you could hear shriex of delight coming out of my office window.

(Okay, I admit it… The roundup was late because I spent a lot of time looking up words that end with a -ks sound… It loox like you can count me among the ranx of freax ‘n’ geex ‘n’ punx ‘n’ jerx ‘n’ dorx ‘n’ quax who carry something a little too far… It stinx, doesn’t it?).

Methinx it was a success! The SQL Community rox!

But don’t take it from me… Read the submissions for yourself. Click on the linx below for some good reading.

declare @Subject varchar(100)='T-SQL Tuesday #024: Prox ''n'' Funx';

insert [Brad Schulz].Blog
select Subject='T-SQL Tuesday #024 Roundup'
,Content
from
(
select Content from [Rob Farley].Blog where Subject=@Subject
union all
select Content from [Noel McKinney].Blog where Subject=@Subject
union all
select Content from [Greg Lucas].Blog where Subject=@Subject
union all
select Content from [Michael J. Swart].Blog where Subject=@Subject
union all
select Content from [Kent Chenery].Blog where Subject=@Subject
union all
select Content from [Steve Wales].Blog where Subject=@Subject
union all
select Content from [Merrill Aldrich].Blog where Subject=@Subject
union all
select Content from [Aaron Bertrand].Blog where Subject=@Subject
union all
select Content from [Pinal Dave].Blog where Subject=@Subject
union all
select Content from [Rich Brown].Blog where Subject=@Subject
union all
select Content from [Bob Pusateri].Blog where Subject=@Subject
union all
select Content from [Kerry Tyler].Blog where Subject=@Subject
union all
select Content from [Jes Schultz Borland].Blog where Subject=@Subject
union all
select Content from [Thomas Rushton].Blog where Subject=@Subject
union all
select Content from [Jason Brimhall].Blog where Subject=@Subject
union all
select Content from [Nancy Hidy Wilson].Blog where Subject=@Subject
union all
select Content from [David Howard].Blog where Subject=@Subject
union all
select Content from [Brad Schulz].Blog where Subject=@Subject
) Contributions
Rob Farley contributes a wonderfully detailed post comparing inline functions and non-inline functions, filled with lots of examples and query plans. Rob categorically states that “if it’s not inline, it’s rubbish.” I couldn’t agree more, but then again, I’m always a sucker for any blog post that uses the word “rubbish”.

Noel McKinney talks about a situation concerning the abuse and overuse of stored procedures. Imagine a database that is completely devoid of data and yet consumes truckloads of disk space… because it houses 27,000 stored procedures! A good story to keep in the back of your mind in designing solutions.

Greg Lucas gives us a detailed explanation of a useful ExceptionHandler procedure that is designed to be used in the CATCH block of a TRY…CATCH construct, logging detailed information about the error in a table. It encapsulates and standardizes error handling so that the developer can save time in coding.

Michael J. Swart talks about some interesting numbers related to stored procedures. At first I thought it would be the usual numbers like “maximum parameters allowed” and “maximum nested levels”, but if you know Michael, then you know he wouldn’t only spout out dry statistics like that. By the time you finish his list, you’ll be struggling to catch your breath from laughing.

Kent Chenery joins T-SQL Tuesday for the first time, contributing a CLR routine for calculating Easter in any given year. That’s pretty cool… I never even thought about the fact that an algorithm existed for calculating Easter. Make sure you read the Wikipedia article he references… the detail behind the algorithm is fascinating.

Steve Wales is another first-time contributor to T-SQL Tuesday. He talks about the differences in coding procedures and functions between SQL Server and that other database created by that company in my neck of the woods, Oracle. He also provides a link to an earlier post of his that warns about the fact that you have to be aware of the compatibility mode when using built-in functions.

Merrill Aldrich offers his ode to composable solutions, illustrating how table-valued functions (and views) offer maximum composability and flexibility (and that stored procedures don’t offer that same flexibility). As a rabid fan of TVF’s, I couldn’t agree more.

Aaron Bertrand wrote a terrific post on the shortcomings of INFORMATION_SCHEMA views earlier this month, but he waited for T-SQL Tuesday to post his pièce de résistance, plunging the knife deeper in talking about INFORMATION_SCHEMA.ROUTINES (and syscomments and sp_helptext) and how they just don’t cut it in getting the full procedure definition.

Pinal Dave talks about the new CUME_DIST() analytic function in SQL2012, which provides a really cool easy-to-implement percentile-calculating capability. Pinal’s post is the first of a series of blog posts about the new SQL2012 analytic functions… make sure you add them to your must-read list.

Rich Brown tells us about the benefits of using stored procedures for Reporting Services datasets, abstracting away the data layer from the presentation layer. He also brings up an interesting idea of putting Reporting Services procedures into their own schema.

Bob Pusateri gives an overview of the pros and cons of both procedures and functions. Speaking of procedures, though, I have to also mention that I found Bob’s post of October 18th to be very interesting in terms of using a cool trick of using sp_executesql to create a view in another database besides the current one.

Kerry Tyler discusses scenarios (and some myths) regarding transactions being left open by procedures that produce errors or experience timeouts, concluding with a discussion on XACT_ABORT.

Jes Schultz Borland expresses her frustration about the fact that a Reporting Services dataset only recognizes the first result set from a procedure that produces multiple result sets. It is confusing, being that the term dataset has different meanings across platforms (in .NET the dataset class truly is a collection of datatables).

Thomas Rushton makes a “Hell yes” case in answer to the question, “Should I learn to use stored procedures?” And he follows up with a second contribution talking about the best practice in releasing stored procedure scripts (or scripts for any kind of object for that matter).

Jason Brimhall talks about functions… specifically functions that he is thankful for, functions he can’t live without, and functions that he had fun with in delving into SQL Server internals. I’m going to set aside some time to read more about those.

Nancy Hidy Wilson reminds us of the myriad of information we can glean from the good old SERVERPROPERTY() function… and she reminds us to check out DATABASEPROPERTY() and DATABASEPROPERTYEX() as well.

David Howard discussed bad plans resulting from parameter sniffing and gave a list of techniques to solve the problem.

And, finally, my contribution talked about a procedure that I use to deal with a client system that contains no stored procedures at all… it finds the original text of ad-hoc queries in a Forced Parameterization database.

Thanx again for your contributions!

Wednesday, November 9, 2011

T-SQL Tuesday #024: A Procedure To Deal With No Procedures

T-SQL TuesdayOh my… Here it is Wednesday, and I’m a day late for T-SQL Tuesday #024, which is hosted this month by some joker named Brad Schulz. Hopefully he’ll understand and accept my late entry… If he doesn’t, then I’ll just have to make him understand.

The theme for this month is Prox ‘n’ Funx (aka Procedures and Functions)… and I’m going to talk about a procedure that I wrote to help me deal with the lack of procedures.

You probably had to re-read that latest sentence, didn’t you? Well, this whole blog post might turn your brain inside-out a bit, so you might want to read it slowly or be prepared to re-read it after you finish it. My apologies in advance.

Here’s the situation… purely hypothetical, you understand… (wink, wink).

Imagine yourself faced with a SQL Server Database that is just a big dumb file cabinet. All it holds is data, and there isn’t a single procedure or function to be found… every single query that goes to the system is an ad-hoc query constructed within an enormous .NET application.

For example, a typical query in the C# code may be put together like this:

string SQL = "";
SQL +=
" SELECT soh.SalesOrderNumber ";
SQL +=
" ,soh.ShipDate ";
SQL +=
" ,soh.CustomerID ";
SQL +=
" ,s.Name ";
SQL +=
" ,soh.TotalDue ";
SQL +=
" FROM Sales.SalesOrderHeader soh ";
SQL +=
" JOIN Person.Address a ON soh.ShipToAddressID=a.AddressID ";
SQL +=
" JOIN Person.StateProvince sp ON a.StateProvinceID=sp.StateProvinceID ";
SQL +=
" JOIN Sales.Store s ON soh.CustomerID=s.CustomerID ";
SQL +=
" WHERE soh.SalesOrderID IN ";
SQL +=
" (SELECT SalesOrderID ";
SQL +=
" FROM Sales.SalesOrderDetail ";
SQL +=
" WHERE ProductID=" + SQLUtils.SQLValue(ProdID) + ") ";
SQL +=
" AND soh.ShipDate>=" + SQLUtils.SQLValue(FromDate);
SQL +=
" AND soh.ShipDate<" + SQLUtils.SQLValue(ToDate);
SQL +=
" AND a.City=" + SQLUtils.SQLValue(CityName);
SQL +=
" AND sp.Name=" + SQLUtils.SQLValue(StateProvinceName);
DataTable dtResult = SQLUtils.ExecSQLQuery(SQL)
Never mind the thoughts that are going through your head regarding readability. Never mind the maintenance nightmare. Never mind the potential for SQL Injection. Never mind the…

Oh never mind.

So hundreds of thousands of ad-hoc queries get sent to the system all day long.

My job? Make things go faster.

Okay, fine. However, there’s one other wrinkle. This particular database operates under Forced Parameterization. This may reduce the frequency of query compilations, but it makes my job a little more complicated.

Let me illustrate… Let’s set the AdventureWorks database to use Forced Parameterization:

alter database AdventureWorks set parameterization forced
With that in place, every ad-hoc query sent to the system is parameterized… In other words, any literal within the ad-hoc query is converted into a parameter and the query as a whole is parameterized so that its plan can be reused by any subsequent ad-hoc query that has the same “code shape”.

Let’s execute the following query in AdventureWorks:

SELECT soh.SalesOrderNumber
,soh.ShipDate
,soh.CustomerID
,s.Name
,soh.TotalDue
FROM Sales.SalesOrderHeader soh
JOIN Person.Address a ON soh.ShipToAddressID=a.AddressID
JOIN Person.StateProvince sp ON a.StateProvinceID=sp.StateProvinceID
JOIN Sales.Store s ON soh.CustomerID=s.CustomerID
WHERE soh.SalesOrderID IN (SELECT SalesOrderID
FROM Sales.SalesOrderDetail
WHERE ProductID=897)
AND soh.ShipDate>='20030801'
AND soh.ShipDate<'20030901'
AND a.City='Milsons Point'
AND sp.Name='New South Wales'
(Those of you who read my blog regularly may be shocked by my use of UPPER CASE keywords… I never do that… But I’m doing it here to illustrate a point).

As you may know, there is a DMV called sys.dm_exec_cached_plans, which contains information about all the query plans that have been cached, along with a plan_handle column that you can use to acquire the actual text of the query that created the plan. There’s also a DMV called sys.dm_exec_query_stats, which contains performance statistics information for cached plans, but its plan_handle (or sql_handle) column points to a different kind of text when you’re dealing with Forced Parameterization.

Here is a simplified version of a Performance Statistics Query that I run that uses sys.dm_exec_query_stats to acquire performance information so that I can find plans that perform a lot of reads (the ones with the most reads are at the top).:

select Code=(select [processing-instruction(q)]=':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
,qp.Query_Plan
,[#Reads]=qs.total_logical_reads
,[#Execs]=qs.execution_count
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(plan_handle) qt
cross apply sys.dm_exec_query_plan(plan_handle) qp
where qt.text not like '%sys.dm\_%' escape '\'
order by [#Reads] desc
Here’s what the result looks like in SSMS:



In case you’re wondering about that “processing-instruction” and XML gibberish for the generation of the first column, you can read more about that here. But, in short, you can see that it creates a hyperlink to the query text. When I click on that link, I can see that the text looks like this:

(@0 int,@1 varchar(8000),@2 varchar(8000),
@3 varchar(8000),@4 varchar(8000))select soh . SalesOrderNumber ,
soh . ShipDate , soh . CustomerID , s . Name , soh . TotalDue
from Sales . SalesOrderHeader soh join Person . Address a on
soh . ShipToAddressID = a . AddressID join Person . StateProvince sp
on a . StateProvinceID = sp . StateProvinceID join Sales . Store s
on soh . CustomerID = s . CustomerID where soh . SalesOrderID in
( select SalesOrderID from Sales . SalesOrderDetail where ProductID
= @0 ) and soh . ShipDate > = @1 and soh . ShipDate < @2 and
a . City = @3 and sp . Name = @4
I have split it into multiple lines for this blog post… In reality it’s one looooonnnng string. But you can see that this is NOT the actual text of the query that I executed. In parameterizing this query, SQL converted all my UPPER CASE keywords into lower case, and it got rid of any extraneous white space (multiple spaces, tabs, carriage returns, line feeds), and it inserted a single space between all commas and periods/full-stops and operators.

Now when another query comes along that is exactly like the first query (except querying for different values)…

SELECT soh.SalesOrderNumber
,soh.ShipDate
,soh.CustomerID
,s.Name
,soh.TotalDue
FROM Sales.SalesOrderHeader soh
JOIN Person.Address a ON soh.ShipToAddressID=a.AddressID
JOIN Person.StateProvince sp ON a.StateProvinceID=sp.StateProvinceID
JOIN Sales.Store s ON soh.CustomerID=s.CustomerID
WHERE soh.SalesOrderID IN (SELECT SalesOrderID
FROM Sales.SalesOrderDetail
WHERE ProductID=870)
AND soh.ShipDate>='20030101'
AND soh.ShipDate<'20040101'
AND a.City='London'
AND sp.Name='England'
…and we once again execute our Performance Statistics Query to look at the plan cache...



…we see that the same parameterized query has been reused (note the #Execs is equal to 2).

(Those of you with a quick eye may have also noticed that the #Reads skyrocketed… I’ll talk about that in a later post).

So my problem is this: In order to improve the performance of the application, I can use my Performance Statistics Query in order to find those plans that exhibited the most reads and see if they can be tuned via either the addition of an index or via a recommendation of rewriting the query a different way.

But unfortunately all I get is that parameterized version of the query text. I have no indication what the ACTUAL ad-hoc queries were that were sent to the system. Yes, I can see the text of the ad-hoc queries via the sys.dm_exec_cached_plans DMV, but how do I find the exact ones that have to do with this parameterized query? It’s like looking for a needle in a haystack.

So I wrote a procedure to do that for me. It essentially takes the parameterized query text and creates a LIKE pattern out of it so that I can find its original ad-hoc code via the plan_handle of the sys.dm_exec_cached_plans DMV.

I create this procedure in a database called DBATools (which houses other utilities and data that I use for monitoring the server). It starts like so, receiving the text of the parameterized query code and putting it into a local @Text variable:

use DBATools
go

create procedure GetAdhocQueries
@ParameterizedQuery
nvarchar(max)
as

declare @Text nvarchar(max)
set @Text=@ParameterizedQuery
Its first step is to get rid of the parameter declarations at the beginning of the query. Well, luckily, all these parameterized queries are of the pattern “(@0 … )xxxxx”, where “xxxxx” is one of the following: SELECT or UPDATE or INSERT or DELETE or MERGE or WITH (in the case of a CTE). So let’s use that knowledge to strip off the parameter list:

/*Get rid of parameter list*/
select @Text=substring(@Text,isnull(min(Pos),0)+1,len(@Text))
from (select charindex(N')select ',@Text)
union all
select charindex(N')insert ',@Text)
union all
select charindex(N')update ',@Text)
union all
select charindex(N')delete ',@Text)
union all
select charindex(N')merge ',@Text)
union all
select charindex(N')with ',@Text)) x(Pos)
where Pos>0
So that makes our @Text variable looks like the following, stripped of its parameter list (again, I’m artificially word-wrapping the text here… it’s just one loonng string):

select soh . SalesOrderNumber , soh . ShipDate , 
soh . CustomerID , s . Name , soh . TotalDue
from Sales . SalesOrderHeader soh join Person . Address a on
soh . ShipToAddressID = a . AddressID join Person . StateProvince sp
on a . StateProvinceID = sp . StateProvinceID join Sales . Store s
on soh . CustomerID = s . CustomerID where soh . SalesOrderID in
( select SalesOrderID from Sales . SalesOrderDetail where ProductID
= @0 ) and soh . ShipDate > = @1 and soh . ShipDate < @2 and
a . City = @3 and sp . Name = @4
Next I need to find all the parameters in the text. They are all surrounded with a single space, so they are easy to find because they have the pattern “ @[0-9] ” (for a single digit parameter) or “ @[0-9][0-9] ” (for a two-digit parameter), etc. So the following will find all 4-digit, 3-digit, 2-digit, and 1-digit parameters in the text and substitute them with a single percent sign (%):

/*Substitute all parameters (i.e. @0 or @1 or ...) with Percents (%)*/
declare @NumDigs tinyint
,@Pos int
set
@NumDigs=5
while @NumDigs>1
begin
set @NumDigs=@NumDigs-1
while 1=1
begin --Continue substituting until there are no more
set @Pos=patindex(N'% @'+replicate(N'[0-9]',@NumDigs)+N' %',@Text)
if @Pos=0 break --No more... we're done with the loop
set @Text=stuff(@Text,@Pos,3+@NumDigs,N' % ')
end
end
And now our @Text variable looks like this:

select soh . SalesOrderNumber , soh . ShipDate , 
soh . CustomerID , s . Name , soh . TotalDue
from Sales . SalesOrderHeader soh join Person . Address a on
soh . ShipToAddressID = a . AddressID join Person . StateProvince sp
on a . StateProvinceID = sp . StateProvinceID join Sales . Store s
on soh . CustomerID = s . CustomerID where soh . SalesOrderID in
( select SalesOrderID from Sales . SalesOrderDetail where ProductID
= % ) and soh . ShipDate > = % and soh . ShipDate < % and
a . City = % and sp . Name = %
Now, I need to take care of those spaces that had been inserted by the optimizer between every operator and keyword in the query. So I replace all those spaces with percent signs. At the same time, I also replace any normal pattern-matching characters (like a carat or left square bracket or right square bracket) with percent signs. I don’t want them screwing up my final query into the plan cache. Finally, I place a percent sign at the beginning and the end of the text:

/*Replace all spaces and pattern characters (i.e. "[", "]", "^") with Percents (%)*/
/*Also place a Percent at the beginning and the end*/
set @Text=N'%'
+replace(replace(replace(replace(@Text
,N' '
,N'%')
,N'['
,N'%')
,N']'
,N'%')
,N'^'
,N'%')
+N'%'
At this point our @Text variable looks like this:

%select%soh%.%SalesOrderNumber%,%soh%.%ShipDate%,%
soh%.%CustomerID%,%s%.%Name%,%soh%.%TotalDue%
from%Sales%.%SalesOrderHeader%soh%join%Person%.%Address%a%on%
soh%.%ShipToAddressID%=%a%.%AddressID%join%Person%.%StateProvince%sp%
on%a%.%StateProvinceID%=%sp%.%StateProvinceID%join%Sales%.%Store%s%
on%soh%.%CustomerID%=%s%.%CustomerID%where%soh%.%SalesOrderID%in%
(%select%SalesOrderID%from%Sales%.%SalesOrderDetail%where%ProductID%
=%%%)%and%soh%.%ShipDate%>%=%%%and%soh%.%ShipDate%<%%%and%
a%.%City%=%%%and%sp%.%Name%=%%%
Just to be neat and tidy, I get rid of all the multiple percent signs and replace them with a single one:

/*Transform all multiple Percents into single ones*/
set @Text=replace(@Text,N'%%%',N'%%')
set @Text=replace(@Text,N'%%',N'%')
And that results in this:

%select%soh%.%SalesOrderNumber%,%soh%.%ShipDate%,%
soh%.%CustomerID%,%s%.%Name%,%soh%.%TotalDue%
from%Sales%.%SalesOrderHeader%soh%join%Person%.%Address%a%on%
soh%.%ShipToAddressID%=%a%.%AddressID%join%Person%.%StateProvince%sp%
on%a%.%StateProvinceID%=%sp%.%StateProvinceID%join%Sales%.%Store%s%
on%soh%.%CustomerID%=%s%.%CustomerID%where%soh%.%SalesOrderID%in%
(%select%SalesOrderID%from%Sales%.%SalesOrderDetail%where%ProductID%
=%)%and%soh%.%ShipDate%>%=%and%soh%.%ShipDate%<%and%
a%.%City%=%and%sp%.%Name%=%
Finally, since LIKE patterns are not allowed to be more than 4000 characters, I truncate the @Text variable to that length if needed:

/*Truncate to 4000 characters max*/
if len(@Text)>4000 set @Text=left(@Text,3999)+N'%'
So finally, at this point, we can now find all the ad-hoc queries in the cache whose text is LIKE our @Text variable:

/*Find our Adhoc queries that match the pattern*/
select Code=(select [processing-instruction(q)]=N':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
,DB=db_name(QueryDatabaseID)
,[#Bytes]=cp.size_in_bytes
,cp.usecounts
from sys.dm_exec_cached_plans cp
cross apply (select QueryDatabaseID=convert(int,value)
from sys.dm_exec_plan_attributes(cp.plan_handle)
where attribute='dbid') F_DB
cross apply sys.dm_exec_sql_text(cp.plan_handle) qt
where objtype='Adhoc'
and qt.text like @Text
order by cp.usecounts desc
Here is the result:



And when we click on one of the hyperlinks, we can see the original ad-hoc query in all its glory, just as it was originally submitted.

<?q :
SELECT soh.SalesOrderNumber
,soh.ShipDate
,soh.CustomerID
,s.Name
,soh.TotalDue
FROM Sales.SalesOrderHeader soh
JOIN Person.Address a ON soh.ShipToAddressID=a.AddressID
JOIN Person.StateProvince sp ON a.StateProvinceID=sp.StateProvinceID
JOIN Sales.Store s ON soh.CustomerID=s.CustomerID
WHERE soh.SalesOrderID IN (SELECT SalesOrderID
FROM Sales.SalesOrderDetail
WHERE ProductID=870)
AND soh.ShipDate>='20030101'
AND soh.ShipDate<'20040101'
AND a.City='London'
AND sp.Name='England'

?>
So, to take advantage of this new procedure, I just incorporate it into my original Performance Statistics Query, like so:

select Code=(select [processing-instruction(q)]=':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
,qp.Query_Plan
,[#Reads]=qs.total_logical_reads
,[#Execs]=qs.execution_count
,AdHocStmts
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(plan_handle) qt
cross apply sys.dm_exec_query_plan(plan_handle) qp
cross apply
(select AdHocStmts=case
when qt.text like '(@0 %'
then N'exec DBATools.dbo.GetAdhocQueries '''
+replace(qt.text,N'''',N'''''')+N' '''
else null
end) F_Adhoc
where qt.text not like '%sys.dm\_%' escape '\'
order by [#Reads] desc
In short, if the query text (in sys.dm_exec_sql_text) starts with a “(@0 “, I know it’s a parameterized query, so I will have the AdHocStmts column populated with the text to call my GetAdhocQueries procedure. When I execute the above query, I get the following result with the new AdHocStmts column:



And I can copy/paste the AdHocStmts column into a query window and execute it (again, I only word-wrap it here for clarity… it’s actually a looonng single line of text).

exec DBATools.dbo.GetAdhocQueries '(@0 int,@1 varchar(8000),@2 varchar(8000),
@3 varchar(8000),@4 varchar(8000))select soh . SalesOrderNumber , soh .
ShipDate , soh . CustomerID , s . Name , soh . TotalDue from Sales .
SalesOrderHeader soh join Person . Address a on soh . ShipToAddressID =
a . AddressID join Person . StateProvince sp on a . StateProvinceID =
sp . StateProvinceID join Sales . Store s on soh . CustomerID = s .
CustomerID where soh . SalesOrderID in ( select SalesOrderID from
Sales . SalesOrderDetail where ProductID = @0 ) and soh . ShipDate
> = @1 and soh . ShipDate < @2 and a . City = @3 and sp . Name = @4 '
And that will find the ad-hoc queries that are associated with that particular parameterized query.

So this GetAdhocQueries procedure allows me to find the ad-hoc query needles in the plan cache haystack. Now if I only had a way to find those $#&@ queries in the C# code as easily… oh well, that’s somebody else’s problem.

Monday, October 31, 2011

Invitation for T-SQL Tuesday #024: Prox ‘n’ Funx

T-SQL TuesdayBoy, time really flies, doesn’t it? It’s hard to believe, but we are finishing up our second year of T-SQL Tuesdays! Adam Machanic started the whole concept back in November 2009 with T-SQL Tuesday #001, and here we are already up to #024.

So, without further ado…

You are hereby invited to this month’s T-SQL Tuesday #024, which will take place on November 8, 2011.

So, all you T-SQL Bloggers out there, please join the blog party and write up something revolving around this month’s topic: Prox ‘n’ Funx (which is just a coo-ul way of referring to Procedures and Functions).

This topic covers a lot of ground, so there’s a myriad of possibilities in what you can write about. You could discuss a really cool stored procedure or function that you wrote. You could write about a Dynamic Management Function that you can’t live without… or perhaps write about some of the new functions that are coming in SQL2012. How about limitations or “gotchas” or performance issues in working with procedures and functions? And on and on and on…

Now for those nasty rules:

1) Your post must go live between 00:00:00 GMT on Tuesday November 8 and 00:00:00 GMT on Wednesday November 9. If you’re unsure exactly when that is, well, guess what? There’s a function for that! It’s called GETUTCDATE().

2) Your post must link back to this post, and the link must be anchored from the logo (found above) which must also appear at the top of your post.

3) Make sure you leave a comment or a trackback here on this blog regarding your post (so that I can collect the posts and write a round-up).

Optionally:

4) “T-SQL Tuesday #024” should be included in the title of the post.

5) Tweet about your post using the hash tag #TSQL2sDay.

(I have to admit… that last rule is kind of amusing since I don’t have a Twitter account myself… A big thank-you to Rob Farley for initially tweeting this invitation).

I’m looking forward to reading your submissions. But don’t delay in composing your post! T-SQL Tuesday 24 is coming fast! I can almost hear the clock ticking down… Ka-chink! Ka-chunk! Ka-chink! Ka-chunk!

Tuesday, October 4, 2011

T-SQL Tuesday #023: Flip Side of the JOIN

T-SQL TuesdayUnbelievable… It’s been almost 5 months since I last posted something here. I had a lot going on since that last post: A daughter graduating from college, a son graduating from high school, a few family vacation trips, moving my son into college, and between all of that, I was juggling an overwhelming amount of work from 3 demanding clients (and still am, quite frankly).

But I’m back now, ready to finally re-JOIN the SQL blogging world once again.

And that is very apt, because this post is part of the October T-SQL Tuesday on the subject of JOINs, hosted by Stuart Ainsworth.

So, let’s not waste any time… Let’s plunge in…

We are all aware of the various JOINs that are available to us in the T-SQL syntax: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and CROSS JOIN. But what I wanted to talk about today are two other kinds of JOINs: The (LEFT or RIGHT) Semi JOIN and the (LEFT or RIGHT) Anti Semi JOIN.

These are not available to us directly in the syntactical sense, but they are employed in a Query Plan when you use certain types of queries.

A LEFT Semi JOIN returns rows from the left side that have at least one matching row on the right side. At first glance, this seems very much like a regular INNER JOIN, except for one thing. With a LEFT Semi JOIN, the rows from the left table will be returned at most once. Even if the table on the right side contains hundreds of matches for the row on the left side, only one copy of the left-hand row will be returned.

For example, let’s find Handlebar Products (SubCategory=4) in the Products table that appeared on at least one Order. If we use a regular INNER JOIN…

select p.ProductID
from Production.Product p
join Sales.SalesOrderDetail sod on p.ProductID=sod.ProductID
where ProductSubcategoryID=4
/*
ProductID
---------
808
808
808
. . .
809
809
809
. . .
947
947
947
(1531 row(s) affected)
*/
…We get tons of duplicate Product IDs. The (actual) execution plan for the above query looks like so:



Note that the data flow arrow from the Products table showed that 8 rows were processed, and the total rows that were requested by the Nested Loops operator from the SalesOrderDetail Index Seek for those 8 product rows were 1531.

In order to eliminate all the duplicates, we have to introduce a DISTINCT to the query:

select distinct p.ProductID
from Production.Product p
join Sales.SalesOrderDetail sod on p.ProductID=sod.ProductID
where ProductSubcategoryID=4
/*
ProductID
---------
808
809
810
811
813
946
947
*/
Interestingly enough, if we look at the actual execution plan for that query, you would probably expect to find the same plan as before except with an Aggregate operator at the top of the tree to eliminate the duplicates and shrink the 1531 rows down to 7, but instead, this is what we find:



Note that the Nested Loops operator is a LEFT Semi JOIN. Somehow the optimizer is “smart enough” to realize (because of our INNER JOIN and the DISTINCT and our result set only consisting of columns from one side of the JOIN) that it would be more expensive to do the full INNER JOIN and then eliminate the duplicates, and so it employed a LEFT Semi JOIN. This is much more efficient because the Nested Loops operator only needs to request a single row from the SalesOrderDetail Index Seek operator for each Product processed… If a single row exists in SalesOrderDetails, then it can release the Product row to the Select operator for output. If no row exists, then it tosses out the Product row and moves on. If you hover over the data flow arrow coming out of the Index Seek, you’ll see that only 7 rows were passed along (because one of the 8 Product rows did not have a match).

I don’t know about you, but the hairs on the back of my neck stand up whenever I see a DISTINCT in a query. This same solution could be employed (more clearly in my opinion) via three other types of queries, all of which (by nature) involve a Semi JOIN… an INTERSECT query or an EXISTS query or an IN query:

select ProductID
from Production.Product
where ProductSubcategoryID=4
intersect
select
ProductID
from Sales.SalesOrderDetail

select ProductID
from Production.Product p
where ProductSubcategoryID=4
and exists (select *
from Sales.SalesOrderDetail
where ProductID=p.ProductID)

select ProductID
from Production.Product
where ProductSubcategoryID=4
and ProductID in (select ProductID
from Sales.SalesOrderDetail)
All three of the above queries produce the exact same plan, which is the very efficient Semi JOIN plan that we just examined.

It’s really a matter of style as to which approach that you use. I prefer EXISTS or IN. The INTERSECT operator is kind of cool, but it is very limiting. For example, let’s say we wanted the result set to include the Name of the Product as well. In the EXISTS and IN queries (and for that matter in our original INNER JOIN/DISTINCT query), we simply add the Name column to the SELECT clause and we’re done… And the query plan would remain unchanged except for the fact that an extra column will come from the Product table.

But with the INTERSECT query, we have to introduce the Name column to both sides of the INTERSECT, meaning that we have to add an extra JOIN to get the Name:

select ProductID
,Name
from Production.Product
where ProductSubcategoryID=4
intersect
select
sod.ProductID
,p.Name
from Sales.SalesOrderDetail sod
join Production.Product p on sod.ProductID=p.ProductID
/*
ProductID Name
--------- ----------------------
808 LL Mountain Handlebars
809 ML Mountain Handlebars
810 HL Mountain Handlebars
811 LL Road Handlebars
813 HL Road Handlebars
946 LL Touring Handlebars
947 HL Touring Handlebars
*/
And the execution plan now involves a lot more work:



We could also try to re-work the INTERSECT query using a CTE or a derived table to just get the ProductID’s and then JOIN the result to the Products table to get the Name column like so…

with ProductsInOrders as
(
select ProductID
from Production.Product
where ProductSubcategoryID=4
intersect
select ProductID
from Sales.SalesOrderDetail
)
select pio.ProductID
,p.Name
from ProductsInOrders pio
join Production.Product p on pio.ProductID=p.ProductID
/*
ProductID Name
--------- ----------------------
808 LL Mountain Handlebars
809 ML Mountain Handlebars
810 HL Mountain Handlebars
811 LL Road Handlebars
813 HL Road Handlebars
946 LL Touring Handlebars
947 HL Touring Handlebars
*/
…But we still can’t get around the fact that we have to access the Products table multiple times. In fact, the execution plan for the above query is quite amusing when you look at it:



For each Product row acquired in the Clustered Index Scan, it does a SEEK into the same Clustered Index to get the Name! What a waste of resources.

Now on to Anti Semi JOINs…

A LEFT Anti Semi JOIN returns rows from the left side that have no matching rows on the right side… It’s the exact opposite of the Semi JOIN. As you can probably guess, this kind of JOIN is employed when you execute a query using EXCEPT or NOT EXISTS or NOT IN:

select ProductID
from Production.Product
where ProductSubcategoryID=4
except
select
ProductID
from Sales.SalesOrderDetail
/*
ProductID
---------
812
*/

select ProductID
from Production.Product p
where ProductSubcategoryID=4
and not exists (select *
from Sales.SalesOrderDetail
where ProductID=p.ProductID)
/*
ProductID
---------
812
*/

select ProductID
from Production.Product
where ProductSubcategoryID=4
and ProductID not in (select ProductID
from Sales.SalesOrderDetail)
/*
ProductID
---------
812
*/
All three of the above queries produce the exact same execution plan using a LEFT Anti Semi JOIN:



So, for each of the 8 Product rows, the Nested Loops operator requests a row from the SalesOrderDetail table. If one exists, then it tosses the Product row aside and moves on. If one does not exist, then it releases the Product row up to the Select operator for output.

The EXCEPT operator has the same limitations as was described for the INTERSECT operator and therefore is not as useful as the NOT EXISTS or NOT IN types of queries.

One important note about NOT IN. It is only equivalent to the NOT EXISTS query if the column being checked is non-nullable. If the ProductID in Sales.SalesOrderDetail allowed NULLs, then the NOT IN query plan would look like this:



There’s a lot of logic employed in the plan to handle the fact that there may be NULLs in SalesOrderDetail. We can return back to our more simplified query, however, by adding a WHERE IS NOT NULL predicate to our IN subquery:

select ProductID
from Production.Product
where ProductSubcategoryID=4
and ProductID not in (select ProductID
from Sales.SalesOrderDetail
where ProductID is not null)
So if you prefer the NOT IN style over the NOT EXISTS style of querying, it’s a good idea to get in the habit of including a WHERE IS NOT NULL predicate to the subquery.

By the way, many people in the past have tried to emulate the Anti Semi JOIN behavior by doing a LEFT JOIN and adding a WHERE IS NULL to the query to only find rows that have no match on the right side, like so:

select p.ProductID
from Production.Product p
left join Sales.SalesOrderDetail sod on p.ProductID=sod.ProductID
where ProductSubcategoryID=4
and sod.ProductID is null
But this actually produces a plan that LEFT JOINs everything and then employs a Filter to only allow the IS NULL non-matches, creating a lot of unnecessary work and a much more inefficient query than the true Anti Semi JOIN:



I’ve seen people in the past saying that the LEFT JOIN/IS NULL approach is faster than the NOT EXISTS approach, but frankly, I can’t see it. If anyone has an example to offer, I’d certainly like to take a look.