SQL Server Nation
For all of your SQL Server needs.
Comparing a Sub-Query and a UDF

There has always been a lot of confusion in terms of performance between correlated sub-queries vs user-defined functions.  So, I've taken it upon myself to try to figure out which one is more effecient.  The results I have found are by no means conclusive, but will hopefully give you an idea as to when to use which approach as well as a general methodology for how to compare queries for effeciency purposes.

First, I'll start by creating a scalar user defined function that accepts the EmployeeID and the RateChangeDate and returns the SUM of the Rate field from EmployeePayHistory table in the AdventureWorks database.  This is a pretty common type of scalar valued user-defined function (UDF) used to return month-to-date and year-to-date style data.

CREATE FUNCTION dbo.udf_GetPayHistory
(
@EmployeeID INT,
@RateChangeDate DATETIME
)
RETURNS MONEY
AS
BEGIN
DECLARE @PayHistory MONEY

SET @PayHistory =
(
SELECT SUM(Rate)
FROM HumanResources.EmployeePayHistory eh
WHERE
eh.EmployeeID = @EmployeeID AND
eh.RateChangeDate >= @RateChangeDate
)
RETURN(@PayHistory)
END

 In the code snippet below I query the HumanResources.Employee table in the AdventureWorks database and pass the EmployeeID and a static date to the function I defined above. 

SELECT 
EmployeeID,
LoginID,
dbo.udf_GetPayHistory(e.EmployeeID,'1/1/1997')
FROM
HumanResources.Employee e
WHERE
EmployeeID BETWEEN 100 AND 200

 

The statement below uses a correlated sub-query to return the same information as the statement above which uses the UDF to return data.  The sub-query is correlated in that the HumanResources.EmployeePayHistory table access HumanResources.Employee table, which is table being queried at the outermost level.  So, although this statement operates as a batch statement, you can almost think of it as processing iteratively.  The inner query passing data to the outer query until the outer query satisfies its WHERE condition.

DECLARE @RateChangeDate DATETIME
SET @RateChangeDate = '1/1/1997'

SELECT
EmployeeID,
LoginID,
PayHistory =
(
SELECT SUM(Rate)
FROM HumanResources.EmployeePayHistory eh
WHERE
e.EmployeeID = eh.EmployeeID AND
eh.RateChangeDate >= @RateChangeDate
)
FROM HumanResources.Employee e
WHERE EmployeeID BETWEEN 100 AND 200

 

If I run the statements together and view the Actual Execution Plan for the two statements, it is the sub-query approach that takes up 75% of the processing time.  If I were to solely go by this information, I would think that the UDF approach to this solution is actually a LOT faster than the sub-query approach.  To confirm this assumption, I'll need to run a couple of other tests.

 

Below is a comparison of the STATISTICS IO statement between the two statements.  Because the two statements access the same two tables, I expect STATISTICS IO to show me the same information for the two statements, but this is NOT the case.  The statement that uses the UDF only includes reads for the Employee table, while the statement that uses the correlated sub-query involves both tables.  Clearly the 2nd statement is slower because it is showing more reads occurring on the database, right?

Next, I want to look at the exact time it takes for each of these statements to run.  You always have to be careful when comparing time between statements.  As a comparison metric, time is always dependent upon other factors on your database system at the time of running, such as load, memory pressure, etc.  The figure can change with each run.  However, since I'll be running these two statements at approximately the same time on the same system, I am comfortable with their approximate results.

Below is the screenshot for the subquery approach.  This query only takes 1ms.  I had run this query several times, so the execution plan had already been generated, which does help in execution time.  This query seems pretty fast.


Below is the STATISTICS TIME for the UDF approach.  Notice that using the UDF requires just a fraction more time.  Ultimately, the time differences is insignificant since there is such a slight variation.  Even though I ran the statements several times and received the same numbers each time, I'm still not really sure which statement is more effecient.

So, my next step is to look at the statement results in SQL Server Profiler.  I'm mainly interested in the Reads and Duration columns from Profiler.  The results from the subquery are below.  The number of reads from this approach match up very closely to the results from STATISTICS IO, which is what I expect.

 

However, when I view the results from using the user-defined function, I get results that I am not expecting.  The number of reads do not match up at all with what was returned from STATISTICS IO.  The results returned from Profiler shows the ENTIRE number of reads for the query statement.  The part that is misleading is that Profiler is showing all reads that the UDF incurrs.  So, even though it is being showed as being called once, there are more reads involved in the database through its use.

 

Results

Today I looked at a performance comparison between using a correlated sub-query and a user-defined function (UDF).  When I use the tools I am accustomed to using in terms of determining statement effeciency, I was misled.  I had to dig a bit depper and break out SQL Server Profiler to really determine which statement was faster; the correlated sub-query.  While the sub-query is faster, it does not promote code re-use like the UDF does, so there is a trade off.  If I were to use the logic over and over again, I would consider the performance trade off of using the UDF so I didn't have to rewrite the logic over and over again.  It really depends on the situation you're faced with to determine which approach to use.

 

 


Posted 4 Mar 2009 9:40 AM by TimChapman
Copyright SQL Server Nation 2010
Powered by Community Server (Non-Commercial Edition), by Telligent Systems