Aggregating With Correlated Sub-Queries #tsql2sday

Posted by & filed under , , , , , .

olap_1It’s that time of the month again.  Time for T-SQL Tuesday!  This month’s event is being hosted by Jes Schultz Borland (Blog|@grrl_geek).  The topic that she chose for this month is aggregates.

The Dilemma

A few months ago, a client I was working with had a dilemma with one of their search queries.  A query that previously took 10-15 seconds to complete would now take at least 45 seconds and sometimes even longer.

This change occurred shortly after a release in which the distribution of data in the database changed dramatically.  Instead of evenly distributed data across all of the tables, a few tables and indexes now had single values concentrations that equaled 50-75% of the index.

To mimic his behavior, we’ll create a copy of the SalesOrderDetail in AdventureWorks that is weighted towards a single ProductID.  This script will accomplish this:


IF OBJECT_ID('dbo.SalesOrderDetail') IS NOT NULL
DROP TABLE dbo.SalesOrderDetail
GO

SELECT
sad.*
INTO
SalesOrderDetail
FROM
Sales.SalesOrderDetail sad

INSERT  INTO SalesOrderDetail
(
SalesOrderID
,CarrierTrackingNumber
,OrderQty
,ProductID
,SpecialOfferID
,UnitPrice
,UnitPriceDiscount
,LineTotal
,rowguid
,ModifiedDate
)
SELECT
sad.SalesOrderID
,sad.CarrierTrackingNumber
,sad.OrderQty
,sad.ProductID
,sad.SpecialOfferID
,sad.UnitPrice
,sad.UnitPriceDiscount
,sad.LineTotal
,sad.rowguid
,sad.ModifiedDate
FROM
Sales.SalesOrderDetail sad
CROSS JOIN (
SELECT TOP 20
*
FROM
Sales.SalesOrderHeader) soh
WHERE
productid=870
GO

CREATE CLUSTERED INDEX tix_SalesOrderDetail_SalesOrderIDSalesOrderDetailID ON SalesOrderDetail(SalesOrderID,SalesOrderDetailID)
GO

CREATE UNIQUE NONCLUSTERED INDEX tix_SalesOrderDetail_SalesOrderDetailID ON SalesOrderDetail(SalesOrderDetailID)
GO

CREATE NONCLUSTERED INDEX tix_SalesOrderDetail_ProductID ON SalesOrderDetail(ProductID)
GO

Also, create the following index:


CREATE INDEX IX_Product_NameProductSubcategoryID ON Production.Product(Name) INCLUDE (ProductSubcategoryID)

This will remove a Key Lookup that doesn’t affect the demonstration.  It does help simplify the plan a little to highlight the issue.

Now that we have data to work with, let’s begin with a mock-up of the search query that the client was using.  As you may have guessed, I am using the AdventureWorks database for this demonstration.

The query that was being executed was fairly simple.  It had a few tables that were joined together to gather some base data.  In this case we will be looking at products and the category that the product belongs to.  We also want to include a count of the number of order in which the product has been included in.  For out “search” we are only going to return the product Water Bottle – 30 oz.  This is the product that is taking up most of the table.

This query looks like the following:


SELECT
ps.Name
,p.ProductID
,p.Name
,COUNT(*)
FROM
Production.ProductCategory pc
INNER JOIN Production.ProductSubcategory ps ON pc.ProductCategoryID=ps.ProductCategoryID
INNER JOIN Production.Product p ON ps.ProductSubcategoryID=p.ProductSubcategoryID
LEFT OUTER JOIN SalesOrderDetail sad ON p.ProductID=sad.ProductID
WHERE
p.Name='Water Bottle - 30 oz.'
GROUP BY
ps.Name
,p.ProductID
,p.Name
ORDER BY
p.Name
GO

This is a fairly common construct when querying.  I’ve written queries just like this for years.  If you take a look at the execution plan you’ll notice something that maybe you weren’t aware of.

image

The rows returned from SalesOrderDetail are join to the Product and ProductCategory data before they are aggregated.  Now in most cases this isn’t a big deal.  In some rare cases it is a big deal – and with the client I was working with that day it definitely was.  In their database, it wasn’t 98K rows that were expanded out.  It was millions of rows and this was leading to some serious issues with performance.

A Solution

I often watch webcasts and one that I had watched around the time this issue came up was Rob Farley’s (Blog | @Rob_Farley) Designing for Simplification.  In this webcast, Rob talks about the rules for having a join.  Looking at the query plan above, the LEFT OUTER JOIN is duplicating rows.  But the real purpose of that JOIN is to introduce a value.

We have a single column to aggregate, with a COUNT().  One of the ways that we can accomplish this would be to use a correlated sub-query.  Through this, the data is aggregated before it is joined to the other data in the results.  To accomplish this, the query would be re-written as such:


SELECT
ps.Name
,p.ProductID
,p.Name
,(
SELECT
COUNT(*)
FROM
SalesOrderDetail sad
WHERE
p.ProductID=sad.ProductID)
FROM
Production.ProductCategory pc
INNER JOIN Production.ProductSubcategory ps ON pc.ProductCategoryID=ps.ProductCategoryID
INNER JOIN Production.Product p ON ps.ProductSubcategoryID=p.ProductSubcategoryID
WHERE
p.Name='Water Bottle - 30 oz.'
ORDER BY
p.Name

The chief difference is that the query to determine the COUNT() is no longer brought in through a LEFT OUTER JOIN.  Instead it is included in the SELECT statement as a sub-query.  This small change does influence the execution plan:

image

As you can see above, the 98K rows are aggregated right away.  They aren’t joined to the other tables and duplicating that data unnecessarily.  The GROUP BY statement on the outer statement is no longer needed and the performance is slightly improved.

Lesson Learned

Unfortunately, this example of the technique I used with a correlated sub-query didn’t match the dramatic performance improvement that I witnessed at the client.  You could though see that there was a difference in the execution plan and there is also a slight query cost difference.  This technique is one that given the right distributions of data and “proper” table schema will provide significant benefits.

Another take-away from this post is that you need to try different techniques and watch your execution plans.  Different methods of writing T-SQL will influence the results and you need to try many techniques to find the right one.

Bonus Post

I guess I wasn’t really paying attention when I wrote my Monday blog post since it covered Aggregates as well.  If you want to read a bonus post on aggregates, check out XQuery for the Non-Expert – Aggregates.

  • http://sqlblog.com/blogs/rob_farley Rob Farley

    Wohoo! Glad to have helped. :)

  • http://www.webbtechsolutions.com/blog Joe Webb

    Nice post, Jason! I include the JOIN vs SubQuery topic in my Writing Better Queries presentation, but not at this depth.

    • Jason Strate

      Thanks, Joe.

  • Johan Bijnens

    Nice article.
    Another alternative is to actually migrate the count to a nested table expressions. In this case, so depending on statistics – it generates the same execution plan as your optimal plan.

    SELECT ps.Name
    , p.ProductID
    , p.Name
    , isnull(cnt.nRows, 0) as nRows
    FROM Production.ProductCategory pc
    INNER JOIN Production.ProductSubcategory ps
    ON pc.ProductCategoryID = ps.ProductCategoryID
    INNER JOIN Production.Product p
    ON ps.ProductSubcategoryID = p.ProductSubcategoryID
    left join (
    SELECT ProductID, COUNT(*) nRows
    FROM SalesOrderDetail sad
    group by ProductID
    ) cnt
    ON cnt.ProductID = p.ProductID

    WHERE p.Name = 'Water Bottle – 30 oz.'
    ORDER BY p.Name

    • Jason Strate

      Yup, that will work as well.

  • jesborland

    Great info Jason! Thanks for joining the party (as if there's a doubt you would!).

    • Jason Strate

      Thanks, Jes. Actually, I forget to do T-SQL Tuesday more often than not. :(

  • Pingback: T-SQL Tuesday #17 – APPLY Knowledge | Strate SQL()

  • Pingback: Less Than Dot - Blog - Awesome()