Welcome to day eight post for the 31 Days of SSIS. If you are just joining the series you may want to read the introductory post. In another post I talked about the raw file format for sources and destinations.
In this post, I want to discuss a solution I worked on where I used a Merge Join transformation with Raw Files to join data sets from two sources. This solution is based upon techniques that I’ve used at a few clients to solve their business needs and to improve the performance of a process that went sideways due to the volume of production data.
The general idea for this package is that you have two or more datasets that have similar or related data. Unfortunately, this data isn’t stored on a platform where the the two datasets can be easily joined together.
One scenario could be where you have a source system with a few million records in it. These records need to be imported into a second system. Since the number of new or changed records is relatively low, you want to trim off the records that the second system wouldn’t need for inserts or updates.
Another scenario could be the that you need to import a subset of a large file each day. The records that are imported need to match a second set of data, that has a few thousand records, that has already been loaded into the database. This technique could be used to determine the records that are part of the subset.
There are countless other scenarios that this could be used for. As I walk through the demo package, I hope that it will spark some ideas of areas where you could apply this type of a solution.
Merge Join Solution
This solution can basically be summed up as an application of the Merge Join transformation. So from a simple perspective we are just implementing one of the transformations. When adding this transformation to a package, though, there are a number of considerations to take.
Before the considerations, let’s define the Merge Join transformation. This transformation allows data from two data flows to be joined together on specified column(s). The join operation occurs as the datasets are streamed through the transformation and because of this there is only a single pass of the data. This requires that both data flows be sorted on the join column(s). Finally, the data can be joined as either a FULL, LEFT, or INNER join.
Based on the needs of this transformation, our solution will have two data sources. One of the data sources will be a raw file source. I have opted to use this because in most cases that I use the Merge Join transformation there is a raw file included as part of the solution.
To fulfill the sorting requirement, we’ll review two methods for having sorted data within a Data Flow. The first will be by using the Sort transformation. The second will be by modifying the data source such that it knows that the data is sorted. There are times that you will know that data is sorted within the data flow from previous operations. Use this knowledge to prevent the addition of unnecessary Sort transformations.
Package Control Flow
With the description above there isn’t a lot that will be included in the Package Control Flow. There are two Data flows in the SSIS package for the purposes of having a complete solution. Though in the end only one of them truly has a bearing on the solution.
This two Data Flows in the Control Flow are:
- DF – Data Setup: Data Flow task included to create a raw file that can be used as the source for the process. This would be deleted if you opt to use this package.
- DF – Match Rows: Data Flow task that contains the Merge Join and related transformations. This transformation will join to data sets together similar to querying two table together.
There are a couple variables that are part of the package to make referencing the raw files easier
The variables from above were added for the following purposes:
- RawFilePersonContact: Location of the raw file that is the source and destination for the Person.Contact data
- WorkFolder: This is my one variable to rule them all that sets the RawFileHierarchy and RawFileSource path.
Package Data Flows
As noted above, there are two Data Flows in this SSIS Package. The Data Flows is where all of the work is happening in this package.
DF – Data Setup
This Data Flow task creates the raw file that is being used as source data for the other data flow. If you are unfamiliar with using raw files, I suggest reading “Raw Files Are Awesome”. From a solution perspective, this data flow would likely not be included in a final solution.
The data flow is configured with the following transformations:
- OLE – Person Contact Source: OLE DB Source to the AdventureWorks database. The data source retrieves data from Person.Contact. The one item to note with this data source is that the query against Person.Contact is sorting the data. This will be important to know in the next data flow.
- RFD – Person Contact Destination: Raw File Destination that receives all of the records from Person.Contact.
This data flow was pretty bland. But it’s only there for this demo and just serves as a model for extracting data to raw files.
DF – Match Rows
The DF – Match Rows Data Flow task is where the real work for the solution occurs. In this data flow, data is taken in through two data sources. One of the data sources is pre-sorted (mentioned in the first data flow) and the data source is configured with this information. The other data source is not sorted and requires a sort within the data flow. Both datasets are joined with a Merge Join transformation to find matching rows.
- OLES – Person Contact Source: OLE DB Source to the AdventureWorks database. The data source retrieves all of the data from Person.Contact.
- ST – ContactID: Sort transformation that sorts the data on the ContactID column.
- RF – Person Contact Source: Raw File Source that contains all of the records from Person.Contact.
- MJ – ContactID: Merge Join transformation that joins both data flows on the ContactID column. For simplicity all of the columns from both data flows are included.
- RS – Junk: A Row Sampling task that is included because it is easy to include and provides a data path from the Merge Join transformation. This is here for attaching Data Viewers and to see the row counts.
Configuring Source Sort Order
As was mentioned, the RF – Person Contact Source has been configured in order to allow the SSIS package to know that the data it is retrieving is sorted. Changing the data source to allow this is relatively simple and is done in the same manner on the other data sources.
First, browse to the Show Advanced Editor menu item for the right-click menu on RF – Person Contact Source. An example of the right-click menu is to the right. Once this is selected the Advance Editor will open.
Browse in the Advance Editor to the Input and Output Properties tab. On this tab there will be a tree view for the transformation output. Select the root item in the tree view and the Common Properties window will populate. This can be done by following the red arrows the to the yellow region.
With in the Common Properties there is an option for IsSorted. Change this property to True. This change will make the SSIS package think that the output from this source is sorted. SSIS will not perform any validations to validate this.
Once the output is configured as sorted, the sort columns need to be selected. To do this, select the Output Columns folder. The select the ContactID column. Follow the red arrows below to the orange section. This will be the SortKeyPosition. Change this value from 0 to 1.
When this is done, the the data source is configured to be a sorted data source. Remember what I said above, SSIS will not perform any validations to validate that data sources marked sorted are sorted.
Configure Merge Join Transformation
Configuring the Merge Join transformation is relatively easy. Within the properties window for the transformation choose between INNER, LEFT, or FULL joins. These joins do the same operation that they do within T-SQL.
Next, select the columns that you want the transformation to output. The join columns will already selected so there is no need to configure those.
It’s really as easy as that.
Those properties are:
- MaxBufferPerInput: This property can be used to tune the memory requirements for the Merge Join. Joining data does incur resource costs. Depending on the data being processed this could have an impact on other packages. Or you may have additional resources you can apply to the execution. This property will assist in controlling that.
- TreatNullsAsEqual: Exactly as the property states, the property will allow NULL values to be treated as the same values.
Word of Caution
This transformation can bite you in uncomfortable ways if you don’t pay attention to how the data is coming into it. First, the transformation only does one pass across the datasets. If you have unordered records in one of the datasets you could end up with more, less or duplicated data. Pay attention to the sorts. Document the sorts so they aren’t removed by mistake.
Merge Join Wrap-up
The Merge Join can be an extremely useful transformation. With relatively low effort it can join together two datasets and provide an output that spans multiple platforms. The transformation sometimes gets a dodgy reputation because of its need for sorted data. But with a little planning the data can often be pre-sorted to avoid this pitfall.
With the scenarios in the introduction, I didn’t get into any specifics on performance and when you’d choose this method or maybe a Lookup transformation. This question can often be answered by looking at two factors:
- If you are pre-SQL Server 2008, then when you have static data for comparisons this method can provide a nice boost over a Lookup transformation.
- If the number of rows are in the 10’s of thousands or past a million records you will likely want to use a raw file over a cache for the Lookup transformation. A lot of time can be spent building a cache to have the index and avoid needing to sort data. Test both ways because the Merge Join solution performance may surprise you.
But what about real performance from a real world scenario. Here’s an anecdote from my experiences with the Merge Join solution.
I was once called to take a look at an SSIS package that was using a cache file and a Lookup transformation. The solution was deployed to production and the production data didn’t match the development data volumes (no surprise). The solution that ran great was taking hours to complete. I think by the time I had been called it had been run more than 10 hours at a time a few times without completion.
A little rewriting of the package to replace a couple Lookup transformations with this solution and the package was back to development execution timeframes. The last time I had talked to anyone on the package it was executing in about 5 minutes for the operation that had taken hours before.
If you have any questions, please leave a comment. If you have challenges you are facing, I would be interested in hearing about those as well.