Why’d That Job Fail?

Posted by & filed under , , .

Better than Bothan spiesHave you ever had a SQL Agent Job fail?  Of course, just like myself, you have had this happen.

When SQL Agent  jobs fail, you will want to know why it failed.  The problem is that knowing why a SQL Server Agent job has failed can sometimes be a tricky bit of business.

The typical issue I run into is that the output message on the job step is usually longer than the amount of space available to store the output message.  And as Murphey’s Law would predict, the bit of the message that contains the error will be at the tail end of the output message and not stored for later review.

Check Out The Message

Not exactly sure what I am talking about?  Take a look at the following failure message:

Executed as user AWESOMESTRATE ServiceAccount. …0.4035.00 for 32-bit Copyright (C) Microsoft Corp 1984-2005. All rights reserved. Started 63513 PM Error 2010-06-23 183536.87 Code 0xC001000E Source UserImport Description The connection SERVER_01 is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error 2010-06-23 183536.87 Code 0xC001000E Source UserImport Description The connection SERVER_02 is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error 2010-06-23 183536.88 Code 0xC001000E Source UserImport Description The connection SERVER_03 is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error 2010-06-23 190030.68 Code 0×00000000 Source Log Import Activity Descript… The package execution fa… The step failed.

I got this a few weeks back on a SSIS package that was being executed.  The step had obviously failing based on the “The step failed.” message.  Why did it fail?  What didn’t happen?  Can you tell that the root cause of the error was based on an Execute SQL task?

Nope! You can’t and that’s the problem that prevents efficient troubleshooting of SQL Server Agent job issues.

Getting The Full Message

There should be a method for obtaining the full message so that you have what you need to troubleshoot job failures.  As there should be, there is.  This is available through on of the Advanced features on SQL Agent Job steps – the Output file.

image

If you set the Output file value to a path and file name, the output above will export the entire message generated by the SQL Server Agent job into a format that is much easier to read.

Below is the same job failure from above, but this time the output is from the Output file.  If you notice, without much trouble it is evident that the Execute SQL Task due to a TCP  Provider error.  Now instead of digging and looking around for the reason, the issue at hand can be resolved.

Microsoft (R) SQL Server Execute Package Utility
Version 9.00.4035.00 for 32-bit
Copyright (C) Microsoft Corp 1984-2005. All rights reserved.

Started:  6:35:13 PM
Error: 2010-06-23 18:35:36.87
Code: 0xC001000E
Source: UserImport
Description: The connection “SERVER_01″ is not found. This error is thrown by Connections collection when the specific connection element is not found.
End Error
Error: 2010-06-23 18:35:36.87
Code: 0xC001000E
Source: UserImport
Description: The connection “SERVER_02″ is not found. This error is thrown by Connections collection when the specific connection element is not found.
End Error
Error: 2010-06-23 18:35:36.88
Code: 0xC001000E
Source: UserImport
Description: The connection “SERVER_03″ is not found. This error is thrown by Connections collection when the specific connection element is not found.
End Error
Error: 2010-06-23 19:00:30.68
Code: 0×00000000
Source: Execute SQL Task
Description: TCP Provider: The semaphore timeout period has expired.
End Error
DTExec: The package execution returned DTSER_FAILURE (1).
Started:  6:35:13 PM
Finished: 7:01:05 PM
Elapsed:  1551.45 seconds

What About My Existing Job Steps?

At this point, hopefully I’ve got you convinced that a single tweak to your SQL Server Agent Job Steps can help you resolve failures with greater ease.  Let’s run out and update all of the SQL Servers and get this going.  No problem, right?

Well, what if your SQL Agent Job schedule has a few dozen jobs running.  Each job has a few steps.  And this is in an environment with a few dozen or hundred servers.  This can be a daunting task.  Daunting to the point that you just don’t have the time for this type of maintenance.

Fortunately, the script below can help you out by updating all of the job steps with a Output file details.  The script uses SQL Agent tokens to configure the job steps to output the files to the LOG folder under the SQL Server directory.  From there it assigns the Output file with the job name, the job step id, and the step name.  If you need to change directory location, just modify the @OutputFileDir variable.


DECLARE @OutputFileDir varchar(128)
SET @OutputFileDir = '$(ESCAPE_SQUOTE(SQLDIR))\LOG\'

SELECT
'EXEC msdb.dbo.sp_update_jobstep @job_id=N''' + CONVERT(varchar(36), j.job_id)
+ ''', @step_id=' + CONVERT(varchar(3), js.step_id)
+ ',@output_file_name=N''' + @OutputFileDir
+ REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(j.name,SPACE(1)
, '_'),'\',''),'/',''),':',''),'*',''),'?',''),'"',''),'<',''),'>',''),'|','')
+ '_$(ESCAPE_SQUOTE(STEPID))_'
+ REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(js.step_name,SPACE(1)
, '_'),'\',''),'/',''),':',''),'*',''),'?',''),'"',''),'<',''),'>',''),'|','') + '.txt''
GO'
FROM dbo.sysjobsteps js
INNER JOIN dbo.sysjobs j ON j.job_id = js.job_id
WHERE enabled = 1

Sweet, Run This In Production

No, please don’t do that… yet.

If you like this, check it out in your development environment and make certain that it works for you.  Then go ahead with your regular application lifecycle to promote this to other environments.  Who knows, HTML encoding might break part of this script and cause issues – it happened on a post of mine last week.