Updated 2016-01-06: Added an extra Job_Message check.

I've been experimenting with with the Management Data Warehouse (MDW) feature which is extremely important in SQL 2014 and above because it includes the "Analysis, Migrate, and Report" (AMR) tool. Here are some excellent examples of how it works:

What you'll find after setting it up are:

  • Data Collectors (DC). The last two are new and for AMR.
  • Agent jobs.
  • A tonne of SSIS packages stored in msdb.
  • Lots of temporary cache files used by the jobs.
  • A new "Transaction Performance Analysis Overview" report when you right-click on the MDW database which is the gateway to AMR. Unlike other MDW reports you can't get to it through the Management -> Data Collection area which makes them hard to find.

If everything works perfectly for you then there's no reason to read on. But looking at the massive list of Microsoft Connect entries relating to MDW failures you are probably going to have problems at some stage.

The reason I wrote this post is because the problem I most often encounter (and also Google'd and found others encounter as well) is with corrupted cache files.

  • An agent job executes a DC package which writes a cache file somewhere on disk.
  • The cluster fails over, a disk becomes inaccessible for a moment during a snapshot backup, or a cricket chirps too loudly, and the cache file becomes corrupted. Yes, I am positive this is related solely to MDW and not a general filesystem corruption issue.
  • The SSIS upload package can't upload the file because it's corrupted. Microsoft ignores all common sense about error handling and the job dies repeatedly instead of doing something sensible like logging the error and renaming or removing the broken file.
  • After some time (and lots of missing data in your MDW) the job shows up in your daily checks. Now you must play the exciting game of correlating the job name to the data collector to the cache file so that you can remove it and get things going again; which is like playing Sudoku with GUIDs and no pencil.

So I set up an agent job to do it for me every 15 minutes. Security note: Run it under a proxy account otherwise someone malicious could use it to delete other files from your system. Here's the logic:

  • Get the individual step history for upload jobs used by active data collectors.
  • Filter to just the most recent history and look for the message indicating a corrupt cache file. (There are a few different kinds but this one has found them all so far).
    • Delete the corresponding set of cache files from the file system. This is calculated using either a cache path configured with DC or a guess as to where the SQL Server Agent account path is. The latter check requires SQL 2008 R2 SP1+.
    • Restart the upload job. This is to clear the last error so that it will not be picked up again by this cleanup job unless the cache files are corrupted again.
Set Nocount On

Declare @Collector_Name Sysname, @Job_Name Sysname
Declare	@Job_Id Uniqueidentifier
Declare	@Cache_File Nvarchar(4000)

-- You can make this a permanent table in msdb if you like to record the failures.
Declare	@DC_Failures Table (
	Collector_Name Sysname,
	Job_Name Sysname,
	Job_Id Uniqueidentifier,
	Cache_File Nvarchar(4000)
)

Insert	@DC_Failures
Select	Collector_Name,
		Job_Name,
		Job_Id,
		'Del ' + Case When Cache_Directory Is Not Null Then Cast(Cache_Directory As Sysname) Else Agent_Temp End +
		'\*{' + Cast(Collection_Set_Uid As Sysname) + '}*.cache'
From	(
		-- Get the history of data collector upload job steps
		Select	scs.name As Collector_Name,
				scs.collection_set_uid As Collection_Set_Uid,
				sj.name As Job_Name,
				sj.job_id As Job_Id,
				sjh.message As Job_Message,
				Row_Number() Over (Partition By sj.name Order By sjh.run_date Desc, sjh.run_time Desc, sjh.step_id Desc) As Newest
		From	msdb.dbo.syscollector_collection_sets scs
		Join	msdb.dbo.sysjobs sj
		On		scs.upload_job_id = sj.job_id
		Join	msdb.dbo.sysjobhistory sjh
		On		sj.job_id = sjh.job_id
		And		sjh.step_id <> 0
		Where	scs.is_running = 1
		) a
Outer Apply (
		Select	parameter_value As Cache_Directory
		From	msdb.dbo.syscollector_config_store scs
		Where	scs.parameter_name = 'CacheDirectory'
		) b
Outer Apply (
		Select	'C:\Users\' + Replace(dss.service_account, 'NT Service\', '') + '\AppData\Local\Temp' As Agent_Temp
		From	sys.dm_server_services dss -- 2008 R2 SP1
		Where	dss.servicename Like 'SQL Server Agent%'
		) c
Where	Newest = 1 -- Filter the latest run
And     (
				Job_Message Like '%The file is damaged or not a SSIS-produced raw data file.%'
		Or		Job_Message Like '%Description: String too long.%')
		)

-- Nothing to do
If		@@Rowcount = 0
Begin
		Return
End

-- Give up if we have something to do but can't do it
If		Not Exists (Select Top 1 0 From sys.configurations sc Where sc.name = 'xp_cmdshell' and sc.value_in_use = 1)
Begin
		Raiserror(N'xp_cmdshell not enabled', 16, 1)
		Return
End

Declare	DC_Failures Cursor Local Forward_Only Read_Only Static For
		Select	Collector_Name,
				Job_Name,
				Job_Id,
				Cache_File
		From	@DC_Failures

Open	DC_Failures
Fetch Next From DC_Failures Into @Collector_Name, @Job_Name, @Job_Id, @Cache_File

While	@@Fetch_Status = 0
Begin
		Print	'Trying to restart collector ' + @Collector_Name + ', job ' + @Job_Name + ', by removing cache file with ' + @Cache_File
		Declare @DC_Output Table (
				[output] Nvarchar(4000)
		)
		Insert	@DC_Output
		Exec	sys.xp_cmdshell @Cache_File
		Exec	msdb.dbo.sp_start_job @job_id = @Job_Id

		Fetch Next From DC_Failures Into @Collector_Name, @Job_Name, @Job_Id, @Cache_File
End

Close DC_Failures
Deallocate DC_Failures

I would prefer not to have to write this job but resolving these issues by hand has just sucked up too much time to continue to use MDW without it.