I just read PowerShell, Pester, and Ola Hallengren's Maintenance Solution by @sqldbawithbeard and I'm certainly going to give it and Mike Robbins' Operational Validation framework a try soon.
I do however have a similar process already in place using PowerShell and Jenkins. What I find interesting though is that nothing is ever as easy as the demo because tests are hard.
Here's Rob's list and everything I had to work around on real systems while building my own tests.
SQL Server Agent is running
It seems straightforward.
Of course Windows Internal Database and Express don't have Agent so ignore those, if you have them, after you identify which edition they are.
In fact I go one step further and validate the service is set to start automatically as some installs default to it off while administrators or developers may manually shut it down during vMotions and who knows what, respectively.
Of course then you have to make sure you exclude failover clusters. Because those are set to manual so Windows can manage them.
But there's other non-failover cluster third party software which offers psuedo-cluster functionality for business continuity and disaster recovery and which does the same thing by managing the agent services itself.
Before you know it the edge cases outnumber the tests.
We should have 4 backup jobs
Definitely 4, not 3, and 5 is right out.
Unless of course someone builds a 500GB database with SQL CLR UDTs which slow down DBCC to a crawl, so you move the backups and maintenance for that database into their own jobs or steps with different flags, and possibly start staggering some stuff so it finishes on the weekend without dripping into Monday.
Then you've got more. But just kind of sweep them under the carpet. We have file system backup checks so that covers most of it…
We should have the clean up jobs
Just never add a job step to do another kind of clean up and then forget you rolled it out only to half of the servers so the rest are missing the step. And then don't design a test that compares against the raw original job you rolled out because then you find tests failing on all of the stuff you justifiably modified to cater for the big databases above.
All jobs should be scheduled (and enabled)
Job exists. Job scheduled. Schedule enabled. We're all set to go.
Unless the job is disabled. Better check the job is enabled.
And that the schedule is in the past, not a year from now. I don't know how that happened.
And that the schedule triggers every day or on whatever other period I expect and not just every February 29th that is also coincidentally a Friday. Okay that has never happened but I've seen similar. I think someone was trying to rattle me on purpose!
The jobs should have succeeded.
All tests pass.
Except on new systems where the jobs haven't had the chance to run once yet. So remember to run the jobs when you first build them onto a system.
So then the tests pass unless someone just created a new developer database. Then the tests are going to fail until the next full backup; unless you're using @ChangeBackupType = 'Y' in your DIFF job.
Look that's all. I'm not saying it's not worthwhile. I'm not saying it doesn't get better. I'm just saying nothing is ever as easy as the demo.