Resolve Data Protection Manager (DPM) Recovery Points and Registry Problems.

By Steven Jordan on 4/16/2014.

*Update:  This issue is related to Winodows Update KB2506143 and KB2506146 (i.e., WMF 3.0).  Uninstall these updates from DPM 2010!

Problem Statement: The DPM server takes a long time to logon. It can take 15 to 90 minutes to logon after the server restarts. Additionally, Windows updates fail and rolls back to its previous state after the server restarts. The integrity of system backups and restorations are at risk because DPM server has become unreliable.

Additional Symptoms:

  a.)  Expired recovery points are not removed per DPM policy goals. Roughly half of the protected members show excessive recovery points in the DPM console:

Figure 1. Example of a protected member with 20478 recovery points
  b.)  Protection groups show excessive volume size. For example, the Exchange protection group indicates the recovery point volume consumes nearly 2TB.

  c.)  PruneshadowcopiesDPM2010.ps1 is a DPM PowerShell script that removes expired recovery points. The script hangs and does not remove expired recovery points.

 d.)  DPM console hangs when deleting inactive protection group members. The GUI is unresposive and must be manually closed.

 e.)  The registry System file has bloated to over 220MB. System is located in c:\windows\system32\config\.

Figure 2. Bloated registry.
Root Cause:

   There were excessive disk based recovery points (i.e., VSS volumes). In our case, DPM (or Windows) had improbably kept tens of thousands of recovery points per proctection member. DPM, by design, is only supposed to store up to 64 recovery points for its file members, and up to 448 recovery points for its application (e.g., SQL database) members.

   The problem did not affect every protection group member. Some members (e.g., recent additions) that had less than 100 recvery points. However, nearly half of all the protection group members had excessive (e.g., over 20,000 ) recovery points (Figure 1).

   The excessive, or rather expired, recovery points had to be removed. Normally, DPM automatically removes expired recovery points with its PruneshadowcopiesDPM2010.ps1. The default script was not working so I turned to a custom PowerShell script named PruneVSS.ps1.

   PruneVSS.ps1 is a handy tool that removes disk based recovery points based on date. Its interactive session determines protection groups and recovery point date ranges. N.B. The script was originally written by the late, Ruud Baars.

  I had mixed success with Baars' script. It worked great on resources that had less than 8,000 recovery points. The script hung indefinetly for protection group members with more than 10,000 recovery points. The situation required extreme measures.

Inactive Protection Group Members

   The final option nixes the remaining protection group members that continue to retain expired recovery points. This DPM nuclear option removes all disk based recovery points by deleting their associated volumes. It is imperative to plan for continuity before committing. It's best to ensure the secondary DPM server has backups of the primary protection groups and to make a full tape backup before proceeding.

  The afflicted protection group members were transitioned as inactive protection group members. I then attempted to remove the disk based recovery points using the DPM console. Unfortunetly, I had limited success using the GUI. I was able to remove the disk based recovery points from a few of the inactive members. For the majority, however, the console simply froze. At this point, I turned to a second custom PowerShell script, named removeinactivedatasource.ps1. This script was a life saver -it removed all remaining disk based recovery points. I ran the script in verbose mode, so I could see its progress. It took about two hours to complete its job.

   I then moved the inactive protection group members back to their original protection groups. N.B., the recovery points must be deleted before re-adding them to their original protection group members; otherwise DPM will continue to use their originally assigned volumes.

   The next day recovery points looked great; less than 100 for each member in the DPM console. DPM's PruneshadowcopiesDPM2010.ps1 also ran without problems. I had high hopes that the problem was solved -except that DPM continued to hang after restarting it. Victory was short lived.

Secondary Cause

   I had won a battle but not the war. Efforts to fix the recovery point volumes were successful but its cure exposed a secondary sickness: phantom VSS volumes.

  I was fortunate to discover a handful of blogs that had somewhat similar DPM problems. Microsoft explains some of the symptoms in KB982210:
This issue occurs because there are a large amount of orphaned registry keys.
The Volume Shadow Copy Service (VSS) snapshots create many registry keys. However, they are not deleted after the VSS snapshot operations are completed. 
Indeed, the DPM's registry system was bloated with nearly 15,000 VSS volume registry keys.
Fig 3. Registry bloating
from VSS Snapshots

   Scott Forsyth's Blog recommends applying the hotfix from KB982210. The hotfix however, cannot install on a DPM server unless it runs Hyper-V! In fact, most of the focus for this problem centers on Hyper-V backups -but my problem has nothing to do with Hyper-V. Even if I wanted to install Hyper-V, to allow the hotfix installation, the server was in no condition to install a new feature; all updates failed upon restarting the server.

   In our case, DPM uses iSCSI disks for the replica and shadow copy volumes. The alternate approach removes the phantom devices via script and then requires a second tool that shrinks the registry. Both Forsyth and Gary Fenton, recommend running the Microsoft tool called DevNodeClean to remove phantom devices from the registry.


   DevNodeClean is available from Microsoft support or it can be compiled with Visual Studio per KB934234. Fenton also has a complete version available for download on his blog.

   I ran DevNodeClean and it indeed found orphaned devices -a grand total of 7. It was less than the 10,000 I had expected. The reason DevNodeClean did not work in this instance is because it only checks for orphaned devices on disks, partitions, and volumes; It does not check for phantom volume shadow copies.

I described the problem to a talented programmer, #SAK, who works at my office. He reviewed DevNodeClean and further developed it so it checks for orphaned VSS volumes. SAK explained his program lists all orphaned VSS volumes from the command prompt: c:\cleanup.exe.

The program removes can remove all orphaned VSS volumes by including a switch: c:\cleanup -r

Success! The SAK cleanup application found and deleted nearly 10,000 orphaned VSS volumes from the registry.  Download the SAK Cleanup tool from my OneDrive.

   N.B., David Candy's Blog has a good alternative to SAK's custom application. The modifed RmHidDev.bat also finds and deletes orphaned VSS shadow volumes.

Tertiary Problem (i.e., third time's a charm):

 The crazy slow logons remained; even after all the expired recovery point volumes were deleted; and all the orphaned VSS volume registry keys had been removed. Gambit's blog explains that DPM's problems persist because of its bloated registry. I confirmed the registry size had not changed:

Fig. 4. Bloated registry causes log on profile and update issues.

 Microsoft support provides a tool that shrinks the registry, called Chkreg.  N.B., Chkreg is only available by contacting their support team.  Chkreg is also available for download from my OneDrive.  The tool is easy to use; the process is somewhat tedious. Essentially, Chkreg cannot fix the system file while the server is operational.  The server must be turned off and the disk must be accessed using a separate method.

  I shut the server down and used the Windows 2008 installation media to boot into the recovery mode command line.  I then used the recovery command to navigate to c:\windows\System32\config, and copied the system file to a separate location. N.B., the drive letters in the recovery command were different from what Windows normally uses. FDISK provides current assignments with its list disk, list partition, and list volume commands.

  I removed the Windows CD and re-started the server (and waited an hour). When the server was back up I used the chkreg tool to repair the copy of the registry system. I issued the following commands:

   #Chkreg /F SYSTEM /R
   #Chkreg /F SYSTEM /C

  The new system file was significantly smaller than the original. The system file shrank from 219 MB to approximately 140 MB. I admit, I had hoped the new file size was closer to 10 MB, but at least there was some progress.

  Once more, I restarted the DPM server, and accessed the recovery command prompt with the installation media. I moved the original system file to a new location -as a precaution. I then copied the new (i.e., shrunken) system file back to it's original location, c:\windows\system32\config. I restarted the server and waited for DPM to come back online.

End result -it worked!  I can finally log onto the DPM server in less than 30 seconds.  Shortly thereafter I installed a year's worth of updates. Everything installed OK and the server remains trouble-free.



Post a Comment

My Instagram