Investigate large snapshot sizes in ShadowProtect

Written By Tami Sutcliffe (Super Administrator)

Updated at August 20th, 2021

Overview

One common issue seen in volume-based backups is larger-than-expected snapshot or consolidated image file sizes or "bloat". This is seen most often in ShadowProtect deployments, whereas AppAssure's and Acronis' deduplication can mitigate the effect such that it might not be noticed. 

For ShadowProtect, the rule-of-thumb is to look for consolidated daily images (-cd) of approximately 1-2 GB or less. Anything more can be a point of suspicion, but doesn't necessarily mean there's a problem. You also might investigate and find the behavior causing the largess is just part of the operation of the system or needs of the customer (such as specialized database applications having to run SQL dumps to a specific non-alterable location, or the Anti-Virus solution implemented has directory location restrictions).

That 1-2 GB estimation usually applies to highly active volumes. Less active or very well organized volumes can come in quite a bit less - a pure file server storage volume might have consolidated daily images of only a few 10s or 100s of MB, depending on traffic and use of course.

This document covers investigation approaches in a ShadowProtect deployment, but similar steps can be used for any volume-based backup deployment.

Note: Depending on the software installed on the server, how the server is used, etc it is possible a larger consolidated or snapshot size will be normal for the server(s) in question. Your Milage Can Vary

Description

Some additional related information can be found in the following KB: Understanding Data Retention on a ShadowProtect BDR

Causes

The principle cause of an image file size is FileWrite operations or sector changes on the agent server. Some common causes are:

  1. Anti-Virus definition Updates
    1. Particularly for network-wide management or installs.
    2. Symantec, Trendmicro, and AVG have been some of the more common ones we've seen.
    3. Eg: Symantec does or used to require definition files stored to the C:\Program Files\Common Files.
    4. The impact can often be seen in consolidated daily file sizes of 3-8 GB
  2. Disk Defragmentation
    1. Since disk defrag software changes sectors, the incremental backup may be larger than expected especially if the disk is severely fragmented. When you take an incremental backup, you are writing a file of only the sectors which have changed since the last full or incremental backup image was taken. If you run disk defragmenter software, you will be changing the sectors on the disk and cause the time and size of the incremental backup image to greatly increase.
    2. The impact can be seen from minimal to incremental sizes nearly the size of the base image.
  3. Application Backups
    1. SQL dumps due to server configuration or specialized application configurations
    2. Windows Backup (.bak files) stored to protected volumes
    3. Windows ShadowCopy backup turned on
    4. Possible cause is Exchange Store defragmentation. This is not confirmed as either causing or not causing extra snapshot file size.
  4. File Movement
    1. Adding data to an agent's storage
    2. Moving files from one location to another
      1. Archiving files to a different directory
      2. Customer user workflow resulting in files moving around on the disk
      3. Deleting data into a Recycle Bin

Note: Causes do not necessarily have to be "new" data. Movement of data from one position on a volume to another acts like "new" data from the snapshot's perspective.

Investigation Steps

There are three general approaches to investigating and tracking down the cause of unexpected sizes

  • Resource Investigations (writes to disk)
    1. Performance Monitor, Resource Monitor, or there is SysInternal's Process Explorer which can be used for active monitoring of running processes
      1. Disk I/O Usage Peaks
      2. We have seen some SQL leaks doing this as well as other software issues.
    2. SysInternal's Process Monitor or possibly Windows perf counters, logging FileWrite operations
      1. It will be important to spool or cache the outputs of any tracking software (like SysInternals' procmon) to disk not RAM, and to a volume not protected by a volume-based backup. Spooling the log output to a BDR's X volume is often the best available location.
  • Comparison Tools
    1. Any file and folder comparison tool.
    2. We cannot make any specific suggestion but some partners have used one or more of the following: DiffDaff, WinMerge, BeyondCompare, etc.
  • Scheduled Operations
    1. Windows Task Scheduler
    2. Services
    3. Internal scheduling on any specific or custom applications, both 3rd party and Windows like Backup or ShadowCopy

The general recommended approach is to change the ShadowProtect job schedule to run for a day or so (depending on how frequent you see extra large consolidated files) but run it in 15 minute increments the entire 24 hour period. This will identify whether the extra data is scattered evenly thoughout the day (eg: each 15 minute snapshot being 150-300 MB is often a bit high) or whether the extra data is isolated to specific spikes (eg: 6am, 2pm, and 9pm results in a 1 GB jump). Once you've identified the scope to investigate, apply the approaches to those time periods.

Example: Most 15 minute snapshots might come in around 50 MB. This is a little high perhaps but could be normal or even on the low side for a mail server. But every morning at 2 AM there's a snapshot of 400-500 MB. You can apply the following approaches:

  • Resource Investigation
    1. Active: Be up at 1:45 AM to watch Process Explorer or Resource Monitor and the Disk I/O and related processes.
    2. Passive: Set a perf counter or Process Monitor to capture FileWrite operations (or WriteFile and perhaps also WriteConfig) between 1:45 AM and 2:15 AM. Store the outputs across the network to the BDR's X:\ProcMon folder. Review the log outputs to determine what process was writing a data during that time.
  • Comparison Tools
    • Mount the 1:45 AM snapshot
    • Mount the spiked 2:00 AM snapshot
    • Run a comparison tool to analyze the differences between mounts #1 and #2
  • Scheduled Operations
    1. Investigate all installed applications and Windows configurations, including Event Viewer, to determine if there were any scheduled tasks having kicked off during that time which would explain the extra data
    2. Maintenance tasks in SQL or Exchange
    3. Anti-virus updates and scans (though scanning very rarely impacts FileWrite operations)

Note: One testing approach would be to disable any application's backups or updates you might be suspicious of (like AV definition updates or disk defrag) for a day or two and see what the impact results are on the snapshot sizes.

An additional article that might be useful during the identification of where the 'spike' time period is: How To: Read ShadowProtect Chain

Mitigation Solutions

Action steps available

  1. Once the cause has been identified, usually the solution is to just shut off the process. Disk Defrag and Windows Scheduled Shadow Copy usually fall under this category.
  2. Alternatively, some applications or operations need to continue running. For operations like most SQL dumps or some AV installations, they can be redirected to another volume which isn't protected by a volume-based backup.
  3. However, some operations or applications may not allow reconfiguration. In which case it is important to remember that some server configurations and utilizations will result in larger regular snapshot and consolidated sizes than would be seen in other deployments. Large sizes aren't always indicative of a problem, but the causes often should be confirmed to adjust any usage or resource considerations or expectations.

Inherent Mitigation available

The consolidation process offers some inherent mitigation for larger-than-expected snapshots. Consolidation is an additive process, but it is also a sorting or filtering mechanism.

For example: If 40 GB was deleted from a file server and dumped into a Recyle Bin, and a snapshot captured this state, the result could be an extra-large snapshot perhaps as high as the 40 GB depending on compression, etc. If the Recycle Bin was emptied before the end of the day and the last snapshot captures an empty Recycle Bin, the consolidated daily file should recognize the data as completely gone. This has not been extensively tested, but interium tests and discussions with StorageCraft and partners prove this often to be true. The exceptions we've seen may be a result of the Recycle Bin not actually being removed, or it may be a result of the tracking and low level operations of the operating system and how it handles sectors.

ImageManager version 6, which requires ShadowProtect version 5, has an additional level of consolidation call "monthly rollup" which allows this consolidation process to operate on monthly files whereas with earlier versions of ImageManager, once data size was added to a monthly, the only resolution was to re-base the chain.

Note: Other than the consolidation process, there is no way to retroactively apply a cleanup. If the circumstances require "cleaning up" an existing chain and the consolidation process isn't enough or isn't quick enough, then the only solution is to re-base the chain and archive (or delete) the existing chain. Please see our article on restarting a chain or server: How To: Restart ShadowProtect Chain