Failed Storage Spaces Disk Recovery

This wasn't a fun one to write considering the circumstances. Data protection tech is great when everything is set up fresh and working properly. Everything is just fine, until it isn't and Storage Spaces failed me here. This post applies to the Windows 10 variation specifically which shares some of the same general principles with larger enterprise deployments in Server 2016/ 2019.

So, what happened?

My setup:
  • Asus Mobo, Intel Core i7-4790K, 32GB RAM
  • Windows 10 Enterprise, 1809
  • 120GB SSD (Boot)
  • 3 x WD Red 3TB HDDs
    • 1 storage pool: 8TB
    • 2 x mirrored vDisks
      • D:\ (Docs) - 2TB
      • M:\ (Media) - 2TB
The 3 x 3TB WD Red HDDs in my HTPC are going on 5 years old and were running fine, literally until I placed the order for new drives to replace proactively. Murphy. On the day I receive the drives, I see resiliency warnings in Spaces on one HDD and both vDisks. I also see in the System logs: "The device, \Device\Harddisk2\DR2, has a bad block." Fun. Below is the what Spaces told me which seems to lead to a pretty obvious next step right? Remove the drive labeled "Old2" from the pool then replace it. Right? WRONG.

The other bit of good news I uncovered was that Windows no longer recognized my Media volume, which instead of reporting ReFS was marked RAW. At this point my M drive was basically dead and I was unable to browse in File Explorer.

Here's where things go from bad to worse and in hindsight I can confidently say, DO NOT TRUST THE SPACES UI! At this stage I'm still hopeful that the repair process will remedy this. I click remove on the Old2 HDD which according to the UI, the process failed.

What I didn't know yet what that this drive marked as warning by the UI was not actually the drive having problems. Even better, this drive that supposedly "failed" to be removed from the pool was actually marked as retired which triggered repair and regeneration jobs for each vDisk. The UI later caught up and flagged Old2 as "preparing for removal. So, I dutifully complied, shut the box down and added a fresh WD Red HDD. Below is the Spaces UI before I added the new drive. Something to note here, if you interrupt these repair jobs via reboots, they will start over at 0%.

What I also didn't fully realize at this point was how Spaces actually behaves when you elect to voluntarily remove a drive from a pool. To see what's really going on, you have to use PowerShell, which is what I should have done from the start and ignored the UI entirely. The command Get-StorageJob will give you insight into the repair process. As you can see below I now have 4 jobs running, 2 per vDisk.
What is essentially happening here is that the disk I elected to remove from the pool is being indexed of all files and slowly drained from the pool as the files it hosts are being regenerated somewhere else in the pool from the 2nd copy in the mirror. Because I have 3 disks in my pool, no single disk has a complete copy of all files in a mirror set for a particular vDisk. This disk I'm removing has 633GB of data from the Media vDisk and 197GB of data from the Docs vDisk. There also appears to be no way to stop or cancel this job once it has begun.

Now we wait.

My WD Reds are 5400RPM SATA drives, so the slowest of the slow. With 197GB marked for recovery on my first vDisk, my repair operation averaged 2GB processed per hour. Molasses. The net here is that for Spaces to recover 197GB on a 2TB mirror space took ~43 hours. Ugh. Patience is very important as this is not a speedy process. The 633GB for the Media vDisk took even longer. What complicated the recovery process was that I was removing a perfectly good drive from the pool, because the UI, leaving the problem drive having to work with bad blocks to recover data. How long would this have taken if the bad drive was removed instead of a perfectly good one? No idea.

CrystalDiskInfo was useful here to shed some light on which drive was actually bad and why by reporting SMART data. There are a few interesting things happening in the image below that I want to call your attention to.
  • First, the Spaces UI would periodically change from a Blue information banner to the red Error banner you see below flagging my M vDisk as having no resiliency. This error would clear on its own after a few minutes. Checking in PowerShell shows everything as "OK" but the disk Crystal flagged as bad has high read errors and pending sector counts. So far this jives with was is being reported in the System event log. 
  • Windows originally flagged a warning on the HDD listed on the bottom "Old2". There is no clear indication why although I have a suspicion a bit further down. Checking in Crystal, Old2 reads as good, echoed by PowerShell. Notice that Old2 is marked at "Retired" in PowerShell. The drive that actually has a problem, Old1, shows OK in the Spaces UI + PowerShell, but reporting bad blocks in the event log. Why doesn't Spaces see this drive has a problem?!
  • The other interesting thing here is that although my new HDD, serial ending 68E3, is contributing capacity to the pool, you'll notice that no data is actually being stored on it according to the UI. This is after 8 days of data recovery!! I would have thought Spaces would make use of all storage resources in the pool when performing a repair/ regeneration activity. Apparently not! Guessing the process rebuilds using existing media then I'll have to manually run an "optimize drive usage" job to re-balance the pool when the repair is complete. Geez, wtf. 
  • As the HDD Old2 was very slowly being prepared to be removed, the % used as reported continuously dropped as the bytes were processed. 

While I waited and since there is no way to stop the jobs running, I dug more into the physical infrastructure of the pool. Get-PhysicalDisk will show the specifics of all physical media in the system, including the device ID. If you recall back to my System log entry, disk 2 was reporting a bad block which coincides with what Crystal is saying.
Piping Get-PhysicalDisk to Get-VirtualDisk will also show some useful information as you can see what exactly each disk is doing in regards to each vDisk. I did this below for each disk respectively and you can see that the Media vDisk is flagged as Degraded on Old1. Because no single disk, has a full copy of any mirror data set, disks Old2 and Old3 are flagged as no redundancy, as at this point there is none. This output also gives a potential clue to as to why Old2 was flagged by Spaces to begin with. Why does Old2 not contribute to the Docs vDisk? Could this be the cause of the original warning? Highly suspect that this was happening at the same time Old1 is reporting a bad block in the event log.

Unfortunately the majority of Spaces-related event logs are near useless and there doesn't appear to be a way to enable verbose logging for troubleshooting purposes. Four separate Spaces-related event logs filled with useful detail like this:

So how did it all end? In failure. Now to be fair the Docs vDisk was recovered successfully but I never lost the file system on this volume either. I was so excited, after 14+ long days I got so close to the end. .03% to be drained from Old2 but it never finished. It got into this weird loop of small 256MB and 512MB repair jobs that just didn't seem to complete, over and over. I even let it sit for a full 24 hours just to see if it would work itself out. Nope. So here I am, a Spaces recovery job that won't finish and a RAW volume that I can't browse.

Great, what now?

Luckily I have a good backup, so as much as I hate that its come to this, I'll be rebuilding and restoring from backup. If you aren't as fortunate, don't fret, you have options. I looked at a few different tools that are capable of restoring from ReFS RAW disasters. The way these tools work is that they offer a free demo build that will allow you to scan your disks to identify what exactly can be recovered. If you actually want to restore real files then you have to pay.
  • ReclaiMe File Recovery is a tool that specifically calls out support for ReFS as well as NAS systems. There is another variant I found on Google specifically for Storage Spaces data recovery but the logos while similar are different, the docs are old and aside from a link to ReclaiMe on the page, I'm not completely sure it's the same company. File Recovery should be all you need here anyway as you are recovering whatever you can from bare drives. If you get to the point where you just need to save SOMETHING, for $79 ReclaiMe might be able to help.
  • EaseUS Data Recovery is another option and a name you might have seen before. They boast ReFS support as well as RAW partition recovery. $70 gets you an unlimited data recovery license for a single PC.
  • R-Studio I had never heard of before but they offer a really impressive list of capabilities and supported file/operating systems. If you search there are some people in the forums reporting using this tool and having successful recoveries. $60 gets you a R-Studio NTFS license which also supports ReFS.
If your drive has bad blocks you may have data loss. If you're recovering from a 2-way mirror in Storage Spaces, one of your drives from the pool should have the files you hope to recover. 

Lessons Learned

  • First and foremost, I no longer trust ReFS. Search Google for "ReFS RAW" and look at how many people have experienced lost file systems turning RAW on their ReFS volumes. Completely unacceptable. If you value your data, stick to NTFS. 
  • The Storage Spaces UI in Windows 10 is super buggy at best and completely useless at worst. Check in PowerShell and a third-party SMART tool like Crystal to see what's really going on before taking any action. Don't believe the UI if it flags a particular drive as having problems!  
  • Always, always, ALWAYS add a new disk to you Spaces pool first, then re-balance before you elect to remove anything! Would that have saved me here? Considering that I got screwed by ReFS leaving my Media vDisk in a Raw state, I doubt it.
  • Initially, this experience had me rethinking my 3-disk mirror setup. At the end of the day what I have is 1 column with 2 data copies spread across 3 disks. I've proven previously that performance doesn't change unless you're running 2 columns, which requires 4 disks minimum, so that 3rd disk really only contributes storage here. I can still only tolerate a single disk failure but what I've done is spread my failure domain to a slightly larger surface. None of my 3 drives should ever have a full mirror replica of any vDisk volume. So in the event of a failure, what should happen is that I have less data to rebuild once the failed drive is replaced, thus speeding my recovery time. I'll stick with my 3-disk pool for now but will rebuild using NTFS. If I get burned again... 
  • RAID and software-defined data protection are not means to replace backups! You still need to backup your data or roll the dice with the recovery tools should a disaster strike. 

No comments:

Powered by Blogger.