Had a case recently where the two virtual machines acting as the only domain controllers for a child domain, through a series of circumstances, had their VHDX files compromised. Essentially, the disk hosting the VHDX files got filled up and the machines shut down.
When we turned them back on, neither directory service would come online; a DCDIAG would indicate that they were both awaiting initial replication to complete. Here's how we got it to come online.
- Shut down all of the domain controller servers.
- Reboot one of them into Safe Mode with Networking.
- Log in as a domain account that has administrative rights. Here we were fortunate that, because this was a child domain, we could log in as an administrative user from the forest root domain.
- We discovered that the servers were self-referencing for DNS. Normally not a huge problem, but when both servers fail to come up, it is a problem. Again, we were saved by making the DNS zone a forest-level zone so it was available everywhere. We used the following commands to delete the locally-referenced DNS servers and add one from the forest root domain: NETSH INT IP ADD DNS "Connection Name" <IP Address of DNS Server> NETSH INT IP DELETE DNS "Connection Name" <IP Address to remove>
- Next, we also disabled UAC to prevent any administrative escalation issues while we were bringing things online. This was accomplished by setting HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA (DWORD) from 1 to 0.
- Next, we used NTDSUTIL to seize the FSMO roles. Since this was a child domain, we only needed to take over the PDC, RID Master, and Infrastructure Master roles.
- We rebooted into normal mode on that DC so that everything would start up.
- After about 5-10 minutes, the directory service completed initialization and was able to respond to log in requests and password changes.
- So, next we brought up the secondary domain controller into Safe Mode with Networking.
- We used the same NETSH commands to flip its DNS over to a forest root DNS server.
- Since the domain was mostly up, we chose to not disable UAC on that server.
- We rebooted the secondary server into normal mode.
- After a few minutes, it was also up and running. We then tested replication and monitored it using REPADMIN commands.
- At this point, we had a working Active Directory infrastructure (hooray!).
- Next, we took a look at SYSVOL replication. We placed a text file at the top of the SYSVOL share and waited for it to replicate over. After a few minutes of it not appearing, we looked into the event log.
- We found an event that indicated that the DFS Replication service's database was corrupted and needed to be reset. The event entry text even gave a command line to do so. This ended up being a WMIC command to reset the service.
- We had to run the same command on both domain controllers. Once freed up, DFSR checked the health and ended up comparing the state of both SYSVOLs. After a minute or two, replication was working.
- Finally, we did the reboot test and rebooted both servers, one at a time.
- We were then able to verify that AD and DFS replication were both working after the reboots.
At that point, we were up and running again.
- Although the circumstances that got us into this mess in the first place were complex, it was a perfect storm to cause the issue. Bottom line is to make sure that the VHDX's are never on the same shared volume on the Hyper-V cluster.
- Make sure that you have local accounts within the domain that you absolutely know the passwords to and are administrator-level accounts.
- Flip your DNS zones to forest-wide replication so that you have the ability to point to alternate DNS servers and know that the data is available.