Later. First of all, don’t panic. So this is how the story begins…
The customer complained that there were some errors in the DCs event logs related to File Replication Service. This is the service responsible for the replication of the SYSVOL folder, in case you didn’t migrate yet to DFSR (which is better, faster, etc). The DCs were still using NtFrs, so the scary event entry was something like this:
“The File Replication Service has detected that the replica set «DOMAIN SYSTEM VOLUME (SYSVOL SHARE)» is in JRNL_WRAP_ERROR.
Replica set name is : «DOMAIN SYSTEM VOLUME (SYSVOL SHARE)»
Replica root path is : «c:\windows\sysvol\domain»
Replica root volume is : «\\.\C:»
A Replica set hits JRNL_WRAP_ERROR when the record that it is trying to read from the NTFS USN journal is not found. This can occur because of one of the following reasons.
 Volume «\\.\C:» has been formatted.
 The NTFS USN journal on volume «\\.\C:» has been deleted.
 The NTFS USN journal on volume «\\.\C:» has been truncated. Chkdsk can truncate the journal if it finds corrupt entries at the end of the journal.
 File Replication Service was not running on this computer for a long time.
 File Replication Service could not keep up with the rate of Disk IO activity on «\\.\C:».
Setting the «Enable Journal Wrap Automatic Restore» registry parameter to 1 will cause the following recovery steps to be taken to automatically recover from this error state.
 At the first poll, which will occur in 5 minutes, this computer will be deleted from the replica set. If you do not want to wait 5 minutes, then run «net stop ntfrs» followed by «net start ntfrs» to restart the File Replication Service.
 At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set.
WARNING: During the recovery process data in the replica tree may be unavailable. You should reset the registry parameter described above to 0 to prevent automatic recovery from making the data unexpectedly unavailable if this error condition occurs again.
To change this registry parameter, run regedit.
Click on Start, Run and type regedit.
Click down the key path:
Double click on the value name
«Enable Journal Wrap Automatic Restore»
and update the value.
If the value name is not present you may add it with the New->DWORD Value function under the Edit Menu item. Type the value name exactly as shown above.”
Just to prevent you from doing anything wrong here, please DO NOT follow the last instruction about modifying the registry key displayed above, if your DCs are post 2003 you have to fix it differently.
If you have more that one DCs, you should check the contents of the SYSVOL folder. You should compare and check that both DCs have the identical set of files in the Policies folder, in the Scripts folder and in the StarterGPOs folder (if you’ve created this one through the GP Management console). In our case, there were 91 GPOs on the first DC, but only 80 GPOs on the second DC, so we’re missing something:
Other symptoms were:
- You make changes to a logon script but not all users got the change
- Changing a GPO or creating a new GPO is not applied to all users or computers (that’s obvious, because we already know that we miss some GPOs)
WHAT CAUSED IT?
In this specific scenario, we already know that these 2 DCs lost connection for some days, due to some network misconfiguration. So what really happened?
FRS has an internal database that contains all the files and folders it is replicating and each of these has a unique global ID (GUID). The database also contains a pointer to the last NTFS disk operation (in the USN Journal/NTFS Journal) that the FRS service processed. Every time there is a change in the SYSVOL folder (actually the change is on a file) on disk, this is what happens:
- the operation is picked up by NTFS and an entry is made in the NTFS Journal
- FRS monitors the NTFS Journal for changes and notes that a change has been made to that file
- FRS keeps a record of the last NTFS Journal event that it processed and checks if it has processed it already
- If it hasn’t processed it already, it looks at whether it is a file that it should replicate
- If it should be replicated, the file goes into the normal process of staging, replicating, etc.
- FRS increments the entry in its database about the NTFS Journal event that it has processed so it won’t consider it again.
Let’s say that we create a new GPO and we verify that is replicated to the next domain controller. Then we make 3 changes to this GPO. The FRS will update the NTFS Journal so the version will be 4 and will trigger the replication mechanism. If we stop the FRS service and we make 20 more changes to this GPO, the NTFS Journal will be updated so it will go up to version 24. But since FRS is stopped, it will not monitor the NTFS Journal. FRS still knows the last NTFS Journal entry that it processed and it will compare this with the current NTFS Journal the next time it restarts. Remember that the NTFS Journal has a log size limit of the last 10 entries.
The next time the FRS service starts, it sees that it has missed NTFS operations on the disk (it last processed NTFS operation (4) but the NTFS Journal is now at 24 and we only have a log that goes back 10 entries so we’re missing operations 5-14 from the database.
So this is why you get the «DOMAIN SYSTEM VOLUME (SYSVOL SHARE)» is in JRNL_WRAP_ERROR.
WHAT CAN I DO TO FIX IT?
You need to reset things somehow, so you can reset the FRS database and start counting the NTFS Journal from the current values it has. You have 2 options to do this:
Get the data from the second DC through replication, so practically you need to perform the D2 non-authoritative FRS restore (more details follow)
Make this DC the authoritative and “send” the data through replication to all other DCs, called the D4 authoritative FRS restore. But in this case we know that all other DCs function without errors, so it’s better to stick with the first option. The D4 approach needs planning, is time consuming and you practically need to stop the FRS service to all DCs. I prefer to use it as a last resort…
Details on the D2 option can be found in this excellent Technet article (contains also information about the D4 approach), http://support.microsoft.com/kb/290762, but for your convenience I’ve copied the steps here:
“To perform a non-authoritative restore, stop the FRS service, configure the BurFlags registry key, and then restart the FRS service. To do so:
- Click Start, and then click Run.
- In the Open box, type cmd and then press ENTER.
- In the Command box, type net stop ntfrs.
- Click Start, and then click Run.
- In the Open box, type regedit and then press ENTER.
- Locate the following subkey in the registry: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup
- In the right pane, double-click BurFlags.
- In the Edit DWORD Value dialog box, type D2 and then click OK.
- Quit Registry Editor, and then switch to the Command box.
- In the Command box, type net start ntfrs.
- Quit the Command box.
When the FRS service restarts, the following actions occur:
- The value for BurFlags registry key returns to 0.
- Files in the reinitialized FRS folders are moved to a Pre-existing folder.
- An event 13565 is logged to signal that a nonauthoritative restore is started.
- The FRS database is rebuilt.
- The member performs an initial join of the replica set from an upstream partner or from the computer that is specified in the Replica Set Parent registry key if a parent has been specified for SYSVOL replica sets.
- The reinitialized computer runs a full replication of the affected replica sets when the relevant replication schedule begins.
- When the process is complete, an event 13516 is logged to signal that FRS is operational. If the event is not logged, there is a problem with the FRS configuration.
Note: The placement of files in the Pre-existing folder on reinitialized members is a safeguard in FRS designed to prevent accidental data loss. Any files destined for the replica that exist only in the local Pre-existing folder and did not replicate in after the initial replication may then be copied to the appropriate folder. When outbound replication has occurred, delete files in the Pre-existing folder to free up additional drive space.”
So if you go back to your affected DC, you can see that the contents of the SYSVOL folder are identical, and we get Event ID 13516, saying that “The File Replication Service is no longer preventing the computer DC from becoming a domain controller. The system volume has been successfully initialized and the Netlogon service has been notified that the system volume is now ready to be shared as SYSVOL.»