I was helping a customer with restoring a large Xenserver farm that had been corrupted after joining a new Xenserver to the farm. XAPI service stopped and would not start again. Moving the master role did not help. Customer was facing the only option to restore the Xenserver installation on a mission critical system. Then we decided to dig into the problem and tried to fix the Xenserver database manually. Here is what we did.
The xenserver configuration database is an XML file that is available on the master xenserver as /var/xapi/state.db. When trying to start the xenserver api using service xapi start command it just failed, but in the log file /var/og/xensource.log there was a clue to what was wrong:
Caught exception at toplevel: ‘Db_exn.DBCache_NotFound(“missing row”, “VLAN”, “OpaqueRef:25704e28-b130-4789-b244-87b08eceade1”)’
We opened a copy of state.db with notepad and searched for the row with the OpaqueRef id from the log. We deleted the entire row and saved the db file. Now starting the xanpi service again using service xapi start command gave a new error in the log. We repeated the procedure until the service started. Everything was up and running again. It seemed to be the new host joining the pool that did not have the correct NIC configurations and this caused the database to get invalid references. I think this was a weakness in Xenserver 5.6 that they were running but I’m not able to confirm if this is not an issue with newer versions.
To be able to test the database restore without affecting the running vm’s in production environment, we recreated a xenserver host virtually to test in a safe environment. We installed a Xenserver 5.6 vm inside a Xenserver 6.5 SP1 host and used this command to enable support for nested Xenserver installs.
xe vm-param-set uuid=<UUID> platform:exp-nested-hvm=true
This is actually a very cool feature that allows you to build a complete Xenserver pool virtually for experimentation. I really recommend this for testing xenserver things before doing it in production.
Just for fun I had to try running a XenServer within XenServer within Xenserver, and it works. I wonder how deep it goes.
To sum it up, it’s very nice that the state.db is an XML file so it can be fixed by hand in an emergency, but it would be better just to restore state.db from backup. The problem is that the state.db is not backed up when you run a metadata backup, it’s only the machine definitions that gets backed up. Make sure you take a copy of your state.db file at least before doing any changes to you XenServer pool. Another weakness in Xenserver is that if XenServer API service is down you cannot run any xe commands. If you try to just delete the state.db and rebuild it, you will loose all your NIC and disk configurations, and you either have to reinstall Xenserver or to add NIC’s and Disks manually using many CLI commands. All this needs to be done before you can restore the metadata file. This is why it is so important to have a backup of state.db.