[PVE-User] problems with 3.2-beta

Sun Feb 2 19:00:55 CET 2014

Overall, the Ceph GUI is great.  I actually got Ceph up and running (and 
working) this time!  Syncing ceph.conf through corosync is such an 
obvious way to simplify things... for small clusters, anyway.

I am seeing some problems, however, and I'm not sure if they're just me, 
or if I should be opening bugs:

1. I have one node that's up and running just fine, pvecm claims 
everything's fine, but I can't migrate VMs that started somewhere else 
to it - migration always fails, claiming the node is dead.  Nothing 
unusual appears in any logfile that I can see... or at least nothing 
that looks bad to me.  I can create a new VM there, migrate it (online) 
to another node and migrate it back (online, again), but VMs that were 
started on another node won't migrate.

2. CPU usage in the "Summary" screen of each VM sometimes reports 
non-sensical values: right now one VM is using 126% of 1 vCPU.

3. The Wiki page on setting up CEPH Server doesn't mention that you can 
do most of the setup from within the GUI.  Since I have write access 
there, I guess I should fix it myself :-).

4. (This isn't really new...) SPICE continues to be a major PITA when 
running Ubuntu 12.04LTS as the management client.  Hmm, I just found a 
PPA with virt-viewer packages that work.  I should update the Wiki with 
that info, too.

5. Stopping VMs with HA enabled is now an *extremely* slow process... If 
I disable HA for a particular VM, I now notice that Stopping also 
produces a Shutdown task, and it takes longer than previously, but not 
unreasonably slow.  I don't understand why Stop isn't instantaneous, 
though.  I notice that typing "stop" into a qm monitor also is slow... 
the only way I have to rapidly stop a VM is to kill the KVM process 
running it.

6. I'm not sure if this is new, but when I have a VM under HA, if I stop 
it manually, it immediately restarts.  I don't know if I ever tried that 
under 3.1 Enterprise... maybe it always worked this way?

Ceph speeds are barely acceptable (10-20MB/sec) but that's typical of 
Ceph in my experience so far, even with caching turned on. (Still a bit 
of a letdown compared to Sheepdog's 300MB/sec burst throughput, though.)

One thing I'm not sure of is OSD placement... if I have two drives per 
host dedicated to Ceph (and thus two OSDs), and my pool "size" is 2, 
does that mean a single node failure could render some data 
unreachable?  I've adjusted my "size" to 3 just in case, but I don't 
understand how this works.  Sheepdog guarantees that multiple copies of 
an object won't be stored on the same host for exactly this reason, but 
I can't tell what Ceph does.

Also not sure what's going on with thin-provisioning; I guess Ceph and 
QEMU/KVM don't do thin provisioning at all, in any way, shape or form?

-- 
-Adam Thompson
  athompso at athompso.net