[pve-devel] [PATCH v2 qemu-server 8/9] memory: add virtio-mem support

Wed Jan 25 11:28:32 CET 2023

> 
> Sure, doing it in parallel is perfectly fine. I'm just thinking that
> switching gears (too early) and redirecting the request might not be
> ideal. You also issue an additional qom-set to go back to
> $virtiomem->{current} * 1024 *1024 if a request didn't make progress
> in
> time. But to be sure that request worked, we'd also need to monitor
> it
> ;) I think issuing that request is fine as-is, but if there is a
> "hanging" device, we really can't do much. And I do think the user
> should be informed via an appropriate error if there is a problematic
> device.
> 
> Maybe we can use 10 seconds instead of 5 (2-3 seconds already sounds
> too
> close to 5 IMHO), so that we have a good margin, and just die instead
> of
> trying to redirect the request to another device. After issuing the
> reset request and writing our best guess of what the memory is to the
> config of course.
> 
I forgot to say, than it don't timeout 5s after the qom-set,
but timeout after 5s if no memory change is detected by qom-get. 
I'm reseting the retry counter if a change is detected.
(so 5s is really big, in real, when it's blocking for 1s, it's really
blocking)

if($virtiomem->{current} != $virtiomem->{last}) {
                #if value has changed, but not yet completed
                print "virtiomem$id: changed but don't not reach target
yet\n";
                $virtiomem->{retry} = 0;
                $virtiomem->{last} = $virtiomem->{current};
                next;

}

> If it really is an issue in practice that certain devices often take
> too
> long for whatever reason, we can still add the redirect logic. But
> starting out, I feel like it's not worth the additional complexity.
> 
The real only reason will be on unplug, if memory block are unmovable
(kernel reserved) or with big fragmentation, no available block.(with
4MB granurality it's difficult to reach, but with bigger maxmem and
bigger block, we have more chance to trigger it.  (with 1GB
hugepage,it's more easy to trigger it too I think)

But if you want something more simple,

I can something like before, split memory by number of sockets,
and if we have an error on 1 socket, don't try to redispatch again
remaining block of this socket on other nodes.

> > > Would it actually be better to just fill up the first, then the
> > > second
> > > etc. as needed, rather than balancing? My gut feeling is that
> > > having
> > > fewer "active" devices is better. But this would have to be
> > > tested
> > > with
> > > some benchmarks of course.
> > 
> > Well, from numa perspective, you really want to balance as much as
> > possible. (That's why, with classic hotplug, we add/remove dimm on
> > each
> > socket alternatively).
> > 
> > That's the whole point of numa, read the nearest memory attached to
> > the
> > processor where the process are running.
> > 
> > That's a main advantage of virtio-mem  vs balloning (which doesn't
> > handle numa, and remove pages randomly on any socket)
> 
> Makes sense. Thanks for the explanation!
>