[pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs

Fri Dec 23 10:53:12 CET 2016

On Fri, Dec 23, 2016 at 10:20:03AM +0100, Alexandre DERUMIER wrote:
> Hi wolfgang,
> 
> I have done more test with drive mirror and nbd target.
> 
> It seem that the hang occur only if the target ip is unreacheable (no network reponse) 
> 
> # drive_mirror -n drive-scsi0 nbd://66.66.66.66/target 
> ERROR: VM 183 qmp command 'human-monitor-command' failed - got timeout
> 
> If the ip address exist and up,
> # drive_mirror -n drive-scsi0 nbd://10.3.94.89:666/target 
> Failed to connect socket: Connection refused
> 
> 
> 
> I'm not sure, maybe it can hang too if pve-firewall do a drop instead a reject on target port.
> 
> 
> 
> I think this come from in qemu net/socket.c,
> 
> where we have an infinite loop.
> 
> I'm not sure how to add a timeout here, help is welcome :)

Doesn't the job show up in query-block-jobs before the connection
finishes? If so then we could just assume if it's still at 0% after a
timeout the connection doesn't work? The question is though, what is a
good timeout?

Or we find (or add) a way to query the connection status, then we can
poll that before we start polling the block-job status.
(we already poll for stuff, so it's not worse than before ;p)

Then again, an actual timeout on qemu's side might even make it
upstream...? Maybe? (If it doesn't I'd prefer polling as described above
over maintaining yet another qemu patch ;-) )

But it won't be a simple select/poll timeout as they use non-blocking
sockets and register the fd with a callback for the connection
completion later (in net_socket_fd_init() it calls
qemu_set_fd_handler() where it ends up controlled by the aio handlers).

It might be possible to add a timer (qemu_timer_new() & friends
perhaps?) to NetSocketState (the NetSocktState cleanup would have to
remove it, and if it gets triggered it cancels the connection instead).

> static int net_socket_connect_init(NetClientState *peer,
>                                    const char *model,
>                                    const char *name,
>                                    const char *host_str)
> {
>     NetSocketState *s;
>     int fd, connected, ret;
>     struct sockaddr_in saddr;
> 
>     if (parse_host_port(&saddr, host_str) < 0)
>         return -1;
> 
>     fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
>     if (fd < 0) {
>         perror("socket");
>         return -1;
>     }
>     qemu_set_nonblock(fd);
> 
>     connected = 0;
>     for(;;) {
>         ret = connect(fd, (struct sockaddr *)&saddr, sizeof(saddr));
>         if (ret < 0) {
>             if (errno == EINTR || errno == EWOULDBLOCK) {
>                 /* continue */
>             } else if (errno == EINPROGRESS ||
>                        errno == EALREADY ||
>                        errno == EINVAL) {
>                 break;
>             } else {
>                 perror("connect");
>                 closesocket(fd);
>                 return -1;
>             }
>         } else {
>             connected = 1;
>             break;
>         }
>     }
>     s = net_socket_fd_init(peer, model, name, fd, connected);
>     if (!s)
>         return -1;
>     snprintf(s->nc.info_str, sizeof(s->nc.info_str),
>              "socket: connect to %s:%d",
>              inet_ntoa(saddr.sin_addr), ntohs(saddr.sin_port));
>     return 0;
> }
> 
> 
> 
> 
> ----- Mail original -----
> De: "aderumier" <aderumier at odiso.com>
> À: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mercredi 21 Décembre 2016 13:57:10
> Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs
> 
> >>Then it can still hang if the destination disappears between tcp_ping() 
> >>and the `drive-mirror` command, so I'd rather get better behavior on qemu's 
> >>side. It needs a time-out or a way to cancel it or something. 
> Yes sure! 
> 
> I'm currently looking at qemu code to see how nbd client works. 
> 
> ----- Mail original ----- 
> De: "Wolfgang Bumiller" <w.bumiller at proxmox.com> 
> À: "aderumier" <aderumier at odiso.com> 
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>, "dietmar" <dietmar at proxmox.com> 
> Envoyé: Mercredi 21 Décembre 2016 12:20:28 
> Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs 
> 
> > On December 21, 2016 at 10:51 AM Alexandre DERUMIER <aderumier at odiso.com> wrote: 
> > 
> > 
> > >>IIRC that was the only blocker. 
> > >> 
> > >>Basically the patchset has to work *without* tcp_ping() since it is an 
> > >>unreliable check, and then we still have to catch failing connections 
> > >>_correctly_. (There's no point in knowing that "some time in the past 
> > >>you were able to connect to something which may or may not have been a 
> > >>qemu nbd server", we need to know whether the drive-mirror job itself 
> > >>was able to connect.) 
> > 
> > For me, the mirror job auto abort if connection is failing during the migration. Do you see another behaviour ? 
> 
> That covers one problem. IIRC the disk-deletion problem was that due 
> to wrong [] usage around an ipv6 address it could not connect in the 
> first place and didn't error as I would have hoped. 
> 
> > 
> > the tcp_ping was just before launching the drive mirror command, because it was hanging in this case. 
> 
> Then it can still hang if the destination disappears between tcp_ping() 
> and the `drive-mirror` command, so I'd rather get better behavior on qemu's 
> side. It needs a time-out or a way to cancel it or something. 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> 
>