[pve-devel] [PATCH manager 1/7] replace systemd timer with pvescheduler daemon
    Dominik Csapak 
    d.csapak at proxmox.com
       
    Tue Nov  2 10:26:42 CET 2021
    
    
  
On 10/29/21 14:05, Fabian Ebner wrote:
> Am 07.10.21 um 10:27 schrieb Dominik Csapak:
>> From: Thomas Lamprecht <t.lamprecht at proxmox.com>
>>
>> The whole thing is already prepared for this, the systemd timer was
>> just a fixed periodic timer with a frequency of one minute. And we
>> just introduced it as the assumption was made that less memory usage
>> would be generated with this approach, AFAIK.
>>
>> But logging 4+ lines just about that the timer was started, even if
>> it does nothing, and that 24/7 is not to cheap and a bit annoying.
>>
>> So in a first step add a simple daemon, which forks of a child for
>> running jobs once a minute.
>> This could be made still a bit more intelligent, i.e., look if we
>> have jobs tor run before forking - as forking is not the cheapest
>> syscall. Further, we could adapt the sleep interval to the next time
>> we actually need to run a job (and sending a SIGUSR to the daemon if
>> a job interval changes such, that this interval got narrower)
>>
>> We try to sync running on minute-change boundaries at start, this
>> emulates systemd.timer behaviour, we had until now. Also user can
>> configure jobs on minute precision, so they probably expect that
>> those also start really close to a minute change event.
>> Could be adapted to resync during running, to factor in time drift.
>> But, as long as enough cpu cycles are available we run in correct
>> monotonic intervalls, so this isn't a must, IMO.
>>
>> Another improvement could be locking a bit more fine grained, i.e.
>> not on a per-all-local-job-runs basis, but per-job (per-guest?)
>> basis, which would improve temporary starvement  of small
>> high-periodic jobs through big, less peridoci jobs.
>> We argued that it's the user fault if such situations arise, but they
>> can evolve over time without noticing, especially in compolexer
>> setups.
>>
>> Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>> ---
>>   PVE/Service/Makefile          |   2 +-
>>   PVE/Service/pvescheduler.pm   | 102 ++++++++++++++++++++++++++++++++++
>>   bin/Makefile                  |   6 +-
>>   bin/pvescheduler              |  28 ++++++++++
>>   debian/postinst               |   3 +-
>>   services/Makefile             |   3 +-
>>   services/pvescheduler.service |  16 ++++++
>>   services/pvesr.service        |   8 ---
>>   services/pvesr.timer          |  12 ----
>>   9 files changed, 155 insertions(+), 25 deletions(-)
>>   create mode 100644 PVE/Service/pvescheduler.pm
> 
> Just noting: All the other .pm-files in PVE/Service have the executable 
> flag set.
> 
>>   create mode 100755 bin/pvescheduler
>>   create mode 100644 services/pvescheduler.service
>>   delete mode 100644 services/pvesr.service
>>   delete mode 100644 services/pvesr.timer
>>
>> diff --git a/PVE/Service/Makefile b/PVE/Service/Makefile
>> index fc1cdb14..f66ceb81 100644
>> --- a/PVE/Service/Makefile
>> +++ b/PVE/Service/Makefile
>> @@ -1,6 +1,6 @@
>>   include ../../defines.mk
>> -SOURCES=pvestatd.pm pveproxy.pm pvedaemon.pm spiceproxy.pm
>> +SOURCES=pvestatd.pm pveproxy.pm pvedaemon.pm spiceproxy.pm 
>> pvescheduler.pm
>>   all:
>> diff --git a/PVE/Service/pvescheduler.pm b/PVE/Service/pvescheduler.pm
>> new file mode 100644
>> index 00000000..ce55c45a
>> --- /dev/null
>> +++ b/PVE/Service/pvescheduler.pm
>> @@ -0,0 +1,102 @@
>> +package PVE::Service::pvescheduler;
>> +
>> +use strict;
>> +use warnings;
>> +
>> +use POSIX qw(WNOHANG);
>> +use PVE::SafeSyslog;
>> +use PVE::API2::Replication;
>> +
>> +use PVE::Daemon;
>> +use base qw(PVE::Daemon);
>> +
>> +my $cmdline = [$0, @ARGV];
>> +my %daemon_options = (stop_wait_time => 180, max_workers => 0);
>> +my $daemon = __PACKAGE__->new('pvescheduler', $cmdline, 
>> %daemon_options);
>> +
>> +my $finish_jobs = sub {
>> +    my ($self) = @_;
>> +    foreach my $cpid (keys %{$self->{jobs}}) {
>> +    my $waitpid = waitpid($cpid, WNOHANG);
>> +    if (defined($waitpid) && ($waitpid == $cpid)) {
>> +        delete ($self->{jobs}->{$cpid});
>> +    }
>> +    }
>> +};
>> +
>> +sub run {
>> +    my ($self) = @_;
>> +
>> +    my $jobs= {};
>> +    $self->{jobs} = $jobs;
>> +
>> +    my $old_sig_chld = $SIG{CHLD};
>> +    local $SIG{CHLD} = sub {
>> +    local ($@, $!, $?); # do not overwrite error vars
>> +    $finish_jobs->($self);
>> +    $old_sig_chld->(@_) if $old_sig_chld;
>> +    };
>> +
>> +    my $logfunc = sub { syslog('info', $_[0]) };
>> +
>> +    my $run_jobs = sub {
>> +    my $child = fork();
>> +    if (!defined($child)) {
>> +        die "fork failed: $!\n";
>> +    } elsif ($child == 0) {
>> +        $self->after_fork_cleanup();
>> +        PVE::API2::Replication::run_jobs(undef, $logfunc, 0, 1);
> 
> Like this, the whole replication log ends up in syslog.
yes, you're right, we probably don't want that in the syslog
(same for vzdump jobs later)
> 
>> +        POSIX::_exit(0);
>> +    }
>> +
>> +    $jobs->{$child} = 1;
>> +    };
>> +
>> +    # try to run near minute boundaries, makes more sense to the user 
>> as he
>> +    # configures jobs with minute precision
>> +    my ($current_seconds) = localtime;
>> +    sleep(60 - $current_seconds) if (60 - $current_seconds >= 5);
>> +
>> +    for (;;) {
>> +    last if $self->{shutdown_request};
>> +
>> +    $run_jobs->();
>> +
>> +    my $sleep_time = 60;
>> +    my $slept = 0; # SIGCHLD interrupts sleep, so we need to keep track
>> +    while ($slept < $sleep_time) {
>> +        last if $self->{shutdown_request};
>> +        $slept += sleep($sleep_time - $slept);
>> +    }
>> +    }
> 
> Won't the loop drift away from the minute boundary over time? The rest 
> of the loop takes time to execute, which pushes in the positive 
> direction, while the fact that the accuracy when interrupted is only 
> whole seconds pushes in the negative direction. But we can't really rely 
> on those to cancel each other out.
> 
yes, imho the fix would be to pull the correction we have from before
the loop to inside the loop. (maybe only every X loops, since the drift
from the few forks + sleep should not amount to much per iteration)
>> +
>> +    # jobs have a lock timeout of 60s, wait a bit more for graceful 
>> termination
>> +    my $timeout = 0;
>> +    while (keys %$jobs > 0 && $timeout < 75) {
>> +    kill 'TERM', keys %$jobs;
>> +    $timeout += sleep(5);
>> +    }
>> +    # ensure the rest gets stopped
>> +    kill 'KILL', keys %$jobs if (keys %$jobs > 0);
>> +}
>> +
>> +sub shutdown {
>> +    my ($self) = @_;
>> +
>> +    syslog('info', 'got shutdown request, signal running jobs to stop');
>> +
>> +    kill 'TERM', keys %{$self->{jobs}};
>> +    $self->{shutdown_request} = 1;
>> +}
>> +
>> +$daemon->register_start_command();
>> +$daemon->register_stop_command();
>> +$daemon->register_status_command();
>> +
>> +our $cmddef = {
>> +    start => [ __PACKAGE__, 'start', []],
>> +    stop => [ __PACKAGE__, 'stop', []],
>> +    status => [ __PACKAGE__, 'status', [], undef, sub { print shift . 
>> "\n";} ],
>> +};
>> +
>> +1;
>> diff --git a/bin/Makefile b/bin/Makefile
>> index df30b27c..12cb4671 100644
>> --- a/bin/Makefile
>> +++ b/bin/Makefile
>> @@ -6,7 +6,7 @@ export NOVIEW=1
>>   PERL_DOC_INC_DIRS=..
>>   include /usr/share/pve-doc-generator/pve-doc-generator.mk
>> -SERVICES = pvestatd pveproxy pvedaemon spiceproxy
>> +SERVICES = pvestatd pveproxy pvedaemon spiceproxy pvescheduler
>>   CLITOOLS = vzdump pvesubscription pveceph pveam pvesr pvenode pvesh 
>> pve6to7
>>   SCRIPTS =              \
>> @@ -52,6 +52,10 @@ pve6to7.1:
>>       printf ".TH PVE6TO7 1\n.SH NAME\npve6to7 \- Proxmox VE upgrade 
>> checker script for 6.4 to 7.x\n" > $@
>>       printf ".SH SYNOPSIS\npve6to7 [--full]\n" >> $@
>> +pvescheduler.8:
>> +    # FIXME: add to doc-generator
>> +    echo ".TH pvescheduler 8" > $@
>> +
>>   pveversion.1.pod: pveversion
>>   pveupgrade.1.pod: pveupgrade
>>   pvereport.1.pod: pvereport
>> diff --git a/bin/pvescheduler b/bin/pvescheduler
>> new file mode 100755
>> index 00000000..5f2bde52
>> --- /dev/null
>> +++ b/bin/pvescheduler
>> @@ -0,0 +1,28 @@
>> +#!/usr/bin/perl
>> +
>> +use strict;
>> +use warnings;
>> +
>> +use PVE::Service::pvescheduler;
>> +
>> +use PVE::RPCEnvironment;
>> +use PVE::SafeSyslog;
>> +
>> +$SIG{'__WARN__'} = sub {
>> +    my $err = $@;
>> +    my $t = $_[0];
>> +    chomp $t;
>> +    print STDERR "$t\n";
>> +    syslog('warning', "%s", $t);
>> +    $@ = $err;
>> +};
>> +
>> +my $prepare = sub {
>> +    my $rpcenv = PVE::RPCEnvironment->init('priv');
>> +
>> +    $rpcenv->init_request();
>> +    $rpcenv->set_language($ENV{LANG});
>> +    $rpcenv->set_user('root at pam');
>> +};
>> +
>> +PVE::Service::pvescheduler->run_cli_handler(prepare => $prepare);
>> diff --git a/debian/postinst b/debian/postinst
>> index aed4da3f..f6407178 100755
>> --- a/debian/postinst
>> +++ b/debian/postinst
>> @@ -79,6 +79,7 @@ case "$1" in
>>       deb-systemd-invoke reload-or-try-restart pvestatd.service
>>       deb-systemd-invoke reload-or-try-restart pveproxy.service
>>       deb-systemd-invoke reload-or-try-restart spiceproxy.service
>> +    deb-systemd-invoke reload-or-try-restart pvescheduler.service
>>       exit 0;;
>> @@ -102,7 +103,7 @@ case "$1" in
>>       # same as dh_systemd_enable (code copied)
>> -    UNITS="pvedaemon.service pveproxy.service spiceproxy.service 
>> pvestatd.service pvebanner.service pvesr.timer pve-daily-update.timer"
>> +    UNITS="pvedaemon.service pveproxy.service spiceproxy.service 
>> pvestatd.service pvebanner.service pvescheduler.service 
>> pve-daily-update.timer"
>>       NO_RESTART_UNITS="pvenetcommit.service pve-guests.service"
>>       for unit in ${UNITS} ${NO_RESTART_UNITS}; do
>> diff --git a/services/Makefile b/services/Makefile
>> index a3d04a2f..b46c7119 100644
>> --- a/services/Makefile
>> +++ b/services/Makefile
>> @@ -13,8 +13,7 @@ SERVICES=            \
>>       pve-storage.target    \
>>       pve-daily-update.service\
>>       pve-daily-update.timer    \
>> -    pvesr.service        \
>> -    pvesr.timer
>> +    pvescheduler.service
>>   .PHONY: install
>>   install: ${SERVICES}
>> diff --git a/services/pvescheduler.service 
>> b/services/pvescheduler.service
>> new file mode 100644
>> index 00000000..11769e80
>> --- /dev/null
>> +++ b/services/pvescheduler.service
>> @@ -0,0 +1,16 @@
>> +[Unit]
>> +Description=Proxmox VE scheduler
>> +ConditionPathExists=/usr/bin/pvescheduler
>> +Wants=pve-cluster.service
>> +After=pve-cluster.service
>> +After=pve-storage.target
>> +
>> +[Service]
>> +ExecStart=/usr/bin/pvescheduler start
>> +ExecStop=/usr/bin/pvescheduler stop
>> +PIDFile=/var/run/pvescheduler.pid
>> +KillMode=process
>> +Type=forking
>> +
>> +[Install]
>> +WantedBy=multi-user.target
>> diff --git a/services/pvesr.service b/services/pvesr.service
>> deleted file mode 100644
>> index cbaed1ca..00000000
>> --- a/services/pvesr.service
>> +++ /dev/null
>> @@ -1,8 +0,0 @@
>> -[Unit]
>> -Description=Proxmox VE replication runner
>> -ConditionPathExists=/usr/bin/pvesr
>> -After=pve-cluster.service
>> -
>> -[Service]
>> -Type=oneshot
>> -ExecStart=/usr/bin/pvesr run --mail 1
>> diff --git a/services/pvesr.timer b/services/pvesr.timer
>> deleted file mode 100644
>> index 01d7b9c7..00000000
>> --- a/services/pvesr.timer
>> +++ /dev/null
>> @@ -1,12 +0,0 @@
>> -[Unit]
>> -Description=Proxmox VE replication runner
>> -
>> -[Timer]
>> -AccuracySec=1
>> -RemainAfterElapse=no
>> -
>> -[Timer]
>> -OnCalendar=minutely
>> -
>> -[Install]
>> -WantedBy=timers.target
>> \ No newline at end of file
>>
    
    
More information about the pve-devel
mailing list