[pve-devel] [PATCH manager 1/7] replace systemd timer with pvescheduler daemon
Dominik Csapak
d.csapak at proxmox.com
Tue Nov 2 10:26:42 CET 2021
On 10/29/21 14:05, Fabian Ebner wrote:
> Am 07.10.21 um 10:27 schrieb Dominik Csapak:
>> From: Thomas Lamprecht <t.lamprecht at proxmox.com>
>>
>> The whole thing is already prepared for this, the systemd timer was
>> just a fixed periodic timer with a frequency of one minute. And we
>> just introduced it as the assumption was made that less memory usage
>> would be generated with this approach, AFAIK.
>>
>> But logging 4+ lines just about that the timer was started, even if
>> it does nothing, and that 24/7 is not to cheap and a bit annoying.
>>
>> So in a first step add a simple daemon, which forks of a child for
>> running jobs once a minute.
>> This could be made still a bit more intelligent, i.e., look if we
>> have jobs tor run before forking - as forking is not the cheapest
>> syscall. Further, we could adapt the sleep interval to the next time
>> we actually need to run a job (and sending a SIGUSR to the daemon if
>> a job interval changes such, that this interval got narrower)
>>
>> We try to sync running on minute-change boundaries at start, this
>> emulates systemd.timer behaviour, we had until now. Also user can
>> configure jobs on minute precision, so they probably expect that
>> those also start really close to a minute change event.
>> Could be adapted to resync during running, to factor in time drift.
>> But, as long as enough cpu cycles are available we run in correct
>> monotonic intervalls, so this isn't a must, IMO.
>>
>> Another improvement could be locking a bit more fine grained, i.e.
>> not on a per-all-local-job-runs basis, but per-job (per-guest?)
>> basis, which would improve temporary starvement of small
>> high-periodic jobs through big, less peridoci jobs.
>> We argued that it's the user fault if such situations arise, but they
>> can evolve over time without noticing, especially in compolexer
>> setups.
>>
>> Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>> ---
>> PVE/Service/Makefile | 2 +-
>> PVE/Service/pvescheduler.pm | 102 ++++++++++++++++++++++++++++++++++
>> bin/Makefile | 6 +-
>> bin/pvescheduler | 28 ++++++++++
>> debian/postinst | 3 +-
>> services/Makefile | 3 +-
>> services/pvescheduler.service | 16 ++++++
>> services/pvesr.service | 8 ---
>> services/pvesr.timer | 12 ----
>> 9 files changed, 155 insertions(+), 25 deletions(-)
>> create mode 100644 PVE/Service/pvescheduler.pm
>
> Just noting: All the other .pm-files in PVE/Service have the executable
> flag set.
>
>> create mode 100755 bin/pvescheduler
>> create mode 100644 services/pvescheduler.service
>> delete mode 100644 services/pvesr.service
>> delete mode 100644 services/pvesr.timer
>>
>> diff --git a/PVE/Service/Makefile b/PVE/Service/Makefile
>> index fc1cdb14..f66ceb81 100644
>> --- a/PVE/Service/Makefile
>> +++ b/PVE/Service/Makefile
>> @@ -1,6 +1,6 @@
>> include ../../defines.mk
>> -SOURCES=pvestatd.pm pveproxy.pm pvedaemon.pm spiceproxy.pm
>> +SOURCES=pvestatd.pm pveproxy.pm pvedaemon.pm spiceproxy.pm
>> pvescheduler.pm
>> all:
>> diff --git a/PVE/Service/pvescheduler.pm b/PVE/Service/pvescheduler.pm
>> new file mode 100644
>> index 00000000..ce55c45a
>> --- /dev/null
>> +++ b/PVE/Service/pvescheduler.pm
>> @@ -0,0 +1,102 @@
>> +package PVE::Service::pvescheduler;
>> +
>> +use strict;
>> +use warnings;
>> +
>> +use POSIX qw(WNOHANG);
>> +use PVE::SafeSyslog;
>> +use PVE::API2::Replication;
>> +
>> +use PVE::Daemon;
>> +use base qw(PVE::Daemon);
>> +
>> +my $cmdline = [$0, @ARGV];
>> +my %daemon_options = (stop_wait_time => 180, max_workers => 0);
>> +my $daemon = __PACKAGE__->new('pvescheduler', $cmdline,
>> %daemon_options);
>> +
>> +my $finish_jobs = sub {
>> + my ($self) = @_;
>> + foreach my $cpid (keys %{$self->{jobs}}) {
>> + my $waitpid = waitpid($cpid, WNOHANG);
>> + if (defined($waitpid) && ($waitpid == $cpid)) {
>> + delete ($self->{jobs}->{$cpid});
>> + }
>> + }
>> +};
>> +
>> +sub run {
>> + my ($self) = @_;
>> +
>> + my $jobs= {};
>> + $self->{jobs} = $jobs;
>> +
>> + my $old_sig_chld = $SIG{CHLD};
>> + local $SIG{CHLD} = sub {
>> + local ($@, $!, $?); # do not overwrite error vars
>> + $finish_jobs->($self);
>> + $old_sig_chld->(@_) if $old_sig_chld;
>> + };
>> +
>> + my $logfunc = sub { syslog('info', $_[0]) };
>> +
>> + my $run_jobs = sub {
>> + my $child = fork();
>> + if (!defined($child)) {
>> + die "fork failed: $!\n";
>> + } elsif ($child == 0) {
>> + $self->after_fork_cleanup();
>> + PVE::API2::Replication::run_jobs(undef, $logfunc, 0, 1);
>
> Like this, the whole replication log ends up in syslog.
yes, you're right, we probably don't want that in the syslog
(same for vzdump jobs later)
>
>> + POSIX::_exit(0);
>> + }
>> +
>> + $jobs->{$child} = 1;
>> + };
>> +
>> + # try to run near minute boundaries, makes more sense to the user
>> as he
>> + # configures jobs with minute precision
>> + my ($current_seconds) = localtime;
>> + sleep(60 - $current_seconds) if (60 - $current_seconds >= 5);
>> +
>> + for (;;) {
>> + last if $self->{shutdown_request};
>> +
>> + $run_jobs->();
>> +
>> + my $sleep_time = 60;
>> + my $slept = 0; # SIGCHLD interrupts sleep, so we need to keep track
>> + while ($slept < $sleep_time) {
>> + last if $self->{shutdown_request};
>> + $slept += sleep($sleep_time - $slept);
>> + }
>> + }
>
> Won't the loop drift away from the minute boundary over time? The rest
> of the loop takes time to execute, which pushes in the positive
> direction, while the fact that the accuracy when interrupted is only
> whole seconds pushes in the negative direction. But we can't really rely
> on those to cancel each other out.
>
yes, imho the fix would be to pull the correction we have from before
the loop to inside the loop. (maybe only every X loops, since the drift
from the few forks + sleep should not amount to much per iteration)
>> +
>> + # jobs have a lock timeout of 60s, wait a bit more for graceful
>> termination
>> + my $timeout = 0;
>> + while (keys %$jobs > 0 && $timeout < 75) {
>> + kill 'TERM', keys %$jobs;
>> + $timeout += sleep(5);
>> + }
>> + # ensure the rest gets stopped
>> + kill 'KILL', keys %$jobs if (keys %$jobs > 0);
>> +}
>> +
>> +sub shutdown {
>> + my ($self) = @_;
>> +
>> + syslog('info', 'got shutdown request, signal running jobs to stop');
>> +
>> + kill 'TERM', keys %{$self->{jobs}};
>> + $self->{shutdown_request} = 1;
>> +}
>> +
>> +$daemon->register_start_command();
>> +$daemon->register_stop_command();
>> +$daemon->register_status_command();
>> +
>> +our $cmddef = {
>> + start => [ __PACKAGE__, 'start', []],
>> + stop => [ __PACKAGE__, 'stop', []],
>> + status => [ __PACKAGE__, 'status', [], undef, sub { print shift .
>> "\n";} ],
>> +};
>> +
>> +1;
>> diff --git a/bin/Makefile b/bin/Makefile
>> index df30b27c..12cb4671 100644
>> --- a/bin/Makefile
>> +++ b/bin/Makefile
>> @@ -6,7 +6,7 @@ export NOVIEW=1
>> PERL_DOC_INC_DIRS=..
>> include /usr/share/pve-doc-generator/pve-doc-generator.mk
>> -SERVICES = pvestatd pveproxy pvedaemon spiceproxy
>> +SERVICES = pvestatd pveproxy pvedaemon spiceproxy pvescheduler
>> CLITOOLS = vzdump pvesubscription pveceph pveam pvesr pvenode pvesh
>> pve6to7
>> SCRIPTS = \
>> @@ -52,6 +52,10 @@ pve6to7.1:
>> printf ".TH PVE6TO7 1\n.SH NAME\npve6to7 \- Proxmox VE upgrade
>> checker script for 6.4 to 7.x\n" > $@
>> printf ".SH SYNOPSIS\npve6to7 [--full]\n" >> $@
>> +pvescheduler.8:
>> + # FIXME: add to doc-generator
>> + echo ".TH pvescheduler 8" > $@
>> +
>> pveversion.1.pod: pveversion
>> pveupgrade.1.pod: pveupgrade
>> pvereport.1.pod: pvereport
>> diff --git a/bin/pvescheduler b/bin/pvescheduler
>> new file mode 100755
>> index 00000000..5f2bde52
>> --- /dev/null
>> +++ b/bin/pvescheduler
>> @@ -0,0 +1,28 @@
>> +#!/usr/bin/perl
>> +
>> +use strict;
>> +use warnings;
>> +
>> +use PVE::Service::pvescheduler;
>> +
>> +use PVE::RPCEnvironment;
>> +use PVE::SafeSyslog;
>> +
>> +$SIG{'__WARN__'} = sub {
>> + my $err = $@;
>> + my $t = $_[0];
>> + chomp $t;
>> + print STDERR "$t\n";
>> + syslog('warning', "%s", $t);
>> + $@ = $err;
>> +};
>> +
>> +my $prepare = sub {
>> + my $rpcenv = PVE::RPCEnvironment->init('priv');
>> +
>> + $rpcenv->init_request();
>> + $rpcenv->set_language($ENV{LANG});
>> + $rpcenv->set_user('root at pam');
>> +};
>> +
>> +PVE::Service::pvescheduler->run_cli_handler(prepare => $prepare);
>> diff --git a/debian/postinst b/debian/postinst
>> index aed4da3f..f6407178 100755
>> --- a/debian/postinst
>> +++ b/debian/postinst
>> @@ -79,6 +79,7 @@ case "$1" in
>> deb-systemd-invoke reload-or-try-restart pvestatd.service
>> deb-systemd-invoke reload-or-try-restart pveproxy.service
>> deb-systemd-invoke reload-or-try-restart spiceproxy.service
>> + deb-systemd-invoke reload-or-try-restart pvescheduler.service
>> exit 0;;
>> @@ -102,7 +103,7 @@ case "$1" in
>> # same as dh_systemd_enable (code copied)
>> - UNITS="pvedaemon.service pveproxy.service spiceproxy.service
>> pvestatd.service pvebanner.service pvesr.timer pve-daily-update.timer"
>> + UNITS="pvedaemon.service pveproxy.service spiceproxy.service
>> pvestatd.service pvebanner.service pvescheduler.service
>> pve-daily-update.timer"
>> NO_RESTART_UNITS="pvenetcommit.service pve-guests.service"
>> for unit in ${UNITS} ${NO_RESTART_UNITS}; do
>> diff --git a/services/Makefile b/services/Makefile
>> index a3d04a2f..b46c7119 100644
>> --- a/services/Makefile
>> +++ b/services/Makefile
>> @@ -13,8 +13,7 @@ SERVICES= \
>> pve-storage.target \
>> pve-daily-update.service\
>> pve-daily-update.timer \
>> - pvesr.service \
>> - pvesr.timer
>> + pvescheduler.service
>> .PHONY: install
>> install: ${SERVICES}
>> diff --git a/services/pvescheduler.service
>> b/services/pvescheduler.service
>> new file mode 100644
>> index 00000000..11769e80
>> --- /dev/null
>> +++ b/services/pvescheduler.service
>> @@ -0,0 +1,16 @@
>> +[Unit]
>> +Description=Proxmox VE scheduler
>> +ConditionPathExists=/usr/bin/pvescheduler
>> +Wants=pve-cluster.service
>> +After=pve-cluster.service
>> +After=pve-storage.target
>> +
>> +[Service]
>> +ExecStart=/usr/bin/pvescheduler start
>> +ExecStop=/usr/bin/pvescheduler stop
>> +PIDFile=/var/run/pvescheduler.pid
>> +KillMode=process
>> +Type=forking
>> +
>> +[Install]
>> +WantedBy=multi-user.target
>> diff --git a/services/pvesr.service b/services/pvesr.service
>> deleted file mode 100644
>> index cbaed1ca..00000000
>> --- a/services/pvesr.service
>> +++ /dev/null
>> @@ -1,8 +0,0 @@
>> -[Unit]
>> -Description=Proxmox VE replication runner
>> -ConditionPathExists=/usr/bin/pvesr
>> -After=pve-cluster.service
>> -
>> -[Service]
>> -Type=oneshot
>> -ExecStart=/usr/bin/pvesr run --mail 1
>> diff --git a/services/pvesr.timer b/services/pvesr.timer
>> deleted file mode 100644
>> index 01d7b9c7..00000000
>> --- a/services/pvesr.timer
>> +++ /dev/null
>> @@ -1,12 +0,0 @@
>> -[Unit]
>> -Description=Proxmox VE replication runner
>> -
>> -[Timer]
>> -AccuracySec=1
>> -RemainAfterElapse=no
>> -
>> -[Timer]
>> -OnCalendar=minutely
>> -
>> -[Install]
>> -WantedBy=timers.target
>> \ No newline at end of file
>>
More information about the pve-devel
mailing list