[PVE-User] Crash after Upgrade PVE2.3 / Cron backup crashes with 2.6.32-19 but not with 2.6.32-17
Martin Schuchmann
ms at city-pc.de
Fri Mar 22 02:31:01 CET 2013
Hi,
Today I had the same behaviour as yesterday - at 23:00h cron started the
backup job and immediately the whole node was out of order: Via
webinterface I could see the running VMs but they had not been reachable
via RDP/SSH anymore. Also VZCTL ENTER did not work.
Again there had been kernel errors in syslog.
After a hart-reset via we recognized a logical volume created at the
same time as the crashed:
lvdisplay
--- Logical volume ---
LV Path /dev/promo3/vzsnap-promo3-0
LV Name vzsnap-promo3-0
VG Name promo3
LV UUID DswPod-t1lR-yKen-vwDH-sG5D-Djpl-wo9iSX
LV Write Access read/write
LV Creation host, time promo3, 2013-03-21 23:00:02 +0100
LV Status available
# open 0
LV Size 4,00 GiB
Current LE 1024
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:3
After deleting via lvremove I started a manual backup (snapshot) for CTs
and VMs - no problem occured.
Now I created again a new cron backup via webinterface of the same
machine which had been successfully updated.
A few seconds after start the following errors occured:
Mar 22 01:00:01 promo3 vzdump[12622]: <root at pam> starting task
UPID:promo3:00003150:0004D259:514B9F01:vzdump::root at pam:
Mar 22 01:00:01 promo3 vzdump[12624]: INFO: starting new backup job:
vzdump 306 --quiet 1 --mailto ms at city-pc.de --mode snapshot --compress
lzo --storage p2_storage
Mar 22 01:00:01 promo3 vzdump[12624]: INFO: Starting Backup of VM 306
(openvz)
Mar 22 01:00:02 promo3 pmxcfs[4048]: [status] notice: received log
Mar 22 01:00:02 promo3 kernel: EXT3-fs: barriers disabled
Mar 22 01:00:02 promo3 kernel: kjournald starting. Commit interval 5
seconds
Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): using internal journal
Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148485
Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148484
Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148481
Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148429
Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148427
Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70156303
Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): 6 orphan inodes deleted
Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): recovery complete
Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): mounted filesystem with
ordered data mode
Mar 22 01:00:02 promo3 pmxcfs[4048]: [status] notice: received log
Mar 22 01:01:06 promo3 pvestatd[4511]: WARNING: command 'df -P -B 1
/mnt/pve/p3_storage' failed: got timeout
Mar 22 01:01:36 promo3 pvestatd[4511]: WARNING: command 'df -P -B 1
/mnt/pve/p3_storage' failed: got timeout
Mar 22 01:02:58 promo3 kernel: device-mapper: snapshots: Invalidating
snapshot: Unable to allocate exception.
Mar 22 01:03:05 promo3 kernel: Aborting journal on device dm-3.
Mar 22 01:03:05 promo3 kernel: Buffer I/O error on device dm-3, logical
block 342819330
Mar 22 01:03:05 promo3 kernel: lost page write due to I/O error on dm-3
Mar 22 01:03:05 promo3 kernel: JBD: I/O error detected when updating
journal superblock for dm-3.
Mar 22 01:03:05 promo3 kernel: EXT3-fs (dm-3): error:
ext3_journal_start_sb: Detected aborted journal
Mar 22 01:03:05 promo3 kernel: EXT3-fs (dm-3): error: remounting
filesystem read-only
Mar 22 01:03:09 promo3 kernel: EXT3-fs (dm-3): error: ext3_put_super:
Couldn't clean up the journal
Mar 22 01:03:10 promo3 vzdump[12624]: ERROR: Backup of VM 306 failed -
command '(cd /mnt/vzsnap0/private/306;find . '(' -regex '^\.$' ')' -o
'(' -type 's' -prune ')' -o -print0|sed 's/\\/\\\\/g'|tar cpf - --totals
--sparse --numeric-owner --no-recursion --one-file-system --null -T
-|lzop)
>/mnt/pve/p2_storage/dump/vzdump-openvz-306-2013_03_22-01_00_01.tar.dat' failed: exit code 2
Mar 22 01:03:10 promo3 vzdump[12624]: INFO: Backup job finished with errors
Mar 22 01:03:12 promo3 citadel: 1 unique messages to be merged
Mar 22 01:03:12 promo3 citadel: 1 unique messages to be merged
Mar 22 01:03:12 promo3 vzdump[12624]: job errors
Mar 22 01:03:12 promo3 vzdump[12622]: <root at pam> end task
UPID:promo3:00003150:0004D259:514B9F01:vzdump::root at pam: job errors
Mar 22 01:03:12 promo3 /USR/SBIN/CRON[12617]: (CRON) error (grandchild
#12620 failed with exit status 255)
This time the backup did not crash the whole node, but it failed.
Also the lvdisplay did show the lv again during the failed backup:
--- Logical volume ---
LV Path /dev/promo3/vzsnap-promo3-0
LV Name vzsnap-promo3-0
VG Name promo3
LV UUID UkxdQW-GGM7-raEO-MSxS-k9jZ-s0D1-g2ZO9M
LV Write Access read/write
LV Creation host, time promo3, 2013-03-22 01:00:01 +0100
LV snapshot status active destination for data
LV Status available
# open 1
LV Size 2,55 TiB
Current LE 669651
COW-table size 4,00 GiB
COW-table LE 1024
Allocated to snapshot 60,92%
Snapshot chunk size 4,00 KiB
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:3
I added a new backup cron a few minutes later. It started, and I tried
to have a look on the lv's - but lvdisplay did not answer.
I started a new SSH console and tried lvscan - and it also did not
answer during the backup, CTRL-C endet with:
promo3:~# lvscan
^C CTRL-c detected: giving up waiting for lock
/run/lock/lvm/V_promo3: flock failed: Unterbrechung während des
Betriebssystemaufrufs
Can't get lock for promo3
Skipping volume group promo3
Syslog this time.
Mar 22 01:08:01 promo3 /USR/SBIN/CRON[13592]: (root) CMD (vzdump 306
--quiet 1 --mode snapshot --mailto ms at city-pc.de --compress lzo
--storage p2_storage)
Mar 22 01:08:02 promo3 vzdump[13593]: <root at pam> starting task
UPID:promo3:0000351B:00058E23:514BA0E2:vzdump::root at pam:
Mar 22 01:08:02 promo3 vzdump[13595]: INFO: starting new backup job:
vzdump 306 --quiet 1 --mailto ms at city-pc.de --mode snapshot --compress
lzo --storage p2_storage
Mar 22 01:08:02 promo3 vzdump[13595]: INFO: Starting Backup of VM 306
(openvz)
Mar 22 01:09:54 promo3 rrdcached[4027]: flushing old values
Mar 22 01:09:54 promo3 rrdcached[4027]: rotating journals
Mar 22 01:09:54 promo3 rrdcached[4027]: started new journal
/var/lib/rrdcached/journal//rrd.journal.1363910994.615643
Mar 22 01:11:07 promo3 kernel: ct0 nfs: server 10.1.0.3 not responding,
still trying
Mar 22 01:11:22 promo3 kernel: INFO: task nfsd:3957 blocked for more
than 120 seconds.
Mar 22 01:11:22 promo3 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 22 01:11:22 promo3 kernel: nfsd D ffff880879f2d1e0 0
3957 2 0 0x00000000
Mar 22 01:11:22 promo3 kernel: ffff880879f2f900 0000000000000046
ffff8808619a4fc0 0000000000000001
Mar 22 01:11:22 promo3 kernel: 00000000000005a8 ffff88087bb77e00
0000000000000080 0000000000000004
Mar 22 01:11:22 promo3 kernel: ffff880879f2f8d0 ffffffff81182e1b
ffff880879f2d7a8 000000000001e9c0
Mar 22 01:11:22 promo3 kernel: Call Trace:
Mar 22 01:11:22 promo3 kernel: [<ffffffff81182e1b>] ?
cache_flusharray+0xab/0x100
Mar 22 01:11:22 promo3 kernel: [<ffffffff810974ee>] ?
prepare_to_wait+0x4e/0x80
Mar 22 01:11:22 promo3 kernel: [<ffffffff8119ad69>]
__sb_start_write+0x169/0x1a0
Mar 22 01:11:22 promo3 kernel: [<ffffffff81097200>] ?
autoremove_wake_function+0x0/0x40
Mar 22 01:11:22 promo3 kernel: [<ffffffff81127489>]
generic_file_aio_write+0x69/0x100
Mar 22 01:11:22 promo3 kernel: [<ffffffff81127420>] ?
generic_file_aio_write+0x0/0x100
Mar 22 01:11:22 promo3 kernel: [<ffffffff8119872b>]
do_sync_readv_writev+0xfb/0x140
Mar 22 01:11:22 promo3 kernel: [<ffffffff811b3e40>] ? iput+0x30/0x70
Mar 22 01:11:22 promo3 kernel: [<ffffffff81097200>] ?
autoremove_wake_function+0x0/0x40
Mar 22 01:11:22 promo3 kernel: [<ffffffffa0361e70>] ?
nfsd_acceptable+0x0/0x120 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffff81198557>] ?
rw_copy_check_uvector+0x97/0x120
Mar 22 01:11:22 promo3 kernel: [<ffffffff81199696>]
do_readv_writev+0xd6/0x1f0
Mar 22 01:11:22 promo3 kernel: [<ffffffffa0361ff2>] ?
nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa0571e99>] ?
vzquota_qlnk_destroy+0x29/0x110 [vzdquota]
Mar 22 01:11:22 promo3 kernel: [<ffffffff811997f8>] vfs_writev+0x48/0x60
Mar 22 01:11:22 promo3 kernel: [<ffffffffa0363a25>]
nfsd_vfs_write+0x115/0x480 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa0365cbb>] ?
nfsd_open+0x23b/0x2c0 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa0366107>]
nfsd_write+0xe7/0x100 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa036e1df>]
nfsd3_proc_write+0xaf/0x140 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa035e52e>]
nfsd_dispatch+0xfe/0x240 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa027e174>]
svc_process_common+0x344/0x650 [sunrpc]
Mar 22 01:11:22 promo3 kernel: [<ffffffff8105a620>] ?
default_wake_function+0x0/0x20
Mar 22 01:11:22 promo3 kernel: [<ffffffffa027e7b2>]
svc_process+0x102/0x150 [sunrpc]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa035ee5d>] nfsd+0xcd/0x180 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa035ed90>] ? nfsd+0x0/0x180 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffff81096c26>] kthread+0x96/0xa0
Mar 22 01:11:22 promo3 kernel: [<ffffffff8100c1aa>] child_rip+0xa/0x20
Mar 22 01:11:22 promo3 kernel: [<ffffffff81096b90>] ? kthread+0x0/0xa0
Mar 22 01:11:22 promo3 kernel: [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Mar 22 01:11:22 promo3 kernel: INFO: task nfsd:3958 blocked for more
than 120 seconds.
Mar 22 01:11:22 promo3 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 22 01:11:22 promo3 kernel: nfsd D ffff880879f2c700 0
3958 2 0 0x00000000
Mar 22 01:11:22 promo3 kernel: ffff88087aa97900 0000000000000046
ffff88087aa978a0 ffff881064c4b000
Mar 22 01:11:22 promo3 kernel: ffff8808619a4fc0 ffff88087bb77e00
ffff88086de62bf8 0000000000000020
Mar 22 01:11:22 promo3 kernel: ffff88087aa978d0 ffffffff81182e1b
ffff880879f2ccc8 000000000001e9c0
During the job also the listing of the mounted pve storage did not work:
ls /mnt/pve/
ended with an hung up.
All machines on the node had been inaccassibe again.
I did a reboot with kernel 2.6.32-17.
Entered a new cronjob and the backup worked as it did for 12 month
before, here the syslog:
Mar 22 02:22:01 promo3 /USR/SBIN/CRON[3738]: (root) CMD (vzdump 306
--quiet 1 --mode snapshot --compress lzo --storage p2_storage)
Mar 22 02:22:01 promo3 vzdump[3739]: <root at pam> starting task
UPID:promo3:00000E9D:00007A5A:514BB239:vzdump::root at pam:
Mar 22 02:22:01 promo3 vzdump[3741]: INFO: starting new backup job:
vzdump 306 --quiet 1 --mailto ms at city-pc.de --mode snapshot --compress
lzo --storage p2_storage
Mar 22 02:22:01 promo3 vzdump[3741]: INFO: Starting Backup of VM 306
(openvz)
Mar 22 02:22:02 promo3 kernel: EXT3-fs: barriers disabled
Mar 22 02:22:02 promo3 kernel: kjournald starting. Commit interval 5
seconds
Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): using internal journal
Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148484
Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148481
Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148465
Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148429
Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70148427
Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deleting
unreferenced inode 70156303
Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): 6 orphan inodes deleted
Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): recovery complete
Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): mounted filesystem with
ordered data mode
Mar 22 02:22:30 promo3 ntpd[1884]: Listen normally on 38 veth306.0
fe80::c04e:44ff:fe61:ecfe UDP 123
Mar 22 02:25:49 promo3 vzdump[3741]: INFO: Finished Backup of VM 306
(00:03:48)
Mar 22 02:25:49 promo3 vzdump[3741]: INFO: Backup job finished successfully
Mar 22 02:25:49 promo3 citadel: 1 unique messages to be merged
Mar 22 02:25:49 promo3 citadel: 1 unique messages to be merged
Mar 22 02:25:49 promo3 vzdump[3739]: <root at pam> end task
UPID:promo3:00000E9D:00007A5A:514BB239:vzdump::root at pam: OK
Any hints to pevent kernel 2.6.32-19 from that issue?
Regards, Martin
More information about the pve-user
mailing list