[pbs-devel] [PATCH proxmox-backup] docs: add note for not using remote storages

Mon Jun 17 17:58:00 CEST 2024

Am 13/06/2024 um 10:02 schrieb Dominik Csapak:
> On 6/12/24 17:40, Thomas Lamprecht wrote:
> but we already do recommend against using remote storage regularly,
> just not in the docs but in the forum. (so do many of our users)
> 
> we also recommend against slow storage, but that can also work
> depending on the use case/workload/exact setup

If a user complains it's safe to assume that it's too slow for their
use case, otherwise they would not be in the forum.

It's also OK to tell users that their storage is too slow and a local
storage with some SSDs might be a (relatively) cheap alternative to
address that, especially in the previous mentioned combination where
a small and fast local storage is used for incoming backups while still
using the remote storage to sync a longer history of backups too.

Both have nothing to do with a blanket recommendation against remote
storage, i.e., without looking at the actual setup closely, and I hope
not that's such blanket statements are currently done frequently without
context.

>>>
>>> (i know that datastore creation is not the best benchmark for this,
>>> but shows that there is significant overhead on some operations)
>>
>> Yeah, one creates a datastore only once, and on actual backup there
>> are at max a few mkdirs, not 65k, so not really relevant here.
>> Also, just because there's some overhead, allowing simultaneous mounts
>> doesn't come for free, it doesn't mean that it's actually a problem for
>> actual backup. As said, a blanket recommendation against a setup that
>> is already rather frequent is IMO just deterring (future) users.
> 
> it's not only datastore creation, also garbage collection and
> all operations that has to access many files in succession suffers
> from the over head here.
> 
> my point is that the overhead of using a remote fs (regardless which)
> adds so much overhead that it often turns what would be 'reasonable'
> performance locally into 'unreasonably slow' so you'd have to massively
> overcompensate for that in hardware. This is possible ofc, but highly
> unlikely for the vast majority of users.
> 

That a storage being remote makes it unusable slow for PBS by definition is
just not true (see next paragraph of my reply for expanding on that).

>>
>> If only; from forum and office request it's quite sensible to assume
>> that a good amount of users already have their storage box, and they'd
>> need to do so to be able to test it in any way, so already too late.
>>
>> It might be better to describe a setup how to still be able to use their
>> existing, NFS/SMB/... attached storage in the best way possible. E.g., by
>> doing a fast small local storage for incoming backups and use the bigger
>> remote storage only through syncing to it. This has a few benefits beside
>> getting good performance with existing, slower storage (of any type), like
>> having already an extra copy of most recent data.
> 
> ultimately it's your call, but personally i'd prefer a broad statement
> that defers users from using a sub optimal setup in the first place
> than not mentioning it at all in the official docs and explaining
> every week in the forums that it's a bad idea

Again, just because a storage is remote just does *not* mean that it has to
be too slow to be used. I.e., just because there is _some_ overhead it does
*not* mean that it will make the storage unusable. Ceph, e.g., is a remote
storage that can be made plenty of fast, as our own benchmark papers how,
and some users in huge environments even have to use it for backups as nothing
else can scale amount of data and performance.
Or take Blockbridge, they're providing fast remote storage through NVMe over
TCP.

So by counterexample, including our *own* benchmarks, I think we really
can establish as a fact that there can be remote storage setups that are fast,
and I do not see any point in arguing that further.

> 
> this is the same as recommending fast disk, as one can use slow disks
> in some (small) setups successfully without problems, but it does not
> scale properly so we recommend against it. for remote storage,

It really isn't, recommending for fast local storage in a recommended
system specs section is not the same 

> the vast majority of users won't probably invest in a super
> high performance nas/san box so recommending against using those
> is worth mentioning in the docs IMHO

As mentioned in my last reply, with that logic we have thousands+ things
to recommend against, lots of old/low-power/ HW, some USB HW (some other
nice one can be totally fine again), ... this would blow up the section
such over some time, that almost nobody would read it to completion,
not really helping such annoying cases in the forum or other channels
(that cannot be really fixed by just adding a bulletin point, IME they're
even encouraged to further go in the wrong direction if argumentation isn't
sound (and sometimes even then..)).

> 
> it does not have to be in the system requirements though, we could
> also put a longer explanation in e.g. the FAQ or datastore section.
> i just put it in the system requirements because we call out
> slow disks there too and i guessed this is one of the more
> read sections.
> 

I reworked the system requirements part to my previous proposal, that fit's
the style of recommending for things, not against, and tells the user what's
actually important, not some possible correlation to that.

https://git.proxmox.com/?p=proxmox-backup.git;a=commitdiff;h=5c15fb97b4d507c2f60428b3dba376bdbfadf116

This is getting long again and so only as short draft that would need some
more thoughts and expansion, but a IMO better help that recommending against
such things would be to provide a CLI command that allows users to test some
basic throughput and access times (e.g. with cold/flushed FS cache) and
use these measurements to extrapolate on some GC/Verify examples that try to
mirror some real-world smaller/medium/big setups.
While naturally still not perfect it would tell the user much more to see
that a work load with, e.g., 30 VMs (backup group), with each say ~100 GB of
space usage, and 10 snapshots per backup group each, would need roughly X time
for a GC and Y time for a verification of all data. Surely quite a bit more
complex to do sanely, but something like that would IMO *much* more helpful.