[pbs-devel] [PATCH proxmox-backup] docs: add note for not using remote storages

Dominik Csapak d.csapak at proxmox.com
Thu Jun 13 10:02:24 CEST 2024


On 6/12/24 17:40, Thomas Lamprecht wrote:
> Am 12/06/2024 um 08:39 schrieb Dominik Csapak:
>>
>> On 6/11/24 8:05 PM, Thomas Lamprecht wrote:
>>> This section is a quite central and important one, so I'm being a bit
>>> more nitpicking with it than other content. NFS boxes are still quite
>>> popular, a blanket recommendation against them quite probably won't
>>> help our cause or reducing noise in our getting help channels.
>>>
>>> Dietmar already applied this, so would need a follow-up please.
>>
>> sure
>>
>>>
>>> Am 11/06/2024 um 11:30 schrieb Dominik Csapak:
>>>> such as NFS or SMB. They will not provide the expected performance
>>>> and it's better to recommend against them.
>>>
>>> Not so sure about doing recommending against them as a blanket statement,
>>> the "remote" part might adjective is a bit subtle and, e.g., using a local
>>> full flash NVMe storage attached over a 100G link with latency in the µs
>>> surely beats basically any local spinner only storage and probably even
>>> a lot of SATA attached SSD ones.
>>
>> well alone the fact of using nfs makes some operations a few magnitudes
>> slower. e.g. here locally creating a datastore locally takes a few
>> seconds (probably fast due to the page cache) but a locally
>> mounted nfs (so no network involved) on the same disk takes
>> a few minutes. so at least some file creation/deletion operations
>> are some magnitudes slower just by using nfs (though i guess
>> there are some options/ipmlementations that can influence that
>> such as async/sync export options)
>>
>> also a remote SMB share from windows (same physical host though, so
>> again, no real network) takes ~ a minute for the same operation
>>
>> so yes, while I generally agree that using remote storage can be fast
>> enough, using any of them increases some file operations by a
>> significant amount, even when using fast storage and fast network
> 
> Just because there is some overhead (that is the result of a trade-off
> to get a parallel/simultaneous accessible FS) doesn't mean that we
> should recommend against an FS, which is IMO a bit strange to do
> in a system requirement recommendation list anyway (there's a huge
> list of things that'd need to get added then here, from not using
> USB 1.0 pen drives as backing storage to not sliding strong magnets
> over the server).

but we already do recommend against using remote storage regularly,
just not in the docs but in the forum. (so do many of our users)

we also recommend against slow storage, but that can also work
depending on the use case/workload/exact setup

> 
>>
>> (i know that datastore creation is not the best benchmark for this,
>> but shows that there is significant overhead on some operations)
> 
> Yeah, one creates a datastore only once, and on actual backup there
> are at max a few mkdirs, not 65k, so not really relevant here.
> Also, just because there's some overhead, allowing simultaneous mounts
> doesn't come for free, it doesn't mean that it's actually a problem for
> actual backup. As said, a blanket recommendation against a setup that
> is already rather frequent is IMO just deterring (future) users.

it's not only datastore creation, also garbage collection and
all operations that has to access many files in succession suffers
from the over head here.

my point is that the overhead of using a remote fs (regardless which)
adds so much overhead that it often turns what would be 'reasonable'
performance locally into 'unreasonably slow' so you'd have to massively
overcompensate for that in hardware. This is possible ofc, but highly
unlikely for the vast majority of users.

> 
> 
>>>
>>> Also, it can be totally fine to use as second datastore, i.e. in a setup
>>> with a (smaller) datastore backed by (e.g. local) fast storage that is
>>> then periodically synced to a slower remote.
>>>
>>>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>>>> ---
>>>> if we want to discourage users even more, we could also detect it on
>>>> datastore creation and put a warning into the task log
>>>
>>> I would avoid that, at least not without actually measuring how the
>>> storage performs (which is probably quite prone to errors, or would
>>> require periodic measurements).
>>
>> fine with me
>>
>>>
>>>>
>>>> also if we ever come around to implementing the 'health' page thomas
>>>> wished for, we can put a warning/error there too
>>>>
>>>>    docs/system-requirements.rst | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/docs/system-requirements.rst b/docs/system-requirements.rst
>>>> index fb920865..17756b7b 100644
>>>> --- a/docs/system-requirements.rst
>>>> +++ b/docs/system-requirements.rst
>>>> @@ -41,6 +41,9 @@ Recommended Server System Requirements
>>>>      * Use only SSDs, for best results
>>>>      * If HDDs are used: Using a metadata cache is highly recommended, for example,
>>>>        add a ZFS :ref:`special device mirror <local_zfs_special_device>`.
>>>> +  * While it's technically possible to use remote storages such as NFS or SMB,
>>>
>>> Up-front, I wrote some possible smaller improvements upfront but then
>>> a replacement (see below), but I kept the others
>>>
>>> Would do s/remote storages/remote storage/
>>>
>>> (We use "storages" quite a few times already, but if possible keeping it
>>> singular sounds nicer IMO)
>>
>> ok
>>
>>>
>>>> +    the additional latency and overhead drastically reduces performance and it's
>>>
>>> s/additional latency and overhead/additional latency overhead/ ?
>>>
>>> or "network overhead"
>>>
>>> If it'd stay as is, the "reduces" should be changed to "reduce" ("latency and
>>> overhead" is plural).
>>>
>>
>> i meant actually two things here, the network latency and the additional
>> overhead of the second filesystem layer
> 
> Then it'd have helped me if to avoid mixing a specific overhead (latency) with
> a generic mentioning of the word overhead, like:
> 
> "... the added overhead of networking and providing concurrent file system access
> drastically reduces performance ..."
> 
> But that sounds a bit convoluted, so the best option here might be to just
> use "added overhead".
> 
> 
>>>
>>> But I'd rather reword the whole thing to focus more on what the actual issue is,
>>> i.e., not NFS or SMB/CIFS per se, but if the network accessing them is slow.
>>> Maybe something like:
>>>
>>> * Avoid using remote storage, like NFS or SMB/CIFS, connected over a slow
>>>     (< 10 Gbps) and/or high latency (> 1 ms) link. Such a storage can
>>>     dramatically reduce performance and may even negatively impact the
>>>     backup source, e.g. by causing IO hangs.
>>>
>>> I pulled the numbers in parentheses out of thin air, but IMO they shouldn't be too far
>>> off from 2024 Slow™, no hard feelings on adapting them though.
>>
>> IMHO i'd not mention any specific numbers at all, unless we actually
>> benchmarked such a setup. so what about:
> 
> Not sure what numbers from a benchmark would be of use here? One knows what
> fast storage can do latency wise and how much bandwidth is a good baseline
> – granted, the numbers are not helping for every specific setup, but doing
> some benchmark won't change that either.
> Anyway, won't matter, see below.
> 
>>
>> * Avoid using remote storage, like NFS or SMB/CIFS, connected over a
>> slow and/or high latency link. Such a storage can dramatically reduce
>> performance and may even negatively impact the backup source, e.g. by
>> causing IO hangs. If you want to use such a storage, make sure it
>> performs as expected by testing it before using it in production.
>>
> 
> That starts to get rather convoluted, tbh., the more I think about this,
> the more I prefer just reverting the whole thing, I see no gain in
> "bashing" NFS/SMB just because they have some overhead.
> 
> If, we could simply adapt the "Use only SSDs, for best results" point to:
> 
> "Prefer fast local storage that delivers high IOPS for random IO workloads; use only enterprise SSDs for best results."
> 
> Would be a better fit to convey that fast local storage should be preferred,
> especially in a "recommended" (not "recommended against") list.
> 
> 
>>
>> By adding that additional sentence we hopefully nudge some users
>> into actually testing before deploying it, instead of then
>> complaining that it's slow.
> 
> If only; from forum and office request it's quite sensible to assume
> that a good amount of users already have their storage box, and they'd
> need to do so to be able to test it in any way, so already too late.
> 
> It might be better to describe a setup how to still be able to use their
> existing, NFS/SMB/... attached storage in the best way possible. E.g., by
> doing a fast small local storage for incoming backups and use the bigger
> remote storage only through syncing to it. This has a few benefits beside
> getting good performance with existing, slower storage (of any type), like
> having already an extra copy of most recent data.

ultimately it's your call, but personally i'd prefer a broad statement
that defers users from using a sub optimal setup in the first place
than not mentioning it at all in the official docs and explaining
every week in the forums that it's a bad idea

this is the same as recommending fast disk, as one can use slow disks
in some (small) setups successfully without problems, but it does not
scale properly so we recommend against it. for remote storage,
the vast majority of users won't probably invest in a super
high performance nas/san box so recommending against using those
is worth mentioning in the docs IMHO

it does not have to be in the system requirements though, we could
also put a longer explanation in e.g. the FAQ or datastore section.
i just put it in the system requirements because we call out
slow disks there too and i guessed this is one of the more
read sections.




More information about the pbs-devel mailing list