[pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs

Mon Oct 21 10:16:57 CEST 2024

On  2024-10-17 15:20, Gabriel Goller wrote:
> On 17.10.2024 08:15, Thomas Lamprecht wrote:
>> Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
>>> This option allows us to "fix" corrupt snapshots (and/or their chunks)
>>> by pulling them from another remote. When traversing the remote
>>> snapshots, we check if it exists locally, and if it is, we check if the
>>> last verification of it failed. If the local snapshot is broken and the
>>> `resync-corrupt` option is turned on, we pull in the remote snapshot,
>>> overwriting the local one.
>>>
>>> This is very useful and has been requested a lot, as there is currently
>>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
>>> version of it on their offsite instance.
>>>
>>> Signed-off-by: Gabriel Goller <g.goller at proxmox.com>
>>> Originally-by: Shannon Sterz <s.sterz at proxmox.com>
>>
>> those two trailers probably should be switched to uphold causal ordering.
> 
> Makes sense.
> 
>> In general, it would be _really_ nice to get some regression testing for
>> sync jobs covering the selection with different matching and options.
>> The lack of that doesn't have to block your/shannon's patch, but IMO it
>> should be added rather sooner than later. Sync is becoming more and more
>> complex, and the selection of snapshots is a very important and fundamental
>> part of it.
> 
> Do you think we could mock ourselves out of this and unit-test it
> somehow? For full regression tests we will need Lukas's CI work.
> 

To chime in here, while it will be certainly possible to test such stuff once the
end-to-end testing tooling is finished (it's getting there), in most cases it is
preferable to test as much as possible in unit tests as they are
  - much faster (<<1s, compared to end-to-end tests which could take seconds or even minutes, depending on what they do)
  - easier to write and maintain (you are much closer to the code under test)
  - also serve as documentation for the code (e.g. what contracts a unit of code must uphold)
  - far shorter feedback loop / better DX (cargo test vs. having to set up/run test instance VMs or trigger some CI system)

Especially for something like the snapshot selection in sync jobs, a good test suite might cover
many different permutations and corner cases, which would be *a lot of* of work to write
and also take *a long time* to execute in a full bells and whistles end-to-end test where you 
perform an actual sync job between two real PBS installations.

That being said, it will be of course a challenge to factor out the sync logic to make it testable.
Without me being familiar with the code, it could involve abstracting the interactions with the rest of
the system with some trait, moving the sync logic into a separate structs / functions and use
dependency injection to give the sync module/fn a concrete implementation to use (in tests
you'd then use a fake implementation). For existing code these refactoring steps can be
quite intimidating, which is the reason why I am a big fan of writing unit tests *while*
writing the actual implementation (similar to, but not quite TDD; there you'd write the tests *first*),
as this ensures that the code I'm writing is actually testable.

-- 
- Lukas