[pve-devel] [PATCH pve-storage] fix #6450: add file-checksum endpoint to storage API

Thu Oct 2 14:51:16 CEST 2025

On October 2, 2025 2:41 pm, Thomas Lamprecht wrote:
> Am 02.10.25 um 14:15 schrieb Shannon Sterz:
>>>              warn $@ if $@;
>>>          }
>>>
>>> +        if (exists $param->{checksum}) {
>>> +            print "calculating checksum...\n";
>>> +            $entry->{checksum} = PVE::Tools::get_file_hash($param->{checksum}, $path);
>> i've tested this with some not too uncommon disk images such as a 32GB
>> volume that is essentially empty and the api endpoint here just times
>> out. which is not too surprising. i wonder if we can cache the hashes
>> here somehow and calculate them in a worker tasks. i also wonder how
>> this should ideally work for running vm and container images as their
>> checksum could change all the time.
>> 
>> maybe we can at least calculate the hashes here for some more static
>> assets such iso etc. ahead of time and only enable this flag for things
>> like that (so isos, container templates, images of vm and container
>> templates etc.) basically things that don't change that much?
> 
> 
> I could not find it, but IIRC there was such a request (or patch?) for
> checksums of storage content submitted in the past where we discussed
> this already.
> 
> Anyhow, this is really not something trivial and would need some system
> to cache the hash while also having a heuristic that ensures the cached
> hash is still valid – as having a wrong hash returned might needlessly
> wreck some nerves of any admin that take their job seriously.
> 
> We could do a file that contains the hash(es) and a inode nr., file
> size and mtime value from the time those hash(es) got created as
> heuristic to detect legitimate change. Plus probably the date to
> show the user that this is was not calculated on the fly.
> And yes, actual calculation needs to happen in a task worker, as
> this can run for quite a while on big files and/or slow storages.
> So probably best done in a dedicated API call I guess, but with all
> this in mind I'm questing a bit if this is really worth that much
> effort...

recently discussed this with Dominik in the context of the streaming PBS
content API - we should really finally get around to implement an async
storage content list API call - then this could easily be only enabled
for the async variant..

the rough sketch was:

- add a task worker variant that is "ephemeral"/"light-weight"/..
- such task workers return a structured result object that is saved to disk
- the API endpoint starting them returns some kind of "token" (similar
  to the UPID for regular tasks, or maybe even use the same format?)
- they are not included in the regular task list
- the result can be queried using the token, once the task has finished
  either an error or the result is returned and the result is removed
  from disk

the UI could then trigger periodic refreshs of the content view, always
display (slightly outdated) information, etc.pp., other clients could
opt-into the async variant as well, if it fits their use case.

besides the storage content view, there's a few more that would benefit
from this kind of mechanism (with or without a client-side cache):

https://bugzilla.proxmox.com/show_bug.cgi?id=4447
https://bugzilla.proxmox.com/show_bug.cgi?id=3045

https://bugzilla.proxmox.com/show_bug.cgi?id=4961

and probably a few more that I failed to find quickly.