[pve-devel] [PATCH v3 pve-common 01/12] introduce PVE::Path
Max Carrara
m.carrara at proxmox.com
Mon Feb 3 15:21:23 CET 2025
On Fri Jan 31, 2025 at 1:23 PM CET, Fiona Ebner wrote:
> Am 09.01.25 um 15:48 schrieb Max Carrara:
> > +=head3 path_join(@paths)
> > +
> > +Joins multiple paths together. All kinds of paths are supported.
> > +
> > +Does not perform any C<L<< normalization|/"path_normalize($path)" >>>.
> > +
> > + my $joined = path_join("foo", "bar/baz", "qux.txt");
> > + # foo/bar/baz/qux.txt
> > +
> > + my $joined = path_join("/", "etc/pve/", "storage.cfg");
> > + # /etc/pve/storage.cfg
> > +
> > +Similar to C<L<< path_push()|/"path_push($path, $other)">>>, should any of the
> > +C<@paths> be an absolute path, it I<replaces> all preceding paths.
>
> Seems like the docs are somehow broken here, looking at it with perldoc,
> it just states 'Similar to "path_push()"' and the rest of the sentence
> is missing.
My bad; made a typo in the docstring for path_components() (just above
this function). Thanks for spotting this!
>
> Why this kind of behavior with absolute paths? Seems surprising to me.
> Wouldn't failing the call be better?
I decided to stick with Rust's behaviour [pathbuf] here and in most
other places, simply in order to keep the behaviour consistent.
I'm not *really* a fan of it either, but I think it's better than
failing the call -- IMO having to wrap path_join / path_push into an
eval-block would be quite awkward for what's a rather "small" operation
:s
Perhaps I should provide some additional context here as well so as to
make my "design philosophy" more clear: The reason why I modeled
PVE::Path (mostly) after Rust's std::path::{Path, PathBuf} is because:
1. Most (newer) developers will be much more likely to be familiar
with Rust instead of Perl, so if they should see this library,
they'll have a much easier time using it (or at least that's what
I *hope* would happen). Sort of as in: "Oh, this is just like the
path stuff in Rust, except for a few Perl-esque exceptions!"
2. I wanted the library to be essentially exception-free, except where
obvious invariants aren't upheld by the caller (e.g. passing undef).
Unless you match the error message with a regex, there's no
structurally typed way to differ between errors in Perl, so I'd
like the "types" of errors to be mostly the same wherever possible.
3. I didn't want to reinvent the wheel and instead stick to what other
popular libraries do as much as possible, while also avoiding any
additional abstractions (e.g. introducing an object for paths).
With all that being said, since that behaviour is surprising to you, I
have no quarrels changing it! :P After all I'd like to have something
that doesn't cause any friction when using it.
[pathbuf]: https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.push
>
> > +
> > + my $joined = path_join("foo/bar", "/etc", "resolv.conf");
> > + # /etc/resolv.conf
> > +
> > + my $joined = path_join("foo", "/etc/resolv.conf", "/etc/hosts");
> > + # /etc/hosts
> > +
> > +Throws an exception if any of the passed C<@paths> is C<undef>.
> > +
> > +=cut
> > +
> > +sub path_join : prototype(@) {
> > + my (@paths) = @_;
> > +
> > + if (!scalar(@paths)) {
> > + return '';
> > + }
> > +
> > + croak "one of the provided paths is undef" if any { !defined($_) } @paths;
> > +
>
> I think the rest could be written more efficiently like (untested):
>
> my $resulting_path = shift @paths;
> for my $path (@paths) {
> if ($path =~ m#^/#) {
> $resulting_path = $path;
> } else {
> $resulting_path = path_push($resulting_path, $path);
> }
> }
Note that I'm doing a reverse iteration below; I think doing it either
way should be fine, because I doubt we'll ever need to join that many
path components where it's actually going to make a difference ;P
>
> > + # Find the last occurrence of a root directory and start conjoining the
> > + # components from there onwards
> > + my $index = scalar(@paths) - 1;
> > + while ($index > 0) {
> > + last if $paths[$index] =~ m#^/#;
> > + $index--;
> > + }
> > +
> > + @paths = @paths[$index .. (scalar(@paths) - 1)];
> > +
> > + my $resulting_path = shift @paths;
> > +
> > + for my $path (@paths) {
> > + $resulting_path = path_push($resulting_path, $path);
> > + }
> > +
> > + return $resulting_path;
> > +}
> > +
> > +=head3 path_normalize($path)
> > +
> > +Wrapper for L<C<File::Spec/canonpath>>. Performs a logical cleanup of the given
> > +C<$path>.
> > +
> > +This removes unnecessary components of a path that can be safely
> > +removed, such as C<.>, trailing C</> or repeated occurrences of C</>.
> > +
> > +For example, C<foo/./bar/baz/.> and C<foo////bar//baz//> will both become
> > +C<foo/bar/baz>.
> > +
> > +B<Note:> This will I<not> remove components referencing the parent directory,
> > +i.e. C<..>. For example, C<foo/bar/../baz> and C<foo/bar/baz/..> will therefore
> > +remain as they are. However, the parent directory of C</> is C</>, so
> > +C</../../foo> will be normalized to C</foo>.
> > +
> > +Throws an exception if C<$path> is C<undef> or the call to C<canonpath> failed.
> > +
> > +=cut
> > +
> > +sub path_normalize : prototype($) {
> > + my ($path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > +
> > + my $cleaned_path = eval {
> > + File::Spec->canonpath($path);
> > + };
> > +
>
> Style nit: blank lines between eval and using $@ are better avoided
> IMHO. I'd also have the eval expression be a single line, but no strong
> feelings.
Could you perhaps elaborate why, please?
>
> > + croak "failed to clean up path: $@" if $@;
> > +
> > + return $cleaned_path;
> > +}
> > +
> > +=head3 path_parent($path)
> > +
> > +Returns the given C<$path> without its final component, if there is one.
> > +
> > +Trailing and repeated occurrences of C</> and C<.> are normalized on the fly
> > +when needed. This means that e.g. C<foo////bar///.//> becomes C<foo>, but
> > +C<foo/.//bar//./baz> becomes C<foo/.//bar>.
> > +
> > +This function's behaviour is almost identical to Rust's
> > +L<< Path::parent|https://doc.rust-lang.org/std/path/struct.Path.html#method.parent >>,
> > +with a few adaptations made wherever Perl treats things differently:
> > +
> > +=over
> > +
> > +=item * C</foo/bar> becomes C</foo>, C<foo/bar> becomes C<foo>.
> > +
> > +=item * C</foo> becomes C</>.
> > +
> > +=item * C<foo/bar/..> becomes C<foo/bar>. Note that C<foo/bar> is not
> > +necessarily the real parent in the filesystem in the case of e.g. symlinks.
> > +
> > +=item * C<foo/../bar> becomes C<foo/..>.
> > +
> > +=item * C<foo> is interpreted as C<./foo> and becomes C<.>. This is because Perl's
> > +C<L<File::Spec/canonpath>> interprets C<./foo> and C<foo> as the same thing.
>
> I fail to see the "because" here. Why is this special handling important
> in practice, i.e. why exactly care about the behavior of canonpath here?
> It seems really surprising to get "." when asking for the parent of
> "foo". With the normalization explained above, I'd argue it'd be much
> more natural to return the empty string for both "./foo" and "foo". Or
> if you want to follow Rust's parent() more closely (which I guess is the
> reason for "/" and "." having different results), have "./foo" return
> "." and "foo" return the empty string.
Yeah, this is the only thing where I wasn't sure if it's the right way
to go. The reason why I chose to handle it that way is because
File::Spec->canonpath() will return "foo" for both "foo" and "./foo"
-- and I wasn't 100% sure whether *not* handling it the same will be
detrimental elsewhere.
I totally get that it's unexpected though, so I really wouldn't mind
changing it to one of the two alternatives you suggested. :P
(Addendum: Handling paths is really finicky, so in cases where I wasn't
100% convinced of something I decided to be conservative instead; this
is one such case. 😅)
>
> > +=head3 path_push($path, $other)
> > +
> > +Extends C<$path> with C<$other>, returning a new path.
> > +
> > +If C<$other> is absolute, it will be returned instead.
> > +
>
> Again, seems rather surprising to me to auto-magically do this.
See my elaboration further above.
>
> > +Throws an exception if any of the arguments is C<undef>.
> > +
> > +=cut
> > +
> > +sub path_push : prototype($$) {
> > + my ($path, $other) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > + croak "\$other is undef" if !defined($other);
> > +
> > + return $path if $other eq '';
> > + return $other if path_is_absolute($other);
> > +
> > + my $need_sep = $path ne '' && $path !~ m#/$#;
> > +
> > + $path .= "/" if $need_sep;
> > + $path .= $other;
> > +
> > + return $path;
> > +}
> > +
> > +=head3 path_pop($path)
> > +
> > +Alias for C<L<< path_parent()|/"path_parent($path)" >>>.
> > +
>
> Why have this alias? The name suggests it would behave like an inverse
> to path_push(), but it does not, because of the normalization of
> path_parent(). So I'd rather not have it or have it be a non-normalizing
> version (but maybe not worth it) to avoid surprises.
Could you please elaborate what you mean regarding the normalization of
path_parent()?
path_parent() (or path_pop()) will only normalize as much as it needs to
in order to remove the last path component and get to the one that
precedes that:
my $p = "foo/././bar/././baz/./.";
$p = PVE::Path::path_parent($p): # or path_pop()
# $p is now "foo/././bar"
path_push() does actually behave like an inverse if you include the
redundant stuff:
$p = PVE::Path::path_push($p, "././baz/./.")
# $p is now "foo/././bar/././baz/./." again
It's not a complete inverse, that's true (as in, it's not bijective ;P),
because you can't replicate the example above with a path like
"foo///bar///baz" since you can't push "///baz" onto "foo///bar"
("///baz" is treated as an absolute path).
I think a non-normalizing version wouldn't really be that nice to use;
one would have to constantly remember to call path_normalize()
beforehand to actually get the desired result (i.e. getting the parent
directory).
I can just remove the alias if you prefer; it's not equivalent to Rust's
PathBuf::pop() [pop] and if it's a source of confusion, we can probably
do without.
[pop]: https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.pop
>
> > +=cut
> > +
> > +sub path_pop : prototype($) {
> > + my ($path) = @_;
> > + return path_parent($path);
> > +}
> > +
> > +=head3 path_file_name($path)
> > +
> > +Returns the last component of the given C<$path>, if it is a legal file name,
> > +or C<undef> otherwise.
> > +
> > +If C<$path> is an empty string, C</>, C<.> or ends with a C<..> component,
> > +there is no valid file name.
> > +
> > +B<Note:> This does not check whether the given C<$path> actually points to a
> > +file or a directory etc.
> > +
> > +Throws an exception if C<$path> is C<undef>.
> > +
> > +=cut
> > +
> > +sub path_file_name : prototype($) {
> > + my ($path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > +
> > + my @components = path_components($path);
> > +
> > + if (!scalar(@components)) {
> > + return;
> > + }
> > +
> > + if (
> > + scalar(@components) == 1
> > + && ($components[0] eq '/' || $components[0] eq '.')
> > + ) {
>
> Style nit: this conditional fits well into 100 characters, so can be one
> line instead of four
Fair point, but I prefer to have it a little more readable here, and
it's still in line with our style guide. :P
>
> > + return;
> > + }
> > +
> > + if ($components[-1] eq '..') {
> > + return;
> > + }
> > +
> > + return $components[-1];
> > +}
> > +
> > +=head3 path_with_file_name($path, $file_name)
> > +
> > +Returns C<$path> with C<$file_name> as the new last component.
> > +
> > +This is essentially like calling C<L<< path_parent()|/"path_parent($path)" >>>
> > +and using C<L<< path_push()|/"path_push($path, $other)" >>> to append the new
> > +file name, but handles a few extra cases:
> > +
> > +=over
> > +
> > +=item * If C<$path> is C</>, appends C<$file_name>.
> > +
> > +=item * If C<$path> is an empty string, appends C<$file_name>.
> > +
> > +=item * If C<$path> ends with a parent directory reference (C<..>), replaces it
> > +with C<$file_name>.
>
> Hmm, why is that an "extra case" compared to calling path_parent() and
> path_push()? I.e. your implementation doesn't handle ".." specifically.
>
This is stated here for completeness's sake, but I guess it's not really
an extra case, yeah.
However, you made me check the exact details again though, and it seems
that Rust's PathBuf::set_file_name() [sfn] would actually behave
differently here; if a path ends with a parent dir reference, it would
actually just append the file name, not replace it :P
.. so, thanks for spotting this! I actually want to adapt this now 😄
[sfn]: https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.set_file_name
> > +
> > +=back
> > +
> > +Throws an exception if any of the arguments is C<undef> or if C<$file_name>
> > +contains a path separator (C</>).
> > +
> > +=cut
> > +
> > +sub path_with_file_name : prototype($$) {
> > + my ($path, $file_name) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > + croak "\$file_name is undef" if !defined($file_name);
> > + croak "\$file_name contains a path separator: $file_name" if $file_name =~ m|/|;
> > +
> > + my $parent = path_parent($path);
> > +
> > + # undef means that $path was either "" or "/", so we can just append to it
> > + return ($path . $file_name) if !defined($parent);
> > +
> > + # Because the parent of "foo" is ".", return $file_name to stay consistent.
> > + # Otherwise, we'd end up with a current path ref prepended ("./$file_name")
> > + if ($parent eq '.' && $path !~ m|/|) {
> > + return $file_name;
> > + }
>
> See, you already need special handling because of that surprising
> behavior ;)
Well yeah, because I chose the behaviour of path_parent() to be the way
it is; of course I'm gonna ensure the other functions take that decision
into account too 😅
>
> > +
> > + return path_push($parent, $file_name);
> > +}
> > +
> > +my sub _path_file_prefix_suffix_str : prototype($) {
> > + my ($file_name) = @_;
> > +
> > + confess "\$file_name is undef" if !defined($file_name);
> > +
> > + confess "\$prefix not matched" if $file_name !~ m|^(\.*[^\.]*)(.*)|;
> > + my ($prefix, $suffix_str) = ($1, $2);
> > +
> > + return ($prefix, $suffix_str);
> > +}
> > +
> > +my sub _path_file_suffixes_from_str : prototype($) {
> > + my ($suffix_str) = @_;
> > +
> > + confess "\$suffix_str is undef" if !defined($suffix_str);
> > +
> > + # Suffixes are extracted "manually" because join()ing the result of split()
> > + # results in a different file name than the original. Let's say you have a
> > + # file named "foo.bar.". The correct suffixes would be ("bar", "").
> > + # With split, you get the following:
> > + # split(/\./, ".bar.") --> ("", "bar") --> join()ed to "foo..bar"
> > + # split(/\./, ".bar.", -1) --> ("", "bar", "") --> join()ed to "foo..bar."
>
> Couldn't you just shift away the first entry from the result?
Hm, I think so. I'll see how it behaves. Thanks for the suggestion!
>
> > + my @suffixes = ();
> > + while ($suffix_str =~ s|^(\.[^\.]*)||) {
> > + my $suffix = $1;
> > + $suffix =~ s|^\.||;
>
> Nit: Could also use substr() rather then regex replace.
Good catch, thanks!
>
> > + push(@suffixes, $suffix);
> > + }
> > +
> > + return @suffixes;
> > +}
> > +
> > +=head3 path_file_prefix($path)
> > +
> > +Returns the prefix of the file name of the given C<$path>. If the C<$path> does
> > +not have a valid file name and thus no prefix, C<undef> is returned instead.
> > +
> > +The prefix of a file name is the part before any extensions (suffixes).
> > +
> > + my $prefix = path_file_prefix("/etc/resolv.conf");
> > + # resolv
> > +
> > + my $prefix = path_file_prefix("/tmp/archive.tar.zst");
> > + # archive
> > +
> > +Throws an exception if C<$path> is C<undef>.
> > +
>
> Documentation should mention how dot files (e.g. .foo.txt) are treated.
>
Sure, I can add that.
FWIW in the case of ".foo.txt", the prefix would be ".foo".
> > +=cut
> > +
> > +sub path_file_prefix : prototype($) {
> > + my ($path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > +
> > + my $file_name = path_file_name($path);
> > + return undef if !defined($file_name);
> > +
> > + my ($prefix, $suffix_str) = _path_file_prefix_suffix_str($file_name);
> > + return $prefix;
> > +}
> > +
> > +=head3 path_with_file_prefix($path, $prefix)
> > +
>
> Hmm, looking at this and path_with_file_suffix{,es}(), would it maybe be
> nicer to have a single
> path_with_file_name_from_parts($path, $prefix, @suffixes)
> function instead of these? Would seem more natural/straightforward to
> me. The implementations are rather involved IMHO compared to how useful
> the functions are. It's very easy to get the behavior for the (I suspect
> rather uncommon) case of where you want to replace only prefix or
> suffix(es) without already knowing the other. Just need to call
> path_file_parts() first. Or did you take inspiration from somewhere else
> for these?
I'll have to disagree here precisely because having to call
path_file_parts() first is IMO tedious from the caller's perspective. In
many cases I rarely care about all of the individual parts; instead I
just e.g. want to rename a file while keeping the suffix(es), or I want to
trim off the last suffix of something like "archive.tar.gz" because I'm
decompressing it to "archive.tar" or similar.
In other words, in a lot of cases I just want to perform one individual
operation, and the path_file_* and path_with_file_* functions represent
all those individual operations as concisely as possible. The burden of
"figuring out what to do" shouldn't be given to the caller in such
cases.
Regarding inspiration: These functions were inspired after having read a
related Rust GitHub issue [issue] where a lot of this was discussed as
well.
That issue also suggests having some kind of iterator for file parts,
that other methods of std::path::Path can be based on; this is actually
what led me to adding path_file_parts() below. You have access to the
individual parts there and can just join() them to the file name. That
would be very similar to the path_with_file_name_from_parts() function
you suggested. :P
[issue]: https://github.com/rust-lang/rust/issues/86319
>
> ---snip 8<---
>
> > +=head3 path_file_parts($path)
> > +
> > +Returns the parts that constitute the file name (prefix and suffixes) of a
> > +C<$path> as a list. If the C<$path> does not have a valid file name, an empty
> > +list is returned instead.
> > +
> > +These parts are split in such a way that allows them to be C<join>ed together,
> > +resulting in the original file name of the given C<$path> again.
> > +
> > + my @file_parts = path_file_parts("/etc/pve/firewall/cluster.fw");
> > + # ("cluster", "fw")
> > + my $file_name = join(".", @file_parts);
> > +
> > + my @file_parts = path_file_parts("/tmp/archive.tar.gz");
> > + # ("archive", "tar", "gz")
> > + my $file_name = join(".", @file_parts);
> > +
> > +Throws an exception if C<$path> is C<undef>.
> > +
>
> Should mention how dot files are treated.
Sure thing!
>
> > +=cut
> > +
> > +sub path_file_parts : prototype($) {
> > + my ($path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > +
> > + my $file_name = path_file_name($path);
> > + if (!defined($file_name)) {
> > + return ();
> > + }
> > +
> > + my ($prefix, $suffix_str) = _path_file_prefix_suffix_str($file_name);
> > + my @suffixes = _path_file_suffixes_from_str($suffix_str);
> > +
> > + return ($prefix, @suffixes);
> > +}
> > +
> > +=head3 path_starts_with($path, $other_path)
> > +
> > +Checks whether a C<$path> starts with the components of C<$other_path>.
> > +
> > + my $starts_with = path_starts_with("/etc/pve/firewall/cluster.fw", "/etc/pve");
> > + # 1
> > +
> > +Throws an exception if any of the arguments is C<undef>.
> > +
>
> Should document the behavior for empty paths as well as the normalization.
ACK, will add that!
>
> > +=cut
> > +
> > +sub path_starts_with : prototype($$) {
> > + my ($path, $other_path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > + croak "\$other_path if undef" if !defined($other_path);
> > +
> > + # Nothing starts with nothing
> > + return 1 if ($path eq '' && $other_path eq '');
> > +
> > + # Nothing cannot start with something
> > + # Something cannot start with nothing
> > + return if ($path eq '' || $other_path eq '');
> > +
> > + my @components = path_components($path);
> > + my @other_components = path_components($other_path);
> > +
> > + my @pairs = zip_shortest(\@components, \@other_components);
>
> Why not just shift off from both in a loop?
>
Sure, that's of course also possible; I just found zip_shortest more
expressive.
> > +
> > + # for my ($comp, $other_comp) (@pairs) is experimental
> > + for my $pair (@pairs) {
> > + my ($comp, $other_comp) = $pair->@*;
> > +
> > + if ($comp ne $other_comp) {
> > + return;
> > + }
> > + }
> > +
> > + return 1;
> > +}
> > +
> > +=head3 path_ends_with($path, $other_path)
> > +
> > +Checks whether a C<$path> ends with the components of C<$other_path>.
> > +
> > + my $ends_with = path_ends_with("/etc/pve/firewall/cluster.fw", "firewall/cluster.fw");
> > + # 1
> > +
> > +Throws an exception if any of the arguments is C<undef>.
> > +
>
> Should document the behavior for empty paths as well as the normalization.
>
Same as above; will do that.
> > +=cut
> > +
> > +sub path_ends_with : prototype($$) {
> > + my ($path, $other_path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > + croak "\$other_path if undef" if !defined($other_path);
> > +
> > + # Nothing ends with nothing
> > + return 1 if ($path eq '' && $other_path eq '');
> > +
> > + # Nothing cannot end with something
> > + # Something cannot end with nothing
> > + return if ($path eq '' || $other_path eq '');
> > +
> > + my @components_rev = reverse(path_components($path));
> > + my @other_components_rev = reverse(path_components($other_path));
> > +
> > + my @pairs_rev = zip_shortest(\@components_rev, \@other_components_rev);
>
> Why not just shift off from both in a loop?
>
Same as above; just found zip_shortest more expressive.
> > +
> > + # for my ($comp, $other_comp) (@pairs_rev) is experimental
> > + for my $pair (@pairs_rev) {
> > + my ($comp, $other_comp) = $pair->@*;
> > +
> > + if ($comp ne $other_comp) {
> > + return;
> > + }
> > + }
> > +
> > + return 1;
> > +}
> > +
> > +=head3 path_equals($path, $other_path)
> > +
> > +Checks whether C<$path> equals C<$other_path>. The paths are compared
> > +by their components, meaning that it's not necessary to
> > +L<< normalize|/"path_normalize($path)" >> them beforehand.
> > +
> > +=cut
> > +
> > +sub path_equals : prototype($$) {
> > + my ($path, $other_path) = @_;
> > +
> > + croak "\$path is undef" if !defined($path);
> > + croak "\$other_path if undef" if !defined($other_path);
> > +
> > + # Nothing is nothing
> > + return 1 if ($path eq '' && $other_path eq '');
> > +
> > + # Nothing is not something
> > + # Something is not nothing
> > + return if ($path eq '' || $other_path eq '');
> > +
> > + my @components = path_components($path);
> > + my @other_components = path_components($other_path);
> > +
> > + return if scalar(@components) != scalar(@other_components);
> > +
> > + my @pairs = zip_longest(\@components, \@other_components);
>
> Why not just shift off from both in a loop?
>
Similar to above, I just found zip_longest more expressive.
> > +
> > + # for my ($comp, $other_comp) (@pairs_rev) is experimental
> > + for my $pair (@pairs) {
> > + my ($comp, $other_comp) = $pair->@*;
> > +
> > + return if !defined($comp) || !defined($other_comp);
> > +
> > + if ($comp ne $other_comp) {
> > + return;
> > + }
> > + }
> > +
> > + return 1;
> > +}
> > +
> > +1;
More information about the pve-devel
mailing list