Nicolas Martyanoff – Brain dump About

Working with Common Lisp pathnames

Common Lisp pathnames, used to represent file paths, have the reputation of being hard to work with. This article aims to change this unfair reputation while highlighting the occasional quirks along the way.

Filenames and file paths

The distinction between filename and file paths is not always obvious. On POSIX systems, the filename is the name of the file, while a file path represents its absolute or relative location in the file system. Which also means that all filenames are file paths, but not the other way around.

Common Lisp uses the term filename for objects which are either pathnames or namestrings, both being representation of file paths. We will try to avoid confusion by using the terms filenames, pathnames and namestrings when referring to Common Lisp concepts and we will talk about file paths when referring to the language-agnostic file system concept.

Pathnames

Pathnames are an implementation-independent representation of file paths made of six components:

  • an host identifying either the file system or a logical host;
  • a device identifying the logical of physical device containing the file;
  • a directory representing an absolute or relative list of directory names;
  • a name;
  • a type, a value nowadays known as file extension;
  • a version, because yes file systems used to support file versioning.

While this representation might seem way too complicated —it originates from a time where the file system ecosystem was much richer— it still is suitable for modern file systems.

The make-pathname function is used to create pathnames and lets you specificy all components. For example the following call yields a pathname representing the file path represented on POSIX systems by /var/run/example.pid:

(make-pathname :directory '(:absolute "var" "run") :name "example" :type "pid")

Common Lisp functions manipulating file paths of course accept pathnames, letting you keep the same convenient structured representation everywhere, only converting from/to a native representation at the boundaries of your program.

Special characters

What happens when you construct a pathname with components containing separation characters, e.g. a directory name containing / on a POSIX system or a type containing .? According to Common Lisp 19.2.2.1.1, the behaviour is implementation-defined; but if the implementation accepts these component values it must handle quoting correctly.

For example:

  • CLISP rejects separator characters in component values, signaling a SIMPLE-ERROR condition.
  • CCL accepts them and quotes them when converting the pathname to a namestring. So (namestring (make-pathname :name "foo/bar" :type "a.b")) yields "foo\\/bar.a\\.b" on Linux.
  • SBCL accepts and quotes them but does not quote . in type components, yielding "foo\\/bar.a.b" for the example above.
  • ECL accepts them but fails to quote them when converting the pathname to a namestring.

One could wonder about which implementation, CCL or SBCL, is correct regarding the quoting of the . character in type strings on POSIX platforms. While everyone understands that / is special in file and directory names, . is debatable because POSIX does not mention the type extension in its definitions: foo.txt is the name of the file, not a combination of a name and a type. As such, I would argue that quoting and not quoting are both correct. And as you will realize then reading about namestrings further in this article, it is irrelevant since namestrings are not POSIX paths.

Note that whether ECL violates the standard or not is unclear since there is no character quoting for POSIX paths. In other words, there is no such thing as a directory named a/b, because it could not be referenced in a way different from a directory named b in a directory named a. This behaviour derives directly from POSIX systems treating paths as strings and not as structured objects.

Invalid characters

The Common Lisp standard mentions special characters but is silent on the subject of invalid characters. For example POSIX forbids null bytes in filenames. But since it is not a separation character, implementations are free to deal with it as they see fit.

When testing implementations with a pathname containing a null byte using (make-pathname :name (string (code-char 0))), CCL, SBCL and ECL accept it while CLISP signals an error mentioning an illegal argument.

I am not convinced by CLISP’s behaviour since null bytes are only invalid in POSIX paths, not in Common Lisp filenames, meaning that the error should occur when the pathname is internally converted to a format usable by the operating system.

Pathname component case

A rarely mentioned property of pathnames is the support for case conversion. MAKE-PATHNAME and function returning pathname components (e.g. PATHNAME-TYPE) support a :CASE argument, either :COMMON or :LOCAL indicating how how to handle character case in strings.

With :LOCAL —which is the default value—, these functions assume that component strings are already represented following the conventions of the underlying operating system. It also dictates that if the host only supports one character case, strings must be returned converted to this case.

With :COMMON, these functions will use the default (customary) case of the host if the string is provided all uppercase, and the opposite case if the string is provided all lowercase. Mixed case strings are not transformed.

These behaviours are not intuitive and made much more sense at a time where some file systems only supported one specific case. You should probably stay away from component case handling unless you really know what you are doing.

On a personal note, as someone running Linux and FreeBSD, I am curious about the behaviour of various implementations on Windows and MacOS since both NTFS and APFS are case insensitive.

Unspecific components

While all components can be null, some of them can be :UNSPECIFIC (which ones is implementation-defined). The only real use case for :UNSPECIFIC is to affect the behaviour of MERGE-PATHNAMES: if a component is null, the function uses the value of the component in the pathname passed as the :DEFAULTS argument; if a component is :UNSPECIFIC, the function uses the same value in the resulting pathname.

For example:

(merge-pathnames (make-pathname :name "foo")
                 (make-pathname :type "txt"))

yields the "foo.txt" namestring, but

(merge-pathnames (make-pathname :name "foo" :type :unspecific)
                 (make-pathname :type "txt"))

yields "foo".

Unfortunately the inability to rely on its support for specific component types (since it is implementation-defined) makes it interesting more than useful.

Namestrings

Namestrings are another represention for file paths. While pathnames are structured objects, namestrings are just strings. The most important aspect of namestrings is that unless they are logical namestrings (something we will cover later), the way they represent paths is implementation-defined (c.f. Common Lisp 19.1.1 Namestrings as Filenames). In other words the namestring for the file foo of type txt in directory data could be data/foo.txt. Or data\foo.txt. Or data|foo#txt. Or any other non-sensical representation. Fortunately implementations tend to act rationally and use a representation as similar as possible to the one of their host operating system.

One should always remember that even though namestrings look and feel like paths, they are still a representation of a Common Lisp pathname, meaning that they may or may not map to a valid native path. The most obvious example would be a pathname whose name is the null byte, created with (make-pathname :name (string (code-char 0))), whose namestring is a one character string that has no valid native representation on modern operating systems.

Pathnames can be converted to namestrings using the NAMESTRING function, while namestrings can be parsed into pathnames with PARSE-NAMESTRING. The #P reader macro uses PARSE-NAMESTRING to read a pathname. As such, #P"/tmp/foo.txt" is identical to #.(parse-namestring '"/tmp/foo.txt").

Note that most functions dealing with files will accept a pathname designator, i.e. either a pathname, a namestring or a stream.

Native namestrings

An unfortunately missing feature from Common Lisp is the ability to parse native namestrings, i.e. paths that use the representation of the underlying operating system.

To understand why it is a problem, let us take *.txt, a perfectly valid filename at least on any POSIX platform. In Common Lisp, you can construct a pathname representing this filename with (make-pathname :name "*" :type "txt"). No problem whatsoever. However the "*.txt" namestring actually represents a pathname whose name component is :WILD. There is no namestring that will return this pathname when passed to PARSE-NAMESTRING.

As a result, when processing filenames coming from the external world (a command line argument, a list of paths in a document, etc.), you have no way to handle those that contain characters used by Common Lisp for wild components.

There is no standard way of solving this issue. Some implementations provide functions to parse native namestrings, e.g. SBCL with SB-EXT:PARSE-NATIVE-NAMESTRING or CCL with CCL:NATIVE-TO-PATHNAME.

Wildcards

Up to now pathnames may have looked like a slightly unusual representation for paths. But we are just getting started.

Pathname can be wild, meaning that they contain one or more wild components. Wild components can match any value. All components can be made wild with the special value :WILD. Directory elements also support :WILD-INFERIORS which matches one or more directory levels.

As such

(make-pathname :directory '(:absolute "tmp" :wild) :name "foo" :type :wild)

is equivalent to the /tmp/*/foo.* POSIX glob pattern, while

(make-pathname :directory '(:absolute "tmp" :wild-inferiors "data" :wild)
               :name :wild :type :wild)

is equivalent to /tmp/**/data/*/*.*.

Wild pathnames only really make sense for the DIRECTORY function which returns files matching a specific pathname.

Logical pathnames

We currently have talked about pathnames representing either paths to physical files or pattern of filenames. Logical pathnames go further and let you work with files in a location-independent way.

Logical pathnames are based on logical hosts, set as pathname host components. Logical pathnames can be passed around and manipulated as any other pathnames; when used to access files, they are translated to a physical pathname, i.e. a pathname referring to an actual file in the file system.

SBCL uses logical pathnames for source file locations. While SBCL is shipped with its source files, their actual location on disk depends on how the software was installed. Instead of manually merging pathnames with a base directory value everywhere, SBCL uses the SYS logical host to map all pathnames whose directory starts with SRC to the actual location on disk. For example on my machine:

(translate-logical-pathname "SYS:SRC;ASSEMBLY;MASTER.LISP")

yields #P"/usr/share/sbcl-source/src/assembly/master.lisp".

Another example would be CCL which maps pathnames with the HOME logical host to subpaths of the home directory of the user.

Note that logical hosts are global to the Common Lisp environment. While SYS is reserved for the implementation, all other hosts are free to use by anyone. To avoid collisions, it is a good idea to name hosts after their program or library.

Logical namestrings

Logical namestrings are implementation-independent, meaning that you can safely use them in your programs without wondering about how they will be interpreted. Their syntax, detailed in section 19.3.1 of the Common Lisp standard, is quite different from usual POSIX paths. The host is separated from the rest of the path by a colon character, and directory names are separated by semicolon characters.

For example "SOURCE:SERVER;LISTENER.LISP" is the logical namestring equivalent of the /server/listener.lisp POSIX path for the SOURCE logical host.

The astute reader will notice the use of uppercase characters in logical namestrings. It happens that the different parts of a logical namestring are defined as using uppercase characters, but that the implementation translates lowercase letters to uppercase letters when parsing the namestrings (c.f. Common Lisp 19.3.1.1.7). We use the canonical uppercase representation for clarity.

Translation

Translation is controlled by a table that maps logical hosts to a list of pattern (wild pathnames or namestrings) and their associated wild physical pathnames.

One can obtain the list of translations for a logical host with LOGICAL-PATHNAME-TRANSLATIONS and update it with (SETF LOGICAL-PATHNAME-TRANSLATIONS). Each translation is a list where the first element is a logical pathname or namestring (usually a wild pathname) and the second element is a pathname or namestring to translate into.

The translation process looks for the first entry that satisfies PATHNAME-MATCH-P, which is guaranteed to behave in a way consistent with DIRECTORY. When there is match, the translation processes replaces corresponding patterns for each components.

And of course if translation results in a logical pathname, it will be recursively translated until a physical pathname is obtained.

A simple example would be the use of a logical host referring to a temporary directory. This lets a program manipulates temporary pathnames without having to know their actual physical location, the translation process being controlled in a single location.

(setf (logical-pathname-translations "tmp")
      (list (list (make-pathname :host "tmp"
                                 :directory '(:absolute :wild-inferiors)
                                 :name :wild :type :wild)
                  (make-pathname :directory '(:absolute "var" "tmp" :wild-inferiors)
                                 :name :wild :type :wild))))

or if we were to use namestrings:

(setf (logical-pathname-translations "tmp")
      '(("TMP:**;*.*.*" "/var/tmp/**/*.*")))

Translating pathnames or namestrings using the TMP logical host yields the expected results. For example (translate-logical-pathname "TMP:CACHE;DATA.TXT") yields #P"/var/tmp/cache/data.txt".

Caveats

While logical pathnames are an elegant abstraction, they are plagued by multiple issues that make them hard to use correctly and in a portable way.

Logical namestring components can only contain letters, digits and hyphens (or the * and ** sequences for wild namestrings). This limitation probably comes from a need to be compatible with all existing file systems, but it can be a showstopper if one needs to refer to files whose naming scheme is not controlled by the program.

Namestring parsing is confusing: calling PARSE-NAMESTRING on an invalid namestring (because it contains invalid characters or because the host is not a known logical host) will not fail. Instead the string will be parsed as a physical namestring, introducing silent bugs. The LOGICAL-PATHNAME can be used to validate logical pathnames and namestrings.

The way translation converts between both pathname patterns is unclear. It is not specified by the Common Lisp standard. Debugging patterns can quickly become very frustrating, especially with implementations unable to produce quality error diagnostics.

Finally, the behaviour of logical pathnames with other functions is rarely obvious, leading to frustrating debugging sessions.

They nevertheless are a unique and helpful feature for very specific use cases.

Recipes

Resolving a path

Files are accessible through multiple paths. For example, on POSIX systems, foo/bar/baz.txt, foo/bar/../bar/baz.txt refer to the same file. If your operating system and file system support symbolic links, you can refer to the same physical file from multiple links, themselves being files.

It is sometimes useful to obtain the canonical path of a file. On POSIX systems, the realpath function serves this purpose. In Common Lisp, this canonical path is called truename, and the TRUENAME function returns it.

Transforming paths

The :DEFAULTS option of MAKE-PATHNAME is useful to construct a pathname that is a variation of another pathname. When a component passed to MAKE-PATHNAME is null, the value is taken from the pathname passed with :DEFAULTS.

For example to create the pathname of a file in the same directory as another pathname:

(make-pathname :name "bar"
               :defaults (make-pathname :directory '(:absolute "tmp") :name "foo"))

Or to create a wild pathname matching the same file names but with any extension:

(make-pathname :type :wild
               :defaults (make-pathname :name "foo" :type "txt"))

Or to obtain a pathname for the directory of a file:

(make-pathname :name nil
               :defaults (make-pathname :directory '(:relative "a" "b" "c")
                                        :name "foo"))

Joining two paths

Joining (or concatenating) two paths can be done with MERGE-PATHNAMES. In general calling (MERGE-PATHNAMES PATH1 PATH2) returns a new pathname whose components are taken either from PATH1 when they are not null, or from PATH2 when they are. As a special case, if the directory component of PATH1 is relative, the directory component of the result pathname is the concatenation of the directory components of both paths.

In other words

(merge-pathnames (make-pathname :directory '(:relative "x" "y"))
                 (make-pathname :directory '(:absolute "a" "b" "c")))

yields "/a/b/c/x/y/" but

(merge-pathnames (make-pathname :directory '(:absolute "x" "y"))
                 (make-pathname :directory '(:absolute "a" "b" "c")))

yields "/x/y/".

Finding files

The DIRECTORY function returns files matching a pathname, wild or not.

If the pathname is not wild, DIRECTORY returns a list of one or zero element depending on whether a file exists at this location or not.

If the pathname is wild, DIRECTORY behaves similarly to POSIX globs. Due to the way pathnames are structured, with the name and type being two different components, a common error is to specify a wild name without a type. In this case, DIRECTORY will not return any file with an extension (since their pathname has a non-null type). To match all files with any extension, set both the name and the type to :WILD.

Another interesting possibility is to only match directories. Directories are represented by pathnames with a non-null directory component and a null name component. Therefore to find all directories in /tmp (top-level only):

(directory (make-pathname :directory '(:absolute "tmp" :wild)))

Note that DIRECTORY returns truenames, i.e. pathnames representing the canonical location of the files. An unexpected consequence is that the function will resolve symlinks. Since the Common Lisp standard explicitely allows extra optional arguments, some implementations have a way to disable symlink resolving, e.g. SBCL with :RESOLVE-SYMLINKS or CCL with :FOLLOW-LINKS.

Resolving tildes in paths

It is commonly believed that tilde characters in paths is a universal feature. It is not. Tilde prefixes are defined in POSIX in the context of the shell (cf. POSIX 2017 2.6.1 Tilde Expansion) and are only supported in very specific locations.

To obtain the path of a file relative to the home directory of the current user, use the USER-HOMEDIR-PATHNAME function.

For example:

(merge-pathnames (make-pathname :directory '(:relative ".emacs.d")
                                :name "init" :type "el")
                 (user-homedir-pathname))

Liked my article? Follow me on Twitter or on Mastodon to see what I'm up to.