Nicolas Martyanoff – Brain dump About

Reading files faster in Common Lisp

While Common Lisp has functions to open, read and write files, none of them takes care of reading and returning the entire content. This is something that I do very regularly, so it made sense to add such a function to Tungsten. It turned out to be a bit more complicated than expected.

A simple but incorrect implementation

The simplest implementation relies on the FILE-LENGTH function which returns the length of a stream (which of course only makes sense for a file stream). The Hyperspec clearly states that “for a binary file, the length is measured in units of the element type of the stream”. Since we are only reading binary data, everything is fine.

Let us write the function:

(defun read-file (path)
  (declare (type (or pathname string) path))
  (with-open-file (file path :element-type 'core:octet)
    (let ((data (make-array (file-length file) :element-type 'core:octet)))
      (read-sequence data file)
      data)))

Note that CORE:OCTET is a Tungsten type for (UNSIGNED-BYTE 8).

The function works as expected, returning the content of the file as an octet vector. But it is not entirely correct.

This implementation only works for regular files. Various files on UNIX will report a length of zero but can still be read. Now you might protest that it would not make sense to call READ-FILE on a device such as /dev/urandom, and you would be right. But a valid example would be pseudo files such as those part of procfs. If you want to obtain memory stats about your process on Linux, you can simply read /proc/self/statm. But this is not a regular file and READ-FILE will return an empty octet vector.

Doing it right and slow

The right way to read a file is to read its content block by block until the read operation fails because it reached the end of the file.

Let us re-write READ-FILE:

(defun read-file (path)
  (declare (type (or pathname string) path))
  (let ((data (make-array 0 :element-type 'core:octet :adjustable t))
        (block-size 4096)
        (offset 0))
    (with-open-file (file path :element-type 'core:octet)
      (loop
        (let* ((capacity (array-total-size data))
               (nb-left (- capacity offset)))
          (when (< nb-left block-size)
            (let ((new-length (+ capacity (- block-size nb-left))))
              (setf data (adjust-array data new-length)))))
        (let ((end (read-sequence data file :start offset)))
          (when (= end offset)
            (return-from read-file (adjust-array data end)))
          (setf offset end))))))

This time we rely on an adjustable array; we iterate, making sure we have enough space in the array to read an entire block each time. When the array is too short, we use ADJUST-ARRAY to extend it, relying on its ability to reuse the underlying storage instead of systematically copying its content.

Finally, once READ-SEQUENCE stops returning data, we truncate the array to the right size and return it.

This function worked correctly and I started using it regularly. Recently I started working with a file larger than usual and realized that READ-FILE was way too slow. With a NVMe drive, I would expect to be able to read a 10+MB file almost instantaneously, but it took several seconds.

After inspecting the code to find what could be so slow, I started to wonder about ADJUST-ARRAY; while I thought SBCL would internally extend the underlying memory in large blocks to minimize allocations, behaving similarly to realloc() in C, it turned out not to be the case. While reading the code behind ADJUST-ARRAY, I learned that it precisely allocates the required size. As a result, this implementation of READ-FILE performs one memory allocation for each 4kB block. Not a problem for small files, slow for larger ones.

A final version, correct and fast

Since I understood what the problem was, fixing it was trivial. When there is not enough space to read a block, we extend the array by at least 50% of its current size. Of course this is a balancing act: for example doubling the size at each allocation would reduce even more the number of allocations, but would increase the total amount of memory allocated. The choice is up to you.

(defun read-file (path)
  (declare (type (or pathname string) path))
  (let ((data (make-array 0 :element-type 'core:octet :adjustable t))
        (block-size 4096)
        (offset 0))
    (with-open-file (file path :element-type 'core:octet)
      (loop
        (let* ((capacity (array-total-size data))
               (nb-left (- capacity offset)))
          (when (< nb-left block-size)
            (let ((new-length (max (+ capacity (- block-size nb-left))
                                   (floor (* capacity 3) 2))))
              (setf data (adjust-array data new-length)))))
        (let ((end (read-sequence data file :start offset)))
          (when (= end offset)
            (return-from read-file (adjust-array data end)))
          (setf offset end))))))

This last version reads a 250MB file in a quarter of a second, while the original version took almost two minutes. Much better!

Share the word!

Liked my article? Follow me on Twitter or on Mastodon to see what I'm up to.