While Common Lisp has functions to open, read and write files, none of them takes care of reading and returning the entire content. This is something that I do very regularly, so it made sense to add such a function to Tungsten. It turned out to be a bit more complicated than expected.
A simple but incorrect implementation
The simplest implementation relies on the FILE-LENGTH
function which returns
the length of a stream (which of course only makes sense for a file stream).
The
Hyperspec
clearly states that “for a binary file, the length is measured in units of the
element type of the stream”. Since we are only reading binary data, everything
is fine.
Let us write the function:
(defun read-file (path)
(declare (type (or pathname string) path))
(with-open-file (file path :element-type 'core:octet)
(let ((data (make-array (file-length file) :element-type 'core:octet)))
(read-sequence data file)
data)))
Note that CORE:OCTET
is a Tungsten type for (UNSIGNED-BYTE 8)
.
The function works as expected, returning the content of the file as an octet vector. But it is not entirely correct.
This implementation only works for regular files. Various files on UNIX will
report a length of zero but can still be read. Now you might protest that it
would not make sense to call READ-FILE
on a device such as /dev/urandom
,
and you would be right. But a valid example would be pseudo files such as
those part of procfs
. If
you want to obtain memory stats about your process on Linux, you can simply
read /proc/self/statm
. But this is not a regular file and READ-FILE
will
return an empty octet vector.
Doing it right and slow
The right way to read a file is to read its content block by block until the read operation fails because it reached the end of the file.
Let us re-write READ-FILE
:
(defun read-file (path)
(declare (type (or pathname string) path))
(let ((data (make-array 0 :element-type 'core:octet :adjustable t))
(block-size 4096)
(offset 0))
(with-open-file (file path :element-type 'core:octet)
(loop
(let* ((capacity (array-total-size data))
(nb-left (- capacity offset)))
(when (< nb-left block-size)
(let ((new-length (+ capacity (- block-size nb-left))))
(setf data (adjust-array data new-length)))))
(let ((end (read-sequence data file :start offset)))
(when (= end offset)
(return-from read-file (adjust-array data end)))
(setf offset end))))))
This time we rely on an adjustable array; we iterate, making sure we have
enough space in the array to read an entire block each time. When the array is
too short, we use ADJUST-ARRAY
to extend it, relying on its ability to reuse
the underlying storage instead of systematically copying its content.
Finally, once READ-SEQUENCE
stops returning data, we truncate the array to
the right size and return it.
This function worked correctly and I started using it regularly. Recently I
started working with a file larger than usual and realized that READ-FILE
was way too slow. With a NVMe drive, I would expect to be able to read a 10+MB
file almost instantaneously, but it took several seconds.
After inspecting the code to find what could be so slow, I started to wonder
about ADJUST-ARRAY
; while I thought SBCL would internally extend the
underlying memory in large blocks to minimize allocations, behaving similarly
to realloc()
in C, it turned out not to be the case. While reading the code
behind ADJUST-ARRAY
, I learned that it precisely allocates the required
size. As a result, this implementation of READ-FILE
performs one memory
allocation for each 4kB block. Not a problem for small files, slow for larger
ones.
A final version, correct and fast
Since I understood what the problem was, fixing it was trivial. When there is not enough space to read a block, we extend the array by at least 50% of its current size. Of course this is a balancing act: for example doubling the size at each allocation would reduce even more the number of allocations, but would increase the total amount of memory allocated. The choice is up to you.
(defun read-file (path)
(declare (type (or pathname string) path))
(let ((data (make-array 0 :element-type 'core:octet :adjustable t))
(block-size 4096)
(offset 0))
(with-open-file (file path :element-type 'core:octet)
(loop
(let* ((capacity (array-total-size data))
(nb-left (- capacity offset)))
(when (< nb-left block-size)
(let ((new-length (max (+ capacity (- block-size nb-left))
(floor (* capacity 3) 2))))
(setf data (adjust-array data new-length)))))
(let ((end (read-sequence data file :start offset)))
(when (= end offset)
(return-from read-file (adjust-array data end)))
(setf offset end))))))
This last version reads a 250MB file in a quarter of a second, while the original version took almost two minutes. Much better!