Jacob Hinkle bffef78291 First draft of record() command.
So far, this can record content hashes and recorded times/filetypes, as
well as symlink targets into a temp table. There is no detection of
deleted files, and non-root paths are left dangling. Also, no directory
content hashes are computed.

Currently I am using sha256 of parent sha256 + filename as the key in
filedir. This is pretty wasteful as they are 32 bytes each, and since
each filedir entry has a parent (except the root), this is 64 bytes for
each entry just in keys (using ints this would be just a few bytes
instead). Maintaining a low probability of collision is important for a
distributed system like this where we envision importing datasets into
possibly very large master databases. However, the sha256 itself starts
from a 128bit UUID for the root's parent, so there is already only that
much collision avoidance (which is already very large).

Moving to UUIDs for keys is attractive for that 2x savings in space; the
obvious candidate would be v5 UUIDs, which are derived from a UUID
namespace and a string. The string could be the relative path for each
filedir entry. Alternatively, we could use the parent's UUID as
namespace and the filename as the string, similar to how we use those
bits of information now to compute the SHA256 hash.

Changing to UUID (v5) keys for filedir in addition to the current UUID
(v4) keys for filedir_version, suggests that perhaps we should switch to
UUID (v5) for all our other keys. Tables with deterministic sha256 keys
are:
 - machine
 - user
 - filedir
 - environment
 - package
 - module
 - func
Each of these are derived from some bit of information that's typically
also present in each row of the table. In other words, the sha256 is
there as a convenient way to avoid using multi-column primary keys. But
it may double or triple some of these table sizes, so we should really
consider using a minimal hash, or in some cases we could consider other
alternatives such as integer primary keys. I've avoided that in this
rewrite so far since it complicates importing, but it's likely the most
space-efficient.
2022-11-09 12:48:35 -05:00
..
2022-10-27 15:23:52 -04:00
2022-11-09 12:48:35 -05:00
2022-11-09 12:48:35 -05:00
2022-11-09 12:48:35 -05:00