12 Commits

Author SHA1 Message Date
Jacob Hinkle
dc172789db Merge branch 'master' of github.com:jacobhinkle/nancy 2022-11-09 13:01:27 -05:00
Jacob Hinkle
bffef78291 First draft of record() command.
So far, this can record content hashes and recorded times/filetypes, as
well as symlink targets into a temp table. There is no detection of
deleted files, and non-root paths are left dangling. Also, no directory
content hashes are computed.

Currently I am using sha256 of parent sha256 + filename as the key in
filedir. This is pretty wasteful as they are 32 bytes each, and since
each filedir entry has a parent (except the root), this is 64 bytes for
each entry just in keys (using ints this would be just a few bytes
instead). Maintaining a low probability of collision is important for a
distributed system like this where we envision importing datasets into
possibly very large master databases. However, the sha256 itself starts
from a 128bit UUID for the root's parent, so there is already only that
much collision avoidance (which is already very large).

Moving to UUIDs for keys is attractive for that 2x savings in space; the
obvious candidate would be v5 UUIDs, which are derived from a UUID
namespace and a string. The string could be the relative path for each
filedir entry. Alternatively, we could use the parent's UUID as
namespace and the filename as the string, similar to how we use those
bits of information now to compute the SHA256 hash.

Changing to UUID (v5) keys for filedir in addition to the current UUID
(v4) keys for filedir_version, suggests that perhaps we should switch to
UUID (v5) for all our other keys. Tables with deterministic sha256 keys
are:
 - machine
 - user
 - filedir
 - environment
 - package
 - module
 - func
Each of these are derived from some bit of information that's typically
also present in each row of the table. In other words, the sha256 is
there as a convenient way to avoid using multi-column primary keys. But
it may double or triple some of these table sizes, so we should really
consider using a minimal hash, or in some cases we could consider other
alternatives such as integer primary keys. I've avoided that in this
rewrite so far since it complicates importing, but it's likely the most
space-efficient.
2022-11-09 12:48:35 -05:00
Jacob Hinkle
e5fa7a32f2 Rename "store" to "dataset" in schema 2022-11-09 12:48:06 -05:00
Jacob Hinkle
7cb76da9d7
Add BSD-3 LICENSE 2022-11-09 10:09:57 -05:00
Jacob Hinkle
12b669d591 Run cargo fmt 2022-10-27 15:23:52 -04:00
Jacob Hinkle
feab22026d Clean up interface to Program::perform_task() 2022-10-27 15:23:13 -04:00
Jacob Hinkle
88dd2bc220 Fix error handling and common path search in find_dataset_dir 2022-10-27 10:37:16 -04:00
Jacob Hinkle
a38bc78093 Start on find_dataset_dir. Format with cargo fmt 2022-10-26 15:00:39 -04:00
Jacob Hinkle
3556590f7b Work out using rusqlite and migrations.
This is a big commit where I learned how to do proper error tracking,
including handling From properly, and deriving it in some cases. The
record subcommand still is not implemented but will be easier now that I
decided to use SQLite temp tables as my data structure. This means I can
simply implement a few loops in the fs submodule in order to scan
directories, and dump entries into temp tables. When finished, I'll drop
the tables. This is nice because SQLite already contains a very
efficient BTree implementation that we can use with indices on these
temp tables. It also means we don't have to hold possibly millions of
directory entries in memory, and most importantly, we don't have to
figure out a bidirectional tree structure in rust.
2022-10-26 12:37:47 -04:00
Jacob Hinkle
afb2fae01b First working query of store_uuid 2022-10-14 12:53:33 -04:00
dc2edcf0a3 Add log/env_logger, empty store module 2022-10-13 12:55:27 -04:00
4cd3b71839 Initial skeleton with lib and cli 2022-10-13 10:16:10 -04:00