nancyrs

Author	SHA1	Message	Date
Jacob Hinkle	dc172789db	Merge branch 'master' of github.com:jacobhinkle/nancy	2022-11-09 13:01:27 -05:00
Jacob Hinkle	bffef78291	First draft of record() command. So far, this can record content hashes and recorded times/filetypes, as well as symlink targets into a temp table. There is no detection of deleted files, and non-root paths are left dangling. Also, no directory content hashes are computed. Currently I am using sha256 of parent sha256 + filename as the key in filedir. This is pretty wasteful as they are 32 bytes each, and since each filedir entry has a parent (except the root), this is 64 bytes for each entry just in keys (using ints this would be just a few bytes instead). Maintaining a low probability of collision is important for a distributed system like this where we envision importing datasets into possibly very large master databases. However, the sha256 itself starts from a 128bit UUID for the root's parent, so there is already only that much collision avoidance (which is already very large). Moving to UUIDs for keys is attractive for that 2x savings in space; the obvious candidate would be v5 UUIDs, which are derived from a UUID namespace and a string. The string could be the relative path for each filedir entry. Alternatively, we could use the parent's UUID as namespace and the filename as the string, similar to how we use those bits of information now to compute the SHA256 hash. Changing to UUID (v5) keys for filedir in addition to the current UUID (v4) keys for filedir_version, suggests that perhaps we should switch to UUID (v5) for all our other keys. Tables with deterministic sha256 keys are: - machine - user - filedir - environment - package - module - func Each of these are derived from some bit of information that's typically also present in each row of the table. In other words, the sha256 is there as a convenient way to avoid using multi-column primary keys. But it may double or triple some of these table sizes, so we should really consider using a minimal hash, or in some cases we could consider other alternatives such as integer primary keys. I've avoided that in this rewrite so far since it complicates importing, but it's likely the most space-efficient.	2022-11-09 12:48:35 -05:00
Jacob Hinkle	e5fa7a32f2	Rename "store" to "dataset" in schema	2022-11-09 12:48:06 -05:00
Jacob Hinkle	7cb76da9d7	Add BSD-3 LICENSE	2022-11-09 10:09:57 -05:00
Jacob Hinkle	12b669d591	Run cargo fmt	2022-10-27 15:23:52 -04:00
Jacob Hinkle	feab22026d	Clean up interface to Program::perform_task()	2022-10-27 15:23:13 -04:00
Jacob Hinkle	88dd2bc220	Fix error handling and common path search in find_dataset_dir	2022-10-27 10:37:16 -04:00
Jacob Hinkle	a38bc78093	Start on find_dataset_dir. Format with cargo fmt	2022-10-26 15:00:39 -04:00
Jacob Hinkle	3556590f7b	Work out using rusqlite and migrations. This is a big commit where I learned how to do proper error tracking, including handling From properly, and deriving it in some cases. The record subcommand still is not implemented but will be easier now that I decided to use SQLite temp tables as my data structure. This means I can simply implement a few loops in the fs submodule in order to scan directories, and dump entries into temp tables. When finished, I'll drop the tables. This is nice because SQLite already contains a very efficient BTree implementation that we can use with indices on these temp tables. It also means we don't have to hold possibly millions of directory entries in memory, and most importantly, we don't have to figure out a bidirectional tree structure in rust.	2022-10-26 12:37:47 -04:00
Jacob Hinkle	afb2fae01b	First working query of store_uuid	2022-10-14 12:53:33 -04:00
Jacob Hinkle	dc2edcf0a3	Add log/env_logger, empty store module	2022-10-13 12:55:27 -04:00
Jacob Hinkle	4cd3b71839	Initial skeleton with lib and cli	2022-10-13 10:16:10 -04:00

12 Commits