So far, this can record content hashes and recorded times/filetypes, as well as symlink targets into a temp table. There is no detection of deleted files, and non-root paths are left dangling. Also, no directory content hashes are computed. Currently I am using sha256 of parent sha256 + filename as the key in filedir. This is pretty wasteful as they are 32 bytes each, and since each filedir entry has a parent (except the root), this is 64 bytes for each entry just in keys (using ints this would be just a few bytes instead). Maintaining a low probability of collision is important for a distributed system like this where we envision importing datasets into possibly very large master databases. However, the sha256 itself starts from a 128bit UUID for the root's parent, so there is already only that much collision avoidance (which is already very large). Moving to UUIDs for keys is attractive for that 2x savings in space; the obvious candidate would be v5 UUIDs, which are derived from a UUID namespace and a string. The string could be the relative path for each filedir entry. Alternatively, we could use the parent's UUID as namespace and the filename as the string, similar to how we use those bits of information now to compute the SHA256 hash. Changing to UUID (v5) keys for filedir in addition to the current UUID (v4) keys for filedir_version, suggests that perhaps we should switch to UUID (v5) for all our other keys. Tables with deterministic sha256 keys are: - machine - user - filedir - environment - package - module - func Each of these are derived from some bit of information that's typically also present in each row of the table. In other words, the sha256 is there as a convenient way to avoid using multi-column primary keys. But it may double or triple some of these table sizes, so we should really consider using a minimal hash, or in some cases we could consider other alternatives such as integer primary keys. I've avoided that in this rewrite so far since it complicates importing, but it's likely the most space-efficient.
33 lines
791 B
TOML
33 lines
791 B
TOML
[package]
|
|
name = "nancy"
|
|
version = "0.1.0"
|
|
edition = "2021"
|
|
authors = ["Jacob Hinkle <jacob.hinkle@jhink.org>"]
|
|
description = "Composable provenance tracking for scientific data analysis"
|
|
repository = "https://git.jhink.org/jacob/nancy"
|
|
readme = "README.md"
|
|
|
|
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
|
|
|
|
[lib]
|
|
name = "nancy"
|
|
path = "src/lib.rs"
|
|
|
|
[[bin]]
|
|
name = "nancy"
|
|
path = "src/main.rs"
|
|
|
|
[dependencies]
|
|
blake3 = "1.3.1"
|
|
clap = { version = "4.0.14", features = ["derive"] }
|
|
derive_more = "0.99.17"
|
|
env_logger = "0.9.1"
|
|
jwalk = "0.6.0"
|
|
log = "0.4.17"
|
|
once_cell = "1.15.0"
|
|
rayon = "1.5.3"
|
|
ring = "0.16.20"
|
|
rusqlite = { version = "0.28.0", features = ["uuid"] }
|
|
rusqlite_migration = "1.0.0"
|
|
uuid = { version = "1.2.1", features = ["v4"] }
|