Skip to content

Releases: mar-file-system/GUFI

0.6.10

12 Dec 00:30

Choose a tag to compare

Documentation

  • NEW quickstart guide
  • Major LaTeX updates
  • Major man page updates

gufi_query

  • Added ability to only process directories with matching uid/gid
    • --dir-match-uid/--dir-match-gid
    • Use with --min-level
  • Disabled printing messages when errno is EACCES - use --print-eacces to re-enable

gufi_vt/gufi_vt_*

  • Run on remotes
    • gufi_vt: remote_cmd + remote_arg
    • gufi_vt_*: remote_cmd + remote_args
  • Added max_level to match with min_level
  • Query multiple indexroots
  • (gufi_vt only) added ability to only process directories with matching uid/gid
    • dir_match_uid/dir_match_gid
    • Use with min_level
  • Installed by client RPM

User Facing Scripts

  • -h for help has been removed from all scripts

  • Need Python 3.9 for argparse feature

  • gufi_find

    • Fixed -printf
    • Added -print and -print0
    • Added --compress
    • -ls columns are now padded
    • fixed timestamps of entries that are more than 6 months old
  • gufi_ls

    • -l columns are now padded
    • fixed timestamps of entries that are more than 6 months old
  • gufi_query.py

    • no longer automatically attaches config.indexroot

gufi_incremental_update

  • Removed suspect method 2

NEW gufi_top_info

  • Store extra data about indexes at the top of a GUFI tree
  • Administrator-only tool
  • Indexing now prints wall clock start and end time that can be used as extra data

gufi_treesummary_all

  • Added ability to do incremental update instead of recomputing everything per invocation

QueuePerThreadPool function signature updates by @cadegore (#185)

  • Now using QPTPool_ctx_t * instead of QPTPool_t *
  • Simplified interface: QPTPool_f now takes in 2 arguments instead of 4

make install now only installs server files

  • Use client RPM to install client files

SQLite3 removed incorrect cache configuration

  • Users should delete previous sqlite3 build and install, and rebuild

GitHub Actions

  • Removed CentOS 8
  • Removed rockylinux:8 and rockylinux:9
  • Added rockylinux/rockylinux:9 and rockylinux/rockylinux:10
  • Now compiling with -Wwrite-strings

0.6.9

22 Oct 22:03

Choose a tag to compare

NEW gufi_incremental_update (#177)

  • Given an old index and an updated source tree, update the index in-place without reindexing the entire source tree
    • Combines bfwreaddirplus2db.c and gitest.py into one executable
  • Removed bfwreaddirplus2db.c, tsmepoch2time.c, tsmtime2epoch.c, gitest.py, and gitestcomplex.py and related files

NEW parallel_find by @cadegore (#174)

  • Implemented --min-level, --max-level, and -t/--type-filter
  • Most find(1) -printf format specifiers are available via --format

NEW Example deployment diagram and description by @tbautis (#175)

NEW Initial GUFI OnDemand example website by @fluganator (#182)

NEW gufi_find_outliers

  • Searches for directory statistics that are more than 3 standard deviations away from the mean

Argument Parser Update by @cadegore (#178)

  • Switched from using getopt(3) to getopt_long(3)

Distributed Processing

  • Can now switch out find(1) for other commands
    • Added find.sh as example and default script to run
    • Added parallel_find.sh as alternative to find.sh
  • Added ability to take in an existing path list before the paths are distributed to nodes
    • The ability to take in already distributed per-node path lists already existed

Streaming Indexing

  • gufi_dir2trace can now write to stdout
    • Use - as the trace name
  • gufi_trace2index can now read from stdin
    • Use - as the trace name
  • Can run gufi_dir2trace ... /path/to/src - | gufi_trace2index ... - index to get index/src
    • Number of threads do not have to match

Miscellaneous

  • SQLite3 UDFs
    • Added stdev_from_parts for use with new tot* columns in summary and treesummary tables
    • rpath(sname, sroll)rpath(sname, sroll [, name])
      • New optional 3rd argument appends the file/link name to the path so the caller does not have to do it manually
    • Added spath(sname, sroll [, name]) to handle source paths with rollups (requires -p)
  • Now calling sqlite3_initialize and sqlite3_shutdown in input_init and input_fini respectively
    • Fixes race condition in gufi_treesummary_all and gufi_index2dir caused by compiling SQLite3 without locking and then not calling an SQLite3 function (and auto-initializing SQLite3) at least once in main before starting threads
      • Now also building SQLite3 with SQLITE_OMIT_AUTOINIT
      • Delete dependencies and rebuild
  • name and nameto are no longer in struct input
    • Positional arguments are now executable-defined so that generic variable names don't have to be used
  • skipfile can now contain whitespace
  • Minor man page updates
  • Bundled Google Test source update
    • Requires C++17
    • Delete dependencies and rebuild

GitHub Actions

  • Added AlmaLinux
  • macOS
    • Package updates
    • sqlite-vec and sqlite-lembed patches

0.6.8

21 Aug 17:33

Choose a tag to compare

NEW scripts/distributed

  • gufi_distributed.py provides a framework for allowing distributed processing of trees across nodes
  • Created scripts for distributing
    • gufi_dir2index, gufi_dir2trace, gufi_treesummary_all, gufi_rollup, gufi_unrollup, gufi_query
  • High Level Description
    • Use find -mindepth <level> -maxdepth <level> -type d -printf "%P\n" or equivalent to get directories at a given level
    • Distribute paths for nodes to process via per-node files
    • Process subtrees starting at level with -y <level> and -D <filename>
    • Process top of tree with -z (<level> - 1)
  • Distribute with ssh or sbatch

NEW Added gufi_du

  • Requires treesummary tables

NEW Indexing Plugin by @bertschinger (#170)

  • gufi_dir2index -U <plugin>
  • Initial Lustre plugin

statx(2)

  • If available, use it instead of lstat(2) to reduce traffic on parallel filesystems

CMake

  • Minimum version is now 3.19
    • Updated GitHub Actions Older CMake build
  • Variable changes to allow for more flexible installs
    • Removed BIN and LIB
    • Added SERVER_BIN, SERVER_LIB, SERVER_CONFIG
    • Added CLIENT_BIN, CLIENT_LIB, CLIENT_CONFIG
  • Fixed install paths of individual files

Python Scripts

  • Removed Python2
    • Removed testing in GitHub Actions
  • Server script shebangs have been changed to #!@Python3_EXECUTABLE@
    • Can be set a configure time with cmake -DPython3_EXECUTABLE=<path>
  • Disabled Python shebang mangling by rpmbuild

Miscellaneous

  • Moved struct stat and crtime back into struct work
  • vrpentries dtotfiledtotfiles
  • QueuePerThreadPool
    • steal is now a function
    • swapping is now a compile time option
  • gufi_query
    • % Formatting has been moved into User Strings
    • Fixed descend handling d_type with value DT_UNKNOWN
    • -a now takes in an integer to switch between modes
  • NEW Basic longitudinal study scripts
    • contrib/stats/per-level.sh
      • Per-level data collection from filesystem trees
    • contrib/stats/process-per-level.py
      • Compute the difference between two user defined vectors selected from collected data
  • LaTeX documentation updates
  • manpage cleanup (no content updates)

0.6.7

01 May 22:53

Choose a tag to compare

NEW Added virtual tables allowing for indexes to be accessible as one giant table instead of many small ones

  • gufi_vt_* (gufi_vt_treesummary, gufi_vt_summary, gufi_vt_entries, gufi_vt_pentries, gufi_vt_vrsummary, and gufi_vt_vrpentries)
    • Fixed schemas
    • Can query directly
    • Testing with SQLAlchemy and PugSQL
      • GUFI can now be queried by tools that use SQLAlchemy
  • gufi_vt
    • User defined schema
    • Requires CREATE VIRTUAL TABLE before querying
    • Testing with gufi_sqlite3
  • Added -u flag to gufi_query to support virtual tables

NEW Added virtual table run_vt allowing for arbitrary commands to be run with popen(3) and the results used as a SQLite 3 table.

  • Requires CREATE VIRTUAL TABLE before querying

NEW UDFs for running arbitrary commands with popen(3) and get stdout as a single SQL value

  • strop - return the first line of the output
  • intop - expects to find an integer at the start of the first line of output
  • blobop - return all of the output

NEW Query Replacement in gufi_query -T, -S, and -E

  • % Formatting
    • Replace appearances of %n, %i , and %s, with the current directory name, the current directory path, and the source prefix (requires -p), respectively
  • User Strings
    • Store SQL values using the setstr('key', value) function and retrieve them in a later query with {key}
    • Per-thread state

NEW AI Capabilities

querydbs

  • Changed from C to Python wrapper around gufi_sqlite3

Miscellaneous

  • C11 support is now required
  • CMake minimum version is now 3.16.0
    • Updated GitHub Actions Older CMake build
  • Completely removed -i and -t flags
  • SQLite 3 is now built with FTS5
    • Users should delete previous sqlite3 build and install, and rebuild
  • Updated sqlite3-pcre entry point to sqlite3_pcre2_init
    • Users should delete previous sqlite3-pcre build and install, and rebuild
  • Updated min/max level to be closer to how find(1) works
  • Removed test macros OPENDB, ADDQUERYFUNCS, SQL_EXEC
  • NEW thread_id UDF
  • Optional bash completion script install
    • enabled by default; Run cmake -DBASH_COMPLETION=Off to disable
  • LaTeX documentation updates
  • Previously installed jemalloc can now be used if found

GitHub Actions

  • Fixed macOS tests
    • Previously was building but not running tests
      • Regression test scripts updates
      • Removed macOS version of copyfd - now using generic version
  • Fixed cygwin build
    • Test source tree needed permissions explicitly set
  • Added Alpine Linux Edge
  • Added cmake --install and make install tests
  • Removed Ubuntu 20.04 build

0.6.6

19 Dec 20:48

Choose a tag to compare

Memory Usage

  • Changed struct work name to be dynamically allocated, but still contiguous within the struct
    • Address now points to space after struct work
      • This is similar to, but not the same as, a flexible array member to avoid needing to compile with C++ extensions enabled
  • Swap work items to storage if queue limit is hit
    • gufi_dir2index, gufi_dir2trace, gufi_trace2index, and gufi_query
    • use -M <bytes> to set queue_limit
      • <bytes> is divided across the number of threads (-n) and work size to produce queue_limit
    • use -s <prefix> to write swap files to a location that is not pwd

QueuePerThreadPool

  • Large amounts of code reorganization and separating out code into functions
  • API updates to support swapping, BottomUp updates, and to generally have better design

Miscellaneous

  • CMake minimum version is now 3.5.0
    • Updated GitHub Actions Older CMake build
  • Now using _at functions to reduce cost of path name resolution
  • struct input now has dynamically allocated values - call input_fini to clean up
  • descend
    • If struct dirent d_type is DT_UNKNOWN, call lstat(2)
    • Removed unnecessary arguments
  • Individual trace files are now scouted in parallel to reduce likelihood of work generation being a bottleneck
  • Now some executables dump VmHWM from /proc/self/status at the end
  • Updated version string to print consistently across C and Python
  • Split BottomUp code into multiple functions in case non-BottomUp functions need to be run with BottomUp
  • Removed PRINT_CUMULATIVE_TIMES, PRINT_PER_THREAD_STATS, and PRINT_QPTPOOL_QUEUE_SIZE
    • Performance History Framework
    • gnuplot scripts
    • GitHub Actions debug builds
  • Updated gpfs-scan-tool to compile with latest code

GitHub Actions

  • Removed macOS 12 and remnants of 13
  • Added macOS 15
  • Python 2 build now runs on CentOS 8

0.6.5

30 Jul 21:51

Choose a tag to compare

gufi_query

  • Amortize external database views creation when -Q is not provided
  • Amortize xattr views creation when -x is not passed in

gufi_rollup

  • Clear out old rollup data before copying new data in
    • Unified with gufi_unrollup SQL
  • treesummary is no longer copied upwards when rolling up
  • Index summary.inode to speed up queries
  • Fixed accidental modification of index during dry run

QueuePerThreadPool

  • Claimed work can now be stolen to prevent starvation caused by long running threads
  • If there is work that can be stolen, at least one work item will be taken even if the multiplier results in 0

External Databases

  • Admins no longer have to know what files to track
  • Changed external databases to be set by users in per-directory files called external.gufi that list one path per line
    • Relative paths with be treated as relative to the source tree (not the current directory in the index)
    • Changed -q to check that external db files listed are valid
  • Now tracked in trace files (old trace files do not have to be changed)

contrib/gufi_sqlite -> src/gufi_sqlite3

  • Added printing results - previous usage did not require it

NEW gufi_index2dir

  • Convert an index into a source tree with file sizes of 0

NEW gufi_trace2dir

  • Convert trace files into a source tree with file sizes of 0

NEW parallel_cpr

  • Parallel cp -r

Misc

  • When descending a directory, if struct dirent d_type is not set, fall back to calling lstat(2)
  • Updated opendb behavior
  • Updated dupdir behavior
  • No longer replacing both search and prefix with prefix in regression test output

GitHub Actions

  • Restored code coverage report with codecov
  • Updated actions/checkout to v4
  • Updated actions/cache to v4
  • Added Rocky Linux 9

0.6.4

14 May 19:30

Choose a tag to compare

New: External User Databases

  • Allows for arbitrary user data to be attached to filesystem metadata and queried
    • Can be rolled up
  • gufi_dir2index/gufi_dir2trace -q
    • Added e type to trace file format - does not affect old trace files
  • gufi_query -I -Q
    • Added new views: esummary, epentries, exsummary, expentries, evrsummary evrpentries, evrxsummary, evrxpentries
      • Always available, but will not be filled unless -Q is used.
    • Reorganized processdir to be easier to read
  • External user database count is tracked in treesummary
  • Removed attachname column from external_dbs

Extended Attributes

  • xattrs view and convenience views are now always available, but only filled when -x is passed to gufi_query

gufi_client now calls ssh with subprocess.Popen instead of paramiko

parallel_rmr top-level directory bug fix

Longitudinal Snapshot

  • More columns
  • Cache intermediate results
  • Allow for rolled up indexes
  • Different views
    • Graph (G)
    • Per Level (L)
    • Siblings (S)
    • Per Directory (D)

Dependencies

  • Updated sqlite3-pcre to use pcre2
    • Existing installs should delete the GUFI sqlite3-pcre build/install and rebuild
  • Removed paramiko tarball
  • Added new SQLite3 patch to increase attach limit to 254
    • Existing installs should delete the GUFI SQLite3 build/install and rebuild

GitHub Actions

  • Removed macOS 11 and 13
  • Added macOS 14
  • Added Ubuntu 24.04
  • Now uploading PDFs and RPMs to tagged releases
  • Added test on Windows with cygwin
  • Codecov actions update is causing issues, so changed to not error on upload failure

0.6.3

14 Feb 20:19

Choose a tag to compare

gufi_query

  • Input paths can now be symlinks
  • Immediate subdirectories of input paths can now be symlinks
  • gufi_query will get the realpath of the top-level input paths for traversal, but the custom SQLite functions path and rpath will print the current path with the original user provided prefix
    • fpath still prints the actual path

Schema changes

  • Added ppinode to pentries
  • dmaxgidIdmaxgid
  • inode INT64inode TEXT

New: contrib/longitudinal_snapshot.py

  • Takes snapshot of an index tree and summaries each directory's metadata i.e. min, max, mean, median, histograms, etc. of file size, file count, timestamps, string lengths, etc., and places data into a single SQLite database file that is much smaller than the index itself
    • Recommend running gufi_treesummary_all before generating a snapshot
  • See Discussion in #149

New: contrib/treediff

  • Walks directory tree and prints top-most directory mismatches

More tests

  • Added empty directory to test tree
  • Added deploy test

GitHub Actions

  • macOS 11 → macOS 14
  • Now keeping pdf documentation as artifacts when building main branch for 14 days

0.6.2

20 Dec 20:09

Choose a tag to compare

Schema Changes

  • summary now has a 0 size count column called totzero
  • New views summarylong and vrsummarylong join summary and vrsummary with tables/views that contain additional data that should be associated with them but do not need to be added into the summary table. Currently, no extra information is attached.

${SEARCH} now contains an empty db.db to guarantee a db.db above all indexes under ${SEARCH}.

  • This can be expanded in the future to add information that is separate from the rest of the index tree.
  • Fixes #49

NEW: gufi_treesummary_all generates treesummary tables for all directories in an index instead of one directory at a time.

gufi_rollup now also generates treesummary tables while processing index

  • gufi_unrollup does not remove treesummary tables because there is no way to tell whether or not they were generated by gufi_rollup or not. Might add column to say what utility was used to generate them in the future.

gufi_statgufi_stat_bin

  • gufi_stat is now a script that calls gufi_stat_bin
  • Server configuration file now also needs the path to gufi_stat_bin

gufi_stats

  • average-leaf-files
  • average-leaf-links
  • average-leaf-size
  • median-leaf-links
  • median-leaf-size
  • filesize-log2-bins
  • filesize-log1024-bins
  • dirfilecount-log2-bins (#146)
  • dirfilecount-log1024-bins (#146)

Scripts now have a --verbose/-V flag to print the command being run (#142)

bfwreaddirplus2db was reorganized.

NEW: split_trace splits trace files into chunks for parallel processing by gufi_trace2index

SQLite3

  • Updated from version 3.27 to version 3.43 to get built-in math functions

    • Existing indexes should be rebuilt
  • Also added math functions stdevs, stdevp, and median

  • Replaced subdirs_walked() with subdirs(srollsubdirs, sroll)

When printing result columns, the delimiter after the last column is no longer printed

  • Prevents pandas from unnecessarily generating a column of Nones when parsing output

Significant increase in testing and code coverage

CMake

  • db2, fuse, and gpfs tool building can be disabled even if the libraries are found
  • Added make pylint, make shellcheck, and make checkstyle
  • gufi_client_jail should not have been created
  • Example configuration files are no longer renamed to config.example

GitHub Actions

  • Now building on macOS 11, 12, and 13
  • Now building with -Wall -Wextra -Werror -pedantic

Added cygwin GCC support (not tested with CI)

0.6.1

30 May 15:37

Choose a tag to compare

Reduced size of struct work
Added optional work compression with zlib to gufi_dir2index, gufi_dir2trace, and gufi_query
Added in-situ processing of work items in descend function - after enqueuing n directories, the remaining immediate directories are processed in the parent thread instead of enqueued
gufi_query no longer requires at least one of -T, -S, or -E
Changed gufi_trace2index to read from file descriptors using pread(2) instead of FILE *s with getline(3)
Removed BENCHMARK macro
Documentation and test updates

QueuePerThreadPool

  • Added soft memory limit to via deferred processing
    • If a thread's wait queue gets too big, new work items are placed in a different queue so they are not processed until the wait queue is empty
    • QPTPool_enqueue now returns whether the new work item was placed in the wait queue or in the deferred queue
  • QPTPool_init now only requires thread count and thread arguments to initialize
    • The other properties can be optionally set with setter functions
    • Previous QPTPool_init has been renamed to QPTPool_init_with_props
  • Symmetrical start up (QPTPool_init and QPTPool_start) and end (QPTPool_wait and QPTPool_destroy)

SQLite3

  • Renamed path() to rpath()
    • Returns full path properly for original and rolled up indicies
    • Use with new views vrsummary, vrpentries, vrxsummary, and vrxpentries
  • Restored path(), epath(), and fpath() functions
  • Removed alignment arguments from functions
  • Updated URI processing to replace percent characters

Renamed

  • bftigufi_treesummary
  • rollupgufi_rollup
  • unrollupgufi_unrollup

Performance History Framework

  • Added helper script that allows user to specify a range of commits and how many times to benchmark each commit
    • Downloads second copy of repo
  • Added support for collecting new/renamed/removed cumulative_times debug values for gufi_query in older commits
  • Plotting supports including or excluding commits without data
  • More documentation

Removed INSTALL, NOTES.txt, Makefiles, and bfmi
Added SQL guide
Added presentation from MSST 2023