Releases: mar-file-system/GUFI
0.6.10
Documentation
- NEW quickstart guide
- Major LaTeX updates
- Major man page updates
gufi_query
- Added ability to only process directories with matching uid/gid
--dir-match-uid/--dir-match-gid- Use with
--min-level
- Disabled printing messages when errno is
EACCES- use--print-eaccesto re-enable
gufi_vt/gufi_vt_*
- Run on remotes
gufi_vt:remote_cmd+remote_arggufi_vt_*:remote_cmd+remote_args
- Added
max_levelto match withmin_level - Query multiple indexroots
- (
gufi_vtonly) added ability to only process directories with matching uid/giddir_match_uid/dir_match_gid- Use with
min_level
- Installed by client RPM
User Facing Scripts
-
-hfor help has been removed from all scripts -
Need Python 3.9 for
argparsefeature -
gufi_find- Fixed
-printf - Added
-printand-print0 - Added
--compress -lscolumns are now padded- fixed timestamps of entries that are more than 6 months old
- Fixed
-
gufi_ls-lcolumns are now padded- fixed timestamps of entries that are more than 6 months old
-
gufi_query.py- no longer automatically attaches
config.indexroot
- no longer automatically attaches
gufi_incremental_update
- Removed suspect method 2
NEW gufi_top_info
- Store extra data about indexes at the top of a GUFI tree
- Administrator-only tool
- Indexing now prints wall clock start and end time that can be used as extra data
gufi_treesummary_all
- Added ability to do incremental update instead of recomputing everything per invocation
QueuePerThreadPool function signature updates by @cadegore (#185)
- Now using
QPTPool_ctx_t *instead ofQPTPool_t * - Simplified interface:
QPTPool_fnow takes in 2 arguments instead of 4
make install now only installs server files
- Use client RPM to install client files
SQLite3 removed incorrect cache configuration
- Users should delete previous
sqlite3build and install, and rebuild
GitHub Actions
- Removed CentOS 8
- Removed
rockylinux:8androckylinux:9 - Added
rockylinux/rockylinux:9androckylinux/rockylinux:10 - Now compiling with
-Wwrite-strings
0.6.9
NEW gufi_incremental_update (#177)
- Given an old index and an updated source tree, update the index in-place without reindexing the entire source tree
- Combines
bfwreaddirplus2db.candgitest.pyinto one executable
- Combines
- Removed
bfwreaddirplus2db.c,tsmepoch2time.c,tsmtime2epoch.c,gitest.py, andgitestcomplex.pyand related files
NEW parallel_find by @cadegore (#174)
- Implemented
--min-level,--max-level, and-t/--type-filter - Most
find(1)-printfformat specifiers are available via--format
NEW Example deployment diagram and description by @tbautis (#175)
NEW Initial GUFI OnDemand example website by @fluganator (#182)
NEW gufi_find_outliers
- Searches for directory statistics that are more than 3 standard deviations away from the mean
Argument Parser Update by @cadegore (#178)
- Switched from using
getopt(3)togetopt_long(3)
Distributed Processing
- Can now switch out
find(1)for other commands- Added
find.shas example and default script to run - Added
parallel_find.shas alternative tofind.sh
- Added
- Added ability to take in an existing path list before the paths are distributed to nodes
- The ability to take in already distributed per-node path lists already existed
Streaming Indexing
gufi_dir2tracecan now write tostdout- Use
-as the trace name
- Use
gufi_trace2indexcan now read fromstdin- Use
-as the trace name
- Use
- Can run
gufi_dir2trace ... /path/to/src - | gufi_trace2index ... - indexto getindex/src- Number of threads do not have to match
Miscellaneous
- SQLite3 UDFs
- Added
stdev_from_partsfor use with newtot*columns insummaryandtreesummarytables rpath(sname, sroll)→rpath(sname, sroll [, name])- New optional 3rd argument appends the file/link name to the path so the caller does not have to do it manually
- Added
spath(sname, sroll [, name])to handle source paths with rollups (requires-p)
- Added
- Now calling
sqlite3_initializeandsqlite3_shutdownininput_initandinput_finirespectively- Fixes race condition in
gufi_treesummary_allandgufi_index2dircaused by compiling SQLite3 without locking and then not calling an SQLite3 function (and auto-initializing SQLite3) at least once inmainbefore starting threads- Now also building SQLite3 with
SQLITE_OMIT_AUTOINIT - Delete dependencies and rebuild
- Now also building SQLite3 with
- Fixes race condition in
nameandnametoare no longer instruct input- Positional arguments are now executable-defined so that generic variable names don't have to be used
- skipfile can now contain whitespace
- Minor man page updates
- Bundled Google Test source update
- Requires C++17
- Delete dependencies and rebuild
GitHub Actions
- Added AlmaLinux
- macOS
- Package updates
sqlite-vecandsqlite-lembedpatches
0.6.8
NEW scripts/distributed
gufi_distributed.pyprovides a framework for allowing distributed processing of trees across nodes- Created scripts for distributing
gufi_dir2index,gufi_dir2trace,gufi_treesummary_all,gufi_rollup,gufi_unrollup,gufi_query
- High Level Description
- Use
find -mindepth <level> -maxdepth <level> -type d -printf "%P\n"or equivalent to get directories at a given level - Distribute paths for nodes to process via per-node files
- Process subtrees starting at level with
-y <level>and-D <filename> - Process top of tree with
-z (<level> - 1)
- Use
- Distribute with
sshorsbatch
NEW Added gufi_du
- Requires
treesummarytables
NEW Indexing Plugin by @bertschinger (#170)
gufi_dir2index -U <plugin>- Initial Lustre plugin
statx(2)
- If available, use it instead of
lstat(2)to reduce traffic on parallel filesystems
CMake
- Minimum version is now 3.19
- Updated GitHub Actions Older CMake build
- Variable changes to allow for more flexible installs
- Removed
BINandLIB - Added
SERVER_BIN,SERVER_LIB,SERVER_CONFIG - Added
CLIENT_BIN,CLIENT_LIB,CLIENT_CONFIG
- Removed
- Fixed install paths of individual files
Python Scripts
- Removed Python2
- Removed testing in GitHub Actions
- Server script shebangs have been changed to
#!@Python3_EXECUTABLE@- Can be set a configure time with
cmake -DPython3_EXECUTABLE=<path>
- Can be set a configure time with
- Disabled Python shebang mangling by
rpmbuild
Miscellaneous
- Moved
struct statandcrtimeback intostruct work vrpentriesdtotfile→dtotfilesQueuePerThreadPoolstealis now a function- swapping is now a compile time option
gufi_query- % Formatting has been moved into User Strings
- Fixed
descendhandlingd_typewith valueDT_UNKNOWN -anow takes in an integer to switch between modes
- NEW Basic longitudinal study scripts
contrib/stats/per-level.sh- Per-level data collection from filesystem trees
contrib/stats/process-per-level.py- Compute the difference between two user defined vectors selected from collected data
- LaTeX documentation updates
- manpage cleanup (no content updates)
0.6.7
NEW Added virtual tables allowing for indexes to be accessible as one giant table instead of many small ones
gufi_vt_*(gufi_vt_treesummary,gufi_vt_summary,gufi_vt_entries,gufi_vt_pentries,gufi_vt_vrsummary, andgufi_vt_vrpentries)- Fixed schemas
- Can query directly
- Testing with SQLAlchemy and PugSQL
- GUFI can now be queried by tools that use SQLAlchemy
gufi_vt- User defined schema
- Requires
CREATE VIRTUAL TABLEbefore querying - Testing with
gufi_sqlite3
- Added
-uflag togufi_queryto support virtual tables
NEW Added virtual table run_vt allowing for arbitrary commands to be run with popen(3) and the results used as a SQLite 3 table.
- Requires
CREATE VIRTUAL TABLEbefore querying
NEW UDFs for running arbitrary commands with popen(3) and get stdout as a single SQL value
strop- return the first line of the outputintop- expects to find an integer at the start of the first line of outputblobop- return all of the output
NEW Query Replacement in gufi_query -T, -S, and -E
- % Formatting
- Replace appearances of
%n,%i, and%s, with the current directory name, the current directory path, and the source prefix (requires-p), respectively
- Replace appearances of
- User Strings
- Store SQL values using the
setstr('key', value)function and retrieve them in a later query with{key} - Per-thread state
- Store SQL values using the
NEW AI Capabilities
- Added asg017/sqlite-vec as a provided dependency
- Added asg017/sqlite-lembed as a downloaded dependency due to submodules
- Enabled by default; Run
cmake -DDEP_AI=Offto disable
querydbs
- Changed from C to Python wrapper around
gufi_sqlite3
Miscellaneous
- C11 support is now required
- CMake minimum version is now 3.16.0
- Updated GitHub Actions Older CMake build
- Completely removed
-iand-tflags - SQLite 3 is now built with FTS5
- Users should delete previous
sqlite3build and install, and rebuild
- Users should delete previous
- Updated
sqlite3-pcreentry point tosqlite3_pcre2_init- Users should delete previous
sqlite3-pcrebuild and install, and rebuild
- Users should delete previous
- Updated min/max level to be closer to how
find(1)works - Removed test macros
OPENDB,ADDQUERYFUNCS,SQL_EXEC - NEW
thread_idUDF - Optional bash completion script install
- enabled by default; Run
cmake -DBASH_COMPLETION=Offto disable
- enabled by default; Run
- LaTeX documentation updates
- Previously installed
jemalloccan now be used if found
GitHub Actions
- Fixed macOS tests
- Previously was building but not running tests
- Regression test scripts updates
- Removed macOS version of
copyfd- now using generic version
- Previously was building but not running tests
- Fixed cygwin build
- Test source tree needed permissions explicitly set
- Added Alpine Linux Edge
- Added
cmake --installandmake installtests - Removed Ubuntu 20.04 build
0.6.6
Memory Usage
- Changed
struct worknameto be dynamically allocated, but still contiguous within the struct- Address now points to space after
struct work- This is similar to, but not the same as, a flexible array member to avoid needing to compile with C++ extensions enabled
- Address now points to space after
- Swap work items to storage if queue limit is hit
gufi_dir2index,gufi_dir2trace,gufi_trace2index, andgufi_query- use
-M <bytes>to setqueue_limit<bytes>is divided across the number of threads (-n) andworksize to producequeue_limit
- use
-s <prefix>to write swap files to a location that is not pwd
QueuePerThreadPool
- Large amounts of code reorganization and separating out code into functions
- API updates to support swapping,
BottomUpupdates, and to generally have better design
Miscellaneous
CMakeminimum version is now 3.5.0- Updated GitHub Actions Older CMake build
- Now using
_atfunctions to reduce cost of path name resolution struct inputnow has dynamically allocated values - callinput_finito clean updescend- If
struct direntd_typeisDT_UNKNOWN, calllstat(2) - Removed unnecessary arguments
- If
- Individual trace files are now scouted in parallel to reduce likelihood of work generation being a bottleneck
- Now some executables dump
VmHWMfrom/proc/self/statusat the end - Updated version string to print consistently across C and Python
- Split
BottomUpcode into multiple functions in case non-BottomUpfunctions need to be run withBottomUp - Removed
PRINT_CUMULATIVE_TIMES,PRINT_PER_THREAD_STATS, andPRINT_QPTPOOL_QUEUE_SIZE- Performance History Framework
- gnuplot scripts
- GitHub Actions debug builds
- Updated
gpfs-scan-toolto compile with latest code
GitHub Actions
- Removed macOS 12 and remnants of 13
- Added macOS 15
- Python 2 build now runs on CentOS 8
0.6.5
gufi_query
- Amortize external database views creation when
-Qis not provided - Amortize xattr views creation when
-xis not passed in
gufi_rollup
- Clear out old rollup data before copying new data in
- Unified with
gufi_unrollupSQL
- Unified with
treesummaryis no longer copied upwards when rolling up- Index
summary.inodeto speed up queries - Fixed accidental modification of index during dry run
QueuePerThreadPool
- Claimed work can now be stolen to prevent starvation caused by long running threads
- If there is work that can be stolen, at least one work item will be taken even if the multiplier results in 0
External Databases
- Admins no longer have to know what files to track
- Changed external databases to be set by users in per-directory files called
external.gufithat list one path per line- Relative paths with be treated as relative to the source tree (not the current directory in the index)
- Changed
-qto check that external db files listed are valid
- Now tracked in trace files (old trace files do not have to be changed)
contrib/gufi_sqlite -> src/gufi_sqlite3
- Added printing results - previous usage did not require it
NEW gufi_index2dir
- Convert an index into a source tree with file sizes of 0
NEW gufi_trace2dir
- Convert trace files into a source tree with file sizes of 0
NEW parallel_cpr
- Parallel
cp -r
Misc
- When descending a directory, if
struct direntd_typeis not set, fall back to callinglstat(2) - Updated
opendbbehavior - Updated
dupdirbehavior - No longer replacing both
searchandprefixwithprefixin regression test output
GitHub Actions
- Restored code coverage report with codecov
- Updated actions/checkout to v4
- Updated actions/cache to v4
- Added Rocky Linux 9
0.6.4
New: External User Databases
- Allows for arbitrary user data to be attached to filesystem metadata and queried
- Can be rolled up
gufi_dir2index/gufi_dir2trace-q- Added
etype to trace file format - does not affect old trace files
- Added
gufi_query -I -Q- Added new views:
esummary,epentries,exsummary,expentries,evrsummaryevrpentries,evrxsummary,evrxpentries- Always available, but will not be filled unless
-Qis used.
- Always available, but will not be filled unless
- Reorganized
processdirto be easier to read
- Added new views:
- External user database count is tracked in
treesummary - Removed
attachnamecolumn fromexternal_dbs
Extended Attributes
xattrsview and convenience views are now always available, but only filled when-xis passed togufi_query
gufi_client now calls ssh with subprocess.Popen instead of paramiko
parallel_rmr top-level directory bug fix
Longitudinal Snapshot
- More columns
- Cache intermediate results
- Allow for rolled up indexes
- Different views
- Graph (G)
- Per Level (L)
- Siblings (S)
- Per Directory (D)
Dependencies
- Updated
sqlite3-pcreto usepcre2- Existing installs should delete the GUFI sqlite3-pcre build/install and rebuild
- Removed
paramikotarball - Added new SQLite3 patch to increase attach limit to 254
- Existing installs should delete the GUFI SQLite3 build/install and rebuild
GitHub Actions
- Removed macOS 11 and 13
- Added macOS 14
- Added Ubuntu 24.04
- Now uploading PDFs and RPMs to tagged releases
- Added test on Windows with cygwin
- Codecov actions update is causing issues, so changed to not error on upload failure
0.6.3
gufi_query
- Input paths can now be symlinks
- Immediate subdirectories of input paths can now be symlinks
gufi_querywill get the realpath of the top-level input paths for traversal, but the custom SQLite functionspathandrpathwill print the current path with the original user provided prefixfpathstill prints the actual path
Schema changes
- Added
ppinodetopentries dmaxgidI→dmaxgidinode INT64→inode TEXT
New: contrib/longitudinal_snapshot.py
- Takes snapshot of an index tree and summaries each directory's metadata i.e. min, max, mean, median, histograms, etc. of file size, file count, timestamps, string lengths, etc., and places data into a single SQLite database file that is much smaller than the index itself
- Recommend running
gufi_treesummary_allbefore generating a snapshot
- Recommend running
- See Discussion in #149
New: contrib/treediff
- Walks directory tree and prints top-most directory mismatches
More tests
- Added empty directory to test tree
- Added deploy test
GitHub Actions
- macOS 11 → macOS 14
- Now keeping pdf documentation as artifacts when building main branch for 14 days
0.6.2
Schema Changes
summarynow has a 0 size count column calledtotzero- New views
summarylongandvrsummarylongjoinsummaryandvrsummarywith tables/views that contain additional data that should be associated with them but do not need to be added into thesummarytable. Currently, no extra information is attached.
${SEARCH} now contains an empty db.db to guarantee a db.db above all indexes under ${SEARCH}.
- This can be expanded in the future to add information that is separate from the rest of the index tree.
- Fixes #49
NEW: gufi_treesummary_all generates treesummary tables for all directories in an index instead of one directory at a time.
gufi_rollup now also generates treesummary tables while processing index
gufi_unrollupdoes not removetreesummarytables because there is no way to tell whether or not they were generated bygufi_rollupor not. Might add column to say what utility was used to generate them in the future.
gufi_stat → gufi_stat_bin
gufi_statis now a script that callsgufi_stat_bin- Server configuration file now also needs the path to
gufi_stat_bin
gufi_stats
- average-leaf-files
- average-leaf-links
- average-leaf-size
- median-leaf-links
- median-leaf-size
- filesize-log2-bins
- filesize-log1024-bins
- dirfilecount-log2-bins (#146)
- dirfilecount-log1024-bins (#146)
Scripts now have a --verbose/-V flag to print the command being run (#142)
bfwreaddirplus2db was reorganized.
NEW: split_trace splits trace files into chunks for parallel processing by gufi_trace2index
SQLite3
-
Updated from version 3.27 to version 3.43 to get built-in math functions
- Existing indexes should be rebuilt
-
Also added math functions
stdevs,stdevp, andmedian -
Replaced
subdirs_walked()withsubdirs(srollsubdirs, sroll)
When printing result columns, the delimiter after the last column is no longer printed
- Prevents pandas from unnecessarily generating a column of
Nones when parsing output
Significant increase in testing and code coverage
CMake
- db2, fuse, and gpfs tool building can be disabled even if the libraries are found
- Added
make pylint,make shellcheck, andmake checkstyle gufi_client_jailshould not have been created- Example configuration files are no longer renamed to
config.example
GitHub Actions
- Now building on macOS 11, 12, and 13
- Now building with
-Wall -Wextra -Werror -pedantic
Added cygwin GCC support (not tested with CI)
0.6.1
Reduced size of struct work
Added optional work compression with zlib to gufi_dir2index, gufi_dir2trace, and gufi_query
Added in-situ processing of work items in descend function - after enqueuing n directories, the remaining immediate directories are processed in the parent thread instead of enqueued
gufi_query no longer requires at least one of -T, -S, or -E
Changed gufi_trace2index to read from file descriptors using pread(2) instead of FILE *s with getline(3)
Removed BENCHMARK macro
Documentation and test updates
QueuePerThreadPool
- Added soft memory limit to via deferred processing
- If a thread's wait queue gets too big, new work items are placed in a different queue so they are not processed until the wait queue is empty
QPTPool_enqueuenow returns whether the new work item was placed in the wait queue or in the deferred queue
QPTPool_initnow only requires thread count and thread arguments to initialize- The other properties can be optionally set with setter functions
- Previous
QPTPool_inithas been renamed toQPTPool_init_with_props
- Symmetrical start up (
QPTPool_initandQPTPool_start) and end (QPTPool_waitandQPTPool_destroy)
SQLite3
- Renamed
path()torpath()- Returns full path properly for original and rolled up indicies
- Use with new views
vrsummary,vrpentries,vrxsummary, andvrxpentries
- Restored
path(),epath(), andfpath()functions - Removed alignment arguments from functions
- Updated URI processing to replace percent characters
Renamed
bfti→gufi_treesummaryrollup→gufi_rollupunrollup→gufi_unrollup
Performance History Framework
- Added helper script that allows user to specify a range of commits and how many times to benchmark each commit
- Downloads second copy of repo
- Added support for collecting new/renamed/removed
cumulative_timesdebug values forgufi_queryin older commits - Plotting supports including or excluding commits without data
- More documentation
Removed INSTALL, NOTES.txt, Makefiles, and bfmi
Added SQL guide
Added presentation from MSST 2023