Prevent hanging when sample collection time > interval#124
Prevent hanging when sample collection time > interval#124benbuckman wants to merge 20 commits intotmm1:masterfrom
Conversation
| Records information about the frames captured in `_stackprof.frames_buffer`, | ||
| up to `frame_count-1` (buffer may contain more frames from prior sample), | ||
| captured `timestamp_delta` microseconds after previous sample. | ||
| */ |
There was a problem hiding this comment.
(This is based on my understanding of the code, please correct if I'm misunderstanding. I'm also not sure what the canonical comment format should be here.)
| if (_stackprof.raw) { | ||
| gettimeofday(&_stackprof.last_sample_at, NULL); | ||
| } | ||
| gettimeofday(&_stackprof.last_sample_at, NULL); |
There was a problem hiding this comment.
(It seems simpler, and trivial, to assign this always, rather than track the downstream paths that do or don't need it.)
ext/stackprof/stackprof.c
Outdated
| struct timeval sampling_start, sampling_finish; | ||
|
|
||
| gettimeofday(&sampling_start, NULL); | ||
| long int time_since_last_sample_usec = diff_timevals_usec(&_stackprof.last_sample_at, &sampling_start); |
There was a problem hiding this comment.
(Note, this replaces the timestamp math fix in #122.)
There was a problem hiding this comment.
Instead of long int, include stdint.h and use int64_t
ext/stackprof/stackprof.c
Outdated
| return MICROSECONDS_IN_SECOND * diff->tv_sec + diff->tv_usec; | ||
| } | ||
|
|
||
| long int diff_timevals_usec(struct timeval *start, struct timeval *end) { |
There was a problem hiding this comment.
(I adapted this from an algorithm at https://www.guyrutenberg.com/2007/09/22/profiling-code-using-clock_gettime/, and tested it with this repl: https://repl.it/@BenBuckman/SanityCheck-Timevals)
|
@tmm1 Thanks for your feedback so far. |
|
Thinking about it some more, another option would be to track and expose |
I see an integer stackprof/ext/stackprof/stackprof.c Line 655 in 1d5ae02 missed_samples the same as I'm proposing here?never mind, the search results missed where it's used, stackprof/ext/stackprof/stackprof.c Line 260 in 1d5ae02 It looks like |
|
@tmm1 || @tenderlove || other maintainers, do you have time to review this? |
causes segfaults in tests 😞 This reverts commit ac49055.
- in `fprintf`s - `_stackprof.interval` is an int - `timestamp_delta` as `int64_t` - but doing the same with `_stackprof.raw_timestamp_deltas` makes it segfault!
|
Edit: this is fixed, I needed to also change the |
it appears that at 1μs, performance hit of profiling causes test to take >10s to run, and time out. try at 10μs instead.
ext/stackprof/stackprof.c
Outdated
|
|
||
| if (_stackprof.debug) { | ||
| printf("started with interval %d (%ld sec %d usec)\n", | ||
| NUM2INT(interval), timer.it_interval.tv_sec, timer.it_interval.tv_usec); |
There was a problem hiding this comment.
(This highlights the separate issue of #125)
| if (raw) { | ||
| gettimeofday(&_stackprof.last_sample_at, NULL); | ||
| if (_stackprof.debug) { | ||
| printf("started with interval %d (%ld sec %d usec)\n", |
There was a problem hiding this comment.
(This highlights the separate issue of #125)
This attempts to fix the bug described in #123, in which, when
intervalis faster than the time required to record and analyze stack frames, the job queue piles up faster than it's flushed and the program hangs.Pieces:
in_signal_handlerlock to the global_stackprofstruct, and adds a check, and escape hatch, instackprof_signal_handler.timeval_to_usecanddiff_timevals_usecfor working with time durations more easily aslong ints of microseconds, rather thantimevalstructs.debugmode with stdout+stderr output recording the internal timings and flow.More context on the underlying bug, and consideration of alternative approaches, is at #123.
Motivation:
We would like to use StackProf on a large-scale Ruby application in production. The inability to set an interval that is both fast enough to produce useful data, and guaranteed to never hang the program, is a blocker to doing so.
Caveats:
stackprof_signal_handlerdoes to the quality of the final results. (At the end of the day, it's a kind of user or program error if the interval is so fast that the escape hatch is necessary. The question is how to deal with it.)wallmode; I have not yet analyzed how this bug or fix work withcpuorobjectmodes (though I suspect they're similar).cc @NickLaMuro, thank you for your engagement with #123!