-
Notifications
You must be signed in to change notification settings - Fork 7
CallGraphNode! optimizations
#104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #104 +/- ##
==========================================
- Coverage 96.22% 95.99% -0.24%
==========================================
Files 4 4
Lines 424 424
==========================================
- Hits 408 407 -1
- Misses 16 17 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I assume you were referring to this line? the idea, if I recall correctly, was to knock out big factors asap so that I minimized the number of strides in lower recursions (i.e., if you wait to do larger radices until the end, you'll have to take a lot of strides). I have absolutely no idea if that actually benchmarked better bc that obv makes the lower recursions take larger strides. If this were a performance critical package, I would assume that one needs to optimize according to a heuristic that accounts for the page size.... perhaps another day. |
| N1_idx = searchsortedlast(N_cp, N_isqrt) | ||
| N1 = N_cp[N1_idx] # N1 <= N_isqrt <= N_fsqrt | ||
| if N1_idx != lastindex(N_cp) && (abs(N_cp[N1_idx+1] - N_fsqrt) < (N_fsqrt - N1)) | ||
| N1 = N_cp[N1_idx+1] # can be >= N_fsqrt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try not to be too stingy on the codecov stuff, but I'm a little worried about introducing new logic here and then not adding a test for it. My understanding is that you're basically looking at the integer square root and seeing if N_cp[i] or N_cp[i+1] is closer to the floating-point sqrt. So worst-case, the output wouldn't be wrong, just lower performance runtime (since N1 will divide N nmw). I think it's fine, in that case, to not have a test. I just wanted some self-documentation because I've completely forgotten how this code works over the past 5 years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't immediately think of some numbers to exercise that code path, and I think it is probably quite hard (maybe impossible, but I'm lazy to put pen to paper for this [2,2,3] would do it I think) to hit that case, so I just put it there to be extra safe.
Also N1_idx != lastindex(N_cp) is in fact impossible to hit by the very nature of the problem, but it's probably good to put it there to help the compiler elide a bounds check and exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, but [2,2,3] would hit the first case...
|
btw thanks for the PRs in recent days, very exciting and the benchmarks look great! |
Yeah. Would have been better to let this cook for a bit longer and figure out a better strategy, but I don't immediately see anything too jarring in the benchmarks so I guess it's ok to let things go. (had a wild idea about randomizing the order of elements in |
I think most of the FFTA use-case is for small arrays---I'll leave the better strategies to johnson and frigo, since it's really difficult to design cache-oblivious algorithms for mixed radix FFTs afaik.
lord have mercy. |
|
FFTW's GPL license is really a pity, would have been wonderful if parts of it could have been translated directly into Julia 🥲 |
Was playing around with this stuff locally while writing the previous PR, but it's probably better split off from that.
_ispow24to remove a loop that doesn't optimize outN > 0for more safetyNa power of 2 / 3 / (1 or prime)) inCallGraphNode!Primes.Factorizationobjectsearchsortedlastwithisqrt(N)and exploit an inequalityThese are mostly changes to the implementation. The only algorithmic change is in
cumprod(Ns), which can be changed tocumprod(reverse(Ns))together with other minor changes to yield the same heuristic used previously. Is there a reason for the current order, or is it just preference? It feels slightly more natural to implement the way it is in this PR.Using Cthulhu to descend into optimized and typed
CallGraph{ComplexF64}shows that the PR makes the function much smaller. As a rough proxy, inlining cost decreases from ~2000 to ~800, although obviouslyCallGraphNode!still won't be inlined.