Skip to content

Conversation

@wheeheee
Copy link
Contributor

@wheeheee wheeheee commented Feb 9, 2026

Was playing around with this stuff locally while writing the previous PR, but it's probably better split off from that.

  • Rewrote _ispow24 to remove a loop that doesn't optimize out
  • Check N > 0 for more safety
  • Made apparent the mutual exclusivity of branches (N a power of 2 / 3 / (1 or prime)) in CallGraphNode!
  • Use the Primes.Factorization object
  • Slightly reduce allocations, using searchsortedlast with isqrt(N) and exploit an inequality
  • Some purely cosmetic formatting

These are mostly changes to the implementation. The only algorithmic change is in cumprod(Ns), which can be changed to cumprod(reverse(Ns)) together with other minor changes to yield the same heuristic used previously. Is there a reason for the current order, or is it just preference? It feels slightly more natural to implement the way it is in this PR.

Using Cthulhu to descend into optimized and typed CallGraph{ComplexF64} shows that the PR makes the function much smaller. As a rough proxy, inlining cost decreases from ~2000 to ~800, although obviously CallGraphNode! still won't be inlined.

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

❌ Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.99%. Comparing base (af2da8e) to head (5985765).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/callgraph.jl 96.15% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
- Coverage   96.22%   95.99%   -0.24%     
==========================================
  Files           4        4              
  Lines         424      424              
==========================================
- Hits          408      407       -1     
- Misses         16       17       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 9, 2026

I assume you were referring to this line?

cumprod(Ns[end:-1:1])[end:-1:1]

the idea, if I recall correctly, was to knock out big factors asap so that I minimized the number of strides in lower recursions (i.e., if you wait to do larger radices until the end, you'll have to take a lot of strides). I have absolutely no idea if that actually benchmarked better bc that obv makes the lower recursions take larger strides. If this were a performance critical package, I would assume that one needs to optimize according to a heuristic that accounts for the page size.... perhaps another day.

N1_idx = searchsortedlast(N_cp, N_isqrt)
N1 = N_cp[N1_idx] # N1 <= N_isqrt <= N_fsqrt
if N1_idx != lastindex(N_cp) && (abs(N_cp[N1_idx+1] - N_fsqrt) < (N_fsqrt - N1))
N1 = N_cp[N1_idx+1] # can be >= N_fsqrt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try not to be too stingy on the codecov stuff, but I'm a little worried about introducing new logic here and then not adding a test for it. My understanding is that you're basically looking at the integer square root and seeing if N_cp[i] or N_cp[i+1] is closer to the floating-point sqrt. So worst-case, the output wouldn't be wrong, just lower performance runtime (since N1 will divide N nmw). I think it's fine, in that case, to not have a test. I just wanted some self-documentation because I've completely forgotten how this code works over the past 5 years.

Copy link
Contributor Author

@wheeheee wheeheee Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't immediately think of some numbers to exercise that code path, and I think it is probably quite hard (maybe impossible, but I'm lazy to put pen to paper for this [2,2,3] would do it I think) to hit that case, so I just put it there to be extra safe.

Also N1_idx != lastindex(N_cp) is in fact impossible to hit by the very nature of the problem, but it's probably good to put it there to help the compiler elide a bounds check and exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, but [2,2,3] would hit the first case...

@dannys4
Copy link
Collaborator

dannys4 commented Feb 9, 2026

btw thanks for the PRs in recent days, very exciting and the benchmarks look great!

@dannys4 dannys4 added this pull request to the merge queue Feb 9, 2026
Merged via the queue into JuliaMath:main with commit 54e2152 Feb 9, 2026
8 checks passed
@wheeheee
Copy link
Contributor Author

I assume you were referring to this line?

cumprod(Ns[end:-1:1])[end:-1:1]

the idea, if I recall correctly, was to knock out big factors asap so that I minimized the number of strides in lower recursions (i.e., if you wait to do larger radices until the end, you'll have to take a lot of strides). I have absolutely no idea if that actually benchmarked better bc that obv makes the lower recursions take larger strides. If this were a performance critical package, I would assume that one needs to optimize according to a heuristic that accounts for the page size.... perhaps another day.

Yeah. Would have been better to let this cook for a bit longer and figure out a better strategy, but I don't immediately see anything too jarring in the benchmarks so I guess it's ok to let things go. (had a wild idea about randomizing the order of elements in Ns...but that's non-deterministic...)

@wheeheee wheeheee deleted the small_optim branch February 10, 2026 02:50
@dannys4
Copy link
Collaborator

dannys4 commented Feb 10, 2026

figure out a better strategy

I think most of the FFTA use-case is for small arrays---I'll leave the better strategies to johnson and frigo, since it's really difficult to design cache-oblivious algorithms for mixed radix FFTs afaik.

randomizing the order of elements in Ns

lord have mercy.

@wheeheee
Copy link
Contributor Author

FFTW's GPL license is really a pity, would have been wonderful if parts of it could have been translated directly into Julia 🥲

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants