Skip to content

Optimize Scikit-learn model loading by adding Bulk Tree Construction API #651

Open
dantegd wants to merge 2 commits intodmlc:mainlinefrom
dantegd:optimize-sklearn-loader
Open

Optimize Scikit-learn model loading by adding Bulk Tree Construction API #651
dantegd wants to merge 2 commits intodmlc:mainlinefrom
dantegd:optimize-sklearn-loader

Conversation

@dantegd
Copy link

@dantegd dantegd commented Dec 19, 2025

This PR introduces a bulk tree construction API that significantly improves performance when importing scikit-learn RandomForest models into Treelite. In my benchmarks, the new API achieves ~7-10x speedup over the existing node-by-node construction approach of the current sklearn loader.

The current implementation spends significant time in per-node overhead due to:

  • Repeated ModelBuilder method calls for each node
  • Python-C++ boundary crossing overhead accumulating across millions of nodes
  • Memory allocation patterns that don't benefit from bulk operations

This becomes a bottleneck in workflows like cuML's RandomForestClassifier.from_sklearn(), where treelite import time dominates the conversion process.

This PR implements a BulkConstructTree friend function that directly populates the Tree class's internal ContiguousArray members in a single pass, bypassing the ModelBuilder abstraction for sklearn imports.

Initial benchmarks:

Configuration Total Nodes Old API (ms) Bulk API (ms) Speedup
classifier, 50 trees, depth=10 39,844 13.5 1.8 7.45x
classifier, 100 trees, depth=15 351,826 77.3 10.2 7.54x
classifier, 300 trees, depth=20 2,520,062 544.9 60.7 8.98x
regressor, 100 trees, depth=15 978,436 195.6 18.8 10.42x

@dantegd dantegd marked this pull request as ready for review January 6, 2026 18:40
@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 99.29329% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.73%. Comparing base (a9ce6c3) to head (0c46c84).

Files with missing lines Patch % Lines
python/treelite/sklearn/importer.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           mainline     #651      +/-   ##
============================================
- Coverage     84.35%   83.73%   -0.63%     
============================================
  Files            75       76       +1     
  Lines          6653     6927     +274     
  Branches        543      561      +18     
============================================
+ Hits           5612     5800     +188     
- Misses         1041     1127      +86     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go ahead and simply remove the old sklearn model builder functions? So LoadSKLearnRandomForestRegressorBulk should be simply called LoadSKLearnRandomForestRegressor, etc.

I don't see a good reason to keep the old functions around, if the new functions are equivalent in functionalities but faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants