Skip to content

[ENH] big data GAM#188

Open
dswah wants to merge 39 commits intomainfrom
bam
Open

[ENH] big data GAM#188
dswah wants to merge 39 commits intomainfrom
bam

Conversation

@dswah
Copy link
Owner

@dswah dswah commented Jul 22, 2018

fixes #187 #76
fixes #124

write an example like pomegranate out of core:
https://pomegranate.readthedocs.io/en/latest/ooc.html

  • QR updating

  • documentation

  • all methods avoid using full model matrix

  • statistics estimation work with new pirls

  • simplify statistics estimation

  • gamma is a instance argument

  • chunk size is instance arg

  • all models inherit new behavior

  • test with large dataset

  • write parallel version

  • ensure parallel version works in serial

  • do memory profiling. see if we can easily optimize memory anywhere

  • try parallelism

  • merge @maorn 'parallel' branch into this one

  • logic for skipping any parallelism if n_cores==1 joblib automatically does this

  • add some tests for the new features

  • fix a couple of broken tests

  • figure out looping in partial_dependence...

  • get rid of matrix vs ndarray warnings

subsequent PR?

  • use joblib with Pool? (this will enable use of dask)
  • use batch_size instead of block_size
  • enable mini-batches, add batches_per_epoch parameter and partial_fit method

memory profile

@dswah dswah changed the title big data GAM [WIP] big data GAM Jul 22, 2018
@codecov
Copy link

codecov bot commented Jul 23, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@b986ec5). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #188   +/-   ##
=========================================
  Coverage          ?   91.33%           
=========================================
  Files             ?       19           
  Lines             ?     2492           
  Branches          ?        0           
=========================================
  Hits              ?     2276           
  Misses            ?      216           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b986ec5...03898ea. Read the comment docs.

@dswah
Copy link
Owner Author

dswah commented Jul 24, 2018

awesome!!!! just tried a dataset that crashes my notebook when no partitioning is used, but that correctly solves when the optimization is incremental!!!!!

@maorn
Copy link

maorn commented Jul 24, 2018 via email

@maorn
Copy link

maorn commented Jul 26, 2018 via email

@dswah
Copy link
Owner Author

dswah commented Jul 26, 2018

@maorn that is really cool!

to contribute your code, please do the following:

  • put your changes in a safe place
  • fork the repo, and clone your fork on your computer
  • commit your changes (ie parallel code into pygam.py)
  • push your changes to your remote repo fork
  • open a pull request from your remote repo to this branch

Attention!!
please make sure that you dont lose the code you've already written!

  • copy it or something before forking/cloning...

looking forward to reading your code :)

@maorn
Copy link

maorn commented Oct 21, 2018

hi,
what is the state of this branch?
is there anything missing on my hand for committing it to the master branch?

@dswah
Copy link
Owner Author

dswah commented Oct 21, 2018

hi @maorn!
i think there are still a couple of things we need to do before we merge:

  • a rebase of your 'parallel' branch off of this one
  • logic for skipping any parallelism if n_cores==1
  • logic for partial dependence and quantiles that uses the new features
  • add some tests for the new features
  • fix a couple of broken tests

@mohsenzabihi
Copy link

Hi @maorn and @dswah, may I know about the status of this work? do you plan to merge it into master?

@dswah
Copy link
Owner Author

dswah commented Jul 16, 2019

@mohsenzabihi @ccurro The plan is to merge this branch into master in August.

But it needs a little love right now.
Specifically, i need to

  • adapt all remaining ocurrences of gam._modelmat like in partial dependence and quantiles to use the new blockwise scheme
  • remove joblib for now since it doesn't look like we get any benefit from parallelizing linear algebra operations

@tjburch
Copy link

tjburch commented Feb 28, 2022

I know this PR is pretty old, but I'd still be really happy to see this functionality implemented. Figured I'd just mention it since it's been a couple of years since there's been any updates.

@dswah dswah changed the title [WIP] big data GAM [ENH] big data GAM Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory consumption error add joblib

4 participants