Skip to content

[WIP] Balanced Random Forest #8732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

Conversation

massich
Copy link
Contributor

@massich massich commented Apr 12, 2017

Reference Issue

Fixes #8607

What does this implement/fix? Explain your changes.

This PR takes over #5181 ( and #8728 )

What does this implement/fix? Explain your changes.

Tasks to be performed

@MechCoder
Copy link
Member

Can you provide a summary of what exactly is left to do in the PR description? Thanks!

@potash
Copy link

potash commented May 17, 2017

@massich check out my branch feature/balanced-random-forest-api. The changes are:

  1. Followed the discussion of @glemaitre @arjoly @amueller in bootstrapping based on sample weights in random forests #8607 to remove the ad-hoc support for multioutput balanced randomf forest and raising an error when it is attempted.

  2. Added unit tests for the two BRF helper methods to test_balanced_random_forest.py-- it wasn't obvious to me which of the existing test files they belong in so feel free to move them.

  3. I changed the API to be class_weight="balanced_bootstrap" as discussed in bootstrapping based on sample weights in random forests #8607.

Please let me know what is left to get this merged.

@massich
Copy link
Contributor Author

massich commented May 18, 2017

@potash I am benchmarking the estimator here. My idea for the benchmark is:

  • Using sklearn datasets:
    • Create a synthetic dataset and go from balanced to highly unbalanced to see when BRF is beneficial
    • Repeat the experiment with Breast dataset in Sk-learn.
  • Using sklearn-imbalance:
    • Test against their selection of imbalanced datasets
  • Using openML:
    • Explore some imbalanced datasets

@potash
Copy link

potash commented May 18, 2017

Sounds good. You'll want to merge feature/balanced-random-forest-api so you can work off the new api (class_weight="balanced_bootstrap") and merge brf-example as it's been updated there too. Let me know if I can help with the examples.

@amueller
Copy link
Member

amueller commented May 18, 2017

There's some benchmarks here on a real datasets and also a silly implementation of the feature using imblearn: https://github.com/amueller/applied_ml_spring_2017/blob/master/slides/aml-15-resampling-imbalanced-data.ipynb
You can see round Out[83] that this method is doing much better than any of the others.

@raghavrv raghavrv added the Sprint label Jun 3, 2017
@raghavrv raghavrv self-requested a review June 28, 2017 13:05
@geneorama
Copy link

Hello there, is it possible to get an update on this? We're using this model in production (https://github.com/Chicago/lead-model), and as we prepare to go live it would be very helpful for deployment if this branch were in the standard sci-kit learn library.

Thanks for all the great work here!

Also, let us know if there's something we can do to move this forward.

@amueller
Copy link
Member

this needs tests, documentation and examples. I'm a big fan of this methods, so I'd be happy to see this moved forward. @massich are you still working on it? Would you like some help?
I liked using the mammography dataset: https://www.openml.org/d/310, see #9908 for a loader ;)

@glemaitre
Copy link
Member

In the meanwhile, we have the BalancedBaggingClassifier which can be set to a balanced random forest by setting max_features='auto' if I am not wrong.

@amueller
Copy link
Member

@glemaitre I believe you are right.

@massich
Copy link
Contributor Author

massich commented Nov 22, 2017

Actually, it completely stalled. I did not even finish the benchmark. I was playing with openml but I didn't finish it. It has been sitting for 6 months.

We should definitely revive it.

@chkoar
Copy link
Contributor

chkoar commented Jan 5, 2018

@massich what is the current status of this PR? Do you need a hand? According to a previous comment of @amueller this PR needs love, tests, documentation and examples, right?

@jnothman
Copy link
Member

IMO it would be good if you helped complete this, @chkoar

@chkoar
Copy link
Contributor

chkoar commented Feb 21, 2018

@jnothman That was the intention. If it is not picked by anyone else I will give it a in a couple of weeks. @massich has already given write access to me on his repos

@potash
Copy link

potash commented Feb 21, 2018

@chkoar let me know if there's anything I (original author of the feature) can do to help. Would be very happy to see this merged.

@chkoar
Copy link
Contributor

chkoar commented Feb 18, 2019

@potash ok, thanks. Let's hope that it will be merged during the upcoming sprint.

@jnothman
Copy link
Member

I think you should expect a little less. But let's honours list hope it will be a lot closer to merge after the sprint.

@chkoar chkoar mentioned this pull request Feb 22, 2019
@massich
Copy link
Contributor Author

massich commented Feb 24, 2019

closing in favor of #13227. Thx @chkoar for taking over.

@massich massich closed this Feb 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bootstrapping based on sample weights in random forests
9 participants