If you compete in or follow the machine learning competitions on kaggle, then by now you’re probably familiar with the concept of ensembling. Almost without exception, the winners of each competition are combining a variety of estimators into a single, more powerful model. It doesn’t matter what kind of data is given, or whether regression or classification is required, there isn’t a single machine learning algorithm that can compete with an ensemble. Even if a shiny new algorithm were published tomorrow that was objectively better than random forests, gradient boosting, and the like - you could just add it to you ensemble, making it even better.
There exists a variety of ensembling techniques, which mostly stem two methods known as “stacking” and “blending”. I may go over these in depth in a future post, but for now all you need to know is that several diverse models separately make predictions, which are then combined into a single prediction. The basic idea is that while all models have imperfections, most of the models are correct for a given prediction so each individual error has negligible effect.
I think its fair to say that most of us don’t rewrite the random forest algorithm in C code every time we want to use it. We have things like scikit-learn so we don’t have to continuously reinvent the wheel. Yet for some reason we don’t have a generalized, reusable tools for creating ensembles, despite the fact that everyone and their mother is using them. That is why I created Berserker.
You can get all of the details in the readme, but here are a few key features:
Finally, I’ll leave you with the source code and output for a demo using the popular Boston housing prices dataset. With only a few lines of code, you can create an ensemble (of ensembles) which outperforms the vanilla random forest and GBT in scikit-learn.
from berserker.ensemble import Ensemble from berserker.layers import Layer from berserker.nodes import Node model = Ensemble(X_trn, y_trn, mean_squared_error) # base estimator pool model.add_layer(folds=5) model.add_node(RandomForestRegressor(50), name='50 Tree Random Forest') model.add_node(GradientBoostingRegressor(n_estimators=250), name='250 Gradient Boosted Trees') # meta-estimator model.add_layer() model.add_node(LinearRegression(), name='Lin Reg Meta Estimator') preds = model.predict(X_tst)
Level 1 Estimators (12 features) Validation Error ----------------------------------------------------- 50 Tree RF 16.1368 Gradient Boosted Trees 18.4357 Level 2 Estimators (14 features) Validation Error ----------------------------------------------------- Lin Reg Meta Estimator 15.5071
I urge you to try it out. This is my first attempt at writing a library, so I openly welcome any criticism.
It shouldn’t really come as a surprise, but apparently companies like having on shelves for the the biggest shopping day of the year. More than $50 billion was spend during Black Friday weekend in 2014, which is generally considered to be the start of the Christmas shopping season.
We see an interesting pattern in the release dates of video games in relation to the unofficial holiday. Histograms of video game release dates show that most games are released in the weeks leading up to Black Friday, with a plurality of releases immediately before it.
It’s pretty apparent that Black Friday is a deadline of sorts for video game publishers.
The recent release of AshleyMadison.com user’s data has taken the internet by storm. While many are interested in digging up the dirty secrets of friends, neighbors, and significant others, some of us are just interested in what we can learn from the data.
Using the list of transactions, I made a plot attempting to shed some light on the demographics of cheating. Totaling the transaction amounts by state, and dividing by each state’s population, we can see which states’ citizens spent the most on average.
Note that Alabama is at the top. This is could potentially be due to users are making fake accounts and not changing the state from the default setting.
I’m in the process of porting all of my work into github pages and jekyll. Knowing very little about web design, I’ve decided to fork an existing template that I really liked and go from there. All I had to strip out most of the “business” content like clients and testimonials, and update the config files with my own information.
Now all that left is to flesh out the site with more content like projects and blog posts.
Subscribe to this blog via RSS.