Blog

A Brief Introduction to Berserker

Nov 1, 2015. | By: Jake

If you compete in or follow the machine learning competitions on kaggle, then by now you’re probably familiar with the concept of ensembling. Almost without exception, the winners of each competition are combining a variety of estimators into a single, more powerful model. It doesn’t matter what kind of data is given, or whether regression or classification is required, there isn’t a single machine learning algorithm that can compete with an ensemble. Even if a shiny new algorithm were published tomorrow that was objectively better than random forests, gradient boosting, and the like - you could just add it to you ensemble, making it even better.

There exists a variety of ensembling techniques, which mostly stem two methods known as “stacking” and “blending”. I may go over these in depth in a future post, but for now all you need to know is that several diverse models separately make predictions, which are then combined into a single prediction. The basic idea is that while all models have imperfections, most of the models are correct for a given prediction so each individual error has negligible effect.

The Case for Berserker

I think its fair to say that most of us don’t rewrite the random forest algorithm in C code every time we want to use it. We have things like scikit-learn so we don’t have to continuously reinvent the wheel. Yet for some reason we don’t have a generalized, reusable tools for creating ensembles, despite the fact that everyone and their mother is using them. That is why I created Berserker.

You can get all of the details in the readme, but here are a few key features:

  • A familiar scikit-learn api/syntax
  • Prediction memoization
  • Generate models algorithmically

A Simple Example

Finally, I’ll leave you with the source code and output for a demo using the popular Boston housing prices dataset. With only a few lines of code, you can create an ensemble (of ensembles) which outperforms the vanilla random forest and GBT in scikit-learn.

from berserker.ensemble import Ensemble
from berserker.layers import Layer
from berserker.nodes import Node

model = Ensemble(X_trn, y_trn, mean_squared_error)

# base estimator pool
model.add_layer(folds=5)
model.add_node(RandomForestRegressor(50), name='50 Tree Random Forest')
model.add_node(GradientBoostingRegressor(n_estimators=250), name='250 Gradient Boosted Trees')

# meta-estimator
model.add_layer()
model.add_node(LinearRegression(), name='Lin Reg Meta Estimator')

preds = model.predict(X_tst)
Level 1 Estimators (12 features)     Validation Error
-----------------------------------------------------
50 Tree RF                            16.1368
Gradient Boosted Trees                18.4357

Level 2 Estimators (14 features)      Validation Error
-----------------------------------------------------
Lin Reg Meta Estimator                15.5071

I urge you to try it out. This is my first attempt at writing a library, so I openly welcome any criticism.

[Read More]

The Push to Release Video Games Before Black Friday

Aug 22, 2015. | By: Jake

It shouldn’t really come as a surprise, but apparently companies like having on shelves for the the biggest shopping day of the year. More than $50 billion was spend during Black Friday weekend in 2014, which is generally considered to be the start of the Christmas shopping season.

We see an interesting pattern in the release dates of video games in relation to the unofficial holiday. Histograms of video game release dates show that most games are released in the weeks leading up to Black Friday, with a plurality of releases immediately before it.

xbox

ps3

It’s pretty apparent that Black Friday is a deadline of sorts for video game publishers.

[Read More]

Analyzing the Ashley Madison Leak

Aug 22, 2015. | By: Jake

The recent release of AshleyMadison.com user’s data has taken the internet by storm. While many are interested in digging up the dirty secrets of friends, neighbors, and significant others, some of us are just interested in what we can learn from the data.

Using the list of transactions, I made a plot attempting to shed some light on the demographics of cheating. Totaling the transaction amounts by state, and dividing by each state’s population, we can see which states’ citizens spent the most on average.

cheaters_plot

Note that Alabama is at the top. This is could potentially be due to users are making fake accounts and not changing the state from the default setting.

[Read More]

Moving to Github Pages

Aug 19, 2015. | By: Jake

I’m in the process of porting all of my work into github pages and jekyll. Knowing very little about web design, I’ve decided to fork an existing template that I really liked and go from there. All I had to strip out most of the “business” content like clients and testimonials, and update the config files with my own information.

Now all that left is to flesh out the site with more content like projects and blog posts.

[Read More]

Subscribe

Subscribe to this blog via RSS.

Categories

Recent Posts

Popular Tags