In this post I deep dive on bagging or bootstrap aggregating. The focus is on building intuition for the underlying mechanics so that you better understand why this technique is so powerful. Bagging is most commonly associated with Random Forest models, but the underlying idea is more general and can be applied to any model.
Bagging — just like boosting — sits with the ensemble family of learners. Bagging involves three key elements:
Working with natural language data can often be challenging due to its lack of structure. Most data scientists, analysts and product managers are familiar with structured tables, consisting of rows and columns, but less familiar with unstructured documents, consisting of sentences and words. For this reason, knowing how to approach a natural language dataset can be quite challenging. In this post I want to demonstrate how you can use the awesome Python packages,
Pandas, to structure natural language and extract interesting insights quickly.
How to model the unexpected and unlikely the Bayesian way.
Rare events are by definition, well, rare. But, inevitably they do happen and when they do they have outsized consequences. 9/11 was a tail event. The financial crisis of 2007/08 was a tail event. Coronavirus was a tail event. Many of the every day products you use, services you engage with and companies you work for are tail events. So, yes, tail events are rare but when they happen their impact is huge.
Predicting a tail event or extreme value is intrinsically hard. In fact, one of the most challenging…
A few years ago I developed a model to identify fraudulent transactions for an online two-sided marketplace. My initial model was based on characteristics of the transaction and its context. This model was quite good, but I wanted to make it better. I was already using a Gradient Boosted Tree and more hyperparameter tuning wasn’t leading to any significant performance gains. I turned again to the features.
The model was already using text messages between buyers and sellers, including metadata on the messages (such as number of messages exchanged), but it wasn’t considering the time between messages. Specifically, it wasn’t…
This post is a short addendum to my longer analysis on earnings on Medium. In that post I estimated that one hour of member reading time is worth around $1.62 per day. One of the findings from that analysis is that internal member reading time is a stronger predictor of earnings. Nevertheless, it is still interesting to understand the relationship between internal views and earnings.
Views on Medium are split between internal and external views. Internal views are from Medium members and count toward earnings, whereas external views do not. …
I started writing on Medium in 2017, but only recently joined the partner program. My most popular post (link below) has accumulated 210K views and 95 hours of member reading time. Awesome. In total, I have earned $3.33 from it. Not so awesome. This is because I published the article in late 2017 but only joined the partner program in April 2021.
Choosing the best layout, imagery and text to present to users on your webpage is hard. But, it doesn’t have to be. In fact *you* don’t need to choose — you can let data do it for you. In this post, I’m going to share a simple algorithm — implemented in Python — that will help you to identify the best webpage across multiple creative options. It’s time to get familiar with an awesome idea — the multiarmed bandit.
FYI I have a longer technical post on this subject here.
The multiarmed bandit (MAB) — The name derives from antiquated…
Scikit learn is *the* go to package for standard machine learning models in Python. It not only provides most of the core algorithms that you would want to use in practice (i.e. GBMs, Random Forests, Logistic/Linear regression), but also provides a wide range of tranforms for feature preprocessing (e.g. Onehot encoding, label encoding) as well as metrics and other convenience functions for tracking model performance. But, there will still be times when you need to do something a little bit different. In these instances I often still want to work within the general API of scikit learn. …
Being an effective data scientist often means being able to identify the right solution for a particular problem. In this post, I want to discuss three techniques that have enabled me to solve tricky problems across multiple contexts, but that aren’t widely used. The three techniques are quantile regression, exponential decay regression and tree-embedded logistic regression. For each technique, I’ll provide a motivating example for why I think it is worth adding it to your toolkit and will also wrap each technique in a custom sklearn model so that you can easily apply it to your own problems.
Pytorch is really fun to work with and if you are looking for a framework to get started with neural networks I highly recommend it — see my short tutorial on how to get up and running with a basic neural net in Pytorch here.
What many people don’t realise however is that Pytorch can be used for general gradient optimization. In other words, you can use Pytorch to find the minium or maximum of arbitrarily complex optimization objectives. But, why would you want to do this? I can think of at least three good reasons (there are many more).
Data Scientist, Economist, Pragmatist.