Photo by Ross Sneddon on Unsplash

In this post I deep dive on bagging or bootstrap aggregating. The focus is on building intuition for the underlying mechanics so that you better understand why this technique is so powerful. Bagging is most commonly associated with Random Forest models, but the underlying idea is more general and can be applied to any model.

Bagging — just like boosting — sits with the ensemble family of learners. Bagging involves three key elements:

Accelerate analysis by bringing structure to unstructured data

Working with natural language data can often be challenging due to its lack of structure. Most data scientists, analysts and product managers are familiar with structured tables, consisting of rows and columns, but less familiar with unstructured documents, consisting of sentences and words. For this reason, knowing how to approach a natural language dataset can be quite challenging. In this post I want to demonstrate how you can use the awesome Python packages, spaCy and Pandas, to structure natural language and extract interesting insights quickly.

Introduction to Spacy

spaCy is a very popular Python package for advanced NLP — I have a beginner…

How to model the unexpected and unlikely the Bayesian way.

Image by Author.

Rare events are by definition, well, rare. But, inevitably they do happen and when they do they have outsized consequences. 9/11 was a tail event. The financial crisis of 2007/08 was a tail event. Coronavirus was a tail event. Many of the every day products you use, services you engage with and companies you work for are tail events. So, yes, tail events are rare but when they happen their impact is huge.

Predicting a tail event or extreme value is intrinsically hard. In fact, one of the most challenging…

Image by Author.

A few years ago I developed a model to identify fraudulent transactions for an online two-sided marketplace. My initial model was based on characteristics of the transaction and its context. This model was quite good, but I wanted to make it better. I was already using a Gradient Boosted Tree and more hyperparameter tuning wasn’t leading to any significant performance gains. I turned again to the features.

The model was already using text messages between buyers and sellers, including metadata on the messages (such as number of messages exchanged), but it wasn’t considering the time between messages. Specifically, it wasn’t…

Image by Author

This post is a short addendum to my longer analysis on earnings on Medium. In that post I estimated that one hour of member reading time is worth around $1.62 per day. One of the findings from that analysis is that internal member reading time is a stronger predictor of earnings. Nevertheless, it is still interesting to understand the relationship between internal views and earnings.

Views on Medium are split between internal and external views. Internal views are from Medium members and count toward earnings, whereas external views do not. …

Image by author.

I started writing on Medium in 2017, but only recently joined the partner program. My most popular post (link below) has accumulated 210K views and 95 hours of member reading time. Awesome. In total, I have earned $3.33 from it. Not so awesome. This is because I published the article in late 2017 but only joined the partner program in April 2021.

Photo by Lewis Keegan on Unsplash

Choosing the best layout, imagery and text to present to users on your webpage is hard. But, it doesn’t have to be. In fact *you* don’t need to choose — you can let data do it for you. In this post, I’m going to share a simple algorithm — implemented in Python — that will help you to identify the best webpage across multiple creative options. It’s time to get familiar with an awesome idea — the multiarmed bandit.

FYI I have a longer technical post on this subject here.

The Multiarmed… wait, what?

The multiarmed bandit (MAB) — The name derives from antiquated…

Photo by Jean-Philippe Delberghe on Unsplash

Scikit learn is *the* go to package for standard machine learning models in Python. It not only provides most of the core algorithms that you would want to use in practice (i.e. GBMs, Random Forests, Logistic/Linear regression), but also provides a wide range of tranforms for feature preprocessing (e.g. Onehot encoding, label encoding) as well as metrics and other convenience functions for tracking model performance. But, there will still be times when you need to do something a little bit different. In these instances I often still want to work within the general API of scikit learn. …

Photo by Myriam Jessier on Unsplash

Being an effective data scientist often means being able to identify the right solution for a particular problem. In this post, I want to discuss three techniques that have enabled me to solve tricky problems across multiple contexts, but that aren’t widely used. The three techniques are quantile regression, exponential decay regression and tree-embedded logistic regression. For each technique, I’ll provide a motivating example for why I think it is worth adding it to your toolkit and will also wrap each technique in a custom sklearn model so that you can easily apply it to your own problems.

All code…

Image by Author.

Pytorch is really fun to work with and if you are looking for a framework to get started with neural networks I highly recommend it — see my short tutorial on how to get up and running with a basic neural net in Pytorch here.

What many people don’t realise however is that Pytorch can be used for general gradient optimization. In other words, you can use Pytorch to find the minium or maximum of arbitrarily complex optimization objectives. But, why would you want to do this? I can think of at least three good reasons (there are many more).

Conor Mc.

Data Scientist, Economist, Pragmatist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store