Image by Author.

A few years ago I developed a model to identify fraudulent transactions for an online two-sided marketplace. My initial model was based on characteristics of the transaction and its context. This model was quite good, but I wanted to make it better. I was already using a Gradient Boosted Tree and more hyperparameter tuning wasn’t leading to any significant performance gains. I turned again to the features.

The model was already using text messages between buyers and sellers, including metadata on the messages (such as number of messages exchanged), but it wasn’t considering the time between messages. Specifically, it wasn’t…

Image by Author

This post is a short addendum to my longer analysis on earnings on Medium. In that post I estimated that one hour of member reading time is worth around $1.62 per day. One of the findings from that analysis is that internal member reading time is a stronger predictor of earnings. Nevertheless, it is still interesting to understand the relationship between internal views and earnings.

Views on Medium are split between internal and external views. Internal views are from Medium members and count toward earnings, whereas external views do not. …

Image by author.

I started writing on Medium in 2017, but only recently joined the partner program. My most popular post (link below) has accumulated 210K views and 95K hours of member reading time. Awesome. In total, I have earned $3.33 from it. Not so awesome. This is because I published the article in late 2017 but only joined the partner program in April 2021.

Photo by Lewis Keegan on Unsplash

Choosing the best layout, imagery and text to present to users on your webpage is hard. But, it doesn’t have to be. In fact *you* don’t need to choose — you can let data do it for you. In this post, I’m going to share a simple algorithm — implemented in Python — that will help you to identify the best webpage across multiple creative options. It’s time to get familiar with an awesome idea — the multiarmed bandit.

FYI I have a longer technical post on this subject here.

The Multiarmed… wait, what?

The multiarmed bandit (MAB) — The name derives from antiquated…

Photo by Jean-Philippe Delberghe on Unsplash

Scikit learn is *the* go to package for standard machine learning models in Python. It not only provides most of the core algorithms that you would want to use in practice (i.e. GBMs, Random Forests, Logistic/Linear regression), but also provides a wide range of tranforms for feature preprocessing (e.g. Onehot encoding, label encoding) as well as metrics and other convenience functions for tracking model performance. But, there will still be times when you need to do something a little bit different. In these instances I often still want to work within the general API of scikit learn. …

Photo by Myriam Jessier on Unsplash

Being an effective data scientist often means being able to identify the right solution for a particular problem. In this post, I want to discuss three techniques that have enabled me to solve tricky problems across multiple contexts, but that aren’t widely used. The three techniques are quantile regression, exponential decay regression and tree-embedded logistic regression. For each technique, I’ll provide a motivating example for why I think it is worth adding it to your toolkit and will also wrap each technique in a custom sklearn model so that you can easily apply it to your own problems.

All code…

Image by Author.

Pytorch is really fun to work with and if you are looking for a framework to get started with neural networks I highly recommend it — see my short tutorial on how to get up and running with a basic neural net in Pytorch here.

What many people don’t realise however is that Pytorch can be used for general gradient optimization. In other words, you can use Pytorch to find the minium or maximum of arbitrarily complex optimization objectives. But, why would you want to do this? I can think of at least three good reasons (there are many more).

Getting started in data science can be daunting. Data scientists are expected to blend several hard-to-acquire skillsets together in one individual, namely; statistics, software engineering and analytics. Knowing where to start or what to focus on is really difficult when you are new to the field or thinking of diving in (believe me I can remember it well!).

With that said, it is easy to fall into the trap of feeling so overwhelmed that you delay or stop your learning journey altogether. This would be a terrible outcome for you though. …

Image by author
Image by author

Why Boosting Works

Gradient boosting is one of the most effective ML techniques out there. In this post I take a look at why boosting works. TL;DL Boosting corrects the mistakes of previous learners by fitting patterns in residuals.

Boosting

In this post I take a look at boosting with a focus on building an intution for why this technique works. Most people who work in data science and machine learning will know that gradient boosting is one of the most powerful and effective algorithms out there. …

The dplyr package in R makes data wrangling significantly easier. The beauty of dplyr is that, by design, the options available are limited. Specifically, a set of key verbs form the core of the package. Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe. Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R. The purpose of this short post is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandaspackage).

dplyr…

Conor Mc.

Data Scientist, Economist, Pragmatist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store