Exploring the MNIST Digits Dataset

Introduction The MNIST digits dataset is a famous dataset of handwritten digit images. You can read more about it at wikipedia or Yann LeCun’s page. It’s a useful dataset because it provides an example of a pretty simple, straightforward image processing task, for which we know exactly what state of the art accuracy is. I plan to use this dataset for a couple upcoming machine learning blog posts, and since the first step of pretty much any ML task is ‘explore your data,’ I figured I would post this first, to have to refer back to, instead of repeating in each subsequent post.

Processing Multiple Pandas DataFrame Columns in Parallel

Introduction Python’s Pandas library for data processing is great for all sorts of data-processing tasks. However, one thing it doesn’t support out of the box is parallel processing across multiple cores. I’ve been wanting a simple way to process Pandas DataFrames in parallel, and recently I found this truly awesome blog post. It shows how to apply an arbitrary Python function to each object in a sequence, in parallel, using Pool.

"You rarely want to use DataFrame.apply"

Tom Augspurger, one of the maintainers of Python’s Pandas library for data analysis, has an awesome series of blog posts on writing idiomatic Pandas code. In fact you should probably leave this site now and go read one of those blog posts, they’re really good. His post on Performance has an especially interesting tip: “You rarely want to use DataFrame.apply and almost never should use it with axis=1 [which processes the DataFrame row-by-row, “across columns”].

Problem With nflScrapR GoalToGo Variable

My new favorite dataset is the trove of NFL play-by-play data downloadable in R now through nflScrapR. However, in my previous post ‘Exploring NFL Play by Play Data with NFLScrapR’, I noticed something that didn’t look right with the GoalToGo variable, based the plot below. The problem here is that the probability distribution in the right sub-plot says that for GoalToGo situations (i.e. ‘3rd and Goal’), a play never gained more than 10 yards.

Exploring NFL Yards-Per-Play Distributions, using R/ggplot

In this post I’d like to dive into NFL yards-per-play (ypp) outcomes, looking at ypp distributions under different conditions. I’ll be using NFL 2009-2015 play-by-play data that I’ve downloaded using the awesome R library nflscrapR For an overview of the variables available with nflscrapR, see my previous post ‘Exploring NFL Play by Play Data with NFLScrapR’ Load/Prep the Data #Load libraries library(ggplot2) library(sqldf) library(dplyr) library(scales) library(assertthat) #Import data pbp_data <- read.

Pandas Data Wrangling: Avoiding that 'SettingWithCopyWarning'

If you use Python for data analysis, you probably use Pandas for Data Munging. And if you use Pandas, you’ve probably come across the warning below: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy The Pandas documentation is great in general, but it’s easy to read through the link above and still be confused.

Exploring NFL Play by Play Data with NFLScrapR

NFLscrapR is an awesome R library which queries the official NFL API for play-by-play data, and parses it into an R dataframe. Data is available from 2009 through the latest week of the current season. In this blog post, I’ll explore the seven full seasons of play-by-play data available from 2009-2015. Getting the Data To downnload a season of play by play data to an R dataframe, execute the following in an R session: