IT Outsourcing is dead. Long live the Tech Partnership

Not at all. There is a still huge demand for the services but, in my opinion, it is changing directions. Outsourcing starts being very boring. Many people I meet complain about it. The quality of…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




My favorite pandas DataFrame viewer

Eyeballs help when cleaning data

By JJ Brosnan

In this blog, I’ll tell a story of value gained from Deephaven as a Python IDE while working on an application using baseball data.

Unfortunately, baseball data notation isn’t completely standardized. This article details data in baseball and how we overcame the challenges of dirty data. If there’s one thing I hope you take away, it’s this:

There are a few different avenues for obtaining baseball data within Python. I use the first on this list in the large gif atop this blog. Here are just a few:

They have their pros and cons, but if you’re interested in working with baseball data in Python, you should check all of them out — it’s really cool stuff.

Ryan’s models are meant to work on previous seasons. But we at Deephaven want it to work on the current season — we’re after that $5.6 million. That’s where the story of this blog really starts.

The pybaseball statcast data is easy to recreate. Get the start date for the 2022 MLB season, today’s date, import pybaseball, and perform a single function call.

Recreating retrosheet data is not as easy. Unfortunately, Retrosheet doesn’t publish its data until the end of each season. Thus, I had to recreate retrosheet for 2022.

Here’s a chunk of singular retrosheet file for the New York Yankees in 2021:

This is but a small chunk of a single event file, which contains information for every home game in a single season for a single team. A full season’s worth of retrosheet data is split into 30 files, with one for each team. The data is split as follows:

After many hours of work and scouring everywhere I could think for baseball information, I had finally finished the work and recreated retrosheet for every team thus far in 2022 (minus play-by-play info).

Here’s a chunk of my 2022 retrosheet (currentsheet?) for the Yankees:

There are only a few minor differences between my “currentsheet” files and the original retrosheet event files:

There’s a lot of data in retrosheet event files. When I first fed my currentsheet data into Ryan’s code, I got errors that were relatively easy to diagnose. They included:

After removing these two (and one or two others), everything looked in tip-top shape. So, I was pretty frustrated when I got another error. I had the pybaseball statcast data for the all of the correct games, so why wasn’t my code working?

I spent quite some time tracking down bugs in the code (and my faux retrosheet data) only to come up with nothing. When I did all of this work, I did it in a standard Python session from my terminal with no GUI. I could print the data (it’s all stored in Pandas DataFrames), but printing a DataFrame will not print all rows and columns if there are too many, and with this stuff, there are a LOT. Even when a DataFrame isn’t huge, it’s not very pleasant to look at when printed to a console.

I had been doing work with Deephaven for something else entirely, and had finally wrapped that up. I knew exactly which line in Ryan’s code produced an error, but had yet to figure out why. So, I spun up Deephaven to have a look at my DataFrames that were being merged.

Here are the side-by-side DataFrames as displayed in Deephaven:

Hmm, looks the same. Or does it?

So, the starting pitcher for the Miami Marlins is causing issues. How come? Well, his name is Max Meyer, and he never played in an MLB game prior to 2022. That means that he does not have a retrosheet player ID, and as such, the code converts that lack of ID to a NaN value in the DataFrame. This is a rather hard-to-find bug with a simple fix.

That’s not where this tale ends. After fixing the Max Meyer issue, I re-ran the code, and now had a column of integers. But now I’m on high alert. So I scroll through the data for 2022, and during the scroll, I notice something:

The starting pitcher for the Boston Red Sox on July 4, 2022 apparently has a -1 for a player ID (hint: that’s not a valid player ID). What’s happening there? That starting pitcher’s name is Austin Davis. And as it turns out, there was once another baseball player named Austin Davis:

So, I spent HOURS looking for a bug caused by missing data for a second baseman who played for a single game in the 1946 Negro National League II season for the New York Black Yankees. This was me when I realized it:

While AI cannot guarantee that you’ll the Beat the Streak, you can flip the odds in your favor. Just like the Oakland Athletics flipped their odds in 2002 using sabermetrics, you too can do so by using deep learning. As you read above, the models are only part of the story — you need to visualize and process your data with care to make accurate predictions.

Data scientists, engineers, and analysts alike spend only a small fraction of their time writing code to process data. Most of their time is spent ironing out kinks in code caused by these gremlins. Max Meyer and Austin Davis were the gremlins wreaking havoc in my baseball data. What will the gremlins turn out to be in your next project?

Don’t be like me and foolishly think “I don’t need to see my data while I work with it.” Deephaven is a powerful tool even if you don’t use the table API. It can drastically cut down on the time you spend debugging issues due to gremlins in your data.

Add a comment

Related posts:

Joking Around at rideOS

In times of uncertainty and trepidation, sometimes what you need the most is a good laugh. In honor of April Fool’s, we’ve put together a few of our team’s favorite jokes. Enjoy!