Getting ready for machine learning - cleaning up free NHL game and odds datasets
Before beginning any feature engineering or ML, it’s necessary to clean up the data first. In this article we work through a real-life example
While it’s always glamorous to focus on the few lines of taking in a dataset, optimizing some hyperparameters, and fitting it to a machine learning model, that is only a small fraction of the whole process. In the real world you’re not dealing with pre-processed and nicely packaged example data you can find in scikit-learn. Instead, the data can come in a variety of formats that may need to be manipulated to fit your goals, and often data is missing or incorrect and needs to be filled in. In many cases the data doesn’t exist at all and requires creating via API, web scraping, or manual efforts.
I was looking into creating some models to predict NHL games, but only found one viable source of betting information that was also free. It has yearly historical odds in excel format: Sports Book Reviews Online. The data is mostly complete but has some issues and comes in a bit of a weird format. In this article I’ll document my process to cleaning up, merging, and manipulating the data for use in future analysis.
As always, full code is available on GitHub
First, I downloaded the data from the newest to oldest years and explored it. It has quite a bit of data, including the Over/Under and opening lines. The PuckLine is only in the latter half of years.
Date | Rot | VH | Team | 1st | 2nd | 3rd | Final | Open | Close | PuckLine | OpenOU | CloseOU | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 41 | V | Pittsburgh | 1 | 1 | 1 | 3 | -110 | -115 | 1.5 | -310 | 6 | -110 | 6 | 105 |
113 | 42 | H | Philadelphia | 2 | 1 | 3 | 6 | 100 | -105 | -1.5 | 260 | 6 | -110 | 6 | -125 |
113 | 43 | V | Montreal | 2 | 1 | 1 | 0 | 132 | 120 | 1.5 | -220 | 6 | -110 | 6 | -110 |
113 | 44 | H | Toronto | 1 | 2 | 1 | 1 | -145 | -140 | -1.5 | 195 | 6 | -110 | 6 | -110 |
113 | 45 | V | Chicago | 0 | 0 | 1 | 1 | 230 | 210 | 1.5 | -125 | 6.5 | -110 | 6.5 | -110 |
113 | 46 | H | TampaBay | 3 | 1 | 1 | 5 | -260 | -250 | -1.5 | 105 | 6.5 | -110 | 6.5 | -110 |
113 | 47 | V | Vancouver | 1 | 1 | 3 | 5 | 135 | 130 | 1.5 | -215 | 5.5 | -110 | 6.5 | 100 |
113 | 48 | H | Edmonton | 0 | 1 | 2 | 3 | -150 | -150 | -1.5 | 190 | 5.5 | -110 | 6.5 | -120 |
113 | 49 | V | St.Louis | 2 | 0 | 2 | 4 | 135 | 125 | 1.5 | -200 | 5.5 | -110 | 6 | -110 |
There are a couple issues with the data that I manually had to manipulate with excel, before I could start using the files in Python.
Although I’m sure there is a way to do this programmatically, since it was only a few years of data I manually split each of these columns and exported the file to .csv.
From there, I can import them into pandas. I loop through the CSVs per season and combine them all into a single dataframe. There is some inconsistent naming across columns, namely OpenOU/Open OU and CloseOU / CloseOU, so I had to first apply a consistent naming. Additionally, Puck Line is only in later seasons, so I dropped it entirely so each year would have the same amount of data.
combined_df = pd.read_csv('nhl_odds/2008.csv')
combined_df['Year'] = '2008'
combined_df.rename(columns={'Open OU': 'OpenOU', 'Close OU': 'CloseOU'}, inplace = True)
for i in range(2009, 2022):
temp_df = pd.read_csv('nhl_odds/' + str(i) + '.csv')
temp_df['Year'] = str(i)
# Address inconsistencies in column naming
temp_df.rename(columns={'Open OU': 'OpenOU', 'Close OU': 'CloseOU'}, inplace = True)
combined_df = combined_df.append(temp_df)
# Drop columns that are only in later data
combined_df.drop(columns=['Puck Line', 'PuckLineOdds', 'PuckLine'], inplace = True)
combined_df
Date | Rot | VH | Team | 1st | 2nd | 3rd | Final | Open | Close | OpenOU | OpenOUOdds | CloseOU | CloseOUOdds | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 929 | 1 | V | Anaheim | 0 | 0 | 1 | 1 | -155 | -123 | 6.0 | 100 | 6.0 | -105 | 2008 |
1 | 929 | 2 | H | LosAngeles | 1 | 1 | 2 | 4 | 135 | 103 | 6.0 | -120 | 6.0 | -115 | 2008 |
2 | 930 | 51 | V | LosAngeles | 0 | 1 | 0 | 1 | 115 | 132 | 6.0 | -115 | 6.0 | 112 | 2008 |
3 | 930 | 52 | H | Anaheim | 2 | 2 | 0 | 4 | -135 | -152 | 6.0 | -105 | 6.0 | -132 | 2008 |
The main concern I have with the data is that it has 2 rows per game: one for the visitor and one for the home team. It would be much easier to do analysis by combining each game into a single row. However, before I looked into merging the rows I needed to evaluate the data and clean up any errors.
I noticed there were a few issues with the data:
When there was a specific instance, I updated the location itself. When there were multiple instances, I mass-replaced the offending data.
combined_df['Open'].iloc[33752] = -110
combined_df['Open'].iloc[33753] = -110
combined_df['Open'] = pd.to_numeric(combined_df['Open'])
combined_df['Final'].iloc[33875] = 5
combined_df['Final'] = pd.to_numeric(combined_df['Final'])
combined_df['CloseOUOdds'].iloc[34658] = 100
combined_df.loc[combined_df.Open == 0, 'Open'] = np.nan
combined_df.fillna(-110, inplace=True)
combined_df.loc[combined_df.Close == 0, 'Close'] = np.nan
combined_df.fillna(-110, inplace=True)
combined_df.loc[combined_df.OpenOU == 'NL', 'OpenOU'] = np.nan
combined_df.fillna(5.5, inplace=True)
combined_df.loc[combined_df.CloseOU == 'NL', 'CloseOU'] = np.nan
combined_df.fillna(5.5, inplace=True)
combined_df.loc[combined_df.OpenOUOdds == 'NL', 'OpenOUOdds'] = np.nan
combined_df.fillna(-110, inplace=True)
combined_df.loc[combined_df.CloseOUOdds == 'NL', 'CloseOUOdds'] = np.nan
combined_df.fillna(-110, inplace=True)
combined_df['OpenOU'] = pd.to_numeric(combined_df['OpenOU'])
combined_df['CloseOU'] = pd.to_numeric(combined_df['CloseOU'])
combined_df['OpenOUOdds'] = pd.to_numeric(combined_df['OpenOUOdds'])
combined_df['CloseOUOdds'] = pd.to_numeric(combined_df['CloseOUOdds'])
combined_df.loc[combined_df.OpenOU < 0, 'OpenOU'] = np.nan
combined_df.fillna(5.5, inplace=True)
combined_df.loc[combined_df.CloseOU < 0, 'CloseOU'] = np.nan
combined_df.fillna(5.5, inplace=True)
Next, I looked at team names as those can have different names (eg Tampa, TampaBay, Tampa Bay). Because I was looking at games prior to Seattle joining the league, I expect to have 31 unique teams. But looking at the number of teams, I found a larger number. I looked at the list of unique teams and applied a unique mapping to get the number of teams down to 31.
combined_df.Team.unique()
fixed_names = {'NY Islanders': 'NYIslanders', 'Tampa Bay': 'TampaBay', 'Tampa': 'Tampa Bay', 'Arizonas': 'Arizona', \
'WinnipegJets': 'Winnipeg', 'Atlanta': 'Winnipeg', 'Phoenix': 'Arizona'}
combined_df.replace({'Team': fixed_names}, inplace=True)
Finally, I could combine the games into a single row per game. The data was formatted such that consistently the visiting team would be the first row, and the home team would be the row immediately after. This simplified the merger because I could just use shift
to get the row under, rather than do a search for the row with the home team further in the data. After merging the data, I renamed the columns to match the new schema.
combined_df['HomeTeam'] = combined_df.Team.shift(-1)
combined_df['Home1st'] = combined_df['1st'].shift(-1)
combined_df['Home2nd'] = combined_df['2nd'].shift(-1)
combined_df['Home3rd'] = combined_df['3rd'].shift(-1)
combined_df['HomeFinal'] = combined_df.Final.shift(-1)
combined_df['MlHomeOpen'] = combined_df.Open.shift(-1)
combined_df['MlHomeClose'] = combined_df.Close.shift(-1)
# For over-under, the first row (visitors) is over, then the second row (home) is under
# Because the over will always equal the under, I only need to pull the opening and closing odds
combined_df['OpenUOdds'] = combined_df.OpenOUOdds.shift(-1)
combined_df['CloseUOdds'] = combined_df.CloseOUOdds.shift(-1)
# Now we've pulled the needed data into a single row, so we can drop the extra (now in-correct) Home row
combined_df.drop(combined_df[combined_df['VH'] == 'H'].index, inplace = True)
# Clean up and rename columns
combined_df.drop(columns=['Rot','VH'], inplace=True)
combined_df.rename(columns={'Team': 'AwayTeam', '1st': 'Away1st', '2nd': 'Away2nd', '3rd': 'Away3rd', 'Final': 'AwayFinal', \
'Open': 'MlAwayOpen', 'Close': 'MlAwayClose', 'OpenOUOdds': 'OpenOOdds', 'CloseOUOdds': 'CloseOOdds'}, inplace=True)
combined_df
Date | AwayTeam | Away1st | Away2nd | Away3rd | AwayFinal | MlAwayOpen | MlAwayClose | OpenOU | OpenOOdds | … | Year | HomeTeam | Home1st | Home2nd | Home3rd | HomeFinal | MlHomeOpen | MlHomeClose | OpenUOdds | CloseUOdds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 929 | Anaheim | 0 | 0 | 1 | 1 | -155.0 | -123.0 | 6.0 | 100 | … | 2008 | LosAngeles | 1.0 | 1.0 | 2.0 | 4.0 | 135.0 | 103.0 | -120.0 | -115.0 |
2 | 930 | LosAngeles | 0 | 1 | 0 | 1 | 115.0 | 132.0 | 6.0 | -115 | … | 2008 | Anaheim | 2.0 | 2.0 | 0.0 | 4.0 | -135.0 | -152.0 | -105.0 | -132.0 |
4 | 1003 | Anaheim | 0 | 2 | 0 | 2 | 165.0 | 190.0 | 5.5 | 105 | … | 2008 | Detroit | 1.0 | 0.0 | 1.0 | 3.0 | -185.0 | -230.0 | -125.0 | -140.0 |
6 | 1003 | Montreal | 1 | 0 | 1 | 3 | 130.0 | 140.0 | 6.0 | -105 | … | 2008 | Carolina | 1.0 | 0.0 | 1.0 | 2.0 | -150.0 | -160.0 | -115.0 | -130.0 |
8 | 1003 | Ottawa | 2 | 0 | 1 | 4 | -120.0 | -125.0 | 6.0 | -120 | … | 2008 | Toronto | 2.0 | 1.0 | 0.0 | 3.0 | 100.0 | 105.0 | 100.0 | 115.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
1894 | 628 | Montreal | 0 | 1 | 0 | 1 | 185.0 | 180.0 | 5.0 | -120 | … | 2021 | TampaBay | 1.0 | 1.0 | 3.0 | 5.0 | -225.0 | -200.0 | 100.0 | 105.0 |
1896 | 630 | Montreal | 0 | 1 | 0 | 1 | 180.0 | 177.0 | 5.0 | -120 | … | 2021 | TampaBay | 0.0 | 2.0 | 1.0 | 3.0 | -220.0 | -197.0 | 100.0 | 100.0 |
1898 | 702 | TampaBay | 2 | 2 | 2 | 6 | -145.0 | -123.0 | 5.0 | -120 | … | 2021 | Montreal | 1.0 | 1.0 | 1.0 | 3.0 | 125.0 | 113.0 | 100.0 | 100.0 |
1900 | 705 | TampaBay | 0 | 1 | 1 | 2 | -145.0 | -150.0 | 5.0 | -135 | … | 2021 | Montreal | 1.0 | 0.0 | 1.0 | 3.0 | 125.0 | 135.0 | 115.0 | 120.0 |
1902 | 707 | Montreal | 0 | 0 | 0 | 0 | 180.0 | 215.0 | 5.0 | -120 | … | 2021 | TampaBay | 0.0 | 1.0 | 0.0 | 1.0 | -220.0 | -245.0 | 100.0 | 120.0 |
Finally, with the data clean and in the format we want, we can export the data to CSV and use it in a future article.
combined_df.to_csv('x.csv', index=True)
Before beginning any feature engineering or ML, it’s necessary to clean up the data first. In this article we work through a real-life example
Discovering several profitable trends that consistently produce positive returns yearly
Leveraging historical performance and spread data to predict what team will cover the spread
昨日はカナダの選挙でした。人気ではありませんでした。
Comparing at the over/under line from 2010 with weather, team ratings, weeks, etc
A review of Matt Rudnitsky’s ‘Smart Sports Betting: How to Shift from Diehard Fan to Winning Gambler’
64% percent of stocks underperform the market and only 6.1% will outperform by 500%+. What makes these outperformers unique?
Using standard QB stats from 2016-2019, teammate ratings, and strength of schedule to predict 2020 fantasy points.
Using standard QB stats from 2016-2019 to compare predicted 2020 fantasy points vs actual performance.
雪が降っていて、桜はアイスクリームみたい
来週、ユダヤの祝日のハヌカーです。
毎年、北米で人気のゲームFantasy Footballをプレイしています。
Available on iOS, Android, BlackBerry, and Web, Red Sea Rescue is a passover themed game using tilt controls to avoid obstacles.
EZ4X is a graphical, automated forex paper trader that allows users to choose techincal indicators and risk tolerance to automatically execute trades on a cu...