In this post I'm going to scrape www.basketball-reference.com in order to get some generic information for every player that played in the NBA. I'm also going to get population data per state from wikipedia.
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import sys
import string
import requests
import datetime
import progressbar
import time
First, let's get the information from basketball-reference. Great tutorials on web scraping:
Note that websites frequently change layout so scraping code typically needs to be modified over time.
Basketball-reference sorts the players by the first letter of their last name. You can see an example - http://www.basketball-reference.com/players/a/. Using that page as an example we can see that it has a table with players and their information. First we are going to read the table and extract the information.
def player_info():
'''
This function web scrapes basketball-reference and extracts player's info.
'''
players = []
base_url = 'http://www.basketball-reference.com/players/'
# get player tables from alphabetical list pages
for letter in string.lowercase:
page_request = requests.get(base_url + letter)
soup = BeautifulSoup(page_request.text,"lxml")
# find table in soup
table = soup.find('table')
if table:
table_body = table.find('tbody')
# loop over list of players
for row in table_body.findAll('tr'):
# get name and url
player_url = row.find('a')
player_names = player_url.text
player_pages = player_url['href']
# get some player's info from table
cells = row.findAll('td')
active_from = int(cells[0].text)
active_to = int(cells[1].text)
position = cells[2].text
height = cells[3].text
weight = cells[4].text
birth_date = cells[5].text
college = cells[6].text
# create entry
player_entry = {'url': player_pages,
'name': player_names,
'active_from': active_from,
'active_to': active_to,
'position': position,
'college': college,
'height': height,
'weight': weight,
'birth_date': birth_date}
# append player dictionary
players.append(player_entry)
return pd.DataFrame(players)
players_general_info = player_info() # call function that scrapes general info
players_general_info.head() # preview
Each player also has a personal page with additional information. For example - http://www.basketball-reference.com/players/a/abdulka01.html. So I'm going to go into each player's page and extract that information as well.
In this case it is going to take much longer because we have to scrape the page of every player. Therefore, I wrote a function that scrapes each players and I'm going to run it in a loop with a status bar (I always prefer a status bar for scripts that run for a while):
def player_detail_info(url):
'''
scrape player's personal page. Input is players url (without www.basketball-reference.com)
'''
# we do not need to parse the whole page since the information we are interested in is only a small part
personal = SoupStrainer('p')
page_request = requests.get('http://www.basketball-reference.com' + url)
soup = BeautifulSoup(page_request.text,"lxml",parse_only=personal) # parse only part we are interested in
p = soup.findAll('p')
# initialize some values - sometimes they are not available
shoots = None
birth_place = None
high_school = None
draft = None
position_str = None
# loop over personal info to get certain information
for prow in p:
# look for shoots field
if 'Shoots:' in prow.text:
s = prow.text.replace('\n','').split(u'\u25aa') # clean text
if len(s)>1:
shoots = s[1].split(':')[1].lstrip().rstrip()
# look for position
elif 'Position:' in prow.text:
s = prow.text.replace('\n','').split(u'\u25aa')
if len(s)>1:
position_str = s[0].split(':')[1].lstrip().rstrip()
else:
position_str = prow.text.split('Position:')[1].lstrip().rstrip() # when shoots does not exist we need this
# look for born
elif 'Born:' in prow.text:
s = prow.text.split(u'in\xa0') # clean text
if len(s)>1:
birth_place = s[1]
elif 'High School:' in prow.text:
s = prow.text.replace('\n','').split(':')
if len(s)>1:
high_school = s[1].lstrip()
elif 'Draft:' in prow.text:
s = prow.text.replace('\n','').split(':')
if len(s)>1:
draft = s[1].lstrip()
# create dictionary with all of the info
player_entry = {'url': url,
'birth_place': birth_place,
'shoots': shoots,
'high_school': high_school,
'draft': draft,
'position_str': position_str}
return player_entry
Now we can run the main loop to create the list:
players_details_info_list = []
bar = progressbar.ProgressBar(max_value=len(players_general_info))
for i,url in enumerate(players_general_info.url):
try:
players_details_info_list.append(player_detail_info(url))
except:
print('cannot load: %s; location %d' %(url,i))
bar.update(i)
time.sleep(0.1)
players_detail_df = pd.DataFrame(players_details_info_list) # convert to dateframe
players_detail_df.head() # preview
Now we can join both DataFrames so we have one DataFrame with all of the information.
players_df = players_general_info.merge(players_detail_df,how='outer',on='url')
players_df.head() # preview
Let's save the data before we proceed:
players_df.to_csv('player_info_raw.csv', encoding='utf-8') # add encoding otherwise an error
We need to do some cleaning to the data. I'm going to start by looking at the data types of each column:
players_df.dtypes
# convert weight to integer
players_df['weight'] = pd.to_numeric(players_df['weight'], errors='coerce')
# convert height to inches
height_in_inches = players_df['height'].str.split('-',expand=True)
players_df['height_in_inches'] = 12.0*pd.to_numeric(height_in_inches[0], errors='coerce')+pd.to_numeric(height_in_inches[1], errors='coerce')
# calculate BMI for each player
players_df['BMI'] = (players_df['weight'].values/2.2)/(players_df['height_in_inches'].values*2.54/100)**2
# convert birth date into datetime
players_df['birth_date'] = pd.to_datetime(players_df['birth_date'], format='%B %d, %Y', errors='coerce')
# clean birth place and split the country
players_df[['birth_place','birth_country']] = players_df['birth_place'].str.split('\n',expand=True).iloc[:,:2]
I'm also going to split the highschool information into city, state and school name.
# split high school into state and city
def split_highschool(x):
'''
Takes string with the following input - high school name in city, state
and return the name, city and state
'''
if x:
s = x.split(' in ')[1].split(',')
if len(s)==2:
city = s[0].lstrip().rstrip()
state = s[1].lstrip().rstrip()
name = x.split(' in ')[0]
else:
city = None
state = x.split(' in ')[1]
name = x.split(' in ')[0]
else:
city = None
state = None
name = None
return pd.Series([city, state, name], index=['city','state','name'])
# now apply the function
players_df[['hs_city','hs_state','hs_name']] = players_df['high_school'].apply(split_highschool)
Let's preview the data again:
players_df.head()
This seems like a good place to stop. We might want to do further manipulation on the data set depending of the analysis we are trying to do (e.g. split birth place into city and state). Let's save the data.
players_df.to_csv('player_info_clean.csv', encoding='utf-8')
Getting population data per state¶
I would also like to get the population data for each state since 1947 (when the NBA started). I will use this data in my next post. Luckily, that data is available on wikipedia (https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population).
Let's extract the data from wikipedia:
url = 'https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population'
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,"lxml")
table = soup.findAll('table')
# looking at the source code, there are a few different tables.
# The 6th table contains all of the raw data (rounded to the thousand)
raw_data = table[5].findAll('td') # get all the numbers
st = table[5].findAll('th') # get all the states names
We are now going to loop over each row and get the information we need:
data = []
for row in raw_data:
# if data is unavailable then insert NaN
if 'n/a' in row:
data.append(np.nan)
else:
data.append(row.text.replace(',','')) # remove , from numbers
# get the list of states
state = []
for row in st:
state.append(row.text)
The data is in a long list format. We need to reshape it to the correct size. There are 116 years worth of data.
# convert data to float and reshape it to a 52 by 116 array
data_array = np.array(data,dtype=float).reshape((116,52)).T
# convert to DataFrame
population_by_state = pd.DataFrame(data_array[1:,:],index=state[1:],columns = data_array[0,:].astype(int))
# preview
population_by_state.head()
And we can save the data
population_by_state.to_csv('population_by_state.csv')
In my next post I will use this data to plot interesting information
Thank you Tomer Tal for sharing a template of your basketball-reference scraping code. .
Comments
comments powered by Disqus