In this post I'm going to scrape www.basketball-reference.com in order to get some generic information for every player that played in the NBA. I'm also going to get population data per state from wikipedia.

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import sys
import string
import requests
import datetime
import progressbar
import time

First, let's get the information from basketball-reference. Great tutorials on web scraping:

Note that websites frequently change layout so scraping code typically needs to be modified over time.

Basketball-reference sorts the players by the first letter of their last name. You can see an example - http://www.basketball-reference.com/players/a/. Using that page as an example we can see that it has a table with players and their information. First we are going to read the table and extract the information.

In [2]:
def player_info():
    '''
    This function web scrapes basketball-reference and extracts player's info.
    '''
    players = []
    base_url = 'http://www.basketball-reference.com/players/'

    # get player tables from alphabetical list pages
    for letter in string.lowercase:
        page_request = requests.get(base_url + letter)
        soup = BeautifulSoup(page_request.text,"lxml")
        # find table in soup
        table = soup.find('table')

        if table:
            table_body = table.find('tbody')

            # loop over list of players
            for row in table_body.findAll('tr'):

                # get name and url
                player_url = row.find('a')
                player_names = player_url.text
                player_pages = player_url['href']

                # get some player's info from table
                cells = row.findAll('td')
                active_from = int(cells[0].text)
                active_to = int(cells[1].text)
                position = cells[2].text
                height = cells[3].text
                weight = cells[4].text
                birth_date = cells[5].text
                college = cells[6].text    

                # create entry
                player_entry = {'url': player_pages,
                                'name': player_names,
                                'active_from': active_from,
                                'active_to': active_to,
                                'position': position,
                                'college': college,
                                'height': height,
                                'weight': weight,
                                'birth_date': birth_date}

                # append player dictionary
                players.append(player_entry)
                
    return pd.DataFrame(players)
In [3]:
players_general_info = player_info() # call function that scrapes general info
players_general_info.head() # preview
Out[3]:
active_from active_to birth_date college height name position url weight
0 1991 1995 June 24, 1968 Duke University 6-10 Alaa Abdelnaby F-C /players/a/abdelal01.html 240
1 1969 1978 April 7, 1946 Iowa State University 6-9 Zaid Abdul-Aziz C-F /players/a/abdulza01.html 235
2 1970 1989 April 16, 1947 University of California, Los Angeles 7-2 Kareem Abdul-Jabbar C /players/a/abdulka01.html 225
3 1991 2001 March 9, 1969 Louisiana State University 6-1 Mahmoud Abdul-Rauf G /players/a/abdulma02.html 162
4 1998 2003 November 3, 1974 San Jose State University 6-6 Tariq Abdul-Wahad F /players/a/abdulta01.html 223

Each player also has a personal page with additional information. For example - http://www.basketball-reference.com/players/a/abdulka01.html. So I'm going to go into each player's page and extract that information as well.

In this case it is going to take much longer because we have to scrape the page of every player. Therefore, I wrote a function that scrapes each players and I'm going to run it in a loop with a status bar (I always prefer a status bar for scripts that run for a while):

In [4]:
def player_detail_info(url):
    '''
    scrape player's personal page. Input is players url (without  www.basketball-reference.com)
    '''
    # we do not need to parse the whole page since the information we are interested in is only a small part
    personal = SoupStrainer('p')
    page_request = requests.get('http://www.basketball-reference.com' + url)
    soup = BeautifulSoup(page_request.text,"lxml",parse_only=personal) # parse only part we are interested in
    p = soup.findAll('p') 

    # initialize some values - sometimes they are not available
    shoots = None
    birth_place = None
    high_school = None
    draft = None
    position_str = None

    # loop over personal info to get certain information
    for prow in p:
        # look for shoots field
        if 'Shoots:' in prow.text:
            s = prow.text.replace('\n','').split(u'\u25aa') # clean text
            if len(s)>1:
                shoots = s[1].split(':')[1].lstrip().rstrip()
        # look for position
        elif 'Position:' in prow.text:
            s = prow.text.replace('\n','').split(u'\u25aa')
            if len(s)>1:
                position_str = s[0].split(':')[1].lstrip().rstrip()
            else:
                position_str = prow.text.split('Position:')[1].lstrip().rstrip() # when shoots does not exist we need this
        # look for born
        elif 'Born:' in prow.text:
            s = prow.text.split(u'in\xa0') # clean text
            if len(s)>1:
                birth_place = s[1]
        elif 'High School:' in prow.text:
            s = prow.text.replace('\n','').split(':') 
            if len(s)>1:
                high_school = s[1].lstrip()
        elif 'Draft:' in prow.text:
            s = prow.text.replace('\n','').split(':')
            if len(s)>1:
                draft = s[1].lstrip()

    # create dictionary with all of the info            
    player_entry = {'url': url,
                    'birth_place': birth_place,
                    'shoots': shoots,
                    'high_school': high_school,
                    'draft': draft,
                    'position_str': position_str}

    return player_entry

Now we can run the main loop to create the list:

In [5]:
players_details_info_list = []
bar = progressbar.ProgressBar(max_value=len(players_general_info))
for i,url in enumerate(players_general_info.url):
    try:
        players_details_info_list.append(player_detail_info(url))
    except:
        print('cannot load: %s; location %d' %(url,i)) 
    bar.update(i)
    time.sleep(0.1)
 99% (4459 of 4460) |############################################################################################ | Elapsed Time: 1:14:25 ETA: 0:00:00
In [6]:
players_detail_df = pd.DataFrame(players_details_info_list) # convert to dateframe 
players_detail_df.head() # preview
Out[6]:
birth_place draft high_school position_str shoots url
0 Cairo, Egypt\neg\n Portland Trail Blazers, 1st round (25th pick, ... Bloomfield in Bloomfield, New Jersey Power Forward Right /players/a/abdelal01.html
1 Brooklyn, New York\nus\n Cincinnati Royals, 1st round (5th pick, 5th ov... John Jay in Brooklyn, New York Center and Power Forward Right /players/a/abdulza01.html
2 New York, New York\nus\n Milwaukee Bucks, 1st round (1st pick, 1st over... Power Memorial in New York, New York Center Right /players/a/abdulka01.html
3 Gulfport, Mississippi\nus\n Denver Nuggets, 1st round (3rd pick, 3rd overa... Gulfport in Gulfport, Mississippi Point Guard Right /players/a/abdulma02.html
4 Maisons Alfort, France\nfr\n Sacramento Kings, 1st round (11th pick, 11th o... Lycee Aristide Briand in Evreux, France Shooting Guard Right /players/a/abdulta01.html

Now we can join both DataFrames so we have one DataFrame with all of the information.

In [7]:
players_df = players_general_info.merge(players_detail_df,how='outer',on='url')
players_df.head() # preview
Out[7]:
active_from active_to birth_date college height name position url weight birth_place draft high_school position_str shoots
0 1991 1995 June 24, 1968 Duke University 6-10 Alaa Abdelnaby F-C /players/a/abdelal01.html 240 Cairo, Egypt\neg\n Portland Trail Blazers, 1st round (25th pick, ... Bloomfield in Bloomfield, New Jersey Power Forward Right
1 1969 1978 April 7, 1946 Iowa State University 6-9 Zaid Abdul-Aziz C-F /players/a/abdulza01.html 235 Brooklyn, New York\nus\n Cincinnati Royals, 1st round (5th pick, 5th ov... John Jay in Brooklyn, New York Center and Power Forward Right
2 1970 1989 April 16, 1947 University of California, Los Angeles 7-2 Kareem Abdul-Jabbar C /players/a/abdulka01.html 225 New York, New York\nus\n Milwaukee Bucks, 1st round (1st pick, 1st over... Power Memorial in New York, New York Center Right
3 1991 2001 March 9, 1969 Louisiana State University 6-1 Mahmoud Abdul-Rauf G /players/a/abdulma02.html 162 Gulfport, Mississippi\nus\n Denver Nuggets, 1st round (3rd pick, 3rd overa... Gulfport in Gulfport, Mississippi Point Guard Right
4 1998 2003 November 3, 1974 San Jose State University 6-6 Tariq Abdul-Wahad F /players/a/abdulta01.html 223 Maisons Alfort, France\nfr\n Sacramento Kings, 1st round (11th pick, 11th o... Lycee Aristide Briand in Evreux, France Shooting Guard Right

Let's save the data before we proceed:

In [9]:
players_df.to_csv('player_info_raw.csv', encoding='utf-8') # add encoding otherwise an error

We need to do some cleaning to the data. I'm going to start by looking at the data types of each column:

In [10]:
players_df.dtypes
Out[10]:
active_from      int64
active_to        int64
birth_date      object
college         object
height          object
name            object
position        object
url             object
weight          object
birth_place     object
draft           object
high_school     object
position_str    object
shoots          object
dtype: object
In [24]:
# convert weight to integer
players_df['weight'] = pd.to_numeric(players_df['weight'], errors='coerce')

# convert height to inches
height_in_inches = players_df['height'].str.split('-',expand=True)
players_df['height_in_inches'] = 12.0*pd.to_numeric(height_in_inches[0], errors='coerce')+pd.to_numeric(height_in_inches[1], errors='coerce')

# calculate BMI for each player
players_df['BMI'] = (players_df['weight'].values/2.2)/(players_df['height_in_inches'].values*2.54/100)**2

# convert birth date into datetime 
players_df['birth_date'] = pd.to_datetime(players_df['birth_date'], format='%B %d, %Y', errors='coerce')

# clean birth place and split the country
players_df[['birth_place','birth_country']] = players_df['birth_place'].str.split('\n',expand=True).iloc[:,:2]

I'm also going to split the highschool information into city, state and school name.

In [25]:
# split high school into state and city
def split_highschool(x):
    '''
    Takes string with the following input - high school name in city, state
    and return the name, city and state
    '''
    if x:
        s = x.split(' in ')[1].split(',')
        if len(s)==2:
            city = s[0].lstrip().rstrip()
            state = s[1].lstrip().rstrip()
            name = x.split(' in ')[0]
        else:
            city = None
            state = x.split(' in ')[1]
            name = x.split(' in ')[0]
    else:
        city = None
        state = None
        name = None
    return pd.Series([city, state, name], index=['city','state','name'])

# now apply the function
players_df[['hs_city','hs_state','hs_name']] = players_df['high_school'].apply(split_highschool)

Let's preview the data again:

In [26]:
players_df.head()
Out[26]:
active_from active_to birth_date college height name position url weight birth_place draft high_school position_str shoots height_in_inches BMI birth_country hs_city hs_state hs_name
0 1991 1995 1968-06-24 Duke University 6-10 Alaa Abdelnaby F-C /players/a/abdelal01.html 240.0 Cairo, Egypt Portland Trail Blazers, 1st round (25th pick, ... Bloomfield in Bloomfield, New Jersey Power Forward Right 82.0 25.147419 eg Bloomfield New Jersey Bloomfield
1 1969 1978 1946-04-07 Iowa State University 6-9 Zaid Abdul-Aziz C-F /players/a/abdulza01.html 235.0 Brooklyn, New York Cincinnati Royals, 1st round (5th pick, 5th ov... John Jay in Brooklyn, New York Center and Power Forward Right 81.0 25.235256 us Brooklyn New York John Jay
2 1970 1989 1947-04-16 University of California, Los Angeles 7-2 Kareem Abdul-Jabbar C /players/a/abdulka01.html 225.0 New York, New York Milwaukee Bucks, 1st round (1st pick, 1st over... Power Memorial in New York, New York Center Right 86.0 21.433619 us New York New York Power Memorial
3 1991 2001 1969-03-09 Louisiana State University 6-1 Mahmoud Abdul-Rauf G /players/a/abdulma02.html 162.0 Gulfport, Mississippi Denver Nuggets, 1st round (3rd pick, 3rd overa... Gulfport in Gulfport, Mississippi Point Guard Right 73.0 21.418013 us Gulfport Mississippi Gulfport
4 1998 2003 1974-11-03 San Jose State University 6-6 Tariq Abdul-Wahad F /players/a/abdulta01.html 223.0 Maisons Alfort, France Sacramento Kings, 1st round (11th pick, 11th o... Lycee Aristide Briand in Evreux, France Shooting Guard Right 78.0 25.824121 fr Evreux France Lycee Aristide Briand

This seems like a good place to stop. We might want to do further manipulation on the data set depending of the analysis we are trying to do (e.g. split birth place into city and state). Let's save the data.

In [28]:
players_df.to_csv('player_info_clean.csv', encoding='utf-8')

Getting population data per state

I would also like to get the population data for each state since 1947 (when the NBA started). I will use this data in my next post. Luckily, that data is available on wikipedia (https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population).

Let's extract the data from wikipedia:

In [29]:
url = 'https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population'
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,"lxml")
table = soup.findAll('table')

# looking at the source code, there are a few different tables. 
# The 6th table contains all of the raw data (rounded to the thousand)
raw_data = table[5].findAll('td') # get all the numbers
st = table[5].findAll('th') # get all the states names

We are now going to loop over each row and get the information we need:

In [30]:
data = []
for row in raw_data:
    # if data is unavailable then insert NaN
    if 'n/a' in row:
        data.append(np.nan)
    else:
        data.append(row.text.replace(',','')) # remove , from numbers
        
# get the list of states        
state = []
for row in st:
    state.append(row.text)

The data is in a long list format. We need to reshape it to the correct size. There are 116 years worth of data.

In [31]:
# convert data to float and reshape it to a 52 by 116 array
data_array = np.array(data,dtype=float).reshape((116,52)).T
# convert to DataFrame
population_by_state = pd.DataFrame(data_array[1:,:],index=state[1:],columns = data_array[0,:].astype(int))
# preview
population_by_state.head()
Out[31]:
1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
AL 1830000.0 1907000.0 1935000.0 1957000.0 1978000.0 2012000.0 2045000.0 2058000.0 2070000.0 2108000.0 ... 4628981.0 4672840.0 4718206.0 4757938.0 4785822.0 4801695.0 4817484.0 4833996.0 4849377.0 4858979.0
AK NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 675302.0 680300.0 687455.0 698895.0 713856.0 722572.0 731081.0 737259.0 736732.0 738432.0
AZ 124000.0 131000.0 138000.0 144000.0 151000.0 158000.0 167000.0 176000.0 186000.0 196000.0 ... 6029141.0 6167681.0 6280362.0 6343154.0 6411999.0 6472867.0 6556236.0 6634997.0 6731484.0 6828065.0
AR 1314000.0 1341000.0 1360000.0 1384000.0 1419000.0 1447000.0 1465000.0 1484000.0 1513000.0 1545000.0 ... 2821761.0 2848650.0 2874554.0 2896843.0 2922297.0 2938430.0 2949300.0 2958765.0 2966369.0 2978204.0
CA 1490000.0 1550000.0 1623000.0 1702000.0 1792000.0 1893000.0 1976000.0 2054000.0 2161000.0 2282000.0 ... 36021202.0 36250311.0 36604337.0 36961229.0 37336011.0 37701901.0 38062780.0 38431393.0 38802500.0 39144818.0

5 rows × 116 columns

And we can save the data

In [33]:
population_by_state.to_csv('population_by_state.csv')

In my next post I will use this data to plot interesting information

Thank you Tomer Tal for sharing a template of your basketball-reference scraping code. .

Let me know what you think!



Comments

comments powered by Disqus