In this post I'm going to scrape www.basketball-reference.com in order to get some generic information for every player that played in the NBA. I'm also going to get population data per state from wikipedia.

In [1]:

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import sys
import string
import requests
import datetime
import progressbar
import time

First, let's get the information from basketball-reference. Great tutorials on web scraping:

Note that websites frequently change layout so scraping code typically needs to be modified over time.

Basketball-reference sorts the players by the first letter of their last name. You can see an example - http://www.basketball-reference.com/players/a/. Using that page as an example we can see that it has a table with players and their information. First we are going to read the table and extract the information.

In [2]:

def player_info():
    '''
    This function web scrapes basketball-reference and extracts player's info.
    '''
    players = []
    base_url = 'http://www.basketball-reference.com/players/'

    # get player tables from alphabetical list pages
    for letter in string.lowercase:
        page_request = requests.get(base_url + letter)
        soup = BeautifulSoup(page_request.text,"lxml")
        # find table in soup
        table = soup.find('table')

        if table:
            table_body = table.find('tbody')

            # loop over list of players
            for row in table_body.findAll('tr'):

                # get name and url
                player_url = row.find('a')
                player_names = player_url.text
                player_pages = player_url['href']

                # get some player's info from table
                cells = row.findAll('td')
                active_from = int(cells[0].text)
                active_to = int(cells[1].text)
                position = cells[2].text
                height = cells[3].text
                weight = cells[4].text
                birth_date = cells[5].text
                college = cells[6].text    

                # create entry
                player_entry = {'url': player_pages,
                                'name': player_names,
                                'active_from': active_from,
                                'active_to': active_to,
                                'position': position,
                                'college': college,
                                'height': height,
                                'weight': weight,
                                'birth_date': birth_date}

                # append player dictionary
                players.append(player_entry)
                
    return pd.DataFrame(players)

In [3]:

players_general_info = player_info() # call function that scrapes general info
players_general_info.head() # preview

Out[3]:

	active_from	active_to	birth_date	college	height	name	position	url	weight
0	1991	1995	June 24, 1968	Duke University	6-10	Alaa Abdelnaby	F-C	/players/a/abdelal01.html	240
1	1969	1978	April 7, 1946	Iowa State University	6-9	Zaid Abdul-Aziz	C-F	/players/a/abdulza01.html	235
2	1970	1989	April 16, 1947	University of California, Los Angeles	7-2	Kareem Abdul-Jabbar	C	/players/a/abdulka01.html	225
3	1991	2001	March 9, 1969	Louisiana State University	6-1	Mahmoud Abdul-Rauf	G	/players/a/abdulma02.html	162
4	1998	2003	November 3, 1974	San Jose State University	6-6	Tariq Abdul-Wahad	F	/players/a/abdulta01.html	223

Each player also has a personal page with additional information. For example - http://www.basketball-reference.com/players/a/abdulka01.html. So I'm going to go into each player's page and extract that information as well.

In this case it is going to take much longer because we have to scrape the page of every player. Therefore, I wrote a function that scrapes each players and I'm going to run it in a loop with a status bar (I always prefer a status bar for scripts that run for a while):

In [4]:

def player_detail_info(url):
    '''
    scrape player's personal page. Input is players url (without  www.basketball-reference.com)
    '''
    # we do not need to parse the whole page since the information we are interested in is only a small part
    personal = SoupStrainer('p')
    page_request = requests.get('http://www.basketball-reference.com' + url)
    soup = BeautifulSoup(page_request.text,"lxml",parse_only=personal) # parse only part we are interested in
    p = soup.findAll('p') 

    # initialize some values - sometimes they are not available
    shoots = None
    birth_place = None
    high_school = None
    draft = None
    position_str = None

    # loop over personal info to get certain information
    for prow in p:
        # look for shoots field
        if 'Shoots:' in prow.text:
            s = prow.text.replace('\n','').split(u'\u25aa') # clean text
            if len(s)>1:
                shoots = s[1].split(':')[1].lstrip().rstrip()
        # look for position
        elif 'Position:' in prow.text:
            s = prow.text.replace('\n','').split(u'\u25aa')
            if len(s)>1:
                position_str = s[0].split(':')[1].lstrip().rstrip()
            else:
                position_str = prow.text.split('Position:')[1].lstrip().rstrip() # when shoots does not exist we need this
        # look for born
        elif 'Born:' in prow.text:
            s = prow.text.split(u'in\xa0') # clean text
            if len(s)>1:
                birth_place = s[1]
        elif 'High School:' in prow.text:
            s = prow.text.replace('\n','').split(':') 
            if len(s)>1:
                high_school = s[1].lstrip()
        elif 'Draft:' in prow.text:
            s = prow.text.replace('\n','').split(':')
            if len(s)>1:
                draft = s[1].lstrip()

    # create dictionary with all of the info            
    player_entry = {'url': url,
                    'birth_place': birth_place,
                    'shoots': shoots,
                    'high_school': high_school,
                    'draft': draft,
                    'position_str': position_str}

    return player_entry

Now we can run the main loop to create the list:

In [5]:

players_details_info_list = []
bar = progressbar.ProgressBar(max_value=len(players_general_info))
for i,url in enumerate(players_general_info.url):
    try:
        players_details_info_list.append(player_detail_info(url))
    except:
        print('cannot load: %s; location %d' %(url,i)) 
    bar.update(i)
    time.sleep(0.1)

 99% (4459 of 4460) |############################################################################################ | Elapsed Time: 1:14:25 ETA: 0:00:00

In [6]:

players_detail_df = pd.DataFrame(players_details_info_list) # convert to dateframe 
players_detail_df.head() # preview

Out[6]:

	birth_place	draft	high_school	position_str	shoots	url
0	Cairo, Egypt\neg\n	Portland Trail Blazers, 1st round (25th pick, ...	Bloomfield in Bloomfield, New Jersey	Power Forward	Right	/players/a/abdelal01.html
1	Brooklyn, New York\nus\n	Cincinnati Royals, 1st round (5th pick, 5th ov...	John Jay in Brooklyn, New York	Center and Power Forward	Right	/players/a/abdulza01.html
2	New York, New York\nus\n	Milwaukee Bucks, 1st round (1st pick, 1st over...	Power Memorial in New York, New York	Center	Right	/players/a/abdulka01.html
3	Gulfport, Mississippi\nus\n	Denver Nuggets, 1st round (3rd pick, 3rd overa...	Gulfport in Gulfport, Mississippi	Point Guard	Right	/players/a/abdulma02.html
4	Maisons Alfort, France\nfr\n	Sacramento Kings, 1st round (11th pick, 11th o...	Lycee Aristide Briand in Evreux, France	Shooting Guard	Right	/players/a/abdulta01.html

Now we can join both DataFrames so we have one DataFrame with all of the information.

In [7]:

players_df = players_general_info.merge(players_detail_df,how='outer',on='url')
players_df.head() # preview

Out[7]:

	active_from	active_to	birth_date	college	height	name	position	url	weight	birth_place	draft	high_school	position_str	shoots
0	1991	1995	June 24, 1968	Duke University	6-10	Alaa Abdelnaby	F-C	/players/a/abdelal01.html	240	Cairo, Egypt\neg\n	Portland Trail Blazers, 1st round (25th pick, ...	Bloomfield in Bloomfield, New Jersey	Power Forward	Right
1	1969	1978	April 7, 1946	Iowa State University	6-9	Zaid Abdul-Aziz	C-F	/players/a/abdulza01.html	235	Brooklyn, New York\nus\n	Cincinnati Royals, 1st round (5th pick, 5th ov...	John Jay in Brooklyn, New York	Center and Power Forward	Right
2	1970	1989	April 16, 1947	University of California, Los Angeles	7-2	Kareem Abdul-Jabbar	C	/players/a/abdulka01.html	225	New York, New York\nus\n	Milwaukee Bucks, 1st round (1st pick, 1st over...	Power Memorial in New York, New York	Center	Right
3	1991	2001	March 9, 1969	Louisiana State University	6-1	Mahmoud Abdul-Rauf	G	/players/a/abdulma02.html	162	Gulfport, Mississippi\nus\n	Denver Nuggets, 1st round (3rd pick, 3rd overa...	Gulfport in Gulfport, Mississippi	Point Guard	Right
4	1998	2003	November 3, 1974	San Jose State University	6-6	Tariq Abdul-Wahad	F	/players/a/abdulta01.html	223	Maisons Alfort, France\nfr\n	Sacramento Kings, 1st round (11th pick, 11th o...	Lycee Aristide Briand in Evreux, France	Shooting Guard	Right

Let's save the data before we proceed:

In [9]:

players_df.to_csv('player_info_raw.csv', encoding='utf-8') # add encoding otherwise an error

We need to do some cleaning to the data. I'm going to start by looking at the data types of each column:

In [10]:

players_df.dtypes

Out[10]:

active_from      int64
active_to        int64
birth_date      object
college         object
height          object
name            object
position        object
url             object
weight          object
birth_place     object
draft           object
high_school     object
position_str    object
shoots          object
dtype: object

In [24]:

# convert weight to integer
players_df['weight'] = pd.to_numeric(players_df['weight'], errors='coerce')

# convert height to inches
height_in_inches = players_df['height'].str.split('-',expand=True)
players_df['height_in_inches'] = 12.0*pd.to_numeric(height_in_inches[0], errors='coerce')+pd.to_numeric(height_in_inches[1], errors='coerce')

# calculate BMI for each player
players_df['BMI'] = (players_df['weight'].values/2.2)/(players_df['height_in_inches'].values*2.54/100)**2

# convert birth date into datetime 
players_df['birth_date'] = pd.to_datetime(players_df['birth_date'], format='%B %d, %Y', errors='coerce')

# clean birth place and split the country
players_df[['birth_place','birth_country']] = players_df['birth_place'].str.split('\n',expand=True).iloc[:,:2]

I'm also going to split the highschool information into city, state and school name.

In [25]:

# split high school into state and city
def split_highschool(x):
    '''
    Takes string with the following input - high school name in city, state
    and return the name, city and state
    '''
    if x:
        s = x.split(' in ')[1].split(',')
        if len(s)==2:
            city = s[0].lstrip().rstrip()
            state = s[1].lstrip().rstrip()
            name = x.split(' in ')[0]
        else:
            city = None
            state = x.split(' in ')[1]
            name = x.split(' in ')[0]
    else:
        city = None
        state = None
        name = None
    return pd.Series([city, state, name], index=['city','state','name'])

# now apply the function
players_df[['hs_city','hs_state','hs_name']] = players_df['high_school'].apply(split_highschool)

Let's preview the data again:

In [26]:

players_df.head()

Out[26]:

	active_from	active_to	birth_date	college	height	name	position	url	weight	birth_place	draft	high_school	position_str	shoots	height_in_inches	BMI	birth_country	hs_city	hs_state	hs_name
0	1991	1995	1968-06-24	Duke University	6-10	Alaa Abdelnaby	F-C	/players/a/abdelal01.html	240.0	Cairo, Egypt	Portland Trail Blazers, 1st round (25th pick, ...	Bloomfield in Bloomfield, New Jersey	Power Forward	Right	82.0	25.147419	eg	Bloomfield	New Jersey	Bloomfield
1	1969	1978	1946-04-07	Iowa State University	6-9	Zaid Abdul-Aziz	C-F	/players/a/abdulza01.html	235.0	Brooklyn, New York	Cincinnati Royals, 1st round (5th pick, 5th ov...	John Jay in Brooklyn, New York	Center and Power Forward	Right	81.0	25.235256	us	Brooklyn	New York	John Jay
2	1970	1989	1947-04-16	University of California, Los Angeles	7-2	Kareem Abdul-Jabbar	C	/players/a/abdulka01.html	225.0	New York, New York	Milwaukee Bucks, 1st round (1st pick, 1st over...	Power Memorial in New York, New York	Center	Right	86.0	21.433619	us	New York	New York	Power Memorial
3	1991	2001	1969-03-09	Louisiana State University	6-1	Mahmoud Abdul-Rauf	G	/players/a/abdulma02.html	162.0	Gulfport, Mississippi	Denver Nuggets, 1st round (3rd pick, 3rd overa...	Gulfport in Gulfport, Mississippi	Point Guard	Right	73.0	21.418013	us	Gulfport	Mississippi	Gulfport
4	1998	2003	1974-11-03	San Jose State University	6-6	Tariq Abdul-Wahad	F	/players/a/abdulta01.html	223.0	Maisons Alfort, France	Sacramento Kings, 1st round (11th pick, 11th o...	Lycee Aristide Briand in Evreux, France	Shooting Guard	Right	78.0	25.824121	fr	Evreux	France	Lycee Aristide Briand

This seems like a good place to stop. We might want to do further manipulation on the data set depending of the analysis we are trying to do (e.g. split birth place into city and state). Let's save the data.

In [28]:

players_df.to_csv('player_info_clean.csv', encoding='utf-8')

Getting population data per state¶

I would also like to get the population data for each state since 1947 (when the NBA started). I will use this data in my next post. Luckily, that data is available on wikipedia (https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population).

Let's extract the data from wikipedia:

In [29]:

url = 'https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population'
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,"lxml")
table = soup.findAll('table')

# looking at the source code, there are a few different tables. 
# The 6th table contains all of the raw data (rounded to the thousand)
raw_data = table[5].findAll('td') # get all the numbers
st = table[5].findAll('th') # get all the states names

We are now going to loop over each row and get the information we need:

In [30]:

data = []
for row in raw_data:
    # if data is unavailable then insert NaN
    if 'n/a' in row:
        data.append(np.nan)
    else:
        data.append(row.text.replace(',','')) # remove , from numbers
        
# get the list of states        
state = []
for row in st:
    state.append(row.text)

The data is in a long list format. We need to reshape it to the correct size. There are 116 years worth of data.

In [31]:

# convert data to float and reshape it to a 52 by 116 array
data_array = np.array(data,dtype=float).reshape((116,52)).T
# convert to DataFrame
population_by_state = pd.DataFrame(data_array[1:,:],index=state[1:],columns = data_array[0,:].astype(int))
# preview
population_by_state.head()

Out[31]:

	1900	1901	1902	1903	1904	1905	1906	1907	1908	1909	...	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
AL	1830000.0	1907000.0	1935000.0	1957000.0	1978000.0	2012000.0	2045000.0	2058000.0	2070000.0	2108000.0	...	4628981.0	4672840.0	4718206.0	4757938.0	4785822.0	4801695.0	4817484.0	4833996.0	4849377.0	4858979.0
AK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	675302.0	680300.0	687455.0	698895.0	713856.0	722572.0	731081.0	737259.0	736732.0	738432.0
AZ	124000.0	131000.0	138000.0	144000.0	151000.0	158000.0	167000.0	176000.0	186000.0	196000.0	...	6029141.0	6167681.0	6280362.0	6343154.0	6411999.0	6472867.0	6556236.0	6634997.0	6731484.0	6828065.0
AR	1314000.0	1341000.0	1360000.0	1384000.0	1419000.0	1447000.0	1465000.0	1484000.0	1513000.0	1545000.0	...	2821761.0	2848650.0	2874554.0	2896843.0	2922297.0	2938430.0	2949300.0	2958765.0	2966369.0	2978204.0
CA	1490000.0	1550000.0	1623000.0	1702000.0	1792000.0	1893000.0	1976000.0	2054000.0	2161000.0	2282000.0	...	36021202.0	36250311.0	36604337.0	36961229.0	37336011.0	37701901.0	38062780.0	38431393.0	38802500.0	39144818.0

5 rows × 116 columns

And we can save the data

In [33]:

population_by_state.to_csv('population_by_state.csv')

In my next post I will use this data to plot interesting information

Thank you Tomer Tal for sharing a template of your basketball-reference scraping code. .

Scraping Basketball Reference

Getting population data per state¶

Let me know what you think!¶

Comments