The goal of the project was to compare different NBA players based on their shot selection and cluster them into groups. These new groups can be compared to the assigned position of players to check for correlation.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import misc,ndimage
import scipy.cluster.hierarchy as sch
import seaborn as sns
import NBAapi as nba
import urllib, cStringIO
from collections import defaultdict

Load shot chart data

In [2]:
df,_ = nba.shotchart.shotchartdetail(season='2014-15')
df.head()
Out[2]:
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING ... SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG GAME_DATE HTM VTM
0 Shot Chart Detail 0021400001 2 203076 Anthony Davis 1610612740 New Orleans Pelicans 1 11 43 ... Center(C) 16-24 ft. 20 50 194 1 0 20141028 NOP ORL
1 Shot Chart Detail 0021400001 4 202696 Nikola Vucevic 1610612753 Orlando Magic 1 11 31 ... Center(C) 16-24 ft. 18 -8 189 1 1 20141028 NOP ORL
2 Shot Chart Detail 0021400001 7 203076 Anthony Davis 1610612740 New Orleans Pelicans 1 11 6 ... Left Side Center(LC) 16-24 ft. 18 -131 127 1 0 20141028 NOP ORL
3 Shot Chart Detail 0021400001 9 203901 Elfrid Payton 1610612753 Orlando Magic 1 10 54 ... Center(C) Less Than 8 ft. 1 -15 4 1 0 20141028 NOP ORL
4 Shot Chart Detail 0021400001 25 203076 Anthony Davis 1610612740 New Orleans Pelicans 1 10 29 ... Center(C) Less Than 8 ft. 0 0 1 1 1 20141028 NOP ORL

5 rows × 24 columns

Let's write a few useful functions that will be used throughout this project:

In [3]:
def KDE_heatmap(df,sigma=3):
    '''
    This function performs KDE calculation for a given shot chart.
    Input  - dataframe with x and y shot coordinates.
    Option - sigma (in feet).
    Output - KDE of shot chart
    '''
    N,_,_ = np.histogram2d( 0.1*df['LOC_X'].values, 0.1*df['LOC_Y'].values,bins = [500, 500],range = [[-25,25],[-5.25,44.75]])
    KDE = ndimage.filters.gaussian_filter(N,10.0*sigma)
    return 1.0*KDE/np.sum(KDE)

def players_picture(player_id):
    '''
    Input: player ID
    Output: players picture
    '''
    URL = "http://stats.nba.com/media/players/230x185/%d.png" %player_id
    file = cStringIO.StringIO(urllib.urlopen(URL).read())
    return misc.imread(file)

def correlation_distance(N1,N2):
    '''
    Takes two 2D array from KDE funciton and finds the distance between the two arrays. 
    Output values are between 0-1 where 0 is identical and 1 is no similarity.
    '''
    D = np.sum(abs(N1-N2))/2.0
    return D

def shot_scatter(df,player_pic=True,ax=None,noise=True,**kwargs):
    '''
    Plotting scatter plot of shots.
    input - dataframe with x and y coordinates.
    optional - player_pic (default True) loads player picture. Use if dataframe is for a single player. 
               ax (default None) can pass plot axis.
               noise (default True) adds some random scatter to the data for better visualization  
               other - any variables that can be passed into the scatter function (e.g. transperecy value)
    '''
    if ax is None: 
        ax = plt.gca(xlim = [30,-30],ylim = [-7,43],xticks=[],yticks=[],aspect=1.0)
    nba.plot.court(ax,outer_lines=True,color='black',lw=2.0,direction='down')
    ax.axis('off')
    if noise:
        X = df.LOC_X.values + np.random.normal(loc=0.0, scale=1.5, size=len(df.LOC_X.values))
        Y = df.LOC_Y.values + np.random.normal(loc=0.0, scale=1.5, size=len(df.LOC_Y.values))
    else:
        X = df.LOC_X.values
        Y = df.LOC_Y.values
    ax.scatter(-0.1*X,0.1*Y,**kwargs)
    if player_pic:
        name = df.PLAYER_NAME.values[0]
        player_id = df.PLAYER_ID.values[0]
        pic = players_picture(player_id)
        ax.imshow(pic,extent=[15,25,30,37.8261])
        ax.text(20,29,name,fontsize=16,horizontalalignment='center',verticalalignment='center')
    ax.text(0,-7,'By: Doingthedishes',color='black',horizontalalignment='center',fontsize=20,fontweight='bold')
    
def shot_heatmap(df,sigma = 1,log=False,player_pic=True,ax=None,cmap='jet'):
    '''
    This function plots a heatmap based on the shot chart.
    input - dataframe with x and y coordinates.
    optional - log (default false) plots heatmap in log scale. 
               player (default true) adds player's picture and name if true 
               sigma - the sigma of the Gaussian kernel. In feet (default=1)
    '''
    N = KDE_heatmap(df,sigma)
    if ax is None:
        ax = plt.gca(xlim = [30,-30],ylim = [-7,43],xticks=[],yticks=[],aspect=1.0)
    nba.plot.court(ax,outer_lines=True,color='black',lw=2.0,direction='down')
    ax.axis('off')
    if log:
        ax.imshow(np.rot90(np.log10(N+1)),cmap=cmap,extent=[25.0, -25.0, -5.25, 44.75])
    else:
        ax.imshow(np.rot90(N),cmap=cmap,extent=[25.0, -25.0, -5.25, 44.75])
    if player_pic:
        player_id = df.PLAYER_ID.values[0]
        pic = players_picture(player_id)
        ax.imshow(pic,extent=[15,25,30,37.8261])
    ax.text(0,-7,'By: Doingthedishes',color='white',horizontalalignment='center',fontsize=20,fontweight='bold')

Choose players with 500 shots or more:

In [4]:
player_df = pd.DataFrame({'shots' : df.groupby(by=['PLAYER_ID','PLAYER_NAME']).size()}).reset_index()
idx = player_df.shots.values > 500
player_df = player_df.ix[idx]
players = player_df.PLAYER_ID.values
player_df.head()
Out[4]:
PLAYER_ID PLAYER_NAME shots
1 977 Kobe Bryant 713
2 1495 Tim Duncan 819
4 1717 Dirk Nowitzki 1062
5 1718 Paul Pierce 656
12 1938 Manu Ginobili 589

Plot example heatmaps + scatter plots

In [5]:
name = 'Stephen Curry'
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for i in range(2):
    axarr[i].set_ylim([-10,41.5])
    axarr[i].set_xlim([25,-25])
    axarr[i].set_aspect(1)
    axarr[i].set_xticks([])
    axarr[i].set_yticks([])
    axarr[i].axis('off')
f.subplots_adjust(hspace=0,wspace=0)
shot_scatter(df[df['PLAYER_NAME']==name],ax=axarr[0],alpha = 0.2)
shot_heatmap(df[df['PLAYER_NAME']==name],ax=axarr[1],player_pic=False,log=True)
In [6]:
name = 'Dirk Nowitzki'
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for i in range(2):
    axarr[i].set_ylim([-10,41.5])
    axarr[i].set_xlim([25,-25])
    axarr[i].set_aspect(1)
    axarr[i].set_xticks([])
    axarr[i].set_yticks([])
    axarr[i].axis('off')
f.subplots_adjust(hspace=0,wspace=0)
shot_scatter(df[df['PLAYER_NAME']==name],ax=axarr[0],alpha = 0.2)
shot_heatmap(df[df['PLAYER_NAME']==name],ax=axarr[1],player_pic = False)

Comparing players

In order to compare players I'm going to calculate the KDE for each player based on their shots during the entire season. The KDE is conceptaully equivalent to calculating the heatmaps for each player.

Note: when computing the KDE the $\sigma$ of the Gaussian kernel needs to be choosen. Larger $\sigma$ means that further shots are still going to be correlated but we do not want to choose a $\sigma$ that is too large. I found that $\sigma = 3$ (feet) was a good compromise between resolution and ensuring that close shots are correlated.

In [7]:
hmaps = np.zeros([500,500,len(players)])
for i,player in enumerate(player_df.PLAYER_NAME):
    hmaps[:,:,i] = KDE_heatmap(df[df['PLAYER_ID']==players[i]])

Now that we have a heatmap for each player (i.e. KDE) we can take each pair of players and compare how similar those heatmaps are. To do so I'm using the correlation_distance function that I defined above. There are sevevral ways to compute a similarity measure. I choose this one after exploring a few different options. Another similarity measure that I like (but I am not sowing here) is the Kernal Distance (https://arxiv.org/abs/1103.1625).

By comparing each pair of players we can create a similarity matrix $S$ using $1 - D$ where D is the player's shot density distance (i.e. how different they are from each other):

In [8]:
S = np.zeros([len(players),len(players)])
for i in xrange(len(players)):
    for j in xrange(i,len(players)):
        S[i,j] = 1 - correlation_distance(hmaps[:,:,i],hmaps[:,:,j])
        S[j,i] = S[i,j]

Let's plot the matrix D:

In [9]:
fig= plt.figure(figsize=(8,8))
im = plt.imshow(1-S,cmap='jet')
plt.colorbar(im,fraction=0.046, pad=0.04)
plt.show()

At this point it does not look like much. We need to work a little more to get some interesting information out of this data.

Find players with minimum similiraty:

In [10]:
i,j = np.unravel_index(np.argmin(S),np.shape(S))
name1 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[i], np.nan).max()
name2 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[j], np.nan).max()
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for n in range(2):
    axarr[n].set_ylim([-10,41.5])
    axarr[n].set_xlim([25,-25])
    axarr[n].set_aspect(1)
    axarr[n].set_xticks([])
    axarr[n].set_yticks([])
    axarr[n].axis('off')
f.suptitle('Similarity = {}'.format(np.round(S[i,j],2)),fontsize=20.0,y=1.02)
shot_scatter(df[df['PLAYER_NAME']==name1],ax=axarr[0],alpha=0.2)
shot_scatter(df[df['PLAYER_NAME']==name2],ax=axarr[1],alpha=0.2)
f.tight_layout()

Find players with maximum similarity

In [11]:
i,j = np.unravel_index(np.argmax(S-np.identity(len(S))),np.shape(S))
name1 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[i], np.nan).max()
name2 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[j], np.nan).max()
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for n in range(2):
    axarr[n].set_ylim([-10,41.5])
    axarr[n].set_xlim([25,-25])
    axarr[n].set_aspect(1)
    axarr[n].set_xticks([])
    axarr[n].set_yticks([])
    axarr[n].axis('off')
f.suptitle('Similarity = {}'.format(np.round(S[i,j],2)),fontsize=20.0,y=1.02)
shot_scatter(df[df['PLAYER_NAME']==name1],ax=axarr[0],alpha=0.2)
shot_scatter(df[df['PLAYER_NAME']==name2],ax=axarr[1],alpha=0.2)
f.tight_layout()

Clustering

The distance matrix D (D = 1 − S) can be clustered using hierarchical clustering. I have tried numerous types of linkage functions including single-link, complete-link and average-link. Average link yields the best clusters. The matrix is then reorganized in such way that close pairs are next to each other - a common technique used in genetic research (https://www.ncbi.nlm.nih.gov/pubmed/9843981).

After inspecting the dendrogram and the resulting matrix, the players were divided into 6 clusters.

In [12]:
D = 1-S
fig = plt.figure(figsize=(10,10))

# Compute and plot dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Y = sch.linkage(D, method='average')
Z1 = sch.dendrogram(Y,color_threshold = 1.45)
ax2.set_xticks([])
ax2.set_yticks([])

# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
D = D[idx1,:]
D = D[:,idx1]
im = axmatrix.matshow(D, aspect='auto', origin='lower',cmap='jet')
axmatrix.set_xticks([])
axmatrix.set_yticks([])

# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)
C:\Users\eyal\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  
Out[12]:
<matplotlib.colorbar.Colorbar at 0x1a72fc18>

We can view the results of the clustering in a table format:

In [13]:
num_clusters = 6
T = sch.fcluster(Y,num_clusters,criterion='maxclust')
player_names = player_df.PLAYER_NAME.values

pd.set_option('display.max_rows', 70)
cluster_df = pd.DataFrame(player_names,columns = ['PLAYER_NAME'])
cluster_df['CLUSTER'] = T

d = cluster_df.groupby('CLUSTER').PLAYER_NAME.apply(list).to_dict()

pd.DataFrame({k : pd.Series(v) for k, v in d.items()}).add_prefix('CLUSTER ').fillna('')
Out[13]:
CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 CLUSTER 5 CLUSTER 6
0 DeAndre Jordan Tim Duncan Kobe Bryant Manu Ginobili Dirk Nowitzki Kyle Korver
1 Greg Monroe Pau Gasol Paul Pierce Tony Parker
2 Jonas Valanciunas Zach Randolph Jamal Crawford LeBron James
3 Kenneth Faried Nene Joe Johnson Dwyane Wade
4 Andre Drummond Carlos Boozer Matt Barnes David West
5 Luis Scola Rasual Butler Boris Diaw
6 Chris Kaman Carmelo Anthony Luol Deng
7 Zaza Pachulia Chris Bosh Josh Smith
8 Al Jefferson Mo Williams Deron Williams
9 Brandon Bass Devin Harris Jarrett Jack
10 Amir Johnson JR Smith Ersan Ilyasova
11 Marcin Gortat Jameer Nelson Monta Ellis
12 Al Horford Kevin Martin LaMarcus Aldridge
13 Thaddeus Young Beno Udrih Rudy Gay
14 Marc Gasol Trevor Ariza Rajon Rondo
15 Brook Lopez Chris Paul Paul Millsap
16 Marreese Speights Gerald Green J.J. Barea
17 Roy Hibbert CJ Miles Mike Conley
18 Blake Griffin Lou Williams Jeff Green
19 Tyreke Evans JJ Redick Corey Brewer
20 Jordan Hill Kyle Lowry Rodney Stuckey
21 Taj Gibson PJ Tucker Jason Smith
22 Derrick Favors Arron Afflalo Wilson Chandler
23 DeMarcus Cousins O.J. Mayo Aaron Brooks
24 Timofey Mozgov Kevin Love Derrick Rose
25 Enes Kanter Danilo Gallinari Russell Westbrook
26 Nikola Vucevic Eric Gordon D.J. Augustin
27 Donatas Motiejunas Ryan Anderson Jerryd Bayless
28 Anthony Davis Courtney Lee Serge Ibaka
29 Tyler Zeller Nicolas Batum Mario Chalmers
30 Henry Sims George Hill Luc Mbah a Moute
31 Nerlens Noel Anthony Morrow Goran Dragic
32 Gorgui Dieng Stephen Curry James Harden
33 Giannis Antetokounmpo Gerald Henderson DeMar DeRozan
34 Elfrid Payton DeMarre Carroll Brandon Jennings
35 Wayne Ellington Jrue Holiday
36 Jodie Meeks Ty Lawson
37 Patrick Beverley Jeff Teague
38 Danny Green Darren Collison
39 Wesley Matthews John Wall
40 Wesley Johnson Evan Turner
41 Patrick Patterson Gordon Hayward
42 Avery Bradley Eric Bledsoe
43 Greivis Vasquez Lance Stephenson
44 Gary Neal Jeremy Lin
45 Brandon Knight Kyrie Irving
46 Klay Thompson Kemba Walker
47 Marcus Morris Markieff Morris
48 Nikola Mirotic Kawhi Leonard
49 Norris Cole Tobias Harris
50 Bojan Bogdanovic Reggie Jackson
51 Isaiah Thomas Jimmy Butler
52 Bradley Beal Chandler Parsons
53 Damian Lillard Dion Waiters
54 Terrence Ross Harrison Barnes
55 Jae Crowder Evan Fournier
56 Khris Middleton Jared Sullinger
57 Hollis Thompson Draymond Green
58 Ben McLemore Dennis Schroder
59 Kentavious Caldwell-Pope Kelly Olynyk
60 Robert Covington Michael Carter-Williams
61 Tim Hardaway Jr. Victor Oladipo
62 Trey Burke Solomon Hill
63 Langston Galloway Zach LaVine
64 Jordan Clarkson
65 Andrew Wiggins

Plot heatmaps for each cluster to visualize the differences

In [14]:
f, axarr = plt.subplots(3,2,figsize=(20,30),facecolor='white')
for i in xrange(6):
    axarr[i/2,i%2].set_ylim([-10,41.5])
    axarr[i/2,i%2].set_xlim([25,-25])
    axarr[i/2,i%2].set_aspect(1)
    axarr[i/2,i%2].set_xticks([])
    axarr[i/2,i%2].set_yticks([])
    axarr[i/2,i%2].axis('off')
    axarr[i/2,i%2].set_title('Cluster {}'.format(i+1),fontsize=20.0)
    idx = T==i+1
    Lia = df['PLAYER_ID'].isin(players[idx])
    shot_heatmap(df.ix[Lia],ax=axarr[i/2,i%2],log=True,player_pic=False)
    if np.sum(Lia)<2000:
        a = 0.3
    else:
        a = 0.01
    axarr[i/2,i%2].scatter(-0.1*df.ix[Lia,'LOC_X'],0.1*df.ix[Lia,'LOC_Y'],marker='+',color='white',alpha = a)
    
f.subplots_adjust(hspace=0,wspace=0)

Conclusions about clusters:

  1. Under the basket
  2. Under the basket + midrange
  3. Under the basket + 3s
  4. Everywhere
  5. Dirk Nowitzki - no corner shots. Lots of midrange jumpers
  6. Kyle Korver - mostly from 3

Include player position from nylon calculus

Nylon Calculus (http://nyloncalculus.com/) used to have players position data. I downloaded the player position list for the relevant year. This file can be found on my github.

In [15]:
PP = pd.read_csv(r'C:\Users\eyal\Desktop\NBA\notebooks\player_postion.txt')
PP.head()
Out[15]:
PLAYER_NAME POSITION
0 James Harden 2.08
1 Andrew Wiggins 2.90
2 Damian Lillard 1.21
3 Chris Paul 1.01
4 Trevor Ariza 3.14

Merge dataframes

We need to merge the new dataframe we have with players position to the one that has the player information:

In [16]:
player_position = pd.merge(player_df,PP,on='PLAYER_NAME',how='left')
player_position = player_position.drop_duplicates(subset='PLAYER_NAME')
index = player_position['PLAYER_NAME'].index[player_position['POSITION'].apply(np.isnan)]
player_position.set_value(index, 'POSITION', [2.0,2.0])
player_position['cluster'] = T
player_position.head()
Out[16]:
PLAYER_ID PLAYER_NAME shots POSITION cluster
0 977 Kobe Bryant 713 2.31 3
1 1495 Tim Duncan 819 5.00 2
2 1717 Dirk Nowitzki 1062 4.12 5
3 1718 Paul Pierce 656 3.09 3
4 1938 Manu Ginobili 589 2.32 4

Plot (normalized) position distribution

In [17]:
n = np.zeros([5,num_clusters])
for i in xrange(num_clusters):
    temp = np.round(player_position[player_position['cluster']==i+1]['POSITION'].values+0.01)
    n[:,i],_ = np.histogram(temp,bins = [0,1.1,2.1,3.1,4.1,5.1])
In [18]:
n_nor = np.zeros([5,num_clusters])
for i in xrange(5):
    n_nor[i,:] = n[i,:]/sum(n[i,:])
In [19]:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
x = np.array([1,2,3,4,5,6])
ax.bar(x-0.2, n_nor[0,:],width=0.1,color='b',align='center')
ax.bar(x-0.1, n_nor[1,:],width=0.1,color='g',align='center')
ax.bar(x, n_nor[2,:],width=0.1,color='r',align='center')
ax.bar(x+0.1, n_nor[3,:],width=0.1,color='k',align='center')
ax.bar(x+0.2, n_nor[4,:],width=0.1,color='y',align='center')
plt.xlabel('Cluster #')
plt.ylabel('Ratio')
plt.legend(['PG','SG','SF','PF','C'])
plt.xlim(0.5,6.5)
Out[19]:
(0.5, 6.5)

Based on this distribution we can reach the following conclusions:

  1. Cluster 4 has the most players followed by cluster 3. This is an indication that players are required to shot from everywhere on the court these days.
  2. The vast majority of PG are in cluster 4 and almost half of the SF. These are most likely the two positions that require the highest versatility.
  3. There are only 5 players which only shot under the basket (4 of them are centers). This old style of playing, where the big man only shot from close range, is disappearing from the NBA.
  4. Most SG are in cluster 3 (their job is to shoot 3s).
  5. The vast majority of centers are in cluster 2 - shooting from both under the basket and from midrange.
  6. PF are most evenly distributed between the clusters with about 40% in clusters 2 and 4. This is the most evolved position were some PF these days shot a lot of 3s and some take a more traditional role of shooting under the basket and midrange shots. There are about 20% in cluster 3 - the ”stretch” 4 is a new position in the NBA were big man with 3 point range are
  7. Korver and Nowitzki have a fairly unique shot selection.

Acknowledgements:

This project was done as part of CS 6140 / Data Mining at the University of Utah (http://www.cs.utah.edu/~jeffp/teaching/cs5140.html)

Many thanks to Prof. Jeff Phillips for his guidance.



Comments

comments powered by Disqus