The goal of the project was to compare different NBA players based on their shot selection and cluster them into groups. These new groups can be compared to the assigned position of players to check for correlation.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import misc,ndimage
import scipy.cluster.hierarchy as sch
import seaborn as sns
import NBAapi as nba
import urllib, cStringIO
from collections import defaultdict
Load shot chart data¶
df,_ = nba.shotchart.shotchartdetail(season='2014-15')
df.head()
Let's write a few useful functions that will be used throughout this project:¶
def KDE_heatmap(df,sigma=3):
'''
This function performs KDE calculation for a given shot chart.
Input - dataframe with x and y shot coordinates.
Option - sigma (in feet).
Output - KDE of shot chart
'''
N,_,_ = np.histogram2d( 0.1*df['LOC_X'].values, 0.1*df['LOC_Y'].values,bins = [500, 500],range = [[-25,25],[-5.25,44.75]])
KDE = ndimage.filters.gaussian_filter(N,10.0*sigma)
return 1.0*KDE/np.sum(KDE)
def players_picture(player_id):
'''
Input: player ID
Output: players picture
'''
URL = "http://stats.nba.com/media/players/230x185/%d.png" %player_id
file = cStringIO.StringIO(urllib.urlopen(URL).read())
return misc.imread(file)
def correlation_distance(N1,N2):
'''
Takes two 2D array from KDE funciton and finds the distance between the two arrays.
Output values are between 0-1 where 0 is identical and 1 is no similarity.
'''
D = np.sum(abs(N1-N2))/2.0
return D
def shot_scatter(df,player_pic=True,ax=None,noise=True,**kwargs):
'''
Plotting scatter plot of shots.
input - dataframe with x and y coordinates.
optional - player_pic (default True) loads player picture. Use if dataframe is for a single player.
ax (default None) can pass plot axis.
noise (default True) adds some random scatter to the data for better visualization
other - any variables that can be passed into the scatter function (e.g. transperecy value)
'''
if ax is None:
ax = plt.gca(xlim = [30,-30],ylim = [-7,43],xticks=[],yticks=[],aspect=1.0)
nba.plot.court(ax,outer_lines=True,color='black',lw=2.0,direction='down')
ax.axis('off')
if noise:
X = df.LOC_X.values + np.random.normal(loc=0.0, scale=1.5, size=len(df.LOC_X.values))
Y = df.LOC_Y.values + np.random.normal(loc=0.0, scale=1.5, size=len(df.LOC_Y.values))
else:
X = df.LOC_X.values
Y = df.LOC_Y.values
ax.scatter(-0.1*X,0.1*Y,**kwargs)
if player_pic:
name = df.PLAYER_NAME.values[0]
player_id = df.PLAYER_ID.values[0]
pic = players_picture(player_id)
ax.imshow(pic,extent=[15,25,30,37.8261])
ax.text(20,29,name,fontsize=16,horizontalalignment='center',verticalalignment='center')
ax.text(0,-7,'By: Doingthedishes',color='black',horizontalalignment='center',fontsize=20,fontweight='bold')
def shot_heatmap(df,sigma = 1,log=False,player_pic=True,ax=None,cmap='jet'):
'''
This function plots a heatmap based on the shot chart.
input - dataframe with x and y coordinates.
optional - log (default false) plots heatmap in log scale.
player (default true) adds player's picture and name if true
sigma - the sigma of the Gaussian kernel. In feet (default=1)
'''
N = KDE_heatmap(df,sigma)
if ax is None:
ax = plt.gca(xlim = [30,-30],ylim = [-7,43],xticks=[],yticks=[],aspect=1.0)
nba.plot.court(ax,outer_lines=True,color='black',lw=2.0,direction='down')
ax.axis('off')
if log:
ax.imshow(np.rot90(np.log10(N+1)),cmap=cmap,extent=[25.0, -25.0, -5.25, 44.75])
else:
ax.imshow(np.rot90(N),cmap=cmap,extent=[25.0, -25.0, -5.25, 44.75])
if player_pic:
player_id = df.PLAYER_ID.values[0]
pic = players_picture(player_id)
ax.imshow(pic,extent=[15,25,30,37.8261])
ax.text(0,-7,'By: Doingthedishes',color='white',horizontalalignment='center',fontsize=20,fontweight='bold')
Choose players with 500 shots or more:¶
player_df = pd.DataFrame({'shots' : df.groupby(by=['PLAYER_ID','PLAYER_NAME']).size()}).reset_index()
idx = player_df.shots.values > 500
player_df = player_df.ix[idx]
players = player_df.PLAYER_ID.values
player_df.head()
Plot example heatmaps + scatter plots¶
name = 'Stephen Curry'
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for i in range(2):
axarr[i].set_ylim([-10,41.5])
axarr[i].set_xlim([25,-25])
axarr[i].set_aspect(1)
axarr[i].set_xticks([])
axarr[i].set_yticks([])
axarr[i].axis('off')
f.subplots_adjust(hspace=0,wspace=0)
shot_scatter(df[df['PLAYER_NAME']==name],ax=axarr[0],alpha = 0.2)
shot_heatmap(df[df['PLAYER_NAME']==name],ax=axarr[1],player_pic=False,log=True)
name = 'Dirk Nowitzki'
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for i in range(2):
axarr[i].set_ylim([-10,41.5])
axarr[i].set_xlim([25,-25])
axarr[i].set_aspect(1)
axarr[i].set_xticks([])
axarr[i].set_yticks([])
axarr[i].axis('off')
f.subplots_adjust(hspace=0,wspace=0)
shot_scatter(df[df['PLAYER_NAME']==name],ax=axarr[0],alpha = 0.2)
shot_heatmap(df[df['PLAYER_NAME']==name],ax=axarr[1],player_pic = False)
Comparing players¶
In order to compare players I'm going to calculate the KDE for each player based on their shots during the entire season. The KDE is conceptaully equivalent to calculating the heatmaps for each player.
Note: when computing the KDE the $\sigma$ of the Gaussian kernel needs to be choosen. Larger $\sigma$ means that further shots are still going to be correlated but we do not want to choose a $\sigma$ that is too large. I found that $\sigma = 3$ (feet) was a good compromise between resolution and ensuring that close shots are correlated.
hmaps = np.zeros([500,500,len(players)])
for i,player in enumerate(player_df.PLAYER_NAME):
hmaps[:,:,i] = KDE_heatmap(df[df['PLAYER_ID']==players[i]])
Now that we have a heatmap for each player (i.e. KDE) we can take each pair of players and compare how similar those heatmaps are. To do so I'm using the correlation_distance function that I defined above. There are sevevral ways to compute a similarity measure. I choose this one after exploring a few different options. Another similarity measure that I like (but I am not sowing here) is the Kernal Distance (https://arxiv.org/abs/1103.1625).
By comparing each pair of players we can create a similarity matrix $S$ using $1 - D$ where D is the player's shot density distance (i.e. how different they are from each other):
S = np.zeros([len(players),len(players)])
for i in xrange(len(players)):
for j in xrange(i,len(players)):
S[i,j] = 1 - correlation_distance(hmaps[:,:,i],hmaps[:,:,j])
S[j,i] = S[i,j]
Let's plot the matrix D:¶
fig= plt.figure(figsize=(8,8))
im = plt.imshow(1-S,cmap='jet')
plt.colorbar(im,fraction=0.046, pad=0.04)
plt.show()
At this point it does not look like much. We need to work a little more to get some interesting information out of this data.
Find players with minimum similiraty:¶
i,j = np.unravel_index(np.argmin(S),np.shape(S))
name1 = player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[i], np.nan).max()
name2 = player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[j], np.nan).max()
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for n in range(2):
axarr[n].set_ylim([-10,41.5])
axarr[n].set_xlim([25,-25])
axarr[n].set_aspect(1)
axarr[n].set_xticks([])
axarr[n].set_yticks([])
axarr[n].axis('off')
f.suptitle('Similarity = {}'.format(np.round(S[i,j],2)),fontsize=20.0,y=1.02)
shot_scatter(df[df['PLAYER_NAME']==name1],ax=axarr[0],alpha=0.2)
shot_scatter(df[df['PLAYER_NAME']==name2],ax=axarr[1],alpha=0.2)
f.tight_layout()
Find players with maximum similarity¶
i,j = np.unravel_index(np.argmax(S-np.identity(len(S))),np.shape(S))
name1 = player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[i], np.nan).max()
name2 = player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[j], np.nan).max()
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for n in range(2):
axarr[n].set_ylim([-10,41.5])
axarr[n].set_xlim([25,-25])
axarr[n].set_aspect(1)
axarr[n].set_xticks([])
axarr[n].set_yticks([])
axarr[n].axis('off')
f.suptitle('Similarity = {}'.format(np.round(S[i,j],2)),fontsize=20.0,y=1.02)
shot_scatter(df[df['PLAYER_NAME']==name1],ax=axarr[0],alpha=0.2)
shot_scatter(df[df['PLAYER_NAME']==name2],ax=axarr[1],alpha=0.2)
f.tight_layout()
Clustering¶
The distance matrix D (D = 1 − S) can be clustered using hierarchical clustering. I have tried numerous types of linkage functions including single-link, complete-link and average-link. Average link yields the best clusters. The matrix is then reorganized in such way that close pairs are next to each other - a common technique used in genetic research (https://www.ncbi.nlm.nih.gov/pubmed/9843981).
After inspecting the dendrogram and the resulting matrix, the players were divided into 6 clusters.
D = 1-S
fig = plt.figure(figsize=(10,10))
# Compute and plot dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Y = sch.linkage(D, method='average')
Z1 = sch.dendrogram(Y,color_threshold = 1.45)
ax2.set_xticks([])
ax2.set_yticks([])
# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
D = D[idx1,:]
D = D[:,idx1]
im = axmatrix.matshow(D, aspect='auto', origin='lower',cmap='jet')
axmatrix.set_xticks([])
axmatrix.set_yticks([])
# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)
We can view the results of the clustering in a table format:
num_clusters = 6
T = sch.fcluster(Y,num_clusters,criterion='maxclust')
player_names = player_df.PLAYER_NAME.values
pd.set_option('display.max_rows', 70)
cluster_df = pd.DataFrame(player_names,columns = ['PLAYER_NAME'])
cluster_df['CLUSTER'] = T
d = cluster_df.groupby('CLUSTER').PLAYER_NAME.apply(list).to_dict()
pd.DataFrame({k : pd.Series(v) for k, v in d.items()}).add_prefix('CLUSTER ').fillna('')
Plot heatmaps for each cluster to visualize the differences¶
f, axarr = plt.subplots(3,2,figsize=(20,30),facecolor='white')
for i in xrange(6):
axarr[i/2,i%2].set_ylim([-10,41.5])
axarr[i/2,i%2].set_xlim([25,-25])
axarr[i/2,i%2].set_aspect(1)
axarr[i/2,i%2].set_xticks([])
axarr[i/2,i%2].set_yticks([])
axarr[i/2,i%2].axis('off')
axarr[i/2,i%2].set_title('Cluster {}'.format(i+1),fontsize=20.0)
idx = T==i+1
Lia = df['PLAYER_ID'].isin(players[idx])
shot_heatmap(df.ix[Lia],ax=axarr[i/2,i%2],log=True,player_pic=False)
if np.sum(Lia)<2000:
a = 0.3
else:
a = 0.01
axarr[i/2,i%2].scatter(-0.1*df.ix[Lia,'LOC_X'],0.1*df.ix[Lia,'LOC_Y'],marker='+',color='white',alpha = a)
f.subplots_adjust(hspace=0,wspace=0)
Conclusions about clusters:
- Under the basket
- Under the basket + midrange
- Under the basket + 3s
- Everywhere
- Dirk Nowitzki - no corner shots. Lots of midrange jumpers
- Kyle Korver - mostly from 3
Include player position from nylon calculus¶
Nylon Calculus (http://nyloncalculus.com/) used to have players position data. I downloaded the player position list for the relevant year. This file can be found on my github.
PP = pd.read_csv(r'C:\Users\eyal\Desktop\NBA\notebooks\player_postion.txt')
PP.head()
Merge dataframes¶
We need to merge the new dataframe we have with players position to the one that has the player information:
player_position = pd.merge(player_df,PP,on='PLAYER_NAME',how='left')
player_position = player_position.drop_duplicates(subset='PLAYER_NAME')
index = player_position['PLAYER_NAME'].index[player_position['POSITION'].apply(np.isnan)]
player_position.set_value(index, 'POSITION', [2.0,2.0])
player_position['cluster'] = T
player_position.head()
Plot (normalized) position distribution¶
n = np.zeros([5,num_clusters])
for i in xrange(num_clusters):
temp = np.round(player_position[player_position['cluster']==i+1]['POSITION'].values+0.01)
n[:,i],_ = np.histogram(temp,bins = [0,1.1,2.1,3.1,4.1,5.1])
n_nor = np.zeros([5,num_clusters])
for i in xrange(5):
n_nor[i,:] = n[i,:]/sum(n[i,:])
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
x = np.array([1,2,3,4,5,6])
ax.bar(x-0.2, n_nor[0,:],width=0.1,color='b',align='center')
ax.bar(x-0.1, n_nor[1,:],width=0.1,color='g',align='center')
ax.bar(x, n_nor[2,:],width=0.1,color='r',align='center')
ax.bar(x+0.1, n_nor[3,:],width=0.1,color='k',align='center')
ax.bar(x+0.2, n_nor[4,:],width=0.1,color='y',align='center')
plt.xlabel('Cluster #')
plt.ylabel('Ratio')
plt.legend(['PG','SG','SF','PF','C'])
plt.xlim(0.5,6.5)
Based on this distribution we can reach the following conclusions:¶
- Cluster 4 has the most players followed by cluster 3. This is an indication that players are required to shot from everywhere on the court these days.
- The vast majority of PG are in cluster 4 and almost half of the SF. These are most likely the two positions that require the highest versatility.
- There are only 5 players which only shot under the basket (4 of them are centers). This old style of playing, where the big man only shot from close range, is disappearing from the NBA.
- Most SG are in cluster 3 (their job is to shoot 3s).
- The vast majority of centers are in cluster 2 - shooting from both under the basket and from midrange.
- PF are most evenly distributed between the clusters with about 40% in clusters 2 and 4. This is the most evolved position were some PF these days shot a lot of 3s and some take a more traditional role of shooting under the basket and midrange shots. There are about 20% in cluster 3 - the ”stretch” 4 is a new position in the NBA were big man with 3 point range are
- Korver and Nowitzki have a fairly unique shot selection.
Acknowledgements:¶
This project was done as part of CS 6140 / Data Mining at the University of Utah (http://www.cs.utah.edu/~jeffp/teaching/cs5140.html)
Many thanks to Prof. Jeff Phillips for his guidance.
Comments
comments powered by Disqus