The goal of the project was to compare different NBA players based on their shot selection and cluster them into groups. These new groups can be compared to the assigned position of players to check for correlation.

In [1]:

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import misc,ndimage
import scipy.cluster.hierarchy as sch
import seaborn as sns
import NBAapi as nba
import urllib, cStringIO
from collections import defaultdict

Load shot chart data¶

In [2]:

df,_ = nba.shotchart.shotchartdetail(season='2014-15')
df.head()

Out[2]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	...	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG	GAME_DATE	HTM	VTM
0	Shot Chart Detail	0021400001	2	203076	Anthony Davis	1610612740	New Orleans Pelicans	1	11	43	...	Center(C)	16-24 ft.	20	50	194	1	0	20141028	NOP	ORL
1	Shot Chart Detail	0021400001	4	202696	Nikola Vucevic	1610612753	Orlando Magic	1	11	31	...	Center(C)	16-24 ft.	18	-8	189	1	1	20141028	NOP	ORL
2	Shot Chart Detail	0021400001	7	203076	Anthony Davis	1610612740	New Orleans Pelicans	1	11	6	...	Left Side Center(LC)	16-24 ft.	18	-131	127	1	0	20141028	NOP	ORL
3	Shot Chart Detail	0021400001	9	203901	Elfrid Payton	1610612753	Orlando Magic	1	10	54	...	Center(C)	Less Than 8 ft.	1	-15	4	1	0	20141028	NOP	ORL
4	Shot Chart Detail	0021400001	25	203076	Anthony Davis	1610612740	New Orleans Pelicans	1	10	29	...	Center(C)	Less Than 8 ft.	0	0	1	1	1	20141028	NOP	ORL

5 rows × 24 columns

Let's write a few useful functions that will be used throughout this project:¶

In [3]:

def KDE_heatmap(df,sigma=3):
    '''
    This function performs KDE calculation for a given shot chart.
    Input  - dataframe with x and y shot coordinates.
    Option - sigma (in feet).
    Output - KDE of shot chart
    '''
    N,_,_ = np.histogram2d( 0.1*df['LOC_X'].values, 0.1*df['LOC_Y'].values,bins = [500, 500],range = [[-25,25],[-5.25,44.75]])
    KDE = ndimage.filters.gaussian_filter(N,10.0*sigma)
    return 1.0*KDE/np.sum(KDE)

def players_picture(player_id):
    '''
    Input: player ID
    Output: players picture
    '''
    URL = "http://stats.nba.com/media/players/230x185/%d.png" %player_id
    file = cStringIO.StringIO(urllib.urlopen(URL).read())
    return misc.imread(file)

def correlation_distance(N1,N2):
    '''
    Takes two 2D array from KDE funciton and finds the distance between the two arrays. 
    Output values are between 0-1 where 0 is identical and 1 is no similarity.
    '''
    D = np.sum(abs(N1-N2))/2.0
    return D

def shot_scatter(df,player_pic=True,ax=None,noise=True,**kwargs):
    '''
    Plotting scatter plot of shots.
    input - dataframe with x and y coordinates.
    optional - player_pic (default True) loads player picture. Use if dataframe is for a single player. 
               ax (default None) can pass plot axis.
               noise (default True) adds some random scatter to the data for better visualization  
               other - any variables that can be passed into the scatter function (e.g. transperecy value)
    '''
    if ax is None: 
        ax = plt.gca(xlim = [30,-30],ylim = [-7,43],xticks=[],yticks=[],aspect=1.0)
    nba.plot.court(ax,outer_lines=True,color='black',lw=2.0,direction='down')
    ax.axis('off')
    if noise:
        X = df.LOC_X.values + np.random.normal(loc=0.0, scale=1.5, size=len(df.LOC_X.values))
        Y = df.LOC_Y.values + np.random.normal(loc=0.0, scale=1.5, size=len(df.LOC_Y.values))
    else:
        X = df.LOC_X.values
        Y = df.LOC_Y.values
    ax.scatter(-0.1*X,0.1*Y,**kwargs)
    if player_pic:
        name = df.PLAYER_NAME.values[0]
        player_id = df.PLAYER_ID.values[0]
        pic = players_picture(player_id)
        ax.imshow(pic,extent=[15,25,30,37.8261])
        ax.text(20,29,name,fontsize=16,horizontalalignment='center',verticalalignment='center')
    ax.text(0,-7,'By: Doingthedishes',color='black',horizontalalignment='center',fontsize=20,fontweight='bold')
    
def shot_heatmap(df,sigma = 1,log=False,player_pic=True,ax=None,cmap='jet'):
    '''
    This function plots a heatmap based on the shot chart.
    input - dataframe with x and y coordinates.
    optional - log (default false) plots heatmap in log scale. 
               player (default true) adds player's picture and name if true 
               sigma - the sigma of the Gaussian kernel. In feet (default=1)
    '''
    N = KDE_heatmap(df,sigma)
    if ax is None:
        ax = plt.gca(xlim = [30,-30],ylim = [-7,43],xticks=[],yticks=[],aspect=1.0)
    nba.plot.court(ax,outer_lines=True,color='black',lw=2.0,direction='down')
    ax.axis('off')
    if log:
        ax.imshow(np.rot90(np.log10(N+1)),cmap=cmap,extent=[25.0, -25.0, -5.25, 44.75])
    else:
        ax.imshow(np.rot90(N),cmap=cmap,extent=[25.0, -25.0, -5.25, 44.75])
    if player_pic:
        player_id = df.PLAYER_ID.values[0]
        pic = players_picture(player_id)
        ax.imshow(pic,extent=[15,25,30,37.8261])
    ax.text(0,-7,'By: Doingthedishes',color='white',horizontalalignment='center',fontsize=20,fontweight='bold')

Choose players with 500 shots or more:¶

In [4]:

player_df = pd.DataFrame({'shots' : df.groupby(by=['PLAYER_ID','PLAYER_NAME']).size()}).reset_index()
idx = player_df.shots.values > 500
player_df = player_df.ix[idx]
players = player_df.PLAYER_ID.values
player_df.head()

Out[4]:

	PLAYER_ID	PLAYER_NAME	shots
1	977	Kobe Bryant	713
2	1495	Tim Duncan	819
4	1717	Dirk Nowitzki	1062
5	1718	Paul Pierce	656
12	1938	Manu Ginobili	589

Plot example heatmaps + scatter plots¶

In [5]:

name = 'Stephen Curry'
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for i in range(2):
    axarr[i].set_ylim([-10,41.5])
    axarr[i].set_xlim([25,-25])
    axarr[i].set_aspect(1)
    axarr[i].set_xticks([])
    axarr[i].set_yticks([])
    axarr[i].axis('off')
f.subplots_adjust(hspace=0,wspace=0)
shot_scatter(df[df['PLAYER_NAME']==name],ax=axarr[0],alpha = 0.2)
shot_heatmap(df[df['PLAYER_NAME']==name],ax=axarr[1],player_pic=False,log=True)

In [6]:

name = 'Dirk Nowitzki'
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for i in range(2):
    axarr[i].set_ylim([-10,41.5])
    axarr[i].set_xlim([25,-25])
    axarr[i].set_aspect(1)
    axarr[i].set_xticks([])
    axarr[i].set_yticks([])
    axarr[i].axis('off')
f.subplots_adjust(hspace=0,wspace=0)
shot_scatter(df[df['PLAYER_NAME']==name],ax=axarr[0],alpha = 0.2)
shot_heatmap(df[df['PLAYER_NAME']==name],ax=axarr[1],player_pic = False)

Comparing players¶

In order to compare players I'm going to calculate the KDE for each player based on their shots during the entire season. The KDE is conceptaully equivalent to calculating the heatmaps for each player.

Note: when computing the KDE the $\sigma$ of the Gaussian kernel needs to be choosen. Larger $\sigma$ means that further shots are still going to be correlated but we do not want to choose a $\sigma$ that is too large. I found that $\sigma = 3$ (feet) was a good compromise between resolution and ensuring that close shots are correlated.

In [7]:

hmaps = np.zeros([500,500,len(players)])
for i,player in enumerate(player_df.PLAYER_NAME):
    hmaps[:,:,i] = KDE_heatmap(df[df['PLAYER_ID']==players[i]])

Now that we have a heatmap for each player (i.e. KDE) we can take each pair of players and compare how similar those heatmaps are. To do so I'm using the correlation_distance function that I defined above. There are sevevral ways to compute a similarity measure. I choose this one after exploring a few different options. Another similarity measure that I like (but I am not sowing here) is the Kernal Distance (https://arxiv.org/abs/1103.1625).

By comparing each pair of players we can create a similarity matrix $S$ using $1 - D$ where D is the player's shot density distance (i.e. how different they are from each other):

In [8]:

S = np.zeros([len(players),len(players)])
for i in xrange(len(players)):
    for j in xrange(i,len(players)):
        S[i,j] = 1 - correlation_distance(hmaps[:,:,i],hmaps[:,:,j])
        S[j,i] = S[i,j]

Let's plot the matrix D:¶

In [9]:

fig= plt.figure(figsize=(8,8))
im = plt.imshow(1-S,cmap='jet')
plt.colorbar(im,fraction=0.046, pad=0.04)
plt.show()

At this point it does not look like much. We need to work a little more to get some interesting information out of this data.

Find players with minimum similiraty:¶

In [10]:

i,j = np.unravel_index(np.argmin(S),np.shape(S))
name1 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[i], np.nan).max()
name2 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[j], np.nan).max()
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for n in range(2):
    axarr[n].set_ylim([-10,41.5])
    axarr[n].set_xlim([25,-25])
    axarr[n].set_aspect(1)
    axarr[n].set_xticks([])
    axarr[n].set_yticks([])
    axarr[n].axis('off')
f.suptitle('Similarity = {}'.format(np.round(S[i,j],2)),fontsize=20.0,y=1.02)
shot_scatter(df[df['PLAYER_NAME']==name1],ax=axarr[0],alpha=0.2)
shot_scatter(df[df['PLAYER_NAME']==name2],ax=axarr[1],alpha=0.2)
f.tight_layout()

Find players with maximum similarity¶

In [11]:

i,j = np.unravel_index(np.argmax(S-np.identity(len(S))),np.shape(S))
name1 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[i], np.nan).max()
name2 =  player_df.PLAYER_NAME.where(player_df['PLAYER_ID']== players[j], np.nan).max()
f, axarr = plt.subplots(1,2,figsize=(20,10),facecolor='white')
for n in range(2):
    axarr[n].set_ylim([-10,41.5])
    axarr[n].set_xlim([25,-25])
    axarr[n].set_aspect(1)
    axarr[n].set_xticks([])
    axarr[n].set_yticks([])
    axarr[n].axis('off')
f.suptitle('Similarity = {}'.format(np.round(S[i,j],2)),fontsize=20.0,y=1.02)
shot_scatter(df[df['PLAYER_NAME']==name1],ax=axarr[0],alpha=0.2)
shot_scatter(df[df['PLAYER_NAME']==name2],ax=axarr[1],alpha=0.2)
f.tight_layout()

Clustering¶

The distance matrix D (D = 1 − S) can be clustered using hierarchical clustering. I have tried numerous types of linkage functions including single-link, complete-link and average-link. Average link yields the best clusters. The matrix is then reorganized in such way that close pairs are next to each other - a common technique used in genetic research (https://www.ncbi.nlm.nih.gov/pubmed/9843981).

After inspecting the dendrogram and the resulting matrix, the players were divided into 6 clusters.

In [12]:

D = 1-S
fig = plt.figure(figsize=(10,10))

# Compute and plot dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Y = sch.linkage(D, method='average')
Z1 = sch.dendrogram(Y,color_threshold = 1.45)
ax2.set_xticks([])
ax2.set_yticks([])

# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
D = D[idx1,:]
D = D[:,idx1]
im = axmatrix.matshow(D, aspect='auto', origin='lower',cmap='jet')
axmatrix.set_xticks([])
axmatrix.set_yticks([])

# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)

C:\Users\eyal\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix

Out[12]:

<matplotlib.colorbar.Colorbar at 0x1a72fc18>

We can view the results of the clustering in a table format:

In [13]:

num_clusters = 6
T = sch.fcluster(Y,num_clusters,criterion='maxclust')
player_names = player_df.PLAYER_NAME.values

pd.set_option('display.max_rows', 70)
cluster_df = pd.DataFrame(player_names,columns = ['PLAYER_NAME'])
cluster_df['CLUSTER'] = T

d = cluster_df.groupby('CLUSTER').PLAYER_NAME.apply(list).to_dict()

pd.DataFrame({k : pd.Series(v) for k, v in d.items()}).add_prefix('CLUSTER ').fillna('')

Out[13]:

	CLUSTER 1	CLUSTER 2	CLUSTER 3	CLUSTER 4	CLUSTER 5	CLUSTER 6
0	DeAndre Jordan	Tim Duncan	Kobe Bryant	Manu Ginobili	Dirk Nowitzki	Kyle Korver
1	Greg Monroe	Pau Gasol	Paul Pierce	Tony Parker
2	Jonas Valanciunas	Zach Randolph	Jamal Crawford	LeBron James
3	Kenneth Faried	Nene	Joe Johnson	Dwyane Wade
4	Andre Drummond	Carlos Boozer	Matt Barnes	David West
5		Luis Scola	Rasual Butler	Boris Diaw
6		Chris Kaman	Carmelo Anthony	Luol Deng
7		Zaza Pachulia	Chris Bosh	Josh Smith
8		Al Jefferson	Mo Williams	Deron Williams
9		Brandon Bass	Devin Harris	Jarrett Jack
10		Amir Johnson	JR Smith	Ersan Ilyasova
11		Marcin Gortat	Jameer Nelson	Monta Ellis
12		Al Horford	Kevin Martin	LaMarcus Aldridge
13		Thaddeus Young	Beno Udrih	Rudy Gay
14		Marc Gasol	Trevor Ariza	Rajon Rondo
15		Brook Lopez	Chris Paul	Paul Millsap
16		Marreese Speights	Gerald Green	J.J. Barea
17		Roy Hibbert	CJ Miles	Mike Conley
18		Blake Griffin	Lou Williams	Jeff Green
19		Tyreke Evans	JJ Redick	Corey Brewer
20		Jordan Hill	Kyle Lowry	Rodney Stuckey
21		Taj Gibson	PJ Tucker	Jason Smith
22		Derrick Favors	Arron Afflalo	Wilson Chandler
23		DeMarcus Cousins	O.J. Mayo	Aaron Brooks
24		Timofey Mozgov	Kevin Love	Derrick Rose
25		Enes Kanter	Danilo Gallinari	Russell Westbrook
26		Nikola Vucevic	Eric Gordon	D.J. Augustin
27		Donatas Motiejunas	Ryan Anderson	Jerryd Bayless
28		Anthony Davis	Courtney Lee	Serge Ibaka
29		Tyler Zeller	Nicolas Batum	Mario Chalmers
30		Henry Sims	George Hill	Luc Mbah a Moute
31		Nerlens Noel	Anthony Morrow	Goran Dragic
32		Gorgui Dieng	Stephen Curry	James Harden
33		Giannis Antetokounmpo	Gerald Henderson	DeMar DeRozan
34		Elfrid Payton	DeMarre Carroll	Brandon Jennings
35			Wayne Ellington	Jrue Holiday
36			Jodie Meeks	Ty Lawson
37			Patrick Beverley	Jeff Teague
38			Danny Green	Darren Collison
39			Wesley Matthews	John Wall
40			Wesley Johnson	Evan Turner
41			Patrick Patterson	Gordon Hayward
42			Avery Bradley	Eric Bledsoe
43			Greivis Vasquez	Lance Stephenson
44			Gary Neal	Jeremy Lin
45			Brandon Knight	Kyrie Irving
46			Klay Thompson	Kemba Walker
47			Marcus Morris	Markieff Morris
48			Nikola Mirotic	Kawhi Leonard
49			Norris Cole	Tobias Harris
50			Bojan Bogdanovic	Reggie Jackson
51			Isaiah Thomas	Jimmy Butler
52			Bradley Beal	Chandler Parsons
53			Damian Lillard	Dion Waiters
54			Terrence Ross	Harrison Barnes
55			Jae Crowder	Evan Fournier
56			Khris Middleton	Jared Sullinger
57			Hollis Thompson	Draymond Green
58			Ben McLemore	Dennis Schroder
59			Kentavious Caldwell-Pope	Kelly Olynyk
60			Robert Covington	Michael Carter-Williams
61			Tim Hardaway Jr.	Victor Oladipo
62			Trey Burke	Solomon Hill
63			Langston Galloway	Zach LaVine
64				Jordan Clarkson
65				Andrew Wiggins

Plot heatmaps for each cluster to visualize the differences¶

In [14]:

f, axarr = plt.subplots(3,2,figsize=(20,30),facecolor='white')
for i in xrange(6):
    axarr[i/2,i%2].set_ylim([-10,41.5])
    axarr[i/2,i%2].set_xlim([25,-25])
    axarr[i/2,i%2].set_aspect(1)
    axarr[i/2,i%2].set_xticks([])
    axarr[i/2,i%2].set_yticks([])
    axarr[i/2,i%2].axis('off')
    axarr[i/2,i%2].set_title('Cluster {}'.format(i+1),fontsize=20.0)
    idx = T==i+1
    Lia = df['PLAYER_ID'].isin(players[idx])
    shot_heatmap(df.ix[Lia],ax=axarr[i/2,i%2],log=True,player_pic=False)
    if np.sum(Lia)<2000:
        a = 0.3
    else:
        a = 0.01
    axarr[i/2,i%2].scatter(-0.1*df.ix[Lia,'LOC_X'],0.1*df.ix[Lia,'LOC_Y'],marker='+',color='white',alpha = a)
    
f.subplots_adjust(hspace=0,wspace=0)

Conclusions about clusters:

Under the basket
Under the basket + midrange
Under the basket + 3s
Everywhere
Dirk Nowitzki - no corner shots. Lots of midrange jumpers
Kyle Korver - mostly from 3

Include player position from nylon calculus¶

Nylon Calculus (http://nyloncalculus.com/) used to have players position data. I downloaded the player position list for the relevant year. This file can be found on my github.

In [15]:

PP = pd.read_csv(r'C:\Users\eyal\Desktop\NBA\notebooks\player_postion.txt')
PP.head()

Out[15]:

	PLAYER_NAME	POSITION
0	James Harden	2.08
1	Andrew Wiggins	2.90
2	Damian Lillard	1.21
3	Chris Paul	1.01
4	Trevor Ariza	3.14

Merge dataframes¶

We need to merge the new dataframe we have with players position to the one that has the player information:

In [16]:

player_position = pd.merge(player_df,PP,on='PLAYER_NAME',how='left')
player_position = player_position.drop_duplicates(subset='PLAYER_NAME')
index = player_position['PLAYER_NAME'].index[player_position['POSITION'].apply(np.isnan)]
player_position.set_value(index, 'POSITION', [2.0,2.0])
player_position['cluster'] = T
player_position.head()

Out[16]:

	PLAYER_ID	PLAYER_NAME	shots	POSITION	cluster
0	977	Kobe Bryant	713	2.31	3
1	1495	Tim Duncan	819	5.00	2
2	1717	Dirk Nowitzki	1062	4.12	5
3	1718	Paul Pierce	656	3.09	3
4	1938	Manu Ginobili	589	2.32	4

Plot (normalized) position distribution¶

In [17]:

n = np.zeros([5,num_clusters])
for i in xrange(num_clusters):
    temp = np.round(player_position[player_position['cluster']==i+1]['POSITION'].values+0.01)
    n[:,i],_ = np.histogram(temp,bins = [0,1.1,2.1,3.1,4.1,5.1])

In [18]:

n_nor = np.zeros([5,num_clusters])
for i in xrange(5):
    n_nor[i,:] = n[i,:]/sum(n[i,:])

In [19]:

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
x = np.array([1,2,3,4,5,6])
ax.bar(x-0.2, n_nor[0,:],width=0.1,color='b',align='center')
ax.bar(x-0.1, n_nor[1,:],width=0.1,color='g',align='center')
ax.bar(x, n_nor[2,:],width=0.1,color='r',align='center')
ax.bar(x+0.1, n_nor[3,:],width=0.1,color='k',align='center')
ax.bar(x+0.2, n_nor[4,:],width=0.1,color='y',align='center')
plt.xlabel('Cluster #')
plt.ylabel('Ratio')
plt.legend(['PG','SG','SF','PF','C'])
plt.xlim(0.5,6.5)

Out[19]:

(0.5, 6.5)

Based on this distribution we can reach the following conclusions:¶

Cluster 4 has the most players followed by cluster 3. This is an indication that players are required to shot from everywhere on the court these days.
The vast majority of PG are in cluster 4 and almost half of the SF. These are most likely the two positions that require the highest versatility.
There are only 5 players which only shot under the basket (4 of them are centers). This old style of playing, where the big man only shot from close range, is disappearing from the NBA.
Most SG are in cluster 3 (their job is to shoot 3s).
The vast majority of centers are in cluster 2 - shooting from both under the basket and from midrange.
PF are most evenly distributed between the clusters with about 40% in clusters 2 and 4. This is the most evolved position were some PF these days shot a lot of 3s and some take a more traditional role of shooting under the basket and midrange shots. There are about 20% in cluster 3 - the ”stretch” 4 is a new position in the NBA were big man with 3 point range are
Korver and Nowitzki have a fairly unique shot selection.

Acknowledgements:¶

This project was done as part of CS 6140 / Data Mining at the University of Utah (http://www.cs.utah.edu/~jeffp/teaching/cs5140.html)

Many thanks to Prof. Jeff Phillips for his guidance.

NBA Player Shot Similarity