Using Machine Learning To Find Every Missing DPOY in NBA History (1982–1947)

6 min readJul 23, 2024

Using NBA data and Machine Learning to find out what players “would’ve” won the award before its origination in 1983.

Introduction

The NBA Defensvie Player of the Year award (DPOY) orginiated in 1983 and was to be given to the “best” defensive player at the end of each NBA season. As a basketball fan I’ve always heard the noise surrounding the legendary Bill Russell and his defensive prowess, but today I wanted to take it to the test and find out how many DPOY’s he would’ve won if the award was around back then (1947).

Well… I also just wanted to answer this question in general:

“What players would’ve won this award 🏆if it was already introduced when the NBA orginiated in 1947?”

Data

All the data used for this project comes from basketball reference. I used several data sets from Sumitro Datta | NBA Stats (1947-present) kaggle hub to piece together the data and build a dataset to use for analysis.

For full details feel free to investiage the code: My GitHub

Model Prediction Value

The value to be calculated by the model is award_dpoy_share. This statistic is the number of votes a player obtained divided by the maximum number of votes possible. For example, Rudy Gobert achieved 433 points out of a maximum of 495 points (votes), giving him a share of 0.875, the largest of the competitors, meaning he wins the award.

The Missing Data

These two charts visualize the NaN’s “blanks” in each feature these ranging from two very different time period (1950–1980 & 1980–2024). As you can see, many advanced statistics (vorp, bpm, dbpm, obpm, etc.) were not fully tracked until around 1974. The same goes for other advanced stats like basic percentage stats (orb, drb, ast, etc.).

With that said, I did include a couple of these advanced features in the model process because those features were still around for about 11 missing DPOY seasons.

What Happen Here?

Something interesting I found is that there has only been 4 DPOYs ever, who won the award despite being listed on the 2nd team All-Defense for that season. This is very strange since 91% of the time the DPOY makes 1st team All-Defense. Can someone explain to me what happened here, in those seasons?

Analysis Notes

82% of DPOYs are top 8 in the league in (dws) defensive win shares.
100% of DPOYs are All-Defensive team (All-Defense), obviously.
52% of DPOYs make an All-NBA team selection (All-NBA).
91% of DPOYs are 1st team All-Defense (All-Defense_1st), obviously.

Machine Learning Results

I chose two different model archetypes for this experiment because, first I wanted to understand the key differences between the models, and secondly, I wanted to determine which is better in general for a problem like this.

Logistic Regression (Classification) v. Random-Forest (Regression)

Logistic Regression -> Target -> ‘won_dpoy’ (The single DPOY winner).
RF Regresstion → Target -> ‘award_dpoy_share’ (The votes of DPOY).
Logistic Regression -> Error Metrics -> Hit-Rate/BackTesting.
RF Regresstion → Error Metrics -> AP/NDCG/BackTesting.

Which Model Performed Better?

Winner: Random-Forest Regression

Specific Model Results

Logistic: 3/5 (2024 – 2020) Current DPOYs were correctly predicted.

It labeled the 2020 winner, Giannis Antetokounmpo, as ranked 3rd in the projection.
It labeled the 2022 winner, Marcus Smart, as ranked 4th in the projection.
It didn’t get the exact voting (contenders) in the best order, compared to Random-Forest.

RF: 3/5 (2024–2020) Current DPOYs were correctly predicted.

It labeled the 2020 winner, Giannis Antetokounmpo, as ranked 3rd in the projection.
It labeled the 2022 winner, Marcus Smart, as ranked 3rd in the projection.
Random-Forest not only had better DPOY predictions, but it also assessed the other contenders for each season better as well.

Model Decision Making

To understand the complex decision-making of the Random-Forest Regressor, here’s the general feature importance of the model.

Players with large All-Defense_1st_share values contribute positively and massively towards the prediction.
Players with large values in blocks (blk_percent/blk_per_game) also contribute well towards the prediction.
And so on…

Feature Importance | Random-Forest Regression

The Missing DPOY’s Results🤨 | Plus My Analysis

True Contender Definition:

True Contenders = When model was tested on current day DPOY’s (2024–2020) it always had the correct DPOY winner within the top 3 predicted players.

Here is a table displaying the true contenders with the true probability of winning the DPOY each season (1982–1947). As you can see, there are some seasons in which the model has major difficulty in assessing due to a lack of relevant data (i.e. 1952).

Random-Forest | Missing DPOY | Projections

Takeaways

Here is a list, first sorted by which unique players accumulated the most percentage chance to recieve each award over the missing seasons. And secondly showing the counts of that award that unique player would’ve won according to the model.

1st: Bill Russell | 4.39 | 5 Awards
2nd: Wilt Chamberlain | 4.16 | 6 Awards
3rd: George Mikan | 3.66 | 6 Awards
4th: Bobby Jones | 2.78 | 3 Awards
5th: Dave DeBusschere | 2.55 | 4 Awards
6th: Dennis Johnson | 1.89 | 3 Awards
7th: Kareem Abdul-Jabbar | 1.66 | 2 Awards
8th: Walt Frazier | 1.64 | 2 Awards
9th: Dolph Schayes | 1.37 | 1 Award
10th: Maurice Stokes | 1.22 | 2 Awards

The Missing DPOY’s… According to My Analysis 🙄

Now here is the same table again, but this time highlighting yellow to show my picks based on my analysis (Power-BI Dashboard and Google Searches).

According to my personally researched-based picks, here are the total award counts for each unique player throughout the missing seasons:

1st: Bill Russell | 8 Awards
2nd: George Mikan | 5 Awards
3rd: Chamberlain, Frazier, and Abdul-Jabbar| 3 Awards
4th: Schayes, Stokes, DeBusschere, and Johnson | 2 Awards
5th: Miasek, Dallmar, and Sloan | 1 Award

Conclusion

Overall this project was quite challenging to say the least. The lack of data from 1968–1947 became more apparent as the predictions rolled into the past, for instance in 1947 you can see both John Logan and Stan Miasek have the exact same probability to win the award 🤨simply because of the few statistics available back then. To conclude realistically more data is needed to be present back then in order to truly make a more accurate prediction.

“With that, said I want to know what you guys think about all of this. Who do you think are the missing dpoys, let me know in the comments.”

Feedback

Any questions or feedback are always welcome, so feel free to post them. For full details of all steps done for this project check out my GitHub.