Helper functions
show_table = function(df, caption="") {
return (df %>% kbl(caption = caption) %>% kable_classic_2(bootstrap_options = c("striped", "hover"), full_width = F, font_size=20, html_font="Cambria"))
}
Read in player data from scraped .csv
file.
players_raw = read_csv("players.csv")
player_years = sqldf("SELECT Name, MIN(YEAR) AS FirstYear, MAX(YEAR) AS LastYear FROM players_raw GROUP BY Name")
players = sqldf("SELECT DISTINCT players_raw.Name, players_raw.Pos, players_raw.Hometown, player_years.FirstYear as FirstYear, player_years.LastYear as LastYear FROM players_raw LEFT JOIN player_years ON players_raw.Name=player_years.Name")
This removed all duplicates, while also keeping for what years those players played. However there are still some duplicates because a player could have switched positions between seasons. See this in the case of FB/OLB/TE Riley Nowakowski. We still have three individual rows for this one player.
players %>%
filter(Name == "Riley Nowakowski") %>%
show_table()
Name | Pos | Hometown | FirstYear | LastYear |
---|---|---|---|---|
Riley Nowakowski | OLB | Milwaukee, Wis. | 2020 | 2023 |
Riley Nowakowski | FB | Milwaukee, Wis. | 2020 | 2023 |
Riley Nowakowski | TE | Milwaukee, Wis. | 2020 | 2023 |
agg = aggregate(Pos~Name, players, FUN=c)
towns = c()
firstyears = c()
lastyears = c()
for (i in 1:length(agg$Name)) {
player = head(players[players$Name == agg$Name[i],], 1)
towns = append(towns, player$Hometown[1])
firstyears = append(firstyears, player$FirstYear[1])
lastyears = append(lastyears, player$LastYear[1])
}
players_no_dup = agg %>%
mutate(Hometown = towns, FirstYear = firstyears, LastYear = lastyears)
# Individual town adjustment
players_no_dup$Hometown[players_no_dup$Hometown == "Montreal, Wis."] = "Montreal, Wisconsin"
players_no_dup$Hometown[players_no_dup$Hometown == "Howard, Kan."] = "Howard, Kansas"
players_w_mult_pos = (nrow(players)-nrow(players_no_dup))/nrow(players) * 100
tibble("Pos. Not Combined" = nrow(players), "Pos. Combined" = nrow(players_no_dup), "% of Multiple Position" = round(players_w_mult_pos, 2)) %>%
show_table("Length of Players DF")
Pos. Not Combined | Pos. Combined | % of Multiple Position |
---|---|---|
520 | 453 | 12.88 |
After this cell, we see that the length of the players DataFrame has decreased significantly, showing that we correctly identified players with multiple positions, and combined them to one row.We also showed that we have 12.88% of players that have played (officially) multiple positions throughout their careers at Wisconsin.
The first part of this analysis will be geographical. Where are Wisconsin football players originating from? How has that trend changed over the past decade?
I created a script to map every Hometown
to its longitude and latitude geocode. The main reason for doing this
was so there wouldn’t be a ton of calls to the Google Maps API every
time I ran this .rmd
file.
geocodes = read_csv("geocodes.csv")
print_players = function(df) {
string = ""
if (is.na(df) || nrow(df) == 0){
return (NA)
}
for (i in 1:nrow(df)){
row = df[i,]
player = ifelse(row$FirstYear == row$LastYear, glue('{row$Name}, {row$FirstYear}'), glue('{row$Name}, {row$FirstYear}-{row$LastYear}'))
string = paste(string, player, "<br>")
}
return (string)
}
town_players = c()
for (town in geocodes$Hometown) {
town_players = append(town_players, list(players_no_dup[players_no_dup["Hometown"] == town,]))
}
geocodes$Players = town_players
m = leaflet() %>%
addProviderTiles(provider=providers$Esri.WorldGrayCanvas) %>%
addCircleMarkers(geocodes$lon, geocodes$lat, popup=paste("<b>", geocodes$Hometown, "</b><br>", lapply(geocodes$Players, print_players)), label=geocodes$Hometown, radius=4.5, color="#C5050C", fillOpacity=1.0, stroke=FALSE)
m
Getting distance from Madison, WI for all cities in our dataset.
year_dists = players_raw
# Fix the two individual cities
players_raw$Hometown[year_dists$Hometown == "Howard, Kan."] = "Howard, Kansas"
players_raw$Hometown[year_dists$Hometown == "Montreal, Wis."] = "Montreal, Wisconsin"
madison_dists = c()
madison = geocode("Madison WI")
for (i in 1:nrow(players_raw)){
city = geocodes %>% filter(Hometown == players_raw$Hometown[i]) %>% select(lon, lat)
if (is.na(city$lon[1]) || is.na(city$lat[1])){
print(players_raw$Hometown[i])
}
madison_dists = append(madison_dists, distHaversine(madison, c(city$lon[1], city$lat[1])))
}
year_dists$dist_madison = madison_dists
avg_distances = year_dists %>%
group_by(Year) %>%
summarize(avg_dist = mean(dist_madison/1000))
farthest_usa_dist = (year_dists %>% filter(Hometown == "Kauai, Hawaii") %>% pull(dist_madison))[1]
american_dists = year_dists %>% filter(dist_madison <= farthest_usa_dist)
avg_usa_distances = american_dists %>%
group_by(Year) %>%
summarize(avg_dist = mean(dist_madison/1000))
avg_cusa_distances = american_dists %>%
filter(!endsWith(Hometown, "Hawaii")) %>%
group_by(Year) %>%
summarize(avg_dist = mean(dist_madison/1000))
far_players = sqldf("SELECT Name, Pos, Hometown, Year, dist_madison FROM american_dists")
sqldf("SELECT DISTINCT far_players.Name, far_players.Pos, far_players.Hometown, player_years.FirstYear as FirstYear, player_years.LastYear as LastYear, far_players.dist_madison/1000 as 'Distance(km)' FROM far_players LEFT JOIN player_years ON far_players.Name=player_years.Name ORDER BY far_players.dist_madison DESC LIMIT 5") %>% show_table()
Name | Pos | Hometown | FirstYear | LastYear | Distance(km) |
---|---|---|---|---|---|
Nick Herbig | OLB | Kauai, Hawaii | 2020 | 2022 | 6774.561 |
Micah Kapoi | OL | Kapolei, Hawaii | 2014 | 2018 | 6709.919 |
Kamo’i Latu | S | Honolulu, Hawaii | 2022 | 2023 | 6693.346 |
Trech Kekahuna | WR | Honolulu, Hawaii | 2023 | 2023 | 6693.346 |
Cam Phillips | WR | San Francisco, Calif. | 2019 | 2019 | 2840.278 |
We see that most of the variance (in the USA) will likely come from Hawaii. Let’s see how that plays out.
ggplot() +
geom_smooth(data=avg_distances, mapping=aes(Year, avg_dist, color="Worldwide"), size=1.25, se=F) +
geom_smooth(data=avg_usa_distances, mapping=aes(Year, avg_dist, color="USA"), size=1.25, se=F) +
geom_smooth(data=avg_cusa_distances, mapping=aes(Year, avg_dist, color="Continental USA"), size=1.25, se=F) +
scale_color_manual(name='Region',
breaks=c('Worldwide', 'USA', 'Continental USA'),
values=c('Worldwide'='black', 'USA'='red', 'Continental USA'='blue')) +
scale_x_continuous(breaks=c(2013, 2015, 2017, 2019, 2021, 2023)) +
ylab("Distance (km)")
> We see in this graph how the distances can change based on what values we can leave out. We see that as we remove higher distances which can skew the average, our graph evens out more, to show more true changes. We can possibly see that there might be an upward trend right now in getting recruits that aren’t as close to UW.
Let’s also do it by variance (using just USA), for more accuracy, as this tells us more about the distance from the average, giving a better representation of the “spread” of our geographic points.
var_americas = avg_cusa_distances = american_dists %>%
group_by(Year) %>%
summarize(dists = var(dist_madison/1000))
ggplot(var_americas, aes(Year, dists)) +
geom_smooth(se=F, color="black") +
ggtitle("Variance in Distance of American UW Football Hometowns") +
ylab("Variance") +
theme(axis.text = element_blank(), axis.ticks = element_blank())
The variance tells a similar story. There was a small hill in the mid-2010s, which dipped closer to COVID, and is now on an upward trend.
One of the more impactful changes in college football has been the transfer portal and NIL (Name, Image and Likeness). Coming into affect in the summer of 2021, it has changed the landscape of college football, along with the other schools. How has recruiting changed for Wisconsin since?
var_pre2021 = american_dists %>%
group_by(Year) %>%
summarize(dists = var(dist_madison/1000)) %>%
filter(Year < 2021)
var_2021 = american_dists %>%
group_by(Year) %>%
summarize(dists = var(dist_madison/1000)) %>%
filter(Year >= 2021)
years = c("2010-2021", "2021-2023")
avgs = c(mean(var_pre2021$dists), mean(var_2021$dists))
ggplot(data.frame(years, avgs), aes(x=years, y=avgs)) +
geom_col(width=0.4, fill = "lightblue", color="black") +
ylab("Average Variance (km)") +
xlab("Years") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
We see that over the last few years, Wisconsin has branched out a little more, and that could be due to NIL/transfer portal changes, or even recent head coaching changes.