今天继续学习pandas。并实践下All GM Chess Games on Chess.com这个dataset。
读入dataset,做一些统计工作,如下,
# 1. 统计一共有多少GM,以及是谁
print(f"1. 一共有 " + str(df.player_name.nunique()) +" 位GM,分别是:")
print(df.player_name.unique())
# 2. 统计每个GM下了多少盘棋
print("\n2. 每个GM下了多少盘棋(取top10):")
print(df.player_name.value_counts().head(10))
# 3. 平均每年每个GM下几盘棋
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
average_games_per_year = df.groupby(['player_name', 'Year']).size().groupby('player_name').mean().round(0).astype(int)
# 取前十名
top_10_gms_average_games = average_games_per_year.groupby('player_name').mean().sort_values(ascending=False).head(10)
print("\n3. 平均每年每个GM下几盘棋(取top10):")
print(top_10_gms_average_games)
# 4. 计算2021年谁的胜率最高,需要至少达到超过全年所有棋手平均下棋局数
# 提取关键字段
df_2021 = df[['player', 'White', 'Black', 'Result', 'Date']]
df_2021['Date'] = pd.to_datetime(df_2021['Date'])
df_2021['Year'] = df_2021['Date'].dt.year
# 计算2021年所有棋手下的总局数
games_in_2021 = df[df['Year'] == 2021]
total_games_by_player = pd.concat([games_in_2021['player'], games_in_2021['White'], games_in_2021['Black']]).value_counts()
# 计算全年平均下棋局数
average_games_per_player = total_games_by_player.mean()
# 过滤出下棋局数超过平均值的棋手
qualified_players = total_games_by_player[total_games_by_player > average_games_per_player].index
# 筛选出符合条件的对局
qualified_games = games_in_2021[games_in_2021['player'].isin(qualified_players)]
# 计算每个棋手的胜利次数和总局数
player_stats = pd.DataFrame()
for player in qualified_players:
wins_as_white = qualified_games[(qualified_games['Result'] == '1-0') & ((qualified_games['White'] == player) | (qualified_games['Black'] == player))]
wins_as_black = qualified_games[(qualified_games['Result'] == '0-1') & ((qualified_games['White'] == player) | (qualified_games['Black'] == player))]
total_games = qualified_games[(qualified_games['White'] == player) | (qualified_games['Black'] == player)]
player_stats = pd.concat([player_stats, pd.DataFrame({
'player': [player],
'total_wins': [len(wins_as_white) + len(wins_as_black)],
'total_games': [len(total_games)]
})])
# 计算胜率
player_stats['win_percentage'] = player_stats['total_wins'] / player_stats['total_games']
# 选择在2021年下棋局数超过平均值的棋手中胜率最高的
highest_win_percentage_player = player_stats.nlargest(1, 'win_percentage')
print("\n4. 2021年胜率最高且下棋局数超过全年平均的玩家:")
print(highest_win_percentage_player[['player', 'win_percentage', 'total_games']])
# 5. 总体看,执白胜率高还是执黑胜率高
# 创建一个新列 'player_color' 表示每个玩家在比赛中是执白还是执黑
df['player_color'] = df.apply(lambda row: 'White' if row['player'] == row['White'] else 'Black', axis=1)
# 创建一个新列 'player_result' 表示每个玩家在比赛中的结果(胜负关系)
df['player_result'] = df.apply(lambda row: 1 if (((row['player'] == row['White']) & (row['Result'] == '1-0')) |
((row['player'] == row['Black']) & (row['Result'] == '0-1'))) else 0, axis=1)
# 计算总体执白和执黑的比赛次数和胜率
overall_stats = df.groupby('player_color').agg(total_games=('Result', 'count'), total_wins=('player_result', 'sum'))
# 计算总体胜率
overall_stats['win_percentage'] = overall_stats['total_wins'] / overall_stats['total_games']
print("\n5. 总体执白和执黑的胜率统计:")
print(overall_stats)
数据集很大,运行4、5比较慢,大约等个半小时吧。
1. 一共有 1055 位GM,分别是:
['Tingjie Lei' 'Дмитрий Хегай' 'Ivan Ivanisevic' ... 'Зубарев Александр'
'ZURAB AZMAIPARASHVILI' 'Nikita Petrov']
2. 每个GM下了多少盘棋(取top10):
player_name
Rogelio Jr Antonio 138587
Daniel Naroditsky 70765
Aman Hambleton 46833
Hikaru Nakamura 43342
Hoang Thong Tu 41392
ZURAB AZMAIPARASHVILI 38654
Khatanbaatar Bazar 38212
Nihal Sarin 33453
Yannick Gozzoli 32219
Andrew Tang 30594
Name: count, dtype: int64
3. 平均每年每个GM下几盘棋(取top10):
player_name
Khatanbaatar Bazar 12737.0
Rogelio Jr Antonio 10661.0
ZURAB AZMAIPARASHVILI 9664.0
Brandon Jacobson 8586.0
Alireza Firouzja 7647.0
Lev Gutman 5656.0
Daniel Naroditsky 5443.0
Petar Drenchev 5006.0
Danielian Elina 4947.0
Hikaru Nakamura 4816.0
dtype: float64
4. 2021年胜率最高且下棋局数超过全年平均的玩家:
player win_percentage total_games
0 sahpufjunior 1.0 238
5. 总体执白和执黑的胜率统计:
total_games total_wins win_percentage
player_color
Black 4014186 408073 0.101658
White 796890 445434 0.558965