Dataframe的数据中,选择某一列,某一行,或者某个子区域,该怎么办呢?
比如,Titanic 数据表中,如果仅仅对乘客感兴趣,可以这样操作:
In [4]: ages = titanic["Age"]
In [5]: ages.head()
Out[5]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series
In [7]: titanic["Age"].shape
Out[7]: (891,)
比如,Titanic 数据表中,想选择多个属性进行组合研究,不仅仅对乘客感兴趣,还需要知道性别,可以这样操作:
In [8]: age_sex = titanic[["Age", "Sex"]]
In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
In [10]: type(titanic[["Age", "Sex"]])
Out[10]: pandas.core.frame.DataFrame
In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)
比如,Titanic 数据表中,对乘客的年龄大于35岁的集合感兴趣
In [12]: above_35 = titanic[titanic["Age"] > 35]
In [13]: above_35.head()
Out[13]:
PassengerId Survived Pclass ... Fare Cabin Embarked
1 2 1 1 ... 71.2833 C85 C
6 7 0 1 ... 51.8625 E46 S
11 12 1 1 ... 26.5500 C103 S
13 14 0 3 ... 31.2750 NaN S
15 16 1 2 ... 16.0000 NaN S
[5 rows x 12 columns]
In [15]: above_35.shape
Out[15]: (217, 12)
事实上,括号内的条件其实是一个真值列表:
In [14]: titanic["Age"] > 35
Out[14]:
0 False
1 True
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Age, Length: 891, dtype: bool
此外,还对乘客的座舱等级感兴趣,筛选等级2,3的,可以这样操作:
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]
In [17]: class_23.head()
Out[17]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
2 3 1 3 ... 7.9250 NaN S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q
7 8 0 3 ... 21.0750 NaN S
[5 rows x 12 columns]
# 等价于:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
In [19]: class_23.head()
Out[19]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
2 3 1 3 ... 7.9250 NaN S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q
7 8 0 3 ... 21.0750 NaN S
[5 rows x 12 columns]
此外,在数据清洗中经常用到,把NA值或者非NA值筛选出来,另做处理,可以这样操作:
In [20]: age_no_na = titanic[titanic["Age"].notna()]
In [21]: age_no_na.head()
Out[21]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
In [22]: age_no_na.shape
Out[22]: (714, 12)
比如,Titanic 数据表中,对乘客的年龄大于35岁的名字感兴趣,
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
如果对第10-25行,3到5列感兴趣,可以这样操作:
In [25]: titanic.iloc[9:25, 2:5]
Out[25]:
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
.. ... ... ...
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female
[16 rows x 3 columns]
以上代码只是一个简单示例,示例代码中的表达式和变量范围也可以根据实际问题进行修改。
觉得有用 收藏 收藏 收藏
点个赞 点个赞 点个赞
End
GPT专栏文章:
GPT实战系列-ChatGLM3本地部署CUDA11+1080Ti+显卡24G实战方案
GPT实战系列-LangChain + ChatGLM3构建天气查询助手
GPT实战系列-大模型为我所用之借用ChatGLM3构建查询助手
GPT实战系列-P-Tuning本地化训练ChatGLM2等LLM模型,到底做了什么?(二)
GPT实战系列-P-Tuning本地化训练ChatGLM2等LLM模型,到底做了什么?(一)
GPT实战系列-ChatGLM2部署Ubuntu+Cuda11+显存24G实战方案