跳转到根目录:知行合一:投资篇
已完成:
1.1 编程基础
??1.1.1 投资-编程基础-numpy
??1.1.2 投资-编程基础-pandas
意思就是,外层是dict,里面是list列表数据。dict的key是列名,list是value值。
# 简单的pandas
import pandas as pd
test_data = {
"country":["China","China","China"],
"sites":["baidu","sougou","hao123"],
"rank":[1,3,2]
}
df1 = pd.DataFrame(test_data)
print(df1)
country sites rank
0 China baidu 1
1 China sougou 3
2 China hao123 2
import pandas as pd
#通过字典创建DataFrame (含Series)
d={'one':pd.Series([1.,2.,3.],index=['a','b','c']),
'two':pd.Series([1.,2.,3.,4.,],index=['a','b','c','d']),
'three':range(4),
'four':1.,
'five':'f'}
df=pd.DataFrame(d)
print (df)
#可以使用dataframe.index和dataframe.columns来查看DataFrame的行和列,
#dataframe.values则以数组的形式返回DataFrame的元素
print ("DataFrame index:\n",df.index)
print ("DataFrame columns:\n",df.columns)
print ("DataFrame values:\n",df.values)
one two three four five
a 1.0 1.0 0 1.0 f
b 2.0 2.0 1 1.0 f
c 3.0 3.0 2 1.0 f
d NaN 4.0 3 1.0 f
DataFrame index:
Index(['a', 'b', 'c', 'd'], dtype='object')
DataFrame columns:
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
DataFrame values:
[[1.0 1.0 0 1.0 'f']
[2.0 2.0 1 1.0 'f']
[3.0 3.0 2 1.0 'f']
[nan 4.0 3 1.0 'f']]
#DataFrame也可以从值是数组的字典创建,但是各个数组的长度需要相同:
d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df = DataFrame(d, index=['a', 'b', 'c', 'd'])
print df
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0
# 构造DataFrame:字典格式的json
json1 = {
"2021-07-01":{"name":"kelvin","age":"31","region":"江苏"},
"2021-07-02":{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
"2021-07-03":{"name":"kipper","age":"15","region":"杭州"}
}
df = pd.DataFrame(json1).T # T操作是进行转置,就是行列标题翻转。本来是name、age作为行id的。
print(df)
name age region unkonwn
2021-07-01 kelvin 31 江苏 NaN
2021-07-02 tom 29 上海 111
2021-07-03 kipper 15 杭州 NaN
json1 = {
"2021-07-01":{"close":10},
"2021-07-02":{"close":11},
"2021-07-03":{"close":6}
}
df = pd.DataFrame(json1).T
print(df)
index_price=pd.DataFrame({'列名1':df.close}).dropna()
print(index_price)
close
2021-07-01 10
2021-07-02 11
2021-07-03 6
列名1
2021-07-01 10
2021-07-02 11
2021-07-03 6
#值非数组时,没有这一限制,并且缺失值补成NaN
d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]
df = DataFrame(d)
print df
a b c
0 1.6 2 NaN
1 3.0 6 9.0
# 构造DataFrame:从json构建DataFrame
# df = pd.read_json('sites.json'),可以从sites.json文件构建DataFrame
json1 = [
{"name":"kelvin","age":"31","region":"江苏"},
{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
{"name":"kipper","age":"15","region":"杭州"}
]
df = pd.DataFrame(json1)
print(df)
df.to_json()
name age region unkonwn
0 kelvin 31 江苏 NaN
1 tom 29 上海 111
2 kipper 15 杭州 NaN
'{"name":{"0":"kelvin","1":"tom","2":"kipper"},"age":{"0":"31","1":"29","2":"15"},"region":{"0":"\\u6c5f\\u82cf","1":"\\u4e0a\\u6d77","2":"\\u676d\\u5dde"},"unkonwn":{"0":null,"1":"111","2":null}}'
# DataFrame获取数据 df.loc[0]
my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)
print()
print(df.loc[0]) # 第0行
print()
print(df.loc[0]["name"])
name age
0 kelvin 31
1 tom 29
2 kipper 13
name kelvin
age 31
Name: 0, dtype: object
kelvin
#在实际处理数据时,有时需要创建一个空的DataFrame,可以这么做
df = DataFrame()
print (df)
Empty DataFrame
Columns: []
Index: []
注意:要引入ssl包,否则报错
提供几个已经抓取的csv文件,可以直接使用:
沪深300历史:SH510300.csv
沪深300历史收盘价:SH510300-close.csv
中证500历史收盘价:SH510500-close.csv
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300.csv")
sh300
date uuid volume open high low close chg percent turnoverrate amount
0 2012/5/28 SH510300|2012-05-28 1277518769 2.1572 2.2046 2.1513 2.2020 0.0255 1.17 10.45 NaN
1 2012/5/29 SH510300|2012-05-29 714949008 2.2004 2.2503 2.2004 2.2359 0.0339 1.54 5.85 NaN
2 2012/5/30 SH510300|2012-05-30 265887198 2.2342 2.2384 2.2266 2.2291 -0.0068 -0.30 2.17 NaN
3 2012/5/31 SH510300|2012-05-31 178155984 2.2164 2.2367 2.2097 2.2240 -0.0051 -0.23 1.46 NaN
4 2012/6/1 SH510300|2012-06-01 179350035 2.2232 2.2494 2.2156 2.2240 0.0000 0.00 1.47 NaN
... ... ... ... ... ... ... ... ... ... ... ...
2792 2023/11/20 SH510300|2023-11-20 858430360 3.6370 3.6570 3.6100 3.6450 0.0110 0.30 0.00 3.119865e+09
2793 2023/11/21 SH510300|2023-11-21 931605485 3.6550 3.6860 3.6400 3.6500 0.0050 0.14 0.00 3.414863e+09
2794 2023/11/22 SH510300|2023-11-22 762202706 3.6410 3.6460 3.6100 3.6110 -0.0390 -1.07 0.00 2.765608e+09
2795 2023/11/23 SH510300|2023-11-23 774971808 3.6090 3.6320 3.5950 3.6300 0.0190 0.53 0.00 2.800813e+09
2796 2023/11/24 SH510300|2023-11-24 743453294 3.6260 3.6270 3.5980 3.6050 -0.0250 -0.69 0.00 2.684276e+09
2797 rows × 11 columns
对于上面的例子,可以更进一步,读取csv之后:
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300.csv", parse_dates=['date'], index_col='date')
# 取2023年数据
sh300.loc['2023']
uuid volume open high low close chg percent turnoverrate amount
date
2023-01-03 SH510300|2023-01-03 656310782 3.8761 3.8998 3.8299 3.8880 0.0109 0.28 0.0 2.576088e+09
2023-01-04 SH510300|2023-01-04 980799721 3.8870 3.9037 3.8702 3.8929 0.0049 0.13 0.0 3.874438e+09
2023-01-05 SH510300|2023-01-05 774502293 3.9136 3.9726 3.9106 3.9667 0.0738 1.90 0.0 3.108571e+09
2023-01-06 SH510300|2023-01-06 541080825 3.9677 3.9992 3.9638 3.9825 0.0158 0.40 0.0 2.187353e+09
2023-01-09 SH510300|2023-01-09 780959941 4.0022 4.0228 3.9894 4.0071 0.0246 0.62 0.0 3.178181e+09
... ... ... ... ... ... ... ... ... ... ...
2023-11-20 SH510300|2023-11-20 858430360 3.6370 3.6570 3.6100 3.6450 0.0110 0.30 0.0 3.119865e+09
2023-11-21 SH510300|2023-11-21 931605485 3.6550 3.6860 3.6400 3.6500 0.0050 0.14 0.0 3.414863e+09
2023-11-22 SH510300|2023-11-22 762202706 3.6410 3.6460 3.6100 3.6110 -0.0390 -1.07 0.0 2.765608e+09
2023-11-23 SH510300|2023-11-23 774971808 3.6090 3.6320 3.5950 3.6300 0.0190 0.53 0.0 2.800813e+09
2023-11-24 SH510300|2023-11-24 743453294 3.6260 3.6270 3.5980 3.6050 -0.0250 -0.69 0.0 2.684276e+09
217 rows × 10 columns
# pandas读取csv, read_csv, head, tail
df = pd.read_csv("nba.csv")
df.head(10)
print(df)
Name Team Number Position Age Height Weight \
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0
.. ... ... ... ... ... ... ...
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0
457 NaN NaN NaN NaN NaN NaN NaN
College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University NaN
3 Georgia State 1148640.0
4 NaN 5000000.0
.. ... ...
453 Butler 2433333.0
454 NaN 900000.0
455 NaN 2900000.0
456 Kansas 947276.0
457 NaN NaN
[458 rows x 9 columns]
# DataFrame写入到csv
name = ["kelvin", "tom", "kipper"]
age = [31, 29, 15]
region = ["江苏", "上海", "杭州"]
dict1 = {"name":name, "age":age, "region":region}
df = pd.DataFrame(dict1)
print(df)
df.to_csv("test_csv.csv")
name age region
0 kelvin 31 江苏
1 tom 29 上海
2 kipper 15 杭州
DataFrame是以列作为操作的基础的,全部操作都想象成先从DataFrame里取一列
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print('按列名取数:\n',df['calories'])
print(df['calories']['day1']) # 访问特定列的某一行元素,即:calories列的day1行元素
# 取2列
print('取2列:\n',df[['calories', 'duration']])
calories duration
day1 420 50
day2 380 40
day3 390 45
按列名取数:
day1 420
day2 380
day3 390
Name: calories, dtype: int64
420
取2列:
calories duration
day1 420 50
day2 380 40
day3 390 45
df.loc[[“day1”,“day2”] # 按行名取数
print (df.iloc[0]) #选取第一行元素,i应该是代表行的索引值,即使定义了行名,也可以通过索引取数。
print (df.loc[‘day2’])#选取day2对应行元素
# DataFrame返回多行数据
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print('按行名取数:\n', df.loc[["day1","day2"]])
print ('iloc选取第一行元素:\n', df.iloc[0]['calories']) #选取第一行元素
calories duration
day1 420 50
day2 380 40
day3 390 45
按行名取数:
calories duration
day1 420 50
day2 380 40
iloc选取第一行元素:
420
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print('行切片:\n', df[0:2]) # 取第0、1行,不包括2行
calories duration
day1 420 50
day2 380 40
day3 390 45
行切片:
calories duration
day1 420 50
day2 380 40
#行列组合起来选取数据:
print (df[['b', 'd']].iloc[[1, 3]]) # b、d列,1、3行
print (df.iloc[[1, 3]][['b', 'd']]) # 1、3行,b、d列
print (df[['b', 'd']].loc[['beta', 'delta']]) # b、d列,beta、delta行
print (df.loc[['beta', 'delta']][['b', 'd']]) # beta、delta行,b、d列
df:
a b c d e
alpha 0.0 0.0 0.0 0.0 0.0
beta 1.0 2.0 3.0 4.0 5.0
gamma 2.0 4.0 6.0 8.0 10.0
delta 3.0 6.0 9.0 12.0 15.0
eta 4.0 8.0 12.0 16.0 20.0
b d
beta 2.0 4.0
delta 6.0 12.0
b d
beta 2.0 4.0
delta 6.0 12.0
b d
beta 2.0 4.0
delta 6.0 12.0
b d
beta 2.0 4.0
delta 6.0 12.0
#如果不是需要访问特定行列,而只是某个特殊位置的元素的话,
#dataframe.at和dataframe.iat
#是最快的方式,它们分别用于使用索引和下标进行访问
print(df)
print (df.iat[2, 3]) #相当于第3行第4列
print (df.at['gamma', 'd'])
a b c d e
alpha 0.0 0.0 0.0 0.0 0.0
beta 1.0 2.0 3.0 4.0 5.0
gamma 2.0 4.0 6.0 8.0 10.0
delta 3.0 6.0 9.0 12.0 15.0
eta 4.0 8.0 12.0 16.0 20.0
8.0
8.0
import pandas as pd
sh300 = pd.read_csv('SH510300-收盘价.csv')
print('sh300原数据\n', sh300.head())
sh300.columns = ['date', 'sh300']
print('改列名后的sh300\n', sh300.head())
sh500 = pd.read_csv('SH510500-收盘价.csv')
print('sh500原数据\n', sh500.head())
sh500.columns = ['date', 'sh500']
print('改列名后的sh500\n', sh500.head())
merged_df = sh300.merge(sh500, on = 'date', how="outer")
print('merge之后的:\n', merged_df)
print(merged_df[merged_df.date>'2012/5/30']) #条件筛选
sh300原数据
date close
0 2012/5/28 2.2020
1 2012/5/29 2.2359
2 2012/5/30 2.2291
3 2012/5/31 2.2240
4 2012/6/1 2.2240
改列名后的sh300
date sh300
0 2012/5/28 2.2020
1 2012/5/29 2.2359
2 2012/5/30 2.2291
3 2012/5/31 2.2240
4 2012/6/1 2.2240
sh500原数据
date close
0 2013/3/15 3.0215
1 2013/3/18 2.9717
2 2013/3/19 2.9904
3 2013/3/20 3.0683
4 2013/3/21 3.0994
改列名后的sh500
date sh500
0 2013/3/15 3.0215
1 2013/3/18 2.9717
2 2013/3/19 2.9904
3 2013/3/20 3.0683
4 2013/3/21 3.0994
merge之后的:
date sh300 sh500
0 2012/5/28 2.2020 NaN
1 2012/5/29 2.2359 NaN
2 2012/5/30 2.2291 NaN
3 2012/5/31 2.2240 NaN
4 2012/6/1 2.2240 NaN
... ... ... ...
2792 2023/11/20 3.6450 5.758
2793 2023/11/21 3.6500 5.741
2794 2023/11/22 3.6110 5.668
2795 2023/11/23 3.6300 5.719
2796 2023/11/24 3.6050 5.673
[2797 rows x 3 columns]
date sh300 sh500
3 2012/5/31 2.2240 NaN
4 2012/6/1 2.2240 NaN
5 2012/6/4 2.1631 NaN
6 2012/6/5 2.1657 NaN
7 2012/6/6 2.1640 NaN
... ... ... ...
2792 2023/11/20 3.6450 5.758
2793 2023/11/21 3.6500 5.741
2794 2023/11/22 3.6110 5.668
2795 2023/11/23 3.6300 5.719
2796 2023/11/24 3.6050 5.673
[2733 rows x 3 columns]
# DataFrame的遍历,遍历DataFrame
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 200, 12345]
}
df = pd.DataFrame(person)
print(df)
print(df.index)
for x in df.index:
print(x)
if df.loc[x, "age"] > 120:
df.loc[x, "age"] = 1
# df.drop(x, inplace = True),这种操作是删除一行
print(df)
name age
0 Google 50
1 Runoob 200
2 Taobao 12345
RangeIndex(start=0, stop=3, step=1)
0
1
2
name age
0 Google 50
1 Runoob 1
2 Taobao 1
import pandas as pd
json1 = [{"date":"2021-07-01", "close": 3.1},
{"date":"2021-07-02", "close": 3.3},
{"date":"2021-07-03", "close": 2.8}
]
df1 = pd.DataFrame(json1)
print(df1)
# 修改df1的列名
# columns[0] = 'sh300' # 不能直接改,TypeError: Index does not support mutable operations
df1.columns = ['date', 'sh300']
print('改列名后的df1:\n', df1)
close
2021-07-01 3.1
2021-07-02 3.3
2021-07-03 2.8
改列名后的df1:
sh300
2021-07-01 3.1
2021-07-02 3.3
2021-07-03 2.8
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)
sh300 = sh300.rename(columns={'close': 'sh300_close'})
print('列重命名后:\n', sh300)
close
date
2012-05-28 2.2020
2012-05-29 2.2359
2012-05-30 2.2291
2012-05-31 2.2240
2012-06-01 2.2240
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[2797 rows x 1 columns]
列重命名后:
close
date
2012-05-28 2.2020
2012-05-29 2.2359
2012-05-30 2.2291
2012-05-31 2.2240
2012-06-01 2.2240
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[2797 rows x 1 columns]
import pandas as pd
json1 = [{"date":"2021-07-01", "close": 3.1},
{"date":"2021-07-02", "close": 3.3},
{"date":"2021-07-03", "close": 2.8}
]
df1 = pd.DataFrame(json1)
print(df1)
# 构造增加一行的数据
new_line = {"date":"2021-07-04", "close": 7.4}
df1.append(new_line, ignore_index=True) # Can only append a dict if ignore_index=True
date close
0 2021-07-01 3.1
1 2021-07-02 3.3
2 2021-07-03 2.8
date close
0 2021-07-01 3.1
1 2021-07-02 3.3
2 2021-07-03 2.8
3 2021-07-04 7.4
# 修改DataFrame中的错误数据
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 40, 12345] # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
print(df)
print()
df.loc[2,"age"] = 30 # 修改数据
print(df)
name age
0 Google 50
1 Runoob 40
2 Taobao 12345
name age
0 Google 50
1 Runoob 40
2 Taobao 30
import pandas as pd
json1 = {
"2021-07-01":{"close": 3.1},
"2021-07-02":{"close": 3.3},
"2021-07-03":{"close": 2.8}
}
df1 = pd.DataFrame(json1).T
print(df1)
# 修改df1的列名
# columns[0] = 'sh300' # 不能直接改,TypeError: Index does not support mutable operations
df1.columns = ['sh300']
print('改列名后的df1:\n', df1)
json2 = {
"2021-07-01":{"close": 11},
"2021-07-02":{"close": 8},
"2021-07-04":{"close": 20}
}
df2 = pd.DataFrame(json2).T
df2.columns = ['sh500']
print(df2)
# 合并2个df
df3 = pd.concat([df1, df2], axis=1)
print('合并后的df3:\n', df3)
close
2021-07-01 3.1
2021-07-02 3.3
2021-07-03 2.8
改列名后的df1:
sh300
2021-07-01 3.1
2021-07-02 3.3
2021-07-03 2.8
sh500
2021-07-01 11
2021-07-02 8
2021-07-04 20
合并后的df3:
sh300 sh500
2021-07-01 3.1 11.0
2021-07-02 3.3 8.0
2021-07-03 2.8 NaN
2021-07-04 NaN 20.0
这个文章比较全面讲了merge的细节操作:https://zhuanlan.zhihu.com/p/634229183
下面使用一个股市收盘价常用的格式来看一下如何将2个股票收盘价合并:
import pandas as pd
json1 = [{"date":"2021-07-01", "close": 3.1},
{"date":"2021-07-02", "close": 3.3},
{"date":"2021-07-03", "close": 2.8}
]
df1 = pd.DataFrame(json1)
print(df1)
# 修改df1的列名
# columns[0] = 'sh300' # 不能直接改,TypeError: Index does not support mutable operations
df1.columns = ['date', 'sh300']
print('改列名后的df1:\n', df1)
json2 = [{"date":"2021-07-01", "close": 11},
{"date":"2021-07-02", "close": 8},
{"date":"2021-07-04", "close": 20}
]
df2 = pd.DataFrame(json2)
df2.columns = ['date', 'sh500']
print('改列名后的df2:\n', df2)
merged_df = df1.merge(df2, on = 'date', how="outer")
print('merge之后的:\n', merged_df)
date close
0 2021-07-01 3.1
1 2021-07-02 3.3
2 2021-07-03 2.8
改列名后的df1:
date sh300
0 2021-07-01 3.1
1 2021-07-02 3.3
2 2021-07-03 2.8
改列名后的df2:
date sh500
0 2021-07-01 11
1 2021-07-02 8
2 2021-07-04 20
merge之后的:
date sh300 sh500
0 2021-07-01 3.1 11.0
1 2021-07-02 3.3 8.0
2 2021-07-03 2.8 NaN
3 2021-07-04 NaN 20.0
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)
sh300.index[0] # Timestamp('2012-05-28 00:00:00')
sh300.index # DatetimeIndex(['2012-05-28', '2012-05-29', '2012-05-30', '2012-05-31',...dtype='datetime64[ns]', name='date', length=2797, freq=None)
sh300.index.to_list() # [Timestamp('2012-05-28 00:00:00'),Timestamp('2012-05-29 00:00:00'),....]
sh300['close']['2023-11-24'] # 3.605
sh300['close'].to_list() # [2.202, 2.2359,...3.605]
sh300['close'].values # array([2.202 , 2.2359, 2.2291, ..., 3.611 , 3.63 , 3.605 ])
sh300['close'].values[-1] # 3.605
close
date
2012-05-28 2.2020
2012-05-29 2.2359
2012-05-30 2.2291
2012-05-31 2.2240
2012-06-01 2.2240
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[2797 rows x 1 columns]
3.605
参考上面的“创建dataframe”
# 读取csv的时候,可以指定哪些是空数据;
# 原先,na和--不会被认为是空,指定后,读取出来就是NaN了;同时,NaN、空、NA、n/a仍旧被认为是空
missing_value = ["na","--"]
df = pd.read_csv("property-data.csv", na_values = missing_value)
print(df)
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3.0 1 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 NaN
2 100003000.0 NaN LEXINGTON N NaN 1 850.0
3 100004000.0 201.0 BERKELEY 12 1.0 NaN 700.0
4 NaN 203.0 BERKELEY Y 3.0 2 1600.0
5 100006000.0 207.0 BERKELEY Y NaN 1 800.0
6 100007000.0 NaN WASHINGTON NaN 2.0 HURLEY 950.0
7 100008000.0 213.0 TREMONT Y 1.0 1 NaN
8 100009000.0 215.0 TREMONT Y NaN 2 1800.0
# 指定列,如果有空数据,则删除整行
# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df = pd.read_csv("property-data.csv")
print(df)
print()
df.dropna(subset=["ST_NUM"])
print(df)
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
# 填充NaN,空数据,替换
df = pd.read_csv("property-data.csv")
print(df)
df["ST_NUM"].fillna(10000, inplace=True)
print()
print(df)
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 10000.0 LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 10000.0 WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
一些常用的操作
df3.iat[3,3]=np.NaN #令第3行第3列的数为缺失值(0.129151)
df3.iat[1,2]=np.NaN #令第1行第2列的数为缺失值(1.127064)
#丢弃存在缺失值的行
#设定how=all只会删除那些全是NaN的行:
df3.dropna(how='any')
#删除列也一样,设置axis=1
df3.dropna(how='any',axis=1)
#thresh参数,如thresh=4,一行中至少有4个非NaN值,否则删除
df3.iloc[2,2]=np.NaN
df3.dropna(thresh=4)
#使在改变DataFrame 和 Series 的操作时,会返回一个新的对象,
#原对象不变,如果要改变原对象,可以添加参数 inplace = True用列均值填充
#使用该列的均值填充
df3['C'].fillna(df3['C'].mean(),inplace=True)
# DataFrame去掉空数据
# 可以看出,这些不是null:na、--
# 这些是null:NaN、空、NA、n/a
# 如果是空、NA,那么读取到DataFrame会被转成NaN
df = pd.read_csv("property-data.csv")
print(df)
print(df.isnull()) # 可以看出,这些不是null:na、--;这些是null:NaN、空、NA、n/a
print()
print(df.dropna()) # 小写的na,不会被删掉;--也不会被删掉
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 False False False False False False False
1 False False False False False False False
2 False True False False True False False
3 False False False False False True False
4 True False False False False False False
5 False False False False True False False
6 False True False True False False False
7 False False False False False False True
8 False False False False False False False
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
8 100009000.0 215.0 TREMONT Y na 2 1800
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['true', 'false', 'true', 'false',
'true', 'false', 'true', 'false'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
print(df)
print(df.groupby(['A']).sum()) #以A列特征分类并加总
print(df.groupby(['A','B']).sum()) # A、B列特征分类并加总
A B C D
0 true one 0.131962 0.000795
1 false one -0.282576 0.440043
2 true two -1.467742 -1.328217
3 false three 1.228367 0.637844
4 true two 0.119230 0.894900
5 false two -0.067859 0.507391
6 true one 0.870252 1.892529
7 false three 0.671450 0.736440
C D
A
false 1.549382 2.321718
true -0.346298 1.460008
C D
A B
false one -0.282576 0.440043
three 1.899816 1.374284
two -0.067859 0.507391
true one 1.002214 1.893324
two -1.348512 -0.433317
参考上面的“变更dataframe”
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)
print('往下挪动1个:', sh300.shift(1)) # shift(1),往下挪动1个。自然,-1就是往上挪动1个。
# 常用于计算变化率。第2天的涨跌,公式:(第2天 - 第1天)/第1天
ret_daily = (sh300 - sh300.shift(1))/sh300
print('使用shift计算每天的收益率:', ret_daily)
# 通过下面的 pct_change 函数结果观察能看到,shift和pct_change有精度误差,手动算一下,pct_change的确是更准确。
print('pct_change:结果', sh300.pct_change())
close
date
2012-05-28 2.2020
2012-05-29 2.2359
2012-05-30 2.2291
2012-05-31 2.2240
2012-06-01 2.2240
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[2797 rows x 1 columns]
往下挪动1个: close
date
2012-05-28 NaN
2012-05-29 2.2020
2012-05-30 2.2359
2012-05-31 2.2291
2012-06-01 2.2240
... ...
2023-11-20 3.6340
2023-11-21 3.6450
2023-11-22 3.6500
2023-11-23 3.6110
2023-11-24 3.6300
[2797 rows x 1 columns]
使用shift计算每天的收益率: close
date
2012-05-28 NaN
2012-05-29 0.015162
2012-05-30 -0.003051
2012-05-31 -0.002293
2012-06-01 0.000000
... ...
2023-11-20 0.003018
2023-11-21 0.001370
2023-11-22 -0.010800
2023-11-23 0.005234
2023-11-24 -0.006935
[2797 rows x 1 columns]
pct_change:结果 close
date
2012-05-28 NaN
2012-05-29 0.015395
2012-05-30 -0.003041
2012-05-31 -0.002288
2012-06-01 0.000000
... ...
2023-11-20 0.003027
2023-11-21 0.001372
2023-11-22 -0.010685
2023-11-23 0.005262
2023-11-24 -0.006887
[2797 rows x 1 columns]
参考上一个案例,“shift 错位移动”
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)
#移动平均线:
ma_day = [5,20,52,252]
for ma in ma_day:
column_name = "%s日均线" %(str(ma))
sh300[column_name] = sh300["close"].rolling(ma).mean()
print(sh300.head(10))
close
date
2012-05-28 2.2020
2012-05-29 2.2359
2012-05-30 2.2291
2012-05-31 2.2240
2012-06-01 2.2240
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[2797 rows x 1 columns]
close 5日均线 20日均线 52日均线 252日均线
date
2012-05-28 2.2020 NaN NaN NaN NaN
2012-05-29 2.2359 NaN NaN NaN NaN
2012-05-30 2.2291 NaN NaN NaN NaN
2012-05-31 2.2240 NaN NaN NaN NaN
2012-06-01 2.2240 2.22300 NaN NaN NaN
2012-06-04 2.1631 2.21522 NaN NaN NaN
2012-06-05 2.1657 2.20118 NaN NaN NaN
2012-06-06 2.1640 2.18816 NaN NaN NaN
2012-06-07 2.1505 2.17346 NaN NaN NaN
2012-06-08 2.1429 2.15724 NaN NaN NaN
参考 ”变更dataframe“的concat、merge
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/510300.csv", parse_dates=['date'], index_col='date')
sh500 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/510500.csv", parse_dates=['date'], index_col='date')
yiyao512010 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/512010.csv", parse_dates=['date'], index_col='date')
# 只取close列
sh300 = sh300[['close']] # 这里要有双方括号,否则取出来的是Series,没有列名字。
sh500 = sh500[['close']]
yiyao512010 = yiyao512010[['close']]
# 重命名列
sh300 = sh300.rename(columns={'close': '510300'})
sh500 = sh500.rename(columns={'close': '510500'})
yiyao512010 = yiyao512010.rename(columns={'close': '512010'})
# 拼接数据
merged_df = pd.merge(sh300,sh500, on = 'date', how="outer")
merged_df = pd.merge(merged_df, yiyao512010, on = 'date', how="outer")
print(merged_df)
510300 510500 512010
date
2012-05-28 2.004 NaN NaN
2012-05-29 2.044 NaN NaN
2012-05-30 2.036 NaN NaN
2012-05-31 2.030 NaN NaN
2012-06-01 2.030 NaN NaN
... ... ... ...
2023-12-20 3.369 5.412 0.401
2023-12-21 3.400 5.426 0.405
2023-12-22 3.406 5.410 0.403
2023-12-25 3.415 5.403 0.404
2023-12-26 3.392 5.349 0.400
[2819 rows x 3 columns]
简单收益率 = (本期价值 - 上期价值)/ 上期价值 * 100%
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
sh300 = sh300.loc['2023'] # 只使用2023的数据计算
print(sh300)
# 简单收益率计算 (本期价值 - 上期价值)/ 上期价值 * 100%
last = sh300['close'][-1]
first = sh300['close'][0]
simple_return = ((last - first)/first).round(4) * 100
simple_return
close
date
2023-01-03 3.8880
2023-01-04 3.8929
2023-01-05 3.9667
2023-01-06 3.9825
2023-01-09 4.0071
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[217 rows x 1 columns]
-7.28
sh300[‘pct_change’] = sh300[‘close’].pct_change()
sh300[‘cum_profit’] = pd.DataFrame(1+sh300[‘pct_change’]).cumprod()-1
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/510300.csv", parse_dates=['date'], index_col='date')
sh300 = sh300[['close']] # 只需要close列的数据
print(sh300)
# 计算沪深300从 2012-05-28 到 2023-12-26 的滚动累积收益率
sh300['pct_change'] = sh300['close'].pct_change()
sh300['cum_profit'] = pd.DataFrame(1+sh300['pct_change']).cumprod()-1
print('2012年滚动:', sh300) # 结果是 0.692615,就是69%的收益,11年
# 计算沪深300从 2013-01-03 到 2023-12-26 的滚动累积收益率
sh300 = sh300.loc['2023']
sh300['pct_change'] = sh300['close'].pct_change()
sh300['cum_profit'] = pd.DataFrame(1+sh300['pct_change']).cumprod()-1
print('2023年滚动:', sh300) # 结果是 -0.126898。2023年,还是比较惨的。
close
date
2012-05-28 2.004
2012-05-29 2.044
2012-05-30 2.036
2012-05-31 2.030
2012-06-01 2.030
... ...
2023-12-20 3.369
2023-12-21 3.400
2023-12-22 3.406
2023-12-25 3.415
2023-12-26 3.392
[2819 rows x 1 columns]
2012年滚动: close pct_change cum_profit
date
2012-05-28 2.004 NaN NaN
2012-05-29 2.044 0.019960 0.019960
2012-05-30 2.036 -0.003914 0.015968
2012-05-31 2.030 -0.002947 0.012974
2012-06-01 2.030 0.000000 0.012974
... ... ... ...
2023-12-20 3.369 -0.009118 0.681138
2023-12-21 3.400 0.009202 0.696607
2023-12-22 3.406 0.001765 0.699601
2023-12-25 3.415 0.002642 0.704092
2023-12-26 3.392 -0.006735 0.692615
[2819 rows x 3 columns]
2023年滚动: close pct_change cum_profit
date
2023-01-03 3.885 NaN NaN
2023-01-04 3.890 0.001287 0.001287
2023-01-05 3.965 0.019280 0.020592
2023-01-06 3.981 0.004035 0.024710
2023-01-09 4.006 0.006280 0.031145
... ... ... ...
2023-12-20 3.369 -0.009118 -0.132819
2023-12-21 3.400 0.009202 -0.124839
2023-12-22 3.406 0.001765 -0.123295
2023-12-25 3.415 0.002642 -0.120978
2023-12-26 3.392 -0.006735 -0.126898
[239 rows x 3 columns]
按年分组滚动cumprod
import pandas as pd
import ssl # # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
ssl._create_default_https_context = ssl._create_unverified_context
sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)
# 退化日期到年
y_sh300 = sh300.pct_change().to_period('A').dropna()
print('退化日期到年:', y_sh300)
# 按年分组,滚动计算收益率
y_ret = (y_sh300.groupby(y_sh300.index).apply(lambda x: ((1+x).cumprod()-1).iloc[-1])).round(4)
print('年分组滚动收益率:', y_ret)
close
date
2012-05-28 2.2020
2012-05-29 2.2359
2012-05-30 2.2291
2012-05-31 2.2240
2012-06-01 2.2240
... ...
2023-11-20 3.6450
2023-11-21 3.6500
2023-11-22 3.6110
2023-11-23 3.6300
2023-11-24 3.6050
[2797 rows x 1 columns]
退化日期到年: close
date
2012 0.015395
2012 -0.003041
2012 -0.002288
2012 0.000000
2012 -0.027383
... ...
2023 0.003027
2023 0.001372
2023 -0.010685
2023 0.005262
2023 -0.006887
[2796 rows x 1 columns]
年分组滚动收益率: close
date
2012 -0.0172
2013 -0.0586
2014 0.5376
2015 0.0684
2016 -0.0971
2017 0.2337
2018 -0.2415
2019 0.3860
2020 0.2908
2021 -0.0402
2022 -0.2014
2023 -0.0702