pandas在数据分析应用中使用频率非常高的python 库,在数据分析的过程中,读写文件是非常基础的操作,它关系到整个数据分析的性能表现。下面就用程序验证pandas 读写常见几种文件的性能表现。
import pandas as pd
import time
import numpy as np
def write_data(df):
store = pd.HDFStore('D:\\test\\store.h5')
start = time.time()
store['df'] = df
store.close()
print(f'HDF5存储用时{time.time() - start}秒')
start = time.time()
df.to_csv('d:\\test\\df.csv', index=False)
print(f'csv存储用时{time.time() - start}秒')
start = time.time()
df.to_pickle("D:\\test\\df.pickle")
print(f'pickle存储用时{time.time() - start}秒')
start = time.time()
df.to_parquet("D:\\test\\df.parquet")
print(f'parquet存储用时{time.time() - start}秒')
start = time.time()
df.to_feather("D:\\test\\df.feather")
print(f'feather存储用时{time.time() - start}秒')
def read_data():
start = time.time()
store = pd.HDFStore('d:\\test\\store.h5', mode='r')
df1 = store.get('df')
print(f'HDF5读取用时{time.time() - start}秒')
store.close()
start = time.time()
df1 = pd.read_csv('d:\\test\\df.csv')
print(f'csv读取用时{time.time() - start}秒')
start = time.time()
df1 = pd.read_pickle('d:\\test\\df.pickle')
print(f'pickle读取用时{time.time() - start}秒')
start = time.time()
df1 = pd.read_parquet('d:\\test\\df.parquet')
print(f'parquet读取用时{time.time() - start}秒')
start = time.time()
df1 = pd.read_feather('d:\\test\\df.feather')
print(f'feather读取用时{time.time() - start}秒')
if __name__ == '__main__':
# 生成1亿条5列的随机数据
data = pd.DataFrame(np.random.rand(100000000, 5))
write_data(data)
read_data()
hdf5 | csv | pickle | parquet | feather | |
---|---|---|---|---|---|
读取 | 11.8s | 68.9s | 3.5s | 6.5s | 5.3s |
写入 | 4s | 532s | 3.3s | 28.4s | 9.4s |
文件大小 | 4.46G | 9.06G | 3.72G | 3.84G | 3.72G |
从以上表格中可以得出hdf5,pickle,parquet等格式的读写性能以及空间占用均比较好,当进行大量数据的分析时可以考虑使用这些格式。