《Python数据分析技术栈》第06章使用 Pandas 准备数据 01 Pandas概览(Pandas at a glance)
Wes McKinney developed the Pandas library in 2008. The name (Pandas) comes from the term “Panel Data” used in econometrics for analyzing time-series data. Pandas has many features, listed in the following, that make it a popular tool for data wrangling and analysis.
Wes McKinney 于 2008 年开发了 Pandas 库。Pandas 这个名字来源于计量经济学中用于分析时间序列数据的术语 “面板数据”。Pandas 有许多功能,这些功能使其成为数据处理和分析的常用工具。
Pandas provides features for labeling of data or indexing, which speeds up the retrieval of data.
Pandas 提供数据标签或索引功能,可加快数据检索速度。
Input and output support: Pandas provides options to read data from different file formats like JSON (JavaScript Object Notation), CSV (Comma-Separated Values), Excel, and HDF5 (Hierarchical Data Format Version 5). It can also be used to write data into databases, web services, and so on.
输入和输出支持: Pandas 提供从不同文件格式读取数据的选项,如 JSON(JavaScript Object Notation)、CSV(Comma-Separated Values)、Excel 和 HDF5(Hierarchical Data Format Version 5)。它还可用于将数据写入数据库、网络服务等。
Most of the data that is needed for analysis is not contained in a single source, and we often need to combine datasets to consolidate the data that we need for analysis. Again, Pandas comes to the rescue with tailor-made functions to combine data.
分析所需的大部分数据并不包含在单一来源中,因此我们经常需要合并数据集,以整合分析所需的数据。Pandas 又一次提供了量身定制的合并数据函数。
Speed and enhanced performance: The Pandas library is based on Cython, which combines the convenience and ease of use of Python with the speed of the C language. Cython helps to optimize performance and reduce overheads.
速度和增强的性能 Pandas 库基于 Cython,它将 Python 的方便易用与 C 语言的速度相结合。Cython 有助于优化性能和减少开销。
Data visualization: To derive insights from the data and make it presentable to the audience, viewing data using visual means is crucial, and Pandas provides a lot of built-in visualization tools using Matplotlib as the base library.
数据可视化: 要从数据中获得洞察力并将其呈现给受众,使用可视化手段查看数据至关重要,而 Pandas 使用 Matplotlib 作为基础库,提供了大量内置可视化工具。
Support for other libraries: Pandas integrates smoothly with other libraries like Numpy, Matplotlib, Scipy, and Scikit-learn. Thus we can perform other tasks like numerical computations, visualizations, statistical analysis, and machine learning in conjunction with data manipulation.
支持其他库 Pandas 可与 Numpy、Matplotlib、Scipy 和 Scikit-learn 等其他库顺利集成。因此,我们可以结合数据处理执行其他任务,如数值计算、可视化、统计分析和机器学习。
Grouping: Pandas provides support for the split-apply-combine methodology, whereby we can group our data into categories, apply separate functions on them, and combine the results.
分组: Pandas 支持 "拆分-应用-合并 "方法,我们可以将数据分组,分别应用不同的函数,然后合并结果。
Handling missing data, duplicates, and filler characters: Data often has missing values, duplicates, blank spaces, special characters (like $, &), and so on that may need to be removed or replaced. With the functions provided in Pandas, you can handle such anomalies with ease.
处理缺失数据、重复数据和填充字符: 数据中经常会有需要删除或替换的缺失值、重复数据、空白、特殊字符(如 $、&)等。利用 Pandas 提供的函数,您可以轻松处理此类异常情况。
Mathematical operations: Many numerical operations and computations can be performed in Pandas, with NumPy being used at the back end for this purpose.
数学运算 在 Pandas 中可以执行许多数值运算和计算,NumPy 在后端用于此目的。
If you have not already installed Pandas, go to the Anaconda Prompt and enter the following command.
如果尚未安装 Pandas,请转到 Anaconda 提示符并输入以下命令。
pip install pandas
Once the Pandas library is installed, you need to import it before using its functions. In your Jupyter notebook, type the following to import this library.
安装好 Pandas 库后,在使用其功能之前需要将其导入。在 Jupyter 笔记本中,键入以下内容导入该库。
import pandas as pd
Here, pd is a shorthand name or alias that is a standard for Pandas.
这里,pd 是 Pandas 标准的速记名称或别名。
For some of the examples, we also use functions from the NumPy library. Ensure that both the Pandas and NumPy libraries are installed and imported.
在部分示例中,我们还使用了 NumPy 库中的函数。确保已安装并导入 Pandas 和 NumPy 库。
You need to download a dataset, “subset-covid-data.csv”, that contains data about the number of cases and deaths related to the COVID-19 pandemic for various countries on a particular date. Please use the following link for downloading the dataset: https://github.com/DataRepo2019/Data-files/blob/master/subset-covid-data.csv
您需要下载一个名为 "subset-covid-data.csv "的数据集,其中包含特定日期不同国家与 COVID-19 大流行相关的病例数和死亡数的数据。请使用以下链接下载数据集: https://github.com/DataRepo2019/Data-files/blob/master/subset-covid-data.csv