《Python数据分析技术栈》第07章Python数据可视化 01 Matplotlib
In the last chapter, we read about Pandas, the library with various functions for preparing data in order to make it ready for analysis and visualization. Visualization is a means to understand patterns in your data, identify outliers and other points of interest, and present our findings to an outside audience, without having to sift through data manually. Visualization also helps us to glean information from raw data and gain insights that would otherwise be difficult to draw.
在上一章中,我们了解了 Pandas,它是一个具有各种功能的库,用于准备数据,以便为分析和可视化做好准备。可视化是了解数据模式、识别异常值和其他兴趣点,以及向外部受众展示我们的发现的一种手段,而无需手动筛选数据。可视化还能帮助我们从原始数据中收集信息,获得原本难以得出的见解。
After going through this chapter, you will be able to understand the commonly used plots, comprehend the object-oriented and stateful approaches in Matplotlib and apply these approaches for visualization, learn how to use Pandas for plotting, and understand how to create graphs with Seaborn.
通过本章的学习,你将能够理解常用的绘图,理解 Matplotlib 中的面向对象和有状态方法,并将这些方法应用于可视化,学习如何使用 Pandas 进行绘图,并理解如何使用 Seaborn 创建图形。
In your Jupyter notebook, type the following to import the following libraries.
在 Jupyter 笔记本中键入以下内容,导入以下库。
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Here, plt is a shorthand name or an alias for the pyplot module of Matplotlib that we use for plotting, sns is an alias for the Seaborn library, and pd is an alias for Pandas.
在这里,plt 是我们用于绘图的 Matplotlib 的 pyplot 模块的简称或别名,sns 是 Seaborn 库的别名,pd 是 Pandas 的别名。
In case these libraries are not installed, go to the Anaconda Prompt and install them as follows:
如果没有安装这些库,请转到 Anaconda 提示符并按以下步骤安装:
pip install matplotlib
pip install seaborn
pip install pandas
We use the Titanic dataset in this chapter to demonstrate the various plots.
在本章中,我们将使用泰坦尼克号数据集来演示各种绘图。
Please download the dataset using the following link: https://github.com/DataRepo2019/Data-files/blob/master/titanic.csv
请使用以下链接下载数据集: https://github.com/DataRepo2019/Data-files/blob/master/titanic.csv
Some of the basic plots that are widely used in exploratory or descriptive data analysis include bar plots, pie charts, histograms, scatter plots, box plots, and heat maps; these are explained in Table 7-1.
在探索性或描述性数据分析中广泛使用的一些基本图包括柱状图、饼图、直方图、散点图、箱形图和热图;表 7-1 对这些图进行了说明。
a bar chart enables visualization of categorical data, with the width or height of the bar representing the value for each category. the bars can be shown either vertically or horizontally.
条形图可以使分类数据可视化,条形的宽度或高度代表每个类别的值。
a histogram is used to visualize the distribution of a continuous variable. it divides the range of the continuous variable into intervals and shows where most of the values lie.
直方图用于直观显示连续变量的分布情况。它将连续变量的范围划分为若干区间,并显示大部分值所在的位置。
box plots help with visually depicting the statistical characteristics of the data. a box plot provides a five-point summary with each line in the figure representing a statistical measure of the data being plotted (refer to the figure on the right). these five measures are
方框图有助于直观地描述数据的统计特征。方框图提供了一个五点汇总,图中的每 条线都代表所绘制数据的一个统计量(参考右图)。
the small circles/dots that you see in the figure on the right represent the outliers (or extreme values).the two lines on either side of the box, representing the minimum and maximum values, are also called “whiskers”. any point outside these whiskers is called an outlier. the middle line in the box represents the median. a box plot is generally used for continuous (ratio/interval) variables, though it can be used for some categorical variables like ordinal variables as well.
右图中的小圆圈/小圆点代表离群值(或极端值)。方框两侧的两条线分别代表最小值和最大值,也称为 “晶须”。
a pie chart shows the distinct values of a variable as sectors within a circle. pie charts are used with categorical variables.
饼图以圆圈内的扇形显示变量的不同值。
a scatter plot displays the values of two continuous variables as points on the x and y axes and helps us visualize if the two variables are correlated or not.
散点图将两个连续变量的值显示为 x 轴和 y 轴上的点,帮助我们直观地看出这两个变量是否相关。
a heat map shows the correlation between multiple variables using a color-coded matrix, where the color saturation represents the strength of the correlation between the variables. a heat map can aid in the visualization of multiple variables at once.
热图使用彩色编码矩阵显示多个变量之间的相关性,其中颜色饱和度代表变量之间相关性的强弱。
Let us now have a look at some of the Python libraries that are used for visualization, starting with Matplotlib.
现在,让我们从 Matplotlib 开始,了解一些用于可视化的 Python 库。
The main library for data visualization in Python is Matplotlib. Matplotlib has many visualization features that are similar to Matlab (a computational environment cum programming language with plotting tools). Matplotlib is mainly used for plotting twodimensional graphs, with limited support for creating three-dimensional graphs.
Matplotlib 是 Python 中数据可视化的主要库。Matplotlib 拥有许多与 Matlab(一种带有绘图工具的计算环境兼编程语言)类似的可视化功能。Matplotlib 主要用于绘制二维图形,对创建三维图形的支持有限。
Plots created using Matplotlib require more lines of code and customization of the parameters of the plot, as compared to other libraries like Seaborn and Pandas (which use some default setting to simplify the writing of code to create plots).
与其他库(如 Seaborn 和 Pandas,它们使用一些默认设置来简化创建绘图的代码编写)相比,使用 Matplotlib 创建的绘图需要更多行代码和自定义绘图参数。
Matplotlib forms the backbone of most of the visualizations that are performed using Python.
Matplotlib 是使用 Python 进行大多数可视化的基础。
There are two interfaces in Matplotlib, stateful and object-oriented, that are described in Table 7-2.
Matplotlib 中有两种接口:有状态接口和面向对象接口,表 7-2 对这两种接口进行了描述。
The object-oriented approach is the recommended approach for plotting in Matplotlib because of the ability to control and customize each of the individual objects or plots. The following steps use the object-oriented methodology for plotting.
在 Matplotlib 中,推荐使用面向对象的绘图方法,因为这种方法可以控制和自定义每个单独的对象或绘图。以下步骤使用面向对象方法进行绘图。
Create a figure (the outer container) and set its dimensions:The plt.figure function creates a figure along with setting itsdimensions (width and height), as shown in the following.
创建图形(外部容器)并设置其尺寸:plt.figure 函数创建图形并设置其尺寸(宽度和高度),如下所示。
fig=plt.figure(figsize=(10,5))
Determine the number of subplots and assign positions foreach of the subplots in the figure:In the following example, we are creating two subplots and placing them vertically. Hence, we divide the figure into two rows and one column with one subplot in each section.The fig.add_subplot function creates an axes object or subplot and assigns a position to each subplot. The argument –211 (for the add_subplot function that creates the first axes object - “ax1”) means that we are giving it the first position in the figure with two rows and one column.The argument -212 (for the add_subplot function that creates the second axes object - “ax2”) means that we are giving the second position in the figure with two rows and one column. Note that the first digit indicates the number of rows, the second digit indicates the number of columns, and the last digit indicates the position of the subplot or axes.
确定子绘图的数量并为图中的每个子绘图指定位置:在下面的示例中,我们将创建两个子绘图,并将它们垂直放置。fig.add_subplot函数创建了一个坐标轴对象或子图,并为每个子图分配了一个位置。参数 -211 (用于创建第一个坐标轴对象的 add_subplot 函数 - “ax1”)表示我们要给它在图中的第一个位置,即两行一列。参数 -212 (用于创建第二个坐标轴对象的 add_subplot 函数 - “ax2”)表示我们要给它在图中的第二个位置,即两行一列。请注意,第一位数字表示行数,第二位数字表示列数,最后一位数字表示子图或坐标轴的位置。
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
Plot and label each subplot:After the positions are assigned to each subplot, we move on to generating the individual subplots. We are creating one histogram (using the hist function) and one bar plot (using the bar function). The x and y axes are labeled using the set_xlabel and set_ylabel functions.
绘制并标注各子图:为各子图分配位置后,我们开始绘制各子图。我们将创建一个直方图(使用 hist 函数)和一个条形图(使用 bar 函数)。使用 set_xlabel 和 set_ylabel 函数标注 x 轴和 y 轴。
labelling the x axis
ax1.set_xlabel("Age")
#labelling the yaxis
ax1.set_ylabel("Frequency")
#plotting a histogram using the hist function
ax1.hist(df['Age'])
#labelling the X axis
ax2.set_xlabel("Category")
#labelling the Y axis
ax2.set_ylabel("Numbers")
#setting the x and y lists for plotting the bar chart
x=['Males','Females']
y=[577,314]
#using the bar function to plot a bar graph
ax2.bar(x,y)
Note that the top half of Figure 7-1 is occupied by the first axes object (histogram), and the bottom half of the figure contains the second subplot (bar plot).
请注意,图 7-1 的上半部分是第一个坐标轴对象(直方图),下半部分是第二个子图(条形图)。