《Python数据分析技术栈》第06章使用 Pandas 准备数据 08 索引Indexing
Indexing is fundamental to Pandas and is what makes retrieval and access to data much faster compared to other tools. It is crucial to set an appropriate index to optimize performance. An index is implemented in NumPy as an immutable (cannot be modified) array and contains hashable objects. A hashable object is one that can be converted to an integer value based on its contents (similar to mapping in a dictionary). Objects with different values will have different hash values.
索引是 Pandas 的基础,与其他工具相比,它能使数据检索和访问速度更快。设置适当的索引对优化性能至关重要。索引在 NumPy 中以不可变(不可修改)数组的形式实现,并包含哈希对象。可散列对象是一种可根据其内容转换为整数值的对象(类似于字典中的映射)。不同值的对象会有不同的哈希值。
Pandas has two types of indexes - a row index (vertical) with labels attached to rows, and a column index with labels (column names) for every column.
Pandas 有两种索引:一种是行索引(垂直),行上有标签;另一种是列索引,每一列都有标签(列名)。
Let us now explore index objects – their data types, their properties, and how they speed up access to data.
现在让我们来探讨索引对象–它们的数据类型、属性以及如何加快数据访问速度。
An index object has a data type, some of which are listed here.
? Index: This is a generic index type; the column index has this type.
? RangeIndex: Default index type in Pandas (used when an index is not defined separately), implemented as a range of increasing integers. This index type helps with saving memory.
? Int64Index: An index type containing integers as labels. For this index type, the index labels need not be equally spaced, whereas this is required for an index of type RangeIndex.
? Float64Index: Contains floating-point numbers (numbers with a decimal point) as index labels.
? IntervalIndex: Contains intervals (for instance, the interval between two integers) as labels.
? CategoricalIndex: A limited and finite set of values.
? DateTimeIndex: Used to represent date and time, like in time-series data.
? PeriodIndex: Represents periods like quarters, months, or years.
? TimedeltaIndex: Represents duration between two periods of time or two dates.
? MultiIndex: Hierarchical index with multiple levels.
索引对象有一种数据类型,这里列出了其中一些。
When a Pandas object is created, a default index is created of the type RangeIndex, as mentioned earlier. An index of this type has the first label value as 0 (which corresponds to the first item of the Pandas Series or DataFrame), and the second label as 1, following an arithmetic progression with a spacing of one integer.
如前所述,创建 Pandas 对象时,会创建 RangeIndex 类型的默认索引。这种类型的索引的第一个标签值为 0(对应于 Pandas 系列或数据帧的第一个项目),第二个标签值为 1,按照一个整数间隔的算术级数递增。
We can set a customized index, using either the index parameter or attribute. In the Series and DataFrame objects we created earlier, we were just setting values for the individual items, and in the absence of labels for the index object, the default index (of type RangeIndex) was used.
我们可以使用索引参数或属性设置自定义索引。在我们之前创建的 Series 和 DataFrame 对象中,我们只是为单个项目设置值,在没有索引对象标签的情况下,使用的是默认索引(RangeIndex 类型)。
We can use the index parameter when we define a Series or DataFrame to give custom values to the index labels.
我们可以在定义系列或数据帧时使用索引参数,为索引标签赋予自定义值。
periodic_table=pd.DataFrame({'Element':['Hydrogen','Helium','Lithium', 'Beryllium','Boron']},index=['H','He','Li','Be','B'])
If we skip the index parameter during the creation of the object, we can set the labels using the index attribute, as shown here.
如果在创建对象时跳过索引参数,我们可以使用索引属性设置标签,如图所示。
periodic_table.index=['H','He','Li','Be','B']
The set_index method can be used to set an index using an existing column, as demonstrated in the following:
如下所示,set_index 方法可用于使用现有列设置索引:
periodic_table=pd.DataFrame({'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron'],'Symbols':['H','He','Li','Be','B']})
periodic_table.set_index(['Symbols'])
The index can be made a column again or reset using the reset_index method:
可以使用 reset_index 方法再次将索引设置为列或重置索引:
periodic_table.reset_index()
We can also set the index when we read data from an external file into a DataFrame, using the index_col parameter, as shown in the following.
我们还可以在将数据从外部文件读入 DataFrame 时使用 index_col 参数设置索引,如下所示。
titanic=pd.read_csv('titanic.csv',index_col='PassengerId')
titanic.head()
We know that indexes dramatically improve the speed of access to data. Let us understand this with the help of an example.
我们知道,索引能显著提高数据访问速度。让我们借助一个例子来理解这一点。
Consider the following DataFrame:
请看下面的 DataFrame:
periodic_table=pd.DataFrame({'Atomic Number':[1,2,3,4,5],'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron'],'Symbol':['H','He', 'Li','Be','B']})
Now, try retrieving the element with atomic number 2 without the use of an index and measure the time taken for retrieval using the timeit magic function. When the index is not used, a linear search is performed to retrieve an element, which is relatively time consuming.
现在,请尝试在不使用索引的情况下检索原子序数为 2 的元素,并使用 timeit 神奇函数测量检索所需的时间。不使用索引时,将执行线性搜索来检索元素,这相对耗时。
%timeit periodic_table[periodic_table['Atomic Number']==2]
Now, set the “Atomic Number” column as the index and use the loc indexer to see how much time the search takes now:
现在,将 "原子序数 "列设置为索引,并使用 loc 索引器查看现在搜索所需的时间:
new_periodic_table=periodic_table.set_index(['Atomic Number'])
%timeit new_periodic_table.loc[2]
The search operation, when performed without using an index, was of the order of milliseconds (around 1.66 ms). With the use of indexes, the time taken for the retrieval operation is now of the order of microseconds (281 μs), which is a significant improvement.
在不使用索引的情况下,检索操作的时间为毫秒级(约 1.66 毫秒)。使用索引后,检索操作所需的时间现在为微秒级(281 μs),这是一个显著的改进。
As mentioned earlier, the index object is immutable - once defined, the index object or its labels cannot be modified.
如前所述,索引对象是不可变的–一旦定义,索引对象或其标签就不能修改。
As an example, let us try changing one of the index labels in the periodic table DataFrame we just defined, as shown in the following. We get an error in the output since we are trying to operate on an immutable object.
例如,让我们尝试更改刚刚定义的周期表 DataFrame 中的一个索引标签,如下所示。由于我们正试图对不可变对象进行操作,因此输出中出现了错误。
periodic_table.index[2]=0
While the values of an Index object cannot be changed, we can retrieve information about the index using its attributes, like the values contained in the Index object, whether there are any null values, and so on.
虽然无法更改索引对象的值,但我们可以使用其属性检索有关索引的信息,如索引对象中包含的值,是否存在空值等。
Let us look at some of the index attributes with some examples:
让我们举例说明一些索引属性:
Considering the column index in the following DataFrame:
考虑以下 DataFrame 中的列索引:
periodic_table=pd.DataFrame({'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron']},index=['H','He','Li','Be','B'])
column_index=periodic_table.columns
Some of the attributes of the column index are 1.values attribute: Returns the column names
列索引的部分属性为
values 属性: 返回列名
column_index.values
hasnans attribute: Returns a Boolean True or False value based on the presence of null
hasnans 属性: 根据是否存在 null 返回布尔值 True 或 False
column_index.hasnans
nbytes attribute: Returns the number of bytes occupied in memory
nbytes 属性: 返回内存占用的字节数
column_index.nbytes
When two Pandas objects are added, their index labels are checked for alignment. For items that have matching indexes, their values are added or concatenated. Where the indexes do not match, the value corresponding to that index in the resultant object is null (np.NaN).
添加两个 Pandas 对象时,会检查它们的索引标签是否对齐。对于具有匹配索引的项目,会添加或连接它们的值。如果索引不匹配,则结果对象中该索引对应的值为空(np.NaN)。
Let us understand this with an example. Here, we see that the 0 index label in s1 does not have a match in s2, and the last index label (10) in s2 does not have a match in s1. These values equal null when the objects are combined. All other values, where the index labels align, are added together.
让我们通过一个例子来理解这一点。在这里,我们看到 s1 中的 0 索引标签在 s2 中没有匹配项,而 s2 中的最后一个索引标签(10)在 s1 中也没有匹配项。合并对象时,这些值等于零。所有其他值,只要索引标签对齐,都会相加。
s1=pd.Series(np.arange(10),index=np.arange(10))
s2=pd.Series(np.arange(10),index=np.arange(1,11))
s1+s2
We can perform set operations like union, difference, and symmetric difference on indexes from different objects.
我们可以对来自不同对象的索引执行集合操作,如联合、差分和对称差分。
Consider the following indexes, “i1” and “i2”, created from two Series objects (“s1” and “s2”) we created in the previous section:
请看下面的索引 "i1 "和 “i2”,它们是由我们在上一节中创建的两个系列对象("s1 "和 “s2”)创建的:
i1=s1.index
i2=s2.index
i1.union(i2)
Elements present in one set, but not in the other, are returned.
在一组中存在而在另一组中不存在的元素将被返回。
i1.difference(i2) #elements present in i1 but not in i2
Elements not common to the two sets are returned. This operation differs from the Difference operation in that it takes into the uncommon elements in both sets:
两个集合中不常见的元素将被返回。此操作与差分操作的不同之处在于,它考虑了两个集合中不常见的元素:
i1.symmetric_difference(i2)
You can also perform arithmetic operations on two index objects, as shown in the following.
您还可以对两个索引对象进行算术运算,如下所示。
i1-i2