关于Pandas版本: 本文基于 pandas2.1.2 编写。
关于本文内容更新: 随着pandas的stable版本更迭,本文持续更新,不断完善补充。
传送门: Pandas API参考目录
传送门: Pandas 版本更新及新特性
传送门: Pandas 由浅入深系列教程
Pandas.Series.describe
用于生成 Series
的统计学描述。返回一个多行的统计表,每一行对应一个统计指标,有总数、平均数、标准差、最小值、四分位数、最大值等,
NaN
),会在计算时被排除。Series.describe(percentiles=None, include=None, exclude=None)
Series or DataFrame
调用 Series.describe
方法时,根据传入类型的不同,返回 Series
或 DataFrame
。
include : ‘all’, list-like of dtypes or None (default), optional
include
参数,用于指定哪种数据类型的列参与统计描述。如果某列的数据类型出现在白名单中,此列将参与统计描述。
Series
此参数无效。exclude : list-like of dtypes or None (default), optional,
exclude
参数,用于指定要排除的数据类型白名单。如果某列的数据类型出现在黑名单中,此列将不会参与统计描述。
Series
此参数无效。percentiles : *list-like of numbers, optional
percentiles
参数用于自定义 百分位数
:
百分位数
,列表里每个元素都应该介于0-1之间,默认状态下,百分数只会返回 [0.25, 0.5, 0.75]
(即第1~3四分位数)。?? 注意 :
你可以指定多个百分位数。例1
?? 注意 :
虽然
numpy.number
包含复数np.complexfloating
,但是Pandas.DataFrame.describe
只支持实数的计算,如果DataFrame
存在复数,但是没有被排除,会引发报错TypeError: a must be an array of real numbers
。 例2对于数值数据(numeric data),结果的索引将包括
count
、mean
、std
、min
、max
,以及lower
、50
和upper
百分位数。默认情况下,lower
百分位数是25
,upper
百分位数是75
。50 百分位
数与中位数
相同。对于对象数据(object data),例如字符串或时间戳,结果的索引将包括
count
、unique
、top
和freq
。top
是最常见的值,freq
是最常见值的频率。时间戳还包括第一个
和最后一个
项。如果多个对象值具有最高计数,则计数和
top
的结果将从具有最高计数的值中任意选择。对于通过
Series
提供的混合数据类型(),默认情况下仅返回数值列的分析结果。如果Series
仅包含对象(‘object’)和分类数据(‘category’)而没有任何数值列,则默认情况下将返回对对象(‘object’)和分类数据(‘category’)列的分析结果。如果提供了 include=‘all’ 作为选项,则结果将包括每种类型的属性的并集。
include
和exclude
参数可用于限制要分析的Series
中的列。在分析Series
时,这些参数将被忽略。
?? 相关方法
非空单元格计数
最大值
最小值
平均值
样本标准差/总体标准差
根据数据类型筛选列
测试文件下载:
本文所涉及的测试文件,如有需要,可在文章顶部的绑定资源处下载。
若发现文件无法下载,应该是资源包有内容更新,正在审核,请稍后再试。或站内私信作者索要。
import pandas as pd
import numpy as np
s = pd.Series(np.arange(1,10,1))
s.describe(include=[np.number], percentiles=[0.1, 0.4, 0.7, 0.8, 0.85])
count 9.000000
mean 5.000000
std 2.738613
min 1.000000
10% 1.800000
40% 4.200000
50% 5.000000
70% 6.600000
80% 7.400000
85% 7.800000
max 9.000000
dtype: float64
import numpy as np
import pandas as pd
# 构建演示数据
s = pd.Series([1 + 1j, 2 + 2j, 3 + 3j])
s
0 1.0+1.0j
1 2.0+2.0j
2 3.0+3.0j
dtype: complex128
s.describe()
D:\miniconda3\envs\python3.12\Lib\site-packages\numpy\core\_methods.py:49: ComplexWarning: Casting complex values to real discards the imaginary part
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\nanops.py:944: RuntimeWarning: invalid value encountered in sqrt
result = np.sqrt(nanvar(values, axis=axis, skipna=skipna, ddof=ddof, mask=mask))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[59], line 1
----> 1 df.describe()
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\generic.py:11544, in NDFrame.describe(self, percentiles, include, exclude)
11302 @final
11303 def describe(
11304 self,
(...)
11307 exclude=None,
11308 ) -> Self:
11309 """
11310 Generate descriptive statistics.
11311
(...)
11542 max NaN 3.0
11543 """
> 11544 return describe_ndframe(
11545 obj=self,
11546 include=include,
11547 exclude=exclude,
11548 percentiles=percentiles,
11549 ).__finalize__(self, method="describe")
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:97, in describe_ndframe(obj, include, exclude, percentiles)
90 else:
91 describer = DataFrameDescriber(
92 obj=cast("DataFrame", obj),
93 include=include,
94 exclude=exclude,
95 )
---> 97 result = describer.describe(percentiles=percentiles)
98 return cast(NDFrameT, result)
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:170, in DataFrameDescriber.describe(self, percentiles)
168 for _, series in data.items():
169 describe_func = select_describe_func(series)
--> 170 ldesc.append(describe_func(series, percentiles))
172 col_names = reorder_columns(ldesc)
173 d = concat(
174 [x.reindex(col_names, copy=False) for x in ldesc],
175 axis=1,
176 sort=False,
177 )
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:232, in describe_numeric_1d(series, percentiles)
227 formatted_percentiles = format_percentiles(percentiles)
229 stat_index = ["count", "mean", "std", "min"] + formatted_percentiles + ["max"]
230 d = (
231 [series.count(), series.mean(), series.std(), series.min()]
--> 232 + series.quantile(percentiles).tolist()
233 + [series.max()]
234 )
235 # GH#48340 - always return float on non-complex numeric data
236 dtype: DtypeObj | None
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\series.py:2769, in Series.quantile(self, q, interpolation)
2765 # We dispatch to DataFrame so that core.internals only has to worry
2766 # about 2D cases.
2767 df = self.to_frame()
-> 2769 result = df.quantile(q=q, interpolation=interpolation, numeric_only=False)
2770 if result.ndim == 2:
2771 result = result.iloc[:, 0]
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\frame.py:11831, in DataFrame.quantile(self, q, axis, numeric_only, interpolation, method)
11827 raise ValueError(
11828 f"Invalid method: {method}. Method must be in {valid_method}."
11829 )
11830 if method == "single":
> 11831 res = data._mgr.quantile(qs=q, interpolation=interpolation)
11832 elif method == "table":
11833 valid_interpolation = {"nearest", "lower", "higher"}
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\internals\managers.py:1508, in BlockManager.quantile(self, qs, interpolation)
1504 new_axes = list(self.axes)
1505 new_axes[1] = Index(qs, dtype=np.float64)
1507 blocks = [
-> 1508 blk.quantile(qs=qs, interpolation=interpolation) for blk in self.blocks
1509 ]
1511 return type(self)(blocks, new_axes)
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\internals\blocks.py:1587, in Block.quantile(self, qs, interpolation)
1584 assert self.ndim == 2
1585 assert is_list_like(qs) # caller is responsible for this
-> 1587 result = quantile_compat(self.values, np.asarray(qs._values), interpolation)
1588 # ensure_block_shape needed for cases where we start with EA and result
1589 # is ndarray, e.g. IntegerArray, SparseArray
1590 result = ensure_block_shape(result, ndim=2)
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:39, in quantile_compat(values, qs, interpolation)
37 fill_value = na_value_for_dtype(values.dtype, compat=False)
38 mask = isna(values)
---> 39 return quantile_with_mask(values, mask, fill_value, qs, interpolation)
40 else:
41 return values._quantile(qs, interpolation)
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:97, in quantile_with_mask(values, mask, fill_value, qs, interpolation)
95 result = np.repeat(flat, len(values)).reshape(len(values), len(qs))
96 else:
---> 97 result = _nanpercentile(
98 values,
99 qs * 100.0,
100 na_value=fill_value,
101 mask=mask,
102 interpolation=interpolation,
103 )
105 result = np.array(result, copy=False)
106 result = result.T
File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:218, in _nanpercentile(values, qs, na_value, mask, interpolation)
216 return result
217 else:
--> 218 return np.percentile(
219 values,
220 qs,
221 axis=1,
222 # error: No overload variant of "percentile" matches argument types
223 # "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]",
224 # "int", "Dict[str, str]" [call-overload]
225 method=interpolation, # type: ignore[call-overload]
226 )
File D:\miniconda3\envs\python3.12\Lib\site-packages\numpy\lib\function_base.py:4277, in percentile(a, q, axis, out, overwrite_input, method, keepdims, interpolation)
4275 a = np.asanyarray(a)
4276 if a.dtype.kind == "c":
-> 4277 raise TypeError("a must be an array of real numbers")
4279 q = np.true_divide(q, 100)
4280 q = asanyarray(q) # undo any decay that the ufunc performed (see gh-13105)
TypeError: a must be an array of real numbers