Pandas.Series.describe() 统计学描述 详解 含代码 含测试数据集 随Pandas版本持续更新

发布时间:2024年01月22日

关于Pandas版本: 本文基于 pandas2.1.2 编写。

关于本文内容更新: 随着pandas的stable版本更迭,本文持续更新,不断完善补充。

传送门: Pandas API参考目录

传送门: Pandas 版本更新及新特性

传送门: Pandas 由浅入深系列教程

Pandas.Series.describe()

Pandas.Series.describe 用于生成 Series 的统计学描述。返回一个多行的统计表,每一行对应一个统计指标,有总数、平均数、标准差、最小值、四分位数、最大值等,

  • 参与统计描述的列,里面的 缺失值(NaN),会在计算时被排除

语法:

Series.describe(percentiles=None, include=None, exclude=None)

返回值:

  • Series or DataFrame

    调用 Series.describe 方法时,根据传入类型的不同,返回 SeriesDataFrame

参数说明:

include 数据类型白名单

  • include : ‘all’, list-like of dtypes or None (default), optional

    include 参数,用于指定哪种数据类型的列参与统计描述。如果某列的数据类型出现在白名单中,此列将参与统计描述。

    • 对于 Series 此参数无效。

exclude 数据类型黑名单

  • exclude : list-like of dtypes or None (default), optional,

    exclude 参数,用于指定要排除的数据类型白名单。如果某列的数据类型出现在黑名单中,此列将不会参与统计描述。

    • 对于 Series 此参数无效。

percentiles 自定义百分位数

  • percentiles : *list-like of numbers, optional

    percentiles 参数用于自定义 百分位数

    • list-like:类似列表 传递自定义的 百分位数 ,列表里每个元素都应该介于0-1之间,默认状态下,百分数只会返回 [0.25, 0.5, 0.75] (即第1~3四分位数)。

    ?? 注意 :

    你可以指定多个百分位数。例1

?? 注意 :

  • 虽然 numpy.number 包含复数 np.complexfloating ,但是 Pandas.DataFrame.describe 只支持实数的计算,如果 DataFrame 存在复数,但是没有被排除,会引发报错 TypeError: a must be an array of real numbers 例2

  • 对于数值数据(numeric data),结果的索引将包括 countmeanstdminmax,以及 lower50upper 百分位数。默认情况下,lower 百分位数是 25upper 百分位数是 7550 百分位 数与 中位数 相同。

  • 对于对象数据(object data),例如字符串或时间戳,结果的索引将包括 countuniquetopfreqtop 是最常见的值,freq 是最常见值的频率。时间戳还包括第一个最后一个项。

  • 如果多个对象值具有最高计数,则计数和 top 的结果将从具有最高计数的值中任意选择。

  • 对于通过 Series 提供的混合数据类型(),默认情况下仅返回数值列的分析结果。如果 Series 仅包含对象(‘object’)和分类数据(‘category’)而没有任何数值列,则默认情况下将返回对对象(‘object’)和分类数据(‘category’)列的分析结果。如果提供了 include=‘all’ 作为选项,则结果将包括每种类型的属性的并集。

  • includeexclude 参数可用于限制要分析的 Series 中的列。在分析 Series 时,这些参数将被忽略。

相关方法:

?? 相关方法


示例:

测试文件下载:

本文所涉及的测试文件,如有需要,可在文章顶部的绑定资源处下载。

若发现文件无法下载,应该是资源包有内容更新,正在审核,请稍后再试。或站内私信作者索要。

测试文件下载位置.png

测试文件下载位置

例1:自定义百分位数

import pandas as pd
import numpy as np
s = pd.Series(np.arange(1,10,1))

s.describe(include=[np.number], percentiles=[0.1, 0.4, 0.7, 0.8, 0.85])
count    9.000000
mean     5.000000
std      2.738613
min      1.000000
10%      1.800000
40%      4.200000
50%      5.000000
70%      6.600000
80%      7.400000
85%      7.800000
max      9.000000
dtype: float64

例2:复数的统计描述

例2-1、构建包含复数的Series
import numpy as np
import pandas as pd

# 构建演示数据
s = pd.Series([1 + 1j, 2 + 2j, 3 + 3j])

s
0    1.0+1.0j
1    2.0+2.0j
2    3.0+3.0j
dtype: complex128

例2-2、复数的统计描述,会引发报错
s.describe()
D:\miniconda3\envs\python3.12\Lib\site-packages\numpy\core\_methods.py:49: ComplexWarning: Casting complex values to real discards the imaginary part
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\nanops.py:944: RuntimeWarning: invalid value encountered in sqrt
  result = np.sqrt(nanvar(values, axis=axis, skipna=skipna, ddof=ddof, mask=mask))



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In[59], line 1
----> 1 df.describe()


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\generic.py:11544, in NDFrame.describe(self, percentiles, include, exclude)
  11302 @final
  11303 def describe(
  11304     self,
   (...)
  11307     exclude=None,
  11308 ) -> Self:
  11309     """
  11310     Generate descriptive statistics.
  11311 
   (...)
  11542     max            NaN      3.0
  11543     """
> 11544     return describe_ndframe(
  11545         obj=self,
  11546         include=include,
  11547         exclude=exclude,
  11548         percentiles=percentiles,
  11549     ).__finalize__(self, method="describe")


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:97, in describe_ndframe(obj, include, exclude, percentiles)
     90 else:
     91     describer = DataFrameDescriber(
     92         obj=cast("DataFrame", obj),
     93         include=include,
     94         exclude=exclude,
     95     )
---> 97 result = describer.describe(percentiles=percentiles)
     98 return cast(NDFrameT, result)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:170, in DataFrameDescriber.describe(self, percentiles)
    168 for _, series in data.items():
    169     describe_func = select_describe_func(series)
--> 170     ldesc.append(describe_func(series, percentiles))
    172 col_names = reorder_columns(ldesc)
    173 d = concat(
    174     [x.reindex(col_names, copy=False) for x in ldesc],
    175     axis=1,
    176     sort=False,
    177 )


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:232, in describe_numeric_1d(series, percentiles)
    227 formatted_percentiles = format_percentiles(percentiles)
    229 stat_index = ["count", "mean", "std", "min"] + formatted_percentiles + ["max"]
    230 d = (
    231     [series.count(), series.mean(), series.std(), series.min()]
--> 232     + series.quantile(percentiles).tolist()
    233     + [series.max()]
    234 )
    235 # GH#48340 - always return float on non-complex numeric data
    236 dtype: DtypeObj | None


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\series.py:2769, in Series.quantile(self, q, interpolation)
   2765 # We dispatch to DataFrame so that core.internals only has to worry
   2766 #  about 2D cases.
   2767 df = self.to_frame()
-> 2769 result = df.quantile(q=q, interpolation=interpolation, numeric_only=False)
   2770 if result.ndim == 2:
   2771     result = result.iloc[:, 0]


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\frame.py:11831, in DataFrame.quantile(self, q, axis, numeric_only, interpolation, method)
  11827     raise ValueError(
  11828         f"Invalid method: {method}. Method must be in {valid_method}."
  11829     )
  11830 if method == "single":
> 11831     res = data._mgr.quantile(qs=q, interpolation=interpolation)
  11832 elif method == "table":
  11833     valid_interpolation = {"nearest", "lower", "higher"}


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\internals\managers.py:1508, in BlockManager.quantile(self, qs, interpolation)
   1504 new_axes = list(self.axes)
   1505 new_axes[1] = Index(qs, dtype=np.float64)
   1507 blocks = [
-> 1508     blk.quantile(qs=qs, interpolation=interpolation) for blk in self.blocks
   1509 ]
   1511 return type(self)(blocks, new_axes)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\internals\blocks.py:1587, in Block.quantile(self, qs, interpolation)
   1584 assert self.ndim == 2
   1585 assert is_list_like(qs)  # caller is responsible for this
-> 1587 result = quantile_compat(self.values, np.asarray(qs._values), interpolation)
   1588 # ensure_block_shape needed for cases where we start with EA and result
   1589 #  is ndarray, e.g. IntegerArray, SparseArray
   1590 result = ensure_block_shape(result, ndim=2)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:39, in quantile_compat(values, qs, interpolation)
     37     fill_value = na_value_for_dtype(values.dtype, compat=False)
     38     mask = isna(values)
---> 39     return quantile_with_mask(values, mask, fill_value, qs, interpolation)
     40 else:
     41     return values._quantile(qs, interpolation)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:97, in quantile_with_mask(values, mask, fill_value, qs, interpolation)
     95     result = np.repeat(flat, len(values)).reshape(len(values), len(qs))
     96 else:
---> 97     result = _nanpercentile(
     98         values,
     99         qs * 100.0,
    100         na_value=fill_value,
    101         mask=mask,
    102         interpolation=interpolation,
    103     )
    105     result = np.array(result, copy=False)
    106     result = result.T


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:218, in _nanpercentile(values, qs, na_value, mask, interpolation)
    216     return result
    217 else:
--> 218     return np.percentile(
    219         values,
    220         qs,
    221         axis=1,
    222         # error: No overload variant of "percentile" matches argument types
    223         # "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]",
    224         # "int", "Dict[str, str]"  [call-overload]
    225         method=interpolation,  # type: ignore[call-overload]
    226     )


File D:\miniconda3\envs\python3.12\Lib\site-packages\numpy\lib\function_base.py:4277, in percentile(a, q, axis, out, overwrite_input, method, keepdims, interpolation)
   4275 a = np.asanyarray(a)
   4276 if a.dtype.kind == "c":
-> 4277     raise TypeError("a must be an array of real numbers")
   4279 q = np.true_divide(q, 100)
   4280 q = asanyarray(q)  # undo any decay that the ufunc performed (see gh-13105)


TypeError: a must be an array of real numbers
文章来源:https://blog.csdn.net/mingqinsky/article/details/135751793
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。