《Python数据分析技术栈》第03章 01 描述性数据分析 - 步骤(Descriptive data analysis - Steps)

发布时间:2024年01月19日

01 描述性数据分析 - 步骤(Descriptive data analysis - Steps)

《Python数据分析技术栈》第03章 01 描述性数据分析 - 步骤(Descriptive data analysis - Steps)

Figure 4-1 illustrates the methodology followed in descriptive data analysis, step by step.

图 4-1 逐步说明了描述性数据分析所遵循的方法。

1、Retrieving and importing data

2、Cursory data review & problem identification

3、Data wrangling: tidying, cleansing , transformation & enrichment

4、Data exploration and visualization

5、Publishing and presenting findings

1、检索和导入数据

2、粗略数据审查和问题识别

3、数据整理:整理、清理、转换和丰富

4、数据探索和可视化

5、发布和展示研究结果

Let us understand each of these steps in detail.

让我们详细了解每一个步骤。

检索和导入数据(Data retrieval)

Data could be stored in a structured format (like databases or spreadsheets) or an unstructured format (like web pages, emails, Word documents). After considering parameters such as the cost and structure of the data, we need to figure out how to retrieve this data. Libraries like Pandas provide functions for importing data in a variety of formats.

数据可以以结构化格式(如数据库或电子表格)或非结构化格式(如网页、电子邮件、Word 文档)存储。在考虑了数据的成本和结构等参数后,我们需要弄清楚如何检索这些数据。Pandas 等库提供了导入各种格式数据的函数。

粗略数据审查和问题识别(Cursory data review and problem identification)

In this step, we form first impressions of the data that we want to analyze. We aim to understand each of the individual columns or features, the meanings of various abbreviations and notations used in the dataset, what the records or data represent, and the units used for the data storage. We also need to ask the right questions and figure out what we need to do before getting into the nitty-gritty of our analysis. These questions may include the following: which are the features that are relevant for analysis, is there an increasing or decreasing trend in individual columns, do we see any missing values, are we trying to develop a forecast and predict one feature, and so on.

在这一步中,我们要对要分析的数据形成初步印象。我们的目标是了解每个单独的列或特征、数据集中使用的各种缩写和符号的含义、记录或数据所代表的内容以及数据存储所使用的单位。我们还需要提出正确的问题,并在进入细枝末节的分析之前弄清楚我们需要做什么。这些问题可能包括:哪些是与分析相关的特征,个别列是否有增加或减少的趋势,我们是否看到任何缺失值,我们是否试图制定预测并预测一个特征,等等。

数据处理(Data wrangling)

This step is the crux of data analysis and the most time-consuming activity, with data analysts and scientists spending approximately 80% of their time on this.

这一步骤是数据分析的关键,也是最耗时的活动,数据分析师和科学家在这一步骤上花费了大约 80% 的时间。

Data in its raw form is often unsuitable for analysis due to any of the following reasons: presence of missing and redundant values, outliers, incorrect data types, presence of extraneous data, more than one unit of measurement being used, data being scattered across different sources, and columns not being correctly identified.

由于以下原因,原始数据通常不适合进行分析:存在缺失值和冗余值、异常值、数据类型不正确、存在无关数据、使用了多个测量单位、数据分散在不同的来源中,以及未正确识别列。

Data wrangling or munging is the process of transforming raw data so that it is suitable for mathematical processing and plotting graphs. It involves removing or substituting missing values and incomplete entries, getting rid of filler values like semicolons and commas, filtering the data, changing data types, eliminating redundancy, and merging data with other sources.

数据整理或混杂是对原始数据进行转换,使其适合数学处理和绘制图表的过程。它包括删除或替换缺失值和不完整条目、去除分号和逗号等填充值、过滤数据、更改数据类型、消除冗余以及与其他来源的数据合并。

Data wrangling comprises tidying, cleansing, and enriching data. In data tidying, we identify the variables in our dataset and map them to columns. We also structure data along the right axis and ensure that the rows contain observations and not features. The purpose of converting data into a tidy form is to have data in a structure that facilitates ease of analysis. Data cleansing involves dealing with missing values, incorrect data types, outliers, and wrongly entered data. In data enrichment, we may add data from other sources and create new columns or features that may be helpful for our analysis.

数据整理包括整理、清理和丰富数据。在数据整理过程中,我们要识别数据集中的变量,并将它们映射到列中。我们还沿着右轴构建数据结构,并确保行中包含的是观测值而不是特征。将数据转换为整洁形式的目的是使数据结构便于分析。数据清理包括处理缺失值、不正确的数据类型、异常值和错误输入的数据。在丰富数据时,我们可能会添加其他来源的数据,并创建可能有助于分析的新列或特征。

数据探索和可视化(Data exploration and visualization)

After the data has been prepared, the next step involves finding patterns in data, summarizing key characteristics, and understanding relationships among various features. With visualization, you can achieve all of this, and also lucidly present critical findings. Python libraries for visualization include Matplotlib, Seaborn, and Pandas.

在准备好数据之后,下一步就需要找出数据中的模式,总结关键特征,并理解各种特征之间的关系。通过可视化,您可以实现所有这些目标,还能清晰地展示重要发现。用于可视化的 Python 库包括 Matplotlib、Seaborn 和 Pandas。

介绍和发布我们的分析(Presenting and publishing our analysis)

Jupyter notebooks serve the dual purpose of both executing our code and serving as a platform to provide a high-level summary of our analysis. By adding notes, headings, annotations, and images, you can spruce up your notebook to make it presentable to a broader audience. The notebook can be downloaded in a variety of formats, like PDF, which can later be shared with others for review.

Jupyter 笔记本有双重用途,既可以执行我们的代码,又可以作为一个平台,提供我们分析的高级摘要。通过添加注释、标题、注解和图片,您可以美化自己的笔记本,使其能够呈现给更多人。您可以下载 PDF 等多种格式的笔记本,然后与他人共享,以供审阅。

文章来源:https://blog.csdn.net/qq_37703224/article/details/135688111
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。