《Python数据分析技术栈》第03章 02 数据结构(Structure of data)
The data that we need to analyze could have any of the following structures,
我们需要分析的数据可能具有以下任何一种结构、
Structured Data: Is arranged in the form of rows and columns. Examples: Spreadsheets, CSV/Excel files, relational databases
结构化数据: 以行和列的形式排列。例如:电子表格、CSV/Excel 文件、关系数据库: 电子表格、CSV/Excel 文件、关系数据库
Unstructured Data: Lacks a structure or form. Examples: photos, videos, web pages , documents
非结构化数据: 缺乏结构或形式。例如:照片、视频、网页、文件
Semi-structured Data: Not strucured like data in relational databases but has some properties like tags for easier analysis: Example: JSON, XML
半结构化数据: 不像关系数据库中的数据那样结构化,但有一些属性,如便于分析的标签: 例如 JSON、XML
There are broadly two levels of data: Continuous and Categorical. Continuous data can further be classified as ratio and interval, while categorical data can be either nominal or ordinal. The levels of data are demonstrated in Figure 4-3.
数据大致分为两个层次: 连续数据和分类数据。连续数据可进一步分为比率数据和区间数据,而分类数据可以是名义数据或序数数据。数据的层次如图 4-3 所示。
Categorical/Discrete or Qualitative Data
分类/离散或定性数据
Continuous or Quantitative Data
连续或定量数据:
The following are some essential points to note:
以下是一些需要注意的要点:
Numeric values for categorical variables: Categorical data is not restricted to non-numeric values. For example, the rank of a student, which could take values like 1/2/3 and so on, is an example of an ordinal (categorical) variable that contains numbers as values. However, these numbers do not have mathematical significance; for instance, it would not make sense to find the average rank.
分类变量的数值: 分类数据并不局限于非数值。例如,学生的排名可以有 1/2/3 等值,这就是一个包含数字值的顺序(分类)变量的例子。但是,这些数字并不具有数学意义;例如,求平均名次就没有意义。
Significance of a true zero point: We have noted that interval variables do not have an absolute zero as a reference point, while ratio variables have a valid zero point. An absolute zero denotes the absence of a value. For example, when we say that variables like height and weight are ratio variables, it would mean that a value of 0 for any of these variables would mean an invalid or nonexistent data point. For an interval variable like temperature (when measured in degrees Celsius or Fahrenheit), a value of 0 does not mean that data is absent. 0 is just one among the values that the temperature variable can assume. On the other hand, temperature, when measured in the Kelvin scale, is a ratio variable since there is an absolute zero defined for this scale.
真正零点的意义: 我们注意到,区间变量没有绝对零点作为参考点,而比率变量则有有效零点。绝对零度表示没有数值。例如,当我们说身高和体重等变量是比率变量时,这意味着这些变量中任何一个变量的值为 0 都意味着数据点无效或不存在。对于像温度这样的区间变量(以摄氏度或华氏度为单位),0 并不意味着没有数据。0 只是温度变量可以取的值之一。另一方面,以开尔文标度测量的温度是一个比率变量,因为这个标度定义了一个绝对零度。
Identifying interval variables: Interval variables do not have an absolute zero as a reference point, but identifying variables that have this characteristic may not be apparent. Whenever we talk about the percentage change in a figure, it is relative to its previous value. For instance, the percentage change in inflation or unemployment is calculated with the last value in time as the reference point. These are instances of interval data. Another example of an interval variable is the score obtained in a standardized test like the GRE (Graduate Record Exam). The minimum score is 260, and the maximum score is 340. The scoring is relative and does not start from 0. With interval data, while you can perform addition and subtraction operations. You cannot divide or multiply values (operations that are permissible for ratio data).
识别区间变量: 区间变量没有绝对零点作为参考点,但要识别具有这一特征的变量可能并不容易。每当我们谈论一个数字的百分比变化时,它都是相对于其先前值而言的。例如,通货膨胀或失业率的百分比变化是以时间上的最后一个值作为参考点来计算的。这些都是区间数据的例子。区间变量的另一个例子是 GRE(研究生入学考试)等标准化考试的分数。最低分是 260,最高分是 340。得分是相对的,并不是从 0 开始。但不能进行除法或乘法运算(比率数据允许进行的运算)。