x=‘x’ 和 y=‘y’:分别指定 x 轴和 y 轴的数据。 hue=‘group’:通过 ‘group’ 列的取值来着色散点,即根据 ‘group’ 列的不同取值,点的颜色会有所区分。
各类型统计图
变量类型:
Categorical Ordinal: 有顺序的。The variable represents categories or groups (adult or not adult). Would imply an == ordered relationship== among categories (e.g., low, medium, high).
Nominal: There is no inherent order or ranking among the categories; they are simply different groups.
Quantitative Continuous: number within a range(取值范围内所有数都可以取)
Quantitative Discrete: Would represent numeric values that are distinct and separate.
Histogram直方图:
Description: Histograms are used to visualize the distribution of a continuous variable by dividing the data into bins and displaying the frequency of observations in each bin.
Types:
Single-Peak (Unimodal): One clear peak in the distribution.
Bimodal双峰:两个峰必须整体趋势一致才叫biomodal.
Skewed (Left or Right): left (negatively skewed,小的值多,平均数小于中位数) or right (positively skewed,大的值多,平均数大于中位).
Bell-Shaped: Symmetrical distribution resembling a bell curve, often observed in normal distributions.
Use: Assessing Spread and Dispersion etc.
Bar charts 条形图:
Description: represent categorical data with rectangular bars. The lengths of the bars are proportional to the values they represent.
Use: Useful for comparing the values of different categories. Bar charts are versatile and can be used for both nominal and ordinal categorical data.
Bar charts 和Histograms的区别:
A histogram is the graphical representation of data where data is grouped into == continuous number ranges== and each range corresponds to a vertical bar.
Pie Chart 饼状图:
Description: Pie charts represent data in a circular graph where each category is shown as a wedge, and the size of each wedge corresponds to the proportion of that category in the whole.
Use: Useful for displaying the composition of a whole, highlighting the relative sizes of different categories.
Box Plot (Box-and-Whisker Plot):
Description: Box plots provide a visual summary of the distribution of a numerical variable through quartiles (25th, 50th, and 75th percentiles) and identify potential outliers.
Use: Useful for comparing the spread and central tendency of different groups or variables.
知识点:四分位距。
Five-Number Summary:
Explanation of the five-number summary: == minimum, first quartile (Q1, 25%在这个value以下), median(50%), third quartile (Q3,75%), and maximum. ==
Example using the adult male heights histogram with values for each parameter.
Interquartile Range (IQR) 四分位距:
Introduction of the Interquartile Range (IQR) as a measure of spread.
Calculation of IQR using Q3 minus Q1 in the adult male heights example.
Comparison of Measures:
Emphasis on the robustness of the median as an estimate of the center, less influenced by outliers.
Standard deviation as an average distance from the mean.
Preference for == IQR== over the == range== due to robustness against outliers.
· 注意异常: Remember that even though outliers are plotting individually in boxplots, they are still part of the data set. 会影响Mean值。
Scatter Plot:
Description: Scatter plots are used to display the relationship between two continuous variables. Each point on the plot represents an observation with values on both X and Y axes.
Use: They help determine whether there is a positive, negative, or no correlation between variables. (一段关系是不是线性的)
Judgement:
r=1: Perfect positive correlation. As one variable increases, the other variable increases proportionally.
r=?1: Perfect negative correlation. As one variable increases, the other variable decreases proportionally.
r=0: No linear correlation. The variables are not linearly related.