计算机毕设:网民社交网络数据的分析与挖掘

发布时间:2024年01月18日

1.读数据表

首先,我们读取原始数据,并查看各字段基本情况。

gradyeargenderagefriendsbasketballfootballsoccersoftballvolleyballswimmingcheerleadingbaseballtennissportscutesexsexyhotkisseddancebandmarchingmusicrockgodchurchjesusbiblehairdressblondemallshoppingclotheshollisterabercrombiediedeathdrunkdrugs
2006M18.987000000000000000100000000000000000000
2006F18.8010010000000010000000221000640100000000
2006M18.33569010000000000000020100000000000000100
2006F18.8750000000000010000000010000000000000000
200618.99510000000000001005110301000100020000011

2.年龄缺失值填补

缺失值(missing value)是指现有数据集中某个或某些属性的值是不完全的。 由于大部分机器学习模型无法处理缺失值,在数据建模前需要填补或者剔除缺失值。对于连续变量age,我们使用该列的均值进行填充,结果如下表所示。

gradyeargenderagefriendsbasketballfootballsoccersoftballvolleyballswimmingcheerleadingbaseballtennissportscutesexsexyhotkisseddancebandmarchingmusicrockgodchurchjesusbiblehairdressblondemallshoppingclotheshollisterabercrombiediedeathdrunkdrugs
2006M18.987000000000000000100000000000000000000
2006F18.8010010000000010000000221000640100000000
2006M18.33569010000000000000020100000000000000100
2006F18.8750000000000010000000010000000000000000
200618.99510000000000001005110301000100020000011

3.性别缺失值填补

对于离散变量gender,我们使用“未知”进行填充,结果如下表所示。

gradyeargenderagefriendsbasketballfootballsoccersoftballvolleyballswimmingcheerleadingbaseballtennissportscutesexsexyhotkisseddancebandmarchingmusicrockgodchurchjesusbiblehairdressblondemallshoppingclotheshollisterabercrombiediedeathdrunkdrugs
2006M18.987000000000000000100000000000000000000
2006F18.8010010000000010000000221000640100000000
2006M18.33569010000000000000020100000000000000100
2006F18.8750000000000010000000010000000000000000
2006未知18.99510000000000001005110301000100020000011

?

5.异常值处理前直方图

异常值(outlier),也称为极端值,是数据集中某些数值明显偏离其余数据点的样本点。因为线性回归模型等机器学习模型对异常值较为敏感,对异常值进行处理有利于提高建模的鲁棒性。

接下来,我们用直方图查看friends列数据分布情况。

6.异常值处理

通过数据筛选组件,我们可以剔除掉大于�3+1.5×���Q3?+1.5×IQR的数据点,结果如下表所示。

gradyeargenderagefriendsbasketballfootballsoccersoftballvolleyballswimmingcheerleadingbaseballtennissportscutesexsexyhotkisseddancebandmarchingmusicrockgodchurchjesusbiblehairdressblondemallshoppingclotheshollisterabercrombiediedeathdrunkdrugs
2006M18.987000000000000000100000000000000000000
2006F18.8010010000000010000000221000640100000000
2006M18.33569010000000000000020100000000000000100
2006F18.8750000000000010000000010000000000000000
2006未知18.99510000000000001005110301000100020000011

7.Z-Score标准化

数据标准化指的是将数据按比例缩放的预处理操作。 当我们希望消除量纲的影响、帮助模型收敛、适应模型假设时,就可能需要进行数据标准化。

在本案例中,我们将介绍比较常用的Z-Score标准化和MinMax标准化。下面我们对数据集中friends列做Z-Score标准化,使得处理后的数据均值为0,标准差为1。

gradyeargenderagefriendsbasketballfootballsoccersoftballvolleyballswimmingcheerleadingbaseballtennissportscutesexsexyhotkisseddancebandmarchingmusicrockgodchurchjesusbiblehairdressblondemallshoppingclotheshollisterabercrombiediedeathdrunkdrugs
2006M18.98-0.720678000000000000000100000000000000000000
2006F18.801-0.99873010000000010000000221000640100000000
2006M18.3351.742069010000000000000020100000000000000100
2006F18.875-0.99873000000000010000000010000000000000000
2006未知18.995-0.601512000000000001005110301000100020000011
特征均值标准差
friends25.14316525.175147

8.异常值处理后直方图

?

文章来源:https://blog.csdn.net/mqdlff_python/article/details/135676320
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。