其实说白了,就是一个二叉树
我们举一个买黄金的例子吧!黄金有999 和 9999 。 他们是有区别的,代表着黄金的纯度(相对杂质而言),那在决策树中——我们也引入了“纯度”这一概念。如果结果集中,全是这一类的,那么我们说“vary pure”。如果结果集中有6个,但是3个是一个类别,那么我们说"not pure",把除这三个外的东西叫做“杂质”
如果一个结果集(经过 一次 或多次 二叉树判别),都是猫 / 都是非猫,那么就说这个结果集 very pure。
如果一个结果集 既有 猫 又有 非猫,那么就是not pure。但是not pure 也分级别。——引出我们计算的公式
P1:是 猫的纯度。
当一组数据有6个,猫有0个时,熵为0,纯度最高
当一组数据有6个,猫有3个时,熵为0.92,纯度不好
…
那买黄金,有专业的机器来判别我们的黄金的纯度,那在决策树中的结果集中,如何判别纯度呢 / 判别纯度的标准??——这就引出了**“信息熵”** 的定义。
In Machine Learning, entropy ※※measures the level of disorder or uncertainty in a given dataset or system. It is a metric that quantifies the amount of information in a dataset, and it is commonly used to evaluate the quality of a model and its ability to make accurate predictions.
※A higher entropy value indicates a more heterogeneous dataset with diverse classes, while a lower entropy signifies a more pure and homogeneous subset of data. Decision tree models can use entropy to determine the best splits to make informed decisions and build accurate predictive models.
Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way.
It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.
【※※※总结】: