Troubleshooting Congestion using a Remote Monitoring Platform
The unpleasant reality of troubleshooting congestion in lossless Ethernet networks can be changed by using a remote monitoring platform, which continuously polls the number of Pause frames to maintain a time and date-stamped history. 通过使用远程监控平台,可以改变在无损以太网网络中排除拥塞故障时令人不快的现实,该平台可持续轮询暂停帧的数量,以保存带有时间和日期戳的历史记录。
The UCS Traffic Monitoring (UTM) app is an example of such an application. Refer to Chapter 9 for more details on it. The UTM app can detect and troubleshoot congestion in near real-time using comparative analysis, treading, and seasonality. UCS 流量监控 (UTM) 应用程序就是此类应用程序的一个示例。有关详细信息,请参阅第 9 章。UTM 应用程序可以使用比较分析、踩点和季节性等方法近乎实时地检测拥堵情况并排除故障。
Comparative Analysis
Compare the rate of Pause frames on the network ports (host ports and switch ports) and detect if a few ports have an excessively higher count than others. 比较网络端口(主机端口和交换机端口)的暂停帧速率,检测是否有几个端口的暂停帧数比其他端口高得多。
In Figure 7-8, with thousands of hosts, 在图 7-8 中,有数千台主机、
1. Poll the Tx and Rx Pause from the edge switchports or hosts every 60 seconds. 每 60 秒从边缘交换端口或主机轮询一次 Tx 和 Rx 暂停。
2. Calculate the delta of the accumulated number of Pause frames to know the change over the 60-second duration. 计算累计暂停帧数的 delta 值,了解 60 秒内的变化情况。
3. Sort the hosts in the descending order of the Tx Pause or edge switchports in the descending order of Rx Pause. 按 "Tx 暂停 "降序排列主机,或按 "Rx 暂停 "降序排列边缘交换端口。
4. Investigate the top 10 hosts in this list. Typically, these hosts have higher severity of slow-drain. 调查该列表中排名前 10 位的主机。通常情况下,这些主机的慢排空严重程度较高。
The same comparative analysis should be used across all similar entities. For example, compare all the spine ports with each other and detect if a few ports report an excessive number of Pause frames. 应在所有类似实体中使用相同的比较分析。例如,将所有骨干端口相互比较,检测是否有几个端口报告了过多的暂停帧。
Trends and Seasonality
The Pause frames are important for the operation of a lossless Ethernet network, and hence, their nominal activity is fine. But analyze any spikes and dips carefully. Also, find if the Pause frame count has been on the rise over the last few days or weeks, although there may not be any sudden spikes. Additionally, find any seasonality, that is, if the spikes in the number of Pause frames are observed during specific hours in a day or days in a week, or even months in a year. 暂停帧对无损以太网网络的运行非常重要,因此其正常活动是没有问题的。但要仔细分析任何峰值和谷值。此外,还要查找暂停帧计数在过去几天或几周内是否一直在上升,尽管可能不会出现任何突然的峰值。此外,还要查找是否存在季节性,即暂停帧数的峰值是否出现在一天中的特定时段或一周中的特定日子,甚至一年中的特定月份。
In graphical terms, a straight line with low counts is fine. Pay attention to spikes, especially big spikes that sustain longer. 从图表上看,低计数的直线就可以了。注意峰值,尤其是持续时间较长的大峰值。
Monitoring a Slow-drain Suspect
A suspect is an end device that sends Pause frames. But it may or may not be a culprit. To be sure, more information is needed because, as mentioned earlier, just counting the number of Pause frames does not convey how long transmission was really stopped. 可疑设备是发送暂停帧的终端设备。但它不一定是罪魁祸首。可以肯定的是,我们需要更多的信息,因为如前所述,仅仅计算暂停帧的数量并不能说明传输到底停止了多长时间。
Based on our experience, the following ar