-
Notifications
You must be signed in to change notification settings - Fork 149
ncclFlowModel 输出文件为空 & 已知 Bug 长期未修复 #247
Description
在执行 SimAI 仿真后,ncclFlowModel_EndToEnd.csv 和 ncclFlowModel_detailed_72.csv 输出为空,参考 Issue #91 删除 workload 中前面DP和最后的优化器相关行后虽可输出,但导致仿真结果缺失 DP 通信数据;此外,使用过程中遇到多个已在历史 Issue 中报告的 Bug 长期未修复,想了解造成 NCCL 输出异常的根本原因及正确解决方案,同时询问仓库目前的维护状态和 Bug 修复计划。
simulation执行命令
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/H800-gpt_7B-world_size64-tp8-pp1-ep1-gbs512-mbs8-seq1440-MOE-False-GEMM-False-flash_attn-True.txt -n ./Spectrum-X_64g_8gps_400Gbps_H800 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
workload删除的部分
grad_gather -1 1 NONE 0 1 NONE 0 1 ALLGATHER 1627652096 100
grad_param_comm -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER 3255304192 100
grad_param_compute -1 1 NONE 0 611776 NONE 0 1 NONE 0 100
layernorm -1 1 NONE 0 1 ALLREDUCE 1627652096 1 NONE 0 100
embedding_grads -1 1 NONE 0 1 ALLREDUCE 94371840 1 NONE
cross_entropy1 -1 0 ALLREDUCE 46080 0 NONE 0 0 NONE 0 100
cross_entropy2 -1 0 ALLREDUCE 46080 0 NONE 0 0 NONE 0 100
cross_entropy3 -1 0 ALLREDUCE 46080 0 NONE 0 0 NONE 0 100
optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
optimizer2 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
optimizer3 -1 0 ALLREDUCE