Skip to content

ncclFlowModel 输出文件为空 & 已知 Bug 长期未修复 #247

@hpc-lee

Description

@hpc-lee

在执行 SimAI 仿真后,ncclFlowModel_EndToEnd.csv 和 ncclFlowModel_detailed_72.csv 输出为空,参考 Issue #91 删除 workload 中前面DP和最后的优化器相关行后虽可输出,但导致仿真结果缺失 DP 通信数据;此外,使用过程中遇到多个已在历史 Issue 中报告的 Bug 长期未修复,想了解造成 NCCL 输出异常的根本原因及正确解决方案,同时询问仓库目前的维护状态和 Bug 修复计划。
simulation执行命令
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/H800-gpt_7B-world_size64-tp8-pp1-ep1-gbs512-mbs8-seq1440-MOE-False-GEMM-False-flash_attn-True.txt -n ./Spectrum-X_64g_8gps_400Gbps_H800 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
workload删除的部分

grad_gather -1 1 NONE 0 1 NONE 0 1 ALLGATHER 1627652096 100
grad_param_comm -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER 3255304192 100
grad_param_compute -1 1 NONE 0 611776 NONE 0 1 NONE 0 100
layernorm -1 1 NONE 0 1 ALLREDUCE 1627652096 1 NONE 0 100
embedding_grads -1 1 NONE 0 1 ALLREDUCE 94371840 1 NONE

cross_entropy1 -1 0 ALLREDUCE 46080 0 NONE 0 0 NONE 0 100
cross_entropy2 -1 0 ALLREDUCE 46080 0 NONE 0 0 NONE 0 100
cross_entropy3 -1 0 ALLREDUCE 46080 0 NONE 0 0 NONE 0 100
optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
optimizer2 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
optimizer3 -1 0 ALLREDUCE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions