生成计算图

5.19 保存+加载方式

tensorboard教程:

前两步均须在集群上(本地没有环境),并且加载模型占用较大内存(不能在管理节点上直接用)

训练模型,保存

在MB15初始化时保存整个模型

model = VisionTransformer(**model_kwargs)
torch.save(model, '/mnt/lustre/sunye/Desktop/mb_det/scripts/arun_log/mb15.pth')

训练过程缺少依赖: pip install lightseq timm einops fvcore --user
训练过程见job-*.log

加载模型,生成summary,生成runs

安装依赖:pip install torchsummary future tensorboard --user

具体见visualize.py

然而见arun_log,尝试过input_size=(3,32,32),(3,64,64),(3,224,224),(64,64),(3,32),(3,64)均报错:

IndexError: too many indices for tensor of dimension

企业微信截图_5ad8d541-aab3-4b8c-9912-f70e15bb7da5

image-20220519201539520

不管是summary还是graph必须提供输入shape和一个输入的样例,不然summary没有输出

上传runs,打开tensorboard面板

# 集群上上传
rm -f all.tar.gz 
tar -zcvf all.tar.gz *
curl -X DELETE http://go.d5.sensetime.com/sunye/visualize/all.tar.gz
curl -F file=@all.tar.gz http://go.d5.sensetime.com/sunye/visualize

tar -zxvf all.tar.gz

# 或者在本机上同步
sr sync -a bj15:/mnt/lustre/sunye/Desktop/mb_det/scripts/arun_log/ /data/sunye/visualize/

启动tensor

tensorboard --logdir=./ --port=18140

# 检查是否有数据存在(没有的话全为-)
tensorboard --inspect --logdir=./    

代理到本地端口并访问:

ssh -v <user>@10.198.20.231 -L 18140:127.0.0.1:18140

5.20 直接在训练过程中生成

PyTorch 如何构建计算图: https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

  1. summary 的主要用途是我们能知道 mb 里有多少种类的 op,比如 Conv2d 有哪几种
  2. graph主要是让我们对mb有个感性认识,咋们得想想数据库里,如何存储这种图:
    甚至找出重复关系,比如mb里是 stage -> block -> op 这三级构成的,每个 block 可能 repeats 几次,构成 一个stage

尝试在bj16上运行?bj16上没有相应的POD数据集,返回bj15

sr sync -a bj16:/mnt/lustre/sunye/Desktop/mb_det/pod /data/sunye/ 
sr sync -a /data/sunye/pod/ bj15:/mnt/lustre/sunye/Desktop/mb_det/pod/

sr sync -a bj15:/mnt/lustre/sunye/Desktop/mb_det/scripts/ /data/sunye/20220520/

确保训练过程不出错

能够正确加载模型和数据,打印出数据的类型
spring 版本:

pip install --user http://spring.sensetime.com/pypi/packages/torch-1.8.1+cuda90.cudnn7.6.3-cp36-cp36m-linux_x86_64.whl

修改代码,在构造模型过程中summary

mb.py 793

from torch.utils.tensorboard import SummaryWriter   
from torchsummary import summary
input_size = (3, 32, 32)
print('---------------------summary---------------------')
print(summary(model, input_size=input_size, device='cuda'))
print('---------------------graph---------------------')
writer = SummaryWriter('graph')
writer.add_graph(model, input_to_model=torch.rand(64, *input_size).cuda(), verbose=True)
writer.close()

mb.py 569

# x = input['image']
if type(input) == torch.Tensor:
    x = input
else:
    x = input['image']
# if isinstance(x, torch.cuda.FloatTensor):
#     print('input: torch.cuda.FloatTensor  -to_cpu-> torch.FloatTensor')
#     x = x.to('cpu')
# elif isinstance(x, torch.FloatTensor):
#     print('input: torch.FloatTensor')
# else:
#     print('input:', type(x), x.type())
# if x.type() != 'torch.cuda.HalfTensor':
#     x = x.type(torch.cuda.HalfTensor)
print('-----------forward: ', type(input), x.shape)

结果:见summary.txt

生成计算图错误

生成计算图时错误如下:大意是模型的参数中并不全为tensor, List[Tensor] and List[int] 均存在。可能与模型的结构定义有关,无法生成计算图

---------------------graph---------------------
-----------forward:  <class 'torch.Tensor'> torch.Size([64, 3, 32, 32])
Tracer cannot infer type of {'features': [tensor([[...]]],
       device='cuda:1', grad_fn=<CheckpointFunctionBackward>), tensor([[[...]]], device='cuda:1', grad_fn=<ViewBackward>), tensor([[[[ 0.1311]],
...
         [[-0.1093]]]], device='cuda:1', grad_fn=<ViewBackward>)], 'strides': [8, 16, 32]}
:Dictionary inputs to traced functions must have consistent type. Found List[Tensor] and List[int]Traceback (most recent call last):
  File "/mnt/lustre/share/spring/conda_envs/miniconda3/envs/s0.3.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/mnt/lustre/share/spring/conda_envs/miniconda3/envs/s0.3.4/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/__main__.py", line 28, in <module>
    main()
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/__main__.py", line 22, in main
    args.run(args)
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/commands/train.py", line 113, in _main
    pod_helper = pod_helper_class(cfg, inference_only=inference_only)
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/spring/pod_helper.py", line 63, in __init__
    self._build()
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/spring/pod_helper.py", line 126, in _build
    self.model = self.build_model()
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/spring/sci_helper.py", line 77, in build_model
    model = model_helper_ins(self.config['net'])
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/model_helper.py", line 36, in __init__
    module = self.build(mtype, kwargs)
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/model_helper.py", line 66, in build
    return cls(**kwargs)
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py", line 810, in MB15
    writer.add_graph(model, input_to_model=torch.rand(64, *input_size).cuda(), verbose=True)
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 723, in add_graph
    self._get_file_writer().add_graph(graph(model, input_to_model, verbose))
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 292, in graph
    raise e
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 286, in graph
    trace = torch.jit.trace(model, args)
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/jit/_trace.py", line 742, in trace
    _module_class,
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/jit/_trace.py", line 940, in trace_module
    _force_outplace,
RuntimeError: Tracer cannot infer type of {'features': 

尝试解决(https://github.com/facebookresearch/detr/issues/208

修改_pytorch_graph.py 286

model = torch.jit.script(model)
trace = torch.jit.trace(model, args)

又出现新问题:

image-20220520201040396

5.23 onnx

关于jit模块:通过TorchScript的方式创建序列化和可优化的模型

模型可视化中,forward方法中有很多条件控制语句,需要使用scripting方式,而不是trace。TorchScript 是 Python 的静态类型子集,可以直接编写(使用@torch.jit.script装饰器)或通过跟踪从 Python 代码自动生成

Dictionary inputs to traced functions must have consistent type

导致错误的直接原因是forward返回值为字典,并且字典的类型不一致,因此要么删掉strides,要么将其改为tensor

将其删掉之后,成功运行,并且生成了有效的图结构,events file大小为1.2M

image-20220523141624508

graph

可访问 10.198.20.231:18140 查看图的可视化,因为图片较大,延迟会很长

TracerWarning(可忽略)

在生成runs过程中会出现一些warning,这些可能会导致上述错误:主要有以下3类

WRN py.warnings 2022-05-23 11:43:36 /mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py:583: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  print('-----------forward: ', type(input), x.shape)

WRN py.warnings 2022-05-23 11:43:36 /mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py:595: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  x = checkpoint.checkpoint(blk, *(x, Variable(torch.Tensor([h])), Variable(torch.Tensor([w]))))

WRN py.warnings 2022-05-23 11:43:38 /mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/tensor.py:590: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  'incorrect results).', category=RuntimeWarning)
  1. 在forward函数中调用了类型转换,从tensor转成python的int或float类型。
  2. 使用函数加载tensor
  3. 在tensor上做循环

netron可视化

https://github.com/lutzroeder/netron

支持.pth模型直接导入,但是导入4.2G pth报错:archieve contains no model files in mb15.pth

将图转为 onnx 文件,具体见to_onnx.py

https://docs.microsoft.com/zh-cn/windows/ai/windows-ml/tutorials/pytorch-convert-model

image-20220523161816925

onnx保存错误:

spring中SyncBNFunc 的 input_sizes为空,修改:

/mnt/lustre/sunye/.local/lib/python3.6/site-packages/spring/linklink/nn/syncbn_layer.py 29

if input_sizes and len(input_sizes) == 2:
            # batchnorm1d accepts 2d and 3d array, but ONNX only accepts 3d
            input = g.op("Unsqueeze", input, axes_i=[2])
pip install --user --force-reinstall --no-deps http://spring.sensetime.com/pypi/packages/spring-0.7.0+cu90.torch181.mvapich2.pmi2.nartgpu-cp36-cp36m-linux_x86_64.whl
Traceback (most recent call last):
  File "to_onnx.py", line 43, in <module>
    Convert_ONNX(model, (3, 32, 32))
  File "to_onnx.py", line 36, in Convert_ONNX
    'modelOutput' : {0 : 'batch_size'}}) 
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/onnx/__init__.py", line 276, in export
    custom_opsets, enable_onnx_checker, use_external_data_format)
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/onnx/utils.py", line 94, in export
    use_external_data_format=use_external_data_format)
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/onnx/utils.py", line 712, in _export
    val_add_node_names, val_use_external_data_format, model_file_location)
RuntimeError: ONNX export failed: Couldn't export Python operator CheckpointFunction

forward函数中注释掉 checkpoint.checkpoint

这次运行了近10分钟

Traceback (most recent call last):
  File "to_onnx.py", line 41, in <module>
    Convert_ONNX(model, (3, 32, 32))
  File "to_onnx.py", line 34, in Convert_ONNX
    output_names = ['modelOutput']) 
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/onnx/__init__.py", line 276, in export
    custom_opsets, enable_onnx_checker, use_external_data_format)
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/onnx/utils.py", line 94, in export
    use_external_data_format=use_external_data_format)
  File "/mnt/lustre/sunye/.local/lib/python3.6/site-packages/torch/onnx/utils.py", line 712, in _export
    val_add_node_names, val_use_external_data_format, model_file_location)
RuntimeError: Exporting model exceed maximum protobuf size of 2GB. Please call torch.onnx.export with use_external_data_format=True.
srun: error: BJ-IDC1-10-10-15-45: task 0: Exited with exit code 1
srun: Terminating job step 2956894.0

按照提示添加参数再执行一次,这次运行了近20分钟,最后运行成功,保存了一系列文件和最终的mb15.onnx文件,总体与pth文件大小大致相同,但onnx大小仅有1.4M

打开onnx

下载mb15.onnx到本地,使用netron-desktop打开:

  • 警告:This graph contains a large number of nodes and might take a long time to render. Do you want to continue?
  • 加载了近30s显示

或者直接打开10.198.20.231:18141

image-20220523170344225

image-20220523165822750

5.23 script

代码见 script.py,日志见arun_log

https://zhuanlan.zhihu.com/p/135911580

Python builtin <built-in method apply of FunctionMeta object at 0x7f1f554c61b8> is currently not supported in Torchscript:
  File "/mnt/lustre/share/spring/conda_envs/miniconda3/envs/s0.3.4/lib/python3.6/site-packages/spring/linklink/nn/syncbn_layer.py", line 283
    def forward(self, input):
        return SyncBNFunc.apply(
               ~~~~~~~~~~~~~~~~ <--- HERE
            input,
            self.weight,

将forward直接改为return input

_scale(__torch__.pod.models.backbones.mb.SqueezeExcitation self, Tensor input, Tensor inplace) -> (Tensor):
Expected a value of type 'Tensor (inferred)' for argument 'inplace' but instead found type 'bool'.
Inferred 'inplace' to be of type 'Tensor' because it was not annotated with an explicit type.
:
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py", line 74
    def forward(self, input):
        scale = self._scale(input, True)
                ~~~~~~~~~~~ <--- HERE
        return scale * input

鉴于inplace没用,直接删去

RuntimeError: 
Expected a default value of type Tensor (inferred) on parameter "h".Because "h" was not annotated with an explicit type it is assumed to be type 'Tensor'.:
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py", line 175
    def forward(self, x, h=14, w=14):
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        # pdb.set_trace()
        ~~~~~~~~~~~~~~~~~
        h = int(h)
        ~~~~~~~~~~
        w = int(w)
        ~~~~~~~~~~
        residual = self.residual(x)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~
        if self.b1 is not None:
        ~~~~~~~~~~~~~~~~~~~~~~~
            x = self.b1(x)
            ~~~~~~~~~~~~~~
        x = self.b2(x)
        ~~~~~~~~~~~~~~
    
        return residual + self.drop_path(x)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

似乎与默认参数有关,torch默认forward函数接受 tensor: https://dev-discuss.pytorch.org/t/adapting-models-to-use-torchscript-and-getting-them-to-produce-fusions/129

mb.py - 175

def forward(self, x, h=14, w=14):
        # pdb.set_trace()
        h = int(h)
        w = int(w)
        residual = self.residual(x)
        if self.b1 is not None:
            x = self.b1(x)
        x = self.b2(x)

        return residual + self.drop_path(x)

改成

def forward(self, x: torch.Tensor, h: int=14, w: int=14):
        # pdb.set_trace()
        h = int(h)
        w = int(w)
        residual = self.residual(x)
        if self.b1 is not None:
            x = self.b1(x)
        x = self.b2(x)

        return residual + self.drop_path(x)

并对下文中所有的forward函数进行显式类型声明

RuntimeError: 
Module 'Attention' has no attribute 'static_a' :
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py", line 276
        attn = attn.softmax(dim=-1)
        if self.static:
            attn = attn + self.static_a
                          ~~~~~~~~~~~~~ <--- HERE
        attn = self.attn_drop(attn)

在源码中

def __init__(...)
    ...
    if self.static:
          self.static_a = nn.Parameter(torch.Tensor(1, num_heads, seq_l, seq_l))
          trunc_normal_(self.static_a)
      self.custom_flops = 2 * seq_l * seq_l * dim
      self.window = window

def forward(self, x: torch.Tensor, head: int=0, mask_type: str=None):
    ...
    if self.static:
        attn = attn + self.static_a

只有self.static为ture才会执行,但这里直接当static_a未定义?

改为if hasattr(self, 'static_a'):

Expected a default value of type str on parameter "mask_type".:
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py", line 261
    def forward(self, x: torch.Tensor, head: int=0, mask_type: str=None):
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        B, N, C = x.shape
        ~~~~~~~~~~~~~~~~~
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
        ...
        return x
        ~~~~~~~~ <--- HERE

改为 def forward(self, x: torch.Tensor, head: int=0, mask_type: str=''):

Expected a value of type 'Tensor (inferred)' for argument 'h' but instead found type 'int'.
Inferred 'h' to be of type 'Tensor' because it was not annotated with an explicit type.
:
  File "/mnt/lustre/sunye/Desktop/mb_det/pod/models/backbones/mb.py", line 377
        # print(h, w)
        if self.conv_embedding == 1 or self.conv_embedding == 3:
            x = x + to3d(self.pos_embed(to4d(x, h, w)))
                                        ~~~~ <--- HERE
        elif self.conv_embedding == 2:
            x = x + self.pos_embed(x.transpose(1, 2)).transpose(1, 2)

5.24 onnx 可视化

多源图结构信息:

  • 在pytorch里pring(model)得到的树形结构:print_model.log
  • 打印onnx中的图结构:graph.log(先按照拓扑顺序描述了节点,但是里面并没有包含层次结构信息)
  • 在pytorch中对模型进行trace生成的summary:summary.logprint(model)summary(model, input_size=input_size, device='cuda')打印出了互补的信息,前者包含静态的结构和参数信息,后者包含每个op的参数数量)

这种树形结构与onnx的图结构有什么关系?

image-20220524162215802

print_model.log中得知,在2级模块blocks中定义了50个相同的FusedMBConv3x3

大致有以下需求:

  1. 大模型层次可视化:虽然通过netron可以加载onnx模型,但是对于大模型仍然效率很低,甚至不可用

    netron 2018提的issue: https://github.com/lutzroeder/netron/issues/68 ,后来把bug改成feature之后就没动过了(单人开发的项目)

  2. 同构子图折叠:需要一种层次可视化onnx模型,最好能够简写掉onnx模型中的重复,比如blockA(+22),意思是这个结构,重复22次

  3. 模型摘要:用了哪些 OP,每个节点/每个网络的 Ops 估计

image-20220524122631534

https://github.com/lutzroeder/netron/issues/204

现有的可视化工具

https://datascience.stackexchange.com/questions/12851/how-do-you-visualize-neural-network-architectures

这一链接中提到了各种各样的神经网络可视化工具:

netscope 查看器具有每层MACC、FLOPS 和内存访问信息(滚动到底部),但没有模型摘要

onnx解析

https://zhuanlan.zhihu.com/p/371177698

https://bindog.github.io/blog/2020/03/13/deep-learning-model-convert-and-depoly/

netron/source/onnx*.js定义了onnx解析器

统计分析

5.26 多层次op特征及参数量统计

sr sync -a bj15:/mnt/lustre/sunye/Desktop/mb_det/scripts/ /data/sunye/20220526/

问题分析

print(model)是树结构,而trace之后是没有层次结构的图结构(由一堆按照拓扑排序的节点组成)

按照 “结构+控制流=计算图,计算图+权重=模型”来说:print(model)有结构但没控制流,trace有控制流但是仅保留了结构的子集,并且失去了结构的层次性

需求:不再进行计算图的可视化及存储图结构了,搜集op粒度的信息(具体见:https://gitlab.bj.sensetime.com/ostrich/cli/-/issues/52

步骤

  • 在运行时代码中添加“输出模型结构到model_str”,“输出模型summary到summary_str”
    • print(model) :实际不用 print,就拿print过程中的遍历逻辑
      • 这个过程需要知道nn.Module.__str__函数具体做了什么,如何遍历模型的
    • print(summary(model, input_size=input_size, device=‘cuda’))
  • 两者之间建立一个一一映射关系,在不同层次上显示参数量大小

四种获取模型结构方式

数量、文件大小、耗时均在mb15上运行

获取方式 traverse summary trace script
描述 中序遍历模型结构,从根节点到叶节点,每个节点是一个nn.Module 输入一个tensor,对运行过程中实际走过的nn.Module执行hook函数 需要输入来让模型 forward 一遍,以通过该输入的流转路径,获得图的结构。不适合适合解决带有if和for-loop的情况 将一个Python代码编译为TorchScript代码,继而导出相应的可优化可部署模型
关键函数(调用栈) print(model)
nn.Moule._str_
model.modules()
summary(model, input_size=input_size, device=‘cuda’)
model.apply(register_hook)
writer.add_graph(model, input_to_model=torch.rand(1, *input_size).cuda())
graph(model, input_to_model, verbose)
torch.jit.trace(model, args)
torch.jit.script(model)
数量 1616 Module 1557 Module 7517 node
输出 模型树形结构,包含模型参数,如shape, kernel_size, stride, padding, bias 线性结构,记录实际运行过程每个module的属性:input_shape, output_shape, trainable, nb_params 部分计算图(及其参数),onnx 完整计算图(及其参数),可部署模型
文件大小 93KB 165KB 仅计算图结构:1.4M
计算图参数+权重:4.2G
4.2G
耗时 0.14s 7.55s (较慢) —(运行失败)

实现

本来希望保存mb15.pth之后在本地离线运行上述步骤,但是UP(POD)依赖于spring,而spring必须在集群环境下使用,并且必须使用cuda(gpu)。给开发带来很大麻烦

在mb.py - mb15中添加见save.py,耗时如下:

  • time(summary): 7.5516321659088135
  • time(model): 7.691348552703857

目前已经收集到数据,如何在summary和traverse间建立映射关系??

从简单的网络入手,如torchvision/models下的resnet152,见resnet文件夹