Skip to content

对原有项目的调试纠错和改造添加记录

NoBa1anc3 edited this page May 11, 2020 · 2 revisions

调试纠错

计算溢出

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
  • 原因:使用随机初始化的权重进行预测,经过Resnet 101,tf.keras.layers.batchnormalization的传入参数training=False时,(meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch),从结果来看,C4, C5输出的大量值的量级处于(10e6, 10e8)这个区间内,因此P2 ~ P5的大量值的量级也会处于(10e6, 10e8)这个区间内。 这些大量级的参数计算而得的rpn_deltas也是非常之大,且rpn层需要利用rpn_deltas对初始的建议框进行微调,对于w的微调公式为Wnew = Wold * exp(dw),得到的Wnew会因为超出范围而被标记为nan,这些带nan的框经过roi_align层,roi_align层使用的函数为tf.image.crop_and_resize,这个函数不允许输入的边界框含有nan数值,否则则会报上述错误。
  • 解决方法:将训练代码之前的预测部分移除

内存不足

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
  • 原因:网上给出的原因不外乎内存不足、CPU不足、显存不足,最后发现原因是运行过程中不足。但在colab上使用相同代码就不会报这个错,推测是某些库版本带来的问题。
  • 解决方法:减少预加载图片数量有效解决了这个问题

模型参数个数不匹配

ValueError: Layer #0 (named "res_net") expects 0 weight(s), but the saved weights have 318 element(s).
  • 原因:提示模型需求参数为0,但加载的h5文件中的参数大于0。在colab上运行并未报这个错,应该是tensorflow版本的问题。
  • 解决方法:在加载模型前先对任意一张图片进行完整的训练过程,然后再加载模型参数,则不会报错。
batch_imgs, batch_metas, batch_bboxes, batch_labels = train_dataset[0]
with tf.GradientTape() as tape:
    _, _, _, _ = model((np.array([batch_imgs]), np.array([batch_metas]),
                        np.array([batch_bboxes]), np.array([batch_labels])), training=True)

model.load_weights(CHECKPOINT, by_name=True)

阈值过大

detection/models/detector/faster_rcnn.py中的

self.RCNN_MIN_CONFIDENCE

该参数过大,应当减小一些,否则会过滤掉所有的框

ps:根据github上星最多的mask RCNN代码中给出的模型,检测气球使用的最小置信度为0.9;检测细胞的最小置信度为0;coco数据集上进行目标检测使用的最小值置信度为0,因此推断对于不同的检测目标,最小置信度的设置也应有所不同。对于细胞的置信度设置为0,代码作者给出的解释为:Don't exclude based on confidence. Since we have two classes then 0.5 is the minimum anyway as it picks between nucleus and BG。

异常处理

detection/models/bbox_head/bbox_head.py中 如若nms_keep为0

nms_keep = tf.concat(nms_keep, axis=0)

该行代码的执行会报错,需要增加异常处理:

if len(nms_keep) != 0:
    nms_keep = tf.concat(nms_keep, axis=0)
else:
    nms_keep = tf.zeros([0, ], tf.int64)

改造添加

  • 增加了对epoch数量、batch大小、是否微调、学习率大小、评估效果的频率以及标准化的配置功能
img_mean = (123.675, 116.28, 103.53)
img_std = (58.395, 57.12, 57.375)

epochs = 100
batch_size = 2
learning_rate = 1e-4
checkpoint = 500
finetune = 0

opts, args = getopt.getopt(sys.argv[1:], "-b:-f:-l:-e:-c:-n:", )

for opt, arg in opts:
    if opt == '-b':
        batch_size = int(arg)
    elif opt == '-f':
        finetune = int(arg)
    elif opt == '-l':
        learning_rate = float(arg)
    elif opt == '-e':
        epochs = int(arg)
    elif opt == '-c':
        checkpoint = int(arg)
    elif opt == '-n':
        if int(arg) == 0:
            img_mean = (0., 0., 0.)
            img_std = (1., 1., 1.)
        elif int(arg) == 1:
            # Company Articles Dataset
            img_mean = (0.9684, 0.9683, 0.9683)
            img_std = (0.1502, 0.1505, 0.1505)
  • 标准化可配置为不使用标准化,imagenet标准化和公司章程数据集标准化
img_mean = (123.675, 116.28, 103.53)
img_std = (58.395, 57.12, 57.375)

    elif opt == '-n':
        if int(arg) == 0:
            img_mean = (0., 0., 0.)
            img_std = (1., 1., 1.)
        elif int(arg) == 1:
            # Company Articles Dataset
            img_mean = (0.9684, 0.9683, 0.9683)
            img_std = (0.1502, 0.1505, 0.1505)

  • 分离了训练时基于batch或epoch进行预测及保存模型的py文件
  • 增加了对RPN和RCNN两部分网络分类和定位loss的输出
print('Epoch:', epoch, 'Batch:', batch, 'Loss:', loss_value.numpy(),
      'RPN Class Loss:', rpn_class_loss.numpy(),
      'RPN Bbox Loss:', rpn_bbox_loss.numpy(),
      'RCNN Class Loss:', rcnn_class_loss.numpy(),
      'RCNN Bbox Loss:', rcnn_bbox_loss.numpy())
  • 增加了保存模型的功能
if batch % checkpoint == 0 and not batch == 0:
    model.save_weights('./model/epoch_' + str(epoch) + '_batch_' + str(batch) + '.h5')
  • 增加了加载模型的功能
model.load_weights('model/faster_rcnn.h5', by_name=True)
  • 在训练中增加了对验证集预测,计算AP和AR并保存预测结果的功能
dataset_results = []
imgIds = []

for idx in range(len(test_dataset)):
    if idx % 10 == 9 or idx + 1 == len(test_dataset):
        print(str(idx + 1) + ' / ' + str(len(test_dataset)))

    img, img_meta, _, _ = test_dataset[idx]

    proposals = model.simple_test_rpn(img, img_meta)
    res = model.simple_test_bboxes(img, img_meta, proposals)

    # visualize.display_instances(ori_img, res['rois'], res['class_ids'],
    #                             test_dataset.get_categories(), scores=res['scores'])

    image_id = test_dataset.img_ids[idx]
    imgIds.append(image_id)

    for pos in range(res['class_ids'].shape[0]):
        results = dict()
        results['score'] = float(res['scores'][pos])
        results['category_id'] = test_dataset.label2cat[int(res['class_ids'][pos])]
        y1, x1, y2, x2 = [float(num) for num in list(res['rois'][pos])]
        results['bbox'] = [x1, y1, x2 - x1 + 1, y2 - y1 + 1]
        results['image_id'] = image_id
        dataset_results.append(results)

if not dataset_results == []:
    with open('result/epoch_' + str(epoch) + '_batch_' + str(batch) + '.json', 'w') as f:
        f.write(json.dumps(dataset_results))

    coco_dt = test_dataset.coco.loadRes(
        'result/epoch_' + str(epoch) + '_batch_' + str(batch) + '.json')
    cocoEval = COCOeval(test_dataset.coco, coco_dt, 'bbox')
    cocoEval.params.imgIds = imgIds

    cocoEval.evaluate()
    cocoEval.accumulate()
    cocoEval.summarize()

    with open('result/evaluation.txt', 'a+') as f:
        content = 'Epoch: ' + str(epoch) + 'Batch: ' + str(batch) \
                  + '\n' + str(cocoEval.stats) + '\n'
        f.write(content)