如何DIY輕型的Mobilenet-SSD模型

物體檢測技術是解決許多業務問題的關鍵性技術，如ADAS中的FCW（前碰預警）中，需要藉助物體檢測技術檢測和識別前方的車輛和行人；又如人臉識別閘機，需要藉助人臉檢測器檢測需要過閘機的人員，並把人臉ROI返回給人臉識別模塊進行驗證等等。本專欄的文章和內容主要面向嵌入式

端的AI演算法，所以本文主要跟大家討論如何去DIY能在嵌入式設備上近實時運行的Mobilenet-SSD。

這裡我們採用的是google開源的Object detection API，安裝過程大家可以參考文檔，這裡不再進行贅述。我們在使用Object detection API訓練自己的物體檢測器的時候，一般需要先經過如下幾個步驟：訓練數據（tfrecord）的準備、預處理函數設置、模型選擇、訓練參數配置，在完成以上步驟後，我們就可以按照官方提供的訓練步驟進行訓練。下面我們來跟大家聊一聊，如何去進行模型DIY的準備。

數據準備

這個一般根據大家自己的標註數據，可以是任何格式，一般的形式為一張圖片內對應不同類別的bounding boxes。大家在生成tfrecord的時候，只需要對應Object detection api提供tfrecord格式。如下為簡單的實例代碼，你可以跟進自己的標註文件格式進行自定義的解析。

def create_tf_example(example): # TODO(user): Populate the following variables from your example. height = example[height] # Image height width = example[width] # Image width filename = example[filename] # Filename of the image. Empty if image is not from file encoded_image_data = example[image] # Encoded image bytes image_format = example[format] # bjpeg or bpng

xmins = example[xmin] # List of normalized left x coordinates in bounding box (1 per box)
xmaxs = example[xmax] # List of normalized right x coordinates in bounding box(1 per box)
ymins = example[ymin] # List of normalized top y coordinates in bounding box (1 per box)
ymaxs = example[ymax] # List of normalized bottom y coordinates in bounding box(1 per box)
classes_text = example[text] # List of string class name of bounding box (1 per box)
classes = example[label] # List of integer class id of bounding box (1 per box)

tf_example = tf.train.Example(features=tf.train.Features(feature={
image/height: dataset_util.int64_feature(height),
image/width: dataset_util.int64_feature(width),
image/filename: dataset_util.bytes_feature(filename),
image/source_id: dataset_util.bytes_feature(filename),
image/encoded: dataset_util.bytes_feature(encoded_image_data),
image/format: dataset_util.bytes_feature(image_format),
image/object/bbox/xmin: dataset_util.float_list_feature(xmins),
image/object/bbox/xmax: dataset_util.float_list_feature(xmaxs),
image/object/bbox/ymin: dataset_util.float_list_feature(ymins),
image/object/bbox/ymax: dataset_util.float_list_feature(ymaxs),
image/object/class/text: dataset_util.bytes_list_feature(classes_text),
image/object/class/label: dataset_util.int64_list_feature(classes),
}))
return tf_example

所以你需要根據自己的標準格式，進行解析，然後逐個的裝配近python dict中的image/height、image/width、image/filename等等。其中需要注意的是，xmin、xmax、ymin、ymax為數組，且其內元素的值域範圍是[0, 1]，值都已經用圖像的width和height進行了歸一化。如下為裝配tfrecord的實例代碼，

import cv2 import glob import os import xml.etree.ElementTree as ET import tensorflow as tf from object_detection.utils import dataset_util from PIL import Image import io import random

train_writer = tf.python_io.TFRecordWriter("./dms_train.tfrecords")
test_writer = tf.python_io.TFRecordWriter("./dms_test.tfrecords")
examples = []

annotation_files = [
"./annotations/2019-06-13.xml",
"./annotations/2019-06-25.xml"
]

for annotation_file in annotation_files:
# annotation_file = "./annotations/1468_imglab.xml";
root = ET.parse(annotation_file).getroot()
image_folder = annotation_file.replace("./annotations/", "")
image_folder = image_folder.replace(".xml", "")

for image in root.findall("images/image"):
filename = image.get("file")
filename = os.path.join("./image-data/", image_folder, filename)
boxes = image.findall("box")

with tf.gfile.GFile(filename, rb) as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
img = Image.open(encoded_jpg_io)
width, height = img.size

xmins = []
ymins = []
xmaxs = []
ymaxs = []
texts = []
labels = []
example = {}

for box in boxes:
x = int(box.get("left")) / float(width)
y = int(box.get("top")) / float(height)
w = int(box.get("width")) / float(width)
h = int(box.get("height")) / float(height)
label_text = box.findall("label")[0].text
label = -1

xmin = x
ymin = y
xmax = x + w
ymax = y + h
if xmin < 0.0:
xmin = 0.0
if ymin < 0.0:
ymin = 0.0
if xmax > 1.0:
xmax = 1.0
if ymax > 1.0:
ymax = 1.0

if "face" == label_text:
label = 1
labels.append(label)
texts.append("face")
xmins.append(xmin)
ymins.append(ymin)
xmaxs.append(xmax)
ymaxs.append(ymax)
elif "phone" == label_text:
label = 2
labels.append(label)
texts.append("phone")
xmins.append(xmin)
ymins.append(ymin)
xmaxs.append(xmax)
ymaxs.append(ymax)
elif "smoke" == label_text:
label = 3
labels.append(label)
texts.append("smoke")
xmins.append(xmin)
ymins.append(ymin)
xmaxs.append(xmax)
ymaxs.append(ymax)

if len(xmins) > 0:
filename = filename.encode(utf8)
example[filename] = filename
example[image] = encoded_jpg
example[format] = bjpg
example[height] = height
example[width] = width
example[xmin] = xmins
example[xmax] = xmaxs
example[ymin] = ymins
example[ymax] = ymaxs
example[label]= labels
example[text] = texts
examples.append(example)

idx = 0
random.shuffle(examples)
for example in examples:
tf_example = create_tf_example(example)
if idx < 100:
test_writer.write(tf_example.SerializeToString())
else:
train_writer.write(tf_example.SerializeToString())
idx += 1

train_writer.close()
test_writer.close()

預處理函數設置

Object detection API本身把預處理的函數寫死了，默認採用的是 value*2 /255 -1的操作，映射到[-1, 1]空間中。這裡若對預處理函數有特殊需求的，例如需要採用ImageNet std-mean values進行白化的，大家可以參考如下代碼，對models/ssd_mobilenet_v1_feature_extractor.py中的preprocess進行修改：

def preprocess(self, resized_inputs): """SSD preprocessing. Maps pixel values to the range [-1, 1]. Args: resized_inputs: a [batch, height, width, channels] float tensor representing a batch of images. Returns: preprocessed_inputs: a [batch, height, width, channels] float tensor representing a batch of images. """ means = tf.constant((123.00,123.00,123.00), dtype=tf.float32) deriv = tf.constant((58.000,58.000,58.000), dtype=tf.float32) output = tf.subtract(resized_inputs, means) output = tf.divide(output, deriv) return output #return (2.0 / 255.0) * resized_inputs - 1.0

模型選擇

模型選擇其實就是選擇適合你業務場景的Mobilenet-SSD模型參數，這個模型參數我們一般在模型config文件中進行配置，目前可調整模型大小的參數為輸入數據的width、height，每個depthwise輸出的通道控制參數depth_multiplier，以及anchor_generator的內部參數。例如，我們如果針對近距離人臉檢測的場景，其實輸入可以很小，224x224的輸入尺度，以及depth_multiplier取0.5就可以滿足我們的業務需求。該模型幾乎可以近實時的運行在目前的中低端嵌入式設備中，如RK3288、RK3399等；對於anchor_generator可調整的參數就比較多了，比如你可以減少aspect_ratios的item個數（對應減少輸出的anchor的個數）、調整min_scale和max_scale（會影響對大小物體的敏感程度）、num_layers（輸入object detection layers的特徵層數，mobilenet-ssd中默認輸入6個特徵層）。

訓練參數配置

訓練參數的配置主要影響檢測器的檢測效果，不影響檢測器的速度；訓練參數配置可選的比較多，一般都針對優化演算法、batch、learning-rate、data-augmentation進行調整。特別是data-augmentation，大家在進行訓練的時候，可多進行嘗試，如隨機水平翻轉、隨機圖像值變化、隨機裁剪等等。若合理的進行data-augmentation，可以最大化你訓練數據的利用程度。如下是一個實例的配置文件，關於模型選擇和訓練參數配置的細節大家可以參考如下配置文件：

#SSD with Mobilenet v1, configured for traffic Dataset. # Users should configure the fine_tune_checkpoint field in the train config as # well as the label_map_path and input_path fields in the train_input_reader and # eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that # should be configured. # TPU-compatible model { ssd { num_classes: 1 box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true } } similarity_calculator { iou_similarity { } } anchor_generator { ssd_anchor_generator { num_layers: 6 #min_scale: 0.2 #max_scale: 0.95 #aspect_ratios: 1.0 #aspect_ratios: 2.0 #aspect_ratios: 0.5 #aspect_ratios: 3.0 #aspect_ratios: 0.3333 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 } } image_resizer { fixed_shape_resizer { height: 224 width: 224 } } box_predictor { convolutional_box_predictor { min_depth: 0 max_depth: 0 num_layers_before_predictor: 0 use_dropout: false dropout_keep_probability: 0.8 kernel_size: 1 box_code_size: 4 apply_sigmoid_to_scores: false conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } } feature_extractor { type: ssd_mobilenet_v1 min_depth: 16 depth_multiplier: 0.5 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } loss { localization_loss { weighted_smooth_l1 { } } classification_loss { weighted_sigmoid { } } hard_example_miner { num_hard_examples: 3000 iou_threshold: 0.99 loss_type: BOTH max_negatives_per_positive: 3 min_negatives_per_image: 3 } classification_weight: 1.0 localization_weight: 1.0 } normalize_loss_by_num_matches: true post_processing { batch_non_max_suppression { score_threshold: 1e-8 iou_threshold: 0.6 max_detections_per_class: 50 max_total_detections: 50 } score_converter: SIGMOID } } } train_config: { batch_size: 32 num_batch_queue_threads: 1 batch_queue_capacity: 2000 optimizer { rms_prop_optimizer: { learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.001 decay_steps: 18750 decay_factor: 0.5 } } momentum_optimizer_value: 0.9 decay: 0.9 epsilon: 1.0 } } #fine_tune_checkpoint: "/home/shuai/models/ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03/model.ckpt" fine_tune_checkpoint: "/home/data/zhangxd/train_face_0.5_224x224/model.ckpt-200058" from_detection_checkpoint: true load_all_detection_checkpoint_vars: false # Note: The below line limits the training process to 200K steps, which we # empirically found to be sufficient enough to train the pets dataset. This # effectively bypasses the learning rate schedule (the learning rate will # never decay). Remove the below line to train indefinitely. num_steps: 200000 data_augmentation_options { random_adjust_brightness{ } } data_augmentation_options { random_image_scale { } } data_augmentation_options { random_jitter_boxes { } } data_augmentation_options { ssd_random_crop { } } data_augmentation_options { random_horizontal_flip { } } max_number_of_boxes: 50 unpad_groundtruth_tensors: false } train_input_reader: { tf_record_input_reader { input_path: "/home/data/zhangxd/face_det_data/face_det_train.tfrecords" } label_map_path: "/home/data/zhangxd/face_det_data/face_label_map.pbtxt" num_readers: 1 prefetch_size: 256 read_block_length: 32 } eval_config: { num_examples: 1000 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. visualization_export_dir: /home/data/zhangxd/visualization visualize_groundtruth_boxes: True min_score_threshold: 0.5 num_visualizations: 100 include_metrics_per_category: True } eval_input_reader: { tf_record_input_reader { input_path: "/home/data/zhangxd/face_det_data/face_det_test.tfrecords" } label_map_path: "/home/data/zhangxd/face_det_data/face_label_map.pbtxt" shuffle: false num_readers: 1 prefetch_size: 32 read_block_length: 16 }

最後

在進行完上述的準備工作以後，你就可以按照官方提供的訓練腳本進行訓練了。至此，你已經成功設計並訓練了一個可在大多數嵌入式設備中近實時運行的檢測器。大家如果對嵌入式CNN部署感興趣的，也可以參考我專欄裏之前的文章，嘗試去部署你訓練好的輕型Mobilenet-SSD。鏈接我會放在參考中，同時也歡迎大家留言討論、關注專欄。謝謝大家！

參考

糖心他爸：實戰MNN之Mobilenet SSD部署（含源碼）?

zhuanlan.zhihu.com

糖心他爸：詳解MNN的tf-MobilenetSSD-cpp部署流程?

zhuanlan.zhihu.com

糖心他爸：使用NNAPI加速android-tflite的Mobilenet分類器?

zhuanlan.zhihu.com

糖心他爸：使用TVM在android中進行Mobilenet SSD部署?

zhuanlan.zhihu.com

如何DIY輕型的Mobilenet-SSD模型

熱門新聞

週熱門

如何DIY輕型的Mobilenet-SSD模型

PN-28: Sub-policy Adaptation for HRL (arXiv 1906)

重磅！圖像分類相關文獻/代碼大列表

輕量型網路：MixNet解讀

強化學習之Q-Learning

【學習筆記】cs231n中assignment1中的k-Nearest Neighbor (kNN) exercise

【學習筆記】cs231n中assignment2的dropout

詳解PyTorch中的ModuleList和Sequential

人工智慧ai演算法、深度學習、機器學習、自然語言處理工程師崗位

F-Principle：初探理解深度學習不能做什麼

換臉視頻後AI又出偏門應用：用演算法「脫」女性衣服

【KDD18最佳論文揭曉】中科大等斬獲最佳學生論文

用小樣本數據集構建強大的圖像分類模型

「知識星球」網路結構1000變上線，下半年更新500+網路模型解讀

物體檢測模型backbone的權重，使用預訓練，或沒有預訓練，哪個更佳？

如何檢測圖片中物體的方向?

熱門新聞

週熱門