双线性卷积神经网络模型（Bilinear CNN)

ICCV 2015

参考资料

摘要

双线性CNN模型：包含两个特征提取器，其输出经过外积相乘，池化后获得image descriptor
feature fusion的方式有很多，所以在两个数据集上进行了三种方式的feature fusion结果比较。

很多广泛使用的texture representation可以被表示为两个设计合理的特征的outer product。

2018Data Fold-1 Fold-2 Fold-3 Fold-4 Fold-5 avg

外积 0.9030987115323181 0.8406864550342491 0.8960236365874678 0.8698119534655764 0.8870514957452408 0.8793344504729704

相加 0.8787941572568113 0.8582870620608324 0.8855902251842876 0.8549717274210491 0.8923102154615716 0.8739906774769104?

concat

2020Data

外积 0.8738744458199748 0.883488966943666 0.8823698495351177 0.888978422419196 0.8865737595481854 0.883057088853228

相加 0.857781431434003 0.8530718361846381 0.856447951572803 0.8862443122156107 0.886259132838726 0.8672857097715232?

concat 0.8613690840605577 0.8715629417205214 0.8735109378384232 0.8908408664506225 0.8895555184850699 0.8773678697110389?

pooling函数整合所有位置的bilinear combination，来得到图片的全局信息。代码中使用的是求和，orderless.

2018Data	Fold-1	Fold-2	Fold-3	Fold-4	Fold-5	avg
外积	0.9030987115323181	0.8406864550342491	0.8960236365874678	0.8698119534655764	0.8870514957452408	0.8793344504729704
相加	0.8787941572568113	0.8582870620608324	0.8855902251842876	0.8549717274210491	0.8923102154615716	0.8739906774769104?
concat
2020Data
外积	0.8738744458199748	0.883488966943666	0.8823698495351177	0.888978422419196	0.8865737595481854	0.883057088853228
相加	0.857781431434003	0.8530718361846381	0.856447951572803	0.8862443122156107	0.886259132838726	0.8672857097715232?
concat	0.8613690840605577	0.8715629417205214	0.8735109378384232	0.8908408664506225	0.8895555184850699	0.8773678697110389?

优点：
- 以平移不变的方式，建模了local pairwise feature interactions，适用于细粒度图像分类
- 能够泛化多种顺序无关的texture descriptor
- 双线性形式简化了梯度计算。
实验结果
- 在CUB200-2011数据集上实现了84.1%的准确率。

1. introduction

细粒度识别

对同属一个子类的物体进行分类，通常需要对高度局部化、且与图像中姿态及位置无关的特征进行识别。例如，“加利福尼亚海鸥”与“环状海鸥”的区分就要求对其身体颜色纹理，或羽毛颜色的微细差异进行识别。

通常的技术分为两种：

局部模型：先对局部定位，之后提取其特征，获得图像特征描述。缺陷：外观通常会随着位置、姿态及视角的改变的改变。
整体模型：直接构造整幅图像的特征表示。包括经典的图像表示方式，如Bag-of-Visual-Words，及其适用于纹理分析的多种变种。
基于CNN的局部模型要求对训练图像局部标注，代价昂贵，并且某些类没有明确定义的局部特征，如纹理及场景。

contribution

It consists of two feature extractors based on CNNs whose outputs are multiplied using the outer product at each location of the image and pooled across locations to obtain an image descriptor.

The outer product captures pairwise correlations between the feature channels and can model part-feature interactions,e.g.,if one of the networks was a part detector and the other a local feature extractor.

M-Net ：52.7% ， 58.8%
D-Net ：61.0% ， 70.4%
M-Net + D-Net ：84.1%(bilinear model)

实验结果

作者在鸟类、飞机、汽车等细粒度识别数据集上对模型性能进行测试。表明B-CNN性能在大多细粒度识别的数据集上，都优于当前模型，甚至是基于局部监督学习的模型，并且相当高效。

2.Bilinear models for image classi?cation

**模型：**包含两个特征提取器，其输出经过外积相乘，池化后获得image descriptor

代码中是将两个feature extractor的网络合成一个网络。

在这里插入图片描述bilinear pooling：

双线性模型

B B

B由四元组组成，

B = (f_{A}, f_{B}, P, C) B = (f_A,f_B,P,C)

B=(fA?,fB?,P,C)

feature function：
$f_{A}, f_{B} f_A,f_B$
fA?,fB?
pooling function：
$P P$
P
classfication function：
$C C$
C

code
首先把同一位置上的两个特征进行bilinear feature combination（矩阵外积）后，得到矩阵
$b b$
b
然后对b进行sum pooling，得到矩阵
$ξ \xi$
ξ
$ξ \xi$
ξ 张成向量，记为bilinear vector
$x x$
x
对
$x x$
x进行矩归一化操作和L2归一化操作后，就得到融合后的特征
$z z$
z
最后
$z z$
z用于fine grain

维度过高：特征A和B的维度之积（相当于排列方式，比如有512*512种特征表示方式，使用了很多的特征之间的两两组合，特征冗余）

一个通道理解为一个特征图，理解为一种特征。比如：双线性操作前有512个特征图，操作后有512*512个特征图，512个feature和512个feature两两组合，得到512*512种特征组合。