论文研读Similarity of Neural Network Representations Revisited (ICML2019)

Title: Similarity of Neural Network Representations Revisited (ICML2019)

Author：Simon Kornblith ...(Long Beach, California)

Aim:

invariance properties of similarity indexes 分为三个方面

Comparing Similarity Structures

Related Similarity Indexes

Results

Conclusion and Future Work

Aim:

one can first measure the similarity between every pair of examples in each representation separately, and then compare the similarity structures.

invariance properties of similarity indexes 分为三个方面

1. Invariance to Invertible Linear Transformation

定义：A similarity index is invariant to invertible linear transformation if s(X, Y ) = s(XA, Y B) for any full rank A and B.

Key sentence：

We demonstrate that early layers, but not later layers, learn similar representations on different datasets.
Invariance to invertible linear transformation implies that the scale of directions in activation space is irrelevant.
Neural networks trained from different random initializations develop representations with similar large principal components 不同初始化得到的主要参数是相似的。因此基于主要参数得到的不同网络间的相似性度量（如欧氏距离）的相似的。A similarity index that is invariant to invertible linear transformation ignores this aspect of the representation, and assigns the same score to networks that match only in large principal components or networks that match only in small principal components.

2. Invariance to Orthogonal Transformation

定义：s(X, Y ) = s(XU, Y V ) for full-rank orthonormal matrices U and V such that UTU = I and V TV = I.

Key sentence：

orthogonal transformations preserve scalar products and Euclidean distances between examples.
Invariance to orthogonal transformation implies invariance to permutation, which is needed to accommodate symmetries of neural networks

3. Invariance to Isotropic Scaling

定义：s(X, Y ) = s(αX, βY ) for any α, β ∈ R+

Key sentence：

This follows from the existence of the singular value decomposition of the transformation matrix
we are interested in similarity indexes that are invariant to isotropic but not necessarily non-isotropic scaling

Comparing Similarity Structures

If we use an inner product to measure similarity, the similarity between representational similarity matrices reduces to another intuitive notion of pairwise feature similarity.

1. Dot Product-Based Similarity.

2. Hilbert-Schmidt Independence Criterion.

HSIC as a test statistic for determining whether two sets of variables are independent. but HSIC is not an estimator of mutual information.
HSIC is not invariant to isotropic scaling, but it can be made invariant through normalization.

3. Centered Kernel Alignment.

Kernel Selection. RBF kernel k(xi, xj ) = exp(?||xi ixj ||^2_2/(2σ2 ))
In practice, we fifind that RBF and linear kernels give similar results across most experiments,

1. Linear Regression.

We are unaware of any application of linear regression to measuring similarity of neural network representations,

2. Canonical Correlation Analysis (CCA)

The mean CCA correlation ρˉCCA was previously used to measure similarity between neural network representations

3. SVCCA.

it is invariant to invertible linear transformation only if the retained subspace does not change.

4. Projection-Weighted CCA.

closely related to linear regression

5. Neuron Alignment Procedures.

They found that the maximum matching subsets are very small for intermediate layers.

Summary：

SVCCA and projection-weighted CCA were also motivated by the idea that eigenvectors that correspond to small eigenvalues are less important, but
linear CKA incorporates this weighting symmetrically and can be computed without a matrix decomposition.

Results

1. A Sanity Check

Aim：Given a pair of architecturally identical networks trained from different random initializations, for each layer in the first network, the most similar layer in the second network should be the architecturally corresponding layer.

Results on a simple VGG-like convolutional network based on All-CNN-C：Only CKA passes
Results on Transformer networks (all layers are of equal width): All indexes pass.

2. Using CKA to Understand Network Architectures???????

左图：CKA随layer增加而不发生改变的时候，ACC也没有改变。1*depth代表基本网络。2x、4x代表重复多少次
右图：Layers in the same block group (i.e. at the same feature map scale) are more similar than layers in different block groups.
右图：残差后的激活层和残差内的有差异；之间的则没有
CKA is equally effective at revealing relationships between layers of different architectures. As networks are made deeper, the new layers are effectively inserted in between the old layers.
increasing layer width leads to more similar representations between networks.

3. Across Datasets

CIFAR-10 and CIFAR-100 develop similar representations in their early layers.

Conclusion and Future Work

CKA consistently identififies correspondences between layers, not only in the same network trained from different initializations, but across entirely different architectures, whereas other methods do not
CKA captures intuitive notions of similarity, i.e. that neural networks trained from different initializations should be similar to each other.