Co-occurrence to reweight class confidences

2015.01.01

3 minutes read

This is a neat little trick that I don’t think I ever shoe-horned into a publication. It’s essentially a means to leverage side information in the form of NLP corpora to improve multi-class classification.

In short, the trick is: multiply output confidence scores by a co-occurrence matrix. This helps your model predict classes that are more likely to happen together, e.g. “zebra” and “giraffe” should boost their relative scores, while “zebra” and “chaise” should not. That’s the cliffnotes version.

Assume you have some classification layer, probably a softmax layer, that you’re taking as confidence estimates across your classes. You’ve trained this model to predict, say, ImageNet classes. You’ve fed in a load of training data, with a particular distribution of one-hot annotation vectors. Even with something as big as ImageNet, you’re still bound by the limitations of your particular training set – and, crucially, its distribution of one-hot class labels.

Even if you only have one class label, most photos have more than one thing in them. A chair is much more likely to be in a living room with a rug than it is to be in a savannah with tall grass. Your model has learned some of this implicitly through the context of your training data, but you can leverage your extra knowledge to improve matters. You need to get an external source that contains information about the co-occurrence relationships of your target classes. In my case, I used English language corpora. If “chair” shows up with “rug” and “living room” more often than with “savannah” or “tall grass”, we can use that to our advantage.

You can choose your favorite co-occurrence measure such as Dice coefficient, or more implicit ones like distance within a semantic embedding. The end result you want is some structured encoding of inter-class-relatedless in the form of an $N\times N$ matrix, $Q$, with 1s along the diagonal (“chair” should always be present when “chair” is present). Once you’ve generated this matrix, all you do is multiply your class confidence scores by it. Assuming you have some set of predictions, $\hat{y}$, in the form of an $M\times N$ matrix, your new predictions are given by:

$$\hat{y}_{new} = \hat{y}\cdot Q$$

In essence, you’re re-scoring all the predictions with a little help from their friends. As a contrived example, imagine you have an input image that yields the following class confidence scores:

Dog     Bone     Collar     Large Hadron Collider      Physicist
0.3      0.2        0.2                       0.3            0.0

Looking at the raw scores, our algorithm thinks the input photo is either a dog or a particle accelerator, with equal probability. Luckily, we have a co-occurrence matrix from news stories that tells us what the frequency of particular words showing up together in a single article is:

             Dog     Bone     Collar     LHC    Physicist    
      Dog      1      0.2        0.3       0         0.01
     Bone    0.2        1        0.1       0            0
   Collar    0.3      0.1          1       0         0.05
      LHC      0        0          0       1         0.95
Physicist   0.01        0       0.05    0.95            1

We multiply our predictions by this matrix, and we find our new scores:

Dog     Bone     Collar     Large Hadron Collider      Physicist
0.4     0.28       0.31                       0.3          0.298

The extra evidence from uncertainty around “bone” and “collar” have given the edge to “dog”. Notice the effect on “Physicist” – because the “Large Hadron Collider” was seldom mentioned without “physicist” also being mentioned, it receives a huge boost. This kind of “spreading the wealth” can happen. You can try various schemes like using $(\hat{y}\cdot Q) * \hat{y}$ (where * here denotes element-wise multiplication) to get around this if it is undesirable. Also note that, depending on your choice of co-occurrence measure, the re-weighted predictions may or may not be normalized.

Well, there you go. Quick and painless, and I’ve found it can yield a small improvement, particularly when your training data isn’t large and representative of the natural world.

tags: dice nlp softmax machinelearning