Decoupling Features and Classes with Self-Organizing Class Embeddings
Classification with neural networks is weird! There, I said it!
We usually have a single output per class, as if for some reason each class was it's own feature. The numbers these outputs produce are then intepreted as a log-probability distribution over all the available classes. Eveybody knows it doesn't make sense, yet we treat it as a mathematical assumption. Also needing a separate output for every single class becomes insanely wasteful once you train for more than a few thousand classes. If you have a small model your output layer might well be bigger than the rest of the network.
You can at least get around the huge amount of outputs with some embedding based method by beheading a pretrained network or by doing some contrastive training, but the similarities you get out of them are hard to interpret and you have no measure of certainty.
So what can we do about it you ask? I collected a few ideas...