Hybrid Semantic Bottlenecks: Bridging Interpretability and Performance in Deep Learning

Deep learning models have achieved remarkable success across a wide range of tasks, particularly in computer vision. However, this success often comes at the cost of transparency. As models grow more complex, understanding why a decision was made becomes increasingly difficult a problem commonly referred to as the black-box nature of deep learning.

Several approaches have attempted to address this issue, including post-hoc explanation methods and inherently interpretable models. Among them, Concept Bottleneck Models (CBMs) stand out by forcing predictions to pass through human-interpretable concepts. While effective from an interpretability standpoint, CBMs frequently suffer from a significant drop in predictive performance. The reason is simple: not all discriminative information can be neatly expressed in terms humans can name.

This raises a fundamental question:
Is interpretability necessarily at odds with performance?

A Hybrid Perspective

Instead of enforcing interpretability across the entire latent space, I explored a different idea:
what if we only constrain the parts of the model that can be meaningfully explained, while leaving the rest free to learn whatever is necessary for performance?

This intuition led to the design of a Hybrid Semantic Bottleneck architecture. The key idea is to explicitly split the internal representation into two components:

Human-defined semantic concepts, which are supervised and fully auditable
Unconstrained latent representations, which capture residual information that cannot be easily expressed as concepts

The final prediction is made using both components, but only the human-defined concepts are subject to semantic supervision. This selective constraint preserves interpretability where it makes sense, without unnecessarily limiting the model’s expressive capacity.

What the Experiments Show

To validate this idea, I evaluated the architecture on the FashionMNIST dataset, comparing it against both a standard convolutional baseline and a strict concept bottleneck model.

The results were revealing:

The hybrid model recovered most of the predictive accuracy of the baseline
Human-defined concepts were learned with very high accuracy
In qualitative analyses, the model activated concepts only when they were semantically applicable
Crucially, when no known human concept applied, the model did not fabricate explanations it relied on its latent representation instead

This behavior is important. Rather than pretending to be fully interpretable, the model is honest about the limits of its explanations.

Why This Matters

Interpretability should not mean oversimplification. Forcing all reasoning into human concepts can be just as misleading as offering no explanation at all. A hybrid approach acknowledges an important reality: human knowledge is partial, and models should be allowed to represent what we cannot yet fully describe.

By explicitly separating what is explainable from what is not, hybrid semantic bottlenecks offer a practical path toward transparent, auditable, and high-performing models — particularly in domains where partial semantic knowledge is available.

Final Thoughts

This work reinforces a broader lesson in machine learning:
good interpretability is not about explaining everything, but about explaining the right things and being explicit about what remains latent.

The full paper and executable experiments are publicly available, and I plan to continue exploring how hybrid representations can improve trust and accountability in real-world AI systems.

The full paper is available as a preprint on Zenodo:
https://zenodo.org/records/18508972

An executable version of all experiments can be accessed via Google Colab:
https://colab.research.google.com/drive/1q8y2zA_bUbZNOhR90QG_JTKha-pHMH7D?usp=sharing