TLDR: TEMPEST is a novel method that enables transformer models to learn semantic representations directly from compressed multimedia data (like MP3s and JPEGs). By exploiting the inherent ‘block’ structure within compressed file formats for tokenization, TEMPEST significantly reduces the input sequence length, leading to lower computational complexity and memory usage. The approach achieves competitive accuracy compared to state-of-the-art models while offering substantial efficiency gains, demonstrating its potential for large-scale multimedia processing without full decoding.
In the world of artificial intelligence, transformer models have become incredibly powerful, especially for understanding language and various forms of multimedia like images, audio, and video. However, a significant challenge arises when applying these models to multimedia: the raw data often results in extremely long sequences of information, which can overwhelm the memory and computational resources of these models. Imagine trying to process every single byte of a video; the amount of data is immense, and the attention mechanism, a core part of transformers, scales quadratically with this length.
Researchers have explored various ways to tackle this issue, such as using approximate attention or merging tokens. Now, a new method called TEMPEST (TransformErs froM comPressed rEpreSenTations) offers an innovative solution by approaching the problem from a different angle: directly learning from compressed data streams.
TEMPEST leverages the inherent structure of compressed file formats (CFFs), such as MP3 for audio or JPEG for images. These formats are designed for efficient storage and transmission, compacting data while preserving essential information. The key insight behind TEMPEST is that while individual bytes in a compressed stream might not carry clear semantic meaning, many CFFs are organized into ‘blocks’ – the smallest encoded units that can be decoded independently. These blocks, like MP3 frames or JPEG Minimum Coded Units (MCUs), naturally encapsulate self-contained information, making them ideal candidates for tokenization.
Instead of processing raw, uncompressed bytes or even partially decoded media, TEMPEST treats these compressed blocks as atomic units for its transformer model. This means that each block is independently embedded into a shorter, compact sequence of tokens. This significantly reduces the overall sequence length that the main transformer architecture needs to process, leading to substantial gains in computational efficiency and reduced memory usage.
The TEMPEST architecture consists of three main components: a block embedding network, a block reconstruction network, and a classification network. The block embedding network takes the compressed blocks and transforms them into a compact, high-dimensional representation. The reconstruction network helps ensure that these embedded blocks remain informative by trying to reconstruct the original byte sequence. Finally, the classification network, similar to a Vision Transformer (ViT), learns semantic representations from these embedded block sequences to perform tasks like classification.
The effectiveness and generality of TEMPEST have been demonstrated across various experiments. For audio, it was tested with MP3 and Opus formats on datasets like ESC-50, Speech Commands v2 (SC2), and AudioSet. For images, it was evaluated using JPEG on the MNIST dataset. TEMPEST achieved accuracy competitive with state-of-the-art models while drastically reducing the number of tokens per second of audio. For instance, it uses only 32 tokens per second compared to 108 tokens per second for a baseline Audio-spectrogram transformer (AST), leading to an attention matrix that is an order of magnitude smaller.
Further studies revealed interesting properties. Training TEMPEST with audio streams encoded at multiple bit rates (e.g., 20, 26, and 32 kbps) acts as a powerful form of data augmentation, improving accuracy and generalization. Similarly, performing inference by combining predictions from multiple bit rates for the same input also boosted performance, akin to multi-crop evaluation in vision tasks.
Also Read:
- TernaryCLIP: Streamlining Vision-Language AI for Resource-Limited Devices
- Transformers Learn to Forget Gradually for Longer Contexts
This novel approach bypasses the need for full media decoding, offering a lightweight and efficient way to extract semantic meaning directly from compressed multimedia files. This has significant implications for large-scale applications where processing millions of media files efficiently is crucial. For more details, you can read the full research paper here.


