A team from the Center for Deep Learning Northwestern University recently proposed a transformer architecture with adaptive online compression capabilities – Sample-based Dynamic Hierarchical Transformer (DHT). DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. A unique aspect of their approach is the fact that the underlying layers and heads are sample-specific, exploiting the strategy that customizes the number of layers and heads for each single sample. Relevant results were published in “Proceedings on Engineering Sciences”. Northwestern University and the Institute of Computing Technology of the Chinese Academy of Sciences are the research and development units. The first author is PhD student Fanfei Meng.
High efficiency, dynamic search, reinforcement learning-based tuning mechanism, and economical automatic machine learning paradigm have become hot topics recently. The transformer model, with self-attention layers and multi-head space, is considered a very promising deep language model due to its low computational complexity and excellent performance in many classification tasks. However, transformers still confront high computational costs and overfitting problems, which are challenged by serious computing consumption constraints in many business use cases. Therefore, how to design a model with high efficiency and low memory requirements is critical to conducting intensive computations on edges. According to reports, the team optimized the dynamic search mechanism via linear contextual bandits, showing that the training efficiency of Dynamic Hierarchical Transformer is generally increased by 74%, and the inference efficiency is increased by 81%. To the best of our knowledge, DHT is the most efficient Transformer model architecture in the world.
Traditional network compression methods require full layer training first and then reduce the model size through layer-wise network compression or knowledge distillation. On the one hand, this two-step compression procedure can be of high time complexity. On the other hand, different tasks may require different lightweight transformers, making the uniform compression inflexible. By contrast, the team designed a dynamic, data-driven transformer model whose size can be optimized during training, skipping the separate compression step while maintaining a decent predicting capability. In addition, considering the effect of head interactions and the order samples appear during training, they formulate rewards of batch level rather than one-step gains, which successfully mitigates model performance reductions.
Professor Archer, a faculty member of Cornell University, believes that the pioneering achievement has great potential in this field. This work provides a new idea for the industrial deployment of online compression and neural architecture search. The paradigm of sample-based adaptive training and inference down streaming owns excellent computing efficiency and lower hardware needs, which is promising for the development of edge computing and mobile computing.