A new AI processor based on state-of-the-art neural network theory
A new accelerator chip called “Hiddenite” that can reach the state of the art precision in computing sparse “hidden neural networks” with lower computational loads has now been developed by Tokyo Tech researchers. Using the proposed on-chip model construction, which is the combination of weight generation and “supermask” expansion, the Hiddenite chip significantly reduces external memory access for improved computational efficiency.
Deep Neural Networks (DNNs) are a complex piece of machine learning architecture for AI (machine learning) that requires many parameters to learn how to predict outputs. DNNs can, however, be “pruned”, thereby reducing computational load and model size. A few years ago, the “lottery ticket hypothesis” took the world of machine learning by storm. The hypothesis stated that a randomly initialized DNN contains subnets that achieve an accuracy equivalent to the original DNN after training. The larger the network, the more “lottery tickets” there are for successful optimization. These lottery tickets thus allow “pruned” sparse neural networks to achieve accuracies equivalent to more complex and “dense” networks, thus reducing overall computational loads and energy consumption.
One technique for finding such subnets is the Hidden Neural Network (HNN) algorithm, which uses AND logic (where the output is high only when all inputs are high) over the initialized random weights and a ” binary mask” called “supermask”. (Fig. 1). The supermask, defined by the highest k% scores, denotes unselected and selected connections by 0 and 1, respectively. The HNN helps to reduce computational efficiency on the software side. However, computing neural networks also requires improvements in hardware components.
Traditional DNN accelerators provide high performance, but they do not take into account the power consumption caused by accessing external memory. Now, researchers from the Tokyo Institute of Technology (Tokyo Tech), led by Professors Jaehoon Yu and Masato Motomura, have developed a new accelerator chip called “Hiddenite”, which can calculate hidden neural networks with power consumption. significantly improved energy. “Reducing access to external memory is the key to reducing power consumption. Currently, achieving high inference accuracy requires large models. But it increases access to external memory to load model parameters. Our main motivation behind the development of Hiddenite was to reduce this access to external memory,” says Prof. Motomura. Their study will feature in the upcoming International Semiconductor Circuit Conference (ISSCC) 2022, a prestigious international conference showcasing the peaks of achievement in integrated circuits.
“Hiddenite” stands for Hidden Neural Network Inference Tensor Engine and is the first HNN inference chip. The Hiddenite architecture (Fig. 2) provides three advantages to reduce external memory access and achieve high power efficiency. The first is that it offers on-chip weight generation to regenerate weights using a random number generator. This eliminates the need to access external memory and store weights. The second advantage is the provision of the “on-chip supermask extension”, which reduces the number of supermasks that need to be loaded by the accelerator. The third enhancement offered by the Hiddenite chip is the four-dimensional (4D) high-density parallel processor that maximizes data reuse during the computational process, thereby improving efficiency.
“The first two factors are what set the Hiddenite chip apart from existing DNN inference accelerators,” Prof. Motomura reveals. “Additionally, we also introduced a new training method for hidden neural networks, called “score distillation”, in which conventional knowledge distillation weights are distilled into scores because hidden neural networks never put update the weights. Accuracy using score distillation is comparable to the binary model while being half the size of the binary model.
Based on the hidden architecture, the team designed, fabricated, and measured a prototype chip with Taiwan Semiconductor Manufacturing Company’s (TSMC) 40 nm process (Fig. 3). The chip measures only 3 mm x 3 mm and handles 4,096 MAC (multiplication and accumulation) operations at a time. It achieves industry-leading computational efficiency, up to 34.8 trillion or tera operations per second (TOPS) per watt of power, while reducing the amount of pattern transfer to half that of networks binarized.
These discoveries and their successful exposure in a real silicon chip are sure to cause another paradigm shift in the world of machine learning, paving the way for faster, more efficient, and ultimately more environmentally friendly computing. environment.