Deep neural networks (DNNs) dramatically improve the accuracy of machine learning applications such as object detection and speech recognition that need the intelligence of human. In Design Automation Laboratory, we conduct research in various aspects of deep learning, from algorithms to hardware/software architectures.
Compared with other machine learning techniques, DNNs typically require a lot more computations due to many layers and many neurons comprising the network. Moreover, the industrial and academic needs tend to increase the size and complicate the topology of DNNs. Because of this, using high performance computers with accelerators such as GPUs and/or clustering a bunch of machines is regarded as a practical solution to implementing DNNs. Considering, however, that machine learning has also been rapidly adopted in mobile and embedded systems such as self-driving car and patient data analysis with limited resources, researchers have paid great attention to finding possible ways of efficiently executing DNNs including minimizing the required precision and reducing the size of network.
In contrast to those studies based on conventional binary arithmetic computing, we conduct research and development in a different type of computing called stochastic computing (SC) for higher efficiency. SC can implement a circuit with smaller hardware footprint, lower power, and shorter critical path delay compared with conventional binary logic. It also has advantages in error tolerance and bit-level parallelism. Our research focuses on developing architectures and inference/learning methods to effectively apply stochastic computing to deep neural network implementations.
A spiking neural network (SNN) is the third generation of an artificial neural network, where each neuron keeps a small state and delivers information in the form of spikes. Compared to conventional artificial neural networks, SNNs resemble biological neurons more closely and have potential to perform more efficiently than conventional fixed-point based neural networks. Research has been conducted from both academia and industry, including SpiNNaker (University of Manchester) and TrueNorth (IBM) projects. We explore efficient design of spiking neural network systems from hardware architectures to algorithmic approaches.
Due to the massive amounts of computation in state-of-the-art DNN models, both academia and industry have actively studied specialized hardware architectures for accelerating deep neural network execution. For example, Google announced the use of their hardware accelerator for DNN inference called Tensor Processing Unit (TPU); Intel announced plans to support efficient DNN execution in their Xeon Phi products based on Nervana’s DNN accelerator design; a series of hardware accelerator designs have appeared in recent computer architecture conferences.
We study high-performance and energy-efficient hardware implementations of deep learning algorithms. Our goal is to develop hardware accelerators for deep learning algorithms, efficient distribution schemes for scalability, and system-/application-level support to ease the application of such hardware accelerators.
As data-intensive applications become more and more important, the memory hierarchy has been considered as the key component that can determine the efficiency of a computer system. The memory hierarchy consists of multiple levels of memory from on-chip caches to main memory and storage, each with different performance-capacity-power trade-offs. In Design Automation Laboratory, we study each level of the memory hierarchy to improve its performance, energy efficiency, and reliability so that future computer systems can process large amounts of data more efficiently.
Spin-Transfer Torque RAM (STT-RAM) is a new non-volatile memory technology that can provide much lower static power and higher density than conventional charge-based memory (e.g., SRAM). Such characteristics are desirable to construct very large on-chip caches in an energy-/area-efficient way. However, compared to SRAM caches, STT-RAM caches exhibit inefficient write operations, which can potentially offset the energy benefit of STT-RAM.
We have explored energy-efficient and scalable on-chip cache design based on STT-RAM. The primary objective is to mitigate the impact of high write energy of STT-RAM while utilizing it as on-chip caches. Our approaches are mainly to modify the on-chip cache architecture in a way to reduce the amount of STT-RAM writes.
With the advent of big-data applications, the memory bandwidth requirements of data-intensive computing systems have continuously increased. However, conventional main memory technologies based on narrow off-chip channels (e.g., DDR3/4) are not enough to fulfill such high demand, mainly due to the limited off-chip bandwidth caused by CPU pin count limitations. Such memory bandwidth bottleneck has been exacerbated in the past few years due to the architectural innovations that can greatly improve the computation speed (e.g., chip multiprocessors, hardware accelerators). This motivates the need for a new computing paradigm that can fundamentally overcome the memory bandwidth bottleneck of conventional systems.
For this purpose, we conduct research in designing processing-in-memory (PIM) systems, where a portion of computation in applications is offloaded to in-memory computation units. PIM can not only provide very high memory bandwidth to in-memory computation units but also realize scalable memory bandwidth without being limited by CPU pin counts. Our research mainly focuses on architectures and programming models for PIM systems targeted for data-intensive applications and aims at providing high performance and high energy efficiency with an intuitive programming model.
We conduct research and development in reconfigurable MP-SoC structures with aims to provide flexibility and extensibility to satisfy the needs of designers and users of embedded systems. DAL has been selected as a National Research Laboratory (NRL) by the Korean Ministry of Education, Science and Technology in 2008.
We develop a configurable processor for application specific optimization and flexibility. This includes the basic processor, instruction extension methodology, cost/power minimization methodology, and a software development environment.
Future applications will require higher levels of concurrency for faster data processing, and more flexibility to run various kinds of tasks. To meet these requirements, we develop an RCM (Reconfigurable Computing Module) with a two-dimensional array of processing elements. These will include RCM-based floating-point operations, so that it can support 3D applications as well as video applications.
To meet the high bandwidth demands and large number of processors, we study several communication architectures, especially bus matrix and network-on-chip. Additionally, since memory architecture cannot be separated from the communication architecture, we study new methodologies to co-design them.
Based on the application and the MP-SoC platform, the designer must determine the number and types of processors to be integrated into the system. This is crucial to the system’s performance, cost, and power consumption. We study communication architecture generation methods to meet these requirements. We also study methods for describing the application, implementing the system, and verifying the implementation.