ACM Fellow and IEEE Fellow,
Endowed Distinguished Professor
University of Delaware, USA
A Founder and Chair
IEEE/CS Dataflow STC (Special Technology Community)
Director of Advanced Computer System Architecture Laboratory,
School of Computer Science and Technology,
University of Science and Technology of China
Abstract. The high availability of a 100P or more actual system such as Sunway TaihuLight for computing science and engineering applications keeps to be very attractive to the supercomputing community, since obtaining peak performances on irregular applications such as computer algebra problems remains a challenging problem. In this short talk, we will introduce our preliminary work of the dataflow-based runtime support implementation on Sunway TaihuLight, to exploit with great efficiency the computation resources of a 100P actual system.
President, Global Supercomputing Corporation, USA
Abstract. The dataflow computing model introduced an elegant and highly influential parallel alternative to the Von Neumann sequential computing model. But the dataflow approach introduced new programming models which are fundamentally incompatible with Von Neumann's sequential single-threaded instruction execution. Furthemore, there is a misconception that in the sequential Von Neumann computing model (much criticized by the dataflow community) instructions have to be executed sequentially. While there exist proofs of non-achievability of optimal theoretical performance in corner cases, in reality there is nothing that prevents a modern Von Neumann computer from executing an application within a time period which is little more than the critical path length of the entire execution trace of the application plus speed of light delays, while remaining fully compatible with the Von Neumann sequential execution model. The exascale, very large cloud centers of the near future comprising millions of FPGAs and ASICs will provide the infrastructure for enabling such performance, using application-specific, customized hardware compiled from a sequential application.
Director of High Throughput Computer Research Center,
Institute of Computing Technology,
Chinese Academy of Sciences (ICT, CAS)
Abstract. Dataflow architecture are becoming important role in high-end computing. In this short talk, we will present a feasible design method of dataflow processors and our newborn typical dataflow processor – SmarCo, which proves the outstanding efficiency of dataflow execution model with higher performance.
Chief Engineer, Machine and Deep Learning,
IBM Corp, USA
Abstract. The emergence of Deep Artificial Neural Networks (DNNs) is revolutionizing information technology with an emphasis on extracting information from massive data corpora. Deep Learning is the process training of a DNN and is a highly numerically intensive operation with an emphasis on a small number of computational kernels that are well known in the high-performance computing community, such as generalized matrix/matrix multiplication and other dense stencil computations. In 2016, IBM introduced the new S822LC for HPC server designed to deliver unprecedented performance for both Artificial Intelligence as well as traditional High-Performance Computing (HPC) workloads.. With its high-performance NVlink connection, the S822LC for HPC server offers a sweet spot of scalability, performance and efficiency for Deep Learning applications. The next generation S822LC for HPC systems combine the balanced high-performance Power server design with four high-performance P100 GPUs which exploit dataflow principles to maximize throughput by scheduling groups of computational threads based on operand availability to hide latency and deliver peak performance. The GPUs are connected via NVlink for enhanced peer-to-peer GPU multiprocessing, and CPU-GPU NVlink for enhanced performance and programmability.
Since its introduction in 2016, these accelerator-based server designs have demonstrated the benefits of numeric accelerators first introduced in the IBM RoadRunner supercomputer system based on IBM's Cell BE design. In 2016, we demonstrated training one of the most common DNNs, Alexnet, on the full Imagenet 2012 dataset (a de-facto industry standard training corpus) setting a new industry record with the first training time of under an hour using a single a single S822LC for HPC server. In 2017, we demonstrated training of the most complex DNNs in use today in a cluster configuration with 64 server nodes, achieving a training time of just 7 hours for the ResNet-101 neural network model. as well as achieving record image recognition accuracy of 33.8% on 7.5M images from the ImageNet-22k dataset compared to the previous best published result of 29.8%. We also achieved a record in fastest absolute training time of 50 minutes by training the ResNet-50 model with ImageNet-1K dataset.
Principle Research Manager,
Abstract. In this talk I will discuss two projects at Microsoft that deal with the end of Moore’s Law and silicon scaling. Project Catapult uses reconfigurable computing to accelerate datacenter services such as Bing search and Azure networking in Microsoft datacenters. Project E2 is a next-generation Explicit Data Graph Execution (EDGE) architecture that utilizes a hybrid von Neuman dataflow model to overcome the limitations of traditional CISC/RISC instruction set architectures.
Head of Qualcomm Research Raleigh, North Carolina, USA
Leader of Processor Research
Abstract. In this short talk, we will consider the relative strengths and weaknesses of the dataflow and conventional von Neumann models, and how they can be combined to obtain the best features of each for high performance and power efficiency, while maintaining compatibility with existing high level software stacks and programming models.