Conference Paper

XOR-CiM: An Efficient Computing-in- SOT-MRAM Design for Binary Neural Network Acceleration

Conference Paper

Mehrdad Morsali, Ranyang Zhou, Sepehr Tabrizchi, Arman Roohi, and Shaahin Angizi

24th International Symposium on Quality Electronic Design (ISQED)

Publication year: 2023

In this work, we leverage the uni-polar switching behavior of Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) to develop an efficient digital Computing-in-Memory (CiM) platform named XOR-CiM. XOR-CiM converts typical MRAM sub-arrays to massively parallel computational cores with ultra-high bandwidth, greatly reducing energy consumption dealing with convolutional layers and accelerating X(N)OR-intensive Binary Neural Networks (BNNs) inference. With a similar inference accuracy to digital CiMs, XOR-CiM achieves ∼4.5× and 1.8× higher energy-efficiency and speed-up compared to the recent MRAM-based CiM platforms.

SenTer: A Reconfigurable Processing-in- Sensor Architecture Enabling Efficient Ternary MLP

Conference Paper

Sepehr Tabrizchi, Rebati Gaire, Shaahin Angizi, and Arman Rooh

Proceedings of the Great Lakes Symposium on VLSI (GLSVLSI)

Publication year: 2023

Recently, Intelligent IoT (IIoT), including various sensors, has gained significant attention due to its capability of sensing, deciding, and acting by leveraging artificial neural networks (ANN). Nevertheless, to achieve acceptable accuracy and high performance in visual systems, a power-delay-efficient architecture is required. In this paper, we propose an ultra-low-power processing in-sensor architecture, namely SenTer, realizing low-precision ternary multi-layer perceptron networks, which can operate in detection and classification modes. Moreover, SenTer supports two activation functions based on user needs and the desired accuracy-energy trade-off. SenTer is capable of performing all the required computations for the MLP’s first layer in the analog domain and then submitting its results to a co-processor. Therefore, SenTer significantly reduces the overhead of analog buffers, data conversion, and transmission power consumption by using only one ADC. Additionally, our simulation results demonstrate acceptable accuracy on various datasets compared to the full precision models.

P-PIM: A Parallel Processing-in-DRAM Framework Enabling RowHammer Protection

Conference Paper

Ranyang Zhou, Sepehr Tabrizchi, Mehrdad Morsali, Arman Roohi, and Shaahin Angizi

Design, Automation & Test in Europe Conference & Exhibition (DATE)

Publication year: 2023

In this work, we propose a Parallel Processing-In-DRAM architecture named P-PIM leveraging the high density of DRAM to enable fast and flexible computation. P-PIM enables bulk bit-wise in-DRAM logic between operands in the same bit-line by elevating the analog operation of the memory sub-array based on a novel dual-row activation mechanism. With this, P-PIM can opportunistically perform a complete and inexpensive in-DRAM RowHammer (RH) self-tracking and mitigation technique to protect the memory unit against such a challenging security vulnerability. Our results show that P-PIM achieves ~72% higher energy efficiency than the fastest charge-sharing-based designs. As for the RH protection, with a worst-case slowdown of ~0.8%, P-PIM archives up to 71% energy-saving over the SRAM/CAM-based frameworks and about 90% saving over DRAM-based frameworks.

Ocellus: Highly Parallel Convolution-in-Pixel Scheme Realizing Power-Delay-Efficient Edge Intelligence

Conference Paper

S. Tabrizchi, Sh. Angizi, and A. Roohi

ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED)

Publication year: 2023

NeSe: Near-Sensor Event-Driven Scheme for Low Power Energy Harvesting Sensors

Conference Paper

Sepehr Tabrizchi, Mehrdad Morsali, Shaahin Angizi, Arman Roohi

IEEE International Symposium on Circuits and Systems (ISCAS)

Publication year: 2023

Digital technologies have made it possible to deploy visual sensor nodes capable of detecting motion events in the coverage area cost-effectively. However, background subtraction, as a widely used approach, remains an intractable task due to its inability to achieve competitive accuracy and reduced computation cost simultaneously. In this paper, an effective background subtraction approach, namely NeSe, for tiny energy-harvested sensors is proposed leveraging non-volatile memory (NVM). Using the developed software/hardware method, the accuracy and efficiency of event detection can be adjusted at runtime by changing the precision depending on the application’s needs. Due to the near-sensor implementation of background subtraction and NVM usage, the proposed design reduces the data movement overhead while ensuring intermittent resiliency. The background is stored for a specific time interval within NVMs and compared with the next frame. If the power is cut, the background remains unchanged and is updated after the interval passes. Once the moving object is detected, the device switches to the high-powered sensor mode to capture the image.

Comparative Study of Low Bit-width DNN Accelerators: Opportunities and Challenges

Conference Paper

D. Vungarala, M. Morsali, S. Tabrizchi, A. Roohi, and Sh. Angizi

66th International Midwest Symposium on Circuits and Systems (MWSCAS)

Publication year: 2023

TizBin: A Low-Power Image Sensor with Event and Object Detection Using Efficient Processing-in-Pixel Schemes

Conference Paper

Sepehr Tabrizchi, Shaahin Angizi, and Arman Roohi

40th International Conference on Computer Design (ICCD)

Publication year: 2022

In the Artificial Intelligence of Things (AIoT) era, always-on intelligent and self-powered visual perception systems have gained considerable attention and are widely used. Thus, this paper proposes TizBin, a low-power processing in-sensor scheme with event and object detection capabilities to eliminate power costs of data conversion and transmission and enable data-intensive neural network tasks. Once the moving object is detected, TizBin architecture switches to the high-power object detection mode to capture the image. TizBin offers several unique features, such as analog convolutions enabling low-precision ternary weight neural networks (TWNN) to mitigate the overhead of analog buffer and analog-to-digital converters. Moreover, TizBin exploits non-volatile magnetic RAMs to store NN’s weights, remarkably reducing static power consumption. Our circuit-to-application co-simulation results for TWNNs demonstrate minor accuracy degradation on various image datasets, while TizBin achieves a frame rate of 1000 and efficiency of ∼1.83 TOp/s/W.

semiMul: Floating-Point Free Implementations for Efficient and Accurate Neural Network Training

Conference Paper

Ali Nezhadi, Shaahin Angizi, and Arman Roohi

21st IEEE International Conference on Machine Learning and Applications (ICMLA)

Publication year: 2022

Multiply–accumulate operation (MAC) is a fundamental component of machine learning tasks, where multiplication (either integer or float multiplication) compared to addition is costly in terms of hardware implementation or power consumption. In this paper, we approximate floating-point multiplication by converting it to integer addition while preserving the test accuracy of shallow and deep neural networks. We mathematically show and prove that our proposed method can be utilized with any floating-point format (e.g., FP8, FP16, FP32, etc.). It is also highly compatible with conventional hardware architectures and can be employed in CPU, GPU, or ASIC accelerators for neural network tasks with minimum hardware cost. Moreover, the proposed method can be utilized in embedded processors without a floating-point unit to perform neural network tasks. We evaluated our method on various datasets such as MNIST, FashionMNIST, SVHN, Cifar-10, and Cifar-100, with both FP16 and FP32 arithmetics. The proposed method preserves the test accuracy and, in some cases, overcomes the overfitting problem and improves the test accuracy.

SCiMA: a Generic Single-Cycle Compute-in-Memory Acceleration Scheme for Matrix Computations

Conference Paper

Sepehr Tabrizchi, Shaahin Angizi, and Arman Roohi

IEEE International Symposium on Circuits and Systems (ISCAS)

Publication year: 2022

This work proposes a new generic Single-cycle Compute-in-Memory (CiM) Accelerator for matrix computation named SCiMA. SCiMA is developed on top of the existing commodity Spin-Orbit Torque Magnetic Random-Access Memory chip. Every sub-array’s peripherals are transformed to realize a full set of single-cycle 2- and 3-input in-memory bulk bitwise functions specifically designed to accelerate a wide variety of graph and matrix multiplication tasks. We explore SCiMA’s efficiency by selecting a complex matrix processing operation, i.e., calculating determinant as an essential and under-explored application in the CiM domain. The cross-layer device-to-architecture simulation framework shows the presented platform can reduce energy consumption by 70.43% compared with the most recent CiM designs implemented with the same memory technology. SCiMA also achieves up to 2.5x speedup compared with current CiM platforms.

ReFACE: Efficient Design Methodology for Acceleration of Digital Filter Implementations

Conference Paper

Arman Roohi, Shaahin Angizi, Pooriya Navaeilavasani, and MohammadReza Taheri

23rd International Symposium on Quality Electronic Design (ISQED)

Publication year: 2022

Because of the impressive performance and success of artificial intelligence (AI)-based applications, filters as a primary part of digital signal processing systems are widely used, especially finite impulse response (FIR) filtering. Although they offer several advantages, such as stability, they are computationally intensive. Hence, in this paper, we propose a systematic methodology to efficiently implement computing in-memory (CIM) accelerators for FIR filters using various CMOS and post-CMOS technologies, referred to as ReFACE. ReFACE leverages a residue number system (RNS) to speed up the essential operations of digital filters, instead of traditional arithmetic implementation that suffers from the inevitable lengthy carry propagation chain. Moreover, the CIM architecture eliminates the off-chip data transfer by leveraging the maximum internal bandwidth of memory chips to realize a local and parallel computation on small residues independently. Taking advantage of both RNS and CIM results in significant power and latency reduction. As a proof-of-concept, ReFACE is leveraged to implement a 4-tap RNS FIR. The simulation results verified its superior performance with up to 85× and 12× improvement in energy consumption and execution time, respectively, compared with an ASIC accelerator.

ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

Conference Paper

Ranyang Zhou, Arman Roohi, Durga Misra, and Shaahin Angizi

41st IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Publication year: 2022

In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.

Integrated Sensing and Computing using Energy-Efficient Magnetic Synapses

Conference Paper

Shaahin Angizi, and Arman Roohi

23rd International Symposium on Quality Electronic Design (ISQED)

Publication year: 2022

This work presents a processing-in-sensor platform leveraging magnetic devices as a flexible and efficient solution for real-time and smart image processing in AI devices. The main idea is to combine the typical sensing mechanism with an intrinsic coarse-grained convolution operation at the edge to remarkably reduce the power consumption of data conversion and transmission to an off-chip processor imposed by the first layer of deep neural networks. Our initial results demonstrate acceptable accuracy on the SVHN image data-set, while the proposed platform substantially reduces data conversion and transmission energy compared with a baseline sensor-CPU platform.

FlexiDRAM: A Flexible in-DRAM Framework to Enable Parallel General-Purpose Computation

Conference Paper

Ranyang Zhou, Arman Roohi, Durga Misra, and Shaahin Angizi

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED)

Publication year: 2022

In this paper, we propose a Flexible processing-in-DRAM framework named FlexiDRAM that supports the efficient implementation of complex bulk bitwise operations. This framework is developed on top of a new reconfigurable in-DRAM accelerator that leverages the analog operation of DRAM sub-arrays and elevates it to implement XOR2-MAJ3 operations between operands stored in the same bit-line. FlexiDRAM first generates an efficient XOR-MAJ representation of the desired logic and then appropriately allocates DRAM rows to the operands to execute any in-DRAM computation. We develop ISA and software support required to compute in-DRAM operation. FlexiDRAM transforms current memory architecture to a massively parallel computational unit and can be leveraged to significantly reduce the latency and energy consumption of complex workloads. Our extensive circuit-to-architecture simulation results show that averaged across two well-known deep learning workloads, FlexiDRAM achieves ∼ 15 × energy-saving and 13 × speedup over the GPU outperforming recent processing-in-DRAM platforms.

Enabling efficient training of convolutional neural networks for histopathology images

Conference Paper

Mohammed H. Alali, Arman Roohi, and Jitender S. Deogun

International Conference on Image Analysis and Processing

Publication year: 2022

Convolutional Neural Networks (CNNs) have gained lots of attention in various digital imaging applications. They have proven to produce incredible results, especially on big data, that require high processing demands. With the increasing size of datasets, especially in computational pathology, CNN processing takes even longer and uses higher computational resources. Considerable research has been conducted to improve the efficiency of CNN, such as quantization. This paper aims to apply efficient training and inference of ResNet using quantization on histopathology images, the Patch Camelyon (PCam) dataset. An analysis for efficient approaches to classify histopathology images is presented. First, the original RGB-colored images are evaluated. Then, compression methods such as channel reduction and sparsity are applied. When comparing sparsity on grayscale with RGB modes, classification accuracy is relatively the same, but the total number of MACs is less in sparsity on grayscale by 77% than RGB. A higher classification result was achieved by grayscale mode, which requires much fewer MACs than the original RGB mode. Our method’s low energy and processing make this project suitable for inference on wearable healthcare low powered devices and mobile hospitals in rural areas or developing countries. This also assists pathologists by presenting a preliminary diagnosis.

Efficient Targeted Bit-Flip Attack Against the Local Binary Pattern Network

Conference Paper

Arman Roohi, and Shaahin Angizi

IEEE International Symposium on Hardware Oriented Security and Trust (HOST)

Publication year: 2022

Deep neural networks (DNNs) have shown their great capability of surpassing human performance in many areas. With the help of quantization, artificial intelligence (AI) powered devices are ubiquitously deployed. Yet, the easily accessible AI-powered edge devices become the target of malicious users who can deteriorate the privacy and integrity of the inference process. This paper proposes two adversarial attack scenarios, including three threat models, which crush local binary pattern networks (LBPNet). These attacks can be applied maliciously to flip a limited number of susceptible bits in kernels within the system’s shared memory. The threat could be driven through the Row-Hammer attack and significantly drops the model’s accuracy. Our preliminary simulation results demonstrate flipping only the most significant bit of the first LBP layer decreases the accuracy from 99.51 % down to 18 % on the MNIST data-set. We then briefly discuss potential hardware/software -oriented defense mechanisms as countermeasures to such attacks.

EaseMiss: HW/SW Co-Optimization for Efficient Large Matrix-Matrix Multiply Operations

Conference Paper

Ali Nezhadi, Shaahin Angizi, and Arman Roohi

IEEE 15th Dallas Circuit And System Conference (DCAS)

Publication year: 2022

Due to the essential role of matrix multiplication in many scientific applications, especially in data and compute -intensive applications, we explore the efficiency of highly used matrix production algorithms. This paper proposes an HW/SW co-optimization technique, entitled EaseMiss, to reduce the cache miss ratio for large general matrix-matrix multiplications. First, we revise the algorithms by applying three software optimization techniques to improve performance. Choosing the proper algorithms to achieve the best performance is examined and formulated. By leveraging the proposed optimizations, the number of cache misses decreases by a factor of 3 in a conventional data cache. To further improve, we then propose SPLiTCACHE to virtually split data cache regarding matrices’ dimensions for better data reuse. This method can be easily embedded into conventional general-purpose processors or GPUs at the cost of negligible logical circuit overhead. After using the correct and valid splitting, the obtained results show that the cache misses reduce by a factor of 2 compared to the conventional data cache on average in the machine learning workloads.

Design and Evaluation of a Robust Power-Efficient Ternary SRAM Cell

Conference Paper

Sepehr Tabrizchi, Shaahin Angizi, and Arman Roohi

IEEE 65th International Midwest Symposium on Circuits and Systems (MWSCAS)

Publication year: 2022

This paper presents a novel ternary Static Random Access Memory (T-SRAM) cell. To validate the functionality of the proposed T-SRAM, carbon nanotube field-effect transistors are selected as a proof-of-concept, whereas either post-CMOS or CMOS technologies can replace it. Our T-SRAM intrinsically eliminates the need to store the intermediate ternary state’s voltage level, thus significantly reducing leakage power and increasing robustness. Extensive SPICE simulation and comparison results show that the proposed T-SRAM can be a promising alternative for CMOS SRAMs deploying in low-power edge AI. Further, the analysis verifies that the proposed design is more robust than previous implementations.

A Processing-in-Pixel Accelerator based on Multi-level HfOx ReRAM

Conference Paper

Minhaz Abedin, Arman Roohi, Nathaniel Cady, and Shaahin Angizi

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)

Publication year: 2022

This work paves the way to realize a processing-in-pixel accelerator based on a multi-level HfO x ReRAM as a flexible, energy-efficient, and high-performance solution for real-time and smart image processing at edge devices. The proposed design intrinsically implements and supports a coarse-grained convolution operation in low-bit-width neural networks leveraging a novel compute-pixel with non-volatile weight storage at the sensor side. Our evaluations show that such a design can remarkably reduce the power consumption of data conversion and transmission to an off-chip processor maintaining accuracy compared with the recent in-sensor computing designs.

RNSiM: Efficient Deep Neural Network Accelerator Using Residue Number Systems

Conference Paper

Arman Roohi, MohammadReza Taheri, Shaahin Angizi, and Deliang Fan

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Publication year: 2021

In this paper, we propose an efficient convolutional neural network (CNN) accelerator design, entitled RNSiM, based on the Residue Number System (RNS) as an alternative for the conventional binary number representation. Instead of traditional arithmetic implementation that suffers from the inevitable lengthy carry propagation chain, the novelty of RNSiM lies in that all the data, including stored weights and communication/computation, are performed in the RNS domain. Due to the inherent parallelism of the RNS arithmetic, power and latency are significantly reduced. Moreover, an enhanced integrated intermodulo operation core is developed to decrease the overhead imposed by non-modular operations. Further improvement in systems’ performance efficiency is achieved by developing efficient Processing-in-Memory (PIM) designs using various volatile CMOS and non-volatile Post-CMOS technologies to accelerate RNS-based multiplication-and-accumulations (MACs). The RN-SiM accelerator’s performance on different datasets, including MNIST, SVHN, and CIFAR-10, is evaluated. With almost the same accuracy to the baseline CNN, the RNSiM accelerator can significantly increase both energy-efficiency and speedup compared with the state-of-the-art FPGA, GPU, and PIM designs. RNSiM and other RNS-PIMs, based on our method, reduce the energy consumption by orders of 28−77× and 331−897× compared with the FPGA and the GPU platforms, respectively.

Processing-in-Memory Acceleration of MAC-based Applications Using Residue Number System: A Comparative Study

Conference Paper

Shaahin Angizi, Arman Roohi, MohammadReza Taheri, Deliang Fan

31st ACM Great Lakes Symposium on VLSI (GLSVLSI 2021), June 22-25, 2021

Publication year: 2021

Entropy-Based Modeling for Estimating Adversarial Bit-flip Attack Impact on Binarized Neural Network

Conference Paper

Navud Khoshavi, Saman Sargolzaei, Yu Bi, A. Roohi

26th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan.18-21

Publication year: 2021

Over past years, the high demand to efficiently process deep learning (DL) models has driven the market of the chip design companies. However, the new Deep Chip architectures, a common term to refer to DL hardware accelerator, have slightly paid attention to the security requirements in quantized neural networks (QNNs), while the black/white -box adversarial attacks can jeopardize the integrity of the inference accelerator. Therefore in this paper, a comprehensive study of the resiliency of QNN topologies to black-box attacks is examined. Herein, different attack scenarios are performed on an FPGA-processor co-design, and the collected results are extensively analyzed to give an estimation of the impact’s degree of different types of attacks on the QNN topology. To be specific, we evaluated the sensitivity of the QNN accelerator to a range number of bit-flip attacks (BFAs) that might occur in the operational lifetime of the device. The BFAs are injected at uniformly distributed times either across the entire QNN or per individual layer during the image classification. The acquired results are utilized to build the entropy-based model that can be leveraged to construct resilient QNN architectures to bit-flip attacks.

SHIELDeNN: Online Accelerated Framework for Fault-Tolerant Deep Neural Network Architectures

Conference Paper

Navid Khoshavi, Arman Roohi, Connor Broyles, Saman Sargolzaei, Yu Bi, David Z. Pan

57th Design Automation Conference (DAC), San Francisco, CA, USA, July 19-23

Publication year: 2020

We propose SHIELDeNN, an end-to-end inference accelerator frame-work that synergizes the mitigation approach and computational resources to realize a low-overhead error-resilient Neural Network (NN) overlay. We develop a rigorous fault assessment paradigm to delineate a ground-truth fault-skeleton map for revealing the most vulnerable parameters in NN. The error-susceptible parameters and resource constraints are given to a function to find superior design. The error-resiliency magnitude offered by SHIELDeNN can be adjusted based on the given boundaries. SHIELDeNN methodology improves the error-resiliency magnitude of cnvW1A1 by 17.19% and 96.15% for 100 MBUs that target weight and activation layers, respectively.

Normally-Off Computing Design Methodology Using Spintronics: From Devices to Architectures

Conference Paper

Arman Roohi

International Green and Sustainable Computing Conference, October 19-22

Publication year: 2020

This work shows a promising solution to efficiently implement normally-off computing (NoC) and power analysis side-channel attack resilient designs using spin-based devices. Spintronics, as post-CMOS devices, provide interesting features such as non-volatility, which plays an essential role in NoC structures. Besides, spin-based components can naturally function as a polymorphic gate (PG) that realizes reconfigurable logic functions with inherent security attributes. However, Spintronics’ utilization imposes power and area overhead compared to CMOS-based designs. Thus, herein an efficient design methodology is introduced to mitigate these overheads. This approach is first extended to realize the targeted insertion PG Modules within the VLSI implementations to make it resilient against power failure, entitled NV-Clustering. Then PARC as an extension of NV-Clustering was developed as a power-masked synthesis method in the presence of power analysis side-channel attack. PARC randomly generates power maskable building blocks with the optimum PDP and area overhead. In addition to NV-Clustering, the PARC can be expanded against fault injection approaches due to PG modules’ reconfigurability to cover the faults.

Fiji-FIN: A Fault Injection Framework on Quantized Neural Network Inference Accelerator

Conference Paper

Navid Khoshavi, Connor Broyles, Yu Bi, Arman Roohi

IEEE International Conference on Machine Learning and Applications (ICMLA), October 19-22

Publication year: 2020

In recent years, the big data booming has boosted the development of highly accurate prediction models driven from machine learning (ML) and deep learning (DL) algorithms. These models can be orchestrated on the customized hardware in the safety-critical missions to accelerate the inference process in ML/DL -powered IoT. However, the radiation-induced transient faults and black/white -box attacks can potentially impact the individual parameters in ML/DL models which may result in generating noisy data/labels or compromising the pre-trained model. In this paper, we propose Fiji-FIN ¹ , a suitable framework for evaluating the resiliency of IoT devices during the ML/DL model execution with respect to the major security challenges such as bit perturbation attacks and soft errors. Fiji-FIN is capable of injecting both single bit/event flip/upset and multi-bit flip/upset faults on the architectural ML/DL accelerator embedded in ML/DL -powered IoT. Fiji-FIN is significantly more accurate compared to the existing software-level fault injections paradigms on ML/DL -driven IoT devices.

Processing-In-Memory Acceleration of Convolutional Neural Networks for Energy-Efficiency, and Power-Intermittency Resilience

Conference Paper

Arman Roohi, Shaahin Angizi, Deliang Fan, Ronald F DeMara

20th International Symposium on Quality Electronic Design (ISQED), Santa Clara, CA, USA, 2019, pp. 8-13.

Publication year: 2019

Abstract

Herein, a bit-wise Convolutional Neural Network (CNN) in-memory accelerator is implemented using Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) computational sub-arrays. It utilizes a novel AND-Accumulation method capable of significantly-reduced energy consumption within convolutional layers and performs various low bitwidth CNN inference operations entirely within MRAM. Power-intermittence resiliency is also enhanced by retaining the partial state information needed to maintain computational forward-progress, which is advantageous for battery-less IoT nodes. Simulation results indicate ~5.4× higher energy-efficiency and 9× speedup over ReRAM-based acceleration, or roughly ~9.7× higher energy-efficiency and 13.5× speedup over recent CMOS-only approaches, while maintaining inference accuracy comparable to baseline designs.

IRC: Cross-layer design exploration of Intermittent Robust Computation units for IoTs

Conference Paper

Arman Roohi, Ronald F DeMara

IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Miami, FL, USA, 2019, pp. 354-359.

Publication year: 2019

Abstract

Energy-harvesting-powered computing offers intriguing and vast opportunities to dramatically transform the landscape of the Internet of Things (IoT) devices by utilizing ambient sources of energy to achieve battery-free computing. In order to operate within the restricted energy capacity and intermittency profile, it is proposed to innovate Intermittent Robust Computation (IRC) Unit as a new duty-cycle-variable computing approach leveraging the non-volatility inherent in spin-based switching devices. The foundations of IRC will be advanced from the device-level upwards, by extending a Spin Hall Effect Magnetic Tunnel Junction (SHE-MTJ) device. The device will then be used to realize SHE-MTJ Majority/Polymorphic Gate (MG/PG) logic approaches and libraries. Then a Logic-Embedded Flip-Flop (LE-FF) is developed to realize rudimentary Boolean logic functions along with an inherent state-holding capability within a compact footprint. Finally, the NV-Clustering synthesis procedure and corresponding tool module are proposed to instantiate the LE-FF library cells within conventional Register Transfer Language (RTL) specifications. This selectively clusters together logic and NV state-holding functionality, based on energy and area minimization criteria. It also realizes middleware-coherent, intermittent computation without checkpointing, micro-tasking, or software bloat and energy overheads vital to IoT. Simulation results for various benchmark circuits including ISCAS-89 validate functionality and power dissipation, area, and delay benefits.

Synthesis of Normally-Off Boolean Circuits: An Evolutionary Optimization Approach Utilizing Spintronic Devices

Conference Paper

Arman Roohi, Ramtin Zand, Ronald F DeMara

19th International Symposium on Quality Electronic Design (ISQED), Santa Clara, CA, 2018, pp. 49-54.

Publication year: 2018

Abstract

In this paper, we develop an evolutionary-driven circuit optimization methodology, which can be leveraged for the synthesis of spintronic-based normally-off computing (NoC) circuits. NoC architectures distribute nonvolatile memory elements throughout the CMOS logic plane, creating a new class of fine-grained functionally-constrained synthesis challenges. Spin-based NoC circuits synthesis objectives include increased computational throughput and reduced static power consumption. Our proposed methodology utilizes Genetic Algorithms (GAs) to optimize the implementation of a Boolean logic expression in terms of area, delay, or power consumption. It first leverages the spin-based device characteristics to achieve a primary semi-optimized implementation, then further performance optimization is applied to the implemented design based on the NoC requirements and optimization criteria. As a proof-of-concept, the optimization approach is leveraged to implement a functionally-complete set of Boolean logic gates using spin Hall effect (SHE)-magnetic tunnel junctions (MTJs), which are optimized for both power and delay objectives. NoC synthesis methodologies supporting NoC circuit design of emerging device and hybrid CMOS logic applications. Finally, Simulation results and analyses verified the functionality of our proposed optimization tool for NoC circuit implementations.

Logic-Encrypted Synthesis for Energy-Harvesting-Powered Spintronic-Embedded Datapath Design

Conference Paper

Arman Roohi, Ramtin Zand, Ronald F DeMara

Great Lakes Symposium on VLSI (GLSVLSI ’18). Association for Computing Machinery, New York, NY, USA, 9–14.

Publication year: 2018

Abstract

The objectives of advancing secure, intermittency-tolerant, and energy-aware logic datapaths are addressed herein by developing a spin-based design methodology and its corresponding synthesis steps. The approach selectively-inserts Non-Volatile (NV) Polymorphic Gates (PGs) to realize datapaths which are suitable for intrinsic operation in Energy-Harvesting-Powered (EHP) devices. Spin Hall Effect (SHE)-based Magnetic Tunnel (MTJs) are utilized to design NV-PGs, which are combined within a Flip-Flop (FF) circuit to develop a PG-FF realizing Boolean logic functions with inherent state-holding capability. The reconfigurability of PGs is leveraged for logic-encryption to enhance the security of the developed intermittency-resilient circuits, which are applied to ISCAS-89, MCNS, and ITC-99 benchmarks. The results obtained indicate that the PG-FF based design can achieve up to 7.1% and 13.6% improvements in terms of area and Power Delay Product (PDP), respectively, compared to NV-FF based methodologies that replace the CMOS-based FFs with NV-FFs. Further PDP improvements are achieved by using low-energy barrier SHE-MTJ devices within the PG-FF circuit. SHE-MTJs with 30kT energy exhibit 40.5% reduction in PDP at the cost of lower retention times in the range of minutes, which is still sufficient to achieve forward progress in EHP devices having more than hundreds of power-on and power-off cycles per minute.

Heterogeneous technology configurable fabrics for field-programmable co-design of cmos and spin-based devices

Conference Paper

Ronald F DeMara, Arman Roohi, Ramtin Zand, Steven D Pyle

Journal of Consumer Psychology, Volume 22, Issue 2, April 2012, Pages 191-194

Publication year: 2018

Abstract

The architecture, operation, and characteristics of two post-CMOS reconfigurable fabrics are identified to realize energy-sparing and resilience features, while remaining feasible for near-term fabrication. First, Storage Cell Replacement Fabrics (SCRFs) provide a reconfigurable computing platform utilizing near- zero leakage Spin Hall Effect devices which replace SRAM bit-cells within Look-Up Tables (LUTs) and/or switch boxes to complement the advantages of MOS transistor-based multiplexer select trees. Second, Heterogeneous Technology Configurable Fabrics (HTCFs) are identified to extend reconfigurable computing platforms via a palette of CMOS, spin-based, or other emerging device technologies, such as various Magnetic Tunnel Junction (MTJ) and Domain Wall Motion devices. HTCFs are composed of a triad of Emerging Device Blocks, CMOS Logic Blocks, and Signal Conversion Blocks. This facilitates a novel architectural approach to reduce leakage energy, minimize communication occurrence and energy cost by eliminating unnecessary data transfer, and support auto-tuning for resilience. Furthermore, HTCFs enable new advantages of technology co-design which trades off alternative mappings between emerging devices and transistors at runtime by allowing dynamic remapping to adaptively leverage the intrinsic computing features of each device technology. Both SCRFs and HTCFs offer a platform for fine- grained Logic-In-Memory architectures and runtime adaptive hardware. SPICE simulations indicate 6% to 67% reduction in read energy, 21% reduction in reconfiguration energy, and 78% higher clock frequency versus alternative fabricated emerging device architectures, and a significant reduction in leakage compared to CMOS-based approaches.

Secure Intermittent-Robust Computation for Energy Harvesting Device Security and Outage Resilience

Conference Paper

Arman Roohi, Ronald F DeMara, Longfei Wang, Selçuk Köse

IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation

Publication year: 2017

Abstract:

In this paper, we propose Secure Intermittent-Robust Computation (SIRC) for Energy Harvesting Powered Internet of Things (IoT) Devices. This effort innovates a new duty-cycle-variable computing approach to facilitate and invigorate security in energy-harvesting-powered IoT network nodes. The proposed SIRC architecture is developed from the ground up by extending emerging post-CMOS switching elements to realize majority-gate logic that is intrinsically-capable of middleware-coherent, battery-free without check-pointing or micro-tasking, and can be resilient to wireless power transfer attacks including charge attacks and data attacks. Potential countermeasures for these attacks are identified at the circuit-level through gate-resolution immunity of power interruption. As a proof-of-concept, a power-maskable design using SIRC approach is developed for s27 circuit from ISCAS89 benchmark. The obtained results shows SIRC provides reduced area consumption and increase number of power traces to extract crypted data.

Reactive rejuvenation of CMOS logic paths using self-activating voltage domains

Conference Paper

Rizwan A Ashraf, Ahmad Al-Zahrani, Navid Khoshavi, Ramtin Zand, Soheil Salehi, Arman Roohi, Mingjie Lin, Ronald F DeMara

IEEE International Symposium on Circuits and Systems (ISCAS), Lisbon, 2015, pp. 2944-2947.

Publication year: 2015

Abstract:

Although the trend of technology scaling is sought to realize higher performance computer systems, it also results in Integrated Circuits (ICs) suffering from increasing Process, Voltage, and Temperature (PVT) variations and adverse aging effects. In most cases, these reliability threats manifest themselves as timing errors on critical speed-paths of the circuit, if a large design guardband is not reserved. In this work, we propose the Reactive Rejuvenation (RR) architectural approach consisting of detection and recovery phases to mitigate circuit from BTI-induced aging. The BTI impact on the critical and near critical paths performance is continuously examined through a lightweight logic circuit which asserts an error signal in the case of any timing violation in those paths. By utilizing timing violation occurrence in the system, the timing-sensitive portion of the circuit is recovered from BTI through switching computations to redundant aging-critical voltage domain. The proposed technique achieves aging mitigation and reduced energy consumption as compared to a baseline circuit. Thus, significant voltage guardbands to meet the desired timing specification are avoided.

Modeling an Improved Modified Type in Metallic Quantum-Dot Fixed Cell for Nano Structure Implementation

Conference Paper

Samira Sayedsalehi, Arman Roohi

23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, 2015, pp. 412-415.

Publication year: 2015

Abstract:

Quantum-dot cellular automata (QCA) is a transistor-less computation approach which encodes binary information via configuration of charges among quantum dots. The fundamental QCA logic primitives are the majority gate and the inverter gate which can be employed to design various QCA circuits. In this study by applying some fixed predefined level of polarization, a detailed modeling of a modified type of fixed metal-dots QCA cell will be explored. An efficient architecture controlled by predefined polarization of fixed cells that position next to the input cells is presented for implementing a desired nano structure. The efficiency of the proposed approach is verified by implementing of several important examples of Boolean function.

Cost-efficient QCA reversible combinational circuits based on a new reversible gate

Conference Paper

Amir Mokhtar Chabi, Arman Roohi, Ronald F DeMara, Shaahin Angizi, Keivan Navi, Hossein Khademolhosseini

18th CSI International Symposium on Computer Architecture and Digital Systems (CADS), Tehran, 2015, pp. 1-6.

Publication year: 2015

Abstract:

Nanotechnologies, notably Quantum-dot Cellular Automata (QCA), provide an attractive perspective for future computing technologies. In this paper, Quantum-dot Cellular Automata (QCA) is investigated as an implementation method for reversible logic. A novel XOR gate and also a new approach to implement 2:1 multiplexer are presented. Moreover, an efficient and potent universal reversible gate based on the proposed XOR gate is designed. The proposed reversible gate has a superb performance in implementing the QCA standard benchmark combinational functions in terms of area, complexity, power consumption and cost function in comparison to the other reversible gates. The gate achieves the lowest overall cost among the most cost-efficient designs presented so far, with a reduction of 24%.

Implementation of reversible logic design in nanoelectronics on basis of majority gates

Conference Paper

Arman Roohi, Hossein Khademolhosseini, Samira Sayedsalehi, Keivan Navi

The 16th CSI International Symposium on Computer Architecture and Digital Systems (CADS 2012), Shiraz, Fars, 2012, pp. 1-6.

Publication year: 2012

Abstract:

Due to low power dissipation in computing, reversible logic is an attractive field of research in quantum and optical computing. Since the conventional CMOS technology cannot be used for implementing reversible gates owing to its high power dissipation, employing novel technologies such as nano-scale ones are being deployed. In this paper we utilize Quantum-dot Cellular Automata (QCA) as a candidate technology for implementing reversible logic gates. This paper presents a new realization approach to reversible logic based on majority gates (MGs) and a new reversible gate is proposed as well. The gate will be compared with an existing MG-based structure in terms of delay, complexity and area. The results show that even though our gate requires more cells, it returns the outputs in less clock cycles and hence the design is faster.

A new redundant method on representing numbers with moduli set {3n, 3n−1, 3n−2}

Conference Paper

Hossein Khademolhosseini, Arman Roohi

2011 International Conference on Computer, Communication and Electrical Technology (ICCCET), Tamilnadu, 2011, pp. 163-166.

Publication year: 2011

Abstract:

The residue number system (RNS) is a system for representing numbers. It uses the residues of numbers with respect to a moduli set. Due to the possibility of parallel operations and smaller numbers used in this system in comparison with the binary equivalents, calculations are applicable with higher speed. Because of the suitable features of RNS, this system is used in many cases such as DSP devices and filters. Summation is the most widely used operation in this system, by use of which, conversions and other operations may be done. The method that has been offered in this paper is a new definition for numbers representation, using {3 ⁿ -2, 3 ⁿ -1, 3 ⁿ } set. We use redundancy to improve the residues representation. This method makes conversions, summation and consequently subtraction and multiplication faster and makes the circuits of them much easier.

A combinational logic optimization for majority gate-based nanoelectronic circuits based on GA

Conference Paper

A Roohi, M Kamrani, S Sayedsalehi, K Navi

2011 International Semiconductor Device Research Symposium (ISDRS), College Park, MD, 2011, pp. 1-2.

Publication year: 2011

Abstract

Quantum dots cellular automata is a new computing method in the nanotechnology that has considerable features such as low power, small dimension and high speed switch. A QCA device stores logic based on the position of individual electrons. The fundamental logic elements in QCA are the majority (Fig.1 (a)) and inverter gates (Fig.1 (b)) that operate based on the Coulomb repulsion between electrons [1].

Arman Roohi

Assistant Professor @ UNL

Publication Types:

XOR-CiM: An Efficient Computing-in- SOT-MRAM Design for Binary Neural Network Acceleration

SenTer: A Reconfigurable Processing-in- Sensor Architecture Enabling Efficient Ternary MLP

P-PIM: A Parallel Processing-in-DRAM Framework Enabling RowHammer Protection

Ocellus: Highly Parallel Convolution-in-Pixel Scheme Realizing Power-Delay-Efficient Edge Intelligence

NeSe: Near-Sensor Event-Driven Scheme for Low Power Energy Harvesting Sensors

Comparative Study of Low Bit-width DNN Accelerators: Opportunities and Challenges

TizBin: A Low-Power Image Sensor with Event and Object Detection Using Efficient Processing-in-Pixel Schemes

semiMul: Floating-Point Free Implementations for Efficient and Accurate Neural Network Training

SCiMA: a Generic Single-Cycle Compute-in-Memory Acceleration Scheme for Matrix Computations

ReFACE: Efficient Design Methodology for Acceleration of Digital Filter Implementations

ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

Integrated Sensing and Computing using Energy-Efficient Magnetic Synapses

FlexiDRAM: A Flexible in-DRAM Framework to Enable Parallel General-Purpose Computation

Enabling efficient training of convolutional neural networks for histopathology images

Efficient Targeted Bit-Flip Attack Against the Local Binary Pattern Network

EaseMiss: HW/SW Co-Optimization for Efficient Large Matrix-Matrix Multiply Operations

Design and Evaluation of a Robust Power-Efficient Ternary SRAM Cell

A Processing-in-Pixel Accelerator based on Multi-level HfOx ReRAM

RNSiM: Efficient Deep Neural Network Accelerator Using Residue Number Systems

Processing-in-Memory Acceleration of MAC-based Applications Using Residue Number System: A Comparative Study

Entropy-Based Modeling for Estimating Adversarial Bit-flip Attack Impact on Binarized Neural Network

SHIELDeNN: Online Accelerated Framework for Fault-Tolerant Deep Neural Network Architectures

Normally-Off Computing Design Methodology Using Spintronics: From Devices to Architectures

Fiji-FIN: A Fault Injection Framework on Quantized Neural Network Inference Accelerator

Processing-In-Memory Acceleration of Convolutional Neural Networks for Energy-Efficiency, and Power-Intermittency Resilience

Abstract

IRC: Cross-layer design exploration of Intermittent Robust Computation units for IoTs

Abstract

Synthesis of Normally-Off Boolean Circuits: An Evolutionary Optimization Approach Utilizing Spintronic Devices

Abstract

Logic-Encrypted Synthesis for Energy-Harvesting-Powered Spintronic-Embedded Datapath Design

Abstract

Heterogeneous technology configurable fabrics for field-programmable co-design of cmos and spin-based devices

Secure Intermittent-Robust Computation for Energy Harvesting Device Security and Outage Resilience

Abstract:

Reactive rejuvenation of CMOS logic paths using self-activating voltage domains

Abstract:

Modeling an Improved Modified Type in Metallic Quantum-Dot Fixed Cell for Nano Structure Implementation

Abstract:

Cost-efficient QCA reversible combinational circuits based on a new reversible gate

Abstract:

Implementation of reversible logic design in nanoelectronics on basis of majority gates

Abstract:

A new redundant method on representing numbers with moduli set {3n, 3n−1, 3n−2}

Abstract:

A combinational logic optimization for majority gate-based nanoelectronic circuits based on GA

Abstract