In this work, we leverage the uni-polar switching behavior of Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) to develop an efficient digital Computing-in-Memory (CiM) platform named XOR-CiM. XOR-CiM converts typical MRAM sub-arrays to massively parallel computational cores with ultra-high bandwidth, greatly reducing energy consumption dealing with convolutional layers and accelerating X(N)OR-intensive Binary Neural Networks (BNNs) inference. With a similar inference accuracy to digital CiMs, XOR-CiM achieves ∼4.5× and 1.8× higher energy-efficiency and speed-up compared to the recent MRAM-based CiM platforms.
Recently, Intelligent IoT (IIoT), including various sensors, has gained significant attention due to its capability of sensing, deciding, and acting by leveraging artificial neural networks (ANN). Nevertheless, to achieve acceptable accuracy and high performance in visual systems, a power-delay-efficient architecture is required. In this paper, we propose an ultra-low-power processing in-sensor architecture, namely SenTer, realizing low-precision ternary multi-layer perceptron networks, which can operate in detection and classification modes. Moreover, SenTer supports two activation functions based on user needs and the desired accuracy-energy trade-off. SenTer is capable of performing all the required computations for the MLP’s first layer in the analog domain and then submitting its results to a co-processor. Therefore, SenTer significantly reduces the overhead of analog buffers, data conversion, and transmission power consumption by using only one ADC. Additionally, our simulation results demonstrate acceptable accuracy on various datasets compared to the full precision models.
In this work, we propose a Parallel Processing-In-DRAM architecture named P-PIM leveraging the high density of DRAM to enable fast and flexible computation. P-PIM enables bulk bit-wise in-DRAM logic between operands in the same bit-line by elevating the analog operation of the memory sub-array based on a novel dual-row activation mechanism. With this, P-PIM can opportunistically perform a complete and inexpensive in-DRAM RowHammer (RH) self-tracking and mitigation technique to protect the memory unit against such a challenging security vulnerability. Our results show that P-PIM achieves ~72% higher energy efficiency than the fastest charge-sharing-based designs. As for the RH protection, with a worst-case slowdown of ~0.8%, P-PIM archives up to 71% energy-saving over the SRAM/CAM-based frameworks and about 90% saving over DRAM-based frameworks.
Digital technologies have made it possible to deploy visual sensor nodes capable of detecting motion events in the coverage area cost-effectively. However, background subtraction, as a widely used approach, remains an intractable task due to its inability to achieve competitive accuracy and reduced computation cost simultaneously. In this paper, an effective background subtraction approach, namely NeSe, for tiny energy-harvested sensors is proposed leveraging non-volatile memory (NVM). Using the developed software/hardware method, the accuracy and efficiency of event detection can be adjusted at runtime by changing the precision depending on the application’s needs. Due to the near-sensor implementation of background subtraction and NVM usage, the proposed design reduces the data movement overhead while ensuring intermittent resiliency. The background is stored for a specific time interval within NVMs and compared with the next frame. If the power is cut, the background remains unchanged and is updated after the interval passes. Once the moving object is detected, the device switches to the high-powered sensor mode to capture the image.
In the Artificial Intelligence of Things (AIoT) era, always-on intelligent and self-powered visual perception systems have gained considerable attention and are widely used. Thus, this paper proposes TizBin, a low-power processing in-sensor scheme with event and object detection capabilities to eliminate power costs of data conversion and transmission and enable data-intensive neural network tasks. Once the moving object is detected, TizBin architecture switches to the high-power object detection mode to capture the image. TizBin offers several unique features, such as analog convolutions enabling low-precision ternary weight neural networks (TWNN) to mitigate the overhead of analog buffer and analog-to-digital converters. Moreover, TizBin exploits non-volatile magnetic RAMs to store NN’s weights, remarkably reducing static power consumption. Our circuit-to-application co-simulation results for TWNNs demonstrate minor accuracy degradation on various image datasets, while TizBin achieves a frame rate of 1000 and efficiency of ∼1.83 TOp/s/W.
Multiply–accumulate operation (MAC) is a fundamental component of machine learning tasks, where multiplication (either integer or float multiplication) compared to addition is costly in terms of hardware implementation or power consumption. In this paper, we approximate floating-point multiplication by converting it to integer addition while preserving the test accuracy of shallow and deep neural networks. We mathematically show and prove that our proposed method can be utilized with any floating-point format (e.g., FP8, FP16, FP32, etc.). It is also highly compatible with conventional hardware architectures and can be employed in CPU, GPU, or ASIC accelerators for neural network tasks with minimum hardware cost. Moreover, the proposed method can be utilized in embedded processors without a floating-point unit to perform neural network tasks. We evaluated our method on various datasets such as MNIST, FashionMNIST, SVHN, Cifar-10, and Cifar-100, with both FP16 and FP32 arithmetics. The proposed method preserves the test accuracy and, in some cases, overcomes the overfitting problem and improves the test accuracy.
This work proposes a new generic Single-cycle Compute-in-Memory (CiM) Accelerator for matrix computation named SCiMA. SCiMA is developed on top of the existing commodity Spin-Orbit Torque Magnetic Random-Access Memory chip. Every sub-array’s peripherals are transformed to realize a full set of single-cycle 2- and 3-input in-memory bulk bitwise functions specifically designed to accelerate a wide variety of graph and matrix multiplication tasks. We explore SCiMA’s efficiency by selecting a complex matrix processing operation, i.e., calculating determinant as an essential and under-explored application in the CiM domain. The cross-layer device-to-architecture simulation framework shows the presented platform can reduce energy consumption by 70.43% compared with the most recent CiM designs implemented with the same memory technology. SCiMA also achieves up to 2.5x speedup compared with current CiM platforms.
Because of the impressive performance and success of artificial intelligence (AI)-based applications, filters as a primary part of digital signal processing systems are widely used, especially finite impulse response (FIR) filtering. Although they offer several advantages, such as stability, they are computationally intensive. Hence, in this paper, we propose a systematic methodology to efficiently implement computing in-memory (CIM) accelerators for FIR filters using various CMOS and post-CMOS technologies, referred to as ReFACE. ReFACE leverages a residue number system (RNS) to speed up the essential operations of digital filters, instead of traditional arithmetic implementation that suffers from the inevitable lengthy carry propagation chain. Moreover, the CIM architecture eliminates the off-chip data transfer by leveraging the maximum internal bandwidth of memory chips to realize a local and parallel computation on small residues independently. Taking advantage of both RNS and CIM results in significant power and latency reduction. As a proof-of-concept, ReFACE is leveraged to implement a 4-tap RNS FIR. The simulation results verified its superior performance with up to 85× and 12× improvement in energy consumption and execution time, respectively, compared with an ASIC accelerator.
In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.
This work presents a processing-in-sensor platform leveraging magnetic devices as a flexible and efficient solution for real-time and smart image processing in AI devices. The main idea is to combine the typical sensing mechanism with an intrinsic coarse-grained convolution operation at the edge to remarkably reduce the power consumption of data conversion and transmission to an off-chip processor imposed by the first layer of deep neural networks. Our initial results demonstrate acceptable accuracy on the SVHN image data-set, while the proposed platform substantially reduces data conversion and transmission energy compared with a baseline sensor-CPU platform.
In this paper, we propose a Flexible processing-in-DRAM framework named FlexiDRAM that supports the efficient implementation of complex bulk bitwise operations. This framework is developed on top of a new reconfigurable in-DRAM accelerator that leverages the analog operation of DRAM sub-arrays and elevates it to implement XOR2-MAJ3 operations between operands stored in the same bit-line. FlexiDRAM first generates an efficient XOR-MAJ representation of the desired logic and then appropriately allocates DRAM rows to the operands to execute any in-DRAM computation. We develop ISA and software support required to compute in-DRAM operation. FlexiDRAM transforms current memory architecture to a massively parallel computational unit and can be leveraged to significantly reduce the latency and energy consumption of complex workloads. Our extensive circuit-to-architecture simulation results show that averaged across two well-known deep learning workloads, FlexiDRAM achieves ∼ 15 × energy-saving and 13 × speedup over the GPU outperforming recent processing-in-DRAM platforms.
Convolutional Neural Networks (CNNs) have gained lots of attention in various digital imaging applications. They have proven to produce incredible results, especially on big data, that require high processing demands. With the increasing size of datasets, especially in computational pathology, CNN processing takes even longer and uses higher computational resources. Considerable research has been conducted to improve the efficiency of CNN, such as quantization. This paper aims to apply efficient training and inference of ResNet using quantization on histopathology images, the Patch Camelyon (PCam) dataset. An analysis for efficient approaches to classify histopathology images is presented. First, the original RGB-colored images are evaluated. Then, compression methods such as channel reduction and sparsity are applied. When comparing sparsity on grayscale with RGB modes, classification accuracy is relatively the same, but the total number of MACs is less in sparsity on grayscale by 77% than RGB. A higher classification result was achieved by grayscale mode, which requires much fewer MACs than the original RGB mode. Our method’s low energy and processing make this project suitable for inference on wearable healthcare low powered devices and mobile hospitals in rural areas or developing countries. This also assists pathologists by presenting a preliminary diagnosis.
Deep neural networks (DNNs) have shown their great capability of surpassing human performance in many areas. With the help of quantization, artificial intelligence (AI) powered devices are ubiquitously deployed. Yet, the easily accessible AI-powered edge devices become the target of malicious users who can deteriorate the privacy and integrity of the inference process. This paper proposes two adversarial attack scenarios, including three threat models, which crush local binary pattern networks (LBPNet). These attacks can be applied maliciously to flip a limited number of susceptible bits in kernels within the system’s shared memory. The threat could be driven through the Row-Hammer attack and significantly drops the model’s accuracy. Our preliminary simulation results demonstrate flipping only the most significant bit of the first LBP layer decreases the accuracy from 99.51 % down to 18 % on the MNIST data-set. We then briefly discuss potential hardware/software -oriented defense mechanisms as countermeasures to such attacks.
Due to the essential role of matrix multiplication in many scientific applications, especially in data and compute -intensive applications, we explore the efficiency of highly used matrix production algorithms. This paper proposes an HW/SW co-optimization technique, entitled EaseMiss, to reduce the cache miss ratio for large general matrix-matrix multiplications. First, we revise the algorithms by applying three software optimization techniques to improve performance. Choosing the proper algorithms to achieve the best performance is examined and formulated. By leveraging the proposed optimizations, the number of cache misses decreases by a factor of 3 in a conventional data cache. To further improve, we then propose SPLiTCACHE to virtually split data cache regarding matrices’ dimensions for better data reuse. This method can be easily embedded into conventional general-purpose processors or GPUs at the cost of negligible logical circuit overhead. After using the correct and valid splitting, the obtained results show that the cache misses reduce by a factor of 2 compared to the conventional data cache on average in the machine learning workloads.
This paper presents a novel ternary Static Random Access Memory (T-SRAM) cell. To validate the functionality of the proposed T-SRAM, carbon nanotube field-effect transistors are selected as a proof-of-concept, whereas either post-CMOS or CMOS technologies can replace it. Our T-SRAM intrinsically eliminates the need to store the intermediate ternary state’s voltage level, thus significantly reducing leakage power and increasing robustness. Extensive SPICE simulation and comparison results show that the proposed T-SRAM can be a promising alternative for CMOS SRAMs deploying in low-power edge AI. Further, the analysis verifies that the proposed design is more robust than previous implementations.
This work paves the way to realize a processing-in-pixel accelerator based on a multi-level HfO x ReRAM as a flexible, energy-efficient, and high-performance solution for real-time and smart image processing at edge devices. The proposed design intrinsically implements and supports a coarse-grained convolution operation in low-bit-width neural networks leveraging a novel compute-pixel with non-volatile weight storage at the sensor side. Our evaluations show that such a design can remarkably reduce the power consumption of data conversion and transmission to an off-chip processor maintaining accuracy compared with the recent in-sensor computing designs.
In this paper, we propose an efficient convolutional neural network (CNN) accelerator design, entitled RNSiM, based on the Residue Number System (RNS) as an alternative for the conventional binary number representation. Instead of traditional arithmetic implementation that suffers from the inevitable lengthy carry propagation chain, the novelty of RNSiM lies in that all the data, including stored weights and communication/computation, are performed in the RNS domain. Due to the inherent parallelism of the RNS arithmetic, power and latency are significantly reduced. Moreover, an enhanced integrated intermodulo operation core is developed to decrease the overhead imposed by non-modular operations. Further improvement in systems’ performance efficiency is achieved by developing efficient Processing-in-Memory (PIM) designs using various volatile CMOS and non-volatile Post-CMOS technologies to accelerate RNS-based multiplication-and-accumulations (MACs). The RN-SiM accelerator’s performance on different datasets, including MNIST, SVHN, and CIFAR-10, is evaluated. With almost the same accuracy to the baseline CNN, the RNSiM accelerator can significantly increase both energy-efficiency and speedup compared with the state-of-the-art FPGA, GPU, and PIM designs. RNSiM and other RNS-PIMs, based on our method, reduce the energy consumption by orders of 28−77× and 331−897× compared with the FPGA and the GPU platforms, respectively.
Over past years, the high demand to efficiently process deep learning (DL) models has driven the market of the chip design companies. However, the new Deep Chip architectures, a common term to refer to DL hardware accelerator, have slightly paid attention to the security requirements in quantized neural networks (QNNs), while the black/white -box adversarial attacks can jeopardize the integrity of the inference accelerator. Therefore in this paper, a comprehensive study of the resiliency of QNN topologies to black-box attacks is examined. Herein, different attack scenarios are performed on an FPGA-processor co-design, and the collected results are extensively analyzed to give an estimation of the impact’s degree of different types of attacks on the QNN topology. To be specific, we evaluated the sensitivity of the QNN accelerator to a range number of bit-flip attacks (BFAs) that might occur in the operational lifetime of the device. The BFAs are injected at uniformly distributed times either across the entire QNN or per individual layer during the image classification. The acquired results are utilized to build the entropy-based model that can be leveraged to construct resilient QNN architectures to bit-flip attacks.
We propose SHIELDeNN, an end-to-end inference accelerator frame-work that synergizes the mitigation approach and computational resources to realize a low-overhead error-resilient Neural Network (NN) overlay. We develop a rigorous fault assessment paradigm to delineate a ground-truth fault-skeleton map for revealing the most vulnerable parameters in NN. The error-susceptible parameters and resource constraints are given to a function to find superior design. The error-resiliency magnitude offered by SHIELDeNN can be adjusted based on the given boundaries. SHIELDeNN methodology improves the error-resiliency magnitude of cnvW1A1 by 17.19% and 96.15% for 100 MBUs that target weight and activation layers, respectively.
This work shows a promising solution to efficiently implement normally-off computing (NoC) and power analysis side-channel attack resilient designs using spin-based devices. Spintronics, as post-CMOS devices, provide interesting features such as non-volatility, which plays an essential role in NoC structures. Besides, spin-based components can naturally function as a polymorphic gate (PG) that realizes reconfigurable logic functions with inherent security attributes. However, Spintronics’ utilization imposes power and area overhead compared to CMOS-based designs. Thus, herein an efficient design methodology is introduced to mitigate these overheads. This approach is first extended to realize the targeted insertion PG Modules within the VLSI implementations to make it resilient against power failure, entitled NV-Clustering. Then PARC as an extension of NV-Clustering was developed as a power-masked synthesis method in the presence of power analysis side-channel attack. PARC randomly generates power maskable building blocks with the optimum PDP and area overhead. In addition to NV-Clustering, the PARC can be expanded against fault injection approaches due to PG modules’ reconfigurability to cover the faults.
In recent years, the big data booming has boosted the development of highly accurate prediction models driven from machine learning (ML) and deep learning (DL) algorithms. These models can be orchestrated on the customized hardware in the safety-critical missions to accelerate the inference process in ML/DL -powered IoT. However, the radiation-induced transient faults and black/white -box attacks can potentially impact the individual parameters in ML/DL models which may result in generating noisy data/labels or compromising the pre-trained model. In this paper, we propose Fiji-FIN 1 , a suitable framework for evaluating the resiliency of IoT devices during the ML/DL model execution with respect to the major security challenges such as bit perturbation attacks and soft errors. Fiji-FIN is capable of injecting both single bit/event flip/upset and multi-bit flip/upset faults on the architectural ML/DL accelerator embedded in ML/DL -powered IoT. Fiji-FIN is significantly more accurate compared to the existing software-level fault injections paradigms on ML/DL -driven IoT devices.
Herein, a bit-wise Convolutional Neural Network (CNN) in-memory accelerator is implemented using Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) computational sub-arrays. It utilizes a novel AND-Accumulation method capable of significantly-reduced energy consumption within convolutional layers and performs various low bitwidth CNN inference operations entirely within MRAM. Power-intermittence resiliency is also enhanced by retaining the partial state information needed to maintain computational forward-progress, which is advantageous for battery-less IoT nodes. Simulation results indicate ~5.4× higher energy-efficiency and 9× speedup over ReRAM-based acceleration, or roughly ~9.7× higher energy-efficiency and 13.5× speedup over recent CMOS-only approaches, while maintaining inference accuracy comparable to baseline designs.
Energy-harvesting-powered computing offers intriguing and vast opportunities to dramatically transform the landscape of the Internet of Things (IoT) devices by utilizing ambient sources of energy to achieve battery-free computing. In order to operate within the restricted energy capacity and intermittency profile, it is proposed to innovate Intermittent Robust Computation (IRC) Unit as a new duty-cycle-variable computing approach leveraging the non-volatility inherent in spin-based switching devices. The foundations of IRC will be advanced from the device-level upwards, by extending a Spin Hall Effect Magnetic Tunnel Junction (SHE-MTJ) device. The device will then be used to realize SHE-MTJ Majority/Polymorphic Gate (MG/PG) logic approaches and libraries. Then a Logic-Embedded Flip-Flop (LE-FF) is developed to realize rudimentary Boolean logic functions along with an inherent state-holding capability within a compact footprint. Finally, the NV-Clustering synthesis procedure and corresponding tool module are proposed to instantiate the LE-FF library cells within conventional Register Transfer Language (RTL) specifications. This selectively clusters together logic and NV state-holding functionality, based on energy and area minimization criteria. It also realizes middleware-coherent, intermittent computation without checkpointing, micro-tasking, or software bloat and energy overheads vital to IoT. Simulation results for various benchmark circuits including ISCAS-89 validate functionality and power dissipation, area, and delay benefits.
In this paper, we develop an evolutionary-driven circuit optimization methodology, which can be leveraged for the synthesis of spintronic-based normally-off computing (NoC) circuits. NoC architectures distribute nonvolatile memory elements throughout the CMOS logic plane, creating a new class of fine-grained functionally-constrained synthesis challenges. Spin-based NoC circuits synthesis objectives include increased computational throughput and reduced static power consumption. Our proposed methodology utilizes Genetic Algorithms (GAs) to optimize the implementation of a Boolean logic expression in terms of area, delay, or power consumption. It first leverages the spin-based device characteristics to achieve a primary semi-optimized implementation, then further performance optimization is applied to the implemented design based on the NoC requirements and optimization criteria. As a proof-of-concept, the optimization approach is leveraged to implement a functionally-complete set of Boolean logic gates using spin Hall effect (SHE)-magnetic tunnel junctions (MTJs), which are optimized for both power and delay objectives. NoC synthesis methodologies supporting NoC circuit design of emerging device and hybrid CMOS logic applications. Finally, Simulation results and analyses verified the functionality of our proposed optimization tool for NoC circuit implementations.
The objectives of advancing secure, intermittency-tolerant, and energy-aware logic datapaths are addressed herein by developing a spin-based design methodology and its corresponding synthesis steps. The approach selectively-inserts Non-Volatile (NV) Polymorphic Gates (PGs) to realize datapaths which are suitable for intrinsic operation in Energy-Harvesting-Powered (EHP) devices. Spin Hall Effect (SHE)-based Magnetic Tunnel (MTJs) are utilized to design NV-PGs, which are combined within a Flip-Flop (FF) circuit to develop a PG-FF realizing Boolean logic functions with inherent state-holding capability. The reconfigurability of PGs is leveraged for logic-encryption to enhance the security of the developed intermittency-resilient circuits, which are applied to ISCAS-89, MCNS, and ITC-99 benchmarks. The results obtained indicate that the PG-FF based design can achieve up to 7.1% and 13.6% improvements in terms of area and Power Delay Product (PDP), respectively, compared to NV-FF based methodologies that replace the CMOS-based FFs with NV-FFs. Further PDP improvements are achieved by using low-energy barrier SHE-MTJ devices within the PG-FF circuit. SHE-MTJs with 30kT energy exhibit 40.5% reduction in PDP at the cost of lower retention times in the range of minutes, which is still sufficient to achieve forward progress in EHP devices having more than hundreds of power-on and power-off cycles per minute.
Abstract
The architecture, operation, and characteristics of two post-CMOS reconfigurable fabrics are identified to realize energy-sparing and resilience features, while remaining feasible for near-term fabrication. First, Storage Cell Replacement Fabrics (SCRFs) provide a reconfigurable computing platform utilizing near- zero leakage Spin Hall Effect devices which replace SRAM bit-cells within Look-Up Tables (LUTs) and/or switch boxes to complement the advantages of MOS transistor-based multiplexer select trees. Second, Heterogeneous Technology Configurable Fabrics (HTCFs) are identified to extend reconfigurable computing platforms via a palette of CMOS, spin-based, or other emerging device technologies, such as various Magnetic Tunnel Junction (MTJ) and Domain Wall Motion devices. HTCFs are composed of a triad of Emerging Device Blocks, CMOS Logic Blocks, and Signal Conversion Blocks. This facilitates a novel architectural approach to reduce leakage energy, minimize communication occurrence and energy cost by eliminating unnecessary data transfer, and support auto-tuning for resilience. Furthermore, HTCFs enable new advantages of technology co-design which trades off alternative mappings between emerging devices and transistors at runtime by allowing dynamic remapping to adaptively leverage the intrinsic computing features of each device technology. Both SCRFs and HTCFs offer a platform for fine- grained Logic-In-Memory architectures and runtime adaptive hardware. SPICE simulations indicate 6% to 67% reduction in read energy, 21% reduction in reconfiguration energy, and 78% higher clock frequency versus alternative fabricated emerging device architectures, and a significant reduction in leakage compared to CMOS-based approaches.
Quantum dots cellular automata is a new computing method in the nanotechnology that has considerable features such as low power, small dimension and high speed switch. A QCA device stores logic based on the position of individual electrons. The fundamental logic elements in QCA are the majority (Fig.1 (a)) and inverter gates (Fig.1 (b)) that operate based on the Coulomb repulsion between electrons [1].