Design of energy-efficient multiplier based on 3:2 compressor

ABSTRACT


INTRODUCTION
A multiplier is an essential block of many nano-electronic applications, control and automation applications [1]. When there requires multiplication, a multiplier module (circuit) is required [2]. Thus, it has applications in computer vision, computer-aided design, image processing, DSP processors, MAC, communications systems, filters, IoT applications, VLSI circuits and systems, [3]. So, the speed of operation, power consumption and complexity of these applications depends on the core multiplier modules up to some extent [4,5]. In the field of nano-electronic research, it is one of the trending research objectives to design low power and high-speed multiplier circuits in nanotechnology for VLSI applications [6]. Low power and highspeed multipliers have designed by adopting different approaches and techniques [7]. Each design has own advantages and disadvantages. In the literature, the architectures and techniques are observed in [8][9][10][11][12][13][14][15][16]. Different approaches have adopted by either algorithm or new architecture. It has been studied these designs have drawbacks with respect to each other. Three major issues that are observed as high power consumption, low speed of operation and complexity of architecture. Thus, in this work, and energy-efficient Multiplier has been designed by using an algorithm and espousing a new structural module.
The multiplier reported here is of 4x4. It is designed in three steps. In the first step, partial product is generated by computing two 4 bit numbers, in the second step the reduction of the partial products and in the final step fast addition to yield the find products of 4x4 multiplier. In the 1st step AND gates have been used to get the partial products, where partial products are generated by multiplying each bit of a number by each bit of another number. The PP is reduced by using a Wallace tree algorithm. It is because the Wallace tree algorithm is one of the oldest and superior techniques that have been used to reduce the multiplication complexity [8]. Another reason for widespread popularity is due to the speed of operation [9]. The final step is the fast addition which is done by a carry-save adder (CSA). The main reason to use a carry-save adder is that it is one of the fast adders. The performance of the designed multiplier is compared with two existing Wallace multipliers reported in [5] and [11]. Since for PPP, the Wallace tree algorithm has been used, so the designed multiplier is also known as Wallace multiplier In the rest of the paper, the word multiplier or Wallace multiplier will indicate the same meaning unless quoted. The rest of the structure of this manuscript given as Section 2 explains the detail of the designed multiplier, Section 3 is used to discuss the performance analysis and discussion, whereas the final Section 4 concludes this article followed by references.

METHODOLOGY
In any MxN multiplication, the first step is the partial product generations which are the results of the products of two numbers M and N. Then the partial product is added up to get the final result. In Wallace multiplier, the process is the same but done in three steps [8,11,12]. Initially, the partial products (PP) are produced by multiplying two inputs which are followed by partial product processing (PPP). At last, final addition (FA) is performed by a fast adder [13]. In the PPP, the PPs are categorised in stages by distributing the partial products by using the Wallace tree algorithm. Each stage contains a number of rows which are calculated by as shown in (1) [5], where i th is the stage and Si gives the number of rows.
In this work, the multiplier has been designed in three steps as mentioned above, where partial products (PP) are generated by using the inputs, then PP processing (PPP) and finally final addition (FA) by a fast adder. For PP generations AND gates are used. The PPP is designed in two phases. In the first phase, the Wallace tree logarithm has been used to reduce the PPs, whereas, in the second phase the PPs are added by using energy-efficient half adder and 3:2 compressor. Since the proposed multiplier is 4x4 bits, two 4-bits inputs will give 16 numbers of partial products which are reduced by PPP. And the final addition is done once the reduction of partial products is successfully achieved. The details descriptions of three designing steps are given.

1st step:
The 1st step is a partial product generation. The partial products are generated by using AND gates. The AND gate used here is of CMOS type, it is because it outperforms other types of AND gates [17]. Since the proposed multiplier is of 4x4 bits, so there will be 16 partial products and thus 16 CMOS AND gates are required to produce the same. Let the two numbers be input1 =a3a2a1a0 and input2=b3b2b1b0. So, the partial products will be p00=a0b0, p01=b0a1, p02=b0a2, p03=b0a3, p10=b1a0, p11=b1a1, p12=b1a2, p13=b1a3, p20=b2a0, p21=b2a1; p22=b2a2, p23=b2a3, p30=b3a0, p31=b3a1, p32=b3a2 and p33=b3a3. The same has been shown in Figure 1. Similarly, to design an NxN multiplier, there will be N2 partial products thus required the same number of AND gates.

2nd Step:
The 2nd step is partial product reduction (PPP). Here, the partial products are reduced by using the Wallace algorithm, where the PPs are rearranged in a fashion of a tree-like parallel structure. It is achieved by categorised the above rows in stages by the Wallace tree algorithm. Until the last stage contains two rows of partial products, the algorithm has been repeated. As shown in (1) has been used for this purpose. A flow chart for the same has been shown in Figure 2. For more illustration how PPP (partial product processing) has been done is shown in Figure 3. It is seen that in 1st stage it contains 4 rows, 2nd stage contains 3 rows but the 3rd stage contains only two rows of PPs. Thus, the algorithm will stop at the 3rd stage i.e., it will be the final stage. The partial products of each stage have been computed by using half adders and 3:2 compressor except the final stage. The reason behind using compressors is they are energy efficient for multiplications [17][18][19][20][21][22]. If there are two rows of PP half adder (HA) is used, whereas for 3 rows PP 3:2 compressor is used. The compressor can reduce three rows of partial products into two. The HA used here is reported in [18], whereas, the 3:2 compressor is designed by using the architecture mentioned in [17] and [23].

3 rd step:
The third step is for fast addition. In the PPP, when any stage contains only two rows, the Wallace algorithm stops there, i.e., there won't be any further stage(s). As shown in Figure 3, stage 3 contains only two rows of partial products, thus it will be the final stage. So, an addition is required to process these two rows of partial products. Since, it the last stage, so fast addition is required to improve the performance of the multiplier that will ensure less delay. Thus, fast adders are used in this step. The fast adder used here is a 4-bit carry-save adder (CSA), as CSA is one of the fastest adders [9].

RESULTS AND DISCUSSION
The designed Wallace multiplier is the combination of three steps discussed above. The performances are evaluated by simulating the multiplier by using the Synopsis EDA tool at room temperature. The technology node used here is 90 nm CMOS pdk technology. The multiplier is the combination of step1, step2 and step3 mentioned above. The results have been shown in Table 1. The parameters such as power, delay, power-delay (PDP) and energy-delay product (EDP) have been calculated. The performance than compared with conventional [5] and Hussain [11]. It is observed that though the conventional Wallace multiplier has the lowest power consumption, the proposed design has the best delay, PDP and EDP. The delay has been minimised by using the 3:2 compressor and CSA. Since it has the lowest delay and moderate power consumption than the other Wallace multiplier taken for consideration, it has the best PDP and EDP. Thus it could be commented that the designed multiplier is the best energy efficient multiplier as compared to conventional [5] and Hussain [11]. The effects of delay, power, PDP and EDP are also observed by varying the power supply. To comment on the performance of the multiplier, it is simulated at 32 nm CMOS technology to evaluate the performance parameters. The results have been shown in Table 2. It is found that the designed multiplier has better performances as compared to the other two multipliers. Though the conventional Wallace multiplier consumes least power, but the proposed has best delay, PDP and EDP as compared to [5] and [11]. From the results, it is clear that the designed multiplier has the best EDP i.e. it is energy-efficient. It is achieved by using the algorithm and the structural optimisation in the design. The structural optimisation has been achieved by using the proper circuit module. The low power compressor and HA that is used in PPP with CSA which is used in the Final addition speed up the operation. To establish the validate of findings and significance of the results, power, delay, PDP and EDP analysis has been carried out against power supply. The same has been discussed below.

Power analysis against power supply
Power is one of the most important considerable parameters of any digital circuits and systems. Hence, the total power has been calculated and its effect has been also observed by varying the supply voltage from 0.6 V to 1.2 V. The lowest power supply considered here is 0.6 V because of the minimum threshold voltage required for 90nm CMOS technology. Whereas, 1.2 V is the highest voltage level considered here as per the ITRS roadmap. The total power consumption reported here is the summation of static (when the circuit is on steady-state) and dynamic (when the circuit is in transition state). The results have been shown in Table 3. A comparison graph by varying the input power supply is shown in Figure 4.

Delay analysis against power supply:
Speed is another most important performance evaluating parameters of modern nano-electronic applications. So, the delay has been calculated and its effects against power supply are also observed. The delay is determined when the input reached one-half of the power supply voltage level (50% Vdd) and the latest output signal reached the same voltage level. Thus the worst-case delay has been recorded. The delay is calculated at 0.6 V, 0.8 V, 1 V and 1.2 V respectively and the same has been listed in Table 4. The effects of delay with input power supply variations are shown in Figure 5. It is observed that the performance of delay is best in the case of the designed multiplier because of architecture. The compressor used in the partial product processing unit reduced the data flow path and eventually speed up the operation. Another reason is the use of a carry-save adder at the final step, where operation starts without the carry bit (which is an inherited property of CSA), thus reduce the delay.

PDP and EDP analysis against power supply:
In digital circuits and systems for nano-electronic applications delay and power can't define the performance of the circuits. Thus power-delay product is calculated which is also known as a Figure of merit that indicates the energy efficiency of the circuits or systems. On the other hand, low PDP circuits may also perform slowly, so energy-delay product (EDP) is another metric that is used to evaluate the performance. Thus, PDP and EDP are calculated. By varying the voltage supply from 0.6V to 1.2V the same has been also evaluated. The PDP and EDP analysis against power supply are shown in Tables 5 and 6 respectively. Comparison graphs for PDP and EDP against power supply are shown in Figures 6 and 7 respectively. It is seen that the proposed multiplier dominates in both cases. So, it could be commented that the designed multiplier is energy-efficient for VLSI circuits and systems.

CONCLUSION
In this article, a 4x4 multiplier is reported which is designed in three steps for nano-electronics, control and automation applications applications. The 1st step is the designing of partial product (PP) generation by using AND gates. The 2nd step is PP processing (PPP). The PPP is designed in two phases. In the first phase, the Wallace tree logarithm has been used to reduce the PPs, whereas, in the second phase the partial products are computed by using energy-efficient half adder and 3:2 compressor. If there are two rows of PP half adder is used, whereas for 3 rows PP 3:2 compressor. In the 3rd step, the final addition has been done by using a carrysave adder (CSA) which fastens the overall operation. The Multiplier has been simulated by using Synopsys tool with 90nm CMOS pdk technology. The performance metrics such as power, delay, PDP and EDP of the multiplier are computed and compared with the other two multipliers. The effects of power, delay, PDP and EDP are also observed by varying the input power supply. It is witnessed, the proposed Wallace multiplier has best performances as compared to other multipliers taken for consideration in terms of delay, EDP and PDP. Thus, the multiplier is energy-efficient and could be a possible alternative for future nano-electronics, control and automation applications.