US 20020178196 A1 Abstract A coprocessor including a first multiplication circuit and a second multiplication circuit with a series input to receive n bits and a series output to give n+k bits. The coprocesser also includes addition and multiplexing circuits enabling the data elements produced by the multiplication circuits to be added up with one another and with other data elements encoded on n bits. The invention makes parallel use of the multiplication circuits to carry out modular or non-modular operations on pieces of binary data having n bits or more.
Claims(13) 1. A device comprising:
a first register, a second register, a third register, a fourth register and a fifth register, at least one input terminal to receive binary pieces of data to be stored in these registers, a first multiplication circuit that performs a multiplication operation between two pieces of data stored in the first and third registers, a second multiplication circuit that performs a multiplication operation between two pieces of data stored in the first and fourth registers, a first addition circuit that performs operations of addition between a piece of data stored in the second register and a piece of data produced by the first multiplication circuit, a second addition circuit that performs an operation of addition between a piece of data produced by the first addition circuit and a piece of data given to the second addition circuit by the second multiplication circuit, a delay cell to delay the supply to the second addition circuit of the piece of data given by the second multiplication circuit, multiplexing means that selectively supplies, to inputs of the first addition circuit, the contents of the second register or a permanent logic state, the connection of an input of the second multiplication circuit to an output of the first register, the connection of the output of the first multiplication circuit to one of the registers and the supply to the second addition circuit of a piece of data produced by the first addition circuit or a permanent logic state. 2. A device according to 3. A device according to 4. A device according to one of the 5. A device according to 6. A device according to 7. A device according to 8. A device according to 9. A device according to 10. A device according to 11. A device according to 12. A device comprising a processor, a memory, a communications bus and a device defined according to 13. A method for the implementation of a non-modular multiplication A*B, A and B being pieces of binary data encoded in n bits, n being an integer, these pieces of data being subdivided into m words of k bits A=A_{m−1 }. . . A_{0 }and B=B_{m−1 }. . . B_{0}, m being an even number, the method comprising the following steps:
1—Initialization:
loading the pieces of data A and B into first and second n-bit registers with series input and output, and loading the words A
_{0 }and A_{1 }into third and fourth k-bit registers with series input and parallel output, initializing first and second addition circuits and first and second multiplication circuits,
selecting a first input of a first multiplexer so that it permanently supplies logic zeros to a first series input of the first addition circuit,
selecting an input of a second multiplexer so that the pieces of data produced by the second multiplication circuit are given with a delay of k clock strokes to a series input of the second addition circuit,
selecting inputs of a third and fourth multiplexers so as to connect an output of the first register to series inputs of the first and second multiplication circuits;
2—Implementation of a computation loop with i as an index varying from 1 to m/2 2.1—Iteration 1:
loading the contents of the third and fourth registers into fifth and sixth k-bit registers with parallel input and output, these outputs being connected to parallel inputs of the first and second multiplication circuits,
performing, by simultaneous rightward shifting of the contents of the first register and of a seventh n-bit register with series input and output, multiplication operations of the words A
_{1 }and A_{0 }by the piece of data B, the pieces of data produced by the first and second multiplication circuits being encoded on n+k bits, adding, in the first addition circuit, the bits produced by the first multiplication circuit with the bits given by the first multiplexer,
storing of the k first bits produced by the first multiplication circuit in an eighth n-bit register with series input and output,
adding, in the second addition circuit, the n+k bits produced by the second multiplication circuit with the n most significant bits produced by the first multiplication circuit, these bits being complemented by k zeros,
storaging, in the eighth register, of the k first bits produced by the second addition circuit and the storage, in the seventh register, of the following n bits,
during the above operations, transferring the words A
_{3 }and A_{2 }into the third and fourth registers, selecting of a second input of the first multiplexer in order to connect the output of the seventh register to the first input of the first addition circuit;
2.j—iteration j, j varying from 2 to m/2−1:
loading the contents of the third and fourth registers into the fifth and sixth registers,
performing, by simultaneous rightward shifting of the contents of the first and seventh registers, multiplication operations of the words A
_{2j−1 }and A_{2j−2 }by the piece of data B, adding, in the first addition circuit, the bits produced by the first multiplication circuit with the contents of the seventh register,
storing the k first bits produced by the first addition circuit in the eighth register,
adding, in the second addition circuit, the n+k bits produced by the second multiplication circuit with the n most significant bits produced by the first addition circuit complemented by k zeros to obtain an identical size for the pieces of data that are added up,
storing, in the eighth register, the k first bits produced by the second addition circuit and the storage, in the seventh register, of the n following bits,
during the above operations, the transfer of the words A2j+1 and A2j into the third and fourth registers; and
2.m/2—iteration m/2
Resuming step 2.j, apart from the transfer of words from the second register into the third and fourth registers, the n least significant bits of the result being in the eighth register and the n most significant bits of the result being in the seventh register at the end of this iteration.
Description [0001] This application is a continuation of application Ser. No. 09/428,607, filed Oct. 27, 1999, which in turn is a continuation of application Ser. No. 09/004,375, filed Jan. 8, 1998, entitled MODULAR ARITHMETIC COPROCESSOR COMPRISING TWO MULTIPLICATION CIRCUITS WORKING IN PARALLEL, which prior applications are incorporated herein by reference. [0002] 1. Field of the Invention [0003] The invention relates to a modular arithmetic coprocessor comprising two multiplication circuits working in parallel. More specifically, the invention relates to the improvement of a known arithmetic coprocessor enabling the performance of modular operations according to the Montgomery method in order to extend the applications of this coprocessor. The Montgomery method performs modular computations in a finite field denoted GF(2 [0004] 2. Description of the Prior Art [0005] Conventionally, modular operations on GF(2 [0006] There are commercially available integrated circuits dedicated to such applications. [0007] These include, for example the device manufactured by SGS-THOMSON MICROELECTRONICS S.A. as model number ST16CF54, built around an association of the type including a central processing unit and an arithmetic coprocessor and dedicated to performing modular computations. The coprocessor used enables processing of the modular operations by the use of the Montgomery method. It is the object of European patent application No. 0 601 907 A2, hereinafter called the document D2 which in incorporated herein by reference. This coprocessor is illustrated in FIG. 1 (this figure corresponds to FIG. 2 of the document D2). [0008] The basic operation (called a P [0009] In the coprocessor described in the document D2, k=32 and m=8 or 16. This device may be used to produce the result of the modular multiplication A*B mod N. The modular multiplication can be broken down into two successive Pfield elementary operations. P [0010] The coprocessor [0011] three shift registers [0012] multiplexers [0013] three registers [0014] two multiplication circuits [0015] two registers [0016] multiplexers [0017] a demultiplexer [0018] series subtraction circuits [0019] series addition circuits [0020] delay cells [0021] a comparison circuit [0022] For further details on the arrangement of the different elements of the circuit with respect to one another, reference may be made to the document D2 and especially to FIGS. 2 and 3, and to the extracts from the description pertaining thereto: page 15, line 54 to page 16, line 13, and page 17, line 50 to page 18, line 55. [0023] The use of the coprocessor [0024] 1—The Initialization of the Circuit [0025] the software computation of a parameter J [0026] the serial loading of B into the register [0027] the initialization of the two multiplication circuits [0028] 2—The Setting up of a Loop Indexed by i with i Varying from 1 to m [0029] the parallel loading into the register [0030] the performance of the different elementary operations in order to perform the following computations: [0031] S(i)=Z/2 [0032] the subtraction, during the following iteration, of N or 0 from S depending on whether S is greater than N or not. [0033] 3—The Output of the Result S(k) by Means of an Output Terminal [0034] For farther details on the running of a method of this kind, reference may be made to the document D2 and more particularly to the following extracts: page 4-line 41 to page 6-line 17 and page 19-lines 7 to 49. [0035] Up till now, the use of the device shown in FIG. 1 could be used to optimize processing operations (in terms of computation time, memory size, etc.) for modular operations using a fixed data size, in this case 256 or 512 bits (depending on whether m is equal to 8 or 16). Now, cryptography requires increasingly efficient machines working at ever-higher speeds and using ever-more complex keys. The trend is thus towards the handling of pieces of data encoded on 768 or even 1024 bits. To process pieces of data of this kind, it is possible to envisage the use of larger-sized circuits by adapting the elements of the circuit to the sizes of the pieces of data. This approach may raise problems in applications such as smart card applications for which the size of the circuits is physically limited because of the differences in flexibility between the cards and the silicon substrates. Furthermore, there is a demand for the integration of increasing numbers of different functional elements on a card of this kind, and the place available for an encryption circuit is accordingly further reduced. It is therefore necessary to find solutions with which to limit the increase in the size of this circuit while at the same time enabling optimum operation for pieces of data whose size is greater than the size of the originally planned registers. This problem is not limited to modular arithmetic coprocessors that process pieces of data with a fixed size of 256 or 512 bits. It can also be transposed more generally to data-handling coprocessors that need to be used for operations on data whose size exceeds their processing capacity. [0036] If it is desired to carry out modular operations using operands with a size greater than what is managed by the coprocessor (namely in practice greater than the size of the registers), it is possible to use a standard processor (with 8, 16 or 32 bits), a memory and the coprocessor of FIG. 1, the coprocessor being used to perform standard (that is to say non-modular) operations of multiplication. [0037] It is possible, with the processor described in D2, to carry out standard operations of multiplication A*B on sizes of up to n bits by means of the following procedure. [0038] 1—Initialization [0039] the loading of k logic zeros into the register [0040] the initialization of the multiplication circuit [0041] 2—The Setting up of a Computation Loop with i as an Index Varying from 1 to m [0042] the loading of the contents of the register [0043] the performance, by a simultaneous rightward shift of the registers [0044] the storage of the k least significant bits into the register [0045] the loading of the word A, into the register [0046] At the end of a procedure such as this, there is therefore the least significant bit of the result in the register [0047] It is possible to perform the multiplication of a piece of data B encoded on n bits by means of a piece of data A encoded on m′ words with m′ as an integer greater than m. For this purpose, the loop is done with i varying from 1 to m′. At every m iterations, the contents of the register [0048] Since the coprocessor can be used to carry out standard operations of multiplication, it is possible to perform modular operations on operands encoded on a number m′*k bits with m′>m. For this purpose, the operands A, B and N are manipulated by being divided into q (q as an integer) sub-operands of n bits: A[q−1], A[q−2] . . . A[ [0049] The following method is used: [0050] 1.1—The Multiplication of B by the First Sub-operand of the Piece of Data A [0051] [0052] 2—A[ [0053] . [0054] . [0055] Q—A[ [0056] 1.2—Computation of the Result of the Multiplication of B by the First Sub-operand of A [0057] computation of R[ [0058] computation of c [0059] . [0060] . [0061] computation of c [0062] computation of c [0063] If it is assumed that R[ [0064] It is of course possible to perform the addition operations as and when the results are output. This makes it possible to minimize the size of the memory in which the results are stored. [0065] 1.3—Computation of the Result of a Multiplication [0066] X[ [0067] 1.4—Computation of the Result of the Multiplication of the First Sub-operand of Y by the Piece of Data N [0068] 1—Y[ [0069] 2—Y[ [0070] . [0071] . [0072] Q—Y[ [0073] 1.5—Computation of the Result of the Multiplication of N by the First Sub-operand of the Piece of Data Y [0074] computation of T[ [0075] computation of c [0076] . [0077] . [0078] computation c [0079] computation of c [0080] If it is assumed that T[ [0081] It is of course possible to perform the addition operations as and when the results are output. This makes it possible to minimize the size of the memory in which the results are stored. [0082] 1.6—Computation of the Result of the Modular Multiplication of B by the First Sub-operand of the Piece of Data A [0083] Computation of U+X and storage of the result, referenced Z. [0084] The result Z of the addition has the form (c) Z[q] Z[q−1] . . . Z[ [0085] storage of S( [0086] 2—Resumption of the Steps 1.1 to 1.6 in Considering the Second Sub-operand of the Piece of Data A and in Modifying the Step 1.2 as Here Below [0087] computation of R[ [0088] computation of c [0089] . [0090] . [0091] computation of c [0092] computation of c [0093] Then: [0094] computation of W+S( [0095] Q—Resumption of the Above Step 2 in Taking into Consideration the qth Sub-operand of A. [0096] The final result of the computation is S(q)−N or 0). [0097] As can be seen, the method requires a certain number of exchanges of data with the exterior. These exchanges entail penalties in terms of computation time and memory space to store the results extracted from the coprocessor. Generally, the value of the coprocessors is that they use a faster clock frequency than that of the other elements that are connected to them. Hence, the value of using a coprocessor is minimized if the processing operations for which it is designed involve exchanges with circuits (standard processors, memories, etc.) that work more slowly, namely circuits to whose speeds they have to adapt during the exchanges. [0098] The inventor has sought to modify the coprocessor illustrated in FIG. 1 so as to improve the processing of the above operations, and more particularly so as to reduce the processing time. To do this, the inventor proposes to modify the existing device so that it makes parallel use of the multiplication circuits [0099] Thus, the invention relates to a device comprising: [0100] a first register, a second register, a third register, a fourth register and a fifth register, [0101] at least one input terminal to receive binary pieces of data to be stored in these registers, [0102] a first multiplication circuit that performs a multiplication operation between two pieces of data stored in the first and third registers, [0103] a second multiplication circuit that performs a multiplication operation between two pieces of data stored in the first and fourth registers, [0104] a first addition circuit that performs operations of addition between a piece of data stored in the second register and a piece of data produced by the first multiplication circuit, [0105] a second addition circuit that performs an operation of addition between a piece of data produced by the first addition circuit and a piece of data given to the second addition circuit by the second multiplication circuit, [0106] a delay cell to delay the supply to the second addition circuit of the piece of data given by the second multiplication circuit, [0107] multiplexing means that selectively supplies, to inputs of the first addition circuit, the contents of the second register or a permanent logic state, the connection of an input of the second multiplication circuit to an output of the first register, the connection of the output of the first multiplication circuit to one of the registers and the supply to the second addition circuit of a piece of data produced by the first addition circuit or a permanent logic state. [0108] According to one embodiment, the multiplexing means include a first multiplexer comprising two series inputs and one series output, a first input of said multiplexer being connected to an output of the second register, a second input of the multiplexer receiving a permanent logic state and the output of the multiplexer being connected to an input of the first addition circuit. [0109] According to one embodiment, the device further comprises a subtraction circuit, placed between the second register and the first addition circuit, that performs a subtraction operation between a piece of data stored in the second register and a piece of data stored in the fifth register, the first multiplexer comprises a third series input, said multiplexer being placed between the subtraction circuit and the first addition circuit and the third input of said multiplexer being connected to an output of the subtraction circuit. [0110] According to one embodiment, the device furthermore comprises a third addition circuit, series-connected with the first addition circuit, that performs addition operations between pieces of data stored in the second and fifth registers and a piece of data produced by the first multiplication circuit and multiplexing means that selectively supplies, to an input of the third addition circuit, the contents of the fifth register or a permanent logic state. [0111] According to one embodiment, the multiplexing means comprise a second multiplexer having a first input, this first input enabling the connection of the output of the first or third addition circuit to one of the registers to store all or a part of the pieces of data produced by addition between the pieces of data stored in the second and fifth registers and a piece of data produced by the first multiplication circuit. [0112] According to one embodiment, the second multiplexer comprises a second input connected to the output of the second addition circuit for the storage, in one or more of the registers, of the data produced by this second multiplication circuit. [0113] According to one embodiment, the third and fourth registers being used to provide pieces of data to the first and second multiplication circuits, the device comprises means to connect the output of either one of the second or fifth registers to inputs of these third and fourth registers. [0114] According to one embodiment, the device comprises a sixth register with series input and series output and multiplexing means to connect the output of this sixth register to inputs of the third and fourth registers. [0115] According to one embodiment, the device comprises a multiplexer to selectively connect the input of the third register to the output of the sixth register or to an input terminal. [0116] According to one embodiment, the device comprises a multiplexer having two inputs and one output, a first input of the multiplexer being connected to an input terminal to receive pieces of data from outside the device, a second input of the multiplexer being connected to the output of the sixth register for reintroducing, into said register, the pieces of data given at its output. [0117] According to one embodiment, the device further comprises a delay cell, placed between an output of the first addition circuit and an input of the second addition circuit, comprising multiplexing means to directly connect said first and second addition circuits, thus preventing the introduction of a delay between said circuits. [0118] The invention also relates to a device comprising a processor, a memory, a communications bus and a device as defined here above. [0119] The invention also relates to a method for the implementation of a non-modular multiplication A*B, A and B being pieces of binary data encoded in n bits, n being an integer, these pieces of data being subdivided into m words of k bits A=A [0120] 1—Initialization: [0121] loading the pieces of data A and B into first and second n-bit registers with series input and output, and loading the words A [0122] initializing first and second addition circuits and of first and second multiplication circuits, [0123] selecting a first input of a first multiplexer so that it permanently supplies logic zeros to a first series input of the first addition circuit, [0124] selecting an input of a second multiplexer so that the pieces of data produced by the second multiplication circuit are given with a delay of k clock strokes to a series input of the second addition circuit, [0125] selecting inputs of a third and fourth multiplexers so as to connect an output of the first register to series inputs of the first and second multiplication circuits. [0126] 2—Implementation of a Computation Loop with i as an Index Varying from 1 to m/2 [0127] 2.1—Iteration 1: [0128] loading the contents of the third and fourth registers into fifth and sixth k-bit registers with parallel input and output, these outputs being connected to parallel inputs of the first and second multiplication circuits, [0129] performing, by simultaneous rightward shifting of the contents of the first register and of a seventh n-bit register with series input and output, multiplication operations of the words A [0130] adding, in the first addition circuit, the bits produced by the first multiplication circuit with the bits given by the first multiplexer, [0131] storing the k first bits produced by the first multiplication circuit in an eighth n-bit register with series input and output, [0132] adding, in the second addition circuit, the n+k bits produced by the second multiplication circuit with the n most significant bits produced by the first multiplication circuit, these bits being complemented by k zeros, [0133] storing, in the eighth register, of the k first bits produced by the second addition circuit and the storage, in the seventh register, of the following n bits, [0134] during the above operations, transferring the words A [0135] selecting a second input of the first multiplexer in order to connect the output of the seventh register to the first input of the first addition circuit. [0136] 2j—Iteration j, j Varying from 2 to m/2−1: [0137] loading the contents of the third and fourth registers into the fifth and sixth registers, [0138] performing, by simultaneous rightward shifting of the contents of the first and seventh registers, multiplication operations of the words A [0139] adding, in the first addition circuit, the bits produced by the first multiplication circuit with the contents of the seventh register, [0140] storing the k first bits produced by the first addition circuit in the eighth register, [0141] adding, in the second addition circuit, of the n+k bits produced by the second multiplication circuit with the n most significant bits produced by the first addition circuit complemented by k zeros to obtain an identical size for the pieces of data that are added up, [0142] storing, in the eighth register, of the k first bits produced by the second addition circuit and the storage, in the seventh register, of the n following bits, [0143] during the above operations, the transfer of the words A [0144] 2.m/2—Iteration m/2 [0145] Resuming the step 2.j, apart from the transfer of words from the second register into the third and fourth registers, the n least significant bits of the result being in the eighth register and the n most significant bits of the result being in the seventh register at the end of this iteration. [0146] The invention will be understood more clearly and other particular features and advantages shall appear from the following description, made with reference to the appended drawings, of which: [0147]FIG. 1 shows a coprocessor [0148]FIG. 2 shows an example of a structure of a circuit comprising a coprocessor, [0149]FIG. 3 shows an exemplary embodiment of a coprocessor [0150]FIG. 2 shows an encryption circuit [0151]FIG. 3 shows an exemplary coprocessor [0152] The circuit shown in FIG. 3 comprises: [0153] four shift registers [0154] a multiplexer [0155] a multiplexer [0156] a multiplexer [0157] a multiplexer [0158] three k-cell registers [0159] a multiplexer [0160] two multiplication circuits [0161] two k-bit storage registers [0162] a multiplexer [0163] two multiplexers [0164] a multiplexer [0165] subtraction circuits [0166] a multiplexer [0167] three addition circuits [0168] a multiplexer [0169] a multiplexer [0170] delay cells [0171] a comparison circuit [0172] two multiplexers [0173] a multiplexer [0174] a demultiplexer [0175] a delay cell [0176] a multiplexer [0177] a multiplexer [0178] two output terminals [0179] As shall be seen further below, this exemplary coprocessor, made according to the invention could undergo modifications without departing from the framework of the invention. [0180] With regard to the output and input terminals, it could be chosen to use distinct terminals but these could also be one or more input/output terminals common to several elements of the coprocessor. One advantage in using distinct terminals is that it is possible to receive and/or give pieces of data from and/or to elements external to the coprocessor (such as the processor [0181] Furthermore, with regard to the elements of the circuit [0182] As compared with the device of FIG. 1, the device of FIG. 3 includes the same elements, some added elements and modifications in the connections of the elements with one another. In particular, the device of FIG. 3 has a supplementary register [0183] The register [0184] It is possible, if necessary, to do without the register [0185] The multiplexer [0186] The multiplexer [0187] It will be noted that the multiplexer [0188] The fourth input of the multiplexer [0189] The addition circuit [0190] The multiplexer [0191] It will be noted that it would be possible to place the addition circuit [0192] It would also be possible to place the addition circuit [0193] It is also possible to place the addition circuit [0194] It is again possible to place the addition circuit [0195] As shall be seen further below, the addition circuit [0196] The multiplexer [0197] It is possible if necessary to use a two-input multiplexer and not circumvent the delay cell [0198] The delay cell [0199] It is thus possible to add up the bits produced by the multiplication circuits [0200] If the delay cell [0201] The multiplexer [0202] It is also possible to directly store the bits produced by the multiplication circuit [0203] It is possible, by means of the invention, to carry out a non-modular multiplication without taking account of the elements of the device [0204] Furthermore, the invention enables operations to be performed by the parallel use of the two multiplication circuits [0205] 1. Non-modular Multiplication on Pieces of Data with a Size Smaller than or Equal to n Bits [0206] Let us assume that it is sought to carry out a standard multiplication A*B, the pieces of binary data A and B being encoded on n bits. We shall consider the subdivision of the pieces of data A and B into m words of k bits. Let A=A [0207] It is assumed that m is a multiple of two. If this is not so, registers of (m+1)*k bits will be used. It is also possible to control the multiplexer [0208] Similarly, the pieces of data A and B could have a size smaller than n. If necessary, the most significant bits of these pieces of data will be complemented by zeros to obtain pieces of data of a size equal to n bits, if it is desired to have only one control program for the processor. It is also possible to provide for a sequencing of the conmmands of the device comprising a variable number of computation loops, enabling the processing time to be reduced if the pieces of data are encoded on a number of bits smaller than n. It is also possible to use operands of different sizes, by complementing the smaller-sized operand with logic zeros or by adapting the control program of the coprocessor. [0209] The following procedure is used: [0210] 1—Initialization: [0211] the loading in the registers [0212] the initialization of the addition circuits [0213] the selection of the second inputs of the multiplexers [0214] the selection of the second input of the multiplexer [0215] the selection of the second and fourth inputs of the multiplexers [0216] 2—Implementation of a Computation Loop with i as an Index Varying from 1 to m/2 2.1—Iteration 1: [0217] the loading of the contents of the registers [0218] the performance, by simultaneous rightward shifting of the contents of the registers [0219] the addition, in the addition circuit [0220] the storage of the k first bits produced by the multiplication circuit [0221] the addition, in the addition circuit [0222] the storage in the register [0223] during the above operations, the transfer of the words A [0224] the selection of the first input of the multiplexer [0225] 2j—Iteration j, j Varying from 2 to m/2−1: [0226] the loading of the contents of the registers [0227] the performance, by simultaneous rightward shifting of the registers [0228] the addition, in the addition circuit [0229] the storage of the k first bits produced by the addition circuit [0230] the addition, in the addition circuit [0231] the storage in the register [0232] during the above operations, the transfer of the words A [0233] 2.m/2—Iteration m/2 [0234] The resumption of the step 2j, apart from the transfer of words from the register [0235] At the end of this iteration, the n least significant bits of the result are in the register [0236] The gain in time to perform the operation is equal to 50% with respect to the device shown in FIG. 1 (taking account of the computation steps proper, the initialization step having an identical duration for the devices of FIGS. 1 and 3). [0237] It will be noted that the addition circuit [0238] It is also possible, without modifying the device of FIG. 3, to exchange the roles of the addition circuit [0239] It is also possible, as the case may be, to place the addition circuit [0240] Furthermore, if it is desired simply to improve the prior art device in the performance of multiplication on pieces of data with a size n, it will also be possible to eliminate the register [0241] It will be noted that, in the parallel performance of multiplication operations as described, the multiplication circuit [0242] It is possible if necessary to operate in reverse. In this case, the data elements produced by the multiplication circuit [0243] A problem arises for the addition of the contents of the register [0244] Finally it will be noted that, to make the cell [0245] 2. Multiplication on Pieces of Data with a Size Greater than n [0246] Let it be assumed that it is sought to make a standard multiplication A*B=C, the binary pieces of data A and B being encoded on a size greater than n. As an example, we shall assume pieces of data encoded on 2*n bits, the result C being encoded on 4*n bits. A, B and C have the form A[ [0247] As above, the method described shall be extended without difficulty to operands of different sizes. [0248] By using the device of FIG. 1, the following procedure is used: [0249] 1—the loading of the piece of data B[ [0250] 2—the computation of A[ [0251] 3—the computation of A[ [0252] 4—the loading of the piece of data B[ [0253] 5—the computation of A[ [0254] 6—the computation of A[ [0255] The subsequent steps are performed outside the coprocessor, for example by means of a processor or a dedicated wired circuit. [0256] 7—the computation of R′[ [0257] 8—the computation of T[ [0258] 9—the computation of R′″[ [0259] By using the device of FIG. 3, the same operation can be done as follows: [0260] 1—the loading of the pieces of data A[ [0261] 2—the computation of A[ [0262] 3—the loading of the piece of data A[ [0263] 4—the computation of A[ [0264] 5—the loading of the pieces of data A[ [0265] 6—the computation of A[ [0266] 7—the loading of the pieces of data A[ [0267] 8—the computation of A[ [0268] The gain in computation time is 50% for the steps 2, 3, 5 and 6 of the prior art to which it is necessary to add the absence of the steps corresponding to external additions of 2*n bits. Furthermore, with the device of FIG. 1, it is necessary to take account of the routine outputs of data towards the exterior of the coprocessor, once the multiplication operations have been done (which is detrimental from the viewpoint of time and the viewpoint of memory space needed for storage). [0269] With the device according to the invention, the only piece of data that is output, apart from the sub-operands of the result, is the intermediate result R′[ [0270] The method described is given by way of an example. It is possible to implement other methods while continuing to benefit from the advantages obtained through the simultaneous use of two multiplication circuits [0271] Thus, it is possible for example to use the method of Karatsuba described here below: [0272] 1—the comparison firstly of the n bits of A[ [0273] 2—the computation of A[ [0274] 3—the computation of A[ [0275] 4—the computation of (B[ [0276] 5—the computation of (B[ [0277] 6—the computation of 2*C[ [0278] Only three operations of multiplication are done instead of four (the multiplication by two is obtained directly in binary logic by the shifting of the pieces of data) and the operation can be faster (depending on the difference of the clock frequencies used by the coprocessor and the processor). This, however, calls for exchanges between the coprocessor and the exterior, and memory space to store the intermediate results. It will be noted that the additions could possibly be obtained by using the resources of the coprocessor (registers and addition circuits). [0279] For the implementation of the multiplication of pieces of data of a size greater than 2*n, the method requires exchanges between the coprocessor and the exterior since it is then necessary to be able to add at least three pieces of data of the same place value. The fact nevertheless is that the coprocessor according to the invention remains advantageous in the implementation of multiplication operations. Furthermore, if we assume that n=512, the possibility of performing computations on 1024 bits appears to date to be generally sufficient given the goals of security in civilian applications of encryption. [0280] 3. Modular Operations on Pieces of Data with a Size of n: Example 1 [0281] The coprocessor illustrated in FIG. 3 makes it possible to perform modular operations on operands encoded on a number m′*k bits with m′ greater than or equal to m more quickly than is the case with the device of FIG. 1. [0282] For this purpose, the operands A, B and N are manipulated by being divided into q (q as an integer) sub-operands of n bits: A[q−1] A[q2] . . . A[ [0283] It will be noted that in the above method the addition circuit [0284] The operation A*B+C is therefore performed with the utmost efficiency by means of the resources of the coprocessors. [0285] The following is the method: [0286] 1.1—The Multiplication of B by the First Sub-operand of A [0287] 1—A[ [0288] 2—A[ [0289] . [0290] . [0291] Q—A[ [0292] The result of the multiplication is the piece of data X[q] X[q−1] . . . X[ [0293] 1.2—The Computation of the Result of a Multiplication [0294] X[ [0295] 1.3—The Computation of the Result of the Multiplication of the First Sub-operand of Y by the Piece of Data N [0296] 1—Y[ [0297] 2—Y[ [0298] . [0299] . [0300] Q—Y[ [0301] The result of the multiplication is the piece of data U[q] U[q−1] . . . U[ [0302] 1.4—The Computation of the Result of the Modular Multiplication of B by the First Sub-operand of A [0303] U+X is computed and the result referenced Z is stored. [0304] The result Z of the addition has the form (c) Z[q] Z[q−1] . . . Z[ [0305] S( [0306] 2—Resumption of the Steps 1.1 to 1.4 in Considering the Second Sub-operand of A by Modifying the Step 1.1 as here Below [0307] 1—A[ [0308] 2—A[ [0309] . [0310] . [0311] Q—A[ [0312] W+S( [0313] Q—Resumption of the Above Step 2 in Taking into Consideration the Q [0314] The final result of the computation is S(q)−(N or 0). [0315] Gain in Computation Time [0316] The computation time is measured in terms of number of clock cycles of the coprocessor. [0317] The multiplication of the contents of the register [0318] Method According to the Prior Art [0319] Computation of the values A [0320] Computation of the values W: q·q·n=n·q [0321] Computation of the values X: (q−1)·(q+1)·n=n ·(q [0322] Computation of the values Y: q·m·(n+2·k)=n·q·(m+2) [0323] Computation of the values T: q·[q·m·(n+2.k)]=n·q [0324] Computation of the values U: q·q·n=n·q [0325] Computation of the values Z: q·(q+1)·n=n·(q [0326] The number of cycles needed to perform the computations is given by the following formula: 2·n·(m+4)·q [0327] Method Using the Invention [0328] Computation of the values A*B+R: q·[q·m/2·(n+k)]=n/2·(m+1)·q [0329] Computation of the values X: (q−1)·(q+1)·n=n·q [0330] Computation of the values Y: q·m/2·(n+k)=n/2·(m+1)·q [0331] Computation of the values T: q·[q·m/2·(n+k)]=n/2·(m−1)·q [0332] Computation of the values Z: q·(q+1)·n=n·q [0333] The number of cycles needed to perform the computations is given by the following formula: [0334] n/2·(2m+4)·q [0335] Let it be assumed that q=3 and k=32. [0336] For m=8 (n=256), the first method requires 63,488 cycles and the second method requires 27,136 cycles, giving a gain of 57.26%. [0337] For m=16 (n=512), the first method requires 212,992 cycles and the second method requires 97,024 cycles, giving a gain of 54.45%. [0338] It will be observed that these computations do not take account of the exchanges of data between the coprocessor and the exterior, these exchanges being far more numerous in the implementation of the first method. The time needed to perform these exchanges depends on the clock frequency used to set the rate of operation of the external elements (such as the processor [0339] The capacity of the coprocessor according to the invention to implement operations of the A*B+C type also has other advantages, for example in the implementation of the RSA encryption method. [0340] RSA Method [0341] The RSA encryption method makes it necessary to perform computations of the C=M [0342] An algorithm to perform this computation is the following: [0343] A=(M mod P) [0344] B=(M mod Q) [0345] U=Q [0346] If A<B mod P then [0347] C=(((A+P−(B mod P))*U)mod P)*Q+B [0348] Else [0349] C=(((A−(B mod P))*U) mod P)*Q+B [0350] The invention enables the computation of C which has the form X*Q+B by loading B into the register [0351] 4. Modular Operations on Pieces of Data with a Size Greater than n: Example 2. [0352] In this example, the addition circuit [0353] The operation A*B+C+D is performed. [0354] It is considered, in the same way as earlier, that the operands A, B and N take the form of q n-bit sub-operands: A[q−1] A[q−2] . . . A[ [0355] The following is the method: [0356] 1.1—The Multiplication of B by the First Sub-operand of A [0357] 1—A[ [0358] 2—A[ [0359] . [0360] . [0361] Q—A[ [0362] The result of the multiplication is the piece of data X[q] X[q−1] . . . X[ [0363] 1.2—The Computation of the Result of an Operation of Multiplication [0364] X[ [0365] 1.3—The Computation of the Result of the Multiplication of the First Sub-operand of Y by the Piece of Data N [0366] 1—Y[ [0367] 2—Y[ [0368] . [0369] . [0370] . [0371] Q—Y[ [0372] X[q]+T[ [0373] The result Z of the multiplication is the piece of data Z[q] Z[q−1] . . . Z[ [0374] Storage of S1=Z/2 [0375] 2—Resumption of the Steps 1.1 to 1.3 in Considering the Second Sub-operand of A by Modifying the Step 1.1 as Here Below [0376] 1—A[ [0377] 2—A[ [0378] . [0379] . [0380] Q—A[ [0381] X[q]=R[ [0382] . [0383] . [0384] . [0385] Q—Resumption of the Above Steps in Taking into Consideration the g [0386] The final result of the computation is S(q)−(N or 0). [0387] Gain in computation time [0388] The computation time is measured in terms of number of clock cycles of the coprocessor. [0389] The multiplication of the contents of the register [0390] Computation of the values A*B+R+S: [q·m/2·(n+k)]=n/2·q [0391] Computation of the values X [0392] Computation of the values Y: q·m/2·(n+k)=n/2·q·(m+1) [0393] Computation of the values Y*N+T+X: q·[q·m/2·(n+k)]=n/2·Q [0394] Computation of the values Z [0395] The number of cycles needed to perform the computations is given by the following formula: [0396] n·(m+1)·q [0397] Let it be assumed that q=3 and k=32. [0398] For m=8 (n=256), the method requires 25,472 giving a gain of 59.88%. [0399] For m=16 (n=512), the method requires 90,880 cycles, giving a gain of 57,33%. [0400] It will be observed that these computations do not take account of the exchanges of data between the coprocessor and the exterior, these exchanges being far more numerous in the implementation of the first method. The time needed to perform these exchanges depends on the clock frequency used to set the rate of operation of the external elements (such as the processor [0401] Having thus described at least one illustrative embodiment of the invention, various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only and is not intended as limiting. The invention is limited only as defined in the following claims and the equivalents thereto. Referenced by
Classifications
Rotate |