FIELD OF THE INVENTION
- BACKGROUND OF THE INVENTION
The present invention relates to digital data integrity and more particularly to a technique to detect malicious tampering at a very fine granular level without the performance constraints of purely using digital signatures.
Today, almost all critical business records are generated, managed and stored electronically, creating efficiencies and cost-savings for businesses. Unfortunately, digital information can be easily deleted, altered and/or manipulated. For businesses, the burden of proof is on the company to ensure and attest to the accuracy and credibility of their electronic business records. This ability to prove the integrity of critical business records becomes especially important in litigation where executives are often called upon to support their claims of ownership of any discoverable records, as well as verify their history of creation and use.
It is important to remark the difference between involuntary changes on data (like those due to errors in transmission) and voluntary changes (tampering). When the objective is to detect involuntary changes, the integrity information is commonly calculated without any kind of security added because there is not an attacker that is also going to alter the integrity to hide the data changes. Examples of patents about verification of data integrity for involuntary changes are European Patent EP1665611 “Data transmission path comprising an apparatus for verifying data integrity”, U.S. Pat. No. 5,581,790 “Data feeder control system for performing data integrity check while transferring predetermined number of blocks with variable bytes through a selected one of many channels”, U.S. Pat. No. 7,330,998 “Data integrity verification”, U.S. Pat. No. 6,446,087 “System for maintaining the integrity of application data”, European Patent EP676068 “Data integrity check in buffered data transmission” and European Patent EP1198891 “Data integrity management for data storage systems” amongst others.
But when the objective is to detect tampering, the method used to provide data integrity needs to prevent as well the tampering on the integrity information, therefore some kind of cryptography is required. The invention proposed fits in this category.
Specially on those well regulated environments, operating with large volumes of sensitive information, it is needed to guarantee the integrity of their data with a system that eliminates the risk of data manipulation.
Electronic records have been proven to have been manipulated in cases ranging from stock options fraud to loan fraud to intellectual property disputes. Some recent examples of actual cases surrounding the manipulation of electronic records include:
- Top executives at a successful technology company attempted to alter electronic records to hide a secret options-related slush fund to cover the tracks of their backdating options scheme.
- A prominent real estate developer received an electronic version of a loan agreement to print and sign. Rather than just signing the document, he made subtle changes to it in order to make the terms of the loan more favorable to himself. The changes went undetected for a year until the loan was refinanced.
- An auditor impeded a federal investigation by intentionally altering, destroying and falsifying the financial records of a now defunct credit card issuer in order to downplay or eliminate evidence that there were “red flags” that he should have caught.
- Two major Wall Street firms settled with the SEC after being accused of “late trading”. Late trading or “after-hours” trading involves placing orders for mutual fund shares after the market close, but still getting that day's earlier price, rather than the next day's closing price.
- A prominent scientist, funded by millions of dollars in state and private funding was charged with fraud and embezzlement, after admitting that he manipulated photo images of stem cells in his research.
The industry has been addressing these deficiencies by several means, including the use of WORMs (Write Once Read Many) devices, the use of digital signatures, redundant off-site storage managed by different people, etc., but all of them have aspects to demand a more efficient solution: WORMs are slower than any other storage device and one risk is that a drive can be replaced by another one tampered; digital signatures have a high computational cost that makes impossible to use standalone in systems with significant transaction volume and do not prevent the change of order; and duplicating the storage systems and administration have cost issues and difficult the further audit process.
The state of the art is based today in the use of digital signatures (Public Key Infrastructure based) accompanied by an accurate date and time stamp to provide authenticity to the data susceptible of further audit but the following issues are not addressed:
- a) When processing a huge volume of data, the performance required is not cost efficient or even it is directly not possible to implement because lack of performance of digital signatures.
- b) Digital signatures and timestamps do not provide by themselves the guarantee that there have not been registers deleted without notice, which in fact means immutability is not a feature of such log registries.
The present invention addresses both issues, providing a cost efficient method and system to provide fine granular integrity to huge volumes of data guarantying immutability. The use of both symmetric message authentication functions to create the links and digitally signatures for chunks of links make possible to generate immutable digital chains in a cost efficient way by using standard industry hardware and software.
- SUMMARY OF THE INVENTION
There is a patent that proposes a primitive solution by using a cumulative hash function (U.S. Pat. No. 6,640,294) but it does not address the problem of malicious tampering because it is possible to recalculate the entire set of hashes to match the modified data values (it is clear when saying “[ . . . ] if there is an accidental error, attempts to recover the lost data can be made [ . . . ]” at column 3 line 32). U.S. Pat. No. 6,640,294 is also oriented to data storage. In contrast, the proposed invention:
- Considers malicious tampering, therefore uses cryptographic functions, like Message Authentication Codes in combination with a secret key, to avoid malicious replacement of integrity. Timestamps are also included.
- Provides authenticity, so it is not possible to impersonate the source of data.
- It's not oriented to data storage but to integrity generation. The integrity is managed beside the data, so it is possible to keep together the data and integrity but also it is possible to only keep integrity and finally it is also possible to purely generate integrity and do not keep neither data nor integrity.
With the proposed invention is possible to generate fine granular integrity to huge volumes of data in real time at a very low computational cost.
The invention proposes a scalable system that can receive different digital data from multiple sources and generates integrity streams associated to the original data.
Message Authentication Codes are used to create a digital chain of integrity links. The algorithm proposed in the preferred embodiment creates multiple parallel chains to achieve a high volume of transactions per second.
The symmetric session keys that are used at Message Authentication Codes to create the digital chain are stored encrypted using an asymmetric public key. An audit tool component is presented to allow the owners of the corresponding asymmetric private key to verify data integrity and generate audit reports. The use of a Public Key Infrastructure (PKI) and certificates assures that only those authorized can verify the integrity.
BRIEF DESCRIPTION OF THE DRAWINGS
The system proposed is designed in a way that can process the digital data at binary level and at data format level. When working in binary mode the system processes the digital data at byte level making no difference which format the data have (audio, video, documents, transactions, files . . . )
The invention is best understood from the following detailed description when read in connection with the accompanying drawing. It is emphasized that, according to common practice, the various features of the drawing are not to scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity. Included in the drawing are the following Figures:
FIG. 1 is an illustration of an exemplary embodiment of a system in which the invention may be implemented. There are several information source(s) (310, 312) that communicate with the Integrity Generation System (305) through a Network (405). There are also some of the different receivers of the immutable digital chains of integrity: same receivers as senders of the original data (312), different ones (311) and storage media (320).
FIG. 2 is an illustration of a software architecture showing an exemplary implementation of the invention. There is a data communications layer (505) that provides an API (600) to communicate with the data information sources (310, 312), a cryptographic layer (510) that generates the immutable digital chains of integrity and an integrity communications layer (515) that sends to the appropriate receivers (311) and/or stores (320) the generated immutable digital chains of integrity.
FIG. 3 shows the architecture of the system with its functional modules. There is an API Module (600) that receives the original data from the information sources. This API Module passes the original data to the Integrity Generation Module (610) that generates the integrity, with the (optional) usage of the HSM module (650) and using the public keys certificated by a Trusted Third Party (660). The immutable digital chains of integrity generated at Integrity Generation Module (610) is then communicated to the authorized receiver(s) by means of the Integrity Communication Module (620) and/or stored by the Storage Media Module (640). When an integrity verification is requested, the Audit Tool Module (630) with its web based interface allows the requests through the Integrity Communication Module (620) by providing both the original data and the integrity or only the original data and retrieving the integrity from the Storage Media Module (640) or by retrieving both the original data and the integrity from the Storage Media Module (640).
The present invention proposes to generate fine granular integrity to huge volumes of data in real time, involving the following steps:
- a) receiving the data. An API (Application Programming Interface) (600, 505) is provided to enable the communication with the different data sources;
- b) processing the data applying cryptographic routines (510, 610) to generate one or more immutable digital chains that contain at least the original data related integrity information including timestamps; and
- c) communicating said digital chain(s) to the appropriate receiver (620), that could be the same as the sender of data (312), a different one (311), a storage media (320), etc.
The system described herein is preferably implemented as a software program, platform independent Java implementation, running in standard hardware. However, the system may be implemented in various embodiments using other well known implementations, such as, for example, Microsoft's .net technology or C++. The executable applications, as described herein, are computer programs (software) stored within the main memory or a secondary memory on any suitable computer running preferably Linux or Windows. Such computer programs, when executed, enable a processor to perform the features of the present invention. The system as disclosed herein can be implemented by a programmer, using commercially available development tools. Obviously, as technology changes, other computers and/or operating systems may be preferable in the future.
In a preferred embodiment, the use of an industry standard Hardware Security Module (HSM) (650) at least to generate and keep secure the asymmetric cryptographic keys run provides a higher degree of security and full independence because even the system administrator can not access to these keys.
The system is proposed in a 3-tier software architecture: 1) the data communications tier (505), which is in charge of the connection with data sources; 2) the business or cryptographic tier (510), which is in charge to generate the immutable digital chains; 3) and the integrity communications tier (515), in charge to send said digital chain(s) to the appropriate receiver, that could be the same sender of data (312), a different one (311), a storage media (320), etc.
Designing the application in layers (tiers), is useful for many different reasons. In a multiple tier design, each tier can be run in a separate machine, or machines, allowing for improved processing performance. Depending on the design, multiprocessor machines, or many different independent computers can be used to improve performance. Efficient layering can give structure to the application, promote scalability, and ease long-term maintenance requirements for the code.
- Receiving Data to Generate Its Integrity
The proposed system is designed in such a way that can process the digital data at a binary level and at a data format level. When working in binary mode the system processes the digital data at byte level making no difference which format the data have (audio, video, documents, transactions, files . . . )
- Generating Integrity: Immutable Digital Chains
To receive the original data information to generate its integrity, the proposed system provides an Application Programming Interface (600
). The invention proposes as network (405
) transport protocol to use industry standards, like the following ones, but not restricted to:
- Asynchronous messaging, like JMS;
- Synchronous communication, like webservices using HTTP/S (TLS/SSL) calls over TCP/IP;
- Other communication protocols such as syslog, SNMP, SMTP, secure syslog, etc.
- Data messages mi: We'll call Message to the data information provided at any call to the proposed system in order to generate its integrity.
- Entry: Tuple of values such as a Message, a Timestamp, a link and the type of the Message, etc.
- Register: Ordered set of entries
- PAud: Encryption with the public key of the entity authorized to verify the integrity
- SS: Encryption with the system's private key
- DSS: digital signature made by the system
- ts: timestamp
- ∥: concatenation
- MAC: (Message Authentication Code) is an authentication tag derived by applying an authentication scheme, together with a secret key, to a message.
Unlike digital signatures, MACs are computed and verified with the same key, so that they can only be verified by the intended recipient. There are four types of MACs: (1) unconditionally secure, (2) hash function-based, (3) stream cipher-based or (4) block cipher-based.
In a preferred embodiment, the integrity is generated as immutable digital chains following the cryptographic protocol defined below:
- 1. The proposed system establishes at least one session key (symmetric key) that will be kept secured by means of a digital envelope using public-key cryptography:
- 1.1. The system generates randomly a session key, K.
- 1.2. The system destroys securely the old previous session key (if it exists).
- 1.3. The system encrypts the new key with the public key (PAud), obtaining K′=PAud(K)
- 1.4. The system digitally signs the encrypted key K′ obtaining K″=DSS(K′)
- 1.5. The system adds to at least one of the digital chains, at least the values K′, the K″, a timestamp, and the digital signature of all previous data. This is entry0=(m0,ts,DS0=SS(h(m0∥ts∥1))) where m0=PAud(K)
- 2. Every time a message (unit of data) mi is received, a new link is added to its according digital chain preserving the sequence order. Every added entryi is derived to form the chain from the previous entry entryi-1 according to the formula: entryi=(ts,MACK(mi∥ts∥MACi-1))
- 3. The chain would have no end being an infinite chain if the system is never stopped (for example if the server needs maintenance). When the system is shut down, the chain is securely closed by creating a special final entryN formed with a tuple of at least the following elements: the timestamp ts, the link with previous entry N−1 and by digitally signing said elements mN and ts together with previous IMACN-1; entryN=(ts,SS[h(mN∥ts∥MACN-1)]) where mN at least contains the chain identifier.
As seen, the session key is used to compute a cryptographic message authentication code (MAC) for the entry to calculate its integrity and the MAC of the previous entry each time an entry is added to a digital chain. It is possible to change the session key after a predefined time or a predefined number of iterations and start using a new one as defined at step 1, to provide another level of security.
Metronome entries are added to the digital chain at predefined regular intervals, generated in the same way as the links that close a chain. Metronome entries provide by this way digital signatures to the chunk of messages contained in the digital chain between one metronome entry and the previous one in the chain, adding another level of security. In a preferred embodiment, metronome entries contains at least the same information detailed at step 3 above but without the mi field (this is, only timestamping information). Additionally, in another embodiment the metronome entry could also contain a digital signature of its values.
In another embodiment, it is also possible to include the original data inside the links of the digital chain, providing the integrity together with the original data (the messages m1 to mn). In this embodiment, as an option, it is also possible to encrypt the messages m1 to mn (original data) using a symmetric encryption algorithm, such as AES (preferred), DES, 3DES, IDEA, etc. The secret key to be used for encryption could be the same key K used for integrity (MAC) or a different one also encrypted with a different public key belonging to a different entity, which will provide separation of roles between the entity allowed to verify the integrity and the one allowed to access the original data.
The process to verify the integrity considers recreating the same process followed during integrity generation from the last symmetric key K encryption link, and verifying MACs and digital signatures. The entries are verified preserving the sequence order.
If the system is compromised, the attacker has no way to recreate the MACs (the only way is to know the session key) so he can't modify it without detection.
Considering an attacker that chooses to simply delete or truncate a register rather than attempting to modify existing entries without detection. Of course, no new valid entries can be added once a register has been truncated, since intermediate links will have been lost, and this will be detected during verification.
Considering now an attacker that deletes entries from the end of the chain; in this scenario, the lack of new entries could suggest that no more data have been received recently (instead of being deleted). The use of metronome entries prevents this kind of attacks: if an attacker deletes entries from the end he will also delete the metronome entries or if he leaves the metronome entries, their digital signatures will not match and the authorized Auditor will detect the situation (where the last valid entry indicates the earliest time at which the register could have been truncated).
As said before, the preferred embodiment considers generating multiple concurrently maintained digital chains to reduce latency and take a better advantage of computational load. The system will establish as many concurrent different session keys as chains (configurable). Every chain is independent of the other ones and works in an independent way, but all chains are securely linked together at creation time. In this way, any chain or the complete set of chains cannot be entirely deleted without detection. Additionally, metronome entries are added to all current chains at the same time, so all chains should have the same number of metronome entries. Metronome entries added at the same time have the same identifier value (it simplifies detecting truncation).
In a preferred embodiment, as well as keeping the integrity inside the chains, it is also needed to contemplate other attacks than inside chains modification. That is, the deletion of some of the multiple chains generated.
Since we do not have as many chains as the entries existing inside the chains themselves, we may use mathematical operations to be able to detect the integrity of the whole set of chains.
Lets assume that we have a storage media (320) which is storing chains from n servers in a non-uniform way. That is, it is hard to put an order on the chains during the time they are being stored. And another thing is that if we group the chains and re-chain them all, if one of them is deleted we will not be able to detect the deletion of this chain.
To get rid of these drawbacks, we might take the last MAC values of each chain and create a polynomial by setting these values as roots, according the formula:
P(x)=πn i=1(x−x i)
As there will be more chains coming, we will continue creating this polynomial up to some limit. After that we are going to sign it. Then this allows us very easily to go backward or forward from the point we want among the chains. We can detect very easily if some chains are deleted by cancelling the remaining chain values (final MAC values inside the chains) and we can also recover the value of the chains deleted.
Said polynomial is not going to have a repeating root; i.e., the multiplicity of each root is going to be 1 and sum of multiplicities is going to be equal to the degree of the polynomial. This property is a direct consequence of the collision-resistance of MAC functions.
Moreover, if an attacker deletes an integrity value, she cannot compute a different value to make the polynomial look like the same. This is because polynomial rings are unique factorization domains, which means polynomials cannot be factored by different monomials.
Another advantage coming with this polynomial is that it is possible to obfuscate it without any need of encryption. This might be achieved even but not only by choosing a random number and adding it to the constant term of the polynomial. The size of the interval from which the random number is chosen might be set as a security parameter for the security of the polynomial. So it can be adjusted. Furthermore, this polynomial can be made public by signing it and sending it to different location inside the network. This will reduce the risk of polynomial to be harmed.
There are of course many other ways to create such structures where the order of computation is not important. For example modular multiplication of chain values which might be less costly than polynomials. But arithmetic of polynomials in modulo 2 is going to be fast since they are going to be convenient to implement. The need for security is unique factorization domains under some certain operation.
The polynomial is going to be updated or multiplied when a new chain arrives to the database by the chains value. That is; P(x) becomes P(x)·(x−x(n+1)). Re-signing all of the polynomial every time it's updated again and again is time consuming. To get rid of this trade-off it is proposed to use homomorphic encryption which is going to enable to sign only the new coming chain factor (x−x(n+1)) and multiply; because homomorphism means that DS(P(x))·DS(x−x(n+1))=DS(P(x)·(x−x(n+1)). This is going to be much more efficient than signing the updated polynomial.
Another embodiment considers just timestamping and signing the polynomial. While the chain integrity values are coming, the system is going to sign the recomputed polynomial one by one. So, signing with the time stamp might reduce the replacement attacks if fake but “indistinguishable” chain values are generated and added to the computation process of the polynomial periodically. These periods must be small enough to prevent replacement attacks. Authenticity of time-stamp must be preserved in any case.
Another embodiment considers to create some number of polynomials instead of creating just one polynomial. This is going to be done by just using a pseudorandom function to determine which polynomial is to be updated. The reason for that is to prevent an adversary to understand which polynomial is updated. The seed of the random function is going to be secret. That means when a replacement attack is done; it is going to be understood by the question “How can it be that all of the polynomials are the same for that period of time?”
Another embodiment considers a continuation of the first improvement: the polynomial is going to be updated for each new coming metronome value. After this; it is going to be signed homomorphically. To keep the degree of the polynomial at a reasonable level, we just have to cancel the last signed metronome entry and then we have to update it with the new coming metronome entry. For the last chain values (or link values) we update with them as usual but they are not going to be cancelled as metronome entries. They are going to stay as the real roots of the polynomial. To summarize, link values or the last chain values we add to the computation of the polynomial are going to be permanent, metronome entries added as roots are going to be temporary; they are going to be replaced by each new coming metronome entry.
The arithmetic to use is going to depend on the signature scheme as well as the fastest implementation which is going to be suitable. It is suggested to use binary arithmetic so that the computation of polynomial is going to be very fast. But in general a polynomial of degree n is going to be multiplied with a factor which has degree 1; so in any case it is fast.
Another embodiment, in order to avoid the division of the polynomial each time for to replace the metronome roots, proposes to keep the polynomial which existed before the opening of a new integrity generation session (a polynomial created by the previous link/last chain values) Let's call this polynomial “P”. By above discussion, P just consists of factors whose roots are of last link values (belonging to the previous integrity generation sessions) which are not divided. And there is another polynomial “Q”; which both contains last chain values as roots and last arrived metronome value as factors. By each new coming metronome entry mi, Q is going to be updated as Q=P·(x−mi) and Q is signed again as before.
Now, this overwrite operation prevents division of old metronome values. Furthermore, the cost of signing is kept constant. And the signature scheme does not have to be homomorphic.
- Delivering the Integrity
The use of an industry standard Hardware Security Module (HSM) (650) where at least the pair of private & public keys for digital signatures are generated and the private key is hold securely, guarantees the immutability of the digital chain because nobody can access the private key used to sign, even those privileged users such as the system administrators.
The integrity communications tier (515) is in charge to deliver the integrity. As seen before, the integrity is formed by at least one immutable digital chain, and in a preferred embodiment this chain is delivered to the emissor of the original data in real time as it is being created, link by link, using the same communications protocol established to receive the original data. The owner of the original data possesses now an integrity token related to the original data, that can be verified by the owner of the asymmetric private key used to encrypt the symmetric session key(s) at any time. An example of application could be a real time video system, such as a centralized CCTV server that received multiple video streams and stores the video in a never-ending file (when disk is full, instead of closing the file it continues storing data from the beginning generating a continuous stream file), where the integrity is generated at same time as the video and stored aside in the CCTV system. The CCTV system will send to the proposed integrity system the stream of video in real time, the integrity system will generate the integrity and send it to the CCTV system also in real time and finally the CCTV system will store in its storage media the video stream together with the integrity stream (the digital chain). The benefits over purely using digital signatures are evident in this example, because the integrity is generated continuously according the stream of video instead of snapshots.
In another embodiment, the integrity is stored by the proposed system instead of being delivered, while the original data is not kept. In this scenario, when an integrity verification is required, the system only needs to receive the original data and it will generate the integrity report using the previously stored integrity.
- Audit Tool to Verify the Integrity
Another embodiment contemplates the integrity system to store both the integrity and the original data, together or separately in different storage medias. In this scenario the integrity system does also work as secure repository of data. The audit tool will not only generate an integrity report but also export the original data, guaranteeing its integrity.
The system provides a web based interface audit tool (630) that is in charge to verify the integrity of the data, generate the integrity reports and in some cases deliver the original data. The audit tool requires access to the asymmetric private key of the authorized receiver(s) of the integrity as well as the public key used by the system, in order to recover the symmetric session keys needed to verify the integrity by repeating the same process followed to generate it and comparing both results. To guarantee the security of the process, in a preferred embodiment the public keys are all certified by a trusted third party (660).
While preferred embodiments of the invention have been shown and described herein, it will be understood that such embodiments are provided by way of example only. Numerous variations, changes and substitutions will occur to those skilled in the art without departing from the spirit of the invention. Accordingly, it is intended that the appended claims cover all such variations as fall within the spirit and scope of the invention.