Monday, 2 July 2018

Using Oracle Data Pump for Data Science

Introduction

I'm getting some data for data analytics.  The supplier of the data wants to anonymise it before giving it to me.  The best way I know of doing that (without having a big lookup table) is using a secure hash function.  Secure hash functions are used to scramble data and have a set of properties that make it impractical to reverse the scrambling.  This is different to encryption which is designed to allow the encryption operation to be reversed (if the encryption key is known).

So, simply applying a secure hashing function to data would seem to be a good way of obscuring the original, right?  Well, while it isn't possible to reverse hashing, it turns out that with the speed of modern computers, when the size of the input is fairly small, it is possible to generate a lookup table of every possible input-to-hash pair.  Consider telephone numbers; they're fairly short - 10 digits in the US which is about 33 bits of entropy - and a lookup table could easily be generated once and stored.  This effectively renders the hash ineffective because it can be reversed through the lookup table.

Password storage mechanisms get around this problem by generating and storing a nonce - a number used once - for every password stored. The lookup table is rendered useless and it's back to brute forcing every input to hash value.  As mentioned above, modern computers are fast so password databases combine the nonce approach with a repeated application of the hashing algorithm.  This increases the difficulty according to the number of iterations.

This works for storing passwords; there will only be one entry in the database for the user password, and its nonce can make it separate from every other. For relational data, there are many rows that relate to each other by being equal to one another.  The fact that they are equal allows us to join the dots and build machine learning models of the data. Putting a random nonce with every row in a relational database destroys the relationship it has with any other data. The machine learning algorithms need the relationships between data points, so destroying these relationships is a very bad thing.

Therefore the technical requirement is that there be a consistent way to scramble the data so that the relationships are preserved. This basically means that instead of having a nonce for every individual data point, we have a secret key that is used for all of them. Thus the same value will be hashed in the same way regardless of where it is found, and the relationships between the values are preserved. Note that, if you're going to be getting multiple dumps of data from your source, for the values to hash to the same value (thereby maintaining their relationships), you need to use the same secret key each time.

Why not have a lookup table?

It's worth exploring this idea a bit before we dismiss it. It's entirely possible that we could have a lookup table that maps values to an anonymous version. That means changing the system to accommodate the needs of anonymisation. It also means that the mechanism for doing so has to be done carefully so that it doesn't leak information. For example, replacing values with a numeric primary key that is looked up elsewhere could provide information about when the subscriber joined.

It remains to be determined how user generated values should be anonymised. Thinking about our phone numbers again, a system that contains information about telephone subscribers and the calls they make could anonymise the caller by replacing the caller's number with their primary key in the subscribers table through a lookup. The called number should also be anonymised, and the same approach could be taken if they're also a subscriber; but, what if they aren't? And what if they become a subscriber at a later date?

A word on Hash collisions

The secure hash function produces unpredictable (but deterministic) output for a given input. They're designed to produce different outputs for different inputs but because they effectively compress their input down to something smaller (160 bits in the case of SHA-1), there will be cases where different inputs generate the same output value from the hash function.

In the case of our telephone numbers, we're talking 10^10 different numbers. If we're using SHA-1, each input telephone number will produce one of 2^160 or 1.46x10^48 output values from the hash function. The probability of one number hashing to the same as another is going to be very small.

HMAC - Hashed Message Authentication Code

Using HMAC (RFC2104) provides a convenient way to combine a value with a key. In its role as a message authentication code (MAC), it has been designed to stop an attacker being able to find two messages that have the same MAC. This means that messages can't be forged by the attacker. Note that we don't have that requirement in this case and we could simply extend the hash input by the key, as in MAC = H(input || key).

Note that certain cryptographic hash functions have been weakened over recent years through cryptanalysis. Hash functions like MD4 and MD5 are now considered broken. HMAC improves the collision resistance of these. It may also be more convenient to use HMAC given the availability of implementations.

Password Based Key Derivation Functions

For anonymising the data, whether it is a simple hash with the input extended by a key or the HMAC is used with the key, the strength of the key is what determines how well the data is protected. Just using a simple password (e.g. "monkey") as the key doesn't provide much protection because of the existence of password databases and heuristics based approaches for cracking passwords.

Passphrases are better, like the famous XKCD example of "correct horse battery staple" because they have a higher entropy. But more is better, and because secure hashing algorithms pad their input to an even number of 512 bit blocks, for the typical database table column (e.g. holding a phone number) you can add a very large key with no penalty in performance.

Algorithms such as PBKDF2 and scrypt have been designed to make the generation of high entropy keys easy and computationally expensive. Used with a passphrase such as that above, the practicality of brute forcing the passphrase and/or key diminishes greatly.

Oracle Data Pump

Oracle provides the expdp and impdp tools for efficiently exporting and importing data between databases. In order that my data supplier can anonymise data before they give it to me, I want to setup a parameter file that does the hashing of the data as it is exported. The expdp supports the REMAP_DATA parameter in which you tell it the table, column, and a function to use for remapping it.

Oracle also provides the packages required to do the hashing and necessary conversions between data types - specifically DBMS_CRYPTO, UTL_RAW. The following example shows how to run HMAC on a input "blah" with a key "monkey". The key will ultimately need to be replaced with a suitably hardened value. The HMAC function outputs a binary value, so it is converted to Base64 as shown - this is required for Oracle Data Pump as the result of the remap must be the same data type as the original column's data type; the UTL_ENCODE package does this.

1  select utl_raw.cast_to_varchar2(
  2    utl_encode.base64_encode(
  3      dbms_crypto.mac(
  4        utl_raw.cast_to_raw('blah'),
  5        dbms_crypto.HMAC_SH1,
  6        utl_raw.cast_to_raw('monkey')
  7      )
  8    )
  9  )
 10  from dual

What remains is to implement the simple code in the select above into a PL/SQL package that can be called from the REMAP_DATA parameter.