Monday, 27 August 2018

Using Oracle Data Pump for Data Science - Pt. 2

Recap and Introduction

In the last post I laid out the general ideas behind the approach to be taken for anonymising data as it is extracted from an Oracle database. The use of HMAC provides a convenient way of using a hardened secret key to consistently apply a secure one way hashing algorithm to input data and get an anonymised output - i.e. one that can't be linked to its original value.

  • We want it to be consistent so that we can get the same result for a given input. Randomising anonymises but it also destroys the data.
  • We want a secure one-way hashing algorithm like SHA-1 because the anonymised output can't be reversed to the original text.
  • We want to use a secret key to guard against rainbow tables being used, especially for low entropy input data.
  • The secret key should be hardened against being guessed by brute force approaches. A password based key derivation function like PBKDF2 can do this.

We finished last time with a simple select from dual SQL statement that demonstrated using the DBMS_CRYPTO package to run the HMAC algorithm on some text. Included in that statement were calls to utility packages UTL_RAW and UTL_ENCODE to perform type conversions, and to encode the RAW value as a Base64 string.

Using Oracle Data Pump's REMAP_DATA parameter means that the data is anonymised at its source rather than having post-processing steps. As we will see later, these post-processing steps don't necessarily disappear; but performing this first step on the input values themselves is a significant first step, and it is useful to apply it in-line.

Using REMAP_DATA

The REMAP_DATA parameter to Oracle Data Pump's expdp (and impdp) commands allows an individual column to be remapped to different values. The format of the parameter is as follows:

REMAP_DATA=[schema.]tablename.column_name:[schema.]pkg.function

So this tells us we need a PL/SQL package that contains a function. The documentation also says that the returned value from the function has to match the type of the column. Our primary focus here is anonymising personally identifiable information or PII, which in Oracle data type terms means VARCHAR2.

Following from the simple code in the last post, the code for such a function could be as simple as this:

 1  FUNCTION hash_varchar2_to_base64(input IN VARCHAR2)
 2  RETURN VARCHAR2 IS
 3
 4  BEGIN
 5    RETURN(
 6      utl_raw.cast_to_varchar2(
 7        utl_encode.base64_encode(
 8          dbms_crypto.mac(
 9            utl_raw.cast_to_raw(input),
10            dbms_crypto.HMAC_SH1,
11            utl_raw.cast_to_raw('monkey')
12          )
13        )
14      )
15    );
16  END;

PBKDF2 Implementation

What about that key on line 11 above? It doesn't meet out need for a hardened key. To generate a hardened key we're going to use PBKDF2. This requires a password or passphrase, a random salt, a number of iterations to perform, and a desired key length.

 1  FUNCTION pbkdf2(password IN VARCHAR2, salt IN VARCHAR2,
 2                  iterations IN INTEGER, dklen IN INTEGER)
 3  RETURN RAW IS
 4    blocks       INTEGER;
 5    block        RAW(32767);
 6    prf          RAW(32767);
 7    last_prf     RAW(32767);
 8    salt_raw     RAW(32767);
 9    password_raw RAW(32767);
10    key          RAW(32767);
11  BEGIN
12    blocks := ceil(dklen/20); -- HMAC-SHA-1 output is 20 bytes
13    password_raw := utl_raw.cast_to_raw(password);
14    salt_raw := utl_raw.cast_to_raw(salt);
15    FOR i IN 1..blocks
16    LOOP
17      last_prf := dbms_crypto.mac(
18                    utl_raw.concat(
19                      salt_raw,
20                      utl_raw.cast_from_binary_integer(i, utl_raw.big_endian)
21                    ),
22                    dbms_crypto.HMAC_SH1,
23                    password_raw
24                  );
25      block := last_prf;
26      FOR j IN 2..iterations
27      LOOP
28        prf := dbms_crypto.mac(last_prf, dbms_crypto.HMAC_SH1, password_raw);
29        block := utl_raw.bit_xor(block, prf);
30        last_prf := prf;
31      END LOOP;
32      key := utl_raw.concat(key, block);
33    END LOOP;
34    RETURN utl_raw.substr(key, 1, dklen);
35  END;

Performance considerations

Embedding a call to the function above would meet the requirement for a hardened key. But the idea of a key derivation function is that it is intentionally slow. Embedding a call to the PBKDF2 function to form a key based passphrase is going to slow things down a lot because it will recompute the hardened key for each value to be encoded.

Key access by expdp

We also can't use package state variables to embed the key within the package because the package state is part of the session. The expdp/impdp commands have their own session with Oracle so there's no way to set up the key.

One idea could be to pre-compute the key and embed in the code as a Base64 string. This would change line 14 to be something like this (where "<base 64 key>" represents the key):

11            utl_encode.base64_decode(utl_raw.cast_to_raw('<base64 key>'))

That means every time we want to change the key, we have to re-create the package with the key embedded in it.

We can now make things more modular by putting the PBKDF2 function above in its own package. We then declare a constant in the package body of the package containing the hash_varchar2_to_base64 function to hold the key. This means modifying the code at install time which might be useful as the installer has the choice to put the package in a separate schema; this makes the source of the package inaccessible (assuming the DBA_SOURCE view is also kept restricted).

The final version of the data anonymisation package therefore looks something like this:

 1  raw_key CONSTANT RAW(32767) := anonymous_key.pbkdf2(
 2      '<YOUR PASSWORD>', '<YOUR SALT>', 
 3       <YOUR ITERATIONS>, <YOUR DESIRED KEY LENGTH>);
 4
 5  FUNCTION hash_varchar2_to_base64(input IN VARCHAR2)
 6  RETURN VARCHAR2 IS
 7
 8  BEGIN
 9    RETURN(
10      utl_raw.cast_to_varchar2(
11        utl_encode.base64_encode(
12          dbms_crypto.mac(
13            utl_raw.cast_to_raw(input),
14            dbms_crypto.HMAC_SH1,
15            rawkey
16          )
17        )
18      )
19    );
20  END;

Monday, 2 July 2018

Using Oracle Data Pump for Data Science

Introduction

I'm getting some data for data analytics.  The supplier of the data wants to anonymise it before giving it to me.  The best way I know of doing that (without having a big lookup table) is using a secure hash function.  Secure hash functions are used to scramble data and have a set of properties that make it impractical to reverse the scrambling.  This is different to encryption which is designed to allow the encryption operation to be reversed (if the encryption key is known).

So, simply applying a secure hashing function to data would seem to be a good way of obscuring the original, right?  Well, while it isn't possible to reverse hashing, it turns out that with the speed of modern computers, when the size of the input is fairly small, it is possible to generate a lookup table of every possible input-to-hash pair.  Consider telephone numbers; they're fairly short - 10 digits in the US which is about 33 bits of entropy - and a lookup table could easily be generated once and stored.  This effectively renders the hash ineffective because it can be reversed through the lookup table.

Password storage mechanisms get around this problem by generating and storing a nonce - a number used once - for every password stored. The lookup table is rendered useless and it's back to brute forcing every input to hash value.  As mentioned above, modern computers are fast so password databases combine the nonce approach with a repeated application of the hashing algorithm.  This increases the difficulty according to the number of iterations.

This works for storing passwords; there will only be one entry in the database for the user password, and its nonce can make it separate from every other. For relational data, there are many rows that relate to each other by being equal to one another.  The fact that they are equal allows us to join the dots and build machine learning models of the data. Putting a random nonce with every row in a relational database destroys the relationship it has with any other data. The machine learning algorithms need the relationships between data points, so destroying these relationships is a very bad thing.

Therefore the technical requirement is that there be a consistent way to scramble the data so that the relationships are preserved. This basically means that instead of having a nonce for every individual data point, we have a secret key that is used for all of them. Thus the same value will be hashed in the same way regardless of where it is found, and the relationships between the values are preserved. Note that, if you're going to be getting multiple dumps of data from your source, for the values to hash to the same value (thereby maintaining their relationships), you need to use the same secret key each time.

Why not have a lookup table?

It's worth exploring this idea a bit before we dismiss it. It's entirely possible that we could have a lookup table that maps values to an anonymous version. That means changing the system to accommodate the needs of anonymisation. It also means that the mechanism for doing so has to be done carefully so that it doesn't leak information. For example, replacing values with a numeric primary key that is looked up elsewhere could provide information about when the subscriber joined.

It remains to be determined how user generated values should be anonymised. Thinking about our phone numbers again, a system that contains information about telephone subscribers and the calls they make could anonymise the caller by replacing the caller's number with their primary key in the subscribers table through a lookup. The called number should also be anonymised, and the same approach could be taken if they're also a subscriber; but, what if they aren't? And what if they become a subscriber at a later date?

A word on Hash collisions

The secure hash function produces unpredictable (but deterministic) output for a given input. They're designed to produce different outputs for different inputs but because they effectively compress their input down to something smaller (160 bits in the case of SHA-1), there will be cases where different inputs generate the same output value from the hash function.

In the case of our telephone numbers, we're talking 10^10 different numbers. If we're using SHA-1, each input telephone number will produce one of 2^160 or 1.46x10^48 output values from the hash function. The probability of one number hashing to the same as another is going to be very small.

HMAC - Hashed Message Authentication Code

Using HMAC (RFC2104) provides a convenient way to combine a value with a key. In its role as a message authentication code (MAC), it has been designed to stop an attacker being able to find two messages that have the same MAC. This means that messages can't be forged by the attacker. Note that we don't have that requirement in this case and we could simply extend the hash input by the key, as in MAC = H(input || key).

Note that certain cryptographic hash functions have been weakened over recent years through cryptanalysis. Hash functions like MD4 and MD5 are now considered broken. HMAC improves the collision resistance of these. It may also be more convenient to use HMAC given the availability of implementations.

Password Based Key Derivation Functions

For anonymising the data, whether it is a simple hash with the input extended by a key or the HMAC is used with the key, the strength of the key is what determines how well the data is protected. Just using a simple password (e.g. "monkey") as the key doesn't provide much protection because of the existence of password databases and heuristics based approaches for cracking passwords.

Passphrases are better, like the famous XKCD example of "correct horse battery staple" because they have a higher entropy. But more is better, and because secure hashing algorithms pad their input to an even number of 512 bit blocks, for the typical database table column (e.g. holding a phone number) you can add a very large key with no penalty in performance.

Algorithms such as PBKDF2 and scrypt have been designed to make the generation of high entropy keys easy and computationally expensive. Used with a passphrase such as that above, the practicality of brute forcing the passphrase and/or key diminishes greatly.

Oracle Data Pump

Oracle provides the expdp and impdp tools for efficiently exporting and importing data between databases. In order that my data supplier can anonymise data before they give it to me, I want to setup a parameter file that does the hashing of the data as it is exported. The expdp supports the REMAP_DATA parameter in which you tell it the table, column, and a function to use for remapping it.

Oracle also provides the packages required to do the hashing and necessary conversions between data types - specifically DBMS_CRYPTO, UTL_RAW. The following example shows how to run HMAC on a input "blah" with a key "monkey". The key will ultimately need to be replaced with a suitably hardened value. The HMAC function outputs a binary value, so it is converted to Base64 as shown - this is required for Oracle Data Pump as the result of the remap must be the same data type as the original column's data type; the UTL_ENCODE package does this.

1  select utl_raw.cast_to_varchar2(
  2    utl_encode.base64_encode(
  3      dbms_crypto.mac(
  4        utl_raw.cast_to_raw('blah'),
  5        dbms_crypto.HMAC_SH1,
  6        utl_raw.cast_to_raw('monkey')
  7      )
  8    )
  9  )
 10  from dual

What remains is to implement the simple code in the select above into a PL/SQL package that can be called from the REMAP_DATA parameter.