Monday, 27 August 2018

Using Oracle Data Pump for Data Science - Pt. 2

Recap and Introduction

In the last post I laid out the general ideas behind the approach to be taken for anonymising data as it is extracted from an Oracle database. The use of HMAC provides a convenient way of using a hardened secret key to consistently apply a secure one way hashing algorithm to input data and get an anonymised output - i.e. one that can't be linked to its original value.

  • We want it to be consistent so that we can get the same result for a given input. Randomising anonymises but it also destroys the data.
  • We want a secure one-way hashing algorithm like SHA-1 because the anonymised output can't be reversed to the original text.
  • We want to use a secret key to guard against rainbow tables being used, especially for low entropy input data.
  • The secret key should be hardened against being guessed by brute force approaches. A password based key derivation function like PBKDF2 can do this.

We finished last time with a simple select from dual SQL statement that demonstrated using the DBMS_CRYPTO package to run the HMAC algorithm on some text. Included in that statement were calls to utility packages UTL_RAW and UTL_ENCODE to perform type conversions, and to encode the RAW value as a Base64 string.

Using Oracle Data Pump's REMAP_DATA parameter means that the data is anonymised at its source rather than having post-processing steps. As we will see later, these post-processing steps don't necessarily disappear; but performing this first step on the input values themselves is a significant first step, and it is useful to apply it in-line.

Using REMAP_DATA

The REMAP_DATA parameter to Oracle Data Pump's expdp (and impdp) commands allows an individual column to be remapped to different values. The format of the parameter is as follows:

REMAP_DATA=[schema.]tablename.column_name:[schema.]pkg.function

So this tells us we need a PL/SQL package that contains a function. The documentation also says that the returned value from the function has to match the type of the column. Our primary focus here is anonymising personally identifiable information or PII, which in Oracle data type terms means VARCHAR2.

Following from the simple code in the last post, the code for such a function could be as simple as this:

 1  FUNCTION hash_varchar2_to_base64(input IN VARCHAR2)
 2  RETURN VARCHAR2 IS
 3
 4  BEGIN
 5    RETURN(
 6      utl_raw.cast_to_varchar2(
 7        utl_encode.base64_encode(
 8          dbms_crypto.mac(
 9            utl_raw.cast_to_raw(input),
10            dbms_crypto.HMAC_SH1,
11            utl_raw.cast_to_raw('monkey')
12          )
13        )
14      )
15    );
16  END;

PBKDF2 Implementation

What about that key on line 11 above? It doesn't meet out need for a hardened key. To generate a hardened key we're going to use PBKDF2. This requires a password or passphrase, a random salt, a number of iterations to perform, and a desired key length.

 1  FUNCTION pbkdf2(password IN VARCHAR2, salt IN VARCHAR2,
 2                  iterations IN INTEGER, dklen IN INTEGER)
 3  RETURN RAW IS
 4    blocks       INTEGER;
 5    block        RAW(32767);
 6    prf          RAW(32767);
 7    last_prf     RAW(32767);
 8    salt_raw     RAW(32767);
 9    password_raw RAW(32767);
10    key          RAW(32767);
11  BEGIN
12    blocks := ceil(dklen/20); -- HMAC-SHA-1 output is 20 bytes
13    password_raw := utl_raw.cast_to_raw(password);
14    salt_raw := utl_raw.cast_to_raw(salt);
15    FOR i IN 1..blocks
16    LOOP
17      last_prf := dbms_crypto.mac(
18                    utl_raw.concat(
19                      salt_raw,
20                      utl_raw.cast_from_binary_integer(i, utl_raw.big_endian)
21                    ),
22                    dbms_crypto.HMAC_SH1,
23                    password_raw
24                  );
25      block := last_prf;
26      FOR j IN 2..iterations
27      LOOP
28        prf := dbms_crypto.mac(last_prf, dbms_crypto.HMAC_SH1, password_raw);
29        block := utl_raw.bit_xor(block, prf);
30        last_prf := prf;
31      END LOOP;
32      key := utl_raw.concat(key, block);
33    END LOOP;
34    RETURN utl_raw.substr(key, 1, dklen);
35  END;

Performance considerations

Embedding a call to the function above would meet the requirement for a hardened key. But the idea of a key derivation function is that it is intentionally slow. Embedding a call to the PBKDF2 function to form a key based passphrase is going to slow things down a lot because it will recompute the hardened key for each value to be encoded.

Key access by expdp

We also can't use package state variables to embed the key within the package because the package state is part of the session. The expdp/impdp commands have their own session with Oracle so there's no way to set up the key.

One idea could be to pre-compute the key and embed in the code as a Base64 string. This would change line 14 to be something like this (where "<base 64 key>" represents the key):

11            utl_encode.base64_decode(utl_raw.cast_to_raw('<base64 key>'))

That means every time we want to change the key, we have to re-create the package with the key embedded in it.

We can now make things more modular by putting the PBKDF2 function above in its own package. We then declare a constant in the package body of the package containing the hash_varchar2_to_base64 function to hold the key. This means modifying the code at install time which might be useful as the installer has the choice to put the package in a separate schema; this makes the source of the package inaccessible (assuming the DBA_SOURCE view is also kept restricted).

The final version of the data anonymisation package therefore looks something like this:

 1  raw_key CONSTANT RAW(32767) := anonymous_key.pbkdf2(
 2      '<YOUR PASSWORD>', '<YOUR SALT>', 
 3       <YOUR ITERATIONS>, <YOUR DESIRED KEY LENGTH>);
 4
 5  FUNCTION hash_varchar2_to_base64(input IN VARCHAR2)
 6  RETURN VARCHAR2 IS
 7
 8  BEGIN
 9    RETURN(
10      utl_raw.cast_to_varchar2(
11        utl_encode.base64_encode(
12          dbms_crypto.mac(
13            utl_raw.cast_to_raw(input),
14            dbms_crypto.HMAC_SH1,
15            rawkey
16          )
17        )
18      )
19    );
20  END;

No comments: