Friday, 3 May 2019

TOML Syntax Diagrams

I recently discovered TOML - Tom's Obvious, Minimal Language. I really like it. It comes close to JSON's simplicity but adds some features that make it kinder to humans. It also adds date-time values which makes up for a lot.

But I also really like the way JSON is documented on json.org. The syntax diagrams (sometimes called railroad diagrams) provide a concise visual representation of a syntax.

So with an online railroad diagram generator and starting from the ABNF on the TOML GitHub page, I put these together. The result ended up having quite a lot fewer rules than are given in the ABNF on Github. This was done with the aim to reduce the number of diagrams, shorten the output, and make the whole thing easier to grok.

To this end, the whitespace rules that are in the ABNF have been removed completely here. Instead, the places where whitespace is allowed is described below each diagram.

7th of May, 2019 - Updated with a description of each diagram. Fixed a couple of issues.

Feedback welcome.





TOML:






TOML     ::= ( Key '=' Value | '[' Key ']' | '[[' Key ']]' )? Comment? ( Newline ( Key '=' Value | '[' Key ']' | '[[' Key ']]' )? Comment? )*



A TOML document is a table or dictionary; it consists of key-value pairs and is equivalent to a JSON object. The values themselves can be strings, numbers, booleans, dates, arrays of values, or nested tables.

It is a line-oriented document. This means that apart from multi-line strings and arrays (described later), everything happens on one line. Blank lines are OK and used to improve readability.

There are three main types of lines:
  • Lines that describe key-value pairs. This is the first option above where key and value are separated by an equals-sign.
  • Lines that set a parent key for the key-value pairs that follow. The parent key is set by putting a key in square-brackets.
  • Lines that append an element to table entry that contains an array. The parent key of the array is put in double-square brackets.




Key      ::= ( [A-Za-z0-9#x2D_]+ | QuotedString ) ( '.' ( [A-Za-z0-9#x2D_]+ | QuotedString ) )*


referenced by:


A key is a string containing Unicode characters. In a TOML document, unquoted keys are sequences of upper or lower case letters, digits, dashes or underscores. In order to contain other characters it must be quoted using single or double quotes.

All keys are relative to the current parent key (if it has been set in either of the square-bracket or double-square-bracket forms above), or the root if no parent key has been set yet.

A path down through a series of nested tables is indicated by separating the elements of the path by period characters. A quoted string can be used to include a period in a key itself. Spaces or tabs are allowed either side of the period.





           | QuotedString

           | 'true'

           | 'false'

           | Array

           | InlineTable

           | DateTime

           | Number


referenced by:


As mentioned above, the values themselves can be strings, numbers, booleans, dates, arrays of values, or nested tables of values.

Booleans are simply the text 'true' or 'false'. Other value types are understood by either a token that introduces them, or the format of their data.

The different formats are described below.

Comment:





Comment  ::= '#' [#x0009#x0020-#x10FFF]*


referenced by:


Comments, as in many scripting languages, start with a hash or pound character, and comments end at the end of a line. A comment can appear either on a line by itself, or at the end of a line of content.

Spaces or tabs are allowed either side of tokens. This means at the start of a line, either side of the equals square bracket or double square bracket.




Newline  ::= #x000D? #x000A


referenced by:


The document can be Unix or DOS formatted. Lines end in either a newline character or carriage-return and newline combination.





         ::= '"' ( [#x0020-#x0021#x0023-#x005B#x005D-#x007E#x0080-#x10FFFF] | Escaped )* '"'

           | "'" [#x0009#x0020-#x0026#x0028-#x007E#x0080-#x10FFFF]* "'"


referenced by:


A quoted string is introduced by either single or double quotes and must be closed by the same character on the same line. It can contain Unicode characters and double-quoted strings support escape characters using backslash. The single-quoted strings
don't support escaped characters and hence can't contain single-quotes.

Spaces or tabs within a string are preserved.





         ::= '"""' ( [#x0020-#x005B#x005D-#x007E#x0080-#x10FFFF] | Escaped | '\'? Newline )* '"""'

           | "'''" ( [#x0009#x0020-#x007E#x0080-#x10FFFF] | Newline )* "'''"


referenced by:


Multiline strings, as the name suggests, allow a string to appear over a number of lines. They start with either a triple double quote or a triple single quote, and end when a matching triple appears.

A backslash at the end of a line consumes all following newlines, spaces, or tabs.

Newlines, spaces, or tabs can appear after the opening token and are ignored.





Escaped  ::= '\' ( '"' | '\' | 'b' | 'f' | 'n' | 'r' | 't' | 'uXXXX' | 'UXXXXXXXX' )


referenced by:


Escape sequences in double-quoted strings and multiline strings are introduced with a backslash. A similar set to those found in JSON are supported except for the absence of forward slash. There is also the ability to specify a 32-bit Unicode code point using upper case U and eight hex characters.

No spaces or tabs are allowed after the backslash.


Array:




Array    ::= '[' ArrayComment ( Value ( ',' ArrayComment Value )* ','? ArrayComment )? ']'


referenced by:


Arrays of values are introduced by a left square bracket.

A value in the array is always followed by a comma but then can be followed by a comment and/or newline. The final value in the array is allowed to be followed by a comma which is ignored.

Newline, spaces, or tabs can follow the opening square bracket and preceded the closing square bracket.


ArrayComment:





         ::= ( Comment? Newline )*


referenced by:







         ::= '{' ( Key '=' Value ( ',' Key '=' Value )* ','? )? '}'


referenced by:


An inline-table allows a number of key-value values in a table to be specified on a single line. It is surrounded by curly braces and the key-value pairs are separated by comma.

Key value pairs are separated by equals sign.

Spaces or tabs can appear around the curly braces, commas, and equals signs.




DateTime ::= 'YYYY-MM-DD' ( ( 'T' | ' ' ) 'HH:NN:SS' ( '.' [0-9]+ )? ( 'Z' | ( '+' | '-' )
'HH:NN' )? )?

           | 'HH:NN:SS' '.' [0-9]+


referenced by:


DateTime values are either absolute or relative to some unspecified entity, depending on whether a timezone is specified. A time of day value (irrespective of date) an also be specified; in this case it is always 'relative'.

The start of a DateTime or relative time is recognised by there being four digits followed by minus or two digits followed by a colon. A DateTime can optionally by followed by a time which is separate from the date by upper case T or a space. Fractional seconds can be given in the time.

The timezone can be given as upper case Z meaning UTC, or as an offset given as plus or minus character and a value in hours and minutes.

No spaces or tabs other than the separator between the date and time are allowed.




Number   ::= ( '+' | '-' )? ( ( '0' | [1-9] ( '_'? [0-9] )+ ) ( '.' [0-9] ( '_'? [0-9] )*
)? 'e' ( '+' | '-' )? [0-9] ( '_'? [0-9] )* | 'inf' | 'nan' )

           | '0x' [0-9A-F] ( '_'? [0-9A-F] )*

           | '0o' [0-7] ( '_'? [0-7] )*

           | '0b' [0-1] ( '_'? [0-1] )*


referenced by:


Numeric values are either decimal, hexadecimal, octal, or binary depending on how they start. Decimals start with a digit, a plus character, or a minus character. A decimal value can have an optional fractional part, introduced by a period, and an exponent part introduced by 'e'. The exponent can be positive or negative.

Hexadecimal, octal and boolean values are always integers. Decimal values with no fraction or exponent are considered integers. Otherwise they are floating point values. There are also the special floating point values for infinity, given as 'inf', and not-a-number given as 'nan'.

Space and tab characters are not allowed in numbers although, as it is in Ruby, any pair of digits can be separated by an underscore to provide spacing.





 

... generated by Railroad Diagram Generator

The following is the source EBNF I gave to the Railroad Diagram Generator.


TOML ::= ( 
  ( Key '=' Value | '[' Key ']' | '[[' Key ']]') )?
  Comment? 
  ( Newline ( 
    ( Key '=' Value | '[' Key ']' | '[[' Key ']]') )? Comment?
  )*

Key ::= ( ( [A-Za-z0-9-_] )+ | QuotedString ) ( '.' Key )?

Value ::= MultilineString | QuotedString | 'true' | 'false'
        | Array | InlineTable | DateTime | Number

Comment ::= '#' [#x0009#x0020-#x10FFF]*

Newline ::= ( #x0D )? #x0A 

QuotedString ::= ( '"' ( '[#x20-#x21#x23-#x5B#x5D-#x7E#x80-#x10FFFF]' | Escaped )* '"' ) |
( "'" ( [#x09#x20-#x26#x28-#x7E#x80-#x10FFFF] )* "'" )

MultilineString ::= ( '"""' ( ( [#x20-#x5B#x5D-#x7E#x80-#x10FFFF] | Escaped ) | Newline | ( '\' Newline ) )* '"""' ) |
( "'''" ( [#x09#x20-#x7E#x80-#x10FFFF] | Newline )* "'''" )

Escaped ::= '\' ( '"' | '\' | 'b' | 'f' | 'n' | 'r' | 't' | 'uXXXX' | 'UXXXXXXXX' )

Array ::= '[' ( ArrayVals )? ArrayComment ']' 
ArrayVals ::= ( ArrayComment Value ',' ArrayVals?  ) | ( ArrayComment Value )
ArrayComment ::= ( ( Comment )? Newline )*

InlineTable ::= '{' ( InlineTableKeyvals )?  '}'
InlineTableKeyvals ::= Key '=' Value ( ',' InlineTableKeyvals? )?

DateTime ::= 'YYYY-MM-DD' ( ( 'T' | ' ' ) 'HH:NN:SS' ( '.' [0-9]+ )? ( 'Z' | ( '+' | '-' ) 'HH:NN' )? )?
           | 'HH:NN:SS' ( '.' [0-9]+ )

Number ::= 
( '+' | '-' )? (
    ( '0' | [1-9] '_'? [0-9] ( '_'? [0-9] )* )
    ( '.' [0-9] ( '_'? [0-9] )* )?
    ( 'e' ( '+' | '-' )? [0-9] ( '_'? [0-9] )*
  ) |
 'inf' |
 'nan' ) |
'0x'  [0-9A-F] ( '_'? [0-9A-F] )* |
'0o' [0-7] ( '_'? [0-7] )* |
'0b' [0-1] ( '_'? [0-1] )*

Monday, 27 August 2018

Using Oracle Data Pump for Data Science - Pt. 2

Recap and Introduction

In the last post I laid out the general ideas behind the approach to be taken for anonymising data as it is extracted from an Oracle database. The use of HMAC provides a convenient way of using a hardened secret key to consistently apply a secure one way hashing algorithm to input data and get an anonymised output - i.e. one that can't be linked to its original value.

  • We want it to be consistent so that we can get the same result for a given input. Randomising anonymises but it also destroys the data.
  • We want a secure one-way hashing algorithm like SHA-1 because the anonymised output can't be reversed to the original text.
  • We want to use a secret key to guard against rainbow tables being used, especially for low entropy input data.
  • The secret key should be hardened against being guessed by brute force approaches. A password based key derivation function like PBKDF2 can do this.

We finished last time with a simple select from dual SQL statement that demonstrated using the DBMS_CRYPTO package to run the HMAC algorithm on some text. Included in that statement were calls to utility packages UTL_RAW and UTL_ENCODE to perform type conversions, and to encode the RAW value as a Base64 string.

Using Oracle Data Pump's REMAP_DATA parameter means that the data is anonymised at its source rather than having post-processing steps. As we will see later, these post-processing steps don't necessarily disappear; but performing this first step on the input values themselves is a significant first step, and it is useful to apply it in-line.

Using REMAP_DATA

The REMAP_DATA parameter to Oracle Data Pump's expdp (and impdp) commands allows an individual column to be remapped to different values. The format of the parameter is as follows:

REMAP_DATA=[schema.]tablename.column_name:[schema.]pkg.function

So this tells us we need a PL/SQL package that contains a function. The documentation also says that the returned value from the function has to match the type of the column. Our primary focus here is anonymising personally identifiable information or PII, which in Oracle data type terms means VARCHAR2.

Following from the simple code in the last post, the code for such a function could be as simple as this:

 1  FUNCTION hash_varchar2_to_base64(input IN VARCHAR2)
 2  RETURN VARCHAR2 IS
 3
 4  BEGIN
 5    RETURN(
 6      utl_raw.cast_to_varchar2(
 7        utl_encode.base64_encode(
 8          dbms_crypto.mac(
 9            utl_raw.cast_to_raw(input),
10            dbms_crypto.HMAC_SH1,
11            utl_raw.cast_to_raw('monkey')
12          )
13        )
14      )
15    );
16  END;

PBKDF2 Implementation

What about that key on line 11 above? It doesn't meet out need for a hardened key. To generate a hardened key we're going to use PBKDF2. This requires a password or passphrase, a random salt, a number of iterations to perform, and a desired key length.

 1  FUNCTION pbkdf2(password IN VARCHAR2, salt IN VARCHAR2,
 2                  iterations IN INTEGER, dklen IN INTEGER)
 3  RETURN RAW IS
 4    blocks       INTEGER;
 5    block        RAW(32767);
 6    prf          RAW(32767);
 7    last_prf     RAW(32767);
 8    salt_raw     RAW(32767);
 9    password_raw RAW(32767);
10    key          RAW(32767);
11  BEGIN
12    blocks := ceil(dklen/20); -- HMAC-SHA-1 output is 20 bytes
13    password_raw := utl_raw.cast_to_raw(password);
14    salt_raw := utl_raw.cast_to_raw(salt);
15    FOR i IN 1..blocks
16    LOOP
17      last_prf := dbms_crypto.mac(
18                    utl_raw.concat(
19                      salt_raw,
20                      utl_raw.cast_from_binary_integer(i, utl_raw.big_endian)
21                    ),
22                    dbms_crypto.HMAC_SH1,
23                    password_raw
24                  );
25      block := last_prf;
26      FOR j IN 2..iterations
27      LOOP
28        prf := dbms_crypto.mac(last_prf, dbms_crypto.HMAC_SH1, password_raw);
29        block := utl_raw.bit_xor(block, prf);
30        last_prf := prf;
31      END LOOP;
32      key := utl_raw.concat(key, block);
33    END LOOP;
34    RETURN utl_raw.substr(key, 1, dklen);
35  END;

Performance considerations

Embedding a call to the function above would meet the requirement for a hardened key. But the idea of a key derivation function is that it is intentionally slow. Embedding a call to the PBKDF2 function to form a key based passphrase is going to slow things down a lot because it will recompute the hardened key for each value to be encoded.

Key access by expdp

We also can't use package state variables to embed the key within the package because the package state is part of the session. The expdp/impdp commands have their own session with Oracle so there's no way to set up the key.

One idea could be to pre-compute the key and embed in the code as a Base64 string. This would change line 14 to be something like this (where "<base 64 key>" represents the key):

11            utl_encode.base64_decode(utl_raw.cast_to_raw('<base64 key>'))

That means every time we want to change the key, we have to re-create the package with the key embedded in it.

We can now make things more modular by putting the PBKDF2 function above in its own package. We then declare a constant in the package body of the package containing the hash_varchar2_to_base64 function to hold the key. This means modifying the code at install time which might be useful as the installer has the choice to put the package in a separate schema; this makes the source of the package inaccessible (assuming the DBA_SOURCE view is also kept restricted).

The final version of the data anonymisation package therefore looks something like this:

 1  raw_key CONSTANT RAW(32767) := anonymous_key.pbkdf2(
 2      '<YOUR PASSWORD>', '<YOUR SALT>', 
 3       <YOUR ITERATIONS>, <YOUR DESIRED KEY LENGTH>);
 4
 5  FUNCTION hash_varchar2_to_base64(input IN VARCHAR2)
 6  RETURN VARCHAR2 IS
 7
 8  BEGIN
 9    RETURN(
10      utl_raw.cast_to_varchar2(
11        utl_encode.base64_encode(
12          dbms_crypto.mac(
13            utl_raw.cast_to_raw(input),
14            dbms_crypto.HMAC_SH1,
15            rawkey
16          )
17        )
18      )
19    );
20  END;

Monday, 2 July 2018

Using Oracle Data Pump for Data Science

Introduction

I'm getting some data for data analytics.  The supplier of the data wants to anonymise it before giving it to me.  The best way I know of doing that (without having a big lookup table) is using a secure hash function.  Secure hash functions are used to scramble data and have a set of properties that make it impractical to reverse the scrambling.  This is different to encryption which is designed to allow the encryption operation to be reversed (if the encryption key is known).

So, simply applying a secure hashing function to data would seem to be a good way of obscuring the original, right?  Well, while it isn't possible to reverse hashing, it turns out that with the speed of modern computers, when the size of the input is fairly small, it is possible to generate a lookup table of every possible input-to-hash pair.  Consider telephone numbers; they're fairly short - 10 digits in the US which is about 33 bits of entropy - and a lookup table could easily be generated once and stored.  This effectively renders the hash ineffective because it can be reversed through the lookup table.

Password storage mechanisms get around this problem by generating and storing a nonce - a number used once - for every password stored. The lookup table is rendered useless and it's back to brute forcing every input to hash value.  As mentioned above, modern computers are fast so password databases combine the nonce approach with a repeated application of the hashing algorithm.  This increases the difficulty according to the number of iterations.

This works for storing passwords; there will only be one entry in the database for the user password, and its nonce can make it separate from every other. For relational data, there are many rows that relate to each other by being equal to one another.  The fact that they are equal allows us to join the dots and build machine learning models of the data. Putting a random nonce with every row in a relational database destroys the relationship it has with any other data. The machine learning algorithms need the relationships between data points, so destroying these relationships is a very bad thing.

Therefore the technical requirement is that there be a consistent way to scramble the data so that the relationships are preserved. This basically means that instead of having a nonce for every individual data point, we have a secret key that is used for all of them. Thus the same value will be hashed in the same way regardless of where it is found, and the relationships between the values are preserved. Note that, if you're going to be getting multiple dumps of data from your source, for the values to hash to the same value (thereby maintaining their relationships), you need to use the same secret key each time.

Why not have a lookup table?

It's worth exploring this idea a bit before we dismiss it. It's entirely possible that we could have a lookup table that maps values to an anonymous version. That means changing the system to accommodate the needs of anonymisation. It also means that the mechanism for doing so has to be done carefully so that it doesn't leak information. For example, replacing values with a numeric primary key that is looked up elsewhere could provide information about when the subscriber joined.

It remains to be determined how user generated values should be anonymised. Thinking about our phone numbers again, a system that contains information about telephone subscribers and the calls they make could anonymise the caller by replacing the caller's number with their primary key in the subscribers table through a lookup. The called number should also be anonymised, and the same approach could be taken if they're also a subscriber; but, what if they aren't? And what if they become a subscriber at a later date?

A word on Hash collisions

The secure hash function produces unpredictable (but deterministic) output for a given input. They're designed to produce different outputs for different inputs but because they effectively compress their input down to something smaller (160 bits in the case of SHA-1), there will be cases where different inputs generate the same output value from the hash function.

In the case of our telephone numbers, we're talking 10^10 different numbers. If we're using SHA-1, each input telephone number will produce one of 2^160 or 1.46x10^48 output values from the hash function. The probability of one number hashing to the same as another is going to be very small.

HMAC - Hashed Message Authentication Code

Using HMAC (RFC2104) provides a convenient way to combine a value with a key. In its role as a message authentication code (MAC), it has been designed to stop an attacker being able to find two messages that have the same MAC. This means that messages can't be forged by the attacker. Note that we don't have that requirement in this case and we could simply extend the hash input by the key, as in MAC = H(input || key).

Note that certain cryptographic hash functions have been weakened over recent years through cryptanalysis. Hash functions like MD4 and MD5 are now considered broken. HMAC improves the collision resistance of these. It may also be more convenient to use HMAC given the availability of implementations.

Password Based Key Derivation Functions

For anonymising the data, whether it is a simple hash with the input extended by a key or the HMAC is used with the key, the strength of the key is what determines how well the data is protected. Just using a simple password (e.g. "monkey") as the key doesn't provide much protection because of the existence of password databases and heuristics based approaches for cracking passwords.

Passphrases are better, like the famous XKCD example of "correct horse battery staple" because they have a higher entropy. But more is better, and because secure hashing algorithms pad their input to an even number of 512 bit blocks, for the typical database table column (e.g. holding a phone number) you can add a very large key with no penalty in performance.

Algorithms such as PBKDF2 and scrypt have been designed to make the generation of high entropy keys easy and computationally expensive. Used with a passphrase such as that above, the practicality of brute forcing the passphrase and/or key diminishes greatly.

Oracle Data Pump

Oracle provides the expdp and impdp tools for efficiently exporting and importing data between databases. In order that my data supplier can anonymise data before they give it to me, I want to setup a parameter file that does the hashing of the data as it is exported. The expdp supports the REMAP_DATA parameter in which you tell it the table, column, and a function to use for remapping it.

Oracle also provides the packages required to do the hashing and necessary conversions between data types - specifically DBMS_CRYPTO, UTL_RAW. The following example shows how to run HMAC on a input "blah" with a key "monkey". The key will ultimately need to be replaced with a suitably hardened value. The HMAC function outputs a binary value, so it is converted to Base64 as shown - this is required for Oracle Data Pump as the result of the remap must be the same data type as the original column's data type; the UTL_ENCODE package does this.

1  select utl_raw.cast_to_varchar2(
  2    utl_encode.base64_encode(
  3      dbms_crypto.mac(
  4        utl_raw.cast_to_raw('blah'),
  5        dbms_crypto.HMAC_SH1,
  6        utl_raw.cast_to_raw('monkey')
  7      )
  8    )
  9  )
 10  from dual

What remains is to implement the simple code in the select above into a PL/SQL package that can be called from the REMAP_DATA parameter.

Sunday, 24 August 2014

Media Centre in Galway, Ireland

Introduction

I've been progressing my media centre setup at home.  This is something I've been wanting to get to since we moved into our new home.  As we built it, the house was cabled with a certain media centre architecture in mind.


  • No silly co-ax based distribution;
  • Just one or two Cat5e cables from a central point to every room where the bits and bytes that make up our digital media might be consumed;
  • Co-axial cables going from the same central point up to TV/satellite antennae the roof.
  • And something in the middle to make it work together.


See, this is to be a client-server setup, distributed using TCP/UDP.


I was under time pressure to make decisions about this stuff as we built the house, so the "something in the middle" was a bit undefined.  I had built MythTV systems in the past (for my dad and myself), and I really liked the idea of being client-server.  Once you start costing everything out a MythTV based setup, it quickly gets pretty expensive.


So I've been holding off.


In the meantime, the cost of computing has been tumbling.  New powerful yet low power consumption devices are being released, and satellite distribution over IP is burgeoning.  In the last month, all the pieces finally look like they've finally fallen into place.

MythTV

Media centres have interested me for a long time.  I first started thinking about these things back in the mid to late '90s when companies like Tiny in the UK were selling PCs with TV tuners in them.  As I recall it was a bit of a hack at the time because I think the tuner card just carved out a part of the VGA signal and overlaid the picture.  

Fast forward a few years and the idea of pausing "live TV" started coming to market through devices like those from Tivo.  In '04 or '05 a friend back home in Australia put me on to the MythTV project.  So I built one and it was a lot of fun.  I started from a Debian 'net install' and got something running on an old Pentium III that a friend had donated to the cause.  It had a nice Hauppauge PVR-350 for tuning the analogue cable TV and encoding/decoding MPEG-2, a 400GB HDD.  It was really cool.

But it was also noisy, so I moved to a quieter SFF machine, quieter PSU, etc.  After some time, it fell into disrepair for want of maintenance, upgrades etc.  There was also an annoying bit of interference on the PVR-350's encoder/decoder; I thought maybe a power fluctuation caused by the HDD affecting the ADC/DAC?

I also built a MythTV box for my Dad in Australia.  There we had 720p/1080i over DVB-T to play with, so I had to size it for HD decoding.  This meant either a really fast and expensive CPU or trying to do GPU decoding - I went for the latter because nVidia GeForce 5200 cards were supposedly up to it and relatively cheap.  For a tuner I had the access to the DViCo Fusion HDTV range of DVB-T tuners, and someone at UofQ had done the patches needed to get it working under V4L.

The project for Dad actually started on an old AMD K6 based PC that he had lying around.  I thought it might have worked but the Linux kernel drivers I was trying to use wouldn't work on that thing.  So we went for a whole new machine built from parts.  By this stage Mythbuntu was available, so we had an easier time installing the software.

Not living in Australia anymore, I wasn't able to keep it running for Dad but I think he has fun keeping it going as much as he could.  It fried a couple of PSUs (failing fans) etc which he replaced.  It did get an upgraded graphics card at one stage, a passively cooled nVidia 7x00 thing and that improved things a bit.  Dad periodically goes back to it and messes around, installs latest Mythbuntu etc.  I think he sometimes uses it to record TV programs for the family which he then burns to disc.

The MythTV project moved to using the new VDPAU drivers some time ago and away from the xvmc drivers.  Unfortunately getting a VDPAU capable graphics was going to mean a whole new machine so his MythTV setup is no longer in active service.

Farewell MythTV

When it came to the new house, I just wasn't liking the price of putting together a MythTV setup.  The cost of the PCI card satellite tuners alone was adding up.  I then discovered Sat>IP where a bunch of companies were already selling hardware with four tuners, and would stream over gigabit ethernet.  But there didn't seem to be any support in MythTV...

I'm also fairly time-poor these days, so the idea of giving time to installing and maintaining it has been putting me off.  We got a DVR from our cable TV provider quite some time ago so the immediate need to setup MythTV at home has gone away.  Having had the DVR from our provider for some time now I've realised how immature MythTV actually is.  This DVR upgrades itself, automatically, without anybody having to come to the house.  That's really useful.



The front end hardware for a MythTV setup wasn't looking so cheap either, although I've had my eye on Zotac's ZBox range for a while.  It's a shame there hasn't been a Raspberry Pi port because that is super cheap.  XBMC is very attractive from a cost point of view because it seems quite well supported on Raspberry Pi.

Having done so much with MythTV I was mostly ignoring XBMC - just checking in on it now and again.  Also, for a long time it didn't support live TV which is a critical requirement.  It was purely a media centre application designed to play/stream content over IP.  That's still the case mostly, but now it *does* include live TV through a series of "PVR Add-ons".  It's actually very well integrated.


I recently found that the tvheadend project (one of XBMC's PVR backends) has added support for the Sat>IP protocol as of April 2014.  So another piece of the puzzle now fits.  The tvheadend server also can do the job of fetching the EPG, controlling the tuners, putting recordings on disk, and streaming the video to XBMC over TCP/IP.


With tvheadend supporting Sat>IP and XBMC supporting Live TV in place, I now feel I can actually move forward with this project.

Getting Started

I bought a Raspberry Pi (RPi) and have installed XBMC on it.  Actually what I installed was OpenELEC which is a stripped down version of Linux OS with XBMC pre-installed.  It runs off a 32GB Mircro SD card using 4W of power.

At the moment it's all running in the RPi - I have a borrowed USB DVB-T tuner - and I've been able to watch Irish digital TV in HD.  When I consider all the work I did to get the original MythTV setup running in '05, it was amazingly easy.  Something else - OpenELEC can update itself in situ.  Fantastic stuff.

RF Stage

The next phase is to install a proper TV antenna and a satellite dish.  A lot of the TV we pay for on cable plus a bunch of channels besides is actually available for free on satellite.  Some are in HD too (something we'd have to pay extra for on our cable TV subscription).  I haven't picked a TV antenna just yet.  We're less than 3Km from a 250W repeater and I have direct line of sight, so I don't think there's anything too fancy required there.  The satellite dish will have a quad LNB which means it can tune to four different satellite frequencies at once.

The LNB also acts as the IF stage, downconverting the 10~12 GHz from the satellite to something that can run over cheap co-ax.  Usually the LNB is connected to a set-top-box but in my case will go into a Sat>IP server which streams the DVB-S/DVB-S2 over IP.


Tuning the terrestrial transmissions is likely to be done using DVB-T USB sticks.  More on this later.

Clients and Servers

The end-goal will be to split the installation into separate frontend (XBMC) and backend (tvheadend).  The backend server is going to be an HP N54L micro server that I bought a little while back and should be good for housing hard drives and running tvheadend.  It will have the DVB-T tuners connected (probably two) and an ethernet connection to the Sat>IP server.

Then I can run OpenELEC/XBMC on one or more RPi(s) as clients.  XBMC is also available for the Mac and Windows.  iOS apps as clients are the next thing of course, although I don't see it as something that would be heavily used.

Pi

The RPi and what's behind it is a really interesting part of the story.  You've probably heard of them?  As I recall, what was behind the project was that an engineer at Broadcom (Eben Upton) identified a surplus of a particular ARM system-on-chip (SoC) they had, and proposed that they design and build a small, very cheap system that was easy for kids to get started with.

The goal of the project was to trigger the next generation of British computer engineers; he wasn't happy with the state of computer education in schools which seems to revolve around teaching kids MS Office applications and I think this is a theme repeated in Ireland (and Australia I believe?).

The price he originally committed to was something like 25 pounds.  To achieve that they put in fairly cheap power management components, removed ethernet, etc to produce a Model A.  But it had USB, a 700MHz processor and 256MB of RAM.  There was also Model B which included the ethernet and double the RAM.

The power management design meant that many have found that if too much current was drawn by a USB device, the system would shut down all the time.  But it boots from an SD card, has HDMI and composite video and stereo audio outputs, a bunch of general purpose input/output (GPIO) ports, an optional camera.  It's a great concept and has inspired a lot of little projects as well as other similar devices.


Now there's a Model B+ which has switched from SD to micro SD, fixed the power supply issues, added more USB ports, improved the audio output; the B+ is what I've bought.

GPU Acceleration

The Broadcom SoC actually has a GPU built in; in that sense I guess it's basically like the SoC you'd find in a smartphone.  The GPU can encode/decode H.264 video when you get the RPi out of the box, something I find amazing considering what little the graphics cards we bought for your MythTV could do.

They've now released programming interfaces for the RPi's GPU so there are actually lots of things than can be done with it as seen in sample code they've released.

The generation of graphics card(s) I got for Dad's setup could only support MPEG-2 through xvmc acceleration that MythTV could use at the time.  Since then the graphics card manufacturers have made the video decoding libraries available for Linux.  But in order to support H.264 we were looking at a complete hardware replacement because this was only available for later model graphics cards which only fit PCI-E slots - not the AGP we started with.

MythTV has long since dropped support for xvmc.  Dad is getting on OK without GPU support for H.264, but it could be better with VDPAU.

Broadcast Standards

In Australia DVB-T in HD is using MPEG-2.  This is actually quite wasteful, and it means that to actually be able to do an HD stream and an SD stream (or two) in a single multiplex they have to compress the video quite hard.  I find sports on the SD channels quite unwatchable because so much detail is lost.

In Ireland, we're using the Nordig standard, and SD/HD H.264 is being broadcast over DVB-T rather than DVB-T2.  So it's a much more efficient codec and better use of the spectrum.  DVB-T2 might have been more efficient again from a symbol coding efficiency perspective, but it being DVB-T is great for me because the DVB-T tuners are a lot cheaper than the DVB-T2 tuners.


I can also very cheaply get licenses for the RPi to decode MPEG-2 (and VC-1) which is good because a lot of the standard definition channels off the satellite are encoded using MPEG-2.  An alternative to that might be to get an Elgato EyeTV Netstream 4Sat as the Sat>IP server because it includes hardware transcoding to H.264.  But it's a good bit more expensive too.

XBMC

Currently the RPi is connected to our 42" LCD and the picture from it is great.  I get full 1080p from the little thing.  The UI of XMBC is also pretty nice, and there are even smartphone apps available to control it.  XBMC supports Airplay too.

To be honest, in terms of quality it's not as good as something like an AppleTV; not by a long chalk.  I guess the little 700MHz CPU gets over taxed at times.


XMBC is awesome in the number of add-ons it has (I have one for ABC Radio National installed), but I find the add-ons slow to fetch/process their data.  That said, once the streams get going it seems rock-solid.  One downside is that the TV is being broadcast interlaced.  We had this issue on MythTV as well I think.  I can turn on deinterlacing but I'm not sure how well the RPi is going to handle it.


I will have to wait and see once I get an antenna up.  I may yet be getting more powerful hardware to use as a frontend!

Anyway I will keep you posted!

Wednesday, 24 December 2008

My Skypephone from 3


I've been a prepaid mobile subscriber for years now.  I've been moving around between countries a fair bit, so it made sense from a cost point of view; I didn't want to be paying enormous roaming charges.  This meant a number of other things:
  • Pre-loved phones because otherwise, the phone would be SIM-locked,
  • Manual top-ups rather than just paying at the end of the month,
  • Relatively expensive calls (e.g. as much as €2.50 per minute to call the folks back home),
  • Multiple numbers, one for each country,
  • Expiring numbers, if I didn't return in time to top-up again & renew,
  • Difficulty managing phone-books (if held in the SIM).
Although I love tech, I've resisted getting new phones, and have satisfied my interest through reading about it, and getting those pre-loved phones from family.  Of the phones I've had, my favourite was a Sony-Ericsson T630.  The form factor was so right, that just holding the phone in hand was a pleasure.  The interface was easy to navigate, and although different to the Nokia phones I had had before it, easy to learn.  I was a sad day when it was stolen from me.

My least favourite phone has been my latest, a Samsung SGH-E720.  Although it has quite a nice menu structure, it has such a poor audio performance that calls were to be avoided until an alternative was available.  The speaker was often muffled, and similarly the microphone made it difficult for callers to understand me.  I had a Motorola RAZR V3 as a second phone from work.  Motorola are not know for their user interfaces and I have to say the RAZR had the worst menu structure of any mobile I've ever used.  It was however a great performer in the audio quality stakes.  I would often swap my SIM from the Samsung to the Moto just to make a single call.

I'm also a big fan of Skype.  Their software is great because the interface is so easy to use; by way of example, I talked my mum through installing it and setting an account for herself, and I talked her through from 20,000 miles away over an IM connection.  I almost love it to a fault because I may insist that you have it to contact me! :-)

So when the 3G carrier 3 introduced their Skypephone, I was at last interested in buying a new phone again.  The prospect of unlimited calls to and from my Skype contacts for free is an amazing offer.  Also, when roaming I'm often in a network covered by 3's networks, and roaming is effectively free on those networks.  So after not having bought a new phone for ten years, I had found a phone deal worth having and I got one.

The first thing you notice about the Skypephone is how small it is.  You know that little pocket on a pair of jeans that you can put coins (or your iPod Nano) into?  I often put my Skypephone in there - it's that small.  

The second thing you notice is the short battery life, which I'm sure is in no small part due to the small size.  I don't make a lot of calls, but the phone can struggle to get through the second day after an overnight charge.  But I've had the phone now for quite a few months, and the short battery life is something I've gotten used to.  I just make sure to charge it every night.  The connector on the bottom is also a fairly standard mini-USB port, and it can charge from any computer.  So in today's world a battery top-up is never far away.

This mini-USB port also means that the PC connection is simple, using common USB cables that we all have a million of at this stage; no specialist cables to lose.  Unfortunately the included PC-Sync software leaves something to be desired.  I mean, it works, but the UI is a complete mess.  And sadly there's no Mac version of the PC-Sync software.  But when connected to a PC or Mac, the phone asks you if you want PC-Sync or a USB Mass Storage device.  This latter works fine with my Mac, so I am able to access the supplied 512MB micro-SD card and load it with music and pod-casts.

The audio quality in calls isn't bad... not as good as the Motorola RAZR mentioned above, but more than acceptable.  I have had complaints from callers about background noise.  I guess the microphone in the phone isn't set up quite right.  The saving grace is that the microphone in the headphones cum hands-free is much better.  While on the headphones, they're not too bad, either in calls or while using the media-player.  The headphones/hands-free also connect to the same mini-USB port as is used for charging and PC connection, and it auto-detects what's been connected - sounds cool, eh?  Well...

That's not to say I've not had issues with the phone.  Within a couple of months of buying the thing, the universal mini-USB started acting up.  With nothing connected at all, with every button press, or incoming call, or what-have-you, the phone would think that the headphones had been inserted (and pop up the two second dialog box to tell you).  This made the phone almost unusable.  Thankfully the kind folk at 3 repaired it free of charge.

What else...?

The UI to the phone is quite good, with a simple layout, logical menus, and easy to understand functionality.  The keys are quite responsive, although very small.  I put this small key size down to the small size of the phone, and I initially didn't like it; I've long gotten used to it however.  The apps in the phone are simple enough.  Occasionally, when I'm bored, I'll surf the free content around the 3 portal.  The phone seems quite responsive when looking at these pages; a far sight better than the original phones 3 were selling.  Actually, the Symbian phones I've seen still seem quite slow.  Sadly the low resolution screen makes general internet browsing pointless.  The phone can also be used as a Bluetooth modem, but it's only UMTS (not HSDPA), so not as fast as it could be.

I use the media-player application fairly regularly.  It seems that the media-player understands ID3 tags on MP3 files, sort of... annoyingly, when playing an album it orders the songs alphabetically by the track name in the ID3 tag (no, not by filename, or by ID3 track number).  I could probably write a script to modify the ID3 tags on the phone to work around this, but couldn't they get it right?  Oh well...

And then there's the Skype functionality which quite central to this phone - it's the "Skypephone" after all.  Unlike other X-Series phones from 3 where Skype has to be started every time, the Skype application is always running.  Also rather than using Fring (as on X-Series), the Skypephone uses iSkoot.  Both of these are mobile applications for making Skype calls, but the big difference is that iSkoot uses normal circuit switched calls to make the connections, rather than the data connection made by Fring.  I think this makes the Skype calling experience much better in the Skypephone.

When in it's idle mode, the central button labelled with the Skype logo does just that - brings you straight into the already running Skype application.  

The iSkoot Skype application is not without its annoyances.  I've found that it's a less than perfect implementation of the Skype chat protocol, and where normally I'd get two copies of messages (one on each of PC and Skypephone), somtimes the phone seems to "eat" messages such that the PC never gets them.  And when I cross in and out of 3 network coverage, I've found that there are times when the only way to get the Skype application connected again is to reboot the phone.

Despite all this, I've saved money and been able to re-connect with people back home because I'm no longer afraid of how much a call is going to cost.  In the first month of owning the phone, I made three hours of Skype calls in various 3 networks, all free!  So despite its issues, I love my Skypephone.

The future is now...

The Skypephone is produced by Amoi in China, although I'm reliably informed that the software was produced in Italy.  Amoi's name for the phone is the WP-S1.  It's based on the BREW platform from Qualcomm.  A second generation Skypephone has been introduced in some of the 3 networks (the WP-S2), and this new model is meant to have better battery life.  3 are also about to introduce the INQ-1, a low-cost touch-screen phone also from Amoi.  I can't wait to see it.

Thursday, 13 November 2008

Hash Wrangling

Someone once asked me "just what is a Symbol in Ruby". I was stumbling around the web and found this: http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol

The way I personally think of it is as an immutable string that has a hashed internal representation, which probably encompasses definitions 4 from the web page above. Having coded to the Ruby C API, I also think of them in terms of them in terms of its hashed internal representation, and the C typedef ID (definitions 5 and 13).

But this also reminds of something I've been thinking about for a while with respect to hashes. Let's call it hash-wrangling:

wrangle |ˈraŋg(ə)l|
noun: a dispute or argument, typically one that is long and complicated.
verb:
  1. [ intrans. ] have such a dispute or argument : [as n. ] ( wrangling)weeks of political wrangling. See note at quarrel .
  2. [ trans. ] round up, herd, or take charge of (livestock) : the horses were wrangled early.
  3. another term for wangle .
ORIGIN late Middle English : compare with Low German wrangeln, frequentative of wrangen ‘to struggle’ ; related to wring .

Hashes are a great programming construct that I first encountered in awk's associative arrays. As a former C programmer, hashes were great because they provide tremendous power in freeing you from the mundane simplicity of C's data structures, and let you think about the relationships between different data. Gone are all thoughts of B-trees, search algorithms, etc and you're left with just the hash key and its related value.

The problem I see with hashes however is that it is far too easy to get the key wrong. It's quite easy to misspell a key, and have the resulting value stored in the wrong place, or fetched from the wrong place. The interpreter doesn't care what the hash keys are, and will just return a null when the value is fetched from the wrong place.

I've sometimes considered creating forms of enumerated types for the hash keys, as a convenient way to verify that only the correct keys are being used for a given hash. But this gets very long winded very quickly. And not only is it easy to get the key wrong, but with a hash using the "unknown" dataype (needed to allow nested hashes), it's also easy to get the type of the value wrong.

I think the only solution for this is to write test scripts that exercise every line of code. This sort of approach only goes so far however, and doesn't necessarily catch those cases where the hash keys is generated at runtime. You end up having to write lots of code that validates the inbound hash keys.

This is where an Object Oriented language like Ruby can help a lot because you can build the validation into the class definition itself. The class is basically a hash structure that you've defined in your code up front, and that is validated every time you access it. A hash is then relegated to only those values that might have been received from outside, but not yet passed to the class for validation.

This is not to say that the structure of the class and its data should become static. There is tremendous power in the way a dynamic language like Ruby has all classes open for the addition of new methods all the time. So, even if your class validates the data through its definition, thus staying DRYer in your approach to data structures, you can easily add more capabilities later.

Wednesday, 21 November 2007

Apple Quality?

I have been ruminating lately over whether or not to upgrade my Apple hardware products to more recent versions.  But I am starting to have concerns over the quality of recent Apple hardware products.  Let me explain...

I have a third generation "classic" iPod and still use it all the time.  I travel quite a lot, so it's great for all that dead-time waiting in airport lounges, etc.  As a piece of industrial design it's fantastic.  I love the way I can operate it using the same hand I'm holding it in.  Part of this comes from the select button in the middle of the click wheel which is slightly convex, and when pressed gives a nice and positive click sensation when you press it.  

However more recent iPods however have a very flat or even slightly concave select button, and when you press it, there is no click sensation when it is pressed.

I know I'm being picky, but this is actually issue that is holding me back from an upgrade.  Why did they change it?  I'm guessing that it was made flat/concave to prevent the button from being accidentally pressed if it is placed face down on a table (not an issue for me because I use an iSkin)?

I'm probably going to upgrade fairly soon anyway, because my current unit is starting to have hard disk issues.  Occasionally I find that it won't start up.  The disk spins up but then makes a distinct click and spins down and up again.  Eventually I get a "sad iPod" icon and a web link telling me to go to Apple's iPod support page.  To recover the situation, I have to perform a bit of "percussive maintenance".

I do love my current iPod, and will be sad when it really dies.  Apple have been adding features to tempt me to buy a new one.  I recently played with both an iPhone and an iPod Touch and I have to agree with everyone raving about these products' touch-based interfaces.  It was just so natural.  So I'll probably buy one of those.  But I've also noticed that the same concave button is there on the iPhone and the iPod Touch as the home button, and again the buttons on these units lack that positive feedback when pressed.

The other Apple product that I own is a beloved PowerBook G4, 15", 1.67GHz.  I've had this baby in my life for a couple of years now and I feel truly enriched by it.  At work I use a Dell Latitude D610 - a machine that I truly despise.  I resent its presence in my life and the amount of time I spend waiting for it.  Two things I like:
  • The quality of the PowerBook's screen is lovely and natural, as compared to the Dell's screen which, when viewed even slightly off-axis, makes my eyes bleed.  
  • The sound from the PowerBook, whether from the built-in speakers, or from the headphone jack is great.  With the Dell, the speakers are poor and, instead of the Sennheiser headphones I use with the PowerBook, I have to use lower quality (iPod!) headphones to avoid having my brains sucked out by the bus-noise you get from its headphone jack.
Now that Leopard is out, I feel it's time to upgrade so I've been looking at specs and prices and analysing what I'm going to use it for, reading reviews and just generally enjoying the shopping experience. :)

But I've come across some articles saying that the screen quality isn't as good as my current PB and that there is some high-pitched noise or other digital interference when you listen using the headphone jack.  What's happening here?  Perhaps the headphone noise is down to he chipset that Intel provide to support their Core 2 Duo processor, but that doesn't explain the screen dropping in quality.

So I put it to you, dear reader, that Apple hardware quality is decreasing over time.  Perhaps it's just part of being competitive - after all, their products seem to be better value than ever.  But I feel there is still an expectation of quality that comes with every piece of Apple hardware.

Comments?