I recently came across a stack overflow post asking whether it was possible to run Hadoop jobs using Azure Table Storage as a data source. The user had been going through an ETL process to get data out of the Azure Tables store into Hadoop for further processing. This seems unnecessary. I also needed a way to run SQL like queries on top of very large Azure Diagnostics logs, and fancied writing a custom InputFormat anyway, so put together a quick Hive Storage Handler which allows you to directly query Azure Tables in Hive. This means you can now run a Hive query something like this:
CREATE EXTERNAL TABLE test (Test STRING) STORED BY "com.redgate.hadoop.hive.azuretables.AzureTablesStorageHandler" TBLPROPERTIES( 'azuretables.account_name'='<account_name>', 'azuretables.access_key'='<storage_key>', 'azuretables.table'='test', 'azuretables.partition_keys'='A,B', 'azuretables.column_map'='test' ); select * from test;
The code for the Storage handler, input format, record reader etc is all available on github.
The implementation is fairly basic, and there are certainly ways it could do a better job of generating split. At the moment it just maps one Azure partition key to one split, which may limit your parallelism if you don’t have an effective partition key in Azure, but I’m sure I’ll get round to doing something more sensible about that at some point, and of course if anyone wants to jump in. Then I will gladly accept PRs.
Hope this is useful for someone out there. Please let me know how you get on, and file any bugs as github issues.