Recently, MongoDB bought out a connector that lets you use Pig, Hive and core Map Reduce in hadoop to operate on Mongo sourced data.
Someone was asking about connecting mongo data to HDInsight on Stack Overflow, so I thought I’d make it work.
First off make the sbt build on mongo-hadoop work. Out of the box you can’t build it for HDInsight, since the Hadoop version supported by HDInsight (1.2) is not one of the options supported by Mongo. This is not as serious as it sounds, we just need to adjust the build script.
The changes required are in my GitHub repo, pending a PR to Mongo.
All you need to do then is change the build.sbt file (not in my PR for obvious reasons)
hadoopRelease in ThisBuild := "1.2"
Then you can run the build. I’m doing that on my Mac, but you can do this on a windows commnand line in much the same way.
This will produce a few jar files.
./core/target/mongo-hadoop-core_1.2.0-1.2.0.jar ./flume/target/mongo-flume-1.2.0.jar ./hive/target/mongo-hadoop-hive_1.2.0-1.2.0.jar ./pig/target/mongo-hadoop-pig_1.2.0-1.2.0.jar
Upload these jars to the azure storage container associated with your cluster. You will then need to specify this location when you want to use the mongo functionality. Alternatively, you could put it in the magic location /user/hdp/share/lib/ (though I’ve not personally tested that). You will also need to download the Mongo Java Driver and put it in the same place.
If building these is a bit of a pain, I’ve put them all in a zip for you.
Now let’s try running a Pig script using data from a mongo server. Here’s an example just reading from a BSON file from mongodump.
REGISTER 'wasb://BLOB_LOCATION/mongo-connector/mongo-java-driver-2.9.3.jar'; REGISTER 'wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-core_1.2.0-1.2.0.jar'; REGISTER 'wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-pig_1.2.0-1.2.0.jar'; raw = LOAD 'wasb://BLOB_LOCATION/yield_historical_in.bson' using com.mongodb.hadoop.pig.BSONLoader; raw_limited = LIMIT raw 3; DUMP raw_limited;
And a similar example using an actual mongo connection
-- First, register jar dependencies REGISTER wasb://BLOB_LOCATION/mongo-connector/mongo-java-driver-2.9.3.jar -- mongodb java driver REGISTER wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-core_1.2.0-1.2.0.jar -- mongo-hadoop core lib REGISTER wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-pig_1.2.0-1.2.0.jar -- mongo-hadoop pig lib raw = LOAD 'mongodb://mongohost/Db.collection' using com.mongodb.hadoop.pig.MongoLoader; raw_limited = LIMIT raw 3; DUMP raw_limited;
Similar methods ought to work for Hive and MapReduce jobs, but that’s a blog for another day.