This one was really nice!
This spring we had an opportunity as never before, to study big data and it’s tools, manipulation and queries.
We studied mostly Hadoop on our universitys cluster, Hive on Hadoop, Google BigQuery and MongoDB (through its web-demo interface only).
Hadoop on Linux was easy for me as I was already familiar with the *nix commandline and we didn’t need to learn MapReduce programming, only the basic Hadoop commands and the Hadoop included MapReduce demos (mainly counting words from text documents 😀 ). Hadoop is a clustered solution to distributed computing and data store. As far as I understood a data node can be installed automatically just by plugging a computer with blank HDD to the cluster and PXE-boot will take care of rest, meaning the master node will provide the needed software to the data node.
Google BigQuery was a bit more difficult, import data from CSV and define the schema, well, not that difficult, but still. The query language was just that much different than SQL so it made my experiments a bit overwhelming and daunting, this one I didn’t learn completely, but passed still.
Hive, well, what do you expect, it’s almost pure SQL, just that the data is stored in the Hadoop DFS (Distributed FileSystem), this was surprisingly easy (well, mabe not, as I already had completed the “Information Management” -course). Also our final work was working on Hive and I got through it.
MongoDB was also otherwise nice, but it seems to be unable to update nested structures, or that’s what it seems to me… It can handle huge amounts of document-type (JSON) data. First I had real problems with MongoDB and it’s manipulation and query language, it being so far from SQL, but after the web-interface tutorial and some playing around on my own installation it wasn’t that bad and got me interested.
I recommend everyone who’s studying computer sciences to take part in courses like this, if there is a possibility!
Well, I got 4/5 from this course and really liked and enjoyed it!