This blog is written by Dr. Prithwis Mukherjee – Program Director, Business Analytics. For more blogs, please visit his blog page – www.yantrajaal.com
Hadoop is a conceptual delight and an architectural marvel. Anybody who understands the immense challenge of crunching through a humungous amount of data will appreciate the way it transparently distributes the workload across multiple computers and marvel at the elegance with which it does so.
Thirty years after my first tryst with data — as relational database management systems that I had come across at the University of Texas at Dallas — myintroduction to Hadoop was an eye opener into a whole new world of data processing. Last summer, I managed to Demystify Map Reduce and Hadoop by installing it on Ubuntu and running a few Java programs but frankly I was more comfortable with Pig and Hive that allowed a non-Java person — or pre-Java dinosaur — like me to perform meaningful tasks with Map-Reduce. RHadoop should have helped bridge the gap between R based Data Science ( or Business Analytics) and the world of Hadoop but integrating the two was a tough challenge and I had to settle with hadoop streaming API as a means of executing predictive statistics tasks like linear regression with R in a Hadoop environment.
The challenge with Hadoop is that it is meant for men, not boys. It does not run on Windows — the OS that boys like to play with — nor does it have a “cute” GUI interface. You need to key in mile long unix commands to get the job done and while this gives you good bragging points in the geek community, all that you can expect from the MBAwallahs and the middle managers at India’s famous “IT companies” is nothing more than a shrug and a cold shoulder.
But the trouble is Hadoop ( and Map Reduce) is important. It is an order of magnitude more important than all the Datawarehousing and Business Intelligence that IT managers in India have been talking about for the past fifteen years. So how do we get these little boys ( who think they are hairy chested adult men — few women have any interest in such matters) to try out Hadoop?
Enter Hortonworks, Hue and H2O ( seriously !) as a panacea to all problems.
First, you stay with your beloved Windows (7 or 8) and install a virtual machine software. In my case, I chose and downloaded Oracle VM VirtualBox that I can use for free, and installed it on my Windows 7 partition. ( I have an Ubuntu 14.04 partition where I do some real work). A quick double-click on the downloaded software is enough to get it up and running.
Next I downloaded the Hortonworks Hadoop Data Platform (HDP) sandbox as an Oracle Virtual Appliance and and imported it into the Oracle VM Virtual Box ( or this) and what did I get? A completely configured single node Hadoop cluster along with Pig, Hive and a host of applications that constitute the Hadoop ecosystem! And all this in just about 15 minutes — unbelievable!
Is this for real? Yes it is!
For example the same Java WordCount program that I had compiled and executed last year, worked perfectly with the same shell script that I modified with the corresponding Hadoop libraries present in the sandbox.
But what really takes the cake is that all data movement from the Unix file system to the HDFS file system is through upload/download through the browser. Not the dfs command, -copyFromLocal! This is due to the magic of Hue, a free and opensource product that gives a GUI to the Hadoop ecosystem. Hue can installed, like Pig or Hive on any Hadoop system but in the Hortonworks sandbox, it comes pre-installed.
But as a data scientist, one must be able to run R programs with Hadoop. To do so, follow instructions given here. But there are some deviations
- You log into the sandbox ( or rather ssh [email protected] -p 2222) andinstall RStudio server, but it will NOT be visible at port 8787 as promised until you follow instructions given in this post. Port 8787 in the guest operating system must be made visible to the host operating system.
- You can start with installing the packages rmr2 and rhdfs. The other two, rhbase and plyrmr are not really necessary to get started with. Also devtools is not really required. Just use wget to pull thelatest zip files and install the same.
- However RHadoop will NOT WORK unless the packages are installed in the common library and NOT in the personal library of the userid used to install the packages. See the solution to problem given inthis post. This means that the entire installation must be made with the root login. Even this is a challenge because when you use the sudo command, the environment variable CMD_STREAMING is not available and without this package rhdfs cannot be installed. This Catch22 can be overcome by installing without the sudo command BUT giving write privilege to all on the system library, which would be something like /usr/lib64/R/library.
- RStudio server would need a non-system, non-admin userid to access and use
Once you get past all this, you should be able to run the simple R+Hadoop program given inthis post, but also the Hortonworks R Hadoop tutorial that uses linear regression to predict website visitors.
R and Hadoop is fine but converting a standard machine learning algorithm to make it work in the Map Reduce format is not easy, and this is where we use H2O, a remarkable open source product that allows standard machine learning tasks like Linear Regression, Decision Trees, K-Means to be performed through a GUI interface. H2O runs on Windows or Unix as a server that is accessed through a browser at localhost:54321. To install it on the Hortonworks HDP sandbox in the Oracle VM VirtualBox, follow instructions given in this Hortonworks+H2O tutorial.
In this case, you will (or might) face these problems
- The server may not be available at 10.0.2.15 ( the principal IP address ) but at the 127.0.0.1 ( or local host)
- The server may not become visible until you configure the VM to forward ports as explained in the solution given inthis post.
Once you have H2O configured with Hadoop on the sandbox, then all normal machine learning tasks should be automatically ported to the map-reduce format and can benefit from the scalability of Hadoop.
So what have we achieved ? We have …
- Hadoop, the quintessential product to address big data solutions
- Hortonworks that eliminates the problems associated with installing a complex product like Hadoop, plus its associate products like Pig, Hive, Mahout etc
- The Oracle VM VirtualBox that allows us to install Hortonworks in a Windows environment
- Hue that gives a GUI to interact with the Hadoop ecosystem, without having to remember long and complex Unix commands.
- RHadoop that allows us to use RStudio to run Map Reduce programs. For moresee this.
- H2O that allows us to run complete machine learning jobs, either in stand alone mode or as Map Reduce jobs in Hadoop through an intuitive GUI.
If you think this is too geeky, then think again if you really want to get into data science! In reality, once we get past the initial installation, the rest is very much GUI driven and the data scientist may just feel that he ( or she) is back in the never-never land of menu driven software where you enter values and press shiny buttons. But in reality you would be running heavy duty Hadoop programs, that in principle, can crunch through terabyes of data.