This tutorial assumes that you already know the basic pieces of knowledge such as Linux, HDFS, and Hadoop.
You’ll download a VM with a full environment to run your first big data Hello-World project. It’s important to know that this tutorial runs in MAC Os environment.
- Download the Hapoop Ecosystem VM for Virtual Box – here
- Once it has been downloaded, now on import this Appliance to your Virtual Box.
- Before you start your VM, you should have at least one shared folder configurated.
- Now you should have a full Hadoop ecosystem already installed in your virtual machine.
- user is: gsantos
- password is: gil1234
In your VM, open the terminal command line and type: start-dfs.sh this will start your HDFS services on Hadoop framework.
after this step, now you should also start the YARN, by the following command: start-yarn.sh this is another tool from Hadoop framework ecosystem.
then type the following command JPS in order to verify if all required Hadoop services are running ok.
if you got problems and for any reason, such as NameNode or DataNode doesn’t show up. try this solution.
Now we have everything we need to run our first MapReduce Job at Hadoop Ecosystem. As Big data works against million of records you should have these datasets into your VM and call the Word Count MapReduce Job this is one of the defaults jobs that come with Hadoop. Lets go through https://grouplens.org/datasets/movielens/ website and get any of those samples of datasets.
- I choose the ml-20m.zip file, now I should put it in on my HDFS in my VM, follow the below steps.
- After downloaded, then unzip the file by the following command line: unzip ml-20m.zip
- Then into the folder, you should found a couple of files, let’s take the movies.csv
- We need to have a folder formatted like an HDFS file system. try this solution.
- Now we need to put this file into our HDFS by the following command line: hdfs dfs -put movies.csv /bigdata
- You can also use the command line: hdfs dfs -ls /bigdata in order to test if the file is actually there into HDFS
Once we have the file into HDFS, now it’s time to call the Java Job Map Reduce. For this, we need to understand how to call this Job.
JARPath is : /opt/hadoop/share/hadoop/mapreduce
mainclass is: wordcount
inputFile is: /bigdata
The full command line would be: hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wourdcount /bigdata /output
then you should see the Word Count MapReduce Job:
it will generate another file in HDFS, which is a result of your MapReduce Job. This file is normally named like a part-r-00000 you can see the content of this file with the following command line:
hdfs dfs -cat /output/part-r-00000