SQL Server MVP Jeremiah Peschka posted 2 articles about Hadoop, which makes me be interested on the nosql skill.
I don't have much knowledge on Nosql and Linux system, so I am going to setup a testing environment on my laptop in holidays
1. download CentOS Linux setup iso file
http://www.centos.org/
2. download java jdk 1.6
http://www.oracle.com/technetwork/java/javase/downloads/index.html
3. download hadoop setup file
http://hadoop.apache.org/#Download+Hadoop
I downloaded release 1.0.4
4. Create VM with VMware workstation
I created 3 vm
linux1 : 192.168.27.29 ----->master
linux2 : 192.168.27.31 ----->slaver
5. install Linux OS
6. Configure vm ip address
vi /etc/sysconfig/network-scripts/ifcfg-eth0
7. Configure host name and hosts file
vi /etc/sysconfig/network --------->set the hostname
vi /etc/hosts --------->add ip hostname mapping for all 3 servers, for instance
192.168.27.29 linux1
192.168.27.31 linux2
192.168.27.32 linux3
8. Install JDK
Copy the jdk install file to vm with vmware share folders, and unzip it to local folder. I installed the jdk in /usr/jdk1.6.0-37
9. Install Hadoop
Copy the install file to vm with vmware share folders, and unzip it to local folder. I installed the hadoop files in /usr/hadoop-1.0.4
10. create folder to Hadoop
temp folder: /usr/hadoop-1.0.4/temp
Data folder: /usr/hadoopfiles/Data
Name folder:/usr/hadoopfiles/Name
make sure the folder owner is the user which will start hadoop thread. and for Data folder and Name folder, the permission should be 755
chmod 755 /usr/hadoopfiles/Data
11. Set environment variable
vi /etc/profile
then add the line below:
HADOOP_HOME=/usr/hadoop-1.0.4
JAVA_HOME=/usr/jdk1.6.0_37
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATH
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export JAVA_HOME
export HADOOP_HOME
export CLASSPATH
export PATH
12. Setup SSH
1) generate ssh pub key file on all 3 servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
run "ssh localhost" to test if ssh works. make sure the authorized_keys file has correct permission, that's important
chmod 644 authorized_keys
2)Copy the file id_dsa.pub to other 2 servers with a new file name, for instance
on linux1, copy the id_dsa.pub to lunix2 and linux3 with name linux1_id_dsa.pub
3) log on other 2 servers, import the new file
cat ~/.ssh/linux1_id_dsa.pub >> ~/.ssh/authorized_keys
do the 3 steps on all 3 servers, make sure you can ssh log on any remote server without password prompt.
13. Configure Hadoop.
1) Open $HADOOP_HOME/conf/hadoop_env.sh, set the line below
export JAVA_HOME=/usr/jdk1.6.0_37
2) Open $HADOOP_HOME/conf/masters, add line below
linux1
3) Open $HADOOP_HOME/conf/slavers, add line below
linux2
linux3
4) Edit $HADOOP_HOME/conf/core-site.xml
<configuration>
<!--- global properties -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-1.0.4/tmp</value>(????????????tmp??????)
<description>A base for other temporary directories.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://linux1:9000</value>
</property>
</configuration>
5) Edit $HADOOP_HOME/conf/hdfs-site.xml
<configuration>
<!--- global properties -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/HadoopFiles/Name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/HadoopFiles/Data</value>
</property>
</configuration>
6) Edit $HADOOP_HOME/conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>
do the same configuration on all 3 servers
13) disable firewall on all 3 servers
service iptables stop
chkconfig iptables off
14) format name node
cd /usr/hadoop-1.0.4/bin
./hadoop namenode -format
15) start hadoop on master(linux1)
./start-all.sh
16) run "jps" on all 3 servers to check if hadoop is running
or you can open the website below
http://linux1:50030
http://linux1:50070
you can check the log file in logs folder in case any process can not be run.
it is a good start to learn hadoop, even Microsoft is developing data solutions with hadoop on window platform, so it is time to learn new things
reference:
http://blog.csdn.net/skyering/article/details/6457466