Ubantu中用Hadoop+MapReduce统计单词个数_ubtun虚拟机单词计数怎么用-CSDN博客

本文链接：https://blog.csdn.net/qq_44388747/article/details/143923921

步骤总览：

1、启动hadoop

2、编辑输入文件并将其上传hdfs

3、编写MapReduce程序（python）

4、运行MapReduce作业

5、查看结果

步骤1、启动hadoop

start-dfs.sh
start-yarn.sh

说明：

（1）检查hadoop环境变量是否正确配置。

gedit ~/.bashrc

其中应有：

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

其中HADOOP_HOME应为实际安装hadoop的路径。

（2）验证hadoop是否运行

在浏览器输入以下网址：
http://localhost:9870/
http://localhost:8088/

分别正常显示说明运行成功

步骤2、编辑输入文件并将其放入hdfs

1、创建words.txt文件（文件名与文件路径可自定义）

gedit words.txt

2、将words.txt上传到HDFS

首先创建目录（命令中的"hadoop"应为你的用户名）

hdfs dfs -mkdir -p /user/hadoop/input

其中的hadoop可以为你的用户名，后面都需要根据这个调整命令。

放入hdfs

hdfs dfs -put words.txt /user/hadoop/input/

验证是否上传成功

hdfs dfs -ls /user/hadoop/input/

可以看到words.txt成功上传

步骤3、编写MapReduce程序（python）

创建mapper.py

gedit mapper.py

#!/usr/bin/python3
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print(f"{word}\t1")

同理创建reducer.py

gedit reducer.py

#!/usr/bin/python3
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_word = word
        current_count = count

if current_word == word:
    print(f"{current_word}\t{current_count}")

给予脚本可执行权限

chmod +x mapper.py
chmod +x reducer.py

说明：

（1）mapper.py 和 reducer.py 的路径应该记住，后面会用到，例如我的路径分别是：
/home/hadoop/桌面/MapReduce/mapper.py

/home/hadoop/桌面/MapReduce/reducer.py

（2）在脚本中的第一行中应该为你的python3路径，例如我的在"/usr/bin/python3"。可通过"which python3"查看python3路径。

which python3

（3）确保python3已安装。

在高版本的ubantu中python3是已经预装了的，可通过python3 --version查看python3版本。

python3 --version

如果没有安装可以通过命令安装最新版本或者指定版本。

步骤4：运行MapReduce作业

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /user/your_username/input/words.txt \
-output /user/your_username/output \
-mapper "/path/to/mapper.py" \
-reducer "/path/to/reducer.py" \
-file /path/to/mapper.py \
-file /path/to/reducer.py

说明：

（1）your_username：替换为您的实际用户名。我的用户名为hadoop。

（2）/path/to/mapper.py和/path/to/reducer.py：替换为您的mapper.py和reducer.py脚本的实际路径。

例如我的分别是：

/home/hadoop/桌面/MapReduce/mapper.py

/home/hadoop/桌面/MapReduce/reducer.py
那么我的命令就是：

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
    -input /user/hadoop/input/words.txt \
    -output /user/hadoop/output \
    -mapper "/home/hadoop/桌面/MapReduce/mapper.py" \
    -reducer "/home/hadoop/桌面/MapReduce/reducer.py" \
    -file "/home/hadoop/桌面/MapReduce/mapper.py" \
    -file "/home/hadoop/桌面/MapReduce/reducer.py"

（3）确保执行脚本前，hdfs中的输出目录“/user/hadoop/output/”不存在，否则会失败。

如果存在需要删除（其中hadoop应为你实际的用户名）：

hdfs dfs -rm -r /user/hadoop/output

步骤5、查看并获取结果

查看输出文件（其中hadoop应为你实际的用户名）：

hdfs dfs -ls /user/hadoop/output/

可以看到"part-00000"的输出文件

显示结果（其中hadoop应为你实际的用户名）：

hdfs dfs -cat /user/hadoop/output/part-00000

说明：

可以将结果复制到本地：

hdfs dfs -get /user/hadoop/output/part-00000 path/wordcount_result.txt

其中"path/wordcount_result.txt"为你想要存放的路径。