进入 Linux 终端,使用 Python 语言编写 Spark 独立应用程序,统计 /data/bigfiles/test.txt 文件中包含 a 和 b 的行数并输出(输出示例如下),并使用 spark-submit 命令提交运行,将运行输出日志存储到 /root/run.txt 文件中。

时间: 2025-07-13 13:28:47 浏览: 18
<think>好的,我现在需要帮助用户在Linux终端上用Python编写一个Spark独立应用,统计文件`/data/bigfiles/test.txt`中包含字母'a'和'b'的行数,然后通过spark-submit运行,并将日志保存到`/root/run.txt`。首先,我得理清楚整个步骤,确保每个环节都正确。 首先,用户想要的是统计同时包含a和b的行数,还是包含a或b的行数?根据问题描述,“包含a和b的行数”,应该是指同时包含两者的行。这点需要确认代码中的过滤条件是否正确。比如,用Python的话,应该是检查一行中同时有'a'和'b',即`if 'a' in line and 'b' in line`。 接下来,编写Spark应用的结构。通常,一个Spark应用的结构包括初始化SparkContext或SparkSession,读取数据,转换操作,动作操作,然后关闭资源。这里用户可能用的是Spark的RDD或者DataFrame API。考虑到用户可能对Python比较熟悉,使用RDD的方式可能更直接。 所以,代码的大致流程应该是: 1. 导入必要的模块,比如pyspark中的SparkContext。 2. 创建SparkContext实例,配置应用名称等参数。 3. 读取文件`/data/bigfiles/test.txt`。 4. 对每一行进行过滤,保留同时包含a和b的行。 5. 统计这些行的数量。 6. 打印结果。 7. 停止SparkContext。 然后,保存这个Python脚本,比如命名为`count_ab.py`。接下来,用户需要通过spark-submit来提交这个应用。spark-submit的命令需要指定主类(如果是Python的话,可能不需要,直接指定脚本路径)、master地址(因为是独立模式,所以可能是spark://HOST:PORT,或者如果是本地运行的话,用local[*])。不过用户提到的是“Spark独立应用”,独立集群模式的话,master应该是spark://...,但如果是本地测试,可能用local[*]。 但用户的问题中提到的是在Linux终端上运行,可能是在独立集群上,或者本地模式。根据引用中的信息,比如引用[3]提到,当运行在集群时,不应该硬编码master参数,而是通过spark-submit传递。但本地测试可以用local[*]。所以这里可能用户希望的是独立集群模式,或者本地模式?需要进一步明确。 假设用户希望在独立集群上运行,那么master参数应该设置为spark://主节点地址:端口。但如果是本地运行,可以用local[*]。根据问题描述,用户想编写独立应用,可能指的是独立集群模式。但如果没有明确,可能默认使用local[*]作为测试。 然后,保存日志的问题。spark-submit默认会将日志输出到控制台。要将日志保存到文件,可以使用Linux的重定向命令,比如`2>&1 | tee`或者直接重定向输出。例如,使用`> /root/run.txt 2>&1`将标准输出和错误都重定向到文件。或者使用tee命令同时显示在终端并保存。 所以,完整的spark-submit命令可能像这样: ```bash spark-submit --master local[*] count_ab.py > /root/run.txt 2>&1 ``` 这里`--master local[*]`指定本地模式,使用所有核心。输出重定向到指定文件。 接下来,检查是否需要处理依赖项。Python的Spark应用通常不需要额外的包,除非有特殊情况。这里只是文本处理,应该没问题。 关于日志配置,如果用户希望Spark本身的日志级别调整,可以在spark-submit中添加配置参数,比如`--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j.properties"`,但可能用户只是简单地将所有输出重定向到文件,所以用shell的重定向即可。 现在,整合这些步骤,写出具体的代码示例。需要注意的是,在Spark中,正确创建SparkContext,并且确保文件路径正确。另外,处理完数据后执行一个动作操作,如count(),然后输出结果。 可能的代码示例: ```python from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext(appName="CountABLines") lines = sc.textFile("/data/bigfiles/test.txt") filtered_lines = lines.filter(lambda line: 'a' in line and 'b' in line) count = filtered_lines.count() print(f"Number of lines containing 'a' and 'b': {count}") sc.stop() ``` 然后,保存为count_ab.py,运行命令: ```bash spark-submit --master local[*] count_ab.py > /root/run.txt 2>&1 ``` 这里需要注意文件路径的权限,确保Spark可以读取输入文件,并且有权限写入输出日志到/root目录。可能需要使用sudo或者调整权限。 另外,关于日志保存,用户可能希望同时捕获标准输出和错误,所以使用`2>&1`将标准错误重定向到标准输出,再一起写入文件。使用`>`会覆盖文件,如果希望追加,可以改用`>>`。 此外,用户提到的“保存日志”可能包括Spark作业的运行日志,而不仅仅是应用程序的print输出。此时,需要配置Spark的日志输出方式,比如修改log4j.properties文件,设置日志级别和输出路径。但根据用户的描述,可能只需要将控制台输出保存到指定文件,所以用重定向即可。 总结步骤: 1. 编写Python脚本,使用SparkContext读取文件,过滤行,统计数量。 2. 使用spark-submit提交脚本,指定master,输出重定向到日志文件。 3. 确保输入文件存在,输出目录有权限。 可能遇到的问题: - 文件路径不正确,导致Spark无法读取输入文件。需要确保路径正确,并且Spark有权限访问。 - 如果没有正确关闭SparkContext,可能会导致应用挂起,但代码中用了sc.stop()应该没问题。 - 日志文件路径需要用户有写入权限,特别是/root目录通常需要root权限。可能需要用sudo执行spark-submit,或者将日志保存到用户有权限的目录。 现在,根据这些思考,组织答案,确保步骤清晰,代码正确,命令准确,并注意引用中的相关信息,比如引用[3]关于master参数的使用,引用[5]关于local和standalone模式的说明,引用[4]关于checkpoint的配置(虽然这里可能不需要)。</think>### 如何编写Spark独立应用统计含a和b的行数 以下是分步实现方法: 1. **创建Python脚本 `count_ab.py`** ```python from pyspark import SparkContext if __name__ == "__main__": # 初始化SparkContext,配置应用名称 sc = SparkContext(appName="ABLineCounter") # 读取目标文件并过滤同时包含a和b的行 lines = sc.textFile("/data/bigfiles/test.txt") ab_lines = lines.filter(lambda x: 'a' in x and 'b' in x) # 执行计数操作 result = ab_lines.count() # 输出结果并关闭资源 print(f"包含a和b的行数:{result}") sc.stop() ``` 2. **通过spark-submit运行程序** ```bash # 本地模式运行(使用所有CPU核心) spark-submit \ --master "local[*]" \ count_ab.py \ > /root/run.txt 2>&1 ``` 3. **参数说明** - `--master "local[*]"`:指定本地运行模式,`*`表示使用所有CPU核心[^3][^5] - `> /root/run.txt`:将标准输出重定向到文件 - `2>&1`:将标准错误合并到标准输出 4. **集群模式配置** ```bash # 独立集群模式示例 spark-submit \ --master spark://master-host:7077 \ --deploy-mode client \ count_ab.py \ > /root/run.txt 2>&1 ``` **注意事项**: 1. 输入文件路径需确保Spark节点均可访问 2. `/root`目录需要具有写入权限 3. 完整日志包含Spark系统日志和应用输出 4. 可通过`log4j.properties`文件配置详细日志级别[^4]
阅读全文

相关推荐

【实验目的】 (1)掌握使用 Spark 访问本地文件和 HDFS 文件的方法。 (2)掌握 Spark 应用程序的编写、编译和运行方法。 【实验平台】 (1) 操作系统: Ubuntu 16.04。 (2) Spark 版本: 3.4.0。 (3) Hadoop 版本: 3.3.5。 【实验内容】 1.Spark读取文件系统的数据 (1)在spark-shell中读取Linux系统本地文件“/home/hadoop/test.txt”,然后统计出文件的行数; (2)在spark-shell中读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请先创建),然后,统计出文件的行数; (3)编写独立应用程序(推荐使用Scala语言),读取HDFS系统文件“/user/hadoop/test.txt”(如果该文件不存在,请先创建),然后,统计出文件的行数;通过sbt工具将整个应用程序编译打包成 JAR包,并将生成的JAR包通过 spark-submit 提交到 Spark 中运行命令。 SimpleApp.scala文件内代码内容如下所示: simple.sbt文件内代码如下: 使用、usr/local/sbt/sbt package命令将以上代码文件夹打jar包,打包完成后可看到打包成功的输出,如下图: 运行打好的jar包,使用命令/usr/local/spark/bin/spark-submit --class “SimpleApp” /usr/local/spark/mycode/HDFStest/target/scala-2.11/a-simple-hdfs-test_2.11-1.0.jar即可,如下: 2.编写独立应用程序实现数据去重 对于两个输入文件 A 和 B,编写 Spark 独立应用程序(推荐使用 Scala 语言),对两个文件进行合并,并剔除其中重复的内容,得到一个新文件 C。下面是输入文件和输出文件的一个样例,供参考。 输入文件 A 的样例如下: 20170101 x 20170102 y 20170103 x 20170104 y 20170105 z 20170106 z 输入文件 B 的样例如下: 20170101 y 20170102 y 20170103 x 20170104 z 20170105 y 根据输入的文件 A 和 B 合并得到的输出文件 C 的样例如下: 20170101 x 20170101 y 20170102 y 20170103 x 20170104 y 20170104 z 20170105 y 20170105 z 20170106 z RemDup.scala文件内代码内容如下所示: 创建simple.sbt文件并向其中输入代码,使用命令/usr/local/sbt/sbt package 将代码文件夹打包,打包成功后结果如下: 再使用命令/usr/local/spark/bin/spark-submit --class "RemDup" /home/hadoop/sparkapp2/RemDup/target/scala-2.11/remove-duplication_2.11-1.0.jar 即可运行,结果如下 3.编写独立应用程序实现求平均值问题 每个输入文件表示班级学生某个学科的成绩,每行内容由两个字段组成,第一个是学生名字,第二个是学生的成绩;编写 Spark 独立应用程序求出所有学生的平均成绩,并输出到一个新文件中。下面是输入文件和输出文件的一个样例,供参考。 Algorithm 成绩: 小明 92 小红 87 小新 82 小丽 90 Database 成绩: 小明 95 小红 81 小新 89 小丽 85 Python 成绩: 小明 82 小红 83 小新 94 小丽 91 平均成绩如下: (小红,83.67) (小新,88.33) (小明,89.67) (小丽,88.67) 创建代码文件夹,并创建存放成绩数据的文件夹datas,使用vim命令向其中添加题示数据: 运行jar包,使用命令/usr/local/spark/bin/spark-submit --class "AvgScore" /home/hadoop/sparkapp3/AvgScore/target/scala-2.11/average-score_2.11-1.0.jar 得到运行结果,每个学生的平均成绩如下: 【实验问题与解决】

一、进行换源处理,这里选的是阿里的yum源 1:在root用户下输入 cd /etc/yum.repos.d 命令,输入ls 并展开 2:将CentOS-Base.repo 进行备份,在root用户下输入命令 curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo 3:第三步完成后输入命令 yum clean all 4:第四步完成后输入命令 yum makecache 直至数据加载完成 5:第五步完成后输入命令 yum update -y 更新安装包 直至完成 6:最后在root用户下输入命令 yum install wget 安装wget软件,安装成功如图所示 二、在基于Red Hat的发行版中,可以使用DNF或YUM工具安装Python: yum install python3 安装完成后,执行以下命令检查安装结果: python3 --version 环境配置 vi ~/.bashrc export SPARK_HOME=/usr/local/spark export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH export PATH=$SPARK_HOME/bin:$PATH source ~/.bashrc 三、打开Pyspark cd /usr/local/spark ./bin/pyspark 现在,你可以开始编写你的第一个PySpark程序。可以使用以下代码创建一个简单的Spark应用程序,计算一个数据集的平均值。 from pyspark.sql import SparkSession # 创建Spark会话 spark = SparkSession.builder \ .appName("MyFirstApp") \ .getOrCreate() # 创建数据 data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')] columns = ["id", "name"] # 创建DataFrame df = spark.createDataFrame(data, columns) # 显示DataFrame内容 df.show() # 计算平均值 average_id = df.groupBy().avg("id").first()[0] print(f"Average ID: {average_id}") # 结束Spark会话 spark.stop() 四、jupytor Install pip: yum install epel-release yum install python3-pip Verify pip Installation: pip3 --version 1. 安装编译工具和依赖库 yum groupinstall "Development Tools" yum install -y python3-devel libffi-devel openssl-devel pkgconfig 2. 安装 argon2 依赖库 yum install -y libargon2-devel 3更新 pip 和 setuptools pip3 install --upgrade pip setuptools wheel 4. 单独安装 argon2-cffi-bindings pip3 install argon2-cffi-bindings 1. 安装Jupyter(Linux终端) 在Linux终端中运行以下命令: pip3 install jupyter 2. 配置Spark环境变量(Linux终端) 在Linux终端中编辑环境变量文件: vi ~/.bashrc export SPARK_HOME=/usr/local/spark export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH export PATH=$SPARK_HOME/bin:$PATH source ~/.bashrc 3. 安装依赖库(Linux终端) 在Linux终端中安装Python依赖: pip3 install findspark pyspark 2、生成jupyter配置文件 #root用户: jupyter notebook --generate-config --allow-root 3、打开ipython,生成密码 ipython from notebook.auth import passwd passwd() – 输入密码并按回车确认 输入密码后,会生成秘钥,稍后用到,秘钥:'argon2:$argon2id$v=19$m=10240,t=10,p=8$mPEEyotF56sm8V1Iv6/6Ng$xM7dYChXNFxeVBG9LaHUcTmj5pLRSOiFB3hfqSPe/Sk' 4、修改配置文件 vi /root/.jupyter/jupyter_notebook_config.py c.NotebookApp.ip = '*' #c.NotebookApp.notebook_dir = '/root/jupyder_dir' c.NotebookApp.open_browser = False c.NotebookApp.password = 'argon2:$argon2id$v=19$m=10240,t=10,p=8$mPEEyotF56sm8V1Iv6/6Ng$xM7dYChXNFxeVBG9LaHUcTmj5pLRSOiFB3hfqSPe/Sk' c.NotebookApp.port = 8888 c.NotebookApp.allow_remote_access = True c.NotebookApp.allow_root = True 5、启动jupyter – root用户 jupyter notebook --allow-root

2023-06-02 23:12:37 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Traceback (most recent call last): File "mysqlTest.py", line 12, in <module> jdbcDF=spark.read.format("jdbc").option("url","jdbc:mysql://localhost:3306/sparktest?useSSL=false").option("driver","com.mysql.cj.jdbc.Driver").option("dbtable","employee").option("user", "root").option("password", "123456").load() File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 172, in load return self._df(self._jreader.load()) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o31.load. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99) at scala.Option.foreach(Option.scala:257) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:99) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

WARNING: Using incubator modules: jdk.incubator.vector Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/08/07 11:16:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ❌ 驱动加载失败: An error occurred while calling z:java.lang.Class.forName. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:467) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184) at py4j.ClientServerConnection.run(ClientServerConnection.java:108) at java.base/java.lang.Thread.run(Thread.java:842) ⚠️ 请检查: 1. jar包位置 2. spark-defaults.conf配置 3. 文件权限 进程已结束,退出代码为 0 指定jar包位置后 config("spark.jars", "/export/server/spark/jars/mysql-connector-java-8.0.28.jar"). 25/08/07 11:20:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). ✅ MySQL驱动已加载: com.mysql.cj.jdbc.Driver Traceback (most recent call last): File "/tmp/pycharm_project_49/00_example/test.py", line 26, in <module> .load() ^^^^^^ File "/root/.virtualenvs/pySpark/lib/python3.12/site-packages/pyspark/sql/readwriter.py", line 318, in load return self._df(self._jreader.load()) ^^^^^^^^^^^^^^^^^^^^ File "/root/.virtualenvs/pySpark/lib/python3.12/site-packages/py4j/java_gateway.py", line 1362, in __call__ return_value = get_return_value( ^^^^^^^^^^^^^^^^^ File "/root/.virtualenvs/pySpark/lib/python3.12/site-packages/pyspark/errors/exceptions/captured.py", line 282, in deco return f(*a, **kw) ^^^^^^^^^^^ File "/root/.virtualenvs/pySpark/lib/python3.12/site-packages/py4j/protocol.py", line 327, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o39.load. : java.sql.SQLException: No suitable driver at java.sql/java.sql.DriverManager.getDriver(DriverManager.java:299) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$2(JDBCOptions.scala:118) at scala.Option.getOrElse(Option.scala:201)

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/spark/spark-2.4.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/Hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hbase-1.2.6/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 25/03/21 15:50:53 WARN Utils: Your hostname, master resolves to a loopback address: 127.0.0.1; using 192.168.180.130 instead (on interface ens33) 25/03/21 15:50:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 25/03/21 15:50:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/03/21 15:50:56 WARN FileStreamSink: Error while looking for metadata directory. Traceback (most recent call last): File "/home/abz/PycharmProjects/untitled1/2024.10.22.test.py", line 6, in <module> df = spark.read.csv("传媒综合.csv", header=True, inferSchema=True, sep="|", encoding='utf-8') File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 472, in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An erro

import os from pyspark.sql import SparkSession from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import MulticlassClassificationEvaluator from T2 import spark # 设置环境变量 spark_files = 'D:\spark实训\spark_files' os.environ['PYSPARK_PYTHON'] = os.path.join(spark_files, 'spark_virenvs\python.exe') os.environ['SPARK_HOME'] = os.path.join(spark_files, 'spark-3.4.4-bin-hadoop3') os.environ['JAVA_HOME'] = os.path.join(spark_files, 'Java\jdk-18.0.1') # 加载数据集 train_data = spark.read.csv("D:\spark大数据快速运算大作业\数据集\题目二\data_train.txt", inferSchema=True, header=False) test_data = spark.read.csv("D:\spark大数据快速运算大作业\数据集\题目二\data_test.txt", inferSchema=True, header=False) # 数据预处理 from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler( inputCols=[f"_{i}" for i in range(1, 55)], # 前54列为特征 outputCol="features" ) train = assembler.transform(train_data).withColumnRenamed("_55", "label") test = assembler.transform(test_data).withColumnRenamed("_55", "label") from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( numTrees=100, maxDepth=10, seed=42, labelCol="label", featuresCol="features" ) rf_model = rf.fit(train) rf_predictions = rf_model.transform(test) lr = LogisticRegression( maxIter=100, regParam=0.01, family="multinomial", # 多分类设置 labelCol="label", featuresCol="features" ) lr_model = lr.fit(train) lr_predictions = lr_model.transform(test) evaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="accuracy" ) rf_accuracy = evaluator.evaluate(rf_predictions) lr_accuracy = evaluator.evaluate(lr_predictions) # F1分数评估 evaluator.setMetricName("f1") rf_f1 = evaluator.evaluate(rf_predictions) lr_f1 = evaluator.evaluate(lr_predictions) # 模型训练与评估完整流程 def train_evaluate(model, train_data, test_data): model = model.fit(train_data) predictions = model.transform(test_data) # 计算评估指标 accuracy = evaluator.setMetricName("accuracy").evaluate(predictions) f1 = evaluator.setMetricName("f1").evaluate(predictions) # 特征重要性(随机森林特有) if isinstance(model, RandomForestClassificationModel): importances = model.featureImportances print("Top 5 features:", importances.values.argsort()[-5:][::-1]) return accuracy, f1 # 执行比较 rf_acc, rf_f1 = train_evaluate(rf, train, test) lr_acc, lr_f1 = train_evaluate(lr, train, test)D:\spark实训\spark_files\spark_virenvs\python.exe D:\spark大数据快速运算大作业\22.py 25/06/12 17:24:08 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/06/12 17:24:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 正在加载数据... 训练随机森林模型... 随机森林训练时间: 21.42秒 随机森林评估结果: 准确率: 0.6507 F1分数: 0.6208 训练逻辑回归模型... 25/06/12 17:24:38 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS 逻辑回归训练时间: 14.64秒 逻辑回归评估结果: 准确率: 0.6712 F1分数: 0.6558 ===== 模型比较 ===== 随机森林训练时间: 21.42秒 vs 逻辑回归训练时间: 14.64秒 随机森林准确率: 0.6507 vs 逻辑回归准确率: 0.6712 随机森林F1分数: 0.6208 vs 逻辑回归F1分数: 0.6558 随机森林特征重要性(前10): 特征 0: 0.3683 特征 13: 0.2188 特征 35: 0.0529 特征 23: 0.0500 特征 25: 0.0368 特征 51: 0.0336 特征 52: 0.0281 特征 17: 0.0279 特征 5: 0.0244 特征 36: 0.0236 25/06/12 17:24:49 ERROR Instrumentation: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:188) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1593) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1593) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1579) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1579) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207) at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304) at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153) at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58) at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:341) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:331) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:370) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515) ... 23 more 25/06/12 17:24:49 ERROR Instrumentation: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:188) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1593) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1593) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1579) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1579) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207) at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304) at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153) at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58) at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:341) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:331) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:370) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515) ... 23 more Traceback (most recent call last): File "D:\spark大数据快速运算大作业\22.py", line 6, in <module> from T2 import spark File "D:\spark大数据快速运算大作业\T2.py", line 113, in <module> rf_model.write().overwrite().save("models/random_forest_model") File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\pyspark\ml\util.py", line 197, in save self._jwrite.save(path) File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\py4j\java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\pyspark\errors\exceptions\captured.py", line 169, in deco return f(*a, **kw) File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\py4j\protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o963.save. : java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:188) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1593) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1593) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1579) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1579) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207) at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304) at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153) at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58) at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:341) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:331) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:370) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515) ... 23 more Process finished with exit code 1

第9章 Spark MLlib 教材配套机房上机实验指南 实验7 Spark MLlib编程初级实践 一、实验目的 (1)通过实验掌握基本的MLLib编程方法; (2)掌握用MLLib解决一些常见的数据分析问题,包括数据导入、成分分析和分类和预测等。 二、实验平台 操作系统:Ubuntu16.04 JDK版本:1.8或以上版本 Spark版本:3.4.0 Python版本:3.8.18 数据集:下载Adult数据集(http://archive.ics.uci.edu/ml/datasets/Adult),该数据集也可以直接到本教程官网的“下载专区”的“数据集”中下载。数据从美国1994年人口普查数据库抽取而来,可用来预测居民收入是否超过50K$/year。该数据集类变量为年收入是否超过50k$,属性变量包含年龄、工种、学历、职业、人种等重要信息,值得一提的是,14个属性变量中有7个类别型变量。 三、实验内容和要求 1.数据导入 从文件中导入数据,并转化为DataFrame。 【参考答案】 答案: # 导入需要的包 from pyspark.ml.feature import PCA from pyspark.sql import Row from pyspark.ml.linalg import Vector,Vectors from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml import Pipeline,PipelineModel from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer,HashingTF, Tokenizer from pyspark.ml.classification import LogisticRegression from pyspark.ml.classification import LogisticRegressionModel from pyspark.ml.classification import BinaryLogisticRegressionSummary, LogisticRegression from pyspark.sql import functions from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.sql import SparkSession # 获取训练集测试集(需要对测试集进行一下处理,adult.data.txt的标签是>50K和<=50K,而adult.test.txt的标签是>50K.和<=50K.,这里是把adult.test.txt标签的“.”去掉了。另外,确保adult.data.txt和adult.test.txt最后没有多一个空格。) # 读取训练集文件 >>> with open("/usr/local/spark/adult.data", "r") as train_file:     train_data = train_file.readlines() # 读取测试集文件 >>> with open("/usr/local/spark/adult.test", "r") as test_file:     test_data = test_file.readlines() # 去除测试集标签中的点号 "." >>> test_data = [line.replace(" >50K.", " >50K").replace(" <=50K.", " <=50K") for line in test_data] # 去除测试集第一行(|1x3 Cross validator) >>> if test_data[0].strip() == "|1x3 Cross validator":     test_data = test_data[1:] # 去除训练集文件末尾的空行 >>> while train_data and train_data[-1].strip() == "":     train_data.pop() # 去除测试集文件末尾的空行 >>> while test_data and test_data[-1].strip() == "":     test_data.pop() # 保存处理后的训练集为adult.data.txt >>> with open("/usr/local/spark/adult.data.txt", "w") as train_processed_file:     train_processed_file.writelines(train_data) # 保存处理后的测试集为adult.test.txt >>> with open("/usr/local/spark/adult.test.txt", "w") as test_processed_file:     test_processed_file.writelines(test_data) # 训练集测试集文件处理完成 >>> def f(x): rel = {} rel['features']=Vectors.dense(float(x[0]),float(x[2]),float(x[4]),float(x[10]),float(x[11]),float(x[12])) rel['label'] = str(x[14]) return rel >>> spark = SparkSession.builder.appName("SparkMLlib").getOrCreate() >>> df = spark.sparkContext.textFile("file:///usr/local/spark/adult.data.txt").map(lambda line: line.split(',')).map(lambda p: Row(**f(p))).toDF() >>> test = spark.sparkContext.textFile("file:///usr/local/spark/adult.test.txt").map(lambda line: line.split(',')).map(lambda p: Row(**f(p))).toDF() 2.进行主成分分析(PCA) 对6个连续型的数值型变量进行主成分分析。PCA(主成分分析)是通过正交变换把一组相关变量的观测值转化成一组线性无关的变量值,即主成分的一种方法。PCA通过使用主成分把特征向量投影到低维空间,实现对特征向量的降维。请通过setK()方法将主成分数量设置为3,把连续型的特征向量转化成一个3维的主成分。 【参考答案】 构建PCA模型,并通过训练集进行主成分分解,然后分别应用到训练集和测试集 >>> pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df) >>> result = pca.transform(df) >>> testdata = pca.transform(test) >>> result.show(truncate=False) +------------------------------------+------+-----------------------------------------------------------+ |features |label |pcaFeatures | +------------------------------------+------+-----------------------------------------------------------+ |[39.0,77516.0,13.0,2174.0,0.0,40.0] | <=50K|[77516.06543281932,-2171.6489938846544,-6.94636047659662] | |[50.0,83311.0,13.0,0.0,0.0,13.0] | <=50K|[83310.99935595779,2.526033892795439,-3.3887024086797206] | |[38.0,215646.0,9.0,0.0,0.0,40.0] | <=50K|[215645.99925048652,6.551842584558863,-8.58495396907331] | |[53.0,234721.0,7.0,0.0,0.0,40.0] | <=50K|[234720.99907961808,7.130299808626894,-9.360179790809578] | |[28.0,338409.0,13.0,0.0,0.0,40.0] | <=50K|[338408.99918830546,10.289249842829474,-13.36825187163079] | |[37.0,284582.0,14.0,0.0,0.0,40.0] | <=50K|[284581.99916695454,8.649756033721609,-11.281731333793076] | |[49.0,160187.0,5.0,0.0,0.0,16.0] | <=50K|[160186.9992693704,4.8657537211958015,-6.394299355794672] | |[52.0,209642.0,9.0,0.0,0.0,45.0] | >50K |[209641.99910851714,6.366453450454777,-8.387055585722319] | |[31.0,45781.0,14.0,14084.0,0.0,50.0]| >50K |[45781.42721110637,-14082.596953729322,-26.303509105368967]| |[42.0,159449.0,13.0,5178.0,0.0,40.0]| >50K |[159449.15652342225,-5173.151337268407,-15.351831002502344]| |[37.0,280464.0,10.0,0.0,0.0,80.0] | >50K |[280463.99908861093,8.519356755970295,-11.18800053344727] | |[30.0,141297.0,13.0,0.0,0.0,40.0] | >50K |[141296.99942061218,4.290098166706542,-5.6631132626324545] | |[23.0,122272.0,13.0,0.0,0.0,30.0] | <=50K|[122271.99953623723,3.7134109235615136,-4.887549331279782] | |[32.0,205019.0,12.0,0.0,0.0,50.0] | <=50K|[205018.99929839544,6.227844686218623,-8.176186180265162] | |[40.0,121772.0,11.0,0.0,0.0,40.0] | >50K |[121771.99934864059,3.6945287780608354,-4.9185835672785005]| |[34.0,245487.0,4.0,0.0,0.0,45.0] | <=50K|[245486.99924622502,7.460149417474324,-9.7500032428796] | |[25.0,176756.0,9.0,0.0,0.0,35.0] | <=50K|[176755.99943997271,5.370793765357621,-7.029037217536837] | |[32.0,186824.0,9.0,0.0,0.0,40.0] | <=50K|[186823.9993467819,5.6755410564333655,-7.445605003141201] | |[38.0,28887.0,7.0,0.0,0.0,50.0] | <=50K|[28886.999469511487,0.8668334219453465,-1.2969921640114914]| |[43.0,292175.0,14.0,0.0,0.0,45.0] | >50K |[292174.99908683443,8.879323215730548,-11.599483225617751] | +------------------------------------+------+-----------------------------------------------------------+ only showing top 20 rows >>> testdata.show(truncate=False) +------------------------------------+------+------------------------------------------------------------+ |features |label |pcaFeatures | +------------------------------------+------+------------------------------------------------------------+ |[25.0,226802.0,7.0,0.0,0.0,40.0] | <=50K|[226801.9993670891,6.893313042338155,-8.993983821758418] | |[38.0,89814.0,9.0,0.0,0.0,50.0] | <=50K|[89813.9993894769,2.720987324481492,-3.6809508659703227] | |[28.0,336951.0,12.0,0.0,0.0,40.0] | >50K |[336950.9991912231,10.244920104044986,-13.310695651855433] | |[44.0,160323.0,10.0,7688.0,0.0,40.0]| >50K |[160323.2327290343,-7683.121090489598,-19.729118648463572] | |[18.0,103497.0,10.0,0.0,0.0,30.0] | <=50K|[103496.99961293538,3.1428623091567167,-4.141563083946152] | |[34.0,198693.0,6.0,0.0,0.0,30.0] | <=50K|[198692.99933690467,6.037911774664424,-7.894879761309243] | |[29.0,227026.0,9.0,0.0,0.0,40.0] | <=50K|[227025.99932507658,6.899470708683594,-9.011878890809934] | |[63.0,104626.0,15.0,3103.0,0.0,32.0]| >50K |[104626.09338764263,-3099.8250060691976,-9.648800672049632] | |[24.0,369667.0,10.0,0.0,0.0,40.0] | <=50K|[369666.9991911036,11.241251385630434,-14.58110445420285] | |[55.0,104996.0,4.0,0.0,0.0,10.0] | <=50K|[104995.99929475832,3.186050789410868,-4.236895975019619] | |[65.0,184454.0,9.0,6418.0,0.0,40.0] | >50K |[184454.19392400666,-6412.391589847377,-18.518448307258247] | |[36.0,212465.0,13.0,0.0,0.0,40.0] | <=50K|[212464.99927015402,6.45514884447021,-8.458640605560896] | |[26.0,82091.0,9.0,0.0,0.0,39.0] | <=50K|[82090.99954236703,2.4891114096287383,-3.3355931885530445] | |[58.0,299831.0,9.0,0.0,0.0,35.0] | <=50K|[299830.9989556856,9.111696151579187,-11.90914144134721] | |[48.0,279724.0,9.0,3103.0,0.0,48.0] | >50K |[279724.09328344715,-3094.4957992963828,-16.491321474156507]| |[43.0,346189.0,14.0,0.0,0.0,50.0] | >50K |[346188.9990067699,10.522518314336622,-13.72068664318214] | |[20.0,444554.0,10.0,0.0,0.0,25.0] | <=50K|[444553.9991678727,13.522886896071775,-17.47586621453686] | |[43.0,128354.0,9.0,0.0,0.0,30.0] | <=50K|[128353.99933456784,3.895809826841343,-5.16363050899861] | |[37.0,60548.0,9.0,0.0,0.0,20.0] | <=50K|[60547.99950268138,1.8343884998321716,-2.482228457083682] | |[40.0,85019.0,16.0,0.0,0.0,45.0] | >50K |[85018.99937940769,2.5751267063738417,-3.492497873708586] | +------------------------------------+------+------------------------------------------------------------+ only showing top 20 rows 3.训练分类模型并预测居民收入 在主成分分析的基础上,采用逻辑斯蒂回归,或者决策树模型预测居民收入是否超过50K;对Test数据集进行验证。 【参考答案】 训练逻辑斯蒂回归模型,并进行测试,得到预测准确率 >>> labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(result) >>> for label in labelIndexer.labels:print(label) <=50K >50K >>> featureIndexer = VectorIndexer(inputCol="pcaFeatures", outputCol="indexedFeatures") .fit(result) >>> print(featureIndexer.numFeatures) 3 >>> labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel" ,labels=labelIndexer.labels) >>> lr = LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures") .setMaxIter(100) >>> lrPipeline = Pipeline().setStages([labelIndexer, featureIndexer, lr, labelConverter]) >>> lrPipelineModel = lrPipeline.fit(result) >>> lrModel = lrPipelineModel.stages[2] >>> print ("Coefficients: \n " + str(lrModel.coefficientMatrix)+"\nIntercept: "+str(lrModel.interceptVector)+ "\n numClasses: "+str(lrModel.numClasses)+"\n numFeatures: "+str(lrModel.numFeatures)) Coefficients: DenseMatrix([[-1.98284969e-07, -3.50909196e-04, -8.45149645e-04]]) Intercept: [-1.4525986478021733] numClasses: 2 numFeatures: 3 >>> lrPredictions = lrPipelineModel.transform(testdata) >>> evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel") .setPredictionCol("prediction") >>> lrAccuracy = evaluator.evaluate(lrPredictions) lrAccuracy = 0.7764235163053484 >>> print("Test Error = %g " % (1.0 - lrAccuracy)) Test Error = 0.223576 4.超参数调优 利用CrossValidator确定最优的参数,包括最优主成分PCA的维数、分类器自身的参数等。 【参考答案】 >>> pca = PCA().setInputCol("features").setOutputCol("pcaFeatures") >>> labelIndexer = StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(df) >>> featureIndexer = VectorIndexer().setInputCol("pcaFeatures") .setOutputCol("indexedFeatures") >>> labelConverter = IndexToString().setInputCol("prediction").setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) >>> lr = LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures") .setMaxIter(100) >>> lrPipeline = Pipeline().setStages([pca, labelIndexer, featureIndexer, lr, labelConverter]) >>> paramGrid = ParamGridBuilder().addGrid(pca.k, [1,2,3,4,5,6]).addGrid(lr.elasticNetParam, [0.2,0.8]).addGrid(lr.regParam, [0.01, 0.1, 0.5]).build() >>> cv = CrossValidator().setEstimator(lrPipeline) .setEvaluator(MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(3) >>> cvModel = cv.fit(df) >>> lrPredictions=cvModel.transform(test) >>> evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel") .setPredictionCol("prediction") >>> lrAccuracy = evaluator.evaluate(lrPredictions) >>> print("准确率为"+str(lrAccuracy)) 准确率为0.7831750221537989 >>> bestModel= cvModel.bestModel >>> lrModel = bestModel.stages[3] >>> print ("Coefficients: \n " + str(lrModel.coefficientMatrix)+"\nIntercept: "+str(lrModel.interceptVector)+ "\n numClasses: "+str(lrModel.numClasses)+"\n numFeatures: "+str(lrModel.numFeatures)) Coefficients: DenseMatrix([[-1.46929818e-07, -1.68830548e-04, -8.8415347 4.93295331e-02, 3.11466080e-02, -2.82108785 Intercept: [-7.4699641638803005] numClasses: 2 numFeatures: 6 >>> pcaModel = bestModel.stages[0] >>> print("Primary Component: " + str(pcaModel.pc)) Primary Component: DenseMatrix([[-9.90507714e-06, -1.43514070e-04, ... (6 total) ], [ 9.99999999e-01, 3.04337871e-05, ... ], [-1.05283840e-06, -4.27228452e-05, ... ], [ 3.03678811e-05, -9.99998483e-01, ... ], [-3.91389877e-05, 1.72989546e-03, ... ], [-2.19555372e-06, -1.31095844e-04, ... ]]) 可以看出,PCA最优的维数是6。

大家在看

recommend-type

NR 5G考试等级考考试基础试题(含答案已核实).pdf

。。。
recommend-type

小游戏源码-端午节龙舟大赛.rar

小游戏源码-端午节龙舟大赛.rar
recommend-type

fonteditorV1.3.2.zip 字体工具

FontEditor为一款功能强大的字体编辑和字体格式转换工具,支持制作32*32的全字库。能将TTF,OTF矢量字库转换为BDF字符,可以很方便的将Windows已经安装到系统的矢量字库转换为BDF字库,并能将TTF,OTF文件直接转换成BDF格式,并支持BDF,FNT,FNB文件格式的互转换,随心所欲将windows字体应用于各种嵌入式系统中。并支持将GB2312,BIG5,GBK转换为UCS2,UTF8,并支持UCS2,UTF8编码的互转换   V1.2  运行环境:Win2003, WinXP, Win2000, NT, WinME   (1)BDF,FNT,FNB字体格式的互转换   (2)将Windows已经安装TTF转换为BDF格式   (3)将TTF文件导入转为BDF,FNT,FNB格式   (4)汉字查码   V1.25   (1)Windows已经安装TTF字体导入为BDF时,剔除无效字符   (2)将BDF,FNT,FNB导出为bitmap文件和字符宽度索引   (3)GB2312,BIG5,GBK转换为UCS2,UTF8,以及UCS2,UTF8互转换   V1.25.03   (1)将单个字符导出为Bitmap文件   (2)解决导出字库bitmap时,字符少于256个导出文件不正确的问题   (3)解决导出选择中字符实际上是导出所有字符的问题   V1.26   (1)增加修正字符点阵的功能,可对所有字符或者当前页字符的点阵大小和位移进行调整   (2)修正V1.25.03中导出位图文件选择取消无法退出程序的问题   V1.3   (1)增加导出全字库bitmap方式,同时支持二进制导出和ASCII码导出   (2)增强读取BDF文件的兼容性   (3)增加手动剔除无效字符功能   V1.3.2   (1)增加TTF文件导入调整字符点阵大小,控制位图的精度和导入位图的效果   运行环境:Win2003, WinXP, Win2000, NT, WinME
recommend-type

1-99分钟倒计时Multisim仿真实例源文件.zip

1-99分钟倒计时Multisim仿真实例源文件,1-99分钟倒计时Multisim仿真实例源文件,可供学习及设计参考。
recommend-type

HCIE-Storage实验手册06---Oracle主备容灾方案实验手册.docx

HCIE-Storage实验手册06---Oracle主备容灾方案实验手册.docx

最新推荐

recommend-type

信捷XC系列PLC主从通讯程序设计与实现——工业自动化控制核心技术

信捷XC系列PLC主从通讯程序的设计与实现方法。信捷XC系列PLC是一款高性能、高可靠性的可编程逻辑控制器,在工业自动化领域广泛应用。文中阐述了主从通讯的基本概念及其重要性,具体讲解了配置网络参数、编写程序、数据交换以及调试与测试四个主要步骤。此外,还探讨了该技术在生产线控制、仓储物流、智能交通等多个领域的应用实例,强调了其对系统效率和稳定性的提升作用。 适合人群:从事工业自动化控制的技术人员、工程师及相关专业学生。 使用场景及目标:适用于需要多台PLC协同工作的复杂工业控制系统,旨在提高系统的效率和稳定性,确保各设备间的数据交换顺畅无误。 其他说明:随着工业自动化的快速发展,掌握此类通信协议和技术对于优化生产流程至关重要。
recommend-type

Qt 5.12.4与Halcon构建视觉流程框架:编译与测试的成功实践

如何将Halcon视觉技术和Qt框架相结合,构建一个强大的视觉处理流程框架。文中首先阐述了选择Qt 5.12.4的原因及其优势,接着描述了框架的具体构建方法,包括利用Qt的跨平台特性和界面设计工具创建用户界面,以及用Halcon进行图像处理与识别。随后,文章讲解了编译过程中需要注意的关键步骤,如正确引入Halcon的头文件和库文件,并提供了一个简单的代码示例。最后,作者分享了测试阶段的经验,强调了确保系统在各种情况下都能正常工作的必要性。通过这次实践,证明了两者结合可以带来更高效、更稳定的视觉处理解决方案。 适合人群:具有一定编程经验的技术人员,尤其是对计算机视觉和图形界面开发感兴趣的开发者。 使用场景及目标:适用于希望深入了解Qt与Halcon集成应用的开发者,旨在帮助他们掌握从框架搭建到实际部署的全过程,从而提升自身技能水平。 阅读建议:读者可以在阅读过程中跟随作者的步伐逐步尝试相关操作,以便更好地理解和吸收所介绍的知识点和技术细节。
recommend-type

【CAD入门基础课程】1.4 AutoCAD2016 功能介绍.avi

一、AutoCAD 2016的新增功能 版本演进: 从V1.0到2016版已更新数十次,具备绘图、编辑、图案填充、尺寸标注、三维造型等完整功能体系 界面优化: 新增暗黑色调界面:使界面协调深沉利于工作 底部状态栏整体优化:操作更实用便捷 硬件加速:效果显著提升运行效率 核心新增功能: 新标签页和功能区库:改进工作空间管理 命令预览功能:增强操作可视化 地理位置和实景计算:拓展设计应用场景 Exchange应用程序和计划提要:提升协同效率 1. 几何中心捕捉功能 新增选项: 在对象捕捉模式组中新增"几何中心"选项 可捕捉任意多边形的几何中心点 启用方法: 通过草图设置对话框→对象捕捉选项卡 勾选"几何中心(G)"复选框(F3快捷键启用) 2. 标注功能增强 DIM命令改进: 智能标注创建:基于选择的对象类型自动适配标注方式 选项显示优化:同时在命令行和快捷菜单中显示标注选项 应用场景: 支持直线、圆弧、圆等多种图形元素的智能标注 3. 打印输出改进 PDF输出增强: 新增专用PDF选项按钮 支持为位图对象添加超链接 可连接到外部网站和本地文件 操作路径: 快速访问工具栏→打印按钮→PDF选项对话框 4. 其他重要更新 修订云线增强: 支持创建矩形和多边形云线 新增虚拟线段编辑选项 系统变量: 新增多个控制新功能的系统变量
recommend-type

电力电子领域单相PWM整流模型的主电路与控制模块实现研究

内容概要:本文详细探讨了单相PWM整流模型的设计与实现,重点介绍了主电路和控制模块的具体实现方法。主电路作为整流模型的核心部分,涉及功率因数、电流稳定性和调节范围等因素,常采用全桥或多级整流等拓扑结构。控制模块则由PWM控制器、传感器和执行器组成,负责调节交流电源的输入,实现所需电压和电流波形。文中还强调了算法设计、硬件接口设计和参数调整与优化的重要性,指出这些因素对于提高整流效果和效率至关重要。 适合人群:从事电力电子领域的研究人员、工程师及相关专业的学生。 使用场景及目标:适用于希望深入了解单相PWM整流模型工作原理和技术细节的人群,旨在帮助他们掌握主电路和控制模块的设计要点,提升电力系统的稳定性和效率。 其他说明:文中提到的相关技术可以通过进一步查阅专业文献和技术资料获得更深入的理解。
recommend-type

基于Debian Jessie的Kibana Docker容器部署指南

Docker是一种开源的容器化平台,它允许开发者将应用及其依赖打包进一个可移植的容器中。Kibana则是由Elastic公司开发的一款开源数据可视化插件,主要用于对Elasticsearch中的数据进行可视化分析。Kibana与Elasticsearch以及Logstash一起通常被称为“ELK Stack”,广泛应用于日志管理和数据分析领域。 在本篇文档中,我们看到了关于Kibana的Docker容器化部署方案。文档提到的“Docker-kibana:Kibana 作为基于 Debian Jessie 的Docker 容器”实际上涉及了两个版本的Kibana,即Kibana 3和Kibana 4,并且重点介绍了它们如何被部署在Docker容器中。 Kibana 3 Kibana 3是一个基于HTML和JavaScript构建的前端应用,这意味着它不需要复杂的服务器后端支持。在Docker容器中运行Kibana 3时,容器实际上充当了一个nginx服务器的角色,用以服务Kibana 3的静态资源。在文档中提及的配置选项,建议用户将自定义的config.js文件挂载到容器的/kibana/config.js路径。这一步骤使得用户能够将修改后的配置文件应用到容器中,以便根据自己的需求调整Kibana 3的行为。 Kibana 4 Kibana 4相较于Kibana 3,有了一个质的飞跃,它基于Java服务器应用程序。这使得Kibana 4能够处理更复杂的请求和任务。文档中指出,要通过挂载自定义的kibana.yml文件到容器的/kibana/config/kibana.yml路径来配置Kibana 4。kibana.yml是Kibana的主要配置文件,它允许用户配置各种参数,比如Elasticsearch服务器的地址,数据索引名称等等。通过Docker容器部署Kibana 4,用户可以很轻松地利用Docker提供的环境隔离和可复制性特点,使得Kibana应用的部署和运维更为简洁高效。 Docker容器化的优势 使用Docker容器化技术部署Kibana,有几个显著的优势: - **一致性**:Docker容器确保应用在开发、测试和生产环境中的行为保持一致。 - **轻量级**:相比传统虚拟机,Docker容器更加轻量,启动快速,资源占用更少。 - **隔离性**:容器之间的环境隔离,确保应用之间互不干扰。 - **可移植性**:容器可以在任何支持Docker的环境中运行,提高了应用的可移植性。 - **易于维护**:通过Dockerfile可以轻松构建和分发应用镜像,便于维护和升级。 在文档中,我们还看到了文件名“docker-kibana-master”。这个名称很可能是指向了存放Docker相关文件的源代码仓库,其中可能包含Dockerfile、构建和运行脚本以及可能的配置模板文件等。开发者可以从这个仓库中克隆或下载所需的Docker相关文件,并根据这些文件来构建和部署Kibana的Docker容器。 根据以上信息,对于希望利用Docker容器部署和管理Kibana应用的用户,需要掌握Docker的基本使用方法,包括Docker的安装、镜像管理、容器的创建和配置等。同时,还需要了解Kibana的基本原理和配置方法,这样才能充分利用Docker容器化带来的便利,优化Kibana应用的部署和管理流程。
recommend-type

Coze智能体工作流:打造钦天监视频内容的创新与实践

# 1. Coze智能体工作流概述 Coze智能体工作流是业界领先的智能工作流程管理系统,它通过整合最新的智能技术与人工智能算法,旨在为视频内容创造与优化带来革命性的变革。本章将对Coze智能体工作流进行全面的概述,揭示其在IT行业中的应用范围、流程结构以及技术基础。通过了解Coze智能体工作流,读者将能够把握它在当前和未
recommend-type

使用git仓库的利与弊

<think>好的,我需要分析Git仓库的优缺点。首先,根据用户提供的引用,看看有哪些点需要覆盖。引用[1]提到了Git有本地和远程仓库,需要push命令来更新远程,这应该属于优点还是缺点呢?可能优点在于本地操作灵活,缺点是需要额外命令。 引用[2]指出Git的权限管理薄弱,学习曲线陡峭,这显然是缺点。另外,引用[4]提到Git使用SHA-1哈希,内容完整性更好,属于优点。引用[5]比较了Git的分支管理简单,而缺少全局版本号,这可以作为优点和缺点分别说明。 现在需要将这些信息整理成结构化的优缺点,可能分点列出。同时,用户要求回答结构清晰,逐步解决问题,所以可能需要先介绍Git的基本概念,
recommend-type

TextWorld:基于文本游戏的强化学习环境沙箱

在给出的文件信息中,我们可以提取到以下IT知识点: ### 知识点一:TextWorld环境沙箱 **标题**中提到的“TextWorld”是一个专用的学习环境沙箱,专为强化学习(Reinforcement Learning,简称RL)代理的训练和测试而设计。在IT领域中,尤其是在机器学习的子领域中,环境沙箱是指一个受控的计算环境,允许实验者在隔离的条件下进行软件开发和测试。强化学习是一种机器学习方法,其中智能体(agent)通过与环境进行交互来学习如何在某个特定环境中执行任务,以最大化某种累积奖励。 ### 知识点二:基于文本的游戏生成器 **描述**中说明了TextWorld是一个基于文本的游戏生成器。在计算机科学中,基于文本的游戏(通常被称为文字冒险游戏)是一种游戏类型,玩家通过在文本界面输入文字指令来与游戏世界互动。TextWorld生成器能够创建这类游戏环境,为RL代理提供训练和测试的场景。 ### 知识点三:强化学习(RL) 强化学习是**描述**中提及的关键词,这是一种机器学习范式,用于训练智能体通过尝试和错误来学习在给定环境中如何采取行动。在强化学习中,智能体在环境中探索并执行动作,环境对每个动作做出响应并提供一个奖励或惩罚,智能体的目标是学习一个策略,以最大化长期累积奖励。 ### 知识点四:安装与支持的操作系统 **描述**提到TextWorld的安装需要Python 3,并且当前仅支持Linux和macOS系统。对于Windows用户,提供了使用Docker作为解决方案的信息。这里涉及几个IT知识点: - **Python 3**:一种广泛使用的高级编程语言,适用于快速开发,是进行机器学习研究和开发的常用语言。 - **Linux**和**macOS**:两种流行的操作系统,分别基于Unix系统和类Unix系统。 - **Windows**:另一种广泛使用的操作系统,具有不同的软件兼容性。 - **Docker**:一个开源的应用容器引擎,允许开发者打包应用及其依赖环境为一个轻量级、可移植的容器,使得在任何支持Docker的平台上一致地运行。 ### 知识点五:系统库和依赖 **描述**提到在基于Debian/Ubuntu的系统上,可以安装一些系统库来支持TextWorld的本机组件。这里涉及的知识点包括: - **Debian/Ubuntu**:基于Debian的Linux发行版,是目前最流行的Linux发行版之一。 - **系统库**:操作系统中包含的一系列预编译的软件包和库,供应用程序在运行时使用。 - **包管理工具**,如**apt**(Advanced Package Tool),它是一个在Debian及其衍生系统中用于安装、删除和管理软件包的命令行工具。 ### 知识点六:与创建者联系方式 **描述**提供了与TextWorld创建者的联系方式,包括电子邮件地址和一个Gitter频道。这说明了如何与开源项目的维护者进行沟通与反馈: - **电子邮件**是常见的沟通方式,允许用户与开发者直接交流。 - **Gitter**是一个基于GitHub的即时消息工具,通常用于开源项目中的实时协作和交流。 ### 结语 综合以上信息,我们可以了解到TextWorld是一个专为强化学习设计的学习环境沙箱,它通过创建基于文本的游戏环境,让研究者和开发者训练和测试RL代理。它主要针对Linux和macOS系统,不过也有适合Windows用户的替代方案。此外,了解如何安装和配置TextWorld,以及如何与创建者沟通,对于开发者来说是十分重要的基础技能。
recommend-type

Coze智能体工作流全攻略

# 1. Coze智能体工作流概述 在现代企业中,工作流管理不仅是提高效率的关键因素,而且
recommend-type

64位小端转大端c语言函数起名

<think>好的,用户需要找一个C语言函数的名字,用来将64位小端格式转换为大端格式。首先,我得回顾一下C语言中字节序转换函数的命名惯例。通常,像htonl和htons这样的函数用于将主机字节序转换到网络字节序(大端)。其中,htonl处理32位,htons处理16位。不过,对于64位的情况,可能没有标准函数,所以需要自己实现,并且命名也要符合惯例。 接下来,用户提到的是64位的小端转大端。这时候,可能需要类似htonll这样的名字,因为htonl是host to network long(32位),那么htonll可能就是host to network long long(64位)。不过