Python 加载 .arff数据

原创已于 2025-06-25 14:20:27 修改 · 260 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能

于 2025-06-10 14:46:00 首次发布

面临问题：

对于形如下面的.arff数据加载失败

图1 数据属性展示

图2 数据样本展示

数据解释：

一个{}表示一个实例样本，其中{}中逗号分隔一个数据特征(属性)和特征(属性)取值。

例如，对于第一个样本：

{8 1, 11 1, 61 1, 62 1, 63 2, 64 3, 65 1, 66 1, 67 1, 68 1, 69 1, 70 1, 71 1, 72 1, 73 1, 74 1, 75 1, 76 1, 77 1, 78 1, 79 1, 80 1, 81 1, 82 1, 83 1, 84 1, 85 1, 86 1, 87 1, 88 1, 89 1, 90 1, 91 1, 92 3, 93 1, 94 1, 95 1, 96 1, 97 1, 98 1, 99 1, 100 1, 101 1, 102 1, 103 1, 104 1, 105 1, 106 1, 107 1, 108 1, 109 1, 110 1, 111 1, 112 1, 113 2, 114 1, 115 1, 116 1, 117 1, 118 1, 119 4, 120 1, 121 1, 122 1, 123 1, 124 1, 23168 1}

8 1, 这里8表示图1中的@attribute Att9 numeric 的特征取值为1。(因为属性的索引是从0开始的)（图4展示了加载后的对应属性的结果）

而0-7的属性在这里没有记录，这表明，这些属性的取值是0

问题阐述：

对于上面的数据，如果直接在python中进行数据加载时会面临：

from scipy.io import arff

path = 'E:/Arts1.arff'

data, meta = arff.loadarff(path)

解决：

import numpy as np
import pandas as pd
 
def parse_row(line, len_row):
    line = line.replace('{', '').replace('}', '')
 
    row = np.zeros(len_row)
    for data in line.split(','):
        index, value = data.split()
        row[int(index)] = float(value)
 
    return row
 
def read_data_arff(filename):
    # Step 1. Read data by row.
    with open(filename, 'r') as fp:
        file_content = fp.readlines()
 
    # Step 2. Get the columns.
    columns = []
    len_attr = len('@attribute')
 
    for line in file_content:
        if line.startswith('@attribute '):
            col_name = line[len_attr:].split()[0]
            columns.append(col_name)
 
    # Step 3. Get the rows.
    rows = []
    len_row = len(columns)
 
    for line in file_content:
        if line.startswith('{'):
            rows.append(parse_row(line, len_row))
 
    # Step 4. Return the results.
    df = pd.DataFrame(data=rows, columns=columns)
 
    return df

结果展示：

path = 'E:/Arts1.arff'
data = read_data_arff(path)

图3 加载数据

图4 对应第一个样本的8,1

图5 标签空间

其他参考：Pandas直接读取arff格式的文件_arff2pandas-CSDN博客