【时空数据集处理】将NYC处理成Gowalla的形式

本文链接：https://blog.csdn.net/Bella_Seven/article/details/128526247

在把数据集NYC处理成Gowalla的形式时，出现的问题：

NYC数据读入后，某些记录中存在空值，需要删除

df.dropna()

数据没有索引，删除不需要的某几列时可以使用

x = [2,3,6]
df.drop(df.columns[x], axis=1, inplace=True)

接下来每一列的序号都没有变化（与删除前相同），但是取某列的时候还是需要目前的位置

0         Tue Apr 03 18:17:18 +0000 2012
1         Tue Apr 03 18:22:04 +0000 2012
2         Tue Apr 03 19:12:07 +0000 2012
3         Tue Apr 03 19:12:13 +0000 2012
4         Tue Apr 03 19:18:23 +0000 2012
                       ...              
573698    Sat Feb 16 02:34:35 +0000 2013
573699    Sat Feb 16 02:34:53 +0000 2013
573700    Sat Feb 16 02:34:55 +0000 2013
573701    Sat Feb 16 02:35:17 +0000 2013
573702    Sat Feb 16 02:35:29 +0000 2013
Name: 7, Length: 565100, dtype: object
原本的7行现在取出需要写的行标为4
times = df.iloc[:, 4]

然后是时间格式转换
（1）按3中取出会报错

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

需要写成

times = df.iloc[:, 4].copy()

（2）出现错误KeyError

raise KeyError(key) from err
KeyError: 72

查找问题的时候发现一篇文章给了一些启发:https://ask.csdn.net/questions/7644527
将代码改为：

times = df.iloc[:, 4].to_numpy()

具体为啥不太清楚，向uu们请教！！！
（3）时间格式
需要将

Tue Apr 03 18:08:57 +0000 2012

转换为

2010-10-12T19:44:40Z

代码如下：

times = df.iloc[:, 4].to_numpy()
for i in range(df.shape[0]):
    times[i] = datetime.strptime(times[i], '%a %b %d %H:%M:%S +0000 %Y')
    times[i] = datetime.strftime(times[i], '%Y-%m-%dT%H:%M:%SZ')

处理后的时间与其他的属性列拼接时出现错误
一个是（x,)
另一个是（x,1)
尝试了降维、reshape，但是没有变化，单独print显示的维度是一样的
于是将times单独保存为文件，最后进行读取文件拼接就是正确的…
将地点的字符串转换成序号（特慢最后还不一定对…）

location = df.iloc[:, 4]
location = np.array(location)
l_id = 0
for i in range(df.iloc[:, 4].shape[0]):
    if isinstance(location[i], int):
        continue
    tmp = location[i]
    location[i] = int(l_id)
    for j in range(df.iloc[:, 4].shape[0]):
        if isinstance(location[j], int):
            continue
        elif location[j] == tmp:
            location[j] = int(l_id)
    # print(location)
    l_id += 1

#把location重新拼进来
location = pd.DataFrame(location)
df = df.iloc[:,0:4]
data = pd.concat([df, location], axis=1)
df.columns = list('abcde') #重新索引
a = ['user','time','x','y','location']
df.columns = a

给user，location重新索引排序，要求是连续的

df = df.sort_values(by=['user', 'location'], ascending=True)