在把数据集NYC处理成Gowalla的形式时,出现的问题:
- NYC数据读入后,某些记录中存在空值,需要删除
df.dropna()
- 数据没有索引,删除不需要的某几列时可以使用
x = [2,3,6]
df.drop(df.columns[x], axis=1, inplace=True)
- 接下来每一列的序号都没有变化(与删除前相同),但是取某列的时候还是需要目前的位置
0 Tue Apr 03 18:17:18 +0000 2012
1 Tue Apr 03 18:22:04 +0000 2012
2 Tue Apr 03 19:12:07 +0000 2012
3 Tue Apr 03 19:12:13 +0000 2012
4 Tue Apr 03 19:18:23 +0000 2012
...
573698 Sat Feb 16 02:34:35 +0000 2013
573699 Sat Feb 16 02:34:53 +0000 2013
573700 Sat Feb 16 02:34:55 +0000 2013
573701 Sat Feb 16 02:35:17 +0000 2013
573702 Sat Feb 16 02:35:29 +0000 2013
Name: 7, Length: 565100, dtype: object
原本的7行现在取出需要写的行标为4
times = df.iloc[:, 4]
- 然后是时间格式转换
(1)按3中取出会报错
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
需要写成
times = df.iloc[:, 4].copy()
(2)出现错误KeyError
raise KeyError(key) from err
KeyError: 72
查找问题的时候发现一篇文章给了一些启发:https://ask.csdn.net/questions/7644527
将代码改为:
times = df.iloc[:, 4].to_numpy()
具体为啥不太清楚,向uu们请教!!!
(3)时间格式
需要将
Tue Apr 03 18:08:57 +0000 2012
转换为
2010-10-12T19:44:40Z
代码如下:
times = df.iloc[:, 4].to_numpy()
for i in range(df.shape[0]):
times[i] = datetime.strptime(times[i], '%a %b %d %H:%M:%S +0000 %Y')
times[i] = datetime.strftime(times[i], '%Y-%m-%dT%H:%M:%SZ')
- 处理后的时间与其他的属性列拼接时出现错误
一个是(x,)
另一个是(x,1)
尝试了降维、reshape,但是没有变化,单独print显示的维度是一样的
于是将times单独保存为文件,最后进行读取文件拼接就是正确的… - 将地点的字符串转换成序号(特慢最后还不一定对…)
location = df.iloc[:, 4]
location = np.array(location)
l_id = 0
for i in range(df.iloc[:, 4].shape[0]):
if isinstance(location[i], int):
continue
tmp = location[i]
location[i] = int(l_id)
for j in range(df.iloc[:, 4].shape[0]):
if isinstance(location[j], int):
continue
elif location[j] == tmp:
location[j] = int(l_id)
# print(location)
l_id += 1
#把location重新拼进来
location = pd.DataFrame(location)
df = df.iloc[:,0:4]
data = pd.concat([df, location], axis=1)
df.columns = list('abcde') #重新索引
a = ['user','time','x','y','location']
df.columns = a
- 给user,location重新索引排序,要求是连续的
df = df.sort_values(by=['user', 'location'], ascending=True)