蟒蛇。大熊猫。大数据。凌乱的TSV文件。如何纠缠数据？

Python. Pandas. BigData. Messy TSV file. How to wrangle the data?

所以。我们有一个混乱的数据存储在我需要分析的TSV文件中。
这是它的外观

1	status=200 protocol=http region_name=Podolsk datetime=2016-03-10 15:51:58 user_ip=0.120.81.243 user_agent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 user_id=7885299833141807155 user_vhost=tindex.ru method=GET page=/search/

问题是一些行具有不同的列顺序/其中一些缺少值，我需要摆脱高性能(因为我正在使用的数据集高达100千兆字节)。

1
2
3
4
5
6

Data = pd.read_table('data/data.tsv', sep='\t+',header=None,names=['status', 'protocol',\
'region_name', 'datetime',\
'user_ip', 'user_agent',\
'user_id', 'user_vhost',\
'method', 'page'], engine='python')
Clean_Data = (Data.dropna()).reset_index(drop=True)

现在我摆脱了缺失值，但仍然存在一个问题！
这是数据的外观：
enter image description here

这就是问题的出现：
enter image description here

正如您所看到的，某些列是偏移的。
我做了一个非常低性能的解决方案

1
2
3
4
5
6

ids = Clean_Data.index.tolist()
for column in Clean_Data.columns:
for row, i in zip(Clean_Data[column], ids):
if np.logical_not(str(column) in row):
Clean_Data.drop([i], inplace=True)
ids.remove(i)

所以现在数据看起来不错......至少我可以使用它！
但是我上面提到的方法的高性能替代方案是什么？

更新unutbu代码：traceback错误

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-52c9d76f9744> in <module>()
8 df.index.names = ['index', 'num']
9
---> 10 df = df.set_index('field', append=True)
11 df.index = df.index.droplevel(level='num')
12 df = df['value'].unstack(level=1)

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
2805 if isinstance(self.index, MultiIndex):
2806 for i in range(self.index.nlevels):
-> 2807 arrays.append(self.index.get_level_values(i))
2808 else:
2809 arrays.append(self.index)

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/multi.pyc in get_level_values(self, level)
664 values = _simple_new(filled, self.names[num],
665 freq=getattr(unique, 'freq', None),
--> 666 tz=getattr(unique, 'tz', None))
667 return values
668

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/range.pyc in _simple_new(cls, start, stop, step, name, dtype, **kwargs)
124 return RangeIndex(start, stop, step, name=name, **kwargs)
125 except TypeError:
--> 126 return Index(start, stop, step, name=name, **kwargs)
127
128 result._start = start

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
212 if issubclass(data.dtype.type, np.integer):
213 from .numeric import Int64Index
--> 214 return Int64Index(data, copy=copy, dtype=dtype, name=name)
215 elif issubclass(data.dtype.type, np.floating):
216 from .numeric import Float64Index

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/numeric.pyc in __new__(cls, data, dtype, copy, name, fastpath, **kwargs)
105 # with a platform int
106 if (dtype is None or
--> 107 not issubclass(np.dtype(dtype).type, np.integer)):
108 dtype = np.int64
109

TypeError: data type"index" not understood

熊猫版：0.18.0-np110py27_0

更新

一切正常......谢谢大家！

假设您有TSV数据，例如：

1
2
3

status=A protocol=B region_name=C datetime=D user_ip=E user_agent=F user_id=G
user_id=G status=A region_name=C user_ip=E datetime=D user_agent=F protocol=B
protocol=B datetime=D status=A user_ip=E user_agent=F user_id=G

字段的顺序可能被篡改，并且可能存在缺失值。但是，您不必因为字段没有按特定顺序出现而删除行。您可以使用行数据本身提供的字段名称将值放在正确的列中。例如，

1
2
3
4
5
6
7
8
9
10
11

import pandas as pd

df = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python')
df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)
df.columns = ['field', 'value']

df = df.set_index('field', append=True)
df.index = df.index.droplevel(level=1)
df = df['value'].unstack(level=1)

print(df)

产量

1
2
3
4
5

field datetime protocol region_name status user_agent user_id user_ip
index
0 D B C A F G E
1 D B C A F G E
2 D B None A F G E

要处理大型TSV文件，您可以处理块中的行，然后将处理后的块连接到最后的一个DataFrame中：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import pandas as pd

chunksize = # the number of rows to be processed per iteration
dfs = []
reader = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python',
iterator=True, chunksize=chunksize)
for df in reader:
df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)
df.columns = ['field', 'value']
df.index.names = ['index', 'num']

df = df.set_index('field', append=True)
df.index = df.index.droplevel(level='num')
df = df['value'].unstack(level=1)
dfs.append(df)

df = pd.concat(dfs, ignore_index=True)
print(df)

说明：给定df：

1
2
3
4
5
6
7
8
9
10
11
12
13

In [527]: df = pd.DataFrame({0: ['status=A', 'user_id=G', 'protocol=B'],
1: ['protocol=B', 'status=A', 'datetime=D'],
2: ['region_name=C', 'region_name=C', 'status=A'],
3: ['datetime=D', 'user_ip=E', 'user_ip=E'],
4: ['user_ip=E', 'datetime=D', 'user_agent=F'],
5: ['user_agent=F', 'user_agent=F', 'user_id=G'],
6: ['user_id=G', 'protocol=B', None]}); df
.....: .....: .....: .....: .....: .....: .....:
Out[527]:
0 1 2 3 4 5 6
0 status=A protocol=B region_name=C datetime=D user_ip=E user_agent=F user_id=G
1 user_id=G status=A region_name=C user_ip=E datetime=D user_agent=F protocol=B
2 protocol=B datetime=D status=A user_ip=E user_agent=F user_id=G None

您可以将所有值合并为一个列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

In [449]: df.stack()
Out[449]:
0 0 status=A
1 protocol=B
2 region_name=C
3 datetime=D
4 user_ip=E
5 user_agent=F
6 user_id=G
1 0 user_id=G
1 status=A
2 region_name=C
3 user_ip=E
4 datetime=D
5 user_agent=F
6 protocol=B
2 0 protocol=B
1 datetime=D
2 status=A
3 user_ip=E
4 user_agent=F
5 user_id=G
dtype: object

然后应用.str.extract(r'([^=]*)=(.*)')将字段名称与值分开：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

In [450]: df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0); df
Out[450]:
0 1
0 0 status A
1 protocol B
2 region_name C
3 datetime D
4 user_ip E
5 user_agent F
6 user_id G
1 0 user_id G
1 status A
2 region_name C
3 user_ip E
4 datetime D
5 user_agent F
6 protocol B
2 0 protocol B
1 datetime D
2 status A
3 user_ip E
4 user_agent F
5 user_id G

为了更容易引用DataFrame的部分，让我们给出列和索引级别的描述性名称：

1
2
3
4
5
6
7

In [530]: df.columns = ['field', 'value']; df.index.names = ['index', 'num']; df
Out[530]:
field value
index num
0 0 status A
1 protocol B
...

现在，如果我们将field列移动到索引中：

1
2
3
4
5
6
7
8
9

In [531]: df = df.set_index('field', append=True); df
Out[531]:
value
index num field
0 0 status A
1 protocol B
2 region_name C
3 datetime D
...

并删除num索引级别：

1
2
3
4
5
6
7
8
9

In [532]: df.index = df.index.droplevel(level='num'); df
Out[532]:
value
index field
0 status A
protocol B
region_name C
datetime D
...

然后我们可以获得所需形式的DataFrame
通过将field索引级别移动到列索引中：

1
2
3
4
5
6
7

In [533]: df = df['value'].unstack(level=1); df
Out[533]:
field datetime protocol region_name status user_agent user_id user_ip
index
0 D B C A F G E
1 D B C A F G E
2 D B None A F G E