Convert list of dictionaries to a pandas DataFrame
我有一个这样的字典列表:
1 2 3 4 | [{'points': 50, 'time': '5:00', 'year': 2010}, {'points': 25, 'time': '6:00', 'month':"february"}, {'points':90, 'time': '9:00', 'month': 'january'}, {'points_h1':20, 'month': 'june'}] |
我想把它变成一只大熊猫,就像这样:
1 2 3 4 5 | month points points_h1 time year 0 NaN 50 NaN 5:00 2010 1 february 25 NaN 6:00 NaN 2 january 90 NaN 9:00 NaN 3 june NaN 20 NaN NaN |
注意:列的顺序并不重要。
如何将字典列表转换为如上所示的熊猫数据框?
假设
1 | pd.DataFrame(d) |
在《大熊猫》16.2版中,我必须做
How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
1 2 3 4 5 6 7 8 | np.random.seed(0) data = pd.DataFrame( np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r') print(data) [{'A': 5, 'B': 0, 'C': 3, 'D': 3}, {'A': 7, 'B': 9, 'C': 3, 'D': 5}, {'A': 2, 'B': 4, 'C': 7, 'D': 6}] |
这个列表由"记录"组成,每个键都存在。这是您可能遇到的最简单的情况。
1 2 3 4 5 6 7 8 9 | # The following methods all produce the same output. pd.DataFrame(data) pd.DataFrame.from_dict(data) pd.DataFrame.from_records(data) A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6 |
字典方向词:
在继续之前,必须区分不同类型的字典方向,并支持熊猫。主要有两种类型:"列"和"索引"。
例如,上面的
1 2 3 4 | data_c = [ {'A': 5, 'B': 0, 'C': 3, 'D': 3}, {'A': 7, 'B': 9, 'C': 3, 'D': 5}, {'A': 2, 'B': 4, 'C': 7, 'D': 6}] |
1 2 3 4 5 6 | pd.DataFrame.from_dict(data_c, orient='columns') A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6 |
注:如果您使用的是
1 2 3 4 | data_i ={ 0: {'A': 5, 'B': 0, 'C': 3, 'D': 3}, 1: {'A': 7, 'B': 9, 'C': 3, 'D': 5}, 2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}} |
1 2 3 4 5 6 | pd.DataFrame.from_dict(data_i, orient='index') A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6 |
这种情况在OP中没有考虑,但仍有助于了解。
设置自定义索引如果需要对生成的数据帧进行自定义索引,可以使用
1 2 3 4 5 6 7 | pd.DataFrame(data, index=['a', 'b', 'c']) # pd.DataFrame.from_records(data, index=['a', 'b', 'c']) A B C D a 5 0 3 3 b 7 9 3 5 c 2 4 7 6 |
这不受
处理缺少键/列值的字典时,所有方法都是现成的。例如,
1 2 3 4 | data2 = [ {'A': 5, 'C': 3, 'D': 3}, {'A': 7, 'B': 9, 'F': 5}, {'B': 4, 'C': 7, 'E': 6}] |
1 2 3 4 5 6 7 8 9 | # The methods below all produce the same output. pd.DataFrame(data2) pd.DataFrame.from_dict(data2) pd.DataFrame.from_records(data2) A B C D E F 0 5.0 NaN 3.0 3.0 NaN NaN 1 7.0 9.0 NaN NaN NaN 5.0 2 NaN 4.0 7.0 NaN 6.0 NaN |
正在读取列的子集
"如果我不想在每一篇专栏文章中阅读怎么办?"您可以使用
例如,从上面的
1 2 3 4 5 6 7 | pd.DataFrame(data2, columns=['A', 'D', 'F']) # pd.DataFrame.from_records(data2, columns=['A', 'D', 'F']) A D F 0 5.0 3.0 NaN 1 7.0 NaN 5.0 2 NaN NaN NaN |
带有默认方向"列"的
1 | pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B']) |
1 | ValueError: cannot use columns parameter with orient='columns' |
正在读取行的子集
这些方法都不直接支持。您将不得不迭代您的数据,并在迭代时执行反向删除。例如,要仅从上面的
1 2 3 4 5 6 7 8 9 10 11 12 | rows_to_select = {0, 2} for i in reversed(range(len(data2))): if i not in rows_to_select: del data2[i] pd.DataFrame(data2) # pd.DataFrame.from_dict(data2) # pd.DataFrame.from_records(data2) A B C D E 0 5.0 NaN 3 3.0 NaN 1 NaN 4.0 7 NaN 6.0 |
解决嵌套数据问题的灵丹妙药:
与上述方法相比,
1 2 3 4 5 6 | pd.io.json.json_normalize(data) A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6 |
1 2 3 4 5 | pd.io.json.json_normalize(data2) A B C D E 0 5.0 NaN 3 3.0 NaN 1 NaN 4.0 7 NaN 6.0 |
同样,请记住,传递给
如前所述,
1 2 3 4 5 6 7 8 9 10 11 12 13 | data_nested = [ {'counties': [{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}, {'name': 'Palm Beach', 'population': 60000}], 'info': {'governor': 'Rick Scott'}, 'shortname': 'FL', 'state': 'Florida'}, {'counties': [{'name': 'Summit', 'population': 1234}, {'name': 'Cuyahoga', 'population': 1337}], 'info': {'governor': 'John Kasich'}, 'shortname': 'OH', 'state': 'Ohio'} ] |
1 2 3 4 5 6 7 8 9 10 | pd.io.json.json_normalize(data_nested, record_path='counties', meta=['state', 'shortname', ['info', 'governor']]) name population state shortname info.governor 0 Dade 12345 Florida FL Rick Scott 1 Broward 40000 Florida FL Rick Scott 2 Palm Beach 60000 Florida FL Rick Scott 3 Summit 1234 Ohio OH John Kasich 4 Cuyahoga 1337 Ohio OH John Kasich |
有关
下面是上面讨论的所有方法的表,以及支持的特性/功能。
您也可以使用
...: {'points': 25, 'time': '6:00', 'month':"february