如何在Julia中使用PyCall将Python输出转换为Julia DataFrame

How to use PyCall in Julia to convert Python output to Julia DataFrame

我想从quandl中检索一些数据并在Julia中分析它们。遗憾的是，还没有正式的API(尚未)。我知道这个解决方案，但它的功能仍然非常有限，并且不遵循与原始Python API相同的语法。

我认为使用PyCall从Julia内部使用官方Python API检索数据是明智之举。这确实产生了输出，但我不确定如何将其转换为我能够在Julia中使用的格式(理想情况下是DataFrame)。

我尝试了以下内容。

1
2
3
4

using PyCall, DataFrames
@pyimport quandl

data = quandl.get("WIKI/AAPL", returns ="pandas");

Julia将此输出转换为Dict{Any,Any}。当使用returns ="numpy"而不是returns ="pandas"时，我最终得到PyObject rec.array。

如何让data成为Julia DataFrame，因为quandl.jl会返回它？请注意，quandl.jl对我来说不是一个选项，因为它不支持自动检索多个资源并且缺少其他一些功能，因此我必须使用Python API。

谢谢你的任何建议！

这是一个选项：

首先，从data对象中提取列名：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

julia> colnames = map(Symbol, data[:columns]);
12-element Array{Symbol,1}:
:Open
:High
:Low
:Close
:Volume
Symbol("Ex-Dividend")
Symbol("Split Ratio")
Symbol("Adj. Open")
Symbol("Adj. High")
Symbol("Adj. Low")
Symbol("Adj. Close")
Symbol("Adj. Volume")

然后将所有列倒入DataFrame：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

julia> y = DataFrame(Any[Array(data[c]) for c in colnames], colnames)

6×12 DataFrames.DataFrame
│ Row │ Open │ High │ Low │ Close │ Volume │ Ex-Dividend │ Split Ratio │
├─────┼───────┼───────┼───────┼───────┼──────────┼─────────────┼─────────────┤
│ 1 │ 28.75 │ 28.87 │ 28.75 │ 28.75 │ 2.0939e6 │ 0.0 │ 1.0 │
│ 2 │ 27.38 │ 27.38 │ 27.25 │ 27.25 │ 785200.0 │ 0.0 │ 1.0 │
│ 3 │ 25.37 │ 25.37 │ 25.25 │ 25.25 │ 472000.0 │ 0.0 │ 1.0 │
│ 4 │ 25.87 │ 26.0 │ 25.87 │ 25.87 │ 385900.0 │ 0.0 │ 1.0 │
│ 5 │ 26.63 │ 26.75 │ 26.63 │ 26.63 │ 327900.0 │ 0.0 │ 1.0 │
│ 6 │ 28.25 │ 28.38 │ 28.25 │ 28.25 │ 217100.0 │ 0.0 │ 1.0 │

│ Row │ Adj. Open │ Adj. High │ Adj. Low │ Adj. Close │ Adj. Volume │
├─────┼───────────┼───────────┼──────────┼────────────┼─────────────┤
│ 1 │ 0.428364 │ 0.430152 │ 0.428364 │ 0.428364 │ 1.17258e8 │
│ 2 │ 0.407952 │ 0.407952 │ 0.406015 │ 0.406015 │ 4.39712e7 │
│ 3 │ 0.378004 │ 0.378004 │ 0.376216 │ 0.376216 │ 2.6432e7 │
│ 4 │ 0.385453 │ 0.38739 │ 0.385453 │ 0.385453 │ 2.16104e7 │
│ 5 │ 0.396777 │ 0.398565 │ 0.396777 │ 0.396777 │ 1.83624e7 │
│ 6 │ 0.420914 │ 0.422851 │ 0.420914 │ 0.420914 │ 1.21576e7 │

感谢@Matt B.提出的简化代码的建议。

上面的问题是数据框内的列类型是Any。为了使它更有效，这里有一些功能可以完成工作：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

# first, guess the Julia equivalent of type of the object
function guess_type(x::PyCall.PyObject)
string_dtype = x[:dtype][:name]
julia_string = string(uppercase(string_dtype[1]), string_dtype[2:end])

return eval(parse("$julia_string"))
end

# convert an individual column, falling back to Any array if the guess was wrong
function convert_column(x)
y = try Array{guess_type(x)}(x) catch Array(x) end
return y
end

# put everything together into a single function
function convert_pandas(df)
colnames = map(Symbol, data[:columns])
y = DataFrame(Any[convert_column(df[c]) for c in colnames], colnames)

return y
end

上面的内容，当应用于data时，会提供与以前相同的列名，但具有正确的Float64列类型：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

y = convert_pandas(data);
showcols(y)
9147×12 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │
├───────┼─────────────┼─────────┼─────────┤
│ 1 │ Open │ Float64 │ 0 │
│ 2 │ High │ Float64 │ 0 │
│ 3 │ Low │ Float64 │ 0 │
│ 4 │ Close │ Float64 │ 0 │
│ 5 │ Volume │ Float64 │ 0 │
│ 6 │ Ex-Dividend │ Float64 │ 0 │
│ 7 │ Split Ratio │ Float64 │ 0 │
│ 8 │ Adj. Open │ Float64 │ 0 │
│ 9 │ Adj. High │ Float64 │ 0 │
│ 10 │ Adj. Low │ Float64 │ 0 │
│ 11 │ Adj. Close │ Float64 │ 0 │
│ 12 │ Adj. Volume │ Float64 │ 0 │