How to iterate over each individual column values in multiple column dataframe?
我有多个列数据框架,其中列有[国家]、[能源供应]、[人均能源供应]、[可再生能源]。
在能量供应栏,我想把这个栏的单位从千兆转换成千兆。但在这个过程中当值类似于"…"(缺少的值用这个表示)时,
为了阻止这种情况的发生,我正在运行:
1 2 3 4 5 6 7 | energy = pd.read_excel("Energy Indicators.xls",skiprows = 16, skip_footer = 38) energy.drop(['Unnamed: 0','Unnamed: 1'],axis = 1, inplace = True) energy.columns = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'] for i in energy['Energy Supply']: if (isinstance(energy[i],int) == True): energy['Energy Supply'][i]=energy['Energy Supply'][i]*1000000 return (energy) |
但我得不到结果,即只改变整型变量的值,什么都没有改变。
在我认为问题所在的地方,前两行将给出同样的条件,因为第一行是"字符串",基于此,程序不修改值,而我想单独检查值是否为整数类型,如果是,则将数字乘以1000000。
输入:
1 2 3 4 5 6 | Country Energy Supply Energy Supply per Capita % Renewable 0 NaN Petajoules Gigajoules % 1 Afghanistan 321 10 78.6693 2 Albania 102 35 100 3 Algeria 1959 51 0.55101 4 American Samoa ... ... 0.641026 |
预期输出:
1 2 3 4 5 6 | Country Energy Supply Energy Supply per Capita % Renewable 0 NaN Petajoules Gigajoules % 1 Afghanistan 3210000 10 78.6693 2 Albania 1020000 35 100 3 Algeria 19590000 51 0.55101 4 American Samoa ... ... 0.641026 |
电流输出:
1 2 3 4 5 6 | Country Energy Supply Energy Supply per Capita % Renewable 0 NaN PetajoulesPeta. Gigajoules % 1 Afghanistan 3210000 10 78.6693 2 Albania 1020000 35 100 3 Algeria 19590000 51 0.55101 4 American Samoa ........ ... 0.641026 |
可以使用
1 2 3 4 5 6 7 8 9 10 | energy['Energy Supply'] = energy['Energy Supply'].apply(lambda x: int(x) * 1000000 if str(x).isnumeric() else x) print (energy) Country Energy Supply Energy Supply per Capita % Renewable 0 NaN Petajoules Gigajoules % 1 Afghanistan 321000000 10 78.6693 2 Albania 102000000 35 100 3 Algeria 1959000000 51 0.55101 4 American Samoa ... .. 0.641026 |
这对我来说很有价值:
1 2 3 4 5 6 | import pandas as pd import numpy as np data = {"Energy Supply":[1,30,"Petajoules",5,70]*2000000} energy = pd.DataFrame(data) |
输入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | Energy Supply 0 1 1 30 2 Petajoules 3 5 4 70 5 1 6 30 7 Petajoules 8 5 9 70 10 1 11 30 12 Petajoules 13 5 14 70 15 1 16 30 17 Petajoules 18 5 19 70 20 1 21 30 22 Petajoules 23 5 24 70 25 1 26 30 27 Petajoules 28 5 29 70 ... [10000000 rows x 1 columns] |
然后我将序列转换为数组并设置值:
1 2 3 4 5 6 7 | arr = energy["Energy Supply"].values for i in range(len(arr)): if isinstance(arr[i],int): arr[i] = arr[i]*1000000 else: pass |
输出如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | Energy Supply 0 1000000 1 30000000 2 Petajoules 3 5000000 4 70000000 5 1000000 6 30000000 7 Petajoules 8 5000000 9 70000000 10 1000000 11 30000000 12 Petajoules 13 5000000 14 70000000 15 1000000 16 30000000 17 Petajoules 18 5000000 19 70000000 20 1000000 21 30000000 22 Petajoules 23 5000000 24 70000000 25 1000000 26 30000000 27 Petajoules 28 5000000 29 70000000 ... [10000000 rows x 1 columns] |
此解决方案的速度大约是应用程序的两倍:
在数组中循环:
1 | loop: 100%|██████████| 10000000/10000000 [00:07<00:00, 1376439.75it/s] |
应用:
1 | apply: 100%|██████████| 10000000/10000000 [00:14<00:00, 687420.00it/s] |
如果将序列转换为数字,则字符串值将变为NaN值。使用np.将序列转换为数字并乘以值需要大约5秒:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import pandas as pd import numpy as np import time data = {"Energy Supply":[1,30,"Petajoules",5,70]*2000000} energy = pd.DataFrame(data) t = time.time() energy["Energy Supply"] = pd.to_numeric(energy["Energy Supply"],errors="coerce") energy["Energy_Supply"] = np.where((energy["Energy Supply"]%1==0),energy["Energy Supply"]*100,energy["Energy Supply"]) t1 = time.time() print(t1-t) 5.275099515914917 |
但也可以在使用pd.to_Numeric()之后简单地执行此操作:
1 | energy["Energy Supply"] = energy["Energy Supply"]*1000000 |