How to classify observations based on their covariates in dataframe and numpy?
我有一个有n个观察值的数据集,比如2个变量x1和x2。我正试图根据它们(x1,x2)值的一组条件对每个观察结果进行分类。例如,数据集看起来像
1 2 3 4 5 6 | df: Index X1 X2 1 0.2 0.8 2 0.6 0.2 3 0.2 0.1 4 0.9 0.3 |
组的定义是
- 第1组:x1<0.5&x2>=0.5
- 第2组:x1>=0.5&x2>=0.5
- 第3组:x1<0.5&x2<0.5
- 第4组:x1>=0.5&x2<0.5
我想生成以下数据帧。
1 2 3 4 5 6 | expected result: Index X1 X2 Group 1 0.2 0.8 1 2 0.6 0.2 4 3 0.2 0.1 3 4 0.9 0.3 4 |
另外,对于这种类型的问题,使用numpy数组会更好/更快吗?
在回答你最后一个问题时,我肯定认为
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import numpy as np import pandas as pd # Lay out your conditions conditions = [((df.X1 < 0.5) & (df.X2>=0.5)), ((df.X1>=0.5) & (df.X2>=0.5)), ((df.X1<0.5) & (df.X2<0.5)), ((df.X1>=0.5) & (df.X2<0.5))] # Name the resulting groups (in the same order as the conditions) choicelist = [1,2,3,4] df['group']= np.select(conditions, choicelist, default=-1) # Above, I've the default to -1, but change as you see fit # if none of your conditions are met, then it that row would be classified as -1 >>> df Index X1 X2 group 0 1 0.2 0.8 1 1 2 0.6 0.2 4 2 3 0.2 0.1 3 3 4 0.9 0.3 4 |
类似的东西
1 2 3 4 5 6 7 | df[['X1','X2']].gt(0.5).astype(str).sum(1).map({'FalseTrue':1,'TrueFalse':4,'FalseFalse':3,'TrueTrue':2}) Out[56]: 0 1 1 4 2 3 3 4 dtype: int64 |