How to implement 'in' and 'not in' for Pandas dataframe
如何实现SQL的
我有一个带有所需值的列表。下面是场景:
1 2 3 4 5 | df = pd.DataFrame({'countries':['US','UK','Germany','China']}) countries = ['UK','China'] # pseudo-code: df[df['countries'] not in countries] |
我目前的做法如下:
1 2 3 4 5 6 7 8 9 | df = pd.DataFrame({'countries':['US','UK','Germany','China']}) countries = pd.DataFrame({'countries':['UK','China'], 'matched':True}) # IN df.merge(countries,how='inner',on='countries') # NOT IN not_in = df.merge(countries,how='left',on='countries') not_in = not_in[pd.isnull(not_in['matched'])] |
但这似乎是一个可怕的拼凑。有人能改进一下吗?
您可以使用
对于"in":
或"不在":
举个例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | >>> df countries 0 US 1 UK 2 Germany 3 China >>> countries ['UK', 'China'] >>> df.countries.isin(countries) 0 False 1 True 2 False 3 True Name: countries, dtype: bool >>> df[df.countries.isin(countries)] countries 1 UK 3 China >>> df[~df.countries.isin(countries)] countries 0 US 2 Germany |
使用.query()方法的替代解决方案:
1 2 3 4 5 6 7 8 9 10 11 | In [5]: df.query("countries in @countries") Out[5]: countries 1 UK 3 China In [6]: df.query("countries not in @countries") Out[6]: countries 0 US 2 Germany |
我通常对如下行进行常规筛选:
1 2 | criterion = lambda row: row['countries'] not in countries not_in = df[df.apply(criterion, axis=1)] |
我想筛选出具有业务标识的dfbc行,该业务标识也位于dfprofilesbusids的业务标识中。
终于成功了:
1 | dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])] |
How to implement
in andnot in for a pandas DataFrame?
熊猫提供了两种方法:系列和数据帧分别使用
1 2 3 4 5 6 7 | ╒════════╤══════════════════════╤══════════════════════╕ │ │ Python │ Pandas │ ╞════════╪══════════════════════╪══════════════════════╡ │ in │ item in sequence │ sequence.isin(item) │ ├────────┼──────────────────────┼──────────────────────┤ │ not in │ item not in sequence │ ~sequence.isin(item) │ ╘════════╧══════════════════════╧══════════════════════╛ |
要实现"不在",必须反转
另外请注意,在大熊猫的案例中,"
最常见的情况是在特定列上应用
1 2 3 4 5 6 7 8 9 10 11 12 | df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']}) df countries 0 US 1 UK 2 Germany 3 China c1 = ['UK', 'China'] # list c2 = {'Germany'} # set c3 = pd.Series(['China', 'US']) # Series c4 = np.array(['US', 'UK']) # array |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | df['countries'].isin(c1) 0 False 1 True 2 False 3 False 4 True Name: countries, dtype: bool # `in` operation df[df['countries'].isin(c1)] countries 1 UK 4 China # `not in` operation df[~df['countries'].isin(c1)] countries 0 US 2 Germany 3 NaN |
1 2 3 4 5 | # Filter with `set` (tuples work too) df[df['countries'].isin(c2)] countries 2 Germany |
1 2 3 4 5 6 | # Filter with another Series df[df['countries'].isin(c3)] countries 0 US 4 China |
1 2 3 4 5 6 | # Filter with array df[df['countries'].isin(c4)] countries 0 US 1 UK |
在多个列上筛选
有时,您会希望对多个列应用一些搜索词的"in"成员资格检查,
1 2 3 4 5 6 7 8 9 10 11 | df2 = pd.DataFrame({ 'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)}) df2 A B C 0 x w 0 1 y a 1 2 z NaN 2 3 q x 3 c1 = ['x', 'w', 'p'] |
要对"A"列和"B"列应用
1 2 3 4 5 6 7 | df2[['A', 'B']].isin(c1) A B 0 True True 1 False False 2 False False 3 False True |
因此,为了保留至少有一列为
1 2 3 4 5 6 7 8 9 10 11 12 13 | df2[['A', 'B']].isin(c1).any(axis=1) 0 True 1 False 2 False 3 True dtype: bool df2[df2[['A', 'B']].isin(c1).any(axis=1)] A B C 0 x w 0 3 q x 3 |
同样,要保留所有列都是
1 2 3 4 | df2[df2[['A', 'B']].isin(c1).all(axis=1)] A B C 0 x w 0 |
值得注意的是:
除了上述方法之外,您还可以使用numpy等价物:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # `in` operation df[np.isin(df['countries'], c1)] countries 1 UK 4 China # `not in` operation df[np.isin(df['countries'], c1, invert=True)] countries 0 US 2 Germany 3 NaN |
为什么值得考虑?由于开销较低,numpy函数通常比相应的panda函数快一点。由于这是一种不依赖于索引对齐的元素操作,因此很少有情况下这种方法不适合替代熊猫的
处理字符串时,熊猫程序通常是迭代的,因为字符串操作很难向量化。有很多证据表明清单理解会更快。我们现在用的是一张
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | c1_set = set(c1) # Using `in` with `sets` is a constant time operation... # This doesn't matter for pandas because the implementation differs. # `in` operation df[[x in c1_set for x in df['countries']]] countries 1 UK 4 China # `not in` operation df[[x not in c1_set for x in df['countries']]] countries 0 US 2 Germany 3 NaN |
不过,具体说明要困难得多,所以除非你知道自己在做什么,否则不要使用它。
最后,还有
1 2 | df = pd.DataFrame({'countries':['US','UK','Germany','China']}) countries = ['UK','China'] |
实施:
1 | df[df.countries.isin(countries)] |
不在其他国家实施:
1 | df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])] |
我更喜欢创建一个函数,在该函数中,我可以使用r编程语言和
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | def subset(df, query="", select=("")): query = str(query) query = query.replace("|","¤") query = query.replace("&","?") query = query.replace("","") queries = re.split('¤|?', query) # the %!in% and %in% operator will have to be evaluated after in_operator = ["%in%","%!in%"] queries_lc = [q for q in queries if any(x in q for x in in_operator)] queries = [q for q in queries if not any(x in q for x in in_operator)] if len(select) == 0:select = df.columns query ="".join(queries) query = query.replace("¤","|") query = query.replace("?","&") if len(queries_lc) > 0: for lc_q in queries_lc: if in_operator[0] in lc_q: # %in% var, list_con = re.split(in_operator[0], lc_q) globals()["list_condition_used_in_subset"] = eval(list_con) df = df[df[var].isin(list_condition_used_in_subset)] else: # %!in% - not in var, list_con = re.split(in_operator[1], lc_q) globals()["list_condition_used_in_subset"] = eval(list_con) df = df[~df[var].isin(list_condition_used_in_subset)] if len(queries) == 0 and len(queries_lc) > 0: df = df[select] # if only a list condition query else:df = pd.DataFrame(df.query(query)[select]) # perform query and return selected - normal thing return df df = pd.DataFrame({'countries':['US','UK','Germany','China'],"GDP":[1,2,3,4]}) countries = ['UK','China'] subset(df,query="countries %in% countries & GDP > 2") countries GDP 3 China 4 subset(df,query="countries %!in% countries",select=["GDP"]) GDP 0 1 2 3 |
可能相当长,但可以用于多种用途