Pyspark: Filter dataframe based on multiple conditions
我想首先根据以下条件过滤数据帧(d <5),其次(如果col1中的值等于col3中的对应值,col2的值不等于col4中的对应值)。
如果原始数据帧
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| xx| D| vv| 4| | C| xxx| D| vv| 10| | A| x| A| xx| 3| | E| xxx| B| vv| 3| | E| xxx| F| vvv| 6| | F|xxxx| F| vvv| 4| | G| xxx| G| xxx| 4| | G| xxx| G| xx| 4| | G| xxx| G| xxx| 12| | B|xxxx| B| xx| 13| +----+----+----+----+---+ |
所需的数据框为:
1 2 3 4 5 6 7 8 9 | +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| xx| D| vv| 4| | A| x| A| xx| 3| | E| xxx| B| vv| 3| | F|xxxx| F| vvv| 4| | G| xxx| G| xx| 4| +----+----+----+----+---+ |
我尝试过的代码未按预期工作:
1 2 3 4 5 6 7 8 9 10 11 12 | cols=[('A','xx','D','vv',4),('C','xxx','D','vv',10),('A','x','A','xx',3),('E','xxx','B','vv',3),('E','xxx','F','vvv',6),('F','xxxx','F','vvv',4),('G','xxx','G','xxx',4),('G','xxx','G','xx',4),('G','xxx','G','xxx',12),('B','xxxx','B','xx',13)] df=spark.createDataFrame(cols,['col1','col2','col3','col4','d']) df.filter((df.d<5)& (df.col2!=df.col4) & (df.col1==df.col3)).show() +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| x| A| xx| 3| | F|xxxx| F| vvv| 4| | G| xxx| G| xx| 4| +----+----+----+----+---+ |
我应该怎么做才能达到预期的效果?
您的逻辑条件是错误的。 IIUC,您想要的是:
1 2 3 4 5 6 7 8 | import pyspark.sql.functions AS f df.filter((f.col('d')<5))\\ .filter( ((f.col('col1') != f.col('col3')) | (f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3'))) )\\ .show() |
我将
输出:
1 2 3 4 5 6 7 8 9 | +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| xx| D| vv| 4| | A| x| A| xx| 3| | E| xxx| B| vv| 3| | F|xxxx| F| vvv| 4| | G| xxx| G| xx| 4| +----+----+----+----+---+ |
您也可以像下面这样写(不带
1 | df.filter('d<5 and (col1 <> col3 or (col1 = col3 and col2 <> col4))').show() |
结果:
1 2 3 4 5 6 7 8 9 | +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| xx| D| vv| 4| | A| x| A| xx| 3| | E| xxx| B| vv| 3| | F|xxxx| F| vvv| 4| | G| xxx| G| xx| 4| +----+----+----+----+---+ |
更快的方法(没有
1 2 3 4 | df.filter((df.d<5)&((df.col1 != df.col3) | (df.col2 != df.col4) & (df.col1 ==df.col3)))\\ .show() |