Better way to remove statistical outliers than this?
此代码有效。但我禁不住觉得这是一个黑客,尤其是"抵消"部分。我必须把它放在这里,否则每次执行del操作时,删除中的所有索引值都会移动一个。
1 2 3 4 5 6 7 8 9 10 11 12 | # remove outliers > devs # of std deviations devs = 1 deletes = [] for num, duration in enumerate(durations): if (duration > (mean_duration + (devs * std_dev_one_test))) or \ (duration < (mean_duration - (devs * std_dev_one_test))): deletes.append(num) offset = 0 for delete in deletes: del durations[delete - offset] del dates[delete - offset] offset += 1 |
如何使之更好的想法?
当您迭代列表时,构建一个保持器列表:
1 2 3 4 5 6 7 | def isKeeper( duration ): if (duration > (mean_duration + (devs * std_dev_one_test))) or \ (duration < (mean_duration - (devs * std_dev_one_test))): return False return True durations = [duration for duration in durations if isKeeper(duration)] |
。
可能是这样的:
1 2 3 4 5 6 7 8 9 10 11 12 | import numpy as np myList = [1,2,3,4,5,6,7,3,4,5,3,5,99] mean_duration = np.mean(myList) std_dev_one_test = np.std(myList) def drop_outliers(x): if abs(x - mean_duration) <= std_dev_one_test: return x myList = filter(drop_outliers, myList) |
号
结果:
1 2 | >>> myList [1, 2, 3, 4, 5, 6, 7, 3, 4, 5, 3, 5] |
从列表中删除项目时,是否会导致索引移位,并用偏移量进行补偿?
如果是这样的话,只需从后面到前面删除,这样删除项目不会影响列表的其余部分。
所以开始从列表的最后一项迭代到前面。
这些问题可能会引起人们的兴趣:删除list(python)和python的许多元素:在遍历list时删除list元素
另一个很好的方法是:在迭代时从列表中删除项目(感谢@paulmcguire通过注释给出的建议)
如果您的数据集很小,您可以反转逻辑,保留值而不是删除它们:
1 2 3 4 5 6 7 | # keep value outliers < devs # of std deviations devs = 1 keeps = [] for duration in durations: if (duration <= (mean_duration + (devs * std_dev_one_test))) and \ (duration >= (mean_duration - (devs * std_dev_one_test))): keeps.append(duration) |