Python still having issues with try-except clause
本问题已经有最佳答案,请猛点这里访问。
我使用tld python库使用apply函数从代理请求日志中获取第一级域。当我遇到一个TLD不知道如何处理的奇怪请求时,我会遇到如下错误消息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | TldBadUrl: Is not a valid URL http:1 con! TldBadUrlTraceback (most recent call last) in engine ----> 1 new_fld_column = request_2['request'].apply(get_fld) /usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds) 2353 else: 2354 values = self.asobject -> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype) 2356 2357 if len(mapped) and isinstance(mapped[0], Series): pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)() /home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url, fail_silently, fix_protocol, search_public, search_private, **kwargs) 385 fix_protocol=fix_protocol, 386 search_public=search_public, --> 387 search_private=search_private 388 ) 389 /home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private) 289 return None, None, parsed_url 290 else: --> 291 raise TldBadUrl(url=url) 292 293 domain_parts = domain_name.split('.') |
为了克服这个问题,建议我将函数包装在一个try except子句中,通过使用nan查询来确定出错的行:
1 2 3 4 5 6 7 8 | import tld from tld import get_fld def try_get_fld(x): try: return get_fld(x) except tld.exceptions.TldBadUrl: return np.nan |
这似乎适用于"http:1 con"和"http:/login.cgi%00"等一些"请求",但在"http://urnt12.knhc..txt/"中失败,在这里我会收到另一条错误消息,如上述错误消息:
1 | TldDomainNotFound: Domain urnt12.knhc..txt didn't match any existing TLD name! |
这就是数据帧在名为"请求"的数据帧中总共240000个"请求"的样子:
1 2 3 4 5 6 7 8 9 10 11 | request request count 0 https://login.microsoftonline.com 24521 1 https://dt.adsafeprotected.com 11521 2 https://googleads.g.doubleclick.net 6252 3 https://fls-na.amazon.com 65225 4 https://v10.vortex-win.data.microsoft.com 7852222 5 https://ib.adnxs.com 12 6 http:1 CON 6 7 http:/login.cgi%00 45822 8 http://urnt12.knhc..txt/ 1 |
我的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | from tld import get_tld from tld import get_fld import pandas as pd import numpy as np #Read back into to dataframe request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv') #Remove rows where there were null values in the request column request = request[pd.notnull(request['request'])] #Find the urls that contain IP addresses and exclude them from the new dataframe request = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)] #Reset index request = request.reset_index(drop=True) import tld from tld import get_fld def try_get_fld(x): try: return get_fld(x) except tld.exceptions.TldBadUrl: return np.nan request['flds'] = request['request'].apply(try_get_fld) #faulty_url_df = request[request['flds'].isna()] #print(faulty_url_df) |
失败是因为它是另一个例外。你的
您可以在except子句中不太具体,用一个except子句捕获更多异常,也可以添加另一个except子句来捕获其他类型的异常:
1 2 3 4 5 6 7 | try: return get_fld(x) except tld.exceptions.TldBadUrl: return np.nan except tld.exceptions.TldDomainNotFound: print("Domain not found!") return np.nan |