How catch UnicodeDecodeError due to invalid continuation byte in mysql data
我正在将数亿行的文本数据从MySQL移到搜索引擎中,但无法成功处理其中一个检索到的字符串中的Unicode错误。我尝试显式地对检索到的字符串进行编码和解码,以使Python抛出Unicode异常并了解问题所在。
在我的笔记本电脑上运行了数千万行(叹气…)之后,就会抛出这个异常,但我无法抓住它,跳过那一行,继续前进,这就是我想要的。MySQL数据库中的所有文本都应该是UTF-8。
1 | UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte |
下面是我使用mysql connector/python建立的连接
1 2 3 4 |
以下是数据库字符设置:
1 2 |
+——————————+——————————————————————————————————————————————————————————————————————————————————————————————————————---+
|变量名称值|
+——————————+——————————————————————————————————————————————————————————————————————————————————————————————————————---+
|字符_set_client_utf8|
|字符_set_connection_utf8|
|字符_set_database_utf8|
| character_set_filesystem二进制|
|字符_set_results_utf8|
|字符_set_server_utf8|
|字符_set_system_utf8|
|排序规则连接|
|排序规则数据库|
|排序规则服务器|
+——————————+——————————————————————————————————————————————————————————————————————————————————————————————————————---+
我下面的异常处理有什么问题?注意变量"last_feeds_id"也没有打印出来,但这可能只是一个证明except子句不起作用的证据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | last_feeds_id = 0 for feedsid, ts, url, bid, title, html in cursor: try: # to catch UnicodeErrors and see where the prolem lies # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html # also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error # feeds.URL is varchar(255) in mysql enc_url = url.encode(encoding = 'UTF-8',errors = 'strict') dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict') # texts.title is varchar(600) in mysql enc_title = title.encode(encoding = 'UTF-8',errors = 'strict') dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict') # texts.html is text in mysql enc_html = html.encode(encoding = 'UTF-8',errors = 'strict') dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict') data = {"timestamp":ts, "url":dec_url, "bid":bid, "title":dec_title, "html":dec_html} es.index(index="blogposts", doc_type="blogpost", body=data) except UnicodeDecodeError as e: print("Last feeds id: {}".format(last_feeds_id)) print(e) except UnicodeEncodeError as e: print("Last feeds id: {}".format(last_feeds_id)) print(e) except UnicodeError as e: print("Last feeds id: {}".format(last_feeds_id)) print(e) |
它抱怨Hex-
您的python源代码是以
1 | # -*- coding: utf-8 -*- |
查看更多python-utf8提示
此外,在此回顾"最佳实践"
你有