Unicode Problem with SQLAlchemy
我知道从Unicode转换时有问题,但我不确定它发生在哪里。
我正在从一个HTML文件目录中提取最近一次eruopean旅行的数据。一些位置名称具有非ASCII字符(如_,?,U)。我使用regex从文件的字符串表示中获取数据。
如果我在找到位置时打印位置,它们将使用字符打印,因此编码必须正常:
1 2 | Le Pré-Saint-Gervais, France H?tel-de-Ville, France |
我使用sqlAlchemy将数据存储在sqlite表中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | Base = declarative_base() class Point(Base): __tablename__ = 'points' id = Column(Integer, primary_key=True) pdate = Column(Date) ptime = Column(Time) location = Column(Unicode(32)) weather = Column(String(16)) high = Column(Float) low = Column(Float) lat = Column(String(16)) lon = Column(String(16)) image = Column(String(64)) caption = Column(String(64)) def __init__(self, filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption): self.filename = filename self.pdate = pdate self.ptime = ptime self.location = location self.weather = weather self.high = high self.low = low self.lat = lat self.lon = lon self.image = image self.caption = caption def __repr__(self): return"<Point('%s','%s','%s')>" % (self.filename, self.pdate, self.ptime) engine = create_engine('sqlite:///:memory:', echo=False) Base.metadata.create_all(engine) Session = sessionmaker(bind = engine) session = Session() |
我遍历这些文件,并将每个文件中的数据插入数据库:
1 2 3 4 5 6 7 8 9 | for filename in filelist: # open the file and extract the information using regex such as: location_re = re.compile("(.*)",re.M) # extract other data newpoint = Point(filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption) session.add(newpoint) session.commit() |
我在每个插件上看到以下警告:
1 2 | /usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/default.py:230: SAWarning: Unicode type received non-unicode bind param value 'Spitalfields, United Kingdom' param.append(processors[key](compiled_params[key])) |
当我试图对桌子做任何事情时,比如:
1 | session.query(Point).all() |
我得到:
1 2 3 4 5 6 7 8 9 10 11 12 | Traceback (most recent call last): File"./extract_trips.py", line 131, in <module> session.query(Point).all() File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1193, in all return list(self) File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1341, in instances fetch = cursor.fetchall() File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 1642, in fetchall self.connection._handle_dbapi_exception(e, None, None, self.cursor, self.context) File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect) sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'points_location' with text 'Le Pré-Saint-Gervais, France' None None |
我希望能够正确地存储和返回位置名称,并保持原始字符的完整性。任何帮助都将不胜感激。
我发现这篇文章解释了我的困境
http://www.amk.ca/python/howto/uniccode 355;reading-and-writing-uniccode-data
我本来可以通过使用编码器模块来获得所希望的结果,然后将我的程序改变为:
When opening the file:
ZZU1
When printing the location:
1 | print location.encode('ISO-8859-1') |
我现在可以从桌子上查询和操纵数据,而不必事先弄错。我只需要在输出文本时具体说明编码。
(我仍然不完全明白这是如何工作,所以我想现在是时候更多地了解Python的独码处理。)
From Sqlalchemy.org
See section 0.4.2
added new flag to String and
create_engine(),
assert _unicode=(True|False|'warn'|None).
Defaults toFalse orNone on
create _engine() and String,'warn' on the Unicode type. When
True ,
results in all unicode conversion operations raising an
exception when a
non-unicode bytestring is passed as a bind parameter. 'warn' results
in a warning. It is strongly advised that all unicode-aware
applications
make proper use of Python unicode objects (i.e. u'hello' and not
'hello')
so that data round trips accurately.
我想你正在尝试输入一个非单码字节。也许这会带领你进入正轨?有些形式的转变是需要的,比如说"你好"和"你好"。
谢尔
Try using a column type of uniccode rather than string for the uniccode columns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | Base = declarative_base() class Point(Base): __tablename__ = 'points' id = Column(Integer, primary_key=True) pdate = Column(Date) ptime = Column(Time) location = Column(Unicode(32)) weather = Column(String(16)) high = Column(Float) low = Column(Float) lat = Column(String(16)) lon = Column(String(16)) image = Column(String(64)) caption = Column(String(64)) |
Edit:response to comment:
如果你得到了一个关于统一码编码的警告,那么有两件事你可以尝试:
把你的位置转换成统一码。This would mean having your point created like this:
NewPoint=Point(Filename,PDATE,PTIME,Unicode(Location),Weather,High,Low,LAT,Lon,Image,Cption)
单码转换会产生一个单码字符串,当它穿过一条弦或一条单码字符串时,所以你不必担心你所经历的一切。
如果不解决编码问题,请试着在您的Unicode Objects上加密。这意味着使用代码如:
NewPoint=Point(Filename,PDATE,PTIME,Unicode(Location).Encode("UTF-8"),Weather,High,Low,Lat,Lon,Image,Caption)
This step probably won't be necessary,but what it essentially does is converts a unicode object from UNICODE-Points to a specific Byte representation(in this case,UTF-8).我希望当你走进一个单一代码对象时能为你做这件事,但可能不是。