关于python:SQLAlchemy的Unicode问题

Unicode Problem with SQLAlchemy

我知道从Unicode转换时有问题,但我不确定它发生在哪里。

我正在从一个HTML文件目录中提取最近一次eruopean旅行的数据。一些位置名称具有非ASCII字符(如_,?,U)。我使用regex从文件的字符串表示中获取数据。

如果我在找到位置时打印位置,它们将使用字符打印,因此编码必须正常:

1
2
Le Pré-Saint-Gervais, France
H?tel-de-Ville, France

我使用sqlAlchemy将数据存储在sqlite表中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Base = declarative_base()
class Point(Base):
    __tablename__ = 'points'

    id = Column(Integer, primary_key=True)
    pdate = Column(Date)
    ptime = Column(Time)
    location = Column(Unicode(32))
    weather = Column(String(16))
    high = Column(Float)
    low = Column(Float)
    lat = Column(String(16))
    lon = Column(String(16))
    image = Column(String(64))
    caption = Column(String(64))

    def __init__(self, filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption):
        self.filename = filename
        self.pdate = pdate
        self.ptime = ptime
        self.location = location
        self.weather = weather
        self.high = high
        self.low = low
        self.lat = lat
        self.lon = lon
        self.image = image
        self.caption = caption

    def __repr__(self):
        return"<Point('%s','%s','%s')>" % (self.filename, self.pdate, self.ptime)

engine = create_engine('sqlite:///:memory:', echo=False)
Base.metadata.create_all(engine)
Session = sessionmaker(bind = engine)
session = Session()

我遍历这些文件,并将每个文件中的数据插入数据库:

1
2
3
4
5
6
7
8
9
for filename in filelist:

    # open the file and extract the information using regex such as:
    location_re = re.compile("(.*)",re.M)
    # extract other data

    newpoint = Point(filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption)
    session.add(newpoint)
    session.commit()

我在每个插件上看到以下警告:

1
2
/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/default.py:230: SAWarning: Unicode type received non-unicode bind param value 'Spitalfields, United Kingdom'
  param.append(processors[key](compiled_params[key]))

当我试图对桌子做任何事情时,比如:

1
session.query(Point).all()

我得到:

1
2
3
4
5
6
7
8
9
10
11
12
Traceback (most recent call last):
  File"./extract_trips.py", line 131, in <module>
    session.query(Point).all()
  File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1193, in all
    return list(self)
  File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1341, in instances
    fetch = cursor.fetchall()
  File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 1642, in fetchall
    self.connection._handle_dbapi_exception(e, None, None, self.cursor, self.context)
  File"/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
    raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'points_location' with text 'Le Pré-Saint-Gervais, France' None None

我希望能够正确地存储和返回位置名称,并保持原始字符的完整性。任何帮助都将不胜感激。


我发现这篇文章解释了我的困境

http://www.amk.ca/python/howto/uniccode 355;reading-and-writing-uniccode-data

我本来可以通过使用编码器模块来获得所希望的结果,然后将我的程序改变为:

When opening the file:

ZZU1

When printing the location:

1
print location.encode('ISO-8859-1')

我现在可以从桌子上查询和操纵数据,而不必事先弄错。我只需要在输出文本时具体说明编码。

(我仍然不完全明白这是如何工作,所以我想现在是时候更多地了解Python的独码处理。)


From Sqlalchemy.org

See section 0.4.2

added new flag to String and
create_engine(),
assert _unicode=(True|False|'warn'|None).
Defaults to False or None on
create _engine() and String, 'warn' on the Unicode type. When
True,
results in all unicode conversion operations raising an
exception when a
non-unicode bytestring is passed as a bind parameter. 'warn' results
in a warning. It is strongly advised that all unicode-aware
applications
make proper use of Python unicode objects (i.e. u'hello' and not
'hello')
so that data round trips accurately.

我想你正在尝试输入一个非单码字节。也许这会带领你进入正轨?有些形式的转变是需要的,比如说"你好"和"你好"。

谢尔


Try using a column type of uniccode rather than string for the uniccode columns:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Base = declarative_base()
class Point(Base):
    __tablename__ = 'points'

    id = Column(Integer, primary_key=True)
    pdate = Column(Date)
    ptime = Column(Time)
    location = Column(Unicode(32))
    weather = Column(String(16))
    high = Column(Float)
    low = Column(Float)
    lat = Column(String(16))
    lon = Column(String(16))
    image = Column(String(64))
    caption = Column(String(64))

Edit:response to comment:

如果你得到了一个关于统一码编码的警告,那么有两件事你可以尝试:

  • 把你的位置转换成统一码。This would mean having your point created like this:

    NewPoint=Point(Filename,PDATE,PTIME,Unicode(Location),Weather,High,Low,LAT,Lon,Image,Cption)

    单码转换会产生一个单码字符串,当它穿过一条弦或一条单码字符串时,所以你不必担心你所经历的一切。

  • 如果不解决编码问题,请试着在您的Unicode Objects上加密。这意味着使用代码如:

    NewPoint=Point(Filename,PDATE,PTIME,Unicode(Location).Encode("UTF-8"),Weather,High,Low,Lat,Lon,Image,Caption)

    This step probably won't be necessary,but what it essentially does is converts a unicode object from UNICODE-Points to a specific Byte representation(in this case,UTF-8).我希望当你走进一个单一代码对象时能为你做这件事,但可能不是。