关于正则表达式:使用嵌套的分隔符号拆分Python字符串

Split a Python string with nested separated symbol

我需要从弦

1
i ="1,'Test','items (one, two, etc.)',1,'long, list'"

提取下一个字符串的数组:

1
['1',"'Test'","'items (one, two, etc.)'", '1',"'long, list'"]

在RegExpress的帮助下

1
r=re.split(r',+(?=[^()]*(?:\(|$))', i)

我只收到下一个结果:

1
['1',"'Test'","'items (one, two, etc.)'", '1',"'long"," list'"]

UPD1

应支持空值

1
2
i ="1,'Test',NULL,'items (one, two, etc.)',1,'long, list'"
['1',"'Test'", 'NULL',"'items (one, two, etc.)'", '1',"'long, list'"]


在这种情况下,您不需要re.split。您可以在列表理解中使用re.findall

1
2
>>> [k for j in re.findall(r"(\d)|'([^']*)'",i) for k in j if k]
['1', 'Test', 'items (one, two, etc.)', '1', 'long, list']

前面的regex将匹配一个引号'([^']*)'或任何数字(\d)之间的任何内容。

或者,在这种情况下,作为一种更有效的方法,您可以使用ast.literal_eval

1
2
3
>>> from ast import literal_eval
>>> literal_eval(i)
(1, 'Test', 'items (one, two, etc.)', 1, 'long, list')


这是csv模块的任务:

1
2
3
4
5
6
7
import csv
from StringIO import StringIO
line ="1,'Test','items (one, two, etc.)',1,'long, list'"
reader = csv.reader(StringIO(line), quotechar="'")
row = next(reader)

# row == ['1', 'Test', 'items (one, two, etc.)', '1', 'long, list']

这里的关键是创建一个csv阅读器,将单引号指定为引号字符。


您可以单引号拆分:

1
2
3
4
5
6
i ="1,'Test','items (one, two, etc.)',1,'long, list'"



print([ele.strip(" ,") for ele in i.split("'") if ele.strip(",")])
['1', 'Test', 'items (one, two, etc.)', '1', 'long, list']

或与地图一起使用:

1
print([ele for ele in map(lambda x:  x.strip(","), i.split("'")) if ele])

将map与python 3结合使用非常有效:

1
2
3
4
5
6
7
8
9
In [7]: i ="1,'Test','items (one, two, etc.)',1,'long, list'"

In [8]: timeit [ele for ele in map(lambda x:  x.strip(","), i.split("'")) if ele]
1000000 loops, best of 3: 1.5 μs per loop

In [9]: r = re.compile(r"(\d)|'([^']*)'")

In [10]: timeit [k for j in r.findall(i) for k in j if k]
100000 loops, best of 3: 3.92 μs per loop

更好地使用python2和itertools.imap

1
2
3
4
5
6
7
8
9
10
11
In [9]: from itertools  import imap  
In [10]: timeit [ele for ele in imap(lambda x:  x.strip(","), i.split("'")) if ele]
1000000 loops, best of 3: 871 ns per loop  

In [11]: r = re.compile(r"(\d)|'([^']*)'")
In [12]: timeit [k for j in r.findall(i) for k in j if k]
100000 loops, best of 3: 4.27 μs per loop

In [17]: from ast import literal_eval
In [18]: timeit literal_eval(i)
100000 loops, best of 3: 16.2 μs per loop

所有这些返回的输出条文字值与它将数字计算为整数时返回的值相同:

1
2
3
4
5
6
7
In [19]: literal_eval(i)
Out[19]: (1, 'Test', 'items (one, two, etc.)', 1, 'long, list')

In [20]: [k for j in r.findall(i) for k in j if k]
Out[20]: ['1', 'Test', 'items (one, two, etc.)', '1', 'long, list']

In [21]: [ele for ele in imap(lambda x:  x.strip(","), i.split("'")) if ele]Out[21]: ['1', 'Test', 'items (one, two, etc.)', '1', 'long, list']

空行没有什么不同:

1
2
3
4
5
6
7
i ="1,'Test',NULL,'items (one, two, etc.)',1,'long, list'"



print([ele for ele in map(lambda x:  x.strip(","), i.split("'")) if ele])

['1', 'Test', 'NULL', 'items (one, two, etc.)', '1', 'long, list']