关于Web抓取：如何从python中的字符串中删除所有类型的换行符或格式

nlppythonstripweb-scraping

How to remove ALL kind of linebreaks or formattings from strings in python

本问题已经有最佳答案，请猛点这里访问。

我知道处理换行符、制表符等的经典方法。是to.strip()或.remove(""，"")。但有时也有一些特殊情况下，这些方法会失败，例如：

1
2
3
4
5
6
7
8
9
10
11

'H\xf6cke

:

Die'.strip()

gives: 'H\xf6cke

:

Die'

我如何才能抓住这些必须一个接一个覆盖的罕见案例(例如，by.remove("*"，"")？以上只是我遇到的一个例子。

相关讨论

1
2
3
4
5
6
7
8
9
10

In [1]: import re

In [2]: text = 'H\xf6cke

:

Die'

In [3]: re.sub(r'\s+', '', text)
Out[3]: 'H?cke:Die'

S：

Matches Unicode whitespace characters (which includes [ \t

\f\v],
and also many other characters, for example the non-breaking spaces
mandated by typography rules in many languages). If the ASCII flag is
used, only [ \t

\f\v] is matched (but the flag affects the entire
regular expression, so in such cases using an explicit [ \t

\f\v]
may be a better choice).

"+"

Causes the resulting RE to match 1 or more repetitions of the
preceding RE.

相关讨论

剥离文件：返回带前导和尾随的字符串的副本删除空白。如果给定了字符而不是无字符，则改为删除字符中的字符。

这就是为什么它没有删除文本中的''。

如果要删除"n"个事件，可以使用

1
2
3
4
5
6
7

'H\xf6cke

:

Die'.replace('
','')
Output: H?cke:Die

如果不想导入任何内容，请使用EDOCX1[0]

1
2
3
4
5
6
7
8
9

a ="H\xf6cke

:

Die"
print(a.replace("
",""))

# H?cke:Die