Python 编程/字符串

概述

Python 中字符串概览

str1 = "Hello"                # A new string using double quotes
str2 = 'Hello'                # Single quotes do the same
str3 = "Hello\tworld\n"       # One with a tab and a newline
str4 = str1 + " world"        # Concatenation
str5 = str1 + str(4)          # Concatenation with a number
str6 = str1[2]                # 3rd character
str6a = str1[-1]              # Last character
#str1[0] = "M"                # No way; strings are immutable
for char in str1: print(char) # For each character
str7 = str1[1:]               # Without the 1st character
str8 = str1[:-1]              # Without the last character
str9 = str1[1:4]              # Substring: 2nd to 4th character
str10 = str1 * 3              # Repetition
str11 = str1.lower()          # Lowercase
str12 = str1.upper()          # Uppercase
str13 = str1.rstrip()         # Strip right (trailing) whitespace
str14 = str1.replace('l','h') # Replacement
list15 = str1.split('l')      # Splitting
if str1 == str2: print("Equ") # Equality test
if "el" in str1: print("In")  # Substring test
length = len(str1)            # Length
pos1 = str1.find('llo')       # Index of substring or -1
pos2 = str1.rfind('l')        # Index of substring, from the right
count = str1.count('l')       # Number of occurrences of a substring

print(str1, str2, str3, str4, str5, str6, str7, str8, str9, str10)
print(str11, str12, str13, str14, list15)
print(length, pos1, pos2, count)

另请参阅正则表达式一章，了解 Python 中关于字符串的高级模式匹配。

字符串操作

相等性

如果两个字符串具有 *完全* 相同的内容，则它们相等，这意味着它们具有相同的长度，并且每个字符都具有一一对应的位置关系。许多其他语言通过身份来比较字符串；也就是说，只有当两个字符串在内存中占据相同空间时，它们才被视为相等。Python 使用 is 运算符来测试字符串以及一般任何两个对象的标识。

示例

>>> a = 'hello'; b = 'hello' # Assign 'hello' to a and b.
>>> a == b                   # check for equality
True
>>> a == 'hello'             #
True
>>> a == "hello"             # (choice of delimiter is unimportant)
True
>>> a == 'hello '            # (extra space)
False
>>> a == 'Hello'             # (wrong case)
False

数值

在字符串上可以进行两种类似数值的操作——加法和乘法。字符串加法只是连接的另一种名称，它只是将字符串拼接在一起。字符串乘法是重复加法或连接。因此

>>> c = 'a'
>>> c + 'b'
'ab'
>>> c * 5
'aaaaa'

包含

有一个简单的运算符 'in'，如果第一个操作数包含在第二个操作数中，则返回 True。这也适用于子字符串

>>> x = 'hello'
>>> y = 'ell'
>>> x in y
False
>>> y in x
True

请注意，'print(x in y)' 也将返回相同的值。

索引和切片

与其他语言中的数组类似，字符串中的单个字符可以通过一个整数来访问，该整数代表它在字符串中的位置。字符串 s 中的第一个字符将是 s[0]，第 n 个字符将位于 s[n-1] 处。

>>> s = "Xanadu"
>>> s[1]
'a'

与其他语言中的数组不同，Python 还使用负数来反向索引数组。最后一个字符的索引为 -1，倒数第二个字符的索引为 -2，依此类推。

>>> s[-4]
'n'

我们还可以使用“切片”来访问 s 的子字符串。s[a:b] 将为我们提供一个从 s[a] 开始到 s[b-1] 结束的字符串。

>>> s[1:4]
'ana'

这些都是不可分配的。

>>> print(s)
>>> s[0] = 'J'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>> s[1:3] = "up"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support slice assignment
>>> print(s)

输出（假设错误被抑制）

Xanadu
Xanadu

切片的另一个特性是，如果开头或结尾为空，它将根据上下文默认设置为第一个或最后一个索引

>>> s[2:]
'nadu'
>>> s[:3]
'Xan'
>>> s[:]
'Xanadu'

你也可以在切片中使用负数

>>> print(s[-2:])
'du'

要理解切片，最简单的方法是不要计算元素本身。这有点像在手指之间的空间而不是用手指计算。列表的索引方式如下

Element:     1     2     3     4
Index:    0     1     2     3     4
         -4    -3    -2    -1

因此，当我们要求 [1:3] 切片时，这意味着我们从索引 1 开始，在索引 2 结束，并取它们之间的所有内容。如果你习惯于 C 或 Java 中的索引，这可能有点令人不安，直到你习惯它为止。

字符串常量

字符串常量可以在标准字符串模块中找到。例如，string.digits 等于 '0123456789'。

链接

Python "string" 模块文档 -- python.org

字符串方法

有一些方法或内置字符串函数

capitalize
center
count
decode
encode
endswith
expandtabs
find
index
isalnum
isalpha
isdigit
islower
isspace
istitle
isupper
join
ljust
lower
lstrip
replace
rfind
rindex
rjust
rstrip
split
splitlines
startswith
strip
swapcase
title
translate
upper
zfill

仅强调的项目将被涵盖。

is*

isalnum()、isalpha()、isdigit()、islower()、isupper()、isspace() 和 istitle() 属于此类别。

要比较的字符串对象的长度必须至少为 1，否则 is* 方法将返回 False。换句话说，len(string) == 0 的字符串对象被认为是“空”或 False。

isalnum 如果字符串完全由字母数字字符（即没有标点符号）组成，则返回 True。
isalpha 和 isdigit 对仅字母字符或仅数字字符分别类似地工作。
isspace 如果字符串完全由空格组成，则返回 True。
islower、isupper 和 istitle 如果字符串分别为小写、大写或标题大小写，则返回 True。未区分大小写的字符是“允许的”，例如数字，但字符串对象中必须至少有一个区分大小写的字符才能返回 True。标题大小写意味着每个单词的第一个区分大小写的字符为大写，而任何紧随其后的区分大小写的字符为小写。奇怪的是，'Y2K'.istitle() 返回 True。这是因为大写字符只能跟随未区分大小写的字符。同样，小写字符只能跟随大写或小写字符。提示：空格是未区分大小写的。

示例

>>> '2YK'.istitle()
False
>>> 'Y2K'.istitle()
True
>>> '2Y K'.istitle()
True

标题、大写、小写、交换大小写、首字母大写

分别返回转换为标题大小写、大写、小写、反转大小写或首字母大写的字符串。

title 方法将字符串中每个单词的第一个字母大写（并将其余字母小写）。单词被识别为由非字母字符（例如数字或空格）分隔的字母字符子字符串。这可能会导致一些意外的行为。例如，字符串“x1x”将转换为“X1X”而不是“X1x”。

swapcase 方法将所有大写字母转换为小写字母，反之亦然。

capitalize 方法类似于 title，只是它将整个字符串视为一个单词。（即它将第一个字符大写，并将其余字符小写）

示例

s = 'Hello, wOrLD'
print(s)             # 'Hello, wOrLD'
print(s.title())     # 'Hello, World'
print(s.swapcase())  # 'hELLO, WoRld'
print(s.upper())     # 'HELLO, WORLD'
print(s.lower())     # 'hello, world'
print(s.capitalize())# 'Hello, world'

关键词：转换为小写，转换为大写，小写，大写，小写，大写。

count

返回字符串中指定子字符串的个数。例如

>>> s = 'Hello, world'
>>> s.count('o') # print the number of 'o's in 'Hello, World' (2)
2

提示：.count() 区分大小写，因此此示例只计算小写字母 'o' 的个数。例如，如果你运行

>>> s = 'HELLO, WORLD'
>>> s.count('o') # print the number of lowercase 'o's in 'HELLO, WORLD' (0)
0

strip、rstrip、lstrip

返回字符串的副本，其中开头（lstrip）和结尾（rstrip）的空格被移除。strip 同时移除两者。

>>> s = '\t Hello, world\n\t '
>>> print(s)
         Hello, world

>>> print(s.strip())
Hello, world
>>> print(s.lstrip())
Hello, world
        # ends here
>>> print(s.rstrip())
         Hello, world

请注意开头和结尾的制表符和换行符。

Strip 方法也可以用来移除其他类型的字符。

import string
s = 'www.wikibooks.org'
print(s)
print(s.strip('w'))                # Removes all w's from outside
print(s.strip(string.lowercase))   # Removes all lowercase letters from outside
print(s.strip(string.printable))   # Removes all printable characters

输出

www.wikibooks.org
.wikibooks.org
.wikibooks.

请注意，string.lowercase 和 string.printable 需要导入 string 语句

ljust、rjust、center

将字符串左对齐、右对齐或居中对齐到给定的字段大小（其余部分用空格填充）。

>>> s = 'foo'
>>> s
'foo'
>>> s.ljust(7)
'foo    '
>>> s.rjust(7)
'    foo'
>>> s.center(7)
'  foo  '

join

将给定序列用字符串作为分隔符连接在一起

>>> seq = ['1', '2', '3', '4', '5']
>>> ' '.join(seq)
'1 2 3 4 5'
>>> '+'.join(seq)
'1+2+3+4+5'

map 可能在这里有帮助： (它将 seq 中的数字转换为字符串)

>>> seq = [1,2,3,4,5]
>>> ' '.join(map(str, seq))
'1 2 3 4 5'

现在 seq 中可以包含任意对象，而不仅仅是字符串。

find, index, rfind, rindex

find 和 index 方法返回给定子序列第一次出现的索引。如果未找到，find 返回 -1，但 index 会引发 ValueError。rfind 和 rindex 与 find 和 index 相同，只是它们从右到左搜索字符串（即找到最后一次出现）

>>> s = 'Hello, world'
>>> s.find('l')
2
>>> s[s.index('l'):]
'llo, world'
>>> s.rfind('l')
10
>>> s[:s.rindex('l')]
'Hello, wor'
>>> s[s.index('l'):s.rindex('l')]
'llo, wor'

因为 Python 字符串接受负下标，所以 index 可能更适合用于如所示情况，因为使用 find 反而会产生意外值。

replace

replace 的工作方式就像它听起来那样。它返回字符串的副本，其中第一个参数的所有出现都被第二个参数替换。

>>> 'Hello, world'.replace('o', 'X')
'HellX, wXrld'

或者，使用变量赋值

string = 'Hello, world'
newString = string.replace('o', 'X')
print(string)
print(newString)

输出

Hello, world
HellX, wXrld

注意，原始变量 (string) 在调用 replace 后保持不变。

expandtabs

用适当数量的空格替换制表符（默认每个制表符的空格数 = 8；这可以通过将制表符大小作为参数传递来更改）。

s = 'abcdefg\tabc\ta'
print(s)
print(len(s))
t = s.expandtabs()
print(t)
print(len(t))

输出

abcdefg abc     a
13
abcdefg abc     a
17

注意，尽管这两个字符串看起来相同，但第二个字符串 (t) 的长度不同，因为每个制表符都用空格而不是制表符字符表示。

要使用制表符大小为 4 而不是 8

v = s.expandtabs(4)
print(v)
print(len(v))

输出

abcdefg abc a
13

请注意，每个制表符并不总是被计算为八个空格。相反，制表符将计数“推”到下一个八的倍数。例如

s = '\t\t'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

输出

 ****************
 16

s = 'abc\tabc\tabc'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

输出

 abc*****abc*****abc
 19

split, splitlines

split 方法返回字符串中单词的列表。它可以接受一个分隔符参数，而不是使用空格。

>>> s = 'Hello, world'
>>> s.split()
['Hello,', 'world']
>>> s.split('l')
['He', '', 'o, wor', 'd']

注意，在这两种情况下，分隔符都不包含在分割的字符串中，但允许空字符串。

splitlines 方法将多行字符串分解成多个单行字符串。它类似于 split('\n')（但也接受 '\r' 和 '\r\n' 作为分隔符），不同的是，如果字符串以换行符结尾，splitlines 会忽略该最终字符（见示例）。

>>> s = """
... One line
... Two lines
... Red lines
... Blue lines
... Green lines
... """
>>> s.split('\n')
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines', '']
>>> s.splitlines()
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines']

split 方法也接受多字符字符串字面量

txt = 'May the force be with you'
spl = txt.split('the')
print(spl)
# ['May ', ' force be with you']

Unicode

在 Python 3.x 中，所有字符串（类型 str）默认包含 Unicode。

在 Python 2.x 中，除了 str 类型之外，还有一个专门的 unicode 类型：u = u"Hello"; type(u) is unicode。

内部帮助中的主题名称为 UNICODE。

Python 3.x 的示例

v = "Hello Günther"
- 在源代码中直接使用 Unicode 代码点；这必须采用 UTF-8 编码。
v = "Hello G\xfcnther"
- 使用 \xfc 指定 8 位 Unicode 代码点。
v = "Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 代码点。
v = "Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 代码点，其中 U 大写。
v = "Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 后跟 unicode 点名称指定 Unicode 代码点。
v = "Hello G\N{latin small letter u with diaeresis}nther"
- 代码点名称可以是小写。
n = unicodedata.name(chr(252))
- 获取给定 Unicode 字符的 Unicode 代码点名称，这里是 ü。
v = "Hello G" + chr(252) + "nther"
- chr() 接受 Unicode 代码点并返回包含一个 Unicode 字符的字符串。
c = ord("ü")
- 产生代码点编号。
b = "Hello Günther".encode("UTF-8")
- 从 Unicode 字符串创建字节序列 (bytes)。
b = "Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 通过 decode() 方法将字节解码为 Unicode 字符串。
v = b"Hello " + "G\u00fcnther"
- 抛出 TypeError: can't concat bytes to str。
v = b"Hello".decode("ASCII") + "G\u00fcnther"
- 现在它可以工作了。
f = open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定编码打开文件以供读取，并从中读取。如果没有指定编码，则使用 locale.getpreferredencoding() 的编码。
f = open("File.txt", "w", encoding="UTF-8"); f.write("Hello G\u00fcnther"); f.close()
- 以指定编码写入文件。
f = open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 编码意味着任何前导字节顺序标记 (BOM) 会自动被剥离。
f = tokenize.open("File.txt"); lines = f.readlines(); f.close()
- 根据文件中存在的编码标记（如 BOM）自动检测编码，剥离标记。
f = open("File.txt", "w", encoding="UTF-8-sig"); f.write("Hello G\u00fcnther"); f.close()
- 以 UTF-8 编码写入文件，在开头写入 BOM。

Python 2.x 的示例

v = u"Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 代码点。
v = u"Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 代码点，其中 U 大写。
v = u"Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 后跟 unicode 点名称指定 Unicode 代码点。
v = u"Hello G\N{latin small letter u with diaeresis}nther"
- 代码点名称可以是小写。
unicodedata.name(unichr(252))
- 获取给定 Unicode 字符的 Unicode 代码点名称，这里是 ü。
v = "Hello G" + unichr(252) + "nther"
- chr() 接受 Unicode 代码点并返回包含一个 Unicode 字符的字符串。
c = ord(u"ü")
- 产生代码点编号。
b = u"Hello Günther".encode("UTF-8")
- 从 Unicode 字符串创建字节序列 (str)。type(b) is str。
b = u"Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 通过 decode() 方法将字节（类型 str）解码为 Unicode 字符串。
v = "Hello" + u"Hello G\u00fcnther"
- 连接 str（字节）和 Unicode 字符串，不会出错。
f = codecs.open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定编码打开文件以供读取，并从中读取。如果没有指定编码，则使用 locale.getpreferredencoding() 的编码 [VERIFY]。
f = codecs.open("File.txt", "w", encoding="UTF-8"); f.write(u"Hello G\u00fcnther"); f.close()
- 以指定编码写入文件。
- 与 Python 3 变体不同的是，如果被告知通过 \n 写入换行符，它不会写入操作系统特定的换行符，而是写入字面意义上的 \n；这在 Windows 上会有所不同。
- 为了确保像文本模式一样的操作，可以写入 os.linesep。
f = codecs.open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 编码意味着任何前导字节顺序标记 (BOM) 会自动被剥离。

链接

Unicode HOWTO for Python 3, docs.python.org
Unicode HOWTO for Python 2, docs.python.org
Processing Text Files in Python 3, curiousefficiency.org
PEP 263 – Defining Python Source Code Encodings, python.org
unicodedata — Unicode Database in Python Library Reference, docs.python.org
Get a list of all the encodings Python can encode to, stackoverflow.com

外部链接

"String Methods" chapter -- python.org
Python "string" 模块文档 -- python.org

上一篇：数字

索引

下一篇：列表