Python 编程/字符串

概述

Python 中的字符串概览

str1 = "Hello"                # A new string using double quotes
str2 = 'Hello'                # Single quotes do the same
str3 = "Hello\tworld\n"       # One with a tab and a newline
str4 = str1 + " world"        # Concatenation
str5 = str1 + str(4)          # Concatenation with a number
str6 = str1[2]                # 3rd character
str6a = str1[-1]              # Last character
#str1[0] = "M"                # No way; strings are immutable
for char in str1: print(char) # For each character
str7 = str1[1:]               # Without the 1st character
str8 = str1[:-1]              # Without the last character
str9 = str1[1:4]              # Substring: 2nd to 4th character
str10 = str1 * 3              # Repetition
str11 = str1.lower()          # Lowercase
str12 = str1.upper()          # Uppercase
str13 = str1.rstrip()         # Strip right (trailing) whitespace
str14 = str1.replace('l','h') # Replacement
list15 = str1.split('l')      # Splitting
if str1 == str2: print("Equ") # Equality test
if "el" in str1: print("In")  # Substring test
length = len(str1)            # Length
pos1 = str1.find('llo')       # Index of substring or -1
pos2 = str1.rfind('l')        # Index of substring, from the right
count = str1.count('l')       # Number of occurrences of a substring

print(str1, str2, str3, str4, str5, str6, str7, str8, str9, str10)
print(str11, str12, str13, str14, list15)
print(length, pos1, pos2, count)

另请参阅章节正则表达式，了解 Python 中关于字符串的更高级模式匹配。

字符串操作

相等性

两个字符串相等，如果它们具有完全相同的内容，这意味着它们的长度相同，并且每个字符都具有一一对应的位置关系。许多其他语言通过标识来比较字符串；也就是说，只有当两个字符串占用内存中的同一空间时，它们才被视为相等。Python 使用 is 运算符来测试字符串的标识，以及任何两个对象的标识。

例子

>>> a = 'hello'; b = 'hello' # Assign 'hello' to a and b.
>>> a == b                   # check for equality
True
>>> a == 'hello'             #
True
>>> a == "hello"             # (choice of delimiter is unimportant)
True
>>> a == 'hello '            # (extra space)
False
>>> a == 'Hello'             # (wrong case)
False

数值

有两个伪数值操作可以在字符串上执行 - 加法和乘法。字符串加法只是连接的另一个名称，它只是将字符串粘在一起。字符串乘法是重复的加法或连接。所以

>>> c = 'a'
>>> c + 'b'
'ab'
>>> c * 5
'aaaaa'

包含性

有一个简单的运算符 'in'，如果第一个操作数包含在第二个操作数中，则返回 True。这也适用于子字符串

>>> x = 'hello'
>>> y = 'ell'
>>> x in y
False
>>> y in x
True

请注意，'print(x in y)' 也会返回相同的值。

索引和切片

与其他语言中的数组非常类似，字符串中的各个字符可以通过表示其在字符串中位置的整数来访问。字符串 s 中的第一个字符将是 s[0]，而第 n 个字符将在 s[n-1] 处。

>>> s = "Xanadu"
>>> s[1]
'a'

与其他语言中的数组不同，Python 还使用负数向后索引数组。最后一个字符的索引为 -1，倒数第二个字符的索引为 -2，依此类推。

>>> s[-4]
'n'

我们还可以使用“切片”来访问 s 的子字符串。s[a:b] 将给我们一个从 s[a] 开始并以 s[b-1] 结束的字符串。

>>> s[1:4]
'ana'

这些都不可分配。

>>> print(s)
>>> s[0] = 'J'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>> s[1:3] = "up"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support slice assignment
>>> print(s)

输出（假设错误被抑制）

Xanadu
Xanadu

切片的另一个特点是，如果开头或结尾为空，它将根据上下文默认为第一个或最后一个索引

>>> s[2:]
'nadu'
>>> s[:3]
'Xan'
>>> s[:]
'Xanadu'

您也可以在切片中使用负数

>>> print(s[-2:])
'du'

要理解切片，最简单的方法是不计算元素本身。这有点像不是用你的手指计数，而是在它们之间的空间计数。列表按如下方式索引

Element:     1     2     3     4
Index:    0     1     2     3     4
         -4    -3    -2    -1

因此，当我们要求 [1:3] 切片时，这意味着我们从索引 1 开始，到索引 2 结束，并取它们之间的所有内容。如果您习惯于在 C 或 Java 中使用索引，这可能会有点令人不安，直到您习惯它为止。

字符串常量

字符串常量可以在标准字符串模块中找到。例如，string.digits 等于 '0123456789'。

链接

"string" 模块的 Python 文档 -- python.org

字符串方法

有许多方法或内置字符串函数

capitalize
center
count
decode
encode
endswith
expandtabs
find
index
isalnum
isalpha
isdigit
islower
isspace
istitle
isupper
join
ljust
lower
lstrip
replace
rfind
rindex
rjust
rstrip
split
splitlines
startswith
strip
swapcase
title
translate
upper
zfill

只有强调的项目将被涵盖。

is*

isalnum()、isalpha()、isdigit()、islower()、isupper()、isspace() 和 istitle() 属于此类别。

被比较的字符串对象的长度必须至少为 1，否则 is* 方法将返回 False。换句话说，长度为 len(string) == 0 的字符串对象被认为是“空”或 False。

isalnum 如果字符串完全由字母和/或数字字符组成（即没有标点符号），则返回 True。
isalpha 和 isdigit 对仅字母字符或仅数字字符分别以类似的方式工作。
isspace 如果字符串完全由空格组成，则返回 True。
islower、isupper 和 istitle 分别返回 True，如果字符串为小写、大写或标题大小写。未大小写的字符是“允许的”，例如数字，但字符串对象中必须至少有一个大小写字符才能返回 True。标题大小写表示每个单词的第一个大小写字符为大写，紧随其后的任何大小写字符为小写。奇怪的是，'Y2K'.istitle() 返回 True。这是因为大写字符只能跟随未大小写字符。同样，小写字符只能跟随大写或小写字符。提示：空格未大小写。

例子

>>> '2YK'.istitle()
False
>>> 'Y2K'.istitle()
True
>>> '2Y K'.istitle()
True

Title, Upper, Lower, Swapcase, Capitalize

分别返回转换为标题大小写、大写、小写、反转大小写或首字母大写的字符串。

title 方法将字符串中每个单词的第一个字母大写（并将其余字母小写）。单词被识别为由非字母字符（例如数字或空格）分隔的字母字符的子字符串。这会导致一些意外的行为。例如，字符串 "x1x" 将转换为 "X1X" 而不是 "X1x"。

swapcase 方法使所有大写字母小写，反之亦然。

capitalize 方法类似于 title，只是它将整个字符串视为一个单词。（即它使第一个字符大写，其余字符小写）

例子

s = 'Hello, wOrLD'
print(s)             # 'Hello, wOrLD'
print(s.title())     # 'Hello, World'
print(s.swapcase())  # 'hELLO, WoRld'
print(s.upper())     # 'HELLO, WORLD'
print(s.lower())     # 'hello, world'
print(s.capitalize())# 'Hello, world'

关键词：转换为小写，转换为大写，小写，大写，downcase，upcase。

count

返回字符串中指定子字符串的数量。即

>>> s = 'Hello, world'
>>> s.count('o') # print the number of 'o's in 'Hello, World' (2)
2

提示：.count() 区分大小写，因此此示例将只计算小写字母 'o' 的数量。例如，如果您运行

>>> s = 'HELLO, WORLD'
>>> s.count('o') # print the number of lowercase 'o's in 'HELLO, WORLD' (0)
0

strip, rstrip, lstrip

返回字符串的副本，其中删除了前导（lstrip）和尾随（rstrip）空格。strip 删除两者。

>>> s = '\t Hello, world\n\t '
>>> print(s)
         Hello, world

>>> print(s.strip())
Hello, world
>>> print(s.lstrip())
Hello, world
        # ends here
>>> print(s.rstrip())
         Hello, world

注意前导和尾随的制表符和换行符。

strip 方法也可以用于删除其他类型的字符。

import string
s = 'www.wikibooks.org'
print(s)
print(s.strip('w'))                # Removes all w's from outside
print(s.strip(string.lowercase))   # Removes all lowercase letters from outside
print(s.strip(string.printable))   # Removes all printable characters

输出

www.wikibooks.org
.wikibooks.org
.wikibooks.

注意，string.lowercase 和 string.printable 需要导入 string 语句。

ljust, rjust, center

左、右或居中对齐字符串到给定的字段大小（其余部分用空格填充）。

>>> s = 'foo'
>>> s
'foo'
>>> s.ljust(7)
'foo    '
>>> s.rjust(7)
'    foo'
>>> s.center(7)
'  foo  '

join

使用字符串作为分隔符将给定的序列连接在一起。

>>> seq = ['1', '2', '3', '4', '5']
>>> ' '.join(seq)
'1 2 3 4 5'
>>> '+'.join(seq)
'1+2+3+4+5'

map 这里可能会有用：（它将 seq 中的数字转换为字符串）

>>> seq = [1,2,3,4,5]
>>> ' '.join(map(str, seq))
'1 2 3 4 5'

现在，seq 中可以包含任意对象，而不仅仅是字符串。

find, index, rfind, rindex

find 和 index 方法返回给定子序列在字符串中第一次出现的位置索引。如果未找到，find 返回 -1，但 index 则会引发 ValueError。rfind 和 rindex 与 find 和 index 相同，只是它们从右到左搜索字符串（即它们找到最后出现的）。

>>> s = 'Hello, world'
>>> s.find('l')
2
>>> s[s.index('l'):]
'llo, world'
>>> s.rfind('l')
10
>>> s[:s.rindex('l')]
'Hello, wor'
>>> s[s.index('l'):s.rindex('l')]
'llo, wor'

由于 Python 字符串接受负索引，因此在像上面所示的情况中使用 index 可能更好，因为使用 find 会产生意外的值。

replace

replace 的工作原理与它的字面意思一样。它返回字符串的副本，其中所有出现的第一个参数都被第二个参数替换。

>>> 'Hello, world'.replace('o', 'X')
'HellX, wXrld'

或者，使用变量赋值

string = 'Hello, world'
newString = string.replace('o', 'X')
print(string)
print(newString)

输出

Hello, world
HellX, wXrld

请注意，在调用 replace 后，原始变量（string）保持不变。

expandtabs

将制表符替换为适当数量的空格（默认每个制表符的空格数为 8；可以通过将制表符大小作为参数传递来更改）。

s = 'abcdefg\tabc\ta'
print(s)
print(len(s))
t = s.expandtabs()
print(t)
print(len(t))

输出

abcdefg abc     a
13
abcdefg abc     a
17

注意，尽管这两个字符串看起来相同，但第二个字符串 (t) 的长度不同，因为每个制表符都是由空格表示，而不是制表符字符。

要使用制表符大小为 4 而不是 8

v = s.expandtabs(4)
print(v)
print(len(v))

输出

abcdefg abc a
13

请注意，每个制表符并不总是被计为八个空格。相反，制表符会将计数推送到下一个八的倍数。例如

s = '\t\t'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

输出

 ****************
 16

s = 'abc\tabc\tabc'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

输出

 abc*****abc*****abc
 19

split, splitlines

split 方法返回字符串中单词的列表。它可以接受一个分隔符参数，而不是空格。

>>> s = 'Hello, world'
>>> s.split()
['Hello,', 'world']
>>> s.split('l')
['He', '', 'o, wor', 'd']

请注意，在这两种情况下，分隔符都不包含在分割的字符串中，但空字符串是允许的。

splitlines 方法将多行字符串分解成许多单行字符串。它类似于 split('\n')（但也接受 '\r' 和 '\r\n' 作为分隔符），但如果字符串以换行符结尾，splitlines 则会忽略该最终字符（参见示例）。

>>> s = """
... One line
... Two lines
... Red lines
... Blue lines
... Green lines
... """
>>> s.split('\n')
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines', '']
>>> s.splitlines()
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines']

split 方法也接受多字符字符串字面量

txt = 'May the force be with you'
spl = txt.split('the')
print(spl)
# ['May ', ' force be with you']

Unicode

在 Python 3.x 中，所有字符串（类型 str）默认情况下都包含 Unicode。

在 Python 2.x 中，除了 str 类型之外，还存在一个专门的 unicode 类型：u = u"Hello"; type(u) 是 unicode。

内部帮助中的主题名称是 UNICODE。

Python 3.x 的示例

v = "Hello Günther"
- 直接在源代码中使用 Unicode 代码点；这必须以 UTF-8 编码。
v = "Hello G\xfcnther"
- 使用 \xfc 指定 8 位 Unicode 代码点。
v = "Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 代码点。
v = "Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 代码点，其中 U 大写。
v = "Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 后跟 unicode 点名称指定 Unicode 代码点。
v = "Hello G\N{latin small letter u with diaeresis}nther"
- 代码点名称可以是小写。
n = unicodedata.name(chr(252))
- 给定一个 Unicode 字符（这里为 ü），获取 Unicode 代码点名称。
v = "Hello G" + chr(252) + "nther"
- chr() 接受 Unicode 代码点并返回一个包含一个 Unicode 字符的字符串。
c = ord("ü")
- 产生代码点编号。
b = "Hello Günther".encode("UTF-8")
- 从 Unicode 字符串创建字节序列（字节）。
b = "Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 通过 decode() 方法将字节解码为 Unicode 字符串。
v = b"Hello " + "G\u00fcnther"
- 引发 TypeError：无法将字节连接到 str。
v = b"Hello".decode("ASCII") + "G\u00fcnther"
- 现在它可以正常工作了。
f = open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定编码打开文件以供读取，并从中读取。如果未指定编码，则使用 locale.getpreferredencoding() 的编码。
f = open("File.txt", "w", encoding="UTF-8"); f.write("Hello G\u00fcnther"); f.close()
- 使用指定编码写入文件。
f = open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 编码意味着任何前导字节顺序标记 (BOM) 将自动被剥离。
f = tokenize.open("File.txt"); lines = f.readlines(); f.close()
- 根据文件中存在的编码标记（如 BOM）自动检测编码，剥离标记。
f = open("File.txt", "w", encoding="UTF-8-sig"); f.write("Hello G\u00fcnther"); f.close()
- 使用 UTF-8 写入文件，在开头写入 BOM。

Python 2.x 的示例

v = u"Hello G\u00fcnther"
- 使用 \u00fc 指定 16 位 Unicode 代码点。
v = u"Hello G\U000000fcnther"
- 使用 \U000000fc 指定 32 位 Unicode 代码点，其中 U 大写。
v = u"Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
- 使用 \N 后跟 unicode 点名称指定 Unicode 代码点。
v = u"Hello G\N{latin small letter u with diaeresis}nther"
- 代码点名称可以是小写。
unicodedata.name(unichr(252))
- 给定一个 Unicode 字符（这里为 ü），获取 Unicode 代码点名称。
v = "Hello G" + unichr(252) + "nther"
- chr() 接受 Unicode 代码点并返回一个包含一个 Unicode 字符的字符串。
c = ord(u"ü")
- 产生代码点编号。
b = u"Hello Günther".encode("UTF-8")
- 从 Unicode 字符串创建字节序列 (str)。type(b) 是 str。
b = u"Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
- 通过 decode() 方法将字节（类型 str）解码为 Unicode 字符串。
v = "Hello" + u"Hello G\u00fcnther"
- 连接 str（字节）和 Unicode 字符串，不会出现错误。
f = codecs.open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
- 使用特定编码打开文件以供读取，并从中读取。如果未指定编码，则使用 locale.getpreferredencoding() 的编码 [验证]。
f = codecs.open("File.txt", "w", encoding="UTF-8"); f.write(u"Hello G\u00fcnther"); f.close()
- 使用指定编码写入文件。
- 与 Python 3 变体不同，如果告诉它通过 \n 写入换行符，它不会写入操作系统特定的换行符，而是写入字面意义上的 \n；这会造成差异，例如在 Windows 上。
- 为了确保像文本模式一样的操作，可以写入 os.linesep。
f = codecs.open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
- -sig 编码意味着任何前导字节顺序标记 (BOM) 将自动被剥离。

链接

Unicode HOWTO for Python 3, docs.python.org
Unicode HOWTO for Python 2, docs.python.org
在 Python 3 中处理文本文件，curiousefficiency.org
PEP 263 – 定义 Python 源代码编码，python.org
unicodedata — Unicode 数据库在 Python 库参考中，docs.python.org
获取 Python 可以编码的编码列表，stackoverflow.com

外部链接

"字符串方法" 章节 -- python.org
"string" 模块的 Python 文档 -- python.org

前一个: 数字

索引

下一个: 列表