温馨提示:这篇文章已超过449天没有更新,请注意相关的内容是否还可用!
摘要:,,本文总结了Python中正则表达式的使用。正则表达式是一种强大的文本处理工具,可用于匹配、查找和替换文本模式。本文介绍了Python中正则表达式的基本概念和语法,包括模式匹配、转义字符、特殊字符等。本文还提供了使用正则表达式的常见方法和函数,如re模块中的match、search、findall等函数的使用方法。通过本文的总结,读者可以更快地掌握Python正则表达式的使用,提高文本处理的效率。
最详细的官方讲解
https://docs.python.org/zh-cn/3.11/howto/regex.html#simple-patterns
速查表
https://zhuanlan.zhihu.com/p/658261452
https://blog.csdn.net/Java_ZZZZZ/article/details/130862224
https://docs.python.org/zh-cn/3.8/library/re.html
re库使用
re.findall()寻找所有符合特定形式的子串
import re ## 存在匹配的 txt = "ai aiThe rain in Spain" x = re.findall("ai", txt) print(x) # 没有匹配的 txt = "adafda dafasdf" x = re.findall("ai", txt) print(x)
s='中国人adfadsfasfasdfsdaf中国万岁\n' print(re.findall(r"\w",s))
s='中国人adfads----fasfasdfsdaf中国万岁\n' print(re.findall(r"\w",s)) #“\w”代表的字符主要包括26个大写字母A到Z,即[A-Z]、26个小写字母a到z,即[a-z]、10个阿拉伯数字0到9,即[0-9]和下划线“_”。 # \w不包括破折号,所以我做了破折号的测试。
s='中国人adfads----fasfasdfsdaf中国万岁\n' print(re.findall(r"\w",s,re.A)) # 不匹配汉字
\d+
import re result = re.findall('\d+','123acb567def98') # 这里'\d+'相当于一种子串的形式,这个含义是在加在\d的含义上, # 即这个子串的含义是\d\d\d\d\d\d...,但是我不知道到底有多少个,选择用\d+来表示(加号含义表示至少一个) # 然后re.findall()相当于找出所有符合条件的子串(即形如\d\d\d\d\d...的子串),注意不是其中某一个,而是所有 # 最终以列表的形式返回所有返回条件的子串,注意re.findall()是不返回字符串位置的,返回位置需要用的compile函数 print(result)
\d
import re result = re.findall('\d','123acb567def98') # 这里'\d'相当于一种子串的形式,这里的子串就是'\d' # 然后re.findall()相当于找出所有符合条件的子串,注意不是找出一个 print(result)
### ^
import re s = "https://blog.csdn.net/weixin_44799217http" ret = re.findall(r"^http", s) # ^表示字符串以什么开始 print(ret) #
finditer()返回字符串位置
\d+
import re p = re.compile("\d+") s = '123acb567def98' for m in p.finditer(s): print(m.span(), m.group())
\d
import re p = re.compile("\d") # 可以看到这里得到的结果是类似的 s = '123acb567def98' for m in p.finditer(s): print(m.span(), m.group())
import re p = re.compile("[a-z]+") # 可以看到这里得到的结果是类似的 s = '123acb567def98' for m in p.finditer(s): print(m.span(), m.group())
import re p = re.compile("[a-z]") for m in p.finditer('a1b2c3d4'): print(m.start(), m.group())
注意\d+和[0-9]是有一定区别的,可以自行百度
re.match()函数:直接判断某个字符串是否符合某个形式
re.match() 函数是从头开始匹配一个符合规则的字符串,从起始位置开始匹配,匹配成功返回一个对象,未匹配成功返回None。
import re result = re.match(r'1[35678]\d{9}','13111111111') print(result.group()) #匹配成功 result = re.match(r'1[35678]\d{9}','12111111111') print(result) #none,第二位为2 result = re.match(r'1[35678]\d{9}','121111111112') print(result) #none,有12位
注意是起始位置
import re result=re.match("hello","hello world") print(result)
import re result=re.match("hello123","hello world") print(result)
注意是开头匹配,否则的话是会返回None的
import re result=re.match("hello","qhello world") print(result)
print(re.match('super','insuperable'))
import re print(re.match('www', 'www.runoob.com').span()) # 在起始位置匹配 print(re.match('com', 'www.runoob.com')) # 不在起始位置匹配
re.fullmatch(): 完全匹配
import re string = 'geeks' pattern = 'g...s' print(re.fullmatch(pattern, string))
import re string = 'geeks' pattern = 'g..s' print(re.fullmatch(pattern, string))
re.match()和re.fullmatch()的区别
import re string = 'geeks' pattern = 'g...' print(re.fullmatch(pattern, string))
import re # 注意这个差距 string = 'geeks' pattern = 'g...' print(re.match(pattern, string))
re.search()
re.search()会在字符串内查找模式匹配,只要找到第一个匹配然后返回,如果字符串没有匹配,则返回None
import re ret = re.search(r"\d+", "阅读次数为 9999") print(ret.group())
print(re.search('super','superstition').span()) print(re.search('super','superstition').group())
print(re.search('super','insuperable').span())
import re txt = "The rain in Spain" x = re.search("\s", txt) print("The first white-space character is located in position:", x.start())
re.split(将一个字符串按照正则表达式匹配后进行分割)
import re txt = "The rain in Spain" x = re.split("\s", txt) print(x) import re txt = "The rain in Spain" x = re.split("\s", txt, 1) print(x)
st = "abc1def23mn4xyz" result = re.split(r"\d+",st) print(result) ## zhge
st = "abc1def23mn4xyz" result = re.split(r"[a-z]+",st) print(result) ## zhge
string = "Hello,World,Python" result = string.split(",") # 使用逗号作为分隔符进行切分 print(result) # 输出结果为 ['Hello', 'World', 'Python'] result = re.split(r",",string) print(result) ## zhge
re.sub(): 替换匹配的子串
import re txt = "The rain in Spain" x = re.sub("\s", "9", txt) print(x) import re txt = "The rain in Spain" x = re.sub("\s", "9", txt, 2) print(x)
import re st = "abc1def23mn4xyz" result = re.sub(r"\d+","_",st) print(result)
语法辨析
\s
import re p = re.compile("\s+") # 可以看到这里得到的结果是类似的 s = "瓦房店分12打发打发的==大 发的\n是方\t法" for m in p.finditer(s): print(m.span(), m.group())
*和+的区别
?操作符
{m,n}
案例1(匹配中文)
import re p = re.compile("[\u4e00-\u9fa5]+") # 可以看到这里得到的结果是类似的 s = "瓦房店分12打发打发的==大法!!!发的是方法" for m in p.finditer(s): print(m.span(), m.group())
案例2
案例3
案例4
案例5
案例6
从一个列表中根据字符串选出符合条件的字符串
import re mylist = ["dog", "cat", "wildcat", "thundercat", "cow", "hooo"] r = re.compile(".*cat") # newlist = list(filter(r.match, mylist)) # Read Note below print(newlist) ## 运用r.match # filter函数这个得好好学习。
或者统一用pandas库就可以了,很方便的
pandas库正则表达式
pandas.str.match(元素匹配)
exampe1
import numpy as np import pandas as pd a = np.array(['A0','A1','A2','A3','A4','B0','B1','C0']) pd.Series(a).str.match(r'A[0-2]')
example2
s = pd.Series(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz', 'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn', 'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn', 'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd']) #print(s.str.endswith("dd")) #print("*"*50) #print(s[s.str.endswith("dd")]) #print("*"*50) print("*"*50) print(s.str.match(".*dd$")) print(s[s.str.match(".*dd$")])
pandas.str.extract
注意正则表达式里的括号里的内容就是最终返回匹配的内容
example1
import pandas as pd ele= ["Toy Story (1995)", "GoldenEye (1995)", "Four Rooms (1995)", "Get Shorty (1995)", "Copycat (1995)"] df = pd.DataFrame({"movie_title":ele}) print(df) df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(') df
example 2
import pandas as pd df = pd.DataFrame({"col1":["1/1/100 'BA1", "1/1/102Packe", "1/1/102 'to_"]}) df["col2"]=df['col1'].str.extract('(\d+/\d+/\d+)', expand=True) df
结果如下
example3
# importing pandas as pd import pandas as pd # importing re for regular expressions import re # Creating the Series sr = pd.Series(['New_York', 'Lisbon', 'Tokyo', 'Paris', 'Munich']) # Creating the index idx = ['City 1', 'City 2', 'City 3', 'City 4', 'City 5'] # set the index sr.index = idx # Print the series print(sr) # extract groups having a vowel followed by # any character result = sr.str.extract(pat = '([aeiou].)') # print the result print(result)
example4
import pandas as pd s = pd.Series(['a1', 'b2', 'c3']) s.str.extract(r'([ab])(\d)')
设置expand = True
s.str.extract(r'[ab](\d)', expand=True)
设置新的列名
s.str.extract(r'(?P[ab])(?P\d)')
s.str.extract(r'(\d)')
s.str.extract(r'([ab])')
pandas.str.split
example1
import pandas as pd temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']}) temp2 = temp.ticker.str.split(' ') print(temp2) temp2.str[-1]
抽取某一列的部分元素设置为新列
import pandas as pd df = pd.DataFrame({ 'gene':["1 // foo // blabla", "2 // bar // lalala", "3 // qux // trilil", "4 // woz // hohoho"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) print(df) df['gene'] = df['gene'].str.split('//').str[1] df
结果如下
可以使用pandas.str.extract达到pandas.str.split的同样结果
import pandas as pd df = pd.DataFrame({ 'gene':["1 // foo // blabla", "2 // bar // lalala", "3 // qux // trilil", "4 // woz // hohoho"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) print(df) df["gene"] = df["gene"].str.extract(r"\/\/([a-z ]+)\/\/") print(df) df["gene"] = df["gene"].str.strip() df
结果如下
example2
import pandas as pd df = pd.DataFrame({'Scenario':['HI','HI','HI','HI','HI','HI'], 'Savings':['Total_FFC_base0','Total_FFC_savings1','Total_FFC_saving2', 'Total_FFC_savings3','Total_site_base0','Total_site_savings1'], 'PC1':[0.12,0.15,0.12,0.17,0.12,0.15], 'PC2':[0.13,0.12,0.14,0.15,0.15,0.15]}) print(df) df[['Savings', 'EL']] = df['Savings'].str.extract('_(?P.*)_.*(?P\d+)') df
import pandas as pd df = pd.DataFrame({'Scenario':['HI','HI','HI','HI','HI','HI'], 'Savings':['Total_FFC_base0','Total_FFC_savings1','Total_FFC_saving2', 'Total_FFC_savings3','Total_site_base0','Total_site_savings1'], 'PC1':[0.12,0.15,0.12,0.17,0.12,0.15], 'PC2':[0.13,0.12,0.14,0.15,0.15,0.15]}) print(df) df['Savings'].str.extract('(.*)_(.*)_(.*)')
df['Savings'].str.extract('(.*)_(.*)_(.*)\d')
df['Savings'].str.extract('(.*)')
df['Savings'].str.extract(r'(\d+)') # 匹配的内容都是括号括起来的,括号外面的相当于是标志物,不参与最终的表达结果。
实例操作1
import numpy as np import pandas as pd ele = np.array(['CD1C_P14_S91', 'CD1C_P14_S96', 'CD1C_P3_S12', 'CD141_P7_S22', 'CD141_P7_S24', 'CD1C_P4_S36', 'CD141_P7_S7', 'CD141_P8_S27', 'CD141_P8_S31', 'CD141_P9_S72', 'pDC_P10_S73', 'pDC_P10_S74', 'pDC_P10_S83', 'pDC_P13_S56', 'pDC_P13_S59', 'pDC_P13_S70', 'pDC_P14_S76', 'pDC_P14_S78', 'pDC_P14_S87', 'pDC_P14_S89', 'pDC_P14_S90', 'pDC_P14_S91', 'pDC_P14_S92', 'pDC_P3_S14', 'pDC_P3_S16', 'pDC_P3_S17', 'pDC_P3_S18', 'pDC_P3_S1', 'pDC_P3_S21', 'pDC_P3_S2', 'pDC_P3_S4', 'pDC_P3_S5', 'pDC_P4_S28', 'pDC_P4_S29', 'pDC_P4_S30', 'pDC_P4_S36', 'pDC_P4_S37', 'pDC_P4_S40', 'pDC_P4_S42', 'pDC_P4_S43', 'pDC_P4_S45', 'pDC_P4_S46', 'pDC_P4_S48', 'pDC_P7_S15', 'pDC_P7_S16', 'pDC_P7_S17', 'pDC_P7_S1', 'pDC_P7_S21', 'pDC_P7_S22', 'pDC_P7_S3', 'pDC_P7_S7', 'pDC_P8_S26', 'pDC_P8_S28', 'pDC_P8_S32', 'pDC_P8_S34', 'pDC_P8_S39', 'pDC_P8_S40', 'pDC_P8_S42', 'pDC_P8_S44', 'pDC_P8_S46', 'pDC_P8_S47', 'pDC_P9_S52', 'pDC_P9_S54', 'pDC_P9_S61', 'pDC_P9_S63', 'pDC_P9_S65', 'pDC_P9_S71', 'DoubleNeg_P10_S73', 'DoubleNeg_P10_S76', 'DoubleNeg_P10_S79', 'DoubleNeg_P10_S80', 'DoubleNeg_P10_S81', 'DoubleNeg_P10_S84', 'DoubleNeg_P10_S86', 'DoubleNeg_P13_S49', 'DoubleNeg_P13_S53', 'DoubleNeg_P13_S64', 'DoubleNeg_P13_S67', 'DoubleNeg_P14_S74', 'DoubleNeg_P14_S78', 'DoubleNeg_P14_S81', 'DoubleNeg_P14_S82', 'DoubleNeg_P14_S83', 'DoubleNeg_P14_S87', 'DoubleNeg_P14_S90', 'DoubleNeg_P14_S92', 'DoubleNeg_P14_S95', 'DoubleNeg_P3_S1', 'DoubleNeg_P3_S20', 'DoubleNeg_P3_S24', 'DoubleNeg_P3_S3', 'DoubleNeg_P3_S5', 'DoubleNeg_P3_S7', 'DoubleNeg_P4_S29', 'DoubleNeg_P4_S30', 'DoubleNeg_P4_S35', 'DoubleNeg_P4_S39', 'DoubleNeg_P4_S42', 'DoubleNeg_P4_S45', 'DoubleNeg_P4_S46', 'DoubleNeg_P7_S11', 'DoubleNeg_P7_S13', 'DoubleNeg_P7_S14', 'DoubleNeg_P7_S16', 'DoubleNeg_P7_S24', 'DoubleNeg_P7_S2', 'DoubleNeg_P7_S3', 'DoubleNeg_P7_S5', 'DoubleNeg_P7_S7', 'DoubleNeg_P7_S8', 'DoubleNeg_P8_S25', 'DoubleNeg_P8_S30', 'DoubleNeg_P8_S38', 'DoubleNeg_P8_S41', 'DoubleNeg_P8_S42', 'DoubleNeg_P8_S43', 'DoubleNeg_P8_S44', 'DoubleNeg_P9_S64', 'DoubleNeg_P9_S66', 'CD1C_P13_S57', 'CD1C_P13_S63', 'CD1C_P14_S85']) df = pd.DataFrame({"cell":ele}) df
测试1(仅仅抽取大写字母)
df["cell"].str.extract(r"([A-Z]+)")
测试2(抽取大写字母和小写字母)
df["cell"].str.extract(r"([A-Za-z]+)")
测试3(联合使用)
df["cell"].str.extract(r"([A-Za-z]+\d+[A-Za-z]+)") # CD141不符合,注意这个NaN值
测试4(使用split)
print(df["cell"].str.split("_").str[0]) print(df["cell"].str.split("_").str[0].value_counts())
测试5(使用正则表达式)
# [a-zA-Z0-9] 判断字母和数字 print(df["cell"].str.extract(r"([a-zA-Z0-9]+)")) print(df["cell"].str.extract(r"([a-zA-Z0-9]+)").value_counts())
pandas.str.fullmatch()
我刚测试出这个pandas.str.match()和pandas.str.fullmatch()是存在区别的
比如
import numpy as np import pandas as pd ## 这个只是部分match就行 a = np.array(['A0','A11','A2-','A3','A4','B0','B1','C0']) pd.Series(a).str.match(r'A[0-2]')
结果如下
可以看到pandas.str.match()是部分匹配,不管后面的是否匹配,只要前面满足条件就行
但是同样的数据,对于pandas.str.fullmatch()的结果就不一样了
# 这个得完全match,我就说有一定的问题吧 pd.Series(a).str.fullmatch(r'A[0-2]')
需要额外注意
还没有评论,来说两句吧...