-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter11-regular-expressions
63 lines (57 loc) · 2.87 KB
/
chapter11-regular-expressions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import re
using re.search() like find() and startswith()
if re.search('From:',line): print line
if line.find('From:')>=0:print line
if line.search('^From:',line):
^X.*: # began with'X'followed by any charactor(including space) many times (.*), then followed by':'
result X-Sieve-DSPAM: (X-Plane is big:)
fine-tuning match - no spaces between X and :, ^X-\S+:
'\S' means match any non-whitespace character, '+' means one or more times
using re.findall() like atpos=find ('@'),sppos=find(' ',atpos),var[atpos+1:sppos]
to extract the matched string
x ='my 2 best numbers are 9 and 12'
y = re.findall('[0-9]+',x) # [] is one charactor, [0-9]+ means at least one digit
print y -> ['2','9','12']
x='From: Using the : character'
y=re.findall('^F.+:',x) --> print y --> ['From: Using the :'] # gready matching takes the largest string
y=re.findall('^F.+?:',x)--> print y --> ['From:'] # not gready matching takes the shortest string
y=re.findall('\S+@\S+',x)--> print y --> [email protected]
y=re.findall('^From (\S+@\S+)',x) --> print y --> [email protected]
#()are not part of match, but where to start and stop in the return
y=re.findall('@()',x) # '@ meanslook through the string until you find an at sign
('@([^ ]*)',x) # inside [], ^ means NOT, so [^ ]means a signle non-blank character, * means report [^ ] many times
print y -> uct.ac.za
y=re.findall('^From .*@([^ ]*)',x) print y -> uct.ac.za
stuff=re.findall('^X-DSPAM-Confidence: ([0-9.]+)',line)
if len(stuff) !=1 : continue
num = float(stuff[0])
numlist.append(num)
print max(numlist)
if you want a special regular expression character to just behave normally, prefix it with'\'
y=re.findall('\$[0-9.]+',x) print y -> ['$10.00'] # \$ is a real dollar sigm, [0-9.] means a digit OR period, note that . is not a wild card here
[a-z0-9] match a lowercase OR a digit
The "+" matches at least one character and the "*" matches zero or more characters
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
import re
handle = open (r'C:\Users\Sharon\Desktop\Python\regex_sum_335980.txt') #put r before the string will slove the '\r' problem
total = 0
for line in handle:
numlist = re.findall('[0-9]+',line) # the type of the retrun is list
if len(numlist) == 0: continue
for i in numlist:
i = int(i)
total = total + i
print total