怎么使用Python中的正则表达式处理html文件

使用Python中的正则表达式处理html文件

finditer方法是一种全匹配方法。已经使用过findall方法的话,该方法将返回由多个匹配字符串组成的列表。对于多个匹配项,finditer会按顺序返回一个迭代器,每个迭代生成一个匹配对象。这些匹配对象可通过for循环访问,在下面的代码中,因此组1可以被打印。

您需要撰写 Python 正则表达式,以便在 HTML 文本文件中识别特定的模式。将代码添加到STARTER脚本为这些模式编译RE(将它们分配给有意义的变量名称),并将这些RE应用于文件的每一行,打印出找到的匹配项。

1.编写识别HTML标签的模式,然后将其打印为“TAG:TAG string”(例如“TAG:b”代表标签)。为了简单起见,假设左括号和右括号每个标记的(<,>)将始终出现在同一行文本中。第一次尝试可能使regex“<.*>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点,找出为什么这不是一个好的解决方案。编写一个更好的解决方案,解决这个问题

2.修改代码,使其区分开头和结尾标记(例如p与/p)打印OPENTAG和CLOSETAG

import sys, re

#------------------------------

testRE = re.compile(&#39;(logic|sicstus)&#39;, re.I)
testI = re.compile(&#39;<[A-Za-z]>&#39;, re.I)
testO = re.compile(&#39;<[^/](\S*?)[^>]*>&#39;)
testC = re.compile(&#39;</(\S*?)[^>]*>&#39;)

with open(&#39;RGX_DATA.html&#39;) as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == &#39;&#39;:
            continue
        print(&#39;  &#39;, &#39;-&#39; * 100, &#39;[%d]&#39; % linenum, &#39;\n   TEXT:&#39;, line, end=&#39;&#39;)
    
        m = testRE.search(line)
        if m:
            print(&#39;** TEST-RE:&#39;, m.group(1))

        mm = testRE.finditer(line)
        for m in mm:
            print(&#39;** TEST-RE:&#39;, m.group(1))
        
        index= testI.finditer(line)
        for i in index:
           print(&#39;Tag:&#39;,i.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
           
        open1= testO.finditer(line)
        for m in open1:
           print(&#39;opening:&#39;,m.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
           
        close1= testC.finditer(line)
        for n in close1:
           print(&#39;closing:&#39;,n.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))

请注意,有些HTML标签有参数,例如:

<table border=1 cellspacing=0 cellpadding=8>

成功查找到并打印标记标签,确保启用带参数和不带参数的标记模式。现在扩展您的代码,以便打印两个打开的标签标签和参数,例如:

OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8

 		open1= testO.finditer(line)
        for m in open1:
            #print(&#39;opening:&#39;,m.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
            firstm= m.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;).split()
            num = 0
            for otherm in firstm:
                if num == 0:
                    print(&#39;opening:&#39;,otherm)
                else:
                    print(&#39;pram:&#39;,otherm)
                num+= 1

在正则表达式中,可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为\N(其中N为正整数),并返回到第N个匹配的文本正则表达式组。例如,正则表达式,如:r" (\w+) \1 仅当与组(\w+)完全匹配的字符串再次出现时才匹配 backref\1出现的位置。这可能与字符串“踢”匹配.例如,“the”出现两次。使用反向引用编写一个模式,当一行包含成对的open和关闭标签,例如在粗体中.

考虑到我们可能想要创建一个执行HTML剥离的脚本,即一个HTML文件,并返回一个纯文本文件,所有HTML标记都已从中删除出来这里我们不打算这样做,而是考虑一个更简单的例子,即删除我们在输入数据文件的任何行中找到的HTML标记。

如果您已经定义了一条RE来识别HTML标签,您应该可以将生成的文本输出为STRIPPED,并将其打印在屏幕上。。

import sys, re

#------------------------------
# PART 1: 

   # Key thing is to avoid matching strings that include
   # multiple tags, e.g. treating &#39;<p><b>&#39; as a single
   # tag. Can do this in several ways. Firstly, use
   # non-greedy matching, so get shortest possible match
   # including the two angle brackets:

tag = re.compile(&#39;</?(.*?)>&#39;) 

   # The above treats the &#39;/&#39; of a close tag as a separate
   # optional component - so that this doesn&#39;t turn up as
   # part of the match &#39;.group(1)&#39;, which is meant to return
   # the tag label. 
   # Following alternative solution uses a negated character
   # class to explicitly prevent this including &#39;>&#39;: 

tag = re.compile(&#39;</?([^>]+)>&#39;) 

   # Finally, following version separates finding the tag
   # label string from any (optional) parameters that might
   # also appear before the close angle bracket:

tag = re.compile(r&#39;</?(\w+\b)([^>]+)?>&#39;) 

   # Note that use of &#39;\b&#39; (as word boundary anchor) here means
   # we must mark the regex string as a &#39;raw&#39; string (r&#39;..&#39;). 

#------------------------------
# PART 2: 

   # Following closeTag definition requires first first char
   # after the open angle bracket to be &#39;/&#39;, while openTag
   # definition excludes this by requiring first char to be
   # a &#39;word char&#39; (\w):

openTag  = re.compile(r&#39;<(\w[^>]*)>&#39;)
closeTag = re.compile(r&#39;</([^>]*)>&#39;)

   # Following revised definitions are more carefully stated
   # for correct extraction of tag label (separately from
   # any parameters:

openTag  = re.compile(r&#39;<(\w+\b)([^>]+)?>&#39;)
closeTag = re.compile(r&#39;</(\w+\b)\s*>&#39;)

#------------------------------
# PART 3: 

   # Above openTag definition will already get the string
   # encompassing any parameters, and return it as
   # m.group(2), i.e. defn: 

openTag  = re.compile(r&#39;<(\w+\b)([^>]+)?>&#39;)

   # If assume that parameters are continuous non-whitespace
   # chars separated by whitespace chars, then we can divide
   # them up using split - and that&#39;s how we handle them
   # here. (In reality, parameter strings can be a lot more
   # messy than this, but we won&#39;t try to deal with that.)

#------------------------------
# PART 4: 

openCloseTagPair = re.compile(r&#39;<(\w+\b)([^>]+)?>(.*?)</\1\s*>&#39;)

   # Note use of non-greedy matching for the text falling
   # *between* the open/close tag pair - to avoid false
   # results where have two similar tag pairs on same line.

#------------------------------
# PART 5: URLS

   # This is quite tricky. The URL expressions in the file
   # are of two kinds, of which the first is a string
   # between double quotes ("..") which may include
   # whitespace. For this case we might have a regex: 

url = re.compile(&#39;href=("[^">]+")&#39;, re.I)

   # The second case does not have quotes, and does not
   # allow whitespace, consisting of a continuous sequence
   # of non-whitespace material (that ends when you reach a
   # space or close bracket &#39;>&#39;). This might be: 

url = re.compile(&#39;href=([^">\s]+)&#39;, re.I)

   # We can combine these two cases as follows, and still
   # get the expression back as group(1):

url = re.compile(r&#39;href=("[^">]+"|[^">\s]+)&#39;, re.I)

   # Note that I&#39;ve done nothing here to exclude &#39;mailto:&#39;
   # links as being accepted as URLS. 

#------------------------------

with open(&#39;RGX_DATA.html&#39;) as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == &#39;&#39;:
            continue
        print(&#39;  &#39;, &#39;-&#39; * 100, &#39;[%d]&#39; % linenum, &#39;\n   TEXT:&#39;, line, end=&#39;&#39;)
    
        # PART 1: find HTML tags
        # (The following uses &#39;finditer&#39; to find ALL matches
        # within the line)
    
        mm = tag.finditer(line)
        for m in mm:
            print(&#39;** TAG:&#39;, m.group(1), &#39; + [%s]&#39; % m.group(2))
    
        # PART 2,3: find open/close tags (+ params of open tags)
    
        mm = openTag.finditer(line)
        for m in mm:
            print(&#39;** OPENTAG:&#39;, m.group(1))
            if m.group(2):
                for param in m.group(2).split():
                    print(&#39;    PARAM:&#39;, param)
    
        mm = closeTag.finditer(line)
        for m in mm:
            print(&#39;** CLOSETAG:&#39;, m.group(1))
    
        # PART 4: find open/close tag pairs appearing on same line
    
        mm = openCloseTagPair.finditer(line)
        for m in mm:
            print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3)))
    
        # PART 5: find URLs:
    
        mm = url.finditer(line)
        for m in mm:
            print(&#39;** URL:&#39;, m.group(1))

        # PART 6: Strip out HTML tags (note that .sub will do all
        # possible substitutions, unless number is limited by count
        # keyword arg - which is fortunately what we want here)

        stripped = tag.sub(&#39;&#39;, line)
        print(&#39;** STRIPPED:&#39;, stripped, end = &#39;&#39;)

以上就是怎么使用Python中的正则表达式处理html文件的详细内容,更多请关注www.sxiaw.com其它相关文章!