Beautifulsoup:获取script脚本标签和内容

本教程将教我们如何在 Beautifulsoup 中获取 <script> 标签和 <script> 内容。

内容

获取所有脚本标签

要获取所有脚本标签,我们需要使用find_all()函数

让我们看一个例子。

from bs4 import BeautifulSoup # Import BeautifulSoup module

# ? HTML Source
html = '''
<head>

<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>

<script> console.log('Hellow BeautifulSoup') </script>
</head>

'''

soup = BeautifulSoup(html, 'html.parser') # ?️ Parsing

scripts = soup.find_all("script") # ?️ Find all script tags

print(scripts) # ?️ Print Result

输出:

[<script src="/static/js/prism.js"></script>, <script src="/static/js/bootstrap.bundle.min.js"></script>, <script src="/static/js/main.js"></script>, <script> console.log('Hellow BeautifulSoup') </script>]

如您所见,我们将脚本标签作为list。现在让我们一一打印出来。

for script in scripts: # ?️ Loop Over scripts
    print(script)

输出:

<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>
<script> console.log('Hellow BeautifulSoup') </script>

获取脚本文件附带的脚本标签

要仅获取脚本文件附带的脚本标签,我们需要:

  • 使用 find_all() 函数
  • 设置 src=True参数

例子:

# ? HTML Source
html = '''
<head>

<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>

<script> console.log('Hellow BeautifulSoup') </script>
</head>

'''

soup = BeautifulSoup(html, 'html.parser') # ?️ Parsing

scripts = soup.find_all("script", src=True) # ?️ Find all script tags that come with the src attribute

print(scripts) # ?️ Print Result

输出:

[<script src="/static/js/prism.js"></script>, <script src="/static/js/bootstrap.bundle.min.js"></script>, <script src="/static/js/main.js"></script>]

要获取脚本的 src属性,请遵循以下代码。

# Get src attribute
for script in scripts: # ?️ Loop Over scripts
    print(script['src'])

输出:

/static/js/prism.js
/static/js/bootstrap.bundle.min.js
/static/js/main.js

如您所见,我们使用[‘src’]来获取脚本标签的 src URL。

获取脚本标签的内容

要获取脚本标签的内容,我们需要使用.string属性。但是,让我们看一个例子:

# ? HTML Source
html = '''
<head>

<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>

<script> console.log('Hellow BeautifulSoup') </script>
</head>

'''

soup = BeautifulSoup(html, 'html.parser') # ?️ Parsing

scripts = soup.find_all("script", string=True) # ?️ Find all script tags

print(scripts) # ?️ Print Result

输出:

[<script> console.log('Hellow BeautifulSoup') </script>]

我们设置 了string=True来查找所有有内容的脚本标签。现在我们将打印脚本标签的内容。

# Get content of script
for script in scripts: # ?️ Loop Over scripts
    print(script.string)

输出:

console.log('Hellow BeautifulSoup') 

结论

在 Beautifulsoup 主题中,我们学习了如何获取所有脚本标签。此外,我们还学习了如何获取脚本标签的 src 属性和内容。