如何在 Google BigQuery 中将 URL 参数提取为 ARRAY

我们将使用 BigQuery 的标准 SQL 方言中提供的REGEXP_EXTRACT_ALL 函数从 URL 的查询部分提取参数并将它们作为数组返回。

代码

#standardSQL
SELECT
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM
  table

工作示例

#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# @see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
  SELECT 1   AS id, 
  '?foo=bar' AS query, 
  'simple'   AS description
  UNION ALL SELECT 2, '?foo=bar&bar=baz', 'multiple params'
  UNION ALL SELECT 3, '?foo[]=bar&foo[]=baz', 'arrays'
  UNION ALL SELECT 4, '', 'no query'
)
SELECT 
  id, 
  query,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values,
  description
FROM examples

结果

Rowidqueryparamskeysvaluesdescription
11?foo=barfoo=barfoobarsimple
22?foo=bar&bar=bazfoo=barfoobarmultiple params
bar=bazbarbaz
33?foo[]=bar&foo[]=bazfoo[]=barfoo[]bararrays
foo[]=bazfoo[]baz
44no query

在 BigQuery 上运行

在 BigQuery 控制台中打开

笔记

  • REGEXP_EXTRACT_ALL只有 1 个捕获组除外,因此我们需要将所有其他组标记为非捕获(?:
  • 如果 URL 包含片段部分(例如 https://example.org/?foo=bar#baz),则片段当前不会被删除。为此,请在使用 REGEXP_REPLACE提取之前删除片段,例如: REGEXP_EXTRACT_ALL( REGEXP_EXTRACT(query, r'#.*', ''), r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values

链接

用例

  • 从您的日志文件中编译所有参数的列表
  • 评估参数键/值的频率