Skip to content

fix: 修复音频 base64 被输出到控制台日志的问题#8748

Open
Luna-channel wants to merge 2 commits into
AstrBotDevs:masterfrom
Luna-channel:investigate-stt-base64
Open

fix: 修复音频 base64 被输出到控制台日志的问题#8748
Luna-channel wants to merge 2 commits into
AstrBotDevs:masterfrom
Luna-channel:investigate-stt-base64

Conversation

@Luna-channel

@Luna-channel Luna-channel commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Motivation / 动机

When audio preprocessing fails in the OpenAI-compatible provider, the warning log currently prints the original audio_ref directly.

If the audio reference is a data:audio/...;base64,... URL, the full base64 payload is written to the console log. This can flood logs and may expose sensitive audio data.

This PR redacts base64 payloads in data URLs before logging them.

Modifications / 改动点

  • Modified astrbot/core/provider/sources/openai_source.py.

  • Added _redact_data_url_for_log() to redact data:*;base64,... payloads in log output.

  • Applied the redaction helper to the audio preprocessing failure warning.

  • Kept existing audio preprocessing and STT behavior unchanged.

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

Verification command:

python -m py_compile astrbot/core/provider/sources/openai_source.py

Result:

Passed with no output.

Before this change, logs could include full audio data URLs:

音频 data:audio/wav;base64,UklGRk... 预处理失败,将忽略。错误: ...

After this change, the base64 payload is redacted:

音频 data:audio/wav;base64,<redacted 12345 chars> 预处理失败,将忽略。错误: ...

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Bug Fixes:

  • Prevent audio preprocessing warning logs from printing full base64-encoded audio data URLs by redacting their payloads.

@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. labels Jun 12, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Consider precompiling the data URL regex at module scope (e.g., _DATA_URL_RE = re.compile(...)) and reusing it in _redact_data_url_for_log to avoid recompiling the pattern on every call in hot paths.
  • It may be safer for _redact_data_url_for_log to accept Any and early-return non-string values (or explicitly cast to str), so that it can be reused more broadly in logging without risking type errors.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider precompiling the data URL regex at module scope (e.g., `_DATA_URL_RE = re.compile(...)`) and reusing it in `_redact_data_url_for_log` to avoid recompiling the pattern on every call in hot paths.
- It may be safer for `_redact_data_url_for_log` to accept `Any` and early-return non-string values (or explicitly cast to `str`), so that it can be reused more broadly in logging without risking type errors.

## Individual Comments

### Comment 1
<location path="astrbot/core/provider/sources/openai_source.py" line_range="85-86" />
<code_context>
             return None

+    @staticmethod
+    def _redact_data_url_for_log(value: str) -> str:
+        match = re.match(r"^(data:[^;,]+;base64,)(.*)$", value, flags=re.IGNORECASE)
+        if not match:
+            return value
</code_context>
<issue_to_address>
**issue (bug_risk):** The data URL regex may miss valid data URLs that have additional parameters before `;base64`.

The pattern `r"^(data:[^;,]+;base64,)(.*)$"` only matches when `;base64` directly follows the media type. Valid data URLs like `data:audio/wav;codec=opus;rate=48000;base64,...` won’t be redacted and will be logged in full. Consider a pattern such as `r"^(data:[^,]*;base64,)(.*)$"` (or a variant allowing multiple `;key=value` segments) so base64 payloads in data URLs with extra parameters are still redacted.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/provider/sources/openai_source.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helper method _redact_data_url_for_log to redact base64-encoded data URLs in warning logs when audio preprocessing fails. The reviewer identified an issue where the regular expression used for matching data URLs may fail if the URL contains additional parameters (such as charset) and suggested a more robust regex along with type checking to prevent potential runtime errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +84 to +90
@staticmethod
def _redact_data_url_for_log(value: str) -> str:
match = re.match(r"^(data:[^;,]+;base64,)(.*)$", value, flags=re.IGNORECASE)
if not match:
return value
prefix, payload = match.groups()
return f"{prefix}<redacted {len(payload)} chars>"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里的正则表达式 r"^(data:[^;,]+;base64,)(.*)$" 在处理带有额外参数的 Data URL 时会匹配失败。

根据 RFC 2397,Data URL 的 MIME 类型部分可以包含其他参数(例如 data:text/plain;charset=utf-8;base64,YQ==)。因为 [^;,]+ 会在遇到第一个分号 ; 时停止匹配,导致无法匹配到后面的 ;base64,,从而使脱敏失效,完整的 base64 仍会被输出到日志中。

建议将正则表达式修改为 r"^(data:.*?;base64,)(.*)$",这样可以安全且非贪婪地匹配到 ;base64, 之前的所有参数。此外,为了防止传入非字符串类型导致 re.match 抛出 TypeError,建议增加类型检查。

Suggested change
@staticmethod
def _redact_data_url_for_log(value: str) -> str:
match = re.match(r"^(data:[^;,]+;base64,)(.*)$", value, flags=re.IGNORECASE)
if not match:
return value
prefix, payload = match.groups()
return f"{prefix}<redacted {len(payload)} chars>"
@staticmethod
def _redact_data_url_for_log(value: Any) -> str:
if not isinstance(value, str):
return str(value)
match = re.match(r"^(data:.*?;base64,)(.*)$", value, flags=re.IGNORECASE)
if not match:
return value
prefix, payload = match.groups()
return f"{prefix}<redacted {len(payload)} chars>"

@Luna-channel

Copy link
Copy Markdown
Contributor Author

Fixes #8676

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant