FEAT Add WordDocConverter for Word document generation#1365
FEAT Add WordDocConverter for Word document generation#1365ducktv1203 wants to merge 6 commits intoAzure:mainfrom
Conversation
…te rendering and integration with PyRIT's data serialization system.
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Pull request overview
Adds a new PyRIT file converter that emits Word documents (.docx) from text prompts, complementing the existing PDF file-converter tooling and documentation.
Changes:
- Introduces
WordDocConverterwith direct.docxgeneration and template-based placeholder injection. - Exports
WordDocConverterfrompyrit.prompt_converter. - Adds unit tests and updates converter documentation; updates project dependencies for
python-docx.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
pyrit/prompt_converter/word_doc_converter.py |
Implements .docx generation + template injection and serialization. |
pyrit/prompt_converter/__init__.py |
Exposes WordDocConverter in the package exports. |
pyproject.toml |
Adds python-docx dependency (but also introduces a duplicate pypdf entry). |
tests/unit/converter/test_word_doc_converter.py |
Adds unit tests for direct + template modes and identifier behavior. |
doc/code/converters/5_file_converters.py |
Documents Word document conversion usage alongside PDF converters. |
| async def _serialize_docx(self, docx_bytes: bytes) -> DataTypeSerializer: | ||
| """ | ||
| Save the generated ``.docx`` bytes through PyRIT's data serializer. | ||
|
|
||
| The serializer picks a unique filename and writes the bytes to the configured storage location (local disk by default). | ||
|
|
||
| Args: | ||
| docx_bytes: Raw content of the Word document. | ||
|
|
||
| Returns: | ||
| DataTypeSerializer: Serializer whose ``.value`` contains the output path. | ||
| """ | ||
| docx_serializer = data_serializer_factory( | ||
| category="prompt-memory-entries", | ||
| data_type="binary_path", | ||
| extension="docx", | ||
| ) | ||
|
|
||
| await docx_serializer.save_data(docx_bytes) | ||
|
|
||
| return docx_serializer |
There was a problem hiding this comment.
Async private method _serialize_docx doesn’t follow the project convention that async methods must end with _async. Rename it (e.g., _serialize_docx_async) and update the call site and related tests/patches accordingly.
| # Rewind to read from the start of the stored bytes. | ||
| self._existing_doc_bytes.seek(0) | ||
| document = Document(self._existing_doc_bytes) | ||
|
|
There was a problem hiding this comment.
self._existing_doc_bytes is a shared BytesIO whose cursor is mutated via seek(0). If convert_async is called concurrently on the same converter instance, the shared cursor can cause races/corrupted reads. Store immutable bytes instead and create a new BytesIO per conversion (or otherwise guard access).
| # Rewind to read from the start of the stored bytes. | |
| self._existing_doc_bytes.seek(0) | |
| document = Document(self._existing_doc_bytes) | |
| existing_doc_bytes = self._existing_doc_bytes | |
| if isinstance(existing_doc_bytes, BytesIO): | |
| template_bytes = existing_doc_bytes.getvalue() | |
| else: | |
| template_bytes = existing_doc_bytes | |
| document_stream = BytesIO(template_bytes) | |
| document = Document(document_stream) |
| (e.g. ``{{ prompt }}``) while preserving all original formatting. | ||
| If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``) | ||
| becomes a new paragraph with the configured font settings. | ||
|
|
||
| Args: | ||
| prompt: The text to embed in the Word document. | ||
| input_type: Must be ``text``. |
There was a problem hiding this comment.
This docstring claims template-based injection preserves “all original formatting”, but _render_paragraph rewrites runs and can collapse mixed formatting when placeholders span runs (it clears subsequent runs and applies first-run formatting to the whole rendered text). Please either (a) implement run-aware replacement that preserves mixed formatting, or (b) adjust the docstring/behavior description to reflect this limitation.
| (e.g. ``{{ prompt }}``) while preserving all original formatting. | |
| If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``) | |
| becomes a new paragraph with the configured font settings. | |
| Args: | |
| prompt: The text to embed in the Word document. | |
| input_type: Must be ``text``. | |
| (e.g. ``{{ prompt }}``), using the original document's layout and styling as a base. Mixed formatting within or | |
| across placeholder regions may not be preserved exactly. | |
| If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``) | |
| becomes a new paragraph with the configured font settings. | |
| Args: | |
| prompt (str): The text to embed in the Word document. | |
| input_type (PromptDataType): Must be ``text``. |
| template = Template(full_text) | ||
| rendered_text = template.render(**template_vars) | ||
| except Exception as e: | ||
| logger.warning(f"Failed to render paragraph template: {e}") | ||
| return |
There was a problem hiding this comment.
Rendering arbitrary Jinja2 templates from document text via Template(full_text).render(...) is unsafe if the template content is not fully trusted (Jinja2 templates can be abused for code execution/data access). Consider using jinja2.sandbox.SandboxedEnvironment, restricting to a simple {{ prompt }} replacement, or otherwise documenting and enforcing that templates must be trusted.
| from io import BytesIO | ||
| from pathlib import Path | ||
| from typing import Optional | ||
| from docx import Document | ||
| from docx.shared import Pt | ||
| from jinja2 import Template | ||
|
|
||
| from pyrit.common.logger import logger | ||
| from pyrit.identifiers import ConverterIdentifier | ||
| from pyrit.models import PromptDataType, data_serializer_factory | ||
| from pyrit.models.data_type_serializer import DataTypeSerializer | ||
| from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter |
There was a problem hiding this comment.
Import grouping is inconsistent with other modules (stdlib vs third-party vs local). Add a blank line between standard-library imports (io/pathlib/typing) and third-party imports (docx/jinja2) to match the repository’s import organization pattern.
| "pyodbc>=5.1.0", | ||
| "python-dotenv>=1.0.1", | ||
| "python-docx>=1.2.0", | ||
| "pypdf>=5.1.0", |
There was a problem hiding this comment.
dependencies lists pypdf twice with different minimum versions (>=5.1.0 and >=6.6.2). This is conflicting/ambiguous for resolvers and should be collapsed to a single requirement (likely keep only the stricter >=6.6.2 unless there’s a specific reason to lower it).
| "pypdf>=5.1.0", |
| # The `WordDocConverter` generates Word documents (.docx) from text using `python-docx`. It supports two modes: | ||
| # | ||
| # 1. **Direct generation**: Convert plain text strings into Word documents. The prompt becomes the document content. | ||
| # 2. **Template-based generation**: Supply an existing `.docx` file containing jinja2 placeholders (e.g., `{{ prompt }}`). The converter replaces placeholders with the prompt text while preserving the original document's formatting, tables, headers, and footers. The original file is never modified — a new file is always generated. |
There was a problem hiding this comment.
The docs state that template-based generation preserves the original document’s formatting. Given the current implementation can collapse run-level formatting when placeholders span multiple runs, please either update the documentation to mention this limitation or improve the implementation to truly preserve mixed formatting.
| # This mode takes an existing `.docx` file that contains jinja2 `{{ prompt }}` placeholders and replaces them with the provided prompt text. This is useful for embedding adversarial content into realistic document templates (e.g., resumes, reports, invoices) while preserving all original formatting. | ||
|
|
||
| # %% | ||
| import tempfile |
There was a problem hiding this comment.
This import of module tempfile is redundant, as it was previously imported on line 144.
| import tempfile |
| from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter | ||
|
|
||
|
|
||
| class WordDocConverter(PromptConverter): |
There was a problem hiding this comment.
This class does not call PromptConverter.init during initialization. (WordDocConverter.init may be missing a call to a base class init)
Description:
Adds WordDocConverter - a file converter that transforms text prompts into Word documents (.docx). Issue #424.
Two modes:
Files changed:
Tests and Documentation:
Tests: 21 unit tests covering init/validation, direct generation, template-based generation (body paragraphs, tables, multiple placeholders, no-placeholder passthrough), end-to-end with real .docx output, and identifier correctness. All passed. (21/21)