-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Closed
Description
Summary
When converting DOCX files where a space character has different formatting than adjacent text, the space is lost in the output. This happens because markdownify (used internally via mammoth) strips whitespace-only content from inline formatting tags.
Example:
- DOCX content:
further[normal] +[bold] +reference[normal] - HTML from mammoth:
further<strong> </strong>reference - Current output:
furtherreference - Expected output:
further reference
Root Cause
This is an upstream issue in markdownify, not markitdown itself. When chomp() encounters whitespace-only text inside inline tags like <strong>, <b>, <em>, or <i>, it strips the content entirely.
Upstream issue: matthewwithanm/python-markdownify#155
Fix PR: matthewwithanm/python-markdownify#253
Workaround
Until the markdownify fix is released, users can apply a monkey-patch before importing markitdown:
import markdownify
_original_convert_strong = markdownify.MarkdownConverter.convert_strong
_original_convert_b = markdownify.MarkdownConverter.convert_b
_original_convert_em = markdownify.MarkdownConverter.convert_em
_original_convert_i = markdownify.MarkdownConverter.convert_i
def _make_whitespace_preserving_converter(original_method):
def wrapper(self, el, text, *args, **kwargs):
if text and text.strip() == '' and len(text) > 0:
return text
return original_method(self, el, text, *args, **kwargs)
return wrapper
markdownify.MarkdownConverter.convert_strong = _make_whitespace_preserving_converter(_original_convert_strong)
markdownify.MarkdownConverter.convert_b = _make_whitespace_preserving_converter(_original_convert_b)
markdownify.MarkdownConverter.convert_em = _make_whitespace_preserving_converter(_original_convert_em)
markdownify.MarkdownConverter.convert_i = _make_whitespace_preserving_converter(_original_convert_i)
from markitdown import MarkItDown # Import after patchSuggested Action
Once markdownify releases the fix, consider bumping the dependency version to include it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels