Skip to content

Whitespace lost when DOCX has formatting boundaries mid-text (upstream markdownify issue) #1539

@sp2935

Description

@sp2935

Summary

When converting DOCX files where a space character has different formatting than adjacent text, the space is lost in the output. This happens because markdownify (used internally via mammoth) strips whitespace-only content from inline formatting tags.

Example:

  • DOCX content: further [normal] + [bold] + reference [normal]
  • HTML from mammoth: further<strong> </strong>reference
  • Current output: furtherreference
  • Expected output: further reference

Root Cause

This is an upstream issue in markdownify, not markitdown itself. When chomp() encounters whitespace-only text inside inline tags like <strong>, <b>, <em>, or <i>, it strips the content entirely.

Upstream issue: matthewwithanm/python-markdownify#155
Fix PR: matthewwithanm/python-markdownify#253

Workaround

Until the markdownify fix is released, users can apply a monkey-patch before importing markitdown:

import markdownify

_original_convert_strong = markdownify.MarkdownConverter.convert_strong
_original_convert_b = markdownify.MarkdownConverter.convert_b
_original_convert_em = markdownify.MarkdownConverter.convert_em
_original_convert_i = markdownify.MarkdownConverter.convert_i

def _make_whitespace_preserving_converter(original_method):
    def wrapper(self, el, text, *args, **kwargs):
        if text and text.strip() == '' and len(text) > 0:
            return text
        return original_method(self, el, text, *args, **kwargs)
    return wrapper

markdownify.MarkdownConverter.convert_strong = _make_whitespace_preserving_converter(_original_convert_strong)
markdownify.MarkdownConverter.convert_b = _make_whitespace_preserving_converter(_original_convert_b)
markdownify.MarkdownConverter.convert_em = _make_whitespace_preserving_converter(_original_convert_em)
markdownify.MarkdownConverter.convert_i = _make_whitespace_preserving_converter(_original_convert_i)

from markitdown import MarkItDown  # Import after patch

Suggested Action

Once markdownify releases the fix, consider bumping the dependency version to include it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions