
MarkItDown is a new library from Microsoft that turns almost any document (PDF, Word, PowerPoint, Excel files) into clean Markdown with one function call. Feed it a file, get back structured text with headings and tables intact.
So, if you have some or all of your project’s documentation in word files, you can quickly convert them and build documentation using a tool like MkDocs or Sphinx (here’s the difference).
While you could use a LLM to parse these documents, you run the risk of hallucinations and chewing through your tokens. Instead, you can stitch together PyPDF2 for PDFs, python-docx for Word, and openpyxl for Excel, then hand-writing glue code to normalize whatever those three libraries hand back.
MarkItDown collapses all of your data into one interface.
Installation and Usage
Install it with uv (see our uv guide if you need a refresher), then convert a file in three lines:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("quarterly_report.pdf")
print(result.text_content)Instead of a flat text blob with the structure stripped out, you get something like this:
# Q3 Revenue Summary
| Region | Revenue | Growth |
|--------|---------|--------|
| East | $1.2M | 12% |
| West | $980K | 8% |
Total revenue grew 10% quarter over quarter...Why Markdown as the output
Markdown keeps just enough structure (headings, tables, lists) to preserve meaning, but stays plain enough to paste directly into an LLM prompt or a RAG pipeline without a separate cleanup pass.
One of the biggest advantages of MarkItDown is that it normalizes every input format to the same output; swapping file types doesn't mean swapping libraries and/or adding code:
result = md.convert("meeting_notes.docx")
print(result.text_content)
result = md.convert("sales_data.xlsx")
print(result.text_content)When should you use it?
MarkItDown is worth reaching for when:
You're feeding mixed document types to an LLM and want one consistent input format
You're building a document search or indexing tool and need structure preserved
Users upload files in unpredictable formats and you need to normalize them before processing
» Note: it's built for machine consumption, not human readers. If you need pixel-perfect conversion, Pandoc still does that better.
Keep in mind: this package is brand new from Microsoft - it’s still being developed and I’m sure we’ll see more features pop up later down the line.
Happy coding!
📧 Join the Python Snacks Newsletter! 🐍
Want even more Python-related content that’s useful? Here’s 3 reasons why you should subscribe the Python Snacks newsletter:
Get Ahead in Python with bite-sized Python tips and tricks delivered straight to your inbox, like the one above.
Exclusive Subscriber Perks: Receive a curated selection of up to 6 high-impact Python resources, tips, and exclusive insights with each email.
Get Smarter with Python in under 5 minutes. Your next Python breakthrough could just an email away.
You can unsubscribe at any time.
Interested in starting a newsletter or a blog?
Do you have a wealth of knowledge and insights to share with the world? Starting your own newsletter or blog is an excellent way to establish yourself as an authority in your field, connect with a like-minded community, and open up new opportunities.
If TikTok, Twitter, Facebook, or other social media platforms were to get banned, you’d lose all your followers. This is why you should start a newsletter: you own your audience.
This article may contain affiliate links. Affiliate links come at no cost to you and support the costs of this blog. Should you purchase a product/service from an affiliate link, it will come at no additional cost to you.

