Sherman IT: How to convert a PDF file into a Markdown file or HTML file?

20 January 2026

How to convert a PDF file into a Markdown file or HTML file?

To convert a PDF into a Markdown file using the Python-based tool called Marker on Ubuntu, you need to set up the environment, install the

marker-pdf package, and then use the command-line interface.

Prerequisites

Before installing Marker, ensure your Ubuntu system meets the following prerequisites:

Python: Version 3.10 or higher is required.
PyTorch: Marker needs PyTorch to run, as it relies on deep learning models.
System Libraries: You may need additional system libraries for advanced features like OCR with ocrmypdf or tesseract.

Step-by-Step Guide

Open your terminal on Ubuntu.

Install system dependencies (optional, but recommended for full functionality, including OCR):

bash

# Install required apt packages
sudo apt-get update
sudo apt-get install -y build-essential libssl-dev libffi-dev python3-dev
# Install tesseract and ghostscript related dependencies
sudo apt-get install -y libleptonica-dev libtesseract-dev pkg-config
sudo apt-get install -y ocrmypdf ghostscript

Create a virtual environment to avoid conflicts with other Python projects (recommended):
bash
python3 -m venv marker_env source marker_env/bin/activate
Install PyTorch by following the instructions on the official PyTorch website for your specific system configuration (CPU or GPU). A typical CPU-only installation command might look like:
bash
pip install torch torchvision torchaudio

Install Marker using pip. You can install the basic PDF package or the full version for other document types:

bash

# For PDF conversion only
pip install marker-pdf

# For full functionality (PDFs, images, etc.)
# pip install 'marker-pdf[full]'

Convert a PDF file to Markdown using the command line tool marker_single.

$ marker_single /path/to/file.pdf

--output_format [markdown|json|html|chunks]: Specify the format for the output results.
--output_dir PATH: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.

https://github.com/datalab-to/marker

Sherman IT

20 January 2026

How to convert a PDF file into a Markdown file or HTML file?

No comments:

Post a Comment