20 January 2026

How to convert a PDF file into a Markdown file or HTML file?

To convert a PDF into a Markdown file using the Python-based tool called Marker on Ubuntu, you need to set up the environment, install the
marker-pdf package, and then use the command-line interface. 
Prerequisites
Before installing Marker, ensure your Ubuntu system meets the following prerequisites:
  • Python: Version 3.10 or higher is required.
  • PyTorch: Marker needs PyTorch to run, as it relies on deep learning models.
  • System Libraries: You may need additional system libraries for advanced features like OCR with ocrmypdf or tesseract. 
Step-by-Step Guide
  1. Open your terminal on Ubuntu.
  2. Install system dependencies (optional, but recommended for full functionality, including OCR):
    bash
    # Install required apt packages
    sudo apt-get update
    sudo apt-get install -y build-essential libssl-dev libffi-dev python3-dev
    # Install tesseract and ghostscript related dependencies
    sudo apt-get install -y libleptonica-dev libtesseract-dev pkg-config
    sudo apt-get install -y ocrmypdf ghostscript
    
  3. Create a virtual environment to avoid conflicts with other Python projects (recommended):
    bash
    python3 -m venv marker_env
    source marker_env/bin/activate
    
  4. Install PyTorch by following the instructions on the official PyTorch website for your specific system configuration (CPU or GPU). A typical CPU-only installation command might look like:
    bash
    pip install torch torchvision torchaudio
    
  5. Install Marker using pip. You can install the basic PDF package or the full version for other document types:
    bash
    # For PDF conversion only
    pip install marker-pdf
    
    # For full functionality (PDFs, images, etc.)
    # pip install 'marker-pdf[full]'
    
  6. Convert a PDF file to Markdown using the command line tool marker_single.

    $ marker_single /path/to/file.pdf  

    --output_format [markdown|json|html|chunks]: Specify the format for the output results.
    --output_dir PATH: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.

    https://github.com/datalab-to/marker

No comments:

Post a Comment