convert docx labnotebook to logseq

← focus stacking notes | Blog | calculate reduced pressure in home assistant →

I kept a lot of lab notes in Word documents. For each day I wrote notes, I used a header with the date and put the text below. Recently, I switched to do notes in LogSeq, and now I wanted to convert all the existing notes into markdown files.

There are a few things that make this complicated:

pandoc can do this, but only writes a single markdown file
pandoc further writes images all in the same folder with a name like image1.png, which makes it very hard to run it on multiple files

So, here are some scripts to circumvent this.

First, we use a LUA filter to rename the image files attached to the docx by their SHA1 (a function that is luckily supplied with pandoc!):

-- Basic concept from https://stackoverflow.com/a/65389719/446140
-- CC BY-SA 4.0 by tarleb
--
-- Adjusted by reox, released as CC BY-SA 4.0
local mediabag = require 'pandoc.mediabag'
-- Stores the replaced file names for lookup
local replacements = {}

-- Get the extension of a file
function extension(path)
  return path:match("^.+(%..+)$")
end
-- Get the directory name of a file
function dirname(path)
    return path:match("(.*[/\\])")
end

-- Delete image files and re-insert under new name
-- Uses the SHA1 of the content of the file to rename the image path
for fp, mt, contents in mediabag.items() do
  mediabag.delete(fp)
  local sha1 = pandoc.utils.sha1(contents)
  -- Stores the file in the same name as specifed
  --local new_fp = dirname(fp) .. "/image_" .. sha1 .. extension(fp)
  -- Stores them in a folder called assets
  local new_fp = "assets" .. "/image_" .. sha1 .. extension(fp)
  replacements[fp] = new_fp
  mediabag.insert(new_fp, mt, contents)
end

-- adjust path to image file
function Image (img)
  -- Use this to simply replace
  -- img.src = replacements[img.src]
  -- Move into relative folder
  img.src = "../" .. replacements[img.src]
  -- Set a title, similar to logseq
  img.caption = "image.png"
  -- Remove attributes such as width and height
  img.attr = {}
  return img
end

Some specialities are also that all attributes are removed from the image, the caption is set to image.png and the files are also moved into a folder named assets - like used in logseq.

Now, we can convert the docx to markdown using pandoc:

pandoc -f docx -t markdown notebook.docx -o full_file.md --extract-media=. --lua-filter=filter.lua

For the splitting part, we can use the tool csplit, contained in coreutils. Basically, we tell it to split the file at every occurence of a H3:

 csplit --prefix='journals/journal_' --suffix-format='%03d.md' full_file.md '%^### %' '{1}' '/^### /' '{*}'

An additional adjustment is to skip everything up to the first H3.

Everything can be put into a small script to convert docx files, put them into the right folder and also rename them into journal files, logseq understands:

#!/bin/bash

convert_file () {
    pandoc -f docx -t markdown "$1" -o full_file.md --extract-media=. --lua-filter=mediabag_filter.lua
    #
    # Remove H1 and H2 from file
    sed full_file.md -e 's/^[#]\{1,2\} .*$//' > filtered.md
    # Remove empty headers. This seems to happen when we have two lines of H3 in the
    # file by acciedent
    sed -i filtered.md -e 's/^#\+\s\+$//'

    # Split file
    mkdir -p journals
    # Skip any text up to the first H3, then split at each H3
    csplit --prefix='journals/journal_' --suffix-format='%03d.md' filtered.md '%^### %' '{1}' '/^### /' '{*}'

    # Rename files
    for file in journals/journal_*.md; do
        h=$(head -n 1 "$file")

        if ! [ "x${h:0:4}" = "x### " ]; then
            echo "$file does not start with header!"
            exit 1
        fi

        Y=${h:10:4}
        M=${h:7:2}
        D=${h:4:2}
        dir=$(dirname "$file")

        echo "$file --> $dir/${Y}_${M}_${D}.md"

        # Remove header line, not required by logseq
        sed -i "$file" -e "1d"

        # Shift headers by 3
        # Apply table format, that logseq understands
        pandoc -f markdown -t markdown-simple_tables-grid_tables-multiline_tables+pipe_tables "$file" --shift-heading-level-by=-3 -o "$dir/${Y}_${M}_${D}.md"
        rm $file
    done

    # Cleanup
    rm full_file.md
    rm filtered.md
}


for arg in "$@"; do
    echo "Converting $arg ..."
    convert_file "$arg"
done

The only thing that does not work properly is to automatically convert the files into the list format that logseq uses... That is tricky to do, because we cannot simply use every paragraph as a block.

Furthermore, some equations do not work properly and some other text looks terrible. However, this is only a start and you can probably improve this script quite a bit and also adopt it to your needs.