I kept a lot of lab notes in Word documents. For each day I wrote notes, I used a header with the date and put the text below. Recently, I switched to do notes in LogSeq, and now I wanted to convert all the existing notes into markdown files.
There are a few things that make this complicated:
- pandoc can do this, but only writes a single markdown file
- pandoc further writes images all in the same folder with a name
like
image1.png
, which makes it very hard to run it on multiple files
So, here are some scripts to circumvent this.
First, we use a LUA filter to rename the image files attached to the docx by their SHA1 (a function that is luckily supplied with pandoc!):
-- Basic concept from https://stackoverflow.com/a/65389719/446140
-- CC BY-SA 4.0 by tarleb
--
-- Adjusted by reox, released as CC BY-SA 4.0
local mediabag = require 'pandoc.mediabag'
-- Stores the replaced file names for lookup
local replacements = {}
-- Get the extension of a file
function extension(path)
return path:match("^.+(%..+)$")
end
-- Get the directory name of a file
function dirname(path)
return path:match("(.*[/\\])")
end
-- Delete image files and re-insert under new name
-- Uses the SHA1 of the content of the file to rename the image path
for fp, mt, contents in mediabag.items() do
mediabag.delete(fp)
local sha1 = pandoc.utils.sha1(contents)
-- Stores the file in the same name as specifed
--local new_fp = dirname(fp) .. "/image_" .. sha1 .. extension(fp)
-- Stores them in a folder called assets
local new_fp = "assets" .. "/image_" .. sha1 .. extension(fp)
replacements[fp] = new_fp
mediabag.insert(new_fp, mt, contents)
end
-- adjust path to image file
function Image (img)
-- Use this to simply replace
-- img.src = replacements[img.src]
-- Move into relative folder
img.src = "../" .. replacements[img.src]
-- Set a title, similar to logseq
img.caption = "image.png"
-- Remove attributes such as width and height
img.attr = {}
return img
end
Some specialities are also that all attributes are removed from the image, the
caption is set to image.png
and the files are also moved into a folder named
assets
- like used in logseq.
Now, we can convert the docx to markdown using pandoc:
pandoc -f docx -t markdown notebook.docx -o full_file.md --extract-media=. --lua-filter=filter.lua
For the splitting part, we can use the tool csplit
, contained in coreutils.
Basically, we tell it to split the file at every occurence of a H3:
csplit --prefix='journals/journal_' --suffix-format='%03d.md' full_file.md '%^### %' '{1}' '/^### /' '{*}'
An additional adjustment is to skip everything up to the first H3.
Everything can be put into a small script to convert docx files, put them into the right folder and also rename them into journal files, logseq understands:
#!/bin/bash
convert_file () {
pandoc -f docx -t markdown "$1" -o full_file.md --extract-media=. --lua-filter=mediabag_filter.lua
#
# Remove H1 and H2 from file
sed full_file.md -e 's/^[#]\{1,2\} .*$//' > filtered.md
# Remove empty headers. This seems to happen when we have two lines of H3 in the
# file by acciedent
sed -i filtered.md -e 's/^#\+\s\+$//'
# Split file
mkdir -p journals
# Skip any text up to the first H3, then split at each H3
csplit --prefix='journals/journal_' --suffix-format='%03d.md' filtered.md '%^### %' '{1}' '/^### /' '{*}'
# Rename files
for file in journals/journal_*.md; do
h=$(head -n 1 "$file")
if ! [ "x${h:0:4}" = "x### " ]; then
echo "$file does not start with header!"
exit 1
fi
Y=${h:10:4}
M=${h:7:2}
D=${h:4:2}
dir=$(dirname "$file")
echo "$file --> $dir/${Y}_${M}_${D}.md"
# Remove header line, not required by logseq
sed -i "$file" -e "1d"
# Shift headers by 3
# Apply table format, that logseq understands
pandoc -f markdown -t markdown-simple_tables-grid_tables-multiline_tables+pipe_tables "$file" --shift-heading-level-by=-3 -o "$dir/${Y}_${M}_${D}.md"
rm $file
done
# Cleanup
rm full_file.md
rm filtered.md
}
for arg in "$@"; do
echo "Converting $arg ..."
convert_file "$arg"
done
The only thing that does not work properly is to automatically convert the files into the list format that logseq uses... That is tricky to do, because we cannot simply use every paragraph as a block.
Furthermore, some equations do not work properly and some other text looks terrible. However, this is only a start and you can probably improve this script quite a bit and also adopt it to your needs.