If you have documentation in html but want to provide markdown too, what can you do? Ian Bruntlett describes how he used a shell script to automate the translation and what he learnt.
A filter is a programme that reads its input stream (file descriptor 0 aka stdin
aka cin
), modifies it, then writes the results to its output stream (file descriptor 1 aka stdout
aka cout
). Errors get written to the error stream (file descriptor 2 aka stderr
aka cerr
).
I have been maintaining some HTML pages on I.T. related sources of information, ‘TECH-Manuals’, for personal use for quite some time. I wanted to put one of those pages online. Github seemed like a good idea and I uploaded it to GitHub [Bruntlett-1]. Unfortunately when you look at HTML pages on GitHub, you see the raw HTML. Apparently you have to upload MarkDown (.md) files instead. After searching the Ubuntu package repositories with Synaptic Package Manager, I discovered that the html2markdown
command would do what I wanted. With some caution I performed this command to create tm-free-software.md ready to upload to GitHub [Bruntlett-2]:
$ html2markdown < tm-free-software.html \ > tm-free-software.md
That is all well and good but it relies on me not getting the input and output filenames wrong. As a filter, it works on one file only. That is OK but I also wanted the ability to use wildcards (shell globbing) to save typing and to make it easier to use with the find command.
The resulting shell script is modular – and, as a bonus, if you have another filter that you want to work on multiple files, all you have to do is rename the shell script, edit and rename function perform_html2markdown
and edit strip_extension
’s code to accept the new input filename extension(s).
The first bit of executable code checks that at least one parameter has been passed:
if [ $# -lt 1 ]; then Usage >&2 exit 11 fi
As I thought was conventional, the error message got sent to stderr
(it turns out I was wrong) and then the script exits with a status code of 11 so as not to collide with return values of html2markdown
itself. See [Cooper14] for more information.
The next bit of script loops over the shell script’s parameters. As long as the script has parameters, it invokes my function perform_html2markdown
with a single input filename. I had to name it something different to html2markdown
so, in an obscure reference to COBOL’s PERFORM statement, I named the function perform_html2markdown
.
The for
loop is used to execute perform_html2markdown
until it fails or all input files have been processed.
So, how does perform_html2markdown
work? (Listing 1.)
function perform_html2markdown() { local input_html_file output_markdown_file input_html_file="$1" if ! output_markdown_file=$(strip_extension \ "$input_html_file").md ; then echo "Bad HTML filename: $input_html_file"; return 10; # unsupported or bad filename # extension fi echo Translating "$input_html_file" to \ "$output_markdown_file" html2markdown --no-skip-internal-links < \ "$input_html_file" > "$output_markdown_file" } |
Listing 1 |
In the interests of modularity and ease of development and maintenance, I use functions in bash, making sure that I declare working variables as local. Unlike in C++, variables used in a function are global. That has caused me problems in the past.
The function perform_html2markdown
knows it has been passed a parameter and declares two local variables – input_html_file
, and output_markdown_file
. The input_html_file
variable is, as to be expected, the name of the input file. I could have used $1
instead but I decided to name it to make future maintenance work that little bit nicer. The output_markdown_file
variable invokes another function:
function strip_extension { local destination_file # cater for .html source destination_file=$(basename "$1" .html) # cater for HTML source destination_file= $(basename "$destination_file" .HTML) echo "$destination_file" }
This version contained a bug – this strip_extension
also strips preceding leading directory components! So, I paused writing this article to learn more, referring to [Newham05] and [GNU].
Aided by the power of functions and help from accu-general [ACCU] I came up with Listing 2.
# remove .HTML or .html from parameter 1 and then # output/return that result. function strip_extension { declare -i length local length filename \ filename_stub dot_extension filename=$1 length=${#filename} if [ $length -lt 6 ]; then return 2 # given filename too short fi dot_extension=${filename:$length-5:5} if [ "$dot_extension" == ".html" ] || [ "$dot_extension" == ".HTML" ]; then filename_stub=${filename:0:-5} echo "$filename_stub" else echo return 1; # unsupported or bad filename # extension fi } |
Listing 2 |
Confident that I had fixed the problem, I asked on accu-general for comments. I received some very interesting implementations of the function strip_extension
.
Sven opted for two approaches – one using basename (Listing 3). This works… after a fashion. If the user had provided a directory name, that directory name would be lost. So basename dir-of-html/blank.html became blank.html. The script was in use by me for quite some time before I discovered the bug and consequently stopped writing this article.
function strip_extension { filename=$1 case "$filename" in *.html) stub=$(basename "$filename" .html) ;; *.HTML) stub=$(basename "$filename" .HTML) ;; *) echo "Not an HTML file name." return 1 ;; esac echo "$stub" } |
Listing 3 |
Sven came up with an approach that uses sed – a stream editing tool that I have a little bit of experience with. See Listing 4. This works. By running sed, an external command, it is slower but it has the benefit of being correct!
function strip_extension { filename=$1 case "$filename" in *.html | *.HTML) stub=$(echo "$filename" \ | sed -e 's/\.[^.]*$//') ;; *) echo "Not an HTML file name." return 1 ;; esac echo "$stub" } |
Listing 4 |
Hans Vredeveld came up with another solution (see Listing 5). This uses bash’s pattern-matching operators and is minimally documented by [Newham05] and [GNU]. To quote the former:
${variable%pattern} If the pattern matches the end of the variable’s value, delete the shortest part that matches and return the rest.
function strip_extension { filename=$1 case "$filename" in *.html | *.HTML) stub=$(echo "${filename%.*") ;; *) echo "Not an HTML file name." return 1 ;; esac echo "$stub" } |
Listing 5 |
As with other aspects of shell usage, I had to experiment with this to get a better idea of it. The pattern looks at the environment variable specified (note the absence of a preceding $
), separated by a %
sign to tell bash what to do and pattern is the thing to delete. In this case, strip_extension
is using a pattern of .*
which means match a dot followed by any number of following characters. So, not only does it work for .html or .HTML, it works for .odt etc. The filename does not have to exist, bash is working with characters. Here are some examples:
$ filename=dir.html/blank.html $ echo ${filename%.*} dir.html/blank $ filename=dir.html/blank2.html $ echo ${filename%tml} dir.html/blank2.h
This was then further refined by Sven to avoid using echo:
> stub=$(echo "${filename%.*")
You don’t need command substitution here: the following should be enough:
> stub="${filename%.*}"
Also discussed by Sven, was my use of a while
loop to iterate over the command line parameters:
while [ "$1" != "" ]; do if ! perform_html2markdown "$1" ; then exit $? fi shift done
This is a hang-over from my MS-DOS days, where to access more than 9 (I think) parameters, you had to use the shift
command. Another quirk is that exit $?
can be replaced with exit
.
He proposed:
for f in "$@"; do if ! perform_html2markdown "$f" ; then exit $? fi done
Which does the same job without making processed parameters unavailable. The use of $@
and *
are… subtle. To quote Cameron Newham and Bill Rosenblatt [Newham05]:
“$*” is a single string that consists of all of the positional parameters, separated by the first character in the value of the environment variable IFS (internal field separator), which is a space, TAB, and NEWLINE by default. On the other hand, “$@” is equal to “$1” “$2”... “$N”, where N is the number of positional parameters. That is, it’s equal to N separate double-quoted strings, which are separated by spaces. If there are no positional parameters, “$@” expands to nothing. We’ll explore the ramifications of this difference in a little while.
Dabbling further, I used Newham and Rosenblatt’s [Newham05] function countargs
(see Listing 6).
#!/usr/bin/env bash # experimenting from Learning the bash shell, # chapter 4, page 90 function countargs { echo "$# args." } IFS=, echo -n '$* : ' countargs "$*" echo "$*" echo -n '$@ : ' countargs "$@" echo "$@" |
Listing 6 |
When run, this illustrates the difference between $*
and @
.
$ ./countargs Hello World $* : 1 args. Hello,World $@ : 2 args. Hello World
Note I set the IFS
global variable to a single comma. This is to illustrate the usefulness of $*
using the first character of IFS
(which defaults to a space) when outputting parameters.
The main loop which iterated over the HTML filenames looked like this:
while [ "$1" != "" ]; do if ! perform_html2markdown "$1" ; then exit $? fi shift done
which did not preserve the return code of (aka $?
) perform_html2markdown
. It was possible to immediately copy the value of $?
into a variable and return that – I felt that was clunky. Taking into account Sven’s recommendation to use for
rather than while
and fixing the bug, I came up with this:
for f in "$@"; do perform_html2markdown "$f" || exit done
The current version of the code looks like Listing 7.
#!/usr/bin/env bash # Name : irb-html2markdown # Purpose : To run the html2markdown command with # less chance of making a silly typing mistake # (c) Ian Bruntlett # # Changelog removed for brevity. See # [Bruntlett-1] for complete version... function Usage { cat <<END-OF-USAGE-MESSAGE Usage: $0 name-of-html-file-1 name-of-html-file-2 name-of-html-file-etc For each given html filename, convert the HTML file into markdown, writing the results to a file with a similar name of the origin - the only change being the results filename has .md at the end and not .html Note: Because this utility removes the HTML suffix from filenames, you can use globbing to specify input files. Return codes: 0 Success 1 Input file does not exist 10 Input filename not a .html or .HTML file 11 No parameters passed on command line END-OF-USAGE-MESSAGE } # remove .HTML or .html from parameter 1 and then # output/return that result. # With thanks to accu-general posters: Mathias # Gaunard and Hans Vredeveld and Sven # See https://www.gnu.org/software/bash/manual/ # bash.html#Shell-Parameter-Expansion # 10 - Input filename not an html or HTML file # name function strip_extension { local filename stub filename=$1 case "$filename" in *.html | *.HTML) stub="${filename%.*}" ;; *) echo "Not an HTML file name." return 10 ;; esac echo "$stub" } function perform_html2markdown() { local input_html_file output_markdown_file input_html_file="$1" if ! output_markdown_file=$(strip_extension \ "$input_html_file").md ; then echo "Bad HTML filename: $input_html_file"; return 10; # unsupported or bad filename # extension fi echo Translating "$input_html_file" to \ "$output_markdown_file" html2markdown --no-skip-internal-links < \ "$input_html_file" > "$output_markdown_file" } # main code here if [ $# -lt 1 ]; then Usage >&2 exit 11 # error. need at least 1 parameter fi for f in "$@"; do perform_html2markdown "$f" || exit done |
Listing 7 |
References
[ACCU] https://accu.org/faq/mailing-lists-faq/#accu-general – in particular, Mathias, Hans, and Sven.
[Bruntlett-1] Github repository: https://github.com/ian-bruntlett/studies/tree/main/bash
[Bruntlett-2] TechManuals: https://github.com/ian-bruntlett/TECH-Manuals
[Cooper14] Mendel Cooper (2014) Advanced Bash Scripting Guide: Appenidx E, ‘Exit codes with special meanings’, available at: https://tldp.org/LDP/abs/html/exitcodes.html
[GNU] The GNU Bash Reference Manual, for Bash, Version 5.3, last updated 18 May 2025, available at: https://www.gnu.org/software/bash/manual/bash.html
[Newham05] Cameron Newham and Bill Rosenblatt (2005) Learning the bash Shell, published by O’Reilly.
is a keen reader of software development books. He has promised himself a long stint at dealing with C++, once he has got to grips with Git.