If you have documentation in html but want to provide markdown too, what can you do? Ian Bruntlett describes how he used a shell script to automate the translation and what he learnt.
A filter is a programme that reads its input stream (file descriptor 0 aka stdin aka cin), modifies it, then writes the results to its output stream (file descriptor 1 aka stdout aka cout). Errors get written to the error stream (file descriptor 2 aka stderr aka cerr).
I have been maintaining some HTML pages on I.T. related sources of information, ‘TECH-Manuals’, for personal use for quite some time. I wanted to put one of those pages online. Github seemed like a good idea and I uploaded it to GitHub [Bruntlett-1]. Unfortunately when you look at HTML pages on GitHub, you see the raw HTML. Apparently you have to upload MarkDown (.md) files instead. After searching the Ubuntu package repositories with Synaptic Package Manager, I discovered that the html2markdown command would do what I wanted. With some caution I performed this command to create tm-free-software.md ready to upload to GitHub [Bruntlett-2]:
$ html2markdown < tm-free-software.html \ > tm-free-software.md
That is all well and good but it relies on me not getting the input and output filenames wrong. As a filter, it works on one file only. That is OK but I also wanted the ability to use wildcards (shell globbing) to save typing and to make it easier to use with the find command.
The resulting shell script is modular – and, as a bonus, if you have another filter that you want to work on multiple files, all you have to do is rename the shell script, edit and rename function perform_html2markdown and edit strip_extension’s code to accept the new input filename extension(s).
The first bit of executable code checks that at least one parameter has been passed:
if [ $# -lt 1 ]; then
Usage >&2
exit 11
fi
As I thought was conventional, the error message got sent to stderr (it turns out I was wrong) and then the script exits with a status code of 11 so as not to collide with return values of html2markdown itself. See [Cooper14] for more information.
The next bit of script loops over the shell script’s parameters. As long as the script has parameters, it invokes my function perform_html2markdown with a single input filename. I had to name it something different to html2markdown so, in an obscure reference to COBOL’s PERFORM statement, I named the function perform_html2markdown.
The for loop is used to execute perform_html2markdown until it fails or all input files have been processed.
So, how does perform_html2markdown work? (Listing 1.)
function perform_html2markdown()
{
local input_html_file output_markdown_file
input_html_file="$1"
if ! output_markdown_file=$(strip_extension \
"$input_html_file").md ; then
echo "Bad HTML filename: $input_html_file";
return 10; # unsupported or bad filename
# extension
fi
echo Translating "$input_html_file" to \
"$output_markdown_file"
html2markdown --no-skip-internal-links < \
"$input_html_file" > "$output_markdown_file"
}
|
| Listing 1 |
In the interests of modularity and ease of development and maintenance, I use functions in bash, making sure that I declare working variables as local. Unlike in C++, variables used in a function are global. That has caused me problems in the past.
The function perform_html2markdown knows it has been passed a parameter and declares two local variables – input_html_file, and output_markdown_file. The input_html_file variable is, as to be expected, the name of the input file. I could have used $1 instead but I decided to name it to make future maintenance work that little bit nicer. The output_markdown_file variable invokes another function:
function strip_extension
{
local destination_file
# cater for .html source
destination_file=$(basename "$1" .html)
# cater for HTML source
destination_file=
$(basename "$destination_file" .HTML)
echo "$destination_file"
}
This version contained a bug – this strip_extension also strips preceding leading directory components! So, I paused writing this article to learn more, referring to [Newham05] and [GNU].
Aided by the power of functions and help from accu-general [ACCU] I came up with Listing 2.
# remove .HTML or .html from parameter 1 and then
# output/return that result.
function strip_extension
{
declare -i length
local length filename \
filename_stub dot_extension
filename=$1
length=${#filename}
if [ $length -lt 6 ]; then
return 2 # given filename too short
fi
dot_extension=${filename:$length-5:5}
if [ "$dot_extension" == ".html" ] ||
[ "$dot_extension" == ".HTML" ]; then
filename_stub=${filename:0:-5}
echo "$filename_stub"
else
echo
return 1; # unsupported or bad filename
# extension
fi
}
|
| Listing 2 |
Confident that I had fixed the problem, I asked on accu-general for comments. I received some very interesting implementations of the function strip_extension.
Sven opted for two approaches – one using basename (Listing 3). This works… after a fashion. If the user had provided a directory name, that directory name would be lost. So basename dir-of-html/blank.html became blank.html. The script was in use by me for quite some time before I discovered the bug and consequently stopped writing this article.
function strip_extension
{
filename=$1
case "$filename" in
*.html)
stub=$(basename "$filename" .html)
;;
*.HTML)
stub=$(basename "$filename" .HTML)
;;
*)
echo "Not an HTML file name."
return 1
;;
esac
echo "$stub"
}
|
| Listing 3 |
Sven came up with an approach that uses sed – a stream editing tool that I have a little bit of experience with. See Listing 4. This works. By running sed, an external command, it is slower but it has the benefit of being correct!
function strip_extension
{
filename=$1
case "$filename" in
*.html | *.HTML)
stub=$(echo "$filename" \
| sed -e 's/\.[^.]*$//')
;;
*)
echo "Not an HTML file name."
return 1
;;
esac
echo "$stub"
}
|
| Listing 4 |
Hans Vredeveld came up with another solution (see Listing 5). This uses bash’s pattern-matching operators and is minimally documented by [Newham05] and [GNU]. To quote the former:
${variable%pattern} If the pattern matches the end of the variable’s value, delete the shortest part that matches and return the rest.
function strip_extension
{
filename=$1
case "$filename" in
*.html | *.HTML)
stub=$(echo "${filename%.*")
;;
*)
echo "Not an HTML file name."
return 1
;;
esac
echo "$stub"
}
|
| Listing 5 |
As with other aspects of shell usage, I had to experiment with this to get a better idea of it. The pattern looks at the environment variable specified (note the absence of a preceding $), separated by a % sign to tell bash what to do and pattern is the thing to delete. In this case, strip_extension is using a pattern of .* which means match a dot followed by any number of following characters. So, not only does it work for .html or .HTML, it works for .odt etc. The filename does not have to exist, bash is working with characters. Here are some examples:
$ filename=dir.html/blank.html
$ echo ${filename%.*}
dir.html/blank
$ filename=dir.html/blank2.html
$ echo ${filename%tml}
dir.html/blank2.h
This was then further refined by Sven to avoid using echo:
> stub=$(echo "${filename%.*")
You don’t need command substitution here: the following should be enough:
> stub="${filename%.*}"
Also discussed by Sven, was my use of a while loop to iterate over the command line parameters:
while [ "$1" != "" ]; do
if ! perform_html2markdown "$1" ; then
exit $?
fi
shift
done
This is a hang-over from my MS-DOS days, where to access more than 9 (I think) parameters, you had to use the shift command. Another quirk is that exit $? can be replaced with exit.
He proposed:
for f in "$@"; do
if ! perform_html2markdown "$f" ; then
exit $?
fi
done
Which does the same job without making processed parameters unavailable. The use of $@ and * are… subtle. To quote Cameron Newham and Bill Rosenblatt [Newham05]:
“$*” is a single string that consists of all of the positional parameters, separated by the first character in the value of the environment variable IFS (internal field separator), which is a space, TAB, and NEWLINE by default. On the other hand, “$@” is equal to “$1” “$2”... “$N”, where N is the number of positional parameters. That is, it’s equal to N separate double-quoted strings, which are separated by spaces. If there are no positional parameters, “$@” expands to nothing. We’ll explore the ramifications of this difference in a little while.
Dabbling further, I used Newham and Rosenblatt’s [Newham05] function countargs (see Listing 6).
#!/usr/bin/env bash
# experimenting from Learning the bash shell,
# chapter 4, page 90
function countargs
{
echo "$# args."
}
IFS=,
echo -n '$* : '
countargs "$*"
echo "$*"
echo -n '$@ : '
countargs "$@"
echo "$@"
|
| Listing 6 |
When run, this illustrates the difference between $* and @.
$ ./countargs Hello World $* : 1 args. Hello,World $@ : 2 args. Hello World
Note I set the IFS global variable to a single comma. This is to illustrate the usefulness of $* using the first character of IFS (which defaults to a space) when outputting parameters.
The main loop which iterated over the HTML filenames looked like this:
while [ "$1" != "" ]; do
if ! perform_html2markdown "$1" ; then
exit $?
fi
shift
done
which did not preserve the return code of (aka $?) perform_html2markdown. It was possible to immediately copy the value of $? into a variable and return that – I felt that was clunky. Taking into account Sven’s recommendation to use for rather than while and fixing the bug, I came up with this:
for f in "$@"; do
perform_html2markdown "$f" || exit
done
The current version of the code looks like Listing 7.
#!/usr/bin/env bash
# Name : irb-html2markdown
# Purpose : To run the html2markdown command with
# less chance of making a silly typing mistake
# (c) Ian Bruntlett
#
# Changelog removed for brevity. See
# [Bruntlett-1] for complete version...
function Usage
{
cat <<END-OF-USAGE-MESSAGE
Usage: $0 name-of-html-file-1 name-of-html-file-2 name-of-html-file-etc
For each given html filename, convert the HTML file into markdown, writing the results to a file with a similar name of the origin - the only change being the results filename has .md at the end and not .html
Note:
Because this utility removes the HTML suffix from filenames, you can use globbing to specify input files.
Return codes:
0 Success
1 Input file does not exist
10 Input filename not a .html or .HTML file
11 No parameters passed on command line
END-OF-USAGE-MESSAGE
}
# remove .HTML or .html from parameter 1 and then
# output/return that result.
# With thanks to accu-general posters: Mathias
# Gaunard and Hans Vredeveld and Sven
# See https://www.gnu.org/software/bash/manual/
# bash.html#Shell-Parameter-Expansion
# 10 - Input filename not an html or HTML file
# name
function strip_extension
{
local filename stub
filename=$1
case "$filename" in
*.html | *.HTML)
stub="${filename%.*}"
;;
*)
echo "Not an HTML file name."
return 10
;;
esac
echo "$stub"
}
function perform_html2markdown()
{
local input_html_file output_markdown_file
input_html_file="$1"
if ! output_markdown_file=$(strip_extension \
"$input_html_file").md ; then
echo "Bad HTML filename: $input_html_file";
return 10; # unsupported or bad filename
# extension
fi
echo Translating "$input_html_file" to \
"$output_markdown_file"
html2markdown --no-skip-internal-links < \
"$input_html_file" > "$output_markdown_file"
}
# main code here
if [ $# -lt 1 ]; then
Usage >&2
exit 11 # error. need at least 1 parameter
fi
for f in "$@"; do
perform_html2markdown "$f" || exit
done
|
| Listing 7 |
References
[ACCU] https://accu.org/faq/mailing-lists-faq/#accu-general – in particular, Mathias, Hans, and Sven.
[Bruntlett-1] Github repository: https://github.com/ian-bruntlett/studies/tree/main/bash
[Bruntlett-2] TechManuals: https://github.com/ian-bruntlett/TECH-Manuals
[Cooper14] Mendel Cooper (2014) Advanced Bash Scripting Guide: Appenidx E, ‘Exit codes with special meanings’, available at: https://tldp.org/LDP/abs/html/exitcodes.html
[GNU] The GNU Bash Reference Manual, for Bash, Version 5.3, last updated 18 May 2025, available at: https://www.gnu.org/software/bash/manual/bash.html
[Newham05] Cameron Newham and Bill Rosenblatt (2005) Learning the bash Shell, published by O’Reilly.
is a keen reader of software development books. He has promised himself a long stint at dealing with C++, once he has got to grips with Git.









