Filters and Wildcards (Shell Globbing)

If you have documentation in html but want to provide markdown too, what can you do? Ian Bruntlett describes how he used a shell script to automate the translation and what he learnt.

A filter is a programme that reads its input stream (file descriptor 0 aka stdin aka cin), modifies it, then writes the results to its output stream (file descriptor 1 aka stdout aka cout). Errors get written to the error stream (file descriptor 2 aka stderr aka cerr).

I have been maintaining some HTML pages on I.T. related sources of information, ‘TECH-Manuals’, for personal use for quite some time. I wanted to put one of those pages online. Github seemed like a good idea and I uploaded it to GitHub [Bruntlett-1]. Unfortunately when you look at HTML pages on GitHub, you see the raw HTML. Apparently you have to upload MarkDown (.md) files instead. After searching the Ubuntu package repositories with Synaptic Package Manager, I discovered that the html2markdown command would do what I wanted. With some caution I performed this command to create tm-free-software.md ready to upload to GitHub [Bruntlett-2]:

  $ html2markdown < tm-free-software.html \
  > tm-free-software.md

That is all well and good but it relies on me not getting the input and output filenames wrong. As a filter, it works on one file only. That is OK but I also wanted the ability to use wildcards (shell globbing) to save typing and to make it easier to use with the find command.

The resulting shell script is modular – and, as a bonus, if you have another filter that you want to work on multiple files, all you have to do is rename the shell script, edit and rename function perform_html2markdown and edit strip_extension’s code to accept the new input filename extension(s).

The first bit of executable code checks that at least one parameter has been passed:

  if [ $# -lt 1 ]; then
    Usage >&2
    exit 11
  fi

As I thought was conventional, the error message got sent to stderr (it turns out I was wrong) and then the script exits with a status code of 11 so as not to collide with return values of html2markdown itself. See [Cooper14] for more information.

The next bit of script loops over the shell script’s parameters. As long as the script has parameters, it invokes my function perform_html2markdown with a single input filename. I had to name it something different to html2markdown so, in an obscure reference to COBOL’s PERFORM statement, I named the function perform_html2markdown.

The for loop is used to execute perform_html2markdown until it fails or all input files have been processed.

So, how does perform_html2markdown work? (Listing 1.)

function perform_html2markdown()
{
  local input_html_file output_markdown_file
  input_html_file="$1"
  if ! output_markdown_file=$(strip_extension \
      "$input_html_file").md ; then
    echo "Bad HTML filename: $input_html_file";
    return 10; # unsupported or bad filename 
               # extension
  fi
      
  echo Translating  "$input_html_file" to \
    "$output_markdown_file"
  html2markdown --no-skip-internal-links < \
    "$input_html_file" > "$output_markdown_file"
}

Listing 1

In the interests of modularity and ease of development and maintenance, I use functions in bash, making sure that I declare working variables as local. Unlike in C++, variables used in a function are global. That has caused me problems in the past.

The function perform_html2markdown knows it has been passed a parameter and declares two local variables – input_html_file, and output_markdown_file. The input_html_file variable is, as to be expected, the name of the input file. I could have used $1 instead but I decided to name it to make future maintenance work that little bit nicer. The output_markdown_file variable invokes another function:

  function strip_extension
  {
    local destination_file
    # cater for .html source
    destination_file=$(basename "$1" .html)
    # cater for HTML source
    destination_file=
      $(basename "$destination_file" .HTML)
    echo "$destination_file"
  }

This version contained a bug – this strip_extension also strips preceding leading directory components! So, I paused writing this article to learn more, referring to [Newham05] and [GNU].

Aided by the power of functions and help from accu-general [ACCU] I came up with Listing 2.

# remove .HTML or .html from parameter 1 and then
# output/return that result.
function strip_extension
{
  declare -i length
  local length filename \
    filename_stub dot_extension
  filename=$1
  length=${#filename}
  if [ $length -lt 6 ]; then
    return 2 # given filename too short
  fi
  dot_extension=${filename:$length-5:5}
  if [ "$dot_extension" == ".html" ] ||
     [ "$dot_extension" == ".HTML" ]; then
    filename_stub=${filename:0:-5}
    echo "$filename_stub"
  else
    echo
    return 1; # unsupported or bad filename
              # extension
  fi
}

Listing 2

Confident that I had fixed the problem, I asked on accu-general for comments. I received some very interesting implementations of the function strip_extension.

Sven opted for two approaches – one using basename (Listing 3). This works… after a fashion. If the user had provided a directory name, that directory name would be lost. So basename dir-of-html/blank.html became blank.html. The script was in use by me for quite some time before I discovered the bug and consequently stopped writing this article.

function strip_extension
{
  filename=$1
  case "$filename" in
    *.html)
        stub=$(basename "$filename" .html)
        ;;
    *.HTML)
        stub=$(basename "$filename" .HTML)
        ;;
    *)
      echo "Not an HTML file name."
      return 1
      ;;
  esac
  echo "$stub"
}

Listing 3

Sven came up with an approach that uses sed – a stream editing tool that I have a little bit of experience with. See Listing 4. This works. By running sed, an external command, it is slower but it has the benefit of being correct!

function strip_extension
{
  filename=$1
  case "$filename" in
    *.html | *.HTML)
      stub=$(echo "$filename" \
        | sed -e 's/\.[^.]*$//')
      ;;
    *)
      echo "Not an HTML file name."
      return 1
      ;;
  esac
  echo "$stub"
}

Listing 4

Hans Vredeveld came up with another solution (see Listing 5). This uses bash’s pattern-matching operators and is minimally documented by [Newham05] and [GNU]. To quote the former:

${variable%pattern} If the pattern matches the end of the variable’s value, delete the shortest part that matches and return the rest.

function strip_extension
{
  filename=$1
  case "$filename" in
    *.html | *.HTML)
        stub=$(echo "${filename%.*")
        ;;
    *)
        echo "Not an HTML file name."
        return 1
        ;;
  esac
  echo "$stub"
}

Listing 5

As with other aspects of shell usage, I had to experiment with this to get a better idea of it. The pattern looks at the environment variable specified (note the absence of a preceding $), separated by a % sign to tell bash what to do and pattern is the thing to delete. In this case, strip_extension is using a pattern of .* which means match a dot followed by any number of following characters. So, not only does it work for .html or .HTML, it works for .odt etc. The filename does not have to exist, bash is working with characters. Here are some examples:

  $ filename=dir.html/blank.html
  $ echo ${filename%.*}
  dir.html/blank
  $ filename=dir.html/blank2.html
  $ echo ${filename%tml}
  dir.html/blank2.h

This was then further refined by Sven to avoid using echo:

  >    stub=$(echo "${filename%.*")

You don’t need command substitution here: the following should be enough:

  >    stub="${filename%.*}"

Also discussed by Sven, was my use of a while loop to iterate over the command line parameters:

  while [ "$1" != "" ]; do
    if ! perform_html2markdown "$1" ; then
      exit $?
    fi    
    shift
  done

This is a hang-over from my MS-DOS days, where to access more than 9 (I think) parameters, you had to use the shift command. Another quirk is that exit $? can be replaced with exit.

He proposed:

  for f in "$@"; do
    if ! perform_html2markdown "$f" ; then
      exit $?
    fi
  done

Which does the same job without making processed parameters unavailable. The use of $@ and * are… subtle. To quote Cameron Newham and Bill Rosenblatt [Newham05]:

“$*” is a single string that consists of all of the positional parameters, separated by the first character in the value of the environment variable IFS (internal field separator), which is a space, TAB, and NEWLINE by default. On the other hand, “$@” is equal to “$1” “$2”... “$N”, where N is the number of positional parameters. That is, it’s equal to N separate double-quoted strings, which are separated by spaces. If there are no positional parameters, “$@” expands to nothing. We’ll explore the ramifications of this difference in a little while.

Dabbling further, I used Newham and Rosenblatt’s [Newham05] function countargs (see Listing 6).

#!/usr/bin/env bash
# experimenting from Learning the bash shell,
# chapter 4, page 90
function countargs
{
  echo "$# args."
}
IFS=,
echo -n '$* : '
countargs "$*"
echo "$*"
echo -n '$@ : '
countargs "$@"
echo "$@"

Listing 6

When run, this illustrates the difference between $* and @.

  $ ./countargs Hello World
  $* : 1 args.
  Hello,World
  $@ : 2 args.
  Hello World

Note I set the IFS global variable to a single comma. This is to illustrate the usefulness of $* using the first character of IFS (which defaults to a space) when outputting parameters.

The main loop which iterated over the HTML filenames looked like this:

  while [ "$1" != "" ]; do
    if ! perform_html2markdown "$1" ; then
      exit $?
    fi    
    shift
  done

which did not preserve the return code of (aka $?) perform_html2markdown. It was possible to immediately copy the value of $? into a variable and return that – I felt that was clunky. Taking into account Sven’s recommendation to use for rather than while and fixing the bug, I came up with this:

  for f in "$@"; do
    perform_html2markdown "$f" || exit
  done

The current version of the code looks like Listing 7.

#!/usr/bin/env bash
# Name    : irb-html2markdown
# Purpose : To run the html2markdown command with
# less chance of making a silly typing mistake
# (c) Ian Bruntlett
#
# Changelog removed for brevity. See
# [Bruntlett-1] for complete version...
function Usage
{
cat <<END-OF-USAGE-MESSAGE
Usage: $0 name-of-html-file-1 name-of-html-file-2 name-of-html-file-etc
For each given html filename, convert the HTML file into markdown, writing the results to a file with a similar name of the origin - the only change being the results filename has .md at the end and not .html
Note:
Because this utility removes the HTML suffix from filenames, you can use globbing to specify input files.
Return codes:
0      Success
1      Input file does not exist
10     Input filename not a .html or .HTML file
11     No parameters passed on command line
END-OF-USAGE-MESSAGE
}
# remove .HTML or .html from parameter 1 and then 
# output/return that result.
# With thanks to accu-general posters: Mathias
# Gaunard and Hans Vredeveld and Sven
# See https://www.gnu.org/software/bash/manual/
# bash.html#Shell-Parameter-Expansion
# 10 - Input filename not an html or HTML file 
# name
function strip_extension
{
  local filename stub
  filename=$1
  case "$filename" in
    *.html | *.HTML)
      stub="${filename%.*}"
   ;;
   *)
    echo "Not an HTML file name."
    return 10
   ;;
  esac
  echo "$stub"
}
function perform_html2markdown()
{
  local input_html_file output_markdown_file
  input_html_file="$1"
  if ! output_markdown_file=$(strip_extension \
    "$input_html_file").md ; then
  echo "Bad HTML filename: $input_html_file";
  return 10; # unsupported or bad filename
             # extension
  fi
      
  echo Translating  "$input_html_file" to \
    "$output_markdown_file"
  html2markdown --no-skip-internal-links < \
    "$input_html_file" > "$output_markdown_file"
}
# main code here
if [ $# -lt 1 ]; then
  Usage >&2
  exit 11 # error. need at least 1 parameter
fi
for f in "$@"; do
  perform_html2markdown "$f" || exit
done

Listing 7

References

[ACCU] https://accu.org/faq/mailing-lists-faq/#accu-general – in particular, Mathias, Hans, and Sven.

[Bruntlett-1] Github repository: https://github.com/ian-bruntlett/studies/tree/main/bash

[Bruntlett-2] TechManuals: https://github.com/ian-bruntlett/TECH-Manuals

[Cooper14] Mendel Cooper (2014) Advanced Bash Scripting Guide: Appenidx E, ‘Exit codes with special meanings’, available at: https://tldp.org/LDP/abs/html/exitcodes.html

[GNU] The GNU Bash Reference Manual, for Bash, Version 5.3, last updated 18 May 2025, available at: https://www.gnu.org/software/bash/manual/bash.html

[Newham05] Cameron Newham and Bill Rosenblatt (2005) Learning the bash Shell, published by O’Reilly.

Ian Bruntlett is a keen reader of software development books. He has promised himself a long stint at dealing with C++, once he has got to grips with Git.

References

Advertisement

Advertisement

Your Privacy