An Associative Container for Non-bash Shell Scripts

An Associative Container for Non-bash Shell Scripts

By Ralph McArdell

Overload, 30(167):9-15, February 2022


Basic shell facilities don’t provide associative containers. Ralph McArdell shows you what to do if you need one.

Have you ever found yourself stuck with having to write a *nix/ *nux shell script that cannot assume that Bash and system core utilites having GNU extensions are available and so specify the shell processor simply to be /bin/sh and then found you could really use some sort of container to store multiple values?

This happened to me a while ago and I thought I would attempt to create an associative container abstraction that could be used with only basic shell facilities.

Basic selection

Having decided to use ‘only basic shell facilities’, and not being a shell scripting guru, the next question was just what are ‘basic shell facilities’? What facilities are available for sh? In fact what shell is used when we use /bin/sh? It seems that these days /bin/sh may simply link to some other shell program such as Dash.

A little research lead me to the Open Group’s [OpenGroup] page on the POSIX (IEEE Std 1003.1-2017) ‘Shell Command Language’ [ShellLang]. It seemed that trying to use only the shell language facilities as described on this page would be the most portable solution, assuming as few facilities would be available as possible.

Not quite basic

After reading through the specification of the Shell Command Language, it became apparent that one feature that I would really like to use was not available as standard. The local utility used to declare variables local to a function call is not required to be implemented but is sort-of reserved in that the results of using local has unspecified results. It seems local variables within a function were considered but rejected, though the identifier local was reserved just in case local function variables make it into a future version of the POSIX standard [ShellLangRationale].

I decided that I would use local – at least until it proved a problem, whereby most uses could probably be replaced with careful use of global variables. The exception would be cases where recursion was required. However, I would assume local variables were only in scope in the call of the declaring function, and not in functions called from the declaring function call – similar to most other languages’ local variable scoping rules.

The requirements

In the end it turned out that in addition to local, some basic use of core utilities were required. The full requirements to use the associative container shell script library are:

  • the POSIX Shell Command Language, as described at [ShellLang]
  • the local utility
  • echo -n (the use of -n is not strictly portable)
  • basic use of the tr character transform utility
  • basic use of the sed stream editor utility to perform global subsitutions.

What’s in the box?

Having decided on a super-set of the standard POSIX Shell Command Language to use, the next thing was to look at the available facilities and work out how they could be used. Some of the facilities (or lack of) that are of note are mentioned below.

First off, flow control contructs – if, elif, else, case, while, until, for – are supported, as are user-defined functions. However, returning values from a direct function call (one that does not start a sub-command shell in a separate process) is limited to the numeric exit status – although the Boolean-like true and false can be used for predicate functions that can be called and tested in flow control constructs such as if or while. This is because true and false are built-in utilities that return with an exit code of 0 (success) or a non-zero value (failure) respectively.

To return a copy of an expression, as is more familiar, a user-defined function can be called using Command Substitution, which executes the function in a separate, presumably forked, sub-shell process. The output to stdout from a Command Substitution call is captured and effectively become the function call’s return value. Note that Command Substitution is also commonly used to capture the output of commands into variables. However, creating and destroying the sub-shell around the function call is expensive.

Variables store string values. Sometimes these can be interpreted as numbers. There are some limited operations that can be performed on a variable’s string value:

  • Various error or default/alternative value substitution actions around values being null, set and not null or unset.
  • Character length of the string value of a variable.
  • Removal of a pattern matching the smallest or largest prefix or suffix of the variable’s value from the variable’s value.

Notably, as far as I could see, there is no facility to more directly reference variable values’ characters or substrings other than the prefix/suffix pattern match removal operations.

Arithmetic can be performed on numeric values using Arithmetic Expansion.

Facilities often used on the command line such as redirection (>, <, etc.), pipelines (|), and-or lists (&&, ||) , asynchronous execution (&) and so on are supported.

There is a set of special variables. Of particularly interest are $@ and $# for the list of parameters passed to the script from the command line or to a function call and the number of such parameters, along with the set, unset and shift built-in utilities that allow control and processing of script and function call parameter lists.

Script source files can be included in other script source files using the dot (.) built-in utility. The bash style source synonym for dot is not supported.

The shape of code to come

By now, I had an idea of the general shape of the project. The first order of business would be to decide on a name and use it, or a form of it, as the basis for file names and, given the lack of namespaces, as a prefix to function names and the like.

The implementation code would be contained in a shell script file intended for inclusion using ., the dot built-in utility, in client scripts. Such included files do not need to be executable themselves. Sometime later I found out that there is a convention that such included shell script library files should have a .sh suffix.

The public, client, API would consist of a set of functions, most of which would return string values. To make the use of these functions more familair they would be designed to be called using Command Substitution.

Any private internal implementation details – helper functions, global variables and the like – would have uglified names involving prefixed and suffixed double underscores.

Given that values are strings, the associative container instances would have to be represented as a string. Instance creation operations would have to return the string-instance value. Operations that do not modify an instance would take an instance as the first parameter to its implementation function. Operations that modify an instance would both take an intance as the first parameter and return the modified instance. Note that all parameter passing and value returning is effectively by value, so lots of potential copying.

What’s in a name

The first point of action was to decide a name from which the script library file name and library script identifier names can be derived.

The name I chose, following in the footsteps of Python, was dict.

The library file name was also originally just dict but later was changed to dict.sh, following the discovery of the previously mentioned convention.

What’s a dict to do?

Obviously dict instances would be of a dictionary type – also known as maps. These are associative containers whose entries are key:value value pairs that allow values to be looked up by their associated key. They can be ordered or unordered.

Given the limited options available with the facilities of the POSIX Shell Command Language, I decided it would be much simpler to opt for dicts being unordered.

The basic operations required would be:

  • Create dict instances.
  • Add and update entries in a dict.
  • Lookup and return a value in a dict given its associated key.
  • Remove an entry from a dict given the associated key.

Some additional operations also proved useful:

  • A function to return the size of the dict, being the number of entries.
  • A predicate to check whether some value, which of course is a string, appears to be a string formatted as a dict.
  • A for-each operation that calls a function for each entry in a dict.
  • A customisable pretty-print function to pretty print a dict in a user-controlled fashion.
  • To aid debugging a function to print a dict string with special unprintable characters translated to printable characters.

One capability that needed to be added explicitly is the ability to nest one dict as a value of another, preferrably to an arbitrary depth.

The dict data format

As the only data type to play with is the string and with no true random access operations into a string, the format of a dict was going to have to be sequential, with an appearence of random access only – think tape access rather than disk. This sort of thinking lead to vague memories of some of the lesser used control codes in the ASCII (and thereby the Unicode®) character set. The ones of interest here are the separator control codes. Checking a convenient ASCII table source [AsciiTable] we see these are:

  • FS (value hex 1C) – File Separator
  • GS (value hex 1D) – Group Separator
  • RS (value hex 1E) – Record Separator
  • US (value hex 1F) – Unit Separator

On the one hand, using these values to separate parts of a dict instance means keys and values cannot contain them. On the other hand, for the intended use cases this should not be much of a problem. Where it would be an issue would be keys or values being binary data but this is not a primary use case. If storing binary data in a dict is required then the data can be encoded, for example with base64 encoding.

The current layout of a dict in Extended Backus Naur Form [EBNF] is as shown in Figure 1.

            dict = header, { entry };
          header = dict-type, entry-separator,
                   version, field-separator,
                   size, entry-separator;
           entry = key, field-separator, value,
                   entry-separator;
       dict-type = GS, "D", "i" "C", "t", GS;
         version = digits, ".", digits, ".",
                   digits;
            size = digits;
             key = string;
           value = string | prepared-dict;
 entry-separator = field-separator, 
                   record-separator;
 field-separator = US;
record-separator = RS;
   prepared-dict = ? a dict instance that has had
                     its special separator
                     characters substituted so it
                     can be safely used as a 
                     nested dict value 
                     ?;
          string = character, { character };
       character = ? any valid character except
                     ASCII FS, GS, RS, US ?;
          digits = digit, { digit };
           digit = "0" | "1" | "2" | "3" | "4" |
                   "5" | "6" | "7" | "8" | "9" ;
              GS = "\0x1D";
              RS = "\0x1E";
              US = "\0x1F";
			
Figure 1

There are several things to note about the layout. The first is that a dict instance has two main parts: the header followed by zero or more entries.

The header consists of an initial dict identifier ‘chunk’ consisting of the four characters 'DiCt' surrounded by an ASCII Group Separator character on each side. This is followed by metadata fields for the version of the dict structure (currently '1.1.0') and the size of the dict being the number of entries, which starts at zero (an empty dict) and is updated as entries are added and removed. Note that each field is followed by an ASCII Unit Separator character, both in the header metadata record and within entries. The end of the metadata record is signified by an ASCII Record Separator character.

The following entries section consists of key:value pair entry records. As with the metadata record each key and each value is followed by an ASCII Unit Separator character, and each entry record is terminated with an ASCII Record Separator character.

Keys can be any string of characters except those used as separators – currently the ASCII GS, RS and US characters are used, but the FS – File Separator character might have a use in a later update, so I suggest it is best to not use any of the four ASCII separator characters.

The same restrictions apply to entry values, except that a dict can be nested as a value of another dict. In these cases the characters that could cause confusion are substituted for different sequences on entry with the reverse substitutions being applied on value extraction.

The dict API functions

The current set of dict API functions are:

  • dict_declare to create a dict and optionally populate it with one or more initial entries.
  • dict_set to add or update one or more entries to a dict.
  • dict_get to return from a dict the value associated with the provided key, or an empty string if not found.
  • dict_remove to remove an entry from a dict given its key.
  • dict_size to return the number of entries as stored in a dict’s metadata record’s size field.
  • dict_is_dict – a predicate function – to check if a value is a dict.
  • dict_for_each to iterate over the entries of a dict calling a function for each entry passing it the entry’s key, value, one based computed record index and any extra parameters passsed to dict_for_each.
  • dict_count returns the count of the number of records which should be the same as the value returned by dict_size. It uses dict_for_each and its computed record index to iteratively count the number of entries. The main use of dict_count is for testing.
  • dict_pretty_print prints the entries of a dict according to a caller provided format – itself specified by a dict. Uses dict_for_each to recursively iterate over entries and nested dict entry values. This is a case where local variables are really required.
  • dict_print_raw to print the raw dict string with the unprintable separator characters translated to others. Mainly useful for debugging.

In addition there are simple versions of functions that insert or extract values: dict_declare_simple, dict_set_simple and dict_get_simple. These function act as their non-simple versions except that passing dicts as values is not supported. They assume entry values inserted or extracted are not dict strings and thus elide the checks and special handling required for dict values.

Finally there is a function that is intended to be used with dict_for_each, dict_op_to_var_flat, that creates a variable for each entry in the dict based on the entry’s key name. The ‘flat’ part of the name is intended to indicate that the function does not recurse down into nested dict values.

Slice and splice

There are some basic private internal constants and operations to aid in building and updating dict strings. For example the initial empty dict is a fixed string as the only variable part is the size metadata field value and initially this is always 0.

Some small functions help out with chores, such as ensuring keys are followed by the US field separator character and building an entry string by combining such a decorated key with a value and the US RS entry separator character sequence.

Locating entries is more interesting. The value associated with a key is required as the result. First the header is stripped from the dict (remember parameters are passed by value so in fact the header is stripped from a copy of the dict) by using the remove-smallest-prefix operation ${dict#hdr-pattern}, where hdr-pattern is the fixed part of the header including the entry-separator with the size field value wildcarded.

Next the rest of the dict string up to and including the required key field is removed using the remove-smallest-prefix operation again, this time with a pattern specifying all characters up to the preceding entry-separator and the decorated key value. This leaves the start of the dict substring copy with the value associated with the key followed by all the subsequent entries. Finally, all these subsequent entries that follow the value are removed using the remove-largest-suffix operation ${value_plus%%pattern} with a pattern specifying the entry-separator and all following characters, leaving just the value.

If removing the dict portion up to the required value did not remove anything then there was no pattern match meaning the key was not found in the dict. In this case an empty string is returned.

Add or update is similar except in this case if a key is found the prefix and suffix parts of the sliced up dict around the associated value are kept and the new value spliced in between them, the old value being discarded. In the case a key is not found a new entry is created and appended to the end of the dict.

Entry removal again consists of breaking the existing dict into parts and splicing parts back together. In this case the parts spliced together to form the result are a prefix part up to the key of the entry being removed and a suffix part from the start of the entry, if any, following the removed entry’s entry separator sequence.

A wrinkle in the above is the updating of the size metadata field. This is performed if required after the dict (copy) has been spliced back together and is done by stripping the header and then appending the resulting entry records to a newly built header with the updated size value using string concatenation of the various substring parts.

dict_for_each first strips off the header of its passed dict copy then repeatedly extracts the key and value of the initial entry record in the resulting entry records and then removes the whole of the initial entry record, until eventually there are no more entries to remove. On each iteration, a function whose name is passed as the second parameter to dict_for_each is called and passed the extracted key and value values and a one-based computed record number along with any parameters passed to dict_for_each following the second function name parameter.

Being prepared

If the value of an entry to add or update is itself a dict then it has to be modified before being inserted; otherwise, its entries can interfere with operations on the outer containing dict. Likewise, if the value of an entry located with dict_get is a modified nested dict then it has to be modified back to its original state before the value is returned.

dict_declare and dict_set check to see if a value to add or update is a dict and if it is it is prepared for nesting. Likewise, if a value extracted by dict_get appears to be a prepared nested dict then it is prepared for unnesting.

The prepare-for-nesting operation inserts an ASCII GS character before every field-separator and record-separator – which are the ASCII US and RS characters respectively. This prevents the matching that occurs with the slicing and splicing of dict operations from matching any fields or entries in the prepared for nesting dict value. For example, when stripping away entries before the value of an associated key the pattern used to match the smallest prefix to remove is everything (*) up to an entry-separator followed by the decorated key, being the key followed by a field-separator, which for a key value of "KEY" expands to:

  match = [string], US, RS, "K", "E", "Y", US;

If a nested dict value happened to also have an entry with a key value of "KEY", and this occured before the entry being searched for, if the nested dict entry had not been prepared this entry would be found. However, after preparing for nesting, the entry fragments that would have matched would have the following sequence:

  prepared-averted-match = GS, US, GS, RS, "K",
    "E", "Y", GS, US;

which fails to match both as the prepared modified entry-separator sequence is now GS, US, GS, RS and as if that were not enough the modified decorated key has become "K", "E", "Y", GS, US.

The prepare-for-unnesting reverses the effects of preparing for nesting. It replaces all occurrences of GS, US with US and every occurrence of GS, RS with RS.

These transformations work for multiple levels of nested dict values. Each additional level of nesting causes an additonal GS character to be inserted before each US and RS character. So if a dict has the following sequence of characters for an entry:

  unnested-entry = "K", "E", "Y", US, "V", "A",
    "L", "U", "E", US, RS;

Then inserting the dict into another dict tranforms the entry character sequence like so:

  nested-entry-level-1 = "K", "E", "Y", GS, US,
    "V", "A", "L", "U", "E", GS, US, GS, RS;

If the second dict is then inserted into a third, the already-prepared-for-nesting nested dict value is modified again when the second dict is prepared for nesting, and the entry sequence is further transformed thus:

  nested-entry-level-2 = "K", "E", "Y", GS, GS, US,
    "V", "A", "L", "U", "E", GS, GS, US, GS, GS, RS;

If the second dict nested in the third dict is looked up in the third dict then its prepared separator sequences are transformed by the prepare-for-unnesting operation, effectively removing one GS character before each US and RS character, which for the initial dict, nested as a value of the second nested dict, removes one of the two accumulated GS characters before its US and RS characters, transforming the entry sequence from nested-entry-level-2 back to nested-entry-level-1. If the dict nested in the second dict copy obtained from the third dict is then looked up in the second dict copy, its prepared separator sequences will again be transformed by the prepare-for-unnesting operation, removing the remaining GS characters before its US and RS characters, transforming the entry sequence from nested-entry-level-1 back to unnested-entry.

Unforunately, the only way I could see to perform the required replacements for the prepare-for-nesting and prepare-for-unnesting operations was to farm the tasks out to sed using two global substitutions per preparation operation. This, of course, is quite a heavy weight afair.

Exemplary dict-ation

Enough of the waffle, let’s dive in an see how to use dict in practice. The dict.sh shell script library and the examples shown are available in the GitHub sh-utils repository [ShUtils].

So let us start with the obligatory Hello World example (Listing 1).

#!/bin/sh

. dict.sh

record="$(dict_declare_simple 'greeting' 'Hello'\
  'who' 'World')"
echo "$(dict_get_simple "${record}" 'greeting'),\
  $(dict_get_simple  "${record}" 'who')!"

record="$(dict_set_simple "${record}"\
  'greeting' 'Hi' 'who' 'Earth')"
greeting="$(dict_get_simple "${record}"\
  'greeting')"
who="$(dict_get_simple  "${record}" 'who')"
echo "${greeting}, ${who}!"
			
Listing 1

First the source of the dict shell script library is included using the dot built-in. It is assumed that dict.sh is either on the PATH or in the same directory as the Hello World script.

Next a dict is created using dict_declare_simple, as we are not nesting any dict values. It can be seen that the parameters to dict_declare_simple are values for initial entries to initialise the new dict with. They are grouped pairwise: key1 value1 key2 value2 ... keyN valueN. This is easy to handle in Shell Command Language thanks to the $@ and $# special parameters and the shift built-in. The returned dict string is assigned to a variable named record.

The whole greeting is then written to the console by calling dict_get_simple to obtain the values associated with the greeting and who keys. The returned values are expanded directly into the echo’d string.

The dict is then updated by calling dict_set_simple. Note the pattern of passing the expanded record variable as the first argument and receiving the returned dict string back into the record variable. If we wanted to keep the original state of record then the updated dict could be assigned to a different variable. Once again note the pairwise grouping of entry keys and values in the parameter list.

Finally, the greeting and who values are obtained from the updated dict, this time assigning each value to its own variable, which are then written to the console.

Assuming the Hello World script file is called dict_hello_world, we are in a console session and the working directory is the directory containing dict_hello_world then executing:

./dict_hello_world

should produce the following results (not forgetting to ensure the script is executable):

  Hello, World!
  Hi, Earth!

What is needed though is more colour! This can be achieved by using ANSI terminal control sequences [AnsiTermCodes]. In the next example the Hello World example is extended to add keys foreground and background which have nested dict values containing keys r, g and b to store 24-bit (8 bit + 8 bit + 8 bit) RGB colour value triples which are used to set the foreground and background colours of the output greeting text using 24-bit colour mode ANSI terminal control escape codes. (See Listing 2.)

#!/bin/sh
. dict.sh

readonly ASCII_ESC=$(echo '\033')
readonly ANSI_CMD_PREFIX="${ASCII_ESC}["
readonly ANSI_CMD_END="2m"
readonly ANSI_CMD_RESET="${ANSI_CMD_PREFIX}0m"
readonly \
  ANSI_CMD_FG_MULTIBIT="${ANSI_CMD_PREFIX}38"
readonly \
  ANSI_CMD_BG_MULTIBIT="${ANSI_CMD_PREFIX}48"
readonly \
  ANSI_CMD_FG_24BIT="${ANSI_CMD_FG_MULTIBIT};2"
readonly \
  ANSI_CMD_BG_24BIT="${ANSI_CMD_BG_MULTIBIT};2"

set_ansi_24bit_colours_string() {
  local fg="${1}"
  local bg="${2}"
  local r="$(dict_get_simple "${fg}" 'r')"
  local g="$(dict_get_simple "${fg}" 'g')"
  local b="$(dict_get_simple "${fg}" 'b')"
  local fg_cmd="${ANSI_CMD_FG_24BIT};${r};${g};\
${b}${ANSI_CMD_END}"

  r="$(dict_get_simple "${bg}" 'r')"
  g="$(dict_get_simple "${bg}" 'g')"
  b="$(dict_get_simple "${bg}" 'b')"
  return_value="${fg_cmd}${ANSI_CMD_BG_24BIT};\
${r}; ${g};${b}${ANSI_CMD_END}"                                     
}

record="$(dict_declare \
  'greeting' 'Hello' 'who' 'World' \
  'foreground' "$(dict_declare_simple 'r' '127' \
                  'g' '255' 'b' '80' )" \
  'background' "$(dict_declare_simple 'r' '80' \
                  'g' '0'   'b' '0'  )" \
  )"

set_ansi_24bit_colours_string \
  "$(dict_get "${record}" 'foreground')" \
  "$(dict_get "${record}" 'background')"

echo -n "${return_value}"
echo -n "$(dict_get_simple "${record}"\
        'greeting'),\
  $(dict_get_simple  "${record}" 'who')!"
echo "${ANSI_CMD_RESET}"

record="$(dict_set_simple "${record}" \
  'greeting' 'Hi' 'who' 'Earth')"

record="$(dict_set "${record}" \
  'foreground' "$(dict_declare_simple 'r' '255' \
                  'g' '127' 'b' '80' )" \
  'background' "$(dict_declare_simple 'r' '0' \
                  'g' '80' 'b' '0' )" \
  )"

fore="$(dict_get "${record}" 'foreground')"
back="$(dict_get "${record}" 'background')"
set_ansi_24bit_colours_string "${fore}" "${back}"

greeting="$(dict_get_simple "${record}" \
          'greeting')"
who="$(dict_get_simple  "${record}" 'who')"
echo "${return_value}${greeting},\
  ${who}!${ANSI_CMD_RESET}"
			
Listing 2

After including the dict.sh source as before, there is support for setting the ANSI terminal foreground and background 24-bit colours consisting of a bunch of constants (readonly variables) that build up the ANSI terminal control escape codes and the set_ansi_24bit_colours_string function that takes dicts having r, g and b keys whose values specify the individual 8-bit components of the 24-bit colours for the foreground and background. The set_ansi_24bit_colours_string result is returned in the return_value global variable and is a string that will set the requested 24-bit colours when written to a supporting ANSI terminal such Xterm, GNOME terminal or KDE’s Konsole (as listed in the 24-bit section of [AnsiTermCodes]).

As previously, a dict is created. However, this time dict_declare must be used as some of the initial entry values are dicts. The nested dict values for the foreground and background entries are created by calling dict_declare_simple as none of the nested dicts’ entries’ values are themselves dicts.

set_ansi_24bit_colours_string is called before outputting the complete greeting, and is passed the foreground and background RGB triple dicts which were obtained using dict_get (as they are nested dict values) from the record dict and stored in variables fore and back.

The string returned by set_ansi_24bit_colours_string in return_value is echo’d to the console without a terminating newline to set the terminal’s foreground and background colours and then the greeting is output as before except no terminating newline is output as following the greeting the ANSI_CMD_RESET constant string is output to reset the terminal, removing the previously set foreground and background colours.

The greeting and who entries of record are updated as before with dict_set_simple. Then the foreground and background RGB triple dict values are updated using dict_set. Note that all four entries could have been updated in a single call to dict_get. However, doing it in two parts demonstrates that a dict containing nested dict values can still be operated on by the simple forms of the API functions so long as no dict entry values are involved in that specific function call.

Once more, the (updated) foreground and background RGB triple dicts are passed to set_ansi_24bit_colours_string to obtain the new set-colours ANSI control escape code string. The updated values for the greeting and who entries are, as before, obtained with calls to dict_get_simple and stored in variables greeting and who.

Finally, the whole output sequence: set-colour-control-escape-codes, greeting, who, ANSI reset control-escape-code is output in a single echo command.

Executing the extended Hello World example should produce the same textual output, only with more colour involved.

Other tricks

There are a few things that you can do with dicts that might be of interest. First, dicts are strings. Yes, I know that has been stated a time or two already. However, as strings they can be used like any other string, although of course just printing them out loses something in the process – the unprintable separator characters for a start.

More usefully, they can be passed to commands and other scripts as arguments. This is useful if, for example, you have a suite of mutli-stage scripts that run one after the other and it is the intial script’s job to obtain the arguments and options for the job – they are collected in a dict which is passed from one script to the next, each accessing the arguments and options relevant to itself.

Of course as strings they can also be saved to disk and read back later, or sent over a network. Basically, dicts are already serialised. The only issue in the future would be with regard to possible differing versions of the dict data format, at which time the version metadata field in the dict header would become a lot more relevant.

As an associative key:value container it is possible to use dicts to emulate other container types. Two that come to mind are vectors (that is single dimension arrays) and sets.

The characteristics of a vector are that values are identified by an integer index, usually starting at 0 or 1 and increasing by one for each entry. Another characteristic of some implementations is that they can only have elements efficiently added to the end of the vector.

A dict can emulate a vector which for example has elements starting at index 0 and only allows appending elements to the end. The idea would be to wrap the underlying dict operation function calls in vector specific operation functions.

Operations that create new elements do not require the index key values as parameters. Rather they synthesise the next index as being the value returned by dict_size. Hence an empty vector-dict would have size zero and so the index key for the first element would be '0', the size is then one thus the index of the next added element would be '1' and so on. Operations that add new elements would be created with initial element values and append new element values, which would be wrappers around dict_declare (or dict_declare_simple) and dict_set (or dict_set_simple). Listing 3 shows what a vector_append function might look like.

vector_append() {
  local vector="${1}"
  shift
  local count=$#
  local index=$(dict_size "${vector}")
  while [ ${count} -gt 0 ]; do
    local value="${1}"
    shift
    set -- "$@" "${index}" "${value}"
    count=$(( ${count}-1 ))
    index=$(( ${index}+1 ))
  done

  # pass the vector and converted entry values 
  # to dict_set:
  vector_return_value="$(dict_set \
                         "${vector}" "$@")"
}
			
Listing 3

Note that the values to be added have to be pairwise interleaved with their index key values which requires modifying the $@ special parameter while iterating over it which is why a snapshot of the value of $# is taken before the iteration starts which is then manually decremented. Similarly a snapshot of the intitial size of the vector is taken before iteration start which is incremented manually – this is to reduce the overhead of calling dict_size.

Other operations that could be useful would be accessing a specific element by index key, which would simply devolve to a call to dict_get (or dict_get_simple), and updating an existing element, which would basically wrap a call to dict_set (or dict_set_simple) with checks on the index value to ensure it is a valid and in range index value.

Iteration can of course be performed via dict_for_each, possibly via a wrapper that adapts the function called if not all of the usual three arguments dict_for_each passes to the supplied function for each entry (key, value and computed one based record number) are not required.

Emulating a set is similar in that both keys and values do not need to be specified when adding set members. In this case, it is only the key values that need to be given, the associated values are of no interest other than them not being empty so that set membership can be determined by passing the cadidate value to dict_get_simple (or dict_get) and testing the result to see if it is not empty. The entry values can all be the same – preferably a short value such as '_'.

As with the vector emulation case, functions can be written that wrap the underlying dict operations. For example, Listing 4 shows what a predicate function to check if a set-dict contains a value might look like.

set_contains() {
  local set="${1}"    # the haystack to search
  local value="${2}"  # the needle to find
  local maybe_in="$(dict_get_simple \
                  "${set}" "${value}")"
  if [ -n "${maybe_in}" ]; then
    true; return
  else
    false; return
  fi
}
			
Listing 4

Conclusions and possible future directions

The dict.sh Shell Command Language library script implements a dictionary container using mostly just standard features of the POSIX Shell Command Language plus simple use of a few core utilities, with the exception of local and echo -n.

However, performance is a concern. The call-out to sed to perform the required substitutions when inserting or retrieving nested dicts is particularly heavy on performance.

The use of Command Substitution is also an area that drags down performance. In fact internally the private, uglyfied support functions are now all called directly and return their results by setting a __dict_return_value__ global variable.

I have a couple of ideas to potentially address these two issues.

The first is to re-work how dicts are nested that removes the need for changing the data format and thus removes the dependence on sed. The idea – very nebulous at the moment – would be to give each dict an identifier, which might be random and may need to be globally unique, and bolt nested dicts onto the end of the dict string with some sort of new separator sequence – maybe using the currently unused FS character. Entries that have nested dict values would have values that reference that dict by identifier.

The second idea is to add a parallel set of API functions that are called directly rather than by Command Substitution and return their value in a variable specified by name by the caller as an additional parameter by making used of the eval built-in utility. These would be more cumbersome to use but eleminate having to spin up a sub-shell process for every call.

However, at the end of the day there is only going to be so much that can be done to reduce performance overheads. The underlying sequential nature of strings and the complexity required by the slice and splice operations will end up limiting performance. dicts are never going to have any great performance, especially at scale. However they were not intended to.

The reliance on echo -n for the existing API functions can be fixed by using standard facilties of the printf utility.

Some additional operations would be useful. One that springs to mind is a dict_merge or dict_extend operation that combines the elements of two (or more?) dicts.

Rather than emulating other container types, it would be cleaner to implement each as their own library script, tailoring the data structure and operations to suit each.

I do not know if or when I will make these changes as I had already gone down a few rabbit holes while getting on with a larger task when I started implementing dict. It would be good to bottom out and pop back up out of some of the rabbit holes.

References

[AsciiTable] ASCII Table https://www.asciitable.com/

[AnsiTermCodes] ANSI terminal control codes https://en.wikipedia.org/wiki/ANSI_escape_code

[EBNF] Extended Backus Naur Form https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form

[OpenGroup] The Open Group https://www.opengroup.org/

[ShellLang] Shell Command Language https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html

[ShellLangRationale] Rationale for Shell and Utilities, Shell Command Language, Shell Commands, Function Definition Command, p.2 https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html

[ShUtils] sh-utils repository https://github.com/ralph-mcardell/sh-utils

Ralph McArdell Ralph started programming around 1980 in the 6th form in BASIC on a teletype over a 110bps dialup modem. Since the mid 1990s he has been a freelance developer using a variety of systems and programming languages, with an emphasis on C++. He has been an ACCU member since the turn of the millenium and helps organise ACCU London meetings.