Drive Musings on Linux

Dealing with large files can be hard. Ian Bruntlett muses on various approaches that can help on Linux.

Motivation for this article

Files are important, which is why we entrust them to computer systems. For backups, I use a personal shell script which creates .tar.gz files which then get backed up to an external drive [Bruntlett16, Bruntlett21]. For a long time that worked. Until one of my .tar.gz files exceeded the 4GB barrier. Attempting to copy big files eventually resulted in an error message complaining the file was too big for the file system. So, I started using ext4 instead.

That works except, if you distro-hop (I don’t) – ext4 drives store user and group metadata on the filesystem, leading to complications if you save data as one user, with, say a UID (User ID) of 1000 and try to access it later on a different user account, say with a UID of 1001.

As of the 5.4 release (November 2019) of the Linux Kernel, native exFAT support is built in. This filesystem does not store user or group metadata so UID and permission conflicts don’t arise. After much experimentation, I figure that exfat is the way to go with USB flash or hard drives.

More about storage using ext4

I use Linux and my files are currently all stored on drives formatted to the ‘ext4’ format. What is ext4? Well, to quote Wikipedia, “ext4 (fourth extended filesystem) is a journaling file system for Linux”. It works. That was good enough for me.

Until you try to access the saved data from a different user account. From what I have seen on my systems, the conflict tends to be that on the originating system, you have full read/write/execute permissions whereas if you try to access the data using the ‘other’ user permissions, you only have read/execute permissions. As always, with Linux, ‘it depends’ on how your system is set up and what your current default access permissions have been set to using the umask command – a shell built-in command, so you’ll need to use the command help umask for more information.

Well, ext4 on Linux systems is usually configured to reserve a certain percentage of the drive for privileged processes – typically 5%. This is usually helpful if you are mounting your root filesystem on the partition. However, if you are strange enough to use ext4 for things like external drives (such as USB hard drives), you invariably find that you run out of space quicker. This is because an ext4 filesystem has various parameters stored away. The one we are interested in is ‘Reserved block count’. If it is non-zero, it indicates the number of sectors exclusively available to privileged processes. So if you are using an ext4 formatted external drive, you have three options:

Just ignore the fact you aren’t using the full drive.
Use sudo to copy files to the drive.
Set the Reserved Block Count to 0 using the tune2fs command.

Linux provides a filesystem that is stored across one or more drives. That filesystem provides a way for programs to access stored data – such as /home without having to think about which device a file or directory is on – or even which partition that a file or directory is on.

How do you find out where a file or directory is physically stored? You use the df command which ‘reports file system disk space usage’. Cryptic, no? To get an overview of your filesystem storage, you just type in df at the command line and you might see something like Figure 1.

$ df -Th
Filesystem    Type   Size  Used Avail Use% Mounted on
tmpfs         tmpfs  784M  2.1M  782M   1% /run
/dev/sda5     ext4   909G  370G  493G  43% /
tmpfs         tmpfs  3.9G     0  3.9G   0% /dev/shm
tmpfs         tmpfs  5.0M  4.0K  5.0M   1% /run/lock
tmpfs         tmpfs  784M  148K  784M   1% /run/user/1000
/dev/sdb1     ext4    14G  1.7G   12G  13% /media/ian/HERMES

Figure 1

The above output is partially interesting. I ignore the entries for tmpfs, which are ram disks used by Linux itself.

The interesting stuff begins with /dev – an abbreviation of ‘device’. The critical stuff is the / (root) partition. Interestingly enough, you can place (‘mount’) parts of the Linux file system over multiple drives and partitions. So, if you wanted to, you could mount the root (/) filesystem on one drive and the home files of all users (/home) on a different drive. Note that the ‘Type’ column is important – this article is only relevant to filesystems with a type of ext4 (and presumably other earlier versions of ext as well – this hasn’t been tested on them).

To view the Reserved Block Count using the tune2fs command (default is 5%), assuming that the filesystem is mounted to /dev/sdb1

  $ sudo tune2fs -l /dev/sdb1 | grep Reserved
  Reserved block count:     97766
  Reserved GDT blocks:      477
  Reserved blocks uid:      0 (user root)
  Reserved blocks gid:      0 (group root)

To set the Reserved Block Count to 0, use the tune2fs command again

  george@lucas:~$ sudo tune2fs -m 0 /dev/sdb1
  tune2fs 1.46.5 (30-Dec-2021)
  Setting reserved blocks percentage to 0% (0 blocks)

So, now we can use all of an ext4 filesystem’s space for our files. How do we keep them safe? By backing up, to a memory stick. How do we keep our memory sticks synchronised with the contents of our main drive(s)?

Keeping backup drives up to date

What if you want to quickly check if a main drive directory is reasonably similar with a copy on an external drive? I wrote a shell script, irb-dirstat (in Listing 1, at the end of the article) which, given a directory will count the number of bytes, files, and directories in a directory (including its children directories).

Here is the help / usage message for the script:

  $ ./irb-dirstat --help
  irb-dirstat: usage irb-dirstat directory1 [directory2 etc]
  Used to check actual number of bytes, files, and directories in a directory
  
  Formatting output options for bytes used in files
  -b or -B output number of bytes (this is the default)
  -k or -K output number of KiB
  -m or -M output number of MiB
  -g or -G output number of GiB
  -t or -T output number of TiB
  -e or -E output number of EiB
  --commas    output byte count with commas
  --no-commas output byte count without commas
  
  --help display this message

The above information should be fairly obvious. It is useful when doing rough comparisons of a couple of sub-directories. Here is a sample of its output…

  ~$ ./irb-dirstat --commas ~/isos
  /home/ian/isos
  Byte count 76,884,580,097
  Dir  count 71
  File count 148

To take advantage of globbing, irb-dirstat can handle one or more directory names as arguments. This is particularly useful when dealing with multiple directories or to compare a source directory with a destination directory.

However, irb-dirstat is best used for rough but quick comparison of directory trees. GNOME’s meld command will do a thorough check of two subdirectories (it does other things as well). Unfortunately, it is necessarily slow and sometimes crashes with an error message so it isn’t something I rely on alone.

The diff command can check two directories recursively, if you pass it two directories and the -r flag. I haven’t managed to crash this command and its output is very helpful. I tend to use irb-dirstat to quickly ensure the drives are reasonably synchronised and finally use the diff command for a more thorough, byte by byte comparison.

#!/bin/bash
# default values, to be overridden by command
# line options
divisor_scaling="Byte"
divisor=1
use_commas=0

function report_bytes_used_in_files()
{
  echo -n "$divisor_scaling count "
  byte_count=$(count_bytes_used_in_files "$1")
  if [ $use_commas -eq "1" ] ; then
  printf "%'f\n" "$byte_count"
  else
  echo "$byte_count"
  fi
}

function count_bytes_used_in_files()
{
  # this command inspired by 
  # https://stackoverflow.com
  find "$1/"* -type f -print0 | \
  xargs -0 stat --format=%s | \
  awk -v divisor="$divisor" \
    '{s+=$1} END {print s/divisor}'
}

function report_no_of_directories()
{
  echo -n "Dir  count "
  count_no_of_directories "$1"
}

function count_no_of_directories()
{
  find "$1"/* -type d | wc -l
}

function report_no_of_files()
{
  echo -n "File count "
  count_no_of_files "$1"
}

function count_no_of_files()
{
  find "$1"/* -type f | wc -l
}

# return value 0=files present,
# 1=error or no files present
function are_there_any_files()
{
  find "$1"/* -maxdepth 1 -type f \
   -o -type d -iname "*" 1> /dev/null
}

function do_dirstat()
{
  if [ $# -ne 1  ] ; then
    echo "do_dirstat insufficient no of " \
     "parameters ($#)." >&2;
    return 1;
  fi;

  if ! are_there_any_files "$1" ; then
    # echo NO FILES
    echo "$1"
    echo "$divisor_scaling" count 0
    echo "Dir  count 0"
    echo "File count 0"
    echo
    return 1 # is a useful value?
  fi
  
  if [ ! -d "$1"  ] ; then
    echo "Parameter $1 is not a directory" >&2;
    return 1;
  fi;
  
  echo "$1"
  report_bytes_used_in_files "$1"
  report_no_of_directories "$1"
  report_no_of_files "$1"
  echo
}

function display_help()
{
  cat <<END_OF_HELP
irb-dirstat: usage irb-dirstat directory1 [directory2 etc]
Used to check actual number of bytes, files, and directories in a directory
Formatting output options for bytes used in files
-b or -B output number of bytes (this is the default)
-k or -K output number of KiB
-m or -M output number of MiB
-g or -G output number of GiB
-t or -T output number of TiB
-e or -E output number of EiB
--commas    output byte count with commas
--no-commas output byte count without commas

--help display this message
END_OF_HELP
}

if [ $# -eq 0 ] ; then
  display_help
  exit
fi

for arg in "$@"
do
  case "$arg" in
  -b|-B) divisor_scaling="Byte";
         divisor=1 ;;
  -k|-K) divisor_scaling="KiB ";
         divisor=1024 ;;
  -m|-M) divisor_scaling="MiB ";
         divisor=1048576 ;;
  -g|-G) divisor_scaling="GiB ";
         divisor=$((2**30)) ;;
  -t|-T) divisor_scaling="TiB ";
         divisor=$((2**40)) ;;
  -e|-E) divisor_scaling="EiB ";
         divisor=$((2**50)) ;;
  --help) display_help ;;
  --commas) use_commas=1;;
  --no-commas) use_commas=0;;
  *) do_dirstat "$arg";;
  esac
done

Listing 1

The whole of irb-dirstat can be found online at https://github.com/ian-bruntlett/studies

References

[Bruntlett16] Ian Bruntlett ‘Stufftar’ in Overload 132, April 2016, available at https://accu.org/journals/overload/24/132/bruntlett_2226/

[Bruntlett21] Ian Bruntlett ‘Stufftar Revisited’ in Overload 165, October 2021, available at https://accu.org/journals/overload/29/165/bruntlett/

Ian Bruntlett Ian is a keen reader of software development books. He has promised himself a long stint at dealing with C++, once he has got to grips with Git.

Motivation for this article

More about storage using ext4

Keeping backup drives up to date

References

Advertisement

Advertisement

Your Privacy