Dealing with large files can be hard. Ian Bruntlett muses on various approaches that can help on Linux.
Motivation for this article
Files are important, which is why we entrust them to computer systems. For backups, I use a personal shell script which creates .tar.gz files which then get backed up to external drive [Bruntlett16, Bruntlett21]. For a long time that worked. Until one of my .tar.gz files exceeded the 4GB barrier. Attempting to copy big files eventually resulted in an error message complaining the file was too big for the file system. So, I started using ext4 instead.
That works except, if you distro-hop (I don’t) – ext4 drives store user and group metadata on the filesystem, leading to complications if you save data as one user, with, say a UID (User ID) of 1000 and try to access it later on a different user account, say with a UID of 1001.
As of the 5.4 release (November 2019) of the Linux Kernel, native exFAT support is built in. This filesystem does not store user or group metadata so UID and permission conflicts don’t arise. After much experimentation, I figure that exfat is the way to go with USB flash or hard drives.
More about storage using ext4
I use Linux and my files are currently all stored on drives formatted to the ‘ext4’ format. What is ext4? Well, to quote Wikipedia, “ext4 (fourth extended filesystem) is a journaling file system for Linux”. It works. That was good enough for me.
Until you try to access the saved data from a different user account. From what I have seen on my systems, the conflict tends to be that on the originating system, you have full read/write/execute permissions whereas if you try to access the data using the ‘other’ user permissions, you only have read/execute permissions. As always, with Linux, ‘it depends’ on how your system is set up and what your current default access permissions have been set to using the umask
command – a shell built-in command, so you’ll need to use the command help umask
for more information.
Well, ext4 on Linux systems is usually configured to reserve a certain percentage of the drive for privileged processes – typically 5%. This is usually helpful if you are mounting your root filesystem on the partition. However, if you are strange enough to use ext4 for things like external drives (such as a USB hard drives), you invariably find that you run out of space quicker. This is because an ext4 filesystem has various parameters stored away. The one we are interested in is ‘Reserved block count’. If it is non-zero, it indicates the number of sectors exclusively available to privileged processes. So if you are using an ext4 formatted external drive, you have three options:
- Just ignore the fact you aren’t using the full drive.
- Use
sudo
to copy files to the drive. - Set the Reserved Block Count to 0 using the
tune2fs
command.
Linux provides a filesystem that is stored across one or more drives. That filesystem provides a way for programs to access stored data – such as /home without having to think about which device a file or directory is on – or even which partition that a file or directory is on.
How do you find out where a file or directory is physically stored? You use the df
command which ‘reports file system disk space usage’. Cryptic, no? To get an overview of your filesystem storage, you just type in df
at the command line and you might see something like Figure 1.
$ df -Th Filesystem Type Size Used Avail Use% Mounted on tmpfs tmpfs 784M 2.1M 782M 1% /run /dev/sda5 ext4 909G 370G 493G 43% / tmpfs tmpfs 3.9G 0 3.9G 0% /dev/shm tmpfs tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs tmpfs 784M 148K 784M 1% /run/user/1000 /dev/sdb1 ext4 14G 1.7G 12G 13% /media/ian/HERMES |
Figure 1 |
The above output is partially interesting. I ignore the entries for tmpfs
, which are ram disks used by Linux itself.
The interesting stuff begins with /dev
– an abbreviation of ‘device’. The critical stuff is the /
(root) partition. Interestingly enough, you can place (‘mount’) parts of the Linux file system over multiple drives and partitions. So, if you wanted to, you could mount the root (/) filesystem on one drive and the home files of all users (/home) on a different drive. Note that the ‘Type’ column is important – this article is only relevant to filesystems with a type of ext4 (and presumably other earlier versions of ext as well – this hasn’t been tested on them).
To view the Reserved Block Count using the tune2fs
command (default is 5%), assuming that the filesystem is mounted to /dev/sdb1
$ sudo tune2fs -l /dev/sdb1 | grep Reserved Reserved block count: 97766 Reserved GDT blocks: 477 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root)
To set the Reserved Block Count to 0, use the tune2fs
command again
george@lucas:~$ sudo tune2fs -m 0 /dev/sdb1 tune2fs 1.46.5 (30-Dec-2021) Setting reserved blocks percentage to 0% (0 blocks)
So, now we can use all of an ext4 filesystem’s space for our files. How do we keep them safe? By backing up, to a memory stick. How do we keep our memory sticks synchronised with the contents of our main drive(s)?
Keeping backup drives up to date
What if you want to quickly check if a main drive directory is reasonably similar with a copy on an external drive? I wrote a shell script, irb-dirstat
(in Listing 1, at the end of the article) which, given a directory will count the number of bytes, files, and directories in a directory (including its children directories).
Here is the help / usage message for the script:
$ ./irb-dirstat --help irb-dirstat: usage irb-dirstat directory1 [directory2 etc] Used to check actual number of bytes, files, and directories in a directory Formatting output options for bytes used in files -b or -B output number of bytes (this is the default) -k or -K output number of KiB -m or -M output number of MiB -g or -G output number of GiB -t or -T output number of TiB -e or -E output number of EiB --commas output byte count with commas --no-commas output byte count without commas --help display this message
The above information should be fairly obvious. It is useful when doing rough comparisons of a couple of sub-directories. Here is a sample of its output…
~$ ./irb-dirstat --commas ~/isos /home/ian/isos Byte count 76,884,580,097 Dir count 71 File count 148
To take advantage of globbing, irb-dirstat
can handle one or more directory names as arguments. This is particularly useful when dealing with multiple directories or to compare a source directory with a destination directory.
However, irb-dirstat
is best used for rough but quick comparison of directory trees. GNOME’s meld
command will do a thorough check of two subdirectories (it does other things as well). Unfortunately, it is necessarily slow and sometimes crashes with an error message so it isn’t something I rely on alone.
The diff
command can check two directories recursively, if you pass it two directories and the -r
flag. I haven’t managed to crash this command and its output is very helpful. I tend to use irb-dirstat
to quickly ensure the drives are reasonably synchronised and finally use the diff
command for a more thorough, byte by byte comparison.
#!/bin/bash # default values, to be overridden by command # line options divisor_scaling="Byte" divisor=1 use_commas=0 function report_bytes_used_in_files() { echo -n "$divisor_scaling count " byte_count=$(count_bytes_used_in_files "$1") if [ $use_commas -eq "1" ] ; then printf "%'f\n" "$byte_count" else echo "$byte_count" fi } function count_bytes_used_in_files() { # this command inspired by # https://stackoverflow.com find "$1/"* -type f -print0 | \ xargs -0 stat --format=%s | \ awk -v divisor="$divisor" \ '{s+=$1} END {print s/divisor}' } function report_no_of_directories() { echo -n "Dir count " count_no_of_directories "$1" } function count_no_of_directories() { find "$1"/* -type d | wc -l } function report_no_of_files() { echo -n "File count " count_no_of_files "$1" } function count_no_of_files() { find "$1"/* -type f | wc -l } # return value 0=files present, # 1=error or no files present function are_there_any_files() { find "$1"/* -maxdepth 1 -type f \ -o -type d -iname "*" 1> /dev/null } function do_dirstat() { if [ $# -ne 1 ] ; then |
Listing 1 |
The whole of irb-dirstat can be found online at https://github.com/ian-bruntlett/studies
References
[Bruntlett16] Ian Bruntlett ‘Stufftar’ in Overload 132, April 2016, available at https://accu.org/journals/overload/24/132/bruntlett_2226/
[Bruntlett21] Ian Bruntlett ‘Stufftar Revisited’ in Overload 165, October 2021, available at https://accu.org/journals/overload/29/165/bruntlett/
Ian is a keen reader of software development books. He has promised himself a long stint at dealing with C++, once he has got to grips with Git.