These are my data storage notes, targeting primarily personal data backups: regular files (documents, photo and music collections, not databases), moderate volume, added or edited rarely, backups are managed manually.
The "3-2-1 rule" for backups suggests to keep at least 3 copies of data, on at least 2 different storage devices, with at least one copy off-site.
The regular infosec CIA triad (confidentiality, integrity, availability) is desirable and fairly straightforward to apply. We'll need encryption, so that lost or decommissioned drives won't leak personal data (i.e., crypto-shredding can be employed); integrity checking, so that we'll either read back the data that was written or detect data corruption; varied and common technologies (hardware interfaces, drivers, filesystems, file formats), so that there will be a good chance that at least some of the backups can be accessed with reasonable effort in different situations in the future.
The technologies covered here are usable for both backups and working storage. I prefer to use more general tools, since they tend to be better maintained, and learning them usually is a more useful time investment than learning specialized backup systems (but for those, see Bacula, Borg).
Reliable computer hardware is desirable to minimize errors and hardware failures: an UPS, ECC memory, and quality hardware (including storage) in general. External HDDs are cheap and handy for local and backups, while USB flash drives seem more suitable for off-site backups (though less suitable for backups in general), being more robust for physical transfer. Both HDDs and flash drives provide interfaces different than the primary internal drives do, are easy to transfer, to plug into different machines, and to keep unplugged.
I find it useful (for the peace of mind, at least) to set a bootable
operating system on at least one of the backup drives, with all the
necessary software to read the backups. So there's usually EFI system
partition (ESP), an unencrypted partition for /boot
(GRUB2 can handle encrypted ones, but it won't make much difference),
an encrypted partition for the rest of the system (to prevent possible
data leaks via cache, for instance, after backups are accessed from
it), and a separate encrypted partition for the backup itself.
When installing a system using an installer, on a machine with more than
one disk and some existing systems present, the installer would often use
a seemingly random ESP on one of the internal disks, instead of the one on
the backup drive. Fixing it may involve booting via the GRUB shell after
GRUB fails to find or access its config from the
/boot
partition, remounting (and fixing in
/etc/fstab
) /boot/efi/
, to point to the correct
drive's ESP, and then running grub-install
to install it
there. Also removing undesirable directories from ESP manually, and
adjusting things with efibootmgr
. Or one can opt for a more
involved/manual installation, setting it properly at once: see, for
instance, "Installing Debian GNU/Linux from a Unix/Linux System" and
"Full
disk encryption, including /boot: Unlocking LUKS devices from GRUB".
I do partitioning with fdisk
, mostly because other common
tools (or at least their fancy user interfaces) tend to be buggy, and/or
to hide technical information, neither of which is desirable when
partitioning storage devices. fdisk
is nice, commonly
available, and works well.
RAID 1 is nice to set if there are spare disks, but usually not as critical for redundant personal backups as it is, for instance, for a production server.
As of 2021 and for Linux-based systems, some of the common software options are:
Below are notes and command cheatsheets for the setups I use.
This is probably the most basic and widely supported setup for Linux-based systems. Only authenticated integrity checks are supported by cryptsetup (and those are experimental), so no CRC and no recovery from minor errors without RAID, apparently. CRC won't be useful for repairs on top of an encrypted partition either. Perhaps dm-integrity can be set separately to use CRC32C, but that would complicate the setup. Or it can be skipped altogether, since integrity checking is experimental, and wiping can slow down the process quite a bit (while skipping it easily leads to errors).
Initial setup:
cryptsetup luksFormat --type luks2 --integrity hmac-sha256 /dev/sdXY # alternatively: cryptsetup luksFormat /dev/sdXY cryptsetup open /dev/sdXY backup2 mkfs.ext4 /dev/mapper/backup2 cryptsetup close backup2 mkdir /var/lib/backup2
A typical session:
cryptsetup open /dev/sdXY backup2 mount -t ext4 /dev/mapper/backup2 /var/lib/backup2/ # synchronize backups umount /var/lib/backup2/ cryptsetup close backup2
When done, in order to safely eject a device, run eject
/dev/sdX
, or possibly udisksctl power-off -b
/dev/sdX
.
For RAID with mdadm, see "dm-crypt + dm-integrity + dm-raid = awesome!".
ZFS is not modular like LUKS and friends, there are license compatibility issues, and it's generally rather unusual, but apparently a good filesystem containing all the features needed here.
Initial setup:
# Install zfsutils-linux apt install zfsutils-linux # Find a partition ID ls -l /dev/disk/by-id/ | grep sda4 # Use that ID to create a single-device pool. The "mirror" keyword # should be added to set RAID 1. zpool create tank usb-WD_Elements_...-part4 # Create an encrypted file system. mkdir /var/lib/backup/ zfs create -o encryption=on -o keyformat=passphrase -o mountpoint=/var/lib/backup tank/backup
ZFS comes with its own mounting and unmounting commands, and if it's to be used from different systems, the pools should be exported and imported (or just force-imported). A typical session, assuming that it's used from different systems:
# List pools available for import zpool import # Import the pool zpool import tank # Mount an encrypted file system zfs mount -l tank/backup # (Synchronize backups here) # Unmount the file system (or it'll happen on export) zfs unmount tank/backup # Unmount the pool too (also unnecessary to do manually though) zfs unmount tank # Export the pool zpool export tank
S.M.A.R.T. monitoring and testing can be done with smartmontools, and usually supported even by external and older USB drives.
I normally use just rsync --archive
for the initial
backup, then rsync --exclude='lost+found' --archive
--verbose --checksum --dry-run --delete
to compare
backups and for data scrubbing, and
without --dry-run
afterwards, if everything looks
fine.
For data erasure, dd
is handy for wiping both disks
and partitions (before decommissioning drives, or if there were
unencrypted partitions before), e.g.:
dd status=progress if=/dev/urandom of=/dev/sdX bs=1M dd status=progress if=/dev/urandom of=/dev/sdXY bs=1M
Public data may be useful to backup as well: its regular sources may be censored/blocked by a government, or simply become unavailable because of a technical issue (along with the rest of the Internet if the issue is near the user). In that case, the focus should be on high availability, probably along with integrity, while confidentiality hardly matters (unless it is outlawed). I think even unencrypted NTFS is good enough for this, and easily readable from any common system.
As for the data to backup (and later read) this way, Kiwix is a nice project. Its primary viewer may seem awkward for use in normal circumstances, but apparently it aims to be useful to general public and in bad circumstances: it provides archives as packages, while the viewer—with versions for every common OS—can also serve those to others in a local network via a web browser. library.kiwix.org provides, among others, indexed archives of Project Gutenberg (about 60,000 public domain books), Wikipedia, Wikibooks, Wikiversity, Wiktionary, Wikisource, ready.gov, WikiHow, various StackExchange projects, and many smaller bits like ArchWiki, RationalWiki, Explain XKCD (contains the comics). As of 2022, those would take just 200 to 300 GB, even with images and some non-English versions added.
Other large and legal archives to consider for backing up: Wikimedia Downloads, Complete OSM Data, maybe Debian archive mirroring and other software archives, arXiv and other Open Access sources. If one gets into tape storage, Common Crawl can be considered too. And then there are copyright-infringing but much larger libraries like Library Genesis, (for a trimmed down, txt-only version, see offline-os) as well as music and movies (particularly long TV series may be good for hoarding; out of nice sci-fi ones, there are Doctor Who, Star Trek, Red Dwarf, Farscape, Lexx, Firefly, Defiance, Battlestar Galactica, Babylon 5, The X-Files, First Wave; plenty more can be found in Wikipedia).
YouTube videos may be useful to hoard as well: there are many
nice ones, including educational channels, and platforms like
that seem to be getting blocked quickly when a government tries
to block information flows (see censorship of YouTube). At 480p
most videos would be watchable and not take much space (perhaps
2 to 5 MB per minute), and one can download them with
youtube-dl, e.g.: youtube-dl --download-archive
archive.txt -f
'bestvideo[height<=480]+bestaudio/best[height<=480]'
'https://www.youtube.com/c/3blue1brown/videos'
(see
also: some tricks to avoid throttling). I've collected
some video links, including interesting YouTube channels. I
think it's best to go after relatively information-dense ones
(lectures, online lessons) first, possibly followed by
entertainment-education, pop-sci, and documentaries.
When backing up data to a remote (and usually less trusted)
machine, it should be encrypted and verified client-side (so
options like plain rsync over SSH are not suitable), but
preferably still allowing for incremental backups (so tar and
gpg are not suitable, either). One can still employ LUKS or ZFS
though, by accessing remote block devices via iSCSI (in
particular, tgt
and open-iscsi
seem to
work smoothly on Debian), NBD, or similar protocols, possibly on
top of IPsec, WireGuard, tunnels made with SSH port forwarding,
TLS (e.g., with stunnel), or anything else establishing a secure
channel, to add encryption and a more secure authentication.
An test iSCSI setup example:
# server (192.168.1.2) apt install tgt dd if=/dev/zero of=/tmp/iscsi.disk bs=1M count=128 tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn:2024-07:com.example:tmp-iscsi.disk tgtadm --lld iscsi --op show --mode target tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /tmp/iscsi.disk tgtadm --lld iscsi --op new --mode account --user foo --password bar tgtadm --lld iscsi --op show --mode account tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo tgtadm --lld iscsi --op unbind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3 # client (192.168.1.3) apt install open-iscsi lsscsi iscsiadm --mode discovery --type sendtargets --portal 192.168.1.2 iscsiadm --mode node --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --login iscsiadm --mode session --print=1 lsscsi # a block device is available at this point iscsiadm --mode node --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --logout