Mastering Batch File Manipulation

Being lazy requires some practice

Countless of times, have I had to use a library, a framework or any other piece of software, that did not comply with my project build system, documentation system, or whatsoever system. In such tough time, you can either modify each file manually or ... use some kind of scripting to greatly speed up the process.

But, return on investment does not seem obvious when you do not yet know how to do it. You have to learn and master a certain amount of tools before being fluent at the task. You may think you are pressed by the time and actually never get into it. Each time you don't try it, must be seen as a missed opportunity to learn it.

Later in the article, I will try to explain the tools I master. But they are not the only ones that can be used. First, I'm using linux environment, then if you are a windows user, this may not help you practically. Second, maybe the tools I use are not the most efficient, you may try to get even better toolset. And finally, you should also use tools that matches your way of thinking and the knowledge you have. For example, I studied mathematics and theoretical computer science a lot before finally switching to engineering studies. This gave me some thinking patterns that may not be yours. So don't be frustrated if these tools do not fit you. The essential is to find the ones that are fitting you.

Using Google search engine, ChatGPT, or whatsoever to find a new strategy is fine as long as you try to understand and master what you find. A good way to know if you understand something, is to try to use it in a context that you did not yet encounter and check if you can make it out. Confronting reality is the only way to test your limits.

"This is the way" — The Mandalorian

Before getting into more details here about my toolset. There are some prerequistes you need to understand. I'll try to keep it short.

Prerequisites

Shell basics

I'll concentrate on bash shell even though I use zsh, because it's more widely used by beginners, being the default one on most linux distributions.

When you open a terminal, it runs the default shell (most probably bash) of your user. In this shell, you can enter some code, from simple command calls to function declaration, to loops and so on.
Examples:

This is to ensure you realize that your shell is more than a command prompt. It is actually running an interpreter for a scripting language. This scripting language, allows you to perform complex tasks without having to code a dedicated program in C for example.

Strings

In bash every variable is either a string or an array of strings. You can declare them like this:

echo "Without string delimiters"

# You can declare without delimiters only if there is no space or
# special characters.
VAR_00=write_something_without_space
declare -p VAR_00  # This is to print the variable declaration

# You can declare a string array like this
VAR_01=(write something even with spaces)
declare -p VAR_01  # This is to print the variable declaration

echo "With double quote delimiters"

# You can declare anything containing spaces
VAR_10="write 'something' even with spaces"
declare -p VAR_10  # This is to print the variable declaration

# You can concatenate string declaration. This is useful to mix
# various delimiters.
VAR_11="write ""something"" even with spaces"
declare -p VAR_11  # This is to print the variable declaration

# You can call other variables
VAR_12="write '${VAR_11}' even with spaces"
declare -p VAR_12  # This is to print the variable declaration

# You can use $ symbol if you escape it
VAR_13="write '\${VAR_11}' even with spaces"
declare -p VAR_13  # This is to print the variable declaration

# You can declare a string array like this
VAR_14=("write something" even "with spaces")
declare -p VAR_14  # This is to print the variable declaration

echo "Without simple quote delimiters"

# You can declare anything containing spaces
VAR_20='write something even with spaces'
declare -p VAR_20  # This is to print the variable declaration

# You can concatenate string declaration. This is useful to mix
# various delimiters.
VAR_21='write ''something'' even with spaces'
declare -p VAR_21  # This is to print the variable declaration

# You CANNOT call other variables
VAR_22='write "${VAR_11}" even with spaces'
declare -p VAR_22  # This is to print the variable declaration

# You can declare a string array like this
VAR_23=('write something' even 'with spaces')
declare -p VAR_23  # This is to print the variable declaration

echo "With a quote delimiter mix"

VAR_30='write $omething'" even with spaces"
declare -p VAR_30  # This is to print the variable declaration

VAR_31=('write something'" even" with spaces)
declare -p VAR_31  # This is to print the variable declaration

Pattern Matching: Wildcard

One of the basis of efficient shell usage is to master pattern matching in order to describe what you want to do to the machine, as briefly as possible.

Let's say you create a directory and create some files, by running:

cd /tmp
mkdir test-dir
cd test-dir
touch a12 be3 34jlo 4 kg64.txt elr.bt 233 v2Z

If you want to match only some of these without having to type them explicitly, you can use the following pattern declarators:

Pattern	Description
`*`	Match zero or more characters
`?`	Match any single character
`[...]`	Match any of the characters in a set
`?(patterns)`	Match zero or one occurrences of the patterns (extglob)
`*(patterns)`	Match zero or more occurrences of the patterns (extglob)
`+(patterns)`	Match one or more occurrences of the patterns (extglob)
`@(patterns)`	Match one occurrence of the patterns (extglob)
`!(patterns)`	Match anything that doesn't match one of the patterns (extglob)

Example:

As you can see, if no match is found, the pattern itself is passed to the program as argument and then do not match any file of the system.

Another usage of wildcard is in expression evaluation using the notation [[ expr ]]. To use it in this context, you have to write [[ $var == wilcard ]] or [[ $var != wilcard ]] for the negative form. Example:

And the last usage that comes into my mind is in case statements. Example:

VALUES=("adcbacbadcbad" "value with spaces" "9421312" "" "qrtqrt23432!!%")

shopt -s extglob

for VALUE in "${VALUES[@]}"
do
  case "$VALUE" in
    *' '*)
      echo "VALUE '$VALUE' has a space"
    ;;
    +(a|b|c|d))
      echo "VALUE '$VALUE' is abcd"
    ;;
    +(1|2|3|4|5|6|7|8|9|0))
      echo "VALUE '$VALUE' is number"
    ;;
    '')
      echo "VALUE '$VALUE' empty"
    ;;
    *)
      echo "VALUE '$VALUE' default"
    ;;
  esac
done

Pattern Matching: Regular Expression

Regular expressions are more expressive than wildcards, but you cannot expand them as we did earlier. You can only use them in the form of [[ $var =~ regex ]]. There also another difference, which is that you can capture parts of the match with the bash variable BASH_REMATCH.

It's a good thing to master regular expression, as they are usable in a lot of tools, not only bash, but also grep, sed, find, and many more tools that we will look at later

Regexp Cheat Sheet

Here is a basic usage example:

Variable and Parameter Manipulations

Let's start with the simple variables and parameters:

Now, let's take a look at variable and parameter arrays:

"$@" and "$*" parameter arrays behave the same way than "${TOTO[@]}" and "${TOTO[*]}" in the previous example.

Expression Expansion

This one is easy to understand, nothing tricky. Some patterns automatically expand before evaluation. These are examples covering all the expansion I know:

echo {c..f}         # =>  c d e f
echo {3..7}         # =>  3 4 5 6 7
echo {3,e,z,5}      # =>  3 e z 5
echo ka{c..f}2m     # =>  kac2m kad2m kae2m kaf2m
echo p{3..7}ap      # =>  p3ap p4ap p5ap p6ap p7ap
echo _{3,e,z,5}     # =>  _3 _e _z _5

## QUIZZ
## What do you think this gives?
echo a{1,2}r{f..h}l

Example of practical usages. If you want to do a for loop from 1 to 30 you only have to do:

for I in {1..30}; do echo "$I"; done

You can also quickly change the extension of a given file:

mv /path/without/extension.{old_ext,new_ext}

Mathematical expressions

These ones are using syntax similar to C language. The syntax is (( math_expr )) and $(( math_expr )) to get result as variable. Let's give some useful examples:

for ((I=0; I<10; I+=2)); do echo "$I"; done

invert_order() {
  local -a ARRAY=()
  while (($# > 0))
  do
    ARRAY[$#]="$1"
    shift
  done
  echo "${ARRAY[@]}"
}

invert_order 1 2 3 4 5

sum() {
  if (($# > 1))
  then
    echo $(($1 + $2))
  else
    echo "Wrong number of arguments $#"
    return 1
  fi
}

sum 3 4
sum 1
sum

Scopes

A scope allows you to group several expressions and make them usable as a single entity. You can declare them in two manners. Either { expr1; expr2; ...; exprN; } or ( expr1; expr2; ...; exprN; ). The difference between the both being that than the first stays in the same process, while the second forks a new one. This has implications notably when calling builtins, using variables and on crashes. Examples:

with_braces() {
  { echo "First"; return 1; echo "Second"; }
  echo "Third"
}

with_parenthesis() {
  ( echo "First"; return 1; echo "Second"; )
  echo "Third"
}

echo "# With braces"
with_braces

echo "# With parenthesis"
with_parenthesis

with_braces() {
  local VALUE=42
  echo "VALUE=$VALUE"
  { VALUE=314; echo "VALUE=$VALUE"; }
  echo "VALUE=$VALUE"
}

with_parenthesis() {
  local VALUE=42
  echo "VALUE=$VALUE"
  ( VALUE=314; echo "VALUE=$VALUE"; )
  echo "VALUE=$VALUE"
}

echo "# With braces"
with_braces

echo "# With parenthesis"
with_parenthesis

Functions

There are 4 ways to declare a function:

type1() [[ expr ]]
type2() (( expr ))
type3() { expr1; ...; exprN; }
type4() ( expr1; ...; exprN; )

Let's implement the same function in the 4 ways:

echo "# gt1"
gt1() [[ "$1" -gt "$2" ]]
gt1 3 5 && echo "OK" || echo "KO"
gt1 5 3 && echo "OK" || echo "KO"

echo "# gt2"
gt2() (( "$1" > "$2" ))
gt2 3 5 && echo "OK" || echo "KO"
gt2 5 3 && echo "OK" || echo "KO"

echo "# gt3"
gt3() { test "$1" -gt "$2"; }
gt3 3 5 && echo "OK" || echo "KO"
gt3 5 3 && echo "OK" || echo "KO"

echo "# gt4"
gt4() ( test "$1" -gt "$2"; )
gt4 3 5 && echo "OK" || echo "KO"
gt4 5 3 && echo "OK" || echo "KO"

The two last declaration types have the same implications than scopes.

In scope-type functions, you can use supplementary keywords like return (as in C language, it ends the function and set the exit value) or local (when declaring a variable it limits its "lifetime" to the current scope). Example:

function1() {
  local VAR="some value"
  echo "From function1: VAR='$VAR'"
}

function2() {
  VAR="other value"
  echo "From function2: VAR='$VAR'"
}

echo "################"
echo "AT #1: VAR='$VAR'"
function1
echo "AT #2: VAR='$VAR'"
echo "################"
echo "AT #3: VAR='$VAR'"
function2
echo "AT #4: VAR='$VAR'"
echo "################"

Assignment

The goal of this is to put the standard output of some expression in a variable. There are two ways to make it, either $(runnable_expression) or `runnable_expression` (which is considered the old-style form).

remove_extension() {
  echo "${1%.*}"
}

FILENAME="example.tar.gz"
FILENAME_WITHOUT_EXT="$(remove_extension "$FILENAME")"

echo "FILENAME:             $FILENAME"
echo "FILENAME_WITHOUT_EXT: $FILENAME_WITHOUT_EXT"

As always, the expression can be anything, even a scope. Example:

VALUE="$({ echo -n "something"; echo -n "another thing"; echo -n 1{a..f}2; })"
echo "VALUE: $VALUE"

Unix Pipes

When bash execute a script, it creates a process for each part of the code, linking them the way you specified. A process can be schematized like this:

When you create pipes, you're doing this:

You can basically pipe anything into anything, assuming that the receiving thing effectively consume it standard input, otherwise the first one will stay stucked waiting for it's output to be consummed. Most of the programs automatically detect that something attached to their standard input and change their behaviour. For example, you can pipe a scope into a program:

Now let's assume I have a file which contain addresses, one by line, that I must use to perform some action. A file like this one:

https://apache.com/riding-horses/
https://cowboy.org/where-are-the-horses?findHorses=1
http://sheriff.com/something-is-happening/
https://renegade.com/get-out-of-jail/

Unix Redirections

The goal of this feature is to control each available stream in a script. Let's say you want do do this:

Knowing that any unconnected stream will use the terminal ones. In this example:

PROCESS_0:stdin will use the terminal input. If the process tries to read it's input, it will wait for you to type on your keyboard.
PROCESS_2:stdout will output in the terminal stream.

If we didn't redirect PROCESS_2:stderr to NULL it would also output in the terminal stream.

First, let's explain file redirection. It will cover FILE_1 and NULL cases. You can do it by writing:

# > is equivalent to 1> and redirects the 1st file descriptor which is stdout
PROCESS_1 > /path/to/FILE_1

# 2> is to redirect the process 2nd file descriptor which is stderr
PROCESS_2 2> /dev/null

Now, we want to redirect PROCESS_0:stdout and PROCESS_0:stderr to PROCESS_1:stdin, but pipes only allow you to redirect PROCESS_0:stdout to PROCESS_1:stdin. Then, the solution is to redirect PROCESS_0:stderr to PROCESS_0:stdout before piping the whole into PROCESS_1:stdin. You can do it by writing:

# This tells the 2nd file descriptor to redirect into the first one
PROCESS_0 2>&1 | PROCESS_1

PROCESS_1:stderr is supposed to go into PROCESS_2:stdin, knowing that PROCESS_1:stdout goes to FILE_1, we can write:

PROCESS_1 2>&1 >FILE_1 | PROCESS_2

# If we did:
#PROCESS_1 >FILE_1 2>&1 | PROCESS_2
# Then, it first redirects 1 into FILE_1, then 2 to 1 which is now FILE_1
# ending up with both being redirected to FILE_1

A full example could be:

PROCESS_0() {
  >&2 echo "A line with some data"
  sleep 1
  echo "Something different";
  sleep 1
  echo "An other line with data";
  sleep 1
  >&2 echo "Finally this"
}

PROCESS_1() {
  local COUNT=0
  local TOTAL=0
  while read LINE
  do
    if [[ $LINE =~ ^.*' line '.*$ ]]
    then
      COUNT=$((COUNT+1))
    fi
    TOTAL=$((TOTAL+1))
    >&2 echo "$COUNT:$TOTAL"
    echo "$LINE"
  done
}

PROCESS_2() {
  while read LINE
  do
    local COUNT="${LINE%:*}"
    local TOTAL="${LINE#*:}"
    printf "[2K\rCOUNT=%02d TOTAL=%02d" "$COUNT" "$TOTAL"
    >&2 echo -n "  DEBUG: $LINE"
  done
  echo ""
}

echo "==============="
PROCESS_0 2>&1 | PROCESS_1 2>&1 1>/tmp/file1.txt | PROCESS_2 2> /dev/null
echo "==============="
cat /tmp/file1.txt
echo "==============="

## Try it by running this in your terminal:
# curl 'https://duriez.info/d/TSzO7Kuk' 2>/dev/null | bash

Injection

The goal of this feature is to use variables or files as stream for a given process. Let's say we want to do:

Forks

Builtins

`cd, pushd, popd`

`echo`

`type`

`export`

`set`

`declare`

My toolset

`man`

`file`

`which`

`find`

find $path -name $wildcard

`grep`

grep -R $string $path

grep -RE $regexp $path