Fix HTML Formatting Using Simple Shell Scripting

If you often write HTML in an editor and then paste into WordPress, you'll notice that sometimes annoying formatting tags (like <span> tags) are added. Using simple shell scripts, you can automatically clean up that garbage HTML formatting with a few simple commands.

Why use shell scripting? If you're new to programming, it's much, much better to start small. Not only are you less likely to give up, but you'll have opportunities to stop and learn along the way. That said, your first programs can be really useful even if they're also really simple.

Shell scripting is a great place to start coding for this precise reason: it's easy to put together something in a couple lines of code that will save you some lots of time. Let's take a look at a couple of recipes, or "patterns," you can repurpose into scripts of your own.

Why Shell Scripting?

Firstly, let's start off by defining "shell scripting" as writing scripts to be run in the Bash shell. Technically speaking, other scripting languages such as Powershell could also be termed "shell scripting." But why focus on shell scripting in general, and Bash scripting in particular, in the first place?

With the introduction of the Windows Subsystem for Linux, the Bash shell is now compatible with all major PC platforms. (It's also included on macOS and just about all Linux distributions by out of the box.) It's even available on Android phones with a Termux, a free and open source download from Google Play.
Shell scripting lets you focus on programming fundamentals, because the heaviest lifting is done for you by the commands you'll include. Suppose you want to compress some files in a traditional desktop application written in C. You'll either need to write a little code to use a compatible software library that will do the job, or write a lot of code from scratch to actually do the compression. In a shell script, all you need to do is run the tar command on the desired files.
You can develop in small steps, in an interactive way. To continue the above example, let's say you've decided you'll use tar to do your compression, but you're not yet sure which of its options you want. Just play around with it at the prompt until you get the result you want, then copy/paste the command you used into your script.

With the above in mind, here are a couple of ideas for useful shell scripts you can put together with just a couple lines of code. We'll be building a couple of scripts to enhance the already considerable powers of the Pandoc conversion utility.

1. Collecting Long Lists of Parameters

The easiest and most straightforward way to use a shell script is as a knd of shortcut for an existing command. Some command line programs have a ton of flags, and their syntax isn't always clear. But you can take one of these commands, with all its complicated options, and throw them into a shell script with a name that's easier to enter. Consider the following command, which runs the Pandoc on a Markdown file and creates an ODT file, using a template file:

        pandoc -r markdown -w odt --reference-odt=/path/to/folder/containing/mscript-template.odt -o manuscript.odt manuscript.md

I use Pandoc on a daily basis, as I author everything in lightweight markup like Markdown and Asciidoc. And yet when converting to ODT, I type "odt-reference" instead of "reference-odt." Every. Single. Time. Plus the path to the template won't autocomplete like most shell commands. Creating a simple script can save all that mistyping:

        #! /bin/bash
pandoc -r docbook -w odt --reference-odt=/path/to/folder/containing/mscript-template.odt -o $1.odt $1

The first line of the script directs the system to use the Bash shell to run it. The next one takes the first argument at the command line ($1), and runs Pandoc with a set of flags on it. It's worth noting there are other ways to do this, such as using the alias command on Unix-ish systems. But making small shell scripts means you can keep them handy (such as in your ~/bin folder), quickly copy (or sync) them elsewhere, and change them with any text editor. Save your script with a file name that's easy to remember and type (e.g. "markdown2odt.sh"). Don't forget to give it executable permissions.

2. Piping Output to Clean HTML Formatting

Connecting two terminal commands with a pipe ("|") character causes the output of the first to be used as the input of the second. (If you've never seen this before, check out our quick guide to the command line.) But having to type two commands in the right order, with the right parameters, only compounds the problem we just discussed. Wrapping this double-command up in a shell script makes it that much more convenient.

One trick I use with Pandoc is to "clean" HTML formatting, or remove all inline styling. If you've ever tried to export a word processor document to HTML, you can see there's a ton of styles (span tags) that get added in and among the text.

The Docbook XML format has no convention for inline styles, so if we convert HTML to DocBook all this formatting gets tossed out. Then we can use Pandoc to convert the DocBook back to HTML, and we get a nice bit of markup that you can (for example) paste into WordPress. Rather than do this with individual calls to Pandoc, the following script chains them together to:

Convert the exported HTML file to DocBook, which has no inline styles (before the pipe)
Convert the DocBook back into what is now nice, clean HTML formatting (after the pipe)

        #! /bin/bash
pandoc -w docbook $1 | pandoc -r docbook -w html -o $1 -

Explaining Standard Input/Output

The above takes advantage of the terminal concepts of "standard input" and "standard output." If you were to run the first part of the command, you'd get a whole bunch of XML shown in the terminal. The reason why is we haven't given Pandoc any other output (such as a file) to use. So it's using the only fall back it's got: standard output, in this case the terminal.

On the other hand, the dash character at the end of the second Pandoc command means it should use "standard input." Run by itself, you'd be greeted with a prompt, where the shell would wait for you to provide some text via it's default input, by typing on the keyboard. When we combine them, you can almost imagine the first command spitting out a bunch of XML to the terminal where it is immediately piped into the second command as input.

The result is, if you rename this to "clean-html.sh," you can run it on any HTML file to get rid of those bothersome styles. The best part is Pandoc will read from the file, then overwrite it at the end, meaning there's no temp files littered about.

3. Running Programs on Multiple HTML Files

Some programs allow you to specify wildcards such as the asterisk at the command line. This allows you to, for example, move all JPG images to your "Pictures" folder:

        mv *.jpg ~/Pictures

But other programs take only one file at a time as input, and Pandoc is one of them. So what happens when we have a whole directory full of exported HTML files and we want to clean up the HTML formatting? Do we need to run our "clean-html.sh" script on each one of them manually?

No, because we're not newbies. We can wrap our piped command in a "for-each" loop. This will go to each HTML file in the current directory in turn, and perform the clean operation on it. Let's also add a little message via the echo statement to let us know all the files have been taken care of:

        for filename in ./*.html
do
  pandoc -w docbook $1 | pandoc -r docbook -w html -o $1 -
  echo "Working on $1... HTML is clean!"
done

Now if you have a folder full of "dirty" HTML, you can run this script on it and end up with some sparkly-clean HTML formatting.

Where to Go From Here

If you like tinkering, you'll love shell scripting, because there's always tweaking to be done. Some ideas on how to use these patterns as a basis for other scripts include the following:

Adding support for conversion directly from the word processor file, since Pandoc supports ODT and DOCX input (i.e the chain becomes ODT/DOCX > DocBook XML > HTML).
Combining both HTML cleaners into one, such that if a file is provided it cleans that, otherwise it automatically cleans everything in the current directory (adds dealing with command line arguments).
Provide the user with additional export options like PDF (adds choices based on input, via if-then or case statements).

As you can see, with shell scripts you can build things a little at a time, testing them out at the prompt and tacking them onto your scripts as you go.

What do you say, does shell scripting seem a little less intimidating now? Are you ready to try your hand at automating your dullest tasks? If you decide to jump in, let us know how it goes below in the comments!