Fix HTML Formatting Using Simple Shell Scripting

Aaron Peters 28-09-2017

If you often write HTML in an editor and then paste into WordPress, you’ll notice that sometimes annoying formatting tags (like <span> tags) are added. Using simple shell scripts, you can automatically clean up that garbage HTML formatting with a few simple commands.


Why use shell scripting? If you’re new to programming, it’s much, much better to start small. Not only are you less likely to give up, but you’ll have opportunities to stop and learn along the way. That said, your first programs can be really useful even if they’re also really simple.

Shell scripting What Is Shell Scripting and Why You Should Use It In addition to being able to accept and execute commands interactively, the shell can also execute commands stored in a file. This is known as shell scripting. Here we cover the basics of shell scripting. Read More is a great place to start coding for this precise reason: it’s easy to put together something in a couple lines of code that will save you some lots of time. Let’s take a look at a couple of recipes, or “patterns,” you can repurpose into scripts of your own.

Why Shell Scripting?

Firstly, let’s start off by defining “shell scripting” as writing scripts to be run in the Bash shell. Technically speaking, other scripting languages such as Powershell could also be termed “shell scripting.” But why focus on shell scripting in general, and Bash scripting in particular, in the first place?

With the above in mind, here are a couple of ideas for useful shell scripts you can put together with just a couple lines of code. We’ll be building a couple of scripts to enhance the already considerable powers of the Pandoc conversion utility How to Easily Convert Between Document Formats in Linux Switching to Linux can result in problems with file compatibility. For instance, documents don't look the same in LibreOffice as they do in Word. This is just one reason why you need pandoc. Read More .

1. Collecting Long Lists of Parameters

The easiest and most straightforward way to use a shell script is as a knd of shortcut for an existing command. Some command line programs have a ton of flags, and their syntax isn’t always clear. But you can take one of these commands, with all its complicated options, and throw them into a shell script with a name that’s easier to enter. Consider the following command, which runs the Pandoc on a Markdown file and creates an ODT file, using a template file:

pandoc -r markdown -w odt --reference-odt=/path/to/folder/containing/mscript-template.odt -o manuscript.odt

I use Pandoc on a daily basis, as I author everything in lightweight markup like Markdown What Is Markdown? 4 Reasons Why You Should Learn It Now Tired of HTML and WYSIWYG editors? Then Markdown is the answer for you no matter who you are. Read More and Asciidoc Lightweight Markup Languages: This Is Why You Should Use AsciiDoc Over Regular Markdown Markdown language comes in many flavors, some of which are better than others. Lightweight languages like AsciiDoc are easy to learn and extremely useful. Here's how it compares. Read More . And yet when converting to ODT, I type “odt-reference” instead of “reference-odt.” Every. Single. Time. Plus the path to the template won’t autocomplete like most shell commands. Creating a simple script can save all that mistyping:

#! /bin/bash
pandoc -r docbook -w odt --reference-odt=/path/to/folder/containing/mscript-template.odt -o $1.odt $1

The first line of the script directs the system to use the Bash shell to run it. The next one takes the first argument at the command line ($1), and runs Pandoc with a set of flags on it. It’s worth noting there are other ways to do this, such as using the alias command on Unix-ish systems. But making small shell scripts means you can keep them handy (such as in your ~/bin folder), quickly copy (or sync) them elsewhere, and change them with any text editor. Save your script with a file name that’s easy to remember and type (e.g. “”). Don’t forget to give it executable permissions The Chmod Command and Linux File Permissions Explained If you want to manage file permissions properly on any Linux operating system, you need to know the chmod command. Read More .

2. Piping Output to Clean HTML Formatting

Connecting two terminal commands with a pipe (“|”) character causes the output of the first to be used as the input of the second. (If you’ve never seen this before, check out our quick guide to the command line A Quick Guide To Get Started With The Linux Command Line You can do lots of amazing stuff with commands in Linux and it's really not difficult to learn. Read More .) But having to type two commands in the right order, with the right parameters, only compounds the problem we just discussed. Wrapping this double-command up in a shell script makes it that much more convenient.

One trick I use with Pandoc is to “clean” HTML formatting, or remove all inline styling 9 Mistakes You Shouldn't Make When Building a Web Page These following HTML coding mistakes are easy to make, but if you head them off earlier rather than later, your page will look better, be easier to maintain, and function how you want it to. Read More . If you’ve ever tried to export a word processor document to HTML, you can see there’s a ton of styles (span tags) that get added in and among the text.


messy html formatting

The Docbook XML format has no convention for inline styles, so if we convert HTML to DocBook all this formatting gets tossed out. Then we can use Pandoc to convert the DocBook back to HTML, and we get a nice bit of markup that you can (for example) paste into WordPress. Rather than do this with individual calls to Pandoc, the following script chains them together to:

  1. Convert the exported HTML file to DocBook, which has no inline styles (before the pipe)
  2. Convert the DocBook back into what is now nice, clean HTML formatting (after the pipe)
#! /bin/bash
pandoc -w docbook $1 | pandoc -r docbook -w html -o $1 -

clean html formatting

Explaining Standard Input/Output

The above takes advantage of the terminal concepts of “standard input” and “standard output.” If you were to run the first part of the command, you’d get a whole bunch of XML shown in the terminal. The reason why is we haven’t given Pandoc any other output (such as a file) to use. So it’s using the only fall back it’s got: standard output, in this case the terminal.


On the other hand, the dash character at the end of the second Pandoc command means it should use “standard input.” Run by itself, you’d be greeted with a prompt, where the shell would wait for you to provide some text via it’s default input, by typing on the keyboard. When we combine them, you can almost imagine the first command spitting out a bunch of XML to the terminal where it is immediately piped into the second command as input.

The result is, if you rename this to “,” you can run it on any HTML file to get rid of those bothersome styles. The best part is Pandoc will read from the file, then overwrite it at the end, meaning there’s no temp files littered about.

3. Running Programs on Multiple HTML Files

Some programs allow you to specify wildcards such as the asterisk at the command line. This allows you to, for example, move all JPG images to your “Pictures” folder:

mv *.jpg ~/Pictures

But other programs take only one file at a time as input, and Pandoc is one of them. So what happens when we have a whole directory full of exported HTML files and we want to clean up the HTML formatting? Do we need to run our “” script on each one of them manually?


No, because we’re not newbies. We can wrap our piped command in a “for-each” loop. This will go to each HTML file in the current directory in turn, and perform the clean operation on it. Let’s also add a little message via the echo statement to let us know all the files have been taken care of:

for filename in ./*.html
  pandoc -w docbook $1 | pandoc -r docbook -w html -o $1 -
  echo "Working on $1... HTML is clean!"

Now if you have a folder full of “dirty” HTML, you can run this script on it and end up with some sparkly-clean HTML formatting.

clean multiple html files

Where to Go From Here

If you like tinkering, you’ll love shell scripting, because there’s always tweaking to be done. Some ideas on how to use these patterns as a basis for other scripts include the following:

As you can see, with shell scripts you can build things a little at a time, testing them out at the prompt and tacking them onto your scripts as you go.

What do you say, does shell scripting seem a little less intimidating now? Are you ready to try your hand at automating your dullest tasks? If you decide to jump in, let us know how it goes below in the comments!

Related topics: HTML, Scripting.

Affiliate Disclosure: By buying the products we recommend, you help keep the site alive. Read more.

Whatsapp Pinterest

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. dragonmouth
    September 28, 2017 at 1:29 pm

    This only works if one is fluent in scripting.

    Can't HTML validators, such TidyHTML, used instead?

    • Aaron Peters
      September 28, 2017 at 3:36 pm


      I had looked briefly into these, albeit a while ago. What I'd found at the time was that they're focused on specifically *validating* HTML, and inline styles are technically valid. So they didn't contain a feature to remove in-line styling, at least that I could find.

      But I'd argue that to accomplish the above you don't need to be "fluent" in scripting. The point of this post is that you can get started by using terminal commands you already know and wrapping just a bit of simple scripting around them. After all, everyone who is fluent with scripting *now* had to create something simple like this at some point in the past...