Matt Duran

I break things to learn how they work

Sometimes, the simplest thing is the correct answer

Published on in 📝 Posts

Whoa, two posts in one day!? Hold onto your hats folks! This is just a quick post on a bash script that I wrote last week and, much like this website, sometimes the most simple way is the best way to do something.

The problem

This site has an RSS feed which allows me to make posts and, if anyone actually reads this page, folks can get the new post sent to them automatically in their preferred reader. It's an incredble, underrated technology -- actually the backbone of how podcasts get distributed -- and my preferred way of using the internet. One of the pains for this, however, is that if I want my post to come across the way that I've formatted it here with headers and links and not just plain text, because the RSS feed is in XML, I need to encode that HTML into it. This means stripping out all the HTML tags and replacing them with the proper format so it doesn't break the XML. For the GCP Cloud Resume post, despite my better judgement I did this manually. It took a number of posts to the RSS feed, going over it with a fine tooth comb, to find the one symbol I missed, posting, checking, repeating, until I finally got it right. At that point I decided, "I'm never doing this again, time to automate it".

First pass at the problem

Okay, so what do we need to solve this? We need:

  • To parse the existing HTML document and extract only the part of the article we want. This would be between the article tags
  • To encode all of the HTML tags so that they don't break the XML format
  • To insert this into the existing XML document so that it updates the feed when this is pushed to github
Relatively straightforward problem. I initially thought, "I could write a python script to do this, no problem!" -- I know python pretty well and do most of my projects in that because it's easy to get something done quickly. I started looking into how I could parse a local HTML document and most of the blog posts, documentation pages, and Stack Overflow posts I came across pointed me to BeautifulSoup. I've used BS4 before to make a web scraper but that seemed a bit overkill for this, I really just needed to copy something out of a local file -- isn't there something better I could use that's a bit more lightweight?

I came across html.parser which seemed pretty straightforward and seemed to fit my use case -- a python library for parsing text files formatted in HTML -- simple, right? In setting this up, testing this, writing out the class for the parser and functions, I realized that maybe this isn't the answer here. There really should be a lightweight option for this and then it hit me.

Bash to the rescue

What I'm doing is just really basic subsitution functions and text editing functions -- this is all just really fancy ways of doing things like grep, sed, and awk! I don't need a whole language for this, I've got everything that I need with the terminal. Like I said earlier, I really like to keep things as simple as possible -- makes it easy to remember what I'm doing, makes it easy to maintain, and it's very unlikely to break. That's why this site is only HTML and CSS and hosted in a bucket with no bells or whistles to be found.

The code

Alright, so how are we going to do this? Digging around online, I found xmllint which, with the --html flag, allows you to parse out HTML text based on the tags that you specify. Next, I just need to replace the greater than and less than characters with their encoded versions -- > and < followed by a semi-colon. Once I have that, I just need to feed it to the XML file and add some additional text around it -- bash is perfect for this!

After some testing and iterations, I came up with the below script. I'd run the bash script and specify the file that I want to add to the rss feed as well as the title that I'd like to give it like this:

./update-rss.sh posts/new-article.html "This is my new article"

        
          #! /bin/sh

          if test -f $1; then
          # Get current year in #### format
              year=$(date +%Y)
          # Get current month in ## format
              month=$(date +%M)
          # Get current date in ## format
              day=$(date +%d)
          # Strip filename and only retain after the last '/' for guid
              title=${1##*/}
          # Get date in long format -- Mon, 01 January 1999 14:00:00 PST
              pubdate=$(date)
          
          # Remove last line from the xml file
          # </feed> 
          # This allows for the script to append directly
              sed -i '$ d' public/feed.xml
          # Echo in the new entry tags into the xml file
              echo "<entry>\n<title>$2</title>\n<link href='https://mattduran.info/${year}/${month}/${day}/${title}'/>\n<guid isPermaLink='false'>${title}</guid>\n<pubDate>${pubdate}</pubDate>\n<content type='html'>\n" >> public/feed.xml
          # Parse the existing HTML file for the article 
          # Strip out all the < and > characters
          # Then append to the end of feed.xml
              xmllint --html --xpath "//main/article" $1 | sed -e 's//\>/g' >> public/feed.xml 
          # Echo in the last lines into the file to close the xml format
              echo "\n</entry>\n</content>\n</feed>" >> public/feed.xml
          else
              echo "$1 does not exist."
          fi
        
        
So here's a quick breakdown of the script:
  • We pass in two parameters (posts/new-article.html and "This is my new article) which are referred to as $1 and $2 in the script
  • If the file exists, proceed on with the script, otherwise quit out
  • We get the current date for when the article was published, the raw title of the file from the url path, and save these off as variables
  • We then remove the last line from the XML file so we can add the contents directly to the file
  • Next, we echo in some header information including the title of the post, the url for the post, the publication time, and specify that this is to be encoded as HTML
  • We use xmllint to grab the original article contents, and with sed, replace all the greather than and less than symbols with their encoded counterparts. This is then piped into the XML file
  • Finally, we close out the content and feed tags and we're good to post!
Nothing groundbreaking but I'm pretty proud of coming up with it pretty quickly. Now I just need to save it off in a github action and never touch it again!