Fixing common Hugo encoding problems

I posted a link to my blog on Slack and was greeted with HTML entities right in the website summary. I could see certain characters like the apostrophe being encoded as ’.

Here’s how I fixed this problem.

A screenshot from Slack of a post of this blog. The description includes HTML entities literally in the description: … they’ll … instead of they’ll

HTML Meta Tags

Investigating

Slack and social media sites use meta tags defined by the OpenGraph protocol to fetch information like the summary, publish dates, and images relevant for posting on a feed. On this post, we can see those tags contain HTML entities.

1
2
3
<meta property='og:description' content='I work at AWS, and predictably we use a lot of AWS cloud services. In many cases, when an engineer looks for a computer platform, they&rsquo;ll often go directly to AWS Lambda because &ldquo;it&rsquo;s Serverless&rdquo; with the justification that it&rsquo;s simple and the best option no matter what and not want to explore alternatives. The FaaS (Functions as a Service) compute style is great for a certain category of system problems&ndash;ones in which you don&rsquo;t need strict control over how it executes. AWS Lambda only exposes limited controls and depending on what your workload is like, you could run into unexpected scaling and failure modes. There are alternative compute environments that avoid those limitations that you should know about. '>

<meta property='article:published_time' content='2024-09-29T00:00:00&#43;00:00'/><meta property='article:modified_time' content='2024-09-29T00:00:00&#43;00:00'/>

My Hugo template had a file that looked like this:

1
2
3
4
5
6
7
{{- $title := partialCached "data/title" . .RelPermalink -}}
{{- $description := partialCached "data/description" . .RelPermalink -}}

<meta property='og:title' content='{{ $title }}'>
<meta property='og:description' content='{{ $description }}'>
<meta property='og:url' content='{{ .Permalink }}'>
<meta property='og:site_name' content='{{ .Site.Title }}'>

If we look at Hugo’s repo in the opengraph.html template, they use plainify to remove HTML tags, then htmlUnescape to remove HTML entities and encoding from the strings.

1
2
3
{{- with or .Description .Summary site.Params.description | plainify | htmlUnescape }}
  <meta property="og:description" content="{{ trim . "\n\r\t " }}">
{{- end }}

Solution

Let’s do a similar fix.

1
2
3
4
5
6
7
{{- $description := partialCached "data/description" . .RelPermalink -}}
{{- $description := $description | plainify | htmlUnescape }}
{{- $description := trim $description "\n" -}}
{{- $description := $description | safeHTMLAttr -}}

<meta property='og:title' content='{{ htmlUnescape .Title | safeHTMLAttr }}'>
<meta property='og:description' content='{{ $description }}'>

Now we get the following HTML:

1
2
<meta property='og:description' content='I work at AWS, and predictably we use a lot of AWS cloud services. In many cases, when an engineer looks for a computer platform, they’ll often go directly to AWS Lambda because “it’s Serverless” with the justification that it’s simple and the best option no matter what and not want to explore alternatives.
The FaaS (Functions as a Service) compute style is great for a certain category of system problems–ones in which you don’t need strict control over how it executes. AWS Lambda only exposes limited controls and depending on what your workload is like, you could run into unexpected scaling and failure modes. There are alternative compute environments that avoid those limitations that you should know about.'>

And now my OpenGraph tags are generated correctly.

ActivityPub JSON

Investigation

My secret upcoming project is adding ActivityPub support to this blog. It’s not finished yet and a separate post will be made, but here we can see a similar encoding problem. All the content was being HTML escaped which meant that links were not clickable and were directly visible in the post.

A screenshot of my blog post in Mastodon. You can see HTML elements directly in the post. Example: I was using a mixture of

This is exactly the same problem as before. Say I’ve got a file: layouts/posts/single.post_json.json that generates the ActivityPub JSON for a single post with the following subset:

1
2
3
4
5
6
7
{
  "id": "{{ .Permalink }}",
  "type": "Article",
  "content1": {{ printf "%s" .Summary | jsonify }},
  "content2": {{ .Content | htmlUnescape | jsonify }},
  "content3": "{{.Summary}}",
}

Looking at these two encoding types, which one do you think is correct? Let’s get a complex test post that contains quotation marks:

1
2
3
4
5
6
---
title: test \"test2"
author: "author\""
---

test "test"<br> [test](https://example.com). test"

Running hugo gives me the following output file:

1
2
3
4
5
6
{
  "content1": "\u003cp\u003ehere \u0026ldquo;is\u0026rdquo; a summary\u003cbr\u003e \u003ca class=\"link\" href=\"https://example.com\"  target=\"_blank\" rel=\"noopener\"\n    \u003etest\u003c/a\u003e. test\u0026quot;\u003c/p\u003e",
  "content2": "\u003cp\u003ehere “is” a summary\u003cbr\u003e \u003ca class=\"link\" href=\"https://example.com\"  target=\"_blank\" rel=\"noopener\"\n    \u003etest\u003c/a\u003e. test\"\u003c/p\u003e\n",
  "content3": "<p>here &ldquo;is&rdquo; a summary<br> <a class="link" href="https://example.com"  target="_blank" rel="noopener"
    >test</a>. test&quot;</p>",
}

Solution

Looking at content3, it incorrectly serializes any quotation marks and generates corrupted JSON syntax, so that’s out. Looking at content1, we see examples of HTML entities, like \u0026ldquo; . That’s why we’re seeing the HTML tags written raw in Mastodon, instead of as a clickable link as we expected. content2 does not include any HTML entities, thus is the correct format.

Those \u003c Unicode encodings are extraneous, we can use jsonify’s options to disable them. Here’s the final

1
2
3
4
5
{
  ...
  "content": {{ .Content | htmlUnescape | jsonify (dict "noHTMLEscape" true) }},
  ...
}

This pattern of string | htmlUnescape | jsonify (dict "noHTMLEscape" true) should only be used when you intentionally want to emit HTML tags into a JSON field value. Don’t allow user defined content to enter these values without sanitizing it, but if you’re using Hugo, it’s probably all static defined by you.

Summary

By default, Hugo tries to escape every single string to avoid emitting HTML elements where they don’t belong. This is a good security mechanism, but sometimes you need to generate non-HTML files, like JSON, or sometimes you want the HTML elements to emitted in an output artifact without them being escaped.

If you want HTML elements to be removed and are emitting into an HTML element, do this:

1
<meta property="og:description" content="{{ .Summary | plainify | htmlUnescape }}">

If you’re emitting into a JSON file, do this:

1
"field": {{ .Content | htmlUnescape | jsonify (dict "noHTMLEscape" true) }}

I hope this helps. Stay tuned for further posts on ActivityPub

Copyright - All Rights Reserved

Comments

Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.