Parsing is analyzing and interpreting a document's structure. The parsing process may include extracting specific elements, attributes, or data from the document and verifying that the document is well-formatted while following specific standards or rules. Parsing is mainly used for extracting data from web pages or manipulating the structure of web pages before displaying them to users.

Go provides packages for working with documents, including the HTML and XML formats popularly used in webpages. The html package provides functions for tokenizing and parsing HTML.

The HTML Package

The html package provides an HTML5-compliant tokenizer and parser for parsing and manipulating HTML documents, traversing the parse tree, and manipulating the tree structure. The html package is a built-in package of Go’s standard library.

One of the main features of the html package is the Parse function that can parse HTML documents and return the root node of the parse tree, from where you can use functions like the FirstChild and NextSibling to navigate the tree and extract information from the document. The package also provides the ParseFragment function for parsing fragments of HTML documents.

The EscapeString function is handy for escaping special characters in strings for safer inclusion in HTML; you can use this function to prevent cross-site scripting (XSS) attacks by converting special characters to their corresponding HTML entities.

To get started with the html package, you can import the package into your Go project files.

        import "golang.org/x/net/html"

The html package doesn't provide any functions for generating HTML. Instead, you can use html/template package, which offers a set of functions for generating HTML templates. The html/template package provides a function template.HTMLEscape for writing escaped versions of HTML to a response writer.

The html/template package is also part of the standard library, and here’s how you can import the package.

        import "html/template"

The html package is the most popularly used templating package in the Go ecosystem and supports various operations and data types.

Parsing HTML in Go

The Parse function of the html package helps with parsing HTML text and documents. The Parse function takes in an io.Reader instance as it’s a first argument containing the file document and an *html.Node instance, which is the root node of the HTML document

Here’s how you can use the Parse function to parse a webpage and return all the URLs on the web page.

        import (
    "fmt"
    "golang.org/x/net/html"
    "net/http"
)

func main() {
    // Send an HTTP GET request to the example.com web page
    resp, err := http.Get("https://www.example.com")
    if err != nil {
        fmt.Println("Error:", err)
        return
    }
    defer resp.Body.Close()

    // Use the html package to parse the response body from the request
    doc, err := html.Parse(resp.Body)
    if err != nil {
        fmt.Println("Error:", err)
        return
    }

    
    // Find and print all links on the web page
    var links []string
    var link func(*html.Node)
    link = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    // adds a new link entry when the attribute matches
                    links = append(links, a.Val)
                }
            }
        }

        // traverses the HTML of the webpage from the first child node
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            link(c)
        }
    }
    link(doc)

    // loops through the links slice
    for _, l := range links {
        fmt.Println("Link:", l)
    }
}

The main function sends an HTTP GET request to the website with the Get function of the http package and retrieves the page response body. The Parse function of the html package parses the response body and returns the HTML document.

The links variable is the slice of strings that will hold the URLs from the webpage. The link function takes in the pointer reference to the Node method for the html package, and the Key method of the attribute instance from the node returns data contained in a specified attribute (in this case, href). The function traverses the document with the NextSibling method from the FirstChild node to print every URL on the webpage. Finally, the for loop prints all the URLs from the links slice.

Here’s the result of the operation.

result of retrieving links from a webpage

Generating HTML in Go

The html/template package provides a set of functions for the safe and efficient parsing and execution of HTML templates. The package is designed for use in conjunction with the html package, which provides functions for parsing and manipulating HTML.

You can generate HTML for server-side rendering with the html/template package. Generating HTML is handy for many use cases like sending emails, server-side frontend rendering, and many more. You get to use built-in Go data types like maps and structs to interact and manipulate the HTML of your webpage.

You’ll need to understand Go HTML templating syntax to successfully generate HTML with the html/template package.

        import (
    "html/template"
    "os"
)

type webPage struct {
    Title string
    Heading string
    Text string
}

func main() {
    // Define the template
    tmpl := `
<!DOCTYPE html>
<html>
<head>
    <title>{{.Title}}</title>
</head>
<body>
    <h1>{{.Heading}}</h1>
    <p>{{.Text}}</p>
</body>
</html>`

    // Define the data to be used in the template
    web := webPage{
        Title: "An Example Page",
        Heading: "Welcome to my website!",
        Text: "This is the home page of my website.",
    }

    // Create a new template and parse the template string
    t, err := template.New("webpage").Parse(tmpl)
    if err != nil {
        panic(err)
    }

    // Execute the template and write the result to stdout
    err = t.Execute(os.Stdout, web )
    if err != nil {
        panic(err)
    }
}

The tmpl variable holds the HTML string. The HTML string uses Go templating syntax to define the page title, an h1 header, and a paragraph of text. The webPage struct defines the data fields for the webpage with the Title, Heading, and Text fields.

The Parse method of the New function of the template package creates and parses a new template with the template string. The Execute function of the new template instance executes the template with the data from your struct instance and returns the result to the standard output (in this case, it prints the result to the console).

result from html generation

Build Web Applications With Go

Learning about parsing and generating HTML with Go is one step in the right direction toward building more sophisticated web applications with Go. You can use frameworks like Gin and Echo and routers like Gorilla Mux and the Chi Router to build the server side of your web application.

These packages are built on the net/http package (the built-in package for interacting with HTTP in Go) and abstract the complexities of setting up servers and routers in Go.