What is semantic markup, and why does it matter
You’ve probably heard of web standards, semantic markup, the separation of style from content, and other buzz words if you’ve read any blogs about web development or open-source in the last 4 years. The problem with advocating for semantic markup is that if you don’t know what it is, it’s easy to live without. In fact, the non-semantic markup of today was probably considered the semantic markup of yesterday, so it’s also easy to get confused. But if you’re in the business of creating documents — websites, blogs, stories, articles, papers, books, etc. — then once you learn and understand semantic markup, you’ll never go back.
So what is semantic markup?
First, what is markup? Entering “define: markup” in Google yields this definition
Markup refers to the use of a markup language to describe the structure and appearance of a particular document.
I believe that the history of the term markup comes from the printing/publishing industry. Manuscripts for books would have annotations indicating to the typesetter where certain text would be bold, italic, indented, etc. A manuscript ready for publication would literally be “marked up” with annotations. When we use a language like HTML to create a document, we are essentially doing the same thing — annotating a plain text document with little notes about what the different parts of the text mean, and how they should be presented:
Second, what does semantic mean? Here’s a definition from Google:
of or relating to meaning or the study of meaning
So semantic markup is essentially a meaningful way to describe the structure and appearance of a particular document.
Say you are going to write a piece of prose (be it fictional, academic, technical, or whatever). What are the different kinds of text that you might use?
- Chapter and section headings and subheadings
- Vertical lists (e.g., bulleted or numbered lists)
- Large quotations (the kind that are indented and offset from the rest of the text)
- Tables, figures and illustrations
- Snippets of mathematical formulas or computer code (if you are writing a technical document)
- Normal text (pargraphs, the main body of your document)
Depending on the specific kind of document that you are writing, you may not need all of these, or you may need some elements that aren’t listed. But for 95% of formal writing, every single element of your document can be classified as one of the above kinds of text. And within a document, the majority of the text is of the last kind — normal text — the other elements generally appear as aberrations from normal text. As a writer, you want to make it as easy as possible to read and interpret your document, and over the years certain presentational customs have been developed to clarify to the reader when a particular part of your text should be understood as existing outside the general flow of the document.
For example, longer quotations are usually indented to the right and presented without quotation marks. When you see indented text, you know that you are reading a quotation. Or section headings are usually bold or underlined, and presented in a larger font. Because of these conventions, you don’t confuse a section heading with part of a sentence. But these presentational conventions (indenting, bolding, using a different font size, etc.) are only visual cues to alert the reader that the part of the text that they are currently reading differs structurally from the usual flow of the prose. You can create these visual cues on a typewriter just as easily as you could using a computer running a word processor.
Before desktop computing, visual cues were a fine way of differentiating different parts of a text. They certainly make it easier to interpret a document — imagine if you read an entire book with no page breaks, section headings or indentations. It would be very difficult to understand when a topic changed or when the author was quoting another written work. But now, the majority of formal written communication is created using a computer, and it makes sense to use some of the additional power that computers afford to further aid the processes of writing and reading. That’s where semantic markup comes in.
In Microsoft Word, if I want to make a quotation indented, I can hit the right indent button, type my text, then hit the left outdent button to return to the normal paragraph style. Any reader would know that the text that I just typed represents a quotation. But the computer doesn’t know that. All the computer knows is that the formatting of the text is different. But if I created my document in HTML, I could use semantic markup to literally denote the quotation as such:
<p>Some normal text</p>
<p>More normal text</p>
<blockquote>This part is quoted</blockquote>
<p>Back to normal text</p>
Looking at the HTML above, all you know about the quoted part is that it is a quotation. You do not know that it is to be indented, and in fact, it need not be (although the default HTML format will indent it). What’s the difference between my method of making a quotation in Word and my method using HTML? In Word, I created a visual style, the meaning of which the reader must infer. In HTML, I explicitly changed the structural significance of the quotation. In my HTML, I could use CSS to make all <blockquote> elements indented. Or I could make them italic. Or I could surround them with gigantic quotation marks. I can choose to apply any number of visual markup techniques to the document because the core document is written using semantic markup. The difference is that using visual markup, the interpretation process goes like this:
Visual signal in document -> Different interpretation by reader
whereas using semantic markup, interpretation is more flexible
Semantic signal in document source -> Different visual signal in document -> Different interpretation by reader
So why is that important?
Because if I use semantic markup, I can go backwards, too. The computer can read my HTML and tell me that the indented part is a quotation (because the quote tag is what caused the indentation). It can’t do that with the Word doc, it can only tell me that the text is indented. Say I wanted to change the visual style of all of my quotations. Instead of being indented, they are going to be written in italics. I can make that change easily in HTML. I just edit the CSS file to say:
blockquote {font-style: italic; margin-left: 0px}
But in Word, I have to either change the style used to create the indented quotations or, worse, manually unindent and italicize every quotation.
Say I used styles in Word, though. What’s the big deal? It’s just as easy to change a Word style than to change a CSS file. True. But remember, even though I created a quotation style, Word doesn’t understand the meaning of that style, it just knows that the visual presentation is different. Recall that I claim most documents use a limited number of text elements (headers, quotes, tables, lists, etc.). If I wanted to convert my Word document to HTML, there’s a good chance that the converted code would maintain the visual cues, but it wouldn’t maintain the underlying logical cues. If you read the generated HTML, you’d have know way of knowing that the italicized paragraphs represented quotations. because Word never knew that itself. On the other hand, if I wanted to convert my HTML file to LaTeX, all of the semantic markup would be transferred. Elements of the <blockquote> type would become \begin{quote} … \end{quote}. And if I mapped my CSS file to a custom LaTeX style file, I could also preserve the visual markup. The point is that translating the semantic markup from HTML to LaTeX and translating the visual markup from HTML to LaTeX would be two independent processes. But converting semantic markup from Word to HTML is impossible because Word only uses visual markup.
Non-semantic markup, therefore, didn’t matter before all of your data became digital. But now it does. Everything that you write is data, and there’s a good chance that you’ll want more than one application to be able to read and modify that data. Doing so doesn’t require standardized formats, just clean, logical markup underlying the different formats, so that data of one type can be easily mapped to data of another type. But as long as you use programs that do not utilize any semantic markup, or if you misuse markup languages by trying to mimic visual markup, this kind of interchange will not be possible.
Last thoughts
I know, I know, you think it doesn’t apply to you. Maybe it doesn’t, too. If you don’t write formal documents, if you don’t have a webpage or a blog, if you don’t collaborate electronically with others, then maybe it doesn’t. But here are the cases, right off the top of my head, where semantics matter a lot:
- Formal papers or presentations – semantic markup means your writing can be translated using LaTeX or XHTML to be viewed in different media — online, offline, mobile, etc.
- Web pages – semantic markup means that your web pages can be easily viewed on mobile browsers, or special browsers designed for those with disabilities
- Blogs – semantic markup means that your blog can be rendered on special browsers, as above, as well as translated to Atom or RSS, so people can syndicate your site and read it in a news reader
Those are three important media. I’d say that I get about 80% of my information from papers, web pages and blogs. And as the move toward semantic markup continues, those sources are becoming far more useful to me.