Tuesday, January 25, 2022

Is HTML A Programming Language?

There is a lot of debate on the Internet about whether or not HTML is, in fact, a programming language. On one side, you have those that state that HTML is a declarative programming language, and those that state otherwise are either gate-keeping, or downplaying the importance of HTML. On the other side, you have those that state that HTML is not a programming language, because it is a markup language. That's as far as the debate usually goes, but most people don't know exactly why one side is correct. There is only one correct answer, but to get there, we need to understand how computers work.

A computer uses RAM to store data it is working with in an randomly addressable area of storage. In most computers, this memory is just a contiguous area of storage, though some devices, like certain video game consoles, might use different banks of memory that may not actually be logically contiguous. For simplicity, we'll just assume that memory is linear and can be used for any type of data. In most programs, memory is used in one of three ways.

The first classification of memory is known as "code." This type of memory is loaded by the OS or virtual machine, then typically marked read-only so it cannot be modified during runtime. Historically, viruses would use the fact that code memory was writeable, so it could modify itself while it ran, defeating the anti-viruses of the time. Later, various mechanisms were created so that executing code could not be modified, and data could not be executed as code. Any error in code typically results in a program halting.

After this, we have an area of memory called the stack. It typically grows from a fixed address downwards towards zero. As such, this type of memory is a fixed size, and usually cannot be changed once set up. The processor uses the stack to store local variables, and also to remember functions that were called in order to return control to the calling function when a "return" is executed. Going past address 0 results in a stack overflow, and going past the fixed address is a stack underflow. Either of these conditions can cause the program to crash, since it no longer remembers what came before. Stacks use a First In, Last Out design, and is typically modified through push, pop, call, and return instructions.

Finally, we have the third area of memory, called the heap. This is dynamically allocated memory, and is therefore virtually unlimited in size, given that free memory exists. When a program allocates space on the stack, the memory address of that heap is assigned to a local variable in the stack, often called a pointer, or a reference. When the program is done with the data, that memory then needs to be freed. In many languages, this is automatic, but some older languages had to specifically free the memory that was no longer used. If the addresses are "lost" before the heap memory is freed, this results in a behavior known as a memory leak.

At this point, we should briefly talk about the concept of a "virtual machine." A virtual machine emulates a physical processor, but can execute its own set of instructions that may not be native to the physical processor. It has a second defined as code, a stack, and heap memory. For purposes of deciding if a language is a programming language, a virtual machine is the same as a physical processor. In other words, even though the instructions are technically data as far as the physical processor is concerned, they are still considered machine instructions in the context of a virtual machine, as those instructions are directly converted to physical instructions.

We can also talk about interpreters. They are essentially virtual machines, but have the added ability that a developer can pause the machine at any time, change the instructions as they want, or even run adhoc functions without first compiling any code. Many modern languages have some sort of interpreter that can be paused, inspected, and modified at any time. This helps facilitate shorter development cycles than traditional languages that need to be compiled and debugged.

Now, let's discuss what a programming language is. It is source code that is somehow directly converted into instructions that the processor uses to execute some algorithm. These algorithms use data on the stack and the heap in order to process input, perform calculations, and then generate some kind of output. For example, given a simple program that adds two numbers, the algorithm reads two inputs, performs a mathematical calculation, then outputs the result.

Alan Turing was one of the first to describe computer logic, the Turing Machine. This machine had three parts; the main processor, registers that informed the processor of its state, and an infinitely long tape to read from, and store to, data used in its operation. This basic description of a Turing Machine is the same general philosophy that all modern devices are built on. The code, stack, and heap memory areas are, more or less, the three parts of the Turing Machine.

Given this, we can now examine the evidence for or against HTML as a programming language. There is no access to a code section, there is no stack, no control statements, no allocation of dynamic memory objects, no addresses or references, or anything else remotely approaching the description of a virtual machine or processor. Some people will make an argument that a collection of pages form a Turing Machine, but that only points to the file structure being a finite automata. The HTML itself is still just data.

In reality, the HTML is parsed in to a data structure called the Document Object Model. This DOM is loaded into the browser's heap memory, and then rendered as output on the screen or destined for a PDF or printer. At no point does the DOM implement any algorithms. It is used for input for the rendering engine, a part of the browser's code, in order to generate output. This is not the same as being a program whose instructions are executed one at a time in order.

As further evidence against HTML being a programming language, one should note that code cannot modify code. You can generate and execute code on the fly, but you cannot directly modify code from code in virtually any language. If HTML was a programming language, then JavaScript should not be able to modify it directly, as this would be a major security risk. HTML already has plenty of potential security vulnerabilities without also being an executable set of instructions.

We have different terms to define a job's skill set. Designers work only with things considered data, including HTML, images, and so on. Programmers work only with programming languages, typically text apps and server-side code. Developers take on the role of both programmers and designers, utilizing CSS, HTML, and JavaScript in equal parts. In each of these three roles, significantly skilled people earn a very respectable living.

The area of Information Technology is that of Computer Science. Science is a collection of knowledge, and that includes classifying various types of data. We cannot call HTML a programming language and preserve the purity of that scientific knowledge. HTML is not, by any technical definition, a programming language. It doesn't compile to code, is not executed in a real or virtual processor, it can be modified in real-time by JavaScript, and it cannot perform even the most simple calculations or conditional branches without CSS or JavaScript. It is truly a document format, much like a Word Document or PDF.

In conclusion, we should now understand two things. HTML is not a programming language, as it has none of the hallmarks of what we consider a programming language, and HTML is a vitally important tool for any developer and designer working on web-based applications, as anything with a web-based UI must use HTML at some level. Those that use HTML effectively are definitely in high demand, and will continue to be so for the foreseeable future. Nobody is trying to "gate keep" anybody by defining HTML as not-a-programming-language. They are simply using the pure definition of what a programming language is, and wish to see the term preserved.