Languages and Compilers : Getting started.

On my undergraduate course I’m undertaking a short (but stressful!) project to build an expression analyser and a  very small compiler that translates a higher level language into the Forth language, with the official specification found on the CE305 website[1].  This will probably be written in java, but could be written in cool since I’m also taking the most excellent Compilers course on Coursera by Stanford University [2], backed up with the excellent ‘Writing Compilers and Interpreters : A Software Engineering approach [3].
The brief for the expression analyser is that I’m required to :

produce and present details of the formal specification of the expression analyser, including the regular expressions for the tokeniser and BNF grammar that is implemented by the parse. The parser should have options to display the internal representation of parsed expressions in a easily readable form (e.g. using graphviz or similar). The final program should produce sensible error messages if given ill-formed expressions. Well-formed expressions should be compiled into Forth code, which should be demonstrated to run correctly on the gForth interpreter [1].

This means that I’m required to submit a specification and Implementation for the report, and in the specification that I must specify the ‘languages and translations that are involved in your expression compiler. These should include the tokens, The expressions, The target Language and The Translation.’. Furthermore, for the compiler I’m required to :

produce and present details of the formal specification of the expression analyser, including the regular expressions for the tokeniser and BNF grammar that is implemented by the parse. The parser should have options to display the internal representation of parsed expressions in a easily readable form (e.g. using graphviz or similar). The final program should produce sensible error messages if given ill-formed expressions. Well-formed expressions should be compiled into Forth code, which should be demonstrated to run correctly on the gForth interpreter. [1]

This is obviously a fairly challenging task; and considered by my professor to be the traditional epitaph of computer science in many ways. What I’m going to explore in this initial article is really the foundations of what compilers and languages are, what the influences on compilers have been, how they work and also reveal perhaps why in the view of Aikins [2] the reason there are so many languages present is because of the fundamental cost of programming languages being the cost of ‘training’ programmers.

You may recall that I’ve already remarked in a previous post about the observations of Brooks that there are two major elements to all software development, accident and incidence which lead to delays and ultimately higher project costs [4]. Whilst the focus of Brook’s article was on the human ‘incidence’ than the accident of reduced by syntactic mistakes, even brooks admitted that the technological breakthrough of FORTRAN and it’s interpreter  lead to code that took weeks or months to develop in the late 1950s only taking hours or days.
While the creation of new languages, compilers and interpreters continues today I must confess that generally I’ve never really given enough throught to what happens when I compile code with Clang or use the python interpreter. This project is finally a chance to really understand what’s going on when I happen to come in contact with strange linker and compiler errors, because I will have in the process of building this project gotten to grips with the real fundamentals of languages and their compilers and / or  interpreters.
Let’s get started by discussing some fundamentals.
What do we mean by a language?
For most of us we’re used to speaking and writing our own mother tongue, whether it be English, Spanish, Chinese or Manx. A language is  a sequence of characters that we attach some significance or meaning to, and generally the reason for any kind of communication is to gain influence in some way.  In order to communicate, the community that speaks that language must agree upon a common protocol for how it should be structured into a meaningful set of articles, nouns, verbs, adjectives and so on.  For any language therefore, the following must hold true.
Language = semantics +  syntax +  symbols
                       (or meaning + structure + alphabet)
We will now continue to  analyse these parts from the perspective of language processing and compiling.

Semantics :

‘His name was claire, her name was bob’
From the human point of view that obviously doesn’t seem to sound right because we know almost instinctively that claire is normally a girls name, and that bob is a boys name and yet the prepositions don’t seem to match up in the sentence. This can happen in the compiler world as well when we have problems with incompatible types. For this we use type checking to locate examples where types mismatch, for example:
int i = 1.0;

The above snippet of code should even to the novice programmer be setting off some alarm bells, as we can see that doubles or floating point numbers cannot be implicitly converted to integers. It just doesn’t work. Here’s another example :
Peter tells peter his great great grandfather was santa claus
In this second example a human can easily see that the most likely interpretation of this sentence is that peter 1 has told another peter 2 that peter 2’s grandfather was santa clause, and hopefully we can agree upon that dear reader. However, for the computer this is a very ambiguous sentence – it might mean what we’ve already discussed but it could also mean that peter 1’s grandfather was santa clause or a third person not in the discussion was related to santa clause – we don’t really know exactly.
Moreover, if we were to translate a sentence into another language we expect that the semantics should be preserved so that the purpose of communication is fulfilled (typically to influence or command another agent).  It’s by no means clear how easy it would be for the average human to directly translate the sentence ‘who ate all the pies’ into bits and bytes without error.

Syntax:

According to the Oxford dictionary syntax is defined as ‘The arrangement of words and phrases to create well-formed sentences in a language: the syntax of English’, but it is also defined as ‘the structure of statements in a computer language’ (more on that later). According to materials from the University of Colarado [6] a syntax is comprised of three elements:

  • Constituency – the way words are grouped together,
  • Gramatical Relations – The way in which subject, object and verb are structured and possible transformations.
  • Subcategorization and dependencies – The contextual constraints on what words can be placed where – for example ‘I want’ can be followed by an infinitive ‘to fly a plane’.

In order to understand any sentence, a human has to be able to break up the different parts of that sentence. For example ,if I wanted to understand the sentence ‘who ate all the pies’ I would subconsciously look at each of the words in that sentence to understand what it meant, and a computer program called a lexer would do the same thing for the following line of code:

 if a == 5 then x=2 

From this example sentence we can break the statement up into a series of tokens or words broken up by white space. This allows us to identify keywords, conditionals and variables in the sentence that form building blocks in the sentences such as assignments or relations.  This allows us to build a parse tree so that we can understand the structure of a sentence in order of the hierarchy of the different terms, as we can see below.

Screen Shot 2014-01-14 at 22.44.38

This structure is implied by the grammar of the language and the constituency (grouping together) of some of the words like ‘==’.All humans are by nature experts in ‘context free grammars’ because all human or ‘natural’ languages have context free grammars.  It’s also the case that all high level programming languages are context free as well. This makes understanding grammars a key part of the parsing process, so what do we mean by a context free grammar? well all context free grammars have the following components [2]:

  • A set of terminals T
  • a set of non terminals N,
  • A start symbol S,
  • A set of productions or rules P

When we parse a natural language we need to get from some start state to a simplified end state by using productions to transition towards a set of terminal states, which cannot be simplified any further and it’s these terminal states which form nodes in our parse tree. The rules for transformation, and terminals inform what constitutes the symbols for the language or the acceptable ‘tokens’ for the language.

For example if we started with the statement :

EXPR – > if EXPR then EXPR else EXPR fi [2]
while EXPR if EXPR THEN EXPR

In this example, the non-terminals are the symbols that are capitalised and the terminal symbols are in lower case. This tells us that any expression could be interpreted in terms of IF THEN ELSE, for example with an identifier for a variable that you’ve stored on the machine if X then X else X.

As you can imagine This gets very complicated; very quickly. I hope that it’s possible to see how a language’s grammar not only can transform a sentence, but be used to interpret the valid rules for how a sentence might be constructed in human readable languages.

Symbols :

In human languages we normally have a commonly agreed upon set of symbols or an alphabet that when they are combine together make words, which in terms make sentences as we’ve already expounded.  The major hurdle between human language and machine language is that the set of symbols is completely different : in english we’re used to communicating with a set of 24 symbols, the computer just 0 and 1. This presents  a fundamental challenge in communicating to a computer operating in a regular language of 1’s and 0’s when you me and anybody else is used to communicating under a completely different ruleset. This is the role of a computer translator, to make the unfathomable process of translating into a completely alien language of 1’s and 0’s the complex logical statements and expressions
a compiler takes source code and translates it into machine code, an interpreter takes source code and translates it into actions
– Mak, R.
However, there are at least three ways  in which code can effectively be translated. It may be interpreted, it may be compiled, or in a relatively modern twist it may be compiled and interpreted in some hybrid of the two.  A suitable and succinct definition of these terms is provided by Mak , who states that ‘a compiler takes source code and translates it into machine code, an interpreter takes source code and translates it into actions there’s a subtle, but profound difference lurking in that sentence in not only the implementation of the translation but also the capabilities to debug problems and the behavior of the translated programs: for example are they fully optimized, and do they completely retain the meaning of the  source language that they were translated  from ?
What I interpret by the definition by Mark is that while in both cases code is translated into a language that retains the principles that we’ve established: the compiler will break up the steps of execution from translation, and the interpreter will translate in real time with the data from. Hence, we can visualise how a compiler at it’s most fundamental level is distinct from an Interpreter.
What is  the effect of this different implementation ? Perhaps the biggest distinction is performance, it means that even in modern interpreted languages such as Python the performance of these languages are around 20-30 times slower (even by todays standards) compared to a  compiled language, which is significant when you’re dealing with real-time systems such as trains, planes or nuclear power stations where you expect guaranteed performance for the components in the system. However, what interpreted languages lose in performance they make up for in their ease of use and portability.
Why is this ? Well; an interpreted language allows for far better debugging because not only can you see exactly where an exception (error in the program occurred) occurred because you’re translating as you execute problem code, but you can also pause the translation and execution allowing for step by step debugging. It also allows for far better portability because compilers are typically written and compile for specific architectures only (big-ended or little ended might cause problems for example – if you don’t know what this means look here.) for the architecture that they operate on, whereas interpreters translate on the fly into the hosts architecture.  In all this has led to the movement towards modern hybrid translation programs that combine both of the advantages from both interpreters and compilers into ‘Just in Time’ compilers that give all the benefits of debugging, and portability with the performance of compiled languages. In the next section we’ll look how this all potentially fits together in a modern compiler :

How this all fits together :

Screen Shot 2014-01-14 at 23.54.00

Since the beginning of the article, I’ve subtly been introducing elements of compilation with elements of parsing and tokenisation that we find in the initial stages of compilation, which we can see in the above diagram from Mark. R’s book with the lexer and parser with the executor used only in interpreters [3].  What we’re still missing is a way to actually perform the translation.

According to Mark [3] the role of the front end ‘generates intermediate code and a symbol table in the intermediate table that will serve as the common interface’. What this means is that the parser generates effectively a parse tree which we’ve covered and a  symbol table [3, ch 4] that effectively a place where the compiler stores a list of tokens (typically identifiers) along with meta-data on that token.  The advantage of this approach is that you can make a compiler that is language independent, meaning that it will work for any language you give it.

The process of Code Generation under the java virtual machine is somewhat unique. The approach recommended by Mark is to use a language called jasmine [7], the assembly language for the java virtual machine. In his example,   the generator will process intermediate code (parse tree root and symbol table) and the symbol table to generate machine language instructions. In the example given this means that the compiler is actually able to convert a pascal hello world class into the native java language :
From this : 
PROGRAM HelloOnce;

BEGIN
writein('Hello, world.')
END.
to Java Object code : 
public Class HelloOnce
{
private static RunTimer _runTimer;
private static PascalTextIn _standardIn;

public static void main(String[] args)
{
_runTimer = new RunTimer();
_standardIn =  new PascalTextIn();
System.out.println("Hello, world.");
_runTimer.printElapsedTime();
}
}
 However, in our example we will be generating another lower level language and we will need a different way of generating code… more on that when I figure it out!

Final thought :

So far we’ve covered a lot of material on compilers, and the nature of languages but have you ever considered why on earth there are so many languages out there?  According to aiken [2] this has a lot to do with the ‘economy of programming languages’ that is a result of the collaboration problem of trying to get so many programmers to agree on anything and the costliness of training programmers in new languages. This is true with java, FORTRAN and any other language you can think of.
The Aforementioned FORTRAN was a scientific language, designed with parallism and large floating points in mind. In contrast, SQL was almost entirely designed for business purposes to store and analyze data: queries, and data analysis were at it’s core. Whereas for systematic languages, or languages used in the design of mid-level systems object orientation, and so on. What couldn’t be achieved in one language led to the factionalization of programmers into other groups who created their language.  If there’s a continuing trend towards variation there’s a chance that I might even make my own language in a few years to come and while I don’t underestimate the challenge of writing pete++ the skills learned from constructing compilers and languages are still highly relevant today.

Referencing:

[1.] Fox, Chris (2013, April 13). “CE305 Languages and Compilers “. University of Essex [Online]. Available : http://orb.essex.ac.uk/ce/ce305/#sec-6.1 [Accessed:  December. 14, 2014]
[2.] Aiken, A. (2013, April 13). “Compilers” . Coursera [Online]. Available : https://class.coursera.org/compilers-003
[3.] Mark. R, Writing Compilers and Interpreters : A modern Software Approach Using Java. 3rd ed. Indianapolis, Indiana : Wiley Publishing 2009.
[4] Gent, P . “Personal Extreme Programming. Part 2 – Why Agile and why XP?”. Nov 9 2013 [Blog entry]. Peter Codes. Available : https://petercodes.wordpress.com/2013/11/09/personal-extreme-programming-part-2-why-agile-and-why-xp/ [Accessed: Nov 9, 2013].
[5]. Klein, W (October 31, 2005). “Context Free Grammars”. University of Edinburgh [Online]. Available: http://www.inf.ed.ac.uk/teaching/courses/icl/lectures/2006/cfg-lec.pdf [Accessed : Jan 15, 2014].
[6].UCCS (N.D), “Context-Free Grammars For english”. UCCS [Online]. Available:  http://www.cs.uccs.edu/~jkalita/work/cs589/2010/12Grammars.pdf [Accesed 14 jan, 2014]
[7] Meyer, J et al. “Jasmin”.  Sourceforge [Online]. Available : http://jasmin.sourceforge.net [Accessed Wed 15 January, 2014]

Bibliography :

[1.] Fox, Chris (2013, April 13). “CE305 Languages and Compilers “. University of Essex [Online]. Available : http://orb.essex.ac.uk/ce/ce305/#sec-6.1 [Accessed:  December. 14, 2014]
[2.] Aiken, A. (2013, April 13). “Compilers” . Coursera [Online]. Available : https://class.coursera.org/compilers-003
[3.] Mark. R, Writing Compilers and Interpreters : A modern Software Approach Using Java. 3rd ed. Indianapolis, Indiana : Wiley Publishing 2009.
[4] Gent, P . “Personal Extreme Programming. Part 2 – Why Agile and why XP?”. Nov 9 2013 [Blog entry]. Peter Codes. Available : https://petercodes.wordpress.com/2013/11/09/personal-extreme-programming-part-2-why-agile-and-why-xp/ [Accessed: Nov 9, 2013].
[5]. Klein, W (October 31, 2005). “Context Free Grammars”. University of Edinburgh [Online]. Available: http://www.inf.ed.ac.uk/teaching/courses/icl/lectures/2006/cfg-lec.pdf [Accessed : Jan 15, 2014].
[6].UCCS (N.D), “Context-Free Grammars For english”. UCCS [Online]. Available:  http://www.cs.uccs.edu/~jkalita/work/cs589/2010/12Grammars.pdf [Accesed 14 jan, 2014]
[7] Meyer, J et al. “Jasmin”.  Sourceforge [Online]. Available : http://jasmin.sourceforge.net [Accessed Wed 15 January, 2014]
[8] http://www.ultratechnology.com – a history of Forth
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s