Are you creating a programming language? Wouldn’t you like it to be secure?
Disclaimer: I am not a security expert, not even remotely, but I have read about security and viruses and I can imagine various ways people might take advantage of this sort of system. Furthermore, since you ultimately decide the language with which you will create an interpreter, I expect you to make yourself aware of best programming practices within that language, and consequently, I won’t cover such information here.
I would like to say you shouldn’t be worried about enforcing strict rules for your language, but that’s not true. Ideally, we would like to avoid any security holes that can be exploited by running someone else’s source code designed for our interpreter, especially considering the fact that the interpreter might be embedded in another application. To avoid security holes, an interpreter must handle its input carefully and “crash” in a manner that returns an error signal rather than abruptly killing the entire application. Stray tokens are a great concern because they can conceal information that might be utilized for malicious purposes. A carefully crafted input could obey syntax rules while stuffing the input with byte instructions that are ignored because of the act-dumb or the forgot-to-implement rules. That said, you need to be aware of where (and how) things are going to occur within the interpreter. Carefully restricting character sets used for tokens is the best start so you don’t have to worry about nearly as many crafted bits later in the program as you would otherwise. By not implementing commands like “goto” (among other disguised “goto” commands, like “switch”), a great deal of syntax play can be avoided. (That’s not to say “goto” isn’t a useful feature that, if removed, shouldn’t be compensated for by something suitable. The basic problem with goto is that hackers may use such a feature to bypass the current procedure and point at code that may be malicious or bad to run in the current state of the program. How able they are to do this depends on what kind of goto you have implemented and the unique design of the program being exploited.)
Remember, this is said considering situations where you are accepting code from some foreign source and trying to feed it into the interpreter. Obviously, if you are working alone, you probably don’t care as much about the security of the system because you believe you can feed it perfect input (or at least try). I, on the other hand, would like every new language interpreter to be suitable for a REPL, even if reimplemented for online usage (which would invite a number of crafty visitors).
One of the major problems with MySQL is the ability to craft nifty little inputs for it that extract all sorts of data beyond what the naive programmer would design it for. If you’re creating a database language, you should avoid enabling general operations with administrative data operations. Wonderful thought in hindsight. Too late for MySQL. In a similar way, a language without tight controls over its syntax could allow someone to forge an input that defeats any defensive programming put in place.
Security ultimately boils down to input handling. If your program crashes from random, unrepeatable events, that’s just a bug, not necessarily a security hole. How often does that happen? Probably rarely since states are generally repeatable (unless they are dependent on truly random numbers). If there is any sort of input whatsoever that can cause your program to crash, that’s a security hole.
One of the major questions you have to ask is whether you should sacrifice power for security. The mainstream programming languages are designed to be powerful. A language meant to be embedded in an application is a bit different. It’s only meant to do certain things for that application and yet be capable of anything the application needs. That means it needs to be extensible. Extensions should be eyed with suspicion. As extensions may be quickly created for a project and yet directly handle input, they are the most susceptible to security-breaking bugs, regardless of the maturity of the rest of the code base.
That said, suppose we would like to accept Unicode? First, you have to filter out all raw data that is not Unicode, but the very fact that it is Unicode gives a whole slew of bytes that could be used for malicious purposes. Some languages handle Unicode by forcing input to come in the form of ASCII-represented code points. The file itself is thus not useful for storing “complicated” malicious byte-code (yes, I realize that’s a bit of a relative statement). This is a reasonably effective solution, but hardly readable. It’s not too difficult to safely verify Unicode input in strings. What is of more concern is (1) how attributes of the source code can be hidden from examiners using glyph-less characters (such as the non-separator character) and (2) how the Unicode will be used by extensions that are meant to handle raw data. Anything resembling whitespace must be treated as any other whitespace is (and dropped) or treated as a syntax error, preferably the latter because it alerts the user. Unicode handling by extensions, on the other hand, is a tricky subject (after all, we don’t know what sort of extensions will be created to use that data).
If a language is to be powerful, it should be able to handle raw byte data. There is no escaping this fact. Programming languages that I have worked with that don’t directly handle raw byte data are a real pain when you eventually need that power. But again, do we sacrifice power for security? Allowing any sort of input that is meant to be interpreted as bytes can be problematic. Storing the byte data as chunks in lists is one way to prevent issues (because the data can’t be read in sequence as Assembly instructions), but this does not prevent some unwary extension from utilizing the byte code in a way the hacker desires. Ultimately, then, the question of extensions implementing raw byte data processing capabilities is a question for the user.
Now let’s consider inputs from foreign sources that can be examined before being run. I’ve already covered the fact that code might contain different forms of whitespace. Furthermore, it should be obvious why we can’t protect ourselves against obfuscated code. But what about code comments containing executable code? Here again, we can check for conformance to visual Unicode rules by checking for characters with glyphs. Or we could filter out such characters early in the collection process (but don’t trust anything coming from a foreign source, even if it be through your website).
Admittedly, from the perspective of the interpreter, anything unwanted can be optionally ditched. You can probably get away with doing that in a compiled language, but that’s because the compiler can ignore such stray characters and doesn’t have to put them in the final result (oh and by the way, GCC doesn’t ignore stray Unicode characters – it just spits error messages at you).
Stray characters and bytes in a source file are a red flag, in my opinion. Some of the best tricks are the most subtle. The way to avoid any unwanted business is to simply prevent anything that breaks even a single necessary rule and reporting everything else that might. I won’t go into detail about the ideas that come to my mind, but I don’t want my laziness as a programmer to allow any of them.
~ Direct Code Execution ~
Before I end this article, I have a short rant.
There is one function that should never be a standard feature in any interpreted language, no matter how convenient it may be. That function, while it may come in different forms, is put so simply in Python’s built-in function, eval(). This function accepts a string and pretends that it’s real code, performing processing with its contents. This function can be readily abused, and I’m sure it has been. This is not the same as adding a library. The eval() function may be used to run code from a (we’ll call it) “foreign source” and giving it elevated privileges, which allows a hacker to do whatever he wants, so long as the language is capable of it. Library imports, on the other hand, are always from the host computer unless someone abuses the system (which is possible)… or decides to use an online package manager/repository. I’m not going to write a blog post about why online package managers/repos (like npm) are stupid, but they are perhaps the epitome of unsafe practices in this regard. We programmers can be so trusting, and we don’t need an example like left-pad to prove that. (Don’t believe me? What happens if someone decides to “update” the package manager code to download and install spyware? Even if the author wouldn’t do such a thing, all it takes for access is a stolen credential.)
I understand eval() is one of those conveniences that are unique to interpreted languages, as are dynamically typed variables. After all, why can’t a program run itself? Perhaps if the job requires it, it can be implemented or incorporated (plugin, anyone?), but as a general feature of the language or a function included in the standard library, it is begging to be abused.
(Notably, Python’s eval() can be limited to having a unique global and local space, but without those parameters, it defaults to executing in the current program space with full access to everything.)
I may be over exaggerating the problem, and I’d probably look like a hypocrite if I ever used the function. I suppose it depends on a number of factors, but it’d be nice to see security warnings and potential issues (when using a built-in function) within the standard documentation for the languages that have it. Maybe people don’t want to be bogged down with details? What are instructions and documentation for if not to overload you with a bunch of facts? X)
~ Conclusion ~
I’d like to point out that this is by no means an exhaustive list on security considerations. These are a few things that came to my mind for a blog post.