benchmarking the branches

Before going too far, I wanted to test the speed differences in the two branches to see how much of a different “fn” makes in speed and if it even merits consideration. For this, I would need a lengthy test, but to ensure that the data is reasonably comparable, I needed code that was about the same in time for both branches. Hence, I created a looping test that merely iterates an integer and outputs the speed. The times for both branches should be different by no more than a small amount. If that amount happens to be equal to or greater than the difference in times for a test comparing the usage of “fn” to not, then either the test isn’t time-consuming enough, or the time cost for function construction doesn’t outweigh the cost elsewhere, or – a stretch – there really is no difference because the clock() stinks, at least on my PC. The clock() can be shown as reliable by getting consistent clock times, so any times that are greater than the clock-time variance are significant times. That makes only the first or second options viable.

Second of all, having a benchmark time also allows me to compare Copper to other languages. In some ways, this may be fair or unfair given that every language has its own ways of being optimized and (a) the optimal code designs may look nothing alike and (b) the optimal code designs may not be the most common ways implemented. Nevertheless, benchmarks, if using mostly identical code, do reveal which language is better for the given algorithm(s) and does say something about its speed (though we may not be able to infer which language is faster based on a few contrived tests alone).
There are at least four languages to which Copper ought to be compared: ChaiScript, AngelScript, Lua, and JavaScript. ChaiScript and AngelScript are likely to be faster because they are compiled down to a form of easily-processed bytecode. Lua doesn’t even have the same paradigms, so it’s very difficult to compare it to Copper. For a short time, I thought JavaScript would be the only language that is easy to compare, being similar in paradigms, similar in syntax, and embedded. However, JavaScript undergoes just-in-time compiling in browsers (which makes the comparison like ChaiScript). For testing the speed of Javascript, it makes the most sense to use something like Node.js, for which dozens of applications are being written and whose audience Copper may appeal to. There are libraries for benchmarking like benchmark.js, and a timer could be added via C++ (if I wanted to go to the trouble of creating a Node.js extension), but other than that, the only built-in clocking mechanism for JavaScript is Date().getTime(), so that’s what I used.

Testing: fn() vs []

branch: master

test 1


benchmark = {
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int+(i: 1)
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:
19811
19675
19711
19645

test 1 using int_sum()


benchmark = {
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int_sum(i: 1)
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:
19868
20159
20187
19981

test 2

Same code as before, but this time, I’ve added a function creation.


benchmark = {
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int+(i: 1)
		a = []
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:
28808
28585
28805
28785

branch: fn

test 1


benchmark = {
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int_sum(i: 1)
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:
23890
23966
23621
23573
———-
test 2
Same code as before, but this time, I’ve added a function creation.


benchmark = {
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int_sum(i: 1)
		a = fn()
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:
35250
35552
33932
34086
———–

C++

This should be the fastest for obvious reasons.


int main() {
	clock_t t1, t2;
	t1 = clock();
	int i = 0;
	for (; i < 1000000; i++);
	t2 = clock();
	// Linux, so time_t and clock_t are long int
	printf("time = %li, i=%d\n", (t2 - t1)/CLOCKS_PER_SEC, i);
	return 0;
}

Times were all about 8 milliseconds even. Not surprising.

Part 1 Analysis

The clock times were consistent (as one might expect), but the clock times for the baseline loop were radically different. The increases in times for loading functions were about the same – 8 milliseconds or twice the difference between the times for the baseline loops. This means that function construction speed was not affected by loading “fn”, but the only changes in the codebase were for using square brackets “[]”. This means that those tiny changes (and believe me, there weren’t very many) created a huge speed up in an indirect way.
One thing that can be ruled out: the speed up would not occur at parse-time since everything in the loop has already been tokenized. (I even tested skipping the “fn” tokenization just to see its effect, and there was no time difference.) The difference is certainly not in which functions are called. I set the engine to output verbosely (showing the functions it calls) and performed a difference-check (using the program “diff”) on that output. There should not have been a difference, and there wasn’t. I checked all the other relevant files, and everything was the same.
To ensure names weren’t affecting it, I ran test 1 using the code from test 2 for the baseline loop, set the names for int functions to their originals, and made sure all the preprocessor flags were the same (including name changes, etc.). No time difference was seen.
The only relevant section of code is the loop:


loop {
	if ( int_equal(i: 1000000) ) { stop }
	i = int_sum(i: 1)
}

The implementations of the loop are identical.
How is there an improvement then? I don’t know. I compared everything I thought might be influencing the situation, but there were no differences that would have accounted for the speed improvement. Optimizations were turned off, which means the compiler shouldn’t have (but may have) been moving some code around. Even after compiling the main file with the -E flag for GCC, there were no real differences that should have resulted in faster code in this case. Bizarre.
I leave this as a mystery to be solved later.

Part 2: CopperVM vs JavaScript

I do wish I knew the internals of how the JavaScript virtual machine was implemented in Firefox, but as it is, I have no idea. Perhaps as it does just-in-time compiling before it runs, the results are skewed in its favor, but that shouldn’t matter as much: If that’s the advantage using the language gives you, take advantage of it.
For testing the speed of JavaScript, I used the following code:

panel = document.getElementById("output");
panel.innerHTML = "Test start";
t1 = new Date().getTime();
var i = 0;
while( i < 1000000 ) {
	i += 1;
}
t2 = new Date().getTime();
panel.innerHTML = String(t2 - t1) + ", " + String(i);

Times:

Between 3 and 6 on the output, but usually 4 or 5. Date().getTime() returns the time in milliseconds since the Unix epoch, so “3” and “6” are milliseconds.
These times obviously trump the CopperVM by a long shot.
Notably, the biggest thing affecting the time should be the loop itself. Of course, the compiler could optimized the entire loop out, thereby skewing test results, so it would make more sense to perform an array growth, which is what I try next.

(Note: It’s obviously weird that clock time for JS would be “faster” than C++. The C++ version is true time while my JS version may be optimized out, or some docs are lying to me about how to print the milliseconds out of JS. *shrug*)

Copper


benchmark = {
	l = List()
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int+(i: 1)
		Push_top(l: i:)
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:

30863
30530
30579
30872

Javascript


	panel = document.getElementById("output");
	panel.innerHTML = "Test start";
	var l = [];
	t1 = new Date().getTime();
	var i = 0;
	while( i < 1000000 ) {
		i += 1;
		l.push(i);
	}
	t2 = new Date().getTime();
	panel.innerHTML = String(t2 - t1) + ", " + String(i);

Times in milliseconds:

Fluctuated around 30, which means a 10x increase on the JavaScript end, but still…

Copper speed increased by 10 seconds, which actually means that it didn’t suffer the same performance penalty in terms of percentage speed as JS, but 10 seconds is still huge.

Part 2 Analysis

Given that every language I would be comparing Copper to (except Lua) is compiled, it all mostly boils down to comparing compiled vs interpreted code more than the actual semantics. What this does mean is that, if I want speed, I may have to consider how to compile Copper.
JavaScript is known for its speed, but obviously, that has more to do with its compilation than any inherent design in the language itself.
That said, Copper functions run as tokens, and I’m inclined to believe there are ways this process could be sped up. I doubt changes will get it to match the speed of JavaScript, due to the inherent nature of interpreting (calling tons of C++ functions), but there can be improvements.
Notably, there is a huge slowdown for Copper in that objects are copied, not merely adjusted. JavaScript allows for the ++ and += operators, whereas Copper recreates an entire variable every time. I could add an increment operator, though this does mean the number would no longer be safe for multi-threading (if I ever get to that point).

Part 3: Optimizations On

For my previous tests, I turned optimizations off for compilation. I did this intentionally. I wanted to know the “true” speed of my virtual machine, and that way anyone using it would know just how fast or slow it is in its natural state. However, C++ compilers can do alot to speed up code, and the Copper virtual machine was designed to be readable with the expectation that the compiler would perform the optimizations.
With optimizations on, here’s the speed times for test 1 of the master branch.

benchmark = {
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int+(i: 1)
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

Times in milliseconds:
10819
10780
10909
10968

That’s almost HALF the time. It’s still an enormous 10 seconds, but shaving 9 seconds off the clock is rather nice.
Now let’s consider the list appending test:

benchmark = {
	l = List()
	m = MSec_ClockTime()
	i = int(0)
	loop {
		if ( int_equal(i: 1000000) ) { stop }
		i = int+(i: 1)
		Push_top(l: i:)
	}
	mm = MSec_ClockTime()
	print(MSecTime_rdc(mm: m:))
}
benchmark()

16418
16478
16302
16317

That’s about a 14 second improvement, meaning a 4 second speed boost out of the same List code. It’s amazing what the compiler can do in terms of optimizing your code. It also makes the Copper VM look alot more reasonable.
One of the slowest issues is the search through built-in functions. Each one is compared character-by-character to the input, rather than by hash table. Doing the string comparison manually is extremely slow, and when I short-circuited it (in other words, cut out the part for searching for built-in functions) I got the following times:

Times in milliseconds:

14337
14354
14289
14294

A full 2 seconds! That’s significant enough for me to want to redo that system.

For comparison:

Python

#! /bin/env/Python3
import time

t1 = time.process_time()
a = []
for i in range(1000000):
	a.append(i)
t2 = time.process_time()
print((t2 - t1) * 1000)

Times in milliseconds:

267.823025
273.351204
258.46204
273.033348

100 * 275 / 16400 = 1.69% or in other words, Python runs in about 2% of the time. Note that this is the compiled time of Python. I ran the following code within the console interpreter (and note, since it isn’t in a function, the Python VM is forced to compile as it runs):

import time

t1 = time.process_time()
a = []
for i in range(1000000):
 a.append(i)

t2 = time.process_time()
print((t2 - t1) * 1000)

The time came out to be about 518, 528, 493, and 466. In other words, it took twice as long, but it’s still 4% of the Copper VM speed.
Admittedly, Python has had the benefit of years of input and ideas on optimizations as well as a strict (and what I’d call “invasive”) type system that makes compiling easier. Still, the fact that the Copper VM is close to 25x slower – even interpreted – is unsatisfactory.

Conclusion

Yes, interpreting raw code is slow. That’s no surprise. It’s somewhat silly to even compare a virtual machine interpreter to compiled code, but it must be done.
I suspect a number of changes I plan to make to the VM will slow things down, including changes to the number interface and foreign function interface, but they must be done. The language is intended to be easy, not fast.
All this does get me thinking about how I can make optimizations to the virtual machine. Copper is very much a spoon-fed sort of language – first token dictates the processing, so it may be possible in the future to build a parse tree (within each function body) that could perform its own processing in an alternate, faster way while still being compatible with the rest of the virtual machine. This would bloat the code a bit (unfortunately), so it may end up being an optional add-on if I ever implement it, but it’s probably worth looking into. The method I have in mind would build on the C++ stack (unfortunately), but not require as many function calls, assuming the design ends up matching what I’m envisioning (though I doubt it will).
In any case, seeing as the Copper VM is rather slow, it does make me more hesitant to write tons of code in it. I’d like to think that a number of bulk-duty operations (such as adding numbers by matrices, etc.) could be added to the VM to speed up processing, but I don’t think this is going to make a serious improvement and it may even make matters worse (you still need to build the large data structures and then extract from them).
I’m aware the worst aspect about the Copper VM affecting its speed is its safe approach to being stack-less: the system of processing tasks, which allows for clearing everything easily in the case of an error. Despite the loose syntax of Copper, it shouldn’t be hard to add debugging hooks into Copper without messing anything up because of this processing mechanism. The process cycling hurts speed because it means that every token of every task results in a number of function calls that could be avoided if each task got to view all of the tokens it wanted when it was run. At the time I created it, I didn’t think I could have designed this task system any faster and without risking readability and misinterpretation of the desired syntax. Now that I have something concrete to work with, I’m considering where there might be room for improvement.
Improving the built-in function search will certainly save time – hopefully close to the full 2 seconds I observed when it was short-circuited. It will not be difficult to do this, and both the “fn” and “master” branches will get to have it. A similar improvement could be made when resolving token types during interpretation, but as this only has a one-time effect on the token, it hardly merits being considered a speed bottleneck.
The global scope could also be switched from list to round robin hash to speed it up. This is important if one considers access speed more important than memory.
Given how slow string comparison is, type checking – especially within foreign functions – is liable to be slow. At the current time, the Copper VM doesn’t utilize enumerated types as other language VMs could. It uses strings so as to be more flexible. This comes with a speed cost that is liable to eat up clock time unless a suitable replacement can be implemented.
The outcome of this analysis suggests there need to be more checks for speed bottlenecks and perhaps some major revisions of the VM if I’m still not happy with results.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s