Lab 2

Hi everyone,

In this lab, I will answer some questions from the lab2. First, the lab asks me to write a basic C program which prints a message on the screen, Hello World!-style — something like this:

#include <stdio.h>

int main() {
    printf("Hello World!\n");
}

This program just simply print out Hello World! to the screen, as any beginner tutorial on the internet on how to write code.

Then I compile the program using the GCC compiler that include each of these compiler options. After typing objdump, I can see the code location and output string that are set in the <main()>.

-g               # enable debugging information
-O0              # do not optimize (that's a capital letter and then the digit zero)
-fno-builtin     # do not use builtin function optimizations

 gcc lab2.c -g -O0 -fno-builtin -o lab2 

Option 1: (Add -static)

401bb5:       55                      push   %rbp
  401bb6:       48 89 e5                mov    %rsp,%rbp
  401bb9:       bf 10 00 48 00          mov    $0x480010,%edi
  401bbe:       b8 00 00 00 00          mov    $0x0,%eax
  401bc3:       e8 f8 72 00 00          callq  408ec0 <_IO_printf>
  401bc8:       b8 00 00 00 00          mov    $0x0,%eax
  401bcd:       5d                      pop    %rbp
  401bce:       c3                      retq   
  401bcf:       90                      nop

When compiling with -static option, the file size is bigger a than the original compilation without the option because it has stdio.h header file.

Option 2: (Remove -fno-builtin)

 401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       e8 fc fe ff ff          callq  401030 <puts@plt>
  401134:       b8 00 00 00 00          mov    $0x0,%eax
  401139:       5d                      pop    %rbp
  40113a:       c3                      retq   
  40113b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

When compiling without -fno-builtin, the file size returns smaller size, the function call changes from <printf@plt> to <puts@plt>.

Option 3: (Remove -g)

When compiling without -g option, the file size becomes even smaller than the previous files and there are no mire debugger outputs. By disabling the debugging information option, the disassembly output does not include the contents of section .debug_str.

Option 4: (Add additional arguments to printf())

401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       48 83 ec 08             sub    $0x8,%rsp
  40112e:       6a 0a                   pushq  $0xa
  401130:       6a 09                   pushq  $0x9
  401132:       6a 08                   pushq  $0x8
  401134:       6a 07                   pushq  $0x7
  401136:       6a 06                   pushq  $0x6
  401138:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40113e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  401144:       b9 03 00 00 00          mov    $0x3,%ecx
  401149:       ba 02 00 00 00          mov    $0x2,%edx
  40114e:       be 01 00 00 00          mov    $0x1,%esi
  401153:       bf 10 20 40 00          mov    $0x402010,%edi
  401158:       b8 00 00 00 00          mov    $0x0,%eax
  40115d:       e8 ce fe ff ff          callq  401030 <printf@plt>
  401162:       48 83 c4 30             add    $0x30,%rsp
  401166:       b8 00 00 00 00          mov    $0x0,%eax
  40116b:       c9                      leaveq 
  40116c:       c3                      retq   
  40116d:       0f 1f 00                nopl   (%rax)

Compilation with additional arguments in the printf did not create any changes. It remained the same as there was not any change in the compilation options.

Option 5: (Move printf() to separate function call)

40113c:       55                      push   %rbp
  40113d:       48 89 e5                mov    %rsp,%rbp
  401140:       b8 00 00 00 00          mov    $0x0,%eax
  401145:       e8 dc ff ff ff          callq  401126 <output>
  40114a:       b8 00 00 00 00          mov    $0x0,%eax
  40114f:       5d                      pop    %rbp
  401150:       c3                      retq   
  401151:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  401158:       00 00 00 
  40115b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

In the regular compilation the printf() statement and its content is displayed in the main() code. While during the changed file compilation, the main() function only displays the call to the output() function without displaying any of the content of the output() function.

Option 6: (Remove -O0 and add -O3)

401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       b8 00 00 00 00          mov    $0x0,%eax
  401134:       e8 f7 fe ff ff          callq  401030 <printf@plt>
  401139:       b8 00 00 00 00          mov    $0x0,%eax
  40113e:       5d                      pop    %rbp
  40113f:       c3                      retq   

Replace the -Oo option with the O3 caused the main() function to compile much earlier in the code, increase the optimization and performance.

Project Update 3

Greeting everyone,

So, this is my last update for this project. A few days ago, I got an update from my pull request from the rep owner. He said: Optimizing” tests is rather pointless. Also, I use my new var incorrectly, since I never recalculate that value. Finally, this will fail to build on certain old compilers (mainly Visual Studio) due to variable declarations that aren’t at the beginning of the functions.

After looking at the whole code clearer. I can see that and I agree with the owner. He recommended me not to focus on the tests, but instead on the JSON code instead. So I understand why he will not update his branch.

Overall, even though my changes are not updated, I am still grateful for the experience that this project has given me. I had a chance to apply my knowledge in optimization, improve my skill further. This project will help me more in my future career as a developer.

Project Update 2.2

Hello everyone,

I have come back with some updates on my progress. For the past few weeks, I have been trying to benchmark, profile to optimize the code. After a while, I was able to optimize test1.c and test2.c as I have stated before.

For the test2.c I was able to combine 2 ifdef TEST_FORMATTED together. In the end, it gives the same result, and became more optimized. The code was quite short, so I do not really have anything else to add or remove. If it pass the test, it is ok for me.

Old:

#ifdef TEST_FORMATTED
	int sflags = 0;
#endif

	MC_SET_DEBUG(1);

#ifdef TEST_FORMATTED
	sflags = parse_flags(argc, argv);
#endif

New:

#ifdef TEST_FORMATTED
	int sflags = 0;
	sflags = parse_flags(argc, argv);
#endif

With the test1.c, I used the same changed in test1.c, that I changed ifdef TEST_FORMATTED so that it takes less lines. I also applied Hoisting to optimize the loop part. Here is how I changed it:

Old:

my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_put_idx(my_array, 4, json_object_new_int(5));
	printf("my_array=\n");
	for(i=0; i < json_object_array_length(my_array); i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));

	json_object_put(my_array);

	test_array_del_idx();

	my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_put_idx(my_array, 4, json_object_new_int(0));
	printf("my_array=\n");
	for(i=0; i < json_object_array_length(my_array); i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));
	json_object_array_sort(my_array, sort_fn);
	printf("my_array=\n");
	for(i=0; i < json_object_array_length(my_array); i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}

New:

my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_put_idx(my_array, 4, json_object_new_int(5));
	unsigned int my_array_length = json_object_array_length(my_array);
	printf("my_array=\n");
	for(i=0; i < my_array_length; i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));

	json_object_put(my_array);

	test_array_del_idx();

	my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_put_idx(my_array, 4, json_object_new_int(0));
	printf("my_array=\n");
	for(i=0; i < my_array_length; i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));
	json_object_array_sort(my_array, sort_fn);
	printf("my_array=\n");
	for(i=0; i < my_array_length; i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}

After making sure the result is not changed to provide a valid result, I ran the make command again to get my result, and the runtime has gone down quite a bit. All the tests are passed and there were no problem running the command, which proves my successful with this optimization test.

So in the end, I was able to optimize the code to use less command and make it run better than before. I will try to get my code accepted by the upstream project. Here is my github rep if you want to check out. Thank you for your time.

github Link: https://github.com/hoaianhkhang/json-c/tree/test-build

Project Update 2

Hello everyone again,

I have come back with some update for my project. As of right now, i am dive deeper to the code of the author and profiled while I am working. It is hard to improve the code as I am still figuring out how to make it as optimize as possible. So I hope I have more to share in the near future. Thank you for your time.

Project Update 1

Hello everyone,

So it has been a long time since I first enroll in this course. I have been able to learn alot of things about software and how it compiled as general. For this course, I will need to choose a project to work with. After a while, I decided to go with json-c, because I have always wanted to work with C language, which is why I choose this project to further improve my skill.

After cloned the project on github with a command so that I can work on it freely.

[haung1@xerxes ~]$  git clone https://github.com/json-c/json-c.git
Cloning into 'json-c'...
remote: Enumerating objects: 121, done.
remote: Counting objects: 100% (121/121), done.
remote: Compressing objects: 100% (83/83), done.
remote: Total 3956 (delta 60), reused 78 (delta 38), pack-reused 3835
Receiving objects: 100% (3956/3956), 2.91 MiB | 13.43 MiB/s, done.
Resolving deltas: 100% (2545/2545), done.

And then created a branch so that it does not mess with the original branch, built the project with the help of the github page, I began to benchmark the app on AArch64 systems with a make test command with CMake.


[haung1@xerxes build-test]$ make test
Running tests...
Test project /home/haung1/json-c/build-test
      Start  1: test1
 1/21 Test  #1: test1 ............................   Passed    1.54 sec
      Start  2: test2
 2/21 Test  #2: test2 ............................   Passed    1.41 sec
      Start  3: test4
 3/21 Test  #3: test4 ............................   Passed    0.26 sec
      Start  4: testReplaceExisting
 4/21 Test  #4: testReplaceExisting ..............   Passed    0.23 sec
      Start  5: test_cast
 5/21 Test  #5: test_cast ........................   Passed    0.28 sec
      Start  6: test_charcase
 6/21 Test  #6: test_charcase ....................   Passed    0.22 sec
      Start  7: test_compare
 7/21 Test  #7: test_compare .....................   Passed    0.23 sec
      Start  8: test_deep_copy
 8/21 Test  #8: test_deep_copy ...................   Passed    0.27 sec
      Start  9: test_double_serializer
 9/21 Test  #9: test_double_serializer ...........   Passed    0.23 sec
      Start 10: test_float
10/21 Test #10: test_float .......................   Passed    0.24 sec
      Start 11: test_int_add
11/21 Test #11: test_int_add .....................   Passed    0.20 sec
      Start 12: test_json_pointer
12/21 Test #12: test_json_pointer ................   Passed    0.28 sec
      Start 13: test_locale
13/21 Test #13: test_locale ......................   Passed    0.28 sec
      Start 14: test_null
14/21 Test #14: test_null ........................   Passed    0.22 sec
      Start 15: test_parse
15/21 Test #15: test_parse .......................   Passed    0.28 sec
      Start 16: test_parse_int64
16/21 Test #16: test_parse_int64 .................   Passed    0.22 sec
      Start 17: test_printbuf
17/21 Test #17: test_printbuf ....................   Passed    0.22 sec
      Start 18: test_set_serializer
18/21 Test #18: test_set_serializer ..............   Passed    0.23 sec
      Start 19: test_set_value
19/21 Test #19: test_set_value ...................   Passed    0.20 sec
      Start 20: test_util_file
20/21 Test #20: test_util_file ...................   Passed    0.26 sec
      Start 21: test_visit
21/21 Test #21: test_visit .......................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 21

Total Test time (real) =   7.58 sec

After running, I saw that test1 and test2 are 2 file that took the most time to run, means I will focus on optimizing them first. I see that I can improve this application in multiple of ways, which is:

  • Altered build options:
    I may try some compiler options like -g or -o3 depending on my process, I will adjust it properly.
  • Code changes to permit better optimization by the compiler and Algorithm improvements: To reduce the runtime and optimize the code, I will apply hoisting, inlining and try strength reduction also for some loop in the code. Even though the owner has already optimized it quite well, I think I can improve it more to give a better runtime. After learning all the techniques in SPO600 course, I believe I can apply my knowledge in this project to improve it.

As I have stated in the beginning, I will try all of my experiment in a different branch to ensure safety. And I will test my code multiple times, compare the original code with mine, make sure the result is not different except for the runtime and more optimize to make it trustworthy. I will update more in my next post. Thank you for reading.

Lab 5

Hey there everyone,

It is my, Anh again with some of my experience with lab5, a lab that is frankly, quite hard for me. I will talk about how I resolve my problem throughout this post. For this lab, I have to use some of lab 4 files again, which is a little bit convenient.

Part 1: Auto-Vectorization

First, I change some element for the Makefile file by adding -fopt-info-vec-all line.

Then after I ran the command gcc-g -O3 -fopt-info-vec-all vol1.c -o vol1, I saw that there is a loop vectorized at line 32.


While at line 38, it is not vectorized yet .

So in order to vectorized other part of the code, I decided to change some part of the sum up the data part from this:

To this:

And after ran the command again, the line has been vectorized, so I have successfully vectorized 1 more loop.

Part 2: Inline Assembler

For the next part, I need to look at add.c. Make sure that Iunderstand how the inline assembler code works and why. Modify the code to calculate b mod a using inline assembler, and print the result. So I change what I need in the add.c file.

After that, I ran the time command to get the runtime.

The next objective is that the file vol_inline.c contains a version of the volume scaling problem which uses inline assembler and the SQDMULH instruction. Copy, build, and verify the operation of this program on an AArch64 system.

As we can see, default assembler is :

vol.h

#define SAMPLES 5000000

And with a simple time command, we can see how long it is for it to run

If I try to increase the sample, the runtime will be increase, and if I try to lower down the sample, the runtime will be decrease too.

Now I will try to answer some question in this lab.

Question 1: What is an alternate approach?
I would let the compiler choose which registers to use for the variables instead of doing it myself.

Question 2: Should we use 32767 or 32768 in next line? why?
I use 32767 since the upper bound of a int16_t value is 32767, so using 32768 will cause problem .

Question 3: What does it mean to “duplicate” values in the next line?
It means to put them into the correct vector locations in the registers.

It means storing the volume factor into the SIMD register 8 times .

Question 4: Why is #16 included in the str line but not in the ldr line?
I did not want to increase the cursor at ldr since I need the current cursor position to store the values in the str.

Question 5: What do these next three lines do?

1st line will be the output value. “+r” means that it will be a read/write register.

2nd line declare input operand.

3rd line declares the asm clobbers memory, means the compiler will reload data from memory after execution.

Question 6: are the results usable? are they correct?
It does not return the same number, so I would say no.

Part 3: C Intrinsics

For this part, I need to run vol_intrinsics program, and after run the command, this is how long it took.

If I try to increase the sample, the runtime will be increase, and if I try to lower down the sample, the runtime will be decrease too.

Question 1: What do these intrinsic functions do?
vst1q_s16 stores a single vector.
vqdmulhq_s16 multiplies 2 vectors.
vdupq_n_s16 loads all vector lanes to the same literal value.

Question 2: Why is the increment below 8 instead of 16 or some other value?
We are using int16_t so we have 8 vector lanes. We increment by 8 to get the next 8 values.

Question 3: Why is this line not needed in the inline assembler verson of this program?
Because we set the vectors up to be8 lane, and it is done in the inline assembler while it is not done with intrinsics.

Question 4: are the results usable? are they correct?
It does not return the same reuslt.

This lab takes longer than I thought it would be, I may go back here to update something if i need to. Until then, see you in my next post.