Lab 2

Hi everyone,

In this lab, I will answer some questions from the lab2. First, the lab asks me to write a basic C program which prints a message on the screen, Hello World!-style — something like this:

#include <stdio.h>

int main() {
    printf("Hello World!\n");
}

This program just simply print out Hello World! to the screen, as any beginner tutorial on the internet on how to write code.

Then I compile the program using the GCC compiler that include each of these compiler options. After typing objdump, I can see the code location and output string that are set in the <main()>.

-g               # enable debugging information
-O0              # do not optimize (that's a capital letter and then the digit zero)
-fno-builtin     # do not use builtin function optimizations

 gcc lab2.c -g -O0 -fno-builtin -o lab2 

Option 1: (Add -static)

401bb5:       55                      push   %rbp
  401bb6:       48 89 e5                mov    %rsp,%rbp
  401bb9:       bf 10 00 48 00          mov    $0x480010,%edi
  401bbe:       b8 00 00 00 00          mov    $0x0,%eax
  401bc3:       e8 f8 72 00 00          callq  408ec0 <_IO_printf>
  401bc8:       b8 00 00 00 00          mov    $0x0,%eax
  401bcd:       5d                      pop    %rbp
  401bce:       c3                      retq   
  401bcf:       90                      nop

When compiling with -static option, the file size is bigger a than the original compilation without the option because it has stdio.h header file.

Option 2: (Remove -fno-builtin)

 401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       e8 fc fe ff ff          callq  401030 <puts@plt>
  401134:       b8 00 00 00 00          mov    $0x0,%eax
  401139:       5d                      pop    %rbp
  40113a:       c3                      retq   
  40113b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

When compiling without -fno-builtin, the file size returns smaller size, the function call changes from <printf@plt> to <puts@plt>.

Option 3: (Remove -g)

When compiling without -g option, the file size becomes even smaller than the previous files and there are no mire debugger outputs. By disabling the debugging information option, the disassembly output does not include the contents of section .debug_str.

Option 4: (Add additional arguments to printf())

401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       48 83 ec 08             sub    $0x8,%rsp
  40112e:       6a 0a                   pushq  $0xa
  401130:       6a 09                   pushq  $0x9
  401132:       6a 08                   pushq  $0x8
  401134:       6a 07                   pushq  $0x7
  401136:       6a 06                   pushq  $0x6
  401138:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40113e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  401144:       b9 03 00 00 00          mov    $0x3,%ecx
  401149:       ba 02 00 00 00          mov    $0x2,%edx
  40114e:       be 01 00 00 00          mov    $0x1,%esi
  401153:       bf 10 20 40 00          mov    $0x402010,%edi
  401158:       b8 00 00 00 00          mov    $0x0,%eax
  40115d:       e8 ce fe ff ff          callq  401030 <printf@plt>
  401162:       48 83 c4 30             add    $0x30,%rsp
  401166:       b8 00 00 00 00          mov    $0x0,%eax
  40116b:       c9                      leaveq 
  40116c:       c3                      retq   
  40116d:       0f 1f 00                nopl   (%rax)

Compilation with additional arguments in the printf did not create any changes. It remained the same as there was not any change in the compilation options.

Option 5: (Move printf() to separate function call)

40113c:       55                      push   %rbp
  40113d:       48 89 e5                mov    %rsp,%rbp
  401140:       b8 00 00 00 00          mov    $0x0,%eax
  401145:       e8 dc ff ff ff          callq  401126 <output>
  40114a:       b8 00 00 00 00          mov    $0x0,%eax
  40114f:       5d                      pop    %rbp
  401150:       c3                      retq   
  401151:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  401158:       00 00 00 
  40115b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

In the regular compilation the printf() statement and its content is displayed in the main() code. While during the changed file compilation, the main() function only displays the call to the output() function without displaying any of the content of the output() function.

Option 6: (Remove -O0 and add -O3)

401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       bf 10 20 40 00          mov    $0x402010,%edi
  40112f:       b8 00 00 00 00          mov    $0x0,%eax
  401134:       e8 f7 fe ff ff          callq  401030 <printf@plt>
  401139:       b8 00 00 00 00          mov    $0x0,%eax
  40113e:       5d                      pop    %rbp
  40113f:       c3                      retq   

Replace the -Oo option with the O3 caused the main() function to compile much earlier in the code, increase the optimization and performance.

Project Update 3

Greeting everyone,

So, this is my last update for this project. A few days ago, I got an update from my pull request from the rep owner. He said: Optimizing” tests is rather pointless. Also, I use my new var incorrectly, since I never recalculate that value. Finally, this will fail to build on certain old compilers (mainly Visual Studio) due to variable declarations that aren’t at the beginning of the functions.

After looking at the whole code clearer. I can see that and I agree with the owner. He recommended me not to focus on the tests, but instead on the JSON code instead. So I understand why he will not update his branch.

Overall, even though my changes are not updated, I am still grateful for the experience that this project has given me. I had a chance to apply my knowledge in optimization, improve my skill further. This project will help me more in my future career as a developer.

Project Update 2.2

Hello everyone,

I have come back with some updates on my progress. For the past few weeks, I have been trying to benchmark, profile to optimize the code. After a while, I was able to optimize test1.c and test2.c as I have stated before.

For the test2.c I was able to combine 2 ifdef TEST_FORMATTED together. In the end, it gives the same result, and became more optimized. The code was quite short, so I do not really have anything else to add or remove. If it pass the test, it is ok for me.

Old:

#ifdef TEST_FORMATTED
	int sflags = 0;
#endif

	MC_SET_DEBUG(1);

#ifdef TEST_FORMATTED
	sflags = parse_flags(argc, argv);
#endif

New:

#ifdef TEST_FORMATTED
	int sflags = 0;
	sflags = parse_flags(argc, argv);
#endif

With the test1.c, I used the same changed in test1.c, that I changed ifdef TEST_FORMATTED so that it takes less lines. I also applied Hoisting to optimize the loop part. Here is how I changed it:

Old:

my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_put_idx(my_array, 4, json_object_new_int(5));
	printf("my_array=\n");
	for(i=0; i < json_object_array_length(my_array); i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));

	json_object_put(my_array);

	test_array_del_idx();

	my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_put_idx(my_array, 4, json_object_new_int(0));
	printf("my_array=\n");
	for(i=0; i < json_object_array_length(my_array); i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));
	json_object_array_sort(my_array, sort_fn);
	printf("my_array=\n");
	for(i=0; i < json_object_array_length(my_array); i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}

New:

my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_put_idx(my_array, 4, json_object_new_int(5));
	unsigned int my_array_length = json_object_array_length(my_array);
	printf("my_array=\n");
	for(i=0; i < my_array_length; i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));

	json_object_put(my_array);

	test_array_del_idx();

	my_array = json_object_new_array();
	json_object_array_add(my_array, json_object_new_int(3));
	json_object_array_add(my_array, json_object_new_int(1));
	json_object_array_add(my_array, json_object_new_int(2));
	json_object_array_put_idx(my_array, 4, json_object_new_int(0));
	printf("my_array=\n");
	for(i=0; i < my_array_length; i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}
	printf("my_array.to_string()=%s\n", json_object_to_json_string(my_array));
	json_object_array_sort(my_array, sort_fn);
	printf("my_array=\n");
	for(i=0; i < my_array_length; i++)
	{
		json_object *obj = json_object_array_get_idx(my_array, i);
		printf("\t[%d]=%s\n", (int)i, json_object_to_json_string(obj));
	}

After making sure the result is not changed to provide a valid result, I ran the make command again to get my result, and the runtime has gone down quite a bit. All the tests are passed and there were no problem running the command, which proves my successful with this optimization test.

So in the end, I was able to optimize the code to use less command and make it run better than before. I will try to get my code accepted by the upstream project. Here is my github rep if you want to check out. Thank you for your time.

github Link: https://github.com/hoaianhkhang/json-c/tree/test-build

Project Update 2

Hello everyone again,

I have come back with some update for my project. As of right now, i am dive deeper to the code of the author and profiled while I am working. It is hard to improve the code as I am still figuring out how to make it as optimize as possible. So I hope I have more to share in the near future. Thank you for your time.

Project Update 1

Hello everyone,

So it has been a long time since I first enroll in this course. I have been able to learn alot of things about software and how it compiled as general. For this course, I will need to choose a project to work with. After a while, I decided to go with json-c, because I have always wanted to work with C language, which is why I choose this project to further improve my skill.

After cloned the project on github with a command so that I can work on it freely.

[haung1@xerxes ~]$  git clone https://github.com/json-c/json-c.git
Cloning into 'json-c'...
remote: Enumerating objects: 121, done.
remote: Counting objects: 100% (121/121), done.
remote: Compressing objects: 100% (83/83), done.
remote: Total 3956 (delta 60), reused 78 (delta 38), pack-reused 3835
Receiving objects: 100% (3956/3956), 2.91 MiB | 13.43 MiB/s, done.
Resolving deltas: 100% (2545/2545), done.

And then created a branch so that it does not mess with the original branch, built the project with the help of the github page, I began to benchmark the app on AArch64 systems with a make test command with CMake.


[haung1@xerxes build-test]$ make test
Running tests...
Test project /home/haung1/json-c/build-test
      Start  1: test1
 1/21 Test  #1: test1 ............................   Passed    1.54 sec
      Start  2: test2
 2/21 Test  #2: test2 ............................   Passed    1.41 sec
      Start  3: test4
 3/21 Test  #3: test4 ............................   Passed    0.26 sec
      Start  4: testReplaceExisting
 4/21 Test  #4: testReplaceExisting ..............   Passed    0.23 sec
      Start  5: test_cast
 5/21 Test  #5: test_cast ........................   Passed    0.28 sec
      Start  6: test_charcase
 6/21 Test  #6: test_charcase ....................   Passed    0.22 sec
      Start  7: test_compare
 7/21 Test  #7: test_compare .....................   Passed    0.23 sec
      Start  8: test_deep_copy
 8/21 Test  #8: test_deep_copy ...................   Passed    0.27 sec
      Start  9: test_double_serializer
 9/21 Test  #9: test_double_serializer ...........   Passed    0.23 sec
      Start 10: test_float
10/21 Test #10: test_float .......................   Passed    0.24 sec
      Start 11: test_int_add
11/21 Test #11: test_int_add .....................   Passed    0.20 sec
      Start 12: test_json_pointer
12/21 Test #12: test_json_pointer ................   Passed    0.28 sec
      Start 13: test_locale
13/21 Test #13: test_locale ......................   Passed    0.28 sec
      Start 14: test_null
14/21 Test #14: test_null ........................   Passed    0.22 sec
      Start 15: test_parse
15/21 Test #15: test_parse .......................   Passed    0.28 sec
      Start 16: test_parse_int64
16/21 Test #16: test_parse_int64 .................   Passed    0.22 sec
      Start 17: test_printbuf
17/21 Test #17: test_printbuf ....................   Passed    0.22 sec
      Start 18: test_set_serializer
18/21 Test #18: test_set_serializer ..............   Passed    0.23 sec
      Start 19: test_set_value
19/21 Test #19: test_set_value ...................   Passed    0.20 sec
      Start 20: test_util_file
20/21 Test #20: test_util_file ...................   Passed    0.26 sec
      Start 21: test_visit
21/21 Test #21: test_visit .......................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 21

Total Test time (real) =   7.58 sec

After running, I saw that test1 and test2 are 2 file that took the most time to run, means I will focus on optimizing them first. I see that I can improve this application in multiple of ways, which is:

  • Altered build options:
    I may try some compiler options like -g or -o3 depending on my process, I will adjust it properly.
  • Code changes to permit better optimization by the compiler and Algorithm improvements: To reduce the runtime and optimize the code, I will apply hoisting, inlining and try strength reduction also for some loop in the code. Even though the owner has already optimized it quite well, I think I can improve it more to give a better runtime. After learning all the techniques in SPO600 course, I believe I can apply my knowledge in this project to improve it.

As I have stated in the beginning, I will try all of my experiment in a different branch to ensure safety. And I will test my code multiple times, compare the original code with mine, make sure the result is not different except for the runtime and more optimize to make it trustworthy. I will update more in my next post. Thank you for reading.

Lab 5

Hey there everyone,

It is my, Anh again with some of my experience with lab5, a lab that is frankly, quite hard for me. I will talk about how I resolve my problem throughout this post. For this lab, I have to use some of lab 4 files again, which is a little bit convenient.

Part 1: Auto-Vectorization

First, I change some element for the Makefile file by adding -fopt-info-vec-all line.

Then after I ran the command gcc-g -O3 -fopt-info-vec-all vol1.c -o vol1, I saw that there is a loop vectorized at line 32.


While at line 38, it is not vectorized yet .

So in order to vectorized other part of the code, I decided to change some part of the sum up the data part from this:

To this:

And after ran the command again, the line has been vectorized, so I have successfully vectorized 1 more loop.

Part 2: Inline Assembler

For the next part, I need to look at add.c. Make sure that Iunderstand how the inline assembler code works and why. Modify the code to calculate b mod a using inline assembler, and print the result. So I change what I need in the add.c file.

After that, I ran the time command to get the runtime.

The next objective is that the file vol_inline.c contains a version of the volume scaling problem which uses inline assembler and the SQDMULH instruction. Copy, build, and verify the operation of this program on an AArch64 system.

As we can see, default assembler is :

vol.h

#define SAMPLES 5000000

And with a simple time command, we can see how long it is for it to run

If I try to increase the sample, the runtime will be increase, and if I try to lower down the sample, the runtime will be decrease too.

Now I will try to answer some question in this lab.

Question 1: What is an alternate approach?
I would let the compiler choose which registers to use for the variables instead of doing it myself.

Question 2: Should we use 32767 or 32768 in next line? why?
I use 32767 since the upper bound of a int16_t value is 32767, so using 32768 will cause problem .

Question 3: What does it mean to “duplicate” values in the next line?
It means to put them into the correct vector locations in the registers.

It means storing the volume factor into the SIMD register 8 times .

Question 4: Why is #16 included in the str line but not in the ldr line?
I did not want to increase the cursor at ldr since I need the current cursor position to store the values in the str.

Question 5: What do these next three lines do?

1st line will be the output value. “+r” means that it will be a read/write register.

2nd line declare input operand.

3rd line declares the asm clobbers memory, means the compiler will reload data from memory after execution.

Question 6: are the results usable? are they correct?
It does not return the same number, so I would say no.

Part 3: C Intrinsics

For this part, I need to run vol_intrinsics program, and after run the command, this is how long it took.

If I try to increase the sample, the runtime will be increase, and if I try to lower down the sample, the runtime will be decrease too.

Question 1: What do these intrinsic functions do?
vst1q_s16 stores a single vector.
vqdmulhq_s16 multiplies 2 vectors.
vdupq_n_s16 loads all vector lanes to the same literal value.

Question 2: Why is the increment below 8 instead of 16 or some other value?
We are using int16_t so we have 8 vector lanes. We increment by 8 to get the next 8 values.

Question 3: Why is this line not needed in the inline assembler verson of this program?
Because we set the vectors up to be8 lane, and it is done in the inline assembler while it is not done with intrinsics.

Question 4: are the results usable? are they correct?
It does not return the same reuslt.

This lab takes longer than I thought it would be, I may go back here to update something if i need to. Until then, see you in my next post.

Lab 4

Salute my fellow programmer,

It is me again, with some update on my coding adventure. It took me a long time but I am now able to update my progress on my SPO600 lab 4.

This is my objective, that I copied from the Lab4 page:

+Build and test this file.

  • Does it produce the same output each time?

+Test the performance of this program.

  • How long does it take to run the scaling?
  • How much time is spent scaling the sound samples? Be sure to eliminate the time taken for the non-scaling part of the program (e.g., random sample generation).
  • Do multiple runs take the same time? How much variation do you observe? What is the likely cause of this variation?
  • Is there any difference in the results produced by the various algorithms? How much does numeric accuracy matter in this application?

So first of all, I extract the whole file, just as I should

Then I started running the test to see the run time for it.

I realize it gave the same result 94. After I Pre-calculate a lookup table (array) of all possible sample values multiplied by the volume factor, and look up each sample in that table to get the scaled values. This is what I got.

And here, I convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value “1”.

Overall, this lab gave me a chance to practice with alternative methods for my coding experience, I will update more with my lab 5 progress. Thanks for reading.

SPO600 Lab 1

Hi there everyone,

This is my first lab for SPO600, so it is quite easy. I just need to find 2 open source software packages that has different licenses, so I chose Audacity, which has GNU General Public License (GPL) and LibreOffice, which has Mozilla Public License v2.0 license for this lab.

+ Accept Code Procedure:

For Audacity, they use Mailing List and Github to get codes from their community, which proves to be super useful at receiving and improving project, and also widely accessible by a lot of open source developer. Meanwhile, LibreOffice uses Bugzilla for bugtracking and testing bug report, and also using github to provide source code for the developer.

+ Succefully Submitted Patch

The next thing that I did was identifying 1 succefully submitted patch by the community, and with that, I saw a post about a request for support for labels in 2 formats to Audacity github rep by a user name Pokechu22. As what I have seen, this user forked the original rep, and then work on it by his own by creating a new branch called subrip-label-v2, create a pull request that require the master branch to update. After a some checking, the request was passed and accepted by the owner of Audacity. There were a lot of other people working on this project too. Issues were resolve and pull requests were creating constantly. It usually does not take long for both the participants and the owner to communicate with each others on how to improve the application even more. github is a really convenient tool, so I can see why these two application prefer to use github.

Overall, I think this approach of getting bug fixes and communicating with developer are very convenient and helpful. If I were in the community, I would use github carefully, test my submission in many ways so that it gives helpful result contribute to this big open source comminuty.

0.4 Final

Hey, hello for the last time.

So, this is my final post for this course now ( I may update this site more with other courses’s content in the future).

For the last few weeks, I have a lot of works to do, from job and also from other courses, so I have to choose something that suit my time. After a while, I found a rougelike project that is really interesting, and it needs some help with implemented a stat, so I help the owner with it. It was a JavaScript feature that I implemented after I have found out that he wanted it to be like Pokemon Speed stat. I was just added another variables in both player and monster stat, and also change the battlehandler to support speed stat by deciding whether monster or player has a higher stat. After This project, I learned more about how game works in website, and I may use it in my other courses in the future.

That is it for now, I think I will be off for a while. See you guys later.