eXtreme Development


Translate:  

  • Microbenchmark as a SDLC practice

    2021-07

    While developing software, many times we develop a correct solution but we always overlook the non functional requirements (specially performance and security).

    This is the first case:
    We were tracking down an issue on a chain of calls in a Java app that was already optimised in the past but it turns out the volumetry of data changed, so, parameters were slightly outdated.
    The fix was as simple as increasing the buffer to 64KB in a call in order to fit the whole json file on a Java BufferedReader object.
    While we didn't had much time to further analyse it, my personal hypothesis is that the Java app was calling the Linux read syscall quite a few times instead of a single read that gets the whole file content.
    The performance improvement was significant, a call that was taking quite a few milliseconds, sometimes, going up to seconds and blocking the Quarkus/Vert.x workers, now, it does the whole chain of calls in less than 60 milliseconds.

    Following that case, we tracked down another chain of calls in C++ related to this first case.
    Again, this C++ app was properly tuned in the past and always worked like a charm but (there is always a but), we noticed that strings were concatenated and that a call had 2 loops that could be unified in a single loop.

    So, I wrote this fairly simple C++ code to replicate the situation and microbenchmark it:

    
    #include <chrono>
    #include <ctime>
    #include <iomanip>
    #include <iostream>
    #include <sstream>
    #include <string>
    
    //     g++ -std=c++17 -O2 -o loop__gcctest loop.cpp
    // clang++ -std=c++17 -O2 -o loopclangtest loop.cpp
    
    // https://gist.github.com/polaris/adee936198995a6f8c697c419d21f734
    static std::string timePointAsString()
    {
        auto now = std::chrono::system_clock::now();
        auto in_time_t = std::chrono::system_clock::to_time_t(now);
        std::stringstream ss;
        ss << std::put_time(std::localtime(&in_time_t), "%Y-%m-%d %X");
        return ss.str();
    }
    
    int main()
    {
        auto w = 50000;
        std::chrono::duration<double> case1, case3;
        std::string str1, str3;
    
        //std::cout << "starting double loop with concatenation\n";
        auto start = std::chrono::system_clock::now();
        str1 = str1 + "part1\n";
        for (auto x = 0; x < w; x++)
        {
            str1 = str1 + std::to_string(x) + "," + timePointAsString() + "," + "abcdef\n";
        }
    
        str1 = str1 + "part2\n";
        for (auto x = 0; x < w; x++)
        {
            str1 = str1 + std::to_string(x) + "," + timePointAsString() + "," + "abcdef\n";
        }
        case1 = std::chrono::system_clock::now() - start;
    
        std::string temp1, temp2;
        //std::cout << "starting single loop with append\n";
        start = std::chrono::system_clock::now();
        temp1.append("part1\n");
        temp2.append("part2\n");
        for (auto x = 0; x < w; x++)
        {
            auto curr_time = timePointAsString();
    
            temp1.append(std::to_string(x));
            temp1.append(",");
            temp1.append(curr_time);
            temp1.append(",");
            temp1.append("abcdef\n");
    
            temp2.append(std::to_string(x));
            temp2.append(",");
            temp2.append(curr_time);
            temp2.append(",");
            temp2.append("abcdef\n");
        }
        str3.append(temp1);
        str3.append(temp2);
        case3 = std::chrono::system_clock::now() - start;
    
        std::cout << case1.count() << ";" << case3.count() << "\n";
    
        return 0;
    }
    

    The time difference stored on case1 variable is the current scenario AS-IS, the case3 variable is the TO-BE (the variable case2 was another variation of the concatenation, omitted to simplify the reading of this article).
    Microbenchmarking this sample code, I extracted the following graph which reads as "under 95% of the cases, it takes n seconds to concatenate/append data using gcc/clang":

    String concatenation with double loop vs String append() with single loop
    As we can see, the changes from double loop with concatenation to single loop with append brings a significant performance gain and it will be applied in the coming sprint! :)
    Ah, of course, the explanation for this performance issue is simple, replacing 2 loops by one reduces the number of iterations (O(2n) -> O(n)), replacing the concatenation is related to another thing: string append in order to be efficient, is usually implemented as a linked list, while string concatenations is usually implemented by copying the whole array every time a new piece of string is added.

    To sum up, these are some of the takeaways:
    • Always make resources (CPU, memory, disk, threads, file descriptors, etc) 'tunable' through an external configuration
    • Every time a new feature is introduced, benchmark it to establish a baseline
    • If you're changing the current code, pay attention:
      • where your application call graph spend most of its time
      • to the addition of new loops (to simplify the math, consider each loop with a cost of O(n), so, one loop inside another is O(n^2), one loop before/after another is O(2n), etc)
      • there are any queues on your code (queues are very effective to handle non-uniform workloads but you have to pay attention on how you model it)
      • don't saturate resources with parallelism, a very classic error is assuming that simply adding more threads and all problems are gone which is not true, quite the opposite, if you saturate threads, you will degrade the performance
      • a derivation of the above: adding more threads to a single process instead of creating more instances of that process with a few number of threads - adding more threads to a process will divide the quantum of time that this process has on operating system, if you add more processes with less threads, instead, you will gain more quantum of time from the operating system for your applications
    • Don't forget to check the best practices for your programming language
    • Most important: Please, add microbenchmark as a practice on your SDLC


 

Geraldo Netto
SourceForge
{
 "name": "Geraldo Netto" 
 "contacts": [
          
       
 ]
 "sponsor me": [
       
 ]
 "food": [
      
 ]
 "folk": [
          
 ]
 "mindset": [
   "Control Theory"
   "Seneca"
   "utilitarianism"
   "..."
 ]
 "praying": [
   "Hermetism"
   "Kabbalah"
   "Spiritism"
   "Tantra"
   "Umbanda Branca"
   "..."
 ]
 "music": [
   "Cafe Del Mar"
   "Fito Paez"
   "Jennifer Rostock"
   "Juli"
   "Leningrad"
   "Mafalda Veiga"
   "Oasis"
   "Raimundos"
   "Rammstein"
   "Sabaton"
   "Skank"
   "The Cure"
   "Tom Jobim"
   "Zucchero"
   "..."
 ]
 "arts": [
   "Kandinsky"
   "Pollock"
   "..."
 ]
 "tech": [
   "algorithms"
   "distributed systems"
   "linux"
   "statistics"
   "..."
 ]
}