)]}'
{
  "commit": "a01cb764bd34d3bb11b12a3262ffaffe2a99fd22",
  "tree": "d24a4da73be7a1efbda04f290f025ed399c82e9b",
  "parents": [
    "8c53c1d9c4bf6787d4e12f90d73165863f6f726e"
  ],
  "author": {
    "name": "Simon Marchi",
    "email": "simon.marchi@polymtl.ca",
    "time": "Fri Sep 19 16:27:00 2025 -0400"
  },
  "committer": {
    "name": "Simon Marchi",
    "email": "simon.marchi@efficios.com",
    "time": "Tue Sep 30 19:37:20 2025 +0000"
  },
  "message": "gdbsupport: use dynamic partitioning in gdb::parallel_for_each\n\ngdb::parallel_for_each uses static partitioning of the workload, meaning\nthat each worker thread receives a similar number of work items.  Change\nit to use dynamic partitioning, where worker threads pull work items\nfrom a shared work queue when they need to.\n\nNote that gdb::parallel_for_each is currently only used for processing\nminimal symbols in GDB.  I am looking at improving the startup\nperformance of GDB, where the minimal symbol process is one step.\n\nWith static partitioning, there is a risk of workload imbalance if some\nthreads receive \"easier\" work than others.  Some threads sit still while\nothers finish working on their share of the work.  This is not\ndesirable, because the gdb::parallel_for_each takes as long as the\nslowest thread takes.\n\nWhen loading a file with a lot of minimal symbols (~600k) in GDB, with\n\"maint set per-command time on\", I observe some imbalance:\n\n    Time for \"minsyms install worker\": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU\n    Time for \"minsyms install worker\": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU\n    Time for \"minsyms install worker\": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU\n    Time for \"minsyms install worker\": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU\n    Time for \"minsyms install worker\": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU\n    Time for \"minsyms install worker\": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU\n    Time for \"minsyms install worker\": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU\n    Time for \"minsyms install worker\": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU\n                                              ^\n                                          ----´\n\nThe fastest thread took 0.732 seconds to complete its work (and then sat\nstill), while the slowest took 3.927 seconds.  This means the\nparallel_for_each took a bit less than 4 seconds.\n\nEven if the number of minimal symbols assigned to each worker is the\nsame, I suppose that some symbols (e.g. those that need demangling) take\nlonger to process, which could explain the imbalance.\n\nWith this patch, things are much more balanced:\n\n    Time for \"minsym install worker\": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU\n    Time for \"minsym install worker\": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU\n    Time for \"minsym install worker\": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU\n    Time for \"minsym install worker\": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU\n    Time for \"minsym install worker\": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU\n    Time for \"minsym install worker\": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU\n    Time for \"minsym install worker\": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU\n    Time for \"minsym install worker\": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU\n                                              ^\n                                          ----´\n\nIn this version, the parallel_for_each took about 2.8 seconds,\nrepresenting a reduction of ~1.2 seconds for this step.  Not\nlife-changing, but it\u0027s still good I think.\n\nNote that this patch helps when loading big programs.  My go-to test\nprogram for this is telegram-desktop that I built from source.  For\nsmall programs (including loading gdb itself), it makes no perceptible\ndifference.\n\nNow the technical bits:\n\n - One impact that this change has on the minimal symbol processing\n   specifically is that not all calls to compute_and_set_names (a\n   critical region guarded by a mutex) are done at the end of each\n   worker thread\u0027s task anymore.\n\n   Before this patch, each thread would compute the names and hash values for\n   all the minimal symbols it has been assigned, and then would call\n   compute_and_set_names for all of them, while holding the mutex (thus\n   preventing other threads from doing this same step).\n\n   With the shared work queue approach, each thread grabs a batch of of\n   minimal symbols, computes the names and hash values for them, and\n   then calls compute_and_set_names (with the mutex held) for this batch\n   only.  It then repeats that until the work queue is empty.\n\n   There are therefore more small and spread out compute_and_set_names\n   critical sections, instead of just one per worker thread at the end.\n   Given that before this patch the work was not well balanced among worker\n   threads, I guess that threads would enter that critical region at\n   roughly different times, causing little contention.\n\n   In the \"with this patch\" results, the CPU utilization numbers are not\n   as good, suggesting that there is some contention.  But I don\u0027t know\n   if it\u0027s contention due to the compute_and_set_names critical section\n   or the shared work queue critical section.  That can be investigated\n   later.  In any case, what ultimately counts is the wall time, which\n   improves.\n\n - One choice I had to make was to decide how many work items (in this\n   case minimal symbols) each worker should pop when getting work from\n   the shared queue.  The general wisdom is that:\n\n   - popping too few items, and the synchronization overhead becomes\n     significant, and the total processing time increases\n   - popping too many items, and we get some imbalance back, and the\n     total processing time increases again\n\n   I experimented using a dynamic batch size proportional to the number\n   of remaining work items.  It worked well in some cases but not\n   always.  So I decided to keep it simple, with a fixed batch size.\n   That can always be tweaked later.\n\n - I want to still be able to use scoped_time_it to measure the time\n   that each worker thread spent working on the task.  I find it really\n   handy when measuring the performance impact of changes.\n\n   Unfortunately, the current interface of gdb::parallel_for_each, which\n   receives a simple callback, is not well-suited for that, once I\n   introduce the dynamic partitioning.  The callback would get called\n   once for each work item batch (multiple time for each worker thread),\n   so it\u0027s not possible to maintain a per-worker thread object for the\n   duration of the parallel for.\n\n   To allow this, I changed gdb::parallel_for_each to receive a worker\n   type as a template parameter.  Each worker thread creates one local\n   instance of that type, and calls operator() on it for each work item\n   batch.  By having a scoped_time_it object as a field of that worker,\n   we can get the timings per worker thread.\n\n   The drawbacks of this approach is that we must now define the\n   parallel task in a separate class and manually capture any context we\n   need as fields of that class.\n\nChange-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf\nApproved-By: Tom Tromey \u003ctom@tromey.com\u003e\n",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "c4a0fa873761497a11872a54257c841440f93bf5",
      "old_mode": 33188,
      "old_path": "gdb/minsyms.c",
      "new_id": "cc50febba6c90b12b8b1dc2b021bdc22060d198a",
      "new_mode": 33188,
      "new_path": "gdb/minsyms.c"
    },
    {
      "type": "modify",
      "old_id": "559446deef43c91eb69bcc6a52b0439d463184b9",
      "old_mode": 33188,
      "old_path": "gdb/unittests/parallel-for-selftests.c",
      "new_id": "9d86aa67b106c73c165cf14304574cc5ecbdef67",
      "new_mode": 33188,
      "new_path": "gdb/unittests/parallel-for-selftests.c"
    },
    {
      "type": "modify",
      "old_id": "50406763ded4eeadff80a3b362b7843d380fd798",
      "old_mode": 33188,
      "old_path": "gdbsupport/parallel-for.h",
      "new_id": "8d085df42fb9a2938abf6c3c84751fd714566872",
      "new_mode": 33188,
      "new_path": "gdbsupport/parallel-for.h"
    }
  ]
}
