Message ID | 20240904134152.2141-4-thunder.leizhen@huawei.com |
---|---|
State | New |
Headers | show |
Series | [1/3] list: add hlist_cut_number() | expand |
On Wed, Sep 04 2024 at 21:41, Zhen Lei wrote: > Currently, there are multiple instances where several nodes are extracted > from one list and added to another list. One by one extraction, and then > one by one splicing, not only low efficiency, readability is also poor. > The work can be done well with hlist_cut_number() and hlist_splice_init(), > which move the entire sublist at once. > > When the number of nodes expected to be moved is less than or equal to 0, > or the source list is empty, hlist_cut_number() safely returns 0. The > splicing is performed only when the return value of hlist_cut_number() is > greater than 0. > > For two calls to hlist_cut_number() in __free_object(), the result is > obviously positive, the check of the return value is omitted. Sure but hlist_cut_number() suffers from the same problem as the current code. If is a massive cache line chase as you actually have to walk the list to figure out where to cut it off. All related functions have this problem and all of this code is very strict about boundaries. Instead of accurately doing the refill, purge etc. we should look into proper batch mode mechanisms. Let me think about it. Thanks, tglx
On 2024/9/10 19:44, Thomas Gleixner wrote: > On Tue, Sep 10 2024 at 12:00, Leizhen wrote: >> On 2024/9/10 2:41, Thomas Gleixner wrote: >>> All related functions have this problem and all of this code is very >>> strict about boundaries. Instead of accurately doing the refill, purge >>> etc. we should look into proper batch mode mechanisms. Let me think >>> about it. >> It may be helpful to add several arrays to record the first node of each batch >> in each free list. Take 'percpu_pool' as an example: >> >> struct debug_percpu_free { >> struct hlist_head free_objs; >> int obj_free; >> + int batch_idx; >> + struct hlist_node *batch_first[4]; // ODEBUG_POOL_PERCPU_SIZE / ODEBUG_BATCH_SIZE >> }; >> >> A new free node is added to the header of the list, and the batch is cut from the tail >> of the list. >> NodeA<-->...<-->NodeB<-->...<-->NodeC<-->NodeD<--> free_objs >> |---one batch---|---one batch---| >> | | >> batch_first[0] batch_first[1] > The current data structures are not fit for the purpose. Glueing > workarounds into the existing mess makes it just worse. > > So the data structures need to be redesigned from ground up to be fit > for the purpose. > > allocation: > > 1) Using the global pool for single object allocations is wrong > > During boot this can be a completely disconnected list, which does > not need any accounting, does not need pool_lock and can be just > protected with irqsave like the per CPU pools. It's effectivly a > per CPU pool because at that point there is only one CPU and > everything is single threaded. > > 2) The per CPU pool handling is backwards > > If the per CPU pool is empty, then the pool needs to be refilled > with a batch from the global pool and allocated from there. > > Allocation then always happens from the active per CPU batch slot. > > free: > > 1) Early boot > > Just put it back on the dedicated boot list and be done > > 2) After obj_cache is initialized > > Put it back to the per CPU pool into the active batch slot. If > the slot becomes full then make the next slot the active slot. It > the full slot was the top most slot then move that slot either > into the global pool when there is a free slot, or move it to the > to_free pool. > > That means the per CPU pool is different from the global pools as it can > allocate/free single objects, while the global pools are strictly stacks > of batches. Any movement between per CPU pools and global pools is batch > based and just moves lists from one head to another. > > That minimizes the pool lock contention and the cache foot print. The > global to free pool must have an extra twist to accomodate non-batch > sized drops and to handle the all slots are full case, but that's just a > trivial detail. That's great. I really admire you for completing the refactor in such a short of time. But I have a few minor comments. 1. When kmem_cache_zalloc() is called to allocate objs for filling, if less than one batch of objs are allocated, all of them can be pushed to the local CPU. That's, call pcpu_free() one by one. In this way, the number of free objs cached by pool_global and pool_to_free is always an integer multiple of ODEBUG_BATCH_SIZE. 2. Member tot_cnt of struct global_pool can be deleted. We can get it simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid redundant maintenance. 3. debug_objects_pool_min_level also needs to be adjusted accordingly, the number of batches of the min level. > > See the completely untested combo patch against tip core/debugobjects > below. > > Thanks, > > tglx
On Wed, Sep 11 2024 at 15:44, Leizhen wrote: > On 2024/9/10 19:44, Thomas Gleixner wrote: >> That minimizes the pool lock contention and the cache foot print. The >> global to free pool must have an extra twist to accomodate non-batch >> sized drops and to handle the all slots are full case, but that's just a >> trivial detail. > > That's great. I really admire you for completing the refactor in such a > short of time. The trick is to look at it from the data model and not from the code. You need to sit down and think about which data model is required to achieve what you want. So the goal was batching, right? That made it clear that the global pools need to be stacks of batches and never handle single objects because that makes it complex. As a consequence the per cpu pool is the one which does single object alloc/free and then either gets a full batch from the global pool or drops one into it. The rest is just mechanical. > But I have a few minor comments. > 1. When kmem_cache_zalloc() is called to allocate objs for filling, > if less than one batch of objs are allocated, all of them can be > pushed to the local CPU. That's, call pcpu_free() one by one. If that's the case then we should actually immediately give them back because thats a sign of memory pressure. > 2. Member tot_cnt of struct global_pool can be deleted. We can get it > simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid > redundant maintenance. Agreed. > 3. debug_objects_pool_min_level also needs to be adjusted accordingly, > the number of batches of the min level. Sure. There are certainly more problems with that code. As I said, it's untested and way to big to be reviewed. I'll split it up into more manageable bits and pieces. Thanks, tglx
On 2024/9/11 16:54, Thomas Gleixner wrote: > On Wed, Sep 11 2024 at 15:44, Leizhen wrote: >> On 2024/9/10 19:44, Thomas Gleixner wrote: >>> That minimizes the pool lock contention and the cache foot print. The >>> global to free pool must have an extra twist to accomodate non-batch >>> sized drops and to handle the all slots are full case, but that's just a >>> trivial detail. >> >> That's great. I really admire you for completing the refactor in such a >> short of time. > > The trick is to look at it from the data model and not from the > code. You need to sit down and think about which data model is required > to achieve what you want. So the goal was batching, right? Yes, when I found a hole in the road, I thought about how to fill it. But you think more deeply, why is there a pit, is there a problem with the foundation? I've benefited a lot from communicating with you these days. > > That made it clear that the global pools need to be stacks of batches > and never handle single objects because that makes it complex. As a > consequence the per cpu pool is the one which does single object > alloc/free and then either gets a full batch from the global pool or > drops one into it. The rest is just mechanical. > >> But I have a few minor comments. >> 1. When kmem_cache_zalloc() is called to allocate objs for filling, >> if less than one batch of objs are allocated, all of them can be >> pushed to the local CPU. That's, call pcpu_free() one by one. > > If that's the case then we should actually immediately give them back > because thats a sign of memory pressure. Yes, that makes sense, and that's a solution too. > >> 2. Member tot_cnt of struct global_pool can be deleted. We can get it >> simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid >> redundant maintenance. > > Agreed. > >> 3. debug_objects_pool_min_level also needs to be adjusted accordingly, >> the number of batches of the min level. > > Sure. There are certainly more problems with that code. As I said, it's > untested and way to big to be reviewed. I'll split it up into more > manageable bits and pieces. Looking forward to... > > Thanks, > > tglx > . >
On Wed, Sep 11 2024 at 10:54, Thomas Gleixner wrote: > On Wed, Sep 11 2024 at 15:44, Leizhen wrote: >> 2. Member tot_cnt of struct global_pool can be deleted. We can get it >> simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid >> redundant maintenance. > > Agreed. > >> 3. debug_objects_pool_min_level also needs to be adjusted accordingly, >> the number of batches of the min level. > > Sure. There are certainly more problems with that code. As I said, it's > untested and way to big to be reviewed. I'll split it up into more > manageable bits and pieces. I finally found some time and ended up doing it differently. I'll send out the patches later today. Thanks, tglx
diff --git a/lib/debugobjects.c b/lib/debugobjects.c index db8f6b4b8b3151a..1cb9458af3d0b4f 100644 --- a/lib/debugobjects.c +++ b/lib/debugobjects.c @@ -128,8 +128,9 @@ static const char *obj_states[ODEBUG_STATE_MAX] = { static void fill_pool(void) { gfp_t gfp = __GFP_HIGH | __GFP_NOWARN; - struct debug_obj *obj; + HLIST_HEAD(freelist); unsigned long flags; + int cnt; /* * The upper-layer function uses only one node at a time. If there are @@ -152,17 +153,19 @@ static void fill_pool(void) * the WRITE_ONCE() in pool_lock critical sections. */ if (READ_ONCE(obj_nr_tofree)) { + struct hlist_node *last; + raw_spin_lock_irqsave(&pool_lock, flags); /* * Recheck with the lock held as the worker thread might have * won the race and freed the global free list already. */ - while (obj_nr_tofree && (obj_pool_free < debug_objects_pool_min_level)) { - obj = hlist_entry(obj_to_free.first, typeof(*obj), node); - hlist_del(&obj->node); - WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1); - hlist_add_head(&obj->node, &obj_pool); - WRITE_ONCE(obj_pool_free, obj_pool_free + 1); + cnt = min(obj_nr_tofree, debug_objects_pool_min_level - obj_pool_free); + cnt = hlist_cut_number(&freelist, &obj_to_free, cnt, &last); + if (cnt > 0) { + hlist_splice_init(&freelist, last, &obj_pool); + WRITE_ONCE(obj_pool_free, obj_pool_free + cnt); + WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - cnt); } raw_spin_unlock_irqrestore(&pool_lock, flags); } @@ -172,8 +175,6 @@ static void fill_pool(void) while (READ_ONCE(obj_pool_free) < debug_objects_pool_min_level) { struct debug_obj *new, *last = NULL; - HLIST_HEAD(freelist); - int cnt; for (cnt = 0; cnt < ODEBUG_BATCH_SIZE; cnt++) { new = kmem_cache_zalloc(obj_cache, gfp); @@ -245,30 +246,28 @@ alloc_object(void *addr, struct debug_bucket *b, const struct debug_obj_descr *d raw_spin_lock(&pool_lock); obj = __alloc_object(&obj_pool); if (obj) { - obj_pool_used++; - WRITE_ONCE(obj_pool_free, obj_pool_free - 1); + int cnt = 0; /* * Looking ahead, allocate one batch of debug objects and * put them into the percpu free pool. */ if (likely(obj_cache)) { - int i; - - for (i = 0; i < ODEBUG_BATCH_SIZE; i++) { - struct debug_obj *obj2; - - obj2 = __alloc_object(&obj_pool); - if (!obj2) - break; - hlist_add_head(&obj2->node, - &percpu_pool->free_objs); - percpu_pool->obj_free++; - obj_pool_used++; - WRITE_ONCE(obj_pool_free, obj_pool_free - 1); + struct hlist_node *last; + HLIST_HEAD(freelist); + + cnt = hlist_cut_number(&freelist, &obj_pool, ODEBUG_BATCH_SIZE, &last); + if (cnt > 0) { + hlist_splice_init(&freelist, last, &percpu_pool->free_objs); + percpu_pool->obj_free += cnt; } } + /* add one for obj */ + cnt++; + obj_pool_used += cnt; + WRITE_ONCE(obj_pool_free, obj_pool_free - cnt); + if (obj_pool_used > obj_pool_max_used) obj_pool_max_used = obj_pool_used; @@ -300,6 +299,7 @@ static void free_obj_work(struct work_struct *work) struct debug_obj *obj; unsigned long flags; HLIST_HEAD(tofree); + int cnt; WRITE_ONCE(obj_freeing, false); if (!raw_spin_trylock_irqsave(&pool_lock, flags)) @@ -315,12 +315,12 @@ static void free_obj_work(struct work_struct *work) * may be gearing up to use more and more objects, don't free any * of them until the next round. */ - while (obj_nr_tofree && obj_pool_free < debug_objects_pool_size) { - obj = hlist_entry(obj_to_free.first, typeof(*obj), node); - hlist_del(&obj->node); - hlist_add_head(&obj->node, &obj_pool); - WRITE_ONCE(obj_pool_free, obj_pool_free + 1); - WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1); + cnt = min(obj_nr_tofree, debug_objects_pool_size - obj_pool_free); + cnt = hlist_cut_number(&tofree, &obj_to_free, cnt, &tmp); + if (cnt > 0) { + hlist_splice_init(&tofree, tmp, &obj_pool); + WRITE_ONCE(obj_pool_free, obj_pool_free + cnt); + WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - cnt); } raw_spin_unlock_irqrestore(&pool_lock, flags); return; @@ -346,11 +346,12 @@ static void free_obj_work(struct work_struct *work) static void __free_object(struct debug_obj *obj) { - struct debug_obj *objs[ODEBUG_BATCH_SIZE]; struct debug_percpu_free *percpu_pool; - int lookahead_count = 0; + struct hlist_node *last; + HLIST_HEAD(freelist); unsigned long flags; bool work; + int cnt; local_irq_save(flags); if (!obj_cache) @@ -371,56 +372,36 @@ static void __free_object(struct debug_obj *obj) * As the percpu pool is full, look ahead and pull out a batch * of objects from the percpu pool and free them as well. */ - for (; lookahead_count < ODEBUG_BATCH_SIZE; lookahead_count++) { - objs[lookahead_count] = __alloc_object(&percpu_pool->free_objs); - if (!objs[lookahead_count]) - break; - percpu_pool->obj_free--; - } + cnt = hlist_cut_number(&freelist, &percpu_pool->free_objs, ODEBUG_BATCH_SIZE, &last); + percpu_pool->obj_free -= cnt; + + /* add one for obj */ + cnt++; + hlist_add_head(&obj->node, &freelist); free_to_obj_pool: raw_spin_lock(&pool_lock); work = (obj_pool_free > debug_objects_pool_size) && obj_cache && (obj_nr_tofree < ODEBUG_FREE_WORK_MAX); - obj_pool_used--; + obj_pool_used -= cnt; if (work) { - WRITE_ONCE(obj_nr_tofree, obj_nr_tofree + 1); - hlist_add_head(&obj->node, &obj_to_free); - if (lookahead_count) { - WRITE_ONCE(obj_nr_tofree, obj_nr_tofree + lookahead_count); - obj_pool_used -= lookahead_count; - while (lookahead_count) { - hlist_add_head(&objs[--lookahead_count]->node, - &obj_to_free); - } - } + WRITE_ONCE(obj_nr_tofree, obj_nr_tofree + cnt); + hlist_splice_init(&freelist, last, &obj_to_free); if ((obj_pool_free > debug_objects_pool_size) && (obj_nr_tofree < ODEBUG_FREE_WORK_MAX)) { - int i; - /* * Free one more batch of objects from obj_pool. */ - for (i = 0; i < ODEBUG_BATCH_SIZE; i++) { - obj = __alloc_object(&obj_pool); - hlist_add_head(&obj->node, &obj_to_free); - WRITE_ONCE(obj_pool_free, obj_pool_free - 1); - WRITE_ONCE(obj_nr_tofree, obj_nr_tofree + 1); - } + cnt = hlist_cut_number(&freelist, &obj_pool, ODEBUG_BATCH_SIZE, &last); + hlist_splice_init(&freelist, last, &obj_to_free); + WRITE_ONCE(obj_pool_free, obj_pool_free - cnt); + WRITE_ONCE(obj_nr_tofree, obj_nr_tofree + cnt); } } else { - WRITE_ONCE(obj_pool_free, obj_pool_free + 1); - hlist_add_head(&obj->node, &obj_pool); - if (lookahead_count) { - WRITE_ONCE(obj_pool_free, obj_pool_free + lookahead_count); - obj_pool_used -= lookahead_count; - while (lookahead_count) { - hlist_add_head(&objs[--lookahead_count]->node, - &obj_pool); - } - } + WRITE_ONCE(obj_pool_free, obj_pool_free + cnt); + hlist_splice_init(&freelist, last, &obj_pool); } raw_spin_unlock(&pool_lock); local_irq_restore(flags);
Currently, there are multiple instances where several nodes are extracted from one list and added to another list. One by one extraction, and then one by one splicing, not only low efficiency, readability is also poor. The work can be done well with hlist_cut_number() and hlist_splice_init(), which move the entire sublist at once. When the number of nodes expected to be moved is less than or equal to 0, or the source list is empty, hlist_cut_number() safely returns 0. The splicing is performed only when the return value of hlist_cut_number() is greater than 0. For two calls to hlist_cut_number() in __free_object(), the result is obviously positive, the check of the return value is omitted. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- lib/debugobjects.c | 115 +++++++++++++++++++-------------------------- 1 file changed, 48 insertions(+), 67 deletions(-)