Message ID | 1447435656-26152-2-git-send-email-gary.robertson@linaro.org |
---|---|
State | New |
Headers | show |
Oops - clicked the wrong reply option. Nicolas raises an excellent point. I think at least a configuration option may be needed to enable or disable isolation. There are also impacts on how CPUs should be allocated on isolated versus non-isolated platforms, and some function calls which need to be substituted depending on absence or presence of isolation support. However this patch was intended as a preliminary introduction of isolation support at the earliest possible time frame - with the expectations that further refinements would be needed. As such thanks to Nicolas for his input - and I would encourage others to alert me to additional shortcomings or problems I may have overlooked. On Fri, Nov 13, 2015 at 12:10 PM, Nicolas Morey-Chaisemartin < nmorey@kalray.eu> wrote: > Le 13/11/15 18:27 , Gary S. Robertson a écrit : > >> This patch adds ODP helper code for setting up cpuset-based >> isolated execution environments on a Linux platform. >> >> By executing applications on dedicated CPU cores with minimal scheduler >> contention or 'interference', latency determinism can be significantly >> enhanced, and performance can be improved and made more deterministic as >> well. >> Performance gains are dependent on the degree of CPU loading and scheduler >> contention which would otherwise occur in a 'normal' non-isolated >> environment. >> >> This isolation API requires an underlying Linux kernel with cpuset >> support, >> and will return an error if such support is missing. >> If the underlying kernel also includes LNG-originated 'NO_HZ_FULL' >> support, >> this support will be used to the extent that it is available. >> (NOTE that isolation setup requires root privileges during execution.) >> >> This patch also modifies the pktio performance test as an example of >> how the new isolation helpers might be employed by an application, >> and as a convenient means of quantifying the improved performance >> made possible by executing in an isolated environment. >> >> It is anticipated that this API will evolve as use cases are defined >> and further features or refinements are requested - hence this is only >> the 'initial' API submission. >> >> Signed-off-by: Gary S. Robertson <gary.robertson@linaro.org> >> --- >> helper/Makefile.am | 2 + >> helper/include/odp/helper/linux_isolation.h | 98 + >> helper/linux_isolation.c | 2901 >> +++++++++++++++++++++++++++ >> test/performance/odp_pktio_perf.c | 21 +- >> 4 files changed, 3016 insertions(+), 6 deletions(-) >> create mode 100644 helper/include/odp/helper/linux_isolation.h >> create mode 100644 helper/linux_isolation.c >> >> diff --git a/helper/Makefile.am b/helper/Makefile.am >> index e72507e..f9c3558 100644 >> --- a/helper/Makefile.am >> +++ b/helper/Makefile.am >> @@ -11,6 +11,7 @@ helperincludedir = $(includedir)/odp/helper/ >> helperinclude_HEADERS = \ >> $(srcdir)/include/odp/helper/ring.h \ >> $(srcdir)/include/odp/helper/linux.h \ >> + $(srcdir)/include/odp/helper/linux_isolation.h \ >> $(srcdir)/include/odp/helper/chksum.h\ >> $(srcdir)/include/odp/helper/eth.h\ >> $(srcdir)/include/odp/helper/icmp.h\ >> @@ -29,6 +30,7 @@ noinst_HEADERS = \ >> __LIB__libodphelper_la_SOURCES = \ >> linux.c \ >> + linux_isolation.c \ >> ring.c \ >> hashtable.c \ >> lineartable.c >> diff --git a/helper/include/odp/helper/linux_isolation.h >> b/helper/include/odp/helper/linux_isolation.h >> new file mode 100644 >> index 0000000..2fc8266 >> --- /dev/null >> +++ b/helper/include/odp/helper/linux_isolation.h >> @@ -0,0 +1,98 @@ >> +/* Copyright (c) 2013, Linaro Limited >> + * All rights reserved. >> + * >> + * SPDX-License-Identifier: BSD-3-Clause >> + */ >> + >> + >> +/** >> + * @file >> + * >> + * ODP Linux isolation helper API >> + * >> + * This file is an optional helper to odp.h APIs. These functions are >> provided >> + * to ease common setups for isolation using cpusets in a Linux system. >> + * User is free to implement the same setups in other ways (not via this >> API). >> + */ >> + >> +#ifndef ODP_LINUX_ISOLATION_H_ >> +#define ODP_LINUX_ISOLATION_H_ >> + >> +#ifdef __cplusplus >> +extern "C" { >> +#endif >> + >> +#include <odp.h> >> + >> +/* >> + * Verify the level of underlying operating system support. >> + * (Return with error if the OS does not at least support cpusets) >> + * Set up system-wide CPU masks and cpusets >> + * (Future) Set up file-based persistent cpuset management layer >> + * to allow cooperative use of system isolation resources >> + * by multiple independent ODP instances. >> + */ >> +int odph_isolation_init_global( void ); >> + >> +/* >> + * Migrate all tasks from cpusets created for isolation support to the >> + * generic boot-level single cpuset. >> + * Remove all isolated CPU environments and cpusets >> + * Zero out system-wide CPU masks >> + * (Future) Reset persistent file-based cpuset management layer >> + * to show no system isolation resources are available. >> + */ >> +int odph_isolation_term_global( void ); >> + >> +/** >> + * Creates and launches pthreads >> + * >> + * Creates, pins and launches threads to separate CPU's based on the >> cpumask. >> + * >> + * @param thread_tbl Thread table >> + * @param mask CPU mask >> + * @param start_routine Thread start function >> + * @param arg Thread argument >> + * >> + * @return Number of threads created >> + */ >> +int odph_linux_isolated_pthread_create(odph_linux_pthread_t *thread_tbl, >> + const odp_cpumask_t *mask, >> + void *(*start_routine) (void *), >> + void *arg); >> + >> +/** >> + * Fork a process >> + * >> + * Forks and sets CPU affinity for the child process >> + * >> + * @param proc Pointer to process state info (for output) >> + * @param cpu Destination CPU for the child process >> + * >> + * @return On success: 1 for the parent, 0 for the child >> + * On failure: -1 for the parent, -2 for the child >> + */ >> +int odph_linux_isolated_process_fork(odph_linux_process_t *proc, int >> cpu); >> + >> +/** >> + * Fork a number of processes >> + * >> + * Forks and sets CPU affinity for child processes >> + * >> + * @param proc_tbl Process state info table (for output) >> + * @param mask CPU mask of processes to create >> + * >> + * @return On success: 1 for the parent, 0 for the child >> + * On failure: -1 for the parent, -2 for the child >> + */ >> +int odph_linux_isolated_process_fork_n(odph_linux_process_t *proc_tbl, >> + const odp_cpumask_t *mask); >> + >> +int odph_cpumask_default_worker(odp_cpumask_t *mask, int num); >> +int odph_cpumask_default_control(odp_cpumask_t *mask, int num >> ODP_UNUSED); >> + >> +#ifdef __cplusplus >> +} >> +#endif >> + >> +#endif >> diff --git a/helper/linux_isolation.c b/helper/linux_isolation.c >> new file mode 100644 >> index 0000000..5ca6c7f >> --- /dev/null >> +++ b/helper/linux_isolation.c >> @@ -0,0 +1,2901 @@ >> +/* >> + * This file contains declarations and definitions of functions and >> + * data structures which are useful for manipulating cpusets in support >> + * of OpenDataPlane (ODP) high-performance applications. >> + * >> + * Copyright (c) 2015, Linaro Limited >> + * All rights reserved. >> + * SPDX-License-Identifier: BSD-3-Clause >> + */ >> + >> +#ifndef _GNU_SOURCE >> +#define _GNU_SOURCE >> +#endif >> + >> +#include <ctype.h> >> +#include <dirent.h> >> +#include <errno.h> >> +#include <fcntl.h> >> +#include <fts.h> >> +#include <pthread.h> >> +#include <sched.h> >> +#include <semaphore.h> >> +#include <signal.h> >> +#include <stdarg.h> >> +#include <stdio.h> >> +#include <stdlib.h> >> +#include <string.h> >> +#include <time.h> >> +#include <unistd.h> >> +#include <sys/mman.h> >> +#include <sys/mount.h> >> +#include <sys/resource.h> >> +#include <sys/stat.h> >> +#include <sys/syscall.h> >> +#include <sys/time.h> >> +#include <sys/types.h> >> +#include <sys/wait.h> >> + >> +#include <odp/init.h> >> +#include <odp_internal.h> >> +#include <odp/cpumask.h> >> +#include <odp/debug.h> >> +#include <odp_debug_internal.h> >> +#include <odp/helper/linux.h> >> +#include "odph_debug.h" >> + >> +typedef unsigned long long uint_64_t; >> +typedef unsigned int uint32_t; >> +typedef unsigned short uint16_t; >> +typedef unsigned char uint8_t; >> + >> >> +/****************************************************************************** >> + * The following constants are important for determining isolation >> capacities >> + * MAX_CPUS_SUPPORTED is used to dimension arrays and some loops in the >> + * isolation helper code. >> + * The HOUSEKEEPING_RATIO_* constants define the ratio of housekeeping >> CPUs >> + * (i.e. 'control plane' CPUs) - see MULTIPLIER >> + * versus isolated CPUs (i.e. 'data plane >> CPUs) - >> + * see DIVISOR >> + * The calculation is: >> + * NUMBER OF HOUSEKEEPING CPUs = >> + * (NUMBER OF CPUs * HOUSEKEEPING_RATIO_MULTIPLIER) >> + * divided by HOUSEKEEPING_RATIO_DIVISOR. >> + * If NUMBER OF HOUSEKEEPING CPUs < 1, NUMBER OF HOUSEKEEPING CPUs ++ >> + * NUMBER OF ISOLATED CPUs = >> + * NUMBER OF CPUs - NUMBER OF HOUSEKEEPING CPUs >> + >> ******************************************************************************/ >> +#define MAX_CPUS_SUPPORTED 64 >> +#define HOUSEKEEPING_RATIO_MULTIPLIER 1 >> +#define HOUSEKEEPING_RATIO_DIVISOR 4 >> + >> >> +/****************************************************************************** >> + * >> + * Concatenate a string into a destination buffer >> + * containing an existing string such that the length of the resulting >> string >> + * (including the terminating NUL) does not exceed the buffer size >> + * >> + >> ******************************************************************************/ >> +static inline char *__strcat_bounded( char *dst_strg, const char >> *src_strg, >> + size_t dstlen ) { >> + *(dst_strg + (dstlen - 1)) = '\0'; >> + return( strncat( dst_strg, src_strg, >> + ((dstlen - 1) - strlen( dst_strg )) ) ); >> +} >> + >> +#define strcat_bounded( dest, src ) \ >> + __strcat_bounded( dest, src, (sizeof( dest )) ) >> + >> >> +/****************************************************************************** >> + * >> + * Copy a string into a destination buffer and NUL-terminate it >> + * such that the length of the resulting string >> + * (including the terminating NUL) does not exceed the buffer size >> + * >> + >> ******************************************************************************/ >> +static inline char *__strcpy_bounded( char *dst_strg, const char >> *src_strg, >> + size_t dstlen ) { >> + *(dst_strg + (dstlen - 1)) = '\0'; >> + return( strncpy( dst_strg, src_strg, (dstlen - 1) ) ); >> +} >> + >> +#define strcpy_bounded( dest, src ) \ >> + __strcpy_bounded( dest, src, (sizeof( dest )) ) >> + >> +#define MAX_ERR_MSG_SIZE 256 >> +#define ERR_STRING_SIZE 80 >> + >> +#define NSEC_PER_SEC 1000000000L >> + >> +static void sleep_nsec( long nsec ) >> +{ >> + struct timespec delay, remaining; >> + >> + if ( nsec >= NSEC_PER_SEC ) { >> + delay.tv_sec = nsec / NSEC_PER_SEC; >> + delay.tv_nsec = nsec % NSEC_PER_SEC; >> + } else { >> + delay.tv_sec = 0; >> + delay.tv_nsec = nsec; >> + } >> + for ( errno = EINTR; errno == EINTR; ) { >> + errno = 0; >> + if ( (clock_nanosleep( CLOCK_MONOTONIC, 0, &delay, &remaining )) >> && >> + (errno == EINVAL) ) { >> + errno = 0; >> + clock_nanosleep( CLOCK_REALTIME, 0, &delay, &remaining ); >> + } >> + delay.tv_sec = remaining.tv_sec; >> + delay.tv_nsec = remaining.tv_nsec; >> + } >> +} >> + >> +static sem_t strerror_lock; >> +static char error_buf[ERR_STRING_SIZE]; >> + >> +#define TM_STAMP_SIZE 30 >> +#define MAX_EVENT_STRING_SIZE ((size_t)(MAX_ERR_MSG_SIZE - TM_STAMP_SIZE >> - 4)) >> +#define TM_STAMP_MSEC 19 >> +#define TM_STAMP_MSEC_END 23 >> +#define TM_STAMP_CTIME_END 25 >> +#define TM_STAMP_DATE_END 11 >> +#define TM_STAMP_YEAR (TM_STAMP_MSEC_END + 1) >> +#define TM_STAMP_PREFIX_END 31 >> + >> +static sem_t logmsg_lock; >> +static char stderr_log_msg[MAX_ERR_MSG_SIZE]; >> + >> >> +/****************************************************************************** >> + * >> + * Return the task ID of the calling thread or process >> + * (this is a system-wide thread ID used for scheduling all tasks, >> + * whether single-threaded processes or individual threads within >> + * multithreaded processes) >> + * This is the identifier used for migrating tasks between cpusets >> + * >> + >> ******************************************************************************/ >> +static pid_t gettaskid( void ) >> +{ >> + return( (pid_t)(syscall(SYS_gettid)) ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Convert an error number to an error type string in errstring >> + * >> + >> ******************************************************************************/ >> +static char *errstring( int error_no ) >> +{ >> + >> + pthread_cleanup_push( (void(*)(void *))sem_post, >> + (void *)&strerror_lock ); >> + sem_wait( &strerror_lock ); >> + >> + error_buf[ ERR_STRING_SIZE - 1 ] = '\0'; >> + strncpy( error_buf, strerror( error_no ), (ERR_STRING_SIZE - 2) ); >> + >> + pthread_cleanup_pop( 1 ); >> + >> + return( error_buf ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Print a time stamp and event message out to stderr >> + * >> + >> ******************************************************************************/ >> +static void stderr_log( const char *fmt_str, ... ) >> +{ >> + va_list args; >> + struct timeval time_now; >> + struct tm time_fields; >> + int i, j; >> + char *event_msg_start; >> + >> + pthread_cleanup_push( (void(*)(void *))sem_post, >> + (void *)&logmsg_lock ); >> + sem_wait( &logmsg_lock ); >> + >> + /* >> + * Snapshot the current time down to the resolution of the CPU >> clock. >> + */ >> + gettimeofday( &time_now, (struct timezone *)NULL ); >> + >> + /* >> + * Convert time to a calender time string with 1 sec resolution >> + */ >> + localtime_r( (time_t *)(&(time_now.tv_sec)), &time_fields ); >> + asctime_r( &time_fields, stderr_log_msg ); >> + >> + /* >> + * Shift the year and newline down to make room for a msec string >> field >> + */ >> + for ( i = TM_STAMP_CTIME_END, j = (TM_STAMP_SIZE - 1); >> + i >= TM_STAMP_MSEC; i--, j-- ) >> + stderr_log_msg[j] = stderr_log_msg[i]; >> + >> + /* >> + * Insert the millisecond time stamp field into the string between >> the >> + * seconds and the year as :000 thru :999. Then overwrite the >> premature >> + * NUL with a space to 're-attach' the year and newline >> + */ >> + snprintf( &(stderr_log_msg[TM_STAMP_MSEC]), 5, ":%.3ld", >> + (time_now.tv_usec / 1000) ); >> + stderr_log_msg[TM_STAMP_MSEC_END] = ' '; >> + >> + /* >> + * NUL out the newline at the end of the timestamp so we can >> + * prefix the log message with the timestamp. >> + */ >> + stderr_log_msg[TM_STAMP_SIZE - 2] = '\0'; >> + strcat_bounded( stderr_log_msg, " - " ); >> + event_msg_start = &(stderr_log_msg[strlen( stderr_log_msg )]); >> + >> + /* >> + * Format the caller's event message into a constant string >> + */ >> + va_start( args, fmt_str ); >> + vsnprintf( event_msg_start, MAX_EVENT_STRING_SIZE, fmt_str, args ); >> + stderr_log_msg[MAX_EVENT_STRING_SIZE - 1] = '\0'; >> + va_end( args ); >> + strcat_bounded( stderr_log_msg, "\n" ); >> + >> + /* >> + * Then print the time stamp and event message out to stderr >> + */ >> + fputs( stderr_log_msg, stderr ); >> + >> + pthread_cleanup_pop( 1 ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Initialize the semaphores used for serializing error message >> handling. >> + * >> + >> ******************************************************************************/ >> +static void init_errmsg_locks( void ) >> +{ >> + sem_init( &strerror_lock, 0, 1 ); >> + sem_init( &logmsg_lock, 0, 1 ); >> +} >> + >> +#define DIRMODE ((mode_t)(S_IRUSR | S_IWUSR | S_IXUSR | \ >> + S_IRGRP | S_IXGRP | \ >> + S_IROTH | S_IXOTH)) >> +#define FILEMODE ((mode_t)(S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)) >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Data structures associated with CPUSETS >> + * >> + >> ******************************************************************************/ >> + >> +static int numcpus; >> +static size_t cpusetsize; >> +static int cpusets_supported; >> +static int cpuset_prefix_required; >> + >> +/* >> + * Shared path construction buffers and position markers >> + * Used to construct absolute paths to directories and files in the >> + * cpuset file hierarchy. Shared in order to reduce stack usage, >> + * (especially with nested function calls) - and kept thread-safe >> + * by the locks below. >> + */ >> +static char pathname_buf[128]; /* Use only while holding >> pathname_lock! */ >> +static char fieldname_buf[128]; /* Use only while holding >> fieldname_lock! */ >> +static char cpulist[128]; /* Use only while holding >> fieldname_lock! */ >> +static char cpuname[3]; /* Use only while holding >> fieldname_lock! */ >> +static int end_cpuset_base_path; /* Use only while holding >> fieldname_lock! */ >> +static int end_field_base_path; /* Use only while holding >> fieldname_lock! */ >> + >> +/* >> + * Locks for thread-safe use of the shared path construction buffers. >> + * Locking order - if pathname_lock is needed it must always be taken >> + * before taking fieldname_lock and released after >> + * releasing fieldname_lock. >> + * if from_path_lock is needed it must always be taken >> + * before taking to_path_lock and released after >> + * releasing to_path_lock. >> + * fieldname lock is the primary exclusion mechanism and by implication >> + * allows thread-safe access to the cpuset directory tree >> + * by all tasks using this suite of helper functions. >> + */ >> +static sem_t pathname_lock; >> +static sem_t fieldname_lock; >> +static sem_t from_path_lock; >> +static sem_t to_path_lock; >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Functions associated with cpusets >> + * >> + >> ******************************************************************************/ >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Called from applications when switching to a new cpuset. >> + * >> + * Takes a string specifying the name of the desired cpuset >> + * relative to the mount point '/dev/cpuset/' >> + * - eg. 'cplane' or 'dplane'...etc. >> + * >> + * Obtains the required lock on the shared fieldname path buffer. >> + * Sets the (shared) current cpuset path string, >> + * and creates the cpuset management tree base directory >> + * if it is not already present. >> + * Initializes the fieldname path string to the new base directory path. >> + * Returns while holding the lock on the shared fieldname path buffer. >> + * NOTE - a NULL cpuset_name defaults to the top-level 'master' cpuset. >> + * >> + >> ******************************************************************************/ >> +static char *newpathname( const char *cpuset_name ) >> +{ >> + /* >> + * Lock exclusive access to the fieldname path buffer >> + * and by inference, to the current cpuset management directory tree >> + */ >> + sem_wait( &fieldname_lock ); >> + >> + /* Create a string containing the full path to the caller's >> directory */ >> + strcpy_bounded( fieldname_buf, "/dev/cpuset/" ); >> + >> + if ( cpuset_name != (char *)NULL ) { >> + strcat_bounded( fieldname_buf, cpuset_name ); >> + >> + /* >> + * Create the new cpuset tree under /dev/cpuset >> + * fieldname_buf = "/dev/cpuset/<path>" >> + */ >> + mkdir( fieldname_buf, DIRMODE ); >> + >> + strcat_bounded( fieldname_buf, "/" ); >> + } >> + >> + /* >> + * If a cpuset_name was specified, then >> + * fieldname_buf = "/dev/cpuset/<path>/" --else-- >> + * fieldname_buf = "/dev/cpuset/" >> + * Mark the end of the path base string for this cpuset >> + */ >> + end_cpuset_base_path = strnlen( fieldname_buf, (sizeof( >> fieldname_buf ) - 1) ); >> + >> + if ( cpuset_prefix_required ) >> + strcat_bounded( fieldname_buf, "cpuset." ); >> + >> + /* Mark the end of the field path string for this cpuset */ >> + end_field_base_path = strnlen( fieldname_buf, >> + (sizeof( fieldname_buf ) - 1) ); >> + >> + /* Return to the caller with the fieldname_buf lock held */ >> + return( fieldname_buf ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Called from applications to create a complete cpuset fieldname path. >> + * >> + * Requires and assumes that the caller currently holds the lock >> + * for exclusive use of the shared fieldname path buffer. >> + * >> + * Resets the (shared) current fieldname path to its initial contents, >> + * effectively truncating the name of any previous field path. >> + * Then concatenates the cpuset-relative field name string specified >> + * by the caller onto the path base, creating the full field path name. >> + * >> + >> ******************************************************************************/ >> +static char *newfieldname( const char *field ) >> +{ >> + fieldname_buf[end_field_base_path] = '\0'; >> + strcat_bounded( fieldname_buf, field ); >> + return( fieldname_buf ); >> +} >> + >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Called from applications to release the lock on the shared >> + * fieldname path buffer. This enables serialized access to the >> + * cpuset management structure within a multi-threaded process. >> + * The application releases this lock after it finishes processing >> + * all fields of the current cpuset, guaranteeing that other threads >> + * using this utility will not interfere with that cpuset. >> + * >> + >> ******************************************************************************/ >> +static void releasefieldname( void ) >> +{ >> + sem_post( &fieldname_lock ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Verify and initialize basic CPUSET support >> + * >> + >> ******************************************************************************/ >> +static int init_cpusets( void ) >> +{ >> + int mounted = 0; >> + int retcode = -1; >> + int fileno; >> + >> + /* >> + * Initialize the locks used to serialize access to the error message >> + * logging functions and buffers. This needs to be done prior to >> most >> + * of the other cpuset setup functions... so take care of it here. >> + */ >> + init_errmsg_locks(); >> + >> + /* Init locks for thread-safe access to static path-building strings >> */ >> + sem_init( &pathname_lock, 0, 1 ); >> + sem_init( &fieldname_lock, 0, 1 ); >> + sem_init( &from_path_lock, 0, 1 ); >> + sem_init( &to_path_lock, 0, 1 ); >> + >> + cpuset_prefix_required = 0; >> + cpusets_supported = 0; >> + >> +try2mount: >> + /* Try to mount the cpuset pseudo-filesystem at /dev/cpuset */ >> + mkdir( "/dev/cpuset", DIRMODE ); >> + if ( mount( "none", "/dev/cpuset", "cpuset", >> + (MS_NODEV | MS_NOEXEC | MS_NOSUID), (void *)NULL ) ) { >> + switch ( errno ) { >> + case EBUSY : >> + mounted = 1; >> + break; >> + case ENODEV : >> + ODPH_ERR( "cpusets not supported - aborting!\n" ); >> + break; >> + case EPERM : >> + ODPH_ERR( "Insufficient privileges for cpusets - >> aborting!\n" ); >> + break; >> + default : >> + break; >> + } >> + } >> + if ( mounted > 0) { >> + cpusets_supported = 1; >> + retcode = 0; >> + fileno = open( "/dev/cpuset/cpuset.cpus", O_RDONLY ); >> + if ( fileno > 0 ) { >> + cpuset_prefix_required = 1; >> + close( fileno ); >> + } >> + } else { >> + /* >> + * Try up to two more times to get the cpusets filesystem mounted >> + * before giving up >> + */ >> + if ( --mounted > -3 ) { >> + /* Delay 50 msec to allow the mount to settle and try again >> */ >> + sleep_nsec( 50000000 ); >> + goto try2mount; >> + } >> + } >> + >> + /* Support available CPU cores up to MAX_CPUS_SUPPORTED cores */ >> + numcpus = (int)sysconf( _SC_NPROCESSORS_ONLN ); >> + >> + if( numcpus > MAX_CPUS_SUPPORTED ) { >> + fprintf( stderr, >> + "\rNOTE: MAX_CPUS_SUPPORTED defined as: %d,\n", >> MAX_CPUS_SUPPORTED ); >> + fprintf( stderr, >> + "\r but number of CPU cores detected is: %d\n", numcpus ); >> + fprintf( stderr, >> + "\r Change MAX_CPUS_SUPPORTED in isolation_config.h and >> rebuild\n" >> + ); >> + fprintf(stderr, >> + "\r to support use of all CPU cores on this platform\n" ); >> + } >> + numcpus = (numcpus > MAX_CPUS_SUPPORTED) ? MAX_CPUS_SUPPORTED : >> numcpus; >> + >> + /* Save the required cpuset mask size for global reference */ >> + cpusetsize = CPU_ALLOC_SIZE( numcpus ); >> + >> + return( retcode ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Enable or disable full dynticks operation on the specified cpuset >> + * >> + >> ******************************************************************************/ >> +static void request_dynticks( const char *path, int on_off ) >> +{ >> + int retval, fileno; >> + >> + if ( on_off ) >> + ODPH_DBG( "Requesting dynticks on cpuset %s\n", path ); >> + else >> + ODPH_DBG( "Dynticks not needed on cpuset %s\n", path ); >> + >> + /* >> + * Set the fieldname path string to the base of the path >> + * to the caller's specified cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + >> + /* >> + * Create an absolute path string to the "fulldynticks" field >> + * for the cpuset >> + */ >> + newfieldname( "fulldynticks" ); >> + >> + /* >> + * Specify whether the cores in this cpuset should offload kernel >> + * housekeeping tasks to other cores or else accept those tasks >> + */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + if ( on_off ) >> + retval = write( fileno, "1", 1 ); >> + else >> + retval = write( fileno, "0", 1 ); >> + close( fileno ); >> + } >> + >> + /* >> + * Create an absolute path string to the "quiesce" field >> + * for the cpuset >> + */ >> + newfieldname( "quiesce" ); >> + >> + /* >> + * Migrate timers / hrtimers away from the CPUs in this cpuset -or- >> + * allow timers / hrtimers for this CPU and system-wide use. >> + */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + if ( on_off ) >> + retval = write( fileno, "1", 1 ); >> + else >> + retval = write( fileno, "0", 1 ); >> + close( fileno ); >> + } >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Enable or disable isolation on the specified cpuset >> + * (A NULL path defaults to the top-level master cpuset.) >> + * >> + >> ******************************************************************************/ >> +static void set_cpuset_isolation( const char *path, int on_off ) >> +{ >> + int retval, fileno; >> + >> + if ( on_off ) >> + ODPH_DBG( "Disabling load balancing on cpuset %s\n", path ); >> + else >> + ODPH_DBG( "Enabling load balancing on cpuset %s\n", path ); >> + >> + /* >> + * Set the fieldname path string to the base of the path >> + * to the caller's specified cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + >> + /* >> + * Create an absolute path string to the "sched_load_balance" field >> + * for the cpuset >> + */ >> + newfieldname( "sched_load_balance" ); >> + >> + /* >> + * Enable or disable load balancing >> + */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + if ( on_off ) >> + retval = write( fileno, "0", 1 ); >> + else >> + retval = write( fileno, "1", 1 ); >> + close( fileno ); >> + } >> + >> + /* >> + * Create an absolute path string to the >> + * "sched_relax_domain_level" field for the cpuset >> + */ >> + newfieldname( "sched_relax_domain_level" ); >> + >> + /* >> + * Enable or disable event-based load balancing >> + */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + if ( on_off ) >> + retval = write( fileno, "0", 1 ); >> + else >> + retval = write( fileno, "-1", 2 ); >> + close( fileno ); >> + } >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Create a new management tree for the specified cpuset >> + * >> + >> ******************************************************************************/ >> +static void create_cpuset( const char *path, cpu_set_t *mask, int >> isolated ) >> +{ >> + int retval, i, fileno, endlist; >> + >> + /* >> + * Set the fieldname path string to the base of the path >> + * to the caller's specified cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + >> + /* Create an absolute path string to the "mems" field for the cpuset >> */ >> + newfieldname( "mems" ); >> + >> + /* Init the "mems" field so all cpusets share the same memory map */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + retval = write( fileno, "0", 1 ); >> + close( fileno ); >> + } >> + >> + cpulist[0] = '\0'; >> + for ( i = 0, endlist = 0; i < numcpus; i++ ) { >> + if ( CPU_ISSET( i, mask) ) { >> + /* >> + * Create a comma-separated list of CPU cores in this cpuset >> + * based on the cpuset mask passed in by the caller. >> + */ >> + snprintf( cpuname, sizeof( cpuname ), "%d", i ); >> + strcat_bounded( cpulist, cpuname ); >> + /* Mark the location of the trailing comma */ >> + endlist = strnlen( cpulist, (sizeof( cpulist ) - 1) ); >> + strcat_bounded( cpulist, "," ); >> + } >> + } >> + /* Remove the last superfluous trailing comma from the string */ >> + cpulist[endlist] = '\0'; >> + >> + /* Create an absolute path string to the "cpus" field for the cpuset >> */ >> + newfieldname( "cpus" ); >> + >> + /* >> + * Now populate the overall CPU list for the current cpuset >> + */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + retval = write( fileno, cpulist, strlen( cpulist ) ); >> + close( fileno ); >> + } >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> + >> + /* If the cpuset is to be isolated, turn off load balancing */ >> + set_cpuset_isolation( path, isolated ); >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Delete the directory and file hierarchy associated with the cpuset >> + * specified by the contents of fieldname_buf >> + * >> + * Requires that the caller already holds fieldname_lock -and- >> + * assumes all tasks, etc. have been previously migrated away from the >> + * specified cpuset. >> + * >> + >> ******************************************************************************/ >> +static int cpuset_delete( void ) >> +{ >> + int retcode = -1; >> + int i, core_fileno; >> + >> + ODPH_DBG( "Deleting cpuset %s\n", fieldname_buf ); >> + >> + /* >> + * Create an absolute path string to the "cpus" field for the cpuset >> + */ >> + strcat_bounded( fieldname_buf, "/" ); >> + if ( cpuset_prefix_required ) >> + strcat_bounded( fieldname_buf, "cpuset." ); >> + >> + strcat_bounded( fieldname_buf, "cpus" ); >> + >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.cpus" >> + * -or- "/dev/cpuset/<path>/cpu<n>/cpus" >> + * -or- "/dev/cpuset/<path>/cpuset.cpus" >> + * -or- "/dev/cpuset/<path>/cpus" >> + * De-populate the CPU list to contain no cores >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_TRUNC) ); >> + if ( core_fileno > 0 ) { >> + /* >> + * Try for up to 2 seconds to depopulate the CPU cores. >> + * This allows time for any task migrations to stabilize. >> + */ >> + for ( i = 0; i < 100; i++ ) { >> + errno = 0; >> + retcode = write( core_fileno, "", 1 ); >> + if ( !((retcode < 0) && >> + ((errno == EINTR) || (errno == EBUSY))) ) >> + break; >> + >> + /* Sleep 20 msec to allow depopulation to take effect */ >> + sleep_nsec( 20000000 ); >> + } >> + close( core_fileno ); >> + } >> + >> + fieldname_buf[end_cpuset_base_path] = '\0'; >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>" >> + * -or- "/dev/cpuset/<path>" >> + * Delete the cpuset tree for this core >> + */ >> + retcode = rmdir( fieldname_buf ); >> + if ( retcode ) { >> + ODPH_ERR( "Unable to delete cpuset %s - error %s\n", >> + fieldname_buf, errstring( errno ) ); >> + } >> + >> + return( retcode ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Delete the management tree for the specified cpuset >> + * >> + >> ******************************************************************************/ >> +static void delete_cpuset( const char *path ) >> +{ >> + /* >> + * Return the CPU cores in this cpuset to general purpose duty. >> + * Turn load balancing back on and indicate full dynticks not needed. >> + * This is done here to inform the kernel as to how these cores may >> be >> + * used and operated. >> + */ >> + set_cpuset_isolation( path, 0 ); >> + request_dynticks( path, 0 ); >> + >> + /* >> + * Create an absolute path string to the "cpus" field for the cpuset >> + * newpathname marks the end of the cpuset base path string at a >> position >> + * following the slash - that is where the field name string would be >> + * concatenated onto the path - eg. '/dev/cpuset/<path>/' >> + * cpuset_delete() wants this marker to point to the position prior >> to >> + * the slash - eg. '/dev/cpuset/<path>' - so adjust it. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + end_cpuset_base_path--; >> + fieldname_buf[end_cpuset_base_path] = '\0'; >> + >> + /* >> + * Depopulate the CPU list for the cpuset and remove its >> + * directory hierarchy >> + */ >> + cpuset_delete(); >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Modify the per-cpu management tree for the specified cpuset >> + * to either enable or disable scheduler load balancing on each >> single-core >> + * cpuset descended from the specified parent cpuset. >> + * Assumes the /dev/cpuset filesystem already mounted and the >> + * per-core cpusets already initialized. >> + * >> + >> ******************************************************************************/ >> +static void set_per_core_cpusets_isolated( const char *path, cpu_set_t >> *mask, >> + int on_off ) >> +{ >> + int retval, i, core_fileno, cpu_num_offset; >> + >> + if ( on_off ) >> + ODPH_DBG( "Disabling load balancing on per-core cpusets in >> %s\n", path ); >> + else >> + ODPH_DBG( "Enabling load balancing on per-core cpusets in %s\n", >> path ); >> + >> + >> + /* >> + * Set the pathname and fieldname path strings to the base of the >> path >> + * to the specified 'parent' cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + fieldname_buf[end_cpuset_base_path] = '\0'; >> + >> + /* >> + * Create an individual cpuset for each CPU to facilitate isolation >> + */ >> + strcat_bounded( fieldname_buf, "cpu" ); >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * mark the location where we append the CPU number >> + */ >> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >> 1) ); >> + >> + for ( i = 0; i < numcpus; i++ ) { >> + if ( CPU_ISSET( i, mask) ) { >> + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); >> + strcat_bounded( fieldname_buf, cpuname ); >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu<n> >> + * where <n> is the current core number (0 -> numcpus-1) >> + * Modify the cpuset tree for this core only >> + */ >> + mkdir( fieldname_buf, DIRMODE ); >> + >> + strcat_bounded( fieldname_buf, "/" ); >> + if ( cpuset_prefix_required ) >> + strcat_bounded( fieldname_buf, "cpuset." ); >> + /* Mark the end of the path string for this core */ >> + end_field_base_path = strnlen( fieldname_buf, >> + (sizeof( fieldname_buf ) - 1) >> ); >> + >> + /* Create a path string to the "sched_load_balance" field */ >> + newfieldname( "sched_load_balance" ); >> + /* >> + * fieldname_buf == >> + * >> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_load_balance" >> + * -or- >> "/dev/cpuset/<path>/cpu<n>/sched_load_balance" >> + * Set the specified load balancing on this single-core >> cpuset >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + if ( on_off ) >> + retval = write( core_fileno, "0", 1 ); >> + else >> + retval = write( core_fileno, "1", 1 ); >> + close( core_fileno ); >> + } >> + >> + /* Create a path string to the "sched_relax_domain_level" >> field */ >> + newfieldname( "sched_relax_domain_level" ); >> + /* >> + * fieldname_buf == >> + * >> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_relax_domain_level" >> + * -or- >> "/dev/cpuset/<path>/cpu<n>/sched_relax_domain_level" >> + * Set the specified behavior on this single-core cpuset >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + if ( on_off ) >> + retval = write( core_fileno, "0", 1 ); >> + else >> + retval = write( core_fileno, "-1", 2 ); >> + close( core_fileno ); >> + } >> + >> + /* >> + * Reset the current field pathname to: >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * in preparation for the next CPU core >> + * in the data plane cpuset mask >> + */ >> + fieldname_buf[cpu_num_offset] = '\0'; >> + } >> + } >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> +} >> + >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Modify the per-cpu management tree for the specified cpuset >> + * to either enable or disable full dynticks operation on each >> single-core >> + * cpuset descended from the specified parent cpuset. >> + * Assumes the /dev/cpuset filesystem already mounted and the >> + * per-core cpusets already initialized. >> + * >> + >> ******************************************************************************/ >> +static void request_per_core_dynticks( const char *path, cpu_set_t *mask, >> + int on_off ) >> +{ >> + int retval, i, core_fileno, cpu_num_offset; >> + if ( on_off ) >> + ODPH_DBG( "Requesting dynticks on per-core cpusets in %s\n", >> path ); >> + else >> + ODPH_DBG( "Dynticks not needed on per-core cpusets in %s\n", >> path ); >> + >> + >> + /* >> + * Set the pathname and fieldname path strings to the base of the >> path >> + * to the specified 'parent' cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + fieldname_buf[end_cpuset_base_path] = '\0'; >> + >> + /* >> + * Create an individual cpuset for each CPU to facilitate isolation >> + */ >> + strcat_bounded( fieldname_buf, "cpu" ); >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * mark the location where we append the CPU number >> + */ >> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >> 1) ); >> + >> + for ( i = 0; i < numcpus; i++ ) { >> + if ( CPU_ISSET( i, mask) ) { >> + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); >> + strcat_bounded( fieldname_buf, cpuname ); >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu<n> >> + * where <n> is the current core number (0 -> numcpus-1) >> + * Modify the cpuset tree for this core only >> + */ >> + mkdir( fieldname_buf, DIRMODE ); >> + >> + strcat_bounded( fieldname_buf, "/" ); >> + if ( cpuset_prefix_required ) >> + strcat_bounded( fieldname_buf, "cpuset." ); >> + /* Mark the end of the path string for this core */ >> + end_field_base_path = strnlen( fieldname_buf, >> + (sizeof( fieldname_buf ) - 1) >> ); >> + >> + /* Create a path string to the "fulldynticks" field */ >> + newfieldname( "fulldynticks" ); >> + /* >> + * fieldname_buf == >> + * "/dev/cpuset/<path>/cpu<n>/cpuset/fulldynticks" >> + * -or- "/dev/cpuset/<path>/cpu<n>/fulldynticks" >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + if ( on_off ) >> + /* Mark this single-core cpuset for full dynticks >> mode */ >> + retval = write( core_fileno, "1", 1 ); >> + else >> + /* Mark this single-core cpuset for housekeeping >> mode */ >> + retval = write( core_fileno, "0", 1 ); >> + close( core_fileno ); >> + } >> + >> + /* >> + * Create an absolute path string to the "quiesce" field >> + * for the cpuset >> + */ >> + newfieldname( "quiesce" ); >> + /* >> + * fieldname_buf == >> + * "/dev/cpuset/<path>/cpu<n>/cpuset/quiesce" >> + * -or- "/dev/cpuset/<path>/cpu<n>/quiesce" >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + if ( on_off ) >> + /* Migrate timers / hrtimers away from this cpuset */ >> + retval = write( core_fileno, "1", 1 ); >> + else >> + /* Enable migration of timers / hrtimers onto this >> cpuset */ >> + retval = write( core_fileno, "0", 1 ); >> + close( core_fileno ); >> + } >> + >> + /* >> + * Reset the current field pathname to: >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * in preparation for the next CPU core >> + * in the data plane cpuset mask >> + */ >> + fieldname_buf[cpu_num_offset] = '\0'; >> + } >> + } >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Create a new per-cpu management tree for the specified parent cpuset >> + * Assumes the /dev/cpuset filesystem already mounted and the >> + * parent cpuset already initialized. >> + * >> + >> ******************************************************************************/ >> +static void create_per_core_cpusets( const char *path, cpu_set_t *mask, >> + int isolated ) >> +{ >> + int retval, i, core_fileno, cpu_num_offset; >> + >> + /* >> + * Set the pathname and fieldname path strings to the base of the >> path >> + * to the specified 'parent' cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + fieldname_buf[end_cpuset_base_path] = '\0'; >> + >> + /* >> + * Create an individual cpuset for each CPU to facilitate isolation >> + */ >> + strcat_bounded( fieldname_buf, "cpu" ); >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * mark the location where we append the CPU number >> + */ >> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >> 1) ); >> + >> + for ( i = 0; i < numcpus; i++ ) { >> + if ( CPU_ISSET( i, mask) ) { >> + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); >> + strcat_bounded( fieldname_buf, cpuname ); >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu<n> >> + * where <n> is the current core number (0 -> numcpus-1) >> + * Create a new cpuset tree for this core only >> + */ >> + mkdir( fieldname_buf, DIRMODE ); >> + >> + strcat_bounded( fieldname_buf, "/" ); >> + if ( cpuset_prefix_required ) >> + strcat_bounded( fieldname_buf, "cpuset." ); >> + /* Mark the end of the path string for this core */ >> + end_field_base_path = strnlen( fieldname_buf, >> + (sizeof( fieldname_buf ) - 1) >> ); >> + >> + /* Create an absolute path string to the "mems" field */ >> + newfieldname( "mems" ); >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.mems" >> + * -or- "/dev/cpuset/<path>/cpu<n>/mems" >> + * Init the "mems" field so all cpusets share the same >> memory map >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + retval = write( core_fileno, "0", 1 ); >> + close( core_fileno ); >> + } >> + >> + /* Create an absolute path string to the "cpus" field */ >> + newfieldname( "cpus" ); >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.cpus" >> + * -or- "/dev/cpuset/<path>/cpu<n>/cpus" >> + * Init the CPU list to contain only the current core >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + retval = write( core_fileno, cpuname, strlen( cpuname ) >> ); >> + close( core_fileno ); >> + } >> + >> + /* Create a path string to the "sched_load_balance" field */ >> + newfieldname( "sched_load_balance" ); >> + /* >> + * fieldname_buf == >> + * >> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_load_balance" >> + * -or- >> "/dev/cpuset/<path>/cpu<n>/sched_load_balance" >> + * Set the specified load balancing on this single-core >> cpuset >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + if ( isolated ) >> + retval = write( core_fileno, "0", 1 ); >> + else >> + retval = write( core_fileno, "1", 1 ); >> + close( core_fileno ); >> + } >> + >> + /* Create a path string to the "sched_relax_domain_level" >> field */ >> + newfieldname( "sched_relax_domain_level" ); >> + /* >> + * fieldname_buf == >> + * >> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_relax_domain_level" >> + * -or- >> "/dev/cpuset/<path>/cpu<n>/sched_relax_domain_level" >> + * Set the specified behavior on this single-core cpuset >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >> O_TRUNC), >> + FILEMODE ); >> + if ( core_fileno > 0 ) { >> + if ( isolated ) >> + retval = write( core_fileno, "0", 1 ); >> + else >> + retval = write( core_fileno, "-1", 2 ); >> + close( core_fileno ); >> + } >> + >> + /* >> + * Reset the current field pathname to: >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * in preparation for the next CPU core >> + * in the data plane cpuset mask >> + */ >> + fieldname_buf[cpu_num_offset] = '\0'; >> + } >> + } >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Delete the per-cpu management tree for the specified cpuset >> + * >> + >> ******************************************************************************/ >> +static void delete_per_core_cpusets( const char *path, cpu_set_t *mask ) >> +{ >> + int i, cpu_num_offset; >> + >> + /* >> + * Return the CPUs in the per_core cpusets to general purpose duty. >> + * Turn load balancing back on and indicate full dynticks not needed. >> + * This is done here to inform the kernel as to how these cores may >> be >> + * used and operated. >> + */ >> + set_per_core_cpusets_isolated( path, mask, 0 ); >> + request_per_core_dynticks( path, mask, 0 ); >> + >> + /* >> + * Set the pathname and fieldname path strings to the base of the >> path >> + * to the specified cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + fieldname_buf[end_cpuset_base_path] = '\0'; >> + >> + /* >> + * Delete the individual cpuset for each CPU >> + */ >> + strcat_bounded( fieldname_buf, "cpu" ); >> + >> + /* >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * mark the location where we append the CPU number >> + */ >> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >> 1) ); >> + >> + for ( i = 0; i < numcpus; i++ ) { >> + if ( CPU_ISSET( i, mask) ) { >> + snprintf( cpuname, sizeof( cpuname ), "%d", i ); >> + strcat_bounded( fieldname_buf, cpuname ); >> + /* Mark the end of the path string for this cpuset */ >> + end_cpuset_base_path = strnlen( fieldname_buf, >> + (sizeof( fieldname_buf ) - >> 1) ); >> + >> + /* >> + * Depopulate the CPU list for the cpuset and remove its >> + * directory hierarchy >> + */ >> + cpuset_delete(); >> + >> + /* >> + * Reset the pathname to: >> + * fieldname_buf == /dev/cpuset/<path>/cpu >> + * in preparation for the next CPU core >> + * in the data plane cpuset mask >> + */ >> + fieldname_buf[cpu_num_offset] = '\0'; >> + } >> + } >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Read the specified value from the specified field of the cpuset >> per-cpu >> + * management tree for the specified CPU and return it at caller's value >> ptr. >> + * If the file for the specified field is missing or empty then *value >> is NULL. >> + * >> + * Assumes the /dev/cpuset filesystem already mounted and the >> + * cpusets already initialized. >> + * >> + >> ******************************************************************************/ >> +static void get_per_cpu_field_for( int cpu, const char *path, const char >> *field, >> + char *value, size_t len ) >> +{ >> + int retval = 0; >> + int num_read, core_fileno; >> + >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&pathname_lock ); >> + sem_wait( &pathname_lock ); >> + >> + /* Get the name of this single-core cpuset based on the specified >> CPU */ >> + strcpy( pathname_buf, path ); >> + strcat_bounded( pathname_buf, "/cpu" ); >> + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); >> + strcat_bounded( pathname_buf, cpuname ); >> + >> + /* Set the fieldname path string to point to fields within this >> cpuset */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( pathname_buf ); >> + >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset." >> + * -or- "/dev/cpuset/<path>/cpu<n>/" >> + */ >> + strcat_bounded( fieldname_buf, field ); >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.<field>" >> + * -or- "/dev/cpuset/<path>/cpu<n>/<field>" >> + */ >> + if ( value ) { >> + core_fileno = open( fieldname_buf, O_RDONLY ); >> + if ( core_fileno > 0 ) { >> + for ( num_read = 0; num_read < len; ) { >> + num_read = read( core_fileno, (void *)value, len ); >> + if ( (num_read < len) && (errno != EINTR) ) >> + retval = -1; >> + break; >> + } >> + /* If the field file is missing or empty */ >> + close( core_fileno ); >> + if ( len && (retval < 0) ) { >> + *value = (char)'\0'; >> + ODPH_ERR( "Failed to get value for %s - error %s\n", >> + fieldname_buf, errstring( errno ) ); >> + } >> + } else >> + *value = (char)'\0'; >> + } >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> + pthread_cleanup_pop( 1 ); >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Write the specified value to the specified field of the per-cpu >> + * management tree for the specified CPU and cpuset >> + * If value is NULL then the file for the specified field will be >> truncated. >> + * >> + * Assumes the /dev/cpuset filesystem already mounted and the >> + * cpusets already initialized. >> + * >> + >> ******************************************************************************/ >> +static void set_per_cpu_field_for( int cpu, const char *path, const char >> *field, >> + const char *value ) >> +{ >> + int retval, core_fileno; >> + >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&pathname_lock ); >> + sem_wait( &pathname_lock ); >> + >> + /* Get the name of this single-core cpuset based on the specified >> CPU */ >> + strcpy( pathname_buf, path ); >> + strcat_bounded( pathname_buf, "/cpu" ); >> + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); >> + strcat_bounded( pathname_buf, cpuname ); >> + >> + /* Set the fieldname path string to point to fields within this >> cpuset */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( pathname_buf ); >> + >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset." >> + * -or- "/dev/cpuset/<path>/cpu<n>/" >> + */ >> + strcat_bounded( fieldname_buf, field ); >> + /* >> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.<field>" >> + * -or- "/dev/cpuset/<path>/cpu<n>/<field>" >> + */ >> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >> FILEMODE ); >> + if ( core_fileno > 0 ) { >> + /* If value is NULL then the field file will simply be truncated >> */ >> + if ( value ) >> + retval = write( core_fileno, value, strlen( value ) ); >> + close( core_fileno ); >> + } >> + >> + /* Release the lock on the fieldname_buf and the cpuset */ >> + releasefieldname(); >> + pthread_cleanup_pop( 0 ); >> + pthread_cleanup_pop( 1 ); >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Migrate timers and hrtimers away from the specified cpuset's CPU >> cores >> + * >> + >> ******************************************************************************/ >> +static void quiesce_cpus( const char *path ) >> +{ >> + int retval, fileno; >> + >> + /* >> + * Set the fieldname path string to the base of the path >> + * to the caller's specified cpuset. >> + */ >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&fieldname_lock ); >> + newpathname( path ); >> + >> + /* >> + * Create an absolute path string to the "quiesce" field >> + * for the cpuset >> + */ >> + newfieldname( "quiesce" ); >> + >> + /* >> + * Migrate timers / hrtimers away from the cpuset's CPUs >> + */ >> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE >> ); >> + if ( fileno > 0 ) { >> + retval = write( fileno, "1", 1 ); >> + close( fileno ); >> + } >> + >> + pthread_cleanup_pop( 1 ); >> + >> + /* Make the C compiler happy... do something with retval */ >> + if ( retval ) retval = 0; >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Move the specified task away from its current cpuset >> + * and onto the cores of the specified new cpuset >> + * Specifying a NULL path string pointer defaults to /dev/cpuset >> + * >> + * Assumes the caller passes in a legitimate task PID string. >> + * >> + * Returns an int == zero if migration successful or -1 if an error >> occurred >> + >> ******************************************************************************/ >> +static int migrate_task( const char *callers_pid, const char >> *to_cpuset_path ) >> +{ >> + size_t num_read, num_to_write; >> + int i, to_fileno, proc_pid_fileno, end_of_file, migrate_failed; >> + static char my_pid[24]; >> + static char written_pid[24]; >> + static char cur[2]; >> + static char to_path_buf[128]; >> + static char proc_path_buf[80]; >> + >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&to_path_lock ); >> + sem_wait( &to_path_lock ); >> + >> + /* Create strings containing the full path to the caller's cpusets */ >> + strcpy_bounded( to_path_buf, "/dev/cpuset/" ); >> + >> + /* Mark index of trailing slash for possible overwrite */ >> + num_read = strlen( to_path_buf ) - 1; >> + if ( to_cpuset_path ) >> + strcat_bounded( to_path_buf, to_cpuset_path ); >> + else >> + /* Migrate the task to default cpuset - remove trailing slash */ >> + to_path_buf[num_read] = '\0'; >> + >> + /* >> + * We will be manipulating the tasks in this cpuset, so >> + * extend the path string to specify the 'tasks' file. >> + */ >> + strcat_bounded( to_path_buf, "/tasks" ); >> + >> + /* >> + * Assemble a path to the status file for the caller's task in /proc >> + * to verify that the process still exists >> + */ >> + for ( i = 0; i < strlen( callers_pid ); i++ ) { >> + /* >> + * Don't include any trailing newline from callers_pid >> + * into the pathname string being built. >> + */ >> + if ( callers_pid[i] != (char)'\n' ) >> + written_pid[i] = callers_pid[i]; >> + else >> + written_pid[i] = (char)'\0'; >> + } >> + written_pid[i] = (char)'\0'; >> + strcpy_bounded( proc_path_buf, "/proc/" ); >> + strcat_bounded( proc_path_buf, written_pid ); >> + strcat_bounded( proc_path_buf, "/status" ); >> + proc_pid_fileno = open( proc_path_buf, O_RDONLY ); >> + >> + /* Init the result return value */ >> + migrate_failed = 0; >> + >> + /* Ignore the caller's task if its PID is stale */ >> + if ( proc_pid_fileno > 0 ) { >> + to_fileno = open( to_path_buf, (O_RDWR | O_CREAT | O_APPEND), >> + FILEMODE ); >> + } else { >> + to_fileno = -1; >> + migrate_failed = -1; >> + ODPH_ERR( "%s not found - failed to migrate %s\n", >> + proc_path_buf, callers_pid ); >> + } >> + >> + if ( to_fileno > 0 ) { >> + /* Capture our own ttid for comparison purposes */ >> + snprintf( my_pid, (sizeof( my_pid ) - 1), "%d", gettaskid() ); >> + >> + /* >> + * Now let's try to migrate the task. >> + * Try to write the PID for the caller's task into >> + * the task list for the specified 'to' cpuset. >> + */ >> + errno = 0; >> + num_to_write = strlen( written_pid ); >> + for ( num_read = 0; num_read < num_to_write; ) { >> + num_read = write( to_fileno, written_pid, num_to_write ); >> + if ( (num_read == (size_t)-1) && (errno != EINTR) ) >> + migrate_failed = -1; >> + break; >> + } >> + >> + if ( migrate_failed ) { >> + /* >> + * Scan the task's /proc status file to find its name. >> + */ >> + for ( end_of_file = 0; !end_of_file; ) { >> + /* Read one line of info from the task's /proc status >> file */ >> + for ( i = 0, cur[0] = (char)'\0'; (cur[0] != >> (char)'\n'); ) { >> + num_read = read( proc_pid_fileno, (void *)cur, 1 ); >> + if ( num_read > 0 ) { >> + if ( cur[0] != (char)'\n' ) { >> + proc_path_buf[i] = cur[0]; >> + i++; >> + } else { >> + proc_path_buf[i] = (char)'\0'; >> + } >> + } else { >> + proc_path_buf[i] = '\0'; >> + if ( errno != EINTR ) { >> + end_of_file = 1; >> + break; >> + } >> + } >> + } >> + >> + /* cpulist should contain a string unless EOF reached */ >> + if ( !(strncmp( proc_path_buf, "Name: ", 6 )) ) >> + break; >> + } >> + /* Failed to migrate current task */ >> + ODPH_ERR( "Failed to migrate pid %s - error %s\n", >> + written_pid, errstring( errno ) ); >> + } else { >> + /* >> + * If we are migrating our own task, sleep for 50 msec >> + * to allow time for migration to occur. >> + */ >> + if ( !strncmp( written_pid, my_pid, strlen( my_pid ) ) ) >> + sleep_nsec( 50000000 ); >> + } >> + close( to_fileno ); >> + } >> + if ( proc_pid_fileno > 0 ) >> + close( proc_pid_fileno ); >> + >> + pthread_cleanup_pop( 1 ); >> + >> + return( migrate_failed ); >> +} >> + >> +/* */ >> >> +/****************************************************************************** >> + * >> + * Move all tasks which can be migrated off of the cores of the current >> cpuset >> + * and onto the cores of the specified new cpuset >> + * Specifying a NULL path string pointer defaults to /dev/cpuset >> + * The 'except' parameter is an array of pid_t values >> + * which SHOULD NOT be migrated away from this core - terminated by >> + * a zero pid_t value. If the pointer to this array is NULL or if the >> + * first pid_t is zero, the function will try to migrate all processes >> off of >> + * the 'from' cpuset. >> + * >> + >> ******************************************************************************/ >> +static void migrate_tasks( const char *from_cpuset_path, >> + const char *to_cpuset_path, pid_t *except ) >> +{ >> + size_t num_read; >> + int i, from_fileno, pid_ready, end_of_file; >> + char callers_pid[24]; >> + char cur[1]; >> + static char from_path_buf[128]; >> + uint_64_t pid_numeric = 0; >> + pid_t cur_match; >> + >> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >> *)&from_path_lock ); >> + sem_wait( &from_path_lock ); >> + >> + /* Create strings containing the full path to the caller's cpusets */ >> + strcpy_bounded( from_path_buf, "/dev/cpuset/" ); >> + >> + /* Mark index of trailing slash for possible overwrite */ >> + num_read = strlen( from_path_buf ) - 1; >> + if ( from_cpuset_path ) { >> + strcat_bounded( from_path_buf, from_cpuset_path ); >> + } else { >> + /* Migrate the task from default cpuset - remove trailing slash >> */ >> + from_path_buf[num_read] = '\0'; >> + } >> + >> + if ( to_cpuset_path ) >> + ODPH_DBG( "Migrating tasks from %s to /dev/cpuset/%s\n", >> + from_path_buf, to_cpuset_path ); >> + else >> + ODPH_DBG( "Migrating tasks from %s to /dev/cpuset\n", >> from_path_buf ); >> + >> + /* >> + * We will be manipulating the tasks in this cpuset, so >> + * extend the path string to specify the 'tasks' file. >> + */ >> + strcat_bounded( from_path_buf, "/tasks" ); >> + from_fileno = open( from_path_buf, O_RDWR ); >> + >> + if ( from_fileno > 0 ) { >> + for ( end_of_file = 0; !end_of_file; ) { >> + /* Read one line of PID info from the 'from' tasks file */ >> + callers_pid[0] = '\0'; >> + pid_ready = 0; >> + for ( i = 0; i < sizeof( callers_pid ); ) { >> + num_read = read( from_fileno, (void *)cur, 1 ); >> + switch ( num_read ) { >> + case 0 : >> + end_of_file = 1; >> + break; >> + case 1 : >> + if ( cur[0] == (char)'\n' ) { >> + callers_pid[i] = '\0'; >> + pid_ready = 1; >> + i = 0; >> + } else { >> + if ( (i + 1) < sizeof( callers_pid ) ) { > >
On 13 November 2015 at 13:51, Gary Robertson <gary.robertson@linaro.org> wrote: > Oops - clicked the wrong reply option. > > Nicolas raises an excellent point. I think at least a configuration > option may be needed to enable or disable isolation. > I think that ./configure should check for the support and it if it is available provided the configure option --enable-test-isolated, this is how nearly all our other optional capabilities work. In this case if support is there the default would be to enable --enable-test-isolated > There are also impacts on how CPUs should be allocated on isolated versus > non-isolated platforms, and some function calls which need to be > substituted depending on absence or presence of isolation support. > However this patch was intended as a preliminary introduction of isolation > support at the earliest possible time frame - with the expectations that > further refinements would be needed. > As such thanks to Nicolas for his input - and I would encourage others to > alert me to additional shortcomings or problems I may have overlooked. > > On Fri, Nov 13, 2015 at 12:10 PM, Nicolas Morey-Chaisemartin < > nmorey@kalray.eu> wrote: > >> Le 13/11/15 18:27 , Gary S. Robertson a écrit : >> >>> This patch adds ODP helper code for setting up cpuset-based >>> isolated execution environments on a Linux platform. >>> >>> By executing applications on dedicated CPU cores with minimal scheduler >>> contention or 'interference', latency determinism can be significantly >>> enhanced, and performance can be improved and made more deterministic as >>> well. >>> Performance gains are dependent on the degree of CPU loading and >>> scheduler >>> contention which would otherwise occur in a 'normal' non-isolated >>> environment. >>> >>> This isolation API requires an underlying Linux kernel with cpuset >>> support, >>> and will return an error if such support is missing. >>> If the underlying kernel also includes LNG-originated 'NO_HZ_FULL' >>> support, >>> this support will be used to the extent that it is available. >>> (NOTE that isolation setup requires root privileges during execution.) >>> >>> This patch also modifies the pktio performance test as an example of >>> how the new isolation helpers might be employed by an application, >>> and as a convenient means of quantifying the improved performance >>> made possible by executing in an isolated environment. >>> >>> It is anticipated that this API will evolve as use cases are defined >>> and further features or refinements are requested - hence this is only >>> the 'initial' API submission. >>> >>> Signed-off-by: Gary S. Robertson <gary.robertson@linaro.org> >>> --- >>> helper/Makefile.am | 2 + >>> helper/include/odp/helper/linux_isolation.h | 98 + >>> helper/linux_isolation.c | 2901 >>> +++++++++++++++++++++++++++ >>> test/performance/odp_pktio_perf.c | 21 +- >>> 4 files changed, 3016 insertions(+), 6 deletions(-) >>> create mode 100644 helper/include/odp/helper/linux_isolation.h >>> create mode 100644 helper/linux_isolation.c >>> >>> diff --git a/helper/Makefile.am b/helper/Makefile.am >>> index e72507e..f9c3558 100644 >>> --- a/helper/Makefile.am >>> +++ b/helper/Makefile.am >>> @@ -11,6 +11,7 @@ helperincludedir = $(includedir)/odp/helper/ >>> helperinclude_HEADERS = \ >>> $(srcdir)/include/odp/helper/ring.h \ >>> $(srcdir)/include/odp/helper/linux.h \ >>> + $(srcdir)/include/odp/helper/linux_isolation.h \ >>> $(srcdir)/include/odp/helper/chksum.h\ >>> $(srcdir)/include/odp/helper/eth.h\ >>> $(srcdir)/include/odp/helper/icmp.h\ >>> @@ -29,6 +30,7 @@ noinst_HEADERS = \ >>> __LIB__libodphelper_la_SOURCES = \ >>> linux.c \ >>> + linux_isolation.c \ >>> ring.c \ >>> hashtable.c \ >>> lineartable.c >>> diff --git a/helper/include/odp/helper/linux_isolation.h >>> b/helper/include/odp/helper/linux_isolation.h >>> new file mode 100644 >>> index 0000000..2fc8266 >>> --- /dev/null >>> +++ b/helper/include/odp/helper/linux_isolation.h >>> @@ -0,0 +1,98 @@ >>> +/* Copyright (c) 2013, Linaro Limited >>> + * All rights reserved. >>> + * >>> + * SPDX-License-Identifier: BSD-3-Clause >>> + */ >>> + >>> + >>> +/** >>> + * @file >>> + * >>> + * ODP Linux isolation helper API >>> + * >>> + * This file is an optional helper to odp.h APIs. These functions are >>> provided >>> + * to ease common setups for isolation using cpusets in a Linux system. >>> + * User is free to implement the same setups in other ways (not via >>> this API). >>> + */ >>> + >>> +#ifndef ODP_LINUX_ISOLATION_H_ >>> +#define ODP_LINUX_ISOLATION_H_ >>> + >>> +#ifdef __cplusplus >>> +extern "C" { >>> +#endif >>> + >>> +#include <odp.h> >>> + >>> +/* >>> + * Verify the level of underlying operating system support. >>> + * (Return with error if the OS does not at least support cpusets) >>> + * Set up system-wide CPU masks and cpusets >>> + * (Future) Set up file-based persistent cpuset management layer >>> + * to allow cooperative use of system isolation resources >>> + * by multiple independent ODP instances. >>> + */ >>> +int odph_isolation_init_global( void ); >>> + >>> +/* >>> + * Migrate all tasks from cpusets created for isolation support to the >>> + * generic boot-level single cpuset. >>> + * Remove all isolated CPU environments and cpusets >>> + * Zero out system-wide CPU masks >>> + * (Future) Reset persistent file-based cpuset management layer >>> + * to show no system isolation resources are available. >>> + */ >>> +int odph_isolation_term_global( void ); >>> + >>> +/** >>> + * Creates and launches pthreads >>> + * >>> + * Creates, pins and launches threads to separate CPU's based on the >>> cpumask. >>> + * >>> + * @param thread_tbl Thread table >>> + * @param mask CPU mask >>> + * @param start_routine Thread start function >>> + * @param arg Thread argument >>> + * >>> + * @return Number of threads created >>> + */ >>> +int odph_linux_isolated_pthread_create(odph_linux_pthread_t *thread_tbl, >>> + const odp_cpumask_t *mask, >>> + void *(*start_routine) (void *), >>> + void *arg); >>> + >>> +/** >>> + * Fork a process >>> + * >>> + * Forks and sets CPU affinity for the child process >>> + * >>> + * @param proc Pointer to process state info (for output) >>> + * @param cpu Destination CPU for the child process >>> + * >>> + * @return On success: 1 for the parent, 0 for the child >>> + * On failure: -1 for the parent, -2 for the child >>> + */ >>> +int odph_linux_isolated_process_fork(odph_linux_process_t *proc, int >>> cpu); >>> + >>> +/** >>> + * Fork a number of processes >>> + * >>> + * Forks and sets CPU affinity for child processes >>> + * >>> + * @param proc_tbl Process state info table (for output) >>> + * @param mask CPU mask of processes to create >>> + * >>> + * @return On success: 1 for the parent, 0 for the child >>> + * On failure: -1 for the parent, -2 for the child >>> + */ >>> +int odph_linux_isolated_process_fork_n(odph_linux_process_t *proc_tbl, >>> + const odp_cpumask_t *mask); >>> + >>> +int odph_cpumask_default_worker(odp_cpumask_t *mask, int num); >>> +int odph_cpumask_default_control(odp_cpumask_t *mask, int num >>> ODP_UNUSED); >>> + >>> +#ifdef __cplusplus >>> +} >>> +#endif >>> + >>> +#endif >>> diff --git a/helper/linux_isolation.c b/helper/linux_isolation.c >>> new file mode 100644 >>> index 0000000..5ca6c7f >>> --- /dev/null >>> +++ b/helper/linux_isolation.c >>> @@ -0,0 +1,2901 @@ >>> +/* >>> + * This file contains declarations and definitions of functions and >>> + * data structures which are useful for manipulating cpusets in support >>> + * of OpenDataPlane (ODP) high-performance applications. >>> + * >>> + * Copyright (c) 2015, Linaro Limited >>> + * All rights reserved. >>> + * SPDX-License-Identifier: BSD-3-Clause >>> + */ >>> + >>> +#ifndef _GNU_SOURCE >>> +#define _GNU_SOURCE >>> +#endif >>> + >>> +#include <ctype.h> >>> +#include <dirent.h> >>> +#include <errno.h> >>> +#include <fcntl.h> >>> +#include <fts.h> >>> +#include <pthread.h> >>> +#include <sched.h> >>> +#include <semaphore.h> >>> +#include <signal.h> >>> +#include <stdarg.h> >>> +#include <stdio.h> >>> +#include <stdlib.h> >>> +#include <string.h> >>> +#include <time.h> >>> +#include <unistd.h> >>> +#include <sys/mman.h> >>> +#include <sys/mount.h> >>> +#include <sys/resource.h> >>> +#include <sys/stat.h> >>> +#include <sys/syscall.h> >>> +#include <sys/time.h> >>> +#include <sys/types.h> >>> +#include <sys/wait.h> >>> + >>> +#include <odp/init.h> >>> +#include <odp_internal.h> >>> +#include <odp/cpumask.h> >>> +#include <odp/debug.h> >>> +#include <odp_debug_internal.h> >>> +#include <odp/helper/linux.h> >>> +#include "odph_debug.h" >>> + >>> +typedef unsigned long long uint_64_t; >>> +typedef unsigned int uint32_t; >>> +typedef unsigned short uint16_t; >>> +typedef unsigned char uint8_t; >>> + >>> >>> +/****************************************************************************** >>> + * The following constants are important for determining isolation >>> capacities >>> + * MAX_CPUS_SUPPORTED is used to dimension arrays and some loops in the >>> + * isolation helper code. >>> + * The HOUSEKEEPING_RATIO_* constants define the ratio of housekeeping >>> CPUs >>> + * (i.e. 'control plane' CPUs) - see >>> MULTIPLIER >>> + * versus isolated CPUs (i.e. 'data plane >>> CPUs) - >>> + * see DIVISOR >>> + * The calculation is: >>> + * NUMBER OF HOUSEKEEPING CPUs = >>> + * (NUMBER OF CPUs * HOUSEKEEPING_RATIO_MULTIPLIER) >>> + * divided by HOUSEKEEPING_RATIO_DIVISOR. >>> + * If NUMBER OF HOUSEKEEPING CPUs < 1, NUMBER OF HOUSEKEEPING CPUs ++ >>> + * NUMBER OF ISOLATED CPUs = >>> + * NUMBER OF CPUs - NUMBER OF HOUSEKEEPING CPUs >>> + >>> ******************************************************************************/ >>> +#define MAX_CPUS_SUPPORTED 64 >>> +#define HOUSEKEEPING_RATIO_MULTIPLIER 1 >>> +#define HOUSEKEEPING_RATIO_DIVISOR 4 >>> + >>> >>> +/****************************************************************************** >>> + * >>> + * Concatenate a string into a destination buffer >>> + * containing an existing string such that the length of the resulting >>> string >>> + * (including the terminating NUL) does not exceed the buffer size >>> + * >>> + >>> ******************************************************************************/ >>> +static inline char *__strcat_bounded( char *dst_strg, const char >>> *src_strg, >>> + size_t dstlen ) { >>> + *(dst_strg + (dstlen - 1)) = '\0'; >>> + return( strncat( dst_strg, src_strg, >>> + ((dstlen - 1) - strlen( dst_strg )) ) ); >>> +} >>> + >>> +#define strcat_bounded( dest, src ) \ >>> + __strcat_bounded( dest, src, (sizeof( dest )) ) >>> + >>> >>> +/****************************************************************************** >>> + * >>> + * Copy a string into a destination buffer and NUL-terminate it >>> + * such that the length of the resulting string >>> + * (including the terminating NUL) does not exceed the buffer size >>> + * >>> + >>> ******************************************************************************/ >>> +static inline char *__strcpy_bounded( char *dst_strg, const char >>> *src_strg, >>> + size_t dstlen ) { >>> + *(dst_strg + (dstlen - 1)) = '\0'; >>> + return( strncpy( dst_strg, src_strg, (dstlen - 1) ) ); >>> +} >>> + >>> +#define strcpy_bounded( dest, src ) \ >>> + __strcpy_bounded( dest, src, (sizeof( dest )) ) >>> + >>> +#define MAX_ERR_MSG_SIZE 256 >>> +#define ERR_STRING_SIZE 80 >>> + >>> +#define NSEC_PER_SEC 1000000000L >>> + >>> +static void sleep_nsec( long nsec ) >>> +{ >>> + struct timespec delay, remaining; >>> + >>> + if ( nsec >= NSEC_PER_SEC ) { >>> + delay.tv_sec = nsec / NSEC_PER_SEC; >>> + delay.tv_nsec = nsec % NSEC_PER_SEC; >>> + } else { >>> + delay.tv_sec = 0; >>> + delay.tv_nsec = nsec; >>> + } >>> + for ( errno = EINTR; errno == EINTR; ) { >>> + errno = 0; >>> + if ( (clock_nanosleep( CLOCK_MONOTONIC, 0, &delay, &remaining >>> )) && >>> + (errno == EINVAL) ) { >>> + errno = 0; >>> + clock_nanosleep( CLOCK_REALTIME, 0, &delay, &remaining ); >>> + } >>> + delay.tv_sec = remaining.tv_sec; >>> + delay.tv_nsec = remaining.tv_nsec; >>> + } >>> +} >>> + >>> +static sem_t strerror_lock; >>> +static char error_buf[ERR_STRING_SIZE]; >>> + >>> +#define TM_STAMP_SIZE 30 >>> +#define MAX_EVENT_STRING_SIZE ((size_t)(MAX_ERR_MSG_SIZE - >>> TM_STAMP_SIZE - 4)) >>> +#define TM_STAMP_MSEC 19 >>> +#define TM_STAMP_MSEC_END 23 >>> +#define TM_STAMP_CTIME_END 25 >>> +#define TM_STAMP_DATE_END 11 >>> +#define TM_STAMP_YEAR (TM_STAMP_MSEC_END + 1) >>> +#define TM_STAMP_PREFIX_END 31 >>> + >>> +static sem_t logmsg_lock; >>> +static char stderr_log_msg[MAX_ERR_MSG_SIZE]; >>> + >>> >>> +/****************************************************************************** >>> + * >>> + * Return the task ID of the calling thread or process >>> + * (this is a system-wide thread ID used for scheduling all tasks, >>> + * whether single-threaded processes or individual threads within >>> + * multithreaded processes) >>> + * This is the identifier used for migrating tasks between cpusets >>> + * >>> + >>> ******************************************************************************/ >>> +static pid_t gettaskid( void ) >>> +{ >>> + return( (pid_t)(syscall(SYS_gettid)) ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Convert an error number to an error type string in errstring >>> + * >>> + >>> ******************************************************************************/ >>> +static char *errstring( int error_no ) >>> +{ >>> + >>> + pthread_cleanup_push( (void(*)(void *))sem_post, >>> + (void *)&strerror_lock ); >>> + sem_wait( &strerror_lock ); >>> + >>> + error_buf[ ERR_STRING_SIZE - 1 ] = '\0'; >>> + strncpy( error_buf, strerror( error_no ), (ERR_STRING_SIZE - 2) ); >>> + >>> + pthread_cleanup_pop( 1 ); >>> + >>> + return( error_buf ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Print a time stamp and event message out to stderr >>> + * >>> + >>> ******************************************************************************/ >>> +static void stderr_log( const char *fmt_str, ... ) >>> +{ >>> + va_list args; >>> + struct timeval time_now; >>> + struct tm time_fields; >>> + int i, j; >>> + char *event_msg_start; >>> + >>> + pthread_cleanup_push( (void(*)(void *))sem_post, >>> + (void *)&logmsg_lock ); >>> + sem_wait( &logmsg_lock ); >>> + >>> + /* >>> + * Snapshot the current time down to the resolution of the CPU >>> clock. >>> + */ >>> + gettimeofday( &time_now, (struct timezone *)NULL ); >>> + >>> + /* >>> + * Convert time to a calender time string with 1 sec resolution >>> + */ >>> + localtime_r( (time_t *)(&(time_now.tv_sec)), &time_fields ); >>> + asctime_r( &time_fields, stderr_log_msg ); >>> + >>> + /* >>> + * Shift the year and newline down to make room for a msec string >>> field >>> + */ >>> + for ( i = TM_STAMP_CTIME_END, j = (TM_STAMP_SIZE - 1); >>> + i >= TM_STAMP_MSEC; i--, j-- ) >>> + stderr_log_msg[j] = stderr_log_msg[i]; >>> + >>> + /* >>> + * Insert the millisecond time stamp field into the string between >>> the >>> + * seconds and the year as :000 thru :999. Then overwrite the >>> premature >>> + * NUL with a space to 're-attach' the year and newline >>> + */ >>> + snprintf( &(stderr_log_msg[TM_STAMP_MSEC]), 5, ":%.3ld", >>> + (time_now.tv_usec / 1000) ); >>> + stderr_log_msg[TM_STAMP_MSEC_END] = ' '; >>> + >>> + /* >>> + * NUL out the newline at the end of the timestamp so we can >>> + * prefix the log message with the timestamp. >>> + */ >>> + stderr_log_msg[TM_STAMP_SIZE - 2] = '\0'; >>> + strcat_bounded( stderr_log_msg, " - " ); >>> + event_msg_start = &(stderr_log_msg[strlen( stderr_log_msg )]); >>> + >>> + /* >>> + * Format the caller's event message into a constant string >>> + */ >>> + va_start( args, fmt_str ); >>> + vsnprintf( event_msg_start, MAX_EVENT_STRING_SIZE, fmt_str, args ); >>> + stderr_log_msg[MAX_EVENT_STRING_SIZE - 1] = '\0'; >>> + va_end( args ); >>> + strcat_bounded( stderr_log_msg, "\n" ); >>> + >>> + /* >>> + * Then print the time stamp and event message out to stderr >>> + */ >>> + fputs( stderr_log_msg, stderr ); >>> + >>> + pthread_cleanup_pop( 1 ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Initialize the semaphores used for serializing error message >>> handling. >>> + * >>> + >>> ******************************************************************************/ >>> +static void init_errmsg_locks( void ) >>> +{ >>> + sem_init( &strerror_lock, 0, 1 ); >>> + sem_init( &logmsg_lock, 0, 1 ); >>> +} >>> + >>> +#define DIRMODE ((mode_t)(S_IRUSR | S_IWUSR | S_IXUSR | \ >>> + S_IRGRP | S_IXGRP | \ >>> + S_IROTH | S_IXOTH)) >>> +#define FILEMODE ((mode_t)(S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)) >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Data structures associated with CPUSETS >>> + * >>> + >>> ******************************************************************************/ >>> + >>> +static int numcpus; >>> +static size_t cpusetsize; >>> +static int cpusets_supported; >>> +static int cpuset_prefix_required; >>> + >>> +/* >>> + * Shared path construction buffers and position markers >>> + * Used to construct absolute paths to directories and files in the >>> + * cpuset file hierarchy. Shared in order to reduce stack usage, >>> + * (especially with nested function calls) - and kept thread-safe >>> + * by the locks below. >>> + */ >>> +static char pathname_buf[128]; /* Use only while holding >>> pathname_lock! */ >>> +static char fieldname_buf[128]; /* Use only while holding >>> fieldname_lock! */ >>> +static char cpulist[128]; /* Use only while holding >>> fieldname_lock! */ >>> +static char cpuname[3]; /* Use only while holding >>> fieldname_lock! */ >>> +static int end_cpuset_base_path; /* Use only while holding >>> fieldname_lock! */ >>> +static int end_field_base_path; /* Use only while holding >>> fieldname_lock! */ >>> + >>> +/* >>> + * Locks for thread-safe use of the shared path construction buffers. >>> + * Locking order - if pathname_lock is needed it must always be taken >>> + * before taking fieldname_lock and released after >>> + * releasing fieldname_lock. >>> + * if from_path_lock is needed it must always be taken >>> + * before taking to_path_lock and released after >>> + * releasing to_path_lock. >>> + * fieldname lock is the primary exclusion mechanism and by implication >>> + * allows thread-safe access to the cpuset directory tree >>> + * by all tasks using this suite of helper functions. >>> + */ >>> +static sem_t pathname_lock; >>> +static sem_t fieldname_lock; >>> +static sem_t from_path_lock; >>> +static sem_t to_path_lock; >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Functions associated with cpusets >>> + * >>> + >>> ******************************************************************************/ >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Called from applications when switching to a new cpuset. >>> + * >>> + * Takes a string specifying the name of the desired cpuset >>> + * relative to the mount point '/dev/cpuset/' >>> + * - eg. 'cplane' or 'dplane'...etc. >>> + * >>> + * Obtains the required lock on the shared fieldname path buffer. >>> + * Sets the (shared) current cpuset path string, >>> + * and creates the cpuset management tree base directory >>> + * if it is not already present. >>> + * Initializes the fieldname path string to the new base directory path. >>> + * Returns while holding the lock on the shared fieldname path buffer. >>> + * NOTE - a NULL cpuset_name defaults to the top-level 'master' cpuset. >>> + * >>> + >>> ******************************************************************************/ >>> +static char *newpathname( const char *cpuset_name ) >>> +{ >>> + /* >>> + * Lock exclusive access to the fieldname path buffer >>> + * and by inference, to the current cpuset management directory tree >>> + */ >>> + sem_wait( &fieldname_lock ); >>> + >>> + /* Create a string containing the full path to the caller's >>> directory */ >>> + strcpy_bounded( fieldname_buf, "/dev/cpuset/" ); >>> + >>> + if ( cpuset_name != (char *)NULL ) { >>> + strcat_bounded( fieldname_buf, cpuset_name ); >>> + >>> + /* >>> + * Create the new cpuset tree under /dev/cpuset >>> + * fieldname_buf = "/dev/cpuset/<path>" >>> + */ >>> + mkdir( fieldname_buf, DIRMODE ); >>> + >>> + strcat_bounded( fieldname_buf, "/" ); >>> + } >>> + >>> + /* >>> + * If a cpuset_name was specified, then >>> + * fieldname_buf = "/dev/cpuset/<path>/" --else-- >>> + * fieldname_buf = "/dev/cpuset/" >>> + * Mark the end of the path base string for this cpuset >>> + */ >>> + end_cpuset_base_path = strnlen( fieldname_buf, (sizeof( >>> fieldname_buf ) - 1) ); >>> + >>> + if ( cpuset_prefix_required ) >>> + strcat_bounded( fieldname_buf, "cpuset." ); >>> + >>> + /* Mark the end of the field path string for this cpuset */ >>> + end_field_base_path = strnlen( fieldname_buf, >>> + (sizeof( fieldname_buf ) - 1) ); >>> + >>> + /* Return to the caller with the fieldname_buf lock held */ >>> + return( fieldname_buf ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Called from applications to create a complete cpuset fieldname path. >>> + * >>> + * Requires and assumes that the caller currently holds the lock >>> + * for exclusive use of the shared fieldname path buffer. >>> + * >>> + * Resets the (shared) current fieldname path to its initial contents, >>> + * effectively truncating the name of any previous field path. >>> + * Then concatenates the cpuset-relative field name string specified >>> + * by the caller onto the path base, creating the full field path name. >>> + * >>> + >>> ******************************************************************************/ >>> +static char *newfieldname( const char *field ) >>> +{ >>> + fieldname_buf[end_field_base_path] = '\0'; >>> + strcat_bounded( fieldname_buf, field ); >>> + return( fieldname_buf ); >>> +} >>> + >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Called from applications to release the lock on the shared >>> + * fieldname path buffer. This enables serialized access to the >>> + * cpuset management structure within a multi-threaded process. >>> + * The application releases this lock after it finishes processing >>> + * all fields of the current cpuset, guaranteeing that other threads >>> + * using this utility will not interfere with that cpuset. >>> + * >>> + >>> ******************************************************************************/ >>> +static void releasefieldname( void ) >>> +{ >>> + sem_post( &fieldname_lock ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Verify and initialize basic CPUSET support >>> + * >>> + >>> ******************************************************************************/ >>> +static int init_cpusets( void ) >>> +{ >>> + int mounted = 0; >>> + int retcode = -1; >>> + int fileno; >>> + >>> + /* >>> + * Initialize the locks used to serialize access to the error >>> message >>> + * logging functions and buffers. This needs to be done prior to >>> most >>> + * of the other cpuset setup functions... so take care of it here. >>> + */ >>> + init_errmsg_locks(); >>> + >>> + /* Init locks for thread-safe access to static path-building >>> strings */ >>> + sem_init( &pathname_lock, 0, 1 ); >>> + sem_init( &fieldname_lock, 0, 1 ); >>> + sem_init( &from_path_lock, 0, 1 ); >>> + sem_init( &to_path_lock, 0, 1 ); >>> + >>> + cpuset_prefix_required = 0; >>> + cpusets_supported = 0; >>> + >>> +try2mount: >>> + /* Try to mount the cpuset pseudo-filesystem at /dev/cpuset */ >>> + mkdir( "/dev/cpuset", DIRMODE ); >>> + if ( mount( "none", "/dev/cpuset", "cpuset", >>> + (MS_NODEV | MS_NOEXEC | MS_NOSUID), (void *)NULL ) ) { >>> + switch ( errno ) { >>> + case EBUSY : >>> + mounted = 1; >>> + break; >>> + case ENODEV : >>> + ODPH_ERR( "cpusets not supported - aborting!\n" ); >>> + break; >>> + case EPERM : >>> + ODPH_ERR( "Insufficient privileges for cpusets - >>> aborting!\n" ); >>> + break; >>> + default : >>> + break; >>> + } >>> + } >>> + if ( mounted > 0) { >>> + cpusets_supported = 1; >>> + retcode = 0; >>> + fileno = open( "/dev/cpuset/cpuset.cpus", O_RDONLY ); >>> + if ( fileno > 0 ) { >>> + cpuset_prefix_required = 1; >>> + close( fileno ); >>> + } >>> + } else { >>> + /* >>> + * Try up to two more times to get the cpusets filesystem >>> mounted >>> + * before giving up >>> + */ >>> + if ( --mounted > -3 ) { >>> + /* Delay 50 msec to allow the mount to settle and try again >>> */ >>> + sleep_nsec( 50000000 ); >>> + goto try2mount; >>> + } >>> + } >>> + >>> + /* Support available CPU cores up to MAX_CPUS_SUPPORTED cores */ >>> + numcpus = (int)sysconf( _SC_NPROCESSORS_ONLN ); >>> + >>> + if( numcpus > MAX_CPUS_SUPPORTED ) { >>> + fprintf( stderr, >>> + "\rNOTE: MAX_CPUS_SUPPORTED defined as: %d,\n", >>> MAX_CPUS_SUPPORTED ); >>> + fprintf( stderr, >>> + "\r but number of CPU cores detected is: %d\n", numcpus ); >>> + fprintf( stderr, >>> + "\r Change MAX_CPUS_SUPPORTED in isolation_config.h and >>> rebuild\n" >>> + ); >>> + fprintf(stderr, >>> + "\r to support use of all CPU cores on this platform\n" ); >>> + } >>> + numcpus = (numcpus > MAX_CPUS_SUPPORTED) ? MAX_CPUS_SUPPORTED : >>> numcpus; >>> + >>> + /* Save the required cpuset mask size for global reference */ >>> + cpusetsize = CPU_ALLOC_SIZE( numcpus ); >>> + >>> + return( retcode ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Enable or disable full dynticks operation on the specified cpuset >>> + * >>> + >>> ******************************************************************************/ >>> +static void request_dynticks( const char *path, int on_off ) >>> +{ >>> + int retval, fileno; >>> + >>> + if ( on_off ) >>> + ODPH_DBG( "Requesting dynticks on cpuset %s\n", path ); >>> + else >>> + ODPH_DBG( "Dynticks not needed on cpuset %s\n", path ); >>> + >>> + /* >>> + * Set the fieldname path string to the base of the path >>> + * to the caller's specified cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + >>> + /* >>> + * Create an absolute path string to the "fulldynticks" field >>> + * for the cpuset >>> + */ >>> + newfieldname( "fulldynticks" ); >>> + >>> + /* >>> + * Specify whether the cores in this cpuset should offload kernel >>> + * housekeeping tasks to other cores or else accept those tasks >>> + */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + if ( on_off ) >>> + retval = write( fileno, "1", 1 ); >>> + else >>> + retval = write( fileno, "0", 1 ); >>> + close( fileno ); >>> + } >>> + >>> + /* >>> + * Create an absolute path string to the "quiesce" field >>> + * for the cpuset >>> + */ >>> + newfieldname( "quiesce" ); >>> + >>> + /* >>> + * Migrate timers / hrtimers away from the CPUs in this cpuset -or- >>> + * allow timers / hrtimers for this CPU and system-wide use. >>> + */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + if ( on_off ) >>> + retval = write( fileno, "1", 1 ); >>> + else >>> + retval = write( fileno, "0", 1 ); >>> + close( fileno ); >>> + } >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Enable or disable isolation on the specified cpuset >>> + * (A NULL path defaults to the top-level master cpuset.) >>> + * >>> + >>> ******************************************************************************/ >>> +static void set_cpuset_isolation( const char *path, int on_off ) >>> +{ >>> + int retval, fileno; >>> + >>> + if ( on_off ) >>> + ODPH_DBG( "Disabling load balancing on cpuset %s\n", path ); >>> + else >>> + ODPH_DBG( "Enabling load balancing on cpuset %s\n", path ); >>> + >>> + /* >>> + * Set the fieldname path string to the base of the path >>> + * to the caller's specified cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + >>> + /* >>> + * Create an absolute path string to the "sched_load_balance" field >>> + * for the cpuset >>> + */ >>> + newfieldname( "sched_load_balance" ); >>> + >>> + /* >>> + * Enable or disable load balancing >>> + */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + if ( on_off ) >>> + retval = write( fileno, "0", 1 ); >>> + else >>> + retval = write( fileno, "1", 1 ); >>> + close( fileno ); >>> + } >>> + >>> + /* >>> + * Create an absolute path string to the >>> + * "sched_relax_domain_level" field for the cpuset >>> + */ >>> + newfieldname( "sched_relax_domain_level" ); >>> + >>> + /* >>> + * Enable or disable event-based load balancing >>> + */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + if ( on_off ) >>> + retval = write( fileno, "0", 1 ); >>> + else >>> + retval = write( fileno, "-1", 2 ); >>> + close( fileno ); >>> + } >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Create a new management tree for the specified cpuset >>> + * >>> + >>> ******************************************************************************/ >>> +static void create_cpuset( const char *path, cpu_set_t *mask, int >>> isolated ) >>> +{ >>> + int retval, i, fileno, endlist; >>> + >>> + /* >>> + * Set the fieldname path string to the base of the path >>> + * to the caller's specified cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + >>> + /* Create an absolute path string to the "mems" field for the >>> cpuset */ >>> + newfieldname( "mems" ); >>> + >>> + /* Init the "mems" field so all cpusets share the same memory map */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + retval = write( fileno, "0", 1 ); >>> + close( fileno ); >>> + } >>> + >>> + cpulist[0] = '\0'; >>> + for ( i = 0, endlist = 0; i < numcpus; i++ ) { >>> + if ( CPU_ISSET( i, mask) ) { >>> + /* >>> + * Create a comma-separated list of CPU cores in this cpuset >>> + * based on the cpuset mask passed in by the caller. >>> + */ >>> + snprintf( cpuname, sizeof( cpuname ), "%d", i ); >>> + strcat_bounded( cpulist, cpuname ); >>> + /* Mark the location of the trailing comma */ >>> + endlist = strnlen( cpulist, (sizeof( cpulist ) - 1) ); >>> + strcat_bounded( cpulist, "," ); >>> + } >>> + } >>> + /* Remove the last superfluous trailing comma from the string */ >>> + cpulist[endlist] = '\0'; >>> + >>> + /* Create an absolute path string to the "cpus" field for the >>> cpuset */ >>> + newfieldname( "cpus" ); >>> + >>> + /* >>> + * Now populate the overall CPU list for the current cpuset >>> + */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + retval = write( fileno, cpulist, strlen( cpulist ) ); >>> + close( fileno ); >>> + } >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> + >>> + /* If the cpuset is to be isolated, turn off load balancing */ >>> + set_cpuset_isolation( path, isolated ); >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Delete the directory and file hierarchy associated with the cpuset >>> + * specified by the contents of fieldname_buf >>> + * >>> + * Requires that the caller already holds fieldname_lock -and- >>> + * assumes all tasks, etc. have been previously migrated away from the >>> + * specified cpuset. >>> + * >>> + >>> ******************************************************************************/ >>> +static int cpuset_delete( void ) >>> +{ >>> + int retcode = -1; >>> + int i, core_fileno; >>> + >>> + ODPH_DBG( "Deleting cpuset %s\n", fieldname_buf ); >>> + >>> + /* >>> + * Create an absolute path string to the "cpus" field for the cpuset >>> + */ >>> + strcat_bounded( fieldname_buf, "/" ); >>> + if ( cpuset_prefix_required ) >>> + strcat_bounded( fieldname_buf, "cpuset." ); >>> + >>> + strcat_bounded( fieldname_buf, "cpus" ); >>> + >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.cpus" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/cpus" >>> + * -or- "/dev/cpuset/<path>/cpuset.cpus" >>> + * -or- "/dev/cpuset/<path>/cpus" >>> + * De-populate the CPU list to contain no cores >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_TRUNC) ); >>> + if ( core_fileno > 0 ) { >>> + /* >>> + * Try for up to 2 seconds to depopulate the CPU cores. >>> + * This allows time for any task migrations to stabilize. >>> + */ >>> + for ( i = 0; i < 100; i++ ) { >>> + errno = 0; >>> + retcode = write( core_fileno, "", 1 ); >>> + if ( !((retcode < 0) && >>> + ((errno == EINTR) || (errno == EBUSY))) ) >>> + break; >>> + >>> + /* Sleep 20 msec to allow depopulation to take effect */ >>> + sleep_nsec( 20000000 ); >>> + } >>> + close( core_fileno ); >>> + } >>> + >>> + fieldname_buf[end_cpuset_base_path] = '\0'; >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>" >>> + * -or- "/dev/cpuset/<path>" >>> + * Delete the cpuset tree for this core >>> + */ >>> + retcode = rmdir( fieldname_buf ); >>> + if ( retcode ) { >>> + ODPH_ERR( "Unable to delete cpuset %s - error %s\n", >>> + fieldname_buf, errstring( errno ) ); >>> + } >>> + >>> + return( retcode ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Delete the management tree for the specified cpuset >>> + * >>> + >>> ******************************************************************************/ >>> +static void delete_cpuset( const char *path ) >>> +{ >>> + /* >>> + * Return the CPU cores in this cpuset to general purpose duty. >>> + * Turn load balancing back on and indicate full dynticks not >>> needed. >>> + * This is done here to inform the kernel as to how these cores may >>> be >>> + * used and operated. >>> + */ >>> + set_cpuset_isolation( path, 0 ); >>> + request_dynticks( path, 0 ); >>> + >>> + /* >>> + * Create an absolute path string to the "cpus" field for the cpuset >>> + * newpathname marks the end of the cpuset base path string at a >>> position >>> + * following the slash - that is where the field name string would >>> be >>> + * concatenated onto the path - eg. '/dev/cpuset/<path>/' >>> + * cpuset_delete() wants this marker to point to the position prior >>> to >>> + * the slash - eg. '/dev/cpuset/<path>' - so adjust it. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + end_cpuset_base_path--; >>> + fieldname_buf[end_cpuset_base_path] = '\0'; >>> + >>> + /* >>> + * Depopulate the CPU list for the cpuset and remove its >>> + * directory hierarchy >>> + */ >>> + cpuset_delete(); >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Modify the per-cpu management tree for the specified cpuset >>> + * to either enable or disable scheduler load balancing on each >>> single-core >>> + * cpuset descended from the specified parent cpuset. >>> + * Assumes the /dev/cpuset filesystem already mounted and the >>> + * per-core cpusets already initialized. >>> + * >>> + >>> ******************************************************************************/ >>> +static void set_per_core_cpusets_isolated( const char *path, cpu_set_t >>> *mask, >>> + int on_off ) >>> +{ >>> + int retval, i, core_fileno, cpu_num_offset; >>> + >>> + if ( on_off ) >>> + ODPH_DBG( "Disabling load balancing on per-core cpusets in >>> %s\n", path ); >>> + else >>> + ODPH_DBG( "Enabling load balancing on per-core cpusets in >>> %s\n", path ); >>> + >>> + >>> + /* >>> + * Set the pathname and fieldname path strings to the base of the >>> path >>> + * to the specified 'parent' cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + fieldname_buf[end_cpuset_base_path] = '\0'; >>> + >>> + /* >>> + * Create an individual cpuset for each CPU to facilitate isolation >>> + */ >>> + strcat_bounded( fieldname_buf, "cpu" ); >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * mark the location where we append the CPU number >>> + */ >>> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + for ( i = 0; i < numcpus; i++ ) { >>> + if ( CPU_ISSET( i, mask) ) { >>> + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); >>> + strcat_bounded( fieldname_buf, cpuname ); >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu<n> >>> + * where <n> is the current core number (0 -> numcpus-1) >>> + * Modify the cpuset tree for this core only >>> + */ >>> + mkdir( fieldname_buf, DIRMODE ); >>> + >>> + strcat_bounded( fieldname_buf, "/" ); >>> + if ( cpuset_prefix_required ) >>> + strcat_bounded( fieldname_buf, "cpuset." ); >>> + /* Mark the end of the path string for this core */ >>> + end_field_base_path = strnlen( fieldname_buf, >>> + (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + /* Create a path string to the "sched_load_balance" field */ >>> + newfieldname( "sched_load_balance" ); >>> + /* >>> + * fieldname_buf == >>> + * >>> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_load_balance" >>> + * -or- >>> "/dev/cpuset/<path>/cpu<n>/sched_load_balance" >>> + * Set the specified load balancing on this single-core >>> cpuset >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + if ( on_off ) >>> + retval = write( core_fileno, "0", 1 ); >>> + else >>> + retval = write( core_fileno, "1", 1 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* Create a path string to the "sched_relax_domain_level" >>> field */ >>> + newfieldname( "sched_relax_domain_level" ); >>> + /* >>> + * fieldname_buf == >>> + * >>> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_relax_domain_level" >>> + * -or- >>> "/dev/cpuset/<path>/cpu<n>/sched_relax_domain_level" >>> + * Set the specified behavior on this single-core cpuset >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + if ( on_off ) >>> + retval = write( core_fileno, "0", 1 ); >>> + else >>> + retval = write( core_fileno, "-1", 2 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* >>> + * Reset the current field pathname to: >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * in preparation for the next CPU core >>> + * in the data plane cpuset mask >>> + */ >>> + fieldname_buf[cpu_num_offset] = '\0'; >>> + } >>> + } >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> +} >>> + >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Modify the per-cpu management tree for the specified cpuset >>> + * to either enable or disable full dynticks operation on each >>> single-core >>> + * cpuset descended from the specified parent cpuset. >>> + * Assumes the /dev/cpuset filesystem already mounted and the >>> + * per-core cpusets already initialized. >>> + * >>> + >>> ******************************************************************************/ >>> +static void request_per_core_dynticks( const char *path, cpu_set_t >>> *mask, >>> + int on_off ) >>> +{ >>> + int retval, i, core_fileno, cpu_num_offset; >>> + if ( on_off ) >>> + ODPH_DBG( "Requesting dynticks on per-core cpusets in %s\n", >>> path ); >>> + else >>> + ODPH_DBG( "Dynticks not needed on per-core cpusets in %s\n", >>> path ); >>> + >>> + >>> + /* >>> + * Set the pathname and fieldname path strings to the base of the >>> path >>> + * to the specified 'parent' cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + fieldname_buf[end_cpuset_base_path] = '\0'; >>> + >>> + /* >>> + * Create an individual cpuset for each CPU to facilitate isolation >>> + */ >>> + strcat_bounded( fieldname_buf, "cpu" ); >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * mark the location where we append the CPU number >>> + */ >>> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + for ( i = 0; i < numcpus; i++ ) { >>> + if ( CPU_ISSET( i, mask) ) { >>> + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); >>> + strcat_bounded( fieldname_buf, cpuname ); >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu<n> >>> + * where <n> is the current core number (0 -> numcpus-1) >>> + * Modify the cpuset tree for this core only >>> + */ >>> + mkdir( fieldname_buf, DIRMODE ); >>> + >>> + strcat_bounded( fieldname_buf, "/" ); >>> + if ( cpuset_prefix_required ) >>> + strcat_bounded( fieldname_buf, "cpuset." ); >>> + /* Mark the end of the path string for this core */ >>> + end_field_base_path = strnlen( fieldname_buf, >>> + (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + /* Create a path string to the "fulldynticks" field */ >>> + newfieldname( "fulldynticks" ); >>> + /* >>> + * fieldname_buf == >>> + * "/dev/cpuset/<path>/cpu<n>/cpuset/fulldynticks" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/fulldynticks" >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + if ( on_off ) >>> + /* Mark this single-core cpuset for full dynticks >>> mode */ >>> + retval = write( core_fileno, "1", 1 ); >>> + else >>> + /* Mark this single-core cpuset for housekeeping >>> mode */ >>> + retval = write( core_fileno, "0", 1 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* >>> + * Create an absolute path string to the "quiesce" field >>> + * for the cpuset >>> + */ >>> + newfieldname( "quiesce" ); >>> + /* >>> + * fieldname_buf == >>> + * "/dev/cpuset/<path>/cpu<n>/cpuset/quiesce" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/quiesce" >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + if ( on_off ) >>> + /* Migrate timers / hrtimers away from this cpuset >>> */ >>> + retval = write( core_fileno, "1", 1 ); >>> + else >>> + /* Enable migration of timers / hrtimers onto this >>> cpuset */ >>> + retval = write( core_fileno, "0", 1 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* >>> + * Reset the current field pathname to: >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * in preparation for the next CPU core >>> + * in the data plane cpuset mask >>> + */ >>> + fieldname_buf[cpu_num_offset] = '\0'; >>> + } >>> + } >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Create a new per-cpu management tree for the specified parent cpuset >>> + * Assumes the /dev/cpuset filesystem already mounted and the >>> + * parent cpuset already initialized. >>> + * >>> + >>> ******************************************************************************/ >>> +static void create_per_core_cpusets( const char *path, cpu_set_t *mask, >>> + int isolated ) >>> +{ >>> + int retval, i, core_fileno, cpu_num_offset; >>> + >>> + /* >>> + * Set the pathname and fieldname path strings to the base of the >>> path >>> + * to the specified 'parent' cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + fieldname_buf[end_cpuset_base_path] = '\0'; >>> + >>> + /* >>> + * Create an individual cpuset for each CPU to facilitate isolation >>> + */ >>> + strcat_bounded( fieldname_buf, "cpu" ); >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * mark the location where we append the CPU number >>> + */ >>> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + for ( i = 0; i < numcpus; i++ ) { >>> + if ( CPU_ISSET( i, mask) ) { >>> + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); >>> + strcat_bounded( fieldname_buf, cpuname ); >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu<n> >>> + * where <n> is the current core number (0 -> numcpus-1) >>> + * Create a new cpuset tree for this core only >>> + */ >>> + mkdir( fieldname_buf, DIRMODE ); >>> + >>> + strcat_bounded( fieldname_buf, "/" ); >>> + if ( cpuset_prefix_required ) >>> + strcat_bounded( fieldname_buf, "cpuset." ); >>> + /* Mark the end of the path string for this core */ >>> + end_field_base_path = strnlen( fieldname_buf, >>> + (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + /* Create an absolute path string to the "mems" field */ >>> + newfieldname( "mems" ); >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.mems" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/mems" >>> + * Init the "mems" field so all cpusets share the same >>> memory map >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + retval = write( core_fileno, "0", 1 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* Create an absolute path string to the "cpus" field */ >>> + newfieldname( "cpus" ); >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.cpus" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/cpus" >>> + * Init the CPU list to contain only the current core >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + retval = write( core_fileno, cpuname, strlen( cpuname ) >>> ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* Create a path string to the "sched_load_balance" field */ >>> + newfieldname( "sched_load_balance" ); >>> + /* >>> + * fieldname_buf == >>> + * >>> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_load_balance" >>> + * -or- >>> "/dev/cpuset/<path>/cpu<n>/sched_load_balance" >>> + * Set the specified load balancing on this single-core >>> cpuset >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + if ( isolated ) >>> + retval = write( core_fileno, "0", 1 ); >>> + else >>> + retval = write( core_fileno, "1", 1 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* Create a path string to the "sched_relax_domain_level" >>> field */ >>> + newfieldname( "sched_relax_domain_level" ); >>> + /* >>> + * fieldname_buf == >>> + * >>> "/dev/cpuset/<path>/cpu<n>/cpuset.sched_relax_domain_level" >>> + * -or- >>> "/dev/cpuset/<path>/cpu<n>/sched_relax_domain_level" >>> + * Set the specified behavior on this single-core cpuset >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | >>> O_TRUNC), >>> + FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + if ( isolated ) >>> + retval = write( core_fileno, "0", 1 ); >>> + else >>> + retval = write( core_fileno, "-1", 2 ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* >>> + * Reset the current field pathname to: >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * in preparation for the next CPU core >>> + * in the data plane cpuset mask >>> + */ >>> + fieldname_buf[cpu_num_offset] = '\0'; >>> + } >>> + } >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Delete the per-cpu management tree for the specified cpuset >>> + * >>> + >>> ******************************************************************************/ >>> +static void delete_per_core_cpusets( const char *path, cpu_set_t *mask ) >>> +{ >>> + int i, cpu_num_offset; >>> + >>> + /* >>> + * Return the CPUs in the per_core cpusets to general purpose duty. >>> + * Turn load balancing back on and indicate full dynticks not >>> needed. >>> + * This is done here to inform the kernel as to how these cores may >>> be >>> + * used and operated. >>> + */ >>> + set_per_core_cpusets_isolated( path, mask, 0 ); >>> + request_per_core_dynticks( path, mask, 0 ); >>> + >>> + /* >>> + * Set the pathname and fieldname path strings to the base of the >>> path >>> + * to the specified cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + fieldname_buf[end_cpuset_base_path] = '\0'; >>> + >>> + /* >>> + * Delete the individual cpuset for each CPU >>> + */ >>> + strcat_bounded( fieldname_buf, "cpu" ); >>> + >>> + /* >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * mark the location where we append the CPU number >>> + */ >>> + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + for ( i = 0; i < numcpus; i++ ) { >>> + if ( CPU_ISSET( i, mask) ) { >>> + snprintf( cpuname, sizeof( cpuname ), "%d", i ); >>> + strcat_bounded( fieldname_buf, cpuname ); >>> + /* Mark the end of the path string for this cpuset */ >>> + end_cpuset_base_path = strnlen( fieldname_buf, >>> + (sizeof( fieldname_buf ) - >>> 1) ); >>> + >>> + /* >>> + * Depopulate the CPU list for the cpuset and remove its >>> + * directory hierarchy >>> + */ >>> + cpuset_delete(); >>> + >>> + /* >>> + * Reset the pathname to: >>> + * fieldname_buf == /dev/cpuset/<path>/cpu >>> + * in preparation for the next CPU core >>> + * in the data plane cpuset mask >>> + */ >>> + fieldname_buf[cpu_num_offset] = '\0'; >>> + } >>> + } >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Read the specified value from the specified field of the cpuset >>> per-cpu >>> + * management tree for the specified CPU and return it at caller's >>> value ptr. >>> + * If the file for the specified field is missing or empty then *value >>> is NULL. >>> + * >>> + * Assumes the /dev/cpuset filesystem already mounted and the >>> + * cpusets already initialized. >>> + * >>> + >>> ******************************************************************************/ >>> +static void get_per_cpu_field_for( int cpu, const char *path, const >>> char *field, >>> + char *value, size_t len ) >>> +{ >>> + int retval = 0; >>> + int num_read, core_fileno; >>> + >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&pathname_lock ); >>> + sem_wait( &pathname_lock ); >>> + >>> + /* Get the name of this single-core cpuset based on the specified >>> CPU */ >>> + strcpy( pathname_buf, path ); >>> + strcat_bounded( pathname_buf, "/cpu" ); >>> + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); >>> + strcat_bounded( pathname_buf, cpuname ); >>> + >>> + /* Set the fieldname path string to point to fields within this >>> cpuset */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( pathname_buf ); >>> + >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset." >>> + * -or- "/dev/cpuset/<path>/cpu<n>/" >>> + */ >>> + strcat_bounded( fieldname_buf, field ); >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.<field>" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/<field>" >>> + */ >>> + if ( value ) { >>> + core_fileno = open( fieldname_buf, O_RDONLY ); >>> + if ( core_fileno > 0 ) { >>> + for ( num_read = 0; num_read < len; ) { >>> + num_read = read( core_fileno, (void *)value, len ); >>> + if ( (num_read < len) && (errno != EINTR) ) >>> + retval = -1; >>> + break; >>> + } >>> + /* If the field file is missing or empty */ >>> + close( core_fileno ); >>> + if ( len && (retval < 0) ) { >>> + *value = (char)'\0'; >>> + ODPH_ERR( "Failed to get value for %s - error %s\n", >>> + fieldname_buf, errstring( errno ) ); >>> + } >>> + } else >>> + *value = (char)'\0'; >>> + } >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> + pthread_cleanup_pop( 1 ); >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Write the specified value to the specified field of the per-cpu >>> + * management tree for the specified CPU and cpuset >>> + * If value is NULL then the file for the specified field will be >>> truncated. >>> + * >>> + * Assumes the /dev/cpuset filesystem already mounted and the >>> + * cpusets already initialized. >>> + * >>> + >>> ******************************************************************************/ >>> +static void set_per_cpu_field_for( int cpu, const char *path, const >>> char *field, >>> + const char *value ) >>> +{ >>> + int retval, core_fileno; >>> + >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&pathname_lock ); >>> + sem_wait( &pathname_lock ); >>> + >>> + /* Get the name of this single-core cpuset based on the specified >>> CPU */ >>> + strcpy( pathname_buf, path ); >>> + strcat_bounded( pathname_buf, "/cpu" ); >>> + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); >>> + strcat_bounded( pathname_buf, cpuname ); >>> + >>> + /* Set the fieldname path string to point to fields within this >>> cpuset */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( pathname_buf ); >>> + >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset." >>> + * -or- "/dev/cpuset/<path>/cpu<n>/" >>> + */ >>> + strcat_bounded( fieldname_buf, field ); >>> + /* >>> + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.<field>" >>> + * -or- "/dev/cpuset/<path>/cpu<n>/<field>" >>> + */ >>> + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( core_fileno > 0 ) { >>> + /* If value is NULL then the field file will simply be >>> truncated */ >>> + if ( value ) >>> + retval = write( core_fileno, value, strlen( value ) ); >>> + close( core_fileno ); >>> + } >>> + >>> + /* Release the lock on the fieldname_buf and the cpuset */ >>> + releasefieldname(); >>> + pthread_cleanup_pop( 0 ); >>> + pthread_cleanup_pop( 1 ); >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Migrate timers and hrtimers away from the specified cpuset's CPU >>> cores >>> + * >>> + >>> ******************************************************************************/ >>> +static void quiesce_cpus( const char *path ) >>> +{ >>> + int retval, fileno; >>> + >>> + /* >>> + * Set the fieldname path string to the base of the path >>> + * to the caller's specified cpuset. >>> + */ >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&fieldname_lock ); >>> + newpathname( path ); >>> + >>> + /* >>> + * Create an absolute path string to the "quiesce" field >>> + * for the cpuset >>> + */ >>> + newfieldname( "quiesce" ); >>> + >>> + /* >>> + * Migrate timers / hrtimers away from the cpuset's CPUs >>> + */ >>> + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), >>> FILEMODE ); >>> + if ( fileno > 0 ) { >>> + retval = write( fileno, "1", 1 ); >>> + close( fileno ); >>> + } >>> + >>> + pthread_cleanup_pop( 1 ); >>> + >>> + /* Make the C compiler happy... do something with retval */ >>> + if ( retval ) retval = 0; >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Move the specified task away from its current cpuset >>> + * and onto the cores of the specified new cpuset >>> + * Specifying a NULL path string pointer defaults to /dev/cpuset >>> + * >>> + * Assumes the caller passes in a legitimate task PID string. >>> + * >>> + * Returns an int == zero if migration successful or -1 if an error >>> occurred >>> + >>> ******************************************************************************/ >>> +static int migrate_task( const char *callers_pid, const char >>> *to_cpuset_path ) >>> +{ >>> + size_t num_read, num_to_write; >>> + int i, to_fileno, proc_pid_fileno, end_of_file, migrate_failed; >>> + static char my_pid[24]; >>> + static char written_pid[24]; >>> + static char cur[2]; >>> + static char to_path_buf[128]; >>> + static char proc_path_buf[80]; >>> + >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&to_path_lock ); >>> + sem_wait( &to_path_lock ); >>> + >>> + /* Create strings containing the full path to the caller's cpusets >>> */ >>> + strcpy_bounded( to_path_buf, "/dev/cpuset/" ); >>> + >>> + /* Mark index of trailing slash for possible overwrite */ >>> + num_read = strlen( to_path_buf ) - 1; >>> + if ( to_cpuset_path ) >>> + strcat_bounded( to_path_buf, to_cpuset_path ); >>> + else >>> + /* Migrate the task to default cpuset - remove trailing slash */ >>> + to_path_buf[num_read] = '\0'; >>> + >>> + /* >>> + * We will be manipulating the tasks in this cpuset, so >>> + * extend the path string to specify the 'tasks' file. >>> + */ >>> + strcat_bounded( to_path_buf, "/tasks" ); >>> + >>> + /* >>> + * Assemble a path to the status file for the caller's task in /proc >>> + * to verify that the process still exists >>> + */ >>> + for ( i = 0; i < strlen( callers_pid ); i++ ) { >>> + /* >>> + * Don't include any trailing newline from callers_pid >>> + * into the pathname string being built. >>> + */ >>> + if ( callers_pid[i] != (char)'\n' ) >>> + written_pid[i] = callers_pid[i]; >>> + else >>> + written_pid[i] = (char)'\0'; >>> + } >>> + written_pid[i] = (char)'\0'; >>> + strcpy_bounded( proc_path_buf, "/proc/" ); >>> + strcat_bounded( proc_path_buf, written_pid ); >>> + strcat_bounded( proc_path_buf, "/status" ); >>> + proc_pid_fileno = open( proc_path_buf, O_RDONLY ); >>> + >>> + /* Init the result return value */ >>> + migrate_failed = 0; >>> + >>> + /* Ignore the caller's task if its PID is stale */ >>> + if ( proc_pid_fileno > 0 ) { >>> + to_fileno = open( to_path_buf, (O_RDWR | O_CREAT | O_APPEND), >>> + FILEMODE ); >>> + } else { >>> + to_fileno = -1; >>> + migrate_failed = -1; >>> + ODPH_ERR( "%s not found - failed to migrate %s\n", >>> + proc_path_buf, callers_pid ); >>> + } >>> + >>> + if ( to_fileno > 0 ) { >>> + /* Capture our own ttid for comparison purposes */ >>> + snprintf( my_pid, (sizeof( my_pid ) - 1), "%d", gettaskid() ); >>> + >>> + /* >>> + * Now let's try to migrate the task. >>> + * Try to write the PID for the caller's task into >>> + * the task list for the specified 'to' cpuset. >>> + */ >>> + errno = 0; >>> + num_to_write = strlen( written_pid ); >>> + for ( num_read = 0; num_read < num_to_write; ) { >>> + num_read = write( to_fileno, written_pid, num_to_write ); >>> + if ( (num_read == (size_t)-1) && (errno != EINTR) ) >>> + migrate_failed = -1; >>> + break; >>> + } >>> + >>> + if ( migrate_failed ) { >>> + /* >>> + * Scan the task's /proc status file to find its name. >>> + */ >>> + for ( end_of_file = 0; !end_of_file; ) { >>> + /* Read one line of info from the task's /proc status >>> file */ >>> + for ( i = 0, cur[0] = (char)'\0'; (cur[0] != >>> (char)'\n'); ) { >>> + num_read = read( proc_pid_fileno, (void *)cur, 1 ); >>> + if ( num_read > 0 ) { >>> + if ( cur[0] != (char)'\n' ) { >>> + proc_path_buf[i] = cur[0]; >>> + i++; >>> + } else { >>> + proc_path_buf[i] = (char)'\0'; >>> + } >>> + } else { >>> + proc_path_buf[i] = '\0'; >>> + if ( errno != EINTR ) { >>> + end_of_file = 1; >>> + break; >>> + } >>> + } >>> + } >>> + >>> + /* cpulist should contain a string unless EOF reached */ >>> + if ( !(strncmp( proc_path_buf, "Name: ", 6 )) ) >>> + break; >>> + } >>> + /* Failed to migrate current task */ >>> + ODPH_ERR( "Failed to migrate pid %s - error %s\n", >>> + written_pid, errstring( errno ) ); >>> + } else { >>> + /* >>> + * If we are migrating our own task, sleep for 50 msec >>> + * to allow time for migration to occur. >>> + */ >>> + if ( !strncmp( written_pid, my_pid, strlen( my_pid ) ) ) >>> + sleep_nsec( 50000000 ); >>> + } >>> + close( to_fileno ); >>> + } >>> + if ( proc_pid_fileno > 0 ) >>> + close( proc_pid_fileno ); >>> + >>> + pthread_cleanup_pop( 1 ); >>> + >>> + return( migrate_failed ); >>> +} >>> + >>> +/* */ >>> >>> +/****************************************************************************** >>> + * >>> + * Move all tasks which can be migrated off of the cores of the current >>> cpuset >>> + * and onto the cores of the specified new cpuset >>> + * Specifying a NULL path string pointer defaults to /dev/cpuset >>> + * The 'except' parameter is an array of pid_t values >>> + * which SHOULD NOT be migrated away from this core - terminated by >>> + * a zero pid_t value. If the pointer to this array is NULL or if the >>> + * first pid_t is zero, the function will try to migrate all processes >>> off of >>> + * the 'from' cpuset. >>> + * >>> + >>> ******************************************************************************/ >>> +static void migrate_tasks( const char *from_cpuset_path, >>> + const char *to_cpuset_path, pid_t *except ) >>> +{ >>> + size_t num_read; >>> + int i, from_fileno, pid_ready, end_of_file; >>> + char callers_pid[24]; >>> + char cur[1]; >>> + static char from_path_buf[128]; >>> + uint_64_t pid_numeric = 0; >>> + pid_t cur_match; >>> + >>> + pthread_cleanup_push( (void(*)(void *))sem_post, (void >>> *)&from_path_lock ); >>> + sem_wait( &from_path_lock ); >>> + >>> + /* Create strings containing the full path to the caller's cpusets >>> */ >>> + strcpy_bounded( from_path_buf, "/dev/cpuset/" ); >>> + >>> + /* Mark index of trailing slash for possible overwrite */ >>> + num_read = strlen( from_path_buf ) - 1; >>> + if ( from_cpuset_path ) { >>> + strcat_bounded( from_path_buf, from_cpuset_path ); >>> + } else { >>> + /* Migrate the task from default cpuset - remove trailing slash >>> */ >>> + from_path_buf[num_read] = '\0'; >>> + } >>> + >>> + if ( to_cpuset_path ) >>> + ODPH_DBG( "Migrating tasks from %s to /dev/cpuset/%s\n", >>> + from_path_buf, to_cpuset_path ); >>> + else >>> + ODPH_DBG( "Migrating tasks from %s to /dev/cpuset\n", >>> from_path_buf ); >>> + >>> + /* >>> + * We will be manipulating the tasks in this cpuset, so >>> + * extend the path string to specify the 'tasks' file. >>> + */ >>> + strcat_bounded( from_path_buf, "/tasks" ); >>> + from_fileno = open( from_path_buf, O_RDWR ); >>> + >>> + if ( from_fileno > 0 ) { >>> + for ( end_of_file = 0; !end_of_file; ) { >>> + /* Read one line of PID info from the 'from' tasks file */ >>> + callers_pid[0] = '\0'; >>> + pid_ready = 0; >>> + for ( i = 0; i < sizeof( callers_pid ); ) { >>> + num_read = read( from_fileno, (void *)cur, 1 ); >>> + switch ( num_read ) { >>> + case 0 : >>> + end_of_file = 1; >>> + break; >>> + case 1 : >>> + if ( cur[0] == (char)'\n' ) { >>> + callers_pid[i] = '\0'; >>> + pid_ready = 1; >>> + i = 0; >>> + } else { >>> + if ( (i + 1) < sizeof( callers_pid ) ) { >> >> > > _______________________________________________ > lng-odp mailing list > lng-odp@lists.linaro.org > https://lists.linaro.org/mailman/listinfo/lng-odp > > -- Mike Holmes Technical Manager - Linaro Networking Group Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs
Hmmm - runtime init is another possibility I hadn't fully considered. My code already checks for isolation support. Presumably I could set a runtime global flag or flags to indicate to all interested parties just what isolation features were enabled and then select optimal behavior based on runtime configuration. The tricky part is deciding what the scope of the flags should be... odp_pktio_perf.c seems to be linux-specific already and could conceivably use flags defined in the linux isolation helpers to determine appropriate runtime behavior. But in some cases - eg. odp_cpumask_default_worker() and odp_cpumask_default_control() - the behavior of the linux-generic implementation itself should perhaps be modified rather than substituting a helper call for the linux-generic one? Coming late to the party I've missed out on the discussions and consensus as to where to draw the lines of abstraction, transparency, and least common denominator here. To what extent should linux-generic itself handle isolation support (or the lack of it) in an application-transparent manner? It seems a somewhat moot point for test applications intended to run on a linux development host - but what about the wider scope of end-user applications. Even if linux-generic itself isn't intended to be performance-oriented there may be a case for including isolation support there. On Fri, Nov 13, 2015 at 1:49 PM, Nicolas Morey-Chaisemartin < nmorey@kalray.eu> wrote: > > > On 11/13/2015 07:56 PM, Mike Holmes wrote: > > > > On 13 November 2015 at 13:51, Gary Robertson <gary.robertson@linaro.org> > wrote: > >> Oops - clicked the wrong reply option. >> >> Nicolas raises an excellent point. I think at least a configuration >> option may be needed to enable or disable isolation. >> > > I think that ./configure should check for the support and it if it is > available provided the configure option --enable-test-isolated, this is how > nearly all our other optional capabilities work. In this case if support is > there the default would be to enable --enable-test-isolated > > It is. But the latest talk about RPM packaging and runtime compatibility > tends to move thing another way. > I guess we want to move as many things as possible to runtime init so a > single compile binary can leverage the best performance on whatever > platform it is running (as long as it is ABI compliant) > > Nicolas > >
diff --git a/helper/Makefile.am b/helper/Makefile.am index e72507e..f9c3558 100644 --- a/helper/Makefile.am +++ b/helper/Makefile.am @@ -11,6 +11,7 @@ helperincludedir = $(includedir)/odp/helper/ helperinclude_HEADERS = \ $(srcdir)/include/odp/helper/ring.h \ $(srcdir)/include/odp/helper/linux.h \ + $(srcdir)/include/odp/helper/linux_isolation.h \ $(srcdir)/include/odp/helper/chksum.h\ $(srcdir)/include/odp/helper/eth.h\ $(srcdir)/include/odp/helper/icmp.h\ @@ -29,6 +30,7 @@ noinst_HEADERS = \ __LIB__libodphelper_la_SOURCES = \ linux.c \ + linux_isolation.c \ ring.c \ hashtable.c \ lineartable.c diff --git a/helper/include/odp/helper/linux_isolation.h b/helper/include/odp/helper/linux_isolation.h new file mode 100644 index 0000000..2fc8266 --- /dev/null +++ b/helper/include/odp/helper/linux_isolation.h @@ -0,0 +1,98 @@ +/* Copyright (c) 2013, Linaro Limited + * All rights reserved. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + + +/** + * @file + * + * ODP Linux isolation helper API + * + * This file is an optional helper to odp.h APIs. These functions are provided + * to ease common setups for isolation using cpusets in a Linux system. + * User is free to implement the same setups in other ways (not via this API). + */ + +#ifndef ODP_LINUX_ISOLATION_H_ +#define ODP_LINUX_ISOLATION_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +#include <odp.h> + +/* + * Verify the level of underlying operating system support. + * (Return with error if the OS does not at least support cpusets) + * Set up system-wide CPU masks and cpusets + * (Future) Set up file-based persistent cpuset management layer + * to allow cooperative use of system isolation resources + * by multiple independent ODP instances. + */ +int odph_isolation_init_global( void ); + +/* + * Migrate all tasks from cpusets created for isolation support to the + * generic boot-level single cpuset. + * Remove all isolated CPU environments and cpusets + * Zero out system-wide CPU masks + * (Future) Reset persistent file-based cpuset management layer + * to show no system isolation resources are available. + */ +int odph_isolation_term_global( void ); + +/** + * Creates and launches pthreads + * + * Creates, pins and launches threads to separate CPU's based on the cpumask. + * + * @param thread_tbl Thread table + * @param mask CPU mask + * @param start_routine Thread start function + * @param arg Thread argument + * + * @return Number of threads created + */ +int odph_linux_isolated_pthread_create(odph_linux_pthread_t *thread_tbl, + const odp_cpumask_t *mask, + void *(*start_routine) (void *), + void *arg); + +/** + * Fork a process + * + * Forks and sets CPU affinity for the child process + * + * @param proc Pointer to process state info (for output) + * @param cpu Destination CPU for the child process + * + * @return On success: 1 for the parent, 0 for the child + * On failure: -1 for the parent, -2 for the child + */ +int odph_linux_isolated_process_fork(odph_linux_process_t *proc, int cpu); + +/** + * Fork a number of processes + * + * Forks and sets CPU affinity for child processes + * + * @param proc_tbl Process state info table (for output) + * @param mask CPU mask of processes to create + * + * @return On success: 1 for the parent, 0 for the child + * On failure: -1 for the parent, -2 for the child + */ +int odph_linux_isolated_process_fork_n(odph_linux_process_t *proc_tbl, + const odp_cpumask_t *mask); + +int odph_cpumask_default_worker(odp_cpumask_t *mask, int num); +int odph_cpumask_default_control(odp_cpumask_t *mask, int num ODP_UNUSED); + +#ifdef __cplusplus +} +#endif + +#endif diff --git a/helper/linux_isolation.c b/helper/linux_isolation.c new file mode 100644 index 0000000..5ca6c7f --- /dev/null +++ b/helper/linux_isolation.c @@ -0,0 +1,2901 @@ +/* + * This file contains declarations and definitions of functions and + * data structures which are useful for manipulating cpusets in support + * of OpenDataPlane (ODP) high-performance applications. + * + * Copyright (c) 2015, Linaro Limited + * All rights reserved. + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <ctype.h> +#include <dirent.h> +#include <errno.h> +#include <fcntl.h> +#include <fts.h> +#include <pthread.h> +#include <sched.h> +#include <semaphore.h> +#include <signal.h> +#include <stdarg.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h> +#include <unistd.h> +#include <sys/mman.h> +#include <sys/mount.h> +#include <sys/resource.h> +#include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> + +#include <odp/init.h> +#include <odp_internal.h> +#include <odp/cpumask.h> +#include <odp/debug.h> +#include <odp_debug_internal.h> +#include <odp/helper/linux.h> +#include "odph_debug.h" + +typedef unsigned long long uint_64_t; +typedef unsigned int uint32_t; +typedef unsigned short uint16_t; +typedef unsigned char uint8_t; + +/****************************************************************************** + * The following constants are important for determining isolation capacities + * MAX_CPUS_SUPPORTED is used to dimension arrays and some loops in the + * isolation helper code. + * The HOUSEKEEPING_RATIO_* constants define the ratio of housekeeping CPUs + * (i.e. 'control plane' CPUs) - see MULTIPLIER + * versus isolated CPUs (i.e. 'data plane CPUs) - + * see DIVISOR + * The calculation is: + * NUMBER OF HOUSEKEEPING CPUs = + * (NUMBER OF CPUs * HOUSEKEEPING_RATIO_MULTIPLIER) + * divided by HOUSEKEEPING_RATIO_DIVISOR. + * If NUMBER OF HOUSEKEEPING CPUs < 1, NUMBER OF HOUSEKEEPING CPUs ++ + * NUMBER OF ISOLATED CPUs = + * NUMBER OF CPUs - NUMBER OF HOUSEKEEPING CPUs + ******************************************************************************/ +#define MAX_CPUS_SUPPORTED 64 +#define HOUSEKEEPING_RATIO_MULTIPLIER 1 +#define HOUSEKEEPING_RATIO_DIVISOR 4 + +/****************************************************************************** + * + * Concatenate a string into a destination buffer + * containing an existing string such that the length of the resulting string + * (including the terminating NUL) does not exceed the buffer size + * + ******************************************************************************/ +static inline char *__strcat_bounded( char *dst_strg, const char *src_strg, + size_t dstlen ) { + *(dst_strg + (dstlen - 1)) = '\0'; + return( strncat( dst_strg, src_strg, + ((dstlen - 1) - strlen( dst_strg )) ) ); +} + +#define strcat_bounded( dest, src ) \ + __strcat_bounded( dest, src, (sizeof( dest )) ) + +/****************************************************************************** + * + * Copy a string into a destination buffer and NUL-terminate it + * such that the length of the resulting string + * (including the terminating NUL) does not exceed the buffer size + * + ******************************************************************************/ +static inline char *__strcpy_bounded( char *dst_strg, const char *src_strg, + size_t dstlen ) { + *(dst_strg + (dstlen - 1)) = '\0'; + return( strncpy( dst_strg, src_strg, (dstlen - 1) ) ); +} + +#define strcpy_bounded( dest, src ) \ + __strcpy_bounded( dest, src, (sizeof( dest )) ) + +#define MAX_ERR_MSG_SIZE 256 +#define ERR_STRING_SIZE 80 + +#define NSEC_PER_SEC 1000000000L + +static void sleep_nsec( long nsec ) +{ + struct timespec delay, remaining; + + if ( nsec >= NSEC_PER_SEC ) { + delay.tv_sec = nsec / NSEC_PER_SEC; + delay.tv_nsec = nsec % NSEC_PER_SEC; + } else { + delay.tv_sec = 0; + delay.tv_nsec = nsec; + } + for ( errno = EINTR; errno == EINTR; ) { + errno = 0; + if ( (clock_nanosleep( CLOCK_MONOTONIC, 0, &delay, &remaining )) && + (errno == EINVAL) ) { + errno = 0; + clock_nanosleep( CLOCK_REALTIME, 0, &delay, &remaining ); + } + delay.tv_sec = remaining.tv_sec; + delay.tv_nsec = remaining.tv_nsec; + } +} + +static sem_t strerror_lock; +static char error_buf[ERR_STRING_SIZE]; + +#define TM_STAMP_SIZE 30 +#define MAX_EVENT_STRING_SIZE ((size_t)(MAX_ERR_MSG_SIZE - TM_STAMP_SIZE - 4)) +#define TM_STAMP_MSEC 19 +#define TM_STAMP_MSEC_END 23 +#define TM_STAMP_CTIME_END 25 +#define TM_STAMP_DATE_END 11 +#define TM_STAMP_YEAR (TM_STAMP_MSEC_END + 1) +#define TM_STAMP_PREFIX_END 31 + +static sem_t logmsg_lock; +static char stderr_log_msg[MAX_ERR_MSG_SIZE]; + +/****************************************************************************** + * + * Return the task ID of the calling thread or process + * (this is a system-wide thread ID used for scheduling all tasks, + * whether single-threaded processes or individual threads within + * multithreaded processes) + * This is the identifier used for migrating tasks between cpusets + * + ******************************************************************************/ +static pid_t gettaskid( void ) +{ + return( (pid_t)(syscall(SYS_gettid)) ); +} + +/**/ +/****************************************************************************** + * + * Convert an error number to an error type string in errstring + * + ******************************************************************************/ +static char *errstring( int error_no ) +{ + + pthread_cleanup_push( (void(*)(void *))sem_post, + (void *)&strerror_lock ); + sem_wait( &strerror_lock ); + + error_buf[ ERR_STRING_SIZE - 1 ] = '\0'; + strncpy( error_buf, strerror( error_no ), (ERR_STRING_SIZE - 2) ); + + pthread_cleanup_pop( 1 ); + + return( error_buf ); +} + +/**/ +/****************************************************************************** + * + * Print a time stamp and event message out to stderr + * + ******************************************************************************/ +static void stderr_log( const char *fmt_str, ... ) +{ + va_list args; + struct timeval time_now; + struct tm time_fields; + int i, j; + char *event_msg_start; + + pthread_cleanup_push( (void(*)(void *))sem_post, + (void *)&logmsg_lock ); + sem_wait( &logmsg_lock ); + + /* + * Snapshot the current time down to the resolution of the CPU clock. + */ + gettimeofday( &time_now, (struct timezone *)NULL ); + + /* + * Convert time to a calender time string with 1 sec resolution + */ + localtime_r( (time_t *)(&(time_now.tv_sec)), &time_fields ); + asctime_r( &time_fields, stderr_log_msg ); + + /* + * Shift the year and newline down to make room for a msec string field + */ + for ( i = TM_STAMP_CTIME_END, j = (TM_STAMP_SIZE - 1); + i >= TM_STAMP_MSEC; i--, j-- ) + stderr_log_msg[j] = stderr_log_msg[i]; + + /* + * Insert the millisecond time stamp field into the string between the + * seconds and the year as :000 thru :999. Then overwrite the premature + * NUL with a space to 're-attach' the year and newline + */ + snprintf( &(stderr_log_msg[TM_STAMP_MSEC]), 5, ":%.3ld", + (time_now.tv_usec / 1000) ); + stderr_log_msg[TM_STAMP_MSEC_END] = ' '; + + /* + * NUL out the newline at the end of the timestamp so we can + * prefix the log message with the timestamp. + */ + stderr_log_msg[TM_STAMP_SIZE - 2] = '\0'; + strcat_bounded( stderr_log_msg, " - " ); + event_msg_start = &(stderr_log_msg[strlen( stderr_log_msg )]); + + /* + * Format the caller's event message into a constant string + */ + va_start( args, fmt_str ); + vsnprintf( event_msg_start, MAX_EVENT_STRING_SIZE, fmt_str, args ); + stderr_log_msg[MAX_EVENT_STRING_SIZE - 1] = '\0'; + va_end( args ); + strcat_bounded( stderr_log_msg, "\n" ); + + /* + * Then print the time stamp and event message out to stderr + */ + fputs( stderr_log_msg, stderr ); + + pthread_cleanup_pop( 1 ); +} + +/**/ +/****************************************************************************** + * + * Initialize the semaphores used for serializing error message handling. + * + ******************************************************************************/ +static void init_errmsg_locks( void ) +{ + sem_init( &strerror_lock, 0, 1 ); + sem_init( &logmsg_lock, 0, 1 ); +} + +#define DIRMODE ((mode_t)(S_IRUSR | S_IWUSR | S_IXUSR | \ + S_IRGRP | S_IXGRP | \ + S_IROTH | S_IXOTH)) +#define FILEMODE ((mode_t)(S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)) + +/**/ +/****************************************************************************** + * + * Data structures associated with CPUSETS + * + ******************************************************************************/ + +static int numcpus; +static size_t cpusetsize; +static int cpusets_supported; +static int cpuset_prefix_required; + +/* + * Shared path construction buffers and position markers + * Used to construct absolute paths to directories and files in the + * cpuset file hierarchy. Shared in order to reduce stack usage, + * (especially with nested function calls) - and kept thread-safe + * by the locks below. + */ +static char pathname_buf[128]; /* Use only while holding pathname_lock! */ +static char fieldname_buf[128]; /* Use only while holding fieldname_lock! */ +static char cpulist[128]; /* Use only while holding fieldname_lock! */ +static char cpuname[3]; /* Use only while holding fieldname_lock! */ +static int end_cpuset_base_path; /* Use only while holding fieldname_lock! */ +static int end_field_base_path; /* Use only while holding fieldname_lock! */ + +/* + * Locks for thread-safe use of the shared path construction buffers. + * Locking order - if pathname_lock is needed it must always be taken + * before taking fieldname_lock and released after + * releasing fieldname_lock. + * if from_path_lock is needed it must always be taken + * before taking to_path_lock and released after + * releasing to_path_lock. + * fieldname lock is the primary exclusion mechanism and by implication + * allows thread-safe access to the cpuset directory tree + * by all tasks using this suite of helper functions. + */ +static sem_t pathname_lock; +static sem_t fieldname_lock; +static sem_t from_path_lock; +static sem_t to_path_lock; + +/**/ +/****************************************************************************** + * + * Functions associated with cpusets + * + ******************************************************************************/ +/**/ +/****************************************************************************** + * + * Called from applications when switching to a new cpuset. + * + * Takes a string specifying the name of the desired cpuset + * relative to the mount point '/dev/cpuset/' + * - eg. 'cplane' or 'dplane'...etc. + * + * Obtains the required lock on the shared fieldname path buffer. + * Sets the (shared) current cpuset path string, + * and creates the cpuset management tree base directory + * if it is not already present. + * Initializes the fieldname path string to the new base directory path. + * Returns while holding the lock on the shared fieldname path buffer. + * NOTE - a NULL cpuset_name defaults to the top-level 'master' cpuset. + * + ******************************************************************************/ +static char *newpathname( const char *cpuset_name ) +{ + /* + * Lock exclusive access to the fieldname path buffer + * and by inference, to the current cpuset management directory tree + */ + sem_wait( &fieldname_lock ); + + /* Create a string containing the full path to the caller's directory */ + strcpy_bounded( fieldname_buf, "/dev/cpuset/" ); + + if ( cpuset_name != (char *)NULL ) { + strcat_bounded( fieldname_buf, cpuset_name ); + + /* + * Create the new cpuset tree under /dev/cpuset + * fieldname_buf = "/dev/cpuset/<path>" + */ + mkdir( fieldname_buf, DIRMODE ); + + strcat_bounded( fieldname_buf, "/" ); + } + + /* + * If a cpuset_name was specified, then + * fieldname_buf = "/dev/cpuset/<path>/" --else-- + * fieldname_buf = "/dev/cpuset/" + * Mark the end of the path base string for this cpuset + */ + end_cpuset_base_path = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - 1) ); + + if ( cpuset_prefix_required ) + strcat_bounded( fieldname_buf, "cpuset." ); + + /* Mark the end of the field path string for this cpuset */ + end_field_base_path = strnlen( fieldname_buf, + (sizeof( fieldname_buf ) - 1) ); + + /* Return to the caller with the fieldname_buf lock held */ + return( fieldname_buf ); +} + +/**/ +/****************************************************************************** + * + * Called from applications to create a complete cpuset fieldname path. + * + * Requires and assumes that the caller currently holds the lock + * for exclusive use of the shared fieldname path buffer. + * + * Resets the (shared) current fieldname path to its initial contents, + * effectively truncating the name of any previous field path. + * Then concatenates the cpuset-relative field name string specified + * by the caller onto the path base, creating the full field path name. + * + ******************************************************************************/ +static char *newfieldname( const char *field ) +{ + fieldname_buf[end_field_base_path] = '\0'; + strcat_bounded( fieldname_buf, field ); + return( fieldname_buf ); +} + + +/**/ +/****************************************************************************** + * + * Called from applications to release the lock on the shared + * fieldname path buffer. This enables serialized access to the + * cpuset management structure within a multi-threaded process. + * The application releases this lock after it finishes processing + * all fields of the current cpuset, guaranteeing that other threads + * using this utility will not interfere with that cpuset. + * + ******************************************************************************/ +static void releasefieldname( void ) +{ + sem_post( &fieldname_lock ); +} + +/**/ +/****************************************************************************** + * + * Verify and initialize basic CPUSET support + * + ******************************************************************************/ +static int init_cpusets( void ) +{ + int mounted = 0; + int retcode = -1; + int fileno; + + /* + * Initialize the locks used to serialize access to the error message + * logging functions and buffers. This needs to be done prior to most + * of the other cpuset setup functions... so take care of it here. + */ + init_errmsg_locks(); + + /* Init locks for thread-safe access to static path-building strings */ + sem_init( &pathname_lock, 0, 1 ); + sem_init( &fieldname_lock, 0, 1 ); + sem_init( &from_path_lock, 0, 1 ); + sem_init( &to_path_lock, 0, 1 ); + + cpuset_prefix_required = 0; + cpusets_supported = 0; + +try2mount: + /* Try to mount the cpuset pseudo-filesystem at /dev/cpuset */ + mkdir( "/dev/cpuset", DIRMODE ); + if ( mount( "none", "/dev/cpuset", "cpuset", + (MS_NODEV | MS_NOEXEC | MS_NOSUID), (void *)NULL ) ) { + switch ( errno ) { + case EBUSY : + mounted = 1; + break; + case ENODEV : + ODPH_ERR( "cpusets not supported - aborting!\n" ); + break; + case EPERM : + ODPH_ERR( "Insufficient privileges for cpusets - aborting!\n" ); + break; + default : + break; + } + } + if ( mounted > 0) { + cpusets_supported = 1; + retcode = 0; + fileno = open( "/dev/cpuset/cpuset.cpus", O_RDONLY ); + if ( fileno > 0 ) { + cpuset_prefix_required = 1; + close( fileno ); + } + } else { + /* + * Try up to two more times to get the cpusets filesystem mounted + * before giving up + */ + if ( --mounted > -3 ) { + /* Delay 50 msec to allow the mount to settle and try again */ + sleep_nsec( 50000000 ); + goto try2mount; + } + } + + /* Support available CPU cores up to MAX_CPUS_SUPPORTED cores */ + numcpus = (int)sysconf( _SC_NPROCESSORS_ONLN ); + + if( numcpus > MAX_CPUS_SUPPORTED ) { + fprintf( stderr, + "\rNOTE: MAX_CPUS_SUPPORTED defined as: %d,\n", MAX_CPUS_SUPPORTED ); + fprintf( stderr, + "\r but number of CPU cores detected is: %d\n", numcpus ); + fprintf( stderr, + "\r Change MAX_CPUS_SUPPORTED in isolation_config.h and rebuild\n" + ); + fprintf(stderr, + "\r to support use of all CPU cores on this platform\n" ); + } + numcpus = (numcpus > MAX_CPUS_SUPPORTED) ? MAX_CPUS_SUPPORTED : numcpus; + + /* Save the required cpuset mask size for global reference */ + cpusetsize = CPU_ALLOC_SIZE( numcpus ); + + return( retcode ); +} + +/**/ +/****************************************************************************** + * + * Enable or disable full dynticks operation on the specified cpuset + * + ******************************************************************************/ +static void request_dynticks( const char *path, int on_off ) +{ + int retval, fileno; + + if ( on_off ) + ODPH_DBG( "Requesting dynticks on cpuset %s\n", path ); + else + ODPH_DBG( "Dynticks not needed on cpuset %s\n", path ); + + /* + * Set the fieldname path string to the base of the path + * to the caller's specified cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + + /* + * Create an absolute path string to the "fulldynticks" field + * for the cpuset + */ + newfieldname( "fulldynticks" ); + + /* + * Specify whether the cores in this cpuset should offload kernel + * housekeeping tasks to other cores or else accept those tasks + */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + if ( on_off ) + retval = write( fileno, "1", 1 ); + else + retval = write( fileno, "0", 1 ); + close( fileno ); + } + + /* + * Create an absolute path string to the "quiesce" field + * for the cpuset + */ + newfieldname( "quiesce" ); + + /* + * Migrate timers / hrtimers away from the CPUs in this cpuset -or- + * allow timers / hrtimers for this CPU and system-wide use. + */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + if ( on_off ) + retval = write( fileno, "1", 1 ); + else + retval = write( fileno, "0", 1 ); + close( fileno ); + } + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Enable or disable isolation on the specified cpuset + * (A NULL path defaults to the top-level master cpuset.) + * + ******************************************************************************/ +static void set_cpuset_isolation( const char *path, int on_off ) +{ + int retval, fileno; + + if ( on_off ) + ODPH_DBG( "Disabling load balancing on cpuset %s\n", path ); + else + ODPH_DBG( "Enabling load balancing on cpuset %s\n", path ); + + /* + * Set the fieldname path string to the base of the path + * to the caller's specified cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + + /* + * Create an absolute path string to the "sched_load_balance" field + * for the cpuset + */ + newfieldname( "sched_load_balance" ); + + /* + * Enable or disable load balancing + */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + if ( on_off ) + retval = write( fileno, "0", 1 ); + else + retval = write( fileno, "1", 1 ); + close( fileno ); + } + + /* + * Create an absolute path string to the + * "sched_relax_domain_level" field for the cpuset + */ + newfieldname( "sched_relax_domain_level" ); + + /* + * Enable or disable event-based load balancing + */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + if ( on_off ) + retval = write( fileno, "0", 1 ); + else + retval = write( fileno, "-1", 2 ); + close( fileno ); + } + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Create a new management tree for the specified cpuset + * + ******************************************************************************/ +static void create_cpuset( const char *path, cpu_set_t *mask, int isolated ) +{ + int retval, i, fileno, endlist; + + /* + * Set the fieldname path string to the base of the path + * to the caller's specified cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + + /* Create an absolute path string to the "mems" field for the cpuset */ + newfieldname( "mems" ); + + /* Init the "mems" field so all cpusets share the same memory map */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + retval = write( fileno, "0", 1 ); + close( fileno ); + } + + cpulist[0] = '\0'; + for ( i = 0, endlist = 0; i < numcpus; i++ ) { + if ( CPU_ISSET( i, mask) ) { + /* + * Create a comma-separated list of CPU cores in this cpuset + * based on the cpuset mask passed in by the caller. + */ + snprintf( cpuname, sizeof( cpuname ), "%d", i ); + strcat_bounded( cpulist, cpuname ); + /* Mark the location of the trailing comma */ + endlist = strnlen( cpulist, (sizeof( cpulist ) - 1) ); + strcat_bounded( cpulist, "," ); + } + } + /* Remove the last superfluous trailing comma from the string */ + cpulist[endlist] = '\0'; + + /* Create an absolute path string to the "cpus" field for the cpuset */ + newfieldname( "cpus" ); + + /* + * Now populate the overall CPU list for the current cpuset + */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + retval = write( fileno, cpulist, strlen( cpulist ) ); + close( fileno ); + } + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); + + /* If the cpuset is to be isolated, turn off load balancing */ + set_cpuset_isolation( path, isolated ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Delete the directory and file hierarchy associated with the cpuset + * specified by the contents of fieldname_buf + * + * Requires that the caller already holds fieldname_lock -and- + * assumes all tasks, etc. have been previously migrated away from the + * specified cpuset. + * + ******************************************************************************/ +static int cpuset_delete( void ) +{ + int retcode = -1; + int i, core_fileno; + + ODPH_DBG( "Deleting cpuset %s\n", fieldname_buf ); + + /* + * Create an absolute path string to the "cpus" field for the cpuset + */ + strcat_bounded( fieldname_buf, "/" ); + if ( cpuset_prefix_required ) + strcat_bounded( fieldname_buf, "cpuset." ); + + strcat_bounded( fieldname_buf, "cpus" ); + + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.cpus" + * -or- "/dev/cpuset/<path>/cpu<n>/cpus" + * -or- "/dev/cpuset/<path>/cpuset.cpus" + * -or- "/dev/cpuset/<path>/cpus" + * De-populate the CPU list to contain no cores + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_TRUNC) ); + if ( core_fileno > 0 ) { + /* + * Try for up to 2 seconds to depopulate the CPU cores. + * This allows time for any task migrations to stabilize. + */ + for ( i = 0; i < 100; i++ ) { + errno = 0; + retcode = write( core_fileno, "", 1 ); + if ( !((retcode < 0) && + ((errno == EINTR) || (errno == EBUSY))) ) + break; + + /* Sleep 20 msec to allow depopulation to take effect */ + sleep_nsec( 20000000 ); + } + close( core_fileno ); + } + + fieldname_buf[end_cpuset_base_path] = '\0'; + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>" + * -or- "/dev/cpuset/<path>" + * Delete the cpuset tree for this core + */ + retcode = rmdir( fieldname_buf ); + if ( retcode ) { + ODPH_ERR( "Unable to delete cpuset %s - error %s\n", + fieldname_buf, errstring( errno ) ); + } + + return( retcode ); +} + +/**/ +/****************************************************************************** + * + * Delete the management tree for the specified cpuset + * + ******************************************************************************/ +static void delete_cpuset( const char *path ) +{ + /* + * Return the CPU cores in this cpuset to general purpose duty. + * Turn load balancing back on and indicate full dynticks not needed. + * This is done here to inform the kernel as to how these cores may be + * used and operated. + */ + set_cpuset_isolation( path, 0 ); + request_dynticks( path, 0 ); + + /* + * Create an absolute path string to the "cpus" field for the cpuset + * newpathname marks the end of the cpuset base path string at a position + * following the slash - that is where the field name string would be + * concatenated onto the path - eg. '/dev/cpuset/<path>/' + * cpuset_delete() wants this marker to point to the position prior to + * the slash - eg. '/dev/cpuset/<path>' - so adjust it. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + end_cpuset_base_path--; + fieldname_buf[end_cpuset_base_path] = '\0'; + + /* + * Depopulate the CPU list for the cpuset and remove its + * directory hierarchy + */ + cpuset_delete(); + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); +} + +/**/ +/****************************************************************************** + * + * Modify the per-cpu management tree for the specified cpuset + * to either enable or disable scheduler load balancing on each single-core + * cpuset descended from the specified parent cpuset. + * Assumes the /dev/cpuset filesystem already mounted and the + * per-core cpusets already initialized. + * + ******************************************************************************/ +static void set_per_core_cpusets_isolated( const char *path, cpu_set_t *mask, + int on_off ) +{ + int retval, i, core_fileno, cpu_num_offset; + + if ( on_off ) + ODPH_DBG( "Disabling load balancing on per-core cpusets in %s\n", path ); + else + ODPH_DBG( "Enabling load balancing on per-core cpusets in %s\n", path ); + + + /* + * Set the pathname and fieldname path strings to the base of the path + * to the specified 'parent' cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + fieldname_buf[end_cpuset_base_path] = '\0'; + + /* + * Create an individual cpuset for each CPU to facilitate isolation + */ + strcat_bounded( fieldname_buf, "cpu" ); + /* + * fieldname_buf == /dev/cpuset/<path>/cpu + * mark the location where we append the CPU number + */ + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - 1) ); + + for ( i = 0; i < numcpus; i++ ) { + if ( CPU_ISSET( i, mask) ) { + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); + strcat_bounded( fieldname_buf, cpuname ); + /* + * fieldname_buf == /dev/cpuset/<path>/cpu<n> + * where <n> is the current core number (0 -> numcpus-1) + * Modify the cpuset tree for this core only + */ + mkdir( fieldname_buf, DIRMODE ); + + strcat_bounded( fieldname_buf, "/" ); + if ( cpuset_prefix_required ) + strcat_bounded( fieldname_buf, "cpuset." ); + /* Mark the end of the path string for this core */ + end_field_base_path = strnlen( fieldname_buf, + (sizeof( fieldname_buf ) - 1) ); + + /* Create a path string to the "sched_load_balance" field */ + newfieldname( "sched_load_balance" ); + /* + * fieldname_buf == + * "/dev/cpuset/<path>/cpu<n>/cpuset.sched_load_balance" + * -or- "/dev/cpuset/<path>/cpu<n>/sched_load_balance" + * Set the specified load balancing on this single-core cpuset + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + if ( on_off ) + retval = write( core_fileno, "0", 1 ); + else + retval = write( core_fileno, "1", 1 ); + close( core_fileno ); + } + + /* Create a path string to the "sched_relax_domain_level" field */ + newfieldname( "sched_relax_domain_level" ); + /* + * fieldname_buf == + * "/dev/cpuset/<path>/cpu<n>/cpuset.sched_relax_domain_level" + * -or- "/dev/cpuset/<path>/cpu<n>/sched_relax_domain_level" + * Set the specified behavior on this single-core cpuset + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + if ( on_off ) + retval = write( core_fileno, "0", 1 ); + else + retval = write( core_fileno, "-1", 2 ); + close( core_fileno ); + } + + /* + * Reset the current field pathname to: + * fieldname_buf == /dev/cpuset/<path>/cpu + * in preparation for the next CPU core + * in the data plane cpuset mask + */ + fieldname_buf[cpu_num_offset] = '\0'; + } + } + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); +} + + +/**/ +/****************************************************************************** + * + * Modify the per-cpu management tree for the specified cpuset + * to either enable or disable full dynticks operation on each single-core + * cpuset descended from the specified parent cpuset. + * Assumes the /dev/cpuset filesystem already mounted and the + * per-core cpusets already initialized. + * + ******************************************************************************/ +static void request_per_core_dynticks( const char *path, cpu_set_t *mask, + int on_off ) +{ + int retval, i, core_fileno, cpu_num_offset; + if ( on_off ) + ODPH_DBG( "Requesting dynticks on per-core cpusets in %s\n", path ); + else + ODPH_DBG( "Dynticks not needed on per-core cpusets in %s\n", path ); + + + /* + * Set the pathname and fieldname path strings to the base of the path + * to the specified 'parent' cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + fieldname_buf[end_cpuset_base_path] = '\0'; + + /* + * Create an individual cpuset for each CPU to facilitate isolation + */ + strcat_bounded( fieldname_buf, "cpu" ); + /* + * fieldname_buf == /dev/cpuset/<path>/cpu + * mark the location where we append the CPU number + */ + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - 1) ); + + for ( i = 0; i < numcpus; i++ ) { + if ( CPU_ISSET( i, mask) ) { + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); + strcat_bounded( fieldname_buf, cpuname ); + /* + * fieldname_buf == /dev/cpuset/<path>/cpu<n> + * where <n> is the current core number (0 -> numcpus-1) + * Modify the cpuset tree for this core only + */ + mkdir( fieldname_buf, DIRMODE ); + + strcat_bounded( fieldname_buf, "/" ); + if ( cpuset_prefix_required ) + strcat_bounded( fieldname_buf, "cpuset." ); + /* Mark the end of the path string for this core */ + end_field_base_path = strnlen( fieldname_buf, + (sizeof( fieldname_buf ) - 1) ); + + /* Create a path string to the "fulldynticks" field */ + newfieldname( "fulldynticks" ); + /* + * fieldname_buf == + * "/dev/cpuset/<path>/cpu<n>/cpuset/fulldynticks" + * -or- "/dev/cpuset/<path>/cpu<n>/fulldynticks" + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + if ( on_off ) + /* Mark this single-core cpuset for full dynticks mode */ + retval = write( core_fileno, "1", 1 ); + else + /* Mark this single-core cpuset for housekeeping mode */ + retval = write( core_fileno, "0", 1 ); + close( core_fileno ); + } + + /* + * Create an absolute path string to the "quiesce" field + * for the cpuset + */ + newfieldname( "quiesce" ); + /* + * fieldname_buf == + * "/dev/cpuset/<path>/cpu<n>/cpuset/quiesce" + * -or- "/dev/cpuset/<path>/cpu<n>/quiesce" + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + if ( on_off ) + /* Migrate timers / hrtimers away from this cpuset */ + retval = write( core_fileno, "1", 1 ); + else + /* Enable migration of timers / hrtimers onto this cpuset */ + retval = write( core_fileno, "0", 1 ); + close( core_fileno ); + } + + /* + * Reset the current field pathname to: + * fieldname_buf == /dev/cpuset/<path>/cpu + * in preparation for the next CPU core + * in the data plane cpuset mask + */ + fieldname_buf[cpu_num_offset] = '\0'; + } + } + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); +} + +/**/ +/****************************************************************************** + * + * Create a new per-cpu management tree for the specified parent cpuset + * Assumes the /dev/cpuset filesystem already mounted and the + * parent cpuset already initialized. + * + ******************************************************************************/ +static void create_per_core_cpusets( const char *path, cpu_set_t *mask, + int isolated ) +{ + int retval, i, core_fileno, cpu_num_offset; + + /* + * Set the pathname and fieldname path strings to the base of the path + * to the specified 'parent' cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + fieldname_buf[end_cpuset_base_path] = '\0'; + + /* + * Create an individual cpuset for each CPU to facilitate isolation + */ + strcat_bounded( fieldname_buf, "cpu" ); + /* + * fieldname_buf == /dev/cpuset/<path>/cpu + * mark the location where we append the CPU number + */ + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - 1) ); + + for ( i = 0; i < numcpus; i++ ) { + if ( CPU_ISSET( i, mask) ) { + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); + strcat_bounded( fieldname_buf, cpuname ); + /* + * fieldname_buf == /dev/cpuset/<path>/cpu<n> + * where <n> is the current core number (0 -> numcpus-1) + * Create a new cpuset tree for this core only + */ + mkdir( fieldname_buf, DIRMODE ); + + strcat_bounded( fieldname_buf, "/" ); + if ( cpuset_prefix_required ) + strcat_bounded( fieldname_buf, "cpuset." ); + /* Mark the end of the path string for this core */ + end_field_base_path = strnlen( fieldname_buf, + (sizeof( fieldname_buf ) - 1) ); + + /* Create an absolute path string to the "mems" field */ + newfieldname( "mems" ); + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.mems" + * -or- "/dev/cpuset/<path>/cpu<n>/mems" + * Init the "mems" field so all cpusets share the same memory map + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + retval = write( core_fileno, "0", 1 ); + close( core_fileno ); + } + + /* Create an absolute path string to the "cpus" field */ + newfieldname( "cpus" ); + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.cpus" + * -or- "/dev/cpuset/<path>/cpu<n>/cpus" + * Init the CPU list to contain only the current core + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + retval = write( core_fileno, cpuname, strlen( cpuname ) ); + close( core_fileno ); + } + + /* Create a path string to the "sched_load_balance" field */ + newfieldname( "sched_load_balance" ); + /* + * fieldname_buf == + * "/dev/cpuset/<path>/cpu<n>/cpuset.sched_load_balance" + * -or- "/dev/cpuset/<path>/cpu<n>/sched_load_balance" + * Set the specified load balancing on this single-core cpuset + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + if ( isolated ) + retval = write( core_fileno, "0", 1 ); + else + retval = write( core_fileno, "1", 1 ); + close( core_fileno ); + } + + /* Create a path string to the "sched_relax_domain_level" field */ + newfieldname( "sched_relax_domain_level" ); + /* + * fieldname_buf == + * "/dev/cpuset/<path>/cpu<n>/cpuset.sched_relax_domain_level" + * -or- "/dev/cpuset/<path>/cpu<n>/sched_relax_domain_level" + * Set the specified behavior on this single-core cpuset + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), + FILEMODE ); + if ( core_fileno > 0 ) { + if ( isolated ) + retval = write( core_fileno, "0", 1 ); + else + retval = write( core_fileno, "-1", 2 ); + close( core_fileno ); + } + + /* + * Reset the current field pathname to: + * fieldname_buf == /dev/cpuset/<path>/cpu + * in preparation for the next CPU core + * in the data plane cpuset mask + */ + fieldname_buf[cpu_num_offset] = '\0'; + } + } + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); +} + +/**/ +/****************************************************************************** + * + * Delete the per-cpu management tree for the specified cpuset + * + ******************************************************************************/ +static void delete_per_core_cpusets( const char *path, cpu_set_t *mask ) +{ + int i, cpu_num_offset; + + /* + * Return the CPUs in the per_core cpusets to general purpose duty. + * Turn load balancing back on and indicate full dynticks not needed. + * This is done here to inform the kernel as to how these cores may be + * used and operated. + */ + set_per_core_cpusets_isolated( path, mask, 0 ); + request_per_core_dynticks( path, mask, 0 ); + + /* + * Set the pathname and fieldname path strings to the base of the path + * to the specified cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + fieldname_buf[end_cpuset_base_path] = '\0'; + + /* + * Delete the individual cpuset for each CPU + */ + strcat_bounded( fieldname_buf, "cpu" ); + + /* + * fieldname_buf == /dev/cpuset/<path>/cpu + * mark the location where we append the CPU number + */ + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - 1) ); + + for ( i = 0; i < numcpus; i++ ) { + if ( CPU_ISSET( i, mask) ) { + snprintf( cpuname, sizeof( cpuname ), "%d", i ); + strcat_bounded( fieldname_buf, cpuname ); + /* Mark the end of the path string for this cpuset */ + end_cpuset_base_path = strnlen( fieldname_buf, + (sizeof( fieldname_buf ) - 1) ); + + /* + * Depopulate the CPU list for the cpuset and remove its + * directory hierarchy + */ + cpuset_delete(); + + /* + * Reset the pathname to: + * fieldname_buf == /dev/cpuset/<path>/cpu + * in preparation for the next CPU core + * in the data plane cpuset mask + */ + fieldname_buf[cpu_num_offset] = '\0'; + } + } + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); +} + +/**/ +/****************************************************************************** + * + * Read the specified value from the specified field of the cpuset per-cpu + * management tree for the specified CPU and return it at caller's value ptr. + * If the file for the specified field is missing or empty then *value is NULL. + * + * Assumes the /dev/cpuset filesystem already mounted and the + * cpusets already initialized. + * + ******************************************************************************/ +static void get_per_cpu_field_for( int cpu, const char *path, const char *field, + char *value, size_t len ) +{ + int retval = 0; + int num_read, core_fileno; + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&pathname_lock ); + sem_wait( &pathname_lock ); + + /* Get the name of this single-core cpuset based on the specified CPU */ + strcpy( pathname_buf, path ); + strcat_bounded( pathname_buf, "/cpu" ); + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); + strcat_bounded( pathname_buf, cpuname ); + + /* Set the fieldname path string to point to fields within this cpuset */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( pathname_buf ); + + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset." + * -or- "/dev/cpuset/<path>/cpu<n>/" + */ + strcat_bounded( fieldname_buf, field ); + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.<field>" + * -or- "/dev/cpuset/<path>/cpu<n>/<field>" + */ + if ( value ) { + core_fileno = open( fieldname_buf, O_RDONLY ); + if ( core_fileno > 0 ) { + for ( num_read = 0; num_read < len; ) { + num_read = read( core_fileno, (void *)value, len ); + if ( (num_read < len) && (errno != EINTR) ) + retval = -1; + break; + } + /* If the field file is missing or empty */ + close( core_fileno ); + if ( len && (retval < 0) ) { + *value = (char)'\0'; + ODPH_ERR( "Failed to get value for %s - error %s\n", + fieldname_buf, errstring( errno ) ); + } + } else + *value = (char)'\0'; + } + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); + pthread_cleanup_pop( 1 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Write the specified value to the specified field of the per-cpu + * management tree for the specified CPU and cpuset + * If value is NULL then the file for the specified field will be truncated. + * + * Assumes the /dev/cpuset filesystem already mounted and the + * cpusets already initialized. + * + ******************************************************************************/ +static void set_per_cpu_field_for( int cpu, const char *path, const char *field, + const char *value ) +{ + int retval, core_fileno; + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&pathname_lock ); + sem_wait( &pathname_lock ); + + /* Get the name of this single-core cpuset based on the specified CPU */ + strcpy( pathname_buf, path ); + strcat_bounded( pathname_buf, "/cpu" ); + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); + strcat_bounded( pathname_buf, cpuname ); + + /* Set the fieldname path string to point to fields within this cpuset */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( pathname_buf ); + + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset." + * -or- "/dev/cpuset/<path>/cpu<n>/" + */ + strcat_bounded( fieldname_buf, field ); + /* + * fieldname_buf == "/dev/cpuset/<path>/cpu<n>/cpuset.<field>" + * -or- "/dev/cpuset/<path>/cpu<n>/<field>" + */ + core_fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( core_fileno > 0 ) { + /* If value is NULL then the field file will simply be truncated */ + if ( value ) + retval = write( core_fileno, value, strlen( value ) ); + close( core_fileno ); + } + + /* Release the lock on the fieldname_buf and the cpuset */ + releasefieldname(); + pthread_cleanup_pop( 0 ); + pthread_cleanup_pop( 1 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Migrate timers and hrtimers away from the specified cpuset's CPU cores + * + ******************************************************************************/ +static void quiesce_cpus( const char *path ) +{ + int retval, fileno; + + /* + * Set the fieldname path string to the base of the path + * to the caller's specified cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + newpathname( path ); + + /* + * Create an absolute path string to the "quiesce" field + * for the cpuset + */ + newfieldname( "quiesce" ); + + /* + * Migrate timers / hrtimers away from the cpuset's CPUs + */ + fileno = open( fieldname_buf, (O_RDWR | O_CREAT | O_TRUNC), FILEMODE ); + if ( fileno > 0 ) { + retval = write( fileno, "1", 1 ); + close( fileno ); + } + + pthread_cleanup_pop( 1 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Move the specified task away from its current cpuset + * and onto the cores of the specified new cpuset + * Specifying a NULL path string pointer defaults to /dev/cpuset + * + * Assumes the caller passes in a legitimate task PID string. + * + * Returns an int == zero if migration successful or -1 if an error occurred + ******************************************************************************/ +static int migrate_task( const char *callers_pid, const char *to_cpuset_path ) +{ + size_t num_read, num_to_write; + int i, to_fileno, proc_pid_fileno, end_of_file, migrate_failed; + static char my_pid[24]; + static char written_pid[24]; + static char cur[2]; + static char to_path_buf[128]; + static char proc_path_buf[80]; + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&to_path_lock ); + sem_wait( &to_path_lock ); + + /* Create strings containing the full path to the caller's cpusets */ + strcpy_bounded( to_path_buf, "/dev/cpuset/" ); + + /* Mark index of trailing slash for possible overwrite */ + num_read = strlen( to_path_buf ) - 1; + if ( to_cpuset_path ) + strcat_bounded( to_path_buf, to_cpuset_path ); + else + /* Migrate the task to default cpuset - remove trailing slash */ + to_path_buf[num_read] = '\0'; + + /* + * We will be manipulating the tasks in this cpuset, so + * extend the path string to specify the 'tasks' file. + */ + strcat_bounded( to_path_buf, "/tasks" ); + + /* + * Assemble a path to the status file for the caller's task in /proc + * to verify that the process still exists + */ + for ( i = 0; i < strlen( callers_pid ); i++ ) { + /* + * Don't include any trailing newline from callers_pid + * into the pathname string being built. + */ + if ( callers_pid[i] != (char)'\n' ) + written_pid[i] = callers_pid[i]; + else + written_pid[i] = (char)'\0'; + } + written_pid[i] = (char)'\0'; + strcpy_bounded( proc_path_buf, "/proc/" ); + strcat_bounded( proc_path_buf, written_pid ); + strcat_bounded( proc_path_buf, "/status" ); + proc_pid_fileno = open( proc_path_buf, O_RDONLY ); + + /* Init the result return value */ + migrate_failed = 0; + + /* Ignore the caller's task if its PID is stale */ + if ( proc_pid_fileno > 0 ) { + to_fileno = open( to_path_buf, (O_RDWR | O_CREAT | O_APPEND), + FILEMODE ); + } else { + to_fileno = -1; + migrate_failed = -1; + ODPH_ERR( "%s not found - failed to migrate %s\n", + proc_path_buf, callers_pid ); + } + + if ( to_fileno > 0 ) { + /* Capture our own ttid for comparison purposes */ + snprintf( my_pid, (sizeof( my_pid ) - 1), "%d", gettaskid() ); + + /* + * Now let's try to migrate the task. + * Try to write the PID for the caller's task into + * the task list for the specified 'to' cpuset. + */ + errno = 0; + num_to_write = strlen( written_pid ); + for ( num_read = 0; num_read < num_to_write; ) { + num_read = write( to_fileno, written_pid, num_to_write ); + if ( (num_read == (size_t)-1) && (errno != EINTR) ) + migrate_failed = -1; + break; + } + + if ( migrate_failed ) { + /* + * Scan the task's /proc status file to find its name. + */ + for ( end_of_file = 0; !end_of_file; ) { + /* Read one line of info from the task's /proc status file */ + for ( i = 0, cur[0] = (char)'\0'; (cur[0] != (char)'\n'); ) { + num_read = read( proc_pid_fileno, (void *)cur, 1 ); + if ( num_read > 0 ) { + if ( cur[0] != (char)'\n' ) { + proc_path_buf[i] = cur[0]; + i++; + } else { + proc_path_buf[i] = (char)'\0'; + } + } else { + proc_path_buf[i] = '\0'; + if ( errno != EINTR ) { + end_of_file = 1; + break; + } + } + } + + /* cpulist should contain a string unless EOF reached */ + if ( !(strncmp( proc_path_buf, "Name: ", 6 )) ) + break; + } + /* Failed to migrate current task */ + ODPH_ERR( "Failed to migrate pid %s - error %s\n", + written_pid, errstring( errno ) ); + } else { + /* + * If we are migrating our own task, sleep for 50 msec + * to allow time for migration to occur. + */ + if ( !strncmp( written_pid, my_pid, strlen( my_pid ) ) ) + sleep_nsec( 50000000 ); + } + close( to_fileno ); + } + if ( proc_pid_fileno > 0 ) + close( proc_pid_fileno ); + + pthread_cleanup_pop( 1 ); + + return( migrate_failed ); +} + +/**/ +/****************************************************************************** + * + * Move all tasks which can be migrated off of the cores of the current cpuset + * and onto the cores of the specified new cpuset + * Specifying a NULL path string pointer defaults to /dev/cpuset + * The 'except' parameter is an array of pid_t values + * which SHOULD NOT be migrated away from this core - terminated by + * a zero pid_t value. If the pointer to this array is NULL or if the + * first pid_t is zero, the function will try to migrate all processes off of + * the 'from' cpuset. + * + ******************************************************************************/ +static void migrate_tasks( const char *from_cpuset_path, + const char *to_cpuset_path, pid_t *except ) +{ + size_t num_read; + int i, from_fileno, pid_ready, end_of_file; + char callers_pid[24]; + char cur[1]; + static char from_path_buf[128]; + uint_64_t pid_numeric = 0; + pid_t cur_match; + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&from_path_lock ); + sem_wait( &from_path_lock ); + + /* Create strings containing the full path to the caller's cpusets */ + strcpy_bounded( from_path_buf, "/dev/cpuset/" ); + + /* Mark index of trailing slash for possible overwrite */ + num_read = strlen( from_path_buf ) - 1; + if ( from_cpuset_path ) { + strcat_bounded( from_path_buf, from_cpuset_path ); + } else { + /* Migrate the task from default cpuset - remove trailing slash */ + from_path_buf[num_read] = '\0'; + } + + if ( to_cpuset_path ) + ODPH_DBG( "Migrating tasks from %s to /dev/cpuset/%s\n", + from_path_buf, to_cpuset_path ); + else + ODPH_DBG( "Migrating tasks from %s to /dev/cpuset\n", from_path_buf ); + + /* + * We will be manipulating the tasks in this cpuset, so + * extend the path string to specify the 'tasks' file. + */ + strcat_bounded( from_path_buf, "/tasks" ); + from_fileno = open( from_path_buf, O_RDWR ); + + if ( from_fileno > 0 ) { + for ( end_of_file = 0; !end_of_file; ) { + /* Read one line of PID info from the 'from' tasks file */ + callers_pid[0] = '\0'; + pid_ready = 0; + for ( i = 0; i < sizeof( callers_pid ); ) { + num_read = read( from_fileno, (void *)cur, 1 ); + switch ( num_read ) { + case 0 : + end_of_file = 1; + break; + case 1 : + if ( cur[0] == (char)'\n' ) { + callers_pid[i] = '\0'; + pid_ready = 1; + i = 0; + } else { + if ( (i + 1) < sizeof( callers_pid ) ) { + callers_pid[i] = cur[0]; + callers_pid[++i] = '\0'; + } else { + ODPH_ERR( "PID %s too long\n", callers_pid ); + } + } + break; + default: + if ( errno != EINTR ) { + ODPH_ERR( "Failed to read %s - error %s\n", + from_path_buf, errstring( errno ) ); + end_of_file = 1; + } + } + if ( pid_ready || end_of_file ) + break; + } + if ( pid_ready ) { + /* + * callers_pid should contain a PID... + * If the caller specified any tasks which should not be + * migrated off of this CPU, check the callers_pid + * against the list and skip migrating it if it matches. + */ + if ( except != (pid_t *)NULL ) { + /* + * Convert the PID string to a number + * (64 bit in case pid_t might get that big) + */ + pid_numeric = strtoull( callers_pid, (char **)NULL, 10 ); + cur_match = except[0]; + for ( i = 0; cur_match != (pid_t)0; ) { + /* + * If a match is found, leave pid_numeric nonzero + * as a 'skip indicator flag'. + * If the list contains no matches, clear pid_numeric + * and try to migrate the current task . + */ + if ( cur_match == (pid_t)pid_numeric ) { + ODPH_DBG( "Skipped migrating task %llu\n", + pid_numeric ); + break; + } else { + cur_match = except[++i]; + if ( cur_match == (pid_t)0 ) { + pid_numeric = 0; + } + } + } + } else { + pid_numeric = 0; + } + if ( !pid_numeric ) + /* callers_pid wasn't specified to stay on this CPU */ + migrate_task( callers_pid, to_cpuset_path ); + } + } + close( from_fileno ); + } + + pthread_cleanup_pop( 1 ); +} + +/**/ +/****************************************************************************** + * + * Move all tasks from the per-cpu management tree for the specified + * 'parent' cpuset to the specified destination cpuset. + * A 'NULL' parent cpuset is not allowed, but if the destination cpuset is NULL + * then the tasks for all per_cpu descendants of the 'parent' cpuset will be + * moved to the top-level 'master' cpuset. + * + ******************************************************************************/ +static void migrate_per_core_tasks( const char *from_cpuset_path, + cpu_set_t *from_mask, + const char *to_cpuset_path ) +{ + int i, cpu_num_offset; + + /* + * Set the fieldname path string to the relative path for the + * per_core cpusets in the specified "from" 'parent' cpuset. + */ + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + strcpy_bounded( fieldname_buf, from_cpuset_path ); + strcat_bounded( fieldname_buf, "/cpu" ); + + /* + * fieldname_buf == <path>/cpu + * mark the location where we append the CPU number + */ + cpu_num_offset = strnlen( fieldname_buf, (sizeof( fieldname_buf ) - 1) ); + + for ( i = 0; i < numcpus; i++ ) { + if ( CPU_ISSET( i, from_mask) ) { + snprintf( cpuname, (sizeof( cpuname ) - 1), "%d", i ); + strcat_bounded( fieldname_buf, cpuname ); + /* + * fieldname_buf == <path>/cpu<n> + * where <n> is the current core number (0 -> numcpus-1) + * Now migrate all tasks from this core to the destination cpuset + */ + migrate_tasks( (const char *)fieldname_buf, to_cpuset_path, + (pid_t *)NULL ); + + /* + * Reset the current field pathname to: + * fieldname_buf == <path>/cpu + * in preparation for the next CPU core + * in the specified "from" 'parent' cpuset mask + */ + fieldname_buf[cpu_num_offset] = '\0'; + } + } + + /* Release the lock on the fieldname_buf and the cpuset */ + pthread_cleanup_pop( 1 ); +} + +/**/ +/****************************************************************************** + * + * Force cpufreq to stay at highest clock rate to eliminate timer activities + * associated with cpu frequency monitoring and modifications on all + * isolated CPU cores + * + * Called only from tweak_system_tunables_for_isolation with pathname lock held + * + ******************************************************************************/ +static void fix_cpufreq_governor( cpu_set_t *hiperf_mask ) +{ + int retval, cpunum, cpu_num_offset, core_gov_fileno; + + /* Create a path to the cpufreq governor status/setting file in /sys */ + strcpy_bounded( pathname_buf, "/sys/devices/system/cpu/cpu" ); + + /* mark the location where we append the CPU number */ + cpu_num_offset = strnlen( pathname_buf, (sizeof( pathname_buf ) - 1) ); + + /* Use the 'performance' cpufreq governor on all data plane CPU cores */ + for ( cpunum = 0; cpunum < numcpus; cpunum++ ) { + if ( CPU_ISSET( cpunum, hiperf_mask) ) { + snprintf( cpuname, sizeof( cpuname ), "%d", cpunum ); + strcat_bounded( pathname_buf, cpuname ); + strcat_bounded( pathname_buf, "/cpufreq" ); + + core_gov_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( core_gov_fileno > 0 ) { + retval = write( core_gov_fileno, "performance", 11 ); + close( core_gov_fileno ); + } + + /* + * Reset the current pathname to: + * pathname_buf == /sys/devices/system/cpu/cpu + * in preparation for the next CPU core + * in the data plane cpuset mask + */ + pathname_buf[cpu_num_offset] = '\0'; + } + } + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Affine the specified IRQ number to the specified cpuset + * + * Takes an ascii string representing the IRQ number (from the /proc/irq path) + * and a cpuset mask specifying the acceptable target CPUs + * + * Returns an int = 0 if everything went okay - else < 0 + * + ******************************************************************************/ +static int affine_irq( char *irq_strg, cpu_set_t *mask ) +{ + unsigned long int cpumask; + int cpunum, irq_fileno; + int retval = -1; + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&pathname_lock ); + sem_wait( &pathname_lock ); + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + sem_wait( &fieldname_lock ); + + /* + * Create an integer bit mask for the specified cpuset + * If the specified cpuset mask was invalid, default to CPU 0 + * Then convert the bitmask to an ASCII string in cpulist + */ + for ( cpunum = cpumask = 0; cpunum < numcpus; cpunum++ ) { + if ( CPU_ISSET( cpunum, mask) ) + cpumask |= (1 << cpunum); + } + if ( !cpumask ) + cpumask = 1; + snprintf( cpulist, sizeof( cpulist ), "%lx", cpumask ); + + /* Create a path to the smp_affinity mask file for the IRQ in /proc/irq */ + strcpy_bounded( pathname_buf, "/proc/irq/" ); + strcat_bounded( pathname_buf, irq_strg ); + strcat_bounded( pathname_buf, "/smp_affinity" ); + + /* Set the smp_affinity mask for this interrupt to the specified CPU */ + irq_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( irq_fileno > 0 ) { + retval = write( irq_fileno, cpulist, strlen( cpulist ) ); + + /* + * Check whether or not the IRQ was affined successfully + * Look for an error on writing or a mismatch on the smp_affinity mask + * after a successful write. + */ + if ( retval > 0 ) { + retval = read( irq_fileno, cpulist, strlen( cpulist ) ); + if ( cpumask != strtoul( cpulist, (char **)NULL, 16 ) ) + retval = -1; + else + retval = 0; + } + close( irq_fileno ); + + if ( retval ) + /* Something went wrong with the affinity reassignment */ + ODPH_ERR( + "Unable to affine IRQ %s to cpus in cpuset mask < 0x%lx >\n", + irq_strg, cpumask ); + } + + pthread_cleanup_pop( 1 ); + pthread_cleanup_pop( 1 ); + + return( retval ); +} + +/**/ +/****************************************************************************** + * + * Affine IRQs to housekeeping CPUs + * + ******************************************************************************/ +static void affine_irqs_to_housekeeping( cpu_set_t *hskpg_mask ) +{ + DIR *d; + struct dirent *dir; + + ODPH_DBG( "Affining IRQs to housekeeping CPUs\n" ); + d = opendir("/proc/irq"); + if ( d ) { + while ( (dir = readdir( d )) != NULL ) { + if (dir->d_type == DT_DIR) { + if ( (strncmp( dir->d_name, ".", strlen( dir->d_name ) )) && + (strncmp( dir->d_name, "..", strlen( dir->d_name ) )) ) { + affine_irq( dir->d_name, hskpg_mask ); + } + } + } + closedir(d); + } +} + +/**/ +/****************************************************************************** + * + * Locate the mount point for the debugfs filesystem (if present) + * and initialize a static string with that path. + * + * Returns a pointer to the static mount point path string -or- + * a NULL pointer if no debugfs filesystem could be found. + * + * This function is not thread-safe. It should be called only from a single + * threaded context (i.e. inherently serialized) during isolation setup. + * Furthermore, since the static buffer in which it stores its path string + * is modified and then accessed by the ftrace helper functions, it should + * not be called again if ftrace_tree_found != 0. + * + ******************************************************************************/ +#ifndef MAX_PATH +#define MAX_PATH 256 +#endif +#define _STR(x) #x +#define STR(x) _STR(x) + +/* + * Variables used to locate and save the path to the mounted debugfs + * filesystem. The path is saved in debugfs_path if the debugfs is found. + */ +static char debugfs_path[MAX_PATH+1]; + +static const char *find_debugfs( void ) +{ + FILE *fp; + char *path = (char *)NULL; + static char mount_fstype[100]; + + /* Try to open the /proc/mounts directory for reading */ + if ( (fp = fopen( "/proc/mounts", "r" )) != (FILE *)NULL ) { + /* + * Scan /proc/mounts for a mounted filesystem + * of type 'debugfs' + */ + while ( fscanf( fp, "%*s %" STR(MAX_PATH) "s %99s %*s %*d %*d\n", + debugfs_path, mount_fstype ) == 2 ) { + if ( !(strcmp( mount_fstype, "debugfs" )) ) { + /* + * Found the mount point path for the debugfs filesystem + * Return a pointer to the mount point path string. + */ + path = debugfs_path; + break; + } + } + fclose(fp); + } + + return( (const char *)path ); +} + +/**/ +/****************************************************************************** + * + * Affine bdi writeback workqueues to the specified cpuset + * + * Takes a cpuset mask specifying the acceptable target CPUs + * + * Called only from tweak_system_tunables_for_isolation with pathname lock held + * + ******************************************************************************/ +static void affine_bdi_workqueues( cpu_set_t *hskpg_mask ) +{ + unsigned long int cpumask; + int cpunum, proc_sys_fileno; + int retval = -1; + + /* + * Create an integer bit mask for the specified cpuset + * If the specified cpuset mask was invalid, default to CPU 0 + * Then convert the bitmask to an ASCII string in cpulist + */ + for ( cpunum = cpumask = 0; cpunum < numcpus; cpunum++ ) { + if ( CPU_ISSET( cpunum, hskpg_mask) ) + cpumask |= (1 << cpunum); + } + if ( !cpumask ) + cpumask = 1; + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&fieldname_lock ); + sem_wait( &fieldname_lock ); + + snprintf( cpulist, sizeof( cpulist ), "%lx", cpumask ); + + /* Move bdi writeback workqueues to CPU0 */ + strcpy_bounded( pathname_buf, + "/sys/bus/workqueue/devices/writeback/cpumask" ); + proc_sys_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( proc_sys_fileno > 0 ) { + retval = write( proc_sys_fileno, cpulist, strlen( cpulist ) ); + close( proc_sys_fileno ); + } + + pthread_cleanup_pop( 1 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Tweak system tunables for full dynticks isolation + * + ******************************************************************************/ +static void tweak_system_tunables_for_isolation( cpu_set_t *hskpg_mask, + cpu_set_t *hiperf_mask ) +{ + char *debugfs; + int retval, proc_sys_fileno; + + /* Affine all IRQs to housekeeping CPUs */ + affine_irqs_to_housekeeping( hskpg_mask ); + + pthread_cleanup_push( (void(*)(void *))sem_post, (void *)&pathname_lock ); + sem_wait( &pathname_lock ); + + /* Eliminate cpufreq-related timer activity on all isolated CPU cores */ + fix_cpufreq_governor( hiperf_mask ); + + debugfs = (char *)find_debugfs(); + if ( debugfs ) { + /* Try to disable sched_tick_max_deferment */ + strcpy_bounded( pathname_buf, debugfs ); + strcat_bounded( pathname_buf, "/sched_tick_max_deferment" ); + proc_sys_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( proc_sys_fileno > 2 ) { + retval = write( proc_sys_fileno, "-1", 2 ); + close( proc_sys_fileno ); + } + } + + /* Move bdi writeback workqueues to housekeeping CPUs */ + affine_bdi_workqueues( hskpg_mask ); + + /* Delay the vmstat timer (1000 seconds) */ + strcpy_bounded( pathname_buf, "/proc/sys/vm.stat_interval" ); + proc_sys_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( proc_sys_fileno > 0 ) { + retval = write( proc_sys_fileno, "1000", 4 ); + close( proc_sys_fileno ); + } + + /* Delay the vm writeback timer (10000 centiseconds) */ + strcpy_bounded( pathname_buf, "/proc/sys/vm.dirty_writeback_centisecs" ); + proc_sys_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( proc_sys_fileno > 0 ) { + retval = write( proc_sys_fileno, "10000", 5 ); + close( proc_sys_fileno ); + } + + /* Delay the vm expire timer (10000 centiseconds) */ + strcpy_bounded( pathname_buf, "/proc/sys/vm.dirty_expire_centisecs" ); + proc_sys_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( proc_sys_fileno > 0 ) { + retval = write( proc_sys_fileno, "10000", 5 ); + close( proc_sys_fileno ); + } + + /* Shutdown the NMI watchdog as it uses perf events */ + strcpy_bounded( pathname_buf, "/proc/sys/kernel.watchdog" ); + proc_sys_fileno = open( pathname_buf, (O_RDWR | O_TRUNC) ); + if ( proc_sys_fileno > 0 ) { + retval = write( proc_sys_fileno, "0", 1 ); + close( proc_sys_fileno ); + } + + pthread_cleanup_pop( 1 ); + + /* Make the C compiler happy... do something with retval */ + if ( retval ) retval = 0; +} + +/**/ +/****************************************************************************** + * + * Indicate whether the specified CPU is protected from load balancing + * + * Takes a pathname for the isolated cpuset, + * a pointer to the isolated cpuset mask and + * an int specifying the target CPU + * + * Returns an int = 0 if the CPU is not designated as isolated, else nonzero + * + ******************************************************************************/ +static int cpu_isolated( const char *isol_cpuset_name, + cpu_set_t *isol_mask, int cpunum ) +{ + int retcode = 0; + char isolated_value[2] = {'0','\0'}; + + if ( CPU_ISSET( cpunum, isol_mask) ) { + get_per_cpu_field_for( cpunum, isol_cpuset_name, "sched_load_balance", + isolated_value, 1 ); + if ( isolated_value[0] == (char)'0' ) + retcode = 1; + } + + return( retcode ); +} + +/**/ +/****************************************************************************** + * + * Indicate whether the specified CPU is designated for full dynticks isolation + * + * Takes a pathname for the isolated cpuset, + * a pointer to the isolated cpuset mask and + * an int specifying the target CPU + * + * Returns an int = 0 if the CPU is not designated as tickless, else nonzero + * + ******************************************************************************/ +static int cpu_wants_full_dynticks( const char *isol_cpuset_name, + cpu_set_t *isol_mask, int cpunum ) +{ + int retcode = 0; + char fulldynticks_value[2] = {'0','\0'}; + + if ( CPU_ISSET( cpunum, isol_mask) ) { + get_per_cpu_field_for( cpunum, isol_cpuset_name, "fulldynticks", + fulldynticks_value, 1 ); + if ( fulldynticks_value[0] == (char)'1' ) { + retcode = 1; + + } else { + get_per_cpu_field_for( cpunum, isol_cpuset_name, "quiesce", + fulldynticks_value, 1 ); + if ( fulldynticks_value[0] == (char)'1' ) + retcode = 1; + } + } + + return( retcode ); +} + +/**/ +/****************************************************************************** + * + * Data structures associated with ODP isolation helper 'adapter layer' + * + ******************************************************************************/ +#define HOUSEKEEPING 0 +#define HIPERF 1 + +#define SET_DISABLED 0 +#define SET_ENABLED 1 + +/* CPUSET mask for the 'control plane' CPU core(s) */ +static cpu_set_t cplane; +/* CPUSET mask for the 'data plane' CPU cores as a group */ +static cpu_set_t dplane_master; +/* Array of CPUSET masks - one for each 'data plane' CPU core */ +static cpu_set_t dplane[MAX_CPUS_SUPPORTED]; + +/* CPUSET mask for the currently unallocated 'data plane' CPU core(s) */ +static cpu_set_t dplane_available; +/**/ +/****************************************************************************** + * + * ODP isolation helper 'adapter layer' functions + * + * These functions map the cpuset functions above onto relevant ODP 'hooks' + * and also will serve as a place to hook in an additional persistent + * cpuset management layer which can support multiple instances of ODP and / or + * multiple ODP applications concurrently and coordinate CPU usage between them + * + * NOTE: These functions will be the only externally visible components + * of the ODP isolation helper package. All other functions and + * data structures associated with the package will be internal and + * opaque to ODP. + * + ******************************************************************************/ + +/****************************************************************************** + * + * Set up a cpuset mask for the 'control plane' which contains at least CPU 0 + * and possibly additional CPUs + * Set up a 'data plane master' cpuset mask as well as + * single-CPU 'data plane' cpuset masks for any cores not in the control plane + * + ******************************************************************************/ +static void setup_cpuset_masks( void ) +{ + int cpu, num_cplane_cpus, num_dplane_cpus; + cpu_set_t *ctl_set; + cpu_set_t *cur_set; + cpu_set_t *davl_set; + cpu_set_t *dmst_set; + + /* + * Allocate available CPUs to either the control plane or the + * data plane according to the defined housekeeping CPU ratio + */ + num_cplane_cpus = (numcpus * HOUSEKEEPING_RATIO_MULTIPLIER) / + HOUSEKEEPING_RATIO_DIVISOR; + /* A minimum of one housekeeping CPU is required */ + if ( num_cplane_cpus < 1 ) + num_cplane_cpus++; + num_dplane_cpus = numcpus - num_cplane_cpus; + + /* Initialize storage for the control plane cpuset mask */ + ctl_set = &cplane; + CPU_ZERO( ctl_set ); + for ( cpu = 0; cpu < num_cplane_cpus; cpu++ ) + CPU_SET( cpu, ctl_set ); + + /* + * Initialize storage for the data plane cpuset masks + */ + dmst_set = &dplane_master; + davl_set = &dplane_available; + CPU_ZERO( davl_set ); + CPU_ZERO( dmst_set ); + for ( cpu = 1; cpu < numcpus; cpu++ ) { + cur_set = &(dplane[cpu]); + CPU_ZERO( cur_set ); + if ( !(CPU_ISSET( cpu, ctl_set )) ) { + CPU_SET( cpu, davl_set ); + CPU_SET( cpu, dmst_set ); + CPU_SET( cpu, cur_set ); + } + } +} + +/**/ +/****************************************************************************** + * + * De-populate cpuset mask for the 'control plane' + * De-populate single-CPU 'data plane' cpuset masks + * + ******************************************************************************/ +static void clear_cpuset_masks( void ) +{ + int cpu; + + /* De-activate the control plane cpuset mask */ + CPU_ZERO( &cplane ); + + /* + * De-activate the data plane cpuset masks + */ + for ( cpu = 0; cpu < numcpus; cpu++ ) { + CPU_ZERO( &dplane[cpu] ); + } + + /* De-activate the data plane master cpuset mask */ + CPU_ZERO( &dplane_master ); + CPU_ZERO( &dplane_available ); +} + +/**/ + +/* + * Verify the level of underlying operating system support. + * (Return with error if the OS does not at least support cpusets) + * Set up system-wide CPU masks and cpusets + * (Future) Set up file-based persistent cpuset management layer + * to allow cooperative use of system isolation resources + * by multiple independent ODP instances. + */ +int odph_isolation_init_global( void ) +{ + int rc = 0; + + /* + * Since this function will be invoked from each ODP instance, + * it should first test to see if a prior instance has already + * initialized the underlying plumbing for isolation. + * If this is the case, then CPU 0 should always be present + * in the cplane mask and NOT present in the dplane_master mask - + * and the sum of cpus in the two masks should equal the sum of + * all possible CPUs. + */ + if (!(((CPU_COUNT(&cplane) + CPU_COUNT(&dplane_master)) == numcpus) && + (CPU_ISSET(0, &cplane) && !(CPU_ISSET(0, &dplane_master))))) { + /* + * Either initialization has not yet been performed or else + * the cpuset masks are corrupted... (re)initialize everything + * if the underlying platform has adequate isolation support. + */ + if ( ! (rc = init_cpusets()) ) { + /* + * Looks like we have at least minimal support for + * isolation... proceed. + */ + setup_cpuset_masks(); + + /* + * Create the management directory heirarchy for all + * cpusets we will be using. Do this in hierarchical + * order, creating 'parents' before 'descendants'. + */ + create_cpuset( "cplane", &cplane, HOUSEKEEPING ); + create_cpuset( "dplane", &dplane_master, HIPERF ); + create_per_core_cpusets( "dplane", &dplane_master, + HIPERF ); + + /* + * Turn off scheduler load balancing in the top-level + * cpuset. This effectively enables the use of all the + * 'descendant' cpusets created to support isolation. + */ + set_cpuset_isolation( (const char *)NULL, SET_ENABLED ); + + /* + * Enable isolation and disable load balancing on + * data plane cpusets. + */ + set_per_core_cpusets_isolated( "dplane", &dplane_master, + SET_ENABLED ); + set_cpuset_isolation( "dplane", SET_ENABLED ); + + /* + * Migrate interrupts onto control plane cores + * and optimize other system tunables for isolation + */ + tweak_system_tunables_for_isolation( &cplane, + &dplane_master ); + + /* + * Enable full tickless operation on data plane cpusets + */ + request_per_core_dynticks( "dplane", &dplane_master, + SET_ENABLED ); + request_dynticks( "dplane", SET_ENABLED ); + + /* + * Try to migrate all tasks onto control plane CPUs + */ + migrate_tasks( (char *)NULL, "cplane", (pid_t *)NULL ); + } + } + return rc; +} + +/* + * Migrate all tasks from cpusets created for isolation support to the + * generic boot-level single cpuset. + * Remove all isolated CPU environments and cpusets + * Zero out system-wide CPU masks + * (Future) Reset persistent file-based cpuset management layer + * to show no system isolation resources are available. + */ +int odph_isolation_term_global( void ) +{ + int rc = 0; + + /* + * Since this function will be invoked from each ODP instance, + * it should first test to see if a prior instance has already + * reset and eliminated the underlying plumbing for isolation. + * If so, it should return immediately without error. + */ + if ( CPU_ISSET( 0, &cplane ) ) { + /* + * Try to migrate all tasks onto default cpuset + */ + migrate_tasks( "cplane", (char *)NULL, (pid_t *)NULL ); + migrate_tasks( "dplane", (char *)NULL, (pid_t *)NULL ); + migrate_per_core_tasks( "dplane", &dplane_master, + (char *)NULL ); + + /* + * Disable isolation and enable load balancing + * on data plane cpusets + */ + set_per_core_cpusets_isolated( "dplane", &dplane_master, + SET_DISABLED ); + set_cpuset_isolation( "dplane", SET_DISABLED ); + + /* + * Disable full tickless operation on data plane cpusets + */ + request_per_core_dynticks( "dplane", &dplane_master, + SET_DISABLED ); + request_dynticks( "dplane", SET_DISABLED ); + + /* + * Turn on scheduler load balancing in the top-level cpuset + * This effectively disables the use of all the 'descendant' + * cpusets created to support isolation. + */ + set_cpuset_isolation( (const char *)NULL, SET_DISABLED ); + + /* + * Delete the management directory heirarchy for all cpusets + * we were using. Do this in reverse hierarchical order, + * deleting 'descendants' before 'parents'. + */ + delete_per_core_cpusets( "dplane", &dplane_master ); + delete_cpuset( "dplane" ); + delete_cpuset( "cplane" ); + + /* + * Clean out cpuset masks to show isolation support terminated + */ + clear_cpuset_masks(); + } + + return rc; +} + +/* + * If this is a worker thread, migrate all possible tasks and timers + * away from the isolated cpuset for this thread. + * (Future) In the persistent management layer, mark the CPUs requested + * by this thread as allocated and unavailable to subsequent threads. +int odph_isolation_init_local( void ) +{ + int rc = 0; + + return rc; +} + */ + +/* + * (Future) In the persistent management layer, mark the CPUs used + * by this thread as deallocated and available to subsequent threads. +int odph_isolation_term_local( void ) +{ + int rc = 0; + + return rc; +} + */ + +static void *odp_run_start_routine(void *arg) +{ + odp_start_args_t *start_args = arg; + + /* ODP thread local init */ + if (odp_init_local(ODP_THREAD_WORKER)) { + ODPH_ERR("Local init failed\n"); + return NULL; + } + + ODPH_DBG( "Starting pthread routine @%p with args @%p\n", + start_args->start_routine, start_args->arg ); + void *ret_ptr = start_args->start_routine(start_args->arg); + ODPH_DBG( "Pthread routine @%p returned\n", + start_args->start_routine ); + + int ret = odp_term_local(); + if (ret < 0) + ODPH_ERR("Local term failed\n"); + else { + if (ret == 0 && odp_term_global()) + ODPH_ERR("Global term failed\n"); + } + + return ret_ptr; +} + +/* + * NOTE: When using isolated 'worker' pthreads, the main ODP process which + * creates these worker threads MUST run in the data plane 'master' cpuset. + * This is because these pthreads are 'part of' this process and therefore + * have to execute in the same scheduler domain. For the same reason, it + * doesn't work to place the cloned pthread into an isolated single-CPU + * cpuset - because that is an isolated scheduling domain from which the + * parent process isn't reachable. So if pthreads are employed they will + * all have to execute in the 'master' data plane cpuset. + * We will adopt a convention that the main ODP process executes + * on the first CPU in the data plane. If it spawns a worker thread for + * each CPU in the data plane, one of these workers will have to share + * the first data lane CPU with the ODP main process. This will also mean + * that tickless isolation of an ODP pthread will require at least three + * CPU cores (one 'control plane' and two 'data plane' CPUs). + * + * If only independent processes are used with the isolated cores - + * that is if no data plane pthreads are used - then the main ODP process can + * execute from the control plane. + */ +/** + * Creates and launches pthreads + * + * Creates, pins and launches threads to separate CPU's based on the cpumask. + * + * @param thread_tbl Thread table + * @param mask CPU mask + * @param start_routine Thread start function + * @param arg Thread argument + * + * @return Number of threads created + */ +int odph_linux_isolated_pthread_create(odph_linux_pthread_t *thread_tbl, + const odp_cpumask_t *mask_in, + void *(*start_routine) (void *), + void *arg) +{ + char pidstring[32]; + pid_t pid; + cpu_set_t cur_mask; + int retcode = 0; + int isolation_enabled = 0; + int i; + int num; + int cpu_count; + int cpu; + odp_cpumask_t mask; + + /* + * Make a local copy of the caller's cpumask and find the first CPU + * which is a member of that cpuset + */ + odp_cpumask_copy(&mask, mask_in); + cpu = odp_cpumask_first(&mask); + + /* + * Get the cpumask where the parent task is currently executing. + * We will use this to return the parent task to its original cpuset + * after all the pthreads have been cloned + */ + CPU_ZERO( &cur_mask ); + sched_getaffinity(0, sizeof(cpu_set_t), &cur_mask); + + /* Create a pid string for parent task migration */ + snprintf( pidstring, sizeof( pidstring ), "%d", gettaskid() ); + + /* Ensure thread-safe access to our static buffers */ + pthread_cleanup_push( (void(*)(void *))sem_post, + (void *)&pathname_lock ); + sem_wait( &pathname_lock ); + + /* + * Identify the cpuset containing the destination CPU + * and then migrate the parent task onto it before forking. + */ + if (CPU_ISSET(cpu, &dplane_master)) { + /* + * Most likely case - the destination CPU is + * a data plane 'worker' CPU. + * Get the name of the single-core cpuset + * which contains the specified 'worker' CPU + */ + isolation_enabled = 1; + strcpy( pathname_buf, "dplane/cpu" ); + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); + strcat_bounded( pathname_buf, cpuname ); + /* + * If the parent task isn't already running + * on the first data plane CPU, migrate it there. + */ + if ( !(CPU_EQUAL( &dplane_master, &cur_mask )) ) { + /* + * Move the parent task to the master data plane + * cpuset. Then force it to the first CPU + * in that cpuset. + */ + retcode = migrate_task( pidstring, "dplane" ); + } + if ( !retcode ) { + for ( i = 1; i < numcpus; i++ ) { + if (CPU_ISSET(i, &dplane_master)) { + retcode = + sched_setaffinity( 0, + sizeof( cpu_set_t), + &(dplane[i]) ); + sleep_nsec( 10000000 ); + sched_setaffinity( 0, + sizeof( cpu_set_t), + &dplane_master ); + break; + } + } + } + } else { + if (CPU_ISSET(cpu, &cplane)) { + /* + * Less likely case - the destination CPU is + * a control plane 'housekeeping' CPU. + * Migrate the parent task to any control plane CPU. + */ + isolation_enabled = 1; + strcpy( pathname_buf, "cplane" ); + if ( !(CPU_EQUAL( &cplane, &cur_mask )) ) + retcode = migrate_task( pidstring, "cplane" ); + } + /* Otherwise isolation probably isn't enabled here */ + } + /* Unlock access to static buffers */ + pthread_cleanup_pop( 1 ); + + if ( isolation_enabled && retcode ) { + ODPH_ERR("parent task migration failed\n"); + return 0; + } + + /* + * At this point the parent task has been migrated to a cpuset + * which matches or contains the caller's cpuset. + * Now we can clone pthreads and affine them to specific CPUs + * within the parent task's available cpuset scheduling domain. + */ + + num = odp_cpumask_count(&mask); + memset(thread_tbl, 0, num * sizeof(odph_linux_pthread_t)); + + cpu_count = odp_cpu_count(); + + if (num < 1 || num > cpu_count) { + ODPH_ERR("Invalid number of threads: %d (%d cores available)\n", + num, cpu_count); + return 0; + } + + for (i = 0; i < num; i++) { + odp_cpumask_t thd_mask; + + /* + * Start worker threads with next CPU after the one running + * the main ODP process. This allows us to avoid + * running a worker thread alongside the main ODP process + * if the number of worker threads requested is fewer than + * the number of CPUs in the caller's cpumask. + * If the requested number of worker threads equals the number + * of CPUs in the caller's cpumask, we will 'wrap' and + * run the last worker alongside the main ODP process + * on the first CPU in the caller's cpumask. + */ + if ( cpu == odp_cpumask_last(&mask) ) + cpu = odp_cpumask_first(&mask); + else + cpu = odp_cpumask_next(&mask, cpu); + + odp_cpumask_zero(&thd_mask); + odp_cpumask_set(&thd_mask, cpu); + + pthread_attr_init(&thread_tbl[i].attr); + + thread_tbl[i].cpu = cpu; + + pthread_attr_setaffinity_np(&thread_tbl[i].attr, + sizeof(cpu_set_t), &thd_mask.set); + + thread_tbl[i].start_args = malloc(sizeof(odp_start_args_t)); + if (thread_tbl[i].start_args == NULL) + ODPH_ABORT("Malloc failed"); + + thread_tbl[i].start_args->start_routine = start_routine; + thread_tbl[i].start_args->arg = arg; + + retcode = pthread_create(&thread_tbl[i].thread, + &thread_tbl[i].attr, + odp_run_start_routine, + thread_tbl[i].start_args); + if (retcode) { + ODPH_ERR("Failed to start thread on cpu #%d\n", cpu); + free(thread_tbl[i].start_args); + break; + } + } + + return i; +} + + +/** + * Fork a process + * + * Forks a child process running on the specified CPU. + * + * @param proc Pointer to process state info (for output) + * @param cpu Destination CPU for the child process + * + * @return On success: 1 for the parent, 0 for the child + * On failure: -1 for the parent, -2 for the child + */ +int odph_linux_isolated_process_fork(odph_linux_process_t *proc, int cpu) +{ + char pidstring[32]; + pid_t pid; + cpu_set_t cur_mask; + int retcode = 0; + int isolation_enabled = 0; + + /* + * Get the cpumask where the parent task is currently executing. + * We will use this to return the parent task to its original cpuset + * after all the forks have been completed + */ + CPU_ZERO( &cur_mask ); + sched_getaffinity(0, sizeof(cpu_set_t), &cur_mask); + + /* Create a pid string for parent task migration */ + snprintf( pidstring, sizeof( pidstring ), "%d", gettaskid() ); + + /* Ensure thread-safe access to our static buffers */ + pthread_cleanup_push( (void(*)(void *))sem_post, + (void *)&pathname_lock ); + sem_wait( &pathname_lock ); + + /* + * Identify the cpuset containing the destination CPU + * and then migrate the parent task onto it before forking. + */ + if (CPU_ISSET(cpu, &dplane_master)) { + /* + * Most likely case - the destination CPU is + * a data plane 'worker' CPU. + * Get the name of the single-core cpuset + * which contains the specified 'worker' CPU + */ + isolation_enabled = 1; + strcpy( pathname_buf, "dplane/cpu" ); + snprintf( cpuname, sizeof( cpuname ), "%d", cpu ); + strcat_bounded( pathname_buf, cpuname ); + } else { + if (CPU_ISSET(cpu, &cplane)) { + /* + * Less likely case - the destination CPU is + * a control plane 'housekeeping' CPU. + */ + isolation_enabled = 1; + strcpy( pathname_buf, "cplane" ); + } + /* Otherwise isolation probably isn't enabled here */ + } + + /* + * If the new task is to be created in a different cpuset, + * then we first have to migrate the parent task there + */ + if ( isolation_enabled ) { + /* Migrate the parent task to the specified cpuset */ + retcode = migrate_task( pidstring, pathname_buf ); + } + + /* Unlock access to static buffers */ + pthread_cleanup_pop( 1 ); + + if ( isolation_enabled && retcode ) { + ODPH_ERR("parent task migration failed\n"); + return -2; + } + + pid = fork(); + + if (pid == 0) { + /* Child process */ + if ( !isolation_enabled ) { + odp_cpumask_t proc_mask; + + odp_cpumask_zero(&proc_mask); + odp_cpumask_set(&proc_mask, cpu); + if (sched_setaffinity(0, sizeof(cpu_set_t), + &proc_mask.set)) { + ODPH_ERR("sched_setaffinity() failed\n"); + return -2; + } + } + + if (odp_init_local(ODP_THREAD_WORKER)) { + ODPH_ERR("Local init failed\n"); + return -2; + } else { + return 0; + } + } else { + /* + * Parent process... + * After the fork is completed, the parent task may need to + * migrate back to the cpuset where it was running + * before starting the fork. + */ + if ( isolation_enabled ) { + if ( CPU_EQUAL( &cplane, &cur_mask ) ) + retcode = migrate_task( pidstring, "cplane" ); + else if ( CPU_EQUAL( &dplane_master, &cur_mask ) ) + retcode = migrate_task( pidstring, "dplane" ); + else + retcode = migrate_task( pidstring, + (const char *)NULL ); + } + + if (pid > 0) { + /* Fork was successful... init ODP process data */ + memset(proc, 0, sizeof(odph_linux_process_t)); + proc->pid = pid; + proc->cpu = cpu; + + /* + * Parent has been migrated back off of isolated CPU + * so only a single runnable task remains there... + */ + if (CPU_ISSET(cpu, &dplane_master)) { + /* + * Migrate timers / hrtimers away from the + * destination CPU if it is an isolated 'worker' + */ + set_per_cpu_field_for( cpu, "dplane", + "quiesce", "1" ); + } + + return 1; + } else { + ODPH_ERR("fork() failed\n"); + return -1; + } + } +} + +/** + * Fork a number of processes + * + * Forks child processes running on the specified CPUs + * + * @param proc_tbl Process state info table (for output) + * @param mask CPU mask of processes to create + * + * @return On success: 1 for the parent, 0 for the child + * On failure: -1 for the parent, -2 for the child + */ +int odph_linux_isolated_process_fork_n(odph_linux_process_t *proc_tbl, + const odp_cpumask_t *mask_in) +{ + odp_cpumask_t mask; + pid_t pid; + int num; + int cpu_count; + int cpu; + int i; + int retcode; + + odp_cpumask_copy(&mask, mask_in); + num = odp_cpumask_count(&mask); + + memset(proc_tbl, 0, num * sizeof(odph_linux_process_t)); + + cpu_count = odp_cpu_count(); + + if (num < 1 || num > cpu_count) { + ODPH_ERR("Bad num\n"); + return -1; + } + + cpu = odp_cpumask_first(&mask); + for (i = 0; i < num; i++) { + /* + * Try to fork a new task for each CPU in the caller's cpumask + */ + retcode = odph_linux_isolated_process_fork(&(proc_tbl[i]), cpu); + /* + * If this is a child process or an error occurred, we're done... + * The parent process will remain in the loop as long as the + * forks are successful and more CPUs remain in the mask. + */ + if ( retcode != 1 ) + break; + cpu = odp_cpumask_next(&mask, cpu); + } + + return retcode; +} + +int odph_cpumask_default_worker(odp_cpumask_t *mask, int num) +{ + int avail, cpu, i; + cpu_set_t cpuset; + + /* + * build the local cpuset mask, allocating down from + * the highest numbered available data plane CPU + */ + CPU_ZERO( &cpuset ); + for (avail = 0, i = numcpus - 1; i > 0; --i) { + if (CPU_ISSET(i, &dplane_available)) { + /* Add this CPU to caller's cpuset mask */ + CPU_SET(i, &cpuset); + /* Remove it from the available data plane CPUs */ + CPU_CLR(i, &dplane_available); + /* increment the number of CPUs allocated */ + avail++; + } + } + if (avail == 0) + ODP_ABORT("no isolatable CPUs available\n"); + + odp_cpumask_zero(mask); + + /* + * If no user supplied number or it's too large, then attempt + * to use all available CPUs + */ + if (0 == num || avail < num) + num = avail; + + /* build the mask, allocating down from highest numbered CPU */ + for (cpu = 0, i = CPU_SETSIZE - 1; i >= 0 && cpu < num; --i) { + if (CPU_ISSET(i, &cpuset)) { + odp_cpumask_set(mask, i); + cpu++; + } + } + + return cpu; +} + +int odph_cpumask_default_control(odp_cpumask_t *mask, int num ODP_UNUSED) +{ + int avail, i; + + /* + * Use all control plane CPUs since these are inherently shared. + */ + odp_cpumask_zero(mask); + for (avail = i = 0; i < numcpus; ++i) { + if (CPU_ISSET(i, &cplane)) { + /* Add this CPU to caller's cpuset mask */ + odp_cpumask_set(mask, i); + avail++; + } + } + return avail; +} + diff --git a/test/performance/odp_pktio_perf.c b/test/performance/odp_pktio_perf.c index efd26dc..9583e01 100644 --- a/test/performance/odp_pktio_perf.c +++ b/test/performance/odp_pktio_perf.c @@ -26,6 +26,7 @@ #include <odp/helper/ip.h> #include <odp/helper/udp.h> #include <odp/helper/linux.h> +#include <odp/helper/linux_isolation.h> #include <getopt.h> #include <stdlib.h> @@ -547,8 +548,8 @@ static int setup_txrx_masks(odp_cpumask_t *thd_mask_tx, int i, cpu; num_workers = - odp_cpumask_default_worker(&cpumask, - gbl_args->args.cpu_count); + odph_cpumask_default_worker(&cpumask, + gbl_args->args.cpu_count); if (num_workers < 2) { LOG_ERR("Need at least two cores\n"); return -1; @@ -620,8 +621,8 @@ static int run_test_single(odp_cpumask_t *thd_mask_tx, /* start receiver threads first */ args_rx.batch_len = gbl_args->args.rx_batch_len; - odph_linux_pthread_create(&thd_tbl[0], thd_mask_rx, - run_thread_rx, &args_rx); + odph_linux_isolated_pthread_create(&thd_tbl[0], thd_mask_rx, + run_thread_rx, &args_rx); odp_barrier_wait(&gbl_args->rx_barrier); num_rx_workers = odp_cpumask_count(thd_mask_rx); @@ -630,8 +631,9 @@ static int run_test_single(odp_cpumask_t *thd_mask_tx, args_tx.pps = status->pps_curr / num_tx_workers; args_tx.duration = gbl_args->args.duration; args_tx.batch_len = gbl_args->args.tx_batch_len; - odph_linux_pthread_create(&thd_tbl[num_rx_workers], thd_mask_tx, - run_thread_tx, &args_tx); + odph_linux_isolated_pthread_create(&thd_tbl[num_rx_workers], + thd_mask_tx, + run_thread_tx, &args_tx); odp_barrier_wait(&gbl_args->tx_barrier); /* wait for transmitter threads to terminate */ @@ -994,9 +996,14 @@ int main(int argc, char **argv) odp_shm_t shm; int max_thrs; + odph_isolation_term_global(); + if (odp_init_global(NULL, NULL) != 0) LOG_ABORT("Failed global init.\n"); + if (odph_isolation_init_global() != 0) + LOG_ABORT("Failed global isolation init.\n"); + if (odp_init_local(ODP_THREAD_CONTROL) != 0) LOG_ABORT("Failed local init.\n"); @@ -1043,5 +1050,7 @@ int main(int argc, char **argv) test_term(); } + odph_isolation_term_global(); + return ret; }
This patch adds ODP helper code for setting up cpuset-based isolated execution environments on a Linux platform. By executing applications on dedicated CPU cores with minimal scheduler contention or 'interference', latency determinism can be significantly enhanced, and performance can be improved and made more deterministic as well. Performance gains are dependent on the degree of CPU loading and scheduler contention which would otherwise occur in a 'normal' non-isolated environment. This isolation API requires an underlying Linux kernel with cpuset support, and will return an error if such support is missing. If the underlying kernel also includes LNG-originated 'NO_HZ_FULL' support, this support will be used to the extent that it is available. (NOTE that isolation setup requires root privileges during execution.) This patch also modifies the pktio performance test as an example of how the new isolation helpers might be employed by an application, and as a convenient means of quantifying the improved performance made possible by executing in an isolated environment. It is anticipated that this API will evolve as use cases are defined and further features or refinements are requested - hence this is only the 'initial' API submission. Signed-off-by: Gary S. Robertson <gary.robertson@linaro.org> --- helper/Makefile.am | 2 + helper/include/odp/helper/linux_isolation.h | 98 + helper/linux_isolation.c | 2901 +++++++++++++++++++++++++++ test/performance/odp_pktio_perf.c | 21 +- 4 files changed, 3016 insertions(+), 6 deletions(-) create mode 100644 helper/include/odp/helper/linux_isolation.h create mode 100644 helper/linux_isolation.c