From patchwork Mon Mar 21 17:48:30 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Venkatesh Vivekanandan X-Patchwork-Id: 64126 Delivered-To: patch@linaro.org Received: by 10.112.199.169 with SMTP id jl9csp1549824lbc; Mon, 21 Mar 2016 10:49:58 -0700 (PDT) X-Received: by 10.50.29.3 with SMTP id f3mr13617405igh.59.1458582597430; Mon, 21 Mar 2016 10:49:57 -0700 (PDT) Return-Path: Received: from lists.linaro.org (lists.linaro.org. [54.225.227.206]) by mx.google.com with ESMTP id i73si27848140iod.23.2016.03.21.10.49.56; Mon, 21 Mar 2016 10:49:57 -0700 (PDT) Received-SPF: pass (google.com: domain of lng-odp-bounces@lists.linaro.org designates 54.225.227.206 as permitted sender) client-ip=54.225.227.206; Authentication-Results: mx.google.com; spf=pass (google.com: domain of lng-odp-bounces@lists.linaro.org designates 54.225.227.206 as permitted sender) smtp.mailfrom=lng-odp-bounces@lists.linaro.org Received: by lists.linaro.org (Postfix, from userid 109) id ED51661835; Mon, 21 Mar 2016 17:49:55 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by lists.linaro.org (Postfix) with ESMTP id 2383F6181D; Mon, 21 Mar 2016 17:49:55 +0000 (UTC) X-Original-To: lng-odp@lists.linaro.org Delivered-To: lng-odp@lists.linaro.org Received: by lists.linaro.org (Postfix, from userid 109) id E9C8D6181B; Mon, 21 Mar 2016 17:48:41 +0000 (UTC) Received: from mail-gw3-out.broadcom.com (mail-gw3-out.broadcom.com [216.31.210.64]) by lists.linaro.org (Postfix) with ESMTP id C17E2617EA for ; Mon, 21 Mar 2016 17:48:40 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.24,372,1455004800"; d="scan'208";a="90640970" Received: from mail-irv-18.broadcom.com ([10.15.198.37]) by mail-gw3-out.broadcom.com with ESMTP; 21 Mar 2016 10:55:55 -0700 Received: from mail-irva-13.broadcom.com (mail-irva-13.broadcom.com [10.11.16.103]) by mail-irv-18.broadcom.com (Postfix) with ESMTP id BBE9D8202A; Mon, 21 Mar 2016 10:48:38 -0700 (PDT) Received: from apslabuser-PowerEdge-T610.ban.broadcom.com (unknown [10.131.60.56]) by mail-irva-13.broadcom.com (Postfix) with ESMTP id B15E740FE5; Mon, 21 Mar 2016 10:48:08 -0700 (PDT) From: venkatesh.vivekanandan@linaro.org To: lng-odp@lists.linaro.org Date: Mon, 21 Mar 2016 23:18:30 +0530 Message-Id: <1458582510-21606-1-git-send-email-venkatesh.vivekanandan@linaro.org> X-Mailer: git-send-email 1.9.1 X-Topics: patch Subject: [lng-odp] [PATCH v4 1/6] linux-generic: odp-ipfw: netmap-ipfw files X-BeenThere: lng-odp@lists.linaro.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: "The OpenDataPlane \(ODP\) List" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: lng-odp-bounces@lists.linaro.org Sender: "lng-odp" From: Venkatesh Vivekanandan example/ipfw files are the ones that are taken as-is from netmap-ipfw. source can be found @ https://github.com/luigirizzo/netmap-ipfw.git Signed-off-by: Venkatesh Vivekanandan --- v2: remove permanent stub files and generate them while building v3: fix alphabetical order of c files in Makefile and update README v4: reduced version of odp-ipfw app. It supports only direct mode and no schedule/queue involved. example/ipfw/BSDmakefile | 8 + example/ipfw/UPDATE | 44 + example/ipfw/extra/expand_number.c | 101 + example/ipfw/extra/glue.c | 555 +++ example/ipfw/extra/glue.h | 488 ++ example/ipfw/extra/humanize_number.c | 167 + example/ipfw/extra/ipfw2_mod.c | 278 ++ example/ipfw/extra/linux_defs.h | 144 + example/ipfw/extra/missing.c | 732 +++ example/ipfw/extra/missing.h | 801 ++++ example/ipfw/extra/session.c | 644 +++ example/ipfw/extra/sys/contrib/pf/net/pfvar.h | 27 + example/ipfw/extra/sys/sys/kernel.h | 26 + example/ipfw/extra/sys/sys/malloc.h | 13 + example/ipfw/extra/sys/sys/mbuf.h | 383 ++ example/ipfw/extra/sys/sys/module.h | 43 + example/ipfw/extra/sys/sys/systm.h | 159 + example/ipfw/extra/sys/sys/taskqueue.h | 51 + example/ipfw/ipfw/altq.c | 151 + example/ipfw/ipfw/dummynet.c | 1410 ++++++ example/ipfw/ipfw/ipfw.8 | 3723 +++++++++++++++ example/ipfw/ipfw/ipfw2.c | 4968 +++++++++++++++++++++ example/ipfw/ipfw/ipfw2.h | 352 ++ example/ipfw/ipfw/ipv6.c | 536 +++ example/ipfw/ipfw/main.c | 628 +++ example/ipfw/ipfw/nat.c | 1115 +++++ example/ipfw/ipfw/tables.c | 2013 +++++++++ example/ipfw/sys/net/pfil.h | 148 + example/ipfw/sys/net/radix.c | 1208 +++++ example/ipfw/sys/net/radix.h | 168 + example/ipfw/sys/netgraph/ng_ipfw.h | 33 + example/ipfw/sys/netinet/in_cksum.c | 146 + example/ipfw/sys/netinet/ip_dummynet.h | 264 ++ example/ipfw/sys/netinet/ip_fw.h | 1009 +++++ example/ipfw/sys/netinet/tcp.h | 247 + example/ipfw/sys/netinet/udp.h | 69 + example/ipfw/sys/netpfil/ipfw/dn_heap.c | 552 +++ example/ipfw/sys/netpfil/ipfw/dn_heap.h | 192 + example/ipfw/sys/netpfil/ipfw/dn_sched.h | 192 + example/ipfw/sys/netpfil/ipfw/dn_sched_fifo.c | 120 + example/ipfw/sys/netpfil/ipfw/dn_sched_prio.c | 229 + example/ipfw/sys/netpfil/ipfw/dn_sched_qfq.c | 864 ++++ example/ipfw/sys/netpfil/ipfw/dn_sched_rr.c | 307 ++ example/ipfw/sys/netpfil/ipfw/dn_sched_wf2q.c | 373 ++ example/ipfw/sys/netpfil/ipfw/ip_dn_glue.c | 846 ++++ example/ipfw/sys/netpfil/ipfw/ip_dn_io.c | 960 ++++ example/ipfw/sys/netpfil/ipfw/ip_dn_private.h | 404 ++ example/ipfw/sys/netpfil/ipfw/ip_dummynet.c | 2334 ++++++++++ example/ipfw/sys/netpfil/ipfw/ip_fw2.c | 2905 ++++++++++++ example/ipfw/sys/netpfil/ipfw/ip_fw_dynamic.c | 1604 +++++++ example/ipfw/sys/netpfil/ipfw/ip_fw_iface.c | 537 +++ example/ipfw/sys/netpfil/ipfw/ip_fw_log.c | 567 +++ example/ipfw/sys/netpfil/ipfw/ip_fw_pfil.c | 587 +++ example/ipfw/sys/netpfil/ipfw/ip_fw_private.h | 625 +++ example/ipfw/sys/netpfil/ipfw/ip_fw_sockopt.c | 3469 ++++++++++++++ example/ipfw/sys/netpfil/ipfw/ip_fw_table.c | 3674 +++++++++++++++ example/ipfw/sys/netpfil/ipfw/ip_fw_table.h | 246 + example/ipfw/sys/netpfil/ipfw/ip_fw_table_algo.c | 4081 +++++++++++++++++ example/ipfw/sys/netpfil/ipfw/ip_fw_table_value.c | 812 ++++ example/ipfw/sys/sys/fnv_hash.h | 71 + example/ipfw/sys/sys/hash.h | 133 + 61 files changed, 49536 insertions(+) create mode 100644 example/ipfw/BSDmakefile create mode 100644 example/ipfw/UPDATE create mode 100644 example/ipfw/extra/expand_number.c create mode 100644 example/ipfw/extra/glue.c create mode 100644 example/ipfw/extra/glue.h create mode 100644 example/ipfw/extra/humanize_number.c create mode 100644 example/ipfw/extra/ipfw2_mod.c create mode 100644 example/ipfw/extra/linux_defs.h create mode 100644 example/ipfw/extra/missing.c create mode 100644 example/ipfw/extra/missing.h create mode 100644 example/ipfw/extra/session.c create mode 100644 example/ipfw/extra/sys/contrib/pf/net/pfvar.h create mode 100644 example/ipfw/extra/sys/sys/kernel.h create mode 100644 example/ipfw/extra/sys/sys/malloc.h create mode 100644 example/ipfw/extra/sys/sys/mbuf.h create mode 100644 example/ipfw/extra/sys/sys/module.h create mode 100644 example/ipfw/extra/sys/sys/systm.h create mode 100644 example/ipfw/extra/sys/sys/taskqueue.h create mode 100644 example/ipfw/ipfw/altq.c create mode 100644 example/ipfw/ipfw/dummynet.c create mode 100644 example/ipfw/ipfw/ipfw.8 create mode 100644 example/ipfw/ipfw/ipfw2.c create mode 100644 example/ipfw/ipfw/ipfw2.h create mode 100644 example/ipfw/ipfw/ipv6.c create mode 100644 example/ipfw/ipfw/main.c create mode 100644 example/ipfw/ipfw/nat.c create mode 100644 example/ipfw/ipfw/tables.c create mode 100644 example/ipfw/sys/net/pfil.h create mode 100644 example/ipfw/sys/net/radix.c create mode 100644 example/ipfw/sys/net/radix.h create mode 100644 example/ipfw/sys/netgraph/ng_ipfw.h create mode 100644 example/ipfw/sys/netinet/in_cksum.c create mode 100644 example/ipfw/sys/netinet/ip_dummynet.h create mode 100644 example/ipfw/sys/netinet/ip_fw.h create mode 100644 example/ipfw/sys/netinet/tcp.h create mode 100644 example/ipfw/sys/netinet/udp.h create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_heap.c create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_heap.h create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_sched.h create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_sched_fifo.c create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_sched_prio.c create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_sched_qfq.c create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_sched_rr.c create mode 100644 example/ipfw/sys/netpfil/ipfw/dn_sched_wf2q.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_dn_glue.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_dn_io.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_dn_private.h create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_dummynet.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw2.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_dynamic.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_iface.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_log.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_pfil.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_private.h create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_sockopt.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_table.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_table.h create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_table_algo.c create mode 100644 example/ipfw/sys/netpfil/ipfw/ip_fw_table_value.c create mode 100644 example/ipfw/sys/sys/fnv_hash.h create mode 100644 example/ipfw/sys/sys/hash.h diff --git a/example/ipfw/BSDmakefile b/example/ipfw/BSDmakefile new file mode 100644 index 0000000..810ae2b --- /dev/null +++ b/example/ipfw/BSDmakefile @@ -0,0 +1,8 @@ +# forward to use gmake +.PHONY: ipfw kipfw + +all: + gmake + +$(.TARGETS) : + gmake MAKE=gmake $(.TARGETS) diff --git a/example/ipfw/UPDATE b/example/ipfw/UPDATE new file mode 100644 index 0000000..3da344f --- /dev/null +++ b/example/ipfw/UPDATE @@ -0,0 +1,44 @@ +--- 20141017 --- updating to FreeBSD head 273155 + +sys/net/pfil.h V $FreeBSD$ +sys/net/radix.h V $FreeBSD$ +sys/net/radix.c V merge, caddr_t -> u_char * + +sys/netgraph/ng_ipfw.h -- unchanged + +sys/netinet/in_cksum.c -- unchanged +sys/netinet/ip_dummynet.h V add DN_IS_ECN +sys/netinet/ip_fw.h massive changes +sys/netinet/tcp.h V $FreeBSD$ +sys/netinet/udp.h V $FreeBSD$ + +sys/netpfil/ipfw +dn_heap.c -- unchanged +dn_heap.h -- unchanged +dn_sched.h $FreeBSD$ +dn_sched_fifo.c -- unchanged +dn_sched_prio.c -- unchanged +dn_sched_qfq.c -- unchanged +dn_sched_rr.c -- unchanged +dn_sched_wf2q.c -- unchanged +ip_dn_glue.c V $FreeBSD$ +ip_dn_io.c V ecn, check missing ifp +ip_dn_private.h V $FreeBSD$ +ip_dummynet.c V $FreeBSD$, callout_reset_sbt, fs fixes, module +ip_fw2.c XXX large +ip_fw_dynamic.c XXX large +ip_fw_log.c XXX IP_FW_ARG.. TARG +ip_fw_pfil.c XXX small change +ip_fw_private.h XXX large +ip_fw_sockopt.c XXX huge +ip_fw_table.c XXX huge + +Userspace: +altq.c $FreeBSD$, bprintf +dummynet.c $FreeBSD$, ecn, bprintf +ipfw2.c $FreeBSD$, commands +ipfw2.h as above +ipv6.c as above +main.c small changes +nat.c internal + diff --git a/example/ipfw/extra/expand_number.c b/example/ipfw/extra/expand_number.c new file mode 100644 index 0000000..523fbb0 --- /dev/null +++ b/example/ipfw/extra/expand_number.c @@ -0,0 +1,101 @@ +/*- + * Copyright (c) 2007 Eric Anderson + * Copyright (c) 2007 Pawel Jakub Dawidek + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/lib/libutil/expand_number.c 211343 2010-08-15 18:32:06Z des $"); + +#include +#include +#include +#include +//#include +#include + +/* + * Convert an expression of the following forms to a uint64_t. + * 1) A positive decimal number. + * 2) A positive decimal number followed by a 'b' or 'B' (mult by 1). + * 3) A positive decimal number followed by a 'k' or 'K' (mult by 1 << 10). + * 4) A positive decimal number followed by a 'm' or 'M' (mult by 1 << 20). + * 5) A positive decimal number followed by a 'g' or 'G' (mult by 1 << 30). + * 6) A positive decimal number followed by a 't' or 'T' (mult by 1 << 40). + * 7) A positive decimal number followed by a 'p' or 'P' (mult by 1 << 50). + * 8) A positive decimal number followed by a 'e' or 'E' (mult by 1 << 60). + */ +int +expand_number(const char *buf, uint64_t *num) +{ + uint64_t number; + unsigned shift; + char *endptr; + + number = strtoumax(buf, &endptr, 0); + + if (endptr == buf) { + /* No valid digits. */ + errno = EINVAL; + return (-1); + } + + switch (tolower((unsigned char)*endptr)) { + case 'e': + shift = 60; + break; + case 'p': + shift = 50; + break; + case 't': + shift = 40; + break; + case 'g': + shift = 30; + break; + case 'm': + shift = 20; + break; + case 'k': + shift = 10; + break; + case 'b': + case '\0': /* No unit. */ + *num = number; + return (0); + default: + /* Unrecognized unit. */ + errno = EINVAL; + return (-1); + } + + if ((number << shift) >> shift != number) { + /* Overflow */ + errno = ERANGE; + return (-1); + } + + *num = number << shift; + return (0); +} diff --git a/example/ipfw/extra/glue.c b/example/ipfw/extra/glue.c new file mode 100644 index 0000000..0786453 --- /dev/null +++ b/example/ipfw/extra/glue.c @@ -0,0 +1,555 @@ +/* + * Userland functions missing in linux + * taken from /usr/src/lib/libc/stdtime/time32.c + */ + +#include +#include +#include +#include +#include /* sockaddr_in */ +#include /* TCP_NODELAY */ +#include +#include /* uint* types */ +#include +#include /* bzero */ +#include /* htonl */ + +#ifndef HAVE_NAT +/* dummy nat functions */ +void +ipfw_show_nat(int ac, char **av) +{ + D("unsupported"); +} + +void +ipfw_config_nat(int ac, char **av) +{ + D("unsupported"); +} +#endif /* HAVE_NAT */ + +#ifdef NEED_STRTONUM +/* missing in linux and windows */ +long long int +strtonum(const char *nptr, long long minval, long long maxval, + const char **errstr) +{ + long long ret; + int errno_c = errno; /* save actual errno */ + + errno = 0; +#ifdef TCC + ret = strtol(nptr, (char **)errstr, 0); +#else + ret = strtoll(nptr, (char **)errstr, 0); +#endif + /* We accept only a string that represent exactly a number (ie. start + * and end with a digit). + * FreeBSD version wants errstr==NULL if no error occurs, otherwise + * errstr should point to an error string. + * For our purspose, we implement only the invalid error, ranges + * error aren't checked + */ + if (errno != 0 || nptr == *errstr || **errstr != '\0') + *errstr = "invalid"; + else { + *errstr = NULL; + errno = errno_c; + } + return ret; +} + +int +ishexnumber(int c) +{ + return ((c >= '0' && c <= '9') || + (c >= 'a' && c <= 'f') || + (c >= 'A' && c <= 'F') ); +} + +#endif /* NEED_STRTONUM */ + +#ifdef __linux__ + + +int optreset; /* missing in linux */ + +/* + * not implemented in linux. + * taken from /usr/src/lib/libc/string/strlcpy.c + */ +size_t +strlcpy(dst, src, siz) + char *dst; + const char *src; + size_t siz; +{ + char *d = dst; + const char *s = src; + size_t n = siz; + + /* Copy as many bytes as will fit */ + if (n != 0 && --n != 0) { + do { + if ((*d++ = *s++) == 0) + break; + } while (--n != 0); + } + + /* Not enough room in dst, add NUL and traverse rest of src */ + if (n == 0) { + if (siz != 0) + *d = '\0'; /* NUL-terminate dst */ + while (*s++) + ; + } + + return(s - src - 1); /* count does not include NUL */ +} + + +#endif /* __linux__ */ + + +#if defined (EMULATE_SYSCTL) +//XXX missing prerequisites +#include //openwrt +#include //openwrt +#include +#include +int do_cmd(int optname, void *optval, uintptr_t optlen); +#endif /* EMULATE_SYSCTL */ + +/* + * set or get system information + * XXX lock acquisition/serialize calls + * + * we export this as sys/module/ipfw_mod/parameters/___ + * This function get or/and set the value of the sysctl passed by + * the name parameter. If the old value is not desired, + * oldp and oldlenp should be set to NULL. + * + * XXX + * I do not know how this works in FreeBSD in the case + * where there are no write permission on the sysctl var. + * We read the value and set return variables in any way + * but returns -1 on write failures, regardless the + * read success. + * + * Since there is no information on types, in the following + * code we assume a length of 4 is a int. + * + * Returns 0 on success, -1 on errors. + */ +int +sysctlbyname(const char *name, void *oldp, size_t *oldlenp, void *newp, + size_t newlen) +{ +#if defined (EMULATE_SYSCTL) + /* + * we embed the sysctl request in the usual sockopt mechanics. + * the sockopt buffer il filled with a dn_id with IP_DUMMYNET3 + * command, and the special DN_SYSCTL_GET and DN_SYSCTL_SET + * subcommands. + * the syntax of this function is fully compatible with + * POSIX sysctlby name: + * if newp and newlen are != 0 => this is a set + * else if oldp and oldlen are != 0 => this is a get + * to avoid too much overhead in the module, the whole + * sysctltable is returned, and the parsing is done in userland, + * a probe request is done to retrieve the size needed to + * transfer the table, before the real request + * if both old and new params = 0 => this is a print + * this is a special request, done only by main() + * to implement the extension './ipfw sysctl', + * a command that bypasses the normal getopt, and that + * is available on those platforms that use this + * sysctl emulation. + * in this case, a negative oldlen signals that *oldp + * is actually a FILE* to print somewhere else than stdout + */ + + int l; + int ret; + struct dn_id* oid; + struct sysctlhead* entry; + char* pstring; + char* pdata; + FILE* fp; + + if((oldlenp != NULL) && ((int)*oldlenp < 0)) + fp = (FILE*)oldp; + else + fp = stdout; + if(newp != NULL && newlen != 0) + { + //this is a set + l = sizeof(struct dn_id) + sizeof(struct sysctlhead) + strlen(name)+1 + newlen; + oid = malloc(l); + if (oid == NULL) + return -1; + oid->len = l; + oid->type = DN_SYSCTL_SET; + oid->id = DN_API_VERSION; + + entry = (struct sysctlhead*)(oid+1); + pdata = (char*)(entry+1); + pstring = pdata + newlen; + + entry->blocklen = ((sizeof(struct sysctlhead) + strlen(name)+1 + newlen) + 3) & ~3; + entry->namelen = strlen(name)+1; + entry->flags = 0; + entry->datalen = newlen; + + bcopy(newp, pdata, newlen); + bcopy(name, pstring, strlen(name)+1); + + ret = do_cmd(IP_DUMMYNET3, oid, (uintptr_t)l); + if (ret != 0) + return -1; + } + else + { + //this is a get or a print + l = sizeof(struct dn_id); + oid = malloc(l); + if (oid == NULL) + return -1; + oid->len = l; + oid->type = DN_SYSCTL_GET; + oid->id = DN_API_VERSION; + + ret = do_cmd(-IP_DUMMYNET3, oid, (uintptr_t)&l); + if (ret != 0) + return -1; + + l=oid->id; + free(oid); + oid = malloc(l); + if (oid == NULL) + return -1; + oid->len = l; + oid->type = DN_SYSCTL_GET; + oid->id = DN_API_VERSION; + + ret = do_cmd(-IP_DUMMYNET3, oid, (uintptr_t)&l); + if (ret != 0) + return -1; + + entry = (struct sysctlhead*)(oid+1); + while(entry->blocklen != 0) + { + pdata = (char*)(entry+1); + pstring = pdata+entry->datalen; + + //time to check if this is a get or a print + if(name != NULL && oldp != NULL && *oldlenp > 0) + { + //this is a get + if(strcmp(name,pstring) == 0) + { + //match found, sanity chech on len + if(*oldlenp < entry->datalen) + { + printf("%s error: buffer too small\n",__FUNCTION__); + return -1; + } + *oldlenp = entry->datalen; + bcopy(pdata, oldp, *oldlenp); + return 0; + } + } + else + { + //this is a print + if( name == NULL ) + goto print; + if ( (strncmp(pstring,name,strlen(name)) == 0) && ( pstring[strlen(name)]=='\0' || pstring[strlen(name)]=='.' ) ) + goto print; + else + goto skip; +print: + fprintf(fp, "%s: ",pstring); + switch( entry->flags >> 2 ) + { + case SYSCTLTYPE_LONG: + fprintf(fp, "%li ", *(long*)(pdata)); + break; + case SYSCTLTYPE_UINT: + fprintf(fp, "%u ", *(unsigned int*)(pdata)); + break; + case SYSCTLTYPE_ULONG: + fprintf(fp, "%lu ", *(unsigned long*)(pdata)); + break; + case SYSCTLTYPE_INT: + default: + fprintf(fp, "%i ", *(int*)(pdata)); + } + if( (entry->flags & 0x00000003) == CTLFLAG_RD ) + fprintf(fp, "\t(read only)\n"); + else + fprintf(fp, "\n"); +skip: ; + } + entry = (struct sysctlhead*)((unsigned char*)entry + entry->blocklen); + } + free(oid); + return 0; + } + //fallback for invalid options + return -1; + +#else /* __linux__ */ + FILE *fp; + char *basename = "/sys/module/ipfw_mod/parameters/"; + char filename[256]; /* full filename */ + char *varp; + int ret = 0; /* return value */ + int d; + + if (name == NULL) /* XXX set errno */ + return -1; + + /* locate the filename */ + varp = strrchr(name, '.'); + if (varp == NULL) /* XXX set errno */ + return -1; + + snprintf(filename, sizeof(filename), "%s%s", basename, varp+1); + + /* + * XXX we could open the file here, in rw mode + * but need to check if a file have write + * permissions. + */ + + /* check parameters */ + if (oldp && oldlenp) { /* read mode */ + fp = fopen(filename, "r"); + if (fp == NULL) { + fprintf(stderr, "%s fopen error reading filename %s\n", __FUNCTION__, filename); + return -1; + } + if (*oldlenp == 4) { + if (fscanf(fp, "%d", &d) == 1) + memcpy(oldp, &d, *oldlenp); + else + ret = -1; + } + fclose(fp); + } + + if (newp && newlen) { /* write */ + fp = fopen(filename, "w"); + if (fp == NULL) { + fprintf(stderr, "%s fopen error writing filename %s\n", __FUNCTION__, filename); + return -1; + } + if (newlen == 4) { + if (fprintf(fp, "%d", *(int*)newp) < 1) + ret = -1; + } + + fclose(fp); + } + + return ret; +#endif /* __linux__ */ +} + +/* + * The following two functions implement getsockopt/setsockopt + * replacements to talk over a TCP socket. + * Because the calls are synchronous, we can run blocking code + * and do not need to play special tricks to be selectable. + * The wire protocol for the emulation is the following: + * REQUEST: n32 req_size, level, optname; u8 data[req_size] + * RESPONSE: n32 resp_size, ret_code; u8 data[resp_size] + * data is only present if ret_code == 0 + * + * Return 0 if the message wan sent to the remote + * endpoint, -1 on error. + * + * If the required lenght is greater then the + * available buffer size, -1 is returned and + * optlen is the required lenght. + */ +enum sock_type {GET_SOCKOPT, SET_SOCKOPT}; + +struct wire_hdr { + uint32_t optlen; /* actual data len */ + uint32_t level; /* or error */ + uint32_t optname; /* or act len */ + uint32_t dir; /* in or out */ +}; + +/* do a complete write of the buffer */ +static int +writen(int fd, const char *buf, int len) +{ + int i; + + for (; len > 0; buf += i, len -= i) { + i = write(fd, buf, len); + ND("have %d wrote %d", len, i); + if (i < 0) { + if (errno == EAGAIN) + continue; + return -1; + } + } + return 0; +} + +/* do a complete read */ +static int +readn(int fd, char *buf, int len) +{ + int i, pos; + + for (pos = 0; pos < len; pos += i) { + i = read(fd, buf + pos, len - pos); + ND("have %d want %d got %d", pos, len, i); + if (i < 0) { + if (errno == EAGAIN) + continue; + return -1; + } + } + ND("full read got %d", pos); + return 0; +} + +int +__sockopt2(int s, int level, int optname, void *optval, socklen_t *optlen, + enum sopt_dir dir) +{ + struct wire_hdr r; + int len = optlen && optval ? *optlen : 0; + int new_errno; + + ND("dir %d optlen %d level %d optname %d", dir, len, level, optname); + /* send request to the server */ + r.optlen = htonl(len); + r.level = htonl(level); + r.optname = htonl(optname); + r.dir = htonl(dir); + + if (writen(s, (const char *) &r, sizeof(r))) + return -1; /* error writing */ + + /* send data, if present */ + if (len < 0) { + fprintf(stderr, "%s invalid args found\n", __FUNCTION__); + return -1; + } else if (len > 0) { + if (writen(s, optval, len)) + return -1; /* error writing */ + } + + /* read response size and error code */ + if (readn(s, (char *)&r, sizeof(r))) + return -1; /* error reading */ + len = ntohl(r.optlen); + ND("got header, datalen %d", len); + if (len > 0) { + if (readn(s, optval, len)) { + return -1; /* error reading */ + } + } + if (optlen) + *optlen = ntohl(r.optlen); /* actual len */ + new_errno = ntohl(r.level); + if (new_errno) + errno = new_errno; + return (new_errno ? -1 : 0); +} + +/* + * getsockopt() replacement. + */ +int +getsockopt2(int s, int level, int optname, void *optval, + socklen_t *optlen) +{ + return __sockopt2(s, level, optname, optval, optlen, SOPT_GET); +} + +/* + * setsockopt() replacement + */ +int +setsockopt2(int s, int level, int optname, void *optval, + socklen_t optlen) +{ + /* optlen not changed, use the local address */ + return __sockopt2(s, level, optname, optval, &optlen, SOPT_SET); +} + +#ifdef socket +#undef socket /* we want the real one */ +#endif +/* + * This function replaces the socket() call to connect to + * the ipfw control socket. + * We actually ignore the paramerers if IPFW_HOST and IPFW_PORT + * are defined. + */ +int +do_connect(const char *addr, int port) +{ + int conn_fd; + + /* open the socket */ +#ifdef NETLINK + +struct rtnl_handle rth; + + conn_fd = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE); +#else + struct sockaddr_in server; /* server address */ + const char *s; + + conn_fd = socket(AF_INET, SOCK_STREAM, 0); + if (conn_fd < 0) { + perror("socket"); + return -1; + } +#endif +#ifndef NETLINK + /* fill the sockaddr structure with server address */ + bzero(&server, sizeof(server)); + server.sin_family = AF_INET; + + /* override the host if set in the environment */ + s = getenv("IPFW_HOST"); + if (s) + addr = s; + inet_aton(addr, &server.sin_addr); + s = getenv("IPFW_PORT"); + if (s && atoi(s) > 0) + port = atoi(s); + server.sin_port = htons(port); + + /* connect to the server */ + if (connect(conn_fd, (struct sockaddr*) &server, sizeof(server)) < 0) { + perror("connect"); + return -1; + } +#ifdef setsockopt /* we want the real one here */ +#undef setsockopt +#undef getsockopt +#endif + { + int on = 1, ret; + ret = setsockopt(conn_fd, IPPROTO_TCP, TCP_NODELAY, &on, sizeof(on)); + ND("set TCP_NODELAY %d returns %d", on, ret); + } + if (0) + fprintf(stderr, "connected to %s:%d\n", + inet_ntoa(server.sin_addr), ntohs(server.sin_port)); +#endif + return conn_fd; +} diff --git a/example/ipfw/extra/glue.h b/example/ipfw/extra/glue.h new file mode 100644 index 0000000..97b25bf --- /dev/null +++ b/example/ipfw/extra/glue.h @@ -0,0 +1,488 @@ +/* + * Copyright (c) 2009 Luigi Rizzo, Marta Carbone, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ +/* + * $Id: glue.h 8327 2011-03-22 17:01:35Z marta $ + * + * glue code to adapt the FreeBSD version to linux and windows, + * userland and kernel. + * This is included before any other headers, so we do not have + * a chance to override any #define that should appear in other + * headers. + * First handle headers for userland and kernel. Then common code + * (including headers that require a specific order of inclusion), + * then the user- and kernel- specific parts. + */ + +#ifndef _GLUE_H +#define _GLUE_H + +/* + * common definitions to allow portability + */ +#ifndef __FBSDID +#define __FBSDID(x) struct __hack +#endif /* FBSDID */ + +#include /* linux needs it in addition to sys/types.h */ +#include /* for size_t */ + +#define true 1 /* stdbool */ +#ifdef _KERNEL /* prevent a warning */ +#undef _KERNEL +#include +#include +#include /* we want errno */ +#define _KERNEL +#else +#include +#endif + +#include +#ifndef USERSPACE +#include +#endif + + +/*----- */ + +/* ipfw2.c - from timeconv.h */ +static __inline time_t +_long_to_time(long tlong) +{ + if (sizeof(long) == sizeof(__int32_t)) + return((time_t)(__int32_t)(tlong)); + return((time_t)tlong); +} + +#define min(a, b) ((a) < (b) ? (a) : (b) ) // radix.c +/* + * debugging macros from ip_dn_private.h + */ +#include +#include +extern char *strrchr(const char *, int); +static inline const char *xyz(const char *s) { + static char buf[128]; + struct timeval t; + const char *ret = strrchr(s, '/'); + if (ret) s = ret + 1; + gettimeofday(&t, NULL); + buf[sizeof(buf) - 1] = '\0'; + snprintf(buf, sizeof(buf), "[%4d.%06d] %s", + (int)(t.tv_sec % 1000), (int)(t.tv_usec), s); + return buf; +} + +#define ND(fmt, ...) do {} while (0) +#define D1(fmt, ...) do {} while (0) +#define D(fmt, ...) fprintf(stderr, "%s:%-10s [%d] " fmt "\n", \ + xyz(__FILE__), __FUNCTION__, __LINE__, ## __VA_ARGS__) + +/* Rate limited version of "D", lps indicates how many per second */ +#define RD(lps, format, ...) \ + do { \ + static int t0, __cnt; \ + struct timeval __xxts; \ + gettimeofday(&__xxts, NULL); \ + if (t0 != __xxts.tv_sec) { \ + t0 = __xxts.tv_sec; \ + __cnt = 0; \ + } \ + if (__cnt++ < lps) { \ + D(format, ##__VA_ARGS__); \ + } \ + } while (0) + +#define DX(lev, fmt, ...) do { \ + if (dn_cfg.debug > lev) D(fmt, ## __VA_ARGS__); } while (0) +/* end debugging macros */ + + +/* + * sbin/ipfw on non-freebsd platform + */ +#ifdef NEED_STRTONUM +/* prototypes from libutil */ +/* humanize_number(3) */ +#define HN_DECIMAL 0x01 +#define HN_NOSPACE 0x02 +#define HN_B 0x04 +#define HN_DIVISOR_1000 0x08 +#define HN_IEC_PREFIXES 0x10 + +#define HN_GETSCALE 0x10 +#define HN_AUTOSCALE 0x20 + + +int humanize_number(char *_buf, size_t _len, int64_t _number, + const char *_suffix, int _scale, int _flags); +int expand_number(const char *buf, uint64_t *num); + + +long long +strtonum(const char *nptr, long long minval, long long maxval, + const char **errstr); +#ifndef __APPLE__ +int ishexnumber(int c); +#endif +#endif /* NEED_STRTONUM */ + +#ifdef NEED_SYSCTLBYNAME /* and other linux calls */ +int sysctlbyname(const char *name, void *oldp, size_t *oldlenp, + void *newp, size_t newlen); +#define setprogname(x) /* not present in linux */ + +extern int optreset; /* not present in linux */ + +long long int strtonum(const char *nptr, long long minval, + long long maxval, const char **errstr); + + +struct ether_addr; +struct ether_addr * ether_aton(const char *a); + +#define ICMP6_MAXTYPE 201 +#define __u6_addr in6_u +#define in6_u __in6_u /* missing type for ipv6 (linux 2.6.28) */ + + +#define __u6_addr32 u6_addr32 +/* on freebsd sys/socket.h pf specific */ +#define NET_RT_IFLIST 3 /* survey interface list */ + +#define RTM_VERSION 5 /* Up the ante and ignore older versions */ + +#endif // NEED_SYSCTLBYNAME + +#ifdef NEED_SIN_LEN +/* + * linux at least does not have sin_len and sin6_len, so we remap + * to some safe fields (check use of sin6_flowinfo XXX) + */ +#define sin_len sin_zero[0] +#define sin6_len sin6_flowinfo +#endif /* NEED_SIN_LEN */ + +#ifdef NEED_ROUNDUP2 /* in freensd is in sys/param.h */ +/* round up to the next power of 2 (y) */ +#define roundup2(x, y) (((x)+((y)-1))&(~((y)-1))) /* if y is powers of two */ +#endif // NEED_ROUNDUP2 + +/* possibly redundant, does not harm */ +size_t strlcpy(char * dst, const char * src, size_t siz); + +/* + * Part 2: common userland and kernel definitions + */ + +#define ICMP6_DST_UNREACH_NOROUTE 0 /* no route to destination */ +#define ICMP6_DST_UNREACH_ADMIN 1 /* administratively prohibited */ +#define ICMP6_DST_UNREACH_ADDR 3 /* address unreachable */ +#define ICMP6_DST_UNREACH_NOPORT 4 /* port unreachable */ + +/* + * linux: sysctl are mapped into /sys/module/ipfw_mod parameters + * windows: they are emulated via get/setsockopt + */ +#define CTLFLAG_RD 1 +#define CTLFLAG_RDTUN 1 +#define CTLFLAG_RW 2 +#define CTLFLAG_SECURE3 0 /* unsupported */ +#define CTLFLAG_VNET 0 /* unsupported */ + +/* if needed, queue.h must be included here after list.h */ + +/* + * our own struct thread + */ +struct thread { /* ip_fw_sockopt */ + void *sopt_td; + void *td_ucred; +}; + +enum sopt_dir { SOPT_GET, SOPT_SET }; + +struct sockopt { + enum sopt_dir sopt_dir; /* is this a get or a set? */ + int sopt_level; /* second arg of [gs]etsockopt */ + int sopt_name; /* third arg of [gs]etsockopt */ + void *sopt_val; /* fourth arg of [gs]etsockopt */ + size_t sopt_valsize; /* (almost) fifth arg of [gs]etsockopt */ + struct thread *sopt_td; /* calling thread or null if kernel */ +}; + + +/* + * List of values used for set/getsockopt options. + * The base value on FreeBSD is defined as a macro, + * if not available we will use our own enum. + * The TABLE_BASE value is used in the kernel. + */ +#define _IPFW_SOCKOPT_BASE 100 /* 40 on freebsd */ +#define IP_FW_TABLE_ADD (_IPFW_SOCKOPT_BASE + 0) +#define IP_FW_TABLE_DEL (_IPFW_SOCKOPT_BASE + 1) +#define IP_FW_TABLE_FLUSH (_IPFW_SOCKOPT_BASE + 2) +#define IP_FW_TABLE_GETSIZE (_IPFW_SOCKOPT_BASE + 3) +#define IP_FW_TABLE_LIST (_IPFW_SOCKOPT_BASE + 4) +#define IP_FW_DYN_GET (_IPFW_SOCKOPT_BASE + 5) + +#define IP_FW3 (_IPFW_SOCKOPT_BASE + 8) +#define IP_DUMMYNET3 (_IPFW_SOCKOPT_BASE + 9) + +#define IP_FW_ADD (_IPFW_SOCKOPT_BASE + 10) +#define IP_FW_DEL (_IPFW_SOCKOPT_BASE + 11) +#define IP_FW_FLUSH (_IPFW_SOCKOPT_BASE + 12) +#define IP_FW_ZERO (_IPFW_SOCKOPT_BASE + 13) +#define IP_FW_GET (_IPFW_SOCKOPT_BASE + 14) +#define IP_FW_RESETLOG (_IPFW_SOCKOPT_BASE + 15) + +#define IP_FW_NAT_CFG (_IPFW_SOCKOPT_BASE + 16) +#define IP_FW_NAT_DEL (_IPFW_SOCKOPT_BASE + 17) +#define IP_FW_NAT_GET_CONFIG (_IPFW_SOCKOPT_BASE + 18) +#define IP_FW_NAT_GET_LOG (_IPFW_SOCKOPT_BASE + 19) + +#define IP_DUMMYNET_CONFIGURE (_IPFW_SOCKOPT_BASE + 20) +#define IP_DUMMYNET_DEL (_IPFW_SOCKOPT_BASE + 21) +#define IP_DUMMYNET_FLUSH (_IPFW_SOCKOPT_BASE + 22) + /* 63 is missing */ +#define IP_DUMMYNET_GET (_IPFW_SOCKOPT_BASE + 24) +#define _IPFW_SOCKOPT_END (_IPFW_SOCKOPT_BASE + 25) + +/* + * Part 3: userland stuff for linux/windows + */ + + +/* + * now remap functions for userland or linux kernel etc. + */ +#ifdef USERSPACE +/* + * definitions used when the programs communicate through userspace. + * We need to define the socket and addresses used to talk, and + * the userland side must also remap socket() and [gs]etsockopt() + * to appropriate wrappers. + */ + +#define LOCALADDR "127.0.0.1" +#define IPFW_PORT 5555 + +#ifndef KERNEL_SIDE +#ifdef _KERNEL +#error _KERNEL defined in user space +#endif +int do_connect(const char *addr, int port); +#include /* for socklen_t */ + +#define socket(a, b, c) do_connect(LOCALADDR, IPFW_PORT) +#define setsockopt setsockopt2 +#define getsockopt getsockopt2 +int getsockopt2(int s, int lev, int optname, void *optval, socklen_t *optlen); +int setsockopt2(int s, int lev, int optname, void *optval, socklen_t optlen); +#endif /* KERNEL_SIDE */ + +#endif /* USERSPACE */ + +/* + * Part 5: windows specific stuff and sysctl emulation + */ + +/******************* +* SYSCTL emulation * +********************/ +#ifdef EMULATE_SYSCTL + +/* this needs to be here, as it is part of the user-kernel messages */ +/* flag is set with the last 2 bits for access, as defined in glue.h + * and the rest for type + */ +enum { + SYSCTLTYPE_INT = 0, + SYSCTLTYPE_UINT = 1, + SYSCTLTYPE_SHORT = 2, + SYSCTLTYPE_USHORT = 3, + SYSCTLTYPE_LONG = 4, + SYSCTLTYPE_ULONG = 5, + SYSCTLTYPE_STRING = 6, + + /* the following are SYSCTL_PROC equivalents of the above, + * where the SYSCTLTYPE is shifted 2 bits, + * and SYSCTLTYPE_PROC is set + */ + SYSCTLTYPE_PROC = 0x100, + CTLTYPE_INT = (0x100 | (0<<2)), + CTLTYPE_UINT = (0x100 | (1<<2)), + CTLTYPE_LONG = (0x100 | (4<<2)), + CTLTYPE_ULONG = (0x100 | (5<<2)) +}; + +struct sysctlhead { + uint32_t blocklen; //total size of the entry + uint32_t namelen; //strlen(name) + '\0' + uint32_t flags; //type and access + uint32_t datalen; +}; + + +#endif /* EMULATE_SYSCTL */ +int sysctlbyname(const char *name, void *oldp, size_t *oldlenp, void *newp, + size_t newlen); + +#ifndef __FreeBSD__ +#define test_bit(ix, pData) ((*pData) & (1<<(ix))) +#define __set_bit(ix, pData) (*pData) |= (1<<(ix)) +#define __clear_bit(ix, pData) (*pData) &= ~(1<<(ix)) + +static inline int fls(int _n) +{ + unsigned int n = _n; + int i = 0; + for (i = 0; n > 0; n >>= 1, i++) + ; + return i; +} + +static inline unsigned long __fls(unsigned long word) +{ + return fls(word) - 1; +} + + +#endif /* !FreeBSD */ + +#ifdef KERNEL_SIDE +/* sys/counter.h , to be moved to a file */ +typedef uint64_t *counter_u64_t; // XXX kernel +static inline void counter_u64_add(counter_u64_t c, int64_t v) +{ + *c += v; +} +static inline void counter_u64_zero(counter_u64_t c) +{ + *c = 0; +} +static inline uint64_t counter_u64_fetch(counter_u64_t c) +{ + return *c; +} + +struct rm_priotracker { +}; + +#define vslock(_a, _b) (0) +#define vsunlock(_a, _b) + +typedef uint64_t u_register_t; // XXX not on osx ? + +typedef uintptr_t eventhandler_tag; +#define EVENTHANDLER_REGISTER(_a, _b, ...) (uintptr_t)_b; +#define EVENTHANDLER_DEREGISTER(_a, _b, ...) (void)_b; + +// XXX this needs to be completed +#define if_name(_ifp) (_ifp->if_xname) +#define ifunit_ref(_n) NULL // XXX +#define if_rele(_n) + +#define rtalloc1_fib(_a, ...) NULL +#define rt_key(_a) NULL +#define rt_mask(_a) NULL +#define RTFREE_LOCKED(_a) ((void)NULL) +struct rtentry { +}; +#define rt_tables_get_rnh(_a, _b) NULL + +#endif /* KERNEL_SIDE */ + +#ifdef _KERNEL +/* XXX kernel support */ +/* on freebsd net/if.h XXX used */ +#ifdef linux +#define div64(a,b) (((int64_t)a)/((int64_t)b)) +#define LINUX_VERSION_CODE 30003 +#define KERNEL_VERSION(a,b,c) (a*10000+b*100 + c) +#define __printflike(a,b) +#endif /* linux */ + +#endif /* _KERNEL */ + +#ifndef __FreeBSD__ +#ifndef IFNAMSIZ +#define IFNAMSIZ 16 +#endif +#include "missing.h" + +struct if_data { + /* ... */ + u_long ifi_mtu; /* maximum transmission unit */ +}; + +#endif + +#ifdef __APPLE__ +#include // need in kernel + +/* needed both in kernel and userspace */ +struct if_data64 { // XXX Darwin version + /* ... */ + u_long ifi_mtu; /* maximum transmission unit */ +}; + +struct net_event_data { +}; + +struct in_addr; +#endif /* __APPLE__ */ + +#define __PAST_END(v, idx) v[idx] + +/* + * a fast copy routine + */ +#include +// XXX only for multiples of 64 bytes, non overlapped. +static inline void +_pkt_copy(const void *_src, void *_dst, int l) +{ + const uint64_t *src = _src; + uint64_t *dst = _dst; +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) + if (unlikely(l >= 1024)) { + bcopy(src, dst, l); + return; + } + for (; l > 0; l-=64) { + *dst++ = *src++; + *dst++ = *src++; + *dst++ = *src++; + *dst++ = *src++; + *dst++ = *src++; + *dst++ = *src++; + *dst++ = *src++; + *dst++ = *src++; + } +} + +#endif /* !_GLUE_H */ diff --git a/example/ipfw/extra/humanize_number.c b/example/ipfw/extra/humanize_number.c new file mode 100644 index 0000000..0b7382f --- /dev/null +++ b/example/ipfw/extra/humanize_number.c @@ -0,0 +1,167 @@ +/* $NetBSD: humanize_number.c,v 1.14 2008/04/28 20:22:59 martin Exp $ */ + +/* + * Copyright (c) 1997, 1998, 1999, 2002 The NetBSD Foundation, Inc. + * All rights reserved. + * + * This code is derived from software contributed to The NetBSD Foundation + * by Jason R. Thorpe of the Numerical Aerospace Simulation Facility, + * NASA Ames Research Center, by Luke Mewburn and by Tomas Svensson. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS + * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/lib/libutil/humanize_number.c 220582 2011-04-12 22:48:03Z delphij $"); + +#include +#include +#include +#include +#include +#include +#include +//#include + +static const int maxscale = 7; + +int +humanize_number(char *buf, size_t len, int64_t quotient, + const char *suffix, int scale, int flags) +{ + const char *prefixes, *sep; + int i, r, remainder, s1, s2, sign; + int64_t divisor, max; + size_t baselen; + + assert(buf != NULL); + assert(suffix != NULL); + assert(scale >= 0); + assert(scale < maxscale || (((scale & (HN_AUTOSCALE | HN_GETSCALE)) != 0))); + assert(!((flags & HN_DIVISOR_1000) && (flags & HN_IEC_PREFIXES))); + + remainder = 0; + + if (flags & HN_IEC_PREFIXES) { + baselen = 2; + /* + * Use the prefixes for power of two recommended by + * the International Electrotechnical Commission + * (IEC) in IEC 80000-3 (i.e. Ki, Mi, Gi...). + * + * HN_IEC_PREFIXES implies a divisor of 1024 here + * (use of HN_DIVISOR_1000 would have triggered + * an assertion earlier). + */ + divisor = 1024; + if (flags & HN_B) + prefixes = "B\0\0Ki\0Mi\0Gi\0Ti\0Pi\0Ei"; + else + prefixes = "\0\0Ki\0Mi\0Gi\0Ti\0Pi\0Ei"; + } else { + baselen = 1; + if (flags & HN_DIVISOR_1000) + divisor = 1000; + else + divisor = 1024; + + if (flags & HN_B) + prefixes = "B\0\0k\0\0M\0\0G\0\0T\0\0P\0\0E"; + else + prefixes = "\0\0\0k\0\0M\0\0G\0\0T\0\0P\0\0E"; + } + +#define SCALE2PREFIX(scale) (&prefixes[(scale) * 3]) + + if (scale < 0 || (scale >= maxscale && + (scale & (HN_AUTOSCALE | HN_GETSCALE)) == 0)) + return (-1); + + if (buf == NULL || suffix == NULL) + return (-1); + + if (len > 0) + buf[0] = '\0'; + if (quotient < 0) { + sign = -1; + quotient = -quotient; + baselen += 2; /* sign, digit */ + } else { + sign = 1; + baselen += 1; /* digit */ + } + if (flags & HN_NOSPACE) + sep = ""; + else { + sep = " "; + baselen++; + } + baselen += strlen(suffix); + + /* Check if enough room for `x y' + suffix + `\0' */ + if (len < baselen + 1) + return (-1); + + if (scale & (HN_AUTOSCALE | HN_GETSCALE)) { + /* See if there is additional columns can be used. */ + for (max = 1, i = len - baselen; i-- > 0;) + max *= 10; + + /* + * Divide the number until it fits the given column. + * If there will be an overflow by the rounding below, + * divide once more. + */ + for (i = 0; + (quotient >= max || (quotient == max - 1 && remainder >= 950)) && + i < maxscale; i++) { + remainder = quotient % divisor; + quotient /= divisor; + } + + if (scale & HN_GETSCALE) + return (i); + } else { + for (i = 0; i < scale && i < maxscale; i++) { + remainder = quotient % divisor; + quotient /= divisor; + } + } + + /* If a value <= 9.9 after rounding and ... */ + if (quotient <= 9 && remainder < 950 && i > 0 && flags & HN_DECIMAL) { + /* baselen + \0 + .N */ + if (len < baselen + 1 + 2) + return (-1); + s1 = (int)quotient + ((remainder + 50) / 1000); + s2 = ((remainder + 50) / 100) % 10; + r = snprintf(buf, len, "%d%s%d%s%s%s", + sign * s1, localeconv()->decimal_point, s2, + sep, SCALE2PREFIX(i), suffix); + } else + r = snprintf(buf, len, "%" PRId64 "%s%s%s", + sign * (quotient + (remainder + 50) / 1000), + sep, SCALE2PREFIX(i), suffix); + + return (r); +} + diff --git a/example/ipfw/extra/ipfw2_mod.c b/example/ipfw/extra/ipfw2_mod.c new file mode 100644 index 0000000..974c6af --- /dev/null +++ b/example/ipfw/extra/ipfw2_mod.c @@ -0,0 +1,278 @@ +/* + * Copyright (C) 2009 Luigi Rizzo, Marta Carbone, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $Id: ipfw2_mod.c 7787 2010-11-19 21:15:50Z marta $ + * + * The main interface to build ipfw+dummynet as a linux module. + * (and possibly as a windows module as well, though that part + * is not complete yet). + * + * The control interface uses the sockopt mechanism + * on a socket(AF_INET, SOCK_RAW, IPPROTO_RAW). + * + * The data interface uses the netfilter interface, at the moment + * hooked to the PRE_ROUTING and POST_ROUTING hooks. + * Unfortunately the netfilter interface is a moving target, + * so we need a set of macros to adapt to the various cases. + * + * In the netfilter hook we just mark packet as 'QUEUE' and then + * let the queue handler to do the whole work (filtering and + * possibly emulation). + * As we receive packets, we wrap them with an mbuf descriptor + * so the existing ipfw+dummynet code runs unmodified. + */ + +#include +#include /* sizeof struct mbuf */ +#include /* NGROUPS */ +#include /* in_addr */ +#include /* ip_fw_ctl_t, ip_fw_chk_t */ +#include /* ip_dn_ctl_t, ip_dn_io_t */ +#include /* PFIL_IN, PFIL_OUT */ +#include /* inet_iif */ + +#include /* ip_fw_ctl_t, ip_fw_chk_t */ + +/* + * Here we allocate some global variables used in the firewall. + */ +//ip_dn_ctl_t *ip_dn_ctl_ptr; +int (*ip_dn_ctl_ptr)(struct sockopt *); + +ip_fw_ctl_t *ip_fw_ctl_ptr; + +int (*ip_dn_io_ptr)(struct mbuf **m, int dir, struct ip_fw_args *fwa); + +void (*bridge_dn_p)(struct mbuf *, struct ifnet *); + +/* Divert hooks. */ +void (*ip_divert_ptr)(struct mbuf *m, int incoming); + +/* ng_ipfw hooks. */ +ng_ipfw_input_t *ng_ipfw_input_p = NULL; + + +/*--- + * Control hooks: + * ipfw_ctl_h() is a wrapper for linux to FreeBSD sockopt call convention. + * then call the ipfw handler in order to manage requests. + * In turn this is called by the linux set/get handlers. + */ +static int +ipfw_ctl_h(struct sockopt *s, int cmd, int dir, int len, void __user *user) +{ + struct thread t; + int ret = EINVAL; + + memset(s, 0, sizeof(*s)); + s->sopt_name = cmd; + s->sopt_dir = dir; + s->sopt_valsize = len; + s->sopt_val = user; + + /* sopt_td is not used but it is referenced */ + memset(&t, 0, sizeof(t)); + s->sopt_td = &t; + + if (ip_fw_ctl_ptr && cmd != IP_DUMMYNET3 && (cmd == IP_FW3 || + cmd < IP_DUMMYNET_CONFIGURE)) + ret = ip_fw_ctl_ptr(s); + else if (ip_dn_ctl_ptr && (cmd == IP_DUMMYNET3 || + cmd >= IP_DUMMYNET_CONFIGURE)) + ret = ip_dn_ctl_ptr(s); + + return -ret; /* errors are < 0 on linux */ +} + + + +/* + * setsockopt hook has no return value other than the error code. + */ +int +do_ipfw_set_ctl(void *sk, int cmd, + void __user *user, unsigned int len) +{ + struct sockopt s; /* pass arguments */ + return ipfw_ctl_h(&s, cmd, SOPT_SET, len, user); +} + +/* + * getsockopt can can return a block of data in response. + */ +int +do_ipfw_get_ctl(void *sk, + int cmd, void __user *user, int *len) +{ + struct sockopt s; /* pass arguments */ + int ret = ipfw_ctl_h(&s, cmd, SOPT_GET, *len, user); + + *len = s.sopt_valsize; /* return lenght back to the caller */ + return ret; +} + + +/* + * Module glue - init and exit function. + */ +#include +/* descriptors for the children, until i find a way for the + * linker to produce them + */ +extern moduledata_t *moddesc_ipfw; +extern moduledata_t *moddesc_dummynet; +extern moduledata_t *moddesc_dn_fifo; +extern moduledata_t *moddesc_dn_wf2qp; +extern moduledata_t *moddesc_dn_rr; +extern moduledata_t *moddesc_dn_qfq; +extern moduledata_t *moddesc_dn_prio; +extern int (*sysinit_ipfw_init)(void *); +extern int (*sysuninit_ipfw_destroy)(void *); +extern int (*sysinit_vnet_ipfw_init)(void *); +extern int (*sysuninit_vnet_ipfw_uninit)(void *); + +/*--- + * Glue code to implement the registration of children with the parent. + * Each child should call my_mod_register() when linking, so that + * module_init() and module_exit() can call init_children() and + * fini_children() to provide the necessary initialization. + * We use the same mechanism for MODULE_ and SYSINIT_. + * The former only get a pointer to the moduledata, + * the latter have two function pointers (init/uninit) + */ +#include +struct mod_args { + const char *name; + int order; + struct moduledata *mod; + int (*init)(void *); + int (*uninit)(void *); +}; + +static unsigned int mod_idx; +static struct mod_args mods[10]; /* hard limit to 10 modules */ + +int +my_mod_register(const char *name, int order, + struct moduledata *mod, int (*init)(void *), int (*uninit)(void *)); +/* + * my_mod_register should be called automatically as the init + * functions in the submodules. Unfortunately this compiler/linker + * trick is not supported yet so we call it manually. + */ +int +my_mod_register(const char *name, int order, + struct moduledata *mod, int (*init)(void *), int (*uninit)(void *)) +{ + struct mod_args m; + + m.name = name; + m.order = order; + m.mod = mod; + m.init = init; + m.uninit = uninit; + + ND("called for %s", name); + if (mod_idx < sizeof(mods) / sizeof(mods[0])) + mods[mod_idx++] = m; + return 0; +} + +static void +init_children(void) +{ + unsigned int i; + + /* Call the functions registered at init time. */ + printf("%s mod_idx value %d\n", __FUNCTION__, mod_idx); + for (i = 0; i < mod_idx; i++) { + struct mod_args *m = &mods[i]; + printf("+++ start module %d %s %s at %p order 0x%x\n", + i, m->name, m->mod ? m->mod->name : "SYSINIT", + m->mod, m->order); + if (m->mod && m->mod->evhand) + m->mod->evhand(NULL, MOD_LOAD, m->mod->priv); + else if (m->init) + m->init(NULL); + } +} + +static void +fini_children(void) +{ + int i; + + /* Call the functions registered at init time. */ + for (i = mod_idx - 1; i >= 0; i--) { + struct mod_args *m = &mods[i]; + printf("+++ end module %d %s %s at %p order 0x%x\n", + i, m->name, m->mod ? m->mod->name : "SYSINIT", + m->mod, m->order); + if (m->mod && m->mod->evhand) + m->mod->evhand(NULL, MOD_UNLOAD, m->mod->priv); + else if (m->uninit) + m->uninit(NULL); + } +} +/*--- end of module binding helper functions ---*/ + +int +ipfw_module_init(void) +{ + int ret = 0; + + my_mod_register("ipfw", 1, moddesc_ipfw, NULL, NULL); + my_mod_register("sy_ipfw", 2, NULL, + sysinit_ipfw_init, sysuninit_ipfw_destroy); + my_mod_register("sy_Vnet_ipfw", 3, NULL, + sysinit_vnet_ipfw_init, sysuninit_vnet_ipfw_uninit); + my_mod_register("dummynet", 4, moddesc_dummynet, NULL, NULL); + my_mod_register("dn_fifo", 5, moddesc_dn_fifo, NULL, NULL); + my_mod_register("dn_wf2qp", 6, moddesc_dn_wf2qp, NULL, NULL); + my_mod_register("dn_rr", 7, moddesc_dn_rr, NULL, NULL); + my_mod_register("dn_qfq", 8, moddesc_dn_qfq, NULL, NULL); + my_mod_register("dn_prio", 9, moddesc_dn_prio, NULL, NULL); + init_children(); + +#ifdef EMULATE_SYSCTL + keinit_GST(); +#endif + + return ret; +} + +/* module shutdown */ +void +ipfw_module_exit(void) +{ +#ifdef EMULATE_SYSCTL + keexit_GST(); +#endif + + fini_children(); + + printf("%s unloaded\n", __FUNCTION__); +} diff --git a/example/ipfw/extra/linux_defs.h b/example/ipfw/extra/linux_defs.h new file mode 100644 index 0000000..b7994cf --- /dev/null +++ b/example/ipfw/extra/linux_defs.h @@ -0,0 +1,144 @@ +#ifndef __LINUX_DEFS_ +#define __LINUX_DEFS_ + +/* define, includes and functions missing in linux */ + +#ifdef __linux__ +/* include and define */ +#include /* inet_ntoa */ +#include + +#include /* error define */ +#include /* u_int32_t */ +#include /* snprintf */ + +typedef struct mtx spinlock_t; +typedef struct mtx rwlock_t; + +/* + * some network structure can be defined in the bsd way + * by using the _FAVOR_BSD definition. This is not true + * for icmp structure. + * XXX struct icmp contains bsd names in + * /usr/include/netinet/ip_icmp.h + */ +#define icmp_code code +#define icmp_type type + +/* linux in6_addr has no member __u6_addr + * replace the whole structure ? + */ +#define __u6_addr __in6_u +// #define __u6_addr32 u6_addr32 + +/* defined in linux/sctp.h with no bsd definition */ +struct sctphdr { + uint16_t src_port; /* source port */ + uint16_t dest_port; /* destination port */ + uint32_t v_tag; /* verification tag of packet */ + uint32_t checksum; /* Adler32 C-Sum */ + /* chunks follow... */ +} SCTP_PACKED; + +/* missing definition */ +#define TH_FIN 0x01 +#define TH_SYN 0x02 +#define TH_RST 0x04 +#define TH_ACK 0x10 + +#define RTF_CLONING 0x100 /* generate new routes on use */ + +#define IPPROTO_OSPFIGP 89 /* OSPFIGP */ +#define IPPROTO_CARP 112 /* CARP */ +#define IPPROTO_IPV4 IPPROTO_IPIP /* for compatibility */ + +#define CARP_VERSION 2 +#define CARP_ADVERTISEMENT 0x01 + +#define PRIV_NETINET_IPFW 491 /* Administer IPFW firewall. */ +#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) /* linux/stdlib */ + +#define IP_FORWARDING 0x1 /* most of ip header exists */ + +#define NETISR_IP 2 /* same as AF_INET */ + +#define PRIV_NETINET_DUMMYNET 494 /* Administer DUMMYNET. */ + +extern int securelevel; + +struct carp_header { +#if BYTE_ORDER == LITTLE_ENDIAN + u_int8_t carp_type:4, + carp_version:4; +#endif +#if BYTE_ORDER == BIG_ENDIAN + u_int8_t carp_version:4, + carp_type:4; +#endif +}; + +struct pim { +}; + +struct route { + struct rtentry *ro_rt; + struct sockaddr ro_dst; +}; + + +#if 0 // already in main header +struct ifaltq { + void *ifq_head; +}; + +struct ifnet { + char if_xname[IFNAMSIZ]; /* external name (name + unit) */ + struct ifaltq if_snd; /* output queue (includes altq) */ +}; + +/* involves mbufs */ +int in_cksum(struct mbuf *m, int len); +#define divert_cookie(mtag) 0 +#define divert_info(mtag) 0 +#define INADDR_TO_IFP(a, b) b = NULL +#define pf_find_mtag(a) NULL +#define pf_get_mtag(a) NULL +#define AF_LINK AF_ASH /* ? linux/socket.h */ + +struct pf_mtag { + void *hdr; /* saved hdr pos in mbuf, for ECN */ + sa_family_t af; /* for ECN */ + u_int32_t qid; /* queue id */ +}; +#endif + +/* radix related */ + +#if 0 +struct radix_node { + caddr_t rn_key; /* object of search */ + caddr_t rn_mask; /* netmask, if present */ +}; +#endif + + +/* missing functions */ + +/* from bsd sys/queue.h */ +#define TAILQ_FOREACH_SAFE(var, head, field, tvar) \ + for ((var) = TAILQ_FIRST((head)); \ + (var) && ((tvar) = TAILQ_NEXT((var), field), 1); \ + (var) = (tvar)) + +#define SLIST_FOREACH_SAFE(var, head, field, tvar) \ + for ((var) = SLIST_FIRST((head)); \ + (var) && ((tvar) = SLIST_NEXT((var), field), 1); \ + (var) = (tvar)) + +/* depending of linux version */ +#ifndef ETHERTYPE_IPV6 +#define ETHERTYPE_IPV6 0x86dd /* IP protocol version 6 */ +#endif + +#endif /* __linux__ */ +#endif /* !__LINUX_DEFS_ */ diff --git a/example/ipfw/extra/missing.c b/example/ipfw/extra/missing.c new file mode 100644 index 0000000..1713bdd --- /dev/null +++ b/example/ipfw/extra/missing.c @@ -0,0 +1,732 @@ +/* + * $Id$ + * + * Support to compile the kernel side of ipfw/dummynet in userland. + * This file contains variables and functions that are not available in + * userland. It is compiled in a kernel-like environment, so + * it has _KERNEL defined, together with malloc() and free(). + * They must be redefined here as we build the real thing. + */ + +#include "glue.h" /* normally comes from the command line */ +#include "missing.h" /* normally comes from the command line */ +#undef _KERNEL +#include +#include +#include +#include /* timersub */ +#define _KERNEL + +#include +#include + +#include +#undef malloc +#undef free + +#include // calloc + +#include /* struct sockaddr, route, sockopt... */ +#include + +#if 0 +#define IF_NAMESIZE 16 /* ip_fw.h */ +#define IFNAMSIZ IF_NAMESIZE /* ip_fw.h */ +#endif + + +/* + * Global bariables in the kernel + */ +int ticks; /* kernel ticks counter */ +int hz = 5000; /* default clock time */ +long tick = 0; /* XXX is this 100000/hz ? */ +int bootverbose = 0; +time_t time_uptime = 0; +struct timeval boottime; + +int max_protohdr = 14 + 4 + 20 + 20; /* mac, vlan, ip, tcp */ +int max_linkhdr; +int ip_defttl; +u_long in_ifaddrhmask; /* mask for hash table */ +struct in_ifaddrhashhead *in_ifaddrhashtbl; /* inet addr hash table */ + +u_int rt_numfibs = RT_NUMFIBS; + +void +module_register_init(const void *foo) +{ + D("start for %p", foo); +} + +/* defined as assert */ +#include +void +panic(const char *fmt, ...) +{ + assert(1); +} + +void +getmicrouptime(struct timeval *tv) +{ + gettimeofday(tv, NULL); +} + +/* + * pfil hook support. + * We make pfil_head_get return a non-null pointer, which is then ignored + * in our 'add-hook' routines. + */ +struct pfil_head; +typedef int (pfil_hook_t) + (void *, struct mbuf **, struct ifnet *, int, struct inpcb *); + +struct pfil_head * +pfil_head_get(int proto, u_long flags) +{ + static int dummy; + D("called"); + return (struct pfil_head *)(void *)&dummy; +} + +int +pfil_add_hook(pfil_hook_t *func, void *arg, int dir, struct pfil_head *h) +{ + D("called"); + return 0; +} + +int +pfil_remove_hook(pfil_hook_t *func, void *arg, int dir, struct pfil_head *h) +{ + D("called"); + return 0; +} + +/* from sys/netinet/ip_output.c */ +int +ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags, + struct ip_moptions *imo, struct inpcb *inp) +{ + D("unimplemented"); + return 0; +} + +struct tags_freelist tags_freelist; +int tags_minlen = 64; +int tags_freelist_count = 0; +static int tags_freelist_max = 0; + +struct mbuf *mbuf_freelist; + +void +m_freem(struct mbuf *m) +{ + struct m_tag *t; + + /* free the m_tag chain */ + while ( (t = SLIST_FIRST(&m->m_pkthdr.tags) ) ) { + ND("free tag %p", &m->m_pkthdr.tags); + SLIST_REMOVE_HEAD(&m->m_pkthdr.tags, m_tag_link); + SLIST_INSERT_HEAD(&tags_freelist, t, m_tag_link); + tags_freelist_count++; + if (tags_freelist_count > tags_freelist_max) { + static int pr=0; + if ((pr++ % 1000) == 0) + D("new max %d", tags_freelist_count); + tags_freelist_max = tags_freelist_count; + } + } + if (m->m_flags & M_STACK) { + ND("free invalid mbuf %p", m); + return; + } + /* free the mbuf */ + ND("free(m = %p, M_IPFW);", m); + m->m_next = mbuf_freelist; + mbuf_freelist = m; +} + +/* from net/netisr.c */ +int +netisr_dispatch(u_int proto, struct mbuf *m) +{ + if ((int)proto < 0) + m_freem(m); + else if (m->__m_callback) + m->__m_callback(m, proto); + else + D("unimplemented proto %d mbuf %p", proto, m); + return 0; +} + +/* define empty body for kernel function */ +int +priv_check(struct thread *td, int priv) +{ + /* once connected, always allow */ + ND("called"); + return 0; +} + +int +securelevel_ge(struct ucred *cr, int level) +{ + /* we are always secure... */ + ND("called"); + return 0; +} + +int +sysctl_handle_int(SYSCTL_HANDLER_ARGS) +{ + int tmp; + + ND("called"); + if (!req || !req->oldptr || req->oldlen != sizeof(int)) + return EINVAL; + tmp = arg1 ? *(int *)arg1 : arg2; + bcopy(&tmp, req->oldptr, sizeof(int)); + /* XXX check the SET routine */ + if (req->newptr && arg1) + bcopy(req->newptr, arg1, sizeof(int)); + return 0; +} + +int +sysctl_handle_long(SYSCTL_HANDLER_ARGS) +{ + ND("called"); + sysctl_handle_int(oidp, arg1, arg2, req); + return 0; +} + +void +ether_demux(struct ifnet *ifp, struct mbuf *m) +{ + if (m->__m_callback) + m->__m_callback(m, 0); + else + D("missing callback mbuf %p", m); + return; +} + +int +ether_output_frame(struct ifnet *ifp, struct mbuf *m) +{ + D("incomplete"); + return 0; +} + +void +in_rtalloc_ign(struct route *ro, u_long ignflags, u_int fibnum) +{ + D("called"); + return; +} + +void +icmp_error(struct mbuf *n, int type, int code, n_long dest, int mtu) +{ + D("called"); + return; +} + +void +rtfree(struct rtentry *rt) +{ + D("called"); + return; +} + +u_short +in_cksum_skip(struct mbuf *m, int len, int skip) +{ + D("called"); + return 0; +} + +u_short +in_cksum_hdr(struct ip *ip) +{ + D("called"); + return 0; +} + + +struct mbuf * +ip_reass(struct mbuf *clone) +{ + D("called"); + return clone; +} +#ifdef INP_LOCK_ASSERT +#undef INP_LOCK_ASSERT +#define INP_LOCK_ASSERT(a) +#endif + +int +jailed(struct ucred *cred) +{ + D("called"); + return 0; +} + +/* +* Return 1 if an internet address is for a ``local'' host +* (one to which we have a connection). If subnetsarelocal +* is true, this includes other subnets of the local net. +* Otherwise, it includes only the directly-connected (sub)nets. +*/ +int +in_localaddr(struct in_addr in) +{ + D("called"); + return 1; +} + +#if 0 +int ipfw_chg_hook(SYSCTL_HANDLER_ARGS) +{ + return 1; +} +#endif + +/* + * Procedures for the callout interface + * + * callout_init() initializes a descriptor, + * callout_reset() starts a timer + * callout_stop() stops a timer + * + * Internally we hold a list of callout entries etc etc. + */ + +struct callout_tailq callout_head; + +#include +void +callout_init(struct callout *c, int mpsafe) +{ + D("c %p mpsafe %d", c, mpsafe); + bzero(c, sizeof(*c)); +} + +int +callout_reset_on(struct callout *c, int due_ticks, void (*func)(void *), void *arg, int p) +{ + return callout_reset(c, due_ticks, func, arg); +} + +int +callout_reset(struct callout *c, int due_ticks, void (*func)(void *), void *arg) +{ + struct callout *cur; + + ND("c %p ticks %d f %p(%p)", c, due_ticks, func, arg); + if (c->c_flags & CALLOUT_ACTIVE) { + D(" --- callout was already active"); + return -1; + } + c->c_time = ticks + due_ticks; /* XXX not the original meaning */ + c->c_func = func; + c->c_arg = arg; + c->c_flags |= CALLOUT_ACTIVE; + TAILQ_FOREACH(cur, &callout_head, c_links.tqe) { + if ( (c->c_time - cur->c_time) < 0) + break; + } + if (cur) + TAILQ_INSERT_BEFORE(cur, c, c_links.tqe); + else + TAILQ_INSERT_TAIL(&callout_head, c, c_links.tqe); + return 0; /* no error */ +} + +int +_callout_stop_safe(struct callout *c, int safe) +{ + D("c %p safe %d", c, safe); + TAILQ_REMOVE(&callout_head, c, c_links.tqe); + return 0; +} + +int +callout_drain(struct callout *c) +{ + _callout_stop_safe(c, 1); + return 0; +} + +void +callout_startup(void) +{ + D("start"); + TAILQ_INIT( &callout_head); +} + +void +callout_run(void) +{ + struct callout *cur, *tmp; + + ND("Run pending callouts tick %d", ticks); + TAILQ_FOREACH_SAFE(cur, &callout_head, c_links.tqe, tmp) { + int delta = ticks - cur->c_time; + if (delta < 0) { // early ? + //fprintf(stderr, "c %p due at %d\n", cur, cur->c_time); + continue; + } + if (delta > 100) + RD(1,"running %p due at %d now %d", cur, cur->c_time, ticks); + TAILQ_REMOVE(&callout_head, cur, c_links.tqe); + cur->c_flags &= ~CALLOUT_ACTIVE; + cur->c_func(cur->c_arg); + } +} + +/* + * the taskqueue type is actually opaque + */ +struct taskqueue { + STAILQ_ENTRY(taskqueue) tq_link; + STAILQ_HEAD(, task) tq_queue; + const char *tq_name; + taskqueue_enqueue_fn tq_enqueue; + void *tq_context; + struct task *tq_running; + int tq_pcount; + int tq_spin; + int tq_flags; +}; + +#if 0 +/* + * instead of enqueueing, we run this immediately. + */ +int +taskqueue_enqueue(struct taskqueue *queue, struct task *task) +{ + task->ta_func(task->ta_context, 1); + return 0; +} +#endif + +void +taskqueue_thread_enqueue(void *context) +{ + D("ctx %p", context); +} + +struct taskqueue * +taskqueue_create_fast(const char *name, int mflags, + taskqueue_enqueue_fn enqueue, void *context) +{ + struct taskqueue *tq; + + tq = calloc(1, sizeof(*tq)); + if (tq == NULL) + return NULL; + D("start %s fn %p ctx %p", name, enqueue, context); + return tq; +} + +int +taskqueue_start_threads(struct taskqueue **tqp, int count, int pri, + const char *name, ...) +{ + D("tqp %p count %d (dummy)", tqp, count); + return 0; +} + +void +taskqueue_drain(struct taskqueue *queue, struct task *task) +{ + D("q %p task %p", queue, task); +} + +void +taskqueue_free(struct taskqueue *queue) +{ + D("q %p", queue); + free(queue); +} + +void * +kern_malloc(int sz) +{ + return calloc(sz, 1); /* most of the time we want zeroed memory */ +} + +void +kern_free(void *p) +{ + free(p); +} + +#ifdef linux +size_t +strlcpy(char *dst, const char *src, size_t siz) +{ + char *d = dst; + const char *s = src; + size_t n = siz; + + /* Copy as many bytes as will fit */ + if (n != 0 && --n != 0) { + do { + if ((*d++ = *s++) == 0) + break; + } while (--n != 0); + } + + /* Not enough room in dst, add NUL and traverse rest of src */ + if (n == 0) { + if (siz != 0) + *d = '\0'; /* NUL-terminate dst */ + while (*s++) + ; + } + + return(s - src - 1); /* count does not include NUL */ +} +#endif // linux + +#ifdef EMULATE_SYSCTL +/* + * Support for sysctl emulation. + * We transfer options as part of the IP_DUMMYNET3 sockopt emulation, + * so we need to include ip_fw.h and ip_dummynet.h + */ + +#include /* struct ip_fw_args */ +#include /* struct dn_id */ +static struct sysctltable GST; + +int +kesysctl_emu_get(struct sockopt* sopt) +{ + struct dn_id* oid = sopt->sopt_val; + struct sysctlhead* entry; + int sizeneeded = sizeof(struct dn_id) + GST.totalsize + + sizeof(struct sysctlhead); + unsigned char* pstring; + unsigned char* pdata; + int i; + + if (sopt->sopt_valsize < sizeneeded) { + // this is a probe to retrieve the space needed for + // a dump of the sysctl table + oid->id = sizeneeded; + sopt->sopt_valsize = sizeof(struct dn_id); + return 0; + } + + entry = (struct sysctlhead*)(oid+1); + /* [entry][data(datalen)][name(namelen)] */ + ND("copying values"); + for( i=0; iblocklen = GST.entry[i].head.blocklen; + entry->namelen = GST.entry[i].head.namelen; + entry->flags = GST.entry[i].head.flags; + entry->datalen = GST.entry[i].head.datalen; + pdata = (unsigned char*)(entry+1); + pstring = pdata+GST.entry[i].head.datalen; + if (entry->flags & SYSCTLTYPE_PROC) { + //int (*f)(SYSCTL_HANDLER_ARGS); + sysctl_h_fn_t *f; + int tmp = 0, ret; + struct sysctl_req req; + + bzero(&req, sizeof(req)); + req.oldlen = req.newlen = sizeof(int); + req.oldptr = &tmp; + f = GST.entry[i].fn; + ND("-- %s is a proc -- at %p", GST.entry[i].name, f); + ret = f(NULL, NULL, (int)(intptr_t)(GST.entry[i].data), &req); + ND("-- %s returns %d", GST.entry[i].name, ret); + bcopy(&tmp, pdata, sizeof(tmp)); + } else { + bcopy(GST.entry[i].data, pdata, GST.entry[i].head.datalen); + } + bcopy(GST.entry[i].name, pstring, GST.entry[i].head.namelen); + entry = (struct sysctlhead*) + ((unsigned char*)(entry) + GST.entry[i].head.blocklen); + } + sopt->sopt_valsize = sizeneeded; + return 0; +} + +int +kesysctl_emu_set(void* p, int l) +{ + struct sysctlhead* entry; + unsigned char* pdata; + unsigned char* pstring; + int i = 0; + + entry = (struct sysctlhead*)(((struct dn_id*)p)+1); + pdata = (unsigned char*)(entry+1); + pstring = pdata + entry->datalen; + + for (i=0; idatalen != GST.entry[i].head.datalen) { + printf("%s: len mismatch, user %d vs kernel %d\n", + __FUNCTION__, entry->datalen, + GST.entry[i].head.datalen); + return -1; + } + // check access (at the moment flags handles only the R/W rights + //later on will be type + access + if( (GST.entry[i].head.flags & 3) == CTLFLAG_RD) { + printf("%s: the entry %s is read only\n", + __FUNCTION__,GST.entry[i].name); + return -1; + } + if (GST.entry[i].head.flags & SYSCTLTYPE_PROC) { + int (*f)(SYSCTL_HANDLER_ARGS); + int tmp = 0, ret; + struct sysctl_req req; + + bzero(&req, sizeof(req)); + req.oldlen = req.newlen = sizeof(int); + req.oldptr = &tmp; + req.newptr = pdata; + f = GST.entry[i].fn; + ND("-- %s is a proc -- at %p", GST.entry[i].name, f); + ret = f(NULL, NULL, (int)(intptr_t)(GST.entry[i].data), &req); + ND("-- %s returns %d", GST.entry[i].name, ret); + } else { + bcopy(pdata, GST.entry[i].data, GST.entry[i].head.datalen); + } + return 0; + } + D("%s: match not found\n",__FUNCTION__); + return 0; +} + +/* convert all _ to . until the first . */ +static void +underscoretopoint(char* s) +{ + for (; *s && *s != '.'; s++) + if (*s == '_') + *s = '.'; +} + +static int +formatnames(void) +{ + int i; + int size=0; + char* name; + + for (i=0; i> 2, + GST.entry[i].head.flags & 0x00000003); + printf("data %i\n", *(int*)(GST.entry[i].data)); + printf("datalen %i\n", GST.entry[i].head.datalen); + printf("blocklen %i\n", GST.entry[i].head.blocklen); + } +} + +void sysctl_addgroup_f1(void); +void sysctl_addgroup_f2(void); +void sysctl_addgroup_f3(void); +void sysctl_addgroup_f4(void); + +void +keinit_GST(void) +{ + int ret; + + sysctl_addgroup_f1(); + sysctl_addgroup_f2(); + sysctl_addgroup_f3(); + sysctl_addgroup_f4(); + ret = formatnames(); + if (ret != 0) + printf("conversion of names failed for some reason\n"); + if (0) + dumpGST(); // XXX debugging + printf("*** Global Sysctl Table entries = %i, total size = %i ***\n", + GST.count, GST.totalsize); +} + +void +keexit_GST(void) +{ + if (GST.namebuffer != NULL) + free(GST.namebuffer); + bzero(&GST, sizeof(GST)); +} + +void +sysctl_pushback(char* name, int flags, int datalen, void* data, sysctl_h_fn_t *fn) +{ + if (GST.count >= GST_HARD_LIMIT) { + printf("WARNING: global sysctl table full, this entry will not be added," + "please recompile the module increasing the table size\n"); + return; + } + GST.entry[GST.count].head.namelen = strlen(name)+1; //add space for '\0' + GST.entry[GST.count].name = name; + GST.entry[GST.count].head.flags = flags; + GST.entry[GST.count].data = data; + GST.entry[GST.count].fn = fn; + GST.entry[GST.count].head.datalen = datalen; + GST.entry[GST.count].head.blocklen = + ((sizeof(struct sysctlhead) + GST.entry[GST.count].head.namelen + + GST.entry[GST.count].head.datalen)+3) & ~3; + GST.totalsize += GST.entry[GST.count].head.blocklen; + GST.count++; +} +#endif /* EMULATE_SYSCTL */ + +extern int mainloop(int argc, char *argv[]); + +/* + * main program for ipfw kernel side when running an userspace emulation: + * open a socket on which we receive requests from userland, + * another socket for calls from the 'kernel' (simulating packet + * arrivals etc), and then periodically run the tick handler. + */ +int +main(int argc, char *argv[]) +{ + tick = 1000000/hz; + D("initializing tick to %ld", tick); + return mainloop(argc, argv); +} diff --git a/example/ipfw/extra/missing.h b/example/ipfw/extra/missing.h new file mode 100644 index 0000000..b5b65b2 --- /dev/null +++ b/example/ipfw/extra/missing.h @@ -0,0 +1,801 @@ +/* + * Copyright (C) 2009 Luigi Rizzo, Marta Carbone, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $Id: missing.h 8377 2011-04-04 16:08:27Z marta $ + * + * Header for kernel variables and functions that are not available in + * userland. + */ + +#ifndef _MISSING_H_ +#define _MISSING_H_ + +#define KLD_MODULE /* disable kernel dependencies */ + +/* defined as assert */ +void panic(const char *fmt, ...); + +#define KASSERT(exp,msg) do { \ + if (__predict_false(!(exp))) \ + panic msg; \ +} while (0) +/* don't bother to optimize */ +#ifndef __predict_false +#define __predict_false(x) (x) /* __builtin_expect((exp), 0) */ +#endif // XXX + + +#ifdef _KERNEL +#define NEED_KERNEL +#undef _KERNEL +#endif + +#include // printf +#include // IFNAMSIZ ? +#include // strncmp +#include // bsearch +#ifdef NEED_KERNEL +#define _KERNEL +#include +#include + +#define __user // not defined here ? +#define __init +#define __exit + +/* portability features, to be set before the rest: */ +#define WITHOUT_BPF /* do not use bpf logging */ + +#define MALLOC_DECLARE(x) struct __hack /* nothing */ +// XXX kernel malloc/free +extern void *kern_malloc(int); +extern void kern_free(void *); +#define malloc(_size, type, flags) kern_malloc(_size) +#define free(_var, type) kern_free(_var) + +/* inet_ntoa_r() differs in userspace and kernel. + * We load netinet/in.h so we get the kernel prototype ? + * but we also need to put #defines in the two places where + * it is used XXX fixme + */ +#include + +/* log() conflicts with the math function. + * Revise, modifying the first argument. + */ +#define LOG_ERR 0x100 +#define LOG_INFO 0x200 +#ifndef LOG_SECURITY +#define LOG_SECURITY 0x400 +#endif + +#define log(_level, fmt, arg...) do { \ + int __attribute__((unused)) _querty = _level; \ + printf("kernel: " fmt, ##arg); } while (0) + +#endif /* _KERNEL */ + +/* + * Kernel locking support. + * FreeBSD uses mtx in dummynet.c and struct rwlock ip_fw2.c + * + * In linux we use spinlock_bh to implement both. + * For 'struct rwlock' we need an #ifdef to change it to spinlock_t + */ + +#ifndef DEFINE_SPINLOCK /* this is for linux 2.4 */ +#if defined(__APPLE__) +#define DEFINE_SPINLOCK(x) struct mtx x; +#else /* linux ? */ +#define DEFINE_SPINLOCK(x) spinlock_t x // = SPIN_LOCK_UNLOCKED +#endif +#endif + +/* 20111031 + * redefine mutex in terms of threads. + */ + +#undef _KERNEL +// #include +#include +#ifdef NEED_KERNEL +#define _KERNEL +#endif +struct mtx { + pthread_mutex_t p0; +}; +struct rwlock { + pthread_mutex_t p0; +}; +struct rmlock { + pthread_mutex_t p0; +}; +extern pthread_mutex_t dummynet_mtx_p; +extern pthread_mutex_t ipfw_dyn_mtx_p; +extern pthread_mutex_t pfil_global_lock_p; + +#define mtx_assert(a, b) +/* + * the first argument to mtx_init is often a static variable, + * so use (void)m to prevent a compiler warning + */ +#define mtx_init(m, a,b,c) do { \ + (void)m; pthread_mutex_init(&((m)->p0), NULL); } while (0) +#define MTX_SYSINIT(a, m, c, d) // pthread_mutex_init(m##_p, NULL) +#define mtx_lock(m) pthread_mutex_lock(m.p0) +#define mtx_unlock(m) pthread_mutex_unlock(m.p0) +#define mtx_destroy(m) pthread_mutex_destroy(m.p0) +#if 1 +//------------------ + +#if 1 // used for IPFW_UH +#define rw_assert(a, b) +#define rw_destroy(_l) +#define rw_init(_l, msg) // XXX mtx_init((_l), 0, 0, 0) +#define rw_rlock(_l) mtx_lock(_l) +#define rw_runlock(_l) mtx_unlock(_l) +#define rw_wlock(_l) mtx_lock(_l) +#define rw_wunlock(_l) mtx_unlock(_l) +#define rw_init_flags(_l, s, v) +#endif // XXX not used anymore + +#define rm_init(_l, msg) // mtx_init(...) +#define rm_rlock(_l, _t) ((void)_t, mtx_lock(_l)) +#define rm_runlock(_l, _t) mtx_unlock(_l) +#define rm_wlock(_l) mtx_lock(_l) +#define rm_wunlock(_l) mtx_unlock(_l) +#define rm_destroy(_l) // XXX +#define rm_assert(_l, _w) // XXX + + +#endif // locking on linux ? + +/* end of locking support */ + +/* + * Reference to an ipfw rule that can be carried outside critical sections. + * A rule is identified by rulenum:rule_id which is ordered. + * In version chain_id the rule can be found in slot 'slot', so + * we don't need a lookup if chain_id == chain->id. + * + * On exit from the firewall this structure refers to the rule after + * the matching one (slot points to the new rule; rulenum:rule_id-1 + * is the matching rule), and additional info (e.g. info often contains + * the insn argument or tablearg in the low 16 bits, in host format). + * On entry, the structure is valid if slot>0, and refers to the starting + * rules. 'info' contains the reason for reinject, e.g. divert port, + * divert direction, and so on. + */ +struct ipfw_rule_ref { + uint32_t slot; /* slot for matching rule */ + uint32_t rulenum; /* matching rule number */ + uint32_t rule_id; /* matching rule id */ + uint32_t chain_id; /* ruleset id */ + uint32_t info; /* see below */ +}; + +/* ISO C restricts enumerator values to range of 'int' + * so we need IN to have a smaller value + */ +enum { + IPFW_INFO_MASK = 0x0000ffff, + IPFW_INFO_OUT = 0x00000000, /* outgoing, just for convenience */ + IPFW_INFO_IN = 0x00800000, /* incoming, overloads dir */ + IPFW_ONEPASS = 0x40000000, /* One-pass, do not reinject */ + IPFW_IS_MASK = 0x30000000, /* which source ? */ + IPFW_IS_DIVERT = 0x20000000, + IPFW_IS_DUMMYNET =0x10000000, + IPFW_IS_PIPE = 0x08000000, /* pipe=1, queue = 0 */ +}; + +/* in netinet/in.h */ +#define in_nullhost(x) ((x).s_addr == INADDR_ANY) + +/* ip_dummynet.c */ +#ifndef __FreeBSD_version +#define __FreeBSD_version 500035 +#endif + +/* define some macro for ip_dummynet */ + +struct malloc_type { +}; + +#define MALLOC_DEFINE(type, shortdesc, longdesc) \ + struct malloc_type type[1]; void *md_dummy_ ## type = type + +#define CTASSERT(x) + + +/* + * gettimeofday would be in sys/time.h but it is not + * visible if _KERNEL is defined + */ +//int gettimeofday(struct timeval *, struct timezone *); + + +extern int hz; +extern long tick; /* exists in 2.4 but not in 2.6 */ +extern int bootverbose; +extern struct timeval boottime; + +/* time_uptime is a FreeBSD variable increased each second */ +extern time_t time_uptime; + +extern int max_linkhdr; +extern int ip_defttl; +extern u_long in_ifaddrhmask; /* mask for hash table */ +extern struct in_ifaddrhashhead *in_ifaddrhashtbl; /* inet addr hash table */ + +/*-------------------------------------------------*/ + +/* define, includes and functions missing in linux */ +/* include and define */ +#include /* inet_ntoa */ + +struct mbuf; +// XXX #define M_MCAST 0x04 /* send/received as link-level multicast */ + + +/* used by ip_dummynet.c */ +void reinject_drop(struct mbuf* m); + +#include /* for ETHERTYPE_IP */ + +#ifdef _KERNEL +#define IF_NAMESIZE 16 +#ifndef IFNAMSIZ +#define IFNAMSIZ IF_NAMESIZE +#endif +//#include /* IFNAMESIZ */ +#endif + +/* + * some network structure can be defined in the bsd way + * by using the _FAVOR_BSD definition. This is not true + * for icmp structure. + * XXX struct icmp contains bsd names in + * /usr/include/netinet/ip_icmp.h + */ + +/* missing definition */ +#define TH_FIN 0x01 +#define TH_SYN 0x02 +#define TH_RST 0x04 +#define TH_ACK 0x10 + +/* 20131101 IPTOS from ip.h */ +/* + * Definitions for DiffServ Codepoints as per RFC2474 + */ +#define IPTOS_DSCP_CS0 0x00 +#define IPTOS_DSCP_CS1 0x20 +#define IPTOS_DSCP_AF11 0x28 +#define IPTOS_DSCP_AF12 0x30 +#define IPTOS_DSCP_AF13 0x38 +#define IPTOS_DSCP_CS2 0x40 +#define IPTOS_DSCP_AF21 0x48 +#define IPTOS_DSCP_AF22 0x50 +#define IPTOS_DSCP_AF23 0x58 +#define IPTOS_DSCP_CS3 0x60 +#define IPTOS_DSCP_AF31 0x68 +#define IPTOS_DSCP_AF32 0x70 +#define IPTOS_DSCP_AF33 0x78 +#define IPTOS_DSCP_CS4 0x80 +#define IPTOS_DSCP_AF41 0x88 +#define IPTOS_DSCP_AF42 0x90 +#define IPTOS_DSCP_AF43 0x98 +#define IPTOS_DSCP_CS5 0xa0 +#define IPTOS_DSCP_EF 0xb8 +#define IPTOS_DSCP_CS6 0xc0 +#define IPTOS_DSCP_CS7 0xe0 + +/* + * ECN (Explicit Congestion Notification) codepoints in RFC3168 mapped to the + * lower 2 bits of the TOS field. + */ +#define IPTOS_ECN_NOTECT 0x00 /* not-ECT */ +#define IPTOS_ECN_ECT1 0x01 /* ECN-capable transport (1) */ +#define IPTOS_ECN_ECT0 0x02 /* ECN-capable transport (0) */ +#define IPTOS_ECN_CE 0x03 /* congestion experienced */ +#define IPTOS_ECN_MASK 0x03 /* ECN field mask */ + +/*------------------------- */ + +#define RTF_CLONING 0x100 /* generate new routes on use */ + +#define IPPROTO_OSPFIGP 89 /* OSPFIGP */ +#define IPPROTO_CARP 112 /* CARP */ +#define CARP_VERSION 2 +#define CARP_ADVERTISEMENT 0x01 +#define PRIV_NETINET_IPFW 491 /* Administer IPFW firewall. */ +#define IP_FORWARDING 0x1 /* most of ip header exists */ +#define NETISR_IP 2 /* same as AF_INET */ +#define PRIV_NETINET_DUMMYNET 494 /* Administer DUMMYNET. */ + +extern int securelevel; + +#define if_xname name +#define if_snd XXX + +// XXX we could use this to point to the incoming peer +struct ifnet { + char if_xname[IFNAMSIZ]; /* external name (name + unit) */ + uint32_t if_index; // IP_FW_3 +}; + +struct ifaltq { + + void *ifq_head; +}; +int ffs(int); // XXX where +int fls(int); // XXX where + +struct ip; +/* machine/in_cksum.h */ +int in_cksum(struct mbuf *m, int len); +#ifndef __FreeBSD__ +u_short in_cksum_hdr(struct ip *); +#endif + + +#define CTR3(a, ...) +#define uma_zone_set_max(a, b) // XXX + +/* + * ifnet->if_snd is used in ip_dummynet.c to take the transmission + * clock. + */ +#if defined( __linux__) +#define if_xname name +#define if_snd XXX + +struct route_in6 { +}; + +#elif defined( _WIN32 ) +/* used in ip_dummynet.c */ +struct ifnet { + char if_xname[IFNAMSIZ]; /* external name (name + unit) */ +// struct ifaltq if_snd; /* output queue (includes altq) */ +}; + +struct net_device { + char if_xname[IFNAMSIZ]; /* external name (name + unit) */ +}; +#elif defined(__APPLE__) +typedef u_int32_t tcp_cc; +#ifndef s6_addr32 // XXX +#define s6_addr32 __u6_addr.__u6_addr32 +#endif +#include + +struct route_in6 { +}; + +struct icmphdr { + u_char icmp_type; /* type of message, see below */ + u_char icmp_code; /* type sub code */ + u_short icmp_cksum; /* ones complement cksum of struct */ +}; + +#define IPPROTO_SCTP 132 /* SCTP */ + +/* defined in linux/sctp.h with no bsd definition */ +struct sctphdr { + uint16_t src_port; /* source port */ + uint16_t dest_port; /* destination port */ + uint32_t v_tag; /* verification tag of packet */ + uint32_t checksum; /* Adler32 C-Sum */ + /* chunks follow... */ +}; + +struct carp_header { +#if BYTE_ORDER == LITTLE_ENDIAN + u_int8_t carp_type:4, + carp_version:4; +#endif +#if BYTE_ORDER == BIG_ENDIAN + u_int8_t carp_version:4, + carp_type:4; +#endif +}; + + +struct pim { + int dummy; /* windows compiler does not like empty definition */ +}; + +#endif + +/* involves mbufs */ +//int in_cksum(struct mbuf *m, int len); +#define divert_cookie(mtag) 0 +#define divert_info(mtag) 0 +#define pf_find_mtag(a) NULL +#define pf_get_mtag(a) NULL +#if !defined(_WIN32) && !defined(AF_LINK) +#define AF_LINK AF_ASH /* ? our sys/socket.h */ +#endif + +/* search local the ip addresses, used for the "me" keyword */ +#define INADDR_TO_IFP(ip, b) b = NULL + +/* we don't pullup, either success or free and fail */ +#define m_pullup(m, x) \ + ((m)->m_len >= x ? (m) : (FREE_PKT(m), NULL)) + +struct pf_mtag { + void *hdr; /* saved hdr pos in mbuf, for ECN */ + sa_family_t af; /* for ECN */ + u_int32_t qid; /* queue id */ +}; + +/* missing kernel functions */ +char *inet_ntoa(struct in_addr ina); +long random(void); + +/* + * Return the risult of a/b + * + * this is used in linux kernel space, + * since the 64bit division needs to + * be done using a macro + */ +//int64_t div64(int64_t a, int64_t b); + +/* from bsd sys/queue.h */ +#define TAILQ_FOREACH_SAFE(var, head, field, tvar) \ + for ((var) = TAILQ_FIRST((head)); \ + (var) && ((tvar) = TAILQ_NEXT((var), field), 1); \ + (var) = (tvar)) + +#define SLIST_FOREACH_SAFE(var, head, field, tvar) \ + for ((var) = SLIST_FIRST((head)); \ + (var) && ((tvar) = SLIST_NEXT((var), field), 1); \ + (var) = (tvar)) + +/*-------------------------------------------------*/ +#define RT_NUMFIBS 1 +extern u_int rt_numfibs; + +/* involves kernel locking function */ +#ifdef RTFREE +#undef RTFREE +#define RTFREE(a) fprintf(stderr, "RTFREE: commented out locks\n"); +#endif + +void getmicrouptime(struct timeval *tv); + +/* from sys/netinet/ip_output.c */ +struct ip_moptions; +struct route; +struct ip; + +struct inpcb; +struct mbuf *ip_reass(struct mbuf *); +int ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags, + struct ip_moptions *imo, struct inpcb *inp); + +/* from net/netisr.c -- fails on FreeBSD */ +int netisr_dispatch(u_int proto, struct mbuf *m); + + +/* definition moved in missing.c */ +int sooptcopyout(struct sockopt *sopt, const void *buf, size_t len); +int copyout(const void *kaddr, void *uaddr, size_t len); + +int sooptcopyin(struct sockopt *sopt, void *buf, size_t len, size_t minlen); + +/* defined in session.c */ +int priv_check(struct thread *td, int priv); + +/* struct ucred is in linux/socket.h and has pid, uid, gid. + * We need a 'bsd_ucred' to store also the extra info + */ + +struct bsd_ucred { + uid_t uid; + gid_t gid; + uint32_t xid; + uint32_t nid; +}; + +#ifdef _KERNEL + +#if 0 // XXX +int +cred_check(void *insn, int proto, struct ifnet *oif, + struct in_addr dst_ip, u_int16_t dst_port, struct in_addr src_ip, + u_int16_t src_port, struct bsd_ucred *u, int *ugid_lookupp, + struct sk_buff *skb); +#endif + +struct ucred; +int securelevel_ge(struct ucred *cr, int level); + +/* + * stripped down version of the sysctl api + */ +struct sysctl_oid; +struct sysctl_req { + void *oldptr; /* store here the original value */ + int oldlen; + void *newptr; /* NULL on reads */ + int newlen; +}; + +#ifdef _WIN32 +#define module_param_named(_name, _var, _ty, _perm) +#else /* !_WIN32 */ + +#endif /* !_WIN32 so maybe __linux__ */ + +#if 0 // XXX disable sysctl defined (__linux__) && !defined (EMULATE_SYSCTL) +#define SYSCTL_DECL(_1) +#define SYSCTL_OID(_1, _2, _3, _4, _5, _6, _7, _8) +#define SYSCTL_NODE(_1, _2, _3, _4, _5, _6) +#define _SYSCTL_BASE(_name, _var, _ty, _perm) \ + module_param_named(_name, *(_var), _ty, \ + ( (_perm) == CTLFLAG_RD) ? 0444: 0644 ) +#define SYSCTL_PROC(_base, _oid, _name, _mode, _var, _val, _desc, _a, _b) + +#define SYSCTL_INT(_base, _oid, _name, _mode, _var, _val, _desc) \ + _SYSCTL_BASE(_name, _var, int, _mode) + +#define SYSCTL_LONG(_base, _oid, _name, _mode, _var, _val, _desc) \ + _SYSCTL_BASE(_name, _var, long, _mode) + +#define SYSCTL_ULONG(_base, _oid, _name, _mode, _var, _val, _desc) \ + _SYSCTL_BASE(_name, _var, ulong, _mode) + +#define SYSCTL_UINT(_base, _oid, _name, _mode, _var, _val, _desc) \ + _SYSCTL_BASE(_name, _var, uint, _mode) + +#define TUNABLE_INT(_name, _ptr) + +#define SYSCTL_VNET_PROC SYSCTL_PROC +#define SYSCTL_VNET_INT SYSCTL_INT +#define SYSCTL_VNET_UINT SYSCTL_UINT + +#endif + +#define SYSCTL_HANDLER_ARGS \ + struct sysctl_oid *oidp, void *arg1, int arg2, struct sysctl_req *req +typedef int (sysctl_h_fn_t)(SYSCTL_HANDLER_ARGS); +int sysctl_handle_int(SYSCTL_HANDLER_ARGS); +int sysctl_handle_long(SYSCTL_HANDLER_ARGS); + +#ifdef EMULATE_SYSCTL /* mandatory here */ + +#define STRINGIFY(x) #x + +#ifdef SYSCTL_NODE +#undef SYSCTL_NODE +#endif +#define SYSCTL_NODE(a,b,c,d,e,f) int a; (void)a +#define SYSCTL_DECL(a) + +#define GST_HARD_LIMIT 100 + +/* In the module, GST is implemented as an array of + * sysctlentry, but while passing data to the userland + * pointers are useless, the buffer is actually made of: + * - sysctlhead (fixed size, containing lengths) + * - data (typically 32 bit) + * - name (zero-terminated and padded to mod4) + */ + +struct sysctlentry { + struct sysctlhead head; + char* name; + void* data; + sysctl_h_fn_t *fn; +}; + +struct sysctltable { + int count; //number of valid tables + int totalsize; //total size of valid entries of al the valid tables + void* namebuffer; //a buffer for all chained names + struct sysctlentry entry[GST_HARD_LIMIT]; +}; + +#ifdef SYSBEGIN +#undef SYSBEGIN +#endif +#define SYSBEGIN(x) void sysctl_addgroup_##x() { +#ifdef SYSEND +#undef SYSEND +#endif +#define SYSEND } + +/* XXX remove duplication */ +#define SYSCTL_INT(a,b,c,d,e,f,g) \ + sysctl_pushback(STRINGIFY(a) "." STRINGIFY(c) + 1, \ + (d) | (SYSCTLTYPE_INT << 2), sizeof(*e), e, NULL) + +#define SYSCTL_UINT(a,b,c,d,e,f,g) \ + sysctl_pushback(STRINGIFY(a) "." STRINGIFY(c) + 1, \ + (d) | (SYSCTLTYPE_UINT << 2), sizeof(*e), e, NULL) + +#define SYSCTL_LONG(a,b,c,d,e,f,g) \ + sysctl_pushback(STRINGIFY(a) "." STRINGIFY(c) + 1, \ + (d) | (SYSCTLTYPE_LONG << 2), sizeof(*e), e, NULL) + +#define SYSCTL_ULONG(a,b,c,d,e,f,g) \ + sysctl_pushback(STRINGIFY(a) "." STRINGIFY(c) + 1, \ + (d) | (SYSCTLTYPE_ULONG << 2), sizeof(*e), e, NULL) +#define TUNABLE_INT(a,b) + +#define SYSCTL_PROC(a,b,c,d,e,f,g,h,i) \ + sysctl_pushback(STRINGIFY(a) "." STRINGIFY(c) + 1, \ + (d), 4 /* XXX large */, (void *)(f /* arg2 */), g) + +#define SYSCTL_VNET_PROC SYSCTL_PROC +#define SYSCTL_VNET_INT SYSCTL_INT +#define SYSCTL_VNET_UINT SYSCTL_UINT + +void keinit_GST(void); +void keexit_GST(void); +int kesysctl_emu_set(void* p, int l); +int kesysctl_emu_get(struct sockopt* sopt); +void sysctl_pushback(char* name, int flags, int datalen, void* data, sysctl_h_fn_t *fn); + +#endif /* EMULATE_SYSCTL */ + +struct ifnet; +void ether_demux(struct ifnet *ifp, struct mbuf *m); + +int ether_output_frame(struct ifnet *ifp, struct mbuf *m); + +void in_rtalloc_ign(struct route *ro, u_long ignflags, u_int fibnum); + +void icmp_error(struct mbuf *n, int type, int code, uint32_t dest, int mtu); + +#define in_localip(_x) (0) + +#ifndef __FreeBSD__ +struct rtentry; +#endif +void rtfree(struct rtentry *rt); + +u_short in_cksum_skip(struct mbuf *m, int len, int skip); + +#ifdef INP_LOCK_ASSERT +#undef INP_LOCK_ASSERT +#define INP_LOCK_ASSERT(a) +#endif + +int jailed(struct ucred *cred); + +/* +* Return 1 if an internet address is for a ``local'' host +* (one to which we have a connection). If subnetsarelocal +* is true, this includes other subnets of the local net. +* Otherwise, it includes only the directly-connected (sub)nets. +*/ +int in_localaddr(struct in_addr in); + +int fnmatch(const char *pattern, const char *string, int flags); + +/* vnet wrappers, in vnet.h and ip_var.h */ +//int ipfw_init(void); +//void ipfw_destroy(void); + +#define MTAG_IPFW 1148380143 /* IPFW-tagged cookie */ +#define MTAG_IPFW_RULE 1262273568 /* rule reference */ +#define MTAG_IPFW_CALL 1308397630 /* call stack */ + +#ifdef __APPLE__ +#define offsetof(type, field) __builtin_offsetof(type, field) +#endif +struct ip_fw_args; +extern int (*ip_dn_io_ptr)(struct mbuf **m, int dir, struct ip_fw_args *fwa); + +#if 1 /* include vnet.h */ +#define curvnet NULL +#define CURVNET_SET(_v) +#define CURVNET_RESTORE() +#define VNET_ASSERT(condition) + +#define VNET_NAME(n) n +#define VNET_DECLARE(t, n) extern t n +#define VNET_DEFINE(t, n) t n +#define _VNET_PTR(b, n) &VNET_NAME(n) +/* + * Virtualized global variable accessor macros. + */ +#define VNET_VNET_PTR(vnet, n) (&(n)) +#define VNET_VNET(vnet, n) (n) + +#define VNET_PTR(n) (&(n)) +#define VNET(n) (n) + +#define IS_DEFAULT_VNET(x) (1) // always true +#endif + +VNET_DECLARE(int, ip_defttl); +#define V_ip_defttl VNET(ip_defttl); + + +// int ipfw_check_hook(void *arg, struct mbuf **m0, struct ifnet *ifp, int dir, struct inpcb *inp); +// XXX used in netmap_io.c +int ipfw_check_packet(void *arg, struct mbuf **m0, struct ifnet *ifp, int dir, struct inpcb *inp); +int ipfw_check_frame(void *arg, struct mbuf **m0, struct ifnet *ifp, int dir, struct inpcb *inp); + +/* hooks for divert */ +extern void (*ip_divert_ptr)(struct mbuf *m, int incoming); + +extern int (*ip_dn_ctl_ptr)(struct sockopt *); +typedef int ip_fw_ctl_t(struct sockopt *); +extern ip_fw_ctl_t *ip_fw_ctl_ptr; + + +/* netgraph prototypes */ +typedef int ng_ipfw_input_t(struct mbuf **, int, struct ip_fw_args *, int); +extern ng_ipfw_input_t *ng_ipfw_input_p; + +/* For kernel ipfw_ether and ipfw_bridge. */ +struct ip_fw_args; + +#define V_ip_fw_ctl_ptr VNET(ip_fw_ctl_ptr) +#define V_tcbinfo VNET(tcbinfo) +#define V_udbinfo VNET(udbinfo) +#endif /* _KERNEL */ + +// sys/eventhandler.h +#define EVENTHANDLER_DECLARE(a, b) + +/* application specific */ +struct sess; +typedef int (handler_t)(struct sess *sess, void *arg); + +/* + * flags to control the callback + * WANT_READ select on read + * WANT_WRITE select on write + * WANT_RUN run unconditionally + * WANT_DELETE session is exiting + */ +enum flags_t { + WANT_READ=1, WANT_WRITE=2, WANT_RUN=4, + WANT_DELETE=0x8000 +}; + +struct sess { + struct sess *next; + int fd; + handler_t *func; + void *arg; + enum flags_t flags; + void *private; /* pointer managed by the session code */ +}; +struct sess * +new_session(int fd, handler_t *func, void *arg, enum flags_t flags); + + +void netmap_add_port(const char *dev); +#endif /* !_MISSING_H_ */ diff --git a/example/ipfw/extra/session.c b/example/ipfw/extra/session.c new file mode 100644 index 0000000..333edd3 --- /dev/null +++ b/example/ipfw/extra/session.c @@ -0,0 +1,644 @@ +/* + * Session handler to simulate soopt* and network communication + * over a TCP socket, and also run the callbacks. + */ + +#ifdef _KERNEL +#undef _KERNEL +#endif +/* these headers need to be compiled without _KERNEL */ +#include +#include +#include +#include +#include // TCP_NODELAY +#include // freebsd, used in rmlock +#include // PFIL_IN +#include +extern int errno; + + +#ifdef free +/* we are built in a pseudo-kernel env so malloc and free are redefined */ +#undef free +#undef malloc +#endif /* free */ + +#include +#include +#include +#include /* timersub */ +#include +#include +#include /* read() */ + +#include /* mbuf */ +#define _KERNEL + +/* args for ipfw */ +#include +#include + +/* + * Global variables need to be somewhere... + */ +void ip_dn_init(void); +int ipfw_init(void); +void ipfw_destroy(void); + +extern int (*ip_fw_ctl_ptr)(struct sockopt *); +extern int (*ip_dn_ctl_ptr)(struct sockopt *); +extern struct ip_fw *ip_fw_default_rule; + +extern int ticks; /* kernel ticks counter */ + +int callout_startup(void); +int callout_run(void); + +/* + * generic handler for sockopt functions + */ +static int +ctl_handler(struct sockopt *sopt) +{ + int error = EINVAL; + + ND("called, level %d", sopt->sopt_level); + if (sopt->sopt_level != IPPROTO_IP) + return (EINVAL); + switch (sopt->sopt_name) { + default: + D("command not recognised %d", sopt->sopt_name); + break; + case IP_FW3: // XXX untested + case IP_FW_ADD: /* ADD actually returns the body... */ + case IP_FW_GET: + case IP_FW_DEL: + case IP_FW_TABLE_GETSIZE: + case IP_FW_TABLE_LIST: + case IP_FW_NAT_GET_CONFIG: + case IP_FW_NAT_GET_LOG: + case IP_FW_FLUSH: + case IP_FW_ZERO: + case IP_FW_RESETLOG: + case IP_FW_TABLE_ADD: + case IP_FW_TABLE_DEL: + case IP_FW_TABLE_FLUSH: + case IP_FW_NAT_CFG: + case IP_FW_NAT_DEL: + if (ip_fw_ctl_ptr != NULL) + error = ip_fw_ctl_ptr(sopt); + else { + D("ipfw not enabled"); + error = ENOPROTOOPT; + } + break; + + case IP_DUMMYNET_GET: + case IP_DUMMYNET_CONFIGURE: + case IP_DUMMYNET_DEL: + case IP_DUMMYNET_FLUSH: + case IP_DUMMYNET3: + if (ip_dn_ctl_ptr != NULL) + error = ip_dn_ctl_ptr(sopt); + else + error = ENOPROTOOPT; + break ; + } + ND("returning error %d", error); + return error; +} + +/* + * copy data back to userland + */ +int +sooptcopyout(struct sockopt *sopt, const void *buf, size_t len) +{ + size_t valsize = sopt->sopt_valsize; + + ND("data len %d sopt_len %d", (int)len, (int)valsize); + if (len < valsize) + sopt->sopt_valsize = valsize = len; + bcopy(buf, sopt->sopt_val, valsize); + return 0; +} + +int +copyout(const void *kaddr, void *uaddr, size_t len) +{ + bcopy(kaddr, uaddr, len); + return 0; /* no fault */ +} + +/* + * copy data from userland to kernel + */ +int +sooptcopyin(struct sockopt *sopt, void *buf, size_t len, size_t minlen) +{ + size_t valsize = sopt->sopt_valsize; + + ND("have %d len %d minlen %d", (int)valsize, (int)len, (int)minlen); + if (valsize < minlen) + return EINVAL; + if (valsize > len) + sopt->sopt_valsize = valsize = len; + bcopy(sopt->sopt_val, buf, valsize); + return 0; +} + +/* + * session description for event-based programming + */ +/* event-based session support */ + +#define SOCK_QLEN 5 /* listen lenght for incoming connection */ + +static struct sess *all_sessions, *new_sessions; + +struct sess * +new_session(int fd, handler_t *func, void *arg, enum flags_t flags) +{ + struct sess *desc; + desc = calloc(1, sizeof(*desc)); + if (desc == NULL) + return NULL; + desc->fd = fd; + desc->func = func; + desc->arg = arg; + desc->flags = flags; + desc->next = new_sessions; + new_sessions = desc; + return desc; +} + +/* remove deleted sessions, merge with new ones */ +static void +merge_sessions(void) +{ + struct sess *cur, *prev, *tmp; + + for (prev = NULL, cur = all_sessions; cur; prev = cur, cur = tmp) { + tmp = cur->next; + if ( (cur->flags & WANT_DELETE) == 0) + continue; + if (prev) + prev->next = cur->next; + else + all_sessions = cur->next; + memset(cur, 0, sizeof(*cur)); + free(cur); + cur = prev; + } + if (prev) + prev->next = new_sessions; + else + all_sessions = new_sessions; + new_sessions = NULL; +} + +/* set the fdset, return the fdmax+1 for select() */ +int +set_sessions(fd_set *r, fd_set *w) +{ + struct sess *cur; + int fd_max = -1; + int count = 0,ready = 0; + + FD_ZERO(r); + FD_ZERO(w); + merge_sessions(); + for (cur = all_sessions; cur; cur = cur->next) { + count++; + if (cur->flags & WANT_RUN) { + ND("WANT_RUN on session %p", cur); + cur->flags &= ~WANT_RUN; + cur->func(cur, cur->arg); + } + if (cur->flags & WANT_READ) + FD_SET(cur->fd, r); + if (cur->flags & WANT_WRITE) + FD_SET(cur->fd, w); + if (cur->flags & (WANT_WRITE|WANT_READ)) { + ready ++; + if (cur->fd > fd_max) + fd_max = cur->fd; + } + } + ND("%d session %d waiting", count, ready); + return fd_max + 1; +} + +int +run_sessions(fd_set *r, fd_set *w) +{ + struct sess *cur; + + for (cur = all_sessions; cur; cur = cur->next) { + int fd = cur->fd; + // fprintf(stderr, "%s sess %p\n", __FUNCTION__, cur); + if (FD_ISSET(fd, r) || FD_ISSET(fd, w)) + cur->func(cur, cur->arg); + } + return 0; +} + +struct sess_buf { + int len; /* allocation length */ + int used; /* bytes used */ + int start; /* start position for next write */ + char data[0]; +}; + +struct sess_buf * +get_buf(int size, struct sess_buf *old) +{ + struct sess_buf *p = old; + + if (!p) { + ND("new buffer size %d", size); + p = calloc(1, sizeof(*p) + size); + } else if (p->len >= size) { + return p; + } else { + ND("calling realloc %p %d", old, size); + p = realloc(old, sizeof(*p) + size); + } + if (!p) { + if (old) + free(old); + } else { + p->len = size; + } + return p; +} + +/* + * do a non-blocking read into the buffer, reallocating if space + * is needed. + */ +static struct sess_buf * +get_data(int fd, struct sess_buf *buf, int want) +{ + int l; + + buf = get_buf(want, buf); + if (buf == NULL) + return buf; + l = read(fd, buf->data + buf->used, want - buf->used); + if (l > 0) + buf->used += l; + return buf; +} + +/* + * Handler for a request coming from the control socket. + */ +enum sockopt_state { + READING = 0, WRITING = 1 +}; + +struct sockopt_desc { + int state; /* internal state */ + struct sess_buf *rd; + struct sess_buf *wr; +}; + +/* header prepended to data in all transactions */ +struct rx_hdr { + uint32_t optlen; /* data len */ + uint32_t level; /* or error ? */ + uint32_t optname; /* or desired len ? */ + uint32_t dir; /* in or out */ +}; + +/* + * Return the number of remainig bytes from the buffer. + * The meessage is int optname; [int optlen; int data] + * where the second part is present or not depending on the + * message type. + */ +int +get_want(struct sess_buf *rd, struct rx_hdr *r) +{ + struct rx_hdr _r; + int l = sizeof(_r); + + if (r == NULL) + r = &_r; + if (!rd || rd->used < l) { + ND("short buffer (%d), return %d to bootstrap", + rd ? rd->used : -1, l); + return l; + } + bcopy(rd->data, r, l); + /* header fields are in network format, convert to host fmt */ + r->optlen = ntohl(r->optlen); + r->level = ntohl(r->level); + r->optname = ntohl(r->optname); + r->dir = ntohl(r->dir); + l += r->optlen; + return l; +} + +/* + * The sockopt commands are sent in network format (at least the header) + */ +int +sockopt_handler(struct sess *sess, void *arg) +{ + struct sockopt_desc *d; + int error = 1; + + ND("sess %p arg %p", sess, arg); + if (sess->private == NULL) + sess->private = calloc(1, sizeof(struct sockopt_desc)); + d = sess->private; + if (d == NULL) + goto done; + if (sess->flags & WANT_READ) { + int l, want, prev; + struct rx_hdr r; + struct sockopt sopt; + struct thread dummy; + + want = get_want(d->rd, &r); + prev = d->rd ? d->rd->used : 0; + ND("total message size is %d (prev %d)", want, prev); + + d->rd = get_data(sess->fd, d->rd, want); + l = d->rd ? d->rd->used : 0; + ND("read %d prev %d want %d", l, prev, want); + if (l == prev) /* no data -> error */ + goto done; + want = get_want(d->rd, &r); + ND("again, want %d l %d", want, l); + if (l < want) /* must read more data */ + return 0; + sopt.sopt_dir = r.dir; + sopt.sopt_level = r.level; + sopt.sopt_name = r.optname; + sopt.sopt_val = + (l <= sizeof(r)) ? NULL : d->rd->data + sizeof(r); + sopt.sopt_valsize = r.optlen; + sopt.sopt_td = &dummy; + ND("dir 0x%x lev %d opt %d optval %p optlen %d", + sopt.sopt_dir, + sopt.sopt_level, + sopt.sopt_name, + sopt.sopt_val, + (int)sopt.sopt_valsize); + + /* now call the handler */ + r.level = htonl(ctl_handler(&sopt)); + ND("handler returns %d", ntohl(r.level)); + r.optlen = htonl(0); /* default len */ + r.dir = htonl(sopt.sopt_dir); + /* prepare the buffer for writing */ + if (d->wr != NULL) { /* previous write buffer */ + free(d->wr); + } + d->wr = d->rd; + d->rd = NULL; + d->wr->used = sopt.sopt_valsize + sizeof(r); + d->wr->start = 0; + /* now update the header */ + if (sopt.sopt_dir == SOPT_GET) + r.optlen = htonl(sopt.sopt_valsize); + + bcopy(&r, d->wr->data, sizeof(r)); + + sess->flags = WANT_WRITE; + return 0; + } + if (sess->flags & WANT_WRITE) { + struct sess_buf *wr = d->wr; + + int l = write(sess->fd, wr->data + wr->start, + wr->used - wr->start); + ND("written %d bytes out of %d", l, + wr->used - wr->start); + if (l <= 0) { + if (errno == EAGAIN) + return 0; + goto done; /* error */ + } + wr->start += l; + if (wr->start < wr->used) + return 0; + // prepare for another rpc + sess->flags = WANT_READ; + return 0; + //goto done; + } +done: + ND("closing session"); + if (d) { + if (sess->fd >= 0) + close(sess->fd); + if (d->rd) + free(d->rd); + if (d->wr) + free(d->wr); + d->rd = d->wr = NULL; + free(d); /* private data */ + sess->flags = WANT_DELETE; + } + return error; +} + + +/* + * testing code when reading fake packets from socket 5556. + * Turns out that ipfw_check_hook() is a lot slower than ipfw_chk() + * XXX new ipfw uses ipfw_check_frame or ipfw_check_packet + */ +int +packet_handler(struct sess *sess, void *arg) +{ + char fake_buf[2048]; + struct mbuf dm; + int i; + + bzero(&dm, sizeof(dm)); + dm.m_data = fake_buf + 14; /* skip mac hdr */ + dm.m_len = dm.m_pkthdr.len = 128; + fake_buf[14] = 0x45; // ip + *(uint16_t *)(fake_buf+16) = htons(64); // bytes + *(uint32_t *)(fake_buf+26) = htonl(0x01020304); // src + *(uint32_t *)(fake_buf+30) = htonl(0x05060708); // dst + { +#if 0 + struct ip_fw_args args; + bzero(&args, sizeof(args)); + args.m = &dm; + for (i = 0; i < 1000; i++) + ipfw_chk(&args); +#else + struct ifnet *ifp = NULL; + struct inpcb *inp = NULL; + struct mbuf *m = &dm; + ND("sess %p arg %p", sess, arg); + for (i = 0; i < 1000; i++) + ipfw_check_packet(NULL, &m, ifp, PFIL_IN, inp); +#endif + } + return 0; +} + + +/* + * This task accepts a new connection and creates a new session. + */ +static int +listener(struct sess *sess, void *arg) +{ + int fd; + + ND("sess %p arg %p", sess, arg); + fd = accept(sess->fd, NULL, NULL); + if (fd < 0) + return -1; + fcntl(fd, F_SETFL, O_NONBLOCK); +#ifdef setsockopt /* make sure we don't redefine it */ +#error cannot compile this +#endif + { + int on = 1, ret; + ret = setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &on, sizeof(on)); + ND("TCP_NODELAY returns %d", ret); + } + new_session(fd, sess->arg ? sockopt_handler: packet_handler, + sess->arg, WANT_READ); + return 0; +} + +/* + * listen on a socket, + * return the listen fd or -1 on error. + */ +static int +do_server(const char *addr, int port) +{ + int fd = -1, on; + struct sockaddr_in server; + + /* open the listen socket */ + fd = socket(AF_INET, SOCK_STREAM, 0); + if (fd < 0) { + perror( "socket" ); + return -1; + } + + on = 1; +#ifdef SO_REUSEADDR + if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on)) == -1) + perror("SO_REUSEADDR failed(non fatal)"); +#endif +#ifdef SO_REUSEPORT + on = 1; + if (setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on)) == -1) + perror("SO_REUSEPORT failed(non fatal)"); +#endif + + /* fill the server struct */ + bzero(&server, sizeof(server)); + server.sin_family = AF_INET; + inet_aton(addr, &server.sin_addr); + server.sin_port = htons(port); + + /* bind the local address */ + if (bind(fd, (struct sockaddr*) &server, sizeof(server)) < 0) { + perror( "bind" ); + return -1; + } + D("+++ listening tcp %s:%d", + inet_ntoa(server.sin_addr), ntohs(server.sin_port)); + + /* listen for incoming connection */ + if (listen(fd, SOCK_QLEN) < 0) { + perror( "listen" ); + return -1; + } + return fd; +} + +extern int ipfw_module_init(void); + +/* + * main program for ipfw kernel side when running an userspace emulation: + * open a socket on which we receive requests from userland, + * another socket for calls from the 'kernel' (simulating packet + * arrivals etc), and then periodically run the tick handler. + */ +int +mainloop(int argc, char *argv[]) +{ + int listen_fd; + struct timeval t0; + const char *s, *addr = LOCALADDR; + int port = IPFW_PORT; + int i; + int old_ticks; + uint64_t callouts = 0, skipped = 0; + + gettimeofday(&t0, NULL); + old_ticks = ticks = 0; + callout_startup(); + + ipfw_module_init(); + + /* override the host if set in the environment */ + s = getenv("IPFW_HOST"); + if (s) + addr = s; + s = getenv("IPFW_PORT"); + if (s && atoi(s) > 0) + port = atoi(s); + /* start the server */ + listen_fd = do_server(addr, port); + if (listen_fd < 0) { + printf("Error starting server\n"); + return -1; + } + new_session(listen_fd, listener, (void *)1, WANT_READ); + +#ifdef WITH_NETMAP + for (i = 1; i < argc; i++) { + netmap_add_port(argv[i]); + } +#endif /* WITH_NETMAP */ + +#if 0 // test code: a telnet on 5556 becomes an infinite source + { + int net_fd = do_server(addr, port+1); + if (net_fd >= 0) + new_session(net_fd, listener, NULL, WANT_READ); + } +#endif + + for (;;) { + struct timeval now, delta = { 0, tick} ; + int n; + fd_set r, w; + + n = set_sessions(&r, &w); + select(n, &r, &w, NULL, &delta); + run_sessions(&r, &w); + gettimeofday(&now, 0); + timersub(&now, &t0, &delta); + /* compute absolute ticks. */ + ticks = (delta.tv_sec * hz) + (delta.tv_usec * hz) / 1000000; + if (old_ticks != ticks) { + callouts++; + callout_run(); + old_ticks = ticks; + } else { + skipped++; + } + RD(1, "callouts %lu skipped %lu", (u_long)callouts, (u_long)skipped); + } + ipfw_destroy(); + return 0; +} diff --git a/example/ipfw/extra/sys/contrib/pf/net/pfvar.h b/example/ipfw/extra/sys/contrib/pf/net/pfvar.h new file mode 100644 index 0000000..257bbd6 --- /dev/null +++ b/example/ipfw/extra/sys/contrib/pf/net/pfvar.h @@ -0,0 +1,27 @@ +/* + * replacement for FreeBSD's pfqueue.h + */ +#include + +#define DIOCSTARTALTQ _IO ('D', 42) +#define DIOCSTOPALTQ _IO ('D', 43) + +struct pf_altq { + TAILQ_ENTRY(pf_altq) entries; + /* ... */ + u_int32_t qid; /* return value */ + +#define PF_QNAME_SIZE 64 + char qname[PF_QNAME_SIZE]; /* queue name */ + +}; + +struct pfioc_altq { + u_int32_t action; + u_int32_t ticket; + u_int32_t nr; + struct pf_altq altq; +}; + +#define DIOCGETALTQS _IOWR('D', 47, struct pfioc_altq) +#define DIOCGETALTQ _IOWR('D', 48, struct pfioc_altq) diff --git a/example/ipfw/extra/sys/sys/kernel.h b/example/ipfw/extra/sys/sys/kernel.h new file mode 100644 index 0000000..f234d4a --- /dev/null +++ b/example/ipfw/extra/sys/sys/kernel.h @@ -0,0 +1,26 @@ +/* + * from freebsd's kernel.h + */ +#ifndef _SYS_KERNEL_H_ +#define _SYS_KERNEL_H_ + +#define SYSINIT(a, b, c, d, e) \ + int (*sysinit_ ## d)(void *) = (int (*)(void *))(d) +#define VNET_SYSINIT(a, b, c, d, e) \ + SYSINIT(a, b, c, d, e) +#define SYSUNINIT(a, b, c, d, e) \ + int (*sysuninit_ ## d)(void *) = (int (*)(void *))(d) +#define VNET_SYSUNINIT(a, b, c, d, e) \ + SYSUNINIT(a, b, c, d, e) + +/* + * Some enumerated orders; "ANY" sorts last. + */ +enum sysinit_elem_order { + SI_ORDER_FIRST = 0x0000000, /* first*/ + SI_ORDER_SECOND = 0x0000001, /* second*/ + SI_ORDER_THIRD = 0x0000002, /* third*/ + SI_ORDER_MIDDLE = 0x1000000, /* somewhere in the middle */ + SI_ORDER_ANY = 0xfffffff /* last*/ +}; +#endif diff --git a/example/ipfw/extra/sys/sys/malloc.h b/example/ipfw/extra/sys/sys/malloc.h new file mode 100644 index 0000000..9bc64a3 --- /dev/null +++ b/example/ipfw/extra/sys/sys/malloc.h @@ -0,0 +1,13 @@ +/* + * $Id$ + * replacement for sys/malloc.h to compile kernel in userspace + */ + +#ifndef _SYS_MALLOC_H_ +#define _SYS_MALLOC_H_ + +#define M_WAITOK 0x0000 /* can block */ +#define M_NOWAIT 0x0001 /* do not block */ +#define M_ZERO 0x0100 /* bzero the allocation */ +#endif /* _SYS_MALLOC_H_ */ + diff --git a/example/ipfw/extra/sys/sys/mbuf.h b/example/ipfw/extra/sys/sys/mbuf.h new file mode 100644 index 0000000..1f3af63 --- /dev/null +++ b/example/ipfw/extra/sys/sys/mbuf.h @@ -0,0 +1,383 @@ +/* + * Copyright (C) 2012 Luigi Rizzo, Universita` di Pisa + * + * BSD copyright. + * + * A simple compatibility interface to map mbufs onto userspace structs + */ + +#ifndef _SYS_MBUF_H_ +#define _SYS_MBUF_H_ +#define VM_UMA_H // kill this one // maybe not needed +#define _VM_UMA_H_ // kill this one too + +// #include /* we use free() */ +/* hopefully queue.h is already included by someone else */ +#include +#ifdef _KERNEL + +/* bzero not present on linux, but this should go in glue.h */ +// #define bzero(s, n) memset(s, 0, n) + +/* + * We implement a very simplified UMA allocator where the backend + * is simply malloc, and uma_zone only stores the length of the components. + */ +typedef int uma_zone_t; /* the zone size */ + +#define uma_zcreate(name, len, _3, _4, _5, _6, _7, _8) (len) +typedef int (*uma_init)(void *mem, int size, int flags); +typedef void (*uma_fini)(void *mem, int size); + + +#define uma_zfree(zone, item) free(item, M_IPFW) +#define uma_zalloc(zone, flags) malloc(zone, M_IPFW, flags) +#define uma_zdestroy(zone) do {} while (0) + +/*- + * Macros for type conversion: + * mtod(m, t) -- Convert mbuf pointer to data pointer of correct type. + */ +#define mtod(m, t) ((t)((m)->m_data)) + +#endif /* _KERNEL */ + +/* + * Packet tag structure (see below for details). + */ +struct m_tag { + SLIST_ENTRY(m_tag) m_tag_link; /* List of packet tags */ + u_int16_t m_tag_id; /* Tag ID */ + u_int16_t m_tag_len; /* Length of data */ + u_int32_t m_tag_cookie; /* ABI/Module ID */ +// void (*m_tag_free)(struct m_tag *); +}; + +/* + * Auxiliary structure to store values from the sk_buf. + * Note that we should not alter the sk_buff, and if we do + * so make sure to keep the values in sync between the mbuf + * and the sk_buff (especially m_len and m_pkthdr.len). + */ + +struct skbuf; + +struct mbuf { + struct mbuf *m_next; + struct mbuf *m_nextpkt; + void * m_data; /* XXX should change to caddr_t */ + int32_t m_len; /* length in this mbuf */ + int m_flags; + struct { + struct ifnet *rcvif; + int len; /* total packet len */ + SLIST_HEAD (packet_tags, m_tag) tags; + } m_pkthdr; + struct skbuf *m_skb; + int __max_m_len; /* original value */ + + /* + * in-stack mbuffers point to an external buffer, + * the two variables below contain base and size, + * and have M_STACK set in m_flags. + * Buffers from the heap have __m_extbuf = (char *)m + MSIZE + */ + void *__m_extbuf; /* external buffer base */ + int __m_extlen; /* data in ext buffer */ + void (*__m_callback)(struct mbuf *, int); + void *__m_peer; /* argument attached to the mbuf */ +}; + +/* + * note we also have M_FASTFWD_OURS mapped to M_PROTO1 0x10 + */ +#ifndef M_SKIP_FIREWALL /* XXX conflict in FreeBSD */ +#define M_SKIP_FIREWALL 0x01 /* skip firewall processing */ +#else +#define M_PROTO3 0x01 // FreeBSD 10 and 11 +#endif /* XXX conflict in FreeBSD */ + +#define M_BCAST 0x02 /* send/received as link-level broadcast */ +#define M_MCAST 0x04 /* send/received as link-level multicast */ +#define M_PROTO1 0x10 +#define M_PROTO2 0x20 +#define M_FASTFWD_OURS M_PROTO1 +#define M_IP_NEXTHOP M_PROTO2 +#define M_STACK 0x1000 /* allocated on the stack */ + +void m_freem(struct mbuf *m); + +#ifdef _KERNEL + +/* + * m_dup() is used in the TEE case, currently unsupported so we + * just return. + */ +static __inline struct mbuf *m_dup(struct mbuf *m, int n) +{ + (void)m; /* UNUSED */ + (void)n; /* UNUSED */ + D("unimplemented, expect panic"); + return NULL; +} + + +static __inline void +m_tag_prepend(struct mbuf *m, struct m_tag *t) +{ + ND("m %p tag %p", m, t); + SLIST_INSERT_HEAD(&m->m_pkthdr.tags, t, m_tag_link); +} + +/* + * Unlink a tag from the list of tags associated with an mbuf. + */ +static __inline void +m_tag_unlink(struct mbuf *m, struct m_tag *t) +{ + + SLIST_REMOVE(&m->m_pkthdr.tags, t, m_tag, m_tag_link); +} + +/* + * Return the next tag in the list of tags associated with an mbuf. + */ +static __inline struct m_tag * +m_tag_next(struct mbuf *m, struct m_tag *t) +{ + D("mbuf %p tag %p", m, t); + return (SLIST_NEXT(t, m_tag_link)); +} + +extern SLIST_HEAD (tags_freelist, m_tag) tags_freelist; +extern int tags_minlen; +extern int tags_freelist_count; + +extern int max_protohdr; /* uipc_mbuf.c - max proto header */ + +/* + * Create an mtag of the given type + */ +static __inline struct m_tag * +m_tag_alloc(uint32_t cookie, int type, int length, int wait) +{ + static int maxlen = 0; + int l = length + sizeof(struct m_tag); + struct m_tag *m = NULL; + + if (l > maxlen) { + D("new maxlen %d (%d)", l, length ); + maxlen = l; + } + if (l <= tags_minlen) { + l = tags_minlen; + m = SLIST_FIRST(&tags_freelist); + } + if (m) { + SLIST_REMOVE_HEAD(&tags_freelist, m_tag_link); + ND("allocate from freelist"); + tags_freelist_count--; + } else { + ND("size %d allocate from malloc", l); + m = malloc(l, 0, M_NOWAIT); + } + if (m) { + bzero(m, l); + m->m_tag_id = type; + m->m_tag_len = length; + m->m_tag_cookie = cookie; + ND("tag %p cookie %d type %d", m, cookie, type); + } + return m; +} + +#define MTAG_ABI_COMPAT 0 /* compatibility ABI */ + +static __inline struct m_tag * +m_tag_get(int type, int length, int wait) +{ + return m_tag_alloc(MTAG_ABI_COMPAT, type, length, wait); +} + +static __inline struct m_tag * +m_tag_first(struct mbuf *m) +{ + struct m_tag *t; + t = SLIST_FIRST(&m->m_pkthdr.tags); + ND("mbuf %p has %p", m, t); + return t; +} + +static __inline void +m_tag_delete(struct mbuf *m, struct m_tag *t) +{ + D("mbuf %p tag %p, ******* unimplemented", m, t); +} + +static __inline struct m_tag * +m_tag_locate(struct mbuf *m, u_int32_t cookie, int x, struct m_tag *t) +{ + struct m_tag *tag; + + ND("search %d %d in mbuf %p at %p", cookie, x, m, t); + if (t) + D("--- XXX ignore non-null t %p", t); + tag = SLIST_FIRST(&m->m_pkthdr.tags); + if (tag == NULL) + return NULL; + + ND("found tag %p cookie %d type %d (want %d %d)", + tag, tag->m_tag_cookie, tag->m_tag_id, cookie, x); + if (tag->m_tag_cookie != cookie || tag->m_tag_id != x) { + ND("want %d %d have %d %d, expect panic", + cookie, x, tag->m_tag_cookie, tag->m_tag_id); + return NULL; + } else + return tag; +} + +static __inline struct m_tag * +m_tag_find(struct mbuf *m, int type, struct m_tag *start) +{ + D("m %p", m); + return (SLIST_EMPTY(&m->m_pkthdr.tags) ? (struct m_tag *)NULL : + m_tag_locate(m, MTAG_ABI_COMPAT, type, start)); +} + +#define M_SETFIB(_m, _fib) /* nothing on linux */ + + +/* m_pullup is not supported, there is a macro in missing.h */ + +#define M_GETFIB(_m) 0 + +/* macro used to create a new mbuf */ +#define MT_DATA 1 /* dynamic (data) allocation */ +#ifndef MSIZE // defined on osx +#define MSIZE 256 /* size of an mbuf */ +#endif +#define MGETHDR(_m, _how, _type) ((_m) = m_gethdr((_how), (_type))) +#define MY_MCLBYTES 2048 /* XXX make slightly less */ + + +extern struct mbuf *mbuf_freelist; + +/* allocate and init a new mbuf using the same structure of FreeBSD */ +/* + * XXX for the userspace version, we actually allocate + * MCLBYTES right after the buffer to store a copy of the packet. + */ +static __inline struct mbuf * +m_gethdr(int how, short type) +{ + struct mbuf *m; + static const struct mbuf m0; /* zero-initialized */ + + if (mbuf_freelist) { + m = mbuf_freelist; + mbuf_freelist = m->m_next; + *m = m0; + } else { + m = malloc(MY_MCLBYTES, M_IPFW, M_NOWAIT); + } + + ND("new mbuf %p", m); + if (m == NULL) { + panic("mgethdr failed"); + return m; + } + + /* here we have MSIZE - sizeof(struct mbuf) available */ + m->m_data = m + 1; + m->__m_extbuf = (char *)m + MSIZE; + m->__m_extlen = MY_MCLBYTES - MSIZE; + + return m; +} + + +/* + * Arrange to prepend space of size plen to mbuf m. If a new mbuf must be + * allocated, how specifies whether to wait. If the allocation fails, the + * original mbuf chain is freed and m is set to NULL. + */ +static inline void M_PREPEND(struct mbuf *m, int plen, int how) +{ \ + if (plen < 0 || plen + m->m_len > m->__max_m_len) { + D("size too large"); + } else { + m->m_data -= plen; + m->m_len += plen; + } +} + +static inline void +m_adj(struct mbuf *mp, int req_len) +{ + if (req_len < 0 || req_len > mp->m_len) { + D("no m_adj for len %d in mlen %d", req_len, mp->m_len); + } else { + mp->m_data += req_len; + mp->m_len -= req_len; + } +} + +#define M_PREPEND_GOOD(m, plen, how) do { \ + struct mbuf **_mmp = &(m); \ + struct mbuf *_mm = *_mmp; \ + int _mplen = (plen); \ + int __mhow = (how); \ + \ + MBUF_CHECKSLEEP(how); \ + if (M_LEADINGSPACE(_mm) >= _mplen) { \ + _mm->m_data -= _mplen; \ + _mm->m_len += _mplen; \ + } else \ + _mm = m_prepend(_mm, _mplen, __mhow); \ + if (_mm != NULL && _mm->m_flags & M_PKTHDR) \ + _mm->m_pkthdr.len += _mplen; \ + *_mmp = _mm; \ +} while (0) + +/* + * Persistent tags stay with an mbuf until the mbuf is reclaimed. Otherwise + * tags are expected to ``vanish'' when they pass through a network + * interface. For most interfaces this happens normally as the tags are + * reclaimed when the mbuf is free'd. However in some special cases + * reclaiming must be done manually. An example is packets that pass through + * the loopback interface. Also, one must be careful to do this when + * ``turning around'' packets (e.g., icmp_reflect). + * + * To mark a tag persistent bit-or this flag in when defining the tag id. + * The tag will then be treated as described above. + */ +#define MTAG_PERSISTENT 0x800 + +#define PACKET_TAG_NONE 0 /* Nadda */ + +/* Packet tags for use with PACKET_ABI_COMPAT. */ +#define PACKET_TAG_IPSEC_IN_DONE 1 /* IPsec applied, in */ +#define PACKET_TAG_IPSEC_OUT_DONE 2 /* IPsec applied, out */ +#define PACKET_TAG_IPSEC_IN_CRYPTO_DONE 3 /* NIC IPsec crypto done */ +#define PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED 4 /* NIC IPsec crypto req'ed */ +#define PACKET_TAG_IPSEC_IN_COULD_DO_CRYPTO 5 /* NIC notifies IPsec */ +#define PACKET_TAG_IPSEC_PENDING_TDB 6 /* Reminder to do IPsec */ +#define PACKET_TAG_BRIDGE 7 /* Bridge processing done */ +#define PACKET_TAG_GIF 8 /* GIF processing done */ +#define PACKET_TAG_GRE 9 /* GRE processing done */ +#define PACKET_TAG_IN_PACKET_CHECKSUM 10 /* NIC checksumming done */ +#define PACKET_TAG_ENCAP 11 /* Encap. processing */ +#define PACKET_TAG_IPSEC_SOCKET 12 /* IPSEC socket ref */ +#define PACKET_TAG_IPSEC_HISTORY 13 /* IPSEC history */ +#define PACKET_TAG_IPV6_INPUT 14 /* IPV6 input processing */ +#define PACKET_TAG_DUMMYNET 15 /* dummynet info */ +#define PACKET_TAG_DIVERT 17 /* divert info */ +#define PACKET_TAG_IPFORWARD 18 /* ipforward info */ +#define PACKET_TAG_MACLABEL (19 | MTAG_PERSISTENT) /* MAC label */ +#define PACKET_TAG_PF 21 /* PF + ALTQ information */ +#define PACKET_TAG_RTSOCKFAM 25 /* rtsock sa family */ +#define PACKET_TAG_IPOPTIONS 27 /* Saved IP options */ +#define PACKET_TAG_CARP 28 /* CARP info */ + +#endif /* _KERNEL */ +#endif /* !_SYS_MBUF_H_ */ diff --git a/example/ipfw/extra/sys/sys/module.h b/example/ipfw/extra/sys/sys/module.h new file mode 100644 index 0000000..310e22b --- /dev/null +++ b/example/ipfw/extra/sys/sys/module.h @@ -0,0 +1,43 @@ +/* + * trivial module support + */ +#ifndef _SYS_MODULE_H_ +#define _SYS_MODULE_H_ +typedef struct module *module_t; +typedef int (*modeventhand_t)(module_t, int /* modeventtype_t */, void *); + +typedef enum modeventtype { + MOD_LOAD, + MOD_UNLOAD, + MOD_SHUTDOWN, + MOD_QUIESCE +} modeventtype_t; + +typedef struct moduledata { + const char *name; /* module name */ + modeventhand_t evhand; /* event handler */ + void *priv; /* extra data */ +} moduledata_t; + +/* + * Hook the module descriptor, md, into our list of things to do. + * We should in principle respect the order of loading. + * + * XXX use the gcc .init functions + */ +#define DECLARE_MODULE(a, md, c,d) \ + moduledata_t *moddesc_##a = &md + +/* + * XXX MODULE_VERSION is define in linux too + */ +#define MODULE_DEPEND(a,b,c,d,e) struct __module_depend +#if 1 // !defined(__FreeBSD__) // defined( __linux__ ) || defined( _WIN32 ) +#undef MODULE_VERSION +#define MODULE_VERSION(a,b) struct __module_version +#endif + +#define FEATURE(a, b) struct __feature + +#endif /* _SYS_MODULE_H_ */ + diff --git a/example/ipfw/extra/sys/sys/systm.h b/example/ipfw/extra/sys/sys/systm.h new file mode 100644 index 0000000..94036c9 --- /dev/null +++ b/example/ipfw/extra/sys/sys/systm.h @@ -0,0 +1,159 @@ +#ifndef _SYS_SYSTM_H_ +#define _SYS_SYSTM_H_ + +#define CALLOUT_ACTIVE 0x0002 /* callout is currently active */ +#define CALLOUT_MPSAFE 0x0008 /* callout handler is mp safe */ + +#if defined(USERSPACE) // freebsd userspace + +#include +#ifdef __FreeBSD__ +#include +#endif + +/// SLIST_HEAD(callout_list, callout); +struct callout; +TAILQ_HEAD(callout_tailq, callout); +struct callout { + union { + //SLIST_ENTRY(callout) sle; + TAILQ_ENTRY(callout) tqe; + } c_links; + int c_time; /* ticks to the event */ + void *c_arg; /* function argument */ + void (*c_func)(void *); /* function to call */ + struct lock_object *c_lock; /* lock to handle */ + int c_flags; /* state of this entry */ + volatile int c_cpu; /* CPU we're scheduled on */ + +}; + + +int callout_drain(struct callout *c); +void callout_init(struct callout *c, int safe); +int callout_reset(struct callout *c, int ticks, void (*fn)(void *), void *arg); +int callout_reset_on(struct callout *c, int ticks, void (*fn)(void *), void *arg, int cpu); + +#else /* linux or windows */ + +#ifndef _WIN32 /* this is the linux version */ +/* callout support, in on FreeBSD */ +/* + * callout support on linux module is done using timers + */ +#include +#ifdef LINUX_24 +#include /* jiffies definition is here in 2.4 */ +#endif +#define callout timer_list +static __inline int +callout_reset_on(struct callout *co, int ticks, void (*fn)(void *), void *arg, int cpu) +{ + co->expires = jiffies + ticks; + co->function = (void (*)(unsigned long))fn; + co->data = (unsigned long)arg; + /* + * Linux 2.6.31 and above has add_timer_on(co, cpu), + * otherwise add_timer() always schedules a callout on the same + * CPU used the first time, so we don't need more. + */ + add_timer(co); + return 0; +} + +#define callout_init(co, safe) init_timer(co) +#define callout_drain(co) del_timer(co) +#define callout_stop(co) del_timer(co) + +#else /* _WIN32 */ +#include + +/* This is the windows part for callout support */ +struct callout { + KTIMER thetimer; + KDPC timerdpc; + int dpcinitialized; + LARGE_INTEGER duetime; +}; + +void dummynet (void*); +VOID dummynet_dpc( + __in struct _KDPC *Dpc, + __in_opt PVOID DeferredContext, + __in_opt PVOID SystemArgument1, + __in_opt PVOID SystemArgument2 + ); + +VOID ipfw_dpc( + __in struct _KDPC *Dpc, + __in_opt PVOID DeferredContext, + __in_opt PVOID SystemArgument1, + __in_opt PVOID SystemArgument2 + ); + +/* callout_reset must handle two problems: + * - dummynet() scheduler must be run always on the same processor + * because do_gettimeofday() is based on cpu performance counter, and + * _occasionally_ can leap backward in time if we query another cpu. + * typically this won't happen that much, and the cpu will almost always + * be the same even without the affinity restriction, but better to be sure. + * - ipfw_tick() does not have the granularity requirements of dummynet() + * but we need to pass a pointer as argument. + * + * for these reasons, if we are called for dummynet() timer, + * KeInitializeDpc is called only once as it should be, and the thread + * is forced on cpu0 (which is always present), while if we're called + * for ipfw_tick(), we re-initialize the DPC each time, using + * parameter DeferredContext to pass the needed pointer. since this + * timer is called only once a sec, this won't hurt that much. + */ +static __inline int +callout_reset_on(struct callout *co, int ticks, void (*fn)(void *), void *arg, int cpu) +{ + if(fn == &dummynet) + { + if(co->dpcinitialized == 0) + { + KeInitializeDpc(&co->timerdpc, dummynet_dpc, NULL); + KeSetTargetProcessorDpc(&co->timerdpc, cpu); + co->dpcinitialized = 1; + } + } + else + { + KeInitializeDpc(&co->timerdpc, ipfw_dpc, arg); + } + co->duetime.QuadPart = (-ticks)*10000; + KeSetTimer(&co->thetimer, co->duetime, &co->timerdpc); + return 0; +} + +static __inline void +callout_init(struct callout* co, int safe) +{ + printf("%s: initializing timer at %p\n",__FUNCTION__,co); + KeInitializeTimer(&co->thetimer); +} + +static __inline int +callout_drain(struct callout* co) +{ + BOOLEAN canceled = KeCancelTimer(&co->thetimer); + while (canceled != TRUE) + { + canceled = KeCancelTimer(&co->thetimer); + } + printf("%s: stopping timer at %p\n",__FUNCTION__,co); + return 0; +} + +static __inline int +callout_stop(struct callout* co) +{ + return callout_drain(co); +} + +#endif /* _WIN32 */ +#endif /* linux or windows */ + +#endif /* _SYS_SYSTM_H_ */ diff --git a/example/ipfw/extra/sys/sys/taskqueue.h b/example/ipfw/extra/sys/sys/taskqueue.h new file mode 100644 index 0000000..a9f79a0 --- /dev/null +++ b/example/ipfw/extra/sys/sys/taskqueue.h @@ -0,0 +1,51 @@ +#ifndef _SYS_TASKQUEUE_H_ +#define _SYS_TASKQUEUE_H_ + +/* + * Remap taskqueue to direct calls + */ + +#ifdef _WIN32 +struct task { + void (*func)(void*, int); +}; +#define taskqueue_enqueue_fast(tq, ta) (ta)->func(NULL,1) +#define TASK_INIT(a,b,c,d) do { \ + (a)->func = (c); } while (0) +#else +struct task { + void (*func)(void); +}; +#define taskqueue_enqueue_fast(tq, ta) (ta)->func() +#define TASK_INIT(a,b,c,d) do { \ + (a)->func = (void (*)(void))c; } while (0) + + +#endif +typedef void (*taskqueue_enqueue_fn)(void *context); + +// #define taskqueue_create(_a, _b, _c, _d) NULL +struct taskqueue *taskqueue_create_fast(const char *name, int mflags, + taskqueue_enqueue_fn enqueue, + void *context); +void taskqueue_thread_enqueue(void *context); + + +// #define taskqueue_create_fast(_a, _b, _c, _d) NULL +int taskqueue_start_threads(struct taskqueue **tqp, int count, int pri, + const char *name, ...) __printflike(4, 5); + + +// #define taskqueue_drain(_a, _b) /* XXX to be completed */ +// #define taskqueue_free(_a) /* XXX to be completed */ +void taskqueue_drain(struct taskqueue *queue, struct task *task); +void taskqueue_free(struct taskqueue *queue); + + +#define PRI_MIN (0) /* Highest priority. */ +#define PRI_MIN_ITHD (PRI_MIN) +#ifndef __FreeBSD__ +#define PI_NET (PRI_MIN_ITHD + 16) +#endif + +#endif /* !_SYS_TASKQUEUE_H_ */ diff --git a/example/ipfw/ipfw/altq.c b/example/ipfw/ipfw/altq.c new file mode 100644 index 0000000..ba6b639 --- /dev/null +++ b/example/ipfw/ipfw/altq.c @@ -0,0 +1,151 @@ +/* + * Copyright (c) 2002-2003 Luigi Rizzo + * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp + * Copyright (c) 1994 Ugen J.S.Antsilevich + * + * Idea and grammar partially left from: + * Copyright (c) 1993 Daniel Boulet + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * NEW command line interface for IP firewall facility + * + * $FreeBSD: head/sbin/ipfw/altq.c 270424 2014-08-23 17:37:18Z melifaro $ + * + * altq interface + */ + +#include +#include +#include + +#include "ipfw2.h" + +#include +#include +#include +#include +#include +#include +#include +#include + +#include /* IFNAMSIZ */ +#include +#include /* in_addr */ +#include + +/* + * Map between current altq queue id numbers and names. + */ +static TAILQ_HEAD(, pf_altq) altq_entries = + TAILQ_HEAD_INITIALIZER(altq_entries); + +void +altq_set_enabled(int enabled) +{ + int pffd; + + pffd = open("/dev/pf", O_RDWR); + if (pffd == -1) + err(EX_UNAVAILABLE, + "altq support opening pf(4) control device"); + if (enabled) { + if (ioctl(pffd, DIOCSTARTALTQ) != 0 && errno != EEXIST) + err(EX_UNAVAILABLE, "enabling altq"); + } else { + if (ioctl(pffd, DIOCSTOPALTQ) != 0 && errno != ENOENT) + err(EX_UNAVAILABLE, "disabling altq"); + } + close(pffd); +} + +static void +altq_fetch(void) +{ + struct pfioc_altq pfioc; + struct pf_altq *altq; + int pffd; + unsigned int mnr; + static int altq_fetched = 0; + + if (altq_fetched) + return; + altq_fetched = 1; + pffd = open("/dev/pf", O_RDONLY); + if (pffd == -1) { + warn("altq support opening pf(4) control device"); + return; + } + bzero(&pfioc, sizeof(pfioc)); + if (ioctl(pffd, DIOCGETALTQS, &pfioc) != 0) { + warn("altq support getting queue list"); + close(pffd); + return; + } + mnr = pfioc.nr; + for (pfioc.nr = 0; pfioc.nr < mnr; pfioc.nr++) { + if (ioctl(pffd, DIOCGETALTQ, &pfioc) != 0) { + if (errno == EBUSY) + break; + warn("altq support getting queue list"); + close(pffd); + return; + } + if (pfioc.altq.qid == 0) + continue; + altq = safe_calloc(1, sizeof(*altq)); + *altq = pfioc.altq; + TAILQ_INSERT_TAIL(&altq_entries, altq, entries); + } + close(pffd); +} + +u_int32_t +altq_name_to_qid(const char *name) +{ + struct pf_altq *altq; + + altq_fetch(); + TAILQ_FOREACH(altq, &altq_entries, entries) + if (strcmp(name, altq->qname) == 0) + break; + if (altq == NULL) + errx(EX_DATAERR, "altq has no queue named `%s'", name); + return altq->qid; +} + +static const char * +altq_qid_to_name(u_int32_t qid) +{ + struct pf_altq *altq; + + altq_fetch(); + TAILQ_FOREACH(altq, &altq_entries, entries) + if (qid == altq->qid) + break; + if (altq == NULL) + return NULL; + return altq->qname; +} + +void +print_altq_cmd(struct buf_pr *bp, ipfw_insn_altq *altqptr) +{ + if (altqptr) { + const char *qname; + + qname = altq_qid_to_name(altqptr->qid); + if (qname == NULL) + bprintf(bp, " altq ?<%u>", altqptr->qid); + else + bprintf(bp, " altq %s", qname); + } +} diff --git a/example/ipfw/ipfw/dummynet.c b/example/ipfw/ipfw/dummynet.c new file mode 100644 index 0000000..1938307 --- /dev/null +++ b/example/ipfw/ipfw/dummynet.c @@ -0,0 +1,1410 @@ +/* + * Copyright (c) 2002-2003,2010 Luigi Rizzo + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * $FreeBSD: head/sbin/ipfw/dummynet.c 270424 2014-08-23 17:37:18Z melifaro $ + * + * dummynet support + */ + +#include +#include +/* XXX there are several sysctl leftover here */ +#include + +#include "ipfw2.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include /* inet_ntoa */ + + +static struct _s_x dummynet_params[] = { + { "plr", TOK_PLR }, + { "noerror", TOK_NOERROR }, + { "buckets", TOK_BUCKETS }, + { "dst-ip", TOK_DSTIP }, + { "src-ip", TOK_SRCIP }, + { "dst-port", TOK_DSTPORT }, + { "src-port", TOK_SRCPORT }, + { "proto", TOK_PROTO }, + { "weight", TOK_WEIGHT }, + { "lmax", TOK_LMAX }, + { "maxlen", TOK_LMAX }, + { "all", TOK_ALL }, + { "mask", TOK_MASK }, /* alias for both */ + { "sched_mask", TOK_SCHED_MASK }, + { "flow_mask", TOK_FLOW_MASK }, + { "droptail", TOK_DROPTAIL }, + { "ecn", TOK_ECN }, + { "red", TOK_RED }, + { "gred", TOK_GRED }, + { "bw", TOK_BW }, + { "bandwidth", TOK_BW }, + { "delay", TOK_DELAY }, + { "link", TOK_LINK }, + { "pipe", TOK_PIPE }, + { "queue", TOK_QUEUE }, + { "flowset", TOK_FLOWSET }, + { "sched", TOK_SCHED }, + { "pri", TOK_PRI }, + { "priority", TOK_PRI }, + { "type", TOK_TYPE }, + { "flow-id", TOK_FLOWID}, + { "dst-ipv6", TOK_DSTIP6}, + { "dst-ip6", TOK_DSTIP6}, + { "src-ipv6", TOK_SRCIP6}, + { "src-ip6", TOK_SRCIP6}, + { "profile", TOK_PROFILE}, + { "burst", TOK_BURST}, + { "dummynet-params", TOK_NULL }, + { NULL, 0 } /* terminator */ +}; + +#define O_NEXT(p, len) ((void *)((char *)p + len)) + +static void +oid_fill(struct dn_id *oid, int len, int type, uintptr_t id) +{ + oid->len = len; + oid->type = type; + oid->subtype = 0; + oid->id = id; +} + +/* make room in the buffer and move the pointer forward */ +static void * +o_next(struct dn_id **o, int len, int type) +{ + struct dn_id *ret = *o; + oid_fill(ret, len, type, 0); + *o = O_NEXT(*o, len); + return ret; +} + +#if 0 +static int +sort_q(void *arg, const void *pa, const void *pb) +{ + int rev = (co.do_sort < 0); + int field = rev ? -co.do_sort : co.do_sort; + long long res = 0; + const struct dn_flow_queue *a = pa; + const struct dn_flow_queue *b = pb; + + switch (field) { + case 1: /* pkts */ + res = a->len - b->len; + break; + case 2: /* bytes */ + res = a->len_bytes - b->len_bytes; + break; + + case 3: /* tot pkts */ + res = a->tot_pkts - b->tot_pkts; + break; + + case 4: /* tot bytes */ + res = a->tot_bytes - b->tot_bytes; + break; + } + if (res < 0) + res = -1; + if (res > 0) + res = 1; + return (int)(rev ? res : -res); +} +#endif + +/* print a mask and header for the subsequent list of flows */ +static void +print_mask(struct ipfw_flow_id *id) +{ + if (!IS_IP6_FLOW_ID(id)) { + printf(" " + "mask: %s 0x%02x 0x%08x/0x%04x -> 0x%08x/0x%04x\n", + id->extra ? "queue," : "", + id->proto, + id->src_ip, id->src_port, + id->dst_ip, id->dst_port); + } else { + char buf[255]; + printf("\n mask: %sproto: 0x%02x, flow_id: 0x%08x, ", + id->extra ? "queue," : "", + id->proto, id->flow_id6); + inet_ntop(AF_INET6, &(id->src_ip6), buf, sizeof(buf)); + printf("%s/0x%04x -> ", buf, id->src_port); + inet_ntop(AF_INET6, &(id->dst_ip6), buf, sizeof(buf)); + printf("%s/0x%04x\n", buf, id->dst_port); + } +} + +static void +print_header(struct ipfw_flow_id *id) +{ + if (!IS_IP6_FLOW_ID(id)) + printf("BKT Prot ___Source IP/port____ " + "____Dest. IP/port____ " + "Tot_pkt/bytes Pkt/Byte Drp\n"); + else + printf("BKT ___Prot___ _flow-id_ " + "______________Source IPv6/port_______________ " + "_______________Dest. IPv6/port_______________ " + "Tot_pkt/bytes Pkt/Byte Drp\n"); +} + +static void +list_flow(struct buf_pr *bp, struct dn_flow *ni) +{ + char buff[255]; + struct protoent *pe = NULL; + struct in_addr ina; + struct ipfw_flow_id *id = &ni->fid; + + pe = getprotobynumber(id->proto); + /* XXX: Should check for IPv4 flows */ + bprintf(bp, "%3u%c", (ni->oid.id) & 0xff, + id->extra ? '*' : ' '); + if (!IS_IP6_FLOW_ID(id)) { + if (pe) + bprintf(bp, "%-4s ", pe->p_name); + else + bprintf(bp, "%4u ", id->proto); + ina.s_addr = htonl(id->src_ip); + bprintf(bp, "%15s/%-5d ", + inet_ntoa(ina), id->src_port); + ina.s_addr = htonl(id->dst_ip); + bprintf(bp, "%15s/%-5d ", + inet_ntoa(ina), id->dst_port); + } else { + /* Print IPv6 flows */ + if (pe != NULL) + bprintf(bp, "%9s ", pe->p_name); + else + bprintf(bp, "%9u ", id->proto); + bprintf(bp, "%7d %39s/%-5d ", id->flow_id6, + inet_ntop(AF_INET6, &(id->src_ip6), buff, sizeof(buff)), + id->src_port); + bprintf(bp, " %39s/%-5d ", + inet_ntop(AF_INET6, &(id->dst_ip6), buff, sizeof(buff)), + id->dst_port); + } + pr_u64(bp, &ni->tot_pkts, 4); + pr_u64(bp, &ni->tot_bytes, 8); + bprintf(bp, "%2u %4u %3u", + ni->length, ni->len_bytes, ni->drops); +} + +static void +print_flowset_parms(struct dn_fs *fs, char *prefix) +{ + int l; + char qs[30]; + char plr[30]; + char red[90]; /* Display RED parameters */ + + l = fs->qsize; + if (fs->flags & DN_QSIZE_BYTES) { + if (l >= 8192) + sprintf(qs, "%d KB", l / 1024); + else + sprintf(qs, "%d B", l); + } else + sprintf(qs, "%3d sl.", l); + if (fs->plr) + sprintf(plr, "plr %f", 1.0 * fs->plr / (double)(0x7fffffff)); + else + plr[0] = '\0'; + + if (fs->flags & DN_IS_RED) { /* RED parameters */ + sprintf(red, + "\n\t %cRED w_q %f min_th %d max_th %d max_p %f", + (fs->flags & DN_IS_GENTLE_RED) ? 'G' : ' ', + 1.0 * fs->w_q / (double)(1 << SCALE_RED), + fs->min_th, + fs->max_th, + 1.0 * fs->max_p / (double)(1 << SCALE_RED)); + if (fs->flags & DN_IS_ECN) + strncat(red, " (ecn)", 6); + } else + sprintf(red, "droptail"); + + if (prefix[0]) { + printf("%s %s%s %d queues (%d buckets) %s\n", + prefix, qs, plr, fs->oid.id, fs->buckets, red); + prefix[0] = '\0'; + } else { + printf("q%05d %s%s %d flows (%d buckets) sched %d " + "weight %d lmax %d pri %d %s\n", + fs->fs_nr, qs, plr, fs->oid.id, fs->buckets, + fs->sched_nr, fs->par[0], fs->par[1], fs->par[2], red); + if (fs->flags & DN_HAVE_MASK) + print_mask(&fs->flow_mask); + } +} + +static void +print_extra_delay_parms(struct dn_profile *p) +{ + double loss; + if (p->samples_no <= 0) + return; + + loss = p->loss_level; + loss /= p->samples_no; + printf("\t profile: name \"%s\" loss %f samples %d\n", + p->name, loss, p->samples_no); +} + +static void +flush_buf(char *buf) +{ + if (buf[0]) + printf("%s\n", buf); + buf[0] = '\0'; +} + +/* + * generic list routine. We expect objects in a specific order, i.e. + * PIPES AND SCHEDULERS: + * link; scheduler; internal flowset if any; instances + * we can tell a pipe from the number. + * + * FLOWSETS: + * flowset; queues; + * link i (int queue); scheduler i; si(i) { flowsets() : queues } + */ +static void +list_pipes(struct dn_id *oid, struct dn_id *end) +{ + char buf[160]; /* pending buffer */ + int toPrint = 1; /* print header */ + struct buf_pr bp; + + buf[0] = '\0'; + bp_alloc(&bp, 4096); + for (; oid != end; oid = O_NEXT(oid, oid->len)) { + if (oid->len < sizeof(*oid)) + errx(1, "invalid oid len %d\n", oid->len); + + switch (oid->type) { + default: + flush_buf(buf); + printf("unrecognized object %d size %d\n", oid->type, oid->len); + break; + case DN_TEXT: /* list of attached flowsets */ + { + int i, l; + struct { + struct dn_id id; + uint32_t p[0]; + } *d = (void *)oid; + l = (oid->len - sizeof(*oid))/sizeof(d->p[0]); + if (l == 0) + break; + printf(" Children flowsets: "); + for (i = 0; i < l; i++) + printf("%u ", d->p[i]); + printf("\n"); + break; + } + case DN_CMD_GET: + if (co.verbose) + printf("answer for cmd %d, len %d\n", oid->type, oid->id); + break; + case DN_SCH: { + struct dn_sch *s = (struct dn_sch *)oid; + flush_buf(buf); + printf(" sched %d type %s flags 0x%x %d buckets %d active\n", + s->sched_nr, + s->name, s->flags, s->buckets, s->oid.id); + if (s->flags & DN_HAVE_MASK) + print_mask(&s->sched_mask); + } + break; + + case DN_FLOW: + if (toPrint != 0) { + print_header(&((struct dn_flow *)oid)->fid); + toPrint = 0; + } + list_flow(&bp, (struct dn_flow *)oid); + printf("%s\n", bp.buf); + break; + + case DN_LINK: { + struct dn_link *p = (struct dn_link *)oid; + double b = p->bandwidth; + char bwbuf[30]; + char burst[5 + 7]; + + /* This starts a new object so flush buffer */ + flush_buf(buf); + /* data rate */ + if (b == 0) + sprintf(bwbuf, "unlimited "); + else if (b >= 1000000) + sprintf(bwbuf, "%7.3f Mbit/s", b/1000000); + else if (b >= 1000) + sprintf(bwbuf, "%7.3f Kbit/s", b/1000); + else + sprintf(bwbuf, "%7.3f bit/s ", b); + + if (humanize_number(burst, sizeof(burst), p->burst, + "", HN_AUTOSCALE, 0) < 0 || co.verbose) + sprintf(burst, "%d", (int)p->burst); + sprintf(buf, "%05d: %s %4d ms burst %s", + p->link_nr % DN_MAX_ID, bwbuf, p->delay, burst); + } + break; + + case DN_FS: + print_flowset_parms((struct dn_fs *)oid, buf); + break; + case DN_PROFILE: + flush_buf(buf); + print_extra_delay_parms((struct dn_profile *)oid); + } + flush_buf(buf); // XXX does it really go here ? + } + + bp_free(&bp); +} + +/* + * Delete pipe, queue or scheduler i + */ +int +ipfw_delete_pipe(int do_pipe, int i) +{ + struct { + struct dn_id oid; + uintptr_t a[1]; /* add more if we want a list */ + } cmd; + oid_fill((void *)&cmd, sizeof(cmd), DN_CMD_DELETE, DN_API_VERSION); + cmd.oid.subtype = (do_pipe == 1) ? DN_LINK : + ( (do_pipe == 2) ? DN_FS : DN_SCH); + cmd.a[0] = i; + i = do_cmd(IP_DUMMYNET3, &cmd, cmd.oid.len); + if (i) { + i = 1; + warn("rule %u: setsockopt(IP_DUMMYNET_DEL)", i); + } + return i; +} + +/* + * Code to parse delay profiles. + * + * Some link types introduce extra delays in the transmission + * of a packet, e.g. because of MAC level framing, contention on + * the use of the channel, MAC level retransmissions and so on. + * From our point of view, the channel is effectively unavailable + * for this extra time, which is constant or variable depending + * on the link type. Additionally, packets may be dropped after this + * time (e.g. on a wireless link after too many retransmissions). + * We can model the additional delay with an empirical curve + * that represents its distribution. + * + * cumulative probability + * 1.0 ^ + * | + * L +-- loss-level x + * | ****** + * | * + * | ***** + * | * + * | ** + * | * + * +-------*-------------------> + * delay + * + * The empirical curve may have both vertical and horizontal lines. + * Vertical lines represent constant delay for a range of + * probabilities; horizontal lines correspond to a discontinuty + * in the delay distribution: the link will use the largest delay + * for a given probability. + * + * To pass the curve to dummynet, we must store the parameters + * in a file as described below, and issue the command + * + * ipfw pipe config ... bw XXX profile ... + * + * The file format is the following, with whitespace acting as + * a separator and '#' indicating the beginning a comment: + * + * samples N + * the number of samples used in the internal + * representation (2..1024; default 100); + * + * loss-level L + * The probability above which packets are lost. + * (0.0 <= L <= 1.0, default 1.0 i.e. no loss); + * + * name identifier + * Optional a name (listed by "ipfw pipe show") + * to identify the distribution; + * + * "delay prob" | "prob delay" + * One of these two lines is mandatory and defines + * the format of the following lines with data points. + * + * XXX YYY + * 2 or more lines representing points in the curve, + * with either delay or probability first, according + * to the chosen format. + * The unit for delay is milliseconds. + * + * Data points does not need to be ordered or equal to the number + * specified in the "samples" line. ipfw will sort and interpolate + * the curve as needed. + * + * Example of a profile file: + + name bla_bla_bla + samples 100 + loss-level 0.86 + prob delay + 0 200 # minimum overhead is 200ms + 0.5 200 + 0.5 300 + 0.8 1000 + 0.9 1300 + 1 1300 + + * Internally, we will convert the curve to a fixed number of + * samples, and when it is time to transmit a packet we will + * model the extra delay as extra bits in the packet. + * + */ + +#define ED_MAX_LINE_LEN 256+ED_MAX_NAME_LEN +#define ED_TOK_SAMPLES "samples" +#define ED_TOK_LOSS "loss-level" +#define ED_TOK_NAME "name" +#define ED_TOK_DELAY "delay" +#define ED_TOK_PROB "prob" +#define ED_TOK_BW "bw" +#define ED_SEPARATORS " \t\n" +#define ED_MIN_SAMPLES_NO 2 + +/* + * returns 1 if s is a non-negative number, with at least one '.' + */ +static int +is_valid_number(const char *s) +{ + int i, dots_found = 0; + int len = strlen(s); + + for (i = 0; i 1)) + return 0; + return 1; +} + +/* + * Take as input a string describing a bandwidth value + * and return the numeric bandwidth value. + * set clocking interface or bandwidth value + */ +static void +read_bandwidth(char *arg, int *bandwidth, char *if_name, int namelen) +{ + if (*bandwidth != -1) + warnx("duplicate token, override bandwidth value!"); + + if (arg[0] >= 'a' && arg[0] <= 'z') { + if (!if_name) { + errx(1, "no if support"); + } + if (namelen >= IFNAMSIZ) + warn("interface name truncated"); + namelen--; + /* interface name */ + strncpy(if_name, arg, namelen); + if_name[namelen] = '\0'; + *bandwidth = 0; + } else { /* read bandwidth value */ + int bw; + char *end = NULL; + + bw = strtoul(arg, &end, 0); + if (*end == 'K' || *end == 'k') { + end++; + bw *= 1000; + } else if (*end == 'M' || *end == 'm') { + end++; + bw *= 1000000; + } + if ((*end == 'B' && + _substrcmp2(end, "Bi", "Bit/s") != 0) || + _substrcmp2(end, "by", "bytes") == 0) + bw *= 8; + + if (bw < 0) + errx(EX_DATAERR, "bandwidth too large"); + + *bandwidth = bw; + if (if_name) + if_name[0] = '\0'; + } +} + +struct point { + double prob; + double delay; +}; + +static int +compare_points(const void *vp1, const void *vp2) +{ + const struct point *p1 = vp1; + const struct point *p2 = vp2; + double res = 0; + + res = p1->prob - p2->prob; + if (res == 0) + res = p1->delay - p2->delay; + if (res < 0) + return -1; + else if (res > 0) + return 1; + else + return 0; +} + +#define ED_EFMT(s) EX_DATAERR,"error in %s at line %d: "#s,filename,lineno + +static void +load_extra_delays(const char *filename, struct dn_profile *p, + struct dn_link *link) +{ + char line[ED_MAX_LINE_LEN]; + FILE *f; + int lineno = 0; + int i; + + int samples = -1; + double loss = -1.0; + char profile_name[ED_MAX_NAME_LEN]; + int delay_first = -1; + int do_points = 0; + struct point points[ED_MAX_SAMPLES_NO]; + int points_no = 0; + + /* XXX link never NULL? */ + p->link_nr = link->link_nr; + + profile_name[0] = '\0'; + f = fopen(filename, "r"); + if (f == NULL) + err(EX_UNAVAILABLE, "fopen: %s", filename); + + while (fgets(line, ED_MAX_LINE_LEN, f)) { /* read commands */ + char *s, *cur = line, *name = NULL, *arg = NULL; + + ++lineno; + + /* parse the line */ + while (cur) { + s = strsep(&cur, ED_SEPARATORS); + if (s == NULL || *s == '#') + break; + if (*s == '\0') + continue; + if (arg) + errx(ED_EFMT("too many arguments")); + if (name == NULL) + name = s; + else + arg = s; + } + if (name == NULL) /* empty line */ + continue; + if (arg == NULL) + errx(ED_EFMT("missing arg for %s"), name); + + if (!strcasecmp(name, ED_TOK_SAMPLES)) { + if (samples > 0) + errx(ED_EFMT("duplicate ``samples'' line")); + if (atoi(arg) <=0) + errx(ED_EFMT("invalid number of samples")); + samples = atoi(arg); + if (samples>ED_MAX_SAMPLES_NO) + errx(ED_EFMT("too many samples, maximum is %d"), + ED_MAX_SAMPLES_NO); + do_points = 0; + } else if (!strcasecmp(name, ED_TOK_BW)) { + char buf[IFNAMSIZ]; + read_bandwidth(arg, &link->bandwidth, buf, sizeof(buf)); + } else if (!strcasecmp(name, ED_TOK_LOSS)) { + if (loss != -1.0) + errx(ED_EFMT("duplicated token: %s"), name); + if (!is_valid_number(arg)) + errx(ED_EFMT("invalid %s"), arg); + loss = atof(arg); + if (loss > 1) + errx(ED_EFMT("%s greater than 1.0"), name); + do_points = 0; + } else if (!strcasecmp(name, ED_TOK_NAME)) { + if (profile_name[0] != '\0') + errx(ED_EFMT("duplicated token: %s"), name); + strncpy(profile_name, arg, sizeof(profile_name) - 1); + profile_name[sizeof(profile_name)-1] = '\0'; + do_points = 0; + } else if (!strcasecmp(name, ED_TOK_DELAY)) { + if (do_points) + errx(ED_EFMT("duplicated token: %s"), name); + delay_first = 1; + do_points = 1; + } else if (!strcasecmp(name, ED_TOK_PROB)) { + if (do_points) + errx(ED_EFMT("duplicated token: %s"), name); + delay_first = 0; + do_points = 1; + } else if (do_points) { + if (!is_valid_number(name) || !is_valid_number(arg)) + errx(ED_EFMT("invalid point found")); + if (delay_first) { + points[points_no].delay = atof(name); + points[points_no].prob = atof(arg); + } else { + points[points_no].delay = atof(arg); + points[points_no].prob = atof(name); + } + if (points[points_no].prob > 1.0) + errx(ED_EFMT("probability greater than 1.0")); + ++points_no; + } else { + errx(ED_EFMT("unrecognised command '%s'"), name); + } + } + + fclose (f); + + if (samples == -1) { + warnx("'%s' not found, assuming 100", ED_TOK_SAMPLES); + samples = 100; + } + + if (loss == -1.0) { + warnx("'%s' not found, assuming no loss", ED_TOK_LOSS); + loss = 1; + } + + /* make sure that there are enough points. */ + if (points_no < ED_MIN_SAMPLES_NO) + errx(ED_EFMT("too few samples, need at least %d"), + ED_MIN_SAMPLES_NO); + + qsort(points, points_no, sizeof(struct point), compare_points); + + /* interpolation */ + for (i = 0; isamples[ix] = x1; + } else { + double m = (y2-y1)/(x2-x1); + double c = y1 - m*x1; + for (; ixsamples[ix] = (ix - c)/m; + } + } + p->samples_no = samples; + p->loss_level = loss * samples; + strncpy(p->name, profile_name, sizeof(p->name)); +} + +/* + * configuration of pipes, schedulers, flowsets. + * When we configure a new scheduler, an empty pipe is created, so: + * + * do_pipe = 1 -> "pipe N config ..." only for backward compatibility + * sched N+Delta type fifo sched_mask ... + * pipe N+Delta + * flowset N+Delta pipe N+Delta (no parameters) + * sched N type wf2q+ sched_mask ... + * pipe N + * + * do_pipe = 2 -> flowset N config + * flowset N parameters + * + * do_pipe = 3 -> sched N config + * sched N parameters (default no pipe) + * optional Pipe N config ... + * pipe ==> + */ +void +ipfw_config_pipe(int ac, char **av) +{ + int i; + u_int j; + char *end; + struct dn_id *buf, *base; + struct dn_sch *sch = NULL; + struct dn_link *p = NULL; + struct dn_fs *fs = NULL; + struct dn_profile *pf = NULL; + struct ipfw_flow_id *mask = NULL; + int lmax; + uint32_t _foo = 0, *flags = &_foo , *buckets = &_foo; + + /* + * allocate space for 1 header, + * 1 scheduler, 1 link, 1 flowset, 1 profile + */ + lmax = sizeof(struct dn_id); /* command header */ + lmax += sizeof(struct dn_sch) + sizeof(struct dn_link) + + sizeof(struct dn_fs) + sizeof(struct dn_profile); + + av++; ac--; + /* Pipe number */ + if (ac && isdigit(**av)) { + i = atoi(*av); av++; ac--; + } else + i = -1; + if (i <= 0) + errx(EX_USAGE, "need a pipe/flowset/sched number"); + base = buf = safe_calloc(1, lmax); + /* all commands start with a 'CONFIGURE' and a version */ + o_next(&buf, sizeof(struct dn_id), DN_CMD_CONFIG); + base->id = DN_API_VERSION; + + switch (co.do_pipe) { + case 1: /* "pipe N config ..." */ + /* Allocate space for the WF2Q+ scheduler, its link + * and the FIFO flowset. Set the number, but leave + * the scheduler subtype and other parameters to 0 + * so the kernel will use appropriate defaults. + * XXX todo: add a flag to record if a parameter + * is actually configured. + * If we do a 'pipe config' mask -> sched_mask. + * The FIFO scheduler and link are derived from the + * WF2Q+ one in the kernel. + */ + sch = o_next(&buf, sizeof(*sch), DN_SCH); + p = o_next(&buf, sizeof(*p), DN_LINK); + fs = o_next(&buf, sizeof(*fs), DN_FS); + + sch->sched_nr = i; + sch->oid.subtype = 0; /* defaults to WF2Q+ */ + mask = &sch->sched_mask; + flags = &sch->flags; + buckets = &sch->buckets; + *flags |= DN_PIPE_CMD; + + p->link_nr = i; + + /* This flowset is only for the FIFO scheduler */ + fs->fs_nr = i + 2*DN_MAX_ID; + fs->sched_nr = i + DN_MAX_ID; + break; + + case 2: /* "queue N config ... " */ + fs = o_next(&buf, sizeof(*fs), DN_FS); + fs->fs_nr = i; + mask = &fs->flow_mask; + flags = &fs->flags; + buckets = &fs->buckets; + break; + + case 3: /* "sched N config ..." */ + sch = o_next(&buf, sizeof(*sch), DN_SCH); + fs = o_next(&buf, sizeof(*fs), DN_FS); + sch->sched_nr = i; + mask = &sch->sched_mask; + flags = &sch->flags; + buckets = &sch->buckets; + /* fs is used only with !MULTIQUEUE schedulers */ + fs->fs_nr = i + DN_MAX_ID; + fs->sched_nr = i; + break; + } + /* set to -1 those fields for which we want to reuse existing + * values from the kernel. + * Also, *_nr and subtype = 0 mean reuse the value from the kernel. + * XXX todo: support reuse of the mask. + */ + if (p) + p->bandwidth = -1; + for (j = 0; j < sizeof(fs->par)/sizeof(fs->par[0]); j++) + fs->par[j] = -1; + while (ac > 0) { + double d; + int tok = match_token(dummynet_params, *av); + ac--; av++; + + switch(tok) { + case TOK_NOERROR: + NEED(fs, "noerror is only for pipes"); + fs->flags |= DN_NOERROR; + break; + + case TOK_PLR: + NEED(fs, "plr is only for pipes"); + NEED1("plr needs argument 0..1\n"); + d = strtod(av[0], NULL); + if (d > 1) + d = 1; + else if (d < 0) + d = 0; + fs->plr = (int)(d*0x7fffffff); + ac--; av++; + break; + + case TOK_QUEUE: + NEED(fs, "queue is only for pipes or flowsets"); + NEED1("queue needs queue size\n"); + end = NULL; + fs->qsize = strtoul(av[0], &end, 0); + if (*end == 'K' || *end == 'k') { + fs->flags |= DN_QSIZE_BYTES; + fs->qsize *= 1024; + } else if (*end == 'B' || + _substrcmp2(end, "by", "bytes") == 0) { + fs->flags |= DN_QSIZE_BYTES; + } + ac--; av++; + break; + + case TOK_BUCKETS: + NEED(fs, "buckets is only for pipes or flowsets"); + NEED1("buckets needs argument\n"); + *buckets = strtoul(av[0], NULL, 0); + ac--; av++; + break; + + case TOK_FLOW_MASK: + case TOK_SCHED_MASK: + case TOK_MASK: + NEED(mask, "tok_mask"); + NEED1("mask needs mask specifier\n"); + /* + * per-flow queue, mask is dst_ip, dst_port, + * src_ip, src_port, proto measured in bits + */ + + bzero(mask, sizeof(*mask)); + end = NULL; + + while (ac >= 1) { + uint32_t *p32 = NULL; + uint16_t *p16 = NULL; + uint32_t *p20 = NULL; + struct in6_addr *pa6 = NULL; + uint32_t a; + + tok = match_token(dummynet_params, *av); + ac--; av++; + switch(tok) { + case TOK_ALL: + /* + * special case, all bits significant + * except 'extra' (the queue number) + */ + mask->dst_ip = ~0; + mask->src_ip = ~0; + mask->dst_port = ~0; + mask->src_port = ~0; + mask->proto = ~0; + n2mask(&mask->dst_ip6, 128); + n2mask(&mask->src_ip6, 128); + mask->flow_id6 = ~0; + *flags |= DN_HAVE_MASK; + goto end_mask; + + case TOK_QUEUE: + mask->extra = ~0; + *flags |= DN_HAVE_MASK; + goto end_mask; + + case TOK_DSTIP: + mask->addr_type = 4; + p32 = &mask->dst_ip; + break; + + case TOK_SRCIP: + mask->addr_type = 4; + p32 = &mask->src_ip; + break; + + case TOK_DSTIP6: + mask->addr_type = 6; + pa6 = &mask->dst_ip6; + break; + + case TOK_SRCIP6: + mask->addr_type = 6; + pa6 = &mask->src_ip6; + break; + + case TOK_FLOWID: + mask->addr_type = 6; + p20 = &mask->flow_id6; + break; + + case TOK_DSTPORT: + p16 = &mask->dst_port; + break; + + case TOK_SRCPORT: + p16 = &mask->src_port; + break; + + case TOK_PROTO: + break; + + default: + ac++; av--; /* backtrack */ + goto end_mask; + } + if (ac < 1) + errx(EX_USAGE, "mask: value missing"); + if (*av[0] == '/') { + a = strtoul(av[0]+1, &end, 0); + if (pa6 == NULL) + a = (a == 32) ? ~0 : (1 << a) - 1; + } else + a = strtoul(av[0], &end, 0); + if (p32 != NULL) + *p32 = a; + else if (p16 != NULL) { + if (a > 0xFFFF) + errx(EX_DATAERR, + "port mask must be 16 bit"); + *p16 = (uint16_t)a; + } else if (p20 != NULL) { + if (a > 0xfffff) + errx(EX_DATAERR, + "flow_id mask must be 20 bit"); + *p20 = (uint32_t)a; + } else if (pa6 != NULL) { + if (a > 128) + errx(EX_DATAERR, + "in6addr invalid mask len"); + else + n2mask(pa6, a); + } else { + if (a > 0xFF) + errx(EX_DATAERR, + "proto mask must be 8 bit"); + mask->proto = (uint8_t)a; + } + if (a != 0) + *flags |= DN_HAVE_MASK; + ac--; av++; + } /* end while, config masks */ +end_mask: + break; + + case TOK_RED: + case TOK_GRED: + NEED1("red/gred needs w_q/min_th/max_th/max_p\n"); + fs->flags |= DN_IS_RED; + if (tok == TOK_GRED) + fs->flags |= DN_IS_GENTLE_RED; + /* + * the format for parameters is w_q/min_th/max_th/max_p + */ + if ((end = strsep(&av[0], "/"))) { + double w_q = strtod(end, NULL); + if (w_q > 1 || w_q <= 0) + errx(EX_DATAERR, "0 < w_q <= 1"); + fs->w_q = (int) (w_q * (1 << SCALE_RED)); + } + if ((end = strsep(&av[0], "/"))) { + fs->min_th = strtoul(end, &end, 0); + if (*end == 'K' || *end == 'k') + fs->min_th *= 1024; + } + if ((end = strsep(&av[0], "/"))) { + fs->max_th = strtoul(end, &end, 0); + if (*end == 'K' || *end == 'k') + fs->max_th *= 1024; + } + if ((end = strsep(&av[0], "/"))) { + double max_p = strtod(end, NULL); + if (max_p > 1 || max_p < 0) + errx(EX_DATAERR, "0 <= max_p <= 1"); + fs->max_p = (int)(max_p * (1 << SCALE_RED)); + } + ac--; av++; + break; + + case TOK_ECN: + fs->flags |= DN_IS_ECN; + break; + + case TOK_DROPTAIL: + NEED(fs, "droptail is only for flowsets"); + fs->flags &= ~(DN_IS_RED|DN_IS_GENTLE_RED); + break; + + case TOK_BW: + NEED(p, "bw is only for links"); + NEED1("bw needs bandwidth or interface\n"); + read_bandwidth(av[0], &p->bandwidth, NULL, 0); + ac--; av++; + break; + + case TOK_DELAY: + NEED(p, "delay is only for links"); + NEED1("delay needs argument 0..10000ms\n"); + p->delay = strtoul(av[0], NULL, 0); + ac--; av++; + break; + + case TOK_TYPE: { + int l; + NEED(sch, "type is only for schedulers"); + NEED1("type needs a string"); + l = strlen(av[0]); + if (l == 0 || l > 15) + errx(1, "type %s too long\n", av[0]); + strcpy(sch->name, av[0]); + sch->oid.subtype = 0; /* use string */ + ac--; av++; + break; + } + + case TOK_WEIGHT: + NEED(fs, "weight is only for flowsets"); + NEED1("weight needs argument\n"); + fs->par[0] = strtol(av[0], &end, 0); + ac--; av++; + break; + + case TOK_LMAX: + NEED(fs, "lmax is only for flowsets"); + NEED1("lmax needs argument\n"); + fs->par[1] = strtol(av[0], &end, 0); + ac--; av++; + break; + + case TOK_PRI: + NEED(fs, "priority is only for flowsets"); + NEED1("priority needs argument\n"); + fs->par[2] = strtol(av[0], &end, 0); + ac--; av++; + break; + + case TOK_SCHED: + case TOK_PIPE: + NEED(fs, "pipe/sched"); + NEED1("pipe/link/sched needs number\n"); + fs->sched_nr = strtoul(av[0], &end, 0); + ac--; av++; + break; + + case TOK_PROFILE: + NEED((!pf), "profile already set"); + NEED(p, "profile"); + { + NEED1("extra delay needs the file name\n"); + pf = o_next(&buf, sizeof(*pf), DN_PROFILE); + load_extra_delays(av[0], pf, p); //XXX can't fail? + --ac; ++av; + } + break; + + case TOK_BURST: + NEED(p, "burst"); + NEED1("burst needs argument\n"); + errno = 0; + if (expand_number(av[0], &p->burst) < 0) + if (errno != ERANGE) + errx(EX_DATAERR, + "burst: invalid argument"); + if (errno || p->burst > (1ULL << 48) - 1) + errx(EX_DATAERR, + "burst: out of range (0..2^48-1)"); + ac--; av++; + break; + + default: + errx(EX_DATAERR, "unrecognised option ``%s''", av[-1]); + } + } + + /* check validity of parameters */ + if (p) { + if (p->delay > 10000) + errx(EX_DATAERR, "delay must be < 10000"); + if (p->bandwidth == -1) + p->bandwidth = 0; + } + if (fs) { + /* XXX accept a 0 scheduler to keep the default */ + if (fs->flags & DN_QSIZE_BYTES) { + size_t len; + long limit; + + len = sizeof(limit); + if (sysctlbyname("net.inet.ip.dummynet.pipe_byte_limit", + &limit, &len, NULL, 0) == -1) + limit = 1024*1024; + if (fs->qsize > limit) + errx(EX_DATAERR, "queue size must be < %ldB", limit); + } else { + size_t len; + long limit; + + len = sizeof(limit); + if (sysctlbyname("net.inet.ip.dummynet.pipe_slot_limit", + &limit, &len, NULL, 0) == -1) + limit = 100; + if (fs->qsize > limit) + errx(EX_DATAERR, "2 <= queue size <= %ld", limit); + } + + if ((fs->flags & DN_IS_ECN) && !(fs->flags & DN_IS_RED)) + errx(EX_USAGE, "enable red/gred for ECN"); + + if (fs->flags & DN_IS_RED) { + size_t len; + int lookup_depth, avg_pkt_size; + + if (!(fs->flags & DN_IS_ECN) && (fs->min_th >= fs->max_th)) + errx(EX_DATAERR, "min_th %d must be < than max_th %d", + fs->min_th, fs->max_th); + else if ((fs->flags & DN_IS_ECN) && (fs->min_th > fs->max_th)) + errx(EX_DATAERR, "min_th %d must be =< than max_th %d", + fs->min_th, fs->max_th); + + if (fs->max_th == 0) + errx(EX_DATAERR, "max_th must be > 0"); + + len = sizeof(int); + if (sysctlbyname("net.inet.ip.dummynet.red_lookup_depth", + &lookup_depth, &len, NULL, 0) == -1) + lookup_depth = 256; + if (lookup_depth == 0) + errx(EX_DATAERR, "net.inet.ip.dummynet.red_lookup_depth" + " must be greater than zero"); + + len = sizeof(int); + if (sysctlbyname("net.inet.ip.dummynet.red_avg_pkt_size", + &avg_pkt_size, &len, NULL, 0) == -1) + avg_pkt_size = 512; + + if (avg_pkt_size == 0) + errx(EX_DATAERR, + "net.inet.ip.dummynet.red_avg_pkt_size must" + " be greater than zero"); + +#if 0 /* the following computation is now done in the kernel */ + /* + * Ticks needed for sending a medium-sized packet. + * Unfortunately, when we are configuring a WF2Q+ queue, we + * do not have bandwidth information, because that is stored + * in the parent pipe, and also we have multiple queues + * competing for it. So we set s=0, which is not very + * correct. But on the other hand, why do we want RED with + * WF2Q+ ? + */ + if (p.bandwidth==0) /* this is a WF2Q+ queue */ + s = 0; + else + s = (double)ck.hz * avg_pkt_size * 8 / p.bandwidth; + /* + * max idle time (in ticks) before avg queue size becomes 0. + * NOTA: (3/w_q) is approx the value x so that + * (1-w_q)^x < 10^-3. + */ + w_q = ((double)fs->w_q) / (1 << SCALE_RED); + idle = s * 3. / w_q; + fs->lookup_step = (int)idle / lookup_depth; + if (!fs->lookup_step) + fs->lookup_step = 1; + weight = 1 - w_q; + for (t = fs->lookup_step; t > 1; --t) + weight *= 1 - w_q; + fs->lookup_weight = (int)(weight * (1 << SCALE_RED)); +#endif /* code moved in the kernel */ + } + } + + i = do_cmd(IP_DUMMYNET3, base, (char *)buf - (char *)base); + + if (i) + err(1, "setsockopt(%s)", "IP_DUMMYNET_CONFIGURE"); +} + +void +dummynet_flush(void) +{ + struct dn_id oid; + oid_fill(&oid, sizeof(oid), DN_CMD_FLUSH, DN_API_VERSION); + do_cmd(IP_DUMMYNET3, &oid, oid.len); +} + +/* Parse input for 'ipfw [pipe|sched|queue] show [range list]' + * Returns the number of ranges, and possibly stores them + * in the array v of size len. + */ +static int +parse_range(int ac, char *av[], uint32_t *v, int len) +{ + int n = 0; + char *endptr, *s; + uint32_t base[2]; + + if (v == NULL || len < 2) { + v = base; + len = 2; + } + + for (s = *av; s != NULL; av++, ac--) { + v[0] = strtoul(s, &endptr, 10); + v[1] = (*endptr != '-') ? v[0] : + strtoul(endptr+1, &endptr, 10); + if (*endptr == '\0') { /* prepare for next round */ + s = (ac > 0) ? *(av+1) : NULL; + } else { + if (*endptr != ',') { + warn("invalid number: %s", s); + s = ++endptr; + continue; + } + /* continue processing from here */ + s = ++endptr; + ac++; + av--; + } + if (v[1] < v[0] || + v[1] >= DN_MAX_ID-1 || + v[1] >= DN_MAX_ID-1) { + continue; /* invalid entry */ + } + n++; + /* translate if 'pipe list' */ + if (co.do_pipe == 1) { + v[0] += DN_MAX_ID; + v[1] += DN_MAX_ID; + } + v = (n*2 < len) ? v + 2 : base; + } + return n; +} + +/* main entry point for dummynet list functions. co.do_pipe indicates + * which function we want to support. + * av may contain filtering arguments, either individual entries + * or ranges, or lists (space or commas are valid separators). + * Format for a range can be n1-n2 or n3 n4 n5 ... + * In a range n1 must be <= n2, otherwise the range is ignored. + * A number 'n4' is translate in a range 'n4-n4' + * All number must be > 0 and < DN_MAX_ID-1 + */ +void +dummynet_list(int ac, char *av[], int show_counters) +{ + struct dn_id *oid, *x = NULL; + int ret, i; + int n; /* # of ranges */ + u_int buflen, l; + u_int max_size; /* largest obj passed up */ + + (void)show_counters; // XXX unused, but we should use it. + ac--; + av++; /* skip 'list' | 'show' word */ + + n = parse_range(ac, av, NULL, 0); /* Count # of ranges. */ + + /* Allocate space to store ranges */ + l = sizeof(*oid) + sizeof(uint32_t) * n * 2; + oid = safe_calloc(1, l); + oid_fill(oid, l, DN_CMD_GET, DN_API_VERSION); + + if (n > 0) /* store ranges in idx */ + parse_range(ac, av, (uint32_t *)(oid + 1), n*2); + /* + * Compute the size of the largest object returned. If the + * response leaves at least this much spare space in the + * buffer, then surely the response is complete; otherwise + * there might be a risk of truncation and we will need to + * retry with a larger buffer. + * XXX don't bother with smaller structs. + */ + max_size = sizeof(struct dn_fs); + if (max_size < sizeof(struct dn_sch)) + max_size = sizeof(struct dn_sch); + if (max_size < sizeof(struct dn_flow)) + max_size = sizeof(struct dn_flow); + + switch (co.do_pipe) { + case 1: + oid->subtype = DN_LINK; /* list pipe */ + break; + case 2: + oid->subtype = DN_FS; /* list queue */ + break; + case 3: + oid->subtype = DN_SCH; /* list sched */ + break; + } + + /* + * Ask the kernel an estimate of the required space (result + * in oid.id), unless we are requesting a subset of objects, + * in which case the kernel does not give an exact answer. + * In any case, space might grow in the meantime due to the + * creation of new queues, so we must be prepared to retry. + */ + if (n > 0) { + buflen = 4*1024; + } else { + ret = do_cmd(-IP_DUMMYNET3, oid, (uintptr_t)&l); + if (ret != 0 || oid->id <= sizeof(*oid)) + goto done; + buflen = oid->id + max_size; + oid->len = sizeof(*oid); /* restore */ + } + /* Try a few times, until the buffer fits */ + for (i = 0; i < 20; i++) { + l = buflen; + x = safe_realloc(x, l); + bcopy(oid, x, oid->len); + ret = do_cmd(-IP_DUMMYNET3, x, (uintptr_t)&l); + if (ret != 0 || x->id <= sizeof(*oid)) + goto done; /* no response */ + if (l + max_size <= buflen) + break; /* ok */ + buflen *= 2; /* double for next attempt */ + } + list_pipes(x, O_NEXT(x, l)); +done: + if (x) + free(x); + free(oid); +} diff --git a/example/ipfw/ipfw/ipfw.8 b/example/ipfw/ipfw/ipfw.8 new file mode 100644 index 0000000..9b8946b --- /dev/null +++ b/example/ipfw/ipfw/ipfw.8 @@ -0,0 +1,3723 @@ +.\" +.\" $FreeBSD: head/sbin/ipfw/ipfw.8 274925 2014-11-23 21:00:00Z joel $ +.\" +.Dd Aug 13, 2014 +.Dt IPFW 8 +.Os +.Sh NAME +.Nm ipfw +.Nd User interface for firewall, traffic shaper, packet scheduler, +in-kernel NAT. +.Sh SYNOPSIS +.Ss FIREWALL CONFIGURATION +.Nm +.Op Fl cq +.Cm add +.Ar rule +.Nm +.Op Fl acdefnNStT +.Op Cm set Ar N +.Brq Cm list | show +.Op Ar rule | first-last ... +.Nm +.Op Fl f | q +.Op Cm set Ar N +.Cm flush +.Nm +.Op Fl q +.Op Cm set Ar N +.Brq Cm delete | zero | resetlog +.Op Ar number ... +.Pp +.Nm +.Cm set Oo Cm disable Ar number ... Oc Op Cm enable Ar number ... +.Nm +.Cm set move +.Op Cm rule +.Ar number Cm to Ar number +.Nm +.Cm set swap Ar number number +.Nm +.Cm set show +.Ss SYSCTL SHORTCUTS +.Nm +.Cm enable +.Brq Cm firewall | altq | one_pass | debug | verbose | dyn_keepalive +.Nm +.Cm disable +.Brq Cm firewall | altq | one_pass | debug | verbose | dyn_keepalive +.Ss LOOKUP TABLES +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm create Ar create-options +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm destroy +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm modify Ar modify-options +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm swap Ar name +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm add Ar table-key Op Ar value +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm add Op Ar table-key Ar value ... +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm atomic add Op Ar table-key Ar value ... +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm delete Op Ar table-key ... +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm lookup Ar addr +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm lock +.Nm +.Oo Cm set Ar N Oc Cm table Ar name Cm unlock +.Nm +.Oo Cm set Ar N Oc Cm table +.Brq Ar name | all +.Cm list +.Nm +.Oo Cm set Ar N Oc Cm table +.Brq Ar name | all +.Cm info +.Nm +.Oo Cm set Ar N Oc Cm table +.Brq Ar name | all +.Cm detail +.Nm +.Oo Cm set Ar N Oc Cm table +.Brq Ar name | all +.Cm flush +.Ss DUMMYNET CONFIGURATION (TRAFFIC SHAPER AND PACKET SCHEDULER) +.Nm +.Brq Cm pipe | queue | sched +.Ar number +.Cm config +.Ar config-options +.Nm +.Op Fl s Op Ar field +.Brq Cm pipe | queue | sched +.Brq Cm delete | list | show +.Op Ar number ... +.Ss IN-KERNEL NAT +.Nm +.Op Fl q +.Cm nat +.Ar number +.Cm config +.Ar config-options +.Pp +.Nm +.Op Fl cfnNqS +.Oo +.Fl p Ar preproc +.Oo +.Ar preproc-flags +.Oc +.Oc +.Ar pathname +.Ss INTERNAL DIAGNOSTICS +.Nm +.Cm internal iflist +.Nm +.Cm internal talist +.Nm +.Cm internal vlist +.Sh DESCRIPTION +The +.Nm +utility is the user interface for controlling the +.Xr ipfw 4 +firewall, the +.Xr dummynet 4 +traffic shaper/packet scheduler, and the +in-kernel NAT services. +.Pp +A firewall configuration, or +.Em ruleset , +is made of a list of +.Em rules +numbered from 1 to 65535. +Packets are passed to the firewall +from a number of different places in the protocol stack +(depending on the source and destination of the packet, +it is possible for the firewall to be +invoked multiple times on the same packet). +The packet passed to the firewall is compared +against each of the rules in the +.Em ruleset , +in rule-number order +(multiple rules with the same number are permitted, in which case +they are processed in order of insertion). +When a match is found, the action corresponding to the +matching rule is performed. +.Pp +Depending on the action and certain system settings, packets +can be reinjected into the firewall at some rule after the +matching one for further processing. +.Pp +A ruleset always includes a +.Em default +rule (numbered 65535) which cannot be modified or deleted, +and matches all packets. +The action associated with the +.Em default +rule can be either +.Cm deny +or +.Cm allow +depending on how the kernel is configured. +.Pp +If the ruleset includes one or more rules with the +.Cm keep-state +or +.Cm limit +option, +the firewall will have a +.Em stateful +behaviour, i.e., upon a match it will create +.Em dynamic rules , +i.e., rules that match packets with the same 5-tuple +(protocol, source and destination addresses and ports) +as the packet which caused their creation. +Dynamic rules, which have a limited lifetime, are checked +at the first occurrence of a +.Cm check-state , +.Cm keep-state +or +.Cm limit +rule, and are typically used to open the firewall on-demand to +legitimate traffic only. +See the +.Sx STATEFUL FIREWALL +and +.Sx EXAMPLES +Sections below for more information on the stateful behaviour of +.Nm . +.Pp +All rules (including dynamic ones) have a few associated counters: +a packet count, a byte count, a log count and a timestamp +indicating the time of the last match. +Counters can be displayed or reset with +.Nm +commands. +.Pp +Each rule belongs to one of 32 different +.Em sets +, and there are +.Nm +commands to atomically manipulate sets, such as enable, +disable, swap sets, move all rules in a set to another +one, delete all rules in a set. +These can be useful to +install temporary configurations, or to test them. +See Section +.Sx SETS OF RULES +for more information on +.Em sets . +.Pp +Rules can be added with the +.Cm add +command; deleted individually or in groups with the +.Cm delete +command, and globally (except those in set 31) with the +.Cm flush +command; displayed, optionally with the content of the +counters, using the +.Cm show +and +.Cm list +commands. +Finally, counters can be reset with the +.Cm zero +and +.Cm resetlog +commands. +.Pp +.Ss COMMAND OPTIONS +The following general options are available when invoking +.Nm : +.Bl -tag -width indent +.It Fl a +Show counter values when listing rules. +The +.Cm show +command implies this option. +.It Fl b +Only show the action and the comment, not the body of a rule. +Implies +.Fl c . +.It Fl c +When entering or showing rules, print them in compact form, +i.e., omitting the "ip from any to any" string +when this does not carry any additional information. +.It Fl d +When listing, show dynamic rules in addition to static ones. +.It Fl e +When listing and +.Fl d +is specified, also show expired dynamic rules. +.It Fl f +Do not ask for confirmation for commands that can cause problems +if misused, i.e., +.Cm flush . +If there is no tty associated with the process, this is implied. +.It Fl i +When listing a table (see the +.Sx LOOKUP TABLES +section below for more information on lookup tables), format values +as IP addresses. +By default, values are shown as integers. +.It Fl n +Only check syntax of the command strings, without actually passing +them to the kernel. +.It Fl N +Try to resolve addresses and service names in output. +.It Fl q +Be quiet when executing the +.Cm add , +.Cm nat , +.Cm zero , +.Cm resetlog +or +.Cm flush +commands; +(implies +.Fl f ) . +This is useful when updating rulesets by executing multiple +.Nm +commands in a script +(e.g., +.Ql sh\ /etc/rc.firewall ) , +or by processing a file with many +.Nm +rules across a remote login session. +It also stops a table add or delete +from failing if the entry already exists or is not present. +.Pp +The reason why this option may be important is that +for some of these actions, +.Nm +may print a message; if the action results in blocking the +traffic to the remote client, +the remote login session will be closed +and the rest of the ruleset will not be processed. +Access to the console would then be required to recover. +.It Fl S +When listing rules, show the +.Em set +each rule belongs to. +If this flag is not specified, disabled rules will not be +listed. +.It Fl s Op Ar field +When listing pipes, sort according to one of the four +counters (total or current packets or bytes). +.It Fl t +When listing, show last match timestamp converted with ctime(). +.It Fl T +When listing, show last match timestamp as seconds from the epoch. +This form can be more convenient for postprocessing by scripts. +.El +.Ss LIST OF RULES AND PREPROCESSING +To ease configuration, rules can be put into a file which is +processed using +.Nm +as shown in the last synopsis line. +An absolute +.Ar pathname +must be used. +The file will be read line by line and applied as arguments to the +.Nm +utility. +.Pp +Optionally, a preprocessor can be specified using +.Fl p Ar preproc +where +.Ar pathname +is to be piped through. +Useful preprocessors include +.Xr cpp 1 +and +.Xr m4 1 . +If +.Ar preproc +does not start with a slash +.Pq Ql / +as its first character, the usual +.Ev PATH +name search is performed. +Care should be taken with this in environments where not all +file systems are mounted (yet) by the time +.Nm +is being run (e.g.\& when they are mounted over NFS). +Once +.Fl p +has been specified, any additional arguments are passed on to the preprocessor +for interpretation. +This allows for flexible configuration files (like conditionalizing +them on the local hostname) and the use of macros to centralize +frequently required arguments like IP addresses. +.Ss TRAFFIC SHAPER CONFIGURATION +The +.Nm +.Cm pipe , queue +and +.Cm sched +commands are used to configure the traffic shaper and packet scheduler. +See the +.Sx TRAFFIC SHAPER (DUMMYNET) CONFIGURATION +Section below for details. +.Pp +If the world and the kernel get out of sync the +.Nm +ABI may break, preventing you from being able to add any rules. +This can adversely affect the booting process. +You can use +.Nm +.Cm disable +.Cm firewall +to temporarily disable the firewall to regain access to the network, +allowing you to fix the problem. +.Sh PACKET FLOW +A packet is checked against the active ruleset in multiple places +in the protocol stack, under control of several sysctl variables. +These places and variables are shown below, and it is important to +have this picture in mind in order to design a correct ruleset. +.Bd -literal -offset indent + ^ to upper layers V + | | + +----------->-----------+ + ^ V + [ip(6)_input] [ip(6)_output] net.inet(6).ip(6).fw.enable=1 + | | + ^ V + [ether_demux] [ether_output_frame] net.link.ether.ipfw=1 + | | + +-->--[bdg_forward]-->--+ net.link.bridge.ipfw=1 + ^ V + | to devices | +.Ed +.Pp +The number of +times the same packet goes through the firewall can +vary between 0 and 4 depending on packet source and +destination, and system configuration. +.Pp +Note that as packets flow through the stack, headers can be +stripped or added to it, and so they may or may not be available +for inspection. +E.g., incoming packets will include the MAC header when +.Nm +is invoked from +.Cm ether_demux() , +but the same packets will have the MAC header stripped off when +.Nm +is invoked from +.Cm ip_input() +or +.Cm ip6_input() . +.Pp +Also note that each packet is always checked against the complete ruleset, +irrespective of the place where the check occurs, or the source of the packet. +If a rule contains some match patterns or actions which are not valid +for the place of invocation (e.g.\& trying to match a MAC header within +.Cm ip_input +or +.Cm ip6_input ), +the match pattern will not match, but a +.Cm not +operator in front of such patterns +.Em will +cause the pattern to +.Em always +match on those packets. +It is thus the responsibility of +the programmer, if necessary, to write a suitable ruleset to +differentiate among the possible places. +.Cm skipto +rules can be useful here, as an example: +.Bd -literal -offset indent +# packets from ether_demux or bdg_forward +ipfw add 10 skipto 1000 all from any to any layer2 in +# packets from ip_input +ipfw add 10 skipto 2000 all from any to any not layer2 in +# packets from ip_output +ipfw add 10 skipto 3000 all from any to any not layer2 out +# packets from ether_output_frame +ipfw add 10 skipto 4000 all from any to any layer2 out +.Ed +.Pp +(yes, at the moment there is no way to differentiate between +ether_demux and bdg_forward). +.Sh SYNTAX +In general, each keyword or argument must be provided as +a separate command line argument, with no leading or trailing +spaces. +Keywords are case-sensitive, whereas arguments may +or may not be case-sensitive depending on their nature +(e.g.\& uid's are, hostnames are not). +.Pp +Some arguments (e.g., port or address lists) are comma-separated +lists of values. +In this case, spaces after commas ',' are allowed to make +the line more readable. +You can also put the entire +command (including flags) into a single argument. +E.g., the following forms are equivalent: +.Bd -literal -offset indent +ipfw -q add deny src-ip 10.0.0.0/24,127.0.0.1/8 +ipfw -q add deny src-ip 10.0.0.0/24, 127.0.0.1/8 +ipfw "-q add deny src-ip 10.0.0.0/24, 127.0.0.1/8" +.Ed +.Sh RULE FORMAT +The format of firewall rules is the following: +.Bd -ragged -offset indent +.Bk -words +.Op Ar rule_number +.Op Cm set Ar set_number +.Op Cm prob Ar match_probability +.Ar action +.Op Cm log Op Cm logamount Ar number +.Op Cm altq Ar queue +.Oo +.Bro Cm tag | untag +.Brc Ar number +.Oc +.Ar body +.Ek +.Ed +.Pp +where the body of the rule specifies which information is used +for filtering packets, among the following: +.Pp +.Bl -tag -width "Source and dest. addresses and ports" -offset XXX -compact +.It Layer-2 header fields +When available +.It IPv4 and IPv6 Protocol +TCP, UDP, ICMP, etc. +.It Source and dest. addresses and ports +.It Direction +See Section +.Sx PACKET FLOW +.It Transmit and receive interface +By name or address +.It Misc. IP header fields +Version, type of service, datagram length, identification, +fragment flag (non-zero IP offset), +Time To Live +.It IP options +.It IPv6 Extension headers +Fragmentation, Hop-by-Hop options, +Routing Headers, Source routing rthdr0, Mobile IPv6 rthdr2, IPSec options. +.It IPv6 Flow-ID +.It Misc. TCP header fields +TCP flags (SYN, FIN, ACK, RST, etc.), +sequence number, acknowledgment number, +window +.It TCP options +.It ICMP types +for ICMP packets +.It ICMP6 types +for ICMP6 packets +.It User/group ID +When the packet can be associated with a local socket. +.It Divert status +Whether a packet came from a divert socket (e.g., +.Xr natd 8 ) . +.It Fib annotation state +Whether a packet has been tagged for using a specific FIB (routing table) +in future forwarding decisions. +.El +.Pp +Note that some of the above information, e.g.\& source MAC or IP addresses and +TCP/UDP ports, can be easily spoofed, so filtering on those fields +alone might not guarantee the desired results. +.Bl -tag -width indent +.It Ar rule_number +Each rule is associated with a +.Ar rule_number +in the range 1..65535, with the latter reserved for the +.Em default +rule. +Rules are checked sequentially by rule number. +Multiple rules can have the same number, in which case they are +checked (and listed) according to the order in which they have +been added. +If a rule is entered without specifying a number, the kernel will +assign one in such a way that the rule becomes the last one +before the +.Em default +rule. +Automatic rule numbers are assigned by incrementing the last +non-default rule number by the value of the sysctl variable +.Ar net.inet.ip.fw.autoinc_step +which defaults to 100. +If this is not possible (e.g.\& because we would go beyond the +maximum allowed rule number), the number of the last +non-default value is used instead. +.It Cm set Ar set_number +Each rule is associated with a +.Ar set_number +in the range 0..31. +Sets can be individually disabled and enabled, so this parameter +is of fundamental importance for atomic ruleset manipulation. +It can be also used to simplify deletion of groups of rules. +If a rule is entered without specifying a set number, +set 0 will be used. +.br +Set 31 is special in that it cannot be disabled, +and rules in set 31 are not deleted by the +.Nm ipfw flush +command (but you can delete them with the +.Nm ipfw delete set 31 +command). +Set 31 is also used for the +.Em default +rule. +.It Cm prob Ar match_probability +A match is only declared with the specified probability +(floating point number between 0 and 1). +This can be useful for a number of applications such as +random packet drop or +(in conjunction with +.Nm dummynet ) +to simulate the effect of multiple paths leading to out-of-order +packet delivery. +.Pp +Note: this condition is checked before any other condition, including +ones such as keep-state or check-state which might have side effects. +.It Cm log Op Cm logamount Ar number +Packets matching a rule with the +.Cm log +keyword will be made available for logging in two ways: +if the sysctl variable +.Va net.inet.ip.fw.verbose +is set to 0 (default), one can use +.Xr bpf 4 +attached to the +.Li ipfw0 +pseudo interface. +This pseudo interface can be created after a boot +manually by using the following command: +.Bd -literal -offset indent +# ifconfig ipfw0 create +.Ed +.Pp +Or, automatically at boot time by adding the following +line to the +.Xr rc.conf 5 +file: +.Bd -literal -offset indent +firewall_logif="YES" +.Ed +.Pp +There is no overhead if no +.Xr bpf 4 +is attached to the pseudo interface. +.Pp +If +.Va net.inet.ip.fw.verbose +is set to 1, packets will be logged to +.Xr syslogd 8 +with a +.Dv LOG_SECURITY +facility up to a maximum of +.Cm logamount +packets. +If no +.Cm logamount +is specified, the limit is taken from the sysctl variable +.Va net.inet.ip.fw.verbose_limit . +In both cases, a value of 0 means unlimited logging. +.Pp +Once the limit is reached, logging can be re-enabled by +clearing the logging counter or the packet counter for that entry, see the +.Cm resetlog +command. +.Pp +Note: logging is done after all other packet matching conditions +have been successfully verified, and before performing the final +action (accept, deny, etc.) on the packet. +.It Cm tag Ar number +When a packet matches a rule with the +.Cm tag +keyword, the numeric tag for the given +.Ar number +in the range 1..65534 will be attached to the packet. +The tag acts as an internal marker (it is not sent out over +the wire) that can be used to identify these packets later on. +This can be used, for example, to provide trust between interfaces +and to start doing policy-based filtering. +A packet can have multiple tags at the same time. +Tags are "sticky", meaning once a tag is applied to a packet by a +matching rule it exists until explicit removal. +Tags are kept with the packet everywhere within the kernel, but are +lost when packet leaves the kernel, for example, on transmitting +packet out to the network or sending packet to a +.Xr divert 4 +socket. +.Pp +To check for previously applied tags, use the +.Cm tagged +rule option. +To delete previously applied tag, use the +.Cm untag +keyword. +.Pp +Note: since tags are kept with the packet everywhere in kernelspace, +they can be set and unset anywhere in the kernel network subsystem +(using the +.Xr mbuf_tags 9 +facility), not only by means of the +.Xr ipfw 4 +.Cm tag +and +.Cm untag +keywords. +For example, there can be a specialized +.Xr netgraph 4 +node doing traffic analyzing and tagging for later inspecting +in firewall. +.It Cm untag Ar number +When a packet matches a rule with the +.Cm untag +keyword, the tag with the number +.Ar number +is searched among the tags attached to this packet and, +if found, removed from it. +Other tags bound to packet, if present, are left untouched. +.It Cm altq Ar queue +When a packet matches a rule with the +.Cm altq +keyword, the ALTQ identifier for the given +.Ar queue +(see +.Xr altq 4 ) +will be attached. +Note that this ALTQ tag is only meaningful for packets going "out" of IPFW, +and not being rejected or going to divert sockets. +Note that if there is insufficient memory at the time the packet is +processed, it will not be tagged, so it is wise to make your ALTQ +"default" queue policy account for this. +If multiple +.Cm altq +rules match a single packet, only the first one adds the ALTQ classification +tag. +In doing so, traffic may be shaped by using +.Cm count Cm altq Ar queue +rules for classification early in the ruleset, then later applying +the filtering decision. +For example, +.Cm check-state +and +.Cm keep-state +rules may come later and provide the actual filtering decisions in +addition to the fallback ALTQ tag. +.Pp +You must run +.Xr pfctl 8 +to set up the queues before IPFW will be able to look them up by name, +and if the ALTQ disciplines are rearranged, the rules in containing the +queue identifiers in the kernel will likely have gone stale and need +to be reloaded. +Stale queue identifiers will probably result in misclassification. +.Pp +All system ALTQ processing can be turned on or off via +.Nm +.Cm enable Ar altq +and +.Nm +.Cm disable Ar altq . +The usage of +.Va net.inet.ip.fw.one_pass +is irrelevant to ALTQ traffic shaping, as the actual rule action is followed +always after adding an ALTQ tag. +.El +.Ss RULE ACTIONS +A rule can be associated with one of the following actions, which +will be executed when the packet matches the body of the rule. +.Bl -tag -width indent +.It Cm allow | accept | pass | permit +Allow packets that match rule. +The search terminates. +.It Cm check-state +Checks the packet against the dynamic ruleset. +If a match is found, execute the action associated with +the rule which generated this dynamic rule, otherwise +move to the next rule. +.br +.Cm Check-state +rules do not have a body. +If no +.Cm check-state +rule is found, the dynamic ruleset is checked at the first +.Cm keep-state +or +.Cm limit +rule. +.It Cm count +Update counters for all packets that match rule. +The search continues with the next rule. +.It Cm deny | drop +Discard packets that match this rule. +The search terminates. +.It Cm divert Ar port +Divert packets that match this rule to the +.Xr divert 4 +socket bound to port +.Ar port . +The search terminates. +.It Cm fwd | forward Ar ipaddr | tablearg Ns Op , Ns Ar port +Change the next-hop on matching packets to +.Ar ipaddr , +which can be an IP address or a host name. +For IPv4, the next hop can also be supplied by the last table +looked up for the packet by using the +.Cm tablearg +keyword instead of an explicit address. +The search terminates if this rule matches. +.Pp +If +.Ar ipaddr +is a local address, then matching packets will be forwarded to +.Ar port +(or the port number in the packet if one is not specified in the rule) +on the local machine. +.br +If +.Ar ipaddr +is not a local address, then the port number +(if specified) is ignored, and the packet will be +forwarded to the remote address, using the route as found in +the local routing table for that IP. +.br +A +.Ar fwd +rule will not match layer-2 packets (those received +on ether_input, ether_output, or bridged). +.br +The +.Cm fwd +action does not change the contents of the packet at all. +In particular, the destination address remains unmodified, so +packets forwarded to another system will usually be rejected by that system +unless there is a matching rule on that system to capture them. +For packets forwarded locally, +the local address of the socket will be +set to the original destination address of the packet. +This makes the +.Xr netstat 1 +entry look rather weird but is intended for +use with transparent proxy servers. +.It Cm nat Ar nat_nr | tablearg +Pass packet to a +nat instance +(for network address translation, address redirect, etc.): +see the +.Sx NETWORK ADDRESS TRANSLATION (NAT) +Section for further information. +.It Cm pipe Ar pipe_nr +Pass packet to a +.Nm dummynet +.Dq pipe +(for bandwidth limitation, delay, etc.). +See the +.Sx TRAFFIC SHAPER (DUMMYNET) CONFIGURATION +Section for further information. +The search terminates; however, on exit from the pipe and if +the +.Xr sysctl 8 +variable +.Va net.inet.ip.fw.one_pass +is not set, the packet is passed again to the firewall code +starting from the next rule. +.It Cm queue Ar queue_nr +Pass packet to a +.Nm dummynet +.Dq queue +(for bandwidth limitation using WF2Q+). +.It Cm reject +(Deprecated). +Synonym for +.Cm unreach host . +.It Cm reset +Discard packets that match this rule, and if the +packet is a TCP packet, try to send a TCP reset (RST) notice. +The search terminates. +.It Cm reset6 +Discard packets that match this rule, and if the +packet is a TCP packet, try to send a TCP reset (RST) notice. +The search terminates. +.It Cm skipto Ar number | tablearg +Skip all subsequent rules numbered less than +.Ar number . +The search continues with the first rule numbered +.Ar number +or higher. +It is possible to use the +.Cm tablearg +keyword with a skipto for a +.Em computed +skipto. Skipto may work either in O(log(N)) or in O(1) depending +on amount of memory and/or sysctl variables. +See the +.Sx SYSCTL VARIABLES +section for more details. +.It Cm call Ar number | tablearg +The current rule number is saved in the internal stack and +ruleset processing continues with the first rule numbered +.Ar number +or higher. +If later a rule with the +.Cm return +action is encountered, the processing returns to the first rule +with number of this +.Cm call +rule plus one or higher +(the same behaviour as with packets returning from +.Xr divert 4 +socket after a +.Cm divert +action). +This could be used to make somewhat like an assembly language +.Dq subroutine +calls to rules with common checks for different interfaces, etc. +.Pp +Rule with any number could be called, not just forward jumps as with +.Cm skipto . +So, to prevent endless loops in case of mistakes, both +.Cm call +and +.Cm return +actions don't do any jumps and simply go to the next rule if memory +cannot be allocated or stack overflowed/underflowed. +.Pp +Internally stack for rule numbers is implemented using +.Xr mbuf_tags 9 +facility and currently has size of 16 entries. +As mbuf tags are lost when packet leaves the kernel, +.Cm divert +should not be used in subroutines to avoid endless loops +and other undesired effects. +.It Cm return +Takes rule number saved to internal stack by the last +.Cm call +action and returns ruleset processing to the first rule +with number greater than number of corresponding +.Cm call +rule. +See description of the +.Cm call +action for more details. +.Pp +Note that +.Cm return +rules usually end a +.Dq subroutine +and thus are unconditional, but +.Nm +command-line utility currently requires every action except +.Cm check-state +to have body. +While it is sometimes useful to return only on some packets, +usually you want to print just +.Dq return +for readability. +A workaround for this is to use new syntax and +.Fl c +switch: +.Bd -literal -offset indent +# Add a rule without actual body +ipfw add 2999 return via any + +# List rules without "from any to any" part +ipfw -c list +.Ed +.Pp +This cosmetic annoyance may be fixed in future releases. +.It Cm tee Ar port +Send a copy of packets matching this rule to the +.Xr divert 4 +socket bound to port +.Ar port . +The search continues with the next rule. +.It Cm unreach Ar code +Discard packets that match this rule, and try to send an ICMP +unreachable notice with code +.Ar code , +where +.Ar code +is a number from 0 to 255, or one of these aliases: +.Cm net , host , protocol , port , +.Cm needfrag , srcfail , net-unknown , host-unknown , +.Cm isolated , net-prohib , host-prohib , tosnet , +.Cm toshost , filter-prohib , host-precedence +or +.Cm precedence-cutoff . +The search terminates. +.It Cm unreach6 Ar code +Discard packets that match this rule, and try to send an ICMPv6 +unreachable notice with code +.Ar code , +where +.Ar code +is a number from 0, 1, 3 or 4, or one of these aliases: +.Cm no-route, admin-prohib, address +or +.Cm port . +The search terminates. +.It Cm netgraph Ar cookie +Divert packet into netgraph with given +.Ar cookie . +The search terminates. +If packet is later returned from netgraph it is either +accepted or continues with the next rule, depending on +.Va net.inet.ip.fw.one_pass +sysctl variable. +.It Cm ngtee Ar cookie +A copy of packet is diverted into netgraph, original +packet continues with the next rule. +See +.Xr ng_ipfw 4 +for more information on +.Cm netgraph +and +.Cm ngtee +actions. +.It Cm setfib Ar fibnum | tablearg +The packet is tagged so as to use the FIB (routing table) +.Ar fibnum +in any subsequent forwarding decisions. +In the current implementation, this is limited to the values 0 through 15, see +.Xr setfib 2 . +Processing continues at the next rule. +It is possible to use the +.Cm tablearg +keyword with setfib. +If the tablearg value is not within the compiled range of fibs, +the packet's fib is set to 0. +.It Cm setdscp Ar DSCP | number | tablearg +Set specified DiffServ codepoint for an IPv4/IPv6 packet. +Processing continues at the next rule. +Supported values are: +.Pp +.Cm CS0 +.Pq Dv 000000 , +.Cm CS1 +.Pq Dv 001000 , +.Cm CS2 +.Pq Dv 010000 , +.Cm CS3 +.Pq Dv 011000 , +.Cm CS4 +.Pq Dv 100000 , +.Cm CS5 +.Pq Dv 101000 , +.Cm CS6 +.Pq Dv 110000 , +.Cm CS7 +.Pq Dv 111000 , +.Cm AF11 +.Pq Dv 001010 , +.Cm AF12 +.Pq Dv 001100 , +.Cm AF13 +.Pq Dv 001110 , +.Cm AF21 +.Pq Dv 010010 , +.Cm AF22 +.Pq Dv 010100 , +.Cm AF23 +.Pq Dv 010110 , +.Cm AF31 +.Pq Dv 011010 , +.Cm AF32 +.Pq Dv 011100 , +.Cm AF33 +.Pq Dv 011110 , +.Cm AF41 +.Pq Dv 100010 , +.Cm AF42 +.Pq Dv 100100 , +.Cm AF43 +.Pq Dv 100110 , +.Cm EF +.Pq Dv 101110 , +.Cm BE +.Pq Dv 000000 . +Additionally, DSCP value can be specified by number (0..64). +It is also possible to use the +.Cm tablearg +keyword with setdscp. +If the tablearg value is not within the 0..64 range, lower 6 bits of supplied +value are used. +.It Cm reass +Queue and reassemble IP fragments. +If the packet is not fragmented, counters are updated and +processing continues with the next rule. +If the packet is the last logical fragment, the packet is reassembled and, if +.Va net.inet.ip.fw.one_pass +is set to 0, processing continues with the next rule. +Otherwise, the packet is allowed to pass and the search terminates. +If the packet is a fragment in the middle of a logical group of fragments, +it is consumed and +processing stops immediately. +.Pp +Fragment handling can be tuned via +.Va net.inet.ip.maxfragpackets +and +.Va net.inet.ip.maxfragsperpacket +which limit, respectively, the maximum number of processable +fragments (default: 800) and +the maximum number of fragments per packet (default: 16). +.Pp +NOTA BENE: since fragments do not contain port numbers, +they should be avoided with the +.Nm reass +rule. +Alternatively, direction-based (like +.Nm in +/ +.Nm out +) and source-based (like +.Nm via +) match patterns can be used to select fragments. +.Pp +Usually a simple rule like: +.Bd -literal -offset indent +# reassemble incoming fragments +ipfw add reass all from any to any in +.Ed +.Pp +is all you need at the beginning of your ruleset. +.El +.Ss RULE BODY +The body of a rule contains zero or more patterns (such as +specific source and destination addresses or ports, +protocol options, incoming or outgoing interfaces, etc.) +that the packet must match in order to be recognised. +In general, the patterns are connected by (implicit) +.Cm and +operators -- i.e., all must match in order for the +rule to match. +Individual patterns can be prefixed by the +.Cm not +operator to reverse the result of the match, as in +.Pp +.Dl "ipfw add 100 allow ip from not 1.2.3.4 to any" +.Pp +Additionally, sets of alternative match patterns +.Pq Em or-blocks +can be constructed by putting the patterns in +lists enclosed between parentheses ( ) or braces { }, and +using the +.Cm or +operator as follows: +.Pp +.Dl "ipfw add 100 allow ip from { x or not y or z } to any" +.Pp +Only one level of parentheses is allowed. +Beware that most shells have special meanings for parentheses +or braces, so it is advisable to put a backslash \\ in front of them +to prevent such interpretations. +.Pp +The body of a rule must in general include a source and destination +address specifier. +The keyword +.Ar any +can be used in various places to specify that the content of +a required field is irrelevant. +.Pp +The rule body has the following format: +.Bd -ragged -offset indent +.Op Ar proto Cm from Ar src Cm to Ar dst +.Op Ar options +.Ed +.Pp +The first part (proto from src to dst) is for backward +compatibility with earlier versions of +.Fx . +In modern +.Fx +any match pattern (including MAC headers, IP protocols, +addresses and ports) can be specified in the +.Ar options +section. +.Pp +Rule fields have the following meaning: +.Bl -tag -width indent +.It Ar proto : protocol | Cm { Ar protocol Cm or ... } +.It Ar protocol : Oo Cm not Oc Ar protocol-name | protocol-number +An IP protocol specified by number or name +(for a complete list see +.Pa /etc/protocols ) , +or one of the following keywords: +.Bl -tag -width indent +.It Cm ip4 | ipv4 +Matches IPv4 packets. +.It Cm ip6 | ipv6 +Matches IPv6 packets. +.It Cm ip | all +Matches any packet. +.El +.Pp +The +.Cm ipv6 +in +.Cm proto +option will be treated as inner protocol. +And, the +.Cm ipv4 +is not available in +.Cm proto +option. +.Pp +The +.Cm { Ar protocol Cm or ... } +format (an +.Em or-block ) +is provided for convenience only but its use is deprecated. +.It Ar src No and Ar dst : Bro Cm addr | Cm { Ar addr Cm or ... } Brc Op Oo Cm not Oc Ar ports +An address (or a list, see below) +optionally followed by +.Ar ports +specifiers. +.Pp +The second format +.Em ( or-block +with multiple addresses) is provided for convenience only and +its use is discouraged. +.It Ar addr : Oo Cm not Oc Bro +.Cm any | me | me6 | +.Cm table Ns Pq Ar name Ns Op , Ns Ar value +.Ar | addr-list | addr-set +.Brc +.Bl -tag -width indent +.It Cm any +matches any IP address. +.It Cm me +matches any IP address configured on an interface in the system. +.It Cm me6 +matches any IPv6 address configured on an interface in the system. +The address list is evaluated at the time the packet is +analysed. +.It Cm table Ns Pq Ar name Ns Op , Ns Ar value +Matches any IPv4 or IPv6 address for which an entry exists in the lookup table +.Ar number . +If an optional 32-bit unsigned +.Ar value +is also specified, an entry will match only if it has this value. +See the +.Sx LOOKUP TABLES +section below for more information on lookup tables. +.El +.It Ar addr-list : ip-addr Ns Op Ns , Ns Ar addr-list +.It Ar ip-addr : +A host or subnet address specified in one of the following ways: +.Bl -tag -width indent +.It Ar numeric-ip | hostname +Matches a single IPv4 address, specified as dotted-quad or a hostname. +Hostnames are resolved at the time the rule is added to the firewall list. +.It Ar addr Ns / Ns Ar masklen +Matches all addresses with base +.Ar addr +(specified as an IP address, a network number, or a hostname) +and mask width of +.Cm masklen +bits. +As an example, 1.2.3.4/25 or 1.2.3.0/25 will match +all IP numbers from 1.2.3.0 to 1.2.3.127 . +.It Ar addr Ns : Ns Ar mask +Matches all addresses with base +.Ar addr +(specified as an IP address, a network number, or a hostname) +and the mask of +.Ar mask , +specified as a dotted quad. +As an example, 1.2.3.4:255.0.255.0 or 1.0.3.0:255.0.255.0 will match +1.*.3.*. +This form is advised only for non-contiguous +masks. +It is better to resort to the +.Ar addr Ns / Ns Ar masklen +format for contiguous masks, which is more compact and less +error-prone. +.El +.It Ar addr-set : addr Ns Oo Ns / Ns Ar masklen Oc Ns Cm { Ns Ar list Ns Cm } +.It Ar list : Bro Ar num | num-num Brc Ns Op Ns , Ns Ar list +Matches all addresses with base address +.Ar addr +(specified as an IP address, a network number, or a hostname) +and whose last byte is in the list between braces { } . +Note that there must be no spaces between braces and +numbers (spaces after commas are allowed). +Elements of the list can be specified as single entries +or ranges. +The +.Ar masklen +field is used to limit the size of the set of addresses, +and can have any value between 24 and 32. +If not specified, +it will be assumed as 24. +.br +This format is particularly useful to handle sparse address sets +within a single rule. +Because the matching occurs using a +bitmask, it takes constant time and dramatically reduces +the complexity of rulesets. +.br +As an example, an address specified as 1.2.3.4/24{128,35-55,89} +or 1.2.3.0/24{128,35-55,89} +will match the following IP addresses: +.br +1.2.3.128, 1.2.3.35 to 1.2.3.55, 1.2.3.89 . +.It Ar addr6-list : ip6-addr Ns Op Ns , Ns Ar addr6-list +.It Ar ip6-addr : +A host or subnet specified one of the following ways: +.Bl -tag -width indent +.It Ar numeric-ip | hostname +Matches a single IPv6 address as allowed by +.Xr inet_pton 3 +or a hostname. +Hostnames are resolved at the time the rule is added to the firewall +list. +.It Ar addr Ns / Ns Ar masklen +Matches all IPv6 addresses with base +.Ar addr +(specified as allowed by +.Xr inet_pton +or a hostname) +and mask width of +.Cm masklen +bits. +.El +.Pp +No support for sets of IPv6 addresses is provided because IPv6 addresses +are typically random past the initial prefix. +.It Ar ports : Bro Ar port | port Ns \&- Ns Ar port Ns Brc Ns Op , Ns Ar ports +For protocols which support port numbers (such as TCP and UDP), optional +.Cm ports +may be specified as one or more ports or port ranges, separated +by commas but no spaces, and an optional +.Cm not +operator. +The +.Ql \&- +notation specifies a range of ports (including boundaries). +.Pp +Service names (from +.Pa /etc/services ) +may be used instead of numeric port values. +The length of the port list is limited to 30 ports or ranges, +though one can specify larger ranges by using an +.Em or-block +in the +.Cm options +section of the rule. +.Pp +A backslash +.Pq Ql \e +can be used to escape the dash +.Pq Ql - +character in a service name (from a shell, the backslash must be +typed twice to avoid the shell itself interpreting it as an escape +character). +.Pp +.Dl "ipfw add count tcp from any ftp\e\e-data-ftp to any" +.Pp +Fragmented packets which have a non-zero offset (i.e., not the first +fragment) will never match a rule which has one or more port +specifications. +See the +.Cm frag +option for details on matching fragmented packets. +.El +.Ss RULE OPTIONS (MATCH PATTERNS) +Additional match patterns can be used within +rules. +Zero or more of these so-called +.Em options +can be present in a rule, optionally prefixed by the +.Cm not +operand, and possibly grouped into +.Em or-blocks . +.Pp +The following match patterns can be used (listed in alphabetical order): +.Bl -tag -width indent +.It Cm // this is a comment. +Inserts the specified text as a comment in the rule. +Everything following // is considered as a comment and stored in the rule. +You can have comment-only rules, which are listed as having a +.Cm count +action followed by the comment. +.It Cm bridged +Alias for +.Cm layer2 . +.It Cm diverted +Matches only packets generated by a divert socket. +.It Cm diverted-loopback +Matches only packets coming from a divert socket back into the IP stack +input for delivery. +.It Cm diverted-output +Matches only packets going from a divert socket back outward to the IP +stack output for delivery. +.It Cm dst-ip Ar ip-address +Matches IPv4 packets whose destination IP is one of the address(es) +specified as argument. +.It Bro Cm dst-ip6 | dst-ipv6 Brc Ar ip6-address +Matches IPv6 packets whose destination IP is one of the address(es) +specified as argument. +.It Cm dst-port Ar ports +Matches IP packets whose destination port is one of the port(s) +specified as argument. +.It Cm established +Matches TCP packets that have the RST or ACK bits set. +.It Cm ext6hdr Ar header +Matches IPv6 packets containing the extended header given by +.Ar header . +Supported headers are: +.Pp +Fragment, +.Pq Cm frag , +Hop-to-hop options +.Pq Cm hopopt , +any type of Routing Header +.Pq Cm route , +Source routing Routing Header Type 0 +.Pq Cm rthdr0 , +Mobile IPv6 Routing Header Type 2 +.Pq Cm rthdr2 , +Destination options +.Pq Cm dstopt , +IPSec authentication headers +.Pq Cm ah , +and IPsec encapsulated security payload headers +.Pq Cm esp . +.It Cm fib Ar fibnum +Matches a packet that has been tagged to use +the given FIB (routing table) number. +.It Cm flow Ar table Ns Pq Ar name Ns Op , Ns Ar value +Search for the flow entry in lookup table +.Ar name . +If not found, the match fails. +Otherwise, the match succeeds and +.Cm tablearg +is set to the value extracted from the table. +.Pp +This option can be useful to quickly dispatch traffic based on +certain packet fields. +See the +.Sx LOOKUP TABLES +section below for more information on lookup tables. +.It Cm flow-id Ar labels +Matches IPv6 packets containing any of the flow labels given in +.Ar labels . +.Ar labels +is a comma separated list of numeric flow labels. +.It Cm frag +Matches packets that are fragments and not the first +fragment of an IP datagram. +Note that these packets will not have +the next protocol header (e.g.\& TCP, UDP) so options that look into +these headers cannot match. +.It Cm gid Ar group +Matches all TCP or UDP packets sent by or received for a +.Ar group . +A +.Ar group +may be specified by name or number. +.It Cm jail Ar prisonID +Matches all TCP or UDP packets sent by or received for the +jail whos prison ID is +.Ar prisonID . +.It Cm icmptypes Ar types +Matches ICMP packets whose ICMP type is in the list +.Ar types . +The list may be specified as any combination of +individual types (numeric) separated by commas. +.Em Ranges are not allowed . +The supported ICMP types are: +.Pp +echo reply +.Pq Cm 0 , +destination unreachable +.Pq Cm 3 , +source quench +.Pq Cm 4 , +redirect +.Pq Cm 5 , +echo request +.Pq Cm 8 , +router advertisement +.Pq Cm 9 , +router solicitation +.Pq Cm 10 , +time-to-live exceeded +.Pq Cm 11 , +IP header bad +.Pq Cm 12 , +timestamp request +.Pq Cm 13 , +timestamp reply +.Pq Cm 14 , +information request +.Pq Cm 15 , +information reply +.Pq Cm 16 , +address mask request +.Pq Cm 17 +and address mask reply +.Pq Cm 18 . +.It Cm icmp6types Ar types +Matches ICMP6 packets whose ICMP6 type is in the list of +.Ar types . +The list may be specified as any combination of +individual types (numeric) separated by commas. +.Em Ranges are not allowed . +.It Cm in | out +Matches incoming or outgoing packets, respectively. +.Cm in +and +.Cm out +are mutually exclusive (in fact, +.Cm out +is implemented as +.Cm not in Ns No ). +.It Cm ipid Ar id-list +Matches IPv4 packets whose +.Cm ip_id +field has value included in +.Ar id-list , +which is either a single value or a list of values or ranges +specified in the same way as +.Ar ports . +.It Cm iplen Ar len-list +Matches IP packets whose total length, including header and data, is +in the set +.Ar len-list , +which is either a single value or a list of values or ranges +specified in the same way as +.Ar ports . +.It Cm ipoptions Ar spec +Matches packets whose IPv4 header contains the comma separated list of +options specified in +.Ar spec . +The supported IP options are: +.Pp +.Cm ssrr +(strict source route), +.Cm lsrr +(loose source route), +.Cm rr +(record packet route) and +.Cm ts +(timestamp). +The absence of a particular option may be denoted +with a +.Ql \&! . +.It Cm ipprecedence Ar precedence +Matches IPv4 packets whose precedence field is equal to +.Ar precedence . +.It Cm ipsec +Matches packets that have IPSEC history associated with them +(i.e., the packet comes encapsulated in IPSEC, the kernel +has IPSEC support and IPSEC_FILTERTUNNEL option, and can correctly +decapsulate it). +.Pp +Note that specifying +.Cm ipsec +is different from specifying +.Cm proto Ar ipsec +as the latter will only look at the specific IP protocol field, +irrespective of IPSEC kernel support and the validity of the IPSEC data. +.Pp +Further note that this flag is silently ignored in kernels without +IPSEC support. +It does not affect rule processing when given and the +rules are handled as if with no +.Cm ipsec +flag. +.It Cm iptos Ar spec +Matches IPv4 packets whose +.Cm tos +field contains the comma separated list of +service types specified in +.Ar spec . +The supported IP types of service are: +.Pp +.Cm lowdelay +.Pq Dv IPTOS_LOWDELAY , +.Cm throughput +.Pq Dv IPTOS_THROUGHPUT , +.Cm reliability +.Pq Dv IPTOS_RELIABILITY , +.Cm mincost +.Pq Dv IPTOS_MINCOST , +.Cm congestion +.Pq Dv IPTOS_ECN_CE . +The absence of a particular type may be denoted +with a +.Ql \&! . +.It Cm dscp spec Ns Op , Ns Ar spec +Matches IPv4/IPv6 packets whose +.Cm DS +field value is contained in +.Ar spec +mask. +Multiple values can be specified via +the comma separated list. +Value can be one of keywords used in +.Cm setdscp +action or exact number. +.It Cm ipttl Ar ttl-list +Matches IPv4 packets whose time to live is included in +.Ar ttl-list , +which is either a single value or a list of values or ranges +specified in the same way as +.Ar ports . +.It Cm ipversion Ar ver +Matches IP packets whose IP version field is +.Ar ver . +.It Cm keep-state +Upon a match, the firewall will create a dynamic rule, whose +default behaviour is to match bidirectional traffic between +source and destination IP/port using the same protocol. +The rule has a limited lifetime (controlled by a set of +.Xr sysctl 8 +variables), and the lifetime is refreshed every time a matching +packet is found. +.It Cm layer2 +Matches only layer2 packets, i.e., those passed to +.Nm +from ether_demux() and ether_output_frame(). +.It Cm limit Bro Cm src-addr | src-port | dst-addr | dst-port Brc Ar N +The firewall will only allow +.Ar N +connections with the same +set of parameters as specified in the rule. +One or more +of source and destination addresses and ports can be +specified. +Currently, +only IPv4 flows are supported. +.It Cm lookup Bro Cm dst-ip | dst-port | src-ip | src-port | uid | jail Brc Ar name +Search an entry in lookup table +.Ar name +that matches the field specified as argument. +If not found, the match fails. +Otherwise, the match succeeds and +.Cm tablearg +is set to the value extracted from the table. +.Pp +This option can be useful to quickly dispatch traffic based on +certain packet fields. +See the +.Sx LOOKUP TABLES +section below for more information on lookup tables. +.It Cm { MAC | mac } Ar dst-mac src-mac +Match packets with a given +.Ar dst-mac +and +.Ar src-mac +addresses, specified as the +.Cm any +keyword (matching any MAC address), or six groups of hex digits +separated by colons, +and optionally followed by a mask indicating the significant bits. +The mask may be specified using either of the following methods: +.Bl -enum -width indent +.It +A slash +.Pq / +followed by the number of significant bits. +For example, an address with 33 significant bits could be specified as: +.Pp +.Dl "MAC 10:20:30:40:50:60/33 any" +.It +An ampersand +.Pq & +followed by a bitmask specified as six groups of hex digits separated +by colons. +For example, an address in which the last 16 bits are significant could +be specified as: +.Pp +.Dl "MAC 10:20:30:40:50:60&00:00:00:00:ff:ff any" +.Pp +Note that the ampersand character has a special meaning in many shells +and should generally be escaped. +.El +Note that the order of MAC addresses (destination first, +source second) is +the same as on the wire, but the opposite of the one used for +IP addresses. +.It Cm mac-type Ar mac-type +Matches packets whose Ethernet Type field +corresponds to one of those specified as argument. +.Ar mac-type +is specified in the same way as +.Cm port numbers +(i.e., one or more comma-separated single values or ranges). +You can use symbolic names for known values such as +.Em vlan , ipv4, ipv6 . +Values can be entered as decimal or hexadecimal (if prefixed by 0x), +and they are always printed as hexadecimal (unless the +.Cm -N +option is used, in which case symbolic resolution will be attempted). +.It Cm proto Ar protocol +Matches packets with the corresponding IP protocol. +.It Cm recv | xmit | via Brq Ar ifX | Ar if Ns Cm * | Ar table Ns Po Ar name Ns Oo , Ns Ar value Oc Pc | Ar ipno | Ar any +Matches packets received, transmitted or going through, +respectively, the interface specified by exact name +.Po Ar ifX Pc , +by device name +.Po Ar if* Pc , +by IP address, or through some interface. +Table +.Ar name +may be used to match interface by its kernel ifindex. +See the +.Sx LOOKUP TABLES +section below for more information on lookup tables. +.Pp +The +.Cm via +keyword causes the interface to always be checked. +If +.Cm recv +or +.Cm xmit +is used instead of +.Cm via , +then only the receive or transmit interface (respectively) +is checked. +By specifying both, it is possible to match packets based on +both receive and transmit interface, e.g.: +.Pp +.Dl "ipfw add deny ip from any to any out recv ed0 xmit ed1" +.Pp +The +.Cm recv +interface can be tested on either incoming or outgoing packets, +while the +.Cm xmit +interface can only be tested on outgoing packets. +So +.Cm out +is required (and +.Cm in +is invalid) whenever +.Cm xmit +is used. +.Pp +A packet might not have a receive or transmit interface: packets +originating from the local host have no receive interface, +while packets destined for the local host have no transmit +interface. +.It Cm setup +Matches TCP packets that have the SYN bit set but no ACK bit. +This is the short form of +.Dq Li tcpflags\ syn,!ack . +.It Cm sockarg +Matches packets that are associated to a local socket and +for which the SO_USER_COOKIE socket option has been set +to a non-zero value. +As a side effect, the value of the +option is made available as +.Cm tablearg +value, which in turn can be used as +.Cm skipto +or +.Cm pipe +number. +.It Cm src-ip Ar ip-address +Matches IPv4 packets whose source IP is one of the address(es) +specified as an argument. +.It Cm src-ip6 Ar ip6-address +Matches IPv6 packets whose source IP is one of the address(es) +specified as an argument. +.It Cm src-port Ar ports +Matches IP packets whose source port is one of the port(s) +specified as argument. +.It Cm tagged Ar tag-list +Matches packets whose tags are included in +.Ar tag-list , +which is either a single value or a list of values or ranges +specified in the same way as +.Ar ports . +Tags can be applied to the packet using +.Cm tag +rule action parameter (see it's description for details on tags). +.It Cm tcpack Ar ack +TCP packets only. +Match if the TCP header acknowledgment number field is set to +.Ar ack . +.It Cm tcpdatalen Ar tcpdatalen-list +Matches TCP packets whose length of TCP data is +.Ar tcpdatalen-list , +which is either a single value or a list of values or ranges +specified in the same way as +.Ar ports . +.It Cm tcpflags Ar spec +TCP packets only. +Match if the TCP header contains the comma separated list of +flags specified in +.Ar spec . +The supported TCP flags are: +.Pp +.Cm fin , +.Cm syn , +.Cm rst , +.Cm psh , +.Cm ack +and +.Cm urg . +The absence of a particular flag may be denoted +with a +.Ql \&! . +A rule which contains a +.Cm tcpflags +specification can never match a fragmented packet which has +a non-zero offset. +See the +.Cm frag +option for details on matching fragmented packets. +.It Cm tcpseq Ar seq +TCP packets only. +Match if the TCP header sequence number field is set to +.Ar seq . +.It Cm tcpwin Ar tcpwin-list +Matches TCP packets whose header window field is set to +.Ar tcpwin-list , +which is either a single value or a list of values or ranges +specified in the same way as +.Ar ports . +.It Cm tcpoptions Ar spec +TCP packets only. +Match if the TCP header contains the comma separated list of +options specified in +.Ar spec . +The supported TCP options are: +.Pp +.Cm mss +(maximum segment size), +.Cm window +(tcp window advertisement), +.Cm sack +(selective ack), +.Cm ts +(rfc1323 timestamp) and +.Cm cc +(rfc1644 t/tcp connection count). +The absence of a particular option may be denoted +with a +.Ql \&! . +.It Cm uid Ar user +Match all TCP or UDP packets sent by or received for a +.Ar user . +A +.Ar user +may be matched by name or identification number. +.It Cm verrevpath +For incoming packets, +a routing table lookup is done on the packet's source address. +If the interface on which the packet entered the system matches the +outgoing interface for the route, +the packet matches. +If the interfaces do not match up, +the packet does not match. +All outgoing packets or packets with no incoming interface match. +.Pp +The name and functionality of the option is intentionally similar to +the Cisco IOS command: +.Pp +.Dl ip verify unicast reverse-path +.Pp +This option can be used to make anti-spoofing rules to reject all +packets with source addresses not from this interface. +See also the option +.Cm antispoof . +.It Cm versrcreach +For incoming packets, +a routing table lookup is done on the packet's source address. +If a route to the source address exists, but not the default route +or a blackhole/reject route, the packet matches. +Otherwise, the packet does not match. +All outgoing packets match. +.Pp +The name and functionality of the option is intentionally similar to +the Cisco IOS command: +.Pp +.Dl ip verify unicast source reachable-via any +.Pp +This option can be used to make anti-spoofing rules to reject all +packets whose source address is unreachable. +.It Cm antispoof +For incoming packets, the packet's source address is checked if it +belongs to a directly connected network. +If the network is directly connected, then the interface the packet +came on in is compared to the interface the network is connected to. +When incoming interface and directly connected interface are not the +same, the packet does not match. +Otherwise, the packet does match. +All outgoing packets match. +.Pp +This option can be used to make anti-spoofing rules to reject all +packets that pretend to be from a directly connected network but do +not come in through that interface. +This option is similar to but more restricted than +.Cm verrevpath +because it engages only on packets with source addresses of directly +connected networks instead of all source addresses. +.El +.Sh LOOKUP TABLES +Lookup tables are useful to handle large sparse sets of +addresses or other search keys (e.g., ports, jail IDs, interface names). +In the rest of this section we will use the term ``key''. +Table name needs to match the following spec: +.Ar table-name . +Tables with the same name can be created in different +.Ar sets . +However, rule links to the tables in +.Ar set 0 +by default. +This behavior can be controlled by +.Va net.inet.ip.fw.tables_sets +variable. +See the +.Sx SETS OF RULES +section for more information. +There may be up to 65535 different lookup tables. +.Pp +The following table types are supported: +.Bl -tag -width indent +.It Ar table-type : Ar addr | iface | number | flow +.It Ar table-key : Ar addr Ns Oo / Ns Ar masklen Oc | iface-name | number | flow-spec +.It Ar flow-spec : Ar flow-field Ns Op , Ns Ar flow-spec +.It Ar flow-field : src-ip | proto | src-port | dst-ip | dst-port +.It Cm addr +matches IPv4 or IPv6 address. +Each entry is represented by an +.Ar addr Ns Op / Ns Ar masklen +and will match all addresses with base +.Ar addr +(specified as an IPv4/IPv6 address, or a hostname) and mask width of +.Ar masklen +bits. +If +.Ar masklen +is not specified, it defaults to 32 for IPv4 and 128 for IPv6. +When looking up an IP address in a table, the most specific +entry will match. +.It Cm iface +matches interface names. +Each entry is represented by string treated as interface name. +Wildcards are not supported. +.It Cm number +maches protocol ports, uids/gids or jail IDs. +Each entry is represented by 32-bit unsigned integer. +Ranges are not supported. +.It Cm flow +Matches packet fields specified by +.Ar flow +type suboptions with table entries. +.El +.Pp +Tables require explicit creation via +.Cm create +before use. +.Pp +The following creation options are supported: +.Bl -tag -width indent +.It Ar create-options : Ar create-option | create-options +.It Ar create-option : Cm type Ar table-type | Cm valtype Ar value-mask | Cm algo Ar algo-desc | +.Cm limit Ar number | Cm locked +.It Cm type +Table key type. +.It Cm valtype +Table value mask. +.It Cm algo +Table algorithm to use (see below). +.It Cm limit +Maximum number of items that may be inserted into table. +.It Cm locked +Restrict any table modifications. +.El +.Pp +Some of these options may be modified later via +.Cm modify +keyword. +The following options can be changed: +.Bl -tag -width indent +.It Ar modify-options : Ar modify-option | modify-options +.It Ar modify-option : Cm limit Ar number +.It Cm limit +Alter maximum number of items that may be inserted into table. +.El +.Pp +Additionally, table can be locked or unlocked using +.Cm lock +or +.Cm unlock +commands. +.Pp +Tables of the same +.Ar type +can be swapped with each other using +.Cm swap Ar name +command. +Swap may fail if tables limits are set and data exchange +would result in limits hit. +Operation is performed atomically. +.Pp +One or more entries can be added to a table at once using +.Cm add +command. +Addition of all items are performed atomically. +By default, error in addition of one entry does not influence +addition of other entries. However, non-zero error code is returned +in that case. +Special +.Cm atomic +keyword may be specified before +.Cm add +to indicate all-or-none add request. +.Pp +One or more entries can be removed from a table at once using +.Cm delete +command. +By default, error in removal of one entry does not influence +removing of other entries. However, non-zero error code is returned +in that case. +.Pp +It may be possible to check what entry will be found on particular +.Ar table-key +using +.Cm lookup +.Ar table-key +command. +This functionality is optional and may be unsupported in some algorithms. +.Pp +The following operations can be performed on +.Ar one +or +.Cm all +tables: +.Bl -tag -width indent +.It Cm list +List all entries. +.It Cm flush +Removes all entries. +.It Cm info +Shows generic table information. +.It Cm detail +Shows generic table information and algo-specific data. +.El +.Pp +The following lookup algorithms are supported: +.Bl -tag -width indent +.It Ar algo-desc : algo-name | "algo-name algo-data" +.It Ar algo-name: Ar addr:radix | addr:hash | iface:array | number:array | flow:hash +.It Cm addr:radix +Separate Radix trees for IPv4 and IPv6, the same way as the routing table (see +.Xr route 4 ) . +Default choice for +.Ar addr +type. +.It Cm addr:hash +Separate auto-growing hashes for IPv4 and IPv6. +Accepts entries with the same mask length specified initially via +.Cm "addr:hash masks=/v4,/v6" +algorithm creation options. +Assume /32 and /128 masks by default. +Search removes host bits (according to mask) from supplied address and checks +resulting key in appropriate hash. +Mostly optimized for /64 and byte-ranged IPv6 masks. +.It Cm iface:array +Array storing sorted indexes for entries which are presented in the system. +Optimized for very fast lookup. +.It Cm number:array +Array storing sorted u32 numbers. +.It Cm flow:hash +Auto-growing hash storing flow entries. +Search calculates hash on required packet fields and searches for matching +entries in selected bucket. +.El +.Pp +The +.Cm tablearg +feature provides the ability to use a value, looked up in the table, as +the argument for a rule action, action parameter or rule option. +This can significantly reduce number of rules in some configurations. +If two tables are used in a rule, the result of the second (destination) +is used. +.Pp +Each record may hold one or more values according to +.Ar value-mask . +This mask is set on table creation via +.Cm valtype +option. +The following value types are supported: +.Bl -tag -width indent +.It Ar value-mask : Ar value-type Ns Op , Ns Ar value-mask +.It Ar value-type : Ar skipto | pipe | fib | nat | dscp | tag | divert | +.Ar netgraph | limit | ipv4 +.It Cm skipto +rule number to jump to. +.It Cm pipe +Pipe number to use. +.It Cm fib +fib number to match/set. +.It Cm nat +nat number to jump to. +.It Cm dscp +dscp value to match/set. +.It Cm tag +tag number to match/set. +.It Cm divert +port number to divert traffic to. +.It Cm netgraph +hook number to move packet to. +.It Cm limit +maximum number of connections. +.It Cm ipv4 +IPv4 nexthop to fwd packets to. +.El +.Pp +The +.Cm tablearg +argument can be used with the following actions: +.Cm nat, pipe , queue, divert, tee, netgraph, ngtee, fwd, skipto, setfib, +action parameters: +.Cm tag, untag, +rule options: +.Cm limit, tagged. +.Pp +When used with the +.Cm skipto +action, the user should be aware that the code will walk the ruleset +up to a rule equal to, or past, the given number. +.Pp +See the +.Sx EXAMPLES +Section for example usage of tables and the tablearg keyword. +.Sh SETS OF RULES +Each rule or table belongs to one of 32 different +.Em sets +, numbered 0 to 31. +Set 31 is reserved for the default rule. +.Pp +By default, rules or tables are put in set 0, unless you use the +.Cm set N +attribute when adding a new rule or table. +Sets can be individually and atomically enabled or disabled, +so this mechanism permits an easy way to store multiple configurations +of the firewall and quickly (and atomically) switch between them. +.Pp +By default, tables from set 0 are referenced when adding rule with +table opcodes regardless of rule set. +This behavior can be changed by setting +.Va net.inet.ip.fw.tables_set +variable to 1. +Rule's set will then be used for table references. +.Pp +The command to enable/disable sets is +.Bd -ragged -offset indent +.Nm +.Cm set Oo Cm disable Ar number ... Oc Op Cm enable Ar number ... +.Ed +.Pp +where multiple +.Cm enable +or +.Cm disable +sections can be specified. +Command execution is atomic on all the sets specified in the command. +By default, all sets are enabled. +.Pp +When you disable a set, its rules behave as if they do not exist +in the firewall configuration, with only one exception: +.Bd -ragged -offset indent +dynamic rules created from a rule before it had been disabled +will still be active until they expire. +In order to delete +dynamic rules you have to explicitly delete the parent rule +which generated them. +.Ed +.Pp +The set number of rules can be changed with the command +.Bd -ragged -offset indent +.Nm +.Cm set move +.Brq Cm rule Ar rule-number | old-set +.Cm to Ar new-set +.Ed +.Pp +Also, you can atomically swap two rulesets with the command +.Bd -ragged -offset indent +.Nm +.Cm set swap Ar first-set second-set +.Ed +.Pp +See the +.Sx EXAMPLES +Section on some possible uses of sets of rules. +.Sh STATEFUL FIREWALL +Stateful operation is a way for the firewall to dynamically +create rules for specific flows when packets that +match a given pattern are detected. +Support for stateful +operation comes through the +.Cm check-state , keep-state +and +.Cm limit +options of +.Nm rules . +.Pp +Dynamic rules are created when a packet matches a +.Cm keep-state +or +.Cm limit +rule, causing the creation of a +.Em dynamic +rule which will match all and only packets with +a given +.Em protocol +between a +.Em src-ip/src-port dst-ip/dst-port +pair of addresses +.Em ( src +and +.Em dst +are used here only to denote the initial match addresses, but they +are completely equivalent afterwards). +Dynamic rules will be checked at the first +.Cm check-state, keep-state +or +.Cm limit +occurrence, and the action performed upon a match will be the same +as in the parent rule. +.Pp +Note that no additional attributes other than protocol and IP addresses +and ports are checked on dynamic rules. +.Pp +The typical use of dynamic rules is to keep a closed firewall configuration, +but let the first TCP SYN packet from the inside network install a +dynamic rule for the flow so that packets belonging to that session +will be allowed through the firewall: +.Pp +.Dl "ipfw add check-state" +.Dl "ipfw add allow tcp from my-subnet to any setup keep-state" +.Dl "ipfw add deny tcp from any to any" +.Pp +A similar approach can be used for UDP, where an UDP packet coming +from the inside will install a dynamic rule to let the response through +the firewall: +.Pp +.Dl "ipfw add check-state" +.Dl "ipfw add allow udp from my-subnet to any keep-state" +.Dl "ipfw add deny udp from any to any" +.Pp +Dynamic rules expire after some time, which depends on the status +of the flow and the setting of some +.Cm sysctl +variables. +See Section +.Sx SYSCTL VARIABLES +for more details. +For TCP sessions, dynamic rules can be instructed to periodically +send keepalive packets to refresh the state of the rule when it is +about to expire. +.Pp +See Section +.Sx EXAMPLES +for more examples on how to use dynamic rules. +.Sh TRAFFIC SHAPER (DUMMYNET) CONFIGURATION +.Nm +is also the user interface for the +.Nm dummynet +traffic shaper, packet scheduler and network emulator, a subsystem that +can artificially queue, delay or drop packets +emulating the behaviour of certain network links +or queueing systems. +.Pp +.Nm dummynet +operates by first using the firewall to select packets +using any match pattern that can be used in +.Nm +rules. +Matching packets are then passed to either of two +different objects, which implement the traffic regulation: +.Bl -hang -offset XXXX +.It Em pipe +A +.Em pipe +emulates a +.Em link +with given bandwidth and propagation delay, +driven by a FIFO scheduler and a single queue with programmable +queue size and packet loss rate. +Packets are appended to the queue as they come out from +.Nm ipfw , +and then transferred in FIFO order to the link at the desired rate. +.It Em queue +A +.Em queue +is an abstraction used to implement packet scheduling +using one of several packet scheduling algorithms. +Packets sent to a +.Em queue +are first grouped into flows according to a mask on the 5-tuple. +Flows are then passed to the scheduler associated to the +.Em queue , +and each flow uses scheduling parameters (weight and others) +as configured in the +.Em queue +itself. +A scheduler in turn is connected to an emulated link, +and arbitrates the link's bandwidth among backlogged flows according to +weights and to the features of the scheduling algorithm in use. +.El +.Pp +In practice, +.Em pipes +can be used to set hard limits to the bandwidth that a flow can use, whereas +.Em queues +can be used to determine how different flows share the available bandwidth. +.Pp +A graphical representation of the binding of queues, +flows, schedulers and links is below. +.Bd -literal -offset indent + (flow_mask|sched_mask) sched_mask + +---------+ weight Wx +-------------+ + | |->-[flow]-->--| |-+ + -->--| QUEUE x | ... | | | + | |->-[flow]-->--| SCHEDuler N | | + +---------+ | | | + ... | +--[LINK N]-->-- + +---------+ weight Wy | | +--[LINK N]-->-- + | |->-[flow]-->--| | | + -->--| QUEUE y | ... | | | + | |->-[flow]-->--| | | + +---------+ +-------------+ | + +-------------+ +.Ed +It is important to understand the role of the SCHED_MASK +and FLOW_MASK, which are configured through the commands +.Dl "ipfw sched N config mask SCHED_MASK ..." +and +.Dl "ipfw queue X config mask FLOW_MASK ..." . +.Pp +The SCHED_MASK is used to assign flows to one or more +scheduler instances, one for each +value of the packet's 5-tuple after applying SCHED_MASK. +As an example, using ``src-ip 0xffffff00'' creates one instance +for each /24 destination subnet. +.Pp +The FLOW_MASK, together with the SCHED_MASK, is used to split +packets into flows. +As an example, using +``src-ip 0x000000ff'' +together with the previous SCHED_MASK makes a flow for +each individual source address. +In turn, flows for each /24 +subnet will be sent to the same scheduler instance. +.Pp +The above diagram holds even for the +.Em pipe +case, with the only restriction that a +.Em pipe +only supports a SCHED_MASK, and forces the use of a FIFO +scheduler (these are for backward compatibility reasons; +in fact, internally, a +.Nm dummynet's +pipe is implemented exactly as above). +.Pp +There are two modes of +.Nm dummynet +operation: +.Dq normal +and +.Dq fast . +The +.Dq normal +mode tries to emulate a real link: the +.Nm dummynet +scheduler ensures that the packet will not leave the pipe faster than it +would on the real link with a given bandwidth. +The +.Dq fast +mode allows certain packets to bypass the +.Nm dummynet +scheduler (if packet flow does not exceed pipe's bandwidth). +This is the reason why the +.Dq fast +mode requires less CPU cycles per packet (on average) and packet latency +can be significantly lower in comparison to a real link with the same +bandwidth. +The default mode is +.Dq normal . +The +.Dq fast +mode can be enabled by setting the +.Va net.inet.ip.dummynet.io_fast +.Xr sysctl 8 +variable to a non-zero value. +.Pp +.Ss PIPE, QUEUE AND SCHEDULER CONFIGURATION +The +.Em pipe , +.Em queue +and +.Em scheduler +configuration commands are the following: +.Bd -ragged -offset indent +.Cm pipe Ar number Cm config Ar pipe-configuration +.Pp +.Cm queue Ar number Cm config Ar queue-configuration +.Pp +.Cm sched Ar number Cm config Ar sched-configuration +.Ed +.Pp +The following parameters can be configured for a pipe: +.Pp +.Bl -tag -width indent -compact +.It Cm bw Ar bandwidth | device +Bandwidth, measured in +.Sm off +.Op Cm K | M +.Brq Cm bit/s | Byte/s . +.Sm on +.Pp +A value of 0 (default) means unlimited bandwidth. +The unit must immediately follow the number, as in +.Pp +.Dl "ipfw pipe 1 config bw 300Kbit/s" +.Pp +If a device name is specified instead of a numeric value, as in +.Pp +.Dl "ipfw pipe 1 config bw tun0" +.Pp +then the transmit clock is supplied by the specified device. +At the moment only the +.Xr tun 4 +device supports this +functionality, for use in conjunction with +.Xr ppp 8 . +.Pp +.It Cm delay Ar ms-delay +Propagation delay, measured in milliseconds. +The value is rounded to the next multiple of the clock tick +(typically 10ms, but it is a good practice to run kernels +with +.Dq "options HZ=1000" +to reduce +the granularity to 1ms or less). +The default value is 0, meaning no delay. +.Pp +.It Cm burst Ar size +If the data to be sent exceeds the pipe's bandwidth limit +(and the pipe was previously idle), up to +.Ar size +bytes of data are allowed to bypass the +.Nm dummynet +scheduler, and will be sent as fast as the physical link allows. +Any additional data will be transmitted at the rate specified +by the +.Nm pipe +bandwidth. +The burst size depends on how long the pipe has been idle; +the effective burst size is calculated as follows: +MAX( +.Ar size +, +.Nm bw +* pipe_idle_time). +.Pp +.It Cm profile Ar filename +A file specifying the additional overhead incurred in the transmission +of a packet on the link. +.Pp +Some link types introduce extra delays in the transmission +of a packet, e.g., because of MAC level framing, contention on +the use of the channel, MAC level retransmissions and so on. +From our point of view, the channel is effectively unavailable +for this extra time, which is constant or variable depending +on the link type. +Additionally, packets may be dropped after this +time (e.g., on a wireless link after too many retransmissions). +We can model the additional delay with an empirical curve +that represents its distribution. +.Bd -literal -offset indent + cumulative probability + 1.0 ^ + | + L +-- loss-level x + | ****** + | * + | ***** + | * + | ** + | * + +-------*-------------------> + delay +.Ed +The empirical curve may have both vertical and horizontal lines. +Vertical lines represent constant delay for a range of +probabilities. +Horizontal lines correspond to a discontinuity in the delay +distribution: the pipe will use the largest delay for a +given probability. +.Pp +The file format is the following, with whitespace acting as +a separator and '#' indicating the beginning a comment: +.Bl -tag -width indent +.It Cm name Ar identifier +optional name (listed by "ipfw pipe show") +to identify the delay distribution; +.It Cm bw Ar value +the bandwidth used for the pipe. +If not specified here, it must be present +explicitly as a configuration parameter for the pipe; +.It Cm loss-level Ar L +the probability above which packets are lost. +(0.0 <= L <= 1.0, default 1.0 i.e., no loss); +.It Cm samples Ar N +the number of samples used in the internal +representation of the curve (2..1024; default 100); +.It Cm "delay prob" | "prob delay" +One of these two lines is mandatory and defines +the format of the following lines with data points. +.It Ar XXX Ar YYY +2 or more lines representing points in the curve, +with either delay or probability first, according +to the chosen format. +The unit for delay is milliseconds. +Data points do not need to be sorted. +Also, the number of actual lines can be different +from the value of the "samples" parameter: +.Nm +utility will sort and interpolate +the curve as needed. +.El +.Pp +Example of a profile file: +.Bd -literal -offset indent +name bla_bla_bla +samples 100 +loss-level 0.86 +prob delay +0 200 # minimum overhead is 200ms +0.5 200 +0.5 300 +0.8 1000 +0.9 1300 +1 1300 +#configuration file end +.Ed +.El +.Pp +The following parameters can be configured for a queue: +.Pp +.Bl -tag -width indent -compact +.It Cm pipe Ar pipe_nr +Connects a queue to the specified pipe. +Multiple queues (with the same or different weights) can be connected to +the same pipe, which specifies the aggregate rate for the set of queues. +.Pp +.It Cm weight Ar weight +Specifies the weight to be used for flows matching this queue. +The weight must be in the range 1..100, and defaults to 1. +.El +.Pp +The following case-insensitive parameters can be configured for a +scheduler: +.Pp +.Bl -tag -width indent -compact +.It Cm type Ar {fifo | wf2q+ | rr | qfq} +specifies the scheduling algorithm to use. +.Bl -tag -width indent -compact +.It Cm fifo +is just a FIFO scheduler (which means that all packets +are stored in the same queue as they arrive to the scheduler). +FIFO has O(1) per-packet time complexity, with very low +constants (estimate 60-80ns on a 2GHz desktop machine) +but gives no service guarantees. +.It Cm wf2q+ +implements the WF2Q+ algorithm, which is a Weighted Fair Queueing +algorithm which permits flows to share bandwidth according to +their weights. +Note that weights are not priorities; even a flow +with a minuscule weight will never starve. +WF2Q+ has O(log N) per-packet processing cost, where N is the number +of flows, and is the default algorithm used by previous versions +dummynet's queues. +.It Cm rr +implements the Deficit Round Robin algorithm, which has O(1) processing +costs (roughly, 100-150ns per packet) +and permits bandwidth allocation according to weights, but +with poor service guarantees. +.It Cm qfq +implements the QFQ algorithm, which is a very fast variant of +WF2Q+, with similar service guarantees and O(1) processing +costs (roughly, 200-250ns per packet). +.El +.El +.Pp +In addition to the type, all parameters allowed for a pipe can also +be specified for a scheduler. +.Pp +Finally, the following parameters can be configured for both +pipes and queues: +.Pp +.Bl -tag -width XXXX -compact +.It Cm buckets Ar hash-table-size +Specifies the size of the hash table used for storing the +various queues. +Default value is 64 controlled by the +.Xr sysctl 8 +variable +.Va net.inet.ip.dummynet.hash_size , +allowed range is 16 to 65536. +.Pp +.It Cm mask Ar mask-specifier +Packets sent to a given pipe or queue by an +.Nm +rule can be further classified into multiple flows, each of which is then +sent to a different +.Em dynamic +pipe or queue. +A flow identifier is constructed by masking the IP addresses, +ports and protocol types as specified with the +.Cm mask +options in the configuration of the pipe or queue. +For each different flow identifier, a new pipe or queue is created +with the same parameters as the original object, and matching packets +are sent to it. +.Pp +Thus, when +.Em dynamic pipes +are used, each flow will get the same bandwidth as defined by the pipe, +whereas when +.Em dynamic queues +are used, each flow will share the parent's pipe bandwidth evenly +with other flows generated by the same queue (note that other queues +with different weights might be connected to the same pipe). +.br +Available mask specifiers are a combination of one or more of the following: +.Pp +.Cm dst-ip Ar mask , +.Cm dst-ip6 Ar mask , +.Cm src-ip Ar mask , +.Cm src-ip6 Ar mask , +.Cm dst-port Ar mask , +.Cm src-port Ar mask , +.Cm flow-id Ar mask , +.Cm proto Ar mask +or +.Cm all , +.Pp +where the latter means all bits in all fields are significant. +.Pp +.It Cm noerror +When a packet is dropped by a +.Nm dummynet +queue or pipe, the error +is normally reported to the caller routine in the kernel, in the +same way as it happens when a device queue fills up. +Setting this +option reports the packet as successfully delivered, which can be +needed for some experimental setups where you want to simulate +loss or congestion at a remote router. +.Pp +.It Cm plr Ar packet-loss-rate +Packet loss rate. +Argument +.Ar packet-loss-rate +is a floating-point number between 0 and 1, with 0 meaning no +loss, 1 meaning 100% loss. +The loss rate is internally represented on 31 bits. +.Pp +.It Cm queue Brq Ar slots | size Ns Cm Kbytes +Queue size, in +.Ar slots +or +.Cm KBytes . +Default value is 50 slots, which +is the typical queue size for Ethernet devices. +Note that for slow speed links you should keep the queue +size short or your traffic might be affected by a significant +queueing delay. +E.g., 50 max-sized ethernet packets (1500 bytes) mean 600Kbit +or 20s of queue on a 30Kbit/s pipe. +Even worse effects can result if you get packets from an +interface with a much larger MTU, e.g.\& the loopback interface +with its 16KB packets. +The +.Xr sysctl 8 +variables +.Em net.inet.ip.dummynet.pipe_byte_limit +and +.Em net.inet.ip.dummynet.pipe_slot_limit +control the maximum lengths that can be specified. +.Pp +.It Cm red | gred Ar w_q Ns / Ns Ar min_th Ns / Ns Ar max_th Ns / Ns Ar max_p +[ecn] +Make use of the RED (Random Early Detection) queue management algorithm. +.Ar w_q +and +.Ar max_p +are floating +point numbers between 0 and 1 (inclusive), while +.Ar min_th +and +.Ar max_th +are integer numbers specifying thresholds for queue management +(thresholds are computed in bytes if the queue has been defined +in bytes, in slots otherwise). +The two parameters can also be of the same value if needed. The +.Nm dummynet +also supports the gentle RED variant (gred) and ECN (Explicit Congestion +Notification) as optional. Three +.Xr sysctl 8 +variables can be used to control the RED behaviour: +.Bl -tag -width indent +.It Va net.inet.ip.dummynet.red_lookup_depth +specifies the accuracy in computing the average queue +when the link is idle (defaults to 256, must be greater than zero) +.It Va net.inet.ip.dummynet.red_avg_pkt_size +specifies the expected average packet size (defaults to 512, must be +greater than zero) +.It Va net.inet.ip.dummynet.red_max_pkt_size +specifies the expected maximum packet size, only used when queue +thresholds are in bytes (defaults to 1500, must be greater than zero). +.El +.El +.Pp +When used with IPv6 data, +.Nm dummynet +currently has several limitations. +Information necessary to route link-local packets to an +interface is not available after processing by +.Nm dummynet +so those packets are dropped in the output path. +Care should be taken to ensure that link-local packets are not passed to +.Nm dummynet . +.Sh CHECKLIST +Here are some important points to consider when designing your +rules: +.Bl -bullet +.It +Remember that you filter both packets going +.Cm in +and +.Cm out . +Most connections need packets going in both directions. +.It +Remember to test very carefully. +It is a good idea to be near the console when doing this. +If you cannot be near the console, +use an auto-recovery script such as the one in +.Pa /usr/share/examples/ipfw/change_rules.sh . +.It +Do not forget the loopback interface. +.El +.Sh FINE POINTS +.Bl -bullet +.It +There are circumstances where fragmented datagrams are unconditionally +dropped. +TCP packets are dropped if they do not contain at least 20 bytes of +TCP header, UDP packets are dropped if they do not contain a full 8 +byte UDP header, and ICMP packets are dropped if they do not contain +4 bytes of ICMP header, enough to specify the ICMP type, code, and +checksum. +These packets are simply logged as +.Dq pullup failed +since there may not be enough good data in the packet to produce a +meaningful log entry. +.It +Another type of packet is unconditionally dropped, a TCP packet with a +fragment offset of one. +This is a valid packet, but it only has one use, to try +to circumvent firewalls. +When logging is enabled, these packets are +reported as being dropped by rule -1. +.It +If you are logged in over a network, loading the +.Xr kld 4 +version of +.Nm +is probably not as straightforward as you would think. +The following command line is recommended: +.Bd -literal -offset indent +kldload ipfw && \e +ipfw add 32000 allow ip from any to any +.Ed +.Pp +Along the same lines, doing an +.Bd -literal -offset indent +ipfw flush +.Ed +.Pp +in similar surroundings is also a bad idea. +.It +The +.Nm +filter list may not be modified if the system security level +is set to 3 or higher +(see +.Xr init 8 +for information on system security levels). +.El +.Sh PACKET DIVERSION +A +.Xr divert 4 +socket bound to the specified port will receive all packets +diverted to that port. +If no socket is bound to the destination port, or if the divert module is +not loaded, or if the kernel was not compiled with divert socket support, +the packets are dropped. +.Sh NETWORK ADDRESS TRANSLATION (NAT) +.Nm +support in-kernel NAT using the kernel version of +.Xr libalias 3 . +.Pp +The nat configuration command is the following: +.Bd -ragged -offset indent +.Bk -words +.Cm nat +.Ar nat_number +.Cm config +.Ar nat-configuration +.Ek +.Ed +.Pp +The following parameters can be configured: +.Bl -tag -width indent +.It Cm ip Ar ip_address +Define an ip address to use for aliasing. +.It Cm if Ar nic +Use ip address of NIC for aliasing, dynamically changing +it if NIC's ip address changes. +.It Cm log +Enable logging on this nat instance. +.It Cm deny_in +Deny any incoming connection from outside world. +.It Cm same_ports +Try to leave the alias port numbers unchanged from +the actual local port numbers. +.It Cm unreg_only +Traffic on the local network not originating from an +unregistered address spaces will be ignored. +.It Cm reset +Reset table of the packet aliasing engine on address change. +.It Cm reverse +Reverse the way libalias handles aliasing. +.It Cm proxy_only +Obey transparent proxy rules only, packet aliasing is not performed. +.It Cm skip_global +Skip instance in case of global state lookup (see below). +.El +.Pp +Some specials value can be supplied instead of +.Va nat_number: +.Bl -tag -width indent +.It Cm global +Looks up translation state in all configured nat instances. +If an entry is found, packet is aliased according to that entry. +If no entry was found in any of the instances, packet is passed unchanged, +and no new entry will be created. +See section +.Sx MULTIPLE INSTANCES +in +.Xr natd 8 +for more information. +.It Cm tablearg +Uses argument supplied in lookup table. +See +.Sx LOOKUP TABLES +section below for more information on lookup tables. +.El +.Pp +To let the packet continue after being (de)aliased, set the sysctl variable +.Va net.inet.ip.fw.one_pass +to 0. +For more information about aliasing modes, refer to +.Xr libalias 3 . +See Section +.Sx EXAMPLES +for some examples about nat usage. +.Ss REDIRECT AND LSNAT SUPPORT IN IPFW +Redirect and LSNAT support follow closely the syntax used in +.Xr natd 8 . +See Section +.Sx EXAMPLES +for some examples on how to do redirect and lsnat. +.Ss SCTP NAT SUPPORT +SCTP nat can be configured in a similar manner to TCP through the +.Nm +command line tool. +The main difference is that +.Nm sctp nat +does not do port translation. +Since the local and global side ports will be the same, +there is no need to specify both. +Ports are redirected as follows: +.Bd -ragged -offset indent +.Bk -words +.Cm nat +.Ar nat_number +.Cm config if +.Ar nic +.Cm redirect_port sctp +.Ar ip_address [,addr_list] {[port | port-port] [,ports]} +.Ek +.Ed +.Pp +Most +.Nm sctp nat +configuration can be done in real-time through the +.Xr sysctl 8 +interface. +All may be changed dynamically, though the hash_table size will only +change for new +.Nm nat +instances. +See +.Sx SYSCTL VARIABLES +for more info. +.Sh LOADER TUNABLES +Tunables can be set in +.Xr loader 8 +prompt, +.Xr loader.conf 5 +or +.Xr kenv 1 +before ipfw module gets loaded. +.Bl -tag -width indent +.It Va net.inet.ip.fw.default_to_accept: No 0 +Defines ipfw last rule behavior. +This value overrides +.Cd "options IPFW_DEFAULT_TO_(ACCEPT|DENY)" +from kernel configuration file. +.It Va net.inet.ip.fw.tables_max: No 128 +Defines number of tables available in ipfw. +Number cannot exceed 65534. +.El +.Sh SYSCTL VARIABLES +A set of +.Xr sysctl 8 +variables controls the behaviour of the firewall and +associated modules +.Pq Nm dummynet , bridge , sctp nat . +These are shown below together with their default value +(but always check with the +.Xr sysctl 8 +command what value is actually in use) and meaning: +.Bl -tag -width indent +.It Va net.inet.ip.alias.sctp.accept_global_ootb_addip: No 0 +Defines how the +.Nm nat +responds to receipt of global OOTB ASCONF-AddIP: +.Bl -tag -width indent +.It Cm 0 +No response (unless a partially matching association exists - +ports and vtags match but global address does not) +.It Cm 1 +.Nm nat +will accept and process all OOTB global AddIP messages. +.El +.Pp +Option 1 should never be selected as this forms a security risk. +An attacker can +establish multiple fake associations by sending AddIP messages. +.It Va net.inet.ip.alias.sctp.chunk_proc_limit: No 5 +Defines the maximum number of chunks in an SCTP packet that will be +parsed for a +packet that matches an existing association. +This value is enforced to be greater or equal than +.Cm net.inet.ip.alias.sctp.initialising_chunk_proc_limit . +A high value is +a DoS risk yet setting too low a value may result in +important control chunks in +the packet not being located and parsed. +.It Va net.inet.ip.alias.sctp.error_on_ootb: No 1 +Defines when the +.Nm nat +responds to any Out-of-the-Blue (OOTB) packets with ErrorM packets. +An OOTB packet is a packet that arrives with no existing association +registered in the +.Nm nat +and is not an INIT or ASCONF-AddIP packet: +.Bl -tag -width indent +.It Cm 0 +ErrorM is never sent in response to OOTB packets. +.It Cm 1 +ErrorM is only sent to OOTB packets received on the local side. +.It Cm 2 +ErrorM is sent to the local side and on the global side ONLY if there is a +partial match (ports and vtags match but the source global IP does not). +This value is only useful if the +.Nm nat +is tracking global IP addresses. +.It Cm 3 +ErrorM is sent in response to all OOTB packets on both +the local and global side +(DoS risk). +.El +.Pp +At the moment the default is 0, since the ErrorM packet is not yet +supported by most SCTP stacks. +When it is supported, and if not tracking +global addresses, we recommend setting this value to 1 to allow +multi-homed local hosts to function with the +.Nm nat . +To track global addresses, we recommend setting this value to 2 to +allow global hosts to be informed when they need to (re)send an +ASCONF-AddIP. +Value 3 should never be chosen (except for debugging) as the +.Nm nat +will respond to all OOTB global packets (a DoS risk). +.It Va net.inet.ip.alias.sctp.hashtable_size: No 2003 +Size of hash tables used for +.Nm nat +lookups (100 < prime_number > 1000001). +This value sets the +.Nm hash table +size for any future created +.Nm nat +instance and therefore must be set prior to creating a +.Nm nat +instance. +The table sizes may be changed to suit specific needs. +If there will be few +concurrent associations, and memory is scarce, you may make these smaller. +If there will be many thousands (or millions) of concurrent associations, you +should make these larger. +A prime number is best for the table size. +The sysctl +update function will adjust your input value to the next highest prime number. +.It Va net.inet.ip.alias.sctp.holddown_time: No 0 +Hold association in table for this many seconds after receiving a +SHUTDOWN-COMPLETE. +This allows endpoints to correct shutdown gracefully if a +shutdown_complete is lost and retransmissions are required. +.It Va net.inet.ip.alias.sctp.init_timer: No 15 +Timeout value while waiting for (INIT-ACK|AddIP-ACK). +This value cannot be 0. +.It Va net.inet.ip.alias.sctp.initialising_chunk_proc_limit: No 2 +Defines the maximum number of chunks in an SCTP packet that will be parsed when +no existing association exists that matches that packet. +Ideally this packet +will only be an INIT or ASCONF-AddIP packet. +A higher value may become a DoS +risk as malformed packets can consume processing resources. +.It Va net.inet.ip.alias.sctp.param_proc_limit: No 25 +Defines the maximum number of parameters within a chunk that will be +parsed in a +packet. +As for other similar sysctl variables, larger values pose a DoS risk. +.It Va net.inet.ip.alias.sctp.log_level: No 0 +Level of detail in the system log messages (0 \- minimal, 1 \- event, +2 \- info, 3 \- detail, 4 \- debug, 5 \- max debug). +May be a good +option in high loss environments. +.It Va net.inet.ip.alias.sctp.shutdown_time: No 15 +Timeout value while waiting for SHUTDOWN-COMPLETE. +This value cannot be 0. +.It Va net.inet.ip.alias.sctp.track_global_addresses: No 0 +Enables/disables global IP address tracking within the +.Nm nat +and places an +upper limit on the number of addresses tracked for each association: +.Bl -tag -width indent +.It Cm 0 +Global tracking is disabled +.It Cm >1 +Enables tracking, the maximum number of addresses tracked for each +association is limited to this value +.El +.Pp +This variable is fully dynamic, the new value will be adopted for all newly +arriving associations, existing associations are treated +as they were previously. +Global tracking will decrease the number of collisions within the +.Nm nat +at a cost +of increased processing load, memory usage, complexity, and possible +.Nm nat +state +problems in complex networks with multiple +.Nm nats . +We recommend not tracking +global IP addresses, this will still result in a fully functional +.Nm nat . +.It Va net.inet.ip.alias.sctp.up_timer: No 300 +Timeout value to keep an association up with no traffic. +This value cannot be 0. +.It Va net.inet.ip.dummynet.expire : No 1 +Lazily delete dynamic pipes/queue once they have no pending traffic. +You can disable this by setting the variable to 0, in which case +the pipes/queues will only be deleted when the threshold is reached. +.It Va net.inet.ip.dummynet.hash_size : No 64 +Default size of the hash table used for dynamic pipes/queues. +This value is used when no +.Cm buckets +option is specified when configuring a pipe/queue. +.It Va net.inet.ip.dummynet.io_fast : No 0 +If set to a non-zero value, +the +.Dq fast +mode of +.Nm dummynet +operation (see above) is enabled. +.It Va net.inet.ip.dummynet.io_pkt +Number of packets passed to +.Nm dummynet . +.It Va net.inet.ip.dummynet.io_pkt_drop +Number of packets dropped by +.Nm dummynet . +.It Va net.inet.ip.dummynet.io_pkt_fast +Number of packets bypassed by the +.Nm dummynet +scheduler. +.It Va net.inet.ip.dummynet.max_chain_len : No 16 +Target value for the maximum number of pipes/queues in a hash bucket. +The product +.Cm max_chain_len*hash_size +is used to determine the threshold over which empty pipes/queues +will be expired even when +.Cm net.inet.ip.dummynet.expire=0 . +.It Va net.inet.ip.dummynet.red_lookup_depth : No 256 +.It Va net.inet.ip.dummynet.red_avg_pkt_size : No 512 +.It Va net.inet.ip.dummynet.red_max_pkt_size : No 1500 +Parameters used in the computations of the drop probability +for the RED algorithm. +.It Va net.inet.ip.dummynet.pipe_byte_limit : No 1048576 +.It Va net.inet.ip.dummynet.pipe_slot_limit : No 100 +The maximum queue size that can be specified in bytes or packets. +These limits prevent accidental exhaustion of resources such as mbufs. +If you raise these limits, +you should make sure the system is configured so that sufficient resources +are available. +.It Va net.inet.ip.fw.autoinc_step : No 100 +Delta between rule numbers when auto-generating them. +The value must be in the range 1..1000. +.It Va net.inet.ip.fw.curr_dyn_buckets : Va net.inet.ip.fw.dyn_buckets +The current number of buckets in the hash table for dynamic rules +(readonly). +.It Va net.inet.ip.fw.debug : No 1 +Controls debugging messages produced by +.Nm . +.It Va net.inet.ip.fw.default_rule : No 65535 +The default rule number (read-only). +By the design of +.Nm , the default rule is the last one, so its number +can also serve as the highest number allowed for a rule. +.It Va net.inet.ip.fw.dyn_buckets : No 256 +The number of buckets in the hash table for dynamic rules. +Must be a power of 2, up to 65536. +It only takes effect when all dynamic rules have expired, so you +are advised to use a +.Cm flush +command to make sure that the hash table is resized. +.It Va net.inet.ip.fw.dyn_count : No 3 +Current number of dynamic rules +(read-only). +.It Va net.inet.ip.fw.dyn_keepalive : No 1 +Enables generation of keepalive packets for +.Cm keep-state +rules on TCP sessions. +A keepalive is generated to both +sides of the connection every 5 seconds for the last 20 +seconds of the lifetime of the rule. +.It Va net.inet.ip.fw.dyn_max : No 8192 +Maximum number of dynamic rules. +When you hit this limit, no more dynamic rules can be +installed until old ones expire. +.It Va net.inet.ip.fw.dyn_ack_lifetime : No 300 +.It Va net.inet.ip.fw.dyn_syn_lifetime : No 20 +.It Va net.inet.ip.fw.dyn_fin_lifetime : No 1 +.It Va net.inet.ip.fw.dyn_rst_lifetime : No 1 +.It Va net.inet.ip.fw.dyn_udp_lifetime : No 5 +.It Va net.inet.ip.fw.dyn_short_lifetime : No 30 +These variables control the lifetime, in seconds, of dynamic +rules. +Upon the initial SYN exchange the lifetime is kept short, +then increased after both SYN have been seen, then decreased +again during the final FIN exchange or when a RST is received. +Both +.Em dyn_fin_lifetime +and +.Em dyn_rst_lifetime +must be strictly lower than 5 seconds, the period of +repetition of keepalives. +The firewall enforces that. +.It Va net.inet.ip.fw.dyn_keep_states: No 0 +Keep dynamic states on rule/set deletion. +States are relinked to default rule (65535). +This can be handly for ruleset reload. +Turned off by default. +.It Va net.inet.ip.fw.enable : No 1 +Enables the firewall. +Setting this variable to 0 lets you run your machine without +firewall even if compiled in. +.It Va net.inet6.ip6.fw.enable : No 1 +provides the same functionality as above for the IPv6 case. +.It Va net.inet.ip.fw.one_pass : No 1 +When set, the packet exiting from the +.Nm dummynet +pipe or from +.Xr ng_ipfw 4 +node is not passed though the firewall again. +Otherwise, after an action, the packet is +reinjected into the firewall at the next rule. +.It Va net.inet.ip.fw.tables_max : No 128 +Maximum number of tables. +.It Va net.inet.ip.fw.verbose : No 1 +Enables verbose messages. +.It Va net.inet.ip.fw.verbose_limit : No 0 +Limits the number of messages produced by a verbose firewall. +.It Va net.inet6.ip6.fw.deny_unknown_exthdrs : No 1 +If enabled packets with unknown IPv6 Extension Headers will be denied. +.It Va net.link.ether.ipfw : No 0 +Controls whether layer-2 packets are passed to +.Nm . +Default is no. +.It Va net.link.bridge.ipfw : No 0 +Controls whether bridged packets are passed to +.Nm . +Default is no. +.El +.Sh INTERNAL DIAGNOSTICS +There are some commands that may be useful to understand current state +of certain subsystems inside kernel module. +These commands provide debugging output which may change without notice. +.Pp +Currently the following commands are available as +.Cm internal +sub-options: +.Bl -tag -width indent +.It Cm iflist +Lists all interface which are currently tracked by +.Nm +with their in-kernel status. +.It Cm talist +List all table lookup algorithms currently available. +.El +.Sh EXAMPLES +There are far too many possible uses of +.Nm +so this Section will only give a small set of examples. +.Pp +.Ss BASIC PACKET FILTERING +This command adds an entry which denies all tcp packets from +.Em cracker.evil.org +to the telnet port of +.Em wolf.tambov.su +from being forwarded by the host: +.Pp +.Dl "ipfw add deny tcp from cracker.evil.org to wolf.tambov.su telnet" +.Pp +This one disallows any connection from the entire cracker's +network to my host: +.Pp +.Dl "ipfw add deny ip from 123.45.67.0/24 to my.host.org" +.Pp +A first and efficient way to limit access (not using dynamic rules) +is the use of the following rules: +.Pp +.Dl "ipfw add allow tcp from any to any established" +.Dl "ipfw add allow tcp from net1 portlist1 to net2 portlist2 setup" +.Dl "ipfw add allow tcp from net3 portlist3 to net3 portlist3 setup" +.Dl "..." +.Dl "ipfw add deny tcp from any to any" +.Pp +The first rule will be a quick match for normal TCP packets, +but it will not match the initial SYN packet, which will be +matched by the +.Cm setup +rules only for selected source/destination pairs. +All other SYN packets will be rejected by the final +.Cm deny +rule. +.Pp +If you administer one or more subnets, you can take advantage +of the address sets and or-blocks and write extremely +compact rulesets which selectively enable services to blocks +of clients, as below: +.Pp +.Dl "goodguys=\*q{ 10.1.2.0/24{20,35,66,18} or 10.2.3.0/28{6,3,11} }\*q" +.Dl "badguys=\*q10.1.2.0/24{8,38,60}\*q" +.Dl "" +.Dl "ipfw add allow ip from ${goodguys} to any" +.Dl "ipfw add deny ip from ${badguys} to any" +.Dl "... normal policies ..." +.Pp +The +.Cm verrevpath +option could be used to do automated anti-spoofing by adding the +following to the top of a ruleset: +.Pp +.Dl "ipfw add deny ip from any to any not verrevpath in" +.Pp +This rule drops all incoming packets that appear to be coming to the +system on the wrong interface. +For example, a packet with a source +address belonging to a host on a protected internal network would be +dropped if it tried to enter the system from an external interface. +.Pp +The +.Cm antispoof +option could be used to do similar but more restricted anti-spoofing +by adding the following to the top of a ruleset: +.Pp +.Dl "ipfw add deny ip from any to any not antispoof in" +.Pp +This rule drops all incoming packets that appear to be coming from another +directly connected system but on the wrong interface. +For example, a packet with a source address of +.Li 192.168.0.0/24 , +configured on +.Li fxp0 , +but coming in on +.Li fxp1 +would be dropped. +.Pp +The +.Cm setdscp +option could be used to (re)mark user traffic, +by adding the following to the appropriate place in ruleset: +.Pp +.Dl "ipfw add setdscp be ip from any to any dscp af11,af21" +.Ss DYNAMIC RULES +In order to protect a site from flood attacks involving fake +TCP packets, it is safer to use dynamic rules: +.Pp +.Dl "ipfw add check-state" +.Dl "ipfw add deny tcp from any to any established" +.Dl "ipfw add allow tcp from my-net to any setup keep-state" +.Pp +This will let the firewall install dynamic rules only for +those connection which start with a regular SYN packet coming +from the inside of our network. +Dynamic rules are checked when encountering the first +occurrence of a +.Cm check-state , +.Cm keep-state +or +.Cm limit +rule. +A +.Cm check-state +rule should usually be placed near the beginning of the +ruleset to minimize the amount of work scanning the ruleset. +Your mileage may vary. +.Pp +To limit the number of connections a user can open +you can use the following type of rules: +.Pp +.Dl "ipfw add allow tcp from my-net/24 to any setup limit src-addr 10" +.Dl "ipfw add allow tcp from any to me setup limit src-addr 4" +.Pp +The former (assuming it runs on a gateway) will allow each host +on a /24 network to open at most 10 TCP connections. +The latter can be placed on a server to make sure that a single +client does not use more than 4 simultaneous connections. +.Pp +.Em BEWARE : +stateful rules can be subject to denial-of-service attacks +by a SYN-flood which opens a huge number of dynamic rules. +The effects of such attacks can be partially limited by +acting on a set of +.Xr sysctl 8 +variables which control the operation of the firewall. +.Pp +Here is a good usage of the +.Cm list +command to see accounting records and timestamp information: +.Pp +.Dl ipfw -at list +.Pp +or in short form without timestamps: +.Pp +.Dl ipfw -a list +.Pp +which is equivalent to: +.Pp +.Dl ipfw show +.Pp +Next rule diverts all incoming packets from 192.168.2.0/24 +to divert port 5000: +.Pp +.Dl ipfw divert 5000 ip from 192.168.2.0/24 to any in +.Ss TRAFFIC SHAPING +The following rules show some of the applications of +.Nm +and +.Nm dummynet +for simulations and the like. +.Pp +This rule drops random incoming packets with a probability +of 5%: +.Pp +.Dl "ipfw add prob 0.05 deny ip from any to any in" +.Pp +A similar effect can be achieved making use of +.Nm dummynet +pipes: +.Pp +.Dl "ipfw add pipe 10 ip from any to any" +.Dl "ipfw pipe 10 config plr 0.05" +.Pp +We can use pipes to artificially limit bandwidth, e.g.\& on a +machine acting as a router, if we want to limit traffic from +local clients on 192.168.2.0/24 we do: +.Pp +.Dl "ipfw add pipe 1 ip from 192.168.2.0/24 to any out" +.Dl "ipfw pipe 1 config bw 300Kbit/s queue 50KBytes" +.Pp +note that we use the +.Cm out +modifier so that the rule is not used twice. +Remember in fact that +.Nm +rules are checked both on incoming and outgoing packets. +.Pp +Should we want to simulate a bidirectional link with bandwidth +limitations, the correct way is the following: +.Pp +.Dl "ipfw add pipe 1 ip from any to any out" +.Dl "ipfw add pipe 2 ip from any to any in" +.Dl "ipfw pipe 1 config bw 64Kbit/s queue 10Kbytes" +.Dl "ipfw pipe 2 config bw 64Kbit/s queue 10Kbytes" +.Pp +The above can be very useful, e.g.\& if you want to see how +your fancy Web page will look for a residential user who +is connected only through a slow link. +You should not use only one pipe for both directions, unless +you want to simulate a half-duplex medium (e.g.\& AppleTalk, +Ethernet, IRDA). +It is not necessary that both pipes have the same configuration, +so we can also simulate asymmetric links. +.Pp +Should we want to verify network performance with the RED queue +management algorithm: +.Pp +.Dl "ipfw add pipe 1 ip from any to any" +.Dl "ipfw pipe 1 config bw 500Kbit/s queue 100 red 0.002/30/80/0.1" +.Pp +Another typical application of the traffic shaper is to +introduce some delay in the communication. +This can significantly affect applications which do a lot of Remote +Procedure Calls, and where the round-trip-time of the +connection often becomes a limiting factor much more than +bandwidth: +.Pp +.Dl "ipfw add pipe 1 ip from any to any out" +.Dl "ipfw add pipe 2 ip from any to any in" +.Dl "ipfw pipe 1 config delay 250ms bw 1Mbit/s" +.Dl "ipfw pipe 2 config delay 250ms bw 1Mbit/s" +.Pp +Per-flow queueing can be useful for a variety of purposes. +A very simple one is counting traffic: +.Pp +.Dl "ipfw add pipe 1 tcp from any to any" +.Dl "ipfw add pipe 1 udp from any to any" +.Dl "ipfw add pipe 1 ip from any to any" +.Dl "ipfw pipe 1 config mask all" +.Pp +The above set of rules will create queues (and collect +statistics) for all traffic. +Because the pipes have no limitations, the only effect is +collecting statistics. +Note that we need 3 rules, not just the last one, because +when +.Nm +tries to match IP packets it will not consider ports, so we +would not see connections on separate ports as different +ones. +.Pp +A more sophisticated example is limiting the outbound traffic +on a net with per-host limits, rather than per-network limits: +.Pp +.Dl "ipfw add pipe 1 ip from 192.168.2.0/24 to any out" +.Dl "ipfw add pipe 2 ip from any to 192.168.2.0/24 in" +.Dl "ipfw pipe 1 config mask src-ip 0x000000ff bw 200Kbit/s queue 20Kbytes" +.Dl "ipfw pipe 2 config mask dst-ip 0x000000ff bw 200Kbit/s queue 20Kbytes" +.Ss LOOKUP TABLES +In the following example, we need to create several traffic bandwidth +classes and we need different hosts/networks to fall into different classes. +We create one pipe for each class and configure them accordingly. +Then we create a single table and fill it with IP subnets and addresses. +For each subnet/host we set the argument equal to the number of the pipe +that it should use. +Then we classify traffic using a single rule: +.Pp +.Dl "ipfw pipe 1 config bw 1000Kbyte/s" +.Dl "ipfw pipe 4 config bw 4000Kbyte/s" +.Dl "..." +.Dl "ipfw table T1 create type addr" +.Dl "ipfw table T1 add 192.168.2.0/24 1" +.Dl "ipfw table T1 add 192.168.0.0/27 4" +.Dl "ipfw table T1 add 192.168.0.2 1" +.Dl "..." +.Dl "ipfw add pipe tablearg ip from 'table(T1)' to any" +.Pp +Using the +.Cm fwd +action, the table entries may include hostnames and IP addresses. +.Pp +.Dl "ipfw table T2 create type addr ftype ip" +.Dl "ipfw table T2 add 192.168.2.0/24 10.23.2.1" +.Dl "ipfw table T21 add 192.168.0.0/27 router1.dmz" +.Dl "..." +.Dl "ipfw add 100 fwd tablearg ip from any to table(1)" +.Pp +In the following example per-interface firewall is created: +.Pp +.Dl "ipfw table IN create type iface valtype skipto,fib" +.Dl "ipfw table IN add vlan20 12000,12" +.Dl "ipfw table IN add vlan30 13000,13" +.Dl "ipfw table OUT create type iface valtype skipto" +.Dl "ipfw table OUT add vlan20 22000" +.Dl "ipfw table OUT add vlan30 23000" +.Dl ".." +.Dl "ipfw add 100 ipfw setfib tablearg ip from any to any recv 'table(IN)' in" +.Dl "ipfw add 200 ipfw skipto tablearg ip from any to any recv 'table(IN)' in" +.Dl "ipfw add 300 ipfw skipto tablearg ip from any to any xmit 'table(OUT)' out" +.Pp +The following example illustrate usage of flow tables: +.Pp +.Dl "ipfw table fl create type flow:flow:src-ip,proto,dst-ip,dst-port" +.Dl "ipfw table fl add 2a02:6b8:77::88,tcp,2a02:6b8:77::99,80 11" +.Dl "ipfw table fl add 10.0.0.1,udp,10.0.0.2,53 12" +.Dl ".." +.Dl "ipfw add 100 allow ip from any to any flow 'table(fl,11)' recv ix0" +.Ss SETS OF RULES +To add a set of rules atomically, e.g.\& set 18: +.Pp +.Dl "ipfw set disable 18" +.Dl "ipfw add NN set 18 ... # repeat as needed" +.Dl "ipfw set enable 18" +.Pp +To delete a set of rules atomically the command is simply: +.Pp +.Dl "ipfw delete set 18" +.Pp +To test a ruleset and disable it and regain control if something goes wrong: +.Pp +.Dl "ipfw set disable 18" +.Dl "ipfw add NN set 18 ... # repeat as needed" +.Dl "ipfw set enable 18; echo done; sleep 30 && ipfw set disable 18" +.Pp +Here if everything goes well, you press control-C before the "sleep" +terminates, and your ruleset will be left active. +Otherwise, e.g.\& if +you cannot access your box, the ruleset will be disabled after +the sleep terminates thus restoring the previous situation. +.Pp +To show rules of the specific set: +.Pp +.Dl "ipfw set 18 show" +.Pp +To show rules of the disabled set: +.Pp +.Dl "ipfw -S set 18 show" +.Pp +To clear a specific rule counters of the specific set: +.Pp +.Dl "ipfw set 18 zero NN" +.Pp +To delete a specific rule of the specific set: +.Pp +.Dl "ipfw set 18 delete NN" +.Ss NAT, REDIRECT AND LSNAT +First redirect all the traffic to nat instance 123: +.Pp +.Dl "ipfw add nat 123 all from any to any" +.Pp +Then to configure nat instance 123 to alias all the outgoing traffic with ip +192.168.0.123, blocking all incoming connections, trying to keep +same ports on both sides, clearing aliasing table on address change +and keeping a log of traffic/link statistics: +.Pp +.Dl "ipfw nat 123 config ip 192.168.0.123 log deny_in reset same_ports" +.Pp +Or to change address of instance 123, aliasing table will be cleared (see +reset option): +.Pp +.Dl "ipfw nat 123 config ip 10.0.0.1" +.Pp +To see configuration of nat instance 123: +.Pp +.Dl "ipfw nat 123 show config" +.Pp +To show logs of all the instances in range 111-999: +.Pp +.Dl "ipfw nat 111-999 show" +.Pp +To see configurations of all instances: +.Pp +.Dl "ipfw nat show config" +.Pp +Or a redirect rule with mixed modes could looks like: +.Pp +.Dl "ipfw nat 123 config redirect_addr 10.0.0.1 10.0.0.66" +.Dl " redirect_port tcp 192.168.0.1:80 500" +.Dl " redirect_proto udp 192.168.1.43 192.168.1.1" +.Dl " redirect_addr 192.168.0.10,192.168.0.11" +.Dl " 10.0.0.100 # LSNAT" +.Dl " redirect_port tcp 192.168.0.1:80,192.168.0.10:22" +.Dl " 500 # LSNAT" +.Pp +or it could be split in: +.Pp +.Dl "ipfw nat 1 config redirect_addr 10.0.0.1 10.0.0.66" +.Dl "ipfw nat 2 config redirect_port tcp 192.168.0.1:80 500" +.Dl "ipfw nat 3 config redirect_proto udp 192.168.1.43 192.168.1.1" +.Dl "ipfw nat 4 config redirect_addr 192.168.0.10,192.168.0.11,192.168.0.12" +.Dl " 10.0.0.100" +.Dl "ipfw nat 5 config redirect_port tcp" +.Dl " 192.168.0.1:80,192.168.0.10:22,192.168.0.20:25 500" +.Sh SEE ALSO +.Xr cpp 1 , +.Xr m4 1 , +.Xr altq 4 , +.Xr divert 4 , +.Xr dummynet 4 , +.Xr if_bridge 4 , +.Xr ip 4 , +.Xr ipfirewall 4 , +.Xr ng_ipfw 4 , +.Xr protocols 5 , +.Xr services 5 , +.Xr init 8 , +.Xr kldload 8 , +.Xr reboot 8 , +.Xr sysctl 8 , +.Xr syslogd 8 +.Sh HISTORY +The +.Nm +utility first appeared in +.Fx 2.0 . +.Nm dummynet +was introduced in +.Fx 2.2.8 . +Stateful extensions were introduced in +.Fx 4.0 . +.Nm ipfw2 +was introduced in Summer 2002. +.Sh AUTHORS +.An Ugen J. S. Antsilevich , +.An Poul-Henning Kamp , +.An Alex Nash , +.An Archie Cobbs , +.An Luigi Rizzo . +.Pp +.An -nosplit +API based upon code written by +.An Daniel Boulet +for BSDI. +.Pp +Dummynet has been introduced by Luigi Rizzo in 1997-1998. +.Pp +Some early work (1999-2000) on the +.Nm dummynet +traffic shaper supported by Akamba Corp. +.Pp +The ipfw core (ipfw2) has been completely redesigned and +reimplemented by Luigi Rizzo in summer 2002. +Further +actions and +options have been added by various developer over the years. +.Pp +.An -nosplit +In-kernel NAT support written by +.An Paolo Pisati Aq Mt piso@FreeBSD.org +as part of a Summer of Code 2005 project. +.Pp +SCTP +.Nm nat +support has been developed by +.An The Centre for Advanced Internet Architectures (CAIA) Aq http://www.caia.swin.edu.au . +The primary developers and maintainers are David Hayes and Jason But. +For further information visit: +.Aq http://www.caia.swin.edu.au/urp/SONATA +.Pp +Delay profiles have been developed by Alessandro Cerri and +Luigi Rizzo, supported by the +European Commission within Projects Onelab and Onelab2. +.Sh BUGS +The syntax has grown over the years and sometimes it might be confusing. +Unfortunately, backward compatibility prevents cleaning up mistakes +made in the definition of the syntax. +.Pp +.Em !!! WARNING !!! +.Pp +Misconfiguring the firewall can put your computer in an unusable state, +possibly shutting down network services and requiring console access to +regain control of it. +.Pp +Incoming packet fragments diverted by +.Cm divert +are reassembled before delivery to the socket. +The action used on those packet is the one from the +rule which matches the first fragment of the packet. +.Pp +Packets diverted to userland, and then reinserted by a userland process +may lose various packet attributes. +The packet source interface name +will be preserved if it is shorter than 8 bytes and the userland process +saves and reuses the sockaddr_in +(as does +.Xr natd 8 ) ; +otherwise, it may be lost. +If a packet is reinserted in this manner, later rules may be incorrectly +applied, making the order of +.Cm divert +rules in the rule sequence very important. +.Pp +Dummynet drops all packets with IPv6 link-local addresses. +.Pp +Rules using +.Cm uid +or +.Cm gid +may not behave as expected. +In particular, incoming SYN packets may +have no uid or gid associated with them since they do not yet belong +to a TCP connection, and the uid/gid associated with a packet may not +be as expected if the associated process calls +.Xr setuid 2 +or similar system calls. +.Pp +Rule syntax is subject to the command line environment and some patterns +may need to be escaped with the backslash character +or quoted appropriately. +.Pp +Due to the architecture of +.Xr libalias 3 , +ipfw nat is not compatible with the TCP segmentation offloading (TSO). +Thus, to reliably nat your network traffic, please disable TSO +on your NICs using +.Xr ifconfig 8 . +.Pp +ICMP error messages are not implicitly matched by dynamic rules +for the respective conversations. +To avoid failures of network error detection and path MTU discovery, +ICMP error messages may need to be allowed explicitly through static +rules. +.Pp +Rules using +.Cm call +and +.Cm return +actions may lead to confusing behaviour if ruleset has mistakes, +and/or interaction with other subsystems (netgraph, dummynet, etc.) is used. +One possible case for this is packet leaving +.Nm +in subroutine on the input pass, while later on output encountering unpaired +.Cm return +first. +As the call stack is kept intact after input pass, packet will suddenly +return to the rule number used on input pass, not on output one. +Order of processing should be checked carefully to avoid such mistakes. diff --git a/example/ipfw/ipfw/ipfw2.c b/example/ipfw/ipfw/ipfw2.c new file mode 100644 index 0000000..b8ef6ee --- /dev/null +++ b/example/ipfw/ipfw/ipfw2.c @@ -0,0 +1,4968 @@ +/* + * Copyright (c) 2002-2003 Luigi Rizzo + * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp + * Copyright (c) 1994 Ugen J.S.Antsilevich + * + * Idea and grammar partially left from: + * Copyright (c) 1993 Daniel Boulet + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * NEW command line interface for IP firewall facility + * + * $FreeBSD: head/sbin/ipfw/ipfw2.c 273253 2014-10-18 15:18:31Z melifaro $ + */ + +#include +#include +#include +#include +#include + +#include "ipfw2.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* ctime */ +#include /* _long_to_time */ +#include +#include +#include /* offsetof */ + +#include +#include /* only IFNAMSIZ */ +#include +#include /* only n_short, n_long */ +#include +#include +#include +#include +#include + +struct cmdline_opts co; /* global options */ + +struct format_opts { + int bcwidth; + int pcwidth; + int show_counters; + uint32_t set_mask; /* enabled sets mask */ + uint32_t flags; /* request flags */ + uint32_t first; /* first rule to request */ + uint32_t last; /* last rule to request */ + uint32_t dcnt; /* number of dynamic states */ + ipfw_obj_ctlv *tstate; /* table state data */ +}; + +int resvd_set_number = RESVD_SET; + +int ipfw_socket = -1; + +#define CHECK_LENGTH(v, len) do { \ + if ((v) < (len)) \ + errx(EX_DATAERR, "Rule too long"); \ + } while (0) +/* + * Check if we have enough space in cmd buffer. Note that since + * first 8? u32 words are reserved by reserved header, full cmd + * buffer can't be used, so we need to protect from buffer overrun + * only. At the beginnig, cblen is less than actual buffer size by + * size of ipfw_insn_u32 instruction + 1 u32 work. This eliminates need + * for checking small instructions fitting in given range. + * We also (ab)use the fact that ipfw_insn is always the first field + * for any custom instruction. + */ +#define CHECK_CMDLEN CHECK_LENGTH(cblen, F_LEN((ipfw_insn *)cmd)) + +#define GET_UINT_ARG(arg, min, max, tok, s_x) do { \ + if (!av[0]) \ + errx(EX_USAGE, "%s: missing argument", match_value(s_x, tok)); \ + if (_substrcmp(*av, "tablearg") == 0) { \ + arg = IP_FW_TARG; \ + break; \ + } \ + \ + { \ + long _xval; \ + char *end; \ + \ + _xval = strtol(*av, &end, 10); \ + \ + if (!isdigit(**av) || *end != '\0' || (_xval == 0 && errno == EINVAL)) \ + errx(EX_DATAERR, "%s: invalid argument: %s", \ + match_value(s_x, tok), *av); \ + \ + if (errno == ERANGE || _xval < min || _xval > max) \ + errx(EX_DATAERR, "%s: argument is out of range (%u..%u): %s", \ + match_value(s_x, tok), min, max, *av); \ + \ + if (_xval == IP_FW_TARG) \ + errx(EX_DATAERR, "%s: illegal argument value: %s", \ + match_value(s_x, tok), *av); \ + arg = _xval; \ + } \ +} while (0) + +static struct _s_x f_tcpflags[] = { + { "syn", TH_SYN }, + { "fin", TH_FIN }, + { "ack", TH_ACK }, + { "psh", TH_PUSH }, + { "rst", TH_RST }, + { "urg", TH_URG }, + { "tcp flag", 0 }, + { NULL, 0 } +}; + +static struct _s_x f_tcpopts[] = { + { "mss", IP_FW_TCPOPT_MSS }, + { "maxseg", IP_FW_TCPOPT_MSS }, + { "window", IP_FW_TCPOPT_WINDOW }, + { "sack", IP_FW_TCPOPT_SACK }, + { "ts", IP_FW_TCPOPT_TS }, + { "timestamp", IP_FW_TCPOPT_TS }, + { "cc", IP_FW_TCPOPT_CC }, + { "tcp option", 0 }, + { NULL, 0 } +}; + +/* + * IP options span the range 0 to 255 so we need to remap them + * (though in fact only the low 5 bits are significant). + */ +static struct _s_x f_ipopts[] = { + { "ssrr", IP_FW_IPOPT_SSRR}, + { "lsrr", IP_FW_IPOPT_LSRR}, + { "rr", IP_FW_IPOPT_RR}, + { "ts", IP_FW_IPOPT_TS}, + { "ip option", 0 }, + { NULL, 0 } +}; + +static struct _s_x f_iptos[] = { + { "lowdelay", IPTOS_LOWDELAY}, + { "throughput", IPTOS_THROUGHPUT}, + { "reliability", IPTOS_RELIABILITY}, + { "mincost", IPTOS_MINCOST}, + { "congestion", IPTOS_ECN_CE}, + { "ecntransport", IPTOS_ECN_ECT0}, + { "ip tos option", 0}, + { NULL, 0 } +}; + +struct _s_x f_ipdscp[] = { + { "af11", IPTOS_DSCP_AF11 >> 2 }, /* 001010 */ + { "af12", IPTOS_DSCP_AF12 >> 2 }, /* 001100 */ + { "af13", IPTOS_DSCP_AF13 >> 2 }, /* 001110 */ + { "af21", IPTOS_DSCP_AF21 >> 2 }, /* 010010 */ + { "af22", IPTOS_DSCP_AF22 >> 2 }, /* 010100 */ + { "af23", IPTOS_DSCP_AF23 >> 2 }, /* 010110 */ + { "af31", IPTOS_DSCP_AF31 >> 2 }, /* 011010 */ + { "af32", IPTOS_DSCP_AF32 >> 2 }, /* 011100 */ + { "af33", IPTOS_DSCP_AF33 >> 2 }, /* 011110 */ + { "af41", IPTOS_DSCP_AF41 >> 2 }, /* 100010 */ + { "af42", IPTOS_DSCP_AF42 >> 2 }, /* 100100 */ + { "af43", IPTOS_DSCP_AF43 >> 2 }, /* 100110 */ + { "be", IPTOS_DSCP_CS0 >> 2 }, /* 000000 */ + { "ef", IPTOS_DSCP_EF >> 2 }, /* 101110 */ + { "cs0", IPTOS_DSCP_CS0 >> 2 }, /* 000000 */ + { "cs1", IPTOS_DSCP_CS1 >> 2 }, /* 001000 */ + { "cs2", IPTOS_DSCP_CS2 >> 2 }, /* 010000 */ + { "cs3", IPTOS_DSCP_CS3 >> 2 }, /* 011000 */ + { "cs4", IPTOS_DSCP_CS4 >> 2 }, /* 100000 */ + { "cs5", IPTOS_DSCP_CS5 >> 2 }, /* 101000 */ + { "cs6", IPTOS_DSCP_CS6 >> 2 }, /* 110000 */ + { "cs7", IPTOS_DSCP_CS7 >> 2 }, /* 100000 */ + { NULL, 0 } +}; + +static struct _s_x limit_masks[] = { + {"all", DYN_SRC_ADDR|DYN_SRC_PORT|DYN_DST_ADDR|DYN_DST_PORT}, + {"src-addr", DYN_SRC_ADDR}, + {"src-port", DYN_SRC_PORT}, + {"dst-addr", DYN_DST_ADDR}, + {"dst-port", DYN_DST_PORT}, + {NULL, 0} +}; + +/* + * we use IPPROTO_ETHERTYPE as a fake protocol id to call the print routines + * This is only used in this code. + */ +#define IPPROTO_ETHERTYPE 0x1000 +static struct _s_x ether_types[] = { + /* + * Note, we cannot use "-:&/" in the names because they are field + * separators in the type specifications. Also, we use s = NULL as + * end-delimiter, because a type of 0 can be legal. + */ + { "ip", 0x0800 }, + { "ipv4", 0x0800 }, + { "ipv6", 0x86dd }, + { "arp", 0x0806 }, + { "rarp", 0x8035 }, + { "vlan", 0x8100 }, + { "loop", 0x9000 }, + { "trail", 0x1000 }, + { "at", 0x809b }, + { "atalk", 0x809b }, + { "aarp", 0x80f3 }, + { "pppoe_disc", 0x8863 }, + { "pppoe_sess", 0x8864 }, + { "ipx_8022", 0x00E0 }, + { "ipx_8023", 0x0000 }, + { "ipx_ii", 0x8137 }, + { "ipx_snap", 0x8137 }, + { "ipx", 0x8137 }, + { "ns", 0x0600 }, + { NULL, 0 } +}; + + +static struct _s_x rule_actions[] = { + { "accept", TOK_ACCEPT }, + { "pass", TOK_ACCEPT }, + { "allow", TOK_ACCEPT }, + { "permit", TOK_ACCEPT }, + { "count", TOK_COUNT }, + { "pipe", TOK_PIPE }, + { "queue", TOK_QUEUE }, + { "divert", TOK_DIVERT }, + { "tee", TOK_TEE }, + { "netgraph", TOK_NETGRAPH }, + { "ngtee", TOK_NGTEE }, + { "fwd", TOK_FORWARD }, + { "forward", TOK_FORWARD }, + { "skipto", TOK_SKIPTO }, + { "deny", TOK_DENY }, + { "drop", TOK_DENY }, + { "reject", TOK_REJECT }, + { "reset6", TOK_RESET6 }, + { "reset", TOK_RESET }, + { "unreach6", TOK_UNREACH6 }, + { "unreach", TOK_UNREACH }, + { "check-state", TOK_CHECKSTATE }, + { "//", TOK_COMMENT }, + { "nat", TOK_NAT }, + { "reass", TOK_REASS }, + { "setfib", TOK_SETFIB }, + { "setdscp", TOK_SETDSCP }, + { "call", TOK_CALL }, + { "return", TOK_RETURN }, + { NULL, 0 } /* terminator */ +}; + +static struct _s_x rule_action_params[] = { + { "altq", TOK_ALTQ }, + { "log", TOK_LOG }, + { "tag", TOK_TAG }, + { "untag", TOK_UNTAG }, + { NULL, 0 } /* terminator */ +}; + +/* + * The 'lookup' instruction accepts one of the following arguments. + * -1 is a terminator for the list. + * Arguments are passed as v[1] in O_DST_LOOKUP options. + */ +static int lookup_key[] = { + TOK_DSTIP, TOK_SRCIP, TOK_DSTPORT, TOK_SRCPORT, + TOK_UID, TOK_JAIL, TOK_DSCP, -1 }; + +static struct _s_x rule_options[] = { + { "tagged", TOK_TAGGED }, + { "uid", TOK_UID }, + { "gid", TOK_GID }, + { "jail", TOK_JAIL }, + { "in", TOK_IN }, + { "limit", TOK_LIMIT }, + { "keep-state", TOK_KEEPSTATE }, + { "bridged", TOK_LAYER2 }, + { "layer2", TOK_LAYER2 }, + { "out", TOK_OUT }, + { "diverted", TOK_DIVERTED }, + { "diverted-loopback", TOK_DIVERTEDLOOPBACK }, + { "diverted-output", TOK_DIVERTEDOUTPUT }, + { "xmit", TOK_XMIT }, + { "recv", TOK_RECV }, + { "via", TOK_VIA }, + { "fragment", TOK_FRAG }, + { "frag", TOK_FRAG }, + { "fib", TOK_FIB }, + { "ipoptions", TOK_IPOPTS }, + { "ipopts", TOK_IPOPTS }, + { "iplen", TOK_IPLEN }, + { "ipid", TOK_IPID }, + { "ipprecedence", TOK_IPPRECEDENCE }, + { "dscp", TOK_DSCP }, + { "iptos", TOK_IPTOS }, + { "ipttl", TOK_IPTTL }, + { "ipversion", TOK_IPVER }, + { "ipver", TOK_IPVER }, + { "estab", TOK_ESTAB }, + { "established", TOK_ESTAB }, + { "setup", TOK_SETUP }, + { "sockarg", TOK_SOCKARG }, + { "tcpdatalen", TOK_TCPDATALEN }, + { "tcpflags", TOK_TCPFLAGS }, + { "tcpflgs", TOK_TCPFLAGS }, + { "tcpoptions", TOK_TCPOPTS }, + { "tcpopts", TOK_TCPOPTS }, + { "tcpseq", TOK_TCPSEQ }, + { "tcpack", TOK_TCPACK }, + { "tcpwin", TOK_TCPWIN }, + { "icmptype", TOK_ICMPTYPES }, + { "icmptypes", TOK_ICMPTYPES }, + { "dst-ip", TOK_DSTIP }, + { "src-ip", TOK_SRCIP }, + { "dst-port", TOK_DSTPORT }, + { "src-port", TOK_SRCPORT }, + { "proto", TOK_PROTO }, + { "MAC", TOK_MAC }, + { "mac", TOK_MAC }, + { "mac-type", TOK_MACTYPE }, + { "verrevpath", TOK_VERREVPATH }, + { "versrcreach", TOK_VERSRCREACH }, + { "antispoof", TOK_ANTISPOOF }, + { "ipsec", TOK_IPSEC }, + { "icmp6type", TOK_ICMP6TYPES }, + { "icmp6types", TOK_ICMP6TYPES }, + { "ext6hdr", TOK_EXT6HDR}, + { "flow-id", TOK_FLOWID}, + { "ipv6", TOK_IPV6}, + { "ip6", TOK_IPV6}, + { "ipv4", TOK_IPV4}, + { "ip4", TOK_IPV4}, + { "dst-ipv6", TOK_DSTIP6}, + { "dst-ip6", TOK_DSTIP6}, + { "src-ipv6", TOK_SRCIP6}, + { "src-ip6", TOK_SRCIP6}, + { "lookup", TOK_LOOKUP}, + { "flow", TOK_FLOW}, + { "//", TOK_COMMENT }, + + { "not", TOK_NOT }, /* pseudo option */ + { "!", /* escape ? */ TOK_NOT }, /* pseudo option */ + { "or", TOK_OR }, /* pseudo option */ + { "|", /* escape */ TOK_OR }, /* pseudo option */ + { "{", TOK_STARTBRACE }, /* pseudo option */ + { "(", TOK_STARTBRACE }, /* pseudo option */ + { "}", TOK_ENDBRACE }, /* pseudo option */ + { ")", TOK_ENDBRACE }, /* pseudo option */ + { NULL, 0 } /* terminator */ +}; + +void bprint_uint_arg(struct buf_pr *bp, const char *str, uint32_t arg); +static int ipfw_get_config(struct cmdline_opts *co, struct format_opts *fo, + ipfw_cfg_lheader **pcfg, size_t *psize); +static int ipfw_show_config(struct cmdline_opts *co, struct format_opts *fo, + ipfw_cfg_lheader *cfg, size_t sz, int ac, char **av); +static void ipfw_list_tifaces(void); + +/* + * Simple string buffer API. + * Used to simplify buffer passing between function and for + * transparent overrun handling. + */ + +/* + * Allocates new buffer of given size @sz. + * + * Returns 0 on success. + */ +int +bp_alloc(struct buf_pr *b, size_t size) +{ + memset(b, 0, sizeof(struct buf_pr)); + + if ((b->buf = calloc(1, size)) == NULL) + return (ENOMEM); + + b->ptr = b->buf; + b->size = size; + b->avail = b->size; + + return (0); +} + +void +bp_free(struct buf_pr *b) +{ + + free(b->buf); +} + +/* + * Flushes buffer so new writer start from beginning. + */ +void +bp_flush(struct buf_pr *b) +{ + + b->ptr = b->buf; + b->avail = b->size; +} + +/* + * Print message specified by @format and args. + * Automatically manage buffer space and transparently handle + * buffer overruns. + * + * Returns number of bytes that should have been printed. + */ +int +bprintf(struct buf_pr *b, char *format, ...) +{ + va_list args; + int i; + + va_start(args, format); + + i = vsnprintf(b->ptr, b->avail, format, args); + va_end(args); + + if (i > b->avail || i < 0) { + /* Overflow or print error */ + b->avail = 0; + } else { + b->ptr += i; + b->avail -= i; + } + + b->needed += i; + + return (i); +} + +/* + * Special values printer for tablearg-aware opcodes. + */ +void +bprint_uint_arg(struct buf_pr *bp, const char *str, uint32_t arg) +{ + + if (str != NULL) + bprintf(bp, "%s", str); + if (arg == IP_FW_TARG) + bprintf(bp, "tablearg"); + else + bprintf(bp, "%u", arg); +} + +/* + * Helper routine to print a possibly unaligned uint64_t on + * various platform. If width > 0, print the value with + * the desired width, followed by a space; + * otherwise, return the required width. + */ +int +pr_u64(struct buf_pr *b, uint64_t *pd, int width) +{ +#ifdef TCC +#define U64_FMT "I64" +#else +#define U64_FMT "llu" +#endif + uint64_t u; + unsigned long long d; + + bcopy (pd, &u, sizeof(u)); + d = u; + return (width > 0) ? + bprintf(b, "%*" U64_FMT " ", width, d) : + snprintf(NULL, 0, "%" U64_FMT, d) ; +#undef U64_FMT +} + + +void * +safe_calloc(size_t number, size_t size) +{ + void *ret = calloc(number, size); + + if (ret == NULL) + err(EX_OSERR, "calloc"); + return ret; +} + +void * +safe_realloc(void *ptr, size_t size) +{ + void *ret = realloc(ptr, size); + + if (ret == NULL) + err(EX_OSERR, "realloc"); + return ret; +} + +/* + * Compare things like interface or table names. + */ +int +stringnum_cmp(const char *a, const char *b) +{ + int la, lb; + + la = strlen(a); + lb = strlen(b); + + if (la > lb) + return (1); + else if (la < lb) + return (-01); + + return (strcmp(a, b)); +} + + +/* + * conditionally runs the command. + * Selected options or negative -> getsockopt + */ +int +do_cmd(int optname, void *optval, uintptr_t optlen) +{ + int i; + + if (co.test_only) + return 0; + + if (ipfw_socket == -1) + ipfw_socket = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); + if (ipfw_socket < 0) + err(EX_UNAVAILABLE, "socket"); + + if (optname == IP_FW_GET || optname == IP_DUMMYNET_GET || + optname == IP_FW_ADD || optname == IP_FW3 || + optname == IP_FW_NAT_GET_CONFIG || + optname < 0 || + optname == IP_FW_NAT_GET_LOG) { + if (optname < 0) + optname = -optname; + i = getsockopt(ipfw_socket, IPPROTO_IP, optname, optval, + (socklen_t *)optlen); + } else { + i = setsockopt(ipfw_socket, IPPROTO_IP, optname, optval, optlen); + } + return i; +} + +/* + * do_set3 - pass ipfw control cmd to kernel + * @optname: option name + * @optval: pointer to option data + * @optlen: option length + * + * Assumes op3 header is already embedded. + * Calls setsockopt() with IP_FW3 as kernel-visible opcode. + * Returns 0 on success or errno otherwise. + */ +int +do_set3(int optname, ip_fw3_opheader *op3, uintptr_t optlen) +{ + + if (co.test_only) + return (0); + + if (ipfw_socket == -1) + ipfw_socket = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); + if (ipfw_socket < 0) + err(EX_UNAVAILABLE, "socket"); + + op3->opcode = optname; + + return (setsockopt(ipfw_socket, IPPROTO_IP, IP_FW3, op3, optlen)); +} + +/* + * do_get3 - pass ipfw control cmd to kernel + * @optname: option name + * @optval: pointer to option data + * @optlen: pointer to option length + * + * Assumes op3 header is already embedded. + * Calls getsockopt() with IP_FW3 as kernel-visible opcode. + * Returns 0 on success or errno otherwise. + */ +int +do_get3(int optname, ip_fw3_opheader *op3, size_t *optlen) +{ + int error; + + if (co.test_only) + return (0); + + if (ipfw_socket == -1) + ipfw_socket = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); + if (ipfw_socket < 0) + err(EX_UNAVAILABLE, "socket"); + + op3->opcode = optname; + + error = getsockopt(ipfw_socket, IPPROTO_IP, IP_FW3, op3, + (socklen_t *)optlen); + + return (error); +} + +/** + * match_token takes a table and a string, returns the value associated + * with the string (-1 in case of failure). + */ +int +match_token(struct _s_x *table, char *string) +{ + struct _s_x *pt; + uint i = strlen(string); + + for (pt = table ; i && pt->s != NULL ; pt++) + if (strlen(pt->s) == i && !bcmp(string, pt->s, i)) + return pt->x; + return (-1); +} + +/** + * match_token takes a table and a string, returns the value associated + * with the string for the best match. + * + * Returns: + * value from @table for matched records + * -1 for non-matched records + * -2 if more than one records match @string. + */ +int +match_token_relaxed(struct _s_x *table, char *string) +{ + struct _s_x *pt, *m = NULL; + int i, c; + + i = strlen(string); + c = 0; + + for (pt = table ; i != 0 && pt->s != NULL ; pt++) { + if (strncmp(pt->s, string, i) != 0) + continue; + m = pt; + c++; + } + + if (c == 1) + return (m->x); + + return (c > 0 ? -2: -1); +} + +/** + * match_value takes a table and a value, returns the string associated + * with the value (NULL in case of failure). + */ +char const * +match_value(struct _s_x *p, int value) +{ + for (; p->s != NULL; p++) + if (p->x == value) + return p->s; + return NULL; +} + +size_t +concat_tokens(char *buf, size_t bufsize, struct _s_x *table, char *delimiter) +{ + struct _s_x *pt; + int l; + size_t sz; + + for (sz = 0, pt = table ; pt->s != NULL; pt++) { + l = snprintf(buf + sz, bufsize - sz, "%s%s", + (sz == 0) ? "" : delimiter, pt->s); + sz += l; + bufsize += l; + if (sz > bufsize) + return (bufsize); + } + + return (sz); +} + +/* + * helper function to process a set of flags and set bits in the + * appropriate masks. + */ +int +fill_flags(struct _s_x *flags, char *p, char **e, uint32_t *set, + uint32_t *clear) +{ + char *q; /* points to the separator */ + int val; + uint32_t *which; /* mask we are working on */ + + while (p && *p) { + if (*p == '!') { + p++; + which = clear; + } else + which = set; + q = strchr(p, ','); + if (q) + *q++ = '\0'; + val = match_token(flags, p); + if (val <= 0) { + if (e != NULL) + *e = p; + return (-1); + } + *which |= (uint32_t)val; + p = q; + } + return (0); +} + +void +print_flags_buffer(char *buf, size_t sz, struct _s_x *list, uint32_t set) +{ + char const *comma = ""; + int i, l; + + for (i = 0; list[i].x != 0; i++) { + if ((set & list[i].x) == 0) + continue; + + set &= ~list[i].x; + l = snprintf(buf, sz, "%s%s", comma, list[i].s); + if (l >= sz) + return; + comma = ","; + buf += l; + sz -=l; + } +} + +/* + * _substrcmp takes two strings and returns 1 if they do not match, + * and 0 if they match exactly or the first string is a sub-string + * of the second. A warning is printed to stderr in the case that the + * first string is a sub-string of the second. + * + * This function will be removed in the future through the usual + * deprecation process. + */ +int +_substrcmp(const char *str1, const char* str2) +{ + + if (strncmp(str1, str2, strlen(str1)) != 0) + return 1; + + if (strlen(str1) != strlen(str2)) + warnx("DEPRECATED: '%s' matched '%s' as a sub-string", + str1, str2); + return 0; +} + +/* + * _substrcmp2 takes three strings and returns 1 if the first two do not match, + * and 0 if they match exactly or the second string is a sub-string + * of the first. A warning is printed to stderr in the case that the + * first string does not match the third. + * + * This function exists to warn about the bizarre construction + * strncmp(str, "by", 2) which is used to allow people to use a shortcut + * for "bytes". The problem is that in addition to accepting "by", + * "byt", "byte", and "bytes", it also excepts "by_rabid_dogs" and any + * other string beginning with "by". + * + * This function will be removed in the future through the usual + * deprecation process. + */ +int +_substrcmp2(const char *str1, const char* str2, const char* str3) +{ + + if (strncmp(str1, str2, strlen(str2)) != 0) + return 1; + + if (strcmp(str1, str3) != 0) + warnx("DEPRECATED: '%s' matched '%s'", + str1, str3); + return 0; +} + +/* + * prints one port, symbolic or numeric + */ +static void +print_port(struct buf_pr *bp, int proto, uint16_t port) +{ + + if (proto == IPPROTO_ETHERTYPE) { + char const *s; + + if (co.do_resolv && (s = match_value(ether_types, port)) ) + bprintf(bp, "%s", s); + else + bprintf(bp, "0x%04x", port); + } else { + struct servent *se = NULL; + if (co.do_resolv) { + struct protoent *pe = getprotobynumber(proto); + + se = getservbyport(htons(port), pe ? pe->p_name : NULL); + } + if (se) + bprintf(bp, "%s", se->s_name); + else + bprintf(bp, "%d", port); + } +} + +static struct _s_x _port_name[] = { + {"dst-port", O_IP_DSTPORT}, + {"src-port", O_IP_SRCPORT}, + {"ipid", O_IPID}, + {"iplen", O_IPLEN}, + {"ipttl", O_IPTTL}, + {"mac-type", O_MAC_TYPE}, + {"tcpdatalen", O_TCPDATALEN}, + {"tcpwin", O_TCPWIN}, + {"tagged", O_TAGGED}, + {NULL, 0} +}; + +/* + * Print the values in a list 16-bit items of the types above. + * XXX todo: add support for mask. + */ +static void +print_newports(struct buf_pr *bp, ipfw_insn_u16 *cmd, int proto, int opcode) +{ + uint16_t *p = cmd->ports; + int i; + char const *sep; + + if (opcode != 0) { + sep = match_value(_port_name, opcode); + if (sep == NULL) + sep = "???"; + bprintf(bp, " %s", sep); + } + sep = " "; + for (i = F_LEN((ipfw_insn *)cmd) - 1; i > 0; i--, p += 2) { + bprintf(bp, "%s", sep); + print_port(bp, proto, p[0]); + if (p[0] != p[1]) { + bprintf(bp, "-"); + print_port(bp, proto, p[1]); + } + sep = ","; + } +} + +/* + * Like strtol, but also translates service names into port numbers + * for some protocols. + * In particular: + * proto == -1 disables the protocol check; + * proto == IPPROTO_ETHERTYPE looks up an internal table + * proto == matches the values there. + * Returns *end == s in case the parameter is not found. + */ +static int +strtoport(char *s, char **end, int base, int proto) +{ + char *p, *buf; + char *s1; + int i; + + *end = s; /* default - not found */ + if (*s == '\0') + return 0; /* not found */ + + if (isdigit(*s)) + return strtol(s, end, base); + + /* + * find separator. '\\' escapes the next char. + */ + for (s1 = s; *s1 && (isalnum(*s1) || *s1 == '\\') ; s1++) + if (*s1 == '\\' && s1[1] != '\0') + s1++; + + buf = safe_calloc(s1 - s + 1, 1); + + /* + * copy into a buffer skipping backslashes + */ + for (p = s, i = 0; p != s1 ; p++) + if (*p != '\\') + buf[i++] = *p; + buf[i++] = '\0'; + + if (proto == IPPROTO_ETHERTYPE) { + i = match_token(ether_types, buf); + free(buf); + if (i != -1) { /* found */ + *end = s1; + return i; + } + } else { + struct protoent *pe = NULL; + struct servent *se; + + if (proto != 0) + pe = getprotobynumber(proto); + setservent(1); + se = getservbyname(buf, pe ? pe->p_name : NULL); + free(buf); + if (se != NULL) { + *end = s1; + return ntohs(se->s_port); + } + } + return 0; /* not found */ +} + +/* + * Fill the body of the command with the list of port ranges. + */ +static int +fill_newports(ipfw_insn_u16 *cmd, char *av, int proto, int cblen) +{ + uint16_t a, b, *p = cmd->ports; + int i = 0; + char *s = av; + + while (*s) { + a = strtoport(av, &s, 0, proto); + if (s == av) /* empty or invalid argument */ + return (0); + + CHECK_LENGTH(cblen, i + 2); + + switch (*s) { + case '-': /* a range */ + av = s + 1; + b = strtoport(av, &s, 0, proto); + /* Reject expressions like '1-abc' or '1-2-3'. */ + if (s == av || (*s != ',' && *s != '\0')) + return (0); + p[0] = a; + p[1] = b; + break; + case ',': /* comma separated list */ + case '\0': + p[0] = p[1] = a; + break; + default: + warnx("port list: invalid separator <%c> in <%s>", + *s, av); + return (0); + } + + i++; + p += 2; + av = s + 1; + } + if (i > 0) { + if (i + 1 > F_LEN_MASK) + errx(EX_DATAERR, "too many ports/ranges\n"); + cmd->o.len |= i + 1; /* leave F_NOT and F_OR untouched */ + } + return (i); +} + +/* + * Fill the body of the command with the list of DiffServ codepoints. + */ +static void +fill_dscp(ipfw_insn *cmd, char *av, int cblen) +{ + uint32_t *low, *high; + char *s = av, *a; + int code; + + cmd->opcode = O_DSCP; + cmd->len |= F_INSN_SIZE(ipfw_insn_u32) + 1; + + CHECK_CMDLEN; + + low = (uint32_t *)(cmd + 1); + high = low + 1; + + *low = 0; + *high = 0; + + while (s != NULL) { + a = strchr(s, ','); + + if (a != NULL) + *a++ = '\0'; + + if (isalpha(*s)) { + if ((code = match_token(f_ipdscp, s)) == -1) + errx(EX_DATAERR, "Unknown DSCP code"); + } else { + code = strtoul(s, NULL, 10); + if (code < 0 || code > 63) + errx(EX_DATAERR, "Invalid DSCP value"); + } + + if (code > 32) + *high |= 1 << (code - 32); + else + *low |= 1 << code; + + s = a; + } +} + +static struct _s_x icmpcodes[] = { + { "net", ICMP_UNREACH_NET }, + { "host", ICMP_UNREACH_HOST }, + { "protocol", ICMP_UNREACH_PROTOCOL }, + { "port", ICMP_UNREACH_PORT }, + { "needfrag", ICMP_UNREACH_NEEDFRAG }, + { "srcfail", ICMP_UNREACH_SRCFAIL }, + { "net-unknown", ICMP_UNREACH_NET_UNKNOWN }, + { "host-unknown", ICMP_UNREACH_HOST_UNKNOWN }, + { "isolated", ICMP_UNREACH_ISOLATED }, + { "net-prohib", ICMP_UNREACH_NET_PROHIB }, + { "host-prohib", ICMP_UNREACH_HOST_PROHIB }, + { "tosnet", ICMP_UNREACH_TOSNET }, + { "toshost", ICMP_UNREACH_TOSHOST }, + { "filter-prohib", ICMP_UNREACH_FILTER_PROHIB }, + { "host-precedence", ICMP_UNREACH_HOST_PRECEDENCE }, + { "precedence-cutoff", ICMP_UNREACH_PRECEDENCE_CUTOFF }, + { NULL, 0 } +}; + +static void +fill_reject_code(u_short *codep, char *str) +{ + int val; + char *s; + + val = strtoul(str, &s, 0); + if (s == str || *s != '\0' || val >= 0x100) + val = match_token(icmpcodes, str); + if (val < 0) + errx(EX_DATAERR, "unknown ICMP unreachable code ``%s''", str); + *codep = val; + return; +} + +static void +print_reject_code(struct buf_pr *bp, uint16_t code) +{ + char const *s; + + if ((s = match_value(icmpcodes, code)) != NULL) + bprintf(bp, "unreach %s", s); + else + bprintf(bp, "unreach %u", code); +} + +/* + * Returns the number of bits set (from left) in a contiguous bitmask, + * or -1 if the mask is not contiguous. + * XXX this needs a proper fix. + * This effectively works on masks in big-endian (network) format. + * when compiled on little endian architectures. + * + * First bit is bit 7 of the first byte -- note, for MAC addresses, + * the first bit on the wire is bit 0 of the first byte. + * len is the max length in bits. + */ +int +contigmask(uint8_t *p, int len) +{ + int i, n; + + for (i=0; iarg1 & 0xff; + uint8_t clear = (cmd->arg1 >> 8) & 0xff; + + if (list == f_tcpflags && set == TH_SYN && clear == TH_ACK) { + bprintf(bp, " setup"); + return; + } + + bprintf(bp, " %s ", name); + for (i=0; list[i].x != 0; i++) { + if (set & list[i].x) { + set &= ~list[i].x; + bprintf(bp, "%s%s", comma, list[i].s); + comma = ","; + } + if (clear & list[i].x) { + clear &= ~list[i].x; + bprintf(bp, "%s!%s", comma, list[i].s); + comma = ","; + } + } +} + + +/* + * Print the ip address contained in a command. + */ +static void +print_ip(struct buf_pr *bp, struct format_opts *fo, ipfw_insn_ip *cmd, + char const *s) +{ + struct hostent *he = NULL; + struct in_addr *ia; + uint32_t len = F_LEN((ipfw_insn *)cmd); + uint32_t *a = ((ipfw_insn_u32 *)cmd)->d; + char *t; + + if (cmd->o.opcode == O_IP_DST_LOOKUP && len > F_INSN_SIZE(ipfw_insn_u32)) { + uint32_t d = a[1]; + const char *arg = ""; + + if (d < sizeof(lookup_key)/sizeof(lookup_key[0])) + arg = match_value(rule_options, lookup_key[d]); + t = table_search_ctlv(fo->tstate, ((ipfw_insn *)cmd)->arg1); + bprintf(bp, "%s lookup %s %s", cmd->o.len & F_NOT ? " not": "", + arg, t); + return; + } + bprintf(bp, "%s%s ", cmd->o.len & F_NOT ? " not": "", s); + + if (cmd->o.opcode == O_IP_SRC_ME || cmd->o.opcode == O_IP_DST_ME) { + bprintf(bp, "me"); + return; + } + if (cmd->o.opcode == O_IP_SRC_LOOKUP || + cmd->o.opcode == O_IP_DST_LOOKUP) { + t = table_search_ctlv(fo->tstate, ((ipfw_insn *)cmd)->arg1); + bprintf(bp, "table(%s", t); + if (len == F_INSN_SIZE(ipfw_insn_u32)) + bprintf(bp, ",%u", *a); + bprintf(bp, ")"); + return; + } + if (cmd->o.opcode == O_IP_SRC_SET || cmd->o.opcode == O_IP_DST_SET) { + uint32_t x, *map = (uint32_t *)&(cmd->mask); + int i, j; + char comma = '{'; + + x = cmd->o.arg1 - 1; + x = htonl( ~x ); + cmd->addr.s_addr = htonl(cmd->addr.s_addr); + bprintf(bp, "%s/%d", inet_ntoa(cmd->addr), + contigmask((uint8_t *)&x, 32)); + x = cmd->addr.s_addr = htonl(cmd->addr.s_addr); + x &= 0xff; /* base */ + /* + * Print bits and ranges. + * Locate first bit set (i), then locate first bit unset (j). + * If we have 3+ consecutive bits set, then print them as a + * range, otherwise only print the initial bit and rescan. + */ + for (i=0; i < cmd->o.arg1; i++) + if (map[i/32] & (1<<(i & 31))) { + for (j=i+1; j < cmd->o.arg1; j++) + if (!(map[ j/32] & (1<<(j & 31)))) + break; + bprintf(bp, "%c%d", comma, i+x); + if (j>i+2) { /* range has at least 3 elements */ + bprintf(bp, "-%d", j-1+x); + i = j-1; + } + comma = ','; + } + bprintf(bp, "}"); + return; + } + /* + * len == 2 indicates a single IP, whereas lists of 1 or more + * addr/mask pairs have len = (2n+1). We convert len to n so we + * use that to count the number of entries. + */ + for (len = len / 2; len > 0; len--, a += 2) { + int mb = /* mask length */ + (cmd->o.opcode == O_IP_SRC || cmd->o.opcode == O_IP_DST) ? + 32 : contigmask((uint8_t *)&(a[1]), 32); + if (mb == 32 && co.do_resolv) + he = gethostbyaddr((char *)&(a[0]), sizeof(u_long), AF_INET); + if (he != NULL) /* resolved to name */ + bprintf(bp, "%s", he->h_name); + else if (mb == 0) /* any */ + bprintf(bp, "any"); + else { /* numeric IP followed by some kind of mask */ + ia = (struct in_addr *)&a[0]; + bprintf(bp, "%s", inet_ntoa(*ia)); + if (mb < 0) + bprintf(bp, ":%s", inet_ntoa(*ia ) ); + else if (mb < 32) + bprintf(bp, "/%d", mb); + } + if (len > 1) + bprintf(bp, ","); + } +} + +/* + * prints a MAC address/mask pair + */ +static void +print_mac(struct buf_pr *bp, uint8_t *addr, uint8_t *mask) +{ + int l = contigmask(mask, 48); + + if (l == 0) + bprintf(bp, " any"); + else { + bprintf(bp, " %02x:%02x:%02x:%02x:%02x:%02x", + addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]); + if (l == -1) + bprintf(bp, "&%02x:%02x:%02x:%02x:%02x:%02x", + mask[0], mask[1], mask[2], + mask[3], mask[4], mask[5]); + else if (l < 48) + bprintf(bp, "/%d", l); + } +} + +static void +fill_icmptypes(ipfw_insn_u32 *cmd, char *av) +{ + uint8_t type; + + cmd->d[0] = 0; + while (*av) { + if (*av == ',') + av++; + + type = strtoul(av, &av, 0); + + if (*av != ',' && *av != '\0') + errx(EX_DATAERR, "invalid ICMP type"); + + if (type > 31) + errx(EX_DATAERR, "ICMP type out of range"); + + cmd->d[0] |= 1 << type; + } + cmd->o.opcode = O_ICMPTYPE; + cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32); +} + +static void +print_icmptypes(struct buf_pr *bp, ipfw_insn_u32 *cmd) +{ + int i; + char sep= ' '; + + bprintf(bp, " icmptypes"); + for (i = 0; i < 32; i++) { + if ( (cmd->d[0] & (1 << (i))) == 0) + continue; + bprintf(bp, "%c%d", sep, i); + sep = ','; + } +} + +static void +print_dscp(struct buf_pr *bp, ipfw_insn_u32 *cmd) +{ + int i, c; + uint32_t *v; + char sep= ' '; + const char *code; + + bprintf(bp, " dscp"); + i = 0; + c = 0; + v = cmd->d; + while (i < 64) { + if (*v & (1 << i)) { + if ((code = match_value(f_ipdscp, i)) != NULL) + bprintf(bp, "%c%s", sep, code); + else + bprintf(bp, "%c%d", sep, i); + sep = ','; + } + + if ((++i % 32) == 0) + v++; + } +} + +/* + * show_ipfw() prints the body of an ipfw rule. + * Because the standard rule has at least proto src_ip dst_ip, we use + * a helper function to produce these entries if not provided explicitly. + * The first argument is the list of fields we have, the second is + * the list of fields we want to be printed. + * + * Special cases if we have provided a MAC header: + * + if the rule does not contain IP addresses/ports, do not print them; + * + if the rule does not contain an IP proto, print "all" instead of "ip"; + * + * Once we have 'have_options', IP header fields are printed as options. + */ +#define HAVE_PROTO 0x0001 +#define HAVE_SRCIP 0x0002 +#define HAVE_DSTIP 0x0004 +#define HAVE_PROTO4 0x0008 +#define HAVE_PROTO6 0x0010 +#define HAVE_IP 0x0100 +#define HAVE_OPTIONS 0x8000 + +static void +show_prerequisites(struct buf_pr *bp, int *flags, int want, int cmd) +{ + (void)cmd; /* UNUSED */ + if (co.comment_only) + return; + if ( (*flags & HAVE_IP) == HAVE_IP) + *flags |= HAVE_OPTIONS; + + if ( !(*flags & HAVE_OPTIONS)) { + if ( !(*flags & HAVE_PROTO) && (want & HAVE_PROTO)) { + if ( (*flags & HAVE_PROTO4)) + bprintf(bp, " ip4"); + else if ( (*flags & HAVE_PROTO6)) + bprintf(bp, " ip6"); + else + bprintf(bp, " ip"); + } + if ( !(*flags & HAVE_SRCIP) && (want & HAVE_SRCIP)) + bprintf(bp, " from any"); + if ( !(*flags & HAVE_DSTIP) && (want & HAVE_DSTIP)) + bprintf(bp, " to any"); + } + *flags |= want; +} + +static void +show_static_rule(struct cmdline_opts *co, struct format_opts *fo, + struct buf_pr *bp, struct ip_fw_rule *rule, struct ip_fw_bcounter *cntr) +{ + static int twidth = 0; + int l; + ipfw_insn *cmd, *tagptr = NULL; + const char *comment = NULL; /* ptr to comment if we have one */ + int proto = 0; /* default */ + int flags = 0; /* prerequisites */ + ipfw_insn_log *logptr = NULL; /* set if we find an O_LOG */ + ipfw_insn_altq *altqptr = NULL; /* set if we find an O_ALTQ */ + int or_block = 0; /* we are in an or block */ + uint32_t uval; + + if ((fo->set_mask & (1 << rule->set)) == 0) { + /* disabled mask */ + if (!co->show_sets) + return; + else + bprintf(bp, "# DISABLED "); + } + bprintf(bp, "%05u ", rule->rulenum); + + /* Print counters if enabled */ + if (fo->pcwidth > 0 || fo->bcwidth > 0) { + pr_u64(bp, &cntr->pcnt, fo->pcwidth); + pr_u64(bp, &cntr->bcnt, fo->bcwidth); + } + + if (co->do_time == 2) + bprintf(bp, "%10u ", cntr->timestamp); + else if (co->do_time == 1) { + char timestr[30]; + time_t t = (time_t)0; + + if (twidth == 0) { + strcpy(timestr, ctime(&t)); + *strchr(timestr, '\n') = '\0'; + twidth = strlen(timestr); + } + if (cntr->timestamp > 0) { + t = _long_to_time(cntr->timestamp); + + strcpy(timestr, ctime(&t)); + *strchr(timestr, '\n') = '\0'; + bprintf(bp, "%s ", timestr); + } else { + bprintf(bp, "%*s", twidth, " "); + } + } + + if (co->show_sets) + bprintf(bp, "set %d ", rule->set); + + /* + * print the optional "match probability" + */ + if (rule->cmd_len > 0) { + cmd = rule->cmd ; + if (cmd->opcode == O_PROB) { + ipfw_insn_u32 *p = (ipfw_insn_u32 *)cmd; + double d = 1.0 * p->d[0]; + + d = (d / 0x7fffffff); + bprintf(bp, "prob %f ", d); + } + } + + /* + * first print actions + */ + for (l = rule->cmd_len - rule->act_ofs, cmd = ACTION_PTR(rule); + l > 0 ; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { + switch(cmd->opcode) { + case O_CHECK_STATE: + bprintf(bp, "check-state"); + /* avoid printing anything else */ + flags = HAVE_PROTO | HAVE_SRCIP | + HAVE_DSTIP | HAVE_IP; + break; + + case O_ACCEPT: + bprintf(bp, "allow"); + break; + + case O_COUNT: + bprintf(bp, "count"); + break; + + case O_DENY: + bprintf(bp, "deny"); + break; + + case O_REJECT: + if (cmd->arg1 == ICMP_REJECT_RST) + bprintf(bp, "reset"); + else if (cmd->arg1 == ICMP_UNREACH_HOST) + bprintf(bp, "reject"); + else + print_reject_code(bp, cmd->arg1); + break; + + case O_UNREACH6: + if (cmd->arg1 == ICMP6_UNREACH_RST) + bprintf(bp, "reset6"); + else + print_unreach6_code(cmd->arg1); + break; + + case O_SKIPTO: + bprint_uint_arg(bp, "skipto ", cmd->arg1); + break; + + case O_PIPE: + bprint_uint_arg(bp, "pipe ", cmd->arg1); + break; + + case O_QUEUE: + bprint_uint_arg(bp, "queue ", cmd->arg1); + break; + + case O_DIVERT: + bprint_uint_arg(bp, "divert ", cmd->arg1); + break; + + case O_TEE: + bprint_uint_arg(bp, "tee ", cmd->arg1); + break; + + case O_NETGRAPH: + bprint_uint_arg(bp, "netgraph ", cmd->arg1); + break; + + case O_NGTEE: + bprint_uint_arg(bp, "ngtee ", cmd->arg1); + break; + + case O_FORWARD_IP: + { + ipfw_insn_sa *s = (ipfw_insn_sa *)cmd; + + if (s->sa.sin_addr.s_addr == INADDR_ANY) { + bprintf(bp, "fwd tablearg"); + } else { + bprintf(bp, "fwd %s",inet_ntoa(s->sa.sin_addr)); + } + if (s->sa.sin_port) + bprintf(bp, ",%d", s->sa.sin_port); + } + break; + + case O_FORWARD_IP6: + { + char buf[4 + INET6_ADDRSTRLEN + 1]; + ipfw_insn_sa6 *s = (ipfw_insn_sa6 *)cmd; + + bprintf(bp, "fwd %s", inet_ntop(AF_INET6, + &s->sa.sin6_addr, buf, sizeof(buf))); + if (s->sa.sin6_port) + bprintf(bp, ",%d", s->sa.sin6_port); + } + break; + + case O_LOG: /* O_LOG is printed last */ + logptr = (ipfw_insn_log *)cmd; + break; + + case O_ALTQ: /* O_ALTQ is printed after O_LOG */ + altqptr = (ipfw_insn_altq *)cmd; + break; + + case O_TAG: + tagptr = cmd; + break; + + case O_NAT: + if (cmd->arg1 != 0) + bprint_uint_arg(bp, "nat ", cmd->arg1); + else + bprintf(bp, "nat global"); + break; + + case O_SETFIB: + bprint_uint_arg(bp, "setfib ", cmd->arg1 & 0x7FFF); + break; + + case O_SETDSCP: + { + const char *code; + + if (cmd->arg1 == IP_FW_TARG) { + bprint_uint_arg(bp, "setdscp ", cmd->arg1); + break; + } + uval = cmd->arg1 & 0x3F; + if ((code = match_value(f_ipdscp, uval)) != NULL) + bprintf(bp, "setdscp %s", code); + else + bprint_uint_arg(bp, "setdscp ", uval); + } + break; + + case O_REASS: + bprintf(bp, "reass"); + break; + + case O_CALLRETURN: + if (cmd->len & F_NOT) + bprintf(bp, "return"); + else + bprint_uint_arg(bp, "call ", cmd->arg1); + break; + + default: + bprintf(bp, "** unrecognized action %d len %d ", + cmd->opcode, cmd->len); + } + } + if (logptr) { + if (logptr->max_log > 0) + bprintf(bp, " log logamount %d", logptr->max_log); + else + bprintf(bp, " log"); + } +#ifndef NO_ALTQ + if (altqptr) { + print_altq_cmd(bp, altqptr); + } +#endif + if (tagptr) { + if (tagptr->len & F_NOT) + bprint_uint_arg(bp, " untag ", tagptr->arg1); + else + bprint_uint_arg(bp, " tag ", tagptr->arg1); + } + + /* + * then print the body. + */ + for (l = rule->act_ofs, cmd = rule->cmd; + l > 0 ; l -= F_LEN(cmd) , cmd += F_LEN(cmd)) { + if ((cmd->len & F_OR) || (cmd->len & F_NOT)) + continue; + if (cmd->opcode == O_IP4) { + flags |= HAVE_PROTO4; + break; + } else if (cmd->opcode == O_IP6) { + flags |= HAVE_PROTO6; + break; + } + } + if (rule->flags & IPFW_RULE_NOOPT) { /* empty rules before options */ + if (!co->do_compact) { + show_prerequisites(bp, &flags, HAVE_PROTO, 0); + bprintf(bp, " from any to any"); + } + flags |= HAVE_IP | HAVE_OPTIONS | HAVE_PROTO | + HAVE_SRCIP | HAVE_DSTIP; + } + + if (co->comment_only) + comment = "..."; + + for (l = rule->act_ofs, cmd = rule->cmd; + l > 0 ; l -= F_LEN(cmd) , cmd += F_LEN(cmd)) { + /* useful alias */ + ipfw_insn_u32 *cmd32 = (ipfw_insn_u32 *)cmd; + + if (co->comment_only) { + if (cmd->opcode != O_NOP) + continue; + bprintf(bp, " // %s\n", (char *)(cmd + 1)); + return; + } + + show_prerequisites(bp, &flags, 0, cmd->opcode); + + switch(cmd->opcode) { + case O_PROB: + break; /* done already */ + + case O_PROBE_STATE: + break; /* no need to print anything here */ + + case O_IP_SRC: + case O_IP_SRC_LOOKUP: + case O_IP_SRC_MASK: + case O_IP_SRC_ME: + case O_IP_SRC_SET: + show_prerequisites(bp, &flags, HAVE_PROTO, 0); + if (!(flags & HAVE_SRCIP)) + bprintf(bp, " from"); + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + print_ip(bp, fo, (ipfw_insn_ip *)cmd, + (flags & HAVE_OPTIONS) ? " src-ip" : ""); + flags |= HAVE_SRCIP; + break; + + case O_IP_DST: + case O_IP_DST_LOOKUP: + case O_IP_DST_MASK: + case O_IP_DST_ME: + case O_IP_DST_SET: + show_prerequisites(bp, &flags, HAVE_PROTO|HAVE_SRCIP, 0); + if (!(flags & HAVE_DSTIP)) + bprintf(bp, " to"); + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + print_ip(bp, fo, (ipfw_insn_ip *)cmd, + (flags & HAVE_OPTIONS) ? " dst-ip" : ""); + flags |= HAVE_DSTIP; + break; + + case O_IP6_SRC: + case O_IP6_SRC_MASK: + case O_IP6_SRC_ME: + show_prerequisites(bp, &flags, HAVE_PROTO, 0); + if (!(flags & HAVE_SRCIP)) + bprintf(bp, " from"); + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + print_ip6(bp, (ipfw_insn_ip6 *)cmd, + (flags & HAVE_OPTIONS) ? " src-ip6" : ""); + flags |= HAVE_SRCIP | HAVE_PROTO; + break; + + case O_IP6_DST: + case O_IP6_DST_MASK: + case O_IP6_DST_ME: + show_prerequisites(bp, &flags, HAVE_PROTO|HAVE_SRCIP, 0); + if (!(flags & HAVE_DSTIP)) + bprintf(bp, " to"); + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + print_ip6(bp, (ipfw_insn_ip6 *)cmd, + (flags & HAVE_OPTIONS) ? " dst-ip6" : ""); + flags |= HAVE_DSTIP; + break; + + case O_FLOW6ID: + print_flow6id(bp, (ipfw_insn_u32 *) cmd ); + flags |= HAVE_OPTIONS; + break; + + case O_IP_DSTPORT: + show_prerequisites(bp, &flags, + HAVE_PROTO | HAVE_SRCIP | + HAVE_DSTIP | HAVE_IP, 0); + case O_IP_SRCPORT: + if (flags & HAVE_DSTIP) + flags |= HAVE_IP; + show_prerequisites(bp, &flags, + HAVE_PROTO | HAVE_SRCIP, 0); + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + if (cmd->len & F_NOT) + bprintf(bp, " not"); + print_newports(bp, (ipfw_insn_u16 *)cmd, proto, + (flags & HAVE_OPTIONS) ? cmd->opcode : 0); + break; + + case O_PROTO: { + struct protoent *pe = NULL; + + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + if (cmd->len & F_NOT) + bprintf(bp, " not"); + proto = cmd->arg1; + pe = getprotobynumber(cmd->arg1); + if ((flags & (HAVE_PROTO4 | HAVE_PROTO6)) && + !(flags & HAVE_PROTO)) + show_prerequisites(bp, &flags, + HAVE_PROTO | HAVE_IP | HAVE_SRCIP | + HAVE_DSTIP | HAVE_OPTIONS, 0); + if (flags & HAVE_OPTIONS) + bprintf(bp, " proto"); + if (pe) + bprintf(bp, " %s", pe->p_name); + else + bprintf(bp, " %u", cmd->arg1); + } + flags |= HAVE_PROTO; + break; + + default: /*options ... */ + if (!(cmd->len & (F_OR|F_NOT))) + if (((cmd->opcode == O_IP6) && + (flags & HAVE_PROTO6)) || + ((cmd->opcode == O_IP4) && + (flags & HAVE_PROTO4))) + break; + show_prerequisites(bp, &flags, HAVE_PROTO | HAVE_SRCIP | + HAVE_DSTIP | HAVE_IP | HAVE_OPTIONS, 0); + if ((cmd->len & F_OR) && !or_block) + bprintf(bp, " {"); + if (cmd->len & F_NOT && cmd->opcode != O_IN) + bprintf(bp, " not"); + switch(cmd->opcode) { + case O_MACADDR2: { + ipfw_insn_mac *m = (ipfw_insn_mac *)cmd; + + bprintf(bp, " MAC"); + print_mac(bp, m->addr, m->mask); + print_mac(bp, m->addr + 6, m->mask + 6); + } + break; + + case O_MAC_TYPE: + print_newports(bp, (ipfw_insn_u16 *)cmd, + IPPROTO_ETHERTYPE, cmd->opcode); + break; + + + case O_FRAG: + bprintf(bp, " frag"); + break; + + case O_FIB: + bprintf(bp, " fib %u", cmd->arg1 ); + break; + case O_SOCKARG: + bprintf(bp, " sockarg"); + break; + + case O_IN: + bprintf(bp, cmd->len & F_NOT ? " out" : " in"); + break; + + case O_DIVERTED: + switch (cmd->arg1) { + case 3: + bprintf(bp, " diverted"); + break; + case 1: + bprintf(bp, " diverted-loopback"); + break; + case 2: + bprintf(bp, " diverted-output"); + break; + default: + bprintf(bp, " diverted-?<%u>", cmd->arg1); + break; + } + break; + + case O_LAYER2: + bprintf(bp, " layer2"); + break; + case O_XMIT: + case O_RECV: + case O_VIA: + { + char const *s, *t; + ipfw_insn_if *cmdif = (ipfw_insn_if *)cmd; + + if (cmd->opcode == O_XMIT) + s = "xmit"; + else if (cmd->opcode == O_RECV) + s = "recv"; + else /* if (cmd->opcode == O_VIA) */ + s = "via"; + if (cmdif->name[0] == '\0') + bprintf(bp, " %s %s", s, + inet_ntoa(cmdif->p.ip)); + else if (cmdif->name[0] == '\1') { + /* interface table */ + t = table_search_ctlv(fo->tstate, + cmdif->p.kidx); + bprintf(bp, " %s table(%s)", s, t); + } else + bprintf(bp, " %s %s", s, cmdif->name); + + break; + } + case O_IP_FLOW_LOOKUP: + { + char *t; + + t = table_search_ctlv(fo->tstate, cmd->arg1); + bprintf(bp, " flow table(%s", t); + if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn_u32)) + bprintf(bp, ",%u", + ((ipfw_insn_u32 *)cmd)->d[0]); + bprintf(bp, ")"); + break; + } + case O_IPID: + if (F_LEN(cmd) == 1) + bprintf(bp, " ipid %u", cmd->arg1 ); + else + print_newports(bp, (ipfw_insn_u16 *)cmd, 0, + O_IPID); + break; + + case O_IPTTL: + if (F_LEN(cmd) == 1) + bprintf(bp, " ipttl %u", cmd->arg1 ); + else + print_newports(bp, (ipfw_insn_u16 *)cmd, 0, + O_IPTTL); + break; + + case O_IPVER: + bprintf(bp, " ipver %u", cmd->arg1 ); + break; + + case O_IPPRECEDENCE: + bprintf(bp, " ipprecedence %u", cmd->arg1 >> 5); + break; + + case O_DSCP: + print_dscp(bp, (ipfw_insn_u32 *)cmd); + break; + + case O_IPLEN: + if (F_LEN(cmd) == 1) + bprintf(bp, " iplen %u", cmd->arg1 ); + else + print_newports(bp, (ipfw_insn_u16 *)cmd, 0, + O_IPLEN); + break; + + case O_IPOPT: + print_flags(bp, "ipoptions", cmd, f_ipopts); + break; + + case O_IPTOS: + print_flags(bp, "iptos", cmd, f_iptos); + break; + + case O_ICMPTYPE: + print_icmptypes(bp, (ipfw_insn_u32 *)cmd); + break; + + case O_ESTAB: + bprintf(bp, " established"); + break; + + case O_TCPDATALEN: + if (F_LEN(cmd) == 1) + bprintf(bp, " tcpdatalen %u", cmd->arg1 ); + else + print_newports(bp, (ipfw_insn_u16 *)cmd, 0, + O_TCPDATALEN); + break; + + case O_TCPFLAGS: + print_flags(bp, "tcpflags", cmd, f_tcpflags); + break; + + case O_TCPOPTS: + print_flags(bp, "tcpoptions", cmd, f_tcpopts); + break; + + case O_TCPWIN: + if (F_LEN(cmd) == 1) + bprintf(bp, " tcpwin %u", cmd->arg1); + else + print_newports(bp, (ipfw_insn_u16 *)cmd, 0, + O_TCPWIN); + break; + + case O_TCPACK: + bprintf(bp, " tcpack %d", ntohl(cmd32->d[0])); + break; + + case O_TCPSEQ: + bprintf(bp, " tcpseq %d", ntohl(cmd32->d[0])); + break; + + case O_UID: + { + struct passwd *pwd = getpwuid(cmd32->d[0]); + + if (pwd) + bprintf(bp, " uid %s", pwd->pw_name); + else + bprintf(bp, " uid %u", cmd32->d[0]); + } + break; + + case O_GID: + { + struct group *grp = getgrgid(cmd32->d[0]); + + if (grp) + bprintf(bp, " gid %s", grp->gr_name); + else + bprintf(bp, " gid %u", cmd32->d[0]); + } + break; + + case O_JAIL: + bprintf(bp, " jail %d", cmd32->d[0]); + break; + + case O_VERREVPATH: + bprintf(bp, " verrevpath"); + break; + + case O_VERSRCREACH: + bprintf(bp, " versrcreach"); + break; + + case O_ANTISPOOF: + bprintf(bp, " antispoof"); + break; + + case O_IPSEC: + bprintf(bp, " ipsec"); + break; + + case O_NOP: + comment = (char *)(cmd + 1); + break; + + case O_KEEP_STATE: + bprintf(bp, " keep-state"); + break; + + case O_LIMIT: { + struct _s_x *p = limit_masks; + ipfw_insn_limit *c = (ipfw_insn_limit *)cmd; + uint8_t x = c->limit_mask; + char const *comma = " "; + + bprintf(bp, " limit"); + for (; p->x != 0 ; p++) + if ((x & p->x) == p->x) { + x &= ~p->x; + bprintf(bp, "%s%s", comma,p->s); + comma = ","; + } + bprint_uint_arg(bp, " ", c->conn_limit); + break; + } + + case O_IP6: + bprintf(bp, " ip6"); + break; + + case O_IP4: + bprintf(bp, " ip4"); + break; + + case O_ICMP6TYPE: + print_icmp6types(bp, (ipfw_insn_u32 *)cmd); + break; + + case O_EXT_HDR: + print_ext6hdr(bp, (ipfw_insn *)cmd); + break; + + case O_TAGGED: + if (F_LEN(cmd) == 1) + bprint_uint_arg(bp, " tagged ", + cmd->arg1); + else + print_newports(bp, (ipfw_insn_u16 *)cmd, + 0, O_TAGGED); + break; + + default: + bprintf(bp, " [opcode %d len %d]", + cmd->opcode, cmd->len); + } + } + if (cmd->len & F_OR) { + bprintf(bp, " or"); + or_block = 1; + } else if (or_block) { + bprintf(bp, " }"); + or_block = 0; + } + } + show_prerequisites(bp, &flags, HAVE_PROTO | HAVE_SRCIP | HAVE_DSTIP + | HAVE_IP, 0); + if (comment) + bprintf(bp, " // %s", comment); + bprintf(bp, "\n"); +} + +static void +show_dyn_state(struct cmdline_opts *co, struct format_opts *fo, + struct buf_pr *bp, ipfw_dyn_rule *d) +{ + struct protoent *pe; + struct in_addr a; + uint16_t rulenum; + char buf[INET6_ADDRSTRLEN]; + + if (!co->do_expired) { + if (!d->expire && !(d->dyn_type == O_LIMIT_PARENT)) + return; + } + bcopy(&d->rule, &rulenum, sizeof(rulenum)); + bprintf(bp, "%05d", rulenum); + if (fo->pcwidth > 0 || fo->bcwidth > 0) { + bprintf(bp, " "); + pr_u64(bp, &d->pcnt, fo->pcwidth); + pr_u64(bp, &d->bcnt, fo->bcwidth); + bprintf(bp, "(%ds)", d->expire); + } + switch (d->dyn_type) { + case O_LIMIT_PARENT: + bprintf(bp, " PARENT %d", d->count); + break; + case O_LIMIT: + bprintf(bp, " LIMIT"); + break; + case O_KEEP_STATE: /* bidir, no mask */ + bprintf(bp, " STATE"); + break; + } + + if ((pe = getprotobynumber(d->id.proto)) != NULL) + bprintf(bp, " %s", pe->p_name); + else + bprintf(bp, " proto %u", d->id.proto); + + if (d->id.addr_type == 4) { + a.s_addr = htonl(d->id.src_ip); + bprintf(bp, " %s %d", inet_ntoa(a), d->id.src_port); + + a.s_addr = htonl(d->id.dst_ip); + bprintf(bp, " <-> %s %d", inet_ntoa(a), d->id.dst_port); + } else if (d->id.addr_type == 6) { + bprintf(bp, " %s %d", inet_ntop(AF_INET6, &d->id.src_ip6, buf, + sizeof(buf)), d->id.src_port); + bprintf(bp, " <-> %s %d", inet_ntop(AF_INET6, &d->id.dst_ip6, + buf, sizeof(buf)), d->id.dst_port); + } else + bprintf(bp, " UNKNOWN <-> UNKNOWN\n"); +} + +static int +do_range_cmd(int cmd, ipfw_range_tlv *rt) +{ + ipfw_range_header rh; + size_t sz; + + memset(&rh, 0, sizeof(rh)); + memcpy(&rh.range, rt, sizeof(*rt)); + rh.range.head.length = sizeof(*rt); + rh.range.head.type = IPFW_TLV_RANGE; + sz = sizeof(rh); + + if (do_get3(cmd, &rh.opheader, &sz) != 0) + return (-1); + /* Save number of matched objects */ + rt->new_set = rh.range.new_set; + return (0); +} + +/* + * This one handles all set-related commands + * ipfw set { show | enable | disable } + * ipfw set swap X Y + * ipfw set move X to Y + * ipfw set move rule X to Y + */ +void +ipfw_sets_handler(char *av[]) +{ + uint32_t masks[2]; + int i; + uint8_t cmd, rulenum; + ipfw_range_tlv rt; + char *msg; + size_t size; + + av++; + memset(&rt, 0, sizeof(rt)); + + if (av[0] == NULL) + errx(EX_USAGE, "set needs command"); + if (_substrcmp(*av, "show") == 0) { + struct format_opts fo; + ipfw_cfg_lheader *cfg; + + memset(&fo, 0, sizeof(fo)); + if (ipfw_get_config(&co, &fo, &cfg, &size) != 0) + err(EX_OSERR, "requesting config failed"); + + for (i = 0, msg = "disable"; i < RESVD_SET; i++) + if ((cfg->set_mask & (1<set_mask != (uint32_t)-1) ? " enable" : "enable"; + for (i = 0; i < RESVD_SET; i++) + if ((cfg->set_mask & (1< RESVD_SET) + errx(EX_DATAERR, "invalid set number %s\n", av[0]); + if (!isdigit(*(av[1])) || rt.new_set > RESVD_SET) + errx(EX_DATAERR, "invalid set number %s\n", av[1]); + i = do_range_cmd(IP_FW_SET_SWAP, &rt); + } else if (_substrcmp(*av, "move") == 0) { + av++; + if (av[0] && _substrcmp(*av, "rule") == 0) { + rt.flags = IPFW_RCFLAG_RANGE; /* move rules to new set */ + cmd = IP_FW_XMOVE; + av++; + } else + cmd = IP_FW_SET_MOVE; /* Move set to new one */ + if (av[0] == NULL || av[1] == NULL || av[2] == NULL || + av[3] != NULL || _substrcmp(av[1], "to") != 0) + errx(EX_USAGE, "syntax: set move [rule] X to Y\n"); + rulenum = atoi(av[0]); + rt.new_set = atoi(av[2]); + if (cmd == IP_FW_XMOVE) { + rt.start_rule = rulenum; + rt.end_rule = rulenum; + } else + rt.set = rulenum; + rt.new_set = atoi(av[2]); + if (!isdigit(*(av[0])) || (cmd == 3 && rt.set > RESVD_SET) || + (cmd == 2 && rt.start_rule == IPFW_DEFAULT_RULE) ) + errx(EX_DATAERR, "invalid source number %s\n", av[0]); + if (!isdigit(*(av[2])) || rt.new_set > RESVD_SET) + errx(EX_DATAERR, "invalid dest. set %s\n", av[1]); + i = do_range_cmd(cmd, &rt); + } else if (_substrcmp(*av, "disable") == 0 || + _substrcmp(*av, "enable") == 0 ) { + int which = _substrcmp(*av, "enable") == 0 ? 1 : 0; + + av++; + masks[0] = masks[1] = 0; + + while (av[0]) { + if (isdigit(**av)) { + i = atoi(*av); + if (i < 0 || i > RESVD_SET) + errx(EX_DATAERR, + "invalid set number %d\n", i); + masks[which] |= (1<dcnt++; + + if (fo->show_counters == 0) + return; + + if (co->use_set) { + /* skip states from another set */ + bcopy((char *)&d->rule + sizeof(uint16_t), &set, + sizeof(uint8_t)); + if (set != co->use_set - 1) + return; + } + + width = pr_u64(NULL, &d->pcnt, 0); + if (width > fo->pcwidth) + fo->pcwidth = width; + + width = pr_u64(NULL, &d->bcnt, 0); + if (width > fo->bcwidth) + fo->bcwidth = width; +} + +static int +foreach_state(struct cmdline_opts *co, struct format_opts *fo, + caddr_t base, size_t sz, state_cb dyn_bc, void *dyn_arg) +{ + int ttype; + state_cb *fptr; + void *farg; + ipfw_obj_tlv *tlv; + ipfw_obj_ctlv *ctlv; + + fptr = NULL; + ttype = 0; + + while (sz > 0) { + ctlv = (ipfw_obj_ctlv *)base; + switch (ctlv->head.type) { + case IPFW_TLV_DYNSTATE_LIST: + base += sizeof(*ctlv); + sz -= sizeof(*ctlv); + ttype = IPFW_TLV_DYN_ENT; + fptr = dyn_bc; + farg = dyn_arg; + break; + default: + return (sz); + } + + while (sz > 0) { + tlv = (ipfw_obj_tlv *)base; + if (tlv->type != ttype) + break; + + fptr(co, fo, farg, tlv + 1); + sz -= tlv->length; + base += tlv->length; + } + } + + return (sz); +} + +static void +prepare_format_opts(struct cmdline_opts *co, struct format_opts *fo, + ipfw_obj_tlv *rtlv, int rcnt, caddr_t dynbase, size_t dynsz) +{ + int bcwidth, pcwidth, width; + int n; + struct ip_fw_bcounter *cntr; + struct ip_fw_rule *r; + + bcwidth = 0; + pcwidth = 0; + if (fo->show_counters != 0) { + for (n = 0; n < rcnt; n++, + rtlv = (ipfw_obj_tlv *)((caddr_t)rtlv + rtlv->length)) { + cntr = (struct ip_fw_bcounter *)(rtlv + 1); + r = (struct ip_fw_rule *)((caddr_t)cntr + cntr->size); + /* skip rules from another set */ + if (co->use_set && r->set != co->use_set - 1) + continue; + + /* packet counter */ + width = pr_u64(NULL, &cntr->pcnt, 0); + if (width > pcwidth) + pcwidth = width; + + /* byte counter */ + width = pr_u64(NULL, &cntr->bcnt, 0); + if (width > bcwidth) + bcwidth = width; + } + } + fo->bcwidth = bcwidth; + fo->pcwidth = pcwidth; + + fo->dcnt = 0; + if (co->do_dynamic && dynsz > 0) + foreach_state(co, fo, dynbase, dynsz, prepare_format_dyn, NULL); +} + +static int +list_static_range(struct cmdline_opts *co, struct format_opts *fo, + struct buf_pr *bp, ipfw_obj_tlv *rtlv, int rcnt) +{ + int n, seen; + struct ip_fw_rule *r; + struct ip_fw_bcounter *cntr; + int c = 0; + + for (n = seen = 0; n < rcnt; n++, + rtlv = (ipfw_obj_tlv *)((caddr_t)rtlv + rtlv->length)) { + + if (fo->show_counters != 0) { + cntr = (struct ip_fw_bcounter *)(rtlv + 1); + r = (struct ip_fw_rule *)((caddr_t)cntr + cntr->size); + } else { + cntr = NULL; + r = (struct ip_fw_rule *)(rtlv + 1); + } + if (r->rulenum > fo->last) + break; + if (co->use_set && r->set != co->use_set - 1) + continue; + if (r->rulenum >= fo->first && r->rulenum <= fo->last) { + show_static_rule(co, fo, bp, r, cntr); + printf("%s", bp->buf); + c += rtlv->length; + bp_flush(bp); + seen++; + } + } + + return (seen); +} + +static void +list_dyn_state(struct cmdline_opts *co, struct format_opts *fo, + void *_arg, void *_state) +{ + uint16_t rulenum; + uint8_t set; + ipfw_dyn_rule *d; + struct buf_pr *bp; + + d = (ipfw_dyn_rule *)_state; + bp = (struct buf_pr *)_arg; + + bcopy(&d->rule, &rulenum, sizeof(rulenum)); + if (rulenum > fo->last) + return; + if (co->use_set) { + bcopy((char *)&d->rule + sizeof(uint16_t), + &set, sizeof(uint8_t)); + if (set != co->use_set - 1) + return; + } + if (rulenum >= fo->first) { + show_dyn_state(co, fo, bp, d); + printf("%s\n", bp->buf); + bp_flush(bp); + } +} + +static int +list_dyn_range(struct cmdline_opts *co, struct format_opts *fo, + struct buf_pr *bp, caddr_t base, size_t sz) +{ + + sz = foreach_state(co, fo, base, sz, list_dyn_state, bp); + return (sz); +} + +void +ipfw_list(int ac, char *av[], int show_counters) +{ + ipfw_cfg_lheader *cfg; + struct format_opts sfo; + size_t sz; + int error; + int lac; + char **lav; + uint32_t rnum; + char *endptr; + + if (co.test_only) { + fprintf(stderr, "Testing only, list disabled\n"); + return; + } + if (co.do_pipe) { + dummynet_list(ac, av, show_counters); + return; + } + + ac--; + av++; + memset(&sfo, 0, sizeof(sfo)); + + /* Determine rule range to request */ + if (ac > 0) { + for (lac = ac, lav = av; lac != 0; lac--) { + rnum = strtoul(*lav++, &endptr, 10); + if (sfo.first == 0 || rnum < sfo.first) + sfo.first = rnum; + + if (*endptr == '-') + rnum = strtoul(endptr + 1, &endptr, 10); + if (sfo.last == 0 || rnum > sfo.last) + sfo.last = rnum; + } + } + + /* get configuraion from kernel */ + cfg = NULL; + sfo.show_counters = show_counters; + sfo.flags = IPFW_CFG_GET_STATIC; + if (co.do_dynamic != 0) + sfo.flags |= IPFW_CFG_GET_STATES; + if (sfo.show_counters != 0) + sfo.flags |= IPFW_CFG_GET_COUNTERS; + if (ipfw_get_config(&co, &sfo, &cfg, &sz) != 0) + err(EX_OSERR, "retrieving config failed"); + + error = ipfw_show_config(&co, &sfo, cfg, sz, ac, av); + + free(cfg); + + if (error != EX_OK) + exit(error); +} + +static int +ipfw_show_config(struct cmdline_opts *co, struct format_opts *fo, + ipfw_cfg_lheader *cfg, size_t sz, int ac, char *av[]) +{ + caddr_t dynbase; + size_t dynsz; + int rcnt; + int exitval = EX_OK; + int lac; + char **lav; + char *endptr; + size_t readsz; + struct buf_pr bp; + ipfw_obj_ctlv *ctlv, *tstate; + ipfw_obj_tlv *rbase; + + /* + * Handle tablenames TLV first, if any + */ + tstate = NULL; + rbase = NULL; + dynbase = NULL; + dynsz = 0; + readsz = sizeof(*cfg); + rcnt = 0; + + fo->set_mask = cfg->set_mask; + + ctlv = (ipfw_obj_ctlv *)(cfg + 1); + + if (cfg->flags & IPFW_CFG_GET_STATIC) { + /* We've requested static rules */ + if (ctlv->head.type == IPFW_TLV_TBLNAME_LIST) { + fo->tstate = ctlv; + readsz += ctlv->head.length; + ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + + ctlv->head.length); + } + + if (ctlv->head.type == IPFW_TLV_RULE_LIST) { + rbase = (ipfw_obj_tlv *)(ctlv + 1); + rcnt = ctlv->count; + readsz += ctlv->head.length; + ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + + ctlv->head.length); + } + } + + if ((cfg->flags & IPFW_CFG_GET_STATES) && (readsz != sz)) { + /* We may have some dynamic states */ + dynsz = sz - readsz; + /* Skip empty header */ + if (dynsz != sizeof(ipfw_obj_ctlv)) + dynbase = (caddr_t)ctlv; + else + dynsz = 0; + } + + prepare_format_opts(co, fo, rbase, rcnt, dynbase, dynsz); + bp_alloc(&bp, 4096); + + /* if no rule numbers were specified, list all rules */ + if (ac == 0) { + fo->first = 0; + fo->last = IPFW_DEFAULT_RULE; + list_static_range(co, fo, &bp, rbase, rcnt); + + if (co->do_dynamic && dynsz > 0) { + printf("## Dynamic rules (%d %zu):\n", fo->dcnt, dynsz); + list_dyn_range(co, fo, &bp, dynbase, dynsz); + } + + bp_free(&bp); + return (EX_OK); + } + + /* display specific rules requested on command line */ + for (lac = ac, lav = av; lac != 0; lac--) { + /* convert command line rule # */ + fo->last = fo->first = strtoul(*lav++, &endptr, 10); + if (*endptr == '-') + fo->last = strtoul(endptr + 1, &endptr, 10); + if (*endptr) { + exitval = EX_USAGE; + warnx("invalid rule number: %s", *(lav - 1)); + continue; + } + + if (list_static_range(co, fo, &bp, rbase, rcnt) == 0) { + /* give precedence to other error(s) */ + if (exitval == EX_OK) + exitval = EX_UNAVAILABLE; + if (fo->first == fo->last) + warnx("rule %u does not exist", fo->first); + else + warnx("no rules in range %u-%u", + fo->first, fo->last); + } + } + + if (co->do_dynamic && dynsz > 0) { + printf("## Dynamic rules:\n"); + for (lac = ac, lav = av; lac != 0; lac--) { + fo->last = fo->first = strtoul(*lav++, &endptr, 10); + if (*endptr == '-') + fo->last = strtoul(endptr+1, &endptr, 10); + if (*endptr) + /* already warned */ + continue; + list_dyn_range(co, fo, &bp, dynbase, dynsz); + } + } + + bp_free(&bp); + return (exitval); +} + + +/* + * Retrieves current ipfw configuration of given type + * and stores its pointer to @pcfg. + * + * Caller is responsible for freeing @pcfg. + * + * Returns 0 on success. + */ + +static int +ipfw_get_config(struct cmdline_opts *co, struct format_opts *fo, + ipfw_cfg_lheader **pcfg, size_t *psize) +{ + ipfw_cfg_lheader *cfg; + size_t sz; + int i; + + + if (co->test_only != 0) { + fprintf(stderr, "Testing only, list disabled\n"); + return (0); + } + + /* Start with some data size */ + sz = 4096; + cfg = NULL; + + for (i = 0; i < 16; i++) { + if (cfg != NULL) + free(cfg); + if ((cfg = calloc(1, sz)) == NULL) + return (ENOMEM); + + cfg->flags = fo->flags; + cfg->start_rule = fo->first; + cfg->end_rule = fo->last; + + if (do_get3(IP_FW_XGET, &cfg->opheader, &sz) != 0) { + if (errno != ENOMEM) { + free(cfg); + return (errno); + } + + /* Buffer size is not enough. Try to increase */ + sz = sz * 2; + if (sz < cfg->size) + sz = cfg->size; + continue; + } + + *pcfg = cfg; + *psize = sz; + return (0); + } + + free(cfg); + return (ENOMEM); +} + +static int +lookup_host (char *host, struct in_addr *ipaddr) +{ + struct hostent *he; + + if (!inet_aton(host, ipaddr)) { + if ((he = gethostbyname(host)) == NULL) + return(-1); + *ipaddr = *(struct in_addr *)he->h_addr_list[0]; + } + return(0); +} + +struct tidx { + ipfw_obj_ntlv *idx; + uint32_t count; + uint32_t size; + uint16_t counter; + uint8_t set; +}; + +static uint16_t +pack_table(struct tidx *tstate, char *name) +{ + int i; + ipfw_obj_ntlv *ntlv; + + if (table_check_name(name) != 0) + return (0); + + for (i = 0; i < tstate->count; i++) { + if (strcmp(tstate->idx[i].name, name) != 0) + continue; + if (tstate->idx[i].set != tstate->set) + continue; + + return (tstate->idx[i].idx); + } + + if (tstate->count + 1 > tstate->size) { + tstate->size += 4; + tstate->idx = realloc(tstate->idx, tstate->size * + sizeof(ipfw_obj_ntlv)); + if (tstate->idx == NULL) + return (0); + } + + ntlv = &tstate->idx[i]; + memset(ntlv, 0, sizeof(ipfw_obj_ntlv)); + strlcpy(ntlv->name, name, sizeof(ntlv->name)); + ntlv->head.type = IPFW_TLV_TBL_NAME; + ntlv->head.length = sizeof(ipfw_obj_ntlv); + ntlv->set = tstate->set; + ntlv->idx = ++tstate->counter; + tstate->count++; + + return (ntlv->idx); +} + +static void +fill_table(ipfw_insn *cmd, char *av, uint8_t opcode, struct tidx *tstate) +{ + uint32_t *d = ((ipfw_insn_u32 *)cmd)->d; + uint16_t uidx; + char *p; + + if ((p = strchr(av + 6, ')')) == NULL) + errx(EX_DATAERR, "forgotten parenthesis: '%s'", av); + *p = '\0'; + p = strchr(av + 6, ','); + if (p) + *p++ = '\0'; + + if ((uidx = pack_table(tstate, av + 6)) == 0) + errx(EX_DATAERR, "Invalid table name: %s", av + 6); + + cmd->opcode = opcode; + cmd->arg1 = uidx; + if (p) { + cmd->len |= F_INSN_SIZE(ipfw_insn_u32); + d[0] = strtoul(p, NULL, 0); + } else + cmd->len |= F_INSN_SIZE(ipfw_insn); +} + + +/* + * fills the addr and mask fields in the instruction as appropriate from av. + * Update length as appropriate. + * The following formats are allowed: + * me returns O_IP_*_ME + * 1.2.3.4 single IP address + * 1.2.3.4:5.6.7.8 address:mask + * 1.2.3.4/24 address/mask + * 1.2.3.4/26{1,6,5,4,23} set of addresses in a subnet + * We can have multiple comma-separated address/mask entries. + */ +static void +fill_ip(ipfw_insn_ip *cmd, char *av, int cblen, struct tidx *tstate) +{ + int len = 0; + uint32_t *d = ((ipfw_insn_u32 *)cmd)->d; + + cmd->o.len &= ~F_LEN_MASK; /* zero len */ + + if (_substrcmp(av, "any") == 0) + return; + + if (_substrcmp(av, "me") == 0) { + cmd->o.len |= F_INSN_SIZE(ipfw_insn); + return; + } + + if (strncmp(av, "table(", 6) == 0) { + fill_table(&cmd->o, av, O_IP_DST_LOOKUP, tstate); + return; + } + + while (av) { + /* + * After the address we can have '/' or ':' indicating a mask, + * ',' indicating another address follows, '{' indicating a + * set of addresses of unspecified size. + */ + char *t = NULL, *p = strpbrk(av, "/:,{"); + int masklen; + char md, nd = '\0'; + + CHECK_LENGTH(cblen, F_INSN_SIZE(ipfw_insn) + 2 + len); + + if (p) { + md = *p; + *p++ = '\0'; + if ((t = strpbrk(p, ",{")) != NULL) { + nd = *t; + *t = '\0'; + } + } else + md = '\0'; + + if (lookup_host(av, (struct in_addr *)&d[0]) != 0) + errx(EX_NOHOST, "hostname ``%s'' unknown", av); + switch (md) { + case ':': + if (!inet_aton(p, (struct in_addr *)&d[1])) + errx(EX_DATAERR, "bad netmask ``%s''", p); + break; + case '/': + masklen = atoi(p); + if (masklen == 0) + d[1] = htonl(0); /* mask */ + else if (masklen > 32) + errx(EX_DATAERR, "bad width ``%s''", p); + else + d[1] = htonl(~0 << (32 - masklen)); + break; + case '{': /* no mask, assume /24 and put back the '{' */ + d[1] = htonl(~0 << (32 - 24)); + *(--p) = md; + break; + + case ',': /* single address plus continuation */ + *(--p) = md; + /* FALLTHROUGH */ + case 0: /* initialization value */ + default: + d[1] = htonl(~0); /* force /32 */ + break; + } + d[0] &= d[1]; /* mask base address with mask */ + if (t) + *t = nd; + /* find next separator */ + if (p) + p = strpbrk(p, ",{"); + if (p && *p == '{') { + /* + * We have a set of addresses. They are stored as follows: + * arg1 is the set size (powers of 2, 2..256) + * addr is the base address IN HOST FORMAT + * mask.. is an array of arg1 bits (rounded up to + * the next multiple of 32) with bits set + * for each host in the map. + */ + uint32_t *map = (uint32_t *)&cmd->mask; + int low, high; + int i = contigmask((uint8_t *)&(d[1]), 32); + + if (len > 0) + errx(EX_DATAERR, "address set cannot be in a list"); + if (i < 24 || i > 31) + errx(EX_DATAERR, "invalid set with mask %d\n", i); + cmd->o.arg1 = 1<<(32-i); /* map length */ + d[0] = ntohl(d[0]); /* base addr in host format */ + cmd->o.opcode = O_IP_DST_SET; /* default */ + cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32) + (cmd->o.arg1+31)/32; + for (i = 0; i < (cmd->o.arg1+31)/32 ; i++) + map[i] = 0; /* clear map */ + + av = p + 1; + low = d[0] & 0xff; + high = low + cmd->o.arg1 - 1; + /* + * Here, i stores the previous value when we specify a range + * of addresses within a mask, e.g. 45-63. i = -1 means we + * have no previous value. + */ + i = -1; /* previous value in a range */ + while (isdigit(*av)) { + char *s; + int a = strtol(av, &s, 0); + + if (s == av) { /* no parameter */ + if (*av != '}') + errx(EX_DATAERR, "set not closed\n"); + if (i != -1) + errx(EX_DATAERR, "incomplete range %d-", i); + break; + } + if (a < low || a > high) + errx(EX_DATAERR, "addr %d out of range [%d-%d]\n", + a, low, high); + a -= low; + if (i == -1) /* no previous in range */ + i = a; + else { /* check that range is valid */ + if (i > a) + errx(EX_DATAERR, "invalid range %d-%d", + i+low, a+low); + if (*s == '-') + errx(EX_DATAERR, "double '-' in range"); + } + for (; i <= a; i++) + map[i/32] |= 1<<(i & 31); + i = -1; + if (*s == '-') + i = a; + else if (*s == '}') + break; + av = s+1; + } + return; + } + av = p; + if (av) /* then *av must be a ',' */ + av++; + + /* Check this entry */ + if (d[1] == 0) { /* "any", specified as x.x.x.x/0 */ + /* + * 'any' turns the entire list into a NOP. + * 'not any' never matches, so it is removed from the + * list unless it is the only item, in which case we + * report an error. + */ + if (cmd->o.len & F_NOT) { /* "not any" never matches */ + if (av == NULL && len == 0) /* only this entry */ + errx(EX_DATAERR, "not any never matches"); + } + /* else do nothing and skip this entry */ + return; + } + /* A single IP can be stored in an optimized format */ + if (d[1] == (uint32_t)~0 && av == NULL && len == 0) { + cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32); + return; + } + len += 2; /* two words... */ + d += 2; + } /* end while */ + if (len + 1 > F_LEN_MASK) + errx(EX_DATAERR, "address list too long"); + cmd->o.len |= len+1; +} + + +/* n2mask sets n bits of the mask */ +void +n2mask(struct in6_addr *mask, int n) +{ + static int minimask[9] = + { 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff }; + u_char *p; + + memset(mask, 0, sizeof(struct in6_addr)); + p = (u_char *) mask; + for (; n > 0; p++, n -= 8) { + if (n >= 8) + *p = 0xff; + else + *p = minimask[n]; + } + return; +} + +static void +fill_flags_cmd(ipfw_insn *cmd, enum ipfw_opcodes opcode, + struct _s_x *flags, char *p) +{ + char *e; + uint32_t set = 0, clear = 0; + + if (fill_flags(flags, p, &e, &set, &clear) != 0) + errx(EX_DATAERR, "invalid flag %s", e); + + cmd->opcode = opcode; + cmd->len = (cmd->len & (F_NOT | F_OR)) | 1; + cmd->arg1 = (set & 0xff) | ( (clear & 0xff) << 8); +} + + +void +ipfw_delete(char *av[]) +{ + int i; + int exitval = EX_OK; + int do_set = 0; + ipfw_range_tlv rt; + + av++; + NEED1("missing rule specification"); + memset(&rt, 0, sizeof(rt)); + if ( *av && _substrcmp(*av, "set") == 0) { + /* Do not allow using the following syntax: + * ipfw set N delete set M + */ + if (co.use_set) + errx(EX_DATAERR, "invalid syntax"); + do_set = 1; /* delete set */ + av++; + } + + /* Rule number */ + while (*av && isdigit(**av)) { + i = atoi(*av); av++; + if (co.do_nat) { + exitval = do_cmd(IP_FW_NAT_DEL, &i, sizeof i); + if (exitval) { + exitval = EX_UNAVAILABLE; + warn("rule %u not available", i); + } + } else if (co.do_pipe) { + exitval = ipfw_delete_pipe(co.do_pipe, i); + } else { + if (do_set != 0) { + rt.set = i & 31; + rt.flags = IPFW_RCFLAG_SET; + } else { + rt.start_rule = i & 0xffff; + rt.end_rule = i & 0xffff; + if (rt.start_rule == 0 && rt.end_rule == 0) + rt.flags |= IPFW_RCFLAG_ALL; + else + rt.flags |= IPFW_RCFLAG_RANGE; + if (co.use_set != 0) { + rt.set = co.use_set - 1; + rt.flags |= IPFW_RCFLAG_SET; + } + } + i = do_range_cmd(IP_FW_XDEL, &rt); + if (i != 0) { + exitval = EX_UNAVAILABLE; + warn("rule %u: setsockopt(IP_FW_XDEL)", + rt.start_rule); + } else if (rt.new_set == 0) { + exitval = EX_UNAVAILABLE; + if (rt.start_rule != rt.end_rule) + warnx("no rules rules in %u-%u range", + rt.start_rule, rt.end_rule); + else + warnx("rule %u not found", + rt.start_rule); + } + } + } + if (exitval != EX_OK) + exit(exitval); +} + + +/* + * fill the interface structure. We do not check the name as we can + * create interfaces dynamically, so checking them at insert time + * makes relatively little sense. + * Interface names containing '*', '?', or '[' are assumed to be shell + * patterns which match interfaces. + */ +static void +fill_iface(ipfw_insn_if *cmd, char *arg, int cblen, struct tidx *tstate) +{ + char *p; + uint16_t uidx; + + cmd->name[0] = '\0'; + cmd->o.len |= F_INSN_SIZE(ipfw_insn_if); + + CHECK_CMDLEN; + + /* Parse the interface or address */ + if (strcmp(arg, "any") == 0) + cmd->o.len = 0; /* effectively ignore this command */ + else if (strncmp(arg, "table(", 6) == 0) { + if ((p = strchr(arg + 6, ')')) == NULL) + errx(EX_DATAERR, "forgotten parenthesis: '%s'", arg); + *p = '\0'; + p = strchr(arg + 6, ','); + if (p) + *p++ = '\0'; + if ((uidx = pack_table(tstate, arg + 6)) == 0) + errx(EX_DATAERR, "Invalid table name: %s", arg + 6); + + cmd->name[0] = '\1'; /* Special value indicating table */ + cmd->p.kidx = uidx; + } else if (!isdigit(*arg)) { + strlcpy(cmd->name, arg, sizeof(cmd->name)); + cmd->p.glob = strpbrk(arg, "*?[") != NULL ? 1 : 0; + } else if (!inet_aton(arg, &cmd->p.ip)) + errx(EX_DATAERR, "bad ip address ``%s''", arg); +} + +static void +get_mac_addr_mask(const char *p, uint8_t *addr, uint8_t *mask) +{ + int i; + size_t l; + char *ap, *ptr, *optr; + struct ether_addr *mac; + const char *macset = "0123456789abcdefABCDEF:"; + + if (strcmp(p, "any") == 0) { + for (i = 0; i < ETHER_ADDR_LEN; i++) + addr[i] = mask[i] = 0; + return; + } + + optr = ptr = strdup(p); + if ((ap = strsep(&ptr, "&/")) != NULL && *ap != 0) { + l = strlen(ap); + if (strspn(ap, macset) != l || (mac = ether_aton(ap)) == NULL) + errx(EX_DATAERR, "Incorrect MAC address"); + bcopy(mac, addr, ETHER_ADDR_LEN); + } else + errx(EX_DATAERR, "Incorrect MAC address"); + + if (ptr != NULL) { /* we have mask? */ + if (p[ptr - optr - 1] == '/') { /* mask len */ + long ml = strtol(ptr, &ap, 10); + if (*ap != 0 || ml > ETHER_ADDR_LEN * 8 || ml < 0) + errx(EX_DATAERR, "Incorrect mask length"); + for (i = 0; ml > 0 && i < ETHER_ADDR_LEN; ml -= 8, i++) + mask[i] = (ml >= 8) ? 0xff: (~0) << (8 - ml); + } else { /* mask */ + l = strlen(ptr); + if (strspn(ptr, macset) != l || + (mac = ether_aton(ptr)) == NULL) + errx(EX_DATAERR, "Incorrect mask"); + bcopy(mac, mask, ETHER_ADDR_LEN); + } + } else { /* default mask: ff:ff:ff:ff:ff:ff */ + for (i = 0; i < ETHER_ADDR_LEN; i++) + mask[i] = 0xff; + } + for (i = 0; i < ETHER_ADDR_LEN; i++) + addr[i] &= mask[i]; + + free(optr); +} + +/* + * helper function, updates the pointer to cmd with the length + * of the current command, and also cleans up the first word of + * the new command in case it has been clobbered before. + */ +static ipfw_insn * +next_cmd(ipfw_insn *cmd, int *len) +{ + *len -= F_LEN(cmd); + CHECK_LENGTH(*len, 0); + cmd += F_LEN(cmd); + bzero(cmd, sizeof(*cmd)); + return cmd; +} + +/* + * Takes arguments and copies them into a comment + */ +static void +fill_comment(ipfw_insn *cmd, char **av, int cblen) +{ + int i, l; + char *p = (char *)(cmd + 1); + + cmd->opcode = O_NOP; + cmd->len = (cmd->len & (F_NOT | F_OR)); + + /* Compute length of comment string. */ + for (i = 0, l = 0; av[i] != NULL; i++) + l += strlen(av[i]) + 1; + if (l == 0) + return; + if (l > 84) + errx(EX_DATAERR, + "comment too long (max 80 chars)"); + l = 1 + (l+3)/4; + cmd->len = (cmd->len & (F_NOT | F_OR)) | l; + CHECK_CMDLEN; + + for (i = 0; av[i] != NULL; i++) { + strcpy(p, av[i]); + p += strlen(av[i]); + *p++ = ' '; + } + *(--p) = '\0'; +} + +/* + * A function to fill simple commands of size 1. + * Existing flags are preserved. + */ +static void +fill_cmd(ipfw_insn *cmd, enum ipfw_opcodes opcode, int flags, uint16_t arg) +{ + cmd->opcode = opcode; + cmd->len = ((cmd->len | flags) & (F_NOT | F_OR)) | 1; + cmd->arg1 = arg; +} + +/* + * Fetch and add the MAC address and type, with masks. This generates one or + * two microinstructions, and returns the pointer to the last one. + */ +static ipfw_insn * +add_mac(ipfw_insn *cmd, char *av[], int cblen) +{ + ipfw_insn_mac *mac; + + if ( ( av[0] == NULL ) || ( av[1] == NULL ) ) + errx(EX_DATAERR, "MAC dst src"); + + cmd->opcode = O_MACADDR2; + cmd->len = (cmd->len & (F_NOT | F_OR)) | F_INSN_SIZE(ipfw_insn_mac); + CHECK_CMDLEN; + + mac = (ipfw_insn_mac *)cmd; + get_mac_addr_mask(av[0], mac->addr, mac->mask); /* dst */ + get_mac_addr_mask(av[1], &(mac->addr[ETHER_ADDR_LEN]), + &(mac->mask[ETHER_ADDR_LEN])); /* src */ + return cmd; +} + +static ipfw_insn * +add_mactype(ipfw_insn *cmd, char *av, int cblen) +{ + if (!av) + errx(EX_DATAERR, "missing MAC type"); + if (strcmp(av, "any") != 0) { /* we have a non-null type */ + fill_newports((ipfw_insn_u16 *)cmd, av, IPPROTO_ETHERTYPE, + cblen); + cmd->opcode = O_MAC_TYPE; + return cmd; + } else + return NULL; +} + +static ipfw_insn * +add_proto0(ipfw_insn *cmd, char *av, u_char *protop) +{ + struct protoent *pe; + char *ep; + int proto; + + proto = strtol(av, &ep, 10); + if (*ep != '\0' || proto <= 0) { + if ((pe = getprotobyname(av)) == NULL) + return NULL; + proto = pe->p_proto; + } + + fill_cmd(cmd, O_PROTO, 0, proto); + *protop = proto; + return cmd; +} + +static ipfw_insn * +add_proto(ipfw_insn *cmd, char *av, u_char *protop) +{ + u_char proto = IPPROTO_IP; + + if (_substrcmp(av, "all") == 0 || strcmp(av, "ip") == 0) + ; /* do not set O_IP4 nor O_IP6 */ + else if (strcmp(av, "ip4") == 0) + /* explicit "just IPv4" rule */ + fill_cmd(cmd, O_IP4, 0, 0); + else if (strcmp(av, "ip6") == 0) { + /* explicit "just IPv6" rule */ + proto = IPPROTO_IPV6; + fill_cmd(cmd, O_IP6, 0, 0); + } else + return add_proto0(cmd, av, protop); + + *protop = proto; + return cmd; +} + +static ipfw_insn * +add_proto_compat(ipfw_insn *cmd, char *av, u_char *protop) +{ + u_char proto = IPPROTO_IP; + + if (_substrcmp(av, "all") == 0 || strcmp(av, "ip") == 0) + ; /* do not set O_IP4 nor O_IP6 */ + else if (strcmp(av, "ipv4") == 0 || strcmp(av, "ip4") == 0) + /* explicit "just IPv4" rule */ + fill_cmd(cmd, O_IP4, 0, 0); + else if (strcmp(av, "ipv6") == 0 || strcmp(av, "ip6") == 0) { + /* explicit "just IPv6" rule */ + proto = IPPROTO_IPV6; + fill_cmd(cmd, O_IP6, 0, 0); + } else + return add_proto0(cmd, av, protop); + + *protop = proto; + return cmd; +} + +static ipfw_insn * +add_srcip(ipfw_insn *cmd, char *av, int cblen, struct tidx *tstate) +{ + fill_ip((ipfw_insn_ip *)cmd, av, cblen, tstate); + if (cmd->opcode == O_IP_DST_SET) /* set */ + cmd->opcode = O_IP_SRC_SET; + else if (cmd->opcode == O_IP_DST_LOOKUP) /* table */ + cmd->opcode = O_IP_SRC_LOOKUP; + else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn)) /* me */ + cmd->opcode = O_IP_SRC_ME; + else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn_u32)) /* one IP */ + cmd->opcode = O_IP_SRC; + else /* addr/mask */ + cmd->opcode = O_IP_SRC_MASK; + return cmd; +} + +static ipfw_insn * +add_dstip(ipfw_insn *cmd, char *av, int cblen, struct tidx *tstate) +{ + fill_ip((ipfw_insn_ip *)cmd, av, cblen, tstate); + if (cmd->opcode == O_IP_DST_SET) /* set */ + ; + else if (cmd->opcode == O_IP_DST_LOOKUP) /* table */ + ; + else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn)) /* me */ + cmd->opcode = O_IP_DST_ME; + else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn_u32)) /* one IP */ + cmd->opcode = O_IP_DST; + else /* addr/mask */ + cmd->opcode = O_IP_DST_MASK; + return cmd; +} + +static struct _s_x f_reserved_keywords[] = { + { "altq", TOK_OR }, + { "//", TOK_OR }, + { "diverted", TOK_OR }, + { "dst-port", TOK_OR }, + { "src-port", TOK_OR }, + { "established", TOK_OR }, + { "keep-state", TOK_OR }, + { "frag", TOK_OR }, + { "icmptypes", TOK_OR }, + { "in", TOK_OR }, + { "out", TOK_OR }, + { "ip6", TOK_OR }, + { "any", TOK_OR }, + { "to", TOK_OR }, + { "via", TOK_OR }, + { "{", TOK_OR }, + { NULL, 0 } /* terminator */ +}; + +static ipfw_insn * +add_ports(ipfw_insn *cmd, char *av, u_char proto, int opcode, int cblen) +{ + + if (match_token(f_reserved_keywords, av) != -1) + return (NULL); + + if (fill_newports((ipfw_insn_u16 *)cmd, av, proto, cblen)) { + /* XXX todo: check that we have a protocol with ports */ + cmd->opcode = opcode; + return cmd; + } + return NULL; +} + +static ipfw_insn * +add_src(ipfw_insn *cmd, char *av, u_char proto, int cblen, struct tidx *tstate) +{ + struct in6_addr a; + char *host, *ch, buf[INET6_ADDRSTRLEN]; + ipfw_insn *ret = NULL; + int len; + + /* Copy first address in set if needed */ + if ((ch = strpbrk(av, "/,")) != NULL) { + len = ch - av; + strlcpy(buf, av, sizeof(buf)); + if (len < sizeof(buf)) + buf[len] = '\0'; + host = buf; + } else + host = av; + + if (proto == IPPROTO_IPV6 || strcmp(av, "me6") == 0 || + inet_pton(AF_INET6, host, &a) == 1) + ret = add_srcip6(cmd, av, cblen); + /* XXX: should check for IPv4, not !IPv6 */ + if (ret == NULL && (proto == IPPROTO_IP || strcmp(av, "me") == 0 || + inet_pton(AF_INET6, host, &a) != 1)) + ret = add_srcip(cmd, av, cblen, tstate); + if (ret == NULL && strcmp(av, "any") != 0) + ret = cmd; + + return ret; +} + +static ipfw_insn * +add_dst(ipfw_insn *cmd, char *av, u_char proto, int cblen, struct tidx *tstate) +{ + struct in6_addr a; + char *host, *ch, buf[INET6_ADDRSTRLEN]; + ipfw_insn *ret = NULL; + int len; + + /* Copy first address in set if needed */ + if ((ch = strpbrk(av, "/,")) != NULL) { + len = ch - av; + strlcpy(buf, av, sizeof(buf)); + if (len < sizeof(buf)) + buf[len] = '\0'; + host = buf; + } else + host = av; + + if (proto == IPPROTO_IPV6 || strcmp(av, "me6") == 0 || + inet_pton(AF_INET6, host, &a) == 1) + ret = add_dstip6(cmd, av, cblen); + /* XXX: should check for IPv4, not !IPv6 */ + if (ret == NULL && (proto == IPPROTO_IP || strcmp(av, "me") == 0 || + inet_pton(AF_INET6, host, &a) != 1)) + ret = add_dstip(cmd, av, cblen, tstate); + if (ret == NULL && strcmp(av, "any") != 0) + ret = cmd; + + return ret; +} + +/* + * Parse arguments and assemble the microinstructions which make up a rule. + * Rules are added into the 'rulebuf' and then copied in the correct order + * into the actual rule. + * + * The syntax for a rule starts with the action, followed by + * optional action parameters, and the various match patterns. + * In the assembled microcode, the first opcode must be an O_PROBE_STATE + * (generated if the rule includes a keep-state option), then the + * various match patterns, log/altq actions, and the actual action. + * + */ +void +compile_rule(char *av[], uint32_t *rbuf, int *rbufsize, struct tidx *tstate) +{ + /* + * rules are added into the 'rulebuf' and then copied in + * the correct order into the actual rule. + * Some things that need to go out of order (prob, action etc.) + * go into actbuf[]. + */ + static uint32_t actbuf[255], cmdbuf[255]; + int rblen, ablen, cblen; + + ipfw_insn *src, *dst, *cmd, *action, *prev=NULL; + ipfw_insn *first_cmd; /* first match pattern */ + + struct ip_fw_rule *rule; + + /* + * various flags used to record that we entered some fields. + */ + ipfw_insn *have_state = NULL; /* check-state or keep-state */ + ipfw_insn *have_log = NULL, *have_altq = NULL, *have_tag = NULL; + size_t len; + + int i; + + int open_par = 0; /* open parenthesis ( */ + + /* proto is here because it is used to fetch ports */ + u_char proto = IPPROTO_IP; /* default protocol */ + + double match_prob = 1; /* match probability, default is always match */ + + bzero(actbuf, sizeof(actbuf)); /* actions go here */ + bzero(cmdbuf, sizeof(cmdbuf)); + bzero(rbuf, *rbufsize); + + rule = (struct ip_fw_rule *)rbuf; + cmd = (ipfw_insn *)cmdbuf; + action = (ipfw_insn *)actbuf; + + rblen = *rbufsize / sizeof(uint32_t); + rblen -= sizeof(struct ip_fw_rule) / sizeof(uint32_t); + ablen = sizeof(actbuf) / sizeof(actbuf[0]); + cblen = sizeof(cmdbuf) / sizeof(cmdbuf[0]); + cblen -= F_INSN_SIZE(ipfw_insn_u32) + 1; + +#define CHECK_RBUFLEN(len) { CHECK_LENGTH(rblen, len); rblen -= len; } +#define CHECK_ACTLEN CHECK_LENGTH(ablen, action->len) + + av++; + + /* [rule N] -- Rule number optional */ + if (av[0] && isdigit(**av)) { + rule->rulenum = atoi(*av); + av++; + } + + /* [set N] -- set number (0..RESVD_SET), optional */ + if (av[0] && av[1] && _substrcmp(*av, "set") == 0) { + int set = strtoul(av[1], NULL, 10); + if (set < 0 || set > RESVD_SET) + errx(EX_DATAERR, "illegal set %s", av[1]); + rule->set = set; + tstate->set = set; + av += 2; + } + + /* [prob D] -- match probability, optional */ + if (av[0] && av[1] && _substrcmp(*av, "prob") == 0) { + match_prob = strtod(av[1], NULL); + + if (match_prob <= 0 || match_prob > 1) + errx(EX_DATAERR, "illegal match prob. %s", av[1]); + av += 2; + } + + /* action -- mandatory */ + NEED1("missing action"); + i = match_token(rule_actions, *av); + av++; + action->len = 1; /* default */ + CHECK_ACTLEN; + switch(i) { + case TOK_CHECKSTATE: + have_state = action; + action->opcode = O_CHECK_STATE; + break; + + case TOK_ACCEPT: + action->opcode = O_ACCEPT; + break; + + case TOK_DENY: + action->opcode = O_DENY; + action->arg1 = 0; + break; + + case TOK_REJECT: + action->opcode = O_REJECT; + action->arg1 = ICMP_UNREACH_HOST; + break; + + case TOK_RESET: + action->opcode = O_REJECT; + action->arg1 = ICMP_REJECT_RST; + break; + + case TOK_RESET6: + action->opcode = O_UNREACH6; + action->arg1 = ICMP6_UNREACH_RST; + break; + + case TOK_UNREACH: + action->opcode = O_REJECT; + NEED1("missing reject code"); + fill_reject_code(&action->arg1, *av); + av++; + break; + + case TOK_UNREACH6: + action->opcode = O_UNREACH6; + NEED1("missing unreach code"); + fill_unreach6_code(&action->arg1, *av); + av++; + break; + + case TOK_COUNT: + action->opcode = O_COUNT; + break; + + case TOK_NAT: + action->opcode = O_NAT; + action->len = F_INSN_SIZE(ipfw_insn_nat); + CHECK_ACTLEN; + if (_substrcmp(*av, "global") == 0) { + action->arg1 = 0; + av++; + break; + } else + goto chkarg; + + case TOK_QUEUE: + action->opcode = O_QUEUE; + goto chkarg; + case TOK_PIPE: + action->opcode = O_PIPE; + goto chkarg; + case TOK_SKIPTO: + action->opcode = O_SKIPTO; + goto chkarg; + case TOK_NETGRAPH: + action->opcode = O_NETGRAPH; + goto chkarg; + case TOK_NGTEE: + action->opcode = O_NGTEE; + goto chkarg; + case TOK_DIVERT: + action->opcode = O_DIVERT; + goto chkarg; + case TOK_TEE: + action->opcode = O_TEE; + goto chkarg; + case TOK_CALL: + action->opcode = O_CALLRETURN; +chkarg: + if (!av[0]) + errx(EX_USAGE, "missing argument for %s", *(av - 1)); + if (isdigit(**av)) { + action->arg1 = strtoul(*av, NULL, 10); + if (action->arg1 <= 0 || action->arg1 >= IP_FW_TABLEARG) + errx(EX_DATAERR, "illegal argument for %s", + *(av - 1)); + } else if (_substrcmp(*av, "tablearg") == 0) { + action->arg1 = IP_FW_TARG; + } else if (i == TOK_DIVERT || i == TOK_TEE) { + struct servent *s; + setservent(1); + s = getservbyname(av[0], "divert"); + if (s != NULL) + action->arg1 = ntohs(s->s_port); + else + errx(EX_DATAERR, "illegal divert/tee port"); + } else + errx(EX_DATAERR, "illegal argument for %s", *(av - 1)); + av++; + break; + + case TOK_FORWARD: { + /* + * Locate the address-port separator (':' or ','). + * Could be one of the following: + * hostname:port + * IPv4 a.b.c.d,port + * IPv4 a.b.c.d:port + * IPv6 w:x:y::z,port + * The ':' can only be used with hostname and IPv4 address. + * XXX-BZ Should we also support [w:x:y::z]:port? + */ + struct sockaddr_storage result; + struct addrinfo *res; + char *s, *end; + int family; + u_short port_number; + + NEED1("missing forward address[:port]"); + + /* + * locate the address-port separator (':' or ',') + */ + s = strchr(*av, ','); + if (s == NULL) { + /* Distinguish between IPv4:port and IPv6 cases. */ + s = strchr(*av, ':'); + if (s && strchr(s+1, ':')) + s = NULL; /* no port */ + } + + port_number = 0; + if (s != NULL) { + /* Terminate host portion and set s to start of port. */ + *(s++) = '\0'; + i = strtoport(s, &end, 0 /* base */, 0 /* proto */); + if (s == end) + errx(EX_DATAERR, + "illegal forwarding port ``%s''", s); + port_number = (u_short)i; + } + + if (_substrcmp(*av, "tablearg") == 0) { + family = PF_INET; + ((struct sockaddr_in*)&result)->sin_addr.s_addr = + INADDR_ANY; + } else { + /* + * Resolve the host name or address to a family and a + * network representation of the address. + */ + if (getaddrinfo(*av, NULL, NULL, &res)) + errx(EX_DATAERR, NULL); + /* Just use the first host in the answer. */ + family = res->ai_family; + memcpy(&result, res->ai_addr, res->ai_addrlen); + freeaddrinfo(res); + } + + if (family == PF_INET) { + ipfw_insn_sa *p = (ipfw_insn_sa *)action; + + action->opcode = O_FORWARD_IP; + action->len = F_INSN_SIZE(ipfw_insn_sa); + CHECK_ACTLEN; + + /* + * In the kernel we assume AF_INET and use only + * sin_port and sin_addr. Remember to set sin_len as + * the routing code seems to use it too. + */ + p->sa.sin_len = sizeof(struct sockaddr_in); + p->sa.sin_family = AF_INET; + p->sa.sin_port = port_number; + p->sa.sin_addr.s_addr = + ((struct sockaddr_in *)&result)->sin_addr.s_addr; + } else if (family == PF_INET6) { + ipfw_insn_sa6 *p = (ipfw_insn_sa6 *)action; + + action->opcode = O_FORWARD_IP6; + action->len = F_INSN_SIZE(ipfw_insn_sa6); + CHECK_ACTLEN; + + p->sa.sin6_len = sizeof(struct sockaddr_in6); + p->sa.sin6_family = AF_INET6; + p->sa.sin6_port = port_number; + p->sa.sin6_flowinfo = 0; + p->sa.sin6_scope_id = 0; + /* No table support for v6 yet. */ + bcopy(&((struct sockaddr_in6*)&result)->sin6_addr, + &p->sa.sin6_addr, sizeof(p->sa.sin6_addr)); + } else { + errx(EX_DATAERR, "Invalid address family in forward action"); + } + av++; + break; + } + case TOK_COMMENT: + /* pretend it is a 'count' rule followed by the comment */ + action->opcode = O_COUNT; + av--; /* go back... */ + break; + + case TOK_SETFIB: + { + int numfibs; + size_t intsize = sizeof(int); + + action->opcode = O_SETFIB; + NEED1("missing fib number"); + if (_substrcmp(*av, "tablearg") == 0) { + action->arg1 = IP_FW_TARG; + } else { + action->arg1 = strtoul(*av, NULL, 10); + if (sysctlbyname("net.fibs", &numfibs, &intsize, + NULL, 0) == -1) + errx(EX_DATAERR, "fibs not suported.\n"); + if (action->arg1 >= numfibs) /* Temporary */ + errx(EX_DATAERR, "fib too large.\n"); + /* Add high-order bit to fib to make room for tablearg*/ + action->arg1 |= 0x8000; + } + av++; + break; + } + + case TOK_SETDSCP: + { + int code; + + action->opcode = O_SETDSCP; + NEED1("missing DSCP code"); + if (_substrcmp(*av, "tablearg") == 0) { + action->arg1 = IP_FW_TARG; + } else if (isalpha(*av[0])) { + if ((code = match_token(f_ipdscp, *av)) == -1) + errx(EX_DATAERR, "Unknown DSCP code"); + action->arg1 = code; + } else + action->arg1 = strtoul(*av, NULL, 10); + /* Add high-order bit to DSCP to make room for tablearg */ + if (action->arg1 != IP_FW_TARG) + action->arg1 |= 0x8000; + av++; + break; + } + + case TOK_REASS: + action->opcode = O_REASS; + break; + + case TOK_RETURN: + fill_cmd(action, O_CALLRETURN, F_NOT, 0); + break; + + default: + errx(EX_DATAERR, "invalid action %s\n", av[-1]); + } + action = next_cmd(action, &ablen); + + /* + * [altq queuename] -- altq tag, optional + * [log [logamount N]] -- log, optional + * + * If they exist, it go first in the cmdbuf, but then it is + * skipped in the copy section to the end of the buffer. + */ + while (av[0] != NULL && (i = match_token(rule_action_params, *av)) != -1) { + av++; + switch (i) { + case TOK_LOG: + { + ipfw_insn_log *c = (ipfw_insn_log *)cmd; + int l; + + if (have_log) + errx(EX_DATAERR, + "log cannot be specified more than once"); + have_log = (ipfw_insn *)c; + cmd->len = F_INSN_SIZE(ipfw_insn_log); + CHECK_CMDLEN; + cmd->opcode = O_LOG; + if (av[0] && _substrcmp(*av, "logamount") == 0) { + av++; + NEED1("logamount requires argument"); + l = atoi(*av); + if (l < 0) + errx(EX_DATAERR, + "logamount must be positive"); + c->max_log = l; + av++; + } else { + len = sizeof(c->max_log); + if (sysctlbyname("net.inet.ip.fw.verbose_limit", + &c->max_log, &len, NULL, 0) == -1) { + if (co.test_only) { + c->max_log = 0; + break; + } + errx(1, "sysctlbyname(\"%s\")", + "net.inet.ip.fw.verbose_limit"); + } + } + } + break; + +#ifndef NO_ALTQ + case TOK_ALTQ: + { + ipfw_insn_altq *a = (ipfw_insn_altq *)cmd; + + NEED1("missing altq queue name"); + if (have_altq) + errx(EX_DATAERR, + "altq cannot be specified more than once"); + have_altq = (ipfw_insn *)a; + cmd->len = F_INSN_SIZE(ipfw_insn_altq); + CHECK_CMDLEN; + cmd->opcode = O_ALTQ; + a->qid = altq_name_to_qid(*av); + av++; + } + break; +#endif + + case TOK_TAG: + case TOK_UNTAG: { + uint16_t tag; + + if (have_tag) + errx(EX_USAGE, "tag and untag cannot be " + "specified more than once"); + GET_UINT_ARG(tag, IPFW_ARG_MIN, IPFW_ARG_MAX, i, + rule_action_params); + have_tag = cmd; + fill_cmd(cmd, O_TAG, (i == TOK_TAG) ? 0: F_NOT, tag); + av++; + break; + } + + default: + abort(); + } + cmd = next_cmd(cmd, &cblen); + } + + if (have_state) /* must be a check-state, we are done */ + goto done; + +#define OR_START(target) \ + if (av[0] && (*av[0] == '(' || *av[0] == '{')) { \ + if (open_par) \ + errx(EX_USAGE, "nested \"(\" not allowed\n"); \ + prev = NULL; \ + open_par = 1; \ + if ( (av[0])[1] == '\0') { \ + av++; \ + } else \ + (*av)++; \ + } \ + target: \ + + +#define CLOSE_PAR \ + if (open_par) { \ + if (av[0] && ( \ + strcmp(*av, ")") == 0 || \ + strcmp(*av, "}") == 0)) { \ + prev = NULL; \ + open_par = 0; \ + av++; \ + } else \ + errx(EX_USAGE, "missing \")\"\n"); \ + } + +#define NOT_BLOCK \ + if (av[0] && _substrcmp(*av, "not") == 0) { \ + if (cmd->len & F_NOT) \ + errx(EX_USAGE, "double \"not\" not allowed\n"); \ + cmd->len |= F_NOT; \ + av++; \ + } + +#define OR_BLOCK(target) \ + if (av[0] && _substrcmp(*av, "or") == 0) { \ + if (prev == NULL || open_par == 0) \ + errx(EX_DATAERR, "invalid OR block"); \ + prev->len |= F_OR; \ + av++; \ + goto target; \ + } \ + CLOSE_PAR; + + first_cmd = cmd; + +#if 0 + /* + * MAC addresses, optional. + * If we have this, we skip the part "proto from src to dst" + * and jump straight to the option parsing. + */ + NOT_BLOCK; + NEED1("missing protocol"); + if (_substrcmp(*av, "MAC") == 0 || + _substrcmp(*av, "mac") == 0) { + av++; /* the "MAC" keyword */ + add_mac(cmd, av); /* exits in case of errors */ + cmd = next_cmd(cmd); + av += 2; /* dst-mac and src-mac */ + NOT_BLOCK; + NEED1("missing mac type"); + if (add_mactype(cmd, av[0])) + cmd = next_cmd(cmd); + av++; /* any or mac-type */ + goto read_options; + } +#endif + + /* + * protocol, mandatory + */ + OR_START(get_proto); + NOT_BLOCK; + NEED1("missing protocol"); + if (add_proto_compat(cmd, *av, &proto)) { + av++; + if (F_LEN(cmd) != 0) { + prev = cmd; + cmd = next_cmd(cmd, &cblen); + } + } else if (first_cmd != cmd) { + errx(EX_DATAERR, "invalid protocol ``%s''", *av); + } else + goto read_options; + OR_BLOCK(get_proto); + + /* + * "from", mandatory + */ + if ((av[0] == NULL) || _substrcmp(*av, "from") != 0) + errx(EX_USAGE, "missing ``from''"); + av++; + + /* + * source IP, mandatory + */ + OR_START(source_ip); + NOT_BLOCK; /* optional "not" */ + NEED1("missing source address"); + if (add_src(cmd, *av, proto, cblen, tstate)) { + av++; + if (F_LEN(cmd) != 0) { /* ! any */ + prev = cmd; + cmd = next_cmd(cmd, &cblen); + } + } else + errx(EX_USAGE, "bad source address %s", *av); + OR_BLOCK(source_ip); + + /* + * source ports, optional + */ + NOT_BLOCK; /* optional "not" */ + if ( av[0] != NULL ) { + if (_substrcmp(*av, "any") == 0 || + add_ports(cmd, *av, proto, O_IP_SRCPORT, cblen)) { + av++; + if (F_LEN(cmd) != 0) + cmd = next_cmd(cmd, &cblen); + } + } + + /* + * "to", mandatory + */ + if ( (av[0] == NULL) || _substrcmp(*av, "to") != 0 ) + errx(EX_USAGE, "missing ``to''"); + av++; + + /* + * destination, mandatory + */ + OR_START(dest_ip); + NOT_BLOCK; /* optional "not" */ + NEED1("missing dst address"); + if (add_dst(cmd, *av, proto, cblen, tstate)) { + av++; + if (F_LEN(cmd) != 0) { /* ! any */ + prev = cmd; + cmd = next_cmd(cmd, &cblen); + } + } else + errx( EX_USAGE, "bad destination address %s", *av); + OR_BLOCK(dest_ip); + + /* + * dest. ports, optional + */ + NOT_BLOCK; /* optional "not" */ + if (av[0]) { + if (_substrcmp(*av, "any") == 0 || + add_ports(cmd, *av, proto, O_IP_DSTPORT, cblen)) { + av++; + if (F_LEN(cmd) != 0) + cmd = next_cmd(cmd, &cblen); + } + } + +read_options: + if (av[0] && first_cmd == cmd) { + /* + * nothing specified so far, store in the rule to ease + * printout later. + */ + rule->flags |= IPFW_RULE_NOOPT; + } + prev = NULL; + while ( av[0] != NULL ) { + char *s; + ipfw_insn_u32 *cmd32; /* alias for cmd */ + + s = *av; + cmd32 = (ipfw_insn_u32 *)cmd; + + if (*s == '!') { /* alternate syntax for NOT */ + if (cmd->len & F_NOT) + errx(EX_USAGE, "double \"not\" not allowed\n"); + cmd->len = F_NOT; + s++; + } + i = match_token(rule_options, s); + av++; + switch(i) { + case TOK_NOT: + if (cmd->len & F_NOT) + errx(EX_USAGE, "double \"not\" not allowed\n"); + cmd->len = F_NOT; + break; + + case TOK_OR: + if (open_par == 0 || prev == NULL) + errx(EX_USAGE, "invalid \"or\" block\n"); + prev->len |= F_OR; + break; + + case TOK_STARTBRACE: + if (open_par) + errx(EX_USAGE, "+nested \"(\" not allowed\n"); + open_par = 1; + break; + + case TOK_ENDBRACE: + if (!open_par) + errx(EX_USAGE, "+missing \")\"\n"); + open_par = 0; + prev = NULL; + break; + + case TOK_IN: + fill_cmd(cmd, O_IN, 0, 0); + break; + + case TOK_OUT: + cmd->len ^= F_NOT; /* toggle F_NOT */ + fill_cmd(cmd, O_IN, 0, 0); + break; + + case TOK_DIVERTED: + fill_cmd(cmd, O_DIVERTED, 0, 3); + break; + + case TOK_DIVERTEDLOOPBACK: + fill_cmd(cmd, O_DIVERTED, 0, 1); + break; + + case TOK_DIVERTEDOUTPUT: + fill_cmd(cmd, O_DIVERTED, 0, 2); + break; + + case TOK_FRAG: + fill_cmd(cmd, O_FRAG, 0, 0); + break; + + case TOK_LAYER2: + fill_cmd(cmd, O_LAYER2, 0, 0); + break; + + case TOK_XMIT: + case TOK_RECV: + case TOK_VIA: + NEED1("recv, xmit, via require interface name" + " or address"); + fill_iface((ipfw_insn_if *)cmd, av[0], cblen, tstate); + av++; + if (F_LEN(cmd) == 0) /* not a valid address */ + break; + if (i == TOK_XMIT) + cmd->opcode = O_XMIT; + else if (i == TOK_RECV) + cmd->opcode = O_RECV; + else if (i == TOK_VIA) + cmd->opcode = O_VIA; + break; + + case TOK_ICMPTYPES: + NEED1("icmptypes requires list of types"); + fill_icmptypes((ipfw_insn_u32 *)cmd, *av); + av++; + break; + + case TOK_ICMP6TYPES: + NEED1("icmptypes requires list of types"); + fill_icmp6types((ipfw_insn_icmp6 *)cmd, *av, cblen); + av++; + break; + + case TOK_IPTTL: + NEED1("ipttl requires TTL"); + if (strpbrk(*av, "-,")) { + if (!add_ports(cmd, *av, 0, O_IPTTL, cblen)) + errx(EX_DATAERR, "invalid ipttl %s", *av); + } else + fill_cmd(cmd, O_IPTTL, 0, strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_IPID: + NEED1("ipid requires id"); + if (strpbrk(*av, "-,")) { + if (!add_ports(cmd, *av, 0, O_IPID, cblen)) + errx(EX_DATAERR, "invalid ipid %s", *av); + } else + fill_cmd(cmd, O_IPID, 0, strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_IPLEN: + NEED1("iplen requires length"); + if (strpbrk(*av, "-,")) { + if (!add_ports(cmd, *av, 0, O_IPLEN, cblen)) + errx(EX_DATAERR, "invalid ip len %s", *av); + } else + fill_cmd(cmd, O_IPLEN, 0, strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_IPVER: + NEED1("ipver requires version"); + fill_cmd(cmd, O_IPVER, 0, strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_IPPRECEDENCE: + NEED1("ipprecedence requires value"); + fill_cmd(cmd, O_IPPRECEDENCE, 0, + (strtoul(*av, NULL, 0) & 7) << 5); + av++; + break; + + case TOK_DSCP: + NEED1("missing DSCP code"); + fill_dscp(cmd, *av, cblen); + av++; + break; + + case TOK_IPOPTS: + NEED1("missing argument for ipoptions"); + fill_flags_cmd(cmd, O_IPOPT, f_ipopts, *av); + av++; + break; + + case TOK_IPTOS: + NEED1("missing argument for iptos"); + fill_flags_cmd(cmd, O_IPTOS, f_iptos, *av); + av++; + break; + + case TOK_UID: + NEED1("uid requires argument"); + { + char *end; + uid_t uid; + struct passwd *pwd; + + cmd->opcode = O_UID; + uid = strtoul(*av, &end, 0); + pwd = (*end == '\0') ? getpwuid(uid) : getpwnam(*av); + if (pwd == NULL) + errx(EX_DATAERR, "uid \"%s\" nonexistent", *av); + cmd32->d[0] = pwd->pw_uid; + cmd->len |= F_INSN_SIZE(ipfw_insn_u32); + av++; + } + break; + + case TOK_GID: + NEED1("gid requires argument"); + { + char *end; + gid_t gid; + struct group *grp; + + cmd->opcode = O_GID; + gid = strtoul(*av, &end, 0); + grp = (*end == '\0') ? getgrgid(gid) : getgrnam(*av); + if (grp == NULL) + errx(EX_DATAERR, "gid \"%s\" nonexistent", *av); + cmd32->d[0] = grp->gr_gid; + cmd->len |= F_INSN_SIZE(ipfw_insn_u32); + av++; + } + break; + + case TOK_JAIL: + NEED1("jail requires argument"); + { + char *end; + int jid; + + cmd->opcode = O_JAIL; + jid = (int)strtol(*av, &end, 0); + if (jid < 0 || *end != '\0') + errx(EX_DATAERR, "jail requires prison ID"); + cmd32->d[0] = (uint32_t)jid; + cmd->len |= F_INSN_SIZE(ipfw_insn_u32); + av++; + } + break; + + case TOK_ESTAB: + fill_cmd(cmd, O_ESTAB, 0, 0); + break; + + case TOK_SETUP: + fill_cmd(cmd, O_TCPFLAGS, 0, + (TH_SYN) | ( (TH_ACK) & 0xff) <<8 ); + break; + + case TOK_TCPDATALEN: + NEED1("tcpdatalen requires length"); + if (strpbrk(*av, "-,")) { + if (!add_ports(cmd, *av, 0, O_TCPDATALEN, cblen)) + errx(EX_DATAERR, "invalid tcpdata len %s", *av); + } else + fill_cmd(cmd, O_TCPDATALEN, 0, + strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_TCPOPTS: + NEED1("missing argument for tcpoptions"); + fill_flags_cmd(cmd, O_TCPOPTS, f_tcpopts, *av); + av++; + break; + + case TOK_TCPSEQ: + case TOK_TCPACK: + NEED1("tcpseq/tcpack requires argument"); + cmd->len = F_INSN_SIZE(ipfw_insn_u32); + cmd->opcode = (i == TOK_TCPSEQ) ? O_TCPSEQ : O_TCPACK; + cmd32->d[0] = htonl(strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_TCPWIN: + NEED1("tcpwin requires length"); + if (strpbrk(*av, "-,")) { + if (!add_ports(cmd, *av, 0, O_TCPWIN, cblen)) + errx(EX_DATAERR, "invalid tcpwin len %s", *av); + } else + fill_cmd(cmd, O_TCPWIN, 0, + strtoul(*av, NULL, 0)); + av++; + break; + + case TOK_TCPFLAGS: + NEED1("missing argument for tcpflags"); + cmd->opcode = O_TCPFLAGS; + fill_flags_cmd(cmd, O_TCPFLAGS, f_tcpflags, *av); + av++; + break; + + case TOK_KEEPSTATE: + if (open_par) + errx(EX_USAGE, "keep-state cannot be part " + "of an or block"); + if (have_state) + errx(EX_USAGE, "only one of keep-state " + "and limit is allowed"); + have_state = cmd; + fill_cmd(cmd, O_KEEP_STATE, 0, 0); + break; + + case TOK_LIMIT: { + ipfw_insn_limit *c = (ipfw_insn_limit *)cmd; + int val; + + if (open_par) + errx(EX_USAGE, + "limit cannot be part of an or block"); + if (have_state) + errx(EX_USAGE, "only one of keep-state and " + "limit is allowed"); + have_state = cmd; + + cmd->len = F_INSN_SIZE(ipfw_insn_limit); + CHECK_CMDLEN; + cmd->opcode = O_LIMIT; + c->limit_mask = c->conn_limit = 0; + + while ( av[0] != NULL ) { + if ((val = match_token(limit_masks, *av)) <= 0) + break; + c->limit_mask |= val; + av++; + } + + if (c->limit_mask == 0) + errx(EX_USAGE, "limit: missing limit mask"); + + GET_UINT_ARG(c->conn_limit, IPFW_ARG_MIN, IPFW_ARG_MAX, + TOK_LIMIT, rule_options); + + av++; + break; + } + + case TOK_PROTO: + NEED1("missing protocol"); + if (add_proto(cmd, *av, &proto)) { + av++; + } else + errx(EX_DATAERR, "invalid protocol ``%s''", + *av); + break; + + case TOK_SRCIP: + NEED1("missing source IP"); + if (add_srcip(cmd, *av, cblen, tstate)) { + av++; + } + break; + + case TOK_DSTIP: + NEED1("missing destination IP"); + if (add_dstip(cmd, *av, cblen, tstate)) { + av++; + } + break; + + case TOK_SRCIP6: + NEED1("missing source IP6"); + if (add_srcip6(cmd, *av, cblen)) { + av++; + } + break; + + case TOK_DSTIP6: + NEED1("missing destination IP6"); + if (add_dstip6(cmd, *av, cblen)) { + av++; + } + break; + + case TOK_SRCPORT: + NEED1("missing source port"); + if (_substrcmp(*av, "any") == 0 || + add_ports(cmd, *av, proto, O_IP_SRCPORT, cblen)) { + av++; + } else + errx(EX_DATAERR, "invalid source port %s", *av); + break; + + case TOK_DSTPORT: + NEED1("missing destination port"); + if (_substrcmp(*av, "any") == 0 || + add_ports(cmd, *av, proto, O_IP_DSTPORT, cblen)) { + av++; + } else + errx(EX_DATAERR, "invalid destination port %s", + *av); + break; + + case TOK_MAC: + if (add_mac(cmd, av, cblen)) + av += 2; + break; + + case TOK_MACTYPE: + NEED1("missing mac type"); + if (!add_mactype(cmd, *av, cblen)) + errx(EX_DATAERR, "invalid mac type %s", *av); + av++; + break; + + case TOK_VERREVPATH: + fill_cmd(cmd, O_VERREVPATH, 0, 0); + break; + + case TOK_VERSRCREACH: + fill_cmd(cmd, O_VERSRCREACH, 0, 0); + break; + + case TOK_ANTISPOOF: + fill_cmd(cmd, O_ANTISPOOF, 0, 0); + break; + + case TOK_IPSEC: + fill_cmd(cmd, O_IPSEC, 0, 0); + break; + + case TOK_IPV6: + fill_cmd(cmd, O_IP6, 0, 0); + break; + + case TOK_IPV4: + fill_cmd(cmd, O_IP4, 0, 0); + break; + + case TOK_EXT6HDR: + fill_ext6hdr( cmd, *av ); + av++; + break; + + case TOK_FLOWID: + if (proto != IPPROTO_IPV6 ) + errx( EX_USAGE, "flow-id filter is active " + "only for ipv6 protocol\n"); + fill_flow6( (ipfw_insn_u32 *) cmd, *av, cblen); + av++; + break; + + case TOK_COMMENT: + fill_comment(cmd, av, cblen); + av[0]=NULL; + break; + + case TOK_TAGGED: + if (av[0] && strpbrk(*av, "-,")) { + if (!add_ports(cmd, *av, 0, O_TAGGED, cblen)) + errx(EX_DATAERR, "tagged: invalid tag" + " list: %s", *av); + } + else { + uint16_t tag; + + GET_UINT_ARG(tag, IPFW_ARG_MIN, IPFW_ARG_MAX, + TOK_TAGGED, rule_options); + fill_cmd(cmd, O_TAGGED, 0, tag); + } + av++; + break; + + case TOK_FIB: + NEED1("fib requires fib number"); + fill_cmd(cmd, O_FIB, 0, strtoul(*av, NULL, 0)); + av++; + break; + case TOK_SOCKARG: + fill_cmd(cmd, O_SOCKARG, 0, 0); + break; + + case TOK_LOOKUP: { + ipfw_insn_u32 *c = (ipfw_insn_u32 *)cmd; + int j; + + if (!av[0] || !av[1]) + errx(EX_USAGE, "format: lookup argument tablenum"); + cmd->opcode = O_IP_DST_LOOKUP; + cmd->len |= F_INSN_SIZE(ipfw_insn) + 2; + i = match_token(rule_options, *av); + for (j = 0; lookup_key[j] >= 0 ; j++) { + if (i == lookup_key[j]) + break; + } + if (lookup_key[j] <= 0) + errx(EX_USAGE, "format: cannot lookup on %s", *av); + __PAST_END(c->d, 1) = j; // i converted to option + av++; + + if ((j = pack_table(tstate, *av)) == 0) + errx(EX_DATAERR, "Invalid table name: %s", *av); + + cmd->arg1 = j; + av++; + } + break; + case TOK_FLOW: + NEED1("missing table name"); + if (strncmp(*av, "table(", 6) != 0) + errx(EX_DATAERR, + "enclose table name into \"table()\""); + fill_table(cmd, *av, O_IP_FLOW_LOOKUP, tstate); + av++; + break; + + default: + errx(EX_USAGE, "unrecognised option [%d] %s\n", i, s); + } + if (F_LEN(cmd) > 0) { /* prepare to advance */ + prev = cmd; + cmd = next_cmd(cmd, &cblen); + } + } + +done: + /* + * Now copy stuff into the rule. + * If we have a keep-state option, the first instruction + * must be a PROBE_STATE (which is generated here). + * If we have a LOG option, it was stored as the first command, + * and now must be moved to the top of the action part. + */ + dst = (ipfw_insn *)rule->cmd; + + /* + * First thing to write into the command stream is the match probability. + */ + if (match_prob != 1) { /* 1 means always match */ + dst->opcode = O_PROB; + dst->len = 2; + *((int32_t *)(dst+1)) = (int32_t)(match_prob * 0x7fffffff); + dst += dst->len; + } + + /* + * generate O_PROBE_STATE if necessary + */ + if (have_state && have_state->opcode != O_CHECK_STATE) { + fill_cmd(dst, O_PROBE_STATE, 0, 0); + dst = next_cmd(dst, &rblen); + } + + /* copy all commands but O_LOG, O_KEEP_STATE, O_LIMIT, O_ALTQ, O_TAG */ + for (src = (ipfw_insn *)cmdbuf; src != cmd; src += i) { + i = F_LEN(src); + CHECK_RBUFLEN(i); + + switch (src->opcode) { + case O_LOG: + case O_KEEP_STATE: + case O_LIMIT: + case O_ALTQ: + case O_TAG: + break; + default: + bcopy(src, dst, i * sizeof(uint32_t)); + dst += i; + } + } + + /* + * put back the have_state command as last opcode + */ + if (have_state && have_state->opcode != O_CHECK_STATE) { + i = F_LEN(have_state); + CHECK_RBUFLEN(i); + bcopy(have_state, dst, i * sizeof(uint32_t)); + dst += i; + } + /* + * start action section + */ + rule->act_ofs = dst - rule->cmd; + + /* put back O_LOG, O_ALTQ, O_TAG if necessary */ + if (have_log) { + i = F_LEN(have_log); + CHECK_RBUFLEN(i); + bcopy(have_log, dst, i * sizeof(uint32_t)); + dst += i; + } + if (have_altq) { + i = F_LEN(have_altq); + CHECK_RBUFLEN(i); + bcopy(have_altq, dst, i * sizeof(uint32_t)); + dst += i; + } + if (have_tag) { + i = F_LEN(have_tag); + CHECK_RBUFLEN(i); + bcopy(have_tag, dst, i * sizeof(uint32_t)); + dst += i; + } + + /* + * copy all other actions + */ + for (src = (ipfw_insn *)actbuf; src != action; src += i) { + i = F_LEN(src); + CHECK_RBUFLEN(i); + bcopy(src, dst, i * sizeof(uint32_t)); + dst += i; + } + + rule->cmd_len = (uint32_t *)dst - (uint32_t *)(rule->cmd); + *rbufsize = (char *)dst - (char *)rule; +} + +/* + * Adds one or more rules to ipfw chain. + * Data layout: + * Request: + * [ + * ip_fw3_opheader + * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional *1) + * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) [ ip_fw_rule ip_fw_insn ] x N ] (*2) (*3) + * ] + * Reply: + * [ + * ip_fw3_opheader + * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional) + * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) [ ip_fw_rule ip_fw_insn ] x N ] + * ] + * + * Rules in reply are modified to store their actual ruleset number. + * + * (*1) TLVs inside IPFW_TLV_TBL_LIST needs to be sorted ascending + * accoring to their idx field and there has to be no duplicates. + * (*2) Numbered rules inside IPFW_TLV_RULE_LIST needs to be sorted ascending. + * (*3) Each ip_fw structure needs to be aligned to u64 boundary. + */ +void +ipfw_add(char *av[]) +{ + uint32_t rulebuf[1024]; + int rbufsize, default_off, tlen, rlen; + size_t sz; + struct tidx ts; + struct ip_fw_rule *rule; + caddr_t tbuf; + ip_fw3_opheader *op3; + ipfw_obj_ctlv *ctlv, *tstate; + + rbufsize = sizeof(rulebuf); + memset(rulebuf, 0, rbufsize); + memset(&ts, 0, sizeof(ts)); + + /* Optimize case with no tables */ + default_off = sizeof(ipfw_obj_ctlv) + sizeof(ip_fw3_opheader); + op3 = (ip_fw3_opheader *)rulebuf; + ctlv = (ipfw_obj_ctlv *)(op3 + 1); + rule = (struct ip_fw_rule *)(ctlv + 1); + rbufsize -= default_off; + + compile_rule(av, (uint32_t *)rule, &rbufsize, &ts); + /* Align rule size to u64 boundary */ + rlen = roundup2(rbufsize, sizeof(uint64_t)); + + tbuf = NULL; + sz = 0; + tstate = NULL; + if (ts.count != 0) { + /* Some tables. We have to alloc more data */ + tlen = ts.count * sizeof(ipfw_obj_ntlv); + sz = default_off + sizeof(ipfw_obj_ctlv) + tlen + rlen; + + if ((tbuf = calloc(1, sz)) == NULL) + err(EX_UNAVAILABLE, "malloc() failed for IP_FW_ADD"); + op3 = (ip_fw3_opheader *)tbuf; + /* Tables first */ + ctlv = (ipfw_obj_ctlv *)(op3 + 1); + ctlv->head.type = IPFW_TLV_TBLNAME_LIST; + ctlv->head.length = sizeof(ipfw_obj_ctlv) + tlen; + ctlv->count = ts.count; + ctlv->objsize = sizeof(ipfw_obj_ntlv); + memcpy(ctlv + 1, ts.idx, tlen); + table_sort_ctlv(ctlv); + tstate = ctlv; + /* Rule next */ + ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + ctlv->head.length); + ctlv->head.type = IPFW_TLV_RULE_LIST; + ctlv->head.length = sizeof(ipfw_obj_ctlv) + rlen; + ctlv->count = 1; + memcpy(ctlv + 1, rule, rbufsize); + } else { + /* Simply add header */ + sz = rlen + default_off; + memset(ctlv, 0, sizeof(*ctlv)); + ctlv->head.type = IPFW_TLV_RULE_LIST; + ctlv->head.length = sizeof(ipfw_obj_ctlv) + rlen; + ctlv->count = 1; + } + + if (do_get3(IP_FW_XADD, op3, &sz) != 0) + err(EX_UNAVAILABLE, "getsockopt(%s)", "IP_FW_XADD"); + + if (!co.do_quiet) { + struct format_opts sfo; + struct buf_pr bp; + memset(&sfo, 0, sizeof(sfo)); + sfo.tstate = tstate; + sfo.set_mask = (uint32_t)(-1); + bp_alloc(&bp, 4096); + show_static_rule(&co, &sfo, &bp, rule, NULL); + printf("%s", bp.buf); + bp_free(&bp); + } + + if (tbuf != NULL) + free(tbuf); + + if (ts.idx != NULL) + free(ts.idx); +} + +/* + * clear the counters or the log counters. + * optname has the following values: + * 0 (zero both counters and logging) + * 1 (zero logging only) + */ +void +ipfw_zero(int ac, char *av[], int optname) +{ + ipfw_range_tlv rt; + uint32_t arg; + int failed = EX_OK; + char const *errstr; + char const *name = optname ? "RESETLOG" : "ZERO"; + + optname = optname ? IP_FW_XRESETLOG : IP_FW_XZERO; + memset(&rt, 0, sizeof(rt)); + + av++; ac--; + + if (ac == 0) { + /* clear all entries */ + rt.flags = IPFW_RCFLAG_ALL; + if (do_range_cmd(optname, &rt) < 0) + err(EX_UNAVAILABLE, "setsockopt(IP_FW_X%s)", name); + if (!co.do_quiet) + printf("%s.\n", optname == IP_FW_XZERO ? + "Accounting cleared":"Logging counts reset"); + + return; + } + + while (ac) { + /* Rule number */ + if (isdigit(**av)) { + arg = strtonum(*av, 0, 0xffff, &errstr); + if (errstr) + errx(EX_DATAERR, + "invalid rule number %s\n", *av); + rt.start_rule = arg; + rt.end_rule = arg; + rt.flags |= IPFW_RCFLAG_RANGE; + if (co.use_set != 0) { + rt.set = co.use_set - 1; + rt.flags |= IPFW_RCFLAG_SET; + } + if (do_range_cmd(optname, &rt) != 0) { + warn("rule %u: setsockopt(IP_FW_X%s)", + arg, name); + failed = EX_UNAVAILABLE; + } else if (rt.new_set == 0) { + printf("Entry %d not found\n", arg); + failed = EX_UNAVAILABLE; + } else if (!co.do_quiet) + printf("Entry %d %s.\n", arg, + optname == IP_FW_XZERO ? + "cleared" : "logging count reset"); + } else { + errx(EX_USAGE, "invalid rule number ``%s''", *av); + } + av++; ac--; + } + if (failed != EX_OK) + exit(failed); +} + +void +ipfw_flush(int force) +{ + ipfw_range_tlv rt; + + if (!force && !co.do_quiet) { /* need to ask user */ + int c; + + printf("Are you sure? [yn] "); + fflush(stdout); + do { + c = toupper(getc(stdin)); + while (c != '\n' && getc(stdin) != '\n') + if (feof(stdin)) + return; /* and do not flush */ + } while (c != 'Y' && c != 'N'); + printf("\n"); + if (c == 'N') /* user said no */ + return; + } + if (co.do_pipe) { + dummynet_flush(); + return; + } + /* `ipfw set N flush` - is the same that `ipfw delete set N` */ + memset(&rt, 0, sizeof(rt)); + if (co.use_set != 0) { + rt.set = co.use_set - 1; + rt.flags = IPFW_RCFLAG_SET; + } else + rt.flags = IPFW_RCFLAG_ALL; + if (do_range_cmd(IP_FW_XDEL, &rt) != 0) + err(EX_UNAVAILABLE, "setsockopt(IP_FW_XDEL)"); + if (!co.do_quiet) + printf("Flushed all %s.\n", co.do_pipe ? "pipes" : "rules"); +} + +static struct _s_x intcmds[] = { + { "talist", TOK_TALIST }, + { "iflist", TOK_IFLIST }, + { "vlist", TOK_VLIST }, + { NULL, 0 } +}; + +void +ipfw_internal_handler(int ac, char *av[]) +{ + int tcmd; + + ac--; av++; + NEED1("internal cmd required"); + + if ((tcmd = match_token(intcmds, *av)) == -1) + errx(EX_USAGE, "invalid internal sub-cmd: %s", *av); + + switch (tcmd) { + case TOK_IFLIST: + ipfw_list_tifaces(); + break; + case TOK_TALIST: + ipfw_list_ta(ac, av); + break; + case TOK_VLIST: + ipfw_list_values(ac, av); + break; + } +} + +static int +ipfw_get_tracked_ifaces(ipfw_obj_lheader **polh) +{ + ipfw_obj_lheader req, *olh; + size_t sz; + + memset(&req, 0, sizeof(req)); + sz = sizeof(req); + + if (do_get3(IP_FW_XIFLIST, &req.opheader, &sz) != 0) { + if (errno != ENOMEM) + return (errno); + } + + sz = req.size; + if ((olh = calloc(1, sz)) == NULL) + return (ENOMEM); + + olh->size = sz; + if (do_get3(IP_FW_XIFLIST, &olh->opheader, &sz) != 0) { + free(olh); + return (errno); + } + + *polh = olh; + return (0); +} + +static int +ifinfo_cmp(const void *a, const void *b) +{ + ipfw_iface_info *ia, *ib; + + ia = (ipfw_iface_info *)a; + ib = (ipfw_iface_info *)b; + + return (stringnum_cmp(ia->ifname, ib->ifname)); +} + +/* + * Retrieves table list from kernel, + * optionally sorts it and calls requested function for each table. + * Returns 0 on success. + */ +static void +ipfw_list_tifaces() +{ + ipfw_obj_lheader *olh = NULL; + ipfw_iface_info *info; + int i, error; + + if ((error = ipfw_get_tracked_ifaces(&olh)) != 0) + err(EX_OSERR, "Unable to request ipfw tracked interface list"); + + + qsort(olh + 1, olh->count, olh->objsize, ifinfo_cmp); + + info = (ipfw_iface_info *)(olh + 1); + for (i = 0; i < olh->count; i++) { + if (info->flags & IPFW_IFFLAG_RESOLVED) + printf("%s ifindex: %d refcount: %u changes: %u\n", + info->ifname, info->ifindex, info->refcnt, + info->gencnt); + else + printf("%s ifindex: unresolved refcount: %u changes: %u\n", + info->ifname, info->refcnt, info->gencnt); + info = (ipfw_iface_info *)((caddr_t)info + olh->objsize); + } + + free(olh); +} + + + + diff --git a/example/ipfw/ipfw/ipfw2.h b/example/ipfw/ipfw/ipfw2.h new file mode 100644 index 0000000..8770534 --- /dev/null +++ b/example/ipfw/ipfw/ipfw2.h @@ -0,0 +1,352 @@ +/* + * Copyright (c) 2002-2003 Luigi Rizzo + * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp + * Copyright (c) 1994 Ugen J.S.Antsilevich + * + * Idea and grammar partially left from: + * Copyright (c) 1993 Daniel Boulet + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * NEW command line interface for IP firewall facility + * + * $FreeBSD: head/sbin/ipfw/ipfw2.h 272840 2014-10-09 19:32:35Z melifaro $ + */ + +/* + * Options that can be set on the command line. + * When reading commands from a file, a subset of the options can also + * be applied globally by specifying them before the file name. + * After that, each line can contain its own option that changes + * the global value. + * XXX The context is not restored after each line. + */ + +struct cmdline_opts { + /* boolean options: */ + int do_value_as_ip; /* show table value as IP */ + int do_resolv; /* try to resolve all ip to names */ + int do_time; /* Show time stamps */ + int do_quiet; /* Be quiet in add and flush */ + int do_pipe; /* this cmd refers to a pipe/queue/sched */ + int do_nat; /* this cmd refers to a nat config */ + int do_dynamic; /* display dynamic rules */ + int do_expired; /* display expired dynamic rules */ + int do_compact; /* show rules in compact mode */ + int do_force; /* do not ask for confirmation */ + int show_sets; /* display the set each rule belongs to */ + int test_only; /* only check syntax */ + int comment_only; /* only print action and comment */ + int verbose; /* be verbose on some commands */ + + /* The options below can have multiple values. */ + + int do_sort; /* field to sort results (0 = no) */ + /* valid fields are 1 and above */ + + int use_set; /* work with specified set number */ + /* 0 means all sets, otherwise apply to set use_set - 1 */ + +}; + +extern struct cmdline_opts co; + +/* + * _s_x is a structure that stores a string <-> token pairs, used in + * various places in the parser. Entries are stored in arrays, + * with an entry with s=NULL as terminator. + * The search routines are match_token() and match_value(). + * Often, an element with x=0 contains an error string. + * + */ +struct _s_x { + char const *s; + int x; +}; + +extern struct _s_x f_ipdscp[]; + +enum tokens { + TOK_NULL=0, + + TOK_OR, + TOK_NOT, + TOK_STARTBRACE, + TOK_ENDBRACE, + + TOK_ACCEPT, + TOK_COUNT, + TOK_PIPE, + TOK_LINK, + TOK_QUEUE, + TOK_FLOWSET, + TOK_SCHED, + TOK_DIVERT, + TOK_TEE, + TOK_NETGRAPH, + TOK_NGTEE, + TOK_FORWARD, + TOK_SKIPTO, + TOK_DENY, + TOK_REJECT, + TOK_RESET, + TOK_UNREACH, + TOK_CHECKSTATE, + TOK_NAT, + TOK_REASS, + TOK_CALL, + TOK_RETURN, + + TOK_ALTQ, + TOK_LOG, + TOK_TAG, + TOK_UNTAG, + + TOK_TAGGED, + TOK_UID, + TOK_GID, + TOK_JAIL, + TOK_IN, + TOK_LIMIT, + TOK_KEEPSTATE, + TOK_LAYER2, + TOK_OUT, + TOK_DIVERTED, + TOK_DIVERTEDLOOPBACK, + TOK_DIVERTEDOUTPUT, + TOK_XMIT, + TOK_RECV, + TOK_VIA, + TOK_FRAG, + TOK_IPOPTS, + TOK_IPLEN, + TOK_IPID, + TOK_IPPRECEDENCE, + TOK_DSCP, + TOK_IPTOS, + TOK_IPTTL, + TOK_IPVER, + TOK_ESTAB, + TOK_SETUP, + TOK_TCPDATALEN, + TOK_TCPFLAGS, + TOK_TCPOPTS, + TOK_TCPSEQ, + TOK_TCPACK, + TOK_TCPWIN, + TOK_ICMPTYPES, + TOK_MAC, + TOK_MACTYPE, + TOK_VERREVPATH, + TOK_VERSRCREACH, + TOK_ANTISPOOF, + TOK_IPSEC, + TOK_COMMENT, + + TOK_PLR, + TOK_NOERROR, + TOK_BUCKETS, + TOK_DSTIP, + TOK_SRCIP, + TOK_DSTPORT, + TOK_SRCPORT, + TOK_ALL, + TOK_MASK, + TOK_FLOW_MASK, + TOK_SCHED_MASK, + TOK_BW, + TOK_DELAY, + TOK_PROFILE, + TOK_BURST, + TOK_RED, + TOK_GRED, + TOK_ECN, + TOK_DROPTAIL, + TOK_PROTO, + /* dummynet tokens */ + TOK_WEIGHT, + TOK_LMAX, + TOK_PRI, + TOK_TYPE, + TOK_SLOTSIZE, + + TOK_IP, + TOK_IF, + TOK_ALOG, + TOK_DENY_INC, + TOK_SAME_PORTS, + TOK_UNREG_ONLY, + TOK_SKIP_GLOBAL, + TOK_RESET_ADDR, + TOK_ALIAS_REV, + TOK_PROXY_ONLY, + TOK_REDIR_ADDR, + TOK_REDIR_PORT, + TOK_REDIR_PROTO, + + TOK_IPV6, + TOK_FLOWID, + TOK_ICMP6TYPES, + TOK_EXT6HDR, + TOK_DSTIP6, + TOK_SRCIP6, + + TOK_IPV4, + TOK_UNREACH6, + TOK_RESET6, + + TOK_FIB, + TOK_SETFIB, + TOK_LOOKUP, + TOK_SOCKARG, + TOK_SETDSCP, + TOK_FLOW, + TOK_IFLIST, + /* Table tokens */ + TOK_CREATE, + TOK_DESTROY, + TOK_LIST, + TOK_INFO, + TOK_DETAIL, + TOK_MODIFY, + TOK_FLUSH, + TOK_SWAP, + TOK_ADD, + TOK_DEL, + TOK_VALTYPE, + TOK_ALGO, + TOK_TALIST, + TOK_ATOMIC, + TOK_LOCK, + TOK_UNLOCK, + TOK_VLIST, +}; + +/* + * the following macro returns an error message if we run out of + * arguments. + */ +#define NEED(_p, msg) {if (!_p) errx(EX_USAGE, msg);} +#define NEED1(msg) {if (!(*av)) errx(EX_USAGE, msg);} + +struct buf_pr { + char *buf; /* allocated buffer */ + char *ptr; /* current pointer */ + size_t size; /* total buffer size */ + size_t avail; /* available storage */ + size_t needed; /* length needed */ +}; + +int pr_u64(struct buf_pr *bp, uint64_t *pd, int width); +int bp_alloc(struct buf_pr *b, size_t size); +void bp_free(struct buf_pr *b); +int bprintf(struct buf_pr *b, char *format, ...); + + +/* memory allocation support */ +void *safe_calloc(size_t number, size_t size); +void *safe_realloc(void *ptr, size_t size); + +/* string comparison functions used for historical compatibility */ +int _substrcmp(const char *str1, const char* str2); +int _substrcmp2(const char *str1, const char* str2, const char* str3); +int stringnum_cmp(const char *a, const char *b); + +/* utility functions */ +int match_token(struct _s_x *table, char *string); +int match_token_relaxed(struct _s_x *table, char *string); +char const *match_value(struct _s_x *p, int value); +size_t concat_tokens(char *buf, size_t bufsize, struct _s_x *table, + char *delimiter); +int fill_flags(struct _s_x *flags, char *p, char **e, uint32_t *set, + uint32_t *clear); +void print_flags_buffer(char *buf, size_t sz, struct _s_x *list, uint32_t set); + +struct _ip_fw3_opheader; +int do_cmd(int optname, void *optval, uintptr_t optlen); +int do_set3(int optname, struct _ip_fw3_opheader *op3, uintptr_t optlen); +int do_get3(int optname, struct _ip_fw3_opheader *op3, size_t *optlen); + +struct in6_addr; +void n2mask(struct in6_addr *mask, int n); +int contigmask(uint8_t *p, int len); + +/* + * Forward declarations to avoid include way too many headers. + * C does not allow duplicated typedefs, so we use the base struct + * that the typedef points to. + * Should the typedefs use a different type, the compiler will + * still detect the change when compiling the body of the + * functions involved, so we do not lose error checking. + */ +struct _ipfw_insn; +struct _ipfw_insn_altq; +struct _ipfw_insn_u32; +struct _ipfw_insn_ip6; +struct _ipfw_insn_icmp6; + +/* + * The reserved set numer. This is a constant in ip_fw.h + * but we store it in a variable so other files do not depend + * in that header just for one constant. + */ +extern int resvd_set_number; + +/* first-level command handlers */ +void ipfw_add(char *av[]); +void ipfw_show_nat(int ac, char **av); +void ipfw_config_pipe(int ac, char **av); +void ipfw_config_nat(int ac, char **av); +void ipfw_sets_handler(char *av[]); +void ipfw_table_handler(int ac, char *av[]); +void ipfw_sysctl_handler(char *av[], int which); +void ipfw_delete(char *av[]); +void ipfw_flush(int force); +void ipfw_zero(int ac, char *av[], int optname); +void ipfw_list(int ac, char *av[], int show_counters); +void ipfw_internal_handler(int ac, char *av[]); + +#ifdef PF +/* altq.c */ +void altq_set_enabled(int enabled); +u_int32_t altq_name_to_qid(const char *name); +void print_altq_cmd(struct buf_pr *bp, struct _ipfw_insn_altq *altqptr); +#else +#define NO_ALTQ +#endif + +/* dummynet.c */ +void dummynet_list(int ac, char *av[], int show_counters); +void dummynet_flush(void); +int ipfw_delete_pipe(int pipe_or_queue, int n); + +/* ipv6.c */ +void print_unreach6_code(uint16_t code); +void print_ip6(struct buf_pr *bp, struct _ipfw_insn_ip6 *cmd, char const *s); +void print_flow6id(struct buf_pr *bp, struct _ipfw_insn_u32 *cmd); +void print_icmp6types(struct buf_pr *bp, struct _ipfw_insn_u32 *cmd); +void print_ext6hdr(struct buf_pr *bp, struct _ipfw_insn *cmd ); + +struct _ipfw_insn *add_srcip6(struct _ipfw_insn *cmd, char *av, int cblen); +struct _ipfw_insn *add_dstip6(struct _ipfw_insn *cmd, char *av, int cblen); + +void fill_flow6(struct _ipfw_insn_u32 *cmd, char *av, int cblen); +void fill_unreach6_code(u_short *codep, char *str); +void fill_icmp6types(struct _ipfw_insn_icmp6 *cmd, char *av, int cblen); +int fill_ext6hdr(struct _ipfw_insn *cmd, char *av); + +/* tables.c */ +struct _ipfw_obj_ctlv; +char *table_search_ctlv(struct _ipfw_obj_ctlv *ctlv, uint16_t idx); +void table_sort_ctlv(struct _ipfw_obj_ctlv *ctlv); +int table_check_name(char *tablename); +void ipfw_list_ta(int ac, char *av[]); +void ipfw_list_values(int ac, char *av[]); + diff --git a/example/ipfw/ipfw/ipv6.c b/example/ipfw/ipfw/ipv6.c new file mode 100644 index 0000000..0871a88 --- /dev/null +++ b/example/ipfw/ipfw/ipv6.c @@ -0,0 +1,536 @@ +/* + * Copyright (c) 2002-2003 Luigi Rizzo + * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp + * Copyright (c) 1994 Ugen J.S.Antsilevich + * + * Idea and grammar partially left from: + * Copyright (c) 1993 Daniel Boulet + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * NEW command line interface for IP firewall facility + * + * $FreeBSD: head/sbin/ipfw/ipv6.c 270424 2014-08-23 17:37:18Z melifaro $ + * + * ipv6 support + */ + +#include +#include + +#include "ipfw2.h" + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#define CHECK_LENGTH(v, len) do { \ + if ((v) < (len)) \ + errx(EX_DATAERR, "Rule too long"); \ + } while (0) + +static struct _s_x icmp6codes[] = { + { "no-route", ICMP6_DST_UNREACH_NOROUTE }, + { "admin-prohib", ICMP6_DST_UNREACH_ADMIN }, + { "address", ICMP6_DST_UNREACH_ADDR }, + { "port", ICMP6_DST_UNREACH_NOPORT }, + { NULL, 0 } +}; + +void +fill_unreach6_code(u_short *codep, char *str) +{ + int val; + char *s; + + val = strtoul(str, &s, 0); + if (s == str || *s != '\0' || val >= 0x100) + val = match_token(icmp6codes, str); + if (val < 0) + errx(EX_DATAERR, "unknown ICMPv6 unreachable code ``%s''", str); + *codep = val; + return; +} + +void +print_unreach6_code(uint16_t code) +{ + char const *s = match_value(icmp6codes, code); + + if (s != NULL) + printf("unreach6 %s", s); + else + printf("unreach6 %u", code); +} + +/* + * Print the ip address contained in a command. + */ +void +print_ip6(struct buf_pr *bp, ipfw_insn_ip6 *cmd, char const *s) +{ + struct hostent *he = NULL; + int len = F_LEN((ipfw_insn *) cmd) - 1; + struct in6_addr *a = &(cmd->addr6); + char trad[255]; + + bprintf(bp, "%s%s ", cmd->o.len & F_NOT ? " not": "", s); + + if (cmd->o.opcode == O_IP6_SRC_ME || cmd->o.opcode == O_IP6_DST_ME) { + bprintf(bp, "me6"); + return; + } + if (cmd->o.opcode == O_IP6) { + bprintf(bp, " ip6"); + return; + } + + /* + * len == 4 indicates a single IP, whereas lists of 1 or more + * addr/mask pairs have len = (2n+1). We convert len to n so we + * use that to count the number of entries. + */ + + for (len = len / 4; len > 0; len -= 2, a += 2) { + int mb = /* mask length */ + (cmd->o.opcode == O_IP6_SRC || cmd->o.opcode == O_IP6_DST) ? + 128 : contigmask((uint8_t *)&(a[1]), 128); + + if (mb == 128 && co.do_resolv) + he = gethostbyaddr((char *)a, sizeof(*a), AF_INET6); + if (he != NULL) /* resolved to name */ + bprintf(bp, "%s", he->h_name); + else if (mb == 0) /* any */ + bprintf(bp, "any"); + else { /* numeric IP followed by some kind of mask */ + if (inet_ntop(AF_INET6, a, trad, sizeof( trad ) ) == NULL) + bprintf(bp, "Error ntop in print_ip6\n"); + bprintf(bp, "%s", trad ); + if (mb < 0) /* XXX not really legal... */ + bprintf(bp, ":%s", + inet_ntop(AF_INET6, &a[1], trad, sizeof(trad))); + else if (mb < 128) + bprintf(bp, "/%d", mb); + } + if (len > 2) + bprintf(bp, ","); + } +} + +void +fill_icmp6types(ipfw_insn_icmp6 *cmd, char *av, int cblen) +{ + uint8_t type; + + CHECK_LENGTH(cblen, F_INSN_SIZE(ipfw_insn_icmp6)); + + bzero(cmd, sizeof(*cmd)); + while (*av) { + if (*av == ',') + av++; + type = strtoul(av, &av, 0); + if (*av != ',' && *av != '\0') + errx(EX_DATAERR, "invalid ICMP6 type"); + /* + * XXX: shouldn't this be 0xFF? I can't see any reason why + * we shouldn't be able to filter all possiable values + * regardless of the ability of the rest of the kernel to do + * anything useful with them. + */ + if (type > ICMP6_MAXTYPE) + errx(EX_DATAERR, "ICMP6 type out of range"); + cmd->d[type / 32] |= ( 1 << (type % 32)); + } + cmd->o.opcode = O_ICMP6TYPE; + cmd->o.len |= F_INSN_SIZE(ipfw_insn_icmp6); +} + + +void +print_icmp6types(struct buf_pr *bp, ipfw_insn_u32 *cmd) +{ + int i, j; + char sep= ' '; + + bprintf(bp, " ip6 icmp6types"); + for (i = 0; i < 7; i++) + for (j=0; j < 32; ++j) { + if ( (cmd->d[i] & (1 << (j))) == 0) + continue; + bprintf(bp, "%c%d", sep, (i*32 + j)); + sep = ','; + } +} + +void +print_flow6id(struct buf_pr *bp, ipfw_insn_u32 *cmd) +{ + uint16_t i, limit = cmd->o.arg1; + char sep = ','; + + bprintf(bp, " flow-id "); + for( i=0; i < limit; ++i) { + if (i == limit - 1) + sep = ' '; + bprintf(bp, "%d%c", cmd->d[i], sep); + } +} + +/* structure and define for the extension header in ipv6 */ +static struct _s_x ext6hdrcodes[] = { + { "frag", EXT_FRAGMENT }, + { "hopopt", EXT_HOPOPTS }, + { "route", EXT_ROUTING }, + { "dstopt", EXT_DSTOPTS }, + { "ah", EXT_AH }, + { "esp", EXT_ESP }, + { "rthdr0", EXT_RTHDR0 }, + { "rthdr2", EXT_RTHDR2 }, + { NULL, 0 } +}; + +/* fills command for the extension header filtering */ +int +fill_ext6hdr( ipfw_insn *cmd, char *av) +{ + int tok; + char *s = av; + + cmd->arg1 = 0; + + while(s) { + av = strsep( &s, ",") ; + tok = match_token(ext6hdrcodes, av); + switch (tok) { + case EXT_FRAGMENT: + cmd->arg1 |= EXT_FRAGMENT; + break; + + case EXT_HOPOPTS: + cmd->arg1 |= EXT_HOPOPTS; + break; + + case EXT_ROUTING: + cmd->arg1 |= EXT_ROUTING; + break; + + case EXT_DSTOPTS: + cmd->arg1 |= EXT_DSTOPTS; + break; + + case EXT_AH: + cmd->arg1 |= EXT_AH; + break; + + case EXT_ESP: + cmd->arg1 |= EXT_ESP; + break; + + case EXT_RTHDR0: + cmd->arg1 |= EXT_RTHDR0; + break; + + case EXT_RTHDR2: + cmd->arg1 |= EXT_RTHDR2; + break; + + default: + errx( EX_DATAERR, "invalid option for ipv6 exten header" ); + break; + } + } + if (cmd->arg1 == 0 ) + return 0; + cmd->opcode = O_EXT_HDR; + cmd->len |= F_INSN_SIZE( ipfw_insn ); + return 1; +} + +void +print_ext6hdr(struct buf_pr *bp, ipfw_insn *cmd ) +{ + char sep = ' '; + + bprintf(bp, " extension header:"); + if (cmd->arg1 & EXT_FRAGMENT ) { + bprintf(bp, "%cfragmentation", sep); + sep = ','; + } + if (cmd->arg1 & EXT_HOPOPTS ) { + bprintf(bp, "%chop options", sep); + sep = ','; + } + if (cmd->arg1 & EXT_ROUTING ) { + bprintf(bp, "%crouting options", sep); + sep = ','; + } + if (cmd->arg1 & EXT_RTHDR0 ) { + bprintf(bp, "%crthdr0", sep); + sep = ','; + } + if (cmd->arg1 & EXT_RTHDR2 ) { + bprintf(bp, "%crthdr2", sep); + sep = ','; + } + if (cmd->arg1 & EXT_DSTOPTS ) { + bprintf(bp, "%cdestination options", sep); + sep = ','; + } + if (cmd->arg1 & EXT_AH ) { + bprintf(bp, "%cauthentication header", sep); + sep = ','; + } + if (cmd->arg1 & EXT_ESP ) { + bprintf(bp, "%cencapsulated security payload", sep); + } +} + +/* Try to find ipv6 address by hostname */ +static int +lookup_host6 (char *host, struct in6_addr *ip6addr) +{ + struct hostent *he; + + if (!inet_pton(AF_INET6, host, ip6addr)) { + if ((he = gethostbyname2(host, AF_INET6)) == NULL) + return(-1); + memcpy(ip6addr, he->h_addr_list[0], sizeof( struct in6_addr)); + } + return(0); +} + + +/* + * fill the addr and mask fields in the instruction as appropriate from av. + * Update length as appropriate. + * The following formats are allowed: + * any matches any IP6. Actually returns an empty instruction. + * me returns O_IP6_*_ME + * + * 03f1::234:123:0342 single IP6 addres + * 03f1::234:123:0342/24 address/mask + * 03f1::234:123:0342/24,03f1::234:123:0343/ List of address + * + * Set of address (as in ipv6) not supported because ipv6 address + * are typically random past the initial prefix. + * Return 1 on success, 0 on failure. + */ +static int +fill_ip6(ipfw_insn_ip6 *cmd, char *av, int cblen) +{ + int len = 0; + struct in6_addr *d = &(cmd->addr6); + /* + * Needed for multiple address. + * Note d[1] points to struct in6_add r mask6 of cmd + */ + + cmd->o.len &= ~F_LEN_MASK; /* zero len */ + + if (strcmp(av, "any") == 0) + return (1); + + + if (strcmp(av, "me") == 0) { /* Set the data for "me" opt*/ + cmd->o.len |= F_INSN_SIZE(ipfw_insn); + return (1); + } + + if (strcmp(av, "me6") == 0) { /* Set the data for "me" opt*/ + cmd->o.len |= F_INSN_SIZE(ipfw_insn); + return (1); + } + + if (strncmp(av, "table(", 6) == 0) { + char *p = strchr(av + 6, ','); + uint32_t *dm = ((ipfw_insn_u32 *)cmd)->d; + + if (p) + *p++ = '\0'; + cmd->o.opcode = O_IP_DST_LOOKUP; + cmd->o.arg1 = strtoul(av + 6, NULL, 0); + if (p) { + cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32); + dm[0] = strtoul(p, NULL, 0); + } else + cmd->o.len |= F_INSN_SIZE(ipfw_insn); + return (1); + } + + av = strdup(av); + while (av) { + /* + * After the address we can have '/' indicating a mask, + * or ',' indicating another address follows. + */ + + char *p; + int masklen; + char md = '\0'; + + CHECK_LENGTH(cblen, 1 + len + 2 * F_INSN_SIZE(struct in6_addr)); + + if ((p = strpbrk(av, "/,")) ) { + md = *p; /* save the separator */ + *p = '\0'; /* terminate address string */ + p++; /* and skip past it */ + } + /* now p points to NULL, mask or next entry */ + + /* lookup stores address in *d as a side effect */ + if (lookup_host6(av, d) != 0) { + /* XXX: failed. Free memory and go */ + errx(EX_DATAERR, "bad address \"%s\"", av); + } + /* next, look at the mask, if any */ + masklen = (md == '/') ? atoi(p) : 128; + if (masklen > 128 || masklen < 0) + errx(EX_DATAERR, "bad width \"%s\''", p); + else + n2mask(&d[1], masklen); + + APPLY_MASK(d, &d[1]) /* mask base address with mask */ + + /* find next separator */ + + if (md == '/') { /* find separator past the mask */ + p = strpbrk(p, ","); + if (p != NULL) + p++; + } + av = p; + + /* Check this entry */ + if (masklen == 0) { + /* + * 'any' turns the entire list into a NOP. + * 'not any' never matches, so it is removed from the + * list unless it is the only item, in which case we + * report an error. + */ + if (cmd->o.len & F_NOT && av == NULL && len == 0) + errx(EX_DATAERR, "not any never matches"); + continue; + } + + /* + * A single IP can be stored alone + */ + if (masklen == 128 && av == NULL && len == 0) { + len = F_INSN_SIZE(struct in6_addr); + break; + } + + /* Update length and pointer to arguments */ + len += F_INSN_SIZE(struct in6_addr)*2; + d += 2; + } /* end while */ + + /* + * Total length of the command, remember that 1 is the size of + * the base command. + */ + if (len + 1 > F_LEN_MASK) + errx(EX_DATAERR, "address list too long"); + cmd->o.len |= len+1; + free(av); + return (1); +} + +/* + * fills command for ipv6 flow-id filtering + * note that the 20 bit flow number is stored in a array of u_int32_t + * it's supported lists of flow-id, so in the o.arg1 we store how many + * additional flow-id we want to filter, the basic is 1 + */ +void +fill_flow6( ipfw_insn_u32 *cmd, char *av, int cblen) +{ + u_int32_t type; /* Current flow number */ + u_int16_t nflow = 0; /* Current flow index */ + char *s = av; + cmd->d[0] = 0; /* Initializing the base number*/ + + while (s) { + CHECK_LENGTH(cblen, F_INSN_SIZE(ipfw_insn_u32) + nflow + 1); + + av = strsep( &s, ",") ; + type = strtoul(av, &av, 0); + if (*av != ',' && *av != '\0') + errx(EX_DATAERR, "invalid ipv6 flow number %s", av); + if (type > 0xfffff) + errx(EX_DATAERR, "flow number out of range %s", av); + cmd->d[nflow] |= type; + nflow++; + } + if( nflow > 0 ) { + cmd->o.opcode = O_FLOW6ID; + cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32) + nflow; + cmd->o.arg1 = nflow; + } + else { + errx(EX_DATAERR, "invalid ipv6 flow number %s", av); + } +} + +ipfw_insn * +add_srcip6(ipfw_insn *cmd, char *av, int cblen) +{ + + fill_ip6((ipfw_insn_ip6 *)cmd, av, cblen); + if (cmd->opcode == O_IP_DST_SET) /* set */ + cmd->opcode = O_IP_SRC_SET; + else if (cmd->opcode == O_IP_DST_LOOKUP) /* table */ + cmd->opcode = O_IP_SRC_LOOKUP; + else if (F_LEN(cmd) == 0) { /* any */ + } else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn)) { /* "me" */ + cmd->opcode = O_IP6_SRC_ME; + } else if (F_LEN(cmd) == + (F_INSN_SIZE(struct in6_addr) + F_INSN_SIZE(ipfw_insn))) { + /* single IP, no mask*/ + cmd->opcode = O_IP6_SRC; + } else { /* addr/mask opt */ + cmd->opcode = O_IP6_SRC_MASK; + } + return cmd; +} + +ipfw_insn * +add_dstip6(ipfw_insn *cmd, char *av, int cblen) +{ + + fill_ip6((ipfw_insn_ip6 *)cmd, av, cblen); + if (cmd->opcode == O_IP_DST_SET) /* set */ + ; + else if (cmd->opcode == O_IP_DST_LOOKUP) /* table */ + ; + else if (F_LEN(cmd) == 0) { /* any */ + } else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn)) { /* "me" */ + cmd->opcode = O_IP6_DST_ME; + } else if (F_LEN(cmd) == + (F_INSN_SIZE(struct in6_addr) + F_INSN_SIZE(ipfw_insn))) { + /* single IP, no mask*/ + cmd->opcode = O_IP6_DST; + } else { /* addr/mask opt */ + cmd->opcode = O_IP6_DST_MASK; + } + return cmd; +} diff --git a/example/ipfw/ipfw/main.c b/example/ipfw/ipfw/main.c new file mode 100644 index 0000000..a8f5fed --- /dev/null +++ b/example/ipfw/ipfw/main.c @@ -0,0 +1,628 @@ +/* + * Copyright (c) 2002-2003,2010 Luigi Rizzo + * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp + * Copyright (c) 1994 Ugen J.S.Antsilevich + * + * Idea and grammar partially left from: + * Copyright (c) 1993 Daniel Boulet + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * Command line interface for IP firewall facility + * + * $FreeBSD: head/sbin/ipfw/main.c 272840 2014-10-09 19:32:35Z melifaro $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipfw2.h" + +static void +help(void) +{ + fprintf(stderr, +"ipfw syntax summary (but please do read the ipfw(8) manpage):\n\n" +"\tipfw [-abcdefhnNqStTv] \n\n" +"where is one of the following:\n\n" +"add [num] [set N] [prob x] RULE-BODY\n" +"{pipe|queue} N config PIPE-BODY\n" +"[pipe|queue] {zero|delete|show} [N{,N}]\n" +"nat N config {ip IPADDR|if IFNAME|log|deny_in|same_ports|unreg_only|reset|\n" +" reverse|proxy_only|redirect_addr linkspec|\n" +" redirect_port linkspec|redirect_proto linkspec}\n" +"set [disable N... enable N...] | move [rule] X to Y | swap X Y | show\n" +"set N {show|list|zero|resetlog|delete} [N{,N}] | flush\n" +"table N {add ip[/bits] [value] | delete ip[/bits] | flush | list}\n" +"table all {flush | list}\n" +"\n" +"RULE-BODY: check-state [PARAMS] | ACTION [PARAMS] ADDR [OPTION_LIST]\n" +"ACTION: check-state | allow | count | deny | unreach{,6} CODE |\n" +" skipto N | {divert|tee} PORT | forward ADDR |\n" +" pipe N | queue N | nat N | setfib FIB | reass\n" +"PARAMS: [log [logamount LOGLIMIT]] [altq QUEUE_NAME]\n" +"ADDR: [ MAC dst src ether_type ] \n" +" [ ip from IPADDR [ PORT ] to IPADDR [ PORTLIST ] ]\n" +" [ ipv6|ip6 from IP6ADDR [ PORT ] to IP6ADDR [ PORTLIST ] ]\n" +"IPADDR: [not] { any | me | ip/bits{x,y,z} | table(t[,v]) | IPLIST }\n" +"IP6ADDR: [not] { any | me | me6 | ip6/bits | IP6LIST }\n" +"IP6LIST: { ip6 | ip6/bits }[,IP6LIST]\n" +"IPLIST: { ip | ip/bits | ip:mask }[,IPLIST]\n" +"OPTION_LIST: OPTION [OPTION_LIST]\n" +"OPTION: bridged | diverted | diverted-loopback | diverted-output |\n" +" {dst-ip|src-ip} IPADDR | {dst-ip6|src-ip6|dst-ipv6|src-ipv6} IP6ADDR |\n" +" {dst-port|src-port} LIST |\n" +" estab | frag | {gid|uid} N | icmptypes LIST | in | out | ipid LIST |\n" +" iplen LIST | ipoptions SPEC | ipprecedence | ipsec | iptos SPEC |\n" +" ipttl LIST | ipversion VER | keep-state | layer2 | limit ... |\n" +" icmp6types LIST | ext6hdr LIST | flow-id N[,N] | fib FIB |\n" +" mac ... | mac-type LIST | proto LIST | {recv|xmit|via} {IF|IPADDR} |\n" +" setup | {tcpack|tcpseq|tcpwin} NN | tcpflags SPEC | tcpoptions SPEC |\n" +" tcpdatalen LIST | verrevpath | versrcreach | antispoof\n" +); + + exit(0); +} + +/* + * Called with the arguments, including program name because getopt + * wants it to be present. + * Returns 0 if successful, 1 if empty command, errx() in case of errors. + * First thing we do is process parameters creating an argv[] array + * which includes the program name and a NULL entry at the end. + * If we are called with a single string, we split it on whitespace. + * Also, arguments with a trailing ',' are joined to the next one. + * The pointers (av[]) and data are in a single chunk of memory. + * av[0] points to the original program name, all other entries + * point into the allocated chunk. + */ +static int +ipfw_main(int oldac, char **oldav) +{ + int ch, ac; + const char *errstr; + char **av, **save_av; + int do_acct = 0; /* Show packet/byte count */ + int try_next = 0; /* set if pipe cmd not found */ + int av_size; /* compute the av size */ + char *av_p; /* used to build the av list */ + +#define WHITESP " \t\f\v\n\r" + if (oldac < 2) + return 1; /* need at least one argument */ + + if (oldac == 2) { + /* + * If we are called with one argument, try to split it into + * words for subsequent parsing. Spaces after a ',' are + * removed by copying the string in-place. + */ + char *arg = oldav[1]; /* The string is the first arg. */ + int l = strlen(arg); + int copy = 0; /* 1 if we need to copy, 0 otherwise */ + int i, j; + + for (i = j = 0; i < l; i++) { + if (arg[i] == '#') /* comment marker */ + break; + if (copy) { + arg[j++] = arg[i]; + copy = !strchr("," WHITESP, arg[i]); + } else { + copy = !strchr(WHITESP, arg[i]); + if (copy) + arg[j++] = arg[i]; + } + } + if (!copy && j > 0) /* last char was a 'blank', remove it */ + j--; + l = j; /* the new argument length */ + arg[j++] = '\0'; + if (l == 0) /* empty string! */ + return 1; + + /* + * First, count number of arguments. Because of the previous + * processing, this is just the number of blanks plus 1. + */ + for (i = 0, ac = 1; i < l; i++) + if (strchr(WHITESP, arg[i]) != NULL) + ac++; + + /* + * Allocate the argument list structure as a single block + * of memory, containing pointers and the argument + * strings. We include one entry for the program name + * because getopt expects it, and a NULL at the end + * to simplify further parsing. + */ + ac++; /* add 1 for the program name */ + av_size = (ac+1) * sizeof(char *) + l + 1; + av = safe_calloc(av_size, 1); + + /* + * Init the argument pointer to the end of the array + * and copy arguments from arg[] to av[]. For each one, + * j is the initial character, i is the one past the end. + */ + av_p = (char *)&av[ac+1]; + for (ac = 1, i = j = 0; i < l; i++) { + if (strchr(WHITESP, arg[i]) != NULL || i == l-1) { + if (i == l-1) + i++; + bcopy(arg+j, av_p, i-j); + av[ac] = av_p; + av_p += i-j; /* the length of the string */ + *av_p++ = '\0'; + ac++; + j = i + 1; + } + } + } else { + /* + * If an argument ends with ',' join with the next one. + */ + int first, i, l=0; + + /* + * Allocate the argument list structure as a single block + * of memory, containing both pointers and the argument + * strings. We include some space for the program name + * because getopt expects it. + * We add an extra pointer to the end of the array, + * to make simpler further parsing. + */ + for (i=0; i= 2 && !strcmp(av[1], "sysctl")) { + char *s; + int i; + + if (ac != 3) { + printf( "sysctl emulation usage:\n" + " ipfw sysctl name[=value]\n" + " ipfw sysctl -a\n"); + return 0; + } + s = strchr(av[2], '='); + if (s == NULL) { + s = !strcmp(av[2], "-a") ? NULL : av[2]; + sysctlbyname(s, NULL, NULL, NULL, 0); + } else { /* ipfw sysctl x.y.z=value */ + /* assume an INT value, will extend later */ + if (s[1] == '\0') { + printf("ipfw sysctl: missing value\n\n"); + return 0; + } + *s = '\0'; + i = strtol(s+1, NULL, 0); + sysctlbyname(av[2], NULL, NULL, &i, sizeof(int)); + } + return 0; + } +#endif + + /* Save arguments for final freeing of memory. */ + save_av = av; + + optind = optreset = 1; /* restart getopt() */ + while ((ch = getopt(ac, av, "abcdefhinNp:qs:STtv")) != -1) + switch (ch) { + case 'a': + do_acct = 1; + break; + + case 'b': + co.comment_only = 1; + co.do_compact = 1; + break; + + case 'c': + co.do_compact = 1; + break; + + case 'd': + co.do_dynamic = 1; + break; + + case 'e': + co.do_expired = 1; + break; + + case 'f': + co.do_force = 1; + break; + + case 'h': /* help */ + free(save_av); + help(); + break; /* NOTREACHED */ + + case 'i': + co.do_value_as_ip = 1; + break; + + case 'n': + co.test_only = 1; + break; + + case 'N': + co.do_resolv = 1; + break; + + case 'p': + errx(EX_USAGE, "An absolute pathname must be used " + "with -p option."); + /* NOTREACHED */ + + case 'q': + co.do_quiet = 1; + break; + + case 's': /* sort */ + co.do_sort = atoi(optarg); + break; + + case 'S': + co.show_sets = 1; + break; + + case 't': + co.do_time = 1; + break; + + case 'T': + co.do_time = 2; /* numeric timestamp */ + break; + + case 'v': /* verbose */ + co.verbose = 1; + break; + + default: + free(save_av); + return 1; + } + + ac -= optind; + av += optind; + NEED1("bad arguments, for usage summary ``ipfw''"); + + /* + * An undocumented behaviour of ipfw1 was to allow rule numbers first, + * e.g. "100 add allow ..." instead of "add 100 allow ...". + * In case, swap first and second argument to get the normal form. + */ + if (ac > 1 && isdigit(*av[0])) { + char *p = av[0]; + + av[0] = av[1]; + av[1] = p; + } + + /* + * Optional: pipe, queue or nat. + */ + co.do_nat = 0; + co.do_pipe = 0; + co.use_set = 0; + if (!strncmp(*av, "nat", strlen(*av))) + co.do_nat = 1; + else if (!strncmp(*av, "pipe", strlen(*av))) + co.do_pipe = 1; + else if (_substrcmp(*av, "queue") == 0) + co.do_pipe = 2; + else if (_substrcmp(*av, "flowset") == 0) + co.do_pipe = 2; + else if (_substrcmp(*av, "sched") == 0) + co.do_pipe = 3; + else if (!strncmp(*av, "set", strlen(*av))) { + if (ac > 1 && isdigit(av[1][0])) { + co.use_set = strtonum(av[1], 0, resvd_set_number, + &errstr); + if (errstr) + errx(EX_DATAERR, + "invalid set number %s\n", av[1]); + ac -= 2; av += 2; co.use_set++; + } + } + + if (co.do_pipe || co.do_nat) { + ac--; + av++; + } + NEED1("missing command"); + + /* + * For pipes, queues and nats we normally say 'nat|pipe NN config' + * but the code is easier to parse as 'nat|pipe config NN' + * so we swap the two arguments. + */ + if ((co.do_pipe || co.do_nat) && ac > 1 && isdigit(*av[0])) { + char *p = av[0]; + + av[0] = av[1]; + av[1] = p; + } + + if (co.use_set == 0) { + if (_substrcmp(*av, "add") == 0) + ipfw_add(av); + else if (co.do_nat && _substrcmp(*av, "show") == 0) + ipfw_show_nat(ac, av); + else if (co.do_pipe && _substrcmp(*av, "config") == 0) + ipfw_config_pipe(ac, av); + else if (co.do_nat && _substrcmp(*av, "config") == 0) + ipfw_config_nat(ac, av); + else if (_substrcmp(*av, "set") == 0) + ipfw_sets_handler(av); + else if (_substrcmp(*av, "table") == 0) + ipfw_table_handler(ac, av); + else if (_substrcmp(*av, "enable") == 0) + ipfw_sysctl_handler(av, 1); + else if (_substrcmp(*av, "disable") == 0) + ipfw_sysctl_handler(av, 0); + else + try_next = 1; + } + + if (co.use_set || try_next) { + if (_substrcmp(*av, "delete") == 0) + ipfw_delete(av); + else if (_substrcmp(*av, "flush") == 0) + ipfw_flush(co.do_force); + else if (_substrcmp(*av, "zero") == 0) + ipfw_zero(ac, av, 0 /* IP_FW_ZERO */); + else if (_substrcmp(*av, "resetlog") == 0) + ipfw_zero(ac, av, 1 /* IP_FW_RESETLOG */); + else if (_substrcmp(*av, "print") == 0 || + _substrcmp(*av, "list") == 0) + ipfw_list(ac, av, do_acct); + else if (_substrcmp(*av, "show") == 0) + ipfw_list(ac, av, 1 /* show counters */); + else if (_substrcmp(*av, "table") == 0) + ipfw_table_handler(ac, av); + else if (_substrcmp(*av, "internal") == 0) + ipfw_internal_handler(ac, av); + else + errx(EX_USAGE, "bad command `%s'", *av); + } + + /* Free memory allocated in the argument parsing. */ + free(save_av); + return 0; +} + + +static void +ipfw_readfile(int ac, char *av[]) +{ +#define MAX_ARGS 32 + char buf[4096]; + char *progname = av[0]; /* original program name */ + const char *cmd = NULL; /* preprocessor name, if any */ + const char *filename = av[ac-1]; /* file to read */ + int c, lineno=0; + FILE *f = NULL; + pid_t preproc = 0; + + while ((c = getopt(ac, av, "cfNnp:qS")) != -1) { + switch(c) { + case 'c': + co.do_compact = 1; + break; + + case 'f': + co.do_force = 1; + break; + + case 'N': + co.do_resolv = 1; + break; + + case 'n': + co.test_only = 1; + break; + + case 'p': + /* + * ipfw -p cmd [args] filename + * + * We are done with getopt(). All arguments + * except the filename go to the preprocessor, + * so we need to do the following: + * - check that a filename is actually present; + * - advance av by optind-1 to skip arguments + * already processed; + * - decrease ac by optind, to remove the args + * already processed and the final filename; + * - set the last entry in av[] to NULL so + * popen() can detect the end of the array; + * - set optind=ac to let getopt() terminate. + */ + if (optind == ac) + errx(EX_USAGE, "no filename argument"); + cmd = optarg; + av[ac-1] = NULL; + av += optind - 1; + ac -= optind; + optind = ac; + break; + + case 'q': + co.do_quiet = 1; + break; + + case 'S': + co.show_sets = 1; + break; + + default: + errx(EX_USAGE, "bad arguments, for usage" + " summary ``ipfw''"); + } + + } + + if (cmd == NULL && ac != optind + 1) + errx(EX_USAGE, "extraneous filename arguments %s", av[ac-1]); + + if ((f = fopen(filename, "r")) == NULL) + err(EX_UNAVAILABLE, "fopen: %s", filename); + + if (cmd != NULL) { /* pipe through preprocessor */ + int pipedes[2]; + + if (pipe(pipedes) == -1) + err(EX_OSERR, "cannot create pipe"); + + preproc = fork(); + if (preproc == -1) + err(EX_OSERR, "cannot fork"); + + if (preproc == 0) { + /* + * Child, will run the preprocessor with the + * file on stdin and the pipe on stdout. + */ + if (dup2(fileno(f), 0) == -1 + || dup2(pipedes[1], 1) == -1) + err(EX_OSERR, "dup2()"); + fclose(f); + close(pipedes[1]); + close(pipedes[0]); + execvp(cmd, av); + err(EX_OSERR, "execvp(%s) failed", cmd); + } else { /* parent, will reopen f as the pipe */ + fclose(f); + close(pipedes[1]); + if ((f = fdopen(pipedes[0], "r")) == NULL) { + int savederrno = errno; + + (void)kill(preproc, SIGTERM); + errno = savederrno; + err(EX_OSERR, "fdopen()"); + } + } + } + + while (fgets(buf, sizeof(buf), f)) { /* read commands */ + char linename[20]; + char *args[2]; + + lineno++; + snprintf(linename, sizeof(linename), "Line %d", lineno); + setprogname(linename); /* XXX */ + args[0] = progname; + args[1] = buf; + ipfw_main(2, args); + } + fclose(f); + if (cmd != NULL) { + int status; + + if (waitpid(preproc, &status, 0) == -1) + errx(EX_OSERR, "waitpid()"); + if (WIFEXITED(status) && WEXITSTATUS(status) != EX_OK) + errx(EX_UNAVAILABLE, + "preprocessor exited with status %d", + WEXITSTATUS(status)); + else if (WIFSIGNALED(status)) + errx(EX_UNAVAILABLE, + "preprocessor exited with signal %d", + WTERMSIG(status)); + } +} + +int +main(int ac, char *av[]) +{ +#if defined(_WIN32) && defined(TCC) + { + WSADATA wsaData; + int ret=0; + unsigned short wVersionRequested = MAKEWORD(2, 2); + ret = WSAStartup(wVersionRequested, &wsaData); + if (ret != 0) { + /* Tell the user that we could not find a usable */ + /* Winsock DLL. */ + printf("WSAStartup failed with error: %d\n", ret); + return 1; + } + } +#endif + /* + * If the last argument is an absolute pathname, interpret it + * as a file to be preprocessed. + */ + + if (ac > 1 && av[ac - 1][0] == '/') { + if (access(av[ac - 1], R_OK) == 0) + ipfw_readfile(ac, av); + else + err(EX_USAGE, "pathname: %s", av[ac - 1]); + } else { + if (ipfw_main(ac, av)) { + errx(EX_USAGE, + "usage: ipfw [options]\n" + "do \"ipfw -h\" or \"man ipfw\" for details"); + } + } + return EX_OK; +} diff --git a/example/ipfw/ipfw/nat.c b/example/ipfw/ipfw/nat.c new file mode 100644 index 0000000..dc2364f --- /dev/null +++ b/example/ipfw/ipfw/nat.c @@ -0,0 +1,1115 @@ +/* + * Copyright (c) 2002-2003 Luigi Rizzo + * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp + * Copyright (c) 1994 Ugen J.S.Antsilevich + * + * Idea and grammar partially left from: + * Copyright (c) 1993 Daniel Boulet + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * NEW command line interface for IP firewall facility + * + * $FreeBSD: head/sbin/ipfw/nat.c 272840 2014-10-09 19:32:35Z melifaro $ + * + * In-kernel nat support + */ + +#include +#include +#include + +#include "ipfw2.h" + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include /* def. of struct route */ +#include +#include +#include +#include + +typedef int (nat_cb_t)(struct nat44_cfg_nat *cfg, void *arg); +static void nat_show_cfg(struct nat44_cfg_nat *n, void *arg); +static void nat_show_log(struct nat44_cfg_nat *n, void *arg); +static int nat_show_data(struct nat44_cfg_nat *cfg, void *arg); +static int natname_cmp(const void *a, const void *b); +static int nat_foreach(nat_cb_t *f, void *arg, int sort); +static int nat_get_cmd(char *name, uint16_t cmd, ipfw_obj_header **ooh); + +static struct _s_x nat_params[] = { + { "ip", TOK_IP }, + { "if", TOK_IF }, + { "log", TOK_ALOG }, + { "deny_in", TOK_DENY_INC }, + { "same_ports", TOK_SAME_PORTS }, + { "unreg_only", TOK_UNREG_ONLY }, + { "skip_global", TOK_SKIP_GLOBAL }, + { "reset", TOK_RESET_ADDR }, + { "reverse", TOK_ALIAS_REV }, + { "proxy_only", TOK_PROXY_ONLY }, + { "redirect_addr", TOK_REDIR_ADDR }, + { "redirect_port", TOK_REDIR_PORT }, + { "redirect_proto", TOK_REDIR_PROTO }, + { NULL, 0 } /* terminator */ +}; + + +/* + * Search for interface with name "ifn", and fill n accordingly: + * + * n->ip ip address of interface "ifn" + * n->if_name copy of interface name "ifn" + */ +static void +set_addr_dynamic(const char *ifn, struct nat44_cfg_nat *n) +{ + size_t needed; + int mib[6]; + char *buf, *lim, *next; + struct if_msghdr *ifm; + struct ifa_msghdr *ifam; + struct sockaddr_dl *sdl; + struct sockaddr_in *sin; + int ifIndex, ifMTU; + + mib[0] = CTL_NET; + mib[1] = PF_ROUTE; + mib[2] = 0; + mib[3] = AF_INET; + mib[4] = NET_RT_IFLIST; + mib[5] = 0; +/* + * Get interface data. + */ + if (sysctl(mib, 6, NULL, &needed, NULL, 0) == -1) + err(1, "iflist-sysctl-estimate"); + buf = safe_calloc(1, needed); + if (sysctl(mib, 6, buf, &needed, NULL, 0) == -1) + err(1, "iflist-sysctl-get"); + lim = buf + needed; +/* + * Loop through interfaces until one with + * given name is found. This is done to + * find correct interface index for routing + * message processing. + */ + ifIndex = 0; + next = buf; + while (next < lim) { + ifm = (struct if_msghdr *)next; + next += ifm->ifm_msglen; + if (ifm->ifm_version != RTM_VERSION) { + if (co.verbose) + warnx("routing message version %d " + "not understood", ifm->ifm_version); + continue; + } + if (ifm->ifm_type == RTM_IFINFO) { + sdl = (struct sockaddr_dl *)(ifm + 1); + if (strlen(ifn) == sdl->sdl_nlen && + strncmp(ifn, sdl->sdl_data, sdl->sdl_nlen) == 0) { + ifIndex = ifm->ifm_index; + ifMTU = ifm->ifm_data.ifi_mtu; + break; + } + } + } + if (!ifIndex) + errx(1, "unknown interface name %s", ifn); +/* + * Get interface address. + */ + sin = NULL; + while (next < lim) { + ifam = (struct ifa_msghdr *)next; + next += ifam->ifam_msglen; + if (ifam->ifam_version != RTM_VERSION) { + if (co.verbose) + warnx("routing message version %d " + "not understood", ifam->ifam_version); + continue; + } + if (ifam->ifam_type != RTM_NEWADDR) + break; + if (ifam->ifam_addrs & RTA_IFA) { + int i; + char *cp = (char *)(ifam + 1); + + for (i = 1; i < RTA_IFA; i <<= 1) { + if (ifam->ifam_addrs & i) + cp += SA_SIZE((struct sockaddr *)cp); + } + if (((struct sockaddr *)cp)->sa_family == AF_INET) { + sin = (struct sockaddr_in *)cp; + break; + } + } + } + if (sin == NULL) + errx(1, "%s: cannot get interface address", ifn); + + n->ip = sin->sin_addr; + strncpy(n->if_name, ifn, IF_NAMESIZE); + + free(buf); +} + +/* + * XXX - The following functions, macros and definitions come from natd.c: + * it would be better to move them outside natd.c, in a file + * (redirect_support.[ch]?) shared by ipfw and natd, but for now i can live + * with it. + */ + +/* + * Definition of a port range, and macros to deal with values. + * FORMAT: HI 16-bits == first port in range, 0 == all ports. + * LO 16-bits == number of ports in range + * NOTES: - Port values are not stored in network byte order. + */ + +#define port_range u_long + +#define GETLOPORT(x) ((x) >> 0x10) +#define GETNUMPORTS(x) ((x) & 0x0000ffff) +#define GETHIPORT(x) (GETLOPORT((x)) + GETNUMPORTS((x))) + +/* Set y to be the low-port value in port_range variable x. */ +#define SETLOPORT(x,y) ((x) = ((x) & 0x0000ffff) | ((y) << 0x10)) + +/* Set y to be the number of ports in port_range variable x. */ +#define SETNUMPORTS(x,y) ((x) = ((x) & 0xffff0000) | (y)) + +static void +StrToAddr (const char* str, struct in_addr* addr) +{ + struct hostent* hp; + + if (inet_aton (str, addr)) + return; + + hp = gethostbyname (str); + if (!hp) + errx (1, "unknown host %s", str); + + memcpy (addr, hp->h_addr, sizeof (struct in_addr)); +} + +static int +StrToPortRange (const char* str, const char* proto, port_range *portRange) +{ + char* sep; + struct servent* sp; + char* end; + u_short loPort; + u_short hiPort; + + /* First see if this is a service, return corresponding port if so. */ + sp = getservbyname (str,proto); + if (sp) { + SETLOPORT(*portRange, ntohs(sp->s_port)); + SETNUMPORTS(*portRange, 1); + return 0; + } + + /* Not a service, see if it's a single port or port range. */ + sep = strchr (str, '-'); + if (sep == NULL) { + SETLOPORT(*portRange, strtol(str, &end, 10)); + if (end != str) { + /* Single port. */ + SETNUMPORTS(*portRange, 1); + return 0; + } + + /* Error in port range field. */ + errx (EX_DATAERR, "%s/%s: unknown service", str, proto); + } + + /* Port range, get the values and sanity check. */ + sscanf (str, "%hu-%hu", &loPort, &hiPort); + SETLOPORT(*portRange, loPort); + SETNUMPORTS(*portRange, 0); /* Error by default */ + if (loPort <= hiPort) + SETNUMPORTS(*portRange, hiPort - loPort + 1); + + if (GETNUMPORTS(*portRange) == 0) + errx (EX_DATAERR, "invalid port range %s", str); + + return 0; +} + +static int +StrToProto (const char* str) +{ + if (!strcmp (str, "tcp")) + return IPPROTO_TCP; + + if (!strcmp (str, "udp")) + return IPPROTO_UDP; + + if (!strcmp (str, "sctp")) + return IPPROTO_SCTP; + errx (EX_DATAERR, "unknown protocol %s. Expected sctp, tcp or udp", str); +} + +static int +StrToAddrAndPortRange (const char* str, struct in_addr* addr, char* proto, + port_range *portRange) +{ + char* ptr; + + ptr = strchr (str, ':'); + if (!ptr) + errx (EX_DATAERR, "%s is missing port number", str); + + *ptr = '\0'; + ++ptr; + + StrToAddr (str, addr); + return StrToPortRange (ptr, proto, portRange); +} + +/* End of stuff taken from natd.c. */ + +/* + * The next 3 functions add support for the addr, port and proto redirect and + * their logic is loosely based on SetupAddressRedirect(), SetupPortRedirect() + * and SetupProtoRedirect() from natd.c. + * + * Every setup_* function fills at least one redirect entry + * (struct nat44_cfg_redir) and zero or more server pool entry + * (struct nat44_cfg_spool) in buf. + * + * The format of data in buf is: + * + * nat44_cfg_nat nat44_cfg_redir nat44_cfg_spool ...... nat44_cfg_spool + * + * ------------------------------------- ------------ + * | | .....X ..... | | | | ..... + * ------------------------------------- ...... ------------ + * ^ + * spool_cnt n=0 ...... n=(X-1) + * + * len points to the amount of available space in buf + * space counts the memory consumed by every function + * + * XXX - Every function get all the argv params so it + * has to check, in optional parameters, that the next + * args is a valid option for the redir entry and not + * another token. Only redir_port and redir_proto are + * affected by this. + */ + +static int +estimate_redir_addr(int *ac, char ***av) +{ + size_t space = sizeof(struct nat44_cfg_redir); + char *sep = **av; + u_int c = 0; + + (void)ac; /* UNUSED */ + while ((sep = strchr(sep, ',')) != NULL) { + c++; + sep++; + } + + if (c > 0) + c++; + + space += c * sizeof(struct nat44_cfg_spool); + + return (space); +} + +static int +setup_redir_addr(char *buf, int *ac, char ***av) +{ + struct nat44_cfg_redir *r; + char *sep; + size_t space; + + r = (struct nat44_cfg_redir *)buf; + r->mode = REDIR_ADDR; + /* Skip nat44_cfg_redir at beginning of buf. */ + buf = &buf[sizeof(struct nat44_cfg_redir)]; + space = sizeof(struct nat44_cfg_redir); + + /* Extract local address. */ + if (strchr(**av, ',') != NULL) { + struct nat44_cfg_spool *spool; + + /* Setup LSNAT server pool. */ + r->laddr.s_addr = INADDR_NONE; + sep = strtok(**av, ","); + while (sep != NULL) { + spool = (struct nat44_cfg_spool *)buf; + space += sizeof(struct nat44_cfg_spool); + StrToAddr(sep, &spool->addr); + spool->port = ~0; + r->spool_cnt++; + /* Point to the next possible nat44_cfg_spool. */ + buf = &buf[sizeof(struct nat44_cfg_spool)]; + sep = strtok(NULL, ","); + } + } else + StrToAddr(**av, &r->laddr); + (*av)++; (*ac)--; + + /* Extract public address. */ + StrToAddr(**av, &r->paddr); + (*av)++; (*ac)--; + + return (space); +} + +static int +estimate_redir_port(int *ac, char ***av) +{ + size_t space = sizeof(struct nat44_cfg_redir); + char *sep = **av; + u_int c = 0; + + (void)ac; /* UNUSED */ + while ((sep = strchr(sep, ',')) != NULL) { + c++; + sep++; + } + + if (c > 0) + c++; + + space += c * sizeof(struct nat44_cfg_spool); + + return (space); +} + +static int +setup_redir_port(char *buf, int *ac, char ***av) +{ + struct nat44_cfg_redir *r; + char *sep, *protoName, *lsnat = NULL; + size_t space; + u_short numLocalPorts; + port_range portRange; + + numLocalPorts = 0; + + r = (struct nat44_cfg_redir *)buf; + r->mode = REDIR_PORT; + /* Skip nat44_cfg_redir at beginning of buf. */ + buf = &buf[sizeof(struct nat44_cfg_redir)]; + space = sizeof(struct nat44_cfg_redir); + + /* + * Extract protocol. + */ + r->proto = StrToProto(**av); + protoName = **av; + (*av)++; (*ac)--; + + /* + * Extract local address. + */ + if (strchr(**av, ',') != NULL) { + r->laddr.s_addr = INADDR_NONE; + r->lport = ~0; + numLocalPorts = 1; + lsnat = **av; + } else { + /* + * The sctp nat does not allow the port numbers to be mapped to + * new port numbers. Therefore, no ports are to be specified + * in the target port field. + */ + if (r->proto == IPPROTO_SCTP) { + if (strchr(**av, ':')) + errx(EX_DATAERR, "redirect_port:" + "port numbers do not change in sctp, so do " + "not specify them as part of the target"); + else + StrToAddr(**av, &r->laddr); + } else { + if (StrToAddrAndPortRange(**av, &r->laddr, protoName, + &portRange) != 0) + errx(EX_DATAERR, "redirect_port: " + "invalid local port range"); + + r->lport = GETLOPORT(portRange); + numLocalPorts = GETNUMPORTS(portRange); + } + } + (*av)++; (*ac)--; + + /* + * Extract public port and optionally address. + */ + if (strchr(**av, ':') != NULL) { + if (StrToAddrAndPortRange(**av, &r->paddr, protoName, + &portRange) != 0) + errx(EX_DATAERR, "redirect_port: " + "invalid public port range"); + } else { + r->paddr.s_addr = INADDR_ANY; + if (StrToPortRange(**av, protoName, &portRange) != 0) + errx(EX_DATAERR, "redirect_port: " + "invalid public port range"); + } + + r->pport = GETLOPORT(portRange); + if (r->proto == IPPROTO_SCTP) { /* so the logic below still works */ + numLocalPorts = GETNUMPORTS(portRange); + r->lport = r->pport; + } + r->pport_cnt = GETNUMPORTS(portRange); + (*av)++; (*ac)--; + + /* + * Extract remote address and optionally port. + */ + /* + * NB: isdigit(**av) => we've to check that next parameter is really an + * option for this redirect entry, else stop here processing arg[cv]. + */ + if (*ac != 0 && isdigit(***av)) { + if (strchr(**av, ':') != NULL) { + if (StrToAddrAndPortRange(**av, &r->raddr, protoName, + &portRange) != 0) + errx(EX_DATAERR, "redirect_port: " + "invalid remote port range"); + } else { + SETLOPORT(portRange, 0); + SETNUMPORTS(portRange, 1); + StrToAddr(**av, &r->raddr); + } + (*av)++; (*ac)--; + } else { + SETLOPORT(portRange, 0); + SETNUMPORTS(portRange, 1); + r->raddr.s_addr = INADDR_ANY; + } + r->rport = GETLOPORT(portRange); + r->rport_cnt = GETNUMPORTS(portRange); + + /* + * Make sure port ranges match up, then add the redirect ports. + */ + if (numLocalPorts != r->pport_cnt) + errx(EX_DATAERR, "redirect_port: " + "port ranges must be equal in size"); + + /* Remote port range is allowed to be '0' which means all ports. */ + if (r->rport_cnt != numLocalPorts && + (r->rport_cnt != 1 || r->rport != 0)) + errx(EX_DATAERR, "redirect_port: remote port must" + "be 0 or equal to local port range in size"); + + /* Setup LSNAT server pool. */ + if (lsnat != NULL) { + struct nat44_cfg_spool *spool; + + sep = strtok(lsnat, ","); + while (sep != NULL) { + spool = (struct nat44_cfg_spool *)buf; + space += sizeof(struct nat44_cfg_spool); + /* + * The sctp nat does not allow the port numbers to + * be mapped to new port numbers. Therefore, no ports + * are to be specified in the target port field. + */ + if (r->proto == IPPROTO_SCTP) { + if (strchr (sep, ':')) { + errx(EX_DATAERR, "redirect_port:" + "port numbers do not change in " + "sctp, so do not specify them as " + "part of the target"); + } else { + StrToAddr(sep, &spool->addr); + spool->port = r->pport; + } + } else { + if (StrToAddrAndPortRange(sep, &spool->addr, + protoName, &portRange) != 0) + errx(EX_DATAERR, "redirect_port:" + "invalid local port range"); + if (GETNUMPORTS(portRange) != 1) + errx(EX_DATAERR, "redirect_port: " + "local port must be single in " + "this context"); + spool->port = GETLOPORT(portRange); + } + r->spool_cnt++; + /* Point to the next possible nat44_cfg_spool. */ + buf = &buf[sizeof(struct nat44_cfg_spool)]; + sep = strtok(NULL, ","); + } + } + + return (space); +} + +static int +setup_redir_proto(char *buf, int *ac, char ***av) +{ + struct nat44_cfg_redir *r; + struct protoent *protoent; + size_t space; + + r = (struct nat44_cfg_redir *)buf; + r->mode = REDIR_PROTO; + /* Skip nat44_cfg_redir at beginning of buf. */ + buf = &buf[sizeof(struct nat44_cfg_redir)]; + space = sizeof(struct nat44_cfg_redir); + + /* + * Extract protocol. + */ + protoent = getprotobyname(**av); + if (protoent == NULL) + errx(EX_DATAERR, "redirect_proto: unknown protocol %s", **av); + else + r->proto = protoent->p_proto; + + (*av)++; (*ac)--; + + /* + * Extract local address. + */ + StrToAddr(**av, &r->laddr); + + (*av)++; (*ac)--; + + /* + * Extract optional public address. + */ + if (*ac == 0) { + r->paddr.s_addr = INADDR_ANY; + r->raddr.s_addr = INADDR_ANY; + } else { + /* see above in setup_redir_port() */ + if (isdigit(***av)) { + StrToAddr(**av, &r->paddr); + (*av)++; (*ac)--; + + /* + * Extract optional remote address. + */ + /* see above in setup_redir_port() */ + if (*ac != 0 && isdigit(***av)) { + StrToAddr(**av, &r->raddr); + (*av)++; (*ac)--; + } + } + } + + return (space); +} + +static void +nat_show_log(struct nat44_cfg_nat *n, void *arg) +{ + char *buf; + + buf = (char *)(n + 1); + if (buf[0] != '\0') + printf("nat %s: %s\n", n->name, buf); +} + +static void +nat_show_cfg(struct nat44_cfg_nat *n, void *arg) +{ + int i, cnt, flag, off; + struct nat44_cfg_redir *t; + struct nat44_cfg_spool *s; + caddr_t buf; + struct protoent *p; + + buf = (caddr_t)n; + flag = 1; + off = sizeof(*n); + printf("ipfw nat %s config", n->name); + if (strlen(n->if_name) != 0) + printf(" if %s", n->if_name); + else if (n->ip.s_addr != 0) + printf(" ip %s", inet_ntoa(n->ip)); + while (n->mode != 0) { + if (n->mode & PKT_ALIAS_LOG) { + printf(" log"); + n->mode &= ~PKT_ALIAS_LOG; + } else if (n->mode & PKT_ALIAS_DENY_INCOMING) { + printf(" deny_in"); + n->mode &= ~PKT_ALIAS_DENY_INCOMING; + } else if (n->mode & PKT_ALIAS_SAME_PORTS) { + printf(" same_ports"); + n->mode &= ~PKT_ALIAS_SAME_PORTS; + } else if (n->mode & PKT_ALIAS_SKIP_GLOBAL) { + printf(" skip_global"); + n->mode &= ~PKT_ALIAS_SKIP_GLOBAL; + } else if (n->mode & PKT_ALIAS_UNREGISTERED_ONLY) { + printf(" unreg_only"); + n->mode &= ~PKT_ALIAS_UNREGISTERED_ONLY; + } else if (n->mode & PKT_ALIAS_RESET_ON_ADDR_CHANGE) { + printf(" reset"); + n->mode &= ~PKT_ALIAS_RESET_ON_ADDR_CHANGE; + } else if (n->mode & PKT_ALIAS_REVERSE) { + printf(" reverse"); + n->mode &= ~PKT_ALIAS_REVERSE; + } else if (n->mode & PKT_ALIAS_PROXY_ONLY) { + printf(" proxy_only"); + n->mode &= ~PKT_ALIAS_PROXY_ONLY; + } + } + /* Print all the redirect's data configuration. */ + for (cnt = 0; cnt < n->redir_cnt; cnt++) { + t = (struct nat44_cfg_redir *)&buf[off]; + off += sizeof(struct nat44_cfg_redir); + switch (t->mode) { + case REDIR_ADDR: + printf(" redirect_addr"); + if (t->spool_cnt == 0) + printf(" %s", inet_ntoa(t->laddr)); + else + for (i = 0; i < t->spool_cnt; i++) { + s = (struct nat44_cfg_spool *)&buf[off]; + if (i) + printf(","); + else + printf(" "); + printf("%s", inet_ntoa(s->addr)); + off += sizeof(struct nat44_cfg_spool); + } + printf(" %s", inet_ntoa(t->paddr)); + break; + case REDIR_PORT: + p = getprotobynumber(t->proto); + printf(" redirect_port %s ", p->p_name); + if (!t->spool_cnt) { + printf("%s:%u", inet_ntoa(t->laddr), t->lport); + if (t->pport_cnt > 1) + printf("-%u", t->lport + + t->pport_cnt - 1); + } else + for (i=0; i < t->spool_cnt; i++) { + s = (struct nat44_cfg_spool *)&buf[off]; + if (i) + printf(","); + printf("%s:%u", inet_ntoa(s->addr), + s->port); + off += sizeof(struct nat44_cfg_spool); + } + + printf(" "); + if (t->paddr.s_addr) + printf("%s:", inet_ntoa(t->paddr)); + printf("%u", t->pport); + if (!t->spool_cnt && t->pport_cnt > 1) + printf("-%u", t->pport + t->pport_cnt - 1); + + if (t->raddr.s_addr) { + printf(" %s", inet_ntoa(t->raddr)); + if (t->rport) { + printf(":%u", t->rport); + if (!t->spool_cnt && t->rport_cnt > 1) + printf("-%u", t->rport + + t->rport_cnt - 1); + } + } + break; + case REDIR_PROTO: + p = getprotobynumber(t->proto); + printf(" redirect_proto %s %s", p->p_name, + inet_ntoa(t->laddr)); + if (t->paddr.s_addr != 0) { + printf(" %s", inet_ntoa(t->paddr)); + if (t->raddr.s_addr) + printf(" %s", inet_ntoa(t->raddr)); + } + break; + default: + errx(EX_DATAERR, "unknown redir mode"); + break; + } + } + printf("\n"); +} + +void +ipfw_config_nat(int ac, char **av) +{ + ipfw_obj_header *oh; + struct nat44_cfg_nat *n; /* Nat instance configuration. */ + int i, off, tok, ac1; + char *id, *buf, **av1, *end; + size_t len; + + av++; + ac--; + /* Nat id. */ + if (ac == 0) + errx(EX_DATAERR, "missing nat id"); + id = *av; + i = (int)strtol(id, &end, 0); + if (i <= 0 || *end != '\0') + errx(EX_DATAERR, "illegal nat id: %s", id); + av++; + ac--; + if (ac == 0) + errx(EX_DATAERR, "missing option"); + + len = sizeof(*oh) + sizeof(*n); + ac1 = ac; + av1 = av; + while (ac1 > 0) { + tok = match_token(nat_params, *av1); + ac1--; + av1++; + switch (tok) { + case TOK_IP: + case TOK_IF: + ac1--; + av1++; + break; + case TOK_ALOG: + case TOK_DENY_INC: + case TOK_SAME_PORTS: + case TOK_SKIP_GLOBAL: + case TOK_UNREG_ONLY: + case TOK_RESET_ADDR: + case TOK_ALIAS_REV: + case TOK_PROXY_ONLY: + break; + case TOK_REDIR_ADDR: + if (ac1 < 2) + errx(EX_DATAERR, "redirect_addr: " + "not enough arguments"); + len += estimate_redir_addr(&ac1, &av1); + av1 += 2; + ac1 -= 2; + break; + case TOK_REDIR_PORT: + if (ac1 < 3) + errx(EX_DATAERR, "redirect_port: " + "not enough arguments"); + av1++; + ac1--; + len += estimate_redir_port(&ac1, &av1); + av1 += 2; + ac1 -= 2; + /* Skip optional remoteIP/port */ + if (ac1 != 0 && isdigit(**av1)) { + av1++; + ac1--; + } + break; + case TOK_REDIR_PROTO: + if (ac1 < 2) + errx(EX_DATAERR, "redirect_proto: " + "not enough arguments"); + len += sizeof(struct nat44_cfg_redir); + av1 += 2; + ac1 -= 2; + /* Skip optional remoteIP/port */ + if (ac1 != 0 && isdigit(**av1)) { + av1++; + ac1--; + } + if (ac1 != 0 && isdigit(**av1)) { + av1++; + ac1--; + } + break; + default: + errx(EX_DATAERR, "unrecognised option ``%s''", av1[-1]); + } + } + + if ((buf = malloc(len)) == NULL) + errx(EX_OSERR, "malloc failed"); + + /* Offset in buf: save space for header at the beginning. */ + off = sizeof(*oh) + sizeof(*n); + memset(buf, 0, len); + oh = (ipfw_obj_header *)buf; + n = (struct nat44_cfg_nat *)(oh + 1); + oh->ntlv.head.length = sizeof(oh->ntlv); + snprintf(oh->ntlv.name, sizeof(oh->ntlv.name), "%d", i); + snprintf(n->name, sizeof(n->name), "%d", i); + + while (ac > 0) { + tok = match_token(nat_params, *av); + ac--; + av++; + switch (tok) { + case TOK_IP: + if (ac == 0) + errx(EX_DATAERR, "missing option"); + if (!inet_aton(av[0], &(n->ip))) + errx(EX_DATAERR, "bad ip address ``%s''", + av[0]); + ac--; + av++; + break; + case TOK_IF: + if (ac == 0) + errx(EX_DATAERR, "missing option"); + set_addr_dynamic(av[0], n); + ac--; + av++; + break; + case TOK_ALOG: + n->mode |= PKT_ALIAS_LOG; + break; + case TOK_DENY_INC: + n->mode |= PKT_ALIAS_DENY_INCOMING; + break; + case TOK_SAME_PORTS: + n->mode |= PKT_ALIAS_SAME_PORTS; + break; + case TOK_UNREG_ONLY: + n->mode |= PKT_ALIAS_UNREGISTERED_ONLY; + break; + case TOK_SKIP_GLOBAL: + n->mode |= PKT_ALIAS_SKIP_GLOBAL; + break; + case TOK_RESET_ADDR: + n->mode |= PKT_ALIAS_RESET_ON_ADDR_CHANGE; + break; + case TOK_ALIAS_REV: + n->mode |= PKT_ALIAS_REVERSE; + break; + case TOK_PROXY_ONLY: + n->mode |= PKT_ALIAS_PROXY_ONLY; + break; + /* + * All the setup_redir_* functions work directly in + * the final buffer, see above for details. + */ + case TOK_REDIR_ADDR: + case TOK_REDIR_PORT: + case TOK_REDIR_PROTO: + switch (tok) { + case TOK_REDIR_ADDR: + i = setup_redir_addr(&buf[off], &ac, &av); + break; + case TOK_REDIR_PORT: + i = setup_redir_port(&buf[off], &ac, &av); + break; + case TOK_REDIR_PROTO: + i = setup_redir_proto(&buf[off], &ac, &av); + break; + } + n->redir_cnt++; + off += i; + break; + } + } + + i = do_set3(IP_FW_NAT44_XCONFIG, &oh->opheader, len); + if (i != 0) + err(1, "setsockopt(%s)", "IP_FW_NAT44_XCONFIG"); + + if (!co.do_quiet) { + /* After every modification, we show the resultant rule. */ + int _ac = 3; + const char *_av[] = {"show", "config", id}; + ipfw_show_nat(_ac, (char **)(void *)_av); + } +} + +struct nat_list_arg { + uint16_t cmd; + int is_all; +}; + +static int +nat_show_data(struct nat44_cfg_nat *cfg, void *arg) +{ + struct nat_list_arg *nla; + ipfw_obj_header *oh; + + nla = (struct nat_list_arg *)arg; + + switch (nla->cmd) { + case IP_FW_NAT44_XGETCONFIG: + if (nat_get_cmd(cfg->name, nla->cmd, &oh) != 0) { + warnx("Error getting nat instance %s info", cfg->name); + break; + } + nat_show_cfg((struct nat44_cfg_nat *)(oh + 1), NULL); + free(oh); + break; + case IP_FW_NAT44_XGETLOG: + if (nat_get_cmd(cfg->name, nla->cmd, &oh) == 0) { + nat_show_log((struct nat44_cfg_nat *)(oh + 1), NULL); + free(oh); + break; + } + /* Handle error */ + if (nla->is_all != 0 && errno == ENOENT) + break; + warn("Error getting nat instance %s info", cfg->name); + break; + } + + return (0); +} + +/* + * Compare nat names. + * Honor number comparison. + */ +static int +natname_cmp(const void *a, const void *b) +{ + struct nat44_cfg_nat *ia, *ib; + + ia = (struct nat44_cfg_nat *)a; + ib = (struct nat44_cfg_nat *)b; + + return (stringnum_cmp(ia->name, ib->name)); +} + +/* + * Retrieves nat list from kernel, + * optionally sorts it and calls requested function for each table. + * Returns 0 on success. + */ +static int +nat_foreach(nat_cb_t *f, void *arg, int sort) +{ + ipfw_obj_lheader *olh; + struct nat44_cfg_nat *cfg; + size_t sz; + int i, error; + + /* Start with reasonable default */ + sz = sizeof(*olh) + 16 * sizeof(struct nat44_cfg_nat); + + for (;;) { + if ((olh = calloc(1, sz)) == NULL) + return (ENOMEM); + + olh->size = sz; + if (do_get3(IP_FW_NAT44_LIST_NAT, &olh->opheader, &sz) != 0) { + free(olh); + if (errno == ENOMEM) { + sz = olh->size; + continue; + } + return (errno); + } + + if (sort != 0) + qsort(olh + 1, olh->count, olh->objsize, natname_cmp); + + cfg = (struct nat44_cfg_nat*)(olh + 1); + for (i = 0; i < olh->count; i++) { + error = f(cfg, arg); /* Ignore errors for now */ + cfg = (struct nat44_cfg_nat *)((caddr_t)cfg + + olh->objsize); + } + + free(olh); + break; + } + + return (0); +} + +static int +nat_get_cmd(char *name, uint16_t cmd, ipfw_obj_header **ooh) +{ + ipfw_obj_header *oh; + struct nat44_cfg_nat *cfg; + size_t sz; + + /* Start with reasonable default */ + sz = sizeof(*oh) + sizeof(*cfg) + 128; + + for (;;) { + if ((oh = calloc(1, sz)) == NULL) + return (ENOMEM); + cfg = (struct nat44_cfg_nat *)(oh + 1); + oh->ntlv.head.length = sizeof(oh->ntlv); + strlcpy(oh->ntlv.name, name, sizeof(oh->ntlv.name)); + strlcpy(cfg->name, name, sizeof(cfg->name)); + + if (do_get3(cmd, &oh->opheader, &sz) != 0) { + sz = cfg->size; + free(oh); + if (errno == ENOMEM) + continue; + return (errno); + } + + *ooh = oh; + break; + } + + return (0); +} + +void +ipfw_show_nat(int ac, char **av) +{ + ipfw_obj_header *oh; + char *name; + int cmd; + struct nat_list_arg nla; + + ac--; + av++; + + if (co.test_only) + return; + + /* Parse parameters. */ + cmd = 0; /* XXX: Change to IP_FW_NAT44_XGETLOG @ MFC */ + name = NULL; + for ( ; ac != 0; ac--, av++) { + if (!strncmp(av[0], "config", strlen(av[0]))) { + cmd = IP_FW_NAT44_XGETCONFIG; + continue; + } + if (strcmp(av[0], "log") == 0) { + cmd = IP_FW_NAT44_XGETLOG; + continue; + } + if (name != NULL) + err(EX_USAGE,"only one instance name may be specified"); + name = av[0]; + } + + if (cmd == 0) + errx(EX_USAGE, "Please specify action. Available: config,log"); + + if (name == NULL) { + memset(&nla, 0, sizeof(nla)); + nla.cmd = cmd; + nla.is_all = 1; + nat_foreach(nat_show_data, &nla, 1); + } else { + if (nat_get_cmd(name, cmd, &oh) != 0) + err(EX_OSERR, "Error getting nat %s instance info", name); + nat_show_cfg((struct nat44_cfg_nat *)(oh + 1), NULL); + free(oh); + } +} + diff --git a/example/ipfw/ipfw/tables.c b/example/ipfw/ipfw/tables.c new file mode 100644 index 0000000..e75b59a --- /dev/null +++ b/example/ipfw/ipfw/tables.c @@ -0,0 +1,2013 @@ +/* + * Copyright (c) 2014 Yandex LLC + * Copyright (c) 2014 Alexander V. Chernikov + * + * Redistribution and use in source forms, with and without modification, + * are permitted provided that this entire comment appears intact. + * + * Redistribution in binary form may occur without any restrictions. + * Obviously, it would be nice if you gave credit where credit is due + * but requiring it would be too onerous. + * + * This software is provided ``AS IS'' without any warranties of any kind. + * + * in-kernel ipfw tables support. + * + * $FreeBSD: head/sbin/ipfw/tables.c 273241 2014-10-17 20:47:55Z melifaro $ + */ + + +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#include "ipfw2.h" + +static void table_modify_record(ipfw_obj_header *oh, int ac, char *av[], + int add, int quiet, int update, int atomic); +static int table_flush(ipfw_obj_header *oh); +static int table_destroy(ipfw_obj_header *oh); +static int table_do_create(ipfw_obj_header *oh, ipfw_xtable_info *i); +static int table_do_modify(ipfw_obj_header *oh, ipfw_xtable_info *i); +static int table_do_swap(ipfw_obj_header *oh, char *second); +static void table_create(ipfw_obj_header *oh, int ac, char *av[]); +static void table_modify(ipfw_obj_header *oh, int ac, char *av[]); +static void table_lookup(ipfw_obj_header *oh, int ac, char *av[]); +static void table_lock(ipfw_obj_header *oh, int lock); +static int table_swap(ipfw_obj_header *oh, char *second); +static int table_get_info(ipfw_obj_header *oh, ipfw_xtable_info *i); +static int table_show_info(ipfw_xtable_info *i, void *arg); +static void table_fill_ntlv(ipfw_obj_ntlv *ntlv, char *name, uint32_t set, + uint16_t uidx); + +static int table_flush_one(ipfw_xtable_info *i, void *arg); +static int table_show_one(ipfw_xtable_info *i, void *arg); +static int table_do_get_list(ipfw_xtable_info *i, ipfw_obj_header **poh); +static void table_show_list(ipfw_obj_header *oh, int need_header); +static void table_show_entry(ipfw_xtable_info *i, ipfw_obj_tentry *tent); + +static void tentry_fill_key(ipfw_obj_header *oh, ipfw_obj_tentry *tent, + char *key, int add, uint8_t *ptype, uint32_t *pvmask, ipfw_xtable_info *xi); +static void tentry_fill_value(ipfw_obj_header *oh, ipfw_obj_tentry *tent, + char *arg, uint8_t type, uint32_t vmask); +static void table_show_value(char *buf, size_t bufsize, ipfw_table_value *v, + uint32_t vmask, int print_ip); + +typedef int (table_cb_t)(ipfw_xtable_info *i, void *arg); +static int tables_foreach(table_cb_t *f, void *arg, int sort); + +#ifndef s6_addr32 +#define s6_addr32 __u6_addr.__u6_addr32 +#endif + +static struct _s_x tabletypes[] = { + { "addr", IPFW_TABLE_ADDR }, + { "iface", IPFW_TABLE_INTERFACE }, + { "number", IPFW_TABLE_NUMBER }, + { "flow", IPFW_TABLE_FLOW }, + { NULL, 0 } +}; + +static struct _s_x tablevaltypes[] = { + { "skipto", IPFW_VTYPE_SKIPTO }, + { "pipe", IPFW_VTYPE_PIPE }, + { "fib", IPFW_VTYPE_FIB }, + { "nat", IPFW_VTYPE_NAT }, + { "dscp", IPFW_VTYPE_DSCP }, + { "tag", IPFW_VTYPE_TAG }, + { "divert", IPFW_VTYPE_DIVERT }, + { "netgraph", IPFW_VTYPE_NETGRAPH }, + { "limit", IPFW_VTYPE_LIMIT }, + { "ipv4", IPFW_VTYPE_NH4 }, + { "ipv6", IPFW_VTYPE_NH6 }, + { NULL, 0 } +}; + +static struct _s_x tablecmds[] = { + { "add", TOK_ADD }, + { "delete", TOK_DEL }, + { "create", TOK_CREATE }, + { "destroy", TOK_DESTROY }, + { "flush", TOK_FLUSH }, + { "modify", TOK_MODIFY }, + { "swap", TOK_SWAP }, + { "info", TOK_INFO }, + { "detail", TOK_DETAIL }, + { "list", TOK_LIST }, + { "lookup", TOK_LOOKUP }, + { "atomic", TOK_ATOMIC }, + { "lock", TOK_LOCK }, + { "unlock", TOK_UNLOCK }, + { NULL, 0 } +}; + +static int +lookup_host (char *host, struct in_addr *ipaddr) +{ + struct hostent *he; + + if (!inet_aton(host, ipaddr)) { + if ((he = gethostbyname(host)) == NULL) + return(-1); + *ipaddr = *(struct in_addr *)he->h_addr_list[0]; + } + return(0); +} + +static int +get_token(struct _s_x *table, char *string, char *errbase) +{ + int tcmd; + + if ((tcmd = match_token_relaxed(table, string)) < 0) + errx(EX_USAGE, "%s %s %s", + (tcmd == 0) ? "invalid" : "ambiguous", errbase, string); + + return (tcmd); +} + +/* + * This one handles all table-related commands + * ipfw table NAME create ... + * ipfw table NAME modify ... + * ipfw table NAME destroy + * ipfw table NAME swap NAME + * ipfw table NAME lock + * ipfw table NAME unlock + * ipfw table NAME add addr[/masklen] [value] + * ipfw table NAME add [addr[/masklen] value] [addr[/masklen] value] .. + * ipfw table NAME delete addr[/masklen] [addr[/masklen]] .. + * ipfw table NAME lookup addr + * ipfw table {NAME | all} flush + * ipfw table {NAME | all} list + * ipfw table {NAME | all} info + * ipfw table {NAME | all} detail + */ +void +ipfw_table_handler(int ac, char *av[]) +{ + int do_add, is_all; + int atomic, error, tcmd; + ipfw_xtable_info i; + ipfw_obj_header oh; + char *tablename; + uint32_t set; + void *arg; + + memset(&oh, 0, sizeof(oh)); + is_all = 0; + if (co.use_set != 0) + set = co.use_set - 1; + else + set = 0; + + ac--; av++; + NEED1("table needs name"); + tablename = *av; + + if (table_check_name(tablename) == 0) { + table_fill_ntlv(&oh.ntlv, *av, set, 1); + oh.idx = 1; + } else { + if (strcmp(tablename, "all") == 0) + is_all = 1; + else + errx(EX_USAGE, "table name %s is invalid", tablename); + } + ac--; av++; + NEED1("table needs command"); + + tcmd = get_token(tablecmds, *av, "table command"); + /* Check if atomic operation was requested */ + atomic = 0; + if (tcmd == TOK_ATOMIC) { + ac--; av++; + NEED1("atomic needs command"); + tcmd = get_token(tablecmds, *av, "table command"); + switch (tcmd) { + case TOK_ADD: + break; + default: + errx(EX_USAGE, "atomic is not compatible with %s", *av); + } + atomic = 1; + } + + switch (tcmd) { + case TOK_LIST: + case TOK_INFO: + case TOK_DETAIL: + case TOK_FLUSH: + break; + default: + if (is_all != 0) + errx(EX_USAGE, "table name required"); + } + + switch (tcmd) { + case TOK_ADD: + case TOK_DEL: + do_add = **av == 'a'; + ac--; av++; + table_modify_record(&oh, ac, av, do_add, co.do_quiet, + co.do_quiet, atomic); + break; + case TOK_CREATE: + ac--; av++; + table_create(&oh, ac, av); + break; + case TOK_MODIFY: + ac--; av++; + table_modify(&oh, ac, av); + break; + case TOK_DESTROY: + if (table_destroy(&oh) != 0) + err(EX_OSERR, "failed to destroy table %s", tablename); + break; + case TOK_FLUSH: + if (is_all == 0) { + if ((error = table_flush(&oh)) != 0) + err(EX_OSERR, "failed to flush table %s info", + tablename); + } else { + error = tables_foreach(table_flush_one, &oh, 1); + if (error != 0) + err(EX_OSERR, "failed to flush tables list"); + } + break; + case TOK_SWAP: + ac--; av++; + NEED1("second table name required"); + table_swap(&oh, *av); + break; + case TOK_LOCK: + case TOK_UNLOCK: + table_lock(&oh, (tcmd == TOK_LOCK)); + break; + case TOK_DETAIL: + case TOK_INFO: + arg = (tcmd == TOK_DETAIL) ? (void *)1 : NULL; + if (is_all == 0) { + if ((error = table_get_info(&oh, &i)) != 0) + err(EX_OSERR, "failed to request table info"); + table_show_info(&i, arg); + } else { + error = tables_foreach(table_show_info, arg, 1); + if (error != 0) + err(EX_OSERR, "failed to request tables list"); + } + break; + case TOK_LIST: + if (is_all == 0) { + ipfw_xtable_info i; + if ((error = table_get_info(&oh, &i)) != 0) + err(EX_OSERR, "failed to request table info"); + table_show_one(&i, NULL); + } else { + error = tables_foreach(table_show_one, NULL, 1); + if (error != 0) + err(EX_OSERR, "failed to request tables list"); + } + break; + case TOK_LOOKUP: + ac--; av++; + table_lookup(&oh, ac, av); + break; + } +} + +static void +table_fill_ntlv(ipfw_obj_ntlv *ntlv, char *name, uint32_t set, uint16_t uidx) +{ + + ntlv->head.type = IPFW_TLV_TBL_NAME; + ntlv->head.length = sizeof(ipfw_obj_ntlv); + ntlv->idx = uidx; + ntlv->set = set; + strlcpy(ntlv->name, name, sizeof(ntlv->name)); +} + +static void +table_fill_objheader(ipfw_obj_header *oh, ipfw_xtable_info *i) +{ + + oh->idx = 1; + table_fill_ntlv(&oh->ntlv, i->tablename, i->set, 1); +} + +static struct _s_x tablenewcmds[] = { + { "type", TOK_TYPE }, + { "valtype", TOK_VALTYPE }, + { "algo", TOK_ALGO }, + { "limit", TOK_LIMIT }, + { "locked", TOK_LOCK }, + { NULL, 0 } +}; + +static struct _s_x flowtypecmds[] = { + { "src-ip", IPFW_TFFLAG_SRCIP }, + { "proto", IPFW_TFFLAG_PROTO }, + { "src-port", IPFW_TFFLAG_SRCPORT }, + { "dst-ip", IPFW_TFFLAG_DSTIP }, + { "dst-port", IPFW_TFFLAG_DSTPORT }, + { NULL, 0 } +}; + +int +table_parse_type(uint8_t ttype, char *p, uint8_t *tflags) +{ + uint32_t fset, fclear; + char *e; + + /* Parse type options */ + switch(ttype) { + case IPFW_TABLE_FLOW: + fset = fclear = 0; + if (fill_flags(flowtypecmds, p, &e, &fset, &fclear) != 0) + errx(EX_USAGE, + "unable to parse flow option %s", e); + *tflags = fset; + break; + default: + return (EX_USAGE); + } + + return (0); +} + +void +table_print_type(char *tbuf, size_t size, uint8_t type, uint8_t tflags) +{ + const char *tname; + int l; + + if ((tname = match_value(tabletypes, type)) == NULL) + tname = "unknown"; + + l = snprintf(tbuf, size, "%s", tname); + tbuf += l; + size -= l; + + switch(type) { + case IPFW_TABLE_FLOW: + if (tflags != 0) { + *tbuf++ = ':'; + l--; + print_flags_buffer(tbuf, size, flowtypecmds, tflags); + } + break; + } +} + +/* + * Creates new table + * + * ipfw table NAME create [ type { addr | iface | number | flow } ] + * [ algo algoname ] + */ +static void +table_create(ipfw_obj_header *oh, int ac, char *av[]) +{ + ipfw_xtable_info xi; + int error, tcmd, val; + uint32_t fset, fclear; + size_t sz; + char *e, *p; + char tbuf[128]; + + sz = sizeof(tbuf); + memset(&xi, 0, sizeof(xi)); + + while (ac > 0) { + tcmd = get_token(tablenewcmds, *av, "option"); + ac--; av++; + + switch (tcmd) { + case TOK_LIMIT: + NEED1("limit value required"); + xi.limit = strtol(*av, NULL, 10); + ac--; av++; + break; + case TOK_TYPE: + NEED1("table type required"); + /* Type may have suboptions after ':' */ + if ((p = strchr(*av, ':')) != NULL) + *p++ = '\0'; + val = match_token(tabletypes, *av); + if (val == -1) { + concat_tokens(tbuf, sizeof(tbuf), tabletypes, + ", "); + errx(EX_USAGE, + "Unknown tabletype: %s. Supported: %s", + *av, tbuf); + } + xi.type = val; + if (p != NULL) { + error = table_parse_type(val, p, &xi.tflags); + if (error != 0) + errx(EX_USAGE, + "Unsupported suboptions: %s", p); + } + ac--; av++; + break; + case TOK_VALTYPE: + NEED1("table value type required"); + fset = fclear = 0; + val = fill_flags(tablevaltypes, *av, &e, &fset, &fclear); + if (val != -1) { + xi.vmask = fset; + ac--; av++; + break; + } + concat_tokens(tbuf, sizeof(tbuf), tablevaltypes, ", "); + errx(EX_USAGE, "Unknown value type: %s. Supported: %s", + e, tbuf); + break; + case TOK_ALGO: + NEED1("table algorithm name required"); + if (strlen(*av) > sizeof(xi.algoname)) + errx(EX_USAGE, "algorithm name too long"); + strlcpy(xi.algoname, *av, sizeof(xi.algoname)); + ac--; av++; + break; + case TOK_LOCK: + xi.flags |= IPFW_TGFLAGS_LOCKED; + break; + } + } + + /* Set some defaults to preserve compability */ + if (xi.algoname[0] == '\0' && xi.type == 0) + xi.type = IPFW_TABLE_ADDR; + if (xi.vmask == 0) + xi.vmask = IPFW_VTYPE_LEGACY; + + if ((error = table_do_create(oh, &xi)) != 0) + err(EX_OSERR, "Table creation failed"); +} + +/* + * Creates new table + * + * Request: [ ipfw_obj_header ipfw_xtable_info ] + * + * Returns 0 on success. + */ +static int +table_do_create(ipfw_obj_header *oh, ipfw_xtable_info *i) +{ + char tbuf[sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info)]; + int error; + + memcpy(tbuf, oh, sizeof(*oh)); + memcpy(tbuf + sizeof(*oh), i, sizeof(*i)); + oh = (ipfw_obj_header *)tbuf; + + error = do_set3(IP_FW_TABLE_XCREATE, &oh->opheader, sizeof(tbuf)); + + return (error); +} + +/* + * Modifies existing table + * + * ipfw table NAME modify [ limit number ] + */ +static void +table_modify(ipfw_obj_header *oh, int ac, char *av[]) +{ + ipfw_xtable_info xi; + int tcmd; + size_t sz; + char tbuf[128]; + + sz = sizeof(tbuf); + memset(&xi, 0, sizeof(xi)); + + while (ac > 0) { + tcmd = get_token(tablenewcmds, *av, "option"); + ac--; av++; + + switch (tcmd) { + case TOK_LIMIT: + NEED1("limit value required"); + xi.limit = strtol(*av, NULL, 10); + xi.mflags |= IPFW_TMFLAGS_LIMIT; + ac--; av++; + break; + default: + errx(EX_USAGE, "cmd is not supported for modificatiob"); + } + } + + if (table_do_modify(oh, &xi) != 0) + err(EX_OSERR, "Table modification failed"); +} + +/* + * Modifies existing table. + * + * Request: [ ipfw_obj_header ipfw_xtable_info ] + * + * Returns 0 on success. + */ +static int +table_do_modify(ipfw_obj_header *oh, ipfw_xtable_info *i) +{ + char tbuf[sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info)]; + int error; + + memcpy(tbuf, oh, sizeof(*oh)); + memcpy(tbuf + sizeof(*oh), i, sizeof(*i)); + oh = (ipfw_obj_header *)tbuf; + + error = do_set3(IP_FW_TABLE_XMODIFY, &oh->opheader, sizeof(tbuf)); + + return (error); +} + +/* + * Locks or unlocks given table + */ +static void +table_lock(ipfw_obj_header *oh, int lock) +{ + ipfw_xtable_info xi; + + memset(&xi, 0, sizeof(xi)); + + xi.mflags |= IPFW_TMFLAGS_LOCK; + xi.flags |= (lock != 0) ? IPFW_TGFLAGS_LOCKED : 0; + + if (table_do_modify(oh, &xi) != 0) + err(EX_OSERR, "Table %s failed", lock != 0 ? "lock" : "unlock"); +} + +/* + * Destroys given table specified by @oh->ntlv. + * Returns 0 on success. + */ +static int +table_destroy(ipfw_obj_header *oh) +{ + + if (do_set3(IP_FW_TABLE_XDESTROY, &oh->opheader, sizeof(*oh)) != 0) + return (-1); + + return (0); +} + +/* + * Flushes given table specified by @oh->ntlv. + * Returns 0 on success. + */ +static int +table_flush(ipfw_obj_header *oh) +{ + + if (do_set3(IP_FW_TABLE_XFLUSH, &oh->opheader, sizeof(*oh)) != 0) + return (-1); + + return (0); +} + +static int +table_do_swap(ipfw_obj_header *oh, char *second) +{ + char tbuf[sizeof(ipfw_obj_header) + sizeof(ipfw_obj_ntlv)]; + int error; + + memset(tbuf, 0, sizeof(tbuf)); + memcpy(tbuf, oh, sizeof(*oh)); + oh = (ipfw_obj_header *)tbuf; + table_fill_ntlv((ipfw_obj_ntlv *)(oh + 1), second, oh->ntlv.set, 1); + + error = do_set3(IP_FW_TABLE_XSWAP, &oh->opheader, sizeof(tbuf)); + + return (error); +} + +/* + * Swaps given table with @second one. + */ +static int +table_swap(ipfw_obj_header *oh, char *second) +{ + int error; + + if (table_check_name(second) != 0) + errx(EX_USAGE, "table name %s is invalid", second); + + error = table_do_swap(oh, second); + + switch (error) { + case EINVAL: + errx(EX_USAGE, "Unable to swap table: check types"); + case EFBIG: + errx(EX_USAGE, "Unable to swap table: check limits"); + } + + return (0); +} + + +/* + * Retrieves table in given table specified by @oh->ntlv. + * it inside @i. + * Returns 0 on success. + */ +static int +table_get_info(ipfw_obj_header *oh, ipfw_xtable_info *i) +{ + char tbuf[sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info)]; + size_t sz; + + sz = sizeof(tbuf); + memset(tbuf, 0, sizeof(tbuf)); + memcpy(tbuf, oh, sizeof(*oh)); + oh = (ipfw_obj_header *)tbuf; + + if (do_get3(IP_FW_TABLE_XINFO, &oh->opheader, &sz) != 0) + return (errno); + + if (sz < sizeof(tbuf)) + return (EINVAL); + + *i = *(ipfw_xtable_info *)(oh + 1); + + return (0); +} + +static struct _s_x tablealgoclass[] = { + { "hash", IPFW_TACLASS_HASH }, + { "array", IPFW_TACLASS_ARRAY }, + { "radix", IPFW_TACLASS_RADIX }, + { NULL, 0 } +}; + +struct ta_cldata { + uint8_t taclass; + uint8_t spare4; + uint16_t itemsize; + uint16_t itemsize6; + uint32_t size; + uint32_t count; +}; + +/* + * Print global/per-AF table @i algorithm info. + */ +static void +table_show_tainfo(ipfw_xtable_info *i, struct ta_cldata *d, + const char *af, const char *taclass) +{ + + switch (d->taclass) { + case IPFW_TACLASS_HASH: + case IPFW_TACLASS_ARRAY: + printf(" %salgorithm %s info\n", af, taclass); + if (d->itemsize == d->itemsize6) + printf(" size: %u items: %u itemsize: %u\n", + d->size, d->count, d->itemsize); + else + printf(" size: %u items: %u " + "itemsize4: %u itemsize6: %u\n", + d->size, d->count, + d->itemsize, d->itemsize6); + break; + case IPFW_TACLASS_RADIX: + printf(" %salgorithm %s info\n", af, taclass); + if (d->itemsize == d->itemsize6) + printf(" items: %u itemsize: %u\n", + d->count, d->itemsize); + else + printf(" items: %u " + "itemsize4: %u itemsize6: %u\n", + d->count, d->itemsize, d->itemsize6); + break; + default: + printf(" algo class: %s\n", taclass); + } +} + +static void +table_print_valheader(char *buf, size_t bufsize, uint32_t vmask) +{ + + if (vmask == IPFW_VTYPE_LEGACY) { + snprintf(buf, bufsize, "legacy"); + return; + } + + print_flags_buffer(buf, bufsize, tablevaltypes, vmask); +} + +/* + * Prints table info struct @i in human-readable form. + */ +static int +table_show_info(ipfw_xtable_info *i, void *arg) +{ + const char *vtype; + ipfw_ta_tinfo *tainfo; + int afdata, afitem; + struct ta_cldata d; + char ttype[64], tvtype[64]; + + table_print_type(ttype, sizeof(ttype), i->type, i->tflags); + table_print_valheader(tvtype, sizeof(tvtype), i->vmask); + + printf("--- table(%s), set(%u) ---\n", i->tablename, i->set); + if ((i->flags & IPFW_TGFLAGS_LOCKED) != 0) + printf(" kindex: %d, type: %s, locked\n", i->kidx, ttype); + else + printf(" kindex: %d, type: %s\n", i->kidx, ttype); + printf(" references: %u, valtype: %s\n", i->refcnt, tvtype); + printf(" algorithm: %s\n", i->algoname); + printf(" items: %u, size: %u\n", i->count, i->size); + if (i->limit > 0) + printf(" limit: %u\n", i->limit); + + /* Print algo-specific info if requested & set */ + if (arg == NULL) + return (0); + + if ((i->ta_info.flags & IPFW_TATFLAGS_DATA) == 0) + return (0); + tainfo = &i->ta_info; + + afdata = 0; + afitem = 0; + if (tainfo->flags & IPFW_TATFLAGS_AFDATA) + afdata = 1; + if (tainfo->flags & IPFW_TATFLAGS_AFITEM) + afitem = 1; + + memset(&d, 0, sizeof(d)); + d.taclass = tainfo->taclass4; + d.size = tainfo->size4; + d.count = tainfo->count4; + d.itemsize = tainfo->itemsize4; + if (afdata == 0 && afitem != 0) + d.itemsize6 = tainfo->itemsize6; + else + d.itemsize6 = d.itemsize; + if ((vtype = match_value(tablealgoclass, d.taclass)) == NULL) + vtype = "unknown"; + + if (afdata == 0) { + table_show_tainfo(i, &d, "", vtype); + } else { + table_show_tainfo(i, &d, "IPv4 ", vtype); + memset(&d, 0, sizeof(d)); + d.taclass = tainfo->taclass6; + if ((vtype = match_value(tablealgoclass, d.taclass)) == NULL) + vtype = "unknown"; + d.size = tainfo->size6; + d.count = tainfo->count6; + d.itemsize = tainfo->itemsize6; + d.itemsize6 = d.itemsize; + table_show_tainfo(i, &d, "IPv6 ", vtype); + } + + return (0); +} + + +/* + * Function wrappers which can be used either + * as is or as foreach function parameter. + */ + +static int +table_show_one(ipfw_xtable_info *i, void *arg) +{ + ipfw_obj_header *oh = NULL; // XXX uninitialized + int error; + + if ((error = table_do_get_list(i, &oh)) != 0) { + err(EX_OSERR, "Error requesting table %s list", i->tablename); + return (error); + } + + table_show_list(oh, 1); + + free(oh); + return (0); +} + +static int +table_flush_one(ipfw_xtable_info *i, void *arg) +{ + ipfw_obj_header *oh; + + oh = (ipfw_obj_header *)arg; + + table_fill_ntlv(&oh->ntlv, i->tablename, i->set, 1); + + return (table_flush(oh)); +} + +static int +table_do_modify_record(int cmd, ipfw_obj_header *oh, + ipfw_obj_tentry *tent, int count, int atomic) +{ + ipfw_obj_ctlv *ctlv; + ipfw_obj_tentry *tent_base; + caddr_t pbuf; + char xbuf[sizeof(*oh) + sizeof(ipfw_obj_ctlv) + sizeof(*tent)]; + int error, i; + size_t sz; + + sz = sizeof(*ctlv) + sizeof(*tent) * count; + if (count == 1) { + memset(xbuf, 0, sizeof(xbuf)); + pbuf = xbuf; + } else { + if ((pbuf = calloc(1, sizeof(*oh) + sz)) == NULL) + return (ENOMEM); + } + + memcpy(pbuf, oh, sizeof(*oh)); + oh = (ipfw_obj_header *)pbuf; + oh->opheader.version = 1; + + ctlv = (ipfw_obj_ctlv *)(oh + 1); + ctlv->count = count; + ctlv->head.length = sz; + if (atomic != 0) + ctlv->flags |= IPFW_CTF_ATOMIC; + + tent_base = tent; + memcpy(ctlv + 1, tent, sizeof(*tent) * count); + tent = (ipfw_obj_tentry *)(ctlv + 1); + for (i = 0; i < count; i++, tent++) { + tent->head.length = sizeof(ipfw_obj_tentry); + tent->idx = oh->idx; + } + + sz += sizeof(*oh); + error = do_get3(cmd, &oh->opheader, &sz); + tent = (ipfw_obj_tentry *)(ctlv + 1); + /* Copy result back to provided buffer */ + memcpy(tent_base, ctlv + 1, sizeof(*tent) * count); + + if (pbuf != xbuf) + free(pbuf); + + return (error); +} + +static void +table_modify_record(ipfw_obj_header *oh, int ac, char *av[], int add, + int quiet, int update, int atomic) +{ + ipfw_obj_tentry *ptent, tent, *tent_buf; + ipfw_xtable_info xi; + uint8_t type; + uint32_t vmask; + int cmd, count, error, i, ignored; + char *texterr, *etxt, *px; + + if (ac == 0) + errx(EX_USAGE, "address required"); + + if (add != 0) { + cmd = IP_FW_TABLE_XADD; + texterr = "Adding record failed"; + } else { + cmd = IP_FW_TABLE_XDEL; + texterr = "Deleting record failed"; + } + + /* + * Calculate number of entries: + * Assume [key val] x N for add + * and + * key x N for delete + */ + count = (add != 0) ? ac / 2 + 1 : ac; + + if (count <= 1) { + /* Adding single entry with/without value */ + memset(&tent, 0, sizeof(tent)); + tent_buf = &tent; + } else { + + if ((tent_buf = calloc(count, sizeof(tent))) == NULL) + errx(EX_OSERR, + "Unable to allocate memory for all entries"); + } + ptent = tent_buf; + + memset(&xi, 0, sizeof(xi)); + count = 0; + while (ac > 0) { + tentry_fill_key(oh, ptent, *av, add, &type, &vmask, &xi); + + /* + * compability layer: auto-create table if not exists + */ + if (xi.tablename[0] == '\0') { + xi.type = type; + xi.vmask = vmask; + strlcpy(xi.tablename, oh->ntlv.name, + sizeof(xi.tablename)); + fprintf(stderr, "DEPRECATED: inserting data info " + "non-existent table %s. (auto-created)\n", + xi.tablename); + table_do_create(oh, &xi); + } + + oh->ntlv.type = type; + ac--; av++; + + if (add != 0 && ac > 0) { + tentry_fill_value(oh, ptent, *av, type, vmask); + ac--; av++; + } + + if (update != 0) + ptent->head.flags |= IPFW_TF_UPDATE; + + count++; + ptent++; + } + + error = table_do_modify_record(cmd, oh, tent_buf, count, atomic); + + quiet = 0; + + /* + * Compatibility stuff: do not yell on duplicate keys or + * failed deletions. + */ + if (error == 0 || (error == EEXIST && add != 0) || + (error == ENOENT && add == 0)) { + if (quiet != 0) { + if (tent_buf != &tent) + free(tent_buf); + return; + } + } + + /* Report results back */ + ptent = tent_buf; + for (i = 0; i < count; ptent++, i++) { + ignored = 0; + switch (ptent->result) { + case IPFW_TR_ADDED: + px = "added"; + break; + case IPFW_TR_DELETED: + px = "deleted"; + break; + case IPFW_TR_UPDATED: + px = "updated"; + break; + case IPFW_TR_LIMIT: + px = "limit"; + ignored = 1; + break; + case IPFW_TR_ERROR: + px = "error"; + ignored = 1; + break; + case IPFW_TR_NOTFOUND: + px = "notfound"; + ignored = 1; + break; + case IPFW_TR_EXISTS: + px = "exists"; + ignored = 1; + break; + case IPFW_TR_IGNORED: + px = "ignored"; + ignored = 1; + break; + default: + px = "unknown"; + ignored = 1; + } + + if (error != 0 && atomic != 0 && ignored == 0) + printf("%s(reverted): ", px); + else + printf("%s: ", px); + + table_show_entry(&xi, ptent); + } + + if (tent_buf != &tent) + free(tent_buf); + + if (error == 0) + return; + /* Get real OS error */ + error = errno; + + /* Try to provide more human-readable error */ + switch (error) { + case EEXIST: + etxt = "record already exists"; + break; + case EFBIG: + etxt = "limit hit"; + break; + case ESRCH: + etxt = "table not found"; + break; + case ENOENT: + etxt = "record not found"; + break; + case EACCES: + etxt = "table is locked"; + break; + default: + etxt = strerror(error); + } + + errx(EX_OSERR, "%s: %s", texterr, etxt); +} + +static int +table_do_lookup(ipfw_obj_header *oh, char *key, ipfw_xtable_info *xi, + ipfw_obj_tentry *xtent) +{ + char xbuf[sizeof(ipfw_obj_header) + sizeof(ipfw_obj_tentry)]; + ipfw_obj_tentry *tent; + uint8_t type; + uint32_t vmask; + size_t sz; + + memcpy(xbuf, oh, sizeof(*oh)); + oh = (ipfw_obj_header *)xbuf; + tent = (ipfw_obj_tentry *)(oh + 1); + + memset(tent, 0, sizeof(*tent)); + tent->head.length = sizeof(*tent); + tent->idx = 1; + + tentry_fill_key(oh, tent, key, 0, &type, &vmask, xi); + oh->ntlv.type = type; + + sz = sizeof(xbuf); + if (do_get3(IP_FW_TABLE_XFIND, &oh->opheader, &sz) != 0) + return (errno); + + if (sz < sizeof(xbuf)) + return (EINVAL); + + *xtent = *tent; + + return (0); +} + +static void +table_lookup(ipfw_obj_header *oh, int ac, char *av[]) +{ + ipfw_obj_tentry xtent; + ipfw_xtable_info xi; + char key[64]; + int error; + + if (ac == 0) + errx(EX_USAGE, "address required"); + + strlcpy(key, *av, sizeof(key)); + + memset(&xi, 0, sizeof(xi)); + error = table_do_lookup(oh, key, &xi, &xtent); + + switch (error) { + case 0: + break; + case ESRCH: + errx(EX_UNAVAILABLE, "Table %s not found", oh->ntlv.name); + case ENOENT: + errx(EX_UNAVAILABLE, "Entry %s not found", *av); + case ENOTSUP: + errx(EX_UNAVAILABLE, "Table %s algo does not support " + "\"lookup\" method", oh->ntlv.name); + default: + err(EX_OSERR, "getsockopt(IP_FW_TABLE_XFIND)"); + } + + table_show_entry(&xi, &xtent); +} + +static void +tentry_fill_key_type(char *arg, ipfw_obj_tentry *tentry, uint8_t type, + uint8_t tflags) +{ + char *p, *pp; + int mask, af; + struct in6_addr *paddr, tmp; + struct tflow_entry *tfe; + uint32_t key, *pkey; + uint16_t port; + struct protoent *pent; + struct servent *sent; + int masklen; + + mask = 0; // XXX uninitialized ? + masklen = 0; + af = 0; + paddr = (struct in6_addr *)&tentry->k; + + switch (type) { + case IPFW_TABLE_ADDR: + /* Remove / if exists */ + if ((p = strchr(arg, '/')) != NULL) { + *p = '\0'; + mask = atoi(p + 1); + } + + if (inet_pton(AF_INET, arg, paddr) == 1) { + if (p != NULL && mask > 32) + errx(EX_DATAERR, "bad IPv4 mask width: %s", + p + 1); + + masklen = p ? mask : 32; + af = AF_INET; + } else if (inet_pton(AF_INET6, arg, paddr) == 1) { + if (IN6_IS_ADDR_V4COMPAT(paddr)) + errx(EX_DATAERR, + "Use IPv4 instead of v4-compatible"); + if (p != NULL && mask > 128) + errx(EX_DATAERR, "bad IPv6 mask width: %s", + p + 1); + + masklen = p ? mask : 128; + af = AF_INET6; + } else { + /* Assume FQDN */ + if (lookup_host(arg, (struct in_addr *)paddr) != 0) + errx(EX_NOHOST, "hostname ``%s'' unknown", arg); + + masklen = 32; + type = IPFW_TABLE_ADDR; + af = AF_INET; + } + break; + case IPFW_TABLE_INTERFACE: + /* Assume interface name. Copy significant data only */ + mask = MIN(strlen(arg), IF_NAMESIZE - 1); + memcpy(paddr, arg, mask); + /* Set mask to exact match */ + masklen = 8 * IF_NAMESIZE; + break; + case IPFW_TABLE_NUMBER: + /* Port or any other key */ + key = strtol(arg, &p, 10); + if (*p != '\0') + errx(EX_DATAERR, "Invalid number: %s", arg); + + pkey = (uint32_t *)paddr; + *pkey = key; + masklen = 32; + break; + case IPFW_TABLE_FLOW: + /* Assume [src-ip][,proto][,src-port][,dst-ip][,dst-port] */ + tfe = &tentry->k.flow; + af = 0; + + /* Handle */ + if ((tflags & IPFW_TFFLAG_SRCIP) != 0) { + if ((p = strchr(arg, ',')) != NULL) + *p++ = '\0'; + /* Determine family using temporary storage */ + if (inet_pton(AF_INET, arg, &tmp) == 1) { + if (af != 0 && af != AF_INET) + errx(EX_DATAERR, + "Inconsistent address family\n"); + af = AF_INET; + memcpy(&tfe->a.a4.sip, &tmp, 4); + } else if (inet_pton(AF_INET6, arg, &tmp) == 1) { + if (af != 0 && af != AF_INET6) + errx(EX_DATAERR, + "Inconsistent address family\n"); + af = AF_INET6; + memcpy(&tfe->a.a6.sip6, &tmp, 16); + } + + arg = p; + } + + /* Handle */ + if ((tflags & IPFW_TFFLAG_PROTO) != 0) { + if (arg == NULL) + errx(EX_DATAERR, "invalid key: proto missing"); + if ((p = strchr(arg, ',')) != NULL) + *p++ = '\0'; + + key = strtol(arg, &pp, 10); + if (*pp != '\0') { + if ((pent = getprotobyname(arg)) == NULL) + errx(EX_DATAERR, "Unknown proto: %s", + arg); + else + key = pent->p_proto; + } + + if (key > 255) + errx(EX_DATAERR, "Bad protocol number: %u",key); + + tfe->proto = key; + + arg = p; + } + + /* Handle */ + if ((tflags & IPFW_TFFLAG_SRCPORT) != 0) { + if (arg == NULL) + errx(EX_DATAERR, "invalid key: src port missing"); + if ((p = strchr(arg, ',')) != NULL) + *p++ = '\0'; + + if ((port = htons(strtol(arg, NULL, 10))) == 0) { + if ((sent = getservbyname(arg, NULL)) == NULL) + errx(EX_DATAERR, "Unknown service: %s", + arg); + else + key = sent->s_port; + } + + tfe->sport = port; + + arg = p; + } + + /* Handle */ + if ((tflags & IPFW_TFFLAG_DSTIP) != 0) { + if (arg == NULL) + errx(EX_DATAERR, "invalid key: dst ip missing"); + if ((p = strchr(arg, ',')) != NULL) + *p++ = '\0'; + /* Determine family using temporary storage */ + if (inet_pton(AF_INET, arg, &tmp) == 1) { + if (af != 0 && af != AF_INET) + errx(EX_DATAERR, + "Inconsistent address family"); + af = AF_INET; + memcpy(&tfe->a.a4.dip, &tmp, 4); + } else if (inet_pton(AF_INET6, arg, &tmp) == 1) { + if (af != 0 && af != AF_INET6) + errx(EX_DATAERR, + "Inconsistent address family"); + af = AF_INET6; + memcpy(&tfe->a.a6.dip6, &tmp, 16); + } + + arg = p; + } + + /* Handle */ + if ((tflags & IPFW_TFFLAG_DSTPORT) != 0) { + if (arg == NULL) + errx(EX_DATAERR, "invalid key: dst port missing"); + if ((p = strchr(arg, ',')) != NULL) + *p++ = '\0'; + + if ((port = htons(strtol(arg, NULL, 10))) == 0) { + if ((sent = getservbyname(arg, NULL)) == NULL) + errx(EX_DATAERR, "Unknown service: %s", + arg); + else + key = sent->s_port; + } + + tfe->dport = port; + + arg = p; + } + + tfe->af = af; + + break; + + default: + errx(EX_DATAERR, "Unsupported table type: %d", type); + } + + tentry->subtype = af; + tentry->masklen = masklen; +} + +static void +tentry_fill_key(ipfw_obj_header *oh, ipfw_obj_tentry *tent, char *key, + int add, uint8_t *ptype, uint32_t *pvmask, ipfw_xtable_info *xi) +{ + uint8_t type, tflags; + uint32_t vmask; + int error; + char *del; + + type = 0; + tflags = 0; + vmask = 0; + + if (xi->tablename[0] == '\0') + error = table_get_info(oh, xi); + else + error = 0; + + if (error == 0) { + /* Table found. */ + type = xi->type; + tflags = xi->tflags; + vmask = xi->vmask; + } else { + if (error != ESRCH) + errx(EX_OSERR, "Error requesting table %s info", + oh->ntlv.name); + if (add == 0) + errx(EX_DATAERR, "Table %s does not exist", + oh->ntlv.name); + /* + * Table does not exist. + * Compability layer: try to interpret data as ADDR + * before failing. + */ + if ((del = strchr(key, '/')) != NULL) + *del = '\0'; + if (inet_pton(AF_INET, key, &tent->k.addr6) == 1 || + inet_pton(AF_INET6, key, &tent->k.addr6) == 1) { + /* OK Prepare and send */ + type = IPFW_TABLE_ADDR; + vmask = IPFW_VTYPE_LEGACY; + } else { + /* Inknown key */ + errx(EX_USAGE, "Table %s does not exist, cannot guess " + "key '%s' type", oh->ntlv.name, key); + } + if (del != NULL) + *del = '/'; + } + + tentry_fill_key_type(key, tent, type, tflags); + + *ptype = type; + *pvmask = vmask; +} + +static void +set_legacy_value(uint32_t val, ipfw_table_value *v) +{ + v->tag = val; + v->pipe = val; + v->divert = val; + v->skipto = val; + v->netgraph = val; + v->fib = val; + v->nat = val; + v->nh4 = val; + v->dscp = (uint8_t)val; + v->limit = val; +} + +static void +tentry_fill_value(ipfw_obj_header *oh, ipfw_obj_tentry *tent, char *arg, + uint8_t type, uint32_t vmask) +{ + uint32_t a4, flag, val, vm; + ipfw_table_value *v; + uint32_t i; + int dval; + char *comma, *e, *etype, *n, *p; + + v = &tent->v.value; + vm = vmask; + + /* Compat layer: keep old behavior for legacy value types */ + if (vmask == IPFW_VTYPE_LEGACY) { + /* Try to interpret as number first */ + val = strtoul(arg, &p, 0); + if (*p == '\0') { + set_legacy_value(val, v); + return; + } + if (inet_pton(AF_INET, arg, &val) == 1) { + set_legacy_value(ntohl(val), v); + return; + } + /* Try hostname */ + if (lookup_host(arg, (struct in_addr *)&val) == 0) { + set_legacy_value(val, v); + return; + } + errx(EX_OSERR, "Unable to parse value %s", arg); + } + + /* + * Shorthands: handle single value if vmask consists + * of numbers only. e.g.: + * vmask = "fib,skipto" -> treat input "1" as "1,1" + */ + + n = arg; + etype = NULL; + for (i = 1; i < (1 << 31); i *= 2) { + if ((flag = (vmask & i)) == 0) + continue; + vmask &= ~flag; + + if ((comma = strchr(n, ',')) != NULL) + *comma = '\0'; + + switch (flag) { + case IPFW_VTYPE_TAG: + v->tag = strtol(n, &e, 10); + if (*e != '\0') + etype = "tag"; + break; + case IPFW_VTYPE_PIPE: + v->pipe = strtol(n, &e, 10); + if (*e != '\0') + etype = "pipe"; + break; + case IPFW_VTYPE_DIVERT: + v->divert = strtol(n, &e, 10); + if (*e != '\0') + etype = "divert"; + break; + case IPFW_VTYPE_SKIPTO: + v->skipto = strtol(n, &e, 10); + if (*e != '\0') + etype = "skipto"; + break; + case IPFW_VTYPE_NETGRAPH: + v->netgraph = strtol(n, &e, 10); + if (*e != '\0') + etype = "netgraph"; + break; + case IPFW_VTYPE_FIB: + v->fib = strtol(n, &e, 10); + if (*e != '\0') + etype = "fib"; + break; + case IPFW_VTYPE_NAT: + v->nat = strtol(n, &e, 10); + if (*e != '\0') + etype = "nat"; + break; + case IPFW_VTYPE_LIMIT: + v->limit = strtol(n, &e, 10); + if (*e != '\0') + etype = "limit"; + break; + case IPFW_VTYPE_NH4: + if (strchr(n, '.') != NULL && + inet_pton(AF_INET, n, &a4) == 1) { + v->nh4 = ntohl(a4); + break; + } + if (lookup_host(n, (struct in_addr *)&v->nh4) == 0) + break; + etype = "ipv4"; + break; + case IPFW_VTYPE_DSCP: + if (isalpha(*n)) { + if ((dval = match_token(f_ipdscp, n)) != -1) { + v->dscp = dval; + break; + } else + etype = "DSCP code"; + } else { + v->dscp = strtol(n, &e, 10); + if (v->dscp > 63 || *e != '\0') + etype = "DSCP value"; + } + break; + case IPFW_VTYPE_NH6: + if (strchr(n, ':') != NULL && + inet_pton(AF_INET6, n, &v->nh6) == 1) + break; + etype = "ipv6"; + break; + } + + if (etype != NULL) + errx(EX_USAGE, "Unable to parse %s as %s", n, etype); + + if (comma != NULL) + *comma++ = ','; + + if ((n = comma) != NULL) + continue; + + /* End of input. */ + if (vmask != 0) + errx(EX_USAGE, "Not enough fields inside value"); + } +} + +/* + * Compare table names. + * Honor number comparison. + */ +static int +tablename_cmp(const void *a, const void *b) +{ + ipfw_xtable_info *ia, *ib; + + ia = (ipfw_xtable_info *)a; + ib = (ipfw_xtable_info *)b; + + return (stringnum_cmp(ia->tablename, ib->tablename)); +} + +/* + * Retrieves table list from kernel, + * optionally sorts it and calls requested function for each table. + * Returns 0 on success. + */ +static int +tables_foreach(table_cb_t *f, void *arg, int sort) +{ + ipfw_obj_lheader *olh; + ipfw_xtable_info *info; + size_t sz; + int i, error; + + /* Start with reasonable default */ + sz = sizeof(*olh) + 16 * sizeof(ipfw_xtable_info); + + for (;;) { + if ((olh = calloc(1, sz)) == NULL) + return (ENOMEM); + + olh->size = sz; + if (do_get3(IP_FW_TABLES_XLIST, &olh->opheader, &sz) != 0) { + sz = olh->size; + free(olh); + if (errno != ENOMEM) + return (errno); + continue; + } + + if (sort != 0) + qsort(olh + 1, olh->count, olh->objsize, tablename_cmp); + + info = (ipfw_xtable_info *)(olh + 1); + for (i = 0; i < olh->count; i++) { + error = f(info, arg); /* Ignore errors for now */ + info = (ipfw_xtable_info *)((caddr_t)info + olh->objsize); + } + + free(olh); + break; + } + + return (0); +} + + +/* + * Retrieves all entries for given table @i in + * eXtended format. Allocate buffer large enough + * to store result. Called needs to free it later. + * + * Returns 0 on success. + */ +static int +table_do_get_list(ipfw_xtable_info *i, ipfw_obj_header **poh) +{ + ipfw_obj_header *oh; + size_t sz; + int c; + + sz = 0; + oh = NULL; + for (c = 0; c < 8; c++) { + if (sz < i->size) + sz = i->size + 44; + if (oh != NULL) + free(oh); + if ((oh = calloc(1, sz)) == NULL) + continue; + table_fill_objheader(oh, i); + oh->opheader.version = 1; /* Current version */ + if (do_get3(IP_FW_TABLE_XLIST, &oh->opheader, &sz) == 0) { + *poh = oh; + return (0); + } + + if (errno != ENOMEM) + break; + } + free(oh); + + return (errno); +} + +/* + * Shows all entries from @oh in human-readable format + */ +static void +table_show_list(ipfw_obj_header *oh, int need_header) +{ + ipfw_obj_tentry *tent; + uint32_t count; + ipfw_xtable_info *i; + + i = (ipfw_xtable_info *)(oh + 1); + tent = (ipfw_obj_tentry *)(i + 1); + + if (need_header) + printf("--- table(%s), set(%u) ---\n", i->tablename, i->set); + + count = i->count; + while (count > 0) { + table_show_entry(i, tent); + tent = (ipfw_obj_tentry *)((caddr_t)tent + tent->head.length); + count--; + } +} + +static void +table_show_value(char *buf, size_t bufsize, ipfw_table_value *v, + uint32_t vmask, int print_ip) +{ + uint32_t flag, i, l; + size_t sz; + struct in_addr a4; + char abuf[INET6_ADDRSTRLEN]; + + sz = bufsize; + + /* + * Some shorthands for printing values: + * legacy assumes all values are equal, so keep the first one. + */ + if (vmask == IPFW_VTYPE_LEGACY) { + if (print_ip != 0) { + flag = htonl(v->tag); + inet_ntop(AF_INET, &flag, buf, sz); + } else + snprintf(buf, sz, "%u", v->tag); + return; + } + + for (i = 1; i < (1 << 31); i *= 2) { + if ((flag = (vmask & i)) == 0) + continue; + l = 0; + + switch (flag) { + case IPFW_VTYPE_TAG: + l = snprintf(buf, sz, "%u,", v->tag); + break; + case IPFW_VTYPE_PIPE: + l = snprintf(buf, sz, "%u,", v->pipe); + break; + case IPFW_VTYPE_DIVERT: + l = snprintf(buf, sz, "%d,", v->divert); + break; + case IPFW_VTYPE_SKIPTO: + l = snprintf(buf, sz, "%d,", v->skipto); + break; + case IPFW_VTYPE_NETGRAPH: + l = snprintf(buf, sz, "%u,", v->netgraph); + break; + case IPFW_VTYPE_FIB: + l = snprintf(buf, sz, "%u,", v->fib); + break; + case IPFW_VTYPE_NAT: + l = snprintf(buf, sz, "%u,", v->nat); + break; + case IPFW_VTYPE_LIMIT: + l = snprintf(buf, sz, "%u,", v->limit); + break; + case IPFW_VTYPE_NH4: + a4.s_addr = htonl(v->nh4); + inet_ntop(AF_INET, &a4, abuf, sizeof(abuf)); + l = snprintf(buf, sz, "%s,", abuf); + break; + case IPFW_VTYPE_DSCP: + l = snprintf(buf, sz, "%d,", v->dscp); + break; + case IPFW_VTYPE_NH6: + inet_ntop(AF_INET6, &v->nh6, abuf, sizeof(abuf)); + l = snprintf(buf, sz, "%s,", abuf); + break; + } + + buf += l; + sz -= l; + } + + if (sz != bufsize) + *(buf - 1) = '\0'; +} + +static void +table_show_entry(ipfw_xtable_info *i, ipfw_obj_tentry *tent) +{ + char *comma, tbuf[128], pval[128]; + void *paddr; + struct tflow_entry *tfe; + + table_show_value(pval, sizeof(pval), &tent->v.value, i->vmask, + co.do_value_as_ip); + + switch (i->type) { + case IPFW_TABLE_ADDR: + /* IPv4 or IPv6 prefixes */ + inet_ntop(tent->subtype, &tent->k, tbuf, sizeof(tbuf)); + printf("%s/%u %s\n", tbuf, tent->masklen, pval); + break; + case IPFW_TABLE_INTERFACE: + /* Interface names */ + printf("%s %s\n", tent->k.iface, pval); + break; + case IPFW_TABLE_NUMBER: + /* numbers */ + printf("%u %s\n", tent->k.key, pval); + break; + case IPFW_TABLE_FLOW: + /* flows */ + tfe = &tent->k.flow; + comma = ""; + + if ((i->tflags & IPFW_TFFLAG_SRCIP) != 0) { + if (tfe->af == AF_INET) + paddr = &tfe->a.a4.sip; + else + paddr = &tfe->a.a6.sip6; + + inet_ntop(tfe->af, paddr, tbuf, sizeof(tbuf)); + printf("%s%s", comma, tbuf); + comma = ","; + } + + if ((i->tflags & IPFW_TFFLAG_PROTO) != 0) { + printf("%s%d", comma, tfe->proto); + comma = ","; + } + + if ((i->tflags & IPFW_TFFLAG_SRCPORT) != 0) { + printf("%s%d", comma, ntohs(tfe->sport)); + comma = ","; + } + if ((i->tflags & IPFW_TFFLAG_DSTIP) != 0) { + if (tfe->af == AF_INET) + paddr = &tfe->a.a4.dip; + else + paddr = &tfe->a.a6.dip6; + + inet_ntop(tfe->af, paddr, tbuf, sizeof(tbuf)); + printf("%s%s", comma, tbuf); + comma = ","; + } + + if ((i->tflags & IPFW_TFFLAG_DSTPORT) != 0) { + printf("%s%d", comma, ntohs(tfe->dport)); + comma = ","; + } + + printf(" %s\n", pval); + } +} + +static int +table_do_get_stdlist(uint16_t opcode, ipfw_obj_lheader **polh) +{ + ipfw_obj_lheader req, *olh; + size_t sz; + + memset(&req, 0, sizeof(req)); + sz = sizeof(req); + + if (do_get3(opcode, &req.opheader, &sz) != 0) + if (errno != ENOMEM) + return (errno); + + sz = req.size; + if ((olh = calloc(1, sz)) == NULL) + return (ENOMEM); + + olh->size = sz; + if (do_get3(opcode, &olh->opheader, &sz) != 0) { + free(olh); + return (errno); + } + + *polh = olh; + return (0); +} + +static int +table_do_get_algolist(ipfw_obj_lheader **polh) +{ + + return (table_do_get_stdlist(IP_FW_TABLES_ALIST, polh)); +} + +static int +table_do_get_vlist(ipfw_obj_lheader **polh) +{ + + return (table_do_get_stdlist(IP_FW_TABLE_VLIST, polh)); +} + +void +ipfw_list_ta(int ac, char *av[]) +{ + ipfw_obj_lheader *olh; + ipfw_ta_info *info; + int error, i; + const char *atype; + + error = table_do_get_algolist(&olh); + if (error != 0) + err(EX_OSERR, "Unable to request algorithm list"); + + info = (ipfw_ta_info *)(olh + 1); + for (i = 0; i < olh->count; i++) { + if ((atype = match_value(tabletypes, info->type)) == NULL) + atype = "unknown"; + printf("--- %s ---\n", info->algoname); + printf(" type: %s\n refcount: %u\n", atype, info->refcnt); + + info = (ipfw_ta_info *)((caddr_t)info + olh->objsize); + } + + free(olh); +} + + +/* Copy of current kernel table_value structure */ +struct _table_value { + uint32_t tag; /* O_TAG/O_TAGGED */ + uint32_t pipe; /* O_PIPE/O_QUEUE */ + uint16_t divert; /* O_DIVERT/O_TEE */ + uint16_t skipto; /* skipto, CALLRET */ + uint32_t netgraph; /* O_NETGRAPH/O_NGTEE */ + uint32_t fib; /* O_SETFIB */ + uint32_t nat; /* O_NAT */ + uint32_t nh4; + uint8_t dscp; + uint8_t spare0[3]; + /* -- 32 bytes -- */ + struct in6_addr nh6; + uint32_t limit; /* O_LIMIT */ + uint32_t spare1; + uint64_t refcnt; /* Number of references */ +}; + +int +compare_values(const void *_a, const void *_b) +{ + struct _table_value *a, *b; + + a = (struct _table_value *)_a; + b = (struct _table_value *)_b; + + if (a->spare1 < b->spare1) + return (-1); + else if (a->spare1 > b->spare1) + return (1); + + return (0); +} + +void +ipfw_list_values(int ac, char *av[]) +{ + ipfw_obj_lheader *olh; + struct _table_value *v; + int error, i; + uint32_t vmask; + char buf[128]; + + error = table_do_get_vlist(&olh); + if (error != 0) + err(EX_OSERR, "Unable to request value list"); + + vmask = 0x7FFFFFFF; /* Similar to IPFW_VTYPE_LEGACY */ + + table_print_valheader(buf, sizeof(buf), vmask); + printf("HEADER: %s\n", buf); + v = (struct _table_value *)(olh + 1); + qsort(v, olh->count, olh->objsize, compare_values); + for (i = 0; i < olh->count; i++) { + table_show_value(buf, sizeof(buf), (ipfw_table_value *)v, + vmask, 0); + printf("[%u] refs=%lu %s\n", v->spare1, (u_long)v->refcnt, buf); + v = (struct _table_value *)((caddr_t)v + olh->objsize); + } + + free(olh); +} + +int +compare_ntlv(const void *_a, const void *_b) +{ + ipfw_obj_ntlv *a, *b; + + a = (ipfw_obj_ntlv *)_a; + b = (ipfw_obj_ntlv *)_b; + + if (a->set < b->set) + return (-1); + else if (a->set > b->set) + return (1); + + if (a->idx < b->idx) + return (-1); + else if (a->idx > b->idx) + return (1); + + return (0); +} + +int +compare_kntlv(const void *k, const void *v) +{ + ipfw_obj_ntlv *ntlv; + uint16_t key; + + key = *((uint16_t *)k); + ntlv = (ipfw_obj_ntlv *)v; + + if (key < ntlv->idx) + return (-1); + else if (key > ntlv->idx) + return (1); + + return (0); +} + +/* + * Finds table name in @ctlv by @idx. + * Uses the following facts: + * 1) All TLVs are the same size + * 2) Kernel implementation provides already sorted list. + * + * Returns table name or NULL. + */ +char * +table_search_ctlv(ipfw_obj_ctlv *ctlv, uint16_t idx) +{ + ipfw_obj_ntlv *ntlv; + + ntlv = bsearch(&idx, (ctlv + 1), ctlv->count, ctlv->objsize, + compare_kntlv); + + if (ntlv != 0) + return (ntlv->name); + + return (NULL); +} + +void +table_sort_ctlv(ipfw_obj_ctlv *ctlv) +{ + + qsort(ctlv + 1, ctlv->count, ctlv->objsize, compare_ntlv); +} + +int +table_check_name(char *tablename) +{ + int c, i, l; + + /* + * Check if tablename is null-terminated and contains + * valid symbols only. Valid mask is: + * [a-zA-Z0-9\-_\.]{1,63} + */ + l = strlen(tablename); + if (l == 0 || l >= 64) + return (EINVAL); + for (i = 0; i < l; i++) { + c = tablename[i]; + if (isalpha(c) || isdigit(c) || c == '_' || + c == '-' || c == '.') + continue; + return (EINVAL); + } + + /* Restrict some 'special' names */ + if (strcmp(tablename, "all") == 0) + return (EINVAL); + + return (0); +} + diff --git a/example/ipfw/sys/net/pfil.h b/example/ipfw/sys/net/pfil.h new file mode 100644 index 0000000..6fa0c25 --- /dev/null +++ b/example/ipfw/sys/net/pfil.h @@ -0,0 +1,148 @@ +/* $FreeBSD: head/sys/net/pfil.h 254777 2013-08-24 12:03:24Z andre $ */ +/* $NetBSD: pfil.h,v 1.22 2003/06/23 12:57:08 martin Exp $ */ + +/*- + * Copyright (c) 1996 Matthew R. Green + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. The name of the author may not be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, + * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED + * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#ifndef _NET_PFIL_H_ +#define _NET_PFIL_H_ + +#include +#include +#include +#include +#include +#include + +struct mbuf; +struct ifnet; +struct inpcb; + +typedef int (*pfil_func_t)(void *, struct mbuf **, struct ifnet *, int, + struct inpcb *); + +/* + * The packet filter hooks are designed for anything to call them to + * possibly intercept the packet. Multiple filter hooks are chained + * together and after each other in the specified order. + */ +struct packet_filter_hook { + TAILQ_ENTRY(packet_filter_hook) pfil_chain; + pfil_func_t pfil_func; + void *pfil_arg; +}; + +#define PFIL_IN 0x00000001 +#define PFIL_OUT 0x00000002 +#define PFIL_WAITOK 0x00000004 +#define PFIL_ALL (PFIL_IN|PFIL_OUT) + +typedef TAILQ_HEAD(pfil_chain, packet_filter_hook) pfil_chain_t; + +#define PFIL_TYPE_AF 1 /* key is AF_* type */ +#define PFIL_TYPE_IFNET 2 /* key is ifnet pointer */ + +#define PFIL_FLAG_PRIVATE_LOCK 0x01 /* Personal lock instead of global */ + +/* + * A pfil head is created by each protocol or packet intercept point. + * For packet is then run through the hook chain for inspection. + */ +struct pfil_head { + pfil_chain_t ph_in; + pfil_chain_t ph_out; + int ph_type; + int ph_nhooks; +#if defined( __linux__ ) || defined( _WIN32 ) + rwlock_t ph_mtx; +#else + struct rmlock *ph_plock; /* Pointer to the used lock */ + struct rmlock ph_lock; /* Private lock storage */ + int flags; +#endif + union { + u_long phu_val; + void *phu_ptr; + } ph_un; +#define ph_af ph_un.phu_val +#define ph_ifnet ph_un.phu_ptr + LIST_ENTRY(pfil_head) ph_list; +}; + +/* Public functions for pfil hook management by packet filters. */ +struct pfil_head *pfil_head_get(int, u_long); +int pfil_add_hook(pfil_func_t, void *, int, struct pfil_head *); +int pfil_remove_hook(pfil_func_t, void *, int, struct pfil_head *); +#define PFIL_HOOKED(p) ((p)->ph_nhooks > 0) + +/* Public functions to run the packet inspection by protocols. */ +int pfil_run_hooks(struct pfil_head *, struct mbuf **, struct ifnet *, + int, struct inpcb *inp); + +/* Public functions for pfil head management by protocols. */ +int pfil_head_register(struct pfil_head *); +int pfil_head_unregister(struct pfil_head *); + +/* Public pfil locking functions for self managed locks by packet filters. */ +struct rm_priotracker; /* Do not require including rmlock header */ +int pfil_try_rlock(struct pfil_head *, struct rm_priotracker *); +void pfil_rlock(struct pfil_head *, struct rm_priotracker *); +void pfil_runlock(struct pfil_head *, struct rm_priotracker *); +void pfil_wlock(struct pfil_head *); +void pfil_wunlock(struct pfil_head *); +int pfil_wowned(struct pfil_head *ph); + +/* Internal pfil locking functions. */ +#define PFIL_LOCK_INIT_REAL(l, t) \ + rm_init_flags(l, "PFil " t " rmlock", RM_RECURSE) +#define PFIL_LOCK_DESTROY_REAL(l) \ + rm_destroy(l) +#define PFIL_LOCK_INIT(p) do { \ + if ((p)->flags & PFIL_FLAG_PRIVATE_LOCK) { \ + PFIL_LOCK_INIT_REAL(&(p)->ph_lock, "private"); \ + (p)->ph_plock = &(p)->ph_lock; \ + } else \ + (p)->ph_plock = &V_pfil_lock; \ +} while (0) +#define PFIL_LOCK_DESTROY(p) do { \ + if ((p)->flags & PFIL_FLAG_PRIVATE_LOCK) \ + PFIL_LOCK_DESTROY_REAL((p)->ph_plock); \ +} while (0) + +#define PFIL_TRY_RLOCK(p, t) rm_try_rlock((p)->ph_plock, (t)) +#define PFIL_RLOCK(p, t) rm_rlock((p)->ph_plock, (t)) +#define PFIL_WLOCK(p) rm_wlock((p)->ph_plock) +#define PFIL_RUNLOCK(p, t) rm_runlock((p)->ph_plock, (t)) +#define PFIL_WUNLOCK(p) rm_wunlock((p)->ph_plock) +#define PFIL_WOWNED(p) rm_wowned((p)->ph_plock) + +/* Internal locking macros for global/vnet pfil_head_list. */ +#define PFIL_HEADLIST_LOCK() mtx_lock(&pfil_global_lock) +#define PFIL_HEADLIST_UNLOCK() mtx_unlock(&pfil_global_lock) + +#endif /* _NET_PFIL_H_ */ diff --git a/example/ipfw/sys/net/radix.c b/example/ipfw/sys/net/radix.c new file mode 100644 index 0000000..b423662 --- /dev/null +++ b/example/ipfw/sys/net/radix.c @@ -0,0 +1,1208 @@ +/*- + * Copyright (c) 1988, 1989, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 4. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)radix.c 8.5 (Berkeley) 5/19/95 + * $FreeBSD: head/sys/net/radix.c 272385 2014-10-01 21:24:58Z melifaro $ + */ + +/* + * Routines to build and maintain radix trees for routing lookups. + */ +#include +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include +#include "opt_mpath.h" +#ifdef RADIX_MPATH +#include +#endif +#else /* !_KERNEL */ +#include +#include +#include +#define log(x, arg...) fprintf(stderr, ## arg) +#define panic(x) fprintf(stderr, "PANIC: %s", x), exit(1) +#define min(a, b) ((a) < (b) ? (a) : (b) ) +#include +#endif /* !_KERNEL */ + +static int rn_walktree_from(struct radix_node_head *h, void *a, void *m, + walktree_f_t *f, void *w); +static int rn_walktree(struct radix_node_head *, walktree_f_t *, void *); +static struct radix_node + *rn_insert(void *, struct radix_node_head *, int *, + struct radix_node [2]), + *rn_newpair(void *, int, struct radix_node[2]), + *rn_search(void *, struct radix_node *), + *rn_search_m(void *, struct radix_node *, void *); + +static void rn_detachhead_internal(void **head); +static int rn_inithead_internal(void **head, int off); + +#define RADIX_MAX_KEY_LEN 32 + +static char rn_zeros[RADIX_MAX_KEY_LEN]; +static char rn_ones[RADIX_MAX_KEY_LEN] = { + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, +}; + + +static int rn_lexobetter(void *m_arg, void *n_arg); +static struct radix_mask * + rn_new_radix_mask(struct radix_node *tt, + struct radix_mask *next); +static int rn_satisfies_leaf(char *trial, struct radix_node *leaf, + int skip); + +/* + * The data structure for the keys is a radix tree with one way + * branching removed. The index rn_bit at an internal node n represents a bit + * position to be tested. The tree is arranged so that all descendants + * of a node n have keys whose bits all agree up to position rn_bit - 1. + * (We say the index of n is rn_bit.) + * + * There is at least one descendant which has a one bit at position rn_bit, + * and at least one with a zero there. + * + * A route is determined by a pair of key and mask. We require that the + * bit-wise logical and of the key and mask to be the key. + * We define the index of a route to associated with the mask to be + * the first bit number in the mask where 0 occurs (with bit number 0 + * representing the highest order bit). + * + * We say a mask is normal if every bit is 0, past the index of the mask. + * If a node n has a descendant (k, m) with index(m) == index(n) == rn_bit, + * and m is a normal mask, then the route applies to every descendant of n. + * If the index(m) < rn_bit, this implies the trailing last few bits of k + * before bit b are all 0, (and hence consequently true of every descendant + * of n), so the route applies to all descendants of the node as well. + * + * Similar logic shows that a non-normal mask m such that + * index(m) <= index(n) could potentially apply to many children of n. + * Thus, for each non-host route, we attach its mask to a list at an internal + * node as high in the tree as we can go. + * + * The present version of the code makes use of normal routes in short- + * circuiting an explict mask and compare operation when testing whether + * a key satisfies a normal route, and also in remembering the unique leaf + * that governs a subtree. + */ + +/* + * Most of the functions in this code assume that the key/mask arguments + * are sockaddr-like structures, where the first byte is an u_char + * indicating the size of the entire structure. + * + * To make the assumption more explicit, we use the LEN() macro to access + * this field. It is safe to pass an expression with side effects + * to LEN() as the argument is evaluated only once. + * We cast the result to int as this is the dominant usage. + */ +#define LEN(x) ( (int) (*(const u_char *)(x)) ) + +/* + * XXX THIS NEEDS TO BE FIXED + * In the code, pointers to keys and masks are passed as either + * 'void *' (because callers use to pass pointers of various kinds), or + * 'caddr_t' (which is fine for pointer arithmetics, but not very + * clean when you dereference it to access data). Furthermore, caddr_t + * is really 'char *', while the natural type to operate on keys and + * masks would be 'u_char'. This mismatch require a lot of casts and + * intermediate variables to adapt types that clutter the code. + */ + +/* + * Search a node in the tree matching the key. + */ +static struct radix_node * +rn_search(void *v_arg, struct radix_node *head) +{ + struct radix_node *x; + caddr_t v; + + for (x = head, v = v_arg; x->rn_bit >= 0;) { + if (x->rn_bmask & v[x->rn_offset]) + x = x->rn_right; + else + x = x->rn_left; + } + return (x); +} + +/* + * Same as above, but with an additional mask. + * XXX note this function is used only once. + */ +static struct radix_node * +rn_search_m(void *v_arg, struct radix_node *head, void *m_arg) +{ + struct radix_node *x; + caddr_t v = v_arg, m = m_arg; + + for (x = head; x->rn_bit >= 0;) { + if ((x->rn_bmask & m[x->rn_offset]) && + (x->rn_bmask & v[x->rn_offset])) + x = x->rn_right; + else + x = x->rn_left; + } + return (x); +} + +int +rn_refines(void *m_arg, void *n_arg) +{ + caddr_t m = m_arg, n = n_arg; + caddr_t lim, lim2 = lim = n + LEN(n); + int longer = LEN(n++) - LEN(m++); + int masks_are_equal = 1; + + if (longer > 0) + lim -= longer; + while (n < lim) { + if (*n & ~(*m)) + return (0); + if (*n++ != *m++) + masks_are_equal = 0; + } + while (n < lim2) + if (*n++) + return (0); + if (masks_are_equal && (longer < 0)) + for (lim2 = m - longer; m < lim2; ) + if (*m++) + return (1); + return (!masks_are_equal); +} + +/* + * Search for exact match in given @head. + * Assume host bits are cleared in @v_arg if @m_arg is not NULL + * Note that prefixes with /32 or /128 masks are treated differently + * from host routes. + */ +struct radix_node * +rn_lookup(void *v_arg, void *m_arg, struct radix_node_head *head) +{ + struct radix_node *x; + caddr_t netmask; + + if (m_arg != NULL) { + /* + * Most common case: search exact prefix/mask + */ + x = rn_addmask(m_arg, head->rnh_masks, 1, + head->rnh_treetop->rn_offset); + if (x == NULL) + return (NULL); + netmask = x->rn_key; + + x = rn_match(v_arg, head); + + while (x != NULL && x->rn_mask != netmask) + x = x->rn_dupedkey; + + return (x); + } + + /* + * Search for host address. + */ + if ((x = rn_match(v_arg, head)) == NULL) + return (NULL); + + /* Check if found key is the same */ + if (LEN(x->rn_key) != LEN(v_arg) || bcmp(x->rn_key, v_arg, LEN(v_arg))) + return (NULL); + + /* Check if this is not host route */ + if (x->rn_mask != NULL) + return (NULL); + + return (x); +} + +static int +rn_satisfies_leaf(char *trial, struct radix_node *leaf, int skip) +{ + char *cp = trial, *cp2 = leaf->rn_key, *cp3 = leaf->rn_mask; + char *cplim; + int length = min(LEN(cp), LEN(cp2)); + + if (cp3 == NULL) + cp3 = rn_ones; + else + length = min(length, LEN(cp3)); + cplim = cp + length; cp3 += skip; cp2 += skip; + for (cp += skip; cp < cplim; cp++, cp2++, cp3++) + if ((*cp ^ *cp2) & *cp3) + return (0); + return (1); +} + +/* + * Search for longest-prefix match in given @head + */ +struct radix_node * +rn_match(void *v_arg, struct radix_node_head *head) +{ + caddr_t v = v_arg; + struct radix_node *t = head->rnh_treetop, *x; + caddr_t cp = v, cp2; + caddr_t cplim; + struct radix_node *saved_t, *top = t; + int off = t->rn_offset, vlen = LEN(cp), matched_off; + int test, b, rn_bit; + + /* + * Open code rn_search(v, top) to avoid overhead of extra + * subroutine call. + */ + for (; t->rn_bit >= 0; ) { + if (t->rn_bmask & cp[t->rn_offset]) + t = t->rn_right; + else + t = t->rn_left; + } + /* + * See if we match exactly as a host destination + * or at least learn how many bits match, for normal mask finesse. + * + * It doesn't hurt us to limit how many bytes to check + * to the length of the mask, since if it matches we had a genuine + * match and the leaf we have is the most specific one anyway; + * if it didn't match with a shorter length it would fail + * with a long one. This wins big for class B&C netmasks which + * are probably the most common case... + */ + if (t->rn_mask) + vlen = *(u_char *)t->rn_mask; + cp += off; cp2 = t->rn_key + off; cplim = v + vlen; + for (; cp < cplim; cp++, cp2++) + if (*cp != *cp2) + goto on1; + /* + * This extra grot is in case we are explicitly asked + * to look up the default. Ugh! + * + * Never return the root node itself, it seems to cause a + * lot of confusion. + */ + if (t->rn_flags & RNF_ROOT) + t = t->rn_dupedkey; + return (t); +on1: + test = (*cp ^ *cp2) & 0xff; /* find first bit that differs */ + for (b = 7; (test >>= 1) > 0;) + b--; + matched_off = cp - v; + b += matched_off << 3; + rn_bit = -1 - b; + /* + * If there is a host route in a duped-key chain, it will be first. + */ + if ((saved_t = t)->rn_mask == 0) + t = t->rn_dupedkey; + for (; t; t = t->rn_dupedkey) + /* + * Even if we don't match exactly as a host, + * we may match if the leaf we wound up at is + * a route to a net. + */ + if (t->rn_flags & RNF_NORMAL) { + if (rn_bit <= t->rn_bit) + return (t); + } else if (rn_satisfies_leaf(v, t, matched_off)) + return (t); + t = saved_t; + /* start searching up the tree */ + do { + struct radix_mask *m; + t = t->rn_parent; + m = t->rn_mklist; + /* + * If non-contiguous masks ever become important + * we can restore the masking and open coding of + * the search and satisfaction test and put the + * calculation of "off" back before the "do". + */ + while (m) { + if (m->rm_flags & RNF_NORMAL) { + if (rn_bit <= m->rm_bit) + return (m->rm_leaf); + } else { + off = min(t->rn_offset, matched_off); + x = rn_search_m(v, t, m->rm_mask); + while (x && x->rn_mask != m->rm_mask) + x = x->rn_dupedkey; + if (x && rn_satisfies_leaf(v, x, off)) + return (x); + } + m = m->rm_mklist; + } + } while (t != top); + return (0); +} + +#ifdef RN_DEBUG +int rn_nodenum; +struct radix_node *rn_clist; +int rn_saveinfo; +int rn_debug = 1; +#endif + +/* + * Whenever we add a new leaf to the tree, we also add a parent node, + * so we allocate them as an array of two elements: the first one must be + * the leaf (see RNTORT() in route.c), the second one is the parent. + * This routine initializes the relevant fields of the nodes, so that + * the leaf is the left child of the parent node, and both nodes have + * (almost) all all fields filled as appropriate. + * (XXX some fields are left unset, see the '#if 0' section). + * The function returns a pointer to the parent node. + */ + +static struct radix_node * +rn_newpair(void *v, int b, struct radix_node nodes[2]) +{ + struct radix_node *tt = nodes, *t = tt + 1; + t->rn_bit = b; + t->rn_bmask = 0x80 >> (b & 7); + t->rn_left = tt; + t->rn_offset = b >> 3; + +#if 0 /* XXX perhaps we should fill these fields as well. */ + t->rn_parent = t->rn_right = NULL; + + tt->rn_mask = NULL; + tt->rn_dupedkey = NULL; + tt->rn_bmask = 0; +#endif + tt->rn_bit = -1; + tt->rn_key = (caddr_t)v; + tt->rn_parent = t; + tt->rn_flags = t->rn_flags = RNF_ACTIVE; + tt->rn_mklist = t->rn_mklist = 0; +#ifdef RN_DEBUG + tt->rn_info = rn_nodenum++; t->rn_info = rn_nodenum++; + tt->rn_twin = t; + tt->rn_ybro = rn_clist; + rn_clist = tt; +#endif + return (t); +} + +static struct radix_node * +rn_insert(void *v_arg, struct radix_node_head *head, int *dupentry, + struct radix_node nodes[2]) +{ + caddr_t v = v_arg; + struct radix_node *top = head->rnh_treetop; + int head_off = top->rn_offset, vlen = LEN(v); + struct radix_node *t = rn_search(v_arg, top); + caddr_t cp = v + head_off; + int b; + struct radix_node *p, *tt, *x; + /* + * Find first bit at which v and t->rn_key differ + */ + caddr_t cp2 = t->rn_key + head_off; + int cmp_res; + caddr_t cplim = v + vlen; + + while (cp < cplim) + if (*cp2++ != *cp++) + goto on1; + *dupentry = 1; + return (t); +on1: + *dupentry = 0; + cmp_res = (cp[-1] ^ cp2[-1]) & 0xff; + for (b = (cp - v) << 3; cmp_res; b--) + cmp_res >>= 1; + + x = top; + cp = v; + do { + p = x; + if (cp[x->rn_offset] & x->rn_bmask) + x = x->rn_right; + else + x = x->rn_left; + } while (b > (unsigned) x->rn_bit); + /* x->rn_bit < b && x->rn_bit >= 0 */ +#ifdef RN_DEBUG + if (rn_debug) + log(LOG_DEBUG, "rn_insert: Going In:\n"), traverse(p); +#endif + t = rn_newpair(v_arg, b, nodes); + tt = t->rn_left; + if ((cp[p->rn_offset] & p->rn_bmask) == 0) + p->rn_left = t; + else + p->rn_right = t; + x->rn_parent = t; + t->rn_parent = p; /* frees x, p as temp vars below */ + if ((cp[t->rn_offset] & t->rn_bmask) == 0) { + t->rn_right = x; + } else { + t->rn_right = tt; + t->rn_left = x; + } +#ifdef RN_DEBUG + if (rn_debug) + log(LOG_DEBUG, "rn_insert: Coming Out:\n"), traverse(p); +#endif + return (tt); +} + +struct radix_node * +rn_addmask(void *n_arg, struct radix_node_head *maskhead, int search, int skip) +{ + unsigned char *netmask = n_arg; + unsigned char *cp, *cplim; + struct radix_node *x; + int b = 0, mlen, j; + int maskduplicated, isnormal; + struct radix_node *saved_x; + unsigned char addmask_key[RADIX_MAX_KEY_LEN]; + + if ((mlen = LEN(netmask)) > RADIX_MAX_KEY_LEN) + mlen = RADIX_MAX_KEY_LEN; + if (skip == 0) + skip = 1; + if (mlen <= skip) + return (maskhead->rnh_nodes); + + bzero(addmask_key, RADIX_MAX_KEY_LEN); + if (skip > 1) + bcopy(rn_ones + 1, addmask_key + 1, skip - 1); + bcopy(netmask + skip, addmask_key + skip, mlen - skip); + /* + * Trim trailing zeroes. + */ + for (cp = addmask_key + mlen; (cp > addmask_key) && cp[-1] == 0;) + cp--; + mlen = cp - addmask_key; + if (mlen <= skip) + return (maskhead->rnh_nodes); + *addmask_key = mlen; + x = rn_search(addmask_key, maskhead->rnh_treetop); + if (bcmp(addmask_key, x->rn_key, mlen) != 0) + x = 0; + if (x || search) + return (x); + R_Zalloc(x, struct radix_node *, RADIX_MAX_KEY_LEN + 2 * sizeof (*x)); + if ((saved_x = x) == 0) + return (0); + netmask = cp = (unsigned char *)(x + 2); + bcopy(addmask_key, cp, mlen); + x = rn_insert(cp, maskhead, &maskduplicated, x); + if (maskduplicated) { + log(LOG_ERR, "rn_addmask: mask impossibly already in tree"); + Free(saved_x); + return (x); + } + /* + * Calculate index of mask, and check for normalcy. + * First find the first byte with a 0 bit, then if there are + * more bits left (remember we already trimmed the trailing 0's), + * the bits should be contiguous, otherwise we have got + * a non-contiguous mask. + */ +#define CONTIG(_c) (((~(_c) + 1) & (_c)) == (unsigned char)(~(_c) + 1)) + cplim = netmask + mlen; + isnormal = 1; + for (cp = netmask + skip; (cp < cplim) && *(u_char *)cp == 0xff;) + cp++; + if (cp != cplim) { + for (j = 0x80; (j & *cp) != 0; j >>= 1) + b++; + if (!CONTIG(*cp) || cp != (cplim - 1)) + isnormal = 0; + } + b += (cp - netmask) << 3; + x->rn_bit = -1 - b; + if (isnormal) + x->rn_flags |= RNF_NORMAL; + return (x); +} + +static int /* XXX: arbitrary ordering for non-contiguous masks */ +rn_lexobetter(void *m_arg, void *n_arg) +{ + u_char *mp = m_arg, *np = n_arg, *lim; + + if (LEN(mp) > LEN(np)) + return (1); /* not really, but need to check longer one first */ + if (LEN(mp) == LEN(np)) + for (lim = mp + LEN(mp); mp < lim;) + if (*mp++ > *np++) + return (1); + return (0); +} + +static struct radix_mask * +rn_new_radix_mask(struct radix_node *tt, struct radix_mask *next) +{ + struct radix_mask *m; + + R_Malloc(m, struct radix_mask *, sizeof (struct radix_mask)); + if (m == NULL) { + log(LOG_ERR, "Failed to allocate route mask\n"); + return (0); + } + bzero(m, sizeof(*m)); + m->rm_bit = tt->rn_bit; + m->rm_flags = tt->rn_flags; + if (tt->rn_flags & RNF_NORMAL) + m->rm_leaf = tt; + else + m->rm_mask = tt->rn_mask; + m->rm_mklist = next; + tt->rn_mklist = m; + return (m); +} + +struct radix_node * +rn_addroute(void *v_arg, void *n_arg, struct radix_node_head *head, + struct radix_node treenodes[2]) +{ + caddr_t v = (caddr_t)v_arg, netmask = (caddr_t)n_arg; + struct radix_node *t, *x = 0, *tt; + struct radix_node *saved_tt, *top = head->rnh_treetop; + short b = 0, b_leaf = 0; + int keyduplicated; + caddr_t mmask; + struct radix_mask *m, **mp; + + /* + * In dealing with non-contiguous masks, there may be + * many different routes which have the same mask. + * We will find it useful to have a unique pointer to + * the mask to speed avoiding duplicate references at + * nodes and possibly save time in calculating indices. + */ + if (netmask) { + x = rn_addmask(netmask, head->rnh_masks, 0, top->rn_offset); + if (x == NULL) + return (0); + b_leaf = x->rn_bit; + b = -1 - x->rn_bit; + netmask = x->rn_key; + } + /* + * Deal with duplicated keys: attach node to previous instance + */ + saved_tt = tt = rn_insert(v, head, &keyduplicated, treenodes); + if (keyduplicated) { + for (t = tt; tt; t = tt, tt = tt->rn_dupedkey) { +#ifdef RADIX_MPATH + /* permit multipath, if enabled for the family */ + if (rn_mpath_capable(head) && netmask == tt->rn_mask) { + /* + * go down to the end of multipaths, so that + * new entry goes into the end of rn_dupedkey + * chain. + */ + do { + t = tt; + tt = tt->rn_dupedkey; + } while (tt && t->rn_mask == tt->rn_mask); + break; + } +#endif + if (tt->rn_mask == netmask) + return (0); + if (netmask == 0 || + (tt->rn_mask && + ((b_leaf < tt->rn_bit) /* index(netmask) > node */ + || rn_refines(netmask, tt->rn_mask) + || rn_lexobetter(netmask, tt->rn_mask)))) + break; + } + /* + * If the mask is not duplicated, we wouldn't + * find it among possible duplicate key entries + * anyway, so the above test doesn't hurt. + * + * We sort the masks for a duplicated key the same way as + * in a masklist -- most specific to least specific. + * This may require the unfortunate nuisance of relocating + * the head of the list. + * + * We also reverse, or doubly link the list through the + * parent pointer. + */ + if (tt == saved_tt) { + struct radix_node *xx = x; + /* link in at head of list */ + (tt = treenodes)->rn_dupedkey = t; + tt->rn_flags = t->rn_flags; + tt->rn_parent = x = t->rn_parent; + t->rn_parent = tt; /* parent */ + if (x->rn_left == t) + x->rn_left = tt; + else + x->rn_right = tt; + saved_tt = tt; x = xx; + } else { + (tt = treenodes)->rn_dupedkey = t->rn_dupedkey; + t->rn_dupedkey = tt; + tt->rn_parent = t; /* parent */ + if (tt->rn_dupedkey) /* parent */ + tt->rn_dupedkey->rn_parent = tt; /* parent */ + } +#ifdef RN_DEBUG + t=tt+1; tt->rn_info = rn_nodenum++; t->rn_info = rn_nodenum++; + tt->rn_twin = t; tt->rn_ybro = rn_clist; rn_clist = tt; +#endif + tt->rn_key = (caddr_t) v; + tt->rn_bit = -1; + tt->rn_flags = RNF_ACTIVE; + } + /* + * Put mask in tree. + */ + if (netmask) { + tt->rn_mask = netmask; + tt->rn_bit = x->rn_bit; + tt->rn_flags |= x->rn_flags & RNF_NORMAL; + } + t = saved_tt->rn_parent; + if (keyduplicated) + goto on2; + b_leaf = -1 - t->rn_bit; + if (t->rn_right == saved_tt) + x = t->rn_left; + else + x = t->rn_right; + /* Promote general routes from below */ + if (x->rn_bit < 0) { + for (mp = &t->rn_mklist; x; x = x->rn_dupedkey) + if (x->rn_mask && (x->rn_bit >= b_leaf) && x->rn_mklist == 0) { + *mp = m = rn_new_radix_mask(x, 0); + if (m) + mp = &m->rm_mklist; + } + } else if (x->rn_mklist) { + /* + * Skip over masks whose index is > that of new node + */ + for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist) + if (m->rm_bit >= b_leaf) + break; + t->rn_mklist = m; *mp = 0; + } +on2: + /* Add new route to highest possible ancestor's list */ + if ((netmask == 0) || (b > t->rn_bit )) + return (tt); /* can't lift at all */ + b_leaf = tt->rn_bit; + do { + x = t; + t = t->rn_parent; + } while (b <= t->rn_bit && x != top); + /* + * Search through routes associated with node to + * insert new route according to index. + * Need same criteria as when sorting dupedkeys to avoid + * double loop on deletion. + */ + for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist) { + if (m->rm_bit < b_leaf) + continue; + if (m->rm_bit > b_leaf) + break; + if (m->rm_flags & RNF_NORMAL) { + mmask = m->rm_leaf->rn_mask; + if (tt->rn_flags & RNF_NORMAL) { +#if !defined(RADIX_MPATH) + log(LOG_ERR, + "Non-unique normal route, mask not entered\n"); +#endif + return (tt); + } + } else + mmask = m->rm_mask; + if (mmask == netmask) { + m->rm_refs++; + tt->rn_mklist = m; + return (tt); + } + if (rn_refines(netmask, mmask) + || rn_lexobetter(netmask, mmask)) + break; + } + *mp = rn_new_radix_mask(tt, *mp); + return (tt); +} + +struct radix_node * +rn_delete(void *v_arg, void *netmask_arg, struct radix_node_head *head) +{ + struct radix_node *t, *p, *x, *tt; + struct radix_mask *m, *saved_m, **mp; + struct radix_node *dupedkey, *saved_tt, *top; + caddr_t v, netmask; + int b, head_off, vlen; + + v = v_arg; + netmask = netmask_arg; + x = head->rnh_treetop; + tt = rn_search(v, x); + head_off = x->rn_offset; + vlen = LEN(v); + saved_tt = tt; + top = x; + if (tt == 0 || + bcmp(v + head_off, tt->rn_key + head_off, vlen - head_off)) + return (0); + /* + * Delete our route from mask lists. + */ + if (netmask) { + x = rn_addmask(netmask, head->rnh_masks, 1, head_off); + if (x == NULL) + return (0); + netmask = x->rn_key; + while (tt->rn_mask != netmask) + if ((tt = tt->rn_dupedkey) == 0) + return (0); + } + if (tt->rn_mask == 0 || (saved_m = m = tt->rn_mklist) == 0) + goto on1; + if (tt->rn_flags & RNF_NORMAL) { + if (m->rm_leaf != tt || m->rm_refs > 0) { + log(LOG_ERR, "rn_delete: inconsistent annotation\n"); + return (0); /* dangling ref could cause disaster */ + } + } else { + if (m->rm_mask != tt->rn_mask) { + log(LOG_ERR, "rn_delete: inconsistent annotation\n"); + goto on1; + } + if (--m->rm_refs >= 0) + goto on1; + } + b = -1 - tt->rn_bit; + t = saved_tt->rn_parent; + if (b > t->rn_bit) + goto on1; /* Wasn't lifted at all */ + do { + x = t; + t = t->rn_parent; + } while (b <= t->rn_bit && x != top); + for (mp = &x->rn_mklist; (m = *mp); mp = &m->rm_mklist) + if (m == saved_m) { + *mp = m->rm_mklist; + Free(m); + break; + } + if (m == 0) { + log(LOG_ERR, "rn_delete: couldn't find our annotation\n"); + if (tt->rn_flags & RNF_NORMAL) + return (0); /* Dangling ref to us */ + } +on1: + /* + * Eliminate us from tree + */ + if (tt->rn_flags & RNF_ROOT) + return (0); +#ifdef RN_DEBUG + /* Get us out of the creation list */ + for (t = rn_clist; t && t->rn_ybro != tt; t = t->rn_ybro) {} + if (t) t->rn_ybro = tt->rn_ybro; +#endif + t = tt->rn_parent; + dupedkey = saved_tt->rn_dupedkey; + if (dupedkey) { + /* + * Here, tt is the deletion target and + * saved_tt is the head of the dupekey chain. + */ + if (tt == saved_tt) { + /* remove from head of chain */ + x = dupedkey; x->rn_parent = t; + if (t->rn_left == tt) + t->rn_left = x; + else + t->rn_right = x; + } else { + /* find node in front of tt on the chain */ + for (x = p = saved_tt; p && p->rn_dupedkey != tt;) + p = p->rn_dupedkey; + if (p) { + p->rn_dupedkey = tt->rn_dupedkey; + if (tt->rn_dupedkey) /* parent */ + tt->rn_dupedkey->rn_parent = p; + /* parent */ + } else log(LOG_ERR, "rn_delete: couldn't find us\n"); + } + t = tt + 1; + if (t->rn_flags & RNF_ACTIVE) { +#ifndef RN_DEBUG + *++x = *t; + p = t->rn_parent; +#else + b = t->rn_info; + *++x = *t; + t->rn_info = b; + p = t->rn_parent; +#endif + if (p->rn_left == t) + p->rn_left = x; + else + p->rn_right = x; + x->rn_left->rn_parent = x; + x->rn_right->rn_parent = x; + } + goto out; + } + if (t->rn_left == tt) + x = t->rn_right; + else + x = t->rn_left; + p = t->rn_parent; + if (p->rn_right == t) + p->rn_right = x; + else + p->rn_left = x; + x->rn_parent = p; + /* + * Demote routes attached to us. + */ + if (t->rn_mklist) { + if (x->rn_bit >= 0) { + for (mp = &x->rn_mklist; (m = *mp);) + mp = &m->rm_mklist; + *mp = t->rn_mklist; + } else { + /* If there are any key,mask pairs in a sibling + duped-key chain, some subset will appear sorted + in the same order attached to our mklist */ + for (m = t->rn_mklist; m && x; x = x->rn_dupedkey) + if (m == x->rn_mklist) { + struct radix_mask *mm = m->rm_mklist; + x->rn_mklist = 0; + if (--(m->rm_refs) < 0) + Free(m); + m = mm; + } + if (m) + log(LOG_ERR, + "rn_delete: Orphaned Mask %p at %p\n", + m, x); + } + } + /* + * We may be holding an active internal node in the tree. + */ + x = tt + 1; + if (t != x) { +#ifndef RN_DEBUG + *t = *x; +#else + b = t->rn_info; + *t = *x; + t->rn_info = b; +#endif + t->rn_left->rn_parent = t; + t->rn_right->rn_parent = t; + p = x->rn_parent; + if (p->rn_left == x) + p->rn_left = t; + else + p->rn_right = t; + } +out: + tt->rn_flags &= ~RNF_ACTIVE; + tt[1].rn_flags &= ~RNF_ACTIVE; + return (tt); +} + +/* + * This is the same as rn_walktree() except for the parameters and the + * exit. + */ +static int +rn_walktree_from(struct radix_node_head *h, void *a, void *m, + walktree_f_t *f, void *w) +{ + int error; + struct radix_node *base, *next; + u_char *xa = (u_char *)a; + u_char *xm = (u_char *)m; + struct radix_node *rn, *last = NULL; /* shut up gcc */ + int stopping = 0; + int lastb; + + KASSERT(m != NULL, ("%s: mask needs to be specified", __func__)); + + /* + * rn_search_m is sort-of-open-coded here. We cannot use the + * function because we need to keep track of the last node seen. + */ + /* printf("about to search\n"); */ + for (rn = h->rnh_treetop; rn->rn_bit >= 0; ) { + last = rn; + /* printf("rn_bit %d, rn_bmask %x, xm[rn_offset] %x\n", + rn->rn_bit, rn->rn_bmask, xm[rn->rn_offset]); */ + if (!(rn->rn_bmask & xm[rn->rn_offset])) { + break; + } + if (rn->rn_bmask & xa[rn->rn_offset]) { + rn = rn->rn_right; + } else { + rn = rn->rn_left; + } + } + /* printf("done searching\n"); */ + + /* + * Two cases: either we stepped off the end of our mask, + * in which case last == rn, or we reached a leaf, in which + * case we want to start from the leaf. + */ + if (rn->rn_bit >= 0) + rn = last; + lastb = last->rn_bit; + + /* printf("rn %p, lastb %d\n", rn, lastb);*/ + + /* + * This gets complicated because we may delete the node + * while applying the function f to it, so we need to calculate + * the successor node in advance. + */ + while (rn->rn_bit >= 0) + rn = rn->rn_left; + + while (!stopping) { + /* printf("node %p (%d)\n", rn, rn->rn_bit); */ + base = rn; + /* If at right child go back up, otherwise, go right */ + while (rn->rn_parent->rn_right == rn + && !(rn->rn_flags & RNF_ROOT)) { + rn = rn->rn_parent; + + /* if went up beyond last, stop */ + if (rn->rn_bit <= lastb) { + stopping = 1; + /* printf("up too far\n"); */ + /* + * XXX we should jump to the 'Process leaves' + * part, because the values of 'rn' and 'next' + * we compute will not be used. Not a big deal + * because this loop will terminate, but it is + * inefficient and hard to understand! + */ + } + } + + /* + * At the top of the tree, no need to traverse the right + * half, prevent the traversal of the entire tree in the + * case of default route. + */ + if (rn->rn_parent->rn_flags & RNF_ROOT) + stopping = 1; + + /* Find the next *leaf* since next node might vanish, too */ + for (rn = rn->rn_parent->rn_right; rn->rn_bit >= 0;) + rn = rn->rn_left; + next = rn; + /* Process leaves */ + while ((rn = base) != 0) { + base = rn->rn_dupedkey; + /* printf("leaf %p\n", rn); */ + if (!(rn->rn_flags & RNF_ROOT) + && (error = (*f)(rn, w))) + return (error); + } + rn = next; + + if (rn->rn_flags & RNF_ROOT) { + /* printf("root, stopping"); */ + stopping = 1; + } + + } + return (0); +} + +static int +rn_walktree(struct radix_node_head *h, walktree_f_t *f, void *w) +{ + int error; + struct radix_node *base, *next; + struct radix_node *rn = h->rnh_treetop; + /* + * This gets complicated because we may delete the node + * while applying the function f to it, so we need to calculate + * the successor node in advance. + */ + + /* First time through node, go left */ + while (rn->rn_bit >= 0) + rn = rn->rn_left; + for (;;) { + base = rn; + /* If at right child go back up, otherwise, go right */ + while (rn->rn_parent->rn_right == rn + && (rn->rn_flags & RNF_ROOT) == 0) + rn = rn->rn_parent; + /* Find the next *leaf* since next node might vanish, too */ + for (rn = rn->rn_parent->rn_right; rn->rn_bit >= 0;) + rn = rn->rn_left; + next = rn; + /* Process leaves */ + while ((rn = base)) { + base = rn->rn_dupedkey; + if (!(rn->rn_flags & RNF_ROOT) + && (error = (*f)(rn, w))) + return (error); + } + rn = next; + if (rn->rn_flags & RNF_ROOT) + return (0); + } + /* NOTREACHED */ +} + +/* + * Allocate and initialize an empty tree. This has 3 nodes, which are + * part of the radix_node_head (in the order ) and are + * marked RNF_ROOT so they cannot be freed. + * The leaves have all-zero and all-one keys, with significant + * bits starting at 'off'. + * Return 1 on success, 0 on error. + */ +static int +rn_inithead_internal(void **head, int off) +{ + struct radix_node_head *rnh; + struct radix_node *t, *tt, *ttt; + if (*head) + return (1); + R_Zalloc(rnh, struct radix_node_head *, sizeof (*rnh)); + if (rnh == 0) + return (0); + *head = rnh; + t = rn_newpair(rn_zeros, off, rnh->rnh_nodes); + ttt = rnh->rnh_nodes + 2; + t->rn_right = ttt; + t->rn_parent = t; + tt = t->rn_left; /* ... which in turn is rnh->rnh_nodes */ + tt->rn_flags = t->rn_flags = RNF_ROOT | RNF_ACTIVE; + tt->rn_bit = -1 - off; + *ttt = *tt; + ttt->rn_key = rn_ones; + rnh->rnh_addaddr = rn_addroute; + rnh->rnh_deladdr = rn_delete; + rnh->rnh_matchaddr = rn_match; + rnh->rnh_lookup = rn_lookup; + rnh->rnh_walktree = rn_walktree; + rnh->rnh_walktree_from = rn_walktree_from; + rnh->rnh_treetop = t; + return (1); +} + +static void +rn_detachhead_internal(void **head) +{ + struct radix_node_head *rnh; + + KASSERT((head != NULL && *head != NULL), + ("%s: head already freed", __func__)); + rnh = *head; + + /* Free nodes. */ + Free(rnh); + + *head = NULL; +} + +int +rn_inithead(void **head, int off) +{ + struct radix_node_head *rnh; + + if (*head != NULL) + return (1); + + if (rn_inithead_internal(head, off) == 0) + return (0); + + rnh = (struct radix_node_head *)(*head); + + if (rn_inithead_internal((void **)&rnh->rnh_masks, 0) == 0) { + rn_detachhead_internal(head); + return (0); + } + + return (1); +} + +static int +rn_freeentry(struct radix_node *rn, void *arg) +{ + struct radix_node_head * const rnh = arg; + struct radix_node *x; + + x = (struct radix_node *)rn_delete(rn + 2, NULL, rnh); + if (x != NULL) + Free(x); + return (0); +} + +int +rn_detachhead(void **head) +{ + struct radix_node_head *rnh; + + KASSERT((head != NULL && *head != NULL), + ("%s: head already freed", __func__)); + + rnh = *head; + + rn_walktree(rnh->rnh_masks, rn_freeentry, rnh->rnh_masks); + rn_detachhead_internal((void **)&rnh->rnh_masks); + rn_detachhead_internal(head); + return (1); +} + diff --git a/example/ipfw/sys/net/radix.h b/example/ipfw/sys/net/radix.h new file mode 100644 index 0000000..43742fa --- /dev/null +++ b/example/ipfw/sys/net/radix.h @@ -0,0 +1,168 @@ +/*- + * Copyright (c) 1988, 1989, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 4. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)radix.h 8.2 (Berkeley) 10/31/94 + * $FreeBSD: head/sys/net/radix.h 262758 2014-03-04 23:55:04Z gnn $ + */ + +#ifndef _RADIX_H_ +#define _RADIX_H_ + +#ifdef _KERNEL +#include +#include +#include +#endif + +#ifdef MALLOC_DECLARE +MALLOC_DECLARE(M_RTABLE); +#endif + +/* + * Radix search tree node layout. + */ + +struct radix_node { + struct radix_mask *rn_mklist; /* list of masks contained in subtree */ + struct radix_node *rn_parent; /* parent */ + short rn_bit; /* bit offset; -1-index(netmask) */ + char rn_bmask; /* node: mask for bit test*/ + u_char rn_flags; /* enumerated next */ +#define RNF_NORMAL 1 /* leaf contains normal route */ +#define RNF_ROOT 2 /* leaf is root leaf for tree */ +#define RNF_ACTIVE 4 /* This node is alive (for rtfree) */ + union { + struct { /* leaf only data: */ + caddr_t rn_Key; /* object of search */ + caddr_t rn_Mask; /* netmask, if present */ + struct radix_node *rn_Dupedkey; + } rn_leaf; + struct { /* node only data: */ + int rn_Off; /* where to start compare */ + struct radix_node *rn_L;/* progeny */ + struct radix_node *rn_R;/* progeny */ + } rn_node; + } rn_u; +#ifdef RN_DEBUG + int rn_info; + struct radix_node *rn_twin; + struct radix_node *rn_ybro; +#endif +}; + +#define rn_dupedkey rn_u.rn_leaf.rn_Dupedkey +#define rn_key rn_u.rn_leaf.rn_Key +#define rn_mask rn_u.rn_leaf.rn_Mask +#define rn_offset rn_u.rn_node.rn_Off +#define rn_left rn_u.rn_node.rn_L +#define rn_right rn_u.rn_node.rn_R + +/* + * Annotations to tree concerning potential routes applying to subtrees. + */ + +struct radix_mask { + short rm_bit; /* bit offset; -1-index(netmask) */ + char rm_unused; /* cf. rn_bmask */ + u_char rm_flags; /* cf. rn_flags */ + struct radix_mask *rm_mklist; /* more masks to try */ + union { + caddr_t rmu_mask; /* the mask */ + struct radix_node *rmu_leaf; /* for normal routes */ + } rm_rmu; + int rm_refs; /* # of references to this struct */ +}; + +#define rm_mask rm_rmu.rmu_mask +#define rm_leaf rm_rmu.rmu_leaf /* extra field would make 32 bytes */ + +typedef int walktree_f_t(struct radix_node *, void *); + +struct radix_node_head { + struct radix_node *rnh_treetop; + u_int rnh_gen; /* generation counter */ + int rnh_multipath; /* multipath capable ? */ + struct radix_node *(*rnh_addaddr) /* add based on sockaddr */ + (void *v, void *mask, + struct radix_node_head *head, struct radix_node nodes[]); + struct radix_node *(*rnh_deladdr) /* remove based on sockaddr */ + (void *v, void *mask, struct radix_node_head *head); + struct radix_node *(*rnh_matchaddr) /* longest match for sockaddr */ + (void *v, struct radix_node_head *head); + struct radix_node *(*rnh_lookup) /*exact match for sockaddr*/ + (void *v, void *mask, struct radix_node_head *head); + int (*rnh_walktree) /* traverse tree */ + (struct radix_node_head *head, walktree_f_t *f, void *w); + int (*rnh_walktree_from) /* traverse tree below a */ + (struct radix_node_head *head, void *a, void *m, + walktree_f_t *f, void *w); + void (*rnh_close) /* do something when the last ref drops */ + (struct radix_node *rn, struct radix_node_head *head); + struct radix_node rnh_nodes[3]; /* empty tree for common case */ + struct radix_node_head *rnh_masks; /* Storage for our masks */ +#ifdef _KERNEL + struct rwlock rnh_lock; /* locks entire radix tree */ +#endif +}; + +#ifndef _KERNEL +#define R_Malloc(p, t, n) (p = (t) malloc((unsigned int)(n))) +#define R_Zalloc(p, t, n) (p = (t) calloc(1,(unsigned int)(n))) +#define R_Free(p) free((char *)p); +#else +#define R_Malloc(p, t, n) (p = (t) malloc((unsigned long)(n), M_RTABLE, M_NOWAIT)) +#define R_Zalloc(p, t, n) (p = (t) malloc((unsigned long)(n), M_RTABLE, M_NOWAIT | M_ZERO)) +#define Free(p) free((caddr_t)p, M_RTABLE); + +#define RADIX_NODE_HEAD_LOCK_INIT(rnh) \ + rw_init_flags(&(rnh)->rnh_lock, "radix node head", 0) +#define RADIX_NODE_HEAD_LOCK(rnh) rw_wlock(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_UNLOCK(rnh) rw_wunlock(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_RLOCK(rnh) rw_rlock(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_RUNLOCK(rnh) rw_runlock(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh) rw_try_upgrade(&(rnh)->rnh_lock) + + +#define RADIX_NODE_HEAD_DESTROY(rnh) rw_destroy(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED) +#define RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED) +#endif /* _KERNEL */ + +int rn_inithead(void **, int); +int rn_detachhead(void **); +int rn_refines(void *, void *); +struct radix_node + *rn_addmask(void *, struct radix_node_head *, int, int), + *rn_addroute (void *, void *, struct radix_node_head *, + struct radix_node [2]), + *rn_delete(void *, void *, struct radix_node_head *), + *rn_lookup (void *v_arg, void *m_arg, + struct radix_node_head *head), + *rn_match(void *, struct radix_node_head *); + +#endif /* _RADIX_H_ */ diff --git a/example/ipfw/sys/netgraph/ng_ipfw.h b/example/ipfw/sys/netgraph/ng_ipfw.h new file mode 100644 index 0000000..c60426e --- /dev/null +++ b/example/ipfw/sys/netgraph/ng_ipfw.h @@ -0,0 +1,33 @@ +/*- + * Copyright 2005, Gleb Smirnoff + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD: head/sys/netgraph/ng_ipfw.h 201124 2009-12-28 12:29:13Z luigi $ + */ + +#ifndef _NG_IPFW_H +#define _NG_IPFW_H +#define NG_IPFW_NODE_TYPE "ipfw" +#define NGM_IPFW_COOKIE 1105988990 +#endif /* _NG_IPFW_H */ diff --git a/example/ipfw/sys/netinet/in_cksum.c b/example/ipfw/sys/netinet/in_cksum.c new file mode 100644 index 0000000..8d95ce5 --- /dev/null +++ b/example/ipfw/sys/netinet/in_cksum.c @@ -0,0 +1,146 @@ +/*- + * Copyright (c) 1988, 1992, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 4. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)in_cksum.c 8.1 (Berkeley) 6/10/93 + */ + +#include +__FBSDID("$FreeBSD: head/sys/netinet/in_cksum.c 238941 2012-07-31 08:04:49Z luigi $"); + +#include +#include + +/* + * Checksum routine for Internet Protocol family headers (Portable Version). + * + * This routine is very heavily used in the network + * code and should be modified for each CPU to be as fast as possible. + */ + +#define ADDCARRY(x) (x > 65535 ? x -= 65535 : x) +#define REDUCE {l_util.l = sum; sum = l_util.s[0] + l_util.s[1]; ADDCARRY(sum);} + +int +in_cksum(struct mbuf *m, int len) +{ + register u_short *w; + register int sum = 0; + register int mlen = 0; + int byte_swapped = 0; + + union { + char c[2]; + u_short s; + } s_util; + union { + u_short s[2]; + long l; + } l_util; + + for (;m && len; m = m->m_next) { + if (m->m_len == 0) + continue; + w = mtod(m, u_short *); + if (mlen == -1) { + /* + * The first byte of this mbuf is the continuation + * of a word spanning between this mbuf and the + * last mbuf. + * + * s_util.c[0] is already saved when scanning previous + * mbuf. + */ + s_util.c[1] = *(char *)w; + sum += s_util.s; + w = (u_short *)((char *)w + 1); + mlen = m->m_len - 1; + len--; + } else + mlen = m->m_len; + if (len < mlen) + mlen = len; + len -= mlen; + /* + * Force to even boundary. + */ + if ((1 & (uintptr_t) w) && (mlen > 0)) { + REDUCE; + sum <<= 8; + s_util.c[0] = *(u_char *)w; + w = (u_short *)((char *)w + 1); + mlen--; + byte_swapped = 1; + } + /* + * Unroll the loop to make overhead from + * branches &c small. + */ + while ((mlen -= 32) >= 0) { + sum += w[0]; sum += w[1]; sum += w[2]; sum += w[3]; + sum += w[4]; sum += w[5]; sum += w[6]; sum += w[7]; + sum += w[8]; sum += w[9]; sum += w[10]; sum += w[11]; + sum += w[12]; sum += w[13]; sum += w[14]; sum += w[15]; + w += 16; + } + mlen += 32; + while ((mlen -= 8) >= 0) { + sum += w[0]; sum += w[1]; sum += w[2]; sum += w[3]; + w += 4; + } + mlen += 8; + if (mlen == 0 && byte_swapped == 0) + continue; + REDUCE; + while ((mlen -= 2) >= 0) { + sum += *w++; + } + if (byte_swapped) { + REDUCE; + sum <<= 8; + byte_swapped = 0; + if (mlen == -1) { + s_util.c[1] = *(char *)w; + sum += s_util.s; + mlen = 0; + } else + mlen = -1; + } else if (mlen == -1) + s_util.c[0] = *(char *)w; + } + if (len) + printf("cksum: out of data\n"); + if (mlen == -1) { + /* The last mbuf has odd # of bytes. Follow the + standard (the odd byte may be shifted left by 8 bits + or not as determined by endian-ness of the machine) */ + s_util.c[1] = 0; + sum += s_util.s; + } + REDUCE; + return (~sum & 0xffff); +} diff --git a/example/ipfw/sys/netinet/ip_dummynet.h b/example/ipfw/sys/netinet/ip_dummynet.h new file mode 100644 index 0000000..3378e82 --- /dev/null +++ b/example/ipfw/sys/netinet/ip_dummynet.h @@ -0,0 +1,264 @@ +/*- + * Copyright (c) 1998-2010 Luigi Rizzo, Universita` di Pisa + * Portions Copyright (c) 2000 Akamba Corp. + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD: head/sys/netinet/ip_dummynet.h 266941 2014-06-01 07:28:24Z hiren $ + */ + +#ifndef _IP_DUMMYNET_H +#define _IP_DUMMYNET_H + +/* + * Definition of the kernel-userland API for dummynet. + * + * Setsockopt() and getsockopt() pass a batch of objects, each + * of them starting with a "struct dn_id" which should fully identify + * the object and its relation with others in the sequence. + * The first object in each request should have + * type= DN_CMD_*, id = DN_API_VERSION. + * For other objects, type and subtype specify the object, len indicates + * the total length including the header, and 'id' identifies the specific + * object. + * + * Most objects are numbered with an identifier in the range 1..65535. + * DN_MAX_ID indicates the first value outside the range. + */ + +#define DN_API_VERSION 12500000 +#define DN_MAX_ID 0x10000 + +struct dn_id { + uint16_t len; /* total obj len including this header */ + uint8_t type; + uint8_t subtype; + uint32_t id; /* generic id */ +}; + +/* + * These values are in the type field of struct dn_id. + * To preserve the ABI, never rearrange the list or delete + * entries with the exception of DN_LAST + */ +enum { + DN_NONE = 0, + DN_LINK = 1, + DN_FS, + DN_SCH, + DN_SCH_I, + DN_QUEUE, + DN_DELAY_LINE, + DN_PROFILE, + DN_FLOW, /* struct dn_flow */ + DN_TEXT, /* opaque text is the object */ + + DN_CMD_CONFIG = 0x80, /* objects follow */ + DN_CMD_DELETE, /* subtype + list of entries */ + DN_CMD_GET, /* subtype + list of entries */ + DN_CMD_FLUSH, + /* for compatibility with FreeBSD 7.2/8 */ + DN_COMPAT_PIPE, + DN_COMPAT_QUEUE, + DN_GET_COMPAT, + + /* special commands for emulation of sysctl variables */ + DN_SYSCTL_GET, + DN_SYSCTL_SET, + + DN_LAST, +}; + +enum { /* subtype for schedulers, flowset and the like */ + DN_SCHED_UNKNOWN = 0, + DN_SCHED_FIFO = 1, + DN_SCHED_WF2QP = 2, + /* others are in individual modules */ +}; + +enum { /* user flags */ + DN_HAVE_MASK = 0x0001, /* fs or sched has a mask */ + DN_NOERROR = 0x0002, /* do not report errors */ + DN_QHT_HASH = 0x0004, /* qht is a hash table */ + DN_QSIZE_BYTES = 0x0008, /* queue size is in bytes */ + DN_HAS_PROFILE = 0x0010, /* a link has a profile */ + DN_IS_RED = 0x0020, + DN_IS_GENTLE_RED= 0x0040, + DN_IS_ECN = 0x0080, + DN_PIPE_CMD = 0x1000, /* pipe config... */ +}; + +/* + * link template. + */ +struct dn_link { + struct dn_id oid; + + /* + * Userland sets bw and delay in bits/s and milliseconds. + * The kernel converts this back and forth to bits/tick and ticks. + * XXX what about burst ? + */ + int32_t link_nr; + int bandwidth; /* bit/s or bits/tick. */ + int delay; /* ms and ticks */ + uint64_t burst; /* scaled. bits*Hz XXX */ +}; + +/* + * A flowset, which is a template for flows. Contains parameters + * from the command line: id, target scheduler, queue sizes, plr, + * flow masks, buckets for the flow hash, and possibly scheduler- + * specific parameters (weight, quantum and so on). + */ +struct dn_fs { + struct dn_id oid; + uint32_t fs_nr; /* the flowset number */ + uint32_t flags; /* userland flags */ + int qsize; /* queue size in slots or bytes */ + int32_t plr; /* PLR, pkt loss rate (2^31-1 means 100%) */ + uint32_t buckets; /* buckets used for the queue hash table */ + + struct ipfw_flow_id flow_mask; + uint32_t sched_nr; /* the scheduler we attach to */ + /* generic scheduler parameters. Leave them at -1 if unset. + * Now we use 0: weight, 1: lmax, 2: priority + */ + int par[4]; + + /* RED/GRED parameters. + * weight and probabilities are in the range 0..1 represented + * in fixed point arithmetic with SCALE_RED decimal bits. + */ +#define SCALE_RED 16 +#define SCALE(x) ( (x) << SCALE_RED ) +#define SCALE_VAL(x) ( (x) >> SCALE_RED ) +#define SCALE_MUL(x,y) ( ( (x) * (y) ) >> SCALE_RED ) + int w_q ; /* queue weight (scaled) */ + int max_th ; /* maximum threshold for queue (scaled) */ + int min_th ; /* minimum threshold for queue (scaled) */ + int max_p ; /* maximum value for p_b (scaled) */ + +}; + +/* + * dn_flow collects flow_id and stats for queues and scheduler + * instances, and is used to pass these info to userland. + * oid.type/oid.subtype describe the object, oid.id is number + * of the parent object. + */ +struct dn_flow { + struct dn_id oid; + struct ipfw_flow_id fid; + uint64_t tot_pkts; /* statistics counters */ + uint64_t tot_bytes; + uint32_t length; /* Queue length, in packets */ + uint32_t len_bytes; /* Queue length, in bytes */ + uint32_t drops; +}; + + +/* + * Scheduler template, mostly indicating the name, number, + * sched_mask and buckets. + */ +struct dn_sch { + struct dn_id oid; + uint32_t sched_nr; /* N, scheduler number */ + uint32_t buckets; /* number of buckets for the instances */ + uint32_t flags; /* have_mask, ... */ + + char name[16]; /* null terminated */ + /* mask to select the appropriate scheduler instance */ + struct ipfw_flow_id sched_mask; /* M */ +}; + + +/* A delay profile is attached to a link. + * Note that a profile, as any other object, cannot be longer than 2^16 + */ +#define ED_MAX_SAMPLES_NO 1024 +struct dn_profile { + struct dn_id oid; + /* fields to simulate a delay profile */ +#define ED_MAX_NAME_LEN 32 + char name[ED_MAX_NAME_LEN]; + int link_nr; + int loss_level; + int _bandwidth; // XXX use link bandwidth? unused ? + int samples_no; /* actual len of samples[] */ + int samples[ED_MAX_SAMPLES_NO]; /* may be shorter */ +}; + + + +/* + * Overall structure of dummynet + +In dummynet, packets are selected with the firewall rules, and passed +to two different objects: PIPE or QUEUE (bad name). + +A QUEUE defines a classifier, which groups packets into flows +according to a 'mask', puts them into independent queues (one +per flow) with configurable size and queue management policy, +and passes flows to a scheduler: + + (flow_mask|sched_mask) sched_mask + +---------+ weight Wx +-------------+ + | |->-[flow]-->--| |-+ + -->--| QUEUE x | ... | | | + | |->-[flow]-->--| SCHEDuler N | | + +---------+ | | | + ... | +--[LINK N]-->-- + +---------+ weight Wy | | +--[LINK N]-->-- + | |->-[flow]-->--| | | + -->--| QUEUE y | ... | | | + | |->-[flow]-->--| | | + +---------+ +-------------+ | + +-------------+ + +Many QUEUE objects can connect to the same scheduler, each +QUEUE object can have its own set of parameters. + +In turn, the SCHEDuler 'forks' multiple instances according +to a 'sched_mask', each instance manages its own set of queues +and transmits on a private instance of a configurable LINK. + +A PIPE is a simplified version of the above, where there +is no flow_mask, and each scheduler instance handles a single queue. + +The following data structures (visible from userland) describe +the objects used by dummynet: + + + dn_link, contains the main configuration parameters related + to delay and bandwidth; + + dn_profile describes a delay profile; + + dn_flow describes the flow status (flow id, statistics) + + + dn_sch describes a scheduler + + dn_fs describes a flowset (msk, weight, queue parameters) + + * + */ + +#endif /* _IP_DUMMYNET_H */ diff --git a/example/ipfw/sys/netinet/ip_fw.h b/example/ipfw/sys/netinet/ip_fw.h new file mode 100644 index 0000000..7ec6f87 --- /dev/null +++ b/example/ipfw/sys/netinet/ip_fw.h @@ -0,0 +1,1009 @@ +/*- + * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD: head/sys/netinet/ip_fw.h 273035 2014-10-13 13:49:28Z melifaro $ + */ + +#ifndef _IPFW2_H +#define _IPFW2_H + +/* + * The default rule number. By the design of ip_fw, the default rule + * is the last one, so its number can also serve as the highest number + * allowed for a rule. The ip_fw code relies on both meanings of this + * constant. + */ +#define IPFW_DEFAULT_RULE 65535 + +#define RESVD_SET 31 /*set for default and persistent rules*/ +#define IPFW_MAX_SETS 32 /* Number of sets supported by ipfw*/ + +/* + * Default number of ipfw tables. + */ +#define IPFW_TABLES_MAX 65535 +#define IPFW_TABLES_DEFAULT 128 + +/* + * Most commands (queue, pipe, tag, untag, limit...) can have a 16-bit + * argument between 1 and 65534. The value 0 (IP_FW_TARG) is used + * to represent 'tablearg' value, e.g. indicate the use of a 'tablearg' + * result of the most recent table() lookup. + * Note that 16bit is only a historical limit, resulting from + * the use of a 16-bit fields for that value. In reality, we can have + * 2^32 pipes, queues, tag values and so on. + */ +#define IPFW_ARG_MIN 1 +#define IPFW_ARG_MAX 65534 +#define IP_FW_TABLEARG 65535 /* Compat value for old clients */ +#define IP_FW_TARG 0 /* Current tablearg value */ + +/* + * Number of entries in the call stack of the call/return commands. + * Call stack currently is an uint16_t array with rule numbers. + */ +#define IPFW_CALLSTACK_SIZE 16 + +/* IP_FW3 header/opcodes */ +typedef struct _ip_fw3_opheader { + uint16_t opcode; /* Operation opcode */ + uint16_t version; /* Opcode version */ + uint16_t reserved[2]; /* Align to 64-bit boundary */ +} ip_fw3_opheader; + +/* IP_FW3 opcodes */ +#define IP_FW_TABLE_XADD 86 /* add entry */ +#define IP_FW_TABLE_XDEL 87 /* delete entry */ +#define IP_FW_TABLE_XGETSIZE 88 /* get table size (deprecated) */ +#define IP_FW_TABLE_XLIST 89 /* list table contents */ +#define IP_FW_TABLE_XDESTROY 90 /* destroy table */ +#define IP_FW_TABLES_XLIST 92 /* list all tables */ +#define IP_FW_TABLE_XINFO 93 /* request info for one table */ +#define IP_FW_TABLE_XFLUSH 94 /* flush table data */ +#define IP_FW_TABLE_XCREATE 95 /* create new table */ +#define IP_FW_TABLE_XMODIFY 96 /* modify existing table */ +#define IP_FW_XGET 97 /* Retrieve configuration */ +#define IP_FW_XADD 98 /* add rule */ +#define IP_FW_XDEL 99 /* del rule */ +#define IP_FW_XMOVE 100 /* move rules to different set */ +#define IP_FW_XZERO 101 /* clear accounting */ +#define IP_FW_XRESETLOG 102 /* zero rules logs */ +#define IP_FW_SET_SWAP 103 /* Swap between 2 sets */ +#define IP_FW_SET_MOVE 104 /* Move one set to another one */ +#define IP_FW_SET_ENABLE 105 /* Enable/disable sets */ +#define IP_FW_TABLE_XFIND 106 /* finds an entry */ +#define IP_FW_XIFLIST 107 /* list tracked interfaces */ +#define IP_FW_TABLES_ALIST 108 /* list table algorithms */ +#define IP_FW_TABLE_XSWAP 109 /* swap two tables */ +#define IP_FW_TABLE_VLIST 110 /* dump table value hash */ + +#define IP_FW_NAT44_XCONFIG 111 /* Create/modify NAT44 instance */ +#define IP_FW_NAT44_DESTROY 112 /* Destroys NAT44 instance */ +#define IP_FW_NAT44_XGETCONFIG 113 /* Get NAT44 instance config */ +#define IP_FW_NAT44_LIST_NAT 114 /* List all NAT44 instances */ +#define IP_FW_NAT44_XGETLOG 115 /* Get log from NAT44 instance */ + +#define IP_FW_DUMP_SOPTCODES 116 /* Dump available sopts/versions */ + +/* + * The kernel representation of ipfw rules is made of a list of + * 'instructions' (for all practical purposes equivalent to BPF + * instructions), which specify which fields of the packet + * (or its metadata) should be analysed. + * + * Each instruction is stored in a structure which begins with + * "ipfw_insn", and can contain extra fields depending on the + * instruction type (listed below). + * Note that the code is written so that individual instructions + * have a size which is a multiple of 32 bits. This means that, if + * such structures contain pointers or other 64-bit entities, + * (there is just one instance now) they may end up unaligned on + * 64-bit architectures, so the must be handled with care. + * + * "enum ipfw_opcodes" are the opcodes supported. We can have up + * to 256 different opcodes. When adding new opcodes, they should + * be appended to the end of the opcode list before O_LAST_OPCODE, + * this will prevent the ABI from being broken, otherwise users + * will have to recompile ipfw(8) when they update the kernel. + */ + +enum ipfw_opcodes { /* arguments (4 byte each) */ + O_NOP, + + O_IP_SRC, /* u32 = IP */ + O_IP_SRC_MASK, /* ip = IP/mask */ + O_IP_SRC_ME, /* none */ + O_IP_SRC_SET, /* u32=base, arg1=len, bitmap */ + + O_IP_DST, /* u32 = IP */ + O_IP_DST_MASK, /* ip = IP/mask */ + O_IP_DST_ME, /* none */ + O_IP_DST_SET, /* u32=base, arg1=len, bitmap */ + + O_IP_SRCPORT, /* (n)port list:mask 4 byte ea */ + O_IP_DSTPORT, /* (n)port list:mask 4 byte ea */ + O_PROTO, /* arg1=protocol */ + + O_MACADDR2, /* 2 mac addr:mask */ + O_MAC_TYPE, /* same as srcport */ + + O_LAYER2, /* none */ + O_IN, /* none */ + O_FRAG, /* none */ + + O_RECV, /* none */ + O_XMIT, /* none */ + O_VIA, /* none */ + + O_IPOPT, /* arg1 = 2*u8 bitmap */ + O_IPLEN, /* arg1 = len */ + O_IPID, /* arg1 = id */ + + O_IPTOS, /* arg1 = id */ + O_IPPRECEDENCE, /* arg1 = precedence << 5 */ + O_IPTTL, /* arg1 = TTL */ + + O_IPVER, /* arg1 = version */ + O_UID, /* u32 = id */ + O_GID, /* u32 = id */ + O_ESTAB, /* none (tcp established) */ + O_TCPFLAGS, /* arg1 = 2*u8 bitmap */ + O_TCPWIN, /* arg1 = desired win */ + O_TCPSEQ, /* u32 = desired seq. */ + O_TCPACK, /* u32 = desired seq. */ + O_ICMPTYPE, /* u32 = icmp bitmap */ + O_TCPOPTS, /* arg1 = 2*u8 bitmap */ + + O_VERREVPATH, /* none */ + O_VERSRCREACH, /* none */ + + O_PROBE_STATE, /* none */ + O_KEEP_STATE, /* none */ + O_LIMIT, /* ipfw_insn_limit */ + O_LIMIT_PARENT, /* dyn_type, not an opcode. */ + + /* + * These are really 'actions'. + */ + + O_LOG, /* ipfw_insn_log */ + O_PROB, /* u32 = match probability */ + + O_CHECK_STATE, /* none */ + O_ACCEPT, /* none */ + O_DENY, /* none */ + O_REJECT, /* arg1=icmp arg (same as deny) */ + O_COUNT, /* none */ + O_SKIPTO, /* arg1=next rule number */ + O_PIPE, /* arg1=pipe number */ + O_QUEUE, /* arg1=queue number */ + O_DIVERT, /* arg1=port number */ + O_TEE, /* arg1=port number */ + O_FORWARD_IP, /* fwd sockaddr */ + O_FORWARD_MAC, /* fwd mac */ + O_NAT, /* nope */ + O_REASS, /* none */ + + /* + * More opcodes. + */ + O_IPSEC, /* has ipsec history */ + O_IP_SRC_LOOKUP, /* arg1=table number, u32=value */ + O_IP_DST_LOOKUP, /* arg1=table number, u32=value */ + O_ANTISPOOF, /* none */ + O_JAIL, /* u32 = id */ + O_ALTQ, /* u32 = altq classif. qid */ + O_DIVERTED, /* arg1=bitmap (1:loop, 2:out) */ + O_TCPDATALEN, /* arg1 = tcp data len */ + O_IP6_SRC, /* address without mask */ + O_IP6_SRC_ME, /* my addresses */ + O_IP6_SRC_MASK, /* address with the mask */ + O_IP6_DST, + O_IP6_DST_ME, + O_IP6_DST_MASK, + O_FLOW6ID, /* for flow id tag in the ipv6 pkt */ + O_ICMP6TYPE, /* icmp6 packet type filtering */ + O_EXT_HDR, /* filtering for ipv6 extension header */ + O_IP6, + + /* + * actions for ng_ipfw + */ + O_NETGRAPH, /* send to ng_ipfw */ + O_NGTEE, /* copy to ng_ipfw */ + + O_IP4, + + O_UNREACH6, /* arg1=icmpv6 code arg (deny) */ + + O_TAG, /* arg1=tag number */ + O_TAGGED, /* arg1=tag number */ + + O_SETFIB, /* arg1=FIB number */ + O_FIB, /* arg1=FIB desired fib number */ + + O_SOCKARG, /* socket argument */ + + O_CALLRETURN, /* arg1=called rule number */ + + O_FORWARD_IP6, /* fwd sockaddr_in6 */ + + O_DSCP, /* 2 u32 = DSCP mask */ + O_SETDSCP, /* arg1=DSCP value */ + O_IP_FLOW_LOOKUP, /* arg1=table number, u32=value */ + + O_LAST_OPCODE /* not an opcode! */ +}; + + +/* + * The extension header are filtered only for presence using a bit + * vector with a flag for each header. + */ +#define EXT_FRAGMENT 0x1 +#define EXT_HOPOPTS 0x2 +#define EXT_ROUTING 0x4 +#define EXT_AH 0x8 +#define EXT_ESP 0x10 +#define EXT_DSTOPTS 0x20 +#define EXT_RTHDR0 0x40 +#define EXT_RTHDR2 0x80 + +/* + * Template for instructions. + * + * ipfw_insn is used for all instructions which require no operands, + * a single 16-bit value (arg1), or a couple of 8-bit values. + * + * For other instructions which require different/larger arguments + * we have derived structures, ipfw_insn_*. + * + * The size of the instruction (in 32-bit words) is in the low + * 6 bits of "len". The 2 remaining bits are used to implement + * NOT and OR on individual instructions. Given a type, you can + * compute the length to be put in "len" using F_INSN_SIZE(t) + * + * F_NOT negates the match result of the instruction. + * + * F_OR is used to build or blocks. By default, instructions + * are evaluated as part of a logical AND. An "or" block + * { X or Y or Z } contains F_OR set in all but the last + * instruction of the block. A match will cause the code + * to skip past the last instruction of the block. + * + * NOTA BENE: in a couple of places we assume that + * sizeof(ipfw_insn) == sizeof(u_int32_t) + * this needs to be fixed. + * + */ +typedef struct _ipfw_insn { /* template for instructions */ + u_int8_t opcode; + u_int8_t len; /* number of 32-bit words */ +#define F_NOT 0x80 +#define F_OR 0x40 +#define F_LEN_MASK 0x3f +#define F_LEN(cmd) ((cmd)->len & F_LEN_MASK) + + u_int16_t arg1; +} ipfw_insn; + +/* + * The F_INSN_SIZE(type) computes the size, in 4-byte words, of + * a given type. + */ +#define F_INSN_SIZE(t) ((sizeof (t))/sizeof(u_int32_t)) + +/* + * This is used to store an array of 16-bit entries (ports etc.) + */ +typedef struct _ipfw_insn_u16 { + ipfw_insn o; + u_int16_t ports[2]; /* there may be more */ +} ipfw_insn_u16; + +/* + * This is used to store an array of 32-bit entries + * (uid, single IPv4 addresses etc.) + */ +typedef struct _ipfw_insn_u32 { + ipfw_insn o; + u_int32_t d[1]; /* one or more */ +} ipfw_insn_u32; + +/* + * This is used to store IP addr-mask pairs. + */ +typedef struct _ipfw_insn_ip { + ipfw_insn o; + struct in_addr addr; + struct in_addr mask; +} ipfw_insn_ip; + +/* + * This is used to forward to a given address (ip). + */ +typedef struct _ipfw_insn_sa { + ipfw_insn o; + struct sockaddr_in sa; +} ipfw_insn_sa; + +/* + * This is used to forward to a given address (ipv6). + */ +typedef struct _ipfw_insn_sa6 { + ipfw_insn o; + struct sockaddr_in6 sa; +} ipfw_insn_sa6; + +/* + * This is used for MAC addr-mask pairs. + */ +typedef struct _ipfw_insn_mac { + ipfw_insn o; + u_char addr[12]; /* dst[6] + src[6] */ + u_char mask[12]; /* dst[6] + src[6] */ +} ipfw_insn_mac; + +/* + * This is used for interface match rules (recv xx, xmit xx). + */ +typedef struct _ipfw_insn_if { + ipfw_insn o; + union { + struct in_addr ip; + int glob; + uint16_t kidx; + } p; + char name[IFNAMSIZ]; +} ipfw_insn_if; + +/* + * This is used for storing an altq queue id number. + */ +typedef struct _ipfw_insn_altq { + ipfw_insn o; + u_int32_t qid; +} ipfw_insn_altq; + +/* + * This is used for limit rules. + */ +typedef struct _ipfw_insn_limit { + ipfw_insn o; + u_int8_t _pad; + u_int8_t limit_mask; /* combination of DYN_* below */ +#define DYN_SRC_ADDR 0x1 +#define DYN_SRC_PORT 0x2 +#define DYN_DST_ADDR 0x4 +#define DYN_DST_PORT 0x8 + + u_int16_t conn_limit; +} ipfw_insn_limit; + +/* + * This is used for log instructions. + */ +typedef struct _ipfw_insn_log { + ipfw_insn o; + u_int32_t max_log; /* how many do we log -- 0 = all */ + u_int32_t log_left; /* how many left to log */ +} ipfw_insn_log; + +/* Legacy NAT structures, compat only */ +#ifndef _KERNEL +/* + * Data structures required by both ipfw(8) and ipfw(4) but not part of the + * management API are protected by IPFW_INTERNAL. + */ +#ifdef IPFW_INTERNAL +/* Server pool support (LSNAT). */ +struct cfg_spool { + LIST_ENTRY(cfg_spool) _next; /* chain of spool instances */ + struct in_addr addr; + u_short port; +}; +#endif + +/* Redirect modes id. */ +#define REDIR_ADDR 0x01 +#define REDIR_PORT 0x02 +#define REDIR_PROTO 0x04 + +#ifdef IPFW_INTERNAL +/* Nat redirect configuration. */ +struct cfg_redir { + LIST_ENTRY(cfg_redir) _next; /* chain of redir instances */ + u_int16_t mode; /* type of redirect mode */ + struct in_addr laddr; /* local ip address */ + struct in_addr paddr; /* public ip address */ + struct in_addr raddr; /* remote ip address */ + u_short lport; /* local port */ + u_short pport; /* public port */ + u_short rport; /* remote port */ + u_short pport_cnt; /* number of public ports */ + u_short rport_cnt; /* number of remote ports */ + int proto; /* protocol: tcp/udp */ + struct alias_link **alink; + /* num of entry in spool chain */ + u_int16_t spool_cnt; + /* chain of spool instances */ + LIST_HEAD(spool_chain, cfg_spool) spool_chain; +}; +#endif + +#ifdef IPFW_INTERNAL +/* Nat configuration data struct. */ +struct cfg_nat { + /* chain of nat instances */ + LIST_ENTRY(cfg_nat) _next; + int id; /* nat id */ + struct in_addr ip; /* nat ip address */ + char if_name[IF_NAMESIZE]; /* interface name */ + int mode; /* aliasing mode */ + struct libalias *lib; /* libalias instance */ + /* number of entry in spool chain */ + int redir_cnt; + /* chain of redir instances */ + LIST_HEAD(redir_chain, cfg_redir) redir_chain; +}; +#endif + +#define SOF_NAT sizeof(struct cfg_nat) +#define SOF_REDIR sizeof(struct cfg_redir) +#define SOF_SPOOL sizeof(struct cfg_spool) + +#endif /* ifndef _KERNEL */ + + +struct nat44_cfg_spool { + struct in_addr addr; + uint16_t port; + uint16_t spare; +}; +#define NAT44_REDIR_ADDR 0x01 +#define NAT44_REDIR_PORT 0x02 +#define NAT44_REDIR_PROTO 0x04 + +/* Nat redirect configuration. */ +struct nat44_cfg_redir { + struct in_addr laddr; /* local ip address */ + struct in_addr paddr; /* public ip address */ + struct in_addr raddr; /* remote ip address */ + uint16_t lport; /* local port */ + uint16_t pport; /* public port */ + uint16_t rport; /* remote port */ + uint16_t pport_cnt; /* number of public ports */ + uint16_t rport_cnt; /* number of remote ports */ + uint16_t mode; /* type of redirect mode */ + uint16_t spool_cnt; /* num of entry in spool chain */ + uint16_t spare; + uint32_t proto; /* protocol: tcp/udp */ +}; + +/* Nat configuration data struct. */ +struct nat44_cfg_nat { + char name[64]; /* nat name */ + char if_name[64]; /* interface name */ + uint32_t size; /* structure size incl. redirs */ + struct in_addr ip; /* nat IPv4 address */ + uint32_t mode; /* aliasing mode */ + uint32_t redir_cnt; /* number of entry in spool chain */ +}; + +/* Nat command. */ +typedef struct _ipfw_insn_nat { + ipfw_insn o; + struct cfg_nat *nat; +} ipfw_insn_nat; + +/* Apply ipv6 mask on ipv6 addr */ +#define APPLY_MASK(addr,mask) \ + (addr)->__u6_addr.__u6_addr32[0] &= (mask)->__u6_addr.__u6_addr32[0]; \ + (addr)->__u6_addr.__u6_addr32[1] &= (mask)->__u6_addr.__u6_addr32[1]; \ + (addr)->__u6_addr.__u6_addr32[2] &= (mask)->__u6_addr.__u6_addr32[2]; \ + (addr)->__u6_addr.__u6_addr32[3] &= (mask)->__u6_addr.__u6_addr32[3]; + +/* Structure for ipv6 */ +typedef struct _ipfw_insn_ip6 { + ipfw_insn o; + struct in6_addr addr6; + struct in6_addr mask6; +} ipfw_insn_ip6; + +/* Used to support icmp6 types */ +typedef struct _ipfw_insn_icmp6 { + ipfw_insn o; + uint32_t d[7]; /* XXX This number si related to the netinet/icmp6.h + * define ICMP6_MAXTYPE + * as follows: n = ICMP6_MAXTYPE/32 + 1 + * Actually is 203 + */ +} ipfw_insn_icmp6; + +/* + * Here we have the structure representing an ipfw rule. + * + * Layout: + * struct ip_fw_rule + * [ counter block, size = rule->cntr_len ] + * [ one or more instructions, size = rule->cmd_len * 4 ] + * + * It starts with a general area (with link fields). + * Counter block may be next (if rule->cntr_len > 0), + * followed by an array of one or more instructions, which the code + * accesses as an array of 32-bit values. rule->cmd_len represents + * the total instructions legth in u32 worrd, while act_ofs represents + * rule action offset in u32 words. + * + * When assembling instruction, remember the following: + * + * + if a rule has a "keep-state" (or "limit") option, then the + * first instruction (at r->cmd) MUST BE an O_PROBE_STATE + * + if a rule has a "log" option, then the first action + * (at ACTION_PTR(r)) MUST be O_LOG + * + if a rule has an "altq" option, it comes after "log" + * + if a rule has an O_TAG option, it comes after "log" and "altq" + * + * + * All structures (excluding instructions) are u64-aligned. + * Please keep this. + */ + +struct ip_fw_rule { + uint16_t act_ofs; /* offset of action in 32-bit units */ + uint16_t cmd_len; /* # of 32-bit words in cmd */ + uint16_t spare; + uint8_t set; /* rule set (0..31) */ + uint8_t flags; /* rule flags */ + uint32_t rulenum; /* rule number */ + uint32_t id; /* rule id */ + + ipfw_insn cmd[1]; /* storage for commands */ +}; +#define IPFW_RULE_NOOPT 0x01 /* Has no options in body */ + +/* Unaligned version */ + +/* Base ipfw rule counter block. */ +struct ip_fw_bcounter { + uint16_t size; /* Size of counter block, bytes */ + uint8_t flags; /* flags for given block */ + uint8_t spare; + uint32_t timestamp; /* tv_sec of last match */ + uint64_t pcnt; /* Packet counter */ + uint64_t bcnt; /* Byte counter */ +}; + + +#ifndef _KERNEL +/* + * Legacy rule format + */ +struct ip_fw { + struct ip_fw *x_next; /* linked list of rules */ + struct ip_fw *next_rule; /* ptr to next [skipto] rule */ + /* 'next_rule' is used to pass up 'set_disable' status */ + + uint16_t act_ofs; /* offset of action in 32-bit units */ + uint16_t cmd_len; /* # of 32-bit words in cmd */ + uint16_t rulenum; /* rule number */ + uint8_t set; /* rule set (0..31) */ + uint8_t _pad; /* padding */ + uint32_t id; /* rule id */ + + /* These fields are present in all rules. */ + uint64_t pcnt; /* Packet counter */ + uint64_t bcnt; /* Byte counter */ + uint32_t timestamp; /* tv_sec of last match */ + + ipfw_insn cmd[1]; /* storage for commands */ +}; +#endif + +#define ACTION_PTR(rule) \ + (ipfw_insn *)( (u_int32_t *)((rule)->cmd) + ((rule)->act_ofs) ) + +#define RULESIZE(rule) (sizeof(*(rule)) + (rule)->cmd_len * 4 - 4) + + +#if 1 // should be moved to in.h +/* + * This structure is used as a flow mask and a flow id for various + * parts of the code. + * addr_type is used in userland and kernel to mark the address type. + * fib is used in the kernel to record the fib in use. + * _flags is used in the kernel to store tcp flags for dynamic rules. + */ +struct ipfw_flow_id { + uint32_t dst_ip; + uint32_t src_ip; + uint16_t dst_port; + uint16_t src_port; + uint8_t fib; + uint8_t proto; + uint8_t _flags; /* protocol-specific flags */ + uint8_t addr_type; /* 4=ip4, 6=ip6, 1=ether ? */ + struct in6_addr dst_ip6; + struct in6_addr src_ip6; + uint32_t flow_id6; + uint32_t extra; /* queue/pipe or frag_id */ +}; +#endif + +#define IS_IP6_FLOW_ID(id) ((id)->addr_type == 6) + +/* + * Dynamic ipfw rule. + */ +typedef struct _ipfw_dyn_rule ipfw_dyn_rule; + +struct _ipfw_dyn_rule { + ipfw_dyn_rule *next; /* linked list of rules. */ + struct ip_fw *rule; /* pointer to rule */ + /* 'rule' is used to pass up the rule number (from the parent) */ + + ipfw_dyn_rule *parent; /* pointer to parent rule */ + u_int64_t pcnt; /* packet match counter */ + u_int64_t bcnt; /* byte match counter */ + struct ipfw_flow_id id; /* (masked) flow id */ + u_int32_t expire; /* expire time */ + u_int32_t bucket; /* which bucket in hash table */ + u_int32_t state; /* state of this rule (typically a + * combination of TCP flags) + */ + u_int32_t ack_fwd; /* most recent ACKs in forward */ + u_int32_t ack_rev; /* and reverse directions (used */ + /* to generate keepalives) */ + u_int16_t dyn_type; /* rule type */ + u_int16_t count; /* refcount */ +}; + +/* + * Definitions for IP option names. + */ +#define IP_FW_IPOPT_LSRR 0x01 +#define IP_FW_IPOPT_SSRR 0x02 +#define IP_FW_IPOPT_RR 0x04 +#define IP_FW_IPOPT_TS 0x08 + +/* + * Definitions for TCP option names. + */ +#define IP_FW_TCPOPT_MSS 0x01 +#define IP_FW_TCPOPT_WINDOW 0x02 +#define IP_FW_TCPOPT_SACK 0x04 +#define IP_FW_TCPOPT_TS 0x08 +#define IP_FW_TCPOPT_CC 0x10 + +#define ICMP_REJECT_RST 0x100 /* fake ICMP code (send a TCP RST) */ +#define ICMP6_UNREACH_RST 0x100 /* fake ICMPv6 code (send a TCP RST) */ + +/* + * These are used for lookup tables. + */ + +#define IPFW_TABLE_ADDR 1 /* Table for holding IPv4/IPv6 prefixes */ +#define IPFW_TABLE_INTERFACE 2 /* Table for holding interface names */ +#define IPFW_TABLE_NUMBER 3 /* Table for holding ports/uid/gid/etc */ +#define IPFW_TABLE_FLOW 4 /* Table for holding flow data */ +#define IPFW_TABLE_MAXTYPE 4 /* Maximum valid number */ + +#define IPFW_TABLE_CIDR IPFW_TABLE_ADDR /* compat */ + +/* Value types */ +#define IPFW_VTYPE_LEGACY 0xFFFFFFFF /* All data is filled in */ +#define IPFW_VTYPE_SKIPTO 0x00000001 /* skipto/call/callreturn */ +#define IPFW_VTYPE_PIPE 0x00000002 /* pipe/queue */ +#define IPFW_VTYPE_FIB 0x00000004 /* setfib */ +#define IPFW_VTYPE_NAT 0x00000008 /* nat */ +#define IPFW_VTYPE_DSCP 0x00000010 /* dscp */ +#define IPFW_VTYPE_TAG 0x00000020 /* tag/untag */ +#define IPFW_VTYPE_DIVERT 0x00000040 /* divert/tee */ +#define IPFW_VTYPE_NETGRAPH 0x00000080 /* netgraph/ngtee */ +#define IPFW_VTYPE_LIMIT 0x00000100 /* limit */ +#define IPFW_VTYPE_NH4 0x00000200 /* IPv4 nexthop */ +#define IPFW_VTYPE_NH6 0x00000400 /* IPv6 nexthop */ + +typedef struct _ipfw_table_entry { + in_addr_t addr; /* network address */ + u_int32_t value; /* value */ + u_int16_t tbl; /* table number */ + u_int8_t masklen; /* mask length */ +} ipfw_table_entry; + +typedef struct _ipfw_table_xentry { + uint16_t len; /* Total entry length */ + uint8_t type; /* entry type */ + uint8_t masklen; /* mask length */ + uint16_t tbl; /* table number */ + uint16_t flags; /* record flags */ + uint32_t value; /* value */ + union { + /* Longest field needs to be aligned by 4-byte boundary */ + struct in6_addr addr6; /* IPv6 address */ + char iface[IF_NAMESIZE]; /* interface name */ + } k; +} ipfw_table_xentry; +#define IPFW_TCF_INET 0x01 /* CIDR flags: IPv4 record */ + +typedef struct _ipfw_table { + u_int32_t size; /* size of entries in bytes */ + u_int32_t cnt; /* # of entries */ + u_int16_t tbl; /* table number */ + ipfw_table_entry ent[0]; /* entries */ +} ipfw_table; + +typedef struct _ipfw_xtable { + ip_fw3_opheader opheader; /* IP_FW3 opcode */ + uint32_t size; /* size of entries in bytes */ + uint32_t cnt; /* # of entries */ + uint16_t tbl; /* table number */ + uint8_t type; /* table type */ + ipfw_table_xentry xent[0]; /* entries */ +} ipfw_xtable; + +typedef struct _ipfw_obj_tlv { + uint16_t type; /* TLV type */ + uint16_t flags; /* TLV-specific flags */ + uint32_t length; /* Total length, aligned to u64 */ +} ipfw_obj_tlv; +#define IPFW_TLV_TBL_NAME 1 +#define IPFW_TLV_TBLNAME_LIST 2 +#define IPFW_TLV_RULE_LIST 3 +#define IPFW_TLV_DYNSTATE_LIST 4 +#define IPFW_TLV_TBL_ENT 5 +#define IPFW_TLV_DYN_ENT 6 +#define IPFW_TLV_RULE_ENT 7 +#define IPFW_TLV_TBLENT_LIST 8 +#define IPFW_TLV_RANGE 9 + +/* Object name TLV */ +typedef struct _ipfw_obj_ntlv { + ipfw_obj_tlv head; /* TLV header */ + uint16_t idx; /* Name index */ + uint8_t spare; /* unused */ + uint8_t type; /* object type, if applicable */ + uint32_t set; /* set, if applicable */ + char name[64]; /* Null-terminated name */ +} ipfw_obj_ntlv; + +/* IPv4/IPv6 L4 flow description */ +struct tflow_entry { + uint8_t af; + uint8_t proto; + uint16_t spare; + uint16_t sport; + uint16_t dport; + union { + struct { + struct in_addr sip; + struct in_addr dip; + } a4; + struct { + struct in6_addr sip6; + struct in6_addr dip6; + } a6; + } a; +}; + +typedef struct _ipfw_table_value { + uint32_t tag; /* O_TAG/O_TAGGED */ + uint32_t pipe; /* O_PIPE/O_QUEUE */ + uint16_t divert; /* O_DIVERT/O_TEE */ + uint16_t skipto; /* skipto, CALLRET */ + uint32_t netgraph; /* O_NETGRAPH/O_NGTEE */ + uint32_t fib; /* O_SETFIB */ + uint32_t nat; /* O_NAT */ + uint32_t nh4; + uint8_t dscp; + uint8_t spare0[3]; + struct in6_addr nh6; + uint32_t limit; /* O_LIMIT */ + uint32_t spare1; + uint64_t reserved; +} ipfw_table_value; + +/* Table entry TLV */ +typedef struct _ipfw_obj_tentry { + ipfw_obj_tlv head; /* TLV header */ + uint8_t subtype; /* subtype (IPv4,IPv6) */ + uint8_t masklen; /* mask length */ + uint8_t result; /* request result */ + uint8_t spare0; + uint16_t idx; /* Table name index */ + uint16_t spare1; + union { + /* Longest field needs to be aligned by 8-byte boundary */ + struct in_addr addr; /* IPv4 address */ + uint32_t key; /* uid/gid/port */ + struct in6_addr addr6; /* IPv6 address */ + char iface[IF_NAMESIZE]; /* interface name */ + struct tflow_entry flow; + } k; + union { + ipfw_table_value value; /* value data */ + uint32_t kidx; /* value kernel index */ + } v; +} ipfw_obj_tentry; +#define IPFW_TF_UPDATE 0x01 /* Update record if exists */ +/* Container TLV */ +#define IPFW_CTF_ATOMIC 0x01 /* Perform atomic operation */ +/* Operation results */ +#define IPFW_TR_IGNORED 0 /* Entry was ignored (rollback) */ +#define IPFW_TR_ADDED 1 /* Entry was succesfully added */ +#define IPFW_TR_UPDATED 2 /* Entry was succesfully updated*/ +#define IPFW_TR_DELETED 3 /* Entry was succesfully deleted*/ +#define IPFW_TR_LIMIT 4 /* Entry was ignored (limit) */ +#define IPFW_TR_NOTFOUND 5 /* Entry was not found */ +#define IPFW_TR_EXISTS 6 /* Entry already exists */ +#define IPFW_TR_ERROR 7 /* Request has failed (unknown) */ + +typedef struct _ipfw_obj_dyntlv { + ipfw_obj_tlv head; + ipfw_dyn_rule state; +} ipfw_obj_dyntlv; +#define IPFW_DF_LAST 0x01 /* Last state in chain */ + +/* Containter TLVs */ +typedef struct _ipfw_obj_ctlv { + ipfw_obj_tlv head; /* TLV header */ + uint32_t count; /* Number of sub-TLVs */ + uint16_t objsize; /* Single object size */ + uint8_t version; /* TLV version */ + uint8_t flags; /* TLV-specific flags */ +} ipfw_obj_ctlv; + +/* Range TLV */ +typedef struct _ipfw_range_tlv { + ipfw_obj_tlv head; /* TLV header */ + uint32_t flags; /* Range flags */ + uint16_t start_rule; /* Range start */ + uint16_t end_rule; /* Range end */ + uint32_t set; /* Range set to match */ + uint32_t new_set; /* New set to move/swap to */ +} ipfw_range_tlv; +#define IPFW_RCFLAG_RANGE 0x01 /* rule range is set */ +#define IPFW_RCFLAG_ALL 0x02 /* match ALL rules */ +#define IPFW_RCFLAG_SET 0x04 /* match rules in given set */ +/* User-settable flags */ +#define IPFW_RCFLAG_USER (IPFW_RCFLAG_RANGE | IPFW_RCFLAG_ALL | \ + IPFW_RCFLAG_SET) +/* Internally used flags */ +#define IPFW_RCFLAG_DEFAULT 0x0100 /* Do not skip defaul rule */ + +typedef struct _ipfw_ta_tinfo { + uint32_t flags; /* Format flags */ + uint32_t spare; + uint8_t taclass4; /* algorithm class */ + uint8_t spare4; + uint16_t itemsize4; /* item size in runtime */ + uint32_t size4; /* runtime structure size */ + uint32_t count4; /* number of items in runtime */ + uint8_t taclass6; /* algorithm class */ + uint8_t spare6; + uint16_t itemsize6; /* item size in runtime */ + uint32_t size6; /* runtime structure size */ + uint32_t count6; /* number of items in runtime */ +} ipfw_ta_tinfo; +#define IPFW_TACLASS_HASH 1 /* algo is based on hash */ +#define IPFW_TACLASS_ARRAY 2 /* algo is based on array */ +#define IPFW_TACLASS_RADIX 3 /* algo is based on radix tree */ + +#define IPFW_TATFLAGS_DATA 0x0001 /* Has data filled in */ +#define IPFW_TATFLAGS_AFDATA 0x0002 /* Separate data per AF */ +#define IPFW_TATFLAGS_AFITEM 0x0004 /* diff. items per AF */ + +typedef struct _ipfw_xtable_info { + uint8_t type; /* table type (addr,iface,..) */ + uint8_t tflags; /* type flags */ + uint16_t mflags; /* modification flags */ + uint16_t flags; /* generic table flags */ + uint16_t spare[3]; + uint32_t vmask; /* bitmask with value types */ + uint32_t set; /* set table is in */ + uint32_t kidx; /* kernel index */ + uint32_t refcnt; /* number of references */ + uint32_t count; /* Number of records */ + uint32_t size; /* Total size of records(export)*/ + uint32_t limit; /* Max number of records */ + char tablename[64]; /* table name */ + char algoname[64]; /* algorithm name */ + ipfw_ta_tinfo ta_info; /* additional algo stats */ +} ipfw_xtable_info; +/* Generic table flags */ +#define IPFW_TGFLAGS_LOCKED 0x01 /* Tables is locked from changes*/ +/* Table type-specific flags */ +#define IPFW_TFFLAG_SRCIP 0x01 +#define IPFW_TFFLAG_DSTIP 0x02 +#define IPFW_TFFLAG_SRCPORT 0x04 +#define IPFW_TFFLAG_DSTPORT 0x08 +#define IPFW_TFFLAG_PROTO 0x10 +/* Table modification flags */ +#define IPFW_TMFLAGS_LIMIT 0x0002 /* Change limit value */ +#define IPFW_TMFLAGS_LOCK 0x0004 /* Change table lock state */ + +typedef struct _ipfw_iface_info { + char ifname[64]; /* interface name */ + uint32_t ifindex; /* interface index */ + uint32_t flags; /* flags */ + uint32_t refcnt; /* number of references */ + uint32_t gencnt; /* number of changes */ + uint64_t spare; +} ipfw_iface_info; +#define IPFW_IFFLAG_RESOLVED 0x01 /* Interface exists */ + +typedef struct _ipfw_ta_info { + char algoname[64]; /* algorithm name */ + uint32_t type; /* lookup type */ + uint32_t flags; + uint32_t refcnt; + uint32_t spare0; + uint64_t spare1; +} ipfw_ta_info; + +#define IPFW_OBJTYPE_TABLE 1 +typedef struct _ipfw_obj_header { + ip_fw3_opheader opheader; /* IP_FW3 opcode */ + uint32_t spare; + uint16_t idx; /* object name index */ + uint8_t objtype; /* object type */ + uint8_t objsubtype; /* object subtype */ + ipfw_obj_ntlv ntlv; /* object name tlv */ +} ipfw_obj_header; + +typedef struct _ipfw_obj_lheader { + ip_fw3_opheader opheader; /* IP_FW3 opcode */ + uint32_t set_mask; /* disabled set mask */ + uint32_t count; /* Total objects count */ + uint32_t size; /* Total size (incl. header) */ + uint32_t objsize; /* Size of one object */ +} ipfw_obj_lheader; + +#define IPFW_CFG_GET_STATIC 0x01 +#define IPFW_CFG_GET_STATES 0x02 +#define IPFW_CFG_GET_COUNTERS 0x04 +typedef struct _ipfw_cfg_lheader { + ip_fw3_opheader opheader; /* IP_FW3 opcode */ + uint32_t set_mask; /* enabled set mask */ + uint32_t spare; + uint32_t flags; /* Request flags */ + uint32_t size; /* neded buffer size */ + uint32_t start_rule; + uint32_t end_rule; +} ipfw_cfg_lheader; + +typedef struct _ipfw_range_header { + ip_fw3_opheader opheader; /* IP_FW3 opcode */ + ipfw_range_tlv range; +} ipfw_range_header; + +typedef struct _ipfw_sopt_info { + uint16_t opcode; + uint8_t version; + uint8_t dir; + uint8_t spare; + uint64_t refcnt; +} ipfw_sopt_info; + +#endif /* _IPFW2_H */ diff --git a/example/ipfw/sys/netinet/tcp.h b/example/ipfw/sys/netinet/tcp.h new file mode 100644 index 0000000..29c4313 --- /dev/null +++ b/example/ipfw/sys/netinet/tcp.h @@ -0,0 +1,247 @@ +/*- + * Copyright (c) 1982, 1986, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 4. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)tcp.h 8.1 (Berkeley) 6/10/93 + * $FreeBSD: head/sys/netinet/tcp.h 246210 2013-02-01 15:32:20Z jhb $ + */ + +#ifndef _NETINET_TCP_H_ +#define _NETINET_TCP_H_ + +#include +#include + +#if __BSD_VISIBLE + +typedef u_int32_t tcp_seq; + +#define tcp6_seq tcp_seq /* for KAME src sync over BSD*'s */ +#define tcp6hdr tcphdr /* for KAME src sync over BSD*'s */ + +/* + * TCP header. + * Per RFC 793, September, 1981. + */ +struct tcphdr { + u_short th_sport; /* source port */ + u_short th_dport; /* destination port */ + tcp_seq th_seq; /* sequence number */ + tcp_seq th_ack; /* acknowledgement number */ +#if BYTE_ORDER == LITTLE_ENDIAN + u_char th_x2:4, /* (unused) */ + th_off:4; /* data offset */ +#endif +#if BYTE_ORDER == BIG_ENDIAN + u_char th_off:4, /* data offset */ + th_x2:4; /* (unused) */ +#endif + u_char th_flags; +#define TH_FIN 0x01 +#define TH_SYN 0x02 +#define TH_RST 0x04 +#define TH_PUSH 0x08 +#define TH_ACK 0x10 +#define TH_URG 0x20 +#define TH_ECE 0x40 +#define TH_CWR 0x80 +#define TH_FLAGS (TH_FIN|TH_SYN|TH_RST|TH_PUSH|TH_ACK|TH_URG|TH_ECE|TH_CWR) +#define PRINT_TH_FLAGS "\20\1FIN\2SYN\3RST\4PUSH\5ACK\6URG\7ECE\10CWR" + + u_short th_win; /* window */ + u_short th_sum; /* checksum */ + u_short th_urp; /* urgent pointer */ +}; + +#define TCPOPT_EOL 0 +#define TCPOLEN_EOL 1 +#define TCPOPT_PAD 0 /* padding after EOL */ +#define TCPOLEN_PAD 1 +#define TCPOPT_NOP 1 +#define TCPOLEN_NOP 1 +#define TCPOPT_MAXSEG 2 +#define TCPOLEN_MAXSEG 4 +#define TCPOPT_WINDOW 3 +#define TCPOLEN_WINDOW 3 +#define TCPOPT_SACK_PERMITTED 4 +#define TCPOLEN_SACK_PERMITTED 2 +#define TCPOPT_SACK 5 +#define TCPOLEN_SACKHDR 2 +#define TCPOLEN_SACK 8 /* 2*sizeof(tcp_seq) */ +#define TCPOPT_TIMESTAMP 8 +#define TCPOLEN_TIMESTAMP 10 +#define TCPOLEN_TSTAMP_APPA (TCPOLEN_TIMESTAMP+2) /* appendix A */ +#define TCPOPT_SIGNATURE 19 /* Keyed MD5: RFC 2385 */ +#define TCPOLEN_SIGNATURE 18 + +/* Miscellaneous constants */ +#define MAX_SACK_BLKS 6 /* Max # SACK blocks stored at receiver side */ +#define TCP_MAX_SACK 4 /* MAX # SACKs sent in any segment */ + + +/* + * The default maximum segment size (MSS) to be used for new TCP connections + * when path MTU discovery is not enabled. + * + * RFC879 derives the default MSS from the largest datagram size hosts are + * minimally required to handle directly or through IP reassembly minus the + * size of the IP and TCP header. With IPv6 the minimum MTU is specified + * in RFC2460. + * + * For IPv4 the MSS is 576 - sizeof(struct tcpiphdr) + * For IPv6 the MSS is IPV6_MMTU - sizeof(struct ip6_hdr) - sizeof(struct tcphdr) + * + * We use explicit numerical definition here to avoid header pollution. + */ +#define TCP_MSS 536 +#define TCP6_MSS 1220 + +/* + * Limit the lowest MSS we accept for path MTU discovery and the TCP SYN MSS + * option. Allowing low values of MSS can consume significant resources and + * be used to mount a resource exhaustion attack. + * Connections requesting lower MSS values will be rounded up to this value + * and the IP_DF flag will be cleared to allow fragmentation along the path. + * + * See tcp_subr.c tcp_minmss SYSCTL declaration for more comments. Setting + * it to "0" disables the minmss check. + * + * The default value is fine for TCP across the Internet's smallest official + * link MTU (256 bytes for AX.25 packet radio). However, a connection is very + * unlikely to come across such low MTU interfaces these days (anno domini 2003). + */ +#define TCP_MINMSS 216 + +#define TCP_MAXWIN 65535 /* largest value for (unscaled) window */ +#define TTCP_CLIENT_SND_WND 4096 /* dflt send window for T/TCP client */ + +#define TCP_MAX_WINSHIFT 14 /* maximum window shift */ + +#define TCP_MAXBURST 4 /* maximum segments in a burst */ + +#define TCP_MAXHLEN (0xf<<2) /* max length of header in bytes */ +#define TCP_MAXOLEN (TCP_MAXHLEN - sizeof(struct tcphdr)) + /* max space left for options */ +#endif /* __BSD_VISIBLE */ + +/* + * User-settable options (used with setsockopt). These are discrete + * values and are not masked together. Some values appear to be + * bitmasks for historical reasons. + */ +#define TCP_NODELAY 1 /* don't delay send to coalesce packets */ +#if __BSD_VISIBLE +#define TCP_MAXSEG 2 /* set maximum segment size */ +#define TCP_NOPUSH 4 /* don't push last block of write */ +#define TCP_NOOPT 8 /* don't use TCP options */ +#define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ +#define TCP_INFO 32 /* retrieve tcp_info structure */ +#define TCP_CONGESTION 64 /* get/set congestion control algorithm */ +#define TCP_KEEPINIT 128 /* N, time to establish connection */ +#define TCP_KEEPIDLE 256 /* L,N,X start keeplives after this period */ +#define TCP_KEEPINTVL 512 /* L,N interval between keepalives */ +#define TCP_KEEPCNT 1024 /* L,N number of keepalives before close */ + +/* Start of reserved space for third-party user-settable options. */ +#define TCP_VENDOR SO_VENDOR + +#define TCP_CA_NAME_MAX 16 /* max congestion control name length */ + +#define TCPI_OPT_TIMESTAMPS 0x01 +#define TCPI_OPT_SACK 0x02 +#define TCPI_OPT_WSCALE 0x04 +#define TCPI_OPT_ECN 0x08 +#define TCPI_OPT_TOE 0x10 + +/* + * The TCP_INFO socket option comes from the Linux 2.6 TCP API, and permits + * the caller to query certain information about the state of a TCP + * connection. We provide an overlapping set of fields with the Linux + * implementation, but since this is a fixed size structure, room has been + * left for growth. In order to maximize potential future compatibility with + * the Linux API, the same variable names and order have been adopted, and + * padding left to make room for omitted fields in case they are added later. + * + * XXX: This is currently an unstable ABI/API, in that it is expected to + * change. + */ +struct tcp_info { + u_int8_t tcpi_state; /* TCP FSM state. */ + u_int8_t __tcpi_ca_state; + u_int8_t __tcpi_retransmits; + u_int8_t __tcpi_probes; + u_int8_t __tcpi_backoff; + u_int8_t tcpi_options; /* Options enabled on conn. */ + u_int8_t tcpi_snd_wscale:4, /* RFC1323 send shift value. */ + tcpi_rcv_wscale:4; /* RFC1323 recv shift value. */ + + u_int32_t tcpi_rto; /* Retransmission timeout (usec). */ + u_int32_t __tcpi_ato; + u_int32_t tcpi_snd_mss; /* Max segment size for send. */ + u_int32_t tcpi_rcv_mss; /* Max segment size for receive. */ + + u_int32_t __tcpi_unacked; + u_int32_t __tcpi_sacked; + u_int32_t __tcpi_lost; + u_int32_t __tcpi_retrans; + u_int32_t __tcpi_fackets; + + /* Times; measurements in usecs. */ + u_int32_t __tcpi_last_data_sent; + u_int32_t __tcpi_last_ack_sent; /* Also unimpl. on Linux? */ + u_int32_t tcpi_last_data_recv; /* Time since last recv data. */ + u_int32_t __tcpi_last_ack_recv; + + /* Metrics; variable units. */ + u_int32_t __tcpi_pmtu; + u_int32_t __tcpi_rcv_ssthresh; + u_int32_t tcpi_rtt; /* Smoothed RTT in usecs. */ + u_int32_t tcpi_rttvar; /* RTT variance in usecs. */ + u_int32_t tcpi_snd_ssthresh; /* Slow start threshold. */ + u_int32_t tcpi_snd_cwnd; /* Send congestion window. */ + u_int32_t __tcpi_advmss; + u_int32_t __tcpi_reordering; + + u_int32_t __tcpi_rcv_rtt; + u_int32_t tcpi_rcv_space; /* Advertised recv window. */ + + /* FreeBSD extensions to tcp_info. */ + u_int32_t tcpi_snd_wnd; /* Advertised send window. */ + u_int32_t tcpi_snd_bwnd; /* No longer used. */ + u_int32_t tcpi_snd_nxt; /* Next egress seqno */ + u_int32_t tcpi_rcv_nxt; /* Next ingress seqno */ + u_int32_t tcpi_toe_tid; /* HWTID for TOE endpoints */ + u_int32_t tcpi_snd_rexmitpack; /* Retransmitted packets */ + u_int32_t tcpi_rcv_ooopack; /* Out-of-order packets */ + u_int32_t tcpi_snd_zerowin; /* Zero-sized windows sent */ + + /* Padding to grow without breaking ABI. */ + u_int32_t __tcpi_pad[26]; /* Padding. */ +}; +#endif + +#endif /* !_NETINET_TCP_H_ */ diff --git a/example/ipfw/sys/netinet/udp.h b/example/ipfw/sys/netinet/udp.h new file mode 100644 index 0000000..c4e6e08 --- /dev/null +++ b/example/ipfw/sys/netinet/udp.h @@ -0,0 +1,69 @@ +/*- + * Copyright (c) 1982, 1986, 1993 + * The Regents of the University of California. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 4. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)udp.h 8.1 (Berkeley) 6/10/93 + * $FreeBSD: head/sys/netinet/udp.h 246210 2013-02-01 15:32:20Z jhb $ + */ + +#ifndef _NETINET_UDP_H_ +#define _NETINET_UDP_H_ + +/* + * UDP protocol header. + * Per RFC 768, September, 1981. + */ +struct udphdr { + u_short uh_sport; /* source port */ + u_short uh_dport; /* destination port */ + u_short uh_ulen; /* udp length */ + u_short uh_sum; /* udp checksum */ +}; + +/* + * User-settable options (used with setsockopt). + */ +#define UDP_ENCAP 1 + +/* Start of reserved space for third-party user-settable options. */ +#define UDP_VENDOR SO_VENDOR + +/* + * UDP Encapsulation of IPsec Packets options. + */ +/* Encapsulation types. */ +#define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */ +#define UDP_ENCAP_ESPINUDP 2 /* draft-ietf-ipsec-udp-encaps-02+ */ + +/* Default ESP in UDP encapsulation port. */ +#define UDP_ENCAP_ESPINUDP_PORT 500 + +/* Maximum UDP fragment size for ESP over UDP. */ +#define UDP_ENCAP_ESPINUDP_MAXFRAGLEN 552 + +#endif diff --git a/example/ipfw/sys/netpfil/ipfw/dn_heap.c b/example/ipfw/sys/netpfil/ipfw/dn_heap.c new file mode 100644 index 0000000..b47bd28 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_heap.c @@ -0,0 +1,552 @@ +/*- + * Copyright (c) 1998-2002,2010 Luigi Rizzo, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * Binary heap and hash tables, used in dummynet + * + * $FreeBSD: head/sys/netpfil/ipfw/dn_heap.c 240494 2012-09-14 11:51:49Z glebius $ + */ + +#include +#include +#ifdef _KERNEL +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/dn_heap.c 240494 2012-09-14 11:51:49Z glebius $"); +#include +#include +#include +#include +#ifndef log +#define log(x, arg...) +#endif + +#else /* !_KERNEL */ + +#include +#include +#include +#include + +#include "dn_heap.h" +#define log(x, arg...) fprintf(stderr, ## arg) +#define panic(x...) fprintf(stderr, ## x), exit(1) +#define MALLOC_DEFINE(a, b, c) +static void *my_malloc(int s) { return malloc(s); } +static void my_free(void *p) { free(p); } +#define malloc(s, t, w) my_malloc(s) +#define free(p, t) my_free(p) +#endif /* !_KERNEL */ + +static MALLOC_DEFINE(M_DN_HEAP, "dummynet", "dummynet heap"); + +/* + * Heap management functions. + * + * In the heap, first node is element 0. Children of i are 2i+1 and 2i+2. + * Some macros help finding parent/children so we can optimize them. + * + * heap_init() is called to expand the heap when needed. + * Increment size in blocks of 16 entries. + * Returns 1 on error, 0 on success + */ +#define HEAP_FATHER(x) ( ( (x) - 1 ) / 2 ) +#define HEAP_LEFT(x) ( (x)+(x) + 1 ) +#define HEAP_SWAP(a, b, buffer) { buffer = a ; a = b ; b = buffer ; } +#define HEAP_INCREMENT 15 + +static int +heap_resize(struct dn_heap *h, unsigned int new_size) +{ + struct dn_heap_entry *p; + + if (h->size >= new_size ) /* have enough room */ + return 0; +#if 1 /* round to the next power of 2 */ + new_size |= new_size >> 1; + new_size |= new_size >> 2; + new_size |= new_size >> 4; + new_size |= new_size >> 8; + new_size |= new_size >> 16; +#else + new_size = (new_size + HEAP_INCREMENT ) & ~HEAP_INCREMENT; +#endif + p = malloc(new_size * sizeof(*p), M_DN_HEAP, M_NOWAIT); + if (p == NULL) { + printf("--- %s, resize %d failed\n", __func__, new_size ); + return 1; /* error */ + } + if (h->size > 0) { + bcopy(h->p, p, h->size * sizeof(*p) ); + free(h->p, M_DN_HEAP); + } + h->p = p; + h->size = new_size; + return 0; +} + +int +heap_init(struct dn_heap *h, int size, int ofs) +{ + if (heap_resize(h, size)) + return 1; + h->elements = 0; + h->ofs = ofs; + return 0; +} + +/* + * Insert element in heap. Normally, p != NULL, we insert p in + * a new position and bubble up. If p == NULL, then the element is + * already in place, and key is the position where to start the + * bubble-up. + * Returns 1 on failure (cannot allocate new heap entry) + * + * If ofs > 0 the position (index, int) of the element in the heap is + * also stored in the element itself at the given offset in bytes. + */ +#define SET_OFFSET(h, i) do { \ + if (h->ofs > 0) \ + *((int32_t *)((char *)(h->p[i].object) + h->ofs)) = i; \ + } while (0) +/* + * RESET_OFFSET is used for sanity checks. It sets ofs + * to an invalid value. + */ +#define RESET_OFFSET(h, i) do { \ + if (h->ofs > 0) \ + *((int32_t *)((char *)(h->p[i].object) + h->ofs)) = -16; \ + } while (0) + +int +heap_insert(struct dn_heap *h, uint64_t key1, void *p) +{ + int son = h->elements; + + //log("%s key %llu p %p\n", __FUNCTION__, key1, p); + if (p == NULL) { /* data already there, set starting point */ + son = key1; + } else { /* insert new element at the end, possibly resize */ + son = h->elements; + if (son == h->size) /* need resize... */ + // XXX expand by 16 or so + if (heap_resize(h, h->elements+16) ) + return 1; /* failure... */ + h->p[son].object = p; + h->p[son].key = key1; + h->elements++; + } + /* make sure that son >= father along the path */ + while (son > 0) { + int father = HEAP_FATHER(son); + struct dn_heap_entry tmp; + + if (DN_KEY_LT( h->p[father].key, h->p[son].key ) ) + break; /* found right position */ + /* son smaller than father, swap and repeat */ + HEAP_SWAP(h->p[son], h->p[father], tmp); + SET_OFFSET(h, son); + son = father; + } + SET_OFFSET(h, son); + return 0; +} + +/* + * remove top element from heap, or obj if obj != NULL + */ +void +heap_extract(struct dn_heap *h, void *obj) +{ + int child, father, max = h->elements - 1; + + if (max < 0) { + printf("--- %s: empty heap 0x%p\n", __FUNCTION__, h); + return; + } + if (obj == NULL) + father = 0; /* default: move up smallest child */ + else { /* extract specific element, index is at offset */ + if (h->ofs <= 0) + panic("%s: extract from middle not set on %p\n", + __FUNCTION__, h); + father = *((int *)((char *)obj + h->ofs)); + if (father < 0 || father >= h->elements) { + panic("%s: father %d out of bound 0..%d\n", + __FUNCTION__, father, h->elements); + } + } + /* + * below, father is the index of the empty element, which + * we replace at each step with the smallest child until we + * reach the bottom level. + */ + // XXX why removing RESET_OFFSET increases runtime by 10% ? + RESET_OFFSET(h, father); + while ( (child = HEAP_LEFT(father)) <= max ) { + if (child != max && + DN_KEY_LT(h->p[child+1].key, h->p[child].key) ) + child++; /* take right child, otherwise left */ + h->p[father] = h->p[child]; + SET_OFFSET(h, father); + father = child; + } + h->elements--; + if (father != max) { + /* + * Fill hole with last entry and bubble up, + * reusing the insert code + */ + h->p[father] = h->p[max]; + heap_insert(h, father, NULL); + } +} + +#if 0 +/* + * change object position and update references + * XXX this one is never used! + */ +static void +heap_move(struct dn_heap *h, uint64_t new_key, void *object) +{ + int temp, i, max = h->elements-1; + struct dn_heap_entry *p, buf; + + if (h->ofs <= 0) + panic("cannot move items on this heap"); + p = h->p; /* shortcut */ + + i = *((int *)((char *)object + h->ofs)); + if (DN_KEY_LT(new_key, p[i].key) ) { /* must move up */ + p[i].key = new_key; + for (; i>0 && + DN_KEY_LT(new_key, p[(temp = HEAP_FATHER(i))].key); + i = temp ) { /* bubble up */ + HEAP_SWAP(p[i], p[temp], buf); + SET_OFFSET(h, i); + } + } else { /* must move down */ + p[i].key = new_key; + while ( (temp = HEAP_LEFT(i)) <= max ) { + /* found left child */ + if (temp != max && + DN_KEY_LT(p[temp+1].key, p[temp].key)) + temp++; /* select child with min key */ + if (DN_KEY_LT(>p[temp].key, new_key)) { + /* go down */ + HEAP_SWAP(p[i], p[temp], buf); + SET_OFFSET(h, i); + } else + break; + i = temp; + } + } + SET_OFFSET(h, i); +} +#endif /* heap_move, unused */ + +/* + * heapify() will reorganize data inside an array to maintain the + * heap property. It is needed when we delete a bunch of entries. + */ +static void +heapify(struct dn_heap *h) +{ + int i; + + for (i = 0; i < h->elements; i++ ) + heap_insert(h, i , NULL); +} + +int +heap_scan(struct dn_heap *h, int (*fn)(void *, uintptr_t), + uintptr_t arg) +{ + int i, ret, found; + + for (i = found = 0 ; i < h->elements ;) { + ret = fn(h->p[i].object, arg); + if (ret & HEAP_SCAN_DEL) { + h->elements-- ; + h->p[i] = h->p[h->elements] ; + found++ ; + } else + i++ ; + if (ret & HEAP_SCAN_END) + break; + } + if (found) + heapify(h); + return found; +} + +/* + * cleanup the heap and free data structure + */ +void +heap_free(struct dn_heap *h) +{ + if (h->size >0 ) + free(h->p, M_DN_HEAP); + bzero(h, sizeof(*h) ); +} + +/* + * hash table support. + */ + +struct dn_ht { + int buckets; /* how many buckets, really buckets - 1*/ + int entries; /* how many entries */ + int ofs; /* offset of link field */ + uint32_t (*hash)(uintptr_t, int, void *arg); + int (*match)(void *_el, uintptr_t key, int, void *); + void *(*newh)(uintptr_t, int, void *); + void **ht; /* bucket heads */ +}; +/* + * Initialize, allocating bucket pointers inline. + * Recycle previous record if possible. + * If the 'newh' function is not supplied, we assume that the + * key passed to ht_find is the same object to be stored in. + */ +struct dn_ht * +dn_ht_init(struct dn_ht *ht, int buckets, int ofs, + uint32_t (*h)(uintptr_t, int, void *), + int (*match)(void *, uintptr_t, int, void *), + void *(*newh)(uintptr_t, int, void *)) +{ + int l; + + /* + * Notes about rounding bucket size to a power of two. + * Given the original bucket size, we compute the nearest lower and + * higher power of two, minus 1 (respectively b_min and b_max) because + * this value will be used to do an AND with the index returned + * by hash function. + * To choice between these two values, the original bucket size is + * compared with b_min. If the original size is greater than 4/3 b_min, + * we round the bucket size to b_max, else to b_min. + * This ratio try to round to the nearest power of two, advantaging + * the greater size if the different between two power is relatively + * big. + * Rounding the bucket size to a power of two avoid the use of + * module when calculating the correct bucket. + * The ht->buckets variable store the bucket size - 1 to simply + * do an AND between the index returned by hash function and ht->bucket + * instead of a module. + */ + int b_min; /* min buckets */ + int b_max; /* max buckets */ + int b_ori; /* original buckets */ + + if (h == NULL || match == NULL) { + printf("--- missing hash or match function"); + return NULL; + } + if (buckets < 1 || buckets > 65536) + return NULL; + + b_ori = buckets; + /* calculate next power of 2, - 1*/ + buckets |= buckets >> 1; + buckets |= buckets >> 2; + buckets |= buckets >> 4; + buckets |= buckets >> 8; + buckets |= buckets >> 16; + + b_max = buckets; /* Next power */ + b_min = buckets >> 1; /* Previous power */ + + /* Calculate the 'nearest' bucket size */ + if (b_min * 4000 / 3000 < b_ori) + buckets = b_max; + else + buckets = b_min; + + if (ht) { /* see if we can reuse */ + if (buckets <= ht->buckets) { + ht->buckets = buckets; + } else { + /* free pointers if not allocated inline */ + if (ht->ht != (void *)(ht + 1)) + free(ht->ht, M_DN_HEAP); + free(ht, M_DN_HEAP); + ht = NULL; + } + } + if (ht == NULL) { + /* Allocate buckets + 1 entries because buckets is use to + * do the AND with the index returned by hash function + */ + l = sizeof(*ht) + (buckets + 1) * sizeof(void **); + ht = malloc(l, M_DN_HEAP, M_NOWAIT | M_ZERO); + } + if (ht) { + ht->ht = (void **)(ht + 1); + ht->buckets = buckets; + ht->ofs = ofs; + ht->hash = h; + ht->match = match; + ht->newh = newh; + } + return ht; +} + +/* dummy callback for dn_ht_free to unlink all */ +static int +do_del(void *obj, void *arg) +{ + return DNHT_SCAN_DEL; +} + +void +dn_ht_free(struct dn_ht *ht, int flags) +{ + if (ht == NULL) + return; + if (flags & DNHT_REMOVE) { + (void)dn_ht_scan(ht, do_del, NULL); + } else { + if (ht->ht && ht->ht != (void *)(ht + 1)) + free(ht->ht, M_DN_HEAP); + free(ht, M_DN_HEAP); + } +} + +int +dn_ht_entries(struct dn_ht *ht) +{ + return ht ? ht->entries : 0; +} + +/* lookup and optionally create or delete element */ +void * +dn_ht_find(struct dn_ht *ht, uintptr_t key, int flags, void *arg) +{ + int i; + void **pp, *p; + + if (ht == NULL) /* easy on an empty hash */ + return NULL; + i = (ht->buckets == 1) ? 0 : + (ht->hash(key, flags, arg) & ht->buckets); + + for (pp = &ht->ht[i]; (p = *pp); pp = (void **)((char *)p + ht->ofs)) { + if (flags & DNHT_MATCH_PTR) { + if (key == (uintptr_t)p) + break; + } else if (ht->match(p, key, flags, arg)) /* found match */ + break; + } + if (p) { + if (flags & DNHT_REMOVE) { + /* link in the next element */ + *pp = *(void **)((char *)p + ht->ofs); + *(void **)((char *)p + ht->ofs) = NULL; + ht->entries--; + } + } else if (flags & DNHT_INSERT) { + // printf("%s before calling new, bucket %d ofs %d\n", + // __FUNCTION__, i, ht->ofs); + p = ht->newh ? ht->newh(key, flags, arg) : (void *)key; + // printf("%s newh returns %p\n", __FUNCTION__, p); + if (p) { + ht->entries++; + *(void **)((char *)p + ht->ofs) = ht->ht[i]; + ht->ht[i] = p; + } + } + return p; +} + +/* + * do a scan with the option to delete the object. Extract next before + * running the callback because the element may be destroyed there. + */ +int +dn_ht_scan(struct dn_ht *ht, int (*fn)(void *, void *), void *arg) +{ + int i, ret, found = 0; + void **curp, *cur, *next; + + if (ht == NULL || fn == NULL) + return 0; + for (i = 0; i <= ht->buckets; i++) { + curp = &ht->ht[i]; + while ( (cur = *curp) != NULL) { + next = *(void **)((char *)cur + ht->ofs); + ret = fn(cur, arg); + if (ret & DNHT_SCAN_DEL) { + found++; + ht->entries--; + *curp = next; + } else { + curp = (void **)((char *)cur + ht->ofs); + } + if (ret & DNHT_SCAN_END) + return (ret & DNHT_COPY_ERR) ? -1 : found; + } + } + return found; +} + +/* + * Similar to dn_ht_scan(), except that the scan is performed only + * in the bucket 'bucket'. The function returns a correct bucket number if + * the original is invalid. + * If the callback returns DNHT_SCAN_END, the function move the ht->ht[i] + * pointer to the last entry processed. Moreover, the bucket number passed + * by caller is decremented, because usually the caller increment it. + */ +int +dn_ht_scan_bucket(struct dn_ht *ht, int *bucket, int (*fn)(void *, void *), + void *arg) +{ + int i, ret, found = 0; + void **curp, *cur, *next; + + if (ht == NULL || fn == NULL) + return 0; + if (*bucket > ht->buckets) + *bucket = 0; + i = *bucket; + + curp = &ht->ht[i]; + while ( (cur = *curp) != NULL) { + next = *(void **)((char *)cur + ht->ofs); + ret = fn(cur, arg); + if (ret & DNHT_SCAN_DEL) { + found++; + ht->entries--; + *curp = next; + } else { + curp = (void **)((char *)cur + ht->ofs); + } + if (ret & DNHT_SCAN_END) + return found; + } + return found; +} diff --git a/example/ipfw/sys/netpfil/ipfw/dn_heap.h b/example/ipfw/sys/netpfil/ipfw/dn_heap.h new file mode 100644 index 0000000..2b44d8e --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_heap.h @@ -0,0 +1,192 @@ +/*- + * Copyright (c) 1998-2010 Luigi Rizzo, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * Binary heap and hash tables, header file + * + * $FreeBSD: head/sys/netpfil/ipfw/dn_heap.h 204865 2010-03-08 11:27:08Z luigi $ + */ + +#ifndef _IP_DN_HEAP_H +#define _IP_DN_HEAP_H + +#define DN_KEY_LT(a,b) ((int64_t)((a)-(b)) < 0) +#define DN_KEY_LEQ(a,b) ((int64_t)((a)-(b)) <= 0) + +/* + * This module implements a binary heap supporting random extraction. + * + * A heap entry contains an uint64_t key and a pointer to object. + * DN_KEY_LT(a,b) returns true if key 'a' is smaller than 'b' + * + * The heap is a struct dn_heap plus a dynamically allocated + * array of dn_heap_entry entries. 'size' represents the size of + * the array, 'elements' count entries in use. The topmost + * element has the smallest key. + * The heap supports ordered insert, and extract from the top. + * To extract an object from the middle of the heap, we the object + * must reserve an 'int32_t' to store the position of the object + * in the heap itself, and the location of this field must be + * passed as an argument to heap_init() -- use -1 if the feature + * is not used. + */ +struct dn_heap_entry { + uint64_t key; /* sorting key, smallest comes first */ + void *object; /* object pointer */ +}; + +struct dn_heap { + int size; /* the size of the array */ + int elements; /* elements in use */ + int ofs; /* offset in the object of heap index */ + struct dn_heap_entry *p; /* array of "size" entries */ +}; + +enum { + HEAP_SCAN_DEL = 1, + HEAP_SCAN_END = 2, +}; + +/* + * heap_init() reinitializes the heap setting the size and the offset + * of the index for random extraction (use -1 if not used). + * The 'elements' counter is set to 0. + * + * SET_HEAP_OFS() indicates where, in the object, is stored the index + * for random extractions from the heap. + * + * heap_free() frees the memory associated to a heap. + * + * heap_insert() adds a key-pointer pair to the heap + * + * HEAP_TOP() returns a pointer to the top element of the heap, + * but makes no checks on its existance (XXX should we change ?) + * + * heap_extract() removes the entry at the top, returing the pointer. + * (the key should have been read before). + * + * heap_scan() invokes a callback on each entry of the heap. + * The callback can return a combination of HEAP_SCAN_DEL and + * HEAP_SCAN_END. HEAP_SCAN_DEL means the current element must + * be removed, and HEAP_SCAN_END means to terminate the scan. + * heap_scan() returns the number of elements removed. + * Because the order is not guaranteed, we should use heap_scan() + * only as a last resort mechanism. + */ +#define HEAP_TOP(h) ((h)->p) +#define SET_HEAP_OFS(h, n) do { (h)->ofs = n; } while (0) +int heap_init(struct dn_heap *h, int size, int ofs); +int heap_insert(struct dn_heap *h, uint64_t key1, void *p); +void heap_extract(struct dn_heap *h, void *obj); +void heap_free(struct dn_heap *h); +int heap_scan(struct dn_heap *, int (*)(void *, uintptr_t), uintptr_t); + +/*------------------------------------------------------ + * This module implements a generic hash table with support for + * running callbacks on the entire table. To avoid allocating + * memory during hash table operations, objects must reserve + * space for a link field. XXX if the heap is moderately full, + * an SLIST suffices, and we can tolerate the cost of a hash + * computation on each removal. + * + * dn_ht_init() initializes the table, setting the number of + * buckets, the offset of the link field, the main callbacks. + * Callbacks are: + * + * hash(key, flags, arg) called to return a bucket index. + * match(obj, key, flags, arg) called to determine if key + * matches the current 'obj' in the heap + * newh(key, flags, arg) optional, used to allocate a new + * object during insertions. + * + * dn_ht_free() frees the heap or unlink elements. + * DNHT_REMOVE unlink elements, 0 frees the heap. + * You need two calls to do both. + * + * dn_ht_find() is the main lookup function, which can also be + * used to insert or delete elements in the hash table. + * The final 'arg' is passed to all callbacks. + * + * dn_ht_scan() is used to invoke a callback on all entries of + * the heap, or possibly on just one bucket. The callback + * is invoked with a pointer to the object, and must return + * one of DNHT_SCAN_DEL or DNHT_SCAN_END to request the + * removal of the object from the heap and the end of the + * scan, respectively. + * + * dn_ht_scan_bucket() is similar to dn_ht_scan(), except that it scans + * only the specific bucket of the table. The bucket is a in-out + * parameter and return a valid bucket number if the original + * is invalid. + * + * A combination of flags can be used to modify the operation + * of the dn_ht_find(), and of the callbacks: + * + * DNHT_KEY_IS_OBJ means the key is the object pointer. + * It is usally of interest for the hash and match functions. + * + * DNHT_MATCH_PTR during a lookup, match pointers instead + * of calling match(). Normally used when removing specific + * entries. Does not imply KEY_IS_OBJ as the latter _is_ used + * by the match function. + * + * DNHT_INSERT insert the element if not found. + * Calls new() to allocates a new object unless + * DNHT_KEY_IS_OBJ is set. + * + * DNHT_UNIQUE only insert if object not found. + * XXX should it imply DNHT_INSERT ? + * + * DNHT_REMOVE remove objects if we find them. + */ +struct dn_ht; /* should be opaque */ + +struct dn_ht *dn_ht_init(struct dn_ht *, int buckets, int ofs, + uint32_t (*hash)(uintptr_t, int, void *), + int (*match)(void *, uintptr_t, int, void *), + void *(*newh)(uintptr_t, int, void *)); +void dn_ht_free(struct dn_ht *, int flags); + +void *dn_ht_find(struct dn_ht *, uintptr_t, int, void *); +int dn_ht_scan(struct dn_ht *, int (*)(void *, void *), void *); +int dn_ht_scan_bucket(struct dn_ht *, int * , int (*)(void *, void *), void *); +int dn_ht_entries(struct dn_ht *); + +enum { /* flags values. + * first two are returned by the scan callback to indicate + * to delete the matching element or to end the scan + */ + DNHT_SCAN_DEL = 0x0001, + DNHT_SCAN_END = 0x0002, + DNHT_KEY_IS_OBJ = 0x0004, /* key is the obj pointer */ + DNHT_MATCH_PTR = 0x0008, /* match by pointer, not match() */ + DNHT_INSERT = 0x0010, /* insert if not found */ + DNHT_UNIQUE = 0x0020, /* report error if already there */ + DNHT_REMOVE = 0x0040, /* remove on find or dn_ht_free */ + DNHT_COPY_ERR = 0x0080, /* error during a copy */ +}; + +#endif /* _IP_DN_HEAP_H */ diff --git a/example/ipfw/sys/netpfil/ipfw/dn_sched.h b/example/ipfw/sys/netpfil/ipfw/dn_sched.h new file mode 100644 index 0000000..a81a9c0 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_sched.h @@ -0,0 +1,192 @@ +/* + * Copyright (c) 2010 Riccardo Panicucci, Luigi Rizzo, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * The API to write a packet scheduling algorithm for dummynet. + * + * $FreeBSD: head/sys/netpfil/ipfw/dn_sched.h 258467 2013-11-22 05:02:37Z luigi $ + */ + +#ifndef _DN_SCHED_H +#define _DN_SCHED_H + +#define DN_MULTIQUEUE 0x01 +/* + * Descriptor for a scheduling algorithm. + * Contains all function pointers for a given scheduler + * This is typically created when a module is loaded, and stored + * in a global list of schedulers. + */ +struct dn_alg { + uint32_t type; /* the scheduler type */ + const char *name; /* scheduler name */ + uint32_t flags; /* DN_MULTIQUEUE if supports multiple queues */ + + /* + * The following define the size of 3 optional data structures + * that may need to be allocated at runtime, and are appended + * to each of the base data structures: scheduler, sched.inst, + * and queue. We don't have a per-flowset structure. + */ + /* + parameters attached to the template, e.g. + * default queue sizes, weights, quantum size, and so on; + */ + size_t schk_datalen; + + /* + per-instance parameters, such as timestamps, + * containers for queues, etc; + */ + size_t si_datalen; + + size_t q_datalen; /* per-queue parameters (e.g. S,F) */ + + /* + * Methods implemented by the scheduler: + * enqueue enqueue packet 'm' on scheduler 's', queue 'q'. + * q is NULL for !MULTIQUEUE. + * Return 0 on success, 1 on drop (packet consumed anyways). + * Note that q should be interpreted only as a hint + * on the flow that the mbuf belongs to: while a + * scheduler will normally enqueue m into q, it is ok + * to leave q alone and put the mbuf elsewhere. + * This function is called in two cases: + * - when a new packet arrives to the scheduler; + * - when a scheduler is reconfigured. In this case the + * call is issued by the new_queue callback, with a + * non empty queue (q) and m pointing to the first + * mbuf in the queue. For this reason, the function + * should internally check for (m != q->mq.head) + * before calling dn_enqueue(). + * + * dequeue Called when scheduler instance 's' can + * dequeue a packet. Return NULL if none are available. + * XXX what about non work-conserving ? + * + * config called on 'sched X config ...', normally writes + * in the area of size sch_arg + * + * destroy called on 'sched delete', frees everything + * in sch_arg (other parts are handled by more specific + * functions) + * + * new_sched called when a new instance is created, e.g. + * to create the local queue for !MULTIQUEUE, set V or + * copy parameters for WFQ, and so on. + * + * free_sched called when deleting an instance, cleans + * extra data in the per-instance area. + * + * new_fsk called when a flowset is linked to a scheduler, + * e.g. to validate parameters such as weights etc. + * free_fsk when a flowset is unlinked from a scheduler. + * (probably unnecessary) + * + * new_queue called to set the per-queue parameters, + * e.g. S and F, adjust sum of weights in the parent, etc. + * + * The new_queue callback is normally called from when + * creating a new queue. In some cases (such as a + * scheduler change or reconfiguration) it can be called + * with a non empty queue. In this case, the queue + * In case of non empty queue, the new_queue callback could + * need to call the enqueue function. In this case, + * the callback should eventually call enqueue() passing + * as m the first element in the queue. + * + * free_queue actions related to a queue removal, e.g. undo + * all the above. If the queue has data in it, also remove + * from the scheduler. This can e.g. happen during a reconfigure. + */ + int (*enqueue)(struct dn_sch_inst *, struct dn_queue *, + struct mbuf *); + struct mbuf * (*dequeue)(struct dn_sch_inst *); + + int (*config)(struct dn_schk *); + int (*destroy)(struct dn_schk*); + int (*new_sched)(struct dn_sch_inst *); + int (*free_sched)(struct dn_sch_inst *); + int (*new_fsk)(struct dn_fsk *f); + int (*free_fsk)(struct dn_fsk *f); + int (*new_queue)(struct dn_queue *q); + int (*free_queue)(struct dn_queue *q); + + /* run-time fields */ + int ref_count; /* XXX number of instances in the system */ + SLIST_ENTRY(dn_alg) next; /* Next scheduler in the list */ +}; + +/* MSVC does not support initializers so we need this ugly macro */ +#ifdef _WIN32 +#define _SI(fld) +#else +#define _SI(fld) fld +#endif + +/* + * Additionally, dummynet exports some functions and macros + * to be used by schedulers: + */ + +void dn_free_pkts(struct mbuf *mnext); +int dn_enqueue(struct dn_queue *q, struct mbuf* m, int drop); +/* bound a variable between min and max */ +int ipdn_bound_var(int *v, int dflt, int lo, int hi, const char *msg); + +/* + * Extract the head of a queue, update stats. Must be the very last + * thing done on a dequeue as the queue itself may go away. + */ +static __inline struct mbuf* +dn_dequeue(struct dn_queue *q) +{ + struct mbuf *m = q->mq.head; + if (m == NULL) + return NULL; + q->mq.head = m->m_nextpkt; + q->mq.count--; + + /* Update stats for the queue */ + q->ni.length--; + q->ni.len_bytes -= m->m_pkthdr.len; + if (q->_si) { + q->_si->ni.length--; + q->_si->ni.len_bytes -= m->m_pkthdr.len; + } + if (q->ni.length == 0) /* queue is now idle */ + q->q_time = dn_cfg.curr_time; + return m; +} + +int dn_sched_modevent(module_t mod, int cmd, void *arg); + +#define DECLARE_DNSCHED_MODULE(name, dnsched) \ + static moduledata_t name##_mod = { \ + #name, dn_sched_modevent, dnsched \ + }; \ + DECLARE_MODULE(name, name##_mod, \ + SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_ANY); \ + MODULE_DEPEND(name, dummynet, 3, 3, 3) +#endif /* _DN_SCHED_H */ diff --git a/example/ipfw/sys/netpfil/ipfw/dn_sched_fifo.c b/example/ipfw/sys/netpfil/ipfw/dn_sched_fifo.c new file mode 100644 index 0000000..1119221 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_sched_fifo.c @@ -0,0 +1,120 @@ +/* + * Copyright (c) 2010 Riccardo Panicucci, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $FreeBSD: head/sys/netpfil/ipfw/dn_sched_fifo.c 240494 2012-09-14 11:51:49Z glebius $ + */ + +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ */ +#include +#include /* ipfw_rule_ref */ +#include /* flow_id */ +#include +#include +#include +#include +#else +#include +#endif + +/* + * This file implements a FIFO scheduler for a single queue. + * The queue is allocated as part of the scheduler instance, + * and there is a single flowset is in the template which stores + * queue size and policy. + * Enqueue and dequeue use the default library functions. + */ +static int +fifo_enqueue(struct dn_sch_inst *si, struct dn_queue *q, struct mbuf *m) +{ + /* XXX if called with q != NULL and m=NULL, this is a + * re-enqueue from an existing scheduler, which we should + * handle. + */ + return dn_enqueue((struct dn_queue *)(si+1), m, 0); +} + +static struct mbuf * +fifo_dequeue(struct dn_sch_inst *si) +{ + return dn_dequeue((struct dn_queue *)(si + 1)); +} + +static int +fifo_new_sched(struct dn_sch_inst *si) +{ + /* This scheduler instance contains the queue */ + struct dn_queue *q = (struct dn_queue *)(si + 1); + + set_oid(&q->ni.oid, DN_QUEUE, sizeof(*q)); + q->_si = si; + q->fs = si->sched->fs; + return 0; +} + +static int +fifo_free_sched(struct dn_sch_inst *si) +{ + struct dn_queue *q = (struct dn_queue *)(si + 1); + dn_free_pkts(q->mq.head); + bzero(q, sizeof(*q)); + return 0; +} + +/* + * FIFO scheduler descriptor + * contains the type of the scheduler, the name, the size of extra + * data structures, and function pointers. + */ +static struct dn_alg fifo_desc = { + _SI( .type = ) DN_SCHED_FIFO, + _SI( .name = ) "FIFO", + _SI( .flags = ) 0, + + _SI( .schk_datalen = ) 0, + _SI( .si_datalen = ) sizeof(struct dn_queue), + _SI( .q_datalen = ) 0, + + _SI( .enqueue = ) fifo_enqueue, + _SI( .dequeue = ) fifo_dequeue, + _SI( .config = ) NULL, + _SI( .destroy = ) NULL, + _SI( .new_sched = ) fifo_new_sched, + _SI( .free_sched = ) fifo_free_sched, + _SI( .new_fsk = ) NULL, + _SI( .free_fsk = ) NULL, + _SI( .new_queue = ) NULL, + _SI( .free_queue = ) NULL, +}; + +DECLARE_DNSCHED_MODULE(dn_fifo, &fifo_desc); diff --git a/example/ipfw/sys/netpfil/ipfw/dn_sched_prio.c b/example/ipfw/sys/netpfil/ipfw/dn_sched_prio.c new file mode 100644 index 0000000..f0ca44e --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_sched_prio.c @@ -0,0 +1,229 @@ +/* + * Copyright (c) 2010 Riccardo Panicucci, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $FreeBSD: head/sys/netpfil/ipfw/dn_sched_prio.c 240494 2012-09-14 11:51:49Z glebius $ + */ +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ */ +#include +#include /* ipfw_rule_ref */ +#include /* flow_id */ +#include +#include +#include +#include +#else +#include +#endif + +#define DN_SCHED_PRIO 5 //XXX + +#if !defined(_KERNEL) || !defined(__linux__) +#define test_bit(ix, pData) ((*pData) & (1<<(ix))) +#define __set_bit(ix, pData) (*pData) |= (1<<(ix)) +#define __clear_bit(ix, pData) (*pData) &= ~(1<<(ix)) +#endif + +#ifdef __MIPSEL__ +#define __clear_bit(ix, pData) (*pData) &= ~(1<<(ix)) +#endif + +/* Size of the array of queues pointers. */ +#define BITMAP_T unsigned long +#define MAXPRIO (sizeof(BITMAP_T) * 8) + +/* + * The scheduler instance contains an array of pointers to queues, + * one for each priority, and a bitmap listing backlogged queues. + */ +struct prio_si { + BITMAP_T bitmap; /* array bitmap */ + struct dn_queue *q_array[MAXPRIO]; /* Array of queues pointers */ +}; + +/* + * If a queue with the same priority is already backlogged, use + * that one instead of the queue passed as argument. + */ +static int +prio_enqueue(struct dn_sch_inst *_si, struct dn_queue *q, struct mbuf *m) +{ + struct prio_si *si = (struct prio_si *)(_si + 1); + int prio = q->fs->fs.par[0]; + + if (test_bit(prio, &si->bitmap) == 0) { + /* No queue with this priority, insert */ + __set_bit(prio, &si->bitmap); + si->q_array[prio] = q; + } else { /* use the existing queue */ + q = si->q_array[prio]; + } + if (dn_enqueue(q, m, 0)) + return 1; + return 0; +} + +/* + * Packets are dequeued only from the highest priority queue. + * The function ffs() return the lowest bit in the bitmap that rapresent + * the array index (-1) which contains the pointer to the highest priority + * queue. + * After the dequeue, if this queue become empty, it is index is removed + * from the bitmap. + * Scheduler is idle if the bitmap is empty + * + * NOTE: highest priority is 0, lowest is sched->max_prio_q + */ +static struct mbuf * +prio_dequeue(struct dn_sch_inst *_si) +{ + struct prio_si *si = (struct prio_si *)(_si + 1); + struct mbuf *m; + struct dn_queue *q; + int prio; + + if (si->bitmap == 0) /* scheduler idle */ + return NULL; + + prio = ffs(si->bitmap) - 1; + + /* Take the highest priority queue in the scheduler */ + q = si->q_array[prio]; + // assert(q) + + m = dn_dequeue(q); + if (q->mq.head == NULL) { + /* Queue is now empty, remove from scheduler + * and mark it + */ + si->q_array[prio] = NULL; + __clear_bit(prio, &si->bitmap); + } + return m; +} + +static int +prio_new_sched(struct dn_sch_inst *_si) +{ + struct prio_si *si = (struct prio_si *)(_si + 1); + + bzero(si->q_array, sizeof(si->q_array)); + si->bitmap = 0; + + return 0; +} + +static int +prio_new_fsk(struct dn_fsk *fs) +{ + /* Check if the prioritiy is between 0 and MAXPRIO-1 */ + ipdn_bound_var(&fs->fs.par[0], 0, 0, MAXPRIO - 1, "PRIO priority"); + return 0; +} + +static int +prio_new_queue(struct dn_queue *q) +{ + struct prio_si *si = (struct prio_si *)(q->_si + 1); + int prio = q->fs->fs.par[0]; + struct dn_queue *oldq; + + q->ni.oid.subtype = DN_SCHED_PRIO; + + if (q->mq.head == NULL) + return 0; + + /* Queue already full, must insert in the scheduler or append + * mbufs to existing queue. This partly duplicates prio_enqueue + */ + if (test_bit(prio, &si->bitmap) == 0) { + /* No queue with this priority, insert */ + __set_bit(prio, &si->bitmap); + si->q_array[prio] = q; + } else if ( (oldq = si->q_array[prio]) != q) { + /* must append to the existing queue. + * can simply append q->mq.head to q2->... + * and add the counters to those of q2 + */ + oldq->mq.tail->m_nextpkt = q->mq.head; + oldq->mq.tail = q->mq.tail; + oldq->ni.length += q->ni.length; + q->ni.length = 0; + oldq->ni.len_bytes += q->ni.len_bytes; + q->ni.len_bytes = 0; + q->mq.tail = q->mq.head = NULL; + } + return 0; +} + +static int +prio_free_queue(struct dn_queue *q) +{ + int prio = q->fs->fs.par[0]; + struct prio_si *si = (struct prio_si *)(q->_si + 1); + + if (si->q_array[prio] == q) { + si->q_array[prio] = NULL; + __clear_bit(prio, &si->bitmap); + } + return 0; +} + + +static struct dn_alg prio_desc = { + _SI( .type = ) DN_SCHED_PRIO, + _SI( .name = ) "PRIO", + _SI( .flags = ) DN_MULTIQUEUE, + + /* we need extra space in the si and the queue */ + _SI( .schk_datalen = ) 0, + _SI( .si_datalen = ) sizeof(struct prio_si), + _SI( .q_datalen = ) 0, + + _SI( .enqueue = ) prio_enqueue, + _SI( .dequeue = ) prio_dequeue, + + _SI( .config = ) NULL, + _SI( .destroy = ) NULL, + _SI( .new_sched = ) prio_new_sched, + _SI( .free_sched = ) NULL, + + _SI( .new_fsk = ) prio_new_fsk, + _SI( .free_fsk = ) NULL, + + _SI( .new_queue = ) prio_new_queue, + _SI( .free_queue = ) prio_free_queue, +}; + + +DECLARE_DNSCHED_MODULE(dn_prio, &prio_desc); diff --git a/example/ipfw/sys/netpfil/ipfw/dn_sched_qfq.c b/example/ipfw/sys/netpfil/ipfw/dn_sched_qfq.c new file mode 100644 index 0000000..5cc5901 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_sched_qfq.c @@ -0,0 +1,864 @@ +/* + * Copyright (c) 2010 Fabio Checconi, Luigi Rizzo, Paolo Valente + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $FreeBSD: head/sys/netpfil/ipfw/dn_sched_qfq.c 240494 2012-09-14 11:51:49Z glebius $ + */ + +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ */ +#include +#include /* ipfw_rule_ref */ +#include /* flow_id */ +#include +#include +#include +#include +#else +#include +#endif + +#ifdef QFQ_DEBUG +struct qfq_sched; +static void dump_sched(struct qfq_sched *q, const char *msg); +#define NO(x) x +#else +#define NO(x) +#endif +#define DN_SCHED_QFQ 4 // XXX Where? +typedef unsigned long bitmap; + +/* + * bitmaps ops are critical. Some linux versions have __fls + * and the bitmap ops. Some machines have ffs + */ +#if defined(_WIN32) || (defined(__MIPSEL__) && defined(LINUX_24)) +int fls(unsigned int n) +{ + int i = 0; + for (i = 0; n > 0; n >>= 1, i++) + ; + return i; +} +#endif + +#if !defined(_KERNEL) || defined( __FreeBSD__ ) || defined(_WIN32) || (defined(__MIPSEL__) && defined(LINUX_24)) +static inline unsigned long __fls(unsigned long word) +{ + return fls(word) - 1; +} +#endif + +#if !defined(_KERNEL) || !defined(__linux__) +#ifdef QFQ_DEBUG +int test_bit(int ix, bitmap *p) +{ + if (ix < 0 || ix > 31) + D("bad index %d", ix); + return *p & (1< 31) + D("bad index %d", ix); + *p |= (1< 31) + D("bad index %d", ix); + *p &= ~(1<index = 0 + *.__grp->slot_shift + + where MIN_SLOT_SHIFT is derived by difference from the others. + +The max group index corresponds to Lmax/w_min, where +Lmax=1<group mapping. Class weights are + * in the range [1, QFQ_MAX_WEIGHT], we to map each class i to the + * group with the smallest index that can support the L_i / r_i + * configured for the class. + * + * grp->index is the index of the group; and grp->slot_shift + * is the shift for the corresponding (scaled) sigma_i. + * + * When computing the group index, we do (len<i_wsum) +#define IWSUM ((1< 0; +} + +/* Round a precise timestamp to its slotted value. */ +static inline uint64_t qfq_round_down(uint64_t ts, unsigned int shift) +{ + return ts & ~((1ULL << shift) - 1); +} + +/* return the pointer to the group with lowest index in the bitmap */ +static inline struct qfq_group *qfq_ffs(struct qfq_sched *q, + unsigned long bitmap) +{ + int index = ffs(bitmap) - 1; // zero-based + return &q->groups[index]; +} + +/* + * Calculate a flow index, given its weight and maximum packet length. + * index = log_2(maxlen/weight) but we need to apply the scaling. + * This is used only once at flow creation. + */ +static int qfq_calc_index(uint32_t inv_w, unsigned int maxlen) +{ + uint64_t slot_size = (uint64_t)maxlen *inv_w; + unsigned long size_map; + int index = 0; + + size_map = (unsigned long)(slot_size >> QFQ_MIN_SLOT_SHIFT); + if (!size_map) + goto out; + + index = __fls(size_map) + 1; // basically a log_2() + index -= !(slot_size - (1ULL << (index + QFQ_MIN_SLOT_SHIFT - 1))); + + if (index < 0) + index = 0; + +out: + ND("W = %d, L = %d, I = %d\n", ONE_FP/inv_w, maxlen, index); + return index; +} +/*---- end support functions ----*/ + +/*-------- API calls --------------------------------*/ +/* + * Validate and copy parameters from flowset. + */ +static int +qfq_new_queue(struct dn_queue *_q) +{ + struct qfq_sched *q = (struct qfq_sched *)(_q->_si + 1); + struct qfq_class *cl = (struct qfq_class *)_q; + int i; + uint32_t w; /* approximated weight */ + + /* import parameters from the flowset. They should be correct + * already. + */ + w = _q->fs->fs.par[0]; + cl->lmax = _q->fs->fs.par[1]; + if (!w || w > QFQ_MAX_WEIGHT) { + w = 1; + D("rounding weight to 1"); + } + cl->inv_w = ONE_FP/w; + w = ONE_FP/cl->inv_w; + if (q->wsum + w > QFQ_MAX_WSUM) + return EINVAL; + + i = qfq_calc_index(cl->inv_w, cl->lmax); + cl->grp = &q->groups[i]; + q->wsum += w; + // XXX cl->S = q->V; ? + // XXX compute q->i_wsum + return 0; +} + +/* remove an empty queue */ +static int +qfq_free_queue(struct dn_queue *_q) +{ + struct qfq_sched *q = (struct qfq_sched *)(_q->_si + 1); + struct qfq_class *cl = (struct qfq_class *)_q; + if (cl->inv_w) { + q->wsum -= ONE_FP/cl->inv_w; + cl->inv_w = 0; /* reset weight to avoid run twice */ + } + return 0; +} + +/* Calculate a mask to mimic what would be ffs_from(). */ +static inline unsigned long +mask_from(unsigned long bitmap, int from) +{ + return bitmap & ~((1UL << from) - 1); +} + +/* + * The state computation relies on ER=0, IR=1, EB=2, IB=3 + * First compute eligibility comparing grp->S, q->V, + * then check if someone is blocking us and possibly add EB + */ +static inline unsigned int +qfq_calc_state(struct qfq_sched *q, struct qfq_group *grp) +{ + /* if S > V we are not eligible */ + unsigned int state = qfq_gt(grp->S, q->V); + unsigned long mask = mask_from(q->bitmaps[ER], grp->index); + struct qfq_group *next; + + if (mask) { + next = qfq_ffs(q, mask); + if (qfq_gt(grp->F, next->F)) + state |= EB; + } + + return state; +} + +/* + * In principle + * q->bitmaps[dst] |= q->bitmaps[src] & mask; + * q->bitmaps[src] &= ~mask; + * but we should make sure that src != dst + */ +static inline void +qfq_move_groups(struct qfq_sched *q, unsigned long mask, int src, int dst) +{ + q->bitmaps[dst] |= q->bitmaps[src] & mask; + q->bitmaps[src] &= ~mask; +} + +static inline void +qfq_unblock_groups(struct qfq_sched *q, int index, uint64_t old_finish) +{ + unsigned long mask = mask_from(q->bitmaps[ER], index + 1); + struct qfq_group *next; + + if (mask) { + next = qfq_ffs(q, mask); + if (!qfq_gt(next->F, old_finish)) + return; + } + + mask = (1UL << index) - 1; + qfq_move_groups(q, mask, EB, ER); + qfq_move_groups(q, mask, IB, IR); +} + +/* + * perhaps + * + old_V ^= q->V; + old_V >>= QFQ_MIN_SLOT_SHIFT; + if (old_V) { + ... + } + * + */ +static inline void +qfq_make_eligible(struct qfq_sched *q, uint64_t old_V) +{ + unsigned long mask, vslot, old_vslot; + + vslot = q->V >> QFQ_MIN_SLOT_SHIFT; + old_vslot = old_V >> QFQ_MIN_SLOT_SHIFT; + + if (vslot != old_vslot) { + mask = (2UL << (__fls(vslot ^ old_vslot))) - 1; + qfq_move_groups(q, mask, IR, ER); + qfq_move_groups(q, mask, IB, EB); + } +} + +/* + * XXX we should make sure that slot becomes less than 32. + * This is guaranteed by the input values. + * roundedS is always cl->S rounded on grp->slot_shift bits. + */ +static inline void +qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl, uint64_t roundedS) +{ + uint64_t slot = (roundedS - grp->S) >> grp->slot_shift; + unsigned int i = (grp->front + slot) % QFQ_MAX_SLOTS; + + cl->next = grp->slots[i]; + grp->slots[i] = cl; + __set_bit(slot, &grp->full_slots); +} + +/* + * remove the entry from the slot + */ +static inline void +qfq_front_slot_remove(struct qfq_group *grp) +{ + struct qfq_class **h = &grp->slots[grp->front]; + + *h = (*h)->next; + if (!*h) + __clear_bit(0, &grp->full_slots); +} + +/* + * Returns the first full queue in a group. As a side effect, + * adjust the bucket list so the first non-empty bucket is at + * position 0 in full_slots. + */ +static inline struct qfq_class * +qfq_slot_scan(struct qfq_group *grp) +{ + int i; + + ND("grp %d full %x", grp->index, grp->full_slots); + if (!grp->full_slots) + return NULL; + + i = ffs(grp->full_slots) - 1; // zero-based + if (i > 0) { + grp->front = (grp->front + i) % QFQ_MAX_SLOTS; + grp->full_slots >>= i; + } + + return grp->slots[grp->front]; +} + +/* + * adjust the bucket list. When the start time of a group decreases, + * we move the index down (modulo QFQ_MAX_SLOTS) so we don't need to + * move the objects. The mask of occupied slots must be shifted + * because we use ffs() to find the first non-empty slot. + * This covers decreases in the group's start time, but what about + * increases of the start time ? + * Here too we should make sure that i is less than 32 + */ +static inline void +qfq_slot_rotate(struct qfq_sched *q, struct qfq_group *grp, uint64_t roundedS) +{ + unsigned int i = (grp->S - roundedS) >> grp->slot_shift; + + grp->full_slots <<= i; + grp->front = (grp->front - i) % QFQ_MAX_SLOTS; +} + + +static inline void +qfq_update_eligible(struct qfq_sched *q, uint64_t old_V) +{ + bitmap ineligible; + + ineligible = q->bitmaps[IR] | q->bitmaps[IB]; + if (ineligible) { + if (!q->bitmaps[ER]) { + struct qfq_group *grp; + grp = qfq_ffs(q, ineligible); + if (qfq_gt(grp->S, q->V)) + q->V = grp->S; + } + qfq_make_eligible(q, old_V); + } +} + +/* + * Updates the class, returns true if also the group needs to be updated. + */ +static inline int +qfq_update_class(struct qfq_sched *q, struct qfq_group *grp, + struct qfq_class *cl) +{ + + cl->S = cl->F; + if (cl->_q.mq.head == NULL) { + qfq_front_slot_remove(grp); + } else { + unsigned int len; + uint64_t roundedS; + + len = cl->_q.mq.head->m_pkthdr.len; + cl->F = cl->S + (uint64_t)len * cl->inv_w; + roundedS = qfq_round_down(cl->S, grp->slot_shift); + if (roundedS == grp->S) + return 0; + + qfq_front_slot_remove(grp); + qfq_slot_insert(grp, cl, roundedS); + } + return 1; +} + +static struct mbuf * +qfq_dequeue(struct dn_sch_inst *si) +{ + struct qfq_sched *q = (struct qfq_sched *)(si + 1); + struct qfq_group *grp; + struct qfq_class *cl; + struct mbuf *m; + uint64_t old_V; + + NO(q->loops++;) + if (!q->bitmaps[ER]) { + NO(if (q->queued) + dump_sched(q, "start dequeue");) + return NULL; + } + + grp = qfq_ffs(q, q->bitmaps[ER]); + + cl = grp->slots[grp->front]; + /* extract from the first bucket in the bucket list */ + m = dn_dequeue(&cl->_q); + + if (!m) { + D("BUG/* non-workconserving leaf */"); + return NULL; + } + NO(q->queued--;) + old_V = q->V; + q->V += (uint64_t)m->m_pkthdr.len * IWSUM; + ND("m is %p F 0x%llx V now 0x%llx", m, cl->F, q->V); + + if (qfq_update_class(q, grp, cl)) { + uint64_t old_F = grp->F; + cl = qfq_slot_scan(grp); + if (!cl) { /* group gone, remove from ER */ + __clear_bit(grp->index, &q->bitmaps[ER]); + // grp->S = grp->F + 1; // XXX debugging only + } else { + uint64_t roundedS = qfq_round_down(cl->S, grp->slot_shift); + unsigned int s; + + if (grp->S == roundedS) + goto skip_unblock; + grp->S = roundedS; + grp->F = roundedS + (2ULL << grp->slot_shift); + /* remove from ER and put in the new set */ + __clear_bit(grp->index, &q->bitmaps[ER]); + s = qfq_calc_state(q, grp); + __set_bit(grp->index, &q->bitmaps[s]); + } + /* we need to unblock even if the group has gone away */ + qfq_unblock_groups(q, grp->index, old_F); + } + +skip_unblock: + qfq_update_eligible(q, old_V); + NO(if (!q->bitmaps[ER] && q->queued) + dump_sched(q, "end dequeue");) + + return m; +} + +/* + * Assign a reasonable start time for a new flow k in group i. + * Admissible values for \hat(F) are multiples of \sigma_i + * no greater than V+\sigma_i . Larger values mean that + * we had a wraparound so we consider the timestamp to be stale. + * + * If F is not stale and F >= V then we set S = F. + * Otherwise we should assign S = V, but this may violate + * the ordering in ER. So, if we have groups in ER, set S to + * the F_j of the first group j which would be blocking us. + * We are guaranteed not to move S backward because + * otherwise our group i would still be blocked. + */ +static inline void +qfq_update_start(struct qfq_sched *q, struct qfq_class *cl) +{ + unsigned long mask; + uint64_t limit, roundedF; + int slot_shift = cl->grp->slot_shift; + + roundedF = qfq_round_down(cl->F, slot_shift); + limit = qfq_round_down(q->V, slot_shift) + (1UL << slot_shift); + + if (!qfq_gt(cl->F, q->V) || qfq_gt(roundedF, limit)) { + /* timestamp was stale */ + mask = mask_from(q->bitmaps[ER], cl->grp->index); + if (mask) { + struct qfq_group *next = qfq_ffs(q, mask); + if (qfq_gt(roundedF, next->F)) { + cl->S = next->F; + return; + } + } + cl->S = q->V; + } else { /* timestamp is not stale */ + cl->S = cl->F; + } +} + +static int +qfq_enqueue(struct dn_sch_inst *si, struct dn_queue *_q, struct mbuf *m) +{ + struct qfq_sched *q = (struct qfq_sched *)(si + 1); + struct qfq_group *grp; + struct qfq_class *cl = (struct qfq_class *)_q; + uint64_t roundedS; + int s; + + NO(q->loops++;) + DX(4, "len %d flow %p inv_w 0x%x grp %d", m->m_pkthdr.len, + _q, cl->inv_w, cl->grp->index); + /* XXX verify that the packet obeys the parameters */ + if (m != _q->mq.head) { + if (dn_enqueue(_q, m, 0)) /* packet was dropped */ + return 1; + NO(q->queued++;) + if (m != _q->mq.head) + return 0; + } + /* If reach this point, queue q was idle */ + grp = cl->grp; + qfq_update_start(q, cl); /* adjust start time */ + /* compute new finish time and rounded start. */ + cl->F = cl->S + (uint64_t)(m->m_pkthdr.len) * cl->inv_w; + roundedS = qfq_round_down(cl->S, grp->slot_shift); + + /* + * insert cl in the correct bucket. + * If cl->S >= grp->S we don't need to adjust the + * bucket list and simply go to the insertion phase. + * Otherwise grp->S is decreasing, we must make room + * in the bucket list, and also recompute the group state. + * Finally, if there were no flows in this group and nobody + * was in ER make sure to adjust V. + */ + if (grp->full_slots) { + if (!qfq_gt(grp->S, cl->S)) + goto skip_update; + /* create a slot for this cl->S */ + qfq_slot_rotate(q, grp, roundedS); + /* group was surely ineligible, remove */ + __clear_bit(grp->index, &q->bitmaps[IR]); + __clear_bit(grp->index, &q->bitmaps[IB]); + } else if (!q->bitmaps[ER] && qfq_gt(roundedS, q->V)) + q->V = roundedS; + + grp->S = roundedS; + grp->F = roundedS + (2ULL << grp->slot_shift); // i.e. 2\sigma_i + s = qfq_calc_state(q, grp); + __set_bit(grp->index, &q->bitmaps[s]); + ND("new state %d 0x%x", s, q->bitmaps[s]); + ND("S %llx F %llx V %llx", cl->S, cl->F, q->V); +skip_update: + qfq_slot_insert(grp, cl, roundedS); + + return 0; +} + + +#if 0 +static inline void +qfq_slot_remove(struct qfq_sched *q, struct qfq_group *grp, + struct qfq_class *cl, struct qfq_class **pprev) +{ + unsigned int i, offset; + uint64_t roundedS; + + roundedS = qfq_round_down(cl->S, grp->slot_shift); + offset = (roundedS - grp->S) >> grp->slot_shift; + i = (grp->front + offset) % QFQ_MAX_SLOTS; + +#ifdef notyet + if (!pprev) { + pprev = &grp->slots[i]; + while (*pprev && *pprev != cl) + pprev = &(*pprev)->next; + } +#endif + + *pprev = cl->next; + if (!grp->slots[i]) + __clear_bit(offset, &grp->full_slots); +} + +/* + * called to forcibly destroy a queue. + * If the queue is not in the front bucket, or if it has + * other queues in the front bucket, we can simply remove + * the queue with no other side effects. + * Otherwise we must propagate the event up. + * XXX description to be completed. + */ +static void +qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl, + struct qfq_class **pprev) +{ + struct qfq_group *grp = &q->groups[cl->index]; + unsigned long mask; + uint64_t roundedS; + int s; + + cl->F = cl->S; // not needed if the class goes away. + qfq_slot_remove(q, grp, cl, pprev); + + if (!grp->full_slots) { + /* nothing left in the group, remove from all sets. + * Do ER last because if we were blocking other groups + * we must unblock them. + */ + __clear_bit(grp->index, &q->bitmaps[IR]); + __clear_bit(grp->index, &q->bitmaps[EB]); + __clear_bit(grp->index, &q->bitmaps[IB]); + + if (test_bit(grp->index, &q->bitmaps[ER]) && + !(q->bitmaps[ER] & ~((1UL << grp->index) - 1))) { + mask = q->bitmaps[ER] & ((1UL << grp->index) - 1); + if (mask) + mask = ~((1UL << __fls(mask)) - 1); + else + mask = ~0UL; + qfq_move_groups(q, mask, EB, ER); + qfq_move_groups(q, mask, IB, IR); + } + __clear_bit(grp->index, &q->bitmaps[ER]); + } else if (!grp->slots[grp->front]) { + cl = qfq_slot_scan(grp); + roundedS = qfq_round_down(cl->S, grp->slot_shift); + if (grp->S != roundedS) { + __clear_bit(grp->index, &q->bitmaps[ER]); + __clear_bit(grp->index, &q->bitmaps[IR]); + __clear_bit(grp->index, &q->bitmaps[EB]); + __clear_bit(grp->index, &q->bitmaps[IB]); + grp->S = roundedS; + grp->F = roundedS + (2ULL << grp->slot_shift); + s = qfq_calc_state(q, grp); + __set_bit(grp->index, &q->bitmaps[s]); + } + } + qfq_update_eligible(q, q->V); +} +#endif + +static int +qfq_new_fsk(struct dn_fsk *f) +{ + ipdn_bound_var(&f->fs.par[0], 1, 1, QFQ_MAX_WEIGHT, "qfq weight"); + ipdn_bound_var(&f->fs.par[1], 1500, 1, 2000, "qfq maxlen"); + ND("weight %d len %d\n", f->fs.par[0], f->fs.par[1]); + return 0; +} + +/* + * initialize a new scheduler instance + */ +static int +qfq_new_sched(struct dn_sch_inst *si) +{ + struct qfq_sched *q = (struct qfq_sched *)(si + 1); + struct qfq_group *grp; + int i; + + for (i = 0; i <= QFQ_MAX_INDEX; i++) { + grp = &q->groups[i]; + grp->index = i; + grp->slot_shift = QFQ_MTU_SHIFT + FRAC_BITS - + (QFQ_MAX_INDEX - i); + } + return 0; +} + +/* + * QFQ scheduler descriptor + */ +static struct dn_alg qfq_desc = { + _SI( .type = ) DN_SCHED_QFQ, + _SI( .name = ) "QFQ", + _SI( .flags = ) DN_MULTIQUEUE, + + _SI( .schk_datalen = ) 0, + _SI( .si_datalen = ) sizeof(struct qfq_sched), + _SI( .q_datalen = ) sizeof(struct qfq_class) - sizeof(struct dn_queue), + + _SI( .enqueue = ) qfq_enqueue, + _SI( .dequeue = ) qfq_dequeue, + + _SI( .config = ) NULL, + _SI( .destroy = ) NULL, + _SI( .new_sched = ) qfq_new_sched, + _SI( .free_sched = ) NULL, + _SI( .new_fsk = ) qfq_new_fsk, + _SI( .free_fsk = ) NULL, + _SI( .new_queue = ) qfq_new_queue, + _SI( .free_queue = ) qfq_free_queue, +}; + +DECLARE_DNSCHED_MODULE(dn_qfq, &qfq_desc); + +#ifdef QFQ_DEBUG +static void +dump_groups(struct qfq_sched *q, uint32_t mask) +{ + int i, j; + + for (i = 0; i < QFQ_MAX_INDEX + 1; i++) { + struct qfq_group *g = &q->groups[i]; + + if (0 == (mask & (1<slots[j]) + D(" bucket %d %p", j, g->slots[j]); + } + D("full_slots 0x%x", g->full_slots); + D(" %2d S 0x%20llx F 0x%llx %c", i, + g->S, g->F, + mask & (1<loops, q->queued, q->V); + D(" ER 0x%08x", q->bitmaps[ER]); + D(" EB 0x%08x", q->bitmaps[EB]); + D(" IR 0x%08x", q->bitmaps[IR]); + D(" IB 0x%08x", q->bitmaps[IB]); + dump_groups(q, 0xffffffff); +}; +#endif /* QFQ_DEBUG */ diff --git a/example/ipfw/sys/netpfil/ipfw/dn_sched_rr.c b/example/ipfw/sys/netpfil/ipfw/dn_sched_rr.c new file mode 100644 index 0000000..28edb29 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_sched_rr.c @@ -0,0 +1,307 @@ +/* + * Copyright (c) 2010 Riccardo Panicucci, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $FreeBSD: head/sys/netpfil/ipfw/dn_sched_rr.c 240494 2012-09-14 11:51:49Z glebius $ + */ + +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ */ +#include +#include /* ipfw_rule_ref */ +#include /* flow_id */ +#include +#include +#include +#include +#else +#include +#endif + +#define DN_SCHED_RR 3 // XXX Where? + +struct rr_queue { + struct dn_queue q; /* Standard queue */ + int status; /* 1: queue is in the list */ + int credit; /* Number of bytes to transmit */ + int quantum; /* quantum * C */ + struct rr_queue *qnext; /* */ +}; + +/* struct rr_schk contains global config parameters + * and is right after dn_schk + */ +struct rr_schk { + int min_q; /* Min quantum */ + int max_q; /* Max quantum */ + int q_bytes; /* Bytes per quantum */ +}; + +/* per-instance round robin list, right after dn_sch_inst */ +struct rr_si { + struct rr_queue *head, *tail; /* Pointer to current queue */ +}; + +/* Append a queue to the rr list */ +static inline void +rr_append(struct rr_queue *q, struct rr_si *si) +{ + q->status = 1; /* mark as in-rr_list */ + q->credit = q->quantum; /* initialize credit */ + + /* append to the tail */ + if (si->head == NULL) + si->head = q; + else + si->tail->qnext = q; + si->tail = q; /* advance the tail pointer */ + q->qnext = si->head; /* make it circular */ +} + +/* Remove the head queue from circular list. */ +static inline void +rr_remove_head(struct rr_si *si) +{ + if (si->head == NULL) + return; /* empty queue */ + si->head->status = 0; + + if (si->head == si->tail) { + si->head = si->tail = NULL; + return; + } + + si->head = si->head->qnext; + si->tail->qnext = si->head; +} + +/* Remove a queue from circular list. + * XXX see if ti can be merge with remove_queue() + */ +static inline void +remove_queue_q(struct rr_queue *q, struct rr_si *si) +{ + struct rr_queue *prev; + + if (q->status != 1) + return; + if (q == si->head) { + rr_remove_head(si); + return; + } + + for (prev = si->head; prev; prev = prev->qnext) { + if (prev->qnext != q) + continue; + prev->qnext = q->qnext; + if (q == si->tail) + si->tail = prev; + q->status = 0; + break; + } +} + + +static inline void +next_pointer(struct rr_si *si) +{ + if (si->head == NULL) + return; /* empty queue */ + + si->head = si->head->qnext; + si->tail = si->tail->qnext; +} + +static int +rr_enqueue(struct dn_sch_inst *_si, struct dn_queue *q, struct mbuf *m) +{ + struct rr_si *si; + struct rr_queue *rrq; + + if (m != q->mq.head) { + if (dn_enqueue(q, m, 0)) /* packet was dropped */ + return 1; + if (m != q->mq.head) + return 0; + } + + /* If reach this point, queue q was idle */ + si = (struct rr_si *)(_si + 1); + rrq = (struct rr_queue *)q; + + if (rrq->status == 1) /* Queue is already in the queue list */ + return 0; + + /* Insert the queue in the queue list */ + rr_append(rrq, si); + + return 0; +} + +static struct mbuf * +rr_dequeue(struct dn_sch_inst *_si) +{ + /* Access scheduler instance private data */ + struct rr_si *si = (struct rr_si *)(_si + 1); + struct rr_queue *rrq; + uint64_t len; + + while ( (rrq = si->head) ) { + struct mbuf *m = rrq->q.mq.head; + if ( m == NULL) { + /* empty queue, remove from list */ + rr_remove_head(si); + continue; + } + len = m->m_pkthdr.len; + + if (len > rrq->credit) { + /* Packet too big */ + rrq->credit += rrq->quantum; + /* Try next queue */ + next_pointer(si); + } else { + rrq->credit -= len; + return dn_dequeue(&rrq->q); + } + } + + /* no packet to dequeue*/ + return NULL; +} + +static int +rr_config(struct dn_schk *_schk) +{ + struct rr_schk *schk = (struct rr_schk *)(_schk + 1); + ND("called"); + + /* use reasonable quantums (64..2k bytes, default 1500) */ + schk->min_q = 64; + schk->max_q = 2048; + schk->q_bytes = 1500; /* quantum */ + + return 0; +} + +static int +rr_new_sched(struct dn_sch_inst *_si) +{ + struct rr_si *si = (struct rr_si *)(_si + 1); + + ND("called"); + si->head = si->tail = NULL; + + return 0; +} + +static int +rr_free_sched(struct dn_sch_inst *_si) +{ + ND("called"); + /* Nothing to do? */ + return 0; +} + +static int +rr_new_fsk(struct dn_fsk *fs) +{ + struct rr_schk *schk = (struct rr_schk *)(fs->sched + 1); + /* par[0] is the weight, par[1] is the quantum step */ + ipdn_bound_var(&fs->fs.par[0], 1, + 1, 65536, "RR weight"); + ipdn_bound_var(&fs->fs.par[1], schk->q_bytes, + schk->min_q, schk->max_q, "RR quantum"); + return 0; +} + +static int +rr_new_queue(struct dn_queue *_q) +{ + struct rr_queue *q = (struct rr_queue *)_q; + + _q->ni.oid.subtype = DN_SCHED_RR; + + q->quantum = _q->fs->fs.par[0] * _q->fs->fs.par[1]; + ND("called, q->quantum %d", q->quantum); + q->credit = q->quantum; + q->status = 0; + + if (_q->mq.head != NULL) { + /* Queue NOT empty, insert in the queue list */ + rr_append(q, (struct rr_si *)(_q->_si + 1)); + } + return 0; +} + +static int +rr_free_queue(struct dn_queue *_q) +{ + struct rr_queue *q = (struct rr_queue *)_q; + + ND("called"); + if (q->status == 1) { + struct rr_si *si = (struct rr_si *)(_q->_si + 1); + remove_queue_q(q, si); + } + return 0; +} + +/* + * RR scheduler descriptor + * contains the type of the scheduler, the name, the size of the + * structures and function pointers. + */ +static struct dn_alg rr_desc = { + _SI( .type = ) DN_SCHED_RR, + _SI( .name = ) "RR", + _SI( .flags = ) DN_MULTIQUEUE, + + _SI( .schk_datalen = ) 0, + _SI( .si_datalen = ) sizeof(struct rr_si), + _SI( .q_datalen = ) sizeof(struct rr_queue) - sizeof(struct dn_queue), + + _SI( .enqueue = ) rr_enqueue, + _SI( .dequeue = ) rr_dequeue, + + _SI( .config = ) rr_config, + _SI( .destroy = ) NULL, + _SI( .new_sched = ) rr_new_sched, + _SI( .free_sched = ) rr_free_sched, + _SI( .new_fsk = ) rr_new_fsk, + _SI( .free_fsk = ) NULL, + _SI( .new_queue = ) rr_new_queue, + _SI( .free_queue = ) rr_free_queue, +}; + + +DECLARE_DNSCHED_MODULE(dn_rr, &rr_desc); diff --git a/example/ipfw/sys/netpfil/ipfw/dn_sched_wf2q.c b/example/ipfw/sys/netpfil/ipfw/dn_sched_wf2q.c new file mode 100644 index 0000000..c07f4c7 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/dn_sched_wf2q.c @@ -0,0 +1,373 @@ +/* + * Copyright (c) 2010 Riccardo Panicucci, Universita` di Pisa + * Copyright (c) 2000-2002 Luigi Rizzo, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $FreeBSD: head/sys/netpfil/ipfw/dn_sched_wf2q.c 240494 2012-09-14 11:51:49Z glebius $ + */ + +#ifdef _KERNEL +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ */ +#include +#include /* ipfw_rule_ref */ +#include /* flow_id */ +#include +#include +#include +#include +#else +#include +#endif + +#ifndef MAX64 +#define MAX64(x,y) (( (int64_t) ( (y)-(x) )) > 0 ) ? (y) : (x) +#endif + +/* + * timestamps are computed on 64 bit using fixed point arithmetic. + * LMAX_BITS, WMAX_BITS are the max number of bits for the packet len + * and sum of weights, respectively. FRAC_BITS is the number of + * fractional bits. We want FRAC_BITS >> WMAX_BITS to avoid too large + * errors when computing the inverse, FRAC_BITS < 32 so we can do 1/w + * using an unsigned 32-bit division, and to avoid wraparounds we need + * LMAX_BITS + WMAX_BITS + FRAC_BITS << 64 + * As an example + * FRAC_BITS = 26, LMAX_BITS=14, WMAX_BITS = 19 + */ +#ifndef FRAC_BITS +#define FRAC_BITS 28 /* shift for fixed point arithmetic */ +#define ONE_FP (1UL << FRAC_BITS) +#endif + +/* + * Private information for the scheduler instance: + * sch_heap (key is Finish time) returns the next queue to serve + * ne_heap (key is Start time) stores not-eligible queues + * idle_heap (key=start/finish time) stores idle flows. It must + * support extract-from-middle. + * A flow is only in 1 of the three heaps. + * XXX todo: use a more efficient data structure, e.g. a tree sorted + * by F with min_subtree(S) in each node + */ +struct wf2qp_si { + struct dn_heap sch_heap; /* top extract - key Finish time */ + struct dn_heap ne_heap; /* top extract - key Start time */ + struct dn_heap idle_heap; /* random extract - key Start=Finish time */ + uint64_t V; /* virtual time */ + uint32_t inv_wsum; /* inverse of sum of weights */ + uint32_t wsum; /* sum of weights */ +}; + +struct wf2qp_queue { + struct dn_queue _q; + uint64_t S, F; /* start time, finish time */ + uint32_t inv_w; /* ONE_FP / weight */ + int32_t heap_pos; /* position (index) of struct in heap */ +}; + +/* + * This file implements a WF2Q+ scheduler as it has been in dummynet + * since 2000. + * The scheduler supports per-flow queues and has O(log N) complexity. + * + * WF2Q+ needs to drain entries from the idle heap so that we + * can keep the sum of weights up to date. We can do it whenever + * we get a chance, or periodically, or following some other + * strategy. The function idle_check() drains at most N elements + * from the idle heap. + */ +static void +idle_check(struct wf2qp_si *si, int n, int force) +{ + struct dn_heap *h = &si->idle_heap; + while (n-- > 0 && h->elements > 0 && + (force || DN_KEY_LT(HEAP_TOP(h)->key, si->V))) { + struct dn_queue *q = HEAP_TOP(h)->object; + struct wf2qp_queue *alg_fq = (struct wf2qp_queue *)q; + + heap_extract(h, NULL); + /* XXX to let the flowset delete the queue we should + * mark it as 'unused' by the scheduler. + */ + alg_fq->S = alg_fq->F + 1; /* Mark timestamp as invalid. */ + si->wsum -= q->fs->fs.par[0]; /* adjust sum of weights */ + if (si->wsum > 0) + si->inv_wsum = ONE_FP/si->wsum; + } +} + +static int +wf2qp_enqueue(struct dn_sch_inst *_si, struct dn_queue *q, struct mbuf *m) +{ + struct dn_fsk *fs = q->fs; + struct wf2qp_si *si = (struct wf2qp_si *)(_si + 1); + struct wf2qp_queue *alg_fq; + uint64_t len = m->m_pkthdr.len; + + if (m != q->mq.head) { + if (dn_enqueue(q, m, 0)) /* packet was dropped */ + return 1; + if (m != q->mq.head) /* queue was already busy */ + return 0; + } + + /* If reach this point, queue q was idle */ + alg_fq = (struct wf2qp_queue *)q; + + if (DN_KEY_LT(alg_fq->F, alg_fq->S)) { + /* Fbrand new queue. */ + alg_fq->S = si->V; /* init start time */ + si->wsum += fs->fs.par[0]; /* add weight of new queue. */ + si->inv_wsum = ONE_FP/si->wsum; + } else { /* if it was idle then it was in the idle heap */ + heap_extract(&si->idle_heap, q); + alg_fq->S = MAX64(alg_fq->F, si->V); /* compute new S */ + } + alg_fq->F = alg_fq->S + len * alg_fq->inv_w; + + /* if nothing is backlogged, make sure this flow is eligible */ + if (si->ne_heap.elements == 0 && si->sch_heap.elements == 0) + si->V = MAX64(alg_fq->S, si->V); + + /* + * Look at eligibility. A flow is not eligibile if S>V (when + * this happens, it means that there is some other flow already + * scheduled for the same pipe, so the sch_heap cannot be + * empty). If the flow is not eligible we just store it in the + * ne_heap. Otherwise, we store in the sch_heap. + * Note that for all flows in sch_heap (SCH), S_i <= V, + * and for all flows in ne_heap (NEH), S_i > V. + * So when we need to compute max(V, min(S_i)) forall i in + * SCH+NEH, we only need to look into NEH. + */ + if (DN_KEY_LT(si->V, alg_fq->S)) { + /* S>V means flow Not eligible. */ + if (si->sch_heap.elements == 0) + D("++ ouch! not eligible but empty scheduler!"); + heap_insert(&si->ne_heap, alg_fq->S, q); + } else { + heap_insert(&si->sch_heap, alg_fq->F, q); + } + return 0; +} + +/* XXX invariant: sch > 0 || V >= min(S in neh) */ +static struct mbuf * +wf2qp_dequeue(struct dn_sch_inst *_si) +{ + /* Access scheduler instance private data */ + struct wf2qp_si *si = (struct wf2qp_si *)(_si + 1); + struct mbuf *m; + struct dn_queue *q; + struct dn_heap *sch = &si->sch_heap; + struct dn_heap *neh = &si->ne_heap; + struct wf2qp_queue *alg_fq; + + if (sch->elements == 0 && neh->elements == 0) { + /* we have nothing to do. We could kill the idle heap + * altogether and reset V + */ + idle_check(si, 0x7fffffff, 1); + si->V = 0; + si->wsum = 0; /* should be set already */ + return NULL; /* quick return if nothing to do */ + } + idle_check(si, 1, 0); /* drain something from the idle heap */ + + /* make sure at least one element is eligible, bumping V + * and moving entries that have become eligible. + * We need to repeat the first part twice, before and + * after extracting the candidate, or enqueue() will + * find the data structure in a wrong state. + */ + m = NULL; + for(;;) { + /* + * Compute V = max(V, min(S_i)). Remember that all elements + * in sch have by definition S_i <= V so if sch is not empty, + * V is surely the max and we must not update it. Conversely, + * if sch is empty we only need to look at neh. + * We don't need to move the queues, as it will be done at the + * next enqueue + */ + if (sch->elements == 0 && neh->elements > 0) { + si->V = MAX64(si->V, HEAP_TOP(neh)->key); + } + while (neh->elements > 0 && + DN_KEY_LEQ(HEAP_TOP(neh)->key, si->V)) { + q = HEAP_TOP(neh)->object; + alg_fq = (struct wf2qp_queue *)q; + heap_extract(neh, NULL); + heap_insert(sch, alg_fq->F, q); + } + if (m) /* pkt found in previous iteration */ + break; + /* ok we have at least one eligible pkt */ + q = HEAP_TOP(sch)->object; + alg_fq = (struct wf2qp_queue *)q; + m = dn_dequeue(q); + heap_extract(sch, NULL); /* Remove queue from heap. */ + si->V += (uint64_t)(m->m_pkthdr.len) * si->inv_wsum; + alg_fq->S = alg_fq->F; /* Update start time. */ + if (q->mq.head == 0) { /* not backlogged any more. */ + heap_insert(&si->idle_heap, alg_fq->F, q); + } else { /* Still backlogged. */ + /* Update F, store in neh or sch */ + uint64_t len = q->mq.head->m_pkthdr.len; + alg_fq->F += len * alg_fq->inv_w; + if (DN_KEY_LEQ(alg_fq->S, si->V)) { + heap_insert(sch, alg_fq->F, q); + } else { + heap_insert(neh, alg_fq->S, q); + } + } + } + return m; +} + +static int +wf2qp_new_sched(struct dn_sch_inst *_si) +{ + struct wf2qp_si *si = (struct wf2qp_si *)(_si + 1); + int ofs = offsetof(struct wf2qp_queue, heap_pos); + + /* all heaps support extract from middle */ + if (heap_init(&si->idle_heap, 16, ofs) || + heap_init(&si->sch_heap, 16, ofs) || + heap_init(&si->ne_heap, 16, ofs)) { + heap_free(&si->ne_heap); + heap_free(&si->sch_heap); + heap_free(&si->idle_heap); + return ENOMEM; + } + return 0; +} + +static int +wf2qp_free_sched(struct dn_sch_inst *_si) +{ + struct wf2qp_si *si = (struct wf2qp_si *)(_si + 1); + + heap_free(&si->sch_heap); + heap_free(&si->ne_heap); + heap_free(&si->idle_heap); + + return 0; +} + +static int +wf2qp_new_fsk(struct dn_fsk *fs) +{ + ipdn_bound_var(&fs->fs.par[0], 1, + 1, 100, "WF2Q+ weight"); + return 0; +} + +static int +wf2qp_new_queue(struct dn_queue *_q) +{ + struct wf2qp_queue *q = (struct wf2qp_queue *)_q; + + _q->ni.oid.subtype = DN_SCHED_WF2QP; + q->F = 0; /* not strictly necessary */ + q->S = q->F + 1; /* mark timestamp as invalid. */ + q->inv_w = ONE_FP / _q->fs->fs.par[0]; + if (_q->mq.head != NULL) { + wf2qp_enqueue(_q->_si, _q, _q->mq.head); + } + return 0; +} + +/* + * Called when the infrastructure removes a queue (e.g. flowset + * is reconfigured). Nothing to do if we did not 'own' the queue, + * otherwise remove it from the right heap and adjust the sum + * of weights. + */ +static int +wf2qp_free_queue(struct dn_queue *q) +{ + struct wf2qp_queue *alg_fq = (struct wf2qp_queue *)q; + struct wf2qp_si *si = (struct wf2qp_si *)(q->_si + 1); + + if (alg_fq->S >= alg_fq->F + 1) + return 0; /* nothing to do, not in any heap */ + si->wsum -= q->fs->fs.par[0]; + if (si->wsum > 0) + si->inv_wsum = ONE_FP/si->wsum; + + /* extract from the heap. XXX TODO we may need to adjust V + * to make sure the invariants hold. + */ + if (q->mq.head == NULL) { + heap_extract(&si->idle_heap, q); + } else if (DN_KEY_LT(si->V, alg_fq->S)) { + heap_extract(&si->ne_heap, q); + } else { + heap_extract(&si->sch_heap, q); + } + return 0; +} + +/* + * WF2Q+ scheduler descriptor + * contains the type of the scheduler, the name, the size of the + * structures and function pointers. + */ +static struct dn_alg wf2qp_desc = { + _SI( .type = ) DN_SCHED_WF2QP, + _SI( .name = ) "WF2Q+", + _SI( .flags = ) DN_MULTIQUEUE, + + /* we need extra space in the si and the queue */ + _SI( .schk_datalen = ) 0, + _SI( .si_datalen = ) sizeof(struct wf2qp_si), + _SI( .q_datalen = ) sizeof(struct wf2qp_queue) - + sizeof(struct dn_queue), + + _SI( .enqueue = ) wf2qp_enqueue, + _SI( .dequeue = ) wf2qp_dequeue, + + _SI( .config = ) NULL, + _SI( .destroy = ) NULL, + _SI( .new_sched = ) wf2qp_new_sched, + _SI( .free_sched = ) wf2qp_free_sched, + + _SI( .new_fsk = ) wf2qp_new_fsk, + _SI( .free_fsk = ) NULL, + + _SI( .new_queue = ) wf2qp_new_queue, + _SI( .free_queue = ) wf2qp_free_queue, +}; + + +DECLARE_DNSCHED_MODULE(dn_wf2qp, &wf2qp_desc); diff --git a/example/ipfw/sys/netpfil/ipfw/ip_dn_glue.c b/example/ipfw/sys/netpfil/ipfw/ip_dn_glue.c new file mode 100644 index 0000000..753331f --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_dn_glue.c @@ -0,0 +1,846 @@ +/*- + * Copyright (c) 2010 Riccardo Panicucci, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * $FreeBSD: head/sys/netpfil/ipfw/ip_dn_glue.c 266955 2014-06-01 20:19:17Z hiren $ + * + * Binary compatibility support for /sbin/ipfw RELENG_7 and RELENG_8 + */ + +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ, struct ifaddr, ifq head, lock.h mutex.h */ +#include +#include /* ip_output(), IP_FORWARDING */ +#include +#include + +#include +#include +#include +#include + +/* FREEBSD7.2 ip_dummynet.h r191715*/ + +struct dn_heap_entry7 { + int64_t key; /* sorting key. Topmost element is smallest one */ + void *object; /* object pointer */ +}; + +struct dn_heap7 { + int size; + int elements; + int offset; /* XXX if > 0 this is the offset of direct ptr to obj */ + struct dn_heap_entry7 *p; /* really an array of "size" entries */ +}; + +/* Common to 7.2 and 8 */ +struct dn_flow_set { + SLIST_ENTRY(dn_flow_set) next; /* linked list in a hash slot */ + + u_short fs_nr ; /* flow_set number */ + u_short flags_fs; +#define DNOLD_HAVE_FLOW_MASK 0x0001 +#define DNOLD_IS_RED 0x0002 +#define DNOLD_IS_GENTLE_RED 0x0004 +#define DNOLD_QSIZE_IS_BYTES 0x0008 /* queue size is measured in bytes */ +#define DNOLD_NOERROR 0x0010 /* do not report ENOBUFS on drops */ +#define DNOLD_HAS_PROFILE 0x0020 /* the pipe has a delay profile. */ +#define DNOLD_IS_PIPE 0x4000 +#define DNOLD_IS_QUEUE 0x8000 + + struct dn_pipe7 *pipe ; /* pointer to parent pipe */ + u_short parent_nr ; /* parent pipe#, 0 if local to a pipe */ + + int weight ; /* WFQ queue weight */ + int qsize ; /* queue size in slots or bytes */ + int plr ; /* pkt loss rate (2^31-1 means 100%) */ + + struct ipfw_flow_id flow_mask ; + + /* hash table of queues onto this flow_set */ + int rq_size ; /* number of slots */ + int rq_elements ; /* active elements */ + struct dn_flow_queue7 **rq; /* array of rq_size entries */ + + u_int32_t last_expired ; /* do not expire too frequently */ + int backlogged ; /* #active queues for this flowset */ + + /* RED parameters */ +#define SCALE_RED 16 +#define SCALE(x) ( (x) << SCALE_RED ) +#define SCALE_VAL(x) ( (x) >> SCALE_RED ) +#define SCALE_MUL(x,y) ( ( (x) * (y) ) >> SCALE_RED ) + int w_q ; /* queue weight (scaled) */ + int max_th ; /* maximum threshold for queue (scaled) */ + int min_th ; /* minimum threshold for queue (scaled) */ + int max_p ; /* maximum value for p_b (scaled) */ + u_int c_1 ; /* max_p/(max_th-min_th) (scaled) */ + u_int c_2 ; /* max_p*min_th/(max_th-min_th) (scaled) */ + u_int c_3 ; /* for GRED, (1-max_p)/max_th (scaled) */ + u_int c_4 ; /* for GRED, 1 - 2*max_p (scaled) */ + u_int * w_q_lookup ; /* lookup table for computing (1-w_q)^t */ + u_int lookup_depth ; /* depth of lookup table */ + int lookup_step ; /* granularity inside the lookup table */ + int lookup_weight ; /* equal to (1-w_q)^t / (1-w_q)^(t+1) */ + int avg_pkt_size ; /* medium packet size */ + int max_pkt_size ; /* max packet size */ +}; +SLIST_HEAD(dn_flow_set_head, dn_flow_set); + +#define DN_IS_PIPE 0x4000 +#define DN_IS_QUEUE 0x8000 +struct dn_flow_queue7 { + struct dn_flow_queue7 *next ; + struct ipfw_flow_id id ; + + struct mbuf *head, *tail ; /* queue of packets */ + u_int len ; + u_int len_bytes ; + + u_long numbytes; + + u_int64_t tot_pkts ; /* statistics counters */ + u_int64_t tot_bytes ; + u_int32_t drops ; + + int hash_slot ; /* debugging/diagnostic */ + + /* RED parameters */ + int avg ; /* average queue length est. (scaled) */ + int count ; /* arrivals since last RED drop */ + int random ; /* random value (scaled) */ + u_int32_t q_time; /* start of queue idle time */ + + /* WF2Q+ support */ + struct dn_flow_set *fs ; /* parent flow set */ + int heap_pos ; /* position (index) of struct in heap */ + int64_t sched_time ; /* current time when queue enters ready_heap */ + + int64_t S,F ; /* start time, finish time */ +}; + +struct dn_pipe7 { /* a pipe */ + SLIST_ENTRY(dn_pipe7) next; /* linked list in a hash slot */ + + int pipe_nr ; /* number */ + int bandwidth; /* really, bytes/tick. */ + int delay ; /* really, ticks */ + + struct mbuf *head, *tail ; /* packets in delay line */ + + /* WF2Q+ */ + struct dn_heap7 scheduler_heap ; /* top extract - key Finish time*/ + struct dn_heap7 not_eligible_heap; /* top extract- key Start time */ + struct dn_heap7 idle_heap ; /* random extract - key Start=Finish time */ + + int64_t V ; /* virtual time */ + int sum; /* sum of weights of all active sessions */ + + int numbytes; + + int64_t sched_time ; /* time pipe was scheduled in ready_heap */ + + /* + * When the tx clock come from an interface (if_name[0] != '\0'), its name + * is stored below, whereas the ifp is filled when the rule is configured. + */ + char if_name[IFNAMSIZ]; + struct ifnet *ifp ; + int ready ; /* set if ifp != NULL and we got a signal from it */ + + struct dn_flow_set fs ; /* used with fixed-rate flows */ +}; +SLIST_HEAD(dn_pipe_head7, dn_pipe7); + + +/* FREEBSD8 ip_dummynet.h r196045 */ +struct dn_flow_queue8 { + struct dn_flow_queue8 *next ; + struct ipfw_flow_id id ; + + struct mbuf *head, *tail ; /* queue of packets */ + u_int len ; + u_int len_bytes ; + + uint64_t numbytes ; /* credit for transmission (dynamic queues) */ + int64_t extra_bits; /* extra bits simulating unavailable channel */ + + u_int64_t tot_pkts ; /* statistics counters */ + u_int64_t tot_bytes ; + u_int32_t drops ; + + int hash_slot ; /* debugging/diagnostic */ + + /* RED parameters */ + int avg ; /* average queue length est. (scaled) */ + int count ; /* arrivals since last RED drop */ + int random ; /* random value (scaled) */ + int64_t idle_time; /* start of queue idle time */ + + /* WF2Q+ support */ + struct dn_flow_set *fs ; /* parent flow set */ + int heap_pos ; /* position (index) of struct in heap */ + int64_t sched_time ; /* current time when queue enters ready_heap */ + + int64_t S,F ; /* start time, finish time */ +}; + +struct dn_pipe8 { /* a pipe */ + SLIST_ENTRY(dn_pipe8) next; /* linked list in a hash slot */ + + int pipe_nr ; /* number */ + int bandwidth; /* really, bytes/tick. */ + int delay ; /* really, ticks */ + + struct mbuf *head, *tail ; /* packets in delay line */ + + /* WF2Q+ */ + struct dn_heap7 scheduler_heap ; /* top extract - key Finish time*/ + struct dn_heap7 not_eligible_heap; /* top extract- key Start time */ + struct dn_heap7 idle_heap ; /* random extract - key Start=Finish time */ + + int64_t V ; /* virtual time */ + int sum; /* sum of weights of all active sessions */ + + /* Same as in dn_flow_queue, numbytes can become large */ + int64_t numbytes; /* bits I can transmit (more or less). */ + uint64_t burst; /* burst size, scaled: bits * hz */ + + int64_t sched_time ; /* time pipe was scheduled in ready_heap */ + int64_t idle_time; /* start of pipe idle time */ + + char if_name[IFNAMSIZ]; + struct ifnet *ifp ; + int ready ; /* set if ifp != NULL and we got a signal from it */ + + struct dn_flow_set fs ; /* used with fixed-rate flows */ + + /* fields to simulate a delay profile */ +#define ED_MAX_NAME_LEN 32 + char name[ED_MAX_NAME_LEN]; + int loss_level; + int samples_no; + int *samples; +}; + +#define ED_MAX_SAMPLES_NO 1024 +struct dn_pipe_max8 { + struct dn_pipe8 pipe; + int samples[ED_MAX_SAMPLES_NO]; +}; +SLIST_HEAD(dn_pipe_head8, dn_pipe8); + +/* + * Changes from 7.2 to 8: + * dn_pipe: + * numbytes from int to int64_t + * add burst (int64_t) + * add idle_time (int64_t) + * add profile + * add struct dn_pipe_max + * add flag DN_HAS_PROFILE + * + * dn_flow_queue + * numbytes from u_long to int64_t + * add extra_bits (int64_t) + * q_time from u_int32_t to int64_t and name idle_time + * + * dn_flow_set unchanged + * + */ + +/* NOTE:XXX copied from dummynet.c */ +#define O_NEXT(p, len) ((void *)((char *)p + len)) +static void +oid_fill(struct dn_id *oid, int len, int type, uintptr_t id) +{ + oid->len = len; + oid->type = type; + oid->subtype = 0; + oid->id = id; +} +/* make room in the buffer and move the pointer forward */ +static void * +o_next(struct dn_id **o, int len, int type) +{ + struct dn_id *ret = *o; + oid_fill(ret, len, type, 0); + *o = O_NEXT(*o, len); + return ret; +} + + +static size_t pipesize7 = sizeof(struct dn_pipe7); +static size_t pipesize8 = sizeof(struct dn_pipe8); +static size_t pipesizemax8 = sizeof(struct dn_pipe_max8); + +/* Indicate 'ipfw' version + * 1: from FreeBSD 7.2 + * 0: from FreeBSD 8 + * -1: unknown (for now is unused) + * + * It is update when a IP_DUMMYNET_DEL or IP_DUMMYNET_CONFIGURE request arrives + * NOTE: if a IP_DUMMYNET_GET arrives and the 'ipfw' version is unknown, + * it is suppose to be the FreeBSD 8 version. + */ +static int is7 = 0; + +static int +convertflags2new(int src) +{ + int dst = 0; + + if (src & DNOLD_HAVE_FLOW_MASK) + dst |= DN_HAVE_MASK; + if (src & DNOLD_QSIZE_IS_BYTES) + dst |= DN_QSIZE_BYTES; + if (src & DNOLD_NOERROR) + dst |= DN_NOERROR; + if (src & DNOLD_IS_RED) + dst |= DN_IS_RED; + if (src & DNOLD_IS_GENTLE_RED) + dst |= DN_IS_GENTLE_RED; + if (src & DNOLD_HAS_PROFILE) + dst |= DN_HAS_PROFILE; + + return dst; +} + +static int +convertflags2old(int src) +{ + int dst = 0; + + if (src & DN_HAVE_MASK) + dst |= DNOLD_HAVE_FLOW_MASK; + if (src & DN_IS_RED) + dst |= DNOLD_IS_RED; + if (src & DN_IS_GENTLE_RED) + dst |= DNOLD_IS_GENTLE_RED; + if (src & DN_NOERROR) + dst |= DNOLD_NOERROR; + if (src & DN_HAS_PROFILE) + dst |= DNOLD_HAS_PROFILE; + if (src & DN_QSIZE_BYTES) + dst |= DNOLD_QSIZE_IS_BYTES; + + return dst; +} + +static int +dn_compat_del(void *v) +{ + struct dn_pipe7 *p = (struct dn_pipe7 *) v; + struct dn_pipe8 *p8 = (struct dn_pipe8 *) v; + struct { + struct dn_id oid; + uintptr_t a[1]; /* add more if we want a list */ + } cmd; + + /* XXX DN_API_VERSION ??? */ + oid_fill((void *)&cmd, sizeof(cmd), DN_CMD_DELETE, DN_API_VERSION); + + if (is7) { + if (p->pipe_nr == 0 && p->fs.fs_nr == 0) + return EINVAL; + if (p->pipe_nr != 0 && p->fs.fs_nr != 0) + return EINVAL; + } else { + if (p8->pipe_nr == 0 && p8->fs.fs_nr == 0) + return EINVAL; + if (p8->pipe_nr != 0 && p8->fs.fs_nr != 0) + return EINVAL; + } + + if (p->pipe_nr != 0) { /* pipe x delete */ + cmd.a[0] = p->pipe_nr; + cmd.oid.subtype = DN_LINK; + } else { /* queue x delete */ + cmd.oid.subtype = DN_FS; + cmd.a[0] = (is7) ? p->fs.fs_nr : p8->fs.fs_nr; + } + + return do_config(&cmd, cmd.oid.len); +} + +static int +dn_compat_config_queue(struct dn_fs *fs, void* v) +{ + struct dn_pipe7 *p7 = (struct dn_pipe7 *)v; + struct dn_pipe8 *p8 = (struct dn_pipe8 *)v; + struct dn_flow_set *f; + + if (is7) + f = &p7->fs; + else + f = &p8->fs; + + fs->fs_nr = f->fs_nr; + fs->sched_nr = f->parent_nr; + fs->flow_mask = f->flow_mask; + fs->buckets = f->rq_size; + fs->qsize = f->qsize; + fs->plr = f->plr; + fs->par[0] = f->weight; + fs->flags = convertflags2new(f->flags_fs); + if (fs->flags & DN_IS_GENTLE_RED || fs->flags & DN_IS_RED) { + fs->w_q = f->w_q; + fs->max_th = f->max_th; + fs->min_th = f->min_th; + fs->max_p = f->max_p; + } + + return 0; +} + +static int +dn_compat_config_pipe(struct dn_sch *sch, struct dn_link *p, + struct dn_fs *fs, void* v) +{ + struct dn_pipe7 *p7 = (struct dn_pipe7 *)v; + struct dn_pipe8 *p8 = (struct dn_pipe8 *)v; + int i = p7->pipe_nr; + + sch->sched_nr = i; + sch->oid.subtype = 0; + p->link_nr = i; + fs->fs_nr = i + 2*DN_MAX_ID; + fs->sched_nr = i + DN_MAX_ID; + + /* Common to 7 and 8 */ + p->bandwidth = p7->bandwidth; + p->delay = p7->delay; + if (!is7) { + /* FreeBSD 8 has burst */ + p->burst = p8->burst; + } + + /* fill the fifo flowset */ + dn_compat_config_queue(fs, v); + fs->fs_nr = i + 2*DN_MAX_ID; + fs->sched_nr = i + DN_MAX_ID; + + /* Move scheduler related parameter from fs to sch */ + sch->buckets = fs->buckets; /*XXX*/ + fs->buckets = 0; + if (fs->flags & DN_HAVE_MASK) { + sch->flags |= DN_HAVE_MASK; + fs->flags &= ~DN_HAVE_MASK; + sch->sched_mask = fs->flow_mask; + bzero(&fs->flow_mask, sizeof(struct ipfw_flow_id)); + } + + return 0; +} + +static int +dn_compat_config_profile(struct dn_profile *pf, struct dn_link *p, + void *v) +{ + struct dn_pipe8 *p8 = (struct dn_pipe8 *)v; + + p8->samples = &(((struct dn_pipe_max8 *)p8)->samples[0]); + + pf->link_nr = p->link_nr; + pf->loss_level = p8->loss_level; +// pf->bandwidth = p->bandwidth; //XXX bandwidth redundant? + pf->samples_no = p8->samples_no; + strncpy(pf->name, p8->name,sizeof(pf->name)); + bcopy(p8->samples, pf->samples, sizeof(pf->samples)); + + return 0; +} + +/* + * If p->pipe_nr != 0 the command is 'pipe x config', so need to create + * the three main struct, else only a flowset is created + */ +static int +dn_compat_configure(void *v) +{ + struct dn_id *buf = NULL, *base; + struct dn_sch *sch = NULL; + struct dn_link *p = NULL; + struct dn_fs *fs = NULL; + struct dn_profile *pf = NULL; + int lmax; + int error; + + struct dn_pipe7 *p7 = (struct dn_pipe7 *)v; + struct dn_pipe8 *p8 = (struct dn_pipe8 *)v; + + int i; /* number of object to configure */ + + lmax = sizeof(struct dn_id); /* command header */ + lmax += sizeof(struct dn_sch) + sizeof(struct dn_link) + + sizeof(struct dn_fs) + sizeof(struct dn_profile); + + base = buf = malloc(lmax, M_DUMMYNET, M_WAITOK|M_ZERO); + o_next(&buf, sizeof(struct dn_id), DN_CMD_CONFIG); + base->id = DN_API_VERSION; + + /* pipe_nr is the same in p7 and p8 */ + i = p7->pipe_nr; + if (i != 0) { /* pipe config */ + sch = o_next(&buf, sizeof(*sch), DN_SCH); + p = o_next(&buf, sizeof(*p), DN_LINK); + fs = o_next(&buf, sizeof(*fs), DN_FS); + + error = dn_compat_config_pipe(sch, p, fs, v); + if (error) { + free(buf, M_DUMMYNET); + return error; + } + if (!is7 && p8->samples_no > 0) { + /* Add profiles*/ + pf = o_next(&buf, sizeof(*pf), DN_PROFILE); + error = dn_compat_config_profile(pf, p, v); + if (error) { + free(buf, M_DUMMYNET); + return error; + } + } + } else { /* queue config */ + fs = o_next(&buf, sizeof(*fs), DN_FS); + error = dn_compat_config_queue(fs, v); + if (error) { + free(buf, M_DUMMYNET); + return error; + } + } + error = do_config(base, (char *)buf - (char *)base); + + if (buf) + free(buf, M_DUMMYNET); + return error; +} + +int +dn_compat_calc_size(void) +{ + int need = 0; + /* XXX use FreeBSD 8 struct size */ + /* NOTE: + * - half scheduler: schk_count/2 + * - all flowset: fsk_count + * - all flowset queues: queue_count + * - all pipe queue: si_count + */ + need += dn_cfg.schk_count * sizeof(struct dn_pipe8) / 2; + need += dn_cfg.fsk_count * sizeof(struct dn_flow_set); + need += dn_cfg.si_count * sizeof(struct dn_flow_queue8); + need += dn_cfg.queue_count * sizeof(struct dn_flow_queue8); + + return need; +} + +int +dn_c_copy_q (void *_ni, void *arg) +{ + struct copy_args *a = arg; + struct dn_flow_queue7 *fq7 = (struct dn_flow_queue7 *)*a->start; + struct dn_flow_queue8 *fq8 = (struct dn_flow_queue8 *)*a->start; + struct dn_flow *ni = (struct dn_flow *)_ni; + int size = 0; + + /* XXX hash slot not set */ + /* No difference between 7.2/8 */ + fq7->len = ni->length; + fq7->len_bytes = ni->len_bytes; + fq7->id = ni->fid; + + if (is7) { + size = sizeof(struct dn_flow_queue7); + fq7->tot_pkts = ni->tot_pkts; + fq7->tot_bytes = ni->tot_bytes; + fq7->drops = ni->drops; + } else { + size = sizeof(struct dn_flow_queue8); + fq8->tot_pkts = ni->tot_pkts; + fq8->tot_bytes = ni->tot_bytes; + fq8->drops = ni->drops; + } + + *a->start += size; + return 0; +} + +int +dn_c_copy_pipe(struct dn_schk *s, struct copy_args *a, int nq) +{ + struct dn_link *l = &s->link; + struct dn_fsk *f = s->fs; + + struct dn_pipe7 *pipe7 = (struct dn_pipe7 *)*a->start; + struct dn_pipe8 *pipe8 = (struct dn_pipe8 *)*a->start; + struct dn_flow_set *fs; + int size = 0; + + if (is7) { + fs = &pipe7->fs; + size = sizeof(struct dn_pipe7); + } else { + fs = &pipe8->fs; + size = sizeof(struct dn_pipe8); + } + + /* These 4 field are the same in pipe7 and pipe8 */ + pipe7->next.sle_next = (struct dn_pipe7 *)DN_IS_PIPE; + pipe7->bandwidth = l->bandwidth; + pipe7->delay = l->delay * 1000 / hz; + pipe7->pipe_nr = l->link_nr - DN_MAX_ID; + + if (!is7) { + if (s->profile) { + struct dn_profile *pf = s->profile; + strncpy(pipe8->name, pf->name, sizeof(pf->name)); + pipe8->loss_level = pf->loss_level; + pipe8->samples_no = pf->samples_no; + } + pipe8->burst = div64(l->burst , 8 * hz); + } + + fs->flow_mask = s->sch.sched_mask; + fs->rq_size = s->sch.buckets ? s->sch.buckets : 1; + + fs->parent_nr = l->link_nr - DN_MAX_ID; + fs->qsize = f->fs.qsize; + fs->plr = f->fs.plr; + fs->w_q = f->fs.w_q; + fs->max_th = f->max_th; + fs->min_th = f->min_th; + fs->max_p = f->fs.max_p; + fs->rq_elements = nq; + + fs->flags_fs = convertflags2old(f->fs.flags); + + *a->start += size; + return 0; +} + + +int +dn_compat_copy_pipe(struct copy_args *a, void *_o) +{ + int have = a->end - *a->start; + int need = 0; + int pipe_size = sizeof(struct dn_pipe8); + int queue_size = sizeof(struct dn_flow_queue8); + int n_queue = 0; /* number of queues */ + + struct dn_schk *s = (struct dn_schk *)_o; + /* calculate needed space: + * - struct dn_pipe + * - if there are instances, dn_queue * n_instances + */ + n_queue = (s->sch.flags & DN_HAVE_MASK ? dn_ht_entries(s->siht) : + (s->siht ? 1 : 0)); + need = pipe_size + queue_size * n_queue; + if (have < need) { + D("have %d < need %d", have, need); + return 1; + } + /* copy pipe */ + dn_c_copy_pipe(s, a, n_queue); + + /* copy queues */ + if (s->sch.flags & DN_HAVE_MASK) + dn_ht_scan(s->siht, dn_c_copy_q, a); + else if (s->siht) + dn_c_copy_q(s->siht, a); + return 0; +} + +int +dn_c_copy_fs(struct dn_fsk *f, struct copy_args *a, int nq) +{ + struct dn_flow_set *fs = (struct dn_flow_set *)*a->start; + + fs->next.sle_next = (struct dn_flow_set *)DN_IS_QUEUE; + fs->fs_nr = f->fs.fs_nr; + fs->qsize = f->fs.qsize; + fs->plr = f->fs.plr; + fs->w_q = f->fs.w_q; + fs->max_th = f->max_th; + fs->min_th = f->min_th; + fs->max_p = f->fs.max_p; + fs->flow_mask = f->fs.flow_mask; + fs->rq_elements = nq; + fs->rq_size = (f->fs.buckets ? f->fs.buckets : 1); + fs->parent_nr = f->fs.sched_nr; + fs->weight = f->fs.par[0]; + + fs->flags_fs = convertflags2old(f->fs.flags); + *a->start += sizeof(struct dn_flow_set); + return 0; +} + +int +dn_compat_copy_queue(struct copy_args *a, void *_o) +{ + int have = a->end - *a->start; + int need = 0; + int fs_size = sizeof(struct dn_flow_set); + int queue_size = sizeof(struct dn_flow_queue8); + + struct dn_fsk *fs = (struct dn_fsk *)_o; + int n_queue = 0; /* number of queues */ + + n_queue = (fs->fs.flags & DN_HAVE_MASK ? dn_ht_entries(fs->qht) : + (fs->qht ? 1 : 0)); + + need = fs_size + queue_size * n_queue; + if (have < need) { + D("have < need"); + return 1; + } + + /* copy flowset */ + dn_c_copy_fs(fs, a, n_queue); + + /* copy queues */ + if (fs->fs.flags & DN_HAVE_MASK) + dn_ht_scan(fs->qht, dn_c_copy_q, a); + else if (fs->qht) + dn_c_copy_q(fs->qht, a); + + return 0; +} + +int +copy_data_helper_compat(void *_o, void *_arg) +{ + struct copy_args *a = _arg; + + if (a->type == DN_COMPAT_PIPE) { + struct dn_schk *s = _o; + if (s->sch.oid.subtype != 1 || s->sch.sched_nr <= DN_MAX_ID) { + return 0; /* not old type */ + } + /* copy pipe parameters, and if instance exists, copy + * other parameters and eventually queues. + */ + if(dn_compat_copy_pipe(a, _o)) + return DNHT_SCAN_END; + } else if (a->type == DN_COMPAT_QUEUE) { + struct dn_fsk *fs = _o; + if (fs->fs.fs_nr >= DN_MAX_ID) + return 0; + if (dn_compat_copy_queue(a, _o)) + return DNHT_SCAN_END; + } + return 0; +} + +/* Main function to manage old requests */ +int +ip_dummynet_compat(struct sockopt *sopt) +{ + int error=0; + void *v = NULL; + struct dn_id oid; + + /* Lenght of data, used to found ipfw version... */ + int len = sopt->sopt_valsize; + + /* len can be 0 if command was dummynet_flush */ + if (len == pipesize7) { + D("setting compatibility with FreeBSD 7.2"); + is7 = 1; + } + else if (len == pipesize8 || len == pipesizemax8) { + D("setting compatibility with FreeBSD 8"); + is7 = 0; + } + + switch (sopt->sopt_name) { + default: + printf("dummynet: -- unknown option %d", sopt->sopt_name); + error = EINVAL; + break; + + case IP_DUMMYNET_FLUSH: + oid_fill(&oid, sizeof(oid), DN_CMD_FLUSH, DN_API_VERSION); + do_config(&oid, oid.len); + break; + + case IP_DUMMYNET_DEL: + v = malloc(len, M_TEMP, M_WAITOK); + error = sooptcopyin(sopt, v, len, len); + if (error) + break; + error = dn_compat_del(v); + free(v, M_TEMP); + break; + + case IP_DUMMYNET_CONFIGURE: + v = malloc(len, M_TEMP, M_WAITOK); + error = sooptcopyin(sopt, v, len, len); + if (error) + break; + error = dn_compat_configure(v); + free(v, M_TEMP); + break; + + case IP_DUMMYNET_GET: { + void *buf; + int ret; + int original_size = sopt->sopt_valsize; + int size; + + ret = dummynet_get(sopt, &buf); + if (ret) + return 0;//XXX ? + size = sopt->sopt_valsize; + sopt->sopt_valsize = original_size; + D("size=%d, buf=%p", size, buf); + ret = sooptcopyout(sopt, buf, size); + if (ret) + printf(" %s ERROR sooptcopyout\n", __FUNCTION__); + if (buf) + free(buf, M_DUMMYNET); + } + } + + return error; +} + + diff --git a/example/ipfw/sys/netpfil/ipfw/ip_dn_io.c b/example/ipfw/sys/netpfil/ipfw/ip_dn_io.c new file mode 100644 index 0000000..6211221 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_dn_io.c @@ -0,0 +1,960 @@ +/*- + * Copyright (c) 2010 Luigi Rizzo, Riccardo Panicucci, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * Dummynet portions related to packet handling. + */ +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_dn_io.c 272089 2014-09-25 02:26:05Z sbruno $"); + +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include /* IFNAMSIZ, struct ifaddr, ifq head, lock.h mutex.h */ +#include +#include + +#include +#include /* ip_len, ip_off */ +#include /* ip_output(), IP_FORWARDING */ +#include +#include +#include /* various ether_* routines */ +#include /* for ip6_input, ip6_output prototypes */ +#include + +#include +#include +#include +#include + +/* + * We keep a private variable for the simulation time, but we could + * probably use an existing one ("softticks" in sys/kern/kern_timeout.c) + * instead of dn_cfg.curr_time + */ + +struct dn_parms dn_cfg; +//VNET_DEFINE(struct dn_parms, _base_dn_cfg); + +static long tick_last; /* Last tick duration (usec). */ +static long tick_delta; /* Last vs standard tick diff (usec). */ +static long tick_delta_sum; /* Accumulated tick difference (usec).*/ +static long tick_adjustment; /* Tick adjustments done. */ +static long tick_lost; /* Lost(coalesced) ticks number. */ +/* Adjusted vs non-adjusted curr_time difference (ticks). */ +static long tick_diff; + +static unsigned long io_pkt; +static unsigned long io_pkt_fast; +static unsigned long io_pkt_drop; + +/* + * We use a heap to store entities for which we have pending timer events. + * The heap is checked at every tick and all entities with expired events + * are extracted. + */ + +MALLOC_DEFINE(M_DUMMYNET, "dummynet", "dummynet heap"); + +extern void (*bridge_dn_p)(struct mbuf *, struct ifnet *); + +#ifdef SYSCTL_NODE + +/* + * Because of the way the SYSBEGIN/SYSEND macros work on other + * platforms, there should not be functions between them. + * So keep the handlers outside the block. + */ +static int +sysctl_hash_size(SYSCTL_HANDLER_ARGS) +{ + int error, value; + + value = dn_cfg.hash_size; + error = sysctl_handle_int(oidp, &value, 0, req); + if (error != 0 || req->newptr == NULL) + return (error); + if (value < 16 || value > 65536) + return (EINVAL); + dn_cfg.hash_size = value; + return (0); +} + +static int +sysctl_limits(SYSCTL_HANDLER_ARGS) +{ + int error; + long value; + + if (arg2 != 0) + value = dn_cfg.slot_limit; + else + value = dn_cfg.byte_limit; + error = sysctl_handle_long(oidp, &value, 0, req); + + if (error != 0 || req->newptr == NULL) + return (error); + if (arg2 != 0) { + if (value < 1) + return (EINVAL); + dn_cfg.slot_limit = value; + } else { + if (value < 1500) + return (EINVAL); + dn_cfg.byte_limit = value; + } + return (0); +} + +SYSBEGIN(f4) + +SYSCTL_DECL(_net_inet); +SYSCTL_DECL(_net_inet_ip); +static SYSCTL_NODE(_net_inet_ip, OID_AUTO, dummynet, CTLFLAG_RW, 0, "Dummynet"); + +/* wrapper to pass dn_cfg fields to SYSCTL_* */ +//#define DC(x) (&(VNET_NAME(_base_dn_cfg).x)) +#define DC(x) (&(dn_cfg.x)) +/* parameters */ + + +SYSCTL_PROC(_net_inet_ip_dummynet, OID_AUTO, hash_size, + CTLTYPE_INT | CTLFLAG_RW, 0, 0, sysctl_hash_size, + "I", "Default hash table size"); + + +SYSCTL_PROC(_net_inet_ip_dummynet, OID_AUTO, pipe_slot_limit, + CTLTYPE_LONG | CTLFLAG_RW, 0, 1, sysctl_limits, + "L", "Upper limit in slots for pipe queue."); +SYSCTL_PROC(_net_inet_ip_dummynet, OID_AUTO, pipe_byte_limit, + CTLTYPE_LONG | CTLFLAG_RW, 0, 0, sysctl_limits, + "L", "Upper limit in bytes for pipe queue."); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, io_fast, + CTLFLAG_RW, DC(io_fast), 0, "Enable fast dummynet io."); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, debug, + CTLFLAG_RW, DC(debug), 0, "Dummynet debug level"); + +/* RED parameters */ +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, red_lookup_depth, + CTLFLAG_RD, DC(red_lookup_depth), 0, "Depth of RED lookup table"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, red_avg_pkt_size, + CTLFLAG_RD, DC(red_avg_pkt_size), 0, "RED Medium packet size"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, red_max_pkt_size, + CTLFLAG_RD, DC(red_max_pkt_size), 0, "RED Max packet size"); + +/* time adjustment */ +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_delta, + CTLFLAG_RD, &tick_delta, 0, "Last vs standard tick difference (usec)."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_delta_sum, + CTLFLAG_RD, &tick_delta_sum, 0, "Accumulated tick difference (usec)."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_adjustment, + CTLFLAG_RD, &tick_adjustment, 0, "Tick adjustments done."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_diff, + CTLFLAG_RD, &tick_diff, 0, + "Adjusted vs non-adjusted curr_time difference (ticks)."); +SYSCTL_LONG(_net_inet_ip_dummynet, OID_AUTO, tick_lost, + CTLFLAG_RD, &tick_lost, 0, + "Number of ticks coalesced by dummynet taskqueue."); + +/* Drain parameters */ +SYSCTL_UINT(_net_inet_ip_dummynet, OID_AUTO, expire, + CTLFLAG_RW, DC(expire), 0, "Expire empty queues/pipes"); +SYSCTL_UINT(_net_inet_ip_dummynet, OID_AUTO, expire_cycle, + CTLFLAG_RD, DC(expire_cycle), 0, "Expire cycle for queues/pipes"); + +/* statistics */ +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, schk_count, + CTLFLAG_RD, DC(schk_count), 0, "Number of schedulers"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, si_count, + CTLFLAG_RD, DC(si_count), 0, "Number of scheduler instances"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, fsk_count, + CTLFLAG_RD, DC(fsk_count), 0, "Number of flowsets"); +SYSCTL_INT(_net_inet_ip_dummynet, OID_AUTO, queue_count, + CTLFLAG_RD, DC(queue_count), 0, "Number of queues"); +SYSCTL_ULONG(_net_inet_ip_dummynet, OID_AUTO, io_pkt, + CTLFLAG_RD, &io_pkt, 0, + "Number of packets passed to dummynet."); +SYSCTL_ULONG(_net_inet_ip_dummynet, OID_AUTO, io_pkt_fast, + CTLFLAG_RD, &io_pkt_fast, 0, + "Number of packets bypassed dummynet scheduler."); +SYSCTL_ULONG(_net_inet_ip_dummynet, OID_AUTO, io_pkt_drop, + CTLFLAG_RD, &io_pkt_drop, 0, + "Number of packets dropped by dummynet."); +#undef DC +SYSEND + +#endif + +static void dummynet_send(struct mbuf *); + +/* + * Packets processed by dummynet have an mbuf tag associated with + * them that carries their dummynet state. + * Outside dummynet, only the 'rule' field is relevant, and it must + * be at the beginning of the structure. + */ +struct dn_pkt_tag { + struct ipfw_rule_ref rule; /* matching rule */ + + /* second part, dummynet specific */ + int dn_dir; /* action when packet comes out.*/ + /* see ip_fw_private.h */ + uint64_t output_time; /* when the pkt is due for delivery*/ + struct ifnet *ifp; /* interface, for ip_output */ +// struct _ip6dn_args ip6opt; /* XXX ipv6 options, 192 bytes */ +}; + +/* + * Return the mbuf tag holding the dummynet state (it should + * be the first one on the list). + */ +static struct dn_pkt_tag * +dn_tag_get(struct mbuf *m) +{ + struct m_tag *mtag = m_tag_first(m); + KASSERT(mtag != NULL && + mtag->m_tag_cookie == MTAG_ABI_COMPAT && + mtag->m_tag_id == PACKET_TAG_DUMMYNET, + ("packet on dummynet queue w/o dummynet tag!")); + return (struct dn_pkt_tag *)(mtag+1); +} + +static inline void +mq_append(struct mq *q, struct mbuf *m) +{ +#ifdef USERSPACE + // buffers from netmap need to be copied + // XXX note that the routine is not expected to fail + ND("append %p to %p", m, q); + if (m->m_flags & M_STACK) { + struct mbuf *m_new; + void *p; + int l, ofs; + + ofs = m->m_data - m->__m_extbuf; + // XXX allocate + MGETHDR(m_new, M_NOWAIT, MT_DATA); + ND("*** WARNING, volatile buf %p ext %p %d dofs %d m_new %p", + m, m->__m_extbuf, m->__m_extlen, ofs, m_new); + p = m_new->__m_extbuf; /* new pointer */ + l = m_new->__m_extlen; /* new len */ + if (l <= m->__m_extlen) { + panic("extlen too large"); + } + + *m_new = *m; // copy + m_new->m_flags &= ~M_STACK; + m_new->__m_extbuf = p; // point to new buffer + _pkt_copy(m->__m_extbuf, p, m->__m_extlen); + m_new->m_data = p + ofs; + m = m_new; + } +#endif /* USERSPACE */ + if (q->head == NULL) + q->head = m; + else + q->tail->m_nextpkt = m; + q->count++; + q->tail = m; + m->m_nextpkt = NULL; +} + +/* + * Dispose a list of packet. Use a functions so if we need to do + * more work, this is a central point to do it. + */ +void dn_free_pkts(struct mbuf *mnext) +{ + struct mbuf *m; + + while ((m = mnext) != NULL) { + mnext = m->m_nextpkt; + FREE_PKT(m); + } +} + +static int +red_drops (struct dn_queue *q, int len) +{ + /* + * RED algorithm + * + * RED calculates the average queue size (avg) using a low-pass filter + * with an exponential weighted (w_q) moving average: + * avg <- (1-w_q) * avg + w_q * q_size + * where q_size is the queue length (measured in bytes or * packets). + * + * If q_size == 0, we compute the idle time for the link, and set + * avg = (1 - w_q)^(idle/s) + * where s is the time needed for transmitting a medium-sized packet. + * + * Now, if avg < min_th the packet is enqueued. + * If avg > max_th the packet is dropped. Otherwise, the packet is + * dropped with probability P function of avg. + */ + + struct dn_fsk *fs = q->fs; + int64_t p_b = 0; + + /* Queue in bytes or packets? */ + uint32_t q_size = (fs->fs.flags & DN_QSIZE_BYTES) ? + q->ni.len_bytes : q->ni.length; + + /* Average queue size estimation. */ + if (q_size != 0) { + /* Queue is not empty, avg <- avg + (q_size - avg) * w_q */ + int diff = SCALE(q_size) - q->avg; + int64_t v = SCALE_MUL((int64_t)diff, (int64_t)fs->w_q); + + q->avg += (int)v; + } else { + /* + * Queue is empty, find for how long the queue has been + * empty and use a lookup table for computing + * (1 - * w_q)^(idle_time/s) where s is the time to send a + * (small) packet. + * XXX check wraps... + */ + if (q->avg) { + u_int t = div64((dn_cfg.curr_time - q->q_time), fs->lookup_step); + + q->avg = (t < fs->lookup_depth) ? + SCALE_MUL(q->avg, fs->w_q_lookup[t]) : 0; + } + } + + /* Should i drop? */ + if (q->avg < fs->min_th) { + q->count = -1; + return (0); /* accept packet */ + } + if (q->avg >= fs->max_th) { /* average queue >= max threshold */ + if (fs->fs.flags & DN_IS_ECN) + return (1); + if (fs->fs.flags & DN_IS_GENTLE_RED) { + /* + * According to Gentle-RED, if avg is greater than + * max_th the packet is dropped with a probability + * p_b = c_3 * avg - c_4 + * where c_3 = (1 - max_p) / max_th + * c_4 = 1 - 2 * max_p + */ + p_b = SCALE_MUL((int64_t)fs->c_3, (int64_t)q->avg) - + fs->c_4; + } else { + q->count = -1; + return (1); + } + } else if (q->avg > fs->min_th) { + if (fs->fs.flags & DN_IS_ECN) + return (1); + /* + * We compute p_b using the linear dropping function + * p_b = c_1 * avg - c_2 + * where c_1 = max_p / (max_th - min_th) + * c_2 = max_p * min_th / (max_th - min_th) + */ + p_b = SCALE_MUL((int64_t)fs->c_1, (int64_t)q->avg) - fs->c_2; + } + + if (fs->fs.flags & DN_QSIZE_BYTES) + p_b = div64((p_b * len) , fs->max_pkt_size); + if (++q->count == 0) + q->random = random() & 0xffff; + else { + /* + * q->count counts packets arrived since last drop, so a greater + * value of q->count means a greater packet drop probability. + */ + if (SCALE_MUL(p_b, SCALE((int64_t)q->count)) > q->random) { + q->count = 0; + /* After a drop we calculate a new random value. */ + q->random = random() & 0xffff; + return (1); /* drop */ + } + } + /* End of RED algorithm. */ + + return (0); /* accept */ + +} + +/* + * ECN/ECT Processing (partially adopted from altq) + */ +static int +ecn_mark(struct mbuf* m) +{ + struct ip *ip; + ip = mtod(m, struct ip *); + + switch (ip->ip_v) { + case IPVERSION: + { + u_int8_t otos; + int sum; + + if ((ip->ip_tos & IPTOS_ECN_MASK) == IPTOS_ECN_NOTECT) + return (0); /* not-ECT */ + if ((ip->ip_tos & IPTOS_ECN_MASK) == IPTOS_ECN_CE) + return (1); /* already marked */ + + /* + * ecn-capable but not marked, + * mark CE and update checksum + */ + otos = ip->ip_tos; + ip->ip_tos |= IPTOS_ECN_CE; + /* + * update checksum (from RFC1624) + * HC' = ~(~HC + ~m + m') + */ + sum = ~ntohs(ip->ip_sum) & 0xffff; + sum += (~otos & 0xffff) + ip->ip_tos; + sum = (sum >> 16) + (sum & 0xffff); + sum += (sum >> 16); /* add carry */ + ip->ip_sum = htons(~sum & 0xffff); + return (1); + } +#ifdef INET6 + case (IPV6_VERSION >> 4): + { + struct ip6_hdr *ip6 = mtod(m, struct ip6_hdr *); + u_int32_t flowlabel; + + flowlabel = ntohl(ip6->ip6_flow); + if ((flowlabel >> 28) != 6) + return (0); /* version mismatch! */ + if ((flowlabel & (IPTOS_ECN_MASK << 20)) == + (IPTOS_ECN_NOTECT << 20)) + return (0); /* not-ECT */ + if ((flowlabel & (IPTOS_ECN_MASK << 20)) == + (IPTOS_ECN_CE << 20)) + return (1); /* already marked */ + /* + * ecn-capable but not marked, mark CE + */ + flowlabel |= (IPTOS_ECN_CE << 20); + ip6->ip6_flow = htonl(flowlabel); + return (1); + } +#endif + } + return (0); +} + +/* + * Enqueue a packet in q, subject to space and queue management policy + * (whose parameters are in q->fs). + * Update stats for the queue and the scheduler. + * Return 0 on success, 1 on drop. The packet is consumed anyways. + */ +int +dn_enqueue(struct dn_queue *q, struct mbuf* m, int drop) +{ + struct dn_fs *f; + struct dn_flow *ni; /* stats for scheduler instance */ + uint64_t len; + + if (q->fs == NULL || q->_si == NULL) { + printf("%s fs %p si %p, dropping\n", + __FUNCTION__, q->fs, q->_si); + FREE_PKT(m); + return 1; + } + f = &(q->fs->fs); + ni = &q->_si->ni; + len = m->m_pkthdr.len; + /* Update statistics, then check reasons to drop pkt. */ + q->ni.tot_bytes += len; + q->ni.tot_pkts++; + ni->tot_bytes += len; + ni->tot_pkts++; + if (drop) + goto drop; + if (f->plr && random() < f->plr) + goto drop; + if (f->flags & DN_IS_RED && red_drops(q, m->m_pkthdr.len)) { + if (!(f->flags & DN_IS_ECN) || !ecn_mark(m)) + goto drop; + } + if (f->flags & DN_QSIZE_BYTES) { + if (q->ni.len_bytes > f->qsize) + goto drop; + } else if (q->ni.length >= f->qsize) { + goto drop; + } + mq_append(&q->mq, m); + q->ni.length++; + q->ni.len_bytes += len; + ni->length++; + ni->len_bytes += len; + return (0); + +drop: + io_pkt_drop++; + q->ni.drops++; + ni->drops++; + FREE_PKT(m); + return (1); +} + +/* + * Fetch packets from the delay line which are due now. If there are + * leftover packets, reinsert the delay line in the heap. + * Runs under scheduler lock. + */ +static void +transmit_event(struct mq *q, struct delay_line *dline, uint64_t now) +{ + struct mbuf *m; + struct dn_pkt_tag *pkt = NULL; + + dline->oid.subtype = 0; /* not in heap */ + while ((m = dline->mq.head) != NULL) { + pkt = dn_tag_get(m); + if (!DN_KEY_LEQ(pkt->output_time, now)) + break; + dline->mq.head = m->m_nextpkt; + dline->mq.count--; + mq_append(q, m); + } + if (m != NULL) { + dline->oid.subtype = 1; /* in heap */ + heap_insert(&dn_cfg.evheap, pkt->output_time, dline); + } +} + +/* + * Convert the additional MAC overheads/delays into an equivalent + * number of bits for the given data rate. The samples are + * in milliseconds so we need to divide by 1000. + */ +static uint64_t +extra_bits(struct mbuf *m, struct dn_schk *s) +{ + int index; + uint64_t bits; + struct dn_profile *pf = s->profile; + + if (!pf || pf->samples_no == 0) + return 0; + index = random() % pf->samples_no; + bits = div64((uint64_t)pf->samples[index] * s->link.bandwidth, 1000); + if (index >= pf->loss_level) { + struct dn_pkt_tag *dt = dn_tag_get(m); + if (dt) + dt->dn_dir = DIR_DROP; + } + return bits; +} + +/* + * Send traffic from a scheduler instance due by 'now'. + * Return a pointer to the head of the queue. + */ +static struct mbuf * +serve_sched(struct mq *q, struct dn_sch_inst *si, uint64_t now) +{ + struct mq def_q; + struct dn_schk *s = si->sched; + struct mbuf *m = NULL; + int delay_line_idle = (si->dline.mq.head == NULL); + int done, bw; + + if (q == NULL) { + q = &def_q; + q->head = NULL; + } + + bw = s->link.bandwidth; + si->kflags &= ~DN_ACTIVE; + + if (bw > 0) + si->credit += (now - si->sched_time) * bw; + else + si->credit = 0; + si->sched_time = now; + done = 0; + while (si->credit >= 0 && (m = s->fp->dequeue(si)) != NULL) { + uint64_t len_scaled; + + done++; + len_scaled = (bw == 0) ? 0 : hz * + (m->m_pkthdr.len * 8 + extra_bits(m, s)); + si->credit -= len_scaled; + /* Move packet in the delay line */ + dn_tag_get(m)->output_time = dn_cfg.curr_time + s->link.delay ; + mq_append(&si->dline.mq, m); + } + + /* + * If credit >= 0 the instance is idle, mark time. + * Otherwise put back in the heap, and adjust the output + * time of the last inserted packet, m, which was too early. + */ + if (si->credit >= 0) { + si->idle_time = now; + } else { + uint64_t t; + KASSERT (bw > 0, ("bw=0 and credit<0 ?")); + t = div64(bw - 1 - si->credit, bw); + if (m) + dn_tag_get(m)->output_time += t; + si->kflags |= DN_ACTIVE; + heap_insert(&dn_cfg.evheap, now + t, si); + } + if (delay_line_idle && done) + transmit_event(q, &si->dline, now); + return q->head; +} + +/* + * The timer handler for dummynet. Time is computed in ticks, but + * but the code is tolerant to the actual rate at which this is called. + * Once complete, the function reschedules itself for the next tick. + */ +void +dummynet_task(void *context, int pending) +{ + struct timeval t; + struct mq q = { NULL, NULL }; /* queue to accumulate results */ + + CURVNET_SET((struct vnet *)context); + + DN_BH_WLOCK(); + + /* Update number of lost(coalesced) ticks. */ + tick_lost += pending - 1; + + getmicrouptime(&t); + /* Last tick duration (usec). */ + tick_last = (t.tv_sec - dn_cfg.prev_t.tv_sec) * 1000000 + + (t.tv_usec - dn_cfg.prev_t.tv_usec); + /* Last tick vs standard tick difference (usec). */ + tick_delta = (tick_last * hz - 1000000) / hz; + /* Accumulated tick difference (usec). */ + tick_delta_sum += tick_delta; + + dn_cfg.prev_t = t; + + /* + * Adjust curr_time if the accumulated tick difference is + * greater than the 'standard' tick. Since curr_time should + * be monotonically increasing, we do positive adjustments + * as required, and throttle curr_time in case of negative + * adjustment. + */ + dn_cfg.curr_time++; + if (tick_delta_sum - tick >= 0) { + int diff = tick_delta_sum / tick; + + dn_cfg.curr_time += diff; + tick_diff += diff; + tick_delta_sum %= tick; + tick_adjustment++; + } else if (tick_delta_sum + tick <= 0) { + dn_cfg.curr_time--; + tick_diff--; + tick_delta_sum += tick; + tick_adjustment++; + } + + /* serve pending events, accumulate in q */ + for (;;) { + struct dn_id *p; /* generic parameter to handler */ + + if (dn_cfg.evheap.elements == 0 || + DN_KEY_LT(dn_cfg.curr_time, HEAP_TOP(&dn_cfg.evheap)->key)) + break; + p = HEAP_TOP(&dn_cfg.evheap)->object; + heap_extract(&dn_cfg.evheap, NULL); + + if (p->type == DN_SCH_I) { + serve_sched(&q, (struct dn_sch_inst *)p, dn_cfg.curr_time); + } else { /* extracted a delay line */ + transmit_event(&q, (struct delay_line *)p, dn_cfg.curr_time); + } + } + if (dn_cfg.expire && ++dn_cfg.expire_cycle >= dn_cfg.expire) { + dn_cfg.expire_cycle = 0; + dn_drain_scheduler(); + dn_drain_queue(); + } + + DN_BH_WUNLOCK(); + dn_reschedule(); + if (q.head != NULL) + dummynet_send(q.head); + CURVNET_RESTORE(); +} + +/* + * forward a chain of packets to the proper destination. + * This runs outside the dummynet lock. + */ +static void +dummynet_send(struct mbuf *m) +{ + struct mbuf *n; + + for (; m != NULL; m = n) { + struct ifnet *ifp = NULL; /* gcc 3.4.6 complains */ + struct m_tag *tag; + int dst; + + n = m->m_nextpkt; + m->m_nextpkt = NULL; + tag = m_tag_first(m); + if (tag == NULL) { /* should not happen */ + dst = DIR_DROP; + } else { + struct dn_pkt_tag *pkt = dn_tag_get(m); + /* extract the dummynet info, rename the tag + * to carry reinject info. + */ + if (pkt->dn_dir == (DIR_OUT | PROTO_LAYER2) && + pkt->ifp == NULL) { + dst = DIR_DROP; + } else { + dst = pkt->dn_dir; + ifp = pkt->ifp; + tag->m_tag_cookie = MTAG_IPFW_RULE; + tag->m_tag_id = 0; + } + } + + switch (dst) { + case DIR_OUT: + ip_output(m, NULL, NULL, IP_FORWARDING, NULL, NULL); + break ; + + case DIR_IN : + netisr_dispatch(NETISR_IP, m); + break; + +#ifdef INET6 + case DIR_IN | PROTO_IPV6: + netisr_dispatch(NETISR_IPV6, m); + break; + + case DIR_OUT | PROTO_IPV6: + ip6_output(m, NULL, NULL, IPV6_FORWARDING, NULL, NULL, NULL); + break; +#endif + + case DIR_FWD | PROTO_IFB: /* DN_TO_IFB_FWD: */ + if (bridge_dn_p != NULL) + ((*bridge_dn_p)(m, ifp)); + else + printf("dummynet: if_bridge not loaded\n"); + + break; + + case DIR_IN | PROTO_LAYER2: /* DN_TO_ETH_DEMUX: */ + /* + * The Ethernet code assumes the Ethernet header is + * contiguous in the first mbuf header. + * Insure this is true. + */ + if (m->m_len < ETHER_HDR_LEN && + (m = m_pullup(m, ETHER_HDR_LEN)) == NULL) { + printf("dummynet/ether: pullup failed, " + "dropping packet\n"); + break; + } + ether_demux(m->m_pkthdr.rcvif, m); + break; + + case DIR_OUT | PROTO_LAYER2: /* N_TO_ETH_OUT: */ + ether_output_frame(ifp, m); + break; + + case DIR_DROP: + /* drop the packet after some time */ + FREE_PKT(m); + break; + + default: + printf("dummynet: bad switch %d!\n", dst); + FREE_PKT(m); + break; + } + } +} + +static inline int +tag_mbuf(struct mbuf *m, int dir, struct ip_fw_args *fwa) +{ + struct dn_pkt_tag *dt; + struct m_tag *mtag; + + mtag = m_tag_get(PACKET_TAG_DUMMYNET, + sizeof(*dt), M_NOWAIT | M_ZERO); + if (mtag == NULL) + return 1; /* Cannot allocate packet header. */ + m_tag_prepend(m, mtag); /* Attach to mbuf chain. */ + dt = (struct dn_pkt_tag *)(mtag + 1); + dt->rule = fwa->rule; + dt->rule.info &= IPFW_ONEPASS; /* only keep this info */ + dt->dn_dir = dir; + dt->ifp = fwa->oif; + /* dt->output tame is updated as we move through */ + dt->output_time = dn_cfg.curr_time; + return 0; +} + + +/* + * dummynet hook for packets. + * We use the argument to locate the flowset fs and the sched_set sch + * associated to it. The we apply flow_mask and sched_mask to + * determine the queue and scheduler instances. + * + * dir where shall we send the packet after dummynet. + * *m0 the mbuf with the packet + * ifp the 'ifp' parameter from the caller. + * NULL in ip_input, destination interface in ip_output, + */ +int +dummynet_io(struct mbuf **m0, int dir, struct ip_fw_args *fwa) +{ + struct mbuf *m = *m0; + struct dn_fsk *fs = NULL; + struct dn_sch_inst *si; + struct dn_queue *q = NULL; /* default */ + + int fs_id = (fwa->rule.info & IPFW_INFO_MASK) + + ((fwa->rule.info & IPFW_IS_PIPE) ? 2*DN_MAX_ID : 0); + DN_BH_WLOCK(); + io_pkt++; + /* we could actually tag outside the lock, but who cares... */ + if (tag_mbuf(m, dir, fwa)) + goto dropit; + if (dn_cfg.busy) { + /* if the upper half is busy doing something expensive, + * lets queue the packet and move forward + */ + mq_append(&dn_cfg.pending, m); + m = *m0 = NULL; /* consumed */ + goto done; /* already active, nothing to do */ + } + /* XXX locate_flowset could be optimised with a direct ref. */ + fs = dn_ht_find(dn_cfg.fshash, fs_id, 0, NULL); + if (fs == NULL) + goto dropit; /* This queue/pipe does not exist! */ + if (fs->sched == NULL) /* should not happen */ + goto dropit; + /* find scheduler instance, possibly applying sched_mask */ + si = ipdn_si_find(fs->sched, &(fwa->f_id)); + if (si == NULL) + goto dropit; + /* + * If the scheduler supports multiple queues, find the right one + * (otherwise it will be ignored by enqueue). + */ + if (fs->sched->fp->flags & DN_MULTIQUEUE) { + q = ipdn_q_find(fs, si, &(fwa->f_id)); + if (q == NULL) + goto dropit; + } + if (fs->sched->fp->enqueue(si, q, m)) { + /* packet was dropped by enqueue() */ + m = *m0 = NULL; + goto dropit; + } + + if (si->kflags & DN_ACTIVE) { + m = *m0 = NULL; /* consumed */ + goto done; /* already active, nothing to do */ + } + + /* compute the initial allowance */ + if (si->idle_time < dn_cfg.curr_time) { + /* Do this only on the first packet on an idle pipe */ + struct dn_link *p = &fs->sched->link; + + si->sched_time = dn_cfg.curr_time; + si->credit = dn_cfg.io_fast ? p->bandwidth : 0; + if (p->burst) { + uint64_t burst = (dn_cfg.curr_time - si->idle_time) * p->bandwidth; + if (burst > p->burst) + burst = p->burst; + si->credit += burst; + } + } + /* pass through scheduler and delay line */ + m = serve_sched(NULL, si, dn_cfg.curr_time); + + /* optimization -- pass it back to ipfw for immediate send */ + /* XXX Don't call dummynet_send() if scheduler return the packet + * just enqueued. This avoid a lock order reversal. + * + */ + if (/*dn_cfg.io_fast &&*/ m == *m0 && (dir & PROTO_LAYER2) == 0 ) { + /* fast io, rename the tag * to carry reinject info. */ + struct m_tag *tag = m_tag_first(m); + + tag->m_tag_cookie = MTAG_IPFW_RULE; + tag->m_tag_id = 0; + io_pkt_fast++; + if (m->m_nextpkt != NULL) { + printf("dummynet: fast io: pkt chain detected!\n"); + m->m_nextpkt = NULL; + } + m = NULL; + } else { + *m0 = NULL; + } +done: + DN_BH_WUNLOCK(); + if (m) + dummynet_send(m); + return 0; + +dropit: + io_pkt_drop++; + DN_BH_WUNLOCK(); + if (m) + FREE_PKT(m); + *m0 = NULL; + return (fs && (fs->fs.flags & DN_NOERROR)) ? 0 : ENOBUFS; +} diff --git a/example/ipfw/sys/netpfil/ipfw/ip_dn_private.h b/example/ipfw/sys/netpfil/ipfw/ip_dn_private.h new file mode 100644 index 0000000..fdd7448 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_dn_private.h @@ -0,0 +1,404 @@ +/*- + * Copyright (c) 2010 Luigi Rizzo, Riccardo Panicucci, Universita` di Pisa + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* + * internal dummynet APIs. + * + * $FreeBSD: head/sys/netpfil/ipfw/ip_dn_private.h 258467 2013-11-22 05:02:37Z luigi $ + */ + +#ifndef _IP_DN_PRIVATE_H +#define _IP_DN_PRIVATE_H + +/* debugging support + * use ND() to remove debugging, D() to print a line, + * DX(level, ...) to print above a certain level + * If you redefine D() you are expected to redefine all. + */ +#ifndef D +#define ND(fmt, ...) do {} while (0) +#define D1(fmt, ...) do {} while (0) +#define D(fmt, ...) printf("%-10s " fmt "\n", \ + __FUNCTION__, ## __VA_ARGS__) +#define DX(lev, fmt, ...) do { \ + if (dn_cfg.debug > lev) D(fmt, ## __VA_ARGS__); } while (0) +#endif + +MALLOC_DECLARE(M_DUMMYNET); + +#ifndef __linux__ +#define div64(a, b) ((int64_t)(a) / (int64_t)(b)) +#endif + +#define DN_LOCK_INIT() do { \ + mtx_init(&dn_cfg.uh_mtx, "dn_uh", NULL, MTX_DEF); \ + mtx_init(&dn_cfg.bh_mtx, "dn_bh", NULL, MTX_DEF); \ + } while (0) +#define DN_LOCK_DESTROY() do { \ + mtx_destroy(&dn_cfg.uh_mtx); \ + mtx_destroy(&dn_cfg.bh_mtx); \ + } while (0) +#if 0 /* not used yet */ +#define DN_UH_RLOCK() mtx_lock(&dn_cfg.uh_mtx) +#define DN_UH_RUNLOCK() mtx_unlock(&dn_cfg.uh_mtx) +#define DN_UH_WLOCK() mtx_lock(&dn_cfg.uh_mtx) +#define DN_UH_WUNLOCK() mtx_unlock(&dn_cfg.uh_mtx) +#define DN_UH_LOCK_ASSERT() mtx_assert(&dn_cfg.uh_mtx, MA_OWNED) +#endif + +#define DN_BH_RLOCK() mtx_lock(&dn_cfg.uh_mtx) +#define DN_BH_RUNLOCK() mtx_unlock(&dn_cfg.uh_mtx) +#define DN_BH_WLOCK() mtx_lock(&dn_cfg.uh_mtx) +#define DN_BH_WUNLOCK() mtx_unlock(&dn_cfg.uh_mtx) +#define DN_BH_LOCK_ASSERT() mtx_assert(&dn_cfg.uh_mtx, MA_OWNED) + +SLIST_HEAD(dn_schk_head, dn_schk); +SLIST_HEAD(dn_sch_inst_head, dn_sch_inst); +SLIST_HEAD(dn_fsk_head, dn_fsk); +SLIST_HEAD(dn_queue_head, dn_queue); +SLIST_HEAD(dn_alg_head, dn_alg); + +struct mq { /* a basic queue of packets*/ + struct mbuf *head, *tail; + int count; +}; + +static inline void +set_oid(struct dn_id *o, int type, int len) +{ + o->type = type; + o->len = len; + o->subtype = 0; +} + +/* + * configuration and global data for a dummynet instance + * + * When a configuration is modified from userland, 'id' is incremented + * so we can use the value to check for stale pointers. + */ +struct dn_parms { + uint32_t id; /* configuration version */ + + /* defaults (sysctl-accessible) */ + int red_lookup_depth; + int red_avg_pkt_size; + int red_max_pkt_size; + int hash_size; + int max_hash_size; + long byte_limit; /* max queue sizes */ + long slot_limit; + + int io_fast; + int debug; + + /* timekeeping */ + struct timeval prev_t; /* last time dummynet_tick ran */ + struct dn_heap evheap; /* scheduled events */ + + /* counters of objects -- used for reporting space */ + int schk_count; + int si_count; + int fsk_count; + int queue_count; + + /* ticks and other stuff */ + uint64_t curr_time; + /* flowsets and schedulers are in hash tables, with 'hash_size' + * buckets. fshash is looked up at every packet arrival + * so better be generous if we expect many entries. + */ + struct dn_ht *fshash; + struct dn_ht *schedhash; + /* list of flowsets without a scheduler -- use sch_chain */ + struct dn_fsk_head fsu; /* list of unlinked flowsets */ + struct dn_alg_head schedlist; /* list of algorithms */ + + /* Store the fs/sch to scan when draining. The value is the + * bucket number of the hash table. Expire can be disabled + * with net.inet.ip.dummynet.expire=0, or it happens every + * expire ticks. + **/ + int drain_fs; + int drain_sch; + uint32_t expire; + uint32_t expire_cycle; /* tick count */ + + int init_done; + + /* if the upper half is busy doing something long, + * can set the busy flag and we will enqueue packets in + * a queue for later processing. + */ + int busy; + struct mq pending; + +#ifdef _KERNEL + /* + * This file is normally used in the kernel, unless we do + * some userland tests, in which case we do not need a mtx. + * uh_mtx arbitrates between system calls and also + * protects fshash, schedhash and fsunlinked. + * These structures are readonly for the lower half. + * bh_mtx protects all other structures which may be + * modified upon packet arrivals + */ +#if defined( __linux__ ) || defined( _WIN32 ) + spinlock_t uh_mtx; + spinlock_t bh_mtx; +#else + struct mtx uh_mtx; + struct mtx bh_mtx; +#endif + +#endif /* _KERNEL */ +}; + +/* + * Delay line, contains all packets on output from a link. + * Every scheduler instance has one. + */ +struct delay_line { + struct dn_id oid; + struct dn_sch_inst *si; + struct mq mq; +}; + +/* + * The kernel side of a flowset. It is linked in a hash table + * of flowsets, and in a list of children of their parent scheduler. + * qht is either the queue or (if HAVE_MASK) a hash table queues. + * Note that the mask to use is the (flow_mask|sched_mask), which + * changes as we attach/detach schedulers. So we store it here. + * + * XXX If we want to add scheduler-specific parameters, we need to + * put them in external storage because the scheduler may not be + * available when the fsk is created. + */ +struct dn_fsk { /* kernel side of a flowset */ + struct dn_fs fs; + SLIST_ENTRY(dn_fsk) fsk_next; /* hash chain for fshash */ + + struct ipfw_flow_id fsk_mask; + + /* qht is a hash table of queues, or just a single queue + * a bit in fs.flags tells us which one + */ + struct dn_ht *qht; + struct dn_schk *sched; /* Sched we are linked to */ + SLIST_ENTRY(dn_fsk) sch_chain; /* list of fsk attached to sched */ + + /* bucket index used by drain routine to drain queues for this + * flowset + */ + int drain_bucket; + /* Parameter realted to RED / GRED */ + /* original values are in dn_fs*/ + int w_q ; /* queue weight (scaled) */ + int max_th ; /* maximum threshold for queue (scaled) */ + int min_th ; /* minimum threshold for queue (scaled) */ + int max_p ; /* maximum value for p_b (scaled) */ + + u_int c_1 ; /* max_p/(max_th-min_th) (scaled) */ + u_int c_2 ; /* max_p*min_th/(max_th-min_th) (scaled) */ + u_int c_3 ; /* for GRED, (1-max_p)/max_th (scaled) */ + u_int c_4 ; /* for GRED, 1 - 2*max_p (scaled) */ + u_int * w_q_lookup ; /* lookup table for computing (1-w_q)^t */ + u_int lookup_depth ; /* depth of lookup table */ + int lookup_step ; /* granularity inside the lookup table */ + int lookup_weight ; /* equal to (1-w_q)^t / (1-w_q)^(t+1) */ + int avg_pkt_size ; /* medium packet size */ + int max_pkt_size ; /* max packet size */ +}; + +/* + * A queue is created as a child of a flowset unless it belongs to + * a !MULTIQUEUE scheduler. It is normally in a hash table in the + * flowset. fs always points to the parent flowset. + * si normally points to the sch_inst, unless the flowset has been + * detached from the scheduler -- in this case si == NULL and we + * should not enqueue. + */ +struct dn_queue { + struct dn_flow ni; /* oid, flow_id, stats */ + struct mq mq; /* packets queue */ + struct dn_sch_inst *_si; /* owner scheduler instance */ + SLIST_ENTRY(dn_queue) q_next; /* hash chain list for qht */ + struct dn_fsk *fs; /* parent flowset. */ + + /* RED parameters */ + int avg; /* average queue length est. (scaled) */ + int count; /* arrivals since last RED drop */ + int random; /* random value (scaled) */ + uint64_t q_time; /* start of queue idle time */ + +}; + +/* + * The kernel side of a scheduler. Contains the userland config, + * a link, pointer to extra config arguments from command line, + * kernel flags, and a pointer to the scheduler methods. + * It is stored in a hash table, and holds a list of all + * flowsets and scheduler instances. + * XXX sch must be at the beginning, see schk_hash(). + */ +struct dn_schk { + struct dn_sch sch; + struct dn_alg *fp; /* Pointer to scheduler functions */ + struct dn_link link; /* The link, embedded */ + struct dn_profile *profile; /* delay profile, if any */ + struct dn_id *cfg; /* extra config arguments */ + + SLIST_ENTRY(dn_schk) schk_next; /* hash chain for schedhash */ + + struct dn_fsk_head fsk_list; /* all fsk linked to me */ + struct dn_fsk *fs; /* Flowset for !MULTIQUEUE */ + + /* bucket index used by the drain routine to drain the scheduler + * instance for this flowset. + */ + int drain_bucket; + + /* Hash table of all instances (through sch.sched_mask) + * or single instance if no mask. Always valid. + */ + struct dn_ht *siht; +}; + + +/* + * Scheduler instance. + * Contains variables and all queues relative to a this instance. + * This struct is created a runtime. + */ +struct dn_sch_inst { + struct dn_flow ni; /* oid, flowid and stats */ + SLIST_ENTRY(dn_sch_inst) si_next; /* hash chain for siht */ + struct delay_line dline; + struct dn_schk *sched; /* the template */ + int kflags; /* DN_ACTIVE */ + + int64_t credit; /* bits I can transmit (more or less). */ + uint64_t sched_time; /* time link was scheduled in ready_heap */ + uint64_t idle_time; /* start of scheduler instance idle time */ + + /* q_count is the number of queues that this instance is using. + * The counter is incremented or decremented when + * a reference from the queue is created or deleted. + * It is used to make sure that a scheduler instance can be safely + * deleted by the drain routine. See notes below. + */ + int q_count; + +}; + +/* + * NOTE about object drain. + * The system will automatically (XXX check when) drain queues and + * scheduler instances when they are idle. + * A queue is idle when it has no packets; an instance is idle when + * it is not in the evheap heap, and the corresponding delay line is empty. + * A queue can be safely deleted when it is idle because of the scheduler + * function xxx_free_queue() will remove any references to it. + * An instance can be only deleted when no queues reference it. To be sure + * of that, a counter (q_count) stores the number of queues that are pointing + * to the instance. + * + * XXX + * Order of scan: + * - take all flowset in a bucket for the flowset hash table + * - take all queues in a bucket for the flowset + * - increment the queue bucket + * - scan next flowset bucket + * Nothing is done if a bucket contains no entries. + * + * The same schema is used for sceduler instances + */ + + +/* kernel-side flags. Linux has DN_DELETE in fcntl.h + */ +enum { + /* 1 and 2 are reserved for the SCAN flags */ + DN_DESTROY = 0x0004, /* destroy */ + DN_DELETE_FS = 0x0008, /* destroy flowset */ + DN_DETACH = 0x0010, + DN_ACTIVE = 0x0020, /* object is in evheap */ + DN_F_DLINE = 0x0040, /* object is a delay line */ + DN_DEL_SAFE = 0x0080, /* delete a queue only if no longer needed + * by scheduler */ + DN_QHT_IS_Q = 0x0100, /* in flowset, qht is a single queue */ +}; + +extern struct dn_parms dn_cfg; +//VNET_DECLARE(struct dn_parms, _base_dn_cfg); +//#define dn_cfg VNET(_base_dn_cfg) + +int dummynet_io(struct mbuf **, int , struct ip_fw_args *); +void dummynet_task(void *context, int pending); +void dn_reschedule(void); + +struct dn_queue *ipdn_q_find(struct dn_fsk *, struct dn_sch_inst *, + struct ipfw_flow_id *); +struct dn_sch_inst *ipdn_si_find(struct dn_schk *, struct ipfw_flow_id *); + +/* + * copy_range is a template for requests for ranges of pipes/queues/scheds. + * The number of ranges is variable and can be derived by o.len. + * As a default, we use a small number of entries so that the struct + * fits easily on the stack and is sufficient for most common requests. + */ +#define DEFAULT_RANGES 5 +struct copy_range { + struct dn_id o; + uint32_t r[ 2 * DEFAULT_RANGES ]; +}; + +struct copy_args { + char **start; + char *end; + int flags; + int type; + struct copy_range *extra; /* extra filtering */ +}; + +struct sockopt; +int ip_dummynet_compat(struct sockopt *sopt); +int dummynet_get(struct sockopt *sopt, void **compat); +int dn_c_copy_q (void *_ni, void *arg); +int dn_c_copy_pipe(struct dn_schk *s, struct copy_args *a, int nq); +int dn_c_copy_fs(struct dn_fsk *f, struct copy_args *a, int nq); +int dn_compat_copy_queue(struct copy_args *a, void *_o); +int dn_compat_copy_pipe(struct copy_args *a, void *_o); +int copy_data_helper_compat(void *_o, void *_arg); +int dn_compat_calc_size(void); +int do_config(void *p, int l); + +/* function to drain idle object */ +void dn_drain_scheduler(void); +void dn_drain_queue(void); + +#endif /* _IP_DN_PRIVATE_H */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_dummynet.c b/example/ipfw/sys/netpfil/ipfw/ip_dummynet.c new file mode 100644 index 0000000..a4dcb4f --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_dummynet.c @@ -0,0 +1,2334 @@ +/*- + * Copyright (c) 1998-2002,2010 Luigi Rizzo, Universita` di Pisa + * Portions Copyright (c) 2000 Akamba Corp. + * All rights reserved + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_dummynet.c 272840 2014-10-09 19:32:35Z melifaro $"); + +/* + * Configuration and internal object management for dummynet. + */ + +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* IFNAMSIZ, struct ifaddr, ifq head, lock.h mutex.h */ +#include +#include /* ip_output(), IP_FORWARDING */ +#include +#include + +#include +#include +#include +#include + +/* which objects to copy */ +#define DN_C_LINK 0x01 +#define DN_C_SCH 0x02 +#define DN_C_FLOW 0x04 +#define DN_C_FS 0x08 +#define DN_C_QUEUE 0x10 + +/* we use this argument in case of a schk_new */ +struct schk_new_arg { + struct dn_alg *fp; + struct dn_sch *sch; +}; + +/*---- callout hooks. ----*/ +static struct callout dn_timeout; +static struct task dn_task; +static struct taskqueue *dn_tq = NULL; + +static void +dummynet(void *arg) +{ + + (void)arg; /* UNUSED */ + taskqueue_enqueue_fast(dn_tq, &dn_task); +} + +void +dn_reschedule(void) +{ + callout_reset(&dn_timeout, 1, dummynet, NULL); +} +/*----- end of callout hooks -----*/ + +/* Return a scheduler descriptor given the type or name. */ +static struct dn_alg * +find_sched_type(int type, char *name) +{ + struct dn_alg *d; + + SLIST_FOREACH(d, &dn_cfg.schedlist, next) { + if (d->type == type || (name && !strcasecmp(d->name, name))) + return d; + } + return NULL; /* not found */ +} + +int +ipdn_bound_var(int *v, int dflt, int lo, int hi, const char *msg) +{ + int oldv = *v; + const char *op = NULL; + if (dflt < lo) + dflt = lo; + if (dflt > hi) + dflt = hi; + if (oldv < lo) { + *v = dflt; + op = "Bump"; + } else if (oldv > hi) { + *v = hi; + op = "Clamp"; + } else + return *v; + if (op && msg) + printf("%s %s to %d (was %d)\n", op, msg, *v, oldv); + return *v; +} + +/*---- flow_id mask, hash and compare functions ---*/ +/* + * The flow_id includes the 5-tuple, the queue/pipe number + * which we store in the extra area in host order, + * and for ipv6 also the flow_id6. + * XXX see if we want the tos byte (can store in 'flags') + */ +static struct ipfw_flow_id * +flow_id_mask(struct ipfw_flow_id *mask, struct ipfw_flow_id *id) +{ + int is_v6 = IS_IP6_FLOW_ID(id); + + id->dst_port &= mask->dst_port; + id->src_port &= mask->src_port; + id->proto &= mask->proto; + id->extra &= mask->extra; + if (is_v6) { + APPLY_MASK(&id->dst_ip6, &mask->dst_ip6); + APPLY_MASK(&id->src_ip6, &mask->src_ip6); + id->flow_id6 &= mask->flow_id6; + } else { + id->dst_ip &= mask->dst_ip; + id->src_ip &= mask->src_ip; + } + return id; +} + +/* computes an OR of two masks, result in dst and also returned */ +static struct ipfw_flow_id * +flow_id_or(struct ipfw_flow_id *src, struct ipfw_flow_id *dst) +{ + int is_v6 = IS_IP6_FLOW_ID(dst); + + dst->dst_port |= src->dst_port; + dst->src_port |= src->src_port; + dst->proto |= src->proto; + dst->extra |= src->extra; + if (is_v6) { +#define OR_MASK(_d, _s) \ + (_d)->__u6_addr.__u6_addr32[0] |= (_s)->__u6_addr.__u6_addr32[0]; \ + (_d)->__u6_addr.__u6_addr32[1] |= (_s)->__u6_addr.__u6_addr32[1]; \ + (_d)->__u6_addr.__u6_addr32[2] |= (_s)->__u6_addr.__u6_addr32[2]; \ + (_d)->__u6_addr.__u6_addr32[3] |= (_s)->__u6_addr.__u6_addr32[3]; + OR_MASK(&dst->dst_ip6, &src->dst_ip6); + OR_MASK(&dst->src_ip6, &src->src_ip6); +#undef OR_MASK + dst->flow_id6 |= src->flow_id6; + } else { + dst->dst_ip |= src->dst_ip; + dst->src_ip |= src->src_ip; + } + return dst; +} + +static int +nonzero_mask(struct ipfw_flow_id *m) +{ + if (m->dst_port || m->src_port || m->proto || m->extra) + return 1; + if (IS_IP6_FLOW_ID(m)) { + return + m->dst_ip6.__u6_addr.__u6_addr32[0] || + m->dst_ip6.__u6_addr.__u6_addr32[1] || + m->dst_ip6.__u6_addr.__u6_addr32[2] || + m->dst_ip6.__u6_addr.__u6_addr32[3] || + m->src_ip6.__u6_addr.__u6_addr32[0] || + m->src_ip6.__u6_addr.__u6_addr32[1] || + m->src_ip6.__u6_addr.__u6_addr32[2] || + m->src_ip6.__u6_addr.__u6_addr32[3] || + m->flow_id6; + } else { + return m->dst_ip || m->src_ip; + } +} + +/* XXX we may want a better hash function */ +static uint32_t +flow_id_hash(struct ipfw_flow_id *id) +{ + uint32_t i; + + if (IS_IP6_FLOW_ID(id)) { + uint32_t *d = (uint32_t *)&id->dst_ip6; + uint32_t *s = (uint32_t *)&id->src_ip6; + i = (d[0] ) ^ (d[1]) ^ + (d[2] ) ^ (d[3]) ^ + (d[0] >> 15) ^ (d[1] >> 15) ^ + (d[2] >> 15) ^ (d[3] >> 15) ^ + (s[0] << 1) ^ (s[1] << 1) ^ + (s[2] << 1) ^ (s[3] << 1) ^ + (s[0] << 16) ^ (s[1] << 16) ^ + (s[2] << 16) ^ (s[3] << 16) ^ + (id->dst_port << 1) ^ (id->src_port) ^ + (id->extra) ^ + (id->proto ) ^ (id->flow_id6); + } else { + i = (id->dst_ip) ^ (id->dst_ip >> 15) ^ + (id->src_ip << 1) ^ (id->src_ip >> 16) ^ + (id->extra) ^ + (id->dst_port << 1) ^ (id->src_port) ^ (id->proto); + } + return i; +} + +/* Like bcmp, returns 0 if ids match, 1 otherwise. */ +static int +flow_id_cmp(struct ipfw_flow_id *id1, struct ipfw_flow_id *id2) +{ + int is_v6 = IS_IP6_FLOW_ID(id1); + + if (!is_v6) { + if (IS_IP6_FLOW_ID(id2)) + return 1; /* different address families */ + + return (id1->dst_ip == id2->dst_ip && + id1->src_ip == id2->src_ip && + id1->dst_port == id2->dst_port && + id1->src_port == id2->src_port && + id1->proto == id2->proto && + id1->extra == id2->extra) ? 0 : 1; + } + /* the ipv6 case */ + return ( + !bcmp(&id1->dst_ip6,&id2->dst_ip6, sizeof(id1->dst_ip6)) && + !bcmp(&id1->src_ip6,&id2->src_ip6, sizeof(id1->src_ip6)) && + id1->dst_port == id2->dst_port && + id1->src_port == id2->src_port && + id1->proto == id2->proto && + id1->extra == id2->extra && + id1->flow_id6 == id2->flow_id6) ? 0 : 1; +} +/*--------- end of flow-id mask, hash and compare ---------*/ + +/*--- support functions for the qht hashtable ---- + * Entries are hashed by flow-id + */ +static uint32_t +q_hash(uintptr_t key, int flags, void *arg) +{ + /* compute the hash slot from the flow id */ + struct ipfw_flow_id *id = (flags & DNHT_KEY_IS_OBJ) ? + &((struct dn_queue *)key)->ni.fid : + (struct ipfw_flow_id *)key; + + return flow_id_hash(id); +} + +static int +q_match(void *obj, uintptr_t key, int flags, void *arg) +{ + struct dn_queue *o = (struct dn_queue *)obj; + struct ipfw_flow_id *id2; + + if (flags & DNHT_KEY_IS_OBJ) { + /* compare pointers */ + id2 = &((struct dn_queue *)key)->ni.fid; + } else { + id2 = (struct ipfw_flow_id *)key; + } + return (0 == flow_id_cmp(&o->ni.fid, id2)); +} + +/* + * create a new queue instance for the given 'key'. + */ +static void * +q_new(uintptr_t key, int flags, void *arg) +{ + struct dn_queue *q, *template = arg; + struct dn_fsk *fs = template->fs; + int size = sizeof(*q) + fs->sched->fp->q_datalen; + + q = malloc(size, M_DUMMYNET, M_NOWAIT | M_ZERO); + if (q == NULL) { + D("no memory for new queue"); + return NULL; + } + + set_oid(&q->ni.oid, DN_QUEUE, size); + if (fs->fs.flags & DN_QHT_HASH) + q->ni.fid = *(struct ipfw_flow_id *)key; + q->fs = fs; + q->_si = template->_si; + q->_si->q_count++; + + if (fs->sched->fp->new_queue) + fs->sched->fp->new_queue(q); + dn_cfg.queue_count++; + return q; +} + +/* + * Notify schedulers that a queue is going away. + * If (flags & DN_DESTROY), also free the packets. + * The version for callbacks is called q_delete_cb(). + */ +static void +dn_delete_queue(struct dn_queue *q, int flags) +{ + struct dn_fsk *fs = q->fs; + + // D("fs %p si %p\n", fs, q->_si); + /* notify the parent scheduler that the queue is going away */ + if (fs && fs->sched->fp->free_queue) + fs->sched->fp->free_queue(q); + q->_si->q_count--; + q->_si = NULL; + if (flags & DN_DESTROY) { + if (q->mq.head) + dn_free_pkts(q->mq.head); + bzero(q, sizeof(*q)); // safety + free(q, M_DUMMYNET); + dn_cfg.queue_count--; + } +} + +static int +q_delete_cb(void *q, void *arg) +{ + int flags = (int)(uintptr_t)arg; + dn_delete_queue(q, flags); + return (flags & DN_DESTROY) ? DNHT_SCAN_DEL : 0; +} + +/* + * calls dn_delete_queue/q_delete_cb on all queues, + * which notifies the parent scheduler and possibly drains packets. + * flags & DN_DESTROY: drains queues and destroy qht; + */ +static void +qht_delete(struct dn_fsk *fs, int flags) +{ + ND("fs %d start flags %d qht %p", + fs->fs.fs_nr, flags, fs->qht); + if (!fs->qht) + return; + if (fs->fs.flags & DN_QHT_HASH) { + dn_ht_scan(fs->qht, q_delete_cb, (void *)(uintptr_t)flags); + if (flags & DN_DESTROY) { + dn_ht_free(fs->qht, 0); + fs->qht = NULL; + } + } else { + dn_delete_queue((struct dn_queue *)(fs->qht), flags); + if (flags & DN_DESTROY) + fs->qht = NULL; + } +} + +/* + * Find and possibly create the queue for a MULTIQUEUE scheduler. + * We never call it for !MULTIQUEUE (the queue is in the sch_inst). + */ +struct dn_queue * +ipdn_q_find(struct dn_fsk *fs, struct dn_sch_inst *si, + struct ipfw_flow_id *id) +{ + struct dn_queue template; + + template._si = si; + template.fs = fs; + + if (fs->fs.flags & DN_QHT_HASH) { + struct ipfw_flow_id masked_id; + if (fs->qht == NULL) { + fs->qht = dn_ht_init(NULL, fs->fs.buckets, + offsetof(struct dn_queue, q_next), + q_hash, q_match, q_new); + if (fs->qht == NULL) + return NULL; + } + masked_id = *id; + flow_id_mask(&fs->fsk_mask, &masked_id); + return dn_ht_find(fs->qht, (uintptr_t)&masked_id, + DNHT_INSERT, &template); + } else { + if (fs->qht == NULL) + fs->qht = q_new(0, 0, &template); + return (struct dn_queue *)fs->qht; + } +} +/*--- end of queue hash table ---*/ + +/*--- support functions for the sch_inst hashtable ---- + * + * These are hashed by flow-id + */ +static uint32_t +si_hash(uintptr_t key, int flags, void *arg) +{ + /* compute the hash slot from the flow id */ + struct ipfw_flow_id *id = (flags & DNHT_KEY_IS_OBJ) ? + &((struct dn_sch_inst *)key)->ni.fid : + (struct ipfw_flow_id *)key; + + return flow_id_hash(id); +} + +static int +si_match(void *obj, uintptr_t key, int flags, void *arg) +{ + struct dn_sch_inst *o = obj; + struct ipfw_flow_id *id2; + + id2 = (flags & DNHT_KEY_IS_OBJ) ? + &((struct dn_sch_inst *)key)->ni.fid : + (struct ipfw_flow_id *)key; + return flow_id_cmp(&o->ni.fid, id2) == 0; +} + +/* + * create a new instance for the given 'key' + * Allocate memory for instance, delay line and scheduler private data. + */ +static void * +si_new(uintptr_t key, int flags, void *arg) +{ + struct dn_schk *s = arg; + struct dn_sch_inst *si; + int l = sizeof(*si) + s->fp->si_datalen; + + si = malloc(l, M_DUMMYNET, M_NOWAIT | M_ZERO); + if (si == NULL) + goto error; + + /* Set length only for the part passed up to userland. */ + set_oid(&si->ni.oid, DN_SCH_I, sizeof(struct dn_flow)); + set_oid(&(si->dline.oid), DN_DELAY_LINE, + sizeof(struct delay_line)); + /* mark si and dline as outside the event queue */ + si->ni.oid.id = si->dline.oid.id = -1; + + si->sched = s; + si->dline.si = si; + + if (s->fp->new_sched && s->fp->new_sched(si)) { + D("new_sched error"); + goto error; + } + if (s->sch.flags & DN_HAVE_MASK) + si->ni.fid = *(struct ipfw_flow_id *)key; + + dn_cfg.si_count++; + return si; + +error: + if (si) { + bzero(si, sizeof(*si)); // safety + free(si, M_DUMMYNET); + } + return NULL; +} + +/* + * Callback from siht to delete all scheduler instances. Remove + * si and delay line from the system heap, destroy all queues. + * We assume that all flowset have been notified and do not + * point to us anymore. + */ +static int +si_destroy(void *_si, void *arg) +{ + struct dn_sch_inst *si = _si; + struct dn_schk *s = si->sched; + struct delay_line *dl = &si->dline; + + if (dl->oid.subtype) /* remove delay line from event heap */ + heap_extract(&dn_cfg.evheap, dl); + dn_free_pkts(dl->mq.head); /* drain delay line */ + if (si->kflags & DN_ACTIVE) /* remove si from event heap */ + heap_extract(&dn_cfg.evheap, si); + if (s->fp->free_sched) + s->fp->free_sched(si); + bzero(si, sizeof(*si)); /* safety */ + free(si, M_DUMMYNET); + dn_cfg.si_count--; + return DNHT_SCAN_DEL; +} + +/* + * Find the scheduler instance for this packet. If we need to apply + * a mask, do on a local copy of the flow_id to preserve the original. + * Assume siht is always initialized if we have a mask. + */ +struct dn_sch_inst * +ipdn_si_find(struct dn_schk *s, struct ipfw_flow_id *id) +{ + + if (s->sch.flags & DN_HAVE_MASK) { + struct ipfw_flow_id id_t = *id; + flow_id_mask(&s->sch.sched_mask, &id_t); + return dn_ht_find(s->siht, (uintptr_t)&id_t, + DNHT_INSERT, s); + } + if (!s->siht) + s->siht = si_new(0, 0, s); + return (struct dn_sch_inst *)s->siht; +} + +/* callback to flush credit for the scheduler instance */ +static int +si_reset_credit(void *_si, void *arg) +{ + struct dn_sch_inst *si = _si; + struct dn_link *p = &si->sched->link; + + si->credit = p->burst + (dn_cfg.io_fast ? p->bandwidth : 0); + return 0; +} + +static void +schk_reset_credit(struct dn_schk *s) +{ + if (s->sch.flags & DN_HAVE_MASK) + dn_ht_scan(s->siht, si_reset_credit, NULL); + else if (s->siht) + si_reset_credit(s->siht, NULL); +} +/*---- end of sch_inst hashtable ---------------------*/ + +/*------------------------------------------------------- + * flowset hash (fshash) support. Entries are hashed by fs_nr. + * New allocations are put in the fsunlinked list, from which + * they are removed when they point to a specific scheduler. + */ +static uint32_t +fsk_hash(uintptr_t key, int flags, void *arg) +{ + uint32_t i = !(flags & DNHT_KEY_IS_OBJ) ? key : + ((struct dn_fsk *)key)->fs.fs_nr; + + return ( (i>>8)^(i>>4)^i ); +} + +static int +fsk_match(void *obj, uintptr_t key, int flags, void *arg) +{ + struct dn_fsk *fs = obj; + int i = !(flags & DNHT_KEY_IS_OBJ) ? key : + ((struct dn_fsk *)key)->fs.fs_nr; + + return (fs->fs.fs_nr == i); +} + +static void * +fsk_new(uintptr_t key, int flags, void *arg) +{ + struct dn_fsk *fs; + + fs = malloc(sizeof(*fs), M_DUMMYNET, M_NOWAIT | M_ZERO); + if (fs) { + set_oid(&fs->fs.oid, DN_FS, sizeof(fs->fs)); + dn_cfg.fsk_count++; + fs->drain_bucket = 0; + SLIST_INSERT_HEAD(&dn_cfg.fsu, fs, sch_chain); + } + return fs; +} + +/* + * detach flowset from its current scheduler. Flags as follows: + * DN_DETACH removes from the fsk_list + * DN_DESTROY deletes individual queues + * DN_DELETE_FS destroys the flowset (otherwise goes in unlinked). + */ +static void +fsk_detach(struct dn_fsk *fs, int flags) +{ + if (flags & DN_DELETE_FS) + flags |= DN_DESTROY; + ND("fs %d from sched %d flags %s %s %s", + fs->fs.fs_nr, fs->fs.sched_nr, + (flags & DN_DELETE_FS) ? "DEL_FS":"", + (flags & DN_DESTROY) ? "DEL":"", + (flags & DN_DETACH) ? "DET":""); + if (flags & DN_DETACH) { /* detach from the list */ + struct dn_fsk_head *h; + h = fs->sched ? &fs->sched->fsk_list : &dn_cfg.fsu; + SLIST_REMOVE(h, fs, dn_fsk, sch_chain); + } + /* Free the RED parameters, they will be recomputed on + * subsequent attach if needed. + */ + if (fs->w_q_lookup) + free(fs->w_q_lookup, M_DUMMYNET); + fs->w_q_lookup = NULL; + qht_delete(fs, flags); + if (fs->sched && fs->sched->fp->free_fsk) + fs->sched->fp->free_fsk(fs); + fs->sched = NULL; + if (flags & DN_DELETE_FS) { + bzero(fs, sizeof(*fs)); /* safety */ + free(fs, M_DUMMYNET); + dn_cfg.fsk_count--; + } else { + SLIST_INSERT_HEAD(&dn_cfg.fsu, fs, sch_chain); + } +} + +/* + * Detach or destroy all flowsets in a list. + * flags specifies what to do: + * DN_DESTROY: flush all queues + * DN_DELETE_FS: DN_DESTROY + destroy flowset + * DN_DELETE_FS implies DN_DESTROY + */ +static void +fsk_detach_list(struct dn_fsk_head *h, int flags) +{ + struct dn_fsk *fs; + int n = 0; /* only for stats */ + + ND("head %p flags %x", h, flags); + while ((fs = SLIST_FIRST(h))) { + SLIST_REMOVE_HEAD(h, sch_chain); + n++; + fsk_detach(fs, flags); + } + ND("done %d flowsets", n); +} + +/* + * called on 'queue X delete' -- removes the flowset from fshash, + * deletes all queues for the flowset, and removes the flowset. + */ +static int +delete_fs(int i, int locked) +{ + struct dn_fsk *fs; + int err = 0; + + if (!locked) + DN_BH_WLOCK(); + fs = dn_ht_find(dn_cfg.fshash, i, DNHT_REMOVE, NULL); + ND("fs %d found %p", i, fs); + if (fs) { + fsk_detach(fs, DN_DETACH | DN_DELETE_FS); + err = 0; + } else + err = EINVAL; + if (!locked) + DN_BH_WUNLOCK(); + return err; +} + +/*----- end of flowset hashtable support -------------*/ + +/*------------------------------------------------------------ + * Scheduler hash. When searching by index we pass sched_nr, + * otherwise we pass struct dn_sch * which is the first field in + * struct dn_schk so we can cast between the two. We use this trick + * because in the create phase (but it should be fixed). + */ +static uint32_t +schk_hash(uintptr_t key, int flags, void *_arg) +{ + uint32_t i = !(flags & DNHT_KEY_IS_OBJ) ? key : + ((struct dn_schk *)key)->sch.sched_nr; + return ( (i>>8)^(i>>4)^i ); +} + +static int +schk_match(void *obj, uintptr_t key, int flags, void *_arg) +{ + struct dn_schk *s = (struct dn_schk *)obj; + int i = !(flags & DNHT_KEY_IS_OBJ) ? key : + ((struct dn_schk *)key)->sch.sched_nr; + return (s->sch.sched_nr == i); +} + +/* + * Create the entry and intialize with the sched hash if needed. + * Leave s->fp unset so we can tell whether a dn_ht_find() returns + * a new object or a previously existing one. + */ +static void * +schk_new(uintptr_t key, int flags, void *arg) +{ + struct schk_new_arg *a = arg; + struct dn_schk *s; + int l = sizeof(*s) +a->fp->schk_datalen; + + s = malloc(l, M_DUMMYNET, M_NOWAIT | M_ZERO); + if (s == NULL) + return NULL; + set_oid(&s->link.oid, DN_LINK, sizeof(s->link)); + s->sch = *a->sch; // copy initial values + s->link.link_nr = s->sch.sched_nr; + SLIST_INIT(&s->fsk_list); + /* initialize the hash table or create the single instance */ + s->fp = a->fp; /* si_new needs this */ + s->drain_bucket = 0; + if (s->sch.flags & DN_HAVE_MASK) { + s->siht = dn_ht_init(NULL, s->sch.buckets, + offsetof(struct dn_sch_inst, si_next), + si_hash, si_match, si_new); + if (s->siht == NULL) { + free(s, M_DUMMYNET); + return NULL; + } + } + s->fp = NULL; /* mark as a new scheduler */ + dn_cfg.schk_count++; + return s; +} + +/* + * Callback for sched delete. Notify all attached flowsets to + * detach from the scheduler, destroy the internal flowset, and + * all instances. The scheduler goes away too. + * arg is 0 (only detach flowsets and destroy instances) + * DN_DESTROY (detach & delete queues, delete schk) + * or DN_DELETE_FS (delete queues and flowsets, delete schk) + */ +static int +schk_delete_cb(void *obj, void *arg) +{ + struct dn_schk *s = obj; + struct dn_profile **p = &s->profile; + int i, lim = 1 /* how many profiles */; + +#if 0 + int a = (int)arg; + ND("sched %d arg %s%s", + s->sch.sched_nr, + a&DN_DESTROY ? "DEL ":"", + a&DN_DELETE_FS ? "DEL_FS":""); +#endif + fsk_detach_list(&s->fsk_list, arg ? DN_DESTROY : 0); + /* no more flowset pointing to us now */ + if (s->sch.flags & DN_HAVE_MASK) { + dn_ht_scan(s->siht, si_destroy, NULL); + dn_ht_free(s->siht, 0); + } else if (s->siht) + si_destroy(s->siht, NULL); + + for (i = 0; i < lim; i++) { + if (p[i]) { + free(p[i], M_DUMMYNET); + p[i] = NULL; + } + } + s->siht = NULL; + if (s->fp->destroy) + s->fp->destroy(s); + bzero(s, sizeof(*s)); // safety + free(obj, M_DUMMYNET); + dn_cfg.schk_count--; + return DNHT_SCAN_DEL; +} + +/* + * called on a 'sched X delete' command. Deletes a single scheduler. + * This is done by removing from the schedhash, unlinking all + * flowsets and deleting their traffic. + */ +static int +delete_schk(int i) +{ + struct dn_schk *s; + + s = dn_ht_find(dn_cfg.schedhash, i, DNHT_REMOVE, NULL); + ND("%d %p", i, s); + if (!s) + return EINVAL; + delete_fs(i + DN_MAX_ID, 1); /* first delete internal fs */ + /* then detach flowsets, delete traffic */ + schk_delete_cb(s, (void*)(uintptr_t)DN_DESTROY); + return 0; +} +/*--- end of schk hashtable support ---*/ + +static int +copy_obj(char **start, char *end, void *_o, const char *msg, int i) +{ + struct dn_id *o = _o; + int have = end - *start; + + if (have < o->len || o->len == 0 || o->type == 0) { + D("(WARN) type %d %s %d have %d need %d", + o->type, msg, i, have, o->len); + return 1; + } + ND("type %d %s %d len %d", o->type, msg, i, o->len); + bcopy(_o, *start, o->len); + if (o->type == DN_LINK) { + /* Adjust burst parameter for link */ + struct dn_link *l = (struct dn_link *)*start; + l->burst = div64(l->burst, 8 * hz); + /* convert back to milliseconds */ + l->delay = l->delay * 1000 / hz; + } else if (o->type == DN_SCH) { + /* Set id->id to the number of instances */ + struct dn_schk *s = _o; + struct dn_id *id = (struct dn_id *)(*start); + id->id = (s->sch.flags & DN_HAVE_MASK) ? + dn_ht_entries(s->siht) : (s->siht ? 1 : 0); + } + *start += o->len; + return 0; +} + +/* Specific function to copy a queue. + * Copies only the user-visible part of a queue (which is in + * a struct dn_flow), and sets len accordingly. + */ +static int +copy_obj_q(char **start, char *end, void *_o, const char *msg, int i) +{ + struct dn_id *o = _o; + int have = end - *start; + int len = sizeof(struct dn_flow); /* see above comment */ + + if (have < len || o->len == 0 || o->type != DN_QUEUE) { + D("ERROR type %d %s %d have %d need %d", + o->type, msg, i, have, len); + return 1; + } + ND("type %d %s %d len %d", o->type, msg, i, len); + bcopy(_o, *start, len); + ((struct dn_id*)(*start))->len = len; + *start += len; + return 0; +} + +static int +copy_q_cb(void *obj, void *arg) +{ + struct dn_queue *q = obj; + struct copy_args *a = arg; + struct dn_flow *ni = (struct dn_flow *)(*a->start); + if (copy_obj_q(a->start, a->end, &q->ni, "queue", -1)) + return DNHT_SCAN_END; + ni->oid.type = DN_FLOW; /* override the DN_QUEUE */ + ni->oid.id = si_hash((uintptr_t)&ni->fid, 0, NULL); + return 0; +} + +static int +copy_q(struct copy_args *a, struct dn_fsk *fs, int flags) +{ + if (!fs->qht) + return 0; + if (fs->fs.flags & DN_QHT_HASH) + dn_ht_scan(fs->qht, copy_q_cb, a); + else + copy_q_cb(fs->qht, a); + return 0; +} + +/* + * This routine only copies the initial part of a profile ? XXX + */ +static int +copy_profile(struct copy_args *a, struct dn_profile *p) +{ + int have = a->end - *a->start; + /* XXX start with base length */ + int profile_len = sizeof(struct dn_profile) - + ED_MAX_SAMPLES_NO*sizeof(int); + + if (p == NULL) + return 0; + profile_len += p->samples_no * sizeof(int); /* add actual samples */ + if (have < profile_len) { + D("error have %d need %d", have, profile_len); + return 1; + } + bcopy(p, *a->start, profile_len); + ((struct dn_id *)(*a->start))->len = profile_len; + *a->start += profile_len; + return 0; +} + +static int +copy_flowset(struct copy_args *a, struct dn_fsk *fs, int flags) +{ + struct dn_fs *ufs = (struct dn_fs *)(*a->start); + if (!fs) + return 0; + ND("flowset %d", fs->fs.fs_nr); + if (copy_obj(a->start, a->end, &fs->fs, "flowset", fs->fs.fs_nr)) + return DNHT_SCAN_END; + ufs->oid.id = (fs->fs.flags & DN_QHT_HASH) ? + dn_ht_entries(fs->qht) : (fs->qht ? 1 : 0); + if (flags) { /* copy queues */ + copy_q(a, fs, 0); + } + return 0; +} + +static int +copy_si_cb(void *obj, void *arg) +{ + struct dn_sch_inst *si = obj; + struct copy_args *a = arg; + struct dn_flow *ni = (struct dn_flow *)(*a->start); + if (copy_obj(a->start, a->end, &si->ni, "inst", + si->sched->sch.sched_nr)) + return DNHT_SCAN_END; + ni->oid.type = DN_FLOW; /* override the DN_SCH_I */ + ni->oid.id = si_hash((uintptr_t)si, DNHT_KEY_IS_OBJ, NULL); + return 0; +} + +static int +copy_si(struct copy_args *a, struct dn_schk *s, int flags) +{ + if (s->sch.flags & DN_HAVE_MASK) + dn_ht_scan(s->siht, copy_si_cb, a); + else if (s->siht) + copy_si_cb(s->siht, a); + return 0; +} + +/* + * compute a list of children of a scheduler and copy up + */ +static int +copy_fsk_list(struct copy_args *a, struct dn_schk *s, int flags) +{ + struct dn_fsk *fs; + struct dn_id *o; + uint32_t *p; + + int n = 0, space = sizeof(*o); + SLIST_FOREACH(fs, &s->fsk_list, sch_chain) { + if (fs->fs.fs_nr < DN_MAX_ID) + n++; + } + space += n * sizeof(uint32_t); + DX(3, "sched %d has %d flowsets", s->sch.sched_nr, n); + if (a->end - *(a->start) < space) + return DNHT_SCAN_END; + o = (struct dn_id *)(*(a->start)); + o->len = space; + *a->start += o->len; + o->type = DN_TEXT; + p = (uint32_t *)(o+1); + SLIST_FOREACH(fs, &s->fsk_list, sch_chain) + if (fs->fs.fs_nr < DN_MAX_ID) + *p++ = fs->fs.fs_nr; + return 0; +} + +static int +copy_data_helper(void *_o, void *_arg) +{ + struct copy_args *a = _arg; + uint32_t *r = a->extra->r; /* start of first range */ + uint32_t *lim; /* first invalid pointer */ + int n; + + lim = (uint32_t *)((char *)(a->extra) + a->extra->o.len); + + if (a->type == DN_LINK || a->type == DN_SCH) { + /* pipe|sched show, we receive a dn_schk */ + struct dn_schk *s = _o; + + n = s->sch.sched_nr; + if (a->type == DN_SCH && n >= DN_MAX_ID) + return 0; /* not a scheduler */ + if (a->type == DN_LINK && n <= DN_MAX_ID) + return 0; /* not a pipe */ + + /* see if the object is within one of our ranges */ + for (;r < lim; r += 2) { + if (n < r[0] || n > r[1]) + continue; + /* Found a valid entry, copy and we are done */ + if (a->flags & DN_C_LINK) { + if (copy_obj(a->start, a->end, + &s->link, "link", n)) + return DNHT_SCAN_END; + if (copy_profile(a, s->profile)) + return DNHT_SCAN_END | DNHT_COPY_ERR; + if (copy_flowset(a, s->fs, 0)) + return DNHT_SCAN_END; + } + if (a->flags & DN_C_SCH) { + if (copy_obj(a->start, a->end, + &s->sch, "sched", n)) + return DNHT_SCAN_END | DNHT_COPY_ERR; + /* list all attached flowsets */ + if (copy_fsk_list(a, s, 0)) + return DNHT_SCAN_END | DNHT_COPY_ERR; + } + if (a->flags & DN_C_FLOW) + copy_si(a, s, 0); + break; + } + } else if (a->type == DN_FS) { + /* queue show, skip internal flowsets */ + struct dn_fsk *fs = _o; + + n = fs->fs.fs_nr; + if (n >= DN_MAX_ID) + return 0; + /* see if the object is within one of our ranges */ + for (;r < lim; r += 2) { + if (n < r[0] || n > r[1]) + continue; + if (copy_flowset(a, fs, 0)) + return DNHT_SCAN_END | DNHT_COPY_ERR; + copy_q(a, fs, 0); + break; /* we are done */ + } + } + return 0; +} + +static inline struct dn_schk * +locate_scheduler(int i) +{ + return dn_ht_find(dn_cfg.schedhash, i, 0, NULL); +} + +/* + * red parameters are in fixed point arithmetic. + */ +static int +config_red(struct dn_fsk *fs) +{ + int64_t s, idle, weight, w0; + int t, i; + + fs->w_q = fs->fs.w_q; + fs->max_p = fs->fs.max_p; + ND("called"); + /* Doing stuff that was in userland */ + i = fs->sched->link.bandwidth; + s = (i <= 0) ? 0 : + hz * dn_cfg.red_avg_pkt_size * 8 * SCALE(1) / i; + + idle = div64((s * 3) , fs->w_q); /* s, fs->w_q scaled; idle not scaled */ + fs->lookup_step = div64(idle , dn_cfg.red_lookup_depth); + /* fs->lookup_step not scaled, */ + if (!fs->lookup_step) + fs->lookup_step = 1; + w0 = weight = SCALE(1) - fs->w_q; //fs->w_q scaled + + for (t = fs->lookup_step; t > 1; --t) + weight = SCALE_MUL(weight, w0); + fs->lookup_weight = (int)(weight); // scaled + + /* Now doing stuff that was in kerneland */ + fs->min_th = SCALE(fs->fs.min_th); + fs->max_th = SCALE(fs->fs.max_th); + + if (fs->fs.max_th == fs->fs.min_th) + fs->c_1 = fs->max_p; + else + fs->c_1 = SCALE((int64_t)(fs->max_p)) / (fs->fs.max_th - fs->fs.min_th); + fs->c_2 = SCALE_MUL(fs->c_1, SCALE(fs->fs.min_th)); + + if (fs->fs.flags & DN_IS_GENTLE_RED) { + fs->c_3 = (SCALE(1) - fs->max_p) / fs->fs.max_th; + fs->c_4 = SCALE(1) - 2 * fs->max_p; + } + + /* If the lookup table already exist, free and create it again. */ + if (fs->w_q_lookup) { + free(fs->w_q_lookup, M_DUMMYNET); + fs->w_q_lookup = NULL; + } + if (dn_cfg.red_lookup_depth == 0) { + printf("\ndummynet: net.inet.ip.dummynet.red_lookup_depth" + "must be > 0\n"); + fs->fs.flags &= ~DN_IS_RED; + fs->fs.flags &= ~DN_IS_GENTLE_RED; + return (EINVAL); + } + fs->lookup_depth = dn_cfg.red_lookup_depth; + fs->w_q_lookup = (u_int *)malloc(fs->lookup_depth * sizeof(int), + M_DUMMYNET, M_NOWAIT); + if (fs->w_q_lookup == NULL) { + printf("dummynet: sorry, cannot allocate red lookup table\n"); + fs->fs.flags &= ~DN_IS_RED; + fs->fs.flags &= ~DN_IS_GENTLE_RED; + return(ENOSPC); + } + + /* Fill the lookup table with (1 - w_q)^x */ + fs->w_q_lookup[0] = SCALE(1) - fs->w_q; + + for (i = 1; i < fs->lookup_depth; i++) + fs->w_q_lookup[i] = + SCALE_MUL(fs->w_q_lookup[i - 1], fs->lookup_weight); + + if (dn_cfg.red_avg_pkt_size < 1) + dn_cfg.red_avg_pkt_size = 512; + fs->avg_pkt_size = dn_cfg.red_avg_pkt_size; + if (dn_cfg.red_max_pkt_size < 1) + dn_cfg.red_max_pkt_size = 1500; + fs->max_pkt_size = dn_cfg.red_max_pkt_size; + ND("exit"); + return 0; +} + +/* Scan all flowset attached to this scheduler and update red */ +static void +update_red(struct dn_schk *s) +{ + struct dn_fsk *fs; + SLIST_FOREACH(fs, &s->fsk_list, sch_chain) { + if (fs && (fs->fs.flags & DN_IS_RED)) + config_red(fs); + } +} + +/* attach flowset to scheduler s, possibly requeue */ +static void +fsk_attach(struct dn_fsk *fs, struct dn_schk *s) +{ + ND("remove fs %d from fsunlinked, link to sched %d", + fs->fs.fs_nr, s->sch.sched_nr); + SLIST_REMOVE(&dn_cfg.fsu, fs, dn_fsk, sch_chain); + fs->sched = s; + SLIST_INSERT_HEAD(&s->fsk_list, fs, sch_chain); + if (s->fp->new_fsk) + s->fp->new_fsk(fs); + /* XXX compute fsk_mask */ + fs->fsk_mask = fs->fs.flow_mask; + if (fs->sched->sch.flags & DN_HAVE_MASK) + flow_id_or(&fs->sched->sch.sched_mask, &fs->fsk_mask); + if (fs->qht) { + /* + * we must drain qht according to the old + * type, and reinsert according to the new one. + * The requeue is complex -- in general we need to + * reclassify every single packet. + * For the time being, let's hope qht is never set + * when we reach this point. + */ + D("XXX TODO requeue from fs %d to sch %d", + fs->fs.fs_nr, s->sch.sched_nr); + fs->qht = NULL; + } + /* set the new type for qht */ + if (nonzero_mask(&fs->fsk_mask)) + fs->fs.flags |= DN_QHT_HASH; + else + fs->fs.flags &= ~DN_QHT_HASH; + + /* XXX config_red() can fail... */ + if (fs->fs.flags & DN_IS_RED) + config_red(fs); +} + +/* update all flowsets which may refer to this scheduler */ +static void +update_fs(struct dn_schk *s) +{ + struct dn_fsk *fs, *tmp; + + SLIST_FOREACH_SAFE(fs, &dn_cfg.fsu, sch_chain, tmp) { + if (s->sch.sched_nr != fs->fs.sched_nr) { + D("fs %d for sch %d not %d still unlinked", + fs->fs.fs_nr, fs->fs.sched_nr, + s->sch.sched_nr); + continue; + } + fsk_attach(fs, s); + } +} + +/* + * Configuration -- to preserve backward compatibility we use + * the following scheme (N is 65536) + * NUMBER SCHED LINK FLOWSET + * 1 .. N-1 (1)WFQ (2)WFQ (3)queue + * N+1 .. 2N-1 (4)FIFO (5)FIFO (6)FIFO for sched 1..N-1 + * 2N+1 .. 3N-1 -- -- (7)FIFO for sched N+1..2N-1 + * + * "pipe i config" configures #1, #2 and #3 + * "sched i config" configures #1 and possibly #6 + * "queue i config" configures #3 + * #1 is configured with 'pipe i config' or 'sched i config' + * #2 is configured with 'pipe i config', and created if not + * existing with 'sched i config' + * #3 is configured with 'queue i config' + * #4 is automatically configured after #1, can only be FIFO + * #5 is automatically configured after #2 + * #6 is automatically created when #1 is !MULTIQUEUE, + * and can be updated. + * #7 is automatically configured after #2 + */ + +/* + * configure a link (and its FIFO instance) + */ +static int +config_link(struct dn_link *p, struct dn_id *arg) +{ + int i; + + if (p->oid.len != sizeof(*p)) { + D("invalid pipe len %d", p->oid.len); + return EINVAL; + } + i = p->link_nr; + if (i <= 0 || i >= DN_MAX_ID) + return EINVAL; + /* + * The config program passes parameters as follows: + * bw = bits/second (0 means no limits), + * delay = ms, must be translated into ticks. + * qsize = slots/bytes + * burst ??? + */ + p->delay = (p->delay * hz) / 1000; + /* Scale burst size: bytes -> bits * hz */ + p->burst *= 8 * hz; + + DN_BH_WLOCK(); + /* do it twice, base link and FIFO link */ + for (; i < 2*DN_MAX_ID; i += DN_MAX_ID) { + struct dn_schk *s = locate_scheduler(i); + if (s == NULL) { + DN_BH_WUNLOCK(); + D("sched %d not found", i); + return EINVAL; + } + /* remove profile if exists */ + if (s->profile) { + free(s->profile, M_DUMMYNET); + s->profile = NULL; + } + /* copy all parameters */ + s->link.oid = p->oid; + s->link.link_nr = i; + s->link.delay = p->delay; + if (s->link.bandwidth != p->bandwidth) { + /* XXX bandwidth changes, need to update red params */ + s->link.bandwidth = p->bandwidth; + update_red(s); + } + s->link.burst = p->burst; + schk_reset_credit(s); + } + dn_cfg.id++; + DN_BH_WUNLOCK(); + return 0; +} + +/* + * configure a flowset. Can be called from inside with locked=1, + */ +static struct dn_fsk * +config_fs(struct dn_fs *nfs, struct dn_id *arg, int locked) +{ + int i; + struct dn_fsk *fs; + + if (nfs->oid.len != sizeof(*nfs)) { + D("invalid flowset len %d", nfs->oid.len); + return NULL; + } + i = nfs->fs_nr; + if (i <= 0 || i >= 3*DN_MAX_ID) + return NULL; + ND("flowset %d", i); + /* XXX other sanity checks */ + if (nfs->flags & DN_QSIZE_BYTES) { + ipdn_bound_var(&nfs->qsize, 16384, + 1500, dn_cfg.byte_limit, NULL); // "queue byte size"); + } else { + ipdn_bound_var(&nfs->qsize, 50, + 1, dn_cfg.slot_limit, NULL); // "queue slot size"); + } + if (nfs->flags & DN_HAVE_MASK) { + /* make sure we have some buckets */ + ipdn_bound_var((int *)&nfs->buckets, dn_cfg.hash_size, + 1, dn_cfg.max_hash_size, "flowset buckets"); + } else { + nfs->buckets = 1; /* we only need 1 */ + } + if (!locked) + DN_BH_WLOCK(); + do { /* exit with break when done */ + struct dn_schk *s; + int flags = nfs->sched_nr ? DNHT_INSERT : 0; + int j; + int oldc = dn_cfg.fsk_count; + fs = dn_ht_find(dn_cfg.fshash, i, flags, NULL); + if (fs == NULL) { + D("missing sched for flowset %d", i); + break; + } + /* grab some defaults from the existing one */ + if (nfs->sched_nr == 0) /* reuse */ + nfs->sched_nr = fs->fs.sched_nr; + for (j = 0; j < sizeof(nfs->par)/sizeof(nfs->par[0]); j++) { + if (nfs->par[j] == -1) /* reuse */ + nfs->par[j] = fs->fs.par[j]; + } + if (bcmp(&fs->fs, nfs, sizeof(*nfs)) == 0) { + ND("flowset %d unchanged", i); + break; /* no change, nothing to do */ + } + if (oldc != dn_cfg.fsk_count) /* new item */ + dn_cfg.id++; + s = locate_scheduler(nfs->sched_nr); + /* detach from old scheduler if needed, preserving + * queues if we need to reattach. Then update the + * configuration, and possibly attach to the new sched. + */ + DX(2, "fs %d changed sched %d@%p to %d@%p", + fs->fs.fs_nr, + fs->fs.sched_nr, fs->sched, nfs->sched_nr, s); + if (fs->sched) { + int flags = s ? DN_DETACH : (DN_DETACH | DN_DESTROY); + flags |= DN_DESTROY; /* XXX temporary */ + fsk_detach(fs, flags); + } + fs->fs = *nfs; /* copy configuration */ + if (s != NULL) + fsk_attach(fs, s); + } while (0); + if (!locked) + DN_BH_WUNLOCK(); + return fs; +} + +/* + * config/reconfig a scheduler and its FIFO variant. + * For !MULTIQUEUE schedulers, also set up the flowset. + * + * On reconfigurations (detected because s->fp is set), + * detach existing flowsets preserving traffic, preserve link, + * and delete the old scheduler creating a new one. + */ +static int +config_sched(struct dn_sch *_nsch, struct dn_id *arg) +{ + struct dn_schk *s; + struct schk_new_arg a; /* argument for schk_new */ + int i; + struct dn_link p; /* copy of oldlink */ + struct dn_profile *pf = NULL; /* copy of old link profile */ + /* Used to preserv mask parameter */ + struct ipfw_flow_id new_mask; + int new_buckets = 0; + int new_flags = 0; + int pipe_cmd; + int err = ENOMEM; + + a.sch = _nsch; + if (a.sch->oid.len != sizeof(*a.sch)) { + D("bad sched len %d", a.sch->oid.len); + return EINVAL; + } + i = a.sch->sched_nr; + if (i <= 0 || i >= DN_MAX_ID) + return EINVAL; + /* make sure we have some buckets */ + if (a.sch->flags & DN_HAVE_MASK) + ipdn_bound_var((int *)&a.sch->buckets, dn_cfg.hash_size, + 1, dn_cfg.max_hash_size, "sched buckets"); + /* XXX other sanity checks */ + bzero(&p, sizeof(p)); + + pipe_cmd = a.sch->flags & DN_PIPE_CMD; + a.sch->flags &= ~DN_PIPE_CMD; //XXX do it even if is not set? + if (pipe_cmd) { + /* Copy mask parameter */ + new_mask = a.sch->sched_mask; + new_buckets = a.sch->buckets; + new_flags = a.sch->flags; + } + DN_BH_WLOCK(); +again: /* run twice, for wfq and fifo */ + /* + * lookup the type. If not supplied, use the previous one + * or default to WF2Q+. Otherwise, return an error. + */ + dn_cfg.id++; + a.fp = find_sched_type(a.sch->oid.subtype, a.sch->name); + if (a.fp != NULL) { + /* found. Lookup or create entry */ + s = dn_ht_find(dn_cfg.schedhash, i, DNHT_INSERT, &a); + } else if (a.sch->oid.subtype == 0 && !a.sch->name[0]) { + /* No type. search existing s* or retry with WF2Q+ */ + s = dn_ht_find(dn_cfg.schedhash, i, 0, &a); + if (s != NULL) { + a.fp = s->fp; + /* Scheduler exists, skip to FIFO scheduler + * if command was pipe config... + */ + if (pipe_cmd) + goto next; + } else { + /* New scheduler, create a wf2q+ with no mask + * if command was pipe config... + */ + if (pipe_cmd) { + /* clear mask parameter */ + bzero(&a.sch->sched_mask, sizeof(new_mask)); + a.sch->buckets = 0; + a.sch->flags &= ~DN_HAVE_MASK; + } + a.sch->oid.subtype = DN_SCHED_WF2QP; + goto again; + } + } else { + D("invalid scheduler type %d %s", + a.sch->oid.subtype, a.sch->name); + err = EINVAL; + goto error; + } + /* normalize name and subtype */ + a.sch->oid.subtype = a.fp->type; + bzero(a.sch->name, sizeof(a.sch->name)); + strlcpy(a.sch->name, a.fp->name, sizeof(a.sch->name)); + if (s == NULL) { + D("cannot allocate scheduler %d", i); + goto error; + } + /* restore existing link if any */ + if (p.link_nr) { + s->link = p; + if (!pf || pf->link_nr != p.link_nr) { /* no saved value */ + s->profile = NULL; /* XXX maybe not needed */ + } else { + s->profile = malloc(sizeof(struct dn_profile), + M_DUMMYNET, M_NOWAIT | M_ZERO); + if (s->profile == NULL) { + D("cannot allocate profile"); + goto error; //XXX + } + bcopy(pf, s->profile, sizeof(*pf)); + } + } + p.link_nr = 0; + if (s->fp == NULL) { + DX(2, "sched %d new type %s", i, a.fp->name); + } else if (s->fp != a.fp || + bcmp(a.sch, &s->sch, sizeof(*a.sch)) ) { + /* already existing. */ + DX(2, "sched %d type changed from %s to %s", + i, s->fp->name, a.fp->name); + DX(4, " type/sub %d/%d -> %d/%d", + s->sch.oid.type, s->sch.oid.subtype, + a.sch->oid.type, a.sch->oid.subtype); + if (s->link.link_nr == 0) + D("XXX WARNING link 0 for sched %d", i); + p = s->link; /* preserve link */ + if (s->profile) {/* preserve profile */ + if (!pf) + pf = malloc(sizeof(*pf), + M_DUMMYNET, M_NOWAIT | M_ZERO); + if (pf) /* XXX should issue a warning otherwise */ + bcopy(s->profile, pf, sizeof(*pf)); + } + /* remove from the hash */ + dn_ht_find(dn_cfg.schedhash, i, DNHT_REMOVE, NULL); + /* Detach flowsets, preserve queues. */ + // schk_delete_cb(s, NULL); + // XXX temporarily, kill queues + schk_delete_cb(s, (void *)DN_DESTROY); + goto again; + } else { + DX(4, "sched %d unchanged type %s", i, a.fp->name); + } + /* complete initialization */ + s->sch = *a.sch; + s->fp = a.fp; + s->cfg = arg; + // XXX schk_reset_credit(s); + /* create the internal flowset if needed, + * trying to reuse existing ones if available + */ + if (!(s->fp->flags & DN_MULTIQUEUE) && !s->fs) { + s->fs = dn_ht_find(dn_cfg.fshash, i, 0, NULL); + if (!s->fs) { + struct dn_fs fs; + bzero(&fs, sizeof(fs)); + set_oid(&fs.oid, DN_FS, sizeof(fs)); + fs.fs_nr = i + DN_MAX_ID; + fs.sched_nr = i; + s->fs = config_fs(&fs, NULL, 1 /* locked */); + } + if (!s->fs) { + schk_delete_cb(s, (void *)DN_DESTROY); + D("error creating internal fs for %d", i); + goto error; + } + } + /* call init function after the flowset is created */ + if (s->fp->config) + s->fp->config(s); + update_fs(s); +next: + if (i < DN_MAX_ID) { /* now configure the FIFO instance */ + i += DN_MAX_ID; + if (pipe_cmd) { + /* Restore mask parameter for FIFO */ + a.sch->sched_mask = new_mask; + a.sch->buckets = new_buckets; + a.sch->flags = new_flags; + } else { + /* sched config shouldn't modify the FIFO scheduler */ + if (dn_ht_find(dn_cfg.schedhash, i, 0, &a) != NULL) { + /* FIFO already exist, don't touch it */ + err = 0; /* and this is not an error */ + goto error; + } + } + a.sch->sched_nr = i; + a.sch->oid.subtype = DN_SCHED_FIFO; + bzero(a.sch->name, sizeof(a.sch->name)); + goto again; + } + err = 0; +error: + DN_BH_WUNLOCK(); + if (pf) + free(pf, M_DUMMYNET); + return err; +} + +/* + * attach a profile to a link + */ +static int +config_profile(struct dn_profile *pf, struct dn_id *arg) +{ + struct dn_schk *s; + int i, olen, err = 0; + + if (pf->oid.len < sizeof(*pf)) { + D("short profile len %d", pf->oid.len); + return EINVAL; + } + i = pf->link_nr; + if (i <= 0 || i >= DN_MAX_ID) + return EINVAL; + /* XXX other sanity checks */ + DN_BH_WLOCK(); + for (; i < 2*DN_MAX_ID; i += DN_MAX_ID) { + struct dn_profile **pkpf, *kpf; + + s = locate_scheduler(i); + + if (s == NULL) { + err = EINVAL; + break; + } + dn_cfg.id++; + pkpf = &s->profile; /* prepare to handle multiple profiles */ + kpf = *pkpf; + + /* + * If we had a profile and the new one does not fit, + * or it is deleted, then we need to free memory. + */ + if (kpf && (pf->samples_no == 0 || + kpf->oid.len < pf->oid.len)) { + free(kpf, M_DUMMYNET); + *pkpf = NULL; + } + if (pf->samples_no == 0) + continue; + /* + * new profile, possibly allocate memory + * and copy data. + */ + if (kpf == NULL) + *pkpf = kpf = malloc(pf->oid.len, + M_DUMMYNET, M_NOWAIT | M_ZERO); + if (kpf == NULL) { + D("no memory for profile %d", i); + err = ENOMEM; + break; + } + /* preserve larger length XXX double check */ + olen = kpf->oid.len; + if (olen < pf->oid.len) + olen = pf->oid.len; + bcopy(pf, kpf, pf->oid.len); + kpf->oid.len = olen; + } + DN_BH_WUNLOCK(); + return err; +} + +/* + * Delete all objects: + */ +static void +dummynet_flush(void) +{ + + /* delete all schedulers and related links/queues/flowsets */ + dn_ht_scan(dn_cfg.schedhash, schk_delete_cb, + (void *)(uintptr_t)DN_DELETE_FS); + /* delete all remaining (unlinked) flowsets */ + DX(4, "still %d unlinked fs", dn_cfg.fsk_count); + dn_ht_free(dn_cfg.fshash, DNHT_REMOVE); + fsk_detach_list(&dn_cfg.fsu, DN_DELETE_FS); + /* Reinitialize system heap... */ + heap_init(&dn_cfg.evheap, 16, offsetof(struct dn_id, id)); +} + +/* + * Main handler for configuration. We are guaranteed to be called + * with an oid which is at least a dn_id. + * - the first object is the command (config, delete, flush, ...) + * - config_link must be issued after the corresponding config_sched + * - parameters (DN_TXT) for an object must preceed the object + * processed on a config_sched. + */ +int +do_config(void *p, int l) +{ + struct dn_id *next, *o; + int err = 0, err2 = 0; + struct dn_id *arg = NULL; + uintptr_t *a; + + o = p; + if (o->id != DN_API_VERSION) { + D("invalid api version got %d need %d", + o->id, DN_API_VERSION); + return EINVAL; + } + for (; l >= sizeof(*o); o = next) { + struct dn_id *prev = arg; + if (o->len < sizeof(*o) || l < o->len) { + D("bad len o->len %d len %d", o->len, l); + err = EINVAL; + break; + } + l -= o->len; + next = (struct dn_id *)((char *)o + o->len); + err = 0; + switch (o->type) { + default: + D("cmd %d not implemented", o->type); + break; + +#ifdef EMULATE_SYSCTL + /* sysctl emulation. + * if we recognize the command, jump to the correct + * handler and return + */ + case DN_SYSCTL_SET: + err = kesysctl_emu_set(p, l); + return err; +#endif + + case DN_CMD_CONFIG: /* simply a header */ + break; + + case DN_CMD_DELETE: + /* the argument is in the first uintptr_t after o */ + a = (uintptr_t *)(o+1); + if (o->len < sizeof(*o) + sizeof(*a)) { + err = EINVAL; + break; + } + switch (o->subtype) { + case DN_LINK: + /* delete base and derived schedulers */ + DN_BH_WLOCK(); + err = delete_schk(*a); + err2 = delete_schk(*a + DN_MAX_ID); + DN_BH_WUNLOCK(); + if (!err) + err = err2; + break; + + default: + D("invalid delete type %d", + o->subtype); + err = EINVAL; + break; + + case DN_FS: + err = (*a <1 || *a >= DN_MAX_ID) ? + EINVAL : delete_fs(*a, 0) ; + break; + } + break; + + case DN_CMD_FLUSH: + DN_BH_WLOCK(); + dummynet_flush(); + DN_BH_WUNLOCK(); + break; + case DN_TEXT: /* store argument the next block */ + prev = NULL; + arg = o; + break; + case DN_LINK: + err = config_link((struct dn_link *)o, arg); + break; + case DN_PROFILE: + err = config_profile((struct dn_profile *)o, arg); + break; + case DN_SCH: + err = config_sched((struct dn_sch *)o, arg); + break; + case DN_FS: + err = (NULL==config_fs((struct dn_fs *)o, arg, 0)); + break; + } + if (prev) + arg = NULL; + if (err != 0) + break; + } + return err; +} + +static int +compute_space(struct dn_id *cmd, struct copy_args *a) +{ + int x = 0, need = 0; + int profile_size = sizeof(struct dn_profile) - + ED_MAX_SAMPLES_NO*sizeof(int); + /* note, this may be short */ + + /* NOTE about compute space: + * NP = dn_cfg.schk_count + * NSI = dn_cfg.si_count + * NF = dn_cfg.fsk_count + * NQ = dn_cfg.queue_count + * - ipfw pipe show + * (NP/2)*(dn_link + dn_sch + dn_id + dn_fs) only half scheduler + * link, scheduler template, flowset + * integrated in scheduler and header + * for flowset list + * (NSI)*(dn_flow) all scheduler instance (includes + * the queue instance) + * - ipfw sched show + * (NP/2)*(dn_link + dn_sch + dn_id + dn_fs) only half scheduler + * link, scheduler template, flowset + * integrated in scheduler and header + * for flowset list + * (NSI * dn_flow) all scheduler instances + * (NF * sizeof(uint_32)) space for flowset list linked to scheduler + * (NQ * dn_queue) all queue [XXXfor now not listed] + * - ipfw queue show + * (NF * dn_fs) all flowset + * (NQ * dn_queue) all queues + */ + switch (cmd->subtype) { + default: + return -1; + /* XXX where do LINK and SCH differ ? */ + /* 'ipfw sched show' could list all queues associated to + * a scheduler. This feature for now is disabled + */ + case DN_LINK: /* pipe show */ + x = DN_C_LINK | DN_C_SCH | DN_C_FLOW; + need += dn_cfg.schk_count * + (sizeof(struct dn_fs) + profile_size) / 2; + need += dn_cfg.fsk_count * sizeof(uint32_t); + break; + case DN_SCH: /* sched show */ + need += dn_cfg.schk_count * + (sizeof(struct dn_fs) + profile_size) / 2; + need += dn_cfg.fsk_count * sizeof(uint32_t); + x = DN_C_SCH | DN_C_LINK | DN_C_FLOW; + break; + case DN_FS: /* queue show */ + x = DN_C_FS | DN_C_QUEUE; + break; + case DN_GET_COMPAT: /* compatibility mode */ + need = dn_compat_calc_size(); + break; + } + a->flags = x; + if (x & DN_C_SCH) { + need += dn_cfg.schk_count * sizeof(struct dn_sch) / 2; + /* NOT also, each fs might be attached to a sched */ + need += dn_cfg.schk_count * sizeof(struct dn_id) / 2; + } + if (x & DN_C_FS) + need += dn_cfg.fsk_count * sizeof(struct dn_fs); + if (x & DN_C_LINK) { + need += dn_cfg.schk_count * sizeof(struct dn_link) / 2; + } + /* + * When exporting a queue to userland, only pass up the + * struct dn_flow, which is the only visible part. + */ + + if (x & DN_C_QUEUE) + need += dn_cfg.queue_count * sizeof(struct dn_flow); + if (x & DN_C_FLOW) + need += dn_cfg.si_count * (sizeof(struct dn_flow)); + return need; +} + +/* + * If compat != NULL dummynet_get is called in compatibility mode. + * *compat will be the pointer to the buffer to pass to ipfw + */ +int +dummynet_get(struct sockopt *sopt, void **compat) +{ + int have, i, need, error; + char *start = NULL, *buf; + size_t sopt_valsize; + struct dn_id *cmd; + struct copy_args a; + struct copy_range r; + int l = sizeof(struct dn_id); + + bzero(&a, sizeof(a)); + bzero(&r, sizeof(r)); + + /* save and restore original sopt_valsize around copyin */ + sopt_valsize = sopt->sopt_valsize; + + cmd = &r.o; + + if (!compat) { + /* copy at least an oid, and possibly a full object */ + error = sooptcopyin(sopt, cmd, sizeof(r), sizeof(*cmd)); + sopt->sopt_valsize = sopt_valsize; + if (error) + goto done; + l = cmd->len; +#ifdef EMULATE_SYSCTL + /* sysctl emulation. */ + if (cmd->type == DN_SYSCTL_GET) + return kesysctl_emu_get(sopt); +#endif + if (l > sizeof(r)) { + /* request larger than default, allocate buffer */ + cmd = malloc(l, M_DUMMYNET, M_WAITOK); + error = sooptcopyin(sopt, cmd, l, l); + sopt->sopt_valsize = sopt_valsize; + if (error) + goto done; + } + } else { /* compatibility */ + error = 0; + cmd->type = DN_CMD_GET; + cmd->len = sizeof(struct dn_id); + cmd->subtype = DN_GET_COMPAT; + // cmd->id = sopt_valsize; + D("compatibility mode"); + } + a.extra = (struct copy_range *)cmd; + if (cmd->len == sizeof(*cmd)) { /* no range, create a default */ + uint32_t *rp = (uint32_t *)(cmd + 1); + cmd->len += 2* sizeof(uint32_t); + rp[0] = 1; + rp[1] = DN_MAX_ID - 1; + if (cmd->subtype == DN_LINK) { + rp[0] += DN_MAX_ID; + rp[1] += DN_MAX_ID; + } + } + /* Count space (under lock) and allocate (outside lock). + * Exit with lock held if we manage to get enough buffer. + * Try a few times then give up. + */ + for (have = 0, i = 0; i < 10; i++) { + DN_BH_WLOCK(); + need = compute_space(cmd, &a); + + /* if there is a range, ignore value from compute_space() */ + if (l > sizeof(*cmd)) + need = sopt_valsize - sizeof(*cmd); + + if (need < 0) { + DN_BH_WUNLOCK(); + error = EINVAL; + goto done; + } + need += sizeof(*cmd); + cmd->id = need; + if (have >= need) + break; + + DN_BH_WUNLOCK(); + if (start) + free(start, M_DUMMYNET); + start = NULL; + if (need > sopt_valsize) + break; + + have = need; + start = malloc(have, M_DUMMYNET, M_WAITOK | M_ZERO); + } + + if (start == NULL) { + if (compat) { + *compat = NULL; + error = 1; // XXX + } else { + error = sooptcopyout(sopt, cmd, sizeof(*cmd)); + } + goto done; + } + ND("have %d:%lu sched %d, %d:%lu links %d, %d:%lu flowsets %d, " + "%d:%lu si %d, %d:%lu queues %d", + dn_cfg.schk_count, sizeof(struct dn_sch), DN_SCH, + dn_cfg.schk_count, sizeof(struct dn_link), DN_LINK, + dn_cfg.fsk_count, sizeof(struct dn_fs), DN_FS, + dn_cfg.si_count, sizeof(struct dn_flow), DN_SCH_I, + dn_cfg.queue_count, sizeof(struct dn_queue), DN_QUEUE); + sopt->sopt_valsize = sopt_valsize; + a.type = cmd->subtype; + + if (compat == NULL) { + bcopy(cmd, start, sizeof(*cmd)); + ((struct dn_id*)(start))->len = sizeof(struct dn_id); + buf = start + sizeof(*cmd); + } else + buf = start; + a.start = &buf; + a.end = start + have; + /* start copying other objects */ + /* XXX set error in case of no space */ + if (compat) { + a.type = DN_COMPAT_PIPE; + error = dn_ht_scan(dn_cfg.schedhash, copy_data_helper_compat, &a); + a.type = DN_COMPAT_QUEUE; + dn_ht_scan(dn_cfg.fshash, copy_data_helper_compat, &a); + } else if (a.type == DN_FS) { + error = dn_ht_scan(dn_cfg.fshash, copy_data_helper, &a); + } else { + error = dn_ht_scan(dn_cfg.schedhash, copy_data_helper, &a); + } + DN_BH_WUNLOCK(); + if (error < 0) { + error = 0; /* we skip the sooptcopyout so we fail, hopefully */ + goto done; + } else { + error = 0; /* all fine */ + } + + if (compat) { + *compat = start; + sopt->sopt_valsize = buf - start; + /* free() is done by ip_dummynet_compat() */ + start = NULL; //XXX hack + } else { + error = sooptcopyout(sopt, start, buf - start); + } +done: + if (cmd && cmd != &r.o) + free(cmd, M_DUMMYNET); + if (start) + free(start, M_DUMMYNET); + return error; +} + +/* Callback called on scheduler instance to delete it if idle */ +static int +drain_scheduler_cb(void *_si, void *arg) +{ + struct dn_sch_inst *si = _si; + + if ((si->kflags & DN_ACTIVE) || si->dline.mq.head != NULL) + return 0; + + if (si->sched->fp->flags & DN_MULTIQUEUE) { + if (si->q_count == 0) + return si_destroy(si, NULL); + else + return 0; + } else { /* !DN_MULTIQUEUE */ + if ((si+1)->ni.length == 0) + return si_destroy(si, NULL); + else + return 0; + } + return 0; /* unreachable */ +} + +/* Callback called on scheduler to check if it has instances */ +static int +drain_scheduler_sch_cb(void *_s, void *arg) +{ + struct dn_schk *s = _s; + + if (s->sch.flags & DN_HAVE_MASK) { + dn_ht_scan_bucket(s->siht, &s->drain_bucket, + drain_scheduler_cb, NULL); + s->drain_bucket++; + } else { + if (s->siht) { + if (drain_scheduler_cb(s->siht, NULL) == DNHT_SCAN_DEL) + s->siht = NULL; + } + } + return 0; +} + +/* Called every tick, try to delete a 'bucket' of scheduler */ +void +dn_drain_scheduler(void) +{ + dn_ht_scan_bucket(dn_cfg.schedhash, &dn_cfg.drain_sch, + drain_scheduler_sch_cb, NULL); + dn_cfg.drain_sch++; +} + +/* Callback called on queue to delete if it is idle */ +static int +drain_queue_cb(void *_q, void *arg) +{ + struct dn_queue *q = _q; + + if (q->ni.length == 0) { + dn_delete_queue(q, DN_DESTROY); + return DNHT_SCAN_DEL; /* queue is deleted */ + } + + return 0; /* queue isn't deleted */ +} + +/* Callback called on flowset used to check if it has queues */ +static int +drain_queue_fs_cb(void *_fs, void *arg) +{ + struct dn_fsk *fs = _fs; + + if (fs->fs.flags & DN_QHT_HASH) { + /* Flowset has a hash table for queues */ + dn_ht_scan_bucket(fs->qht, &fs->drain_bucket, + drain_queue_cb, NULL); + fs->drain_bucket++; + } else { + /* No hash table for this flowset, null the pointer + * if the queue is deleted + */ + if (fs->qht) { + if (drain_queue_cb(fs->qht, NULL) == DNHT_SCAN_DEL) + fs->qht = NULL; + } + } + return 0; +} + +/* Called every tick, try to delete a 'bucket' of queue */ +void +dn_drain_queue(void) +{ + /* scan a bucket of flowset */ + dn_ht_scan_bucket(dn_cfg.fshash, &dn_cfg.drain_fs, + drain_queue_fs_cb, NULL); + dn_cfg.drain_fs++; +} + +/* + * Handler for the various dummynet socket options + */ +static int +ip_dn_ctl(struct sockopt *sopt) +{ + void *p = NULL; + int error, l; + + error = priv_check(sopt->sopt_td, PRIV_NETINET_DUMMYNET); + if (error) + return (error); + + /* Disallow sets in really-really secure mode. */ + if (sopt->sopt_dir == SOPT_SET) { + error = securelevel_ge(sopt->sopt_td->td_ucred, 3); + if (error) + return (error); + } + + switch (sopt->sopt_name) { + default : + D("dummynet: unknown option %d", sopt->sopt_name); + error = EINVAL; + break; + + case IP_DUMMYNET_FLUSH: + case IP_DUMMYNET_CONFIGURE: + case IP_DUMMYNET_DEL: /* remove a pipe or queue */ + case IP_DUMMYNET_GET: + D("dummynet: compat option %d", sopt->sopt_name); + error = ip_dummynet_compat(sopt); + break; + + case IP_DUMMYNET3 : + if (sopt->sopt_dir == SOPT_GET) { + error = dummynet_get(sopt, NULL); + break; + } + l = sopt->sopt_valsize; + /* XXX bumped size to 16000 for 3 profiles */ + if (l < sizeof(struct dn_id) || l > 16000) { + D("argument len %d invalid", l); + break; + } + p = malloc(l, M_TEMP, M_WAITOK); // XXX can it fail ? + error = sooptcopyin(sopt, p, l, l); + if (error) + break ; + error = do_config(p, l); + break; + } + + if (p != NULL) + free(p, M_TEMP); + + return error ; +} + + +static void +ip_dn_init(void) +{ + if (dn_cfg.init_done) + return; + printf("DUMMYNET %p with IPv6 initialized (100409)\n", curvnet); + dn_cfg.init_done = 1; + /* Set defaults here. MSVC does not accept initializers, + * and this is also useful for vimages + */ + /* queue limits */ + dn_cfg.slot_limit = 100; /* Foot shooting limit for queues. */ + dn_cfg.byte_limit = 1024 * 1024; + dn_cfg.expire = 1; + + /* RED parameters */ + dn_cfg.red_lookup_depth = 256; /* default lookup table depth */ + dn_cfg.red_avg_pkt_size = 512; /* default medium packet size */ + dn_cfg.red_max_pkt_size = 1500; /* default max packet size */ + + /* hash tables */ + dn_cfg.max_hash_size = 65536; /* max in the hash tables */ + dn_cfg.hash_size = 64; /* default hash size */ + + /* create hash tables for schedulers and flowsets. + * In both we search by key and by pointer. + */ + dn_cfg.schedhash = dn_ht_init(NULL, dn_cfg.hash_size, + offsetof(struct dn_schk, schk_next), + schk_hash, schk_match, schk_new); + dn_cfg.fshash = dn_ht_init(NULL, dn_cfg.hash_size, + offsetof(struct dn_fsk, fsk_next), + fsk_hash, fsk_match, fsk_new); + + /* bucket index to drain object */ + dn_cfg.drain_fs = 0; + dn_cfg.drain_sch = 0; + + heap_init(&dn_cfg.evheap, 16, offsetof(struct dn_id, id)); + SLIST_INIT(&dn_cfg.fsu); + SLIST_INIT(&dn_cfg.schedlist); + + DN_LOCK_INIT(); + + TASK_INIT(&dn_task, 0, dummynet_task, curvnet); + dn_tq = taskqueue_create_fast("dummynet", M_WAITOK, + taskqueue_thread_enqueue, &dn_tq); + taskqueue_start_threads(&dn_tq, 1, PI_NET, "dummynet"); + + callout_init(&dn_timeout, CALLOUT_MPSAFE); + dn_reschedule(); + + /* Initialize curr_time adjustment mechanics. */ + getmicrouptime(&dn_cfg.prev_t); +} + +static void +ip_dn_destroy(int last) +{ + callout_drain(&dn_timeout); + + DN_BH_WLOCK(); + if (last) { + ND("removing last instance\n"); + ip_dn_ctl_ptr = NULL; + ip_dn_io_ptr = NULL; + } + + dummynet_flush(); + DN_BH_WUNLOCK(); + taskqueue_drain(dn_tq, &dn_task); + taskqueue_free(dn_tq); + + dn_ht_free(dn_cfg.schedhash, 0); + dn_ht_free(dn_cfg.fshash, 0); + heap_free(&dn_cfg.evheap); + + DN_LOCK_DESTROY(); +} + +static int +dummynet_modevent(module_t mod, int type, void *data) +{ + + if (type == MOD_LOAD) { + if (ip_dn_io_ptr) { + printf("DUMMYNET already loaded\n"); + return EEXIST ; + } + ip_dn_init(); + ip_dn_ctl_ptr = ip_dn_ctl; + ip_dn_io_ptr = dummynet_io; + return 0; + } else if (type == MOD_UNLOAD) { + ip_dn_destroy(1 /* last */); + return 0; + } else + return EOPNOTSUPP; +} + +/* modevent helpers for the modules */ +static int +load_dn_sched(struct dn_alg *d) +{ + struct dn_alg *s; + + if (d == NULL) + return 1; /* error */ + ip_dn_init(); /* just in case, we need the lock */ + + /* Check that mandatory funcs exists */ + if (d->enqueue == NULL || d->dequeue == NULL) { + D("missing enqueue or dequeue for %s", d->name); + return 1; + } + + /* Search if scheduler already exists */ + DN_BH_WLOCK(); + SLIST_FOREACH(s, &dn_cfg.schedlist, next) { + if (strcmp(s->name, d->name) == 0) { + D("%s already loaded", d->name); + break; /* scheduler already exists */ + } + } + if (s == NULL) + SLIST_INSERT_HEAD(&dn_cfg.schedlist, d, next); + DN_BH_WUNLOCK(); + D("dn_sched %s %sloaded", d->name, s ? "not ":""); + return s ? 1 : 0; +} + +static int +unload_dn_sched(struct dn_alg *s) +{ + struct dn_alg *tmp, *r; + int err = EINVAL; + + ND("called for %s", s->name); + + DN_BH_WLOCK(); + SLIST_FOREACH_SAFE(r, &dn_cfg.schedlist, next, tmp) { + if (strcmp(s->name, r->name) != 0) + continue; + ND("ref_count = %d", r->ref_count); + err = (r->ref_count != 0) ? EBUSY : 0; + if (err == 0) + SLIST_REMOVE(&dn_cfg.schedlist, r, dn_alg, next); + break; + } + DN_BH_WUNLOCK(); + D("dn_sched %s %sunloaded", s->name, err ? "not ":""); + return err; +} + +int +dn_sched_modevent(module_t mod, int cmd, void *arg) +{ + struct dn_alg *sch = arg; + + if (cmd == MOD_LOAD) + return load_dn_sched(sch); + else if (cmd == MOD_UNLOAD) + return unload_dn_sched(sch); + else + return EINVAL; +} + +static moduledata_t dummynet_mod = { + "dummynet", dummynet_modevent, NULL +}; + +#define DN_SI_SUB SI_SUB_PROTO_IFATTACHDOMAIN +#define DN_MODEV_ORD (SI_ORDER_ANY - 128) /* after ipfw */ +DECLARE_MODULE(dummynet, dummynet_mod, DN_SI_SUB, DN_MODEV_ORD); +MODULE_DEPEND(dummynet, ipfw, 3, 3, 3); +MODULE_VERSION(dummynet, 3); + +/* + * Starting up. Done in order after dummynet_modevent() has been called. + * VNET_SYSINIT is also called for each existing vnet and each new vnet. + */ +//VNET_SYSINIT(vnet_dn_init, DN_SI_SUB, DN_MODEV_ORD+2, ip_dn_init, NULL); + +/* + * Shutdown handlers up shop. These are done in REVERSE ORDER, but still + * after dummynet_modevent() has been called. Not called on reboot. + * VNET_SYSUNINIT is also called for each exiting vnet as it exits. + * or when the module is unloaded. + */ +//VNET_SYSUNINIT(vnet_dn_uninit, DN_SI_SUB, DN_MODEV_ORD+2, ip_dn_destroy, NULL); + +/* end of file */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw2.c b/example/ipfw/sys/netpfil/ipfw/ip_fw2.c new file mode 100644 index 0000000..7e94502 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw2.c @@ -0,0 +1,2905 @@ +/*- + * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw2.c 272840 2014-10-09 19:32:35Z melifaro $"); + +/* + * The FreeBSD IP packet firewall, main file + */ + +#include "opt_ipfw.h" +#include "opt_ipdivert.h" +#include "opt_inet.h" +#ifndef INET +#error "IPFIREWALL requires INET" +#endif /* INET */ +#include "opt_inet6.h" +#include "opt_ipsec.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* for ETHERTYPE_IP */ +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#ifdef INET6 +#include +#include +#include +#endif + +#include + +#include /* XXX for in_cksum */ + +#ifdef MAC +#include +#endif + +/* + * static variables followed by global ones. + * All ipfw global variables are here. + */ + +static VNET_DEFINE(int, fw_deny_unknown_exthdrs); +#define V_fw_deny_unknown_exthdrs VNET(fw_deny_unknown_exthdrs) + +static VNET_DEFINE(int, fw_permit_single_frag6) = 1; +#define V_fw_permit_single_frag6 VNET(fw_permit_single_frag6) + +#ifdef IPFIREWALL_DEFAULT_TO_ACCEPT +static int default_to_accept = 1; +#else +static int default_to_accept; +#endif + +VNET_DEFINE(int, autoinc_step); +VNET_DEFINE(int, fw_one_pass) = 1; + +VNET_DEFINE(unsigned int, fw_tables_max); +VNET_DEFINE(unsigned int, fw_tables_sets) = 0; /* Don't use set-aware tables */ +/* Use 128 tables by default */ +static unsigned int default_fw_tables = IPFW_TABLES_DEFAULT; + +#ifndef LINEAR_SKIPTO +static int jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num, + int tablearg, int jump_backwards); +#define JUMP(ch, f, num, targ, back) jump_fast(ch, f, num, targ, back) +#else +static int jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num, + int tablearg, int jump_backwards); +#define JUMP(ch, f, num, targ, back) jump_linear(ch, f, num, targ, back) +#endif + +/* + * Each rule belongs to one of 32 different sets (0..31). + * The variable set_disable contains one bit per set. + * If the bit is set, all rules in the corresponding set + * are disabled. Set RESVD_SET(31) is reserved for the default rule + * and rules that are not deleted by the flush command, + * and CANNOT be disabled. + * Rules in set RESVD_SET can only be deleted individually. + */ +VNET_DEFINE(u_int32_t, set_disable); +#define V_set_disable VNET(set_disable) + +VNET_DEFINE(int, fw_verbose); +/* counter for ipfw_log(NULL...) */ +VNET_DEFINE(u_int64_t, norule_counter); +VNET_DEFINE(int, verbose_limit); + +/* layer3_chain contains the list of rules for layer 3 */ +VNET_DEFINE(struct ip_fw_chain, layer3_chain); + +/* ipfw_vnet_ready controls when we are open for business */ +VNET_DEFINE(int, ipfw_vnet_ready) = 0; + +VNET_DEFINE(int, ipfw_nat_ready) = 0; + +ipfw_nat_t *ipfw_nat_ptr = NULL; +struct cfg_nat *(*lookup_nat_ptr)(struct nat_list *, int); +ipfw_nat_cfg_t *ipfw_nat_cfg_ptr; +ipfw_nat_cfg_t *ipfw_nat_del_ptr; +ipfw_nat_cfg_t *ipfw_nat_get_cfg_ptr; +ipfw_nat_cfg_t *ipfw_nat_get_log_ptr; + +#ifdef SYSCTL_NODE +uint32_t dummy_def = IPFW_DEFAULT_RULE; +static int sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS); +static int sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS); + +SYSBEGIN(f3) + +SYSCTL_NODE(_net_inet_ip, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall"); +SYSCTL_VNET_INT(_net_inet_ip_fw, OID_AUTO, one_pass, + CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_one_pass), 0, + "Only do a single pass through ipfw when using dummynet(4)"); +SYSCTL_VNET_INT(_net_inet_ip_fw, OID_AUTO, autoinc_step, + CTLFLAG_RW, &VNET_NAME(autoinc_step), 0, + "Rule number auto-increment step"); +SYSCTL_VNET_INT(_net_inet_ip_fw, OID_AUTO, verbose, + CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_verbose), 0, + "Log matches to ipfw rules"); +SYSCTL_VNET_INT(_net_inet_ip_fw, OID_AUTO, verbose_limit, + CTLFLAG_RW, &VNET_NAME(verbose_limit), 0, + "Set upper limit of matches of ipfw rules logged"); +SYSCTL_UINT(_net_inet_ip_fw, OID_AUTO, default_rule, CTLFLAG_RD, + &dummy_def, 0, + "The default/max possible rule number."); +SYSCTL_VNET_PROC(_net_inet_ip_fw, OID_AUTO, tables_max, + CTLTYPE_UINT|CTLFLAG_RW, 0, 0, sysctl_ipfw_table_num, "IU", + "Maximum number of concurrently used tables"); +SYSCTL_VNET_PROC(_net_inet_ip_fw, OID_AUTO, tables_sets, + CTLTYPE_UINT|CTLFLAG_RW, 0, 0, sysctl_ipfw_tables_sets, "IU", + "Use per-set namespace for tables"); +SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, default_to_accept, CTLFLAG_RDTUN, + &default_to_accept, 0, + "Make the default rule accept all packets."); +TUNABLE_INT("net.inet.ip.fw.tables_max", (int *)&default_fw_tables); +SYSCTL_VNET_INT(_net_inet_ip_fw, OID_AUTO, static_count, + CTLFLAG_RD, &VNET_NAME(layer3_chain.n_rules), 0, + "Number of static rules"); + +#ifdef INET6 +SYSCTL_DECL(_net_inet6_ip6); +SYSCTL_NODE(_net_inet6_ip6, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall"); +SYSCTL_VNET_INT(_net_inet6_ip6_fw, OID_AUTO, deny_unknown_exthdrs, + CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(fw_deny_unknown_exthdrs), 0, + "Deny packets with unknown IPv6 Extension Headers"); +SYSCTL_VNET_INT(_net_inet6_ip6_fw, OID_AUTO, permit_single_frag6, + CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(fw_permit_single_frag6), 0, + "Permit single packet IPv6 fragments"); +#endif /* INET6 */ + +SYSEND + +#endif /* SYSCTL_NODE */ + + +/* + * Some macros used in the various matching options. + * L3HDR maps an ipv4 pointer into a layer3 header pointer of type T + * Other macros just cast void * into the appropriate type + */ +#define L3HDR(T, ip) ((T *)((u_int32_t *)(ip) + (ip)->ip_hl)) +#define TCP(p) ((struct tcphdr *)(p)) +#define SCTP(p) ((struct sctphdr *)(p)) +#define UDP(p) ((struct udphdr *)(p)) +#define ICMP(p) ((struct icmphdr *)(p)) +#define ICMP6(p) ((struct icmp6_hdr *)(p)) + +static __inline int +icmptype_match(struct icmphdr *icmp, ipfw_insn_u32 *cmd) +{ + int type = icmp->icmp_type; + + return (type <= ICMP_MAXTYPE && (cmd->d[0] & (1<icmp_type; + + return (type <= ICMP_MAXTYPE && (TT & (1<arg1 or cmd->d[0]. + * + * We scan options and store the bits we find set. We succeed if + * + * (want_set & ~bits) == 0 && (want_clear & ~bits) == want_clear + * + * The code is sometimes optimized not to store additional variables. + */ + +static int +flags_match(ipfw_insn *cmd, u_int8_t bits) +{ + u_char want_clear; + bits = ~bits; + + if ( ((cmd->arg1 & 0xff) & bits) != 0) + return 0; /* some bits we want set were clear */ + want_clear = (cmd->arg1 >> 8) & 0xff; + if ( (want_clear & bits) != want_clear) + return 0; /* some bits we want clear were set */ + return 1; +} + +static int +ipopts_match(struct ip *ip, ipfw_insn *cmd) +{ + int optlen, bits = 0; + u_char *cp = (u_char *)(ip + 1); + int x = (ip->ip_hl << 2) - sizeof (struct ip); + + for (; x > 0; x -= optlen, cp += optlen) { + int opt = cp[IPOPT_OPTVAL]; + + if (opt == IPOPT_EOL) + break; + if (opt == IPOPT_NOP) + optlen = 1; + else { + optlen = cp[IPOPT_OLEN]; + if (optlen <= 0 || optlen > x) + return 0; /* invalid or truncated */ + } + switch (opt) { + + default: + break; + + case IPOPT_LSRR: + bits |= IP_FW_IPOPT_LSRR; + break; + + case IPOPT_SSRR: + bits |= IP_FW_IPOPT_SSRR; + break; + + case IPOPT_RR: + bits |= IP_FW_IPOPT_RR; + break; + + case IPOPT_TS: + bits |= IP_FW_IPOPT_TS; + break; + } + } + return (flags_match(cmd, bits)); +} + +static int +tcpopts_match(struct tcphdr *tcp, ipfw_insn *cmd) +{ + int optlen, bits = 0; + u_char *cp = (u_char *)(tcp + 1); + int x = (tcp->th_off << 2) - sizeof(struct tcphdr); + + for (; x > 0; x -= optlen, cp += optlen) { + int opt = cp[0]; + if (opt == TCPOPT_EOL) + break; + if (opt == TCPOPT_NOP) + optlen = 1; + else { + optlen = cp[1]; + if (optlen <= 0) + break; + } + + switch (opt) { + + default: + break; + + case TCPOPT_MAXSEG: + bits |= IP_FW_TCPOPT_MSS; + break; + + case TCPOPT_WINDOW: + bits |= IP_FW_TCPOPT_WINDOW; + break; + + case TCPOPT_SACK_PERMITTED: + case TCPOPT_SACK: + bits |= IP_FW_TCPOPT_SACK; + break; + + case TCPOPT_TIMESTAMP: + bits |= IP_FW_TCPOPT_TS; + break; + + } + } + return (flags_match(cmd, bits)); +} + +static int +iface_match(struct ifnet *ifp, ipfw_insn_if *cmd, struct ip_fw_chain *chain, + uint32_t *tablearg) +{ + + if (ifp == NULL) /* no iface with this packet, match fails */ + return (0); + + /* Check by name or by IP address */ + if (cmd->name[0] != '\0') { /* match by name */ + if (cmd->name[0] == '\1') /* use tablearg to match */ + return ipfw_lookup_table_extended(chain, cmd->p.kidx, 0, + &ifp->if_index, tablearg); + /* Check name */ + if (cmd->p.glob) { + if (fnmatch(cmd->name, ifp->if_xname, 0) == 0) + return(1); + } else { + if (strncmp(ifp->if_xname, cmd->name, IFNAMSIZ) == 0) + return(1); + } + } else { +#if !defined(USERSPACE) && defined(__FreeBSD__) /* and OSX too ? */ + struct ifaddr *ia; + + if_addr_rlock(ifp); + TAILQ_FOREACH(ia, &ifp->if_addrhead, ifa_link) { + if (ia->ifa_addr->sa_family != AF_INET) + continue; + if (cmd->p.ip.s_addr == ((struct sockaddr_in *) + (ia->ifa_addr))->sin_addr.s_addr) { + if_addr_runlock(ifp); + return(1); /* match */ + } + } + if_addr_runlock(ifp); +#endif /* __FreeBSD__ */ + } + return(0); /* no match, fail ... */ +} + +/* + * The verify_path function checks if a route to the src exists and + * if it is reachable via ifp (when provided). + * + * The 'verrevpath' option checks that the interface that an IP packet + * arrives on is the same interface that traffic destined for the + * packet's source address would be routed out of. + * The 'versrcreach' option just checks that the source address is + * reachable via any route (except default) in the routing table. + * These two are a measure to block forged packets. This is also + * commonly known as "anti-spoofing" or Unicast Reverse Path + * Forwarding (Unicast RFP) in Cisco-ese. The name of the knobs + * is purposely reminiscent of the Cisco IOS command, + * + * ip verify unicast reverse-path + * ip verify unicast source reachable-via any + * + * which implements the same functionality. But note that the syntax + * is misleading, and the check may be performed on all IP packets + * whether unicast, multicast, or broadcast. + */ +static int +verify_path(struct in_addr src, struct ifnet *ifp, u_int fib) +{ +#if defined(USERSPACE) || !defined(__FreeBSD__) + return 0; +#else + struct route ro; + struct sockaddr_in *dst; + + bzero(&ro, sizeof(ro)); + + dst = (struct sockaddr_in *)&(ro.ro_dst); + dst->sin_family = AF_INET; + dst->sin_len = sizeof(*dst); + dst->sin_addr = src; + in_rtalloc_ign(&ro, 0, fib); + + if (ro.ro_rt == NULL) + return 0; + + /* + * If ifp is provided, check for equality with rtentry. + * We should use rt->rt_ifa->ifa_ifp, instead of rt->rt_ifp, + * in order to pass packets injected back by if_simloop(): + * routing entry (via lo0) for our own address + * may exist, so we need to handle routing assymetry. + */ + if (ifp != NULL && ro.ro_rt->rt_ifa->ifa_ifp != ifp) { + RTFREE(ro.ro_rt); + return 0; + } + + /* if no ifp provided, check if rtentry is not default route */ + if (ifp == NULL && + satosin(rt_key(ro.ro_rt))->sin_addr.s_addr == INADDR_ANY) { + RTFREE(ro.ro_rt); + return 0; + } + + /* or if this is a blackhole/reject route */ + if (ifp == NULL && ro.ro_rt->rt_flags & (RTF_REJECT|RTF_BLACKHOLE)) { + RTFREE(ro.ro_rt); + return 0; + } + + /* found valid route */ + RTFREE(ro.ro_rt); + return 1; +#endif /* __FreeBSD__ */ +} + +#ifdef INET6 +/* + * ipv6 specific rules here... + */ +static __inline int +icmp6type_match (int type, ipfw_insn_u32 *cmd) +{ + return (type <= ICMP6_MAXTYPE && (cmd->d[type/32] & (1<<(type%32)) ) ); +} + +static int +flow6id_match( int curr_flow, ipfw_insn_u32 *cmd ) +{ + int i; + for (i=0; i <= cmd->o.arg1; ++i ) + if (curr_flow == cmd->d[i] ) + return 1; + return 0; +} + +/* support for IP6_*_ME opcodes */ +static int +search_ip6_addr_net (struct in6_addr * ip6_addr) +{ + struct ifnet *mdc; + struct ifaddr *mdc2; + struct in6_ifaddr *fdm; + struct in6_addr copia; + + TAILQ_FOREACH(mdc, &V_ifnet, if_link) { + if_addr_rlock(mdc); + TAILQ_FOREACH(mdc2, &mdc->if_addrhead, ifa_link) { + if (mdc2->ifa_addr->sa_family == AF_INET6) { + fdm = (struct in6_ifaddr *)mdc2; + copia = fdm->ia_addr.sin6_addr; + /* need for leaving scope_id in the sock_addr */ + in6_clearscope(&copia); + if (IN6_ARE_ADDR_EQUAL(ip6_addr, &copia)) { + if_addr_runlock(mdc); + return 1; + } + } + } + if_addr_runlock(mdc); + } + return 0; +} + +static int +verify_path6(struct in6_addr *src, struct ifnet *ifp, u_int fib) +{ + struct route_in6 ro; + struct sockaddr_in6 *dst; + + bzero(&ro, sizeof(ro)); + + dst = (struct sockaddr_in6 * )&(ro.ro_dst); + dst->sin6_family = AF_INET6; + dst->sin6_len = sizeof(*dst); + dst->sin6_addr = *src; + + in6_rtalloc_ign(&ro, 0, fib); + if (ro.ro_rt == NULL) + return 0; + + /* + * if ifp is provided, check for equality with rtentry + * We should use rt->rt_ifa->ifa_ifp, instead of rt->rt_ifp, + * to support the case of sending packets to an address of our own. + * (where the former interface is the first argument of if_simloop() + * (=ifp), the latter is lo0) + */ + if (ifp != NULL && ro.ro_rt->rt_ifa->ifa_ifp != ifp) { + RTFREE(ro.ro_rt); + return 0; + } + + /* if no ifp provided, check if rtentry is not default route */ + if (ifp == NULL && + IN6_IS_ADDR_UNSPECIFIED(&satosin6(rt_key(ro.ro_rt))->sin6_addr)) { + RTFREE(ro.ro_rt); + return 0; + } + + /* or if this is a blackhole/reject route */ + if (ifp == NULL && ro.ro_rt->rt_flags & (RTF_REJECT|RTF_BLACKHOLE)) { + RTFREE(ro.ro_rt); + return 0; + } + + /* found valid route */ + RTFREE(ro.ro_rt); + return 1; + +} + +static int +is_icmp6_query(int icmp6_type) +{ + if ((icmp6_type <= ICMP6_MAXTYPE) && + (icmp6_type == ICMP6_ECHO_REQUEST || + icmp6_type == ICMP6_MEMBERSHIP_QUERY || + icmp6_type == ICMP6_WRUREQUEST || + icmp6_type == ICMP6_FQDN_QUERY || + icmp6_type == ICMP6_NI_QUERY)) + return (1); + + return (0); +} + +static void +send_reject6(struct ip_fw_args *args, int code, u_int hlen, struct ip6_hdr *ip6) +{ + struct mbuf *m; + + m = args->m; + if (code == ICMP6_UNREACH_RST && args->f_id.proto == IPPROTO_TCP) { + struct tcphdr *tcp; + tcp = (struct tcphdr *)((char *)ip6 + hlen); + + if ((tcp->th_flags & TH_RST) == 0) { + struct mbuf *m0; + m0 = ipfw_send_pkt(args->m, &(args->f_id), + ntohl(tcp->th_seq), ntohl(tcp->th_ack), + tcp->th_flags | TH_RST); + if (m0 != NULL) + ip6_output(m0, NULL, NULL, 0, NULL, NULL, + NULL); + } + FREE_PKT(m); + } else if (code != ICMP6_UNREACH_RST) { /* Send an ICMPv6 unreach. */ +#if 0 + /* + * Unlike above, the mbufs need to line up with the ip6 hdr, + * as the contents are read. We need to m_adj() the + * needed amount. + * The mbuf will however be thrown away so we can adjust it. + * Remember we did an m_pullup on it already so we + * can make some assumptions about contiguousness. + */ + if (args->L3offset) + m_adj(m, args->L3offset); +#endif + icmp6_error(m, ICMP6_DST_UNREACH, code, 0); + } else + FREE_PKT(m); + + args->m = NULL; +} + +#endif /* INET6 */ + + +/* + * sends a reject message, consuming the mbuf passed as an argument. + */ +static void +send_reject(struct ip_fw_args *args, int code, int iplen, struct ip *ip) +{ + +#if 0 + /* XXX When ip is not guaranteed to be at mtod() we will + * need to account for this */ + * The mbuf will however be thrown away so we can adjust it. + * Remember we did an m_pullup on it already so we + * can make some assumptions about contiguousness. + */ + if (args->L3offset) + m_adj(m, args->L3offset); +#endif + if (code != ICMP_REJECT_RST) { /* Send an ICMP unreach */ + icmp_error(args->m, ICMP_UNREACH, code, 0L, 0); + } else if (args->f_id.proto == IPPROTO_TCP) { + struct tcphdr *const tcp = + L3HDR(struct tcphdr, mtod(args->m, struct ip *)); + if ( (tcp->th_flags & TH_RST) == 0) { + struct mbuf *m; + m = ipfw_send_pkt(args->m, &(args->f_id), + ntohl(tcp->th_seq), ntohl(tcp->th_ack), + tcp->th_flags | TH_RST); + if (m != NULL) + ip_output(m, NULL, NULL, 0, NULL, NULL); + } + FREE_PKT(args->m); + } else + FREE_PKT(args->m); + args->m = NULL; +} + +/* + * Support for uid/gid/jail lookup. These tests are expensive + * (because we may need to look into the list of active sockets) + * so we cache the results. ugid_lookupp is 0 if we have not + * yet done a lookup, 1 if we succeeded, and -1 if we tried + * and failed. The function always returns the match value. + * We could actually spare the variable and use *uc, setting + * it to '(void *)check_uidgid if we have no info, NULL if + * we tried and failed, or any other value if successful. + */ +static int +check_uidgid(ipfw_insn_u32 *insn, struct ip_fw_args *args, int *ugid_lookupp, + struct ucred **uc) +{ +#if defined(USERSPACE) + return 0; // not supported in userspace +#else +#ifndef __FreeBSD__ + /* XXX */ + return cred_check(insn, proto, oif, + dst_ip, dst_port, src_ip, src_port, + (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb); +#else /* FreeBSD */ + struct in_addr src_ip, dst_ip; + struct inpcbinfo *pi; + struct ipfw_flow_id *id; + struct inpcb *pcb, *inp; + struct ifnet *oif; + int lookupflags; + int match; + + id = &args->f_id; + inp = args->inp; + oif = args->oif; + + /* + * Check to see if the UDP or TCP stack supplied us with + * the PCB. If so, rather then holding a lock and looking + * up the PCB, we can use the one that was supplied. + */ + if (inp && *ugid_lookupp == 0) { + INP_LOCK_ASSERT(inp); + if (inp->inp_socket != NULL) { + *uc = crhold(inp->inp_cred); + *ugid_lookupp = 1; + } else + *ugid_lookupp = -1; + } + /* + * If we have already been here and the packet has no + * PCB entry associated with it, then we can safely + * assume that this is a no match. + */ + if (*ugid_lookupp == -1) + return (0); + if (id->proto == IPPROTO_TCP) { + lookupflags = 0; + pi = &V_tcbinfo; + } else if (id->proto == IPPROTO_UDP) { + lookupflags = INPLOOKUP_WILDCARD; + pi = &V_udbinfo; + } else + return 0; + lookupflags |= INPLOOKUP_RLOCKPCB; + match = 0; + if (*ugid_lookupp == 0) { + if (id->addr_type == 6) { +#ifdef INET6 + if (oif == NULL) + pcb = in6_pcblookup_mbuf(pi, + &id->src_ip6, htons(id->src_port), + &id->dst_ip6, htons(id->dst_port), + lookupflags, oif, args->m); + else + pcb = in6_pcblookup_mbuf(pi, + &id->dst_ip6, htons(id->dst_port), + &id->src_ip6, htons(id->src_port), + lookupflags, oif, args->m); +#else + *ugid_lookupp = -1; + return (0); +#endif + } else { + src_ip.s_addr = htonl(id->src_ip); + dst_ip.s_addr = htonl(id->dst_ip); + if (oif == NULL) + pcb = in_pcblookup_mbuf(pi, + src_ip, htons(id->src_port), + dst_ip, htons(id->dst_port), + lookupflags, oif, args->m); + else + pcb = in_pcblookup_mbuf(pi, + dst_ip, htons(id->dst_port), + src_ip, htons(id->src_port), + lookupflags, oif, args->m); + } + if (pcb != NULL) { + INP_RLOCK_ASSERT(pcb); + *uc = crhold(pcb->inp_cred); + *ugid_lookupp = 1; + INP_RUNLOCK(pcb); + } + if (*ugid_lookupp == 0) { + /* + * We tried and failed, set the variable to -1 + * so we will not try again on this packet. + */ + *ugid_lookupp = -1; + return (0); + } + } + if (insn->o.opcode == O_UID) + match = ((*uc)->cr_uid == (uid_t)insn->d[0]); + else if (insn->o.opcode == O_GID) + match = groupmember((gid_t)insn->d[0], *uc); + else if (insn->o.opcode == O_JAIL) + match = ((*uc)->cr_prison->pr_id == (int)insn->d[0]); + return (match); +#endif /* __FreeBSD__ */ +#endif /* not supported in userspace */ +} + +/* + * Helper function to set args with info on the rule after the matching + * one. slot is precise, whereas we guess rule_id as they are + * assigned sequentially. + */ +static inline void +set_match(struct ip_fw_args *args, int slot, + struct ip_fw_chain *chain) +{ + args->rule.chain_id = chain->id; + args->rule.slot = slot + 1; /* we use 0 as a marker */ + args->rule.rule_id = 1 + chain->map[slot]->id; + args->rule.rulenum = chain->map[slot]->rulenum; +} + +#ifndef LINEAR_SKIPTO +/* + * Helper function to enable cached rule lookups using + * cached_id and cached_pos fields in ipfw rule. + */ +static int +jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num, + int tablearg, int jump_backwards) +{ + int f_pos; + + /* If possible use cached f_pos (in f->cached_pos), + * whose version is written in f->cached_id + * (horrible hacks to avoid changing the ABI). + */ + if (num != IP_FW_TARG && f->cached_id == chain->id) + f_pos = f->cached_pos; + else { + int i = IP_FW_ARG_TABLEARG(chain, num, skipto); + /* make sure we do not jump backward */ + if (jump_backwards == 0 && i <= f->rulenum) + i = f->rulenum + 1; + if (chain->idxmap != NULL) + f_pos = chain->idxmap[i]; + else + f_pos = ipfw_find_rule(chain, i, 0); + /* update the cache */ + if (num != IP_FW_TARG) { + f->cached_id = chain->id; + f->cached_pos = f_pos; + } + } + + return (f_pos); +} +#else +/* + * Helper function to enable real fast rule lookups. + */ +static int +jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num, + int tablearg, int jump_backwards) +{ + int f_pos; + + num = IP_FW_ARG_TABLEARG(chain, num, skipto); + /* make sure we do not jump backward */ + if (jump_backwards == 0 && num <= f->rulenum) + num = f->rulenum + 1; + f_pos = chain->idxmap[num]; + + return (f_pos); +} +#endif + +#define TARG(k, f) IP_FW_ARG_TABLEARG(chain, k, f) +/* + * The main check routine for the firewall. + * + * All arguments are in args so we can modify them and return them + * back to the caller. + * + * Parameters: + * + * args->m (in/out) The packet; we set to NULL when/if we nuke it. + * Starts with the IP header. + * args->eh (in) Mac header if present, NULL for layer3 packet. + * args->L3offset Number of bytes bypassed if we came from L2. + * e.g. often sizeof(eh) ** NOTYET ** + * args->oif Outgoing interface, NULL if packet is incoming. + * The incoming interface is in the mbuf. (in) + * args->divert_rule (in/out) + * Skip up to the first rule past this rule number; + * upon return, non-zero port number for divert or tee. + * + * args->rule Pointer to the last matching rule (in/out) + * args->next_hop Socket we are forwarding to (out). + * args->next_hop6 IPv6 next hop we are forwarding to (out). + * args->f_id Addresses grabbed from the packet (out) + * args->rule.info a cookie depending on rule action + * + * Return value: + * + * IP_FW_PASS the packet must be accepted + * IP_FW_DENY the packet must be dropped + * IP_FW_DIVERT divert packet, port in m_tag + * IP_FW_TEE tee packet, port in m_tag + * IP_FW_DUMMYNET to dummynet, pipe in args->cookie + * IP_FW_NETGRAPH into netgraph, cookie args->cookie + * args->rule contains the matching rule, + * args->rule.info has additional information. + * + */ +int +ipfw_chk(struct ip_fw_args *args) +{ + + /* + * Local variables holding state while processing a packet: + * + * IMPORTANT NOTE: to speed up the processing of rules, there + * are some assumption on the values of the variables, which + * are documented here. Should you change them, please check + * the implementation of the various instructions to make sure + * that they still work. + * + * args->eh The MAC header. It is non-null for a layer2 + * packet, it is NULL for a layer-3 packet. + * **notyet** + * args->L3offset Offset in the packet to the L3 (IP or equiv.) header. + * + * m | args->m Pointer to the mbuf, as received from the caller. + * It may change if ipfw_chk() does an m_pullup, or if it + * consumes the packet because it calls send_reject(). + * XXX This has to change, so that ipfw_chk() never modifies + * or consumes the buffer. + * ip is the beginning of the ip(4 or 6) header. + * Calculated by adding the L3offset to the start of data. + * (Until we start using L3offset, the packet is + * supposed to start with the ip header). + */ + struct mbuf *m = args->m; + struct ip *ip = mtod(m, struct ip *); + + /* + * For rules which contain uid/gid or jail constraints, cache + * a copy of the users credentials after the pcb lookup has been + * executed. This will speed up the processing of rules with + * these types of constraints, as well as decrease contention + * on pcb related locks. + */ +#ifndef __FreeBSD__ + struct bsd_ucred ucred_cache; +#else + struct ucred *ucred_cache = NULL; +#endif + int ucred_lookup = 0; + + /* + * oif | args->oif If NULL, ipfw_chk has been called on the + * inbound path (ether_input, ip_input). + * If non-NULL, ipfw_chk has been called on the outbound path + * (ether_output, ip_output). + */ + struct ifnet *oif = args->oif; + + int f_pos = 0; /* index of current rule in the array */ + int retval = 0; + + /* + * hlen The length of the IP header. + */ + u_int hlen = 0; /* hlen >0 means we have an IP pkt */ + + /* + * offset The offset of a fragment. offset != 0 means that + * we have a fragment at this offset of an IPv4 packet. + * offset == 0 means that (if this is an IPv4 packet) + * this is the first or only fragment. + * For IPv6 offset|ip6f_mf == 0 means there is no Fragment Header + * or there is a single packet fragement (fragement header added + * without needed). We will treat a single packet fragment as if + * there was no fragment header (or log/block depending on the + * V_fw_permit_single_frag6 sysctl setting). + */ + u_short offset = 0; + u_short ip6f_mf = 0; + + /* + * Local copies of addresses. They are only valid if we have + * an IP packet. + * + * proto The protocol. Set to 0 for non-ip packets, + * or to the protocol read from the packet otherwise. + * proto != 0 means that we have an IPv4 packet. + * + * src_port, dst_port port numbers, in HOST format. Only + * valid for TCP and UDP packets. + * + * src_ip, dst_ip ip addresses, in NETWORK format. + * Only valid for IPv4 packets. + */ + uint8_t proto; + uint16_t src_port = 0, dst_port = 0; /* NOTE: host format */ + struct in_addr src_ip, dst_ip; /* NOTE: network format */ + uint16_t iplen=0; + int pktlen; + uint16_t etype = 0; /* Host order stored ether type */ + + /* + * dyn_dir = MATCH_UNKNOWN when rules unchecked, + * MATCH_NONE when checked and not matched (q = NULL), + * MATCH_FORWARD or MATCH_REVERSE otherwise (q != NULL) + */ + int dyn_dir = MATCH_UNKNOWN; + ipfw_dyn_rule *q = NULL; + struct ip_fw_chain *chain = &V_layer3_chain; + + /* + * We store in ulp a pointer to the upper layer protocol header. + * In the ipv4 case this is easy to determine from the header, + * but for ipv6 we might have some additional headers in the middle. + * ulp is NULL if not found. + */ + void *ulp = NULL; /* upper layer protocol pointer. */ + + /* XXX ipv6 variables */ + int is_ipv6 = 0; + uint8_t icmp6_type = 0; + uint16_t ext_hd = 0; /* bits vector for extension header filtering */ + /* end of ipv6 variables */ + + int is_ipv4 = 0; + + int done = 0; /* flag to exit the outer loop */ + IPFW_RLOCK_TRACKER; + + if (m->m_flags & M_SKIP_FIREWALL || (! V_ipfw_vnet_ready)) + return (IP_FW_PASS); /* accept */ + + dst_ip.s_addr = 0; /* make sure it is initialized */ + src_ip.s_addr = 0; /* make sure it is initialized */ + pktlen = m->m_pkthdr.len; + args->f_id.fib = M_GETFIB(m); /* note mbuf not altered) */ + proto = args->f_id.proto = 0; /* mark f_id invalid */ + /* XXX 0 is a valid proto: IP/IPv6 Hop-by-Hop Option */ + +/* + * PULLUP_TO(len, p, T) makes sure that len + sizeof(T) is contiguous, + * then it sets p to point at the offset "len" in the mbuf. WARNING: the + * pointer might become stale after other pullups (but we never use it + * this way). + */ +#define PULLUP_TO(_len, p, T) PULLUP_LEN(_len, p, sizeof(T)) +#define PULLUP_LEN(_len, p, T) \ +do { \ + int x = (_len) + T; \ + if ((m)->m_len < x) { \ + args->m = m = m_pullup(m, x); \ + if (m == NULL) \ + goto pullup_failed; \ + } \ + p = (mtod(m, char *) + (_len)); \ +} while (0) + + /* + * if we have an ether header, + */ + if (args->eh) + etype = ntohs(args->eh->ether_type); + + /* Identify IP packets and fill up variables. */ + if (pktlen >= sizeof(struct ip6_hdr) && + (args->eh == NULL || etype == ETHERTYPE_IPV6) && ip->ip_v == 6) { + struct ip6_hdr *ip6 = (struct ip6_hdr *)ip; + is_ipv6 = 1; + args->f_id.addr_type = 6; + hlen = sizeof(struct ip6_hdr); + proto = ip6->ip6_nxt; + + /* Search extension headers to find upper layer protocols */ + while (ulp == NULL && offset == 0) { + switch (proto) { + case IPPROTO_ICMPV6: + PULLUP_TO(hlen, ulp, struct icmp6_hdr); + icmp6_type = ICMP6(ulp)->icmp6_type; + break; + + case IPPROTO_TCP: + PULLUP_TO(hlen, ulp, struct tcphdr); + dst_port = TCP(ulp)->th_dport; + src_port = TCP(ulp)->th_sport; + /* save flags for dynamic rules */ + args->f_id._flags = TCP(ulp)->th_flags; + break; + + case IPPROTO_SCTP: + PULLUP_TO(hlen, ulp, struct sctphdr); + src_port = SCTP(ulp)->src_port; + dst_port = SCTP(ulp)->dest_port; + break; + + case IPPROTO_UDP: + PULLUP_TO(hlen, ulp, struct udphdr); + dst_port = UDP(ulp)->uh_dport; + src_port = UDP(ulp)->uh_sport; + break; + + case IPPROTO_HOPOPTS: /* RFC 2460 */ + PULLUP_TO(hlen, ulp, struct ip6_hbh); + ext_hd |= EXT_HOPOPTS; + hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; + proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; + ulp = NULL; + break; + + case IPPROTO_ROUTING: /* RFC 2460 */ + PULLUP_TO(hlen, ulp, struct ip6_rthdr); + switch (((struct ip6_rthdr *)ulp)->ip6r_type) { + case 0: + ext_hd |= EXT_RTHDR0; + break; + case 2: + ext_hd |= EXT_RTHDR2; + break; + default: + if (V_fw_verbose) + printf("IPFW2: IPV6 - Unknown " + "Routing Header type(%d)\n", + ((struct ip6_rthdr *) + ulp)->ip6r_type); + if (V_fw_deny_unknown_exthdrs) + return (IP_FW_DENY); + break; + } + ext_hd |= EXT_ROUTING; + hlen += (((struct ip6_rthdr *)ulp)->ip6r_len + 1) << 3; + proto = ((struct ip6_rthdr *)ulp)->ip6r_nxt; + ulp = NULL; + break; + + case IPPROTO_FRAGMENT: /* RFC 2460 */ + PULLUP_TO(hlen, ulp, struct ip6_frag); + ext_hd |= EXT_FRAGMENT; + hlen += sizeof (struct ip6_frag); + proto = ((struct ip6_frag *)ulp)->ip6f_nxt; + offset = ((struct ip6_frag *)ulp)->ip6f_offlg & + IP6F_OFF_MASK; + ip6f_mf = ((struct ip6_frag *)ulp)->ip6f_offlg & + IP6F_MORE_FRAG; + if (V_fw_permit_single_frag6 == 0 && + offset == 0 && ip6f_mf == 0) { + if (V_fw_verbose) + printf("IPFW2: IPV6 - Invalid " + "Fragment Header\n"); + if (V_fw_deny_unknown_exthdrs) + return (IP_FW_DENY); + break; + } + args->f_id.extra = + ntohl(((struct ip6_frag *)ulp)->ip6f_ident); + ulp = NULL; + break; + + case IPPROTO_DSTOPTS: /* RFC 2460 */ + PULLUP_TO(hlen, ulp, struct ip6_hbh); + ext_hd |= EXT_DSTOPTS; + hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; + proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; + ulp = NULL; + break; + + case IPPROTO_AH: /* RFC 2402 */ + PULLUP_TO(hlen, ulp, struct ip6_ext); + ext_hd |= EXT_AH; + hlen += (((struct ip6_ext *)ulp)->ip6e_len + 2) << 2; + proto = ((struct ip6_ext *)ulp)->ip6e_nxt; + ulp = NULL; + break; + + case IPPROTO_ESP: /* RFC 2406 */ + PULLUP_TO(hlen, ulp, uint32_t); /* SPI, Seq# */ + /* Anything past Seq# is variable length and + * data past this ext. header is encrypted. */ + ext_hd |= EXT_ESP; + break; + + case IPPROTO_NONE: /* RFC 2460 */ + /* + * Packet ends here, and IPv6 header has + * already been pulled up. If ip6e_len!=0 + * then octets must be ignored. + */ + ulp = ip; /* non-NULL to get out of loop. */ + break; + + case IPPROTO_OSPFIGP: + /* XXX OSPF header check? */ + PULLUP_TO(hlen, ulp, struct ip6_ext); + break; + + case IPPROTO_PIM: + /* XXX PIM header check? */ + PULLUP_TO(hlen, ulp, struct pim); + break; + + case IPPROTO_CARP: + PULLUP_TO(hlen, ulp, struct carp_header); + if (((struct carp_header *)ulp)->carp_version != + CARP_VERSION) + return (IP_FW_DENY); + if (((struct carp_header *)ulp)->carp_type != + CARP_ADVERTISEMENT) + return (IP_FW_DENY); + break; + + case IPPROTO_IPV6: /* RFC 2893 */ + PULLUP_TO(hlen, ulp, struct ip6_hdr); + break; + + case IPPROTO_IPV4: /* RFC 2893 */ + PULLUP_TO(hlen, ulp, struct ip); + break; + + default: + if (V_fw_verbose) + printf("IPFW2: IPV6 - Unknown " + "Extension Header(%d), ext_hd=%x\n", + proto, ext_hd); + if (V_fw_deny_unknown_exthdrs) + return (IP_FW_DENY); + PULLUP_TO(hlen, ulp, struct ip6_ext); + break; + } /*switch */ + } + ip = mtod(m, struct ip *); + ip6 = (struct ip6_hdr *)ip; + args->f_id.src_ip6 = ip6->ip6_src; + args->f_id.dst_ip6 = ip6->ip6_dst; + args->f_id.src_ip = 0; + args->f_id.dst_ip = 0; + args->f_id.flow_id6 = ntohl(ip6->ip6_flow); + } else if (pktlen >= sizeof(struct ip) && + (args->eh == NULL || etype == ETHERTYPE_IP) && ip->ip_v == 4) { + is_ipv4 = 1; + hlen = ip->ip_hl << 2; + args->f_id.addr_type = 4; + + /* + * Collect parameters into local variables for faster matching. + */ + proto = ip->ip_p; + src_ip = ip->ip_src; + dst_ip = ip->ip_dst; + offset = ntohs(ip->ip_off) & IP_OFFMASK; + iplen = ntohs(ip->ip_len); + pktlen = iplen < pktlen ? iplen : pktlen; + + if (offset == 0) { + switch (proto) { + case IPPROTO_TCP: + PULLUP_TO(hlen, ulp, struct tcphdr); + dst_port = TCP(ulp)->th_dport; + src_port = TCP(ulp)->th_sport; + /* save flags for dynamic rules */ + args->f_id._flags = TCP(ulp)->th_flags; + break; + + case IPPROTO_SCTP: + PULLUP_TO(hlen, ulp, struct sctphdr); + src_port = SCTP(ulp)->src_port; + dst_port = SCTP(ulp)->dest_port; + break; + + case IPPROTO_UDP: + PULLUP_TO(hlen, ulp, struct udphdr); + dst_port = UDP(ulp)->uh_dport; + src_port = UDP(ulp)->uh_sport; + break; + + case IPPROTO_ICMP: + PULLUP_TO(hlen, ulp, struct icmphdr); + //args->f_id.flags = ICMP(ulp)->icmp_type; + break; + + default: + break; + } + } + + ip = mtod(m, struct ip *); + args->f_id.src_ip = ntohl(src_ip.s_addr); + args->f_id.dst_ip = ntohl(dst_ip.s_addr); + } +#undef PULLUP_TO + if (proto) { /* we may have port numbers, store them */ + args->f_id.proto = proto; + args->f_id.src_port = src_port = ntohs(src_port); + args->f_id.dst_port = dst_port = ntohs(dst_port); + } + + IPFW_PF_RLOCK(chain); + if (! V_ipfw_vnet_ready) { /* shutting down, leave NOW. */ + IPFW_PF_RUNLOCK(chain); + return (IP_FW_PASS); /* accept */ + } + if (args->rule.slot) { + /* + * Packet has already been tagged as a result of a previous + * match on rule args->rule aka args->rule_id (PIPE, QUEUE, + * REASS, NETGRAPH, DIVERT/TEE...) + * Validate the slot and continue from the next one + * if still present, otherwise do a lookup. + */ + f_pos = (args->rule.chain_id == chain->id) ? + args->rule.slot : + ipfw_find_rule(chain, args->rule.rulenum, + args->rule.rule_id); + } else { + f_pos = 0; + } + + /* + * Now scan the rules, and parse microinstructions for each rule. + * We have two nested loops and an inner switch. Sometimes we + * need to break out of one or both loops, or re-enter one of + * the loops with updated variables. Loop variables are: + * + * f_pos (outer loop) points to the current rule. + * On output it points to the matching rule. + * done (outer loop) is used as a flag to break the loop. + * l (inner loop) residual length of current rule. + * cmd points to the current microinstruction. + * + * We break the inner loop by setting l=0 and possibly + * cmdlen=0 if we don't want to advance cmd. + * We break the outer loop by setting done=1 + * We can restart the inner loop by setting l>0 and f_pos, f, cmd + * as needed. + */ + for (; f_pos < chain->n_rules; f_pos++) { + ipfw_insn *cmd; + uint32_t tablearg = 0; + int l, cmdlen, skip_or; /* skip rest of OR block */ + struct ip_fw *f; + + f = chain->map[f_pos]; + if (V_set_disable & (1 << f->set) ) + continue; + + skip_or = 0; + for (l = f->cmd_len, cmd = f->cmd ; l > 0 ; + l -= cmdlen, cmd += cmdlen) { + int match; + + /* + * check_body is a jump target used when we find a + * CHECK_STATE, and need to jump to the body of + * the target rule. + */ + +/* check_body: */ + cmdlen = F_LEN(cmd); + /* + * An OR block (insn_1 || .. || insn_n) has the + * F_OR bit set in all but the last instruction. + * The first match will set "skip_or", and cause + * the following instructions to be skipped until + * past the one with the F_OR bit clear. + */ + if (skip_or) { /* skip this instruction */ + if ((cmd->len & F_OR) == 0) + skip_or = 0; /* next one is good */ + continue; + } + match = 0; /* set to 1 if we succeed */ + + switch (cmd->opcode) { + /* + * The first set of opcodes compares the packet's + * fields with some pattern, setting 'match' if a + * match is found. At the end of the loop there is + * logic to deal with F_NOT and F_OR flags associated + * with the opcode. + */ + case O_NOP: + match = 1; + break; + + case O_FORWARD_MAC: + printf("ipfw: opcode %d unimplemented\n", + cmd->opcode); + break; + + case O_GID: + case O_UID: + case O_JAIL: + /* + * We only check offset == 0 && proto != 0, + * as this ensures that we have a + * packet with the ports info. + */ + if (offset != 0) + break; + if (proto == IPPROTO_TCP || + proto == IPPROTO_UDP) + match = check_uidgid( + (ipfw_insn_u32 *)cmd, + args, &ucred_lookup, +#ifdef __FreeBSD__ + &ucred_cache); +#else + (void *)&ucred_cache); +#endif + break; + + case O_RECV: + match = iface_match(m->m_pkthdr.rcvif, + (ipfw_insn_if *)cmd, chain, &tablearg); + break; + + case O_XMIT: + match = iface_match(oif, (ipfw_insn_if *)cmd, + chain, &tablearg); + break; + + case O_VIA: + match = iface_match(oif ? oif : + m->m_pkthdr.rcvif, (ipfw_insn_if *)cmd, + chain, &tablearg); + break; + + case O_MACADDR2: + if (args->eh != NULL) { /* have MAC header */ + u_int32_t *want = (u_int32_t *) + ((ipfw_insn_mac *)cmd)->addr; + u_int32_t *mask = (u_int32_t *) + ((ipfw_insn_mac *)cmd)->mask; + u_int32_t *hdr = (u_int32_t *)args->eh; + + match = + ( want[0] == (hdr[0] & mask[0]) && + want[1] == (hdr[1] & mask[1]) && + want[2] == (hdr[2] & mask[2]) ); + } + break; + + case O_MAC_TYPE: + if (args->eh != NULL) { + u_int16_t *p = + ((ipfw_insn_u16 *)cmd)->ports; + int i; + + for (i = cmdlen - 1; !match && i>0; + i--, p += 2) + match = (etype >= p[0] && + etype <= p[1]); + } + break; + + case O_FRAG: + match = (offset != 0); + break; + + case O_IN: /* "out" is "not in" */ + match = (oif == NULL); + break; + + case O_LAYER2: + match = (args->eh != NULL); + break; + + case O_DIVERTED: + { + /* For diverted packets, args->rule.info + * contains the divert port (in host format) + * reason and direction. + */ + uint32_t i = args->rule.info; + match = (i&IPFW_IS_MASK) == IPFW_IS_DIVERT && + cmd->arg1 & ((i & IPFW_INFO_IN) ? 1 : 2); + } + break; + + case O_PROTO: + /* + * We do not allow an arg of 0 so the + * check of "proto" only suffices. + */ + match = (proto == cmd->arg1); + break; + + case O_IP_SRC: + match = is_ipv4 && + (((ipfw_insn_ip *)cmd)->addr.s_addr == + src_ip.s_addr); + break; + + case O_IP_SRC_LOOKUP: + case O_IP_DST_LOOKUP: + if (is_ipv4) { + uint32_t key = + (cmd->opcode == O_IP_DST_LOOKUP) ? + dst_ip.s_addr : src_ip.s_addr; + uint32_t v = 0; + + if (cmdlen > F_INSN_SIZE(ipfw_insn_u32)) { + /* generic lookup. The key must be + * in 32bit big-endian format. + */ + v = ((ipfw_insn_u32 *)cmd)->d[1]; + if (v == 0) + key = dst_ip.s_addr; + else if (v == 1) + key = src_ip.s_addr; + else if (v == 6) /* dscp */ + key = (ip->ip_tos >> 2) & 0x3f; + else if (offset != 0) + break; + else if (proto != IPPROTO_TCP && + proto != IPPROTO_UDP) + break; + else if (v == 2) + key = dst_port; + else if (v == 3) + key = src_port; +#ifndef USERSPACE + else if (v == 4 || v == 5) { + check_uidgid( + (ipfw_insn_u32 *)cmd, + args, &ucred_lookup, +#ifdef __FreeBSD__ + &ucred_cache); + if (v == 4 /* O_UID */) + key = ucred_cache->cr_uid; + else if (v == 5 /* O_JAIL */) + key = ucred_cache->cr_prison->pr_id; +#else /* !__FreeBSD__ */ + (void *)&ucred_cache); + if (v ==4 /* O_UID */) + key = ucred_cache.uid; + else if (v == 5 /* O_JAIL */) + key = ucred_cache.xid; +#endif /* !__FreeBSD__ */ + } +#endif /* !USERSPACE */ + else + break; + } + match = ipfw_lookup_table(chain, + cmd->arg1, key, &v); + if (!match) + break; + if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) + match = + ((ipfw_insn_u32 *)cmd)->d[0] == v; + else + tablearg = v; + } else if (is_ipv6) { + uint32_t v = 0; + void *pkey = (cmd->opcode == O_IP_DST_LOOKUP) ? + &args->f_id.dst_ip6: &args->f_id.src_ip6; + match = ipfw_lookup_table_extended(chain, + cmd->arg1, + sizeof(struct in6_addr), + pkey, &v); + if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) + match = ((ipfw_insn_u32 *)cmd)->d[0] == v; + if (match) + tablearg = v; + } + break; + + case O_IP_FLOW_LOOKUP: + { + uint32_t v = 0; + match = ipfw_lookup_table_extended(chain, + cmd->arg1, 0, &args->f_id, &v); + if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) + match = ((ipfw_insn_u32 *)cmd)->d[0] == v; + if (match) + tablearg = v; + } + break; + case O_IP_SRC_MASK: + case O_IP_DST_MASK: + if (is_ipv4) { + uint32_t a = + (cmd->opcode == O_IP_DST_MASK) ? + dst_ip.s_addr : src_ip.s_addr; + uint32_t *p = ((ipfw_insn_u32 *)cmd)->d; + int i = cmdlen-1; + + for (; !match && i>0; i-= 2, p+= 2) + match = (p[0] == (a & p[1])); + } + break; + + case O_IP_SRC_ME: + if (is_ipv4) { + struct ifnet *tif; + + INADDR_TO_IFP(src_ip, tif); + match = (tif != NULL); + break; + } +#ifdef INET6 + /* FALLTHROUGH */ + case O_IP6_SRC_ME: + match= is_ipv6 && search_ip6_addr_net(&args->f_id.src_ip6); +#endif + break; + + case O_IP_DST_SET: + case O_IP_SRC_SET: + if (is_ipv4) { + u_int32_t *d = (u_int32_t *)(cmd+1); + u_int32_t addr = + cmd->opcode == O_IP_DST_SET ? + args->f_id.dst_ip : + args->f_id.src_ip; + + if (addr < d[0]) + break; + addr -= d[0]; /* subtract base */ + match = (addr < cmd->arg1) && + ( d[ 1 + (addr>>5)] & + (1<<(addr & 0x1f)) ); + } + break; + + case O_IP_DST: + match = is_ipv4 && + (((ipfw_insn_ip *)cmd)->addr.s_addr == + dst_ip.s_addr); + break; + + case O_IP_DST_ME: + if (is_ipv4) { + struct ifnet *tif; + + INADDR_TO_IFP(dst_ip, tif); + match = (tif != NULL); + break; + } +#ifdef INET6 + /* FALLTHROUGH */ + case O_IP6_DST_ME: + match= is_ipv6 && search_ip6_addr_net(&args->f_id.dst_ip6); +#endif + break; + + + case O_IP_SRCPORT: + case O_IP_DSTPORT: + /* + * offset == 0 && proto != 0 is enough + * to guarantee that we have a + * packet with port info. + */ + if ((proto==IPPROTO_UDP || proto==IPPROTO_TCP) + && offset == 0) { + u_int16_t x = + (cmd->opcode == O_IP_SRCPORT) ? + src_port : dst_port ; + u_int16_t *p = + ((ipfw_insn_u16 *)cmd)->ports; + int i; + + for (i = cmdlen - 1; !match && i>0; + i--, p += 2) + match = (x>=p[0] && x<=p[1]); + } + break; + + case O_ICMPTYPE: + match = (offset == 0 && proto==IPPROTO_ICMP && + icmptype_match(ICMP(ulp), (ipfw_insn_u32 *)cmd) ); + break; + +#ifdef INET6 + case O_ICMP6TYPE: + match = is_ipv6 && offset == 0 && + proto==IPPROTO_ICMPV6 && + icmp6type_match( + ICMP6(ulp)->icmp6_type, + (ipfw_insn_u32 *)cmd); + break; +#endif /* INET6 */ + + case O_IPOPT: + match = (is_ipv4 && + ipopts_match(ip, cmd) ); + break; + + case O_IPVER: + match = (is_ipv4 && + cmd->arg1 == ip->ip_v); + break; + + case O_IPID: + case O_IPLEN: + case O_IPTTL: + if (is_ipv4) { /* only for IP packets */ + uint16_t x; + uint16_t *p; + int i; + + if (cmd->opcode == O_IPLEN) + x = iplen; + else if (cmd->opcode == O_IPTTL) + x = ip->ip_ttl; + else /* must be IPID */ + x = ntohs(ip->ip_id); + if (cmdlen == 1) { + match = (cmd->arg1 == x); + break; + } + /* otherwise we have ranges */ + p = ((ipfw_insn_u16 *)cmd)->ports; + i = cmdlen - 1; + for (; !match && i>0; i--, p += 2) + match = (x >= p[0] && x <= p[1]); + } + break; + + case O_IPPRECEDENCE: + match = (is_ipv4 && + (cmd->arg1 == (ip->ip_tos & 0xe0)) ); + break; + + case O_IPTOS: + match = (is_ipv4 && + flags_match(cmd, ip->ip_tos)); + break; + + case O_DSCP: + { + uint32_t *p; + uint16_t x; + + p = ((ipfw_insn_u32 *)cmd)->d; + + if (is_ipv4) + x = ip->ip_tos >> 2; + else if (is_ipv6) { + uint8_t *v; + v = &((struct ip6_hdr *)ip)->ip6_vfc; + x = (*v & 0x0F) << 2; + v++; + x |= *v >> 6; + } else + break; + + /* DSCP bitmask is stored as low_u32 high_u32 */ + if (x > 32) + match = *(p + 1) & (1 << (x - 32)); + else + match = *p & (1 << x); + } + break; + + case O_TCPDATALEN: + if (proto == IPPROTO_TCP && offset == 0) { + struct tcphdr *tcp; + uint16_t x; + uint16_t *p; + int i; + + tcp = TCP(ulp); + x = iplen - + ((ip->ip_hl + tcp->th_off) << 2); + if (cmdlen == 1) { + match = (cmd->arg1 == x); + break; + } + /* otherwise we have ranges */ + p = ((ipfw_insn_u16 *)cmd)->ports; + i = cmdlen - 1; + for (; !match && i>0; i--, p += 2) + match = (x >= p[0] && x <= p[1]); + } + break; + + case O_TCPFLAGS: + match = (proto == IPPROTO_TCP && offset == 0 && + flags_match(cmd, TCP(ulp)->th_flags)); + break; + + case O_TCPOPTS: + if (proto == IPPROTO_TCP && offset == 0 && ulp){ + PULLUP_LEN(hlen, ulp, + (TCP(ulp)->th_off << 2)); + match = tcpopts_match(TCP(ulp), cmd); + } + break; + + case O_TCPSEQ: + match = (proto == IPPROTO_TCP && offset == 0 && + ((ipfw_insn_u32 *)cmd)->d[0] == + TCP(ulp)->th_seq); + break; + + case O_TCPACK: + match = (proto == IPPROTO_TCP && offset == 0 && + ((ipfw_insn_u32 *)cmd)->d[0] == + TCP(ulp)->th_ack); + break; + + case O_TCPWIN: + if (proto == IPPROTO_TCP && offset == 0) { + uint16_t x; + uint16_t *p; + int i; + + x = ntohs(TCP(ulp)->th_win); + if (cmdlen == 1) { + match = (cmd->arg1 == x); + break; + } + /* Otherwise we have ranges. */ + p = ((ipfw_insn_u16 *)cmd)->ports; + i = cmdlen - 1; + for (; !match && i > 0; i--, p += 2) + match = (x >= p[0] && x <= p[1]); + } + break; + + case O_ESTAB: + /* reject packets which have SYN only */ + /* XXX should i also check for TH_ACK ? */ + match = (proto == IPPROTO_TCP && offset == 0 && + (TCP(ulp)->th_flags & + (TH_RST | TH_ACK | TH_SYN)) != TH_SYN); + break; + + case O_ALTQ: { + struct pf_mtag *at; + struct m_tag *mtag; + ipfw_insn_altq *altq = (ipfw_insn_altq *)cmd; + + /* + * ALTQ uses mbuf tags from another + * packet filtering system - pf(4). + * We allocate a tag in its format + * and fill it in, pretending to be pf(4). + */ + match = 1; + at = pf_find_mtag(m); + if (at != NULL && at->qid != 0) + break; + mtag = m_tag_get(PACKET_TAG_PF, + sizeof(struct pf_mtag), M_NOWAIT | M_ZERO); + if (mtag == NULL) { + /* + * Let the packet fall back to the + * default ALTQ. + */ + break; + } + m_tag_prepend(m, mtag); + at = (struct pf_mtag *)(mtag + 1); + at->qid = altq->qid; + at->hdr = ip; + break; + } + + case O_LOG: + ipfw_log(chain, f, hlen, args, m, + oif, offset | ip6f_mf, tablearg, ip); + match = 1; + break; + + case O_PROB: + match = (random()<((ipfw_insn_u32 *)cmd)->d[0]); + break; + + case O_VERREVPATH: + /* Outgoing packets automatically pass/match */ + match = ((oif != NULL) || + (m->m_pkthdr.rcvif == NULL) || + ( +#ifdef INET6 + is_ipv6 ? + verify_path6(&(args->f_id.src_ip6), + m->m_pkthdr.rcvif, args->f_id.fib) : +#endif + verify_path(src_ip, m->m_pkthdr.rcvif, + args->f_id.fib))); + break; + + case O_VERSRCREACH: + /* Outgoing packets automatically pass/match */ + match = (hlen > 0 && ((oif != NULL) || +#ifdef INET6 + is_ipv6 ? + verify_path6(&(args->f_id.src_ip6), + NULL, args->f_id.fib) : +#endif + verify_path(src_ip, NULL, args->f_id.fib))); + break; + + case O_ANTISPOOF: + /* Outgoing packets automatically pass/match */ + if (oif == NULL && hlen > 0 && + ( (is_ipv4 && in_localaddr(src_ip)) +#ifdef INET6 + || (is_ipv6 && + in6_localaddr(&(args->f_id.src_ip6))) +#endif + )) + match = +#ifdef INET6 + is_ipv6 ? verify_path6( + &(args->f_id.src_ip6), + m->m_pkthdr.rcvif, + args->f_id.fib) : +#endif + verify_path(src_ip, + m->m_pkthdr.rcvif, + args->f_id.fib); + else + match = 1; + break; + + case O_IPSEC: +#ifdef IPSEC + match = (m_tag_find(m, + PACKET_TAG_IPSEC_IN_DONE, NULL) != NULL); +#endif + /* otherwise no match */ + break; + +#ifdef INET6 + case O_IP6_SRC: + match = is_ipv6 && + IN6_ARE_ADDR_EQUAL(&args->f_id.src_ip6, + &((ipfw_insn_ip6 *)cmd)->addr6); + break; + + case O_IP6_DST: + match = is_ipv6 && + IN6_ARE_ADDR_EQUAL(&args->f_id.dst_ip6, + &((ipfw_insn_ip6 *)cmd)->addr6); + break; + case O_IP6_SRC_MASK: + case O_IP6_DST_MASK: + if (is_ipv6) { + int i = cmdlen - 1; + struct in6_addr p; + struct in6_addr *d = + &((ipfw_insn_ip6 *)cmd)->addr6; + + for (; !match && i > 0; d += 2, + i -= F_INSN_SIZE(struct in6_addr) + * 2) { + p = (cmd->opcode == + O_IP6_SRC_MASK) ? + args->f_id.src_ip6: + args->f_id.dst_ip6; + APPLY_MASK(&p, &d[1]); + match = + IN6_ARE_ADDR_EQUAL(&d[0], + &p); + } + } + break; + + case O_FLOW6ID: + match = is_ipv6 && + flow6id_match(args->f_id.flow_id6, + (ipfw_insn_u32 *) cmd); + break; + + case O_EXT_HDR: + match = is_ipv6 && + (ext_hd & ((ipfw_insn *) cmd)->arg1); + break; + + case O_IP6: + match = is_ipv6; + break; +#endif + + case O_IP4: + match = is_ipv4; + break; + + case O_TAG: { + struct m_tag *mtag; + uint32_t tag = TARG(cmd->arg1, tag); + + /* Packet is already tagged with this tag? */ + mtag = m_tag_locate(m, MTAG_IPFW, tag, NULL); + + /* We have `untag' action when F_NOT flag is + * present. And we must remove this mtag from + * mbuf and reset `match' to zero (`match' will + * be inversed later). + * Otherwise we should allocate new mtag and + * push it into mbuf. + */ + if (cmd->len & F_NOT) { /* `untag' action */ + if (mtag != NULL) + m_tag_delete(m, mtag); + match = 0; + } else { + if (mtag == NULL) { + mtag = m_tag_alloc( MTAG_IPFW, + tag, 0, M_NOWAIT); + if (mtag != NULL) + m_tag_prepend(m, mtag); + } + match = 1; + } + break; + } + + case O_FIB: /* try match the specified fib */ + if (args->f_id.fib == cmd->arg1) + match = 1; + break; + + case O_SOCKARG: { +#ifndef USERSPACE /* not supported in userspace */ + struct inpcb *inp = args->inp; + struct inpcbinfo *pi; + + if (is_ipv6) /* XXX can we remove this ? */ + break; + + if (proto == IPPROTO_TCP) + pi = &V_tcbinfo; + else if (proto == IPPROTO_UDP) + pi = &V_udbinfo; + else + break; + + /* + * XXXRW: so_user_cookie should almost + * certainly be inp_user_cookie? + */ + + /* For incomming packet, lookup up the + inpcb using the src/dest ip/port tuple */ + if (inp == NULL) { + inp = in_pcblookup(pi, + src_ip, htons(src_port), + dst_ip, htons(dst_port), + INPLOOKUP_RLOCKPCB, NULL); + if (inp != NULL) { + tablearg = + inp->inp_socket->so_user_cookie; + if (tablearg) + match = 1; + INP_RUNLOCK(inp); + } + } else { + if (inp->inp_socket) { + tablearg = + inp->inp_socket->so_user_cookie; + if (tablearg) + match = 1; + } + } +#endif /* !USERSPACE */ + break; + } + + case O_TAGGED: { + struct m_tag *mtag; + uint32_t tag = TARG(cmd->arg1, tag); + + if (cmdlen == 1) { + match = m_tag_locate(m, MTAG_IPFW, + tag, NULL) != NULL; + break; + } + + /* we have ranges */ + for (mtag = m_tag_first(m); + mtag != NULL && !match; + mtag = m_tag_next(m, mtag)) { + uint16_t *p; + int i; + + if (mtag->m_tag_cookie != MTAG_IPFW) + continue; + + p = ((ipfw_insn_u16 *)cmd)->ports; + i = cmdlen - 1; + for(; !match && i > 0; i--, p += 2) + match = + mtag->m_tag_id >= p[0] && + mtag->m_tag_id <= p[1]; + } + break; + } + + /* + * The second set of opcodes represents 'actions', + * i.e. the terminal part of a rule once the packet + * matches all previous patterns. + * Typically there is only one action for each rule, + * and the opcode is stored at the end of the rule + * (but there are exceptions -- see below). + * + * In general, here we set retval and terminate the + * outer loop (would be a 'break 3' in some language, + * but we need to set l=0, done=1) + * + * Exceptions: + * O_COUNT and O_SKIPTO actions: + * instead of terminating, we jump to the next rule + * (setting l=0), or to the SKIPTO target (setting + * f/f_len, cmd and l as needed), respectively. + * + * O_TAG, O_LOG and O_ALTQ action parameters: + * perform some action and set match = 1; + * + * O_LIMIT and O_KEEP_STATE: these opcodes are + * not real 'actions', and are stored right + * before the 'action' part of the rule. + * These opcodes try to install an entry in the + * state tables; if successful, we continue with + * the next opcode (match=1; break;), otherwise + * the packet must be dropped (set retval, + * break loops with l=0, done=1) + * + * O_PROBE_STATE and O_CHECK_STATE: these opcodes + * cause a lookup of the state table, and a jump + * to the 'action' part of the parent rule + * if an entry is found, or + * (CHECK_STATE only) a jump to the next rule if + * the entry is not found. + * The result of the lookup is cached so that + * further instances of these opcodes become NOPs. + * The jump to the next rule is done by setting + * l=0, cmdlen=0. + */ + case O_LIMIT: + case O_KEEP_STATE: + if (ipfw_install_state(chain, f, + (ipfw_insn_limit *)cmd, args, tablearg)) { + /* error or limit violation */ + retval = IP_FW_DENY; + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + } + match = 1; + break; + + case O_PROBE_STATE: + case O_CHECK_STATE: + /* + * dynamic rules are checked at the first + * keep-state or check-state occurrence, + * with the result being stored in dyn_dir. + * The compiler introduces a PROBE_STATE + * instruction for us when we have a + * KEEP_STATE (because PROBE_STATE needs + * to be run first). + */ + if (dyn_dir == MATCH_UNKNOWN && + (q = ipfw_lookup_dyn_rule(&args->f_id, + &dyn_dir, proto == IPPROTO_TCP ? + TCP(ulp) : NULL)) + != NULL) { + /* + * Found dynamic entry, update stats + * and jump to the 'action' part of + * the parent rule by setting + * f, cmd, l and clearing cmdlen. + */ + IPFW_INC_DYN_COUNTER(q, pktlen); + /* XXX we would like to have f_pos + * readily accessible in the dynamic + * rule, instead of having to + * lookup q->rule. + */ + f = q->rule; + f_pos = ipfw_find_rule(chain, + f->rulenum, f->id); + cmd = ACTION_PTR(f); + l = f->cmd_len - f->act_ofs; + ipfw_dyn_unlock(q); + cmdlen = 0; + match = 1; + break; + } + /* + * Dynamic entry not found. If CHECK_STATE, + * skip to next rule, if PROBE_STATE just + * ignore and continue with next opcode. + */ + if (cmd->opcode == O_CHECK_STATE) + l = 0; /* exit inner loop */ + match = 1; + break; + + case O_ACCEPT: + retval = 0; /* accept */ + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + break; + + case O_PIPE: + case O_QUEUE: + set_match(args, f_pos, chain); + args->rule.info = TARG(cmd->arg1, pipe); + if (cmd->opcode == O_PIPE) + args->rule.info |= IPFW_IS_PIPE; + if (V_fw_one_pass) + args->rule.info |= IPFW_ONEPASS; + retval = IP_FW_DUMMYNET; + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + break; + + case O_DIVERT: + case O_TEE: + if (args->eh) /* not on layer 2 */ + break; + /* otherwise this is terminal */ + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + retval = (cmd->opcode == O_DIVERT) ? + IP_FW_DIVERT : IP_FW_TEE; + set_match(args, f_pos, chain); + args->rule.info = TARG(cmd->arg1, divert); + break; + + case O_COUNT: + IPFW_INC_RULE_COUNTER(f, pktlen); + l = 0; /* exit inner loop */ + break; + + case O_SKIPTO: + IPFW_INC_RULE_COUNTER(f, pktlen); + f_pos = JUMP(chain, f, cmd->arg1, tablearg, 0); + /* + * Skip disabled rules, and re-enter + * the inner loop with the correct + * f_pos, f, l and cmd. + * Also clear cmdlen and skip_or + */ + for (; f_pos < chain->n_rules - 1 && + (V_set_disable & + (1 << chain->map[f_pos]->set)); + f_pos++) + ; + /* Re-enter the inner loop at the skipto rule. */ + f = chain->map[f_pos]; + l = f->cmd_len; + cmd = f->cmd; + match = 1; + cmdlen = 0; + skip_or = 0; + continue; + break; /* not reached */ + + case O_CALLRETURN: { + /* + * Implementation of `subroutine' call/return, + * in the stack carried in an mbuf tag. This + * is different from `skipto' in that any call + * address is possible (`skipto' must prevent + * backward jumps to avoid endless loops). + * We have `return' action when F_NOT flag is + * present. The `m_tag_id' field is used as + * stack pointer. + */ + struct m_tag *mtag; + uint16_t jmpto, *stack; + +#define IS_CALL ((cmd->len & F_NOT) == 0) +#define IS_RETURN ((cmd->len & F_NOT) != 0) + /* + * Hand-rolled version of m_tag_locate() with + * wildcard `type'. + * If not already tagged, allocate new tag. + */ + mtag = m_tag_first(m); + while (mtag != NULL) { + if (mtag->m_tag_cookie == + MTAG_IPFW_CALL) + break; + mtag = m_tag_next(m, mtag); + } + if (mtag == NULL && IS_CALL) { + mtag = m_tag_alloc(MTAG_IPFW_CALL, 0, + IPFW_CALLSTACK_SIZE * + sizeof(uint16_t), M_NOWAIT); + if (mtag != NULL) + m_tag_prepend(m, mtag); + } + + /* + * On error both `call' and `return' just + * continue with next rule. + */ + if (IS_RETURN && (mtag == NULL || + mtag->m_tag_id == 0)) { + l = 0; /* exit inner loop */ + break; + } + if (IS_CALL && (mtag == NULL || + mtag->m_tag_id >= IPFW_CALLSTACK_SIZE)) { + printf("ipfw: call stack error, " + "go to next rule\n"); + l = 0; /* exit inner loop */ + break; + } + + IPFW_INC_RULE_COUNTER(f, pktlen); + stack = (uint16_t *)(mtag + 1); + + /* + * The `call' action may use cached f_pos + * (in f->next_rule), whose version is written + * in f->next_rule. + * The `return' action, however, doesn't have + * fixed jump address in cmd->arg1 and can't use + * cache. + */ + if (IS_CALL) { + stack[mtag->m_tag_id] = f->rulenum; + mtag->m_tag_id++; + f_pos = JUMP(chain, f, cmd->arg1, + tablearg, 1); + } else { /* `return' action */ + mtag->m_tag_id--; + jmpto = stack[mtag->m_tag_id] + 1; + f_pos = ipfw_find_rule(chain, jmpto, 0); + } + + /* + * Skip disabled rules, and re-enter + * the inner loop with the correct + * f_pos, f, l and cmd. + * Also clear cmdlen and skip_or + */ + for (; f_pos < chain->n_rules - 1 && + (V_set_disable & + (1 << chain->map[f_pos]->set)); f_pos++) + ; + /* Re-enter the inner loop at the dest rule. */ + f = chain->map[f_pos]; + l = f->cmd_len; + cmd = f->cmd; + cmdlen = 0; + skip_or = 0; + continue; + break; /* NOTREACHED */ + } +#undef IS_CALL +#undef IS_RETURN + + case O_REJECT: + /* + * Drop the packet and send a reject notice + * if the packet is not ICMP (or is an ICMP + * query), and it is not multicast/broadcast. + */ + if (hlen > 0 && is_ipv4 && offset == 0 && + (proto != IPPROTO_ICMP || + is_icmp_query(ICMP(ulp))) && + !(m->m_flags & (M_BCAST|M_MCAST)) && + !IN_MULTICAST(ntohl(dst_ip.s_addr))) { + send_reject(args, cmd->arg1, iplen, ip); + m = args->m; + } + /* FALLTHROUGH */ +#ifdef INET6 + case O_UNREACH6: + if (hlen > 0 && is_ipv6 && + ((offset & IP6F_OFF_MASK) == 0) && + (proto != IPPROTO_ICMPV6 || + (is_icmp6_query(icmp6_type) == 1)) && + !(m->m_flags & (M_BCAST|M_MCAST)) && + !IN6_IS_ADDR_MULTICAST(&args->f_id.dst_ip6)) { + send_reject6( + args, cmd->arg1, hlen, + (struct ip6_hdr *)ip); + m = args->m; + } + /* FALLTHROUGH */ +#endif + case O_DENY: + retval = IP_FW_DENY; + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + break; + + case O_FORWARD_IP: +#ifndef USERSPACE /* allow forward in userspace */ + if (args->eh) /* not valid on layer2 pkts */ + break; +#endif /* !USERSPACE */ + if (q == NULL || q->rule != f || + dyn_dir == MATCH_FORWARD) { + struct sockaddr_in *sa; + sa = &(((ipfw_insn_sa *)cmd)->sa); + if (sa->sin_addr.s_addr == INADDR_ANY) { + bcopy(sa, &args->hopstore, + sizeof(*sa)); + args->hopstore.sin_addr.s_addr = + htonl(tablearg); + args->next_hop = &args->hopstore; + } else { + args->next_hop = sa; + } + } + retval = IP_FW_PASS; + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + break; + +#ifdef INET6 + case O_FORWARD_IP6: + if (args->eh) /* not valid on layer2 pkts */ + break; + if (q == NULL || q->rule != f || + dyn_dir == MATCH_FORWARD) { + struct sockaddr_in6 *sin6; + + sin6 = &(((ipfw_insn_sa6 *)cmd)->sa); + args->next_hop6 = sin6; + } + retval = IP_FW_PASS; + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + break; +#endif + + case O_NETGRAPH: + case O_NGTEE: + set_match(args, f_pos, chain); + args->rule.info = TARG(cmd->arg1, netgraph); + if (V_fw_one_pass) + args->rule.info |= IPFW_ONEPASS; + retval = (cmd->opcode == O_NETGRAPH) ? + IP_FW_NETGRAPH : IP_FW_NGTEE; + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + break; + + case O_SETFIB: { + uint32_t fib; + + IPFW_INC_RULE_COUNTER(f, pktlen); + fib = TARG(cmd->arg1, fib) & 0x7FFFF; + if (fib >= rt_numfibs) + fib = 0; + M_SETFIB(m, fib); + args->f_id.fib = fib; + l = 0; /* exit inner loop */ + break; + } + + case O_SETDSCP: { + uint16_t code; + + code = TARG(cmd->arg1, dscp) & 0x3F; + l = 0; /* exit inner loop */ + if (is_ipv4) { + uint16_t a; + + a = ip->ip_tos; + ip->ip_tos = (code << 2) | (ip->ip_tos & 0x03); + a += ntohs(ip->ip_sum) - ip->ip_tos; + ip->ip_sum = htons(a); + } else if (is_ipv6) { + uint8_t *v; + + v = &((struct ip6_hdr *)ip)->ip6_vfc; + *v = (*v & 0xF0) | (code >> 2); + v++; + *v = (*v & 0x3F) | ((code & 0x03) << 6); + } else + break; + + IPFW_INC_RULE_COUNTER(f, pktlen); + break; + } + + case O_NAT: + l = 0; /* exit inner loop */ + done = 1; /* exit outer loop */ + if (!IPFW_NAT_LOADED) { + retval = IP_FW_DENY; + break; + } + + struct cfg_nat *t; + int nat_id; + + set_match(args, f_pos, chain); + /* Check if this is 'global' nat rule */ + if (cmd->arg1 == 0) { + retval = ipfw_nat_ptr(args, NULL, m); + break; + } + t = ((ipfw_insn_nat *)cmd)->nat; + if (t == NULL) { + nat_id = TARG(cmd->arg1, nat); + t = (*lookup_nat_ptr)(&chain->nat, nat_id); + + if (t == NULL) { + retval = IP_FW_DENY; + break; + } + if (cmd->arg1 != IP_FW_TARG) + ((ipfw_insn_nat *)cmd)->nat = t; + } + retval = ipfw_nat_ptr(args, t, m); + break; + + case O_REASS: { + int ip_off; + + IPFW_INC_RULE_COUNTER(f, pktlen); + l = 0; /* in any case exit inner loop */ + ip_off = ntohs(ip->ip_off); + + /* if not fragmented, go to next rule */ + if ((ip_off & (IP_MF | IP_OFFMASK)) == 0) + break; + + args->m = m = ip_reass(m); + + /* + * do IP header checksum fixup. + */ + if (m == NULL) { /* fragment got swallowed */ + retval = IP_FW_DENY; + } else { /* good, packet complete */ + int hlen; + + ip = mtod(m, struct ip *); + hlen = ip->ip_hl << 2; + ip->ip_sum = 0; + if (hlen == sizeof(struct ip)) + ip->ip_sum = in_cksum_hdr(ip); + else + ip->ip_sum = in_cksum(m, hlen); + retval = IP_FW_REASS; + set_match(args, f_pos, chain); + } + done = 1; /* exit outer loop */ + break; + } + + default: + panic("-- unknown opcode %d\n", cmd->opcode); + } /* end of switch() on opcodes */ + /* + * if we get here with l=0, then match is irrelevant. + */ + + if (cmd->len & F_NOT) + match = !match; + + if (match) { + if (cmd->len & F_OR) + skip_or = 1; + } else { + if (!(cmd->len & F_OR)) /* not an OR block, */ + break; /* try next rule */ + } + + } /* end of inner loop, scan opcodes */ +#undef PULLUP_LEN + + if (done) + break; + +/* next_rule:; */ /* try next rule */ + + } /* end of outer for, scan rules */ + + if (done) { + struct ip_fw *rule = chain->map[f_pos]; + /* Update statistics */ + IPFW_INC_RULE_COUNTER(rule, pktlen); + } else { + retval = IP_FW_DENY; + printf("ipfw: ouch!, skip past end of rules, denying packet\n"); + } + IPFW_PF_RUNLOCK(chain); +#ifdef __FreeBSD__ + if (ucred_cache != NULL) + crfree(ucred_cache); +#endif + return (retval); + +pullup_failed: + if (V_fw_verbose) + printf("ipfw: pullup failed\n"); + return (IP_FW_DENY); +} + +/* + * Set maximum number of tables that can be used in given VNET ipfw instance. + */ +#ifdef SYSCTL_NODE +static int +sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS) +{ + int error; + unsigned int ntables; + + ntables = V_fw_tables_max; + + error = sysctl_handle_int(oidp, &ntables, 0, req); + /* Read operation or some error */ + if ((error != 0) || (req->newptr == NULL)) + return (error); + + return (ipfw_resize_tables(&V_layer3_chain, ntables)); +} + +/* + * Switches table namespace between global and per-set. + */ +static int +sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS) +{ + int error; + unsigned int sets; + + sets = V_fw_tables_sets; + + error = sysctl_handle_int(oidp, &sets, 0, req); + /* Read operation or some error */ + if ((error != 0) || (req->newptr == NULL)) + return (error); + + return (ipfw_switch_tables_namespace(&V_layer3_chain, sets)); +} +#endif + +/* + * Module and VNET glue + */ + +/* + * Stuff that must be initialised only on boot or module load + */ +static int +ipfw_init(void) +{ + int error = 0; + + /* + * Only print out this stuff the first time around, + * when called from the sysinit code. + */ + printf("ipfw2 " +#ifdef INET6 + "(+ipv6) " +#endif + "initialized, divert %s, nat %s, " + "default to %s, logging ", +#ifdef IPDIVERT + "enabled", +#else + "loadable", +#endif +#ifdef IPFIREWALL_NAT + "enabled", +#else + "loadable", +#endif + default_to_accept ? "accept" : "deny"); + + /* + * Note: V_xxx variables can be accessed here but the vnet specific + * initializer may not have been called yet for the VIMAGE case. + * Tuneables will have been processed. We will print out values for + * the default vnet. + * XXX This should all be rationalized AFTER 8.0 + */ + if (V_fw_verbose == 0) + printf("disabled\n"); + else if (V_verbose_limit == 0) + printf("unlimited\n"); + else + printf("limited to %d packets/entry by default\n", + V_verbose_limit); + + /* Check user-supplied table count for validness */ + if (default_fw_tables > IPFW_TABLES_MAX) + default_fw_tables = IPFW_TABLES_MAX; + + ipfw_init_sopt_handler(); + ipfw_log_bpf(1); /* init */ + ipfw_iface_init(); + return (error); +} + +/* + * Called for the removal of the last instance only on module unload. + */ +static void +ipfw_destroy(void) +{ + + ipfw_iface_destroy(); + ipfw_log_bpf(0); /* uninit */ + ipfw_destroy_sopt_handler(); + printf("IP firewall unloaded\n"); +} + +/* + * Stuff that must be initialized for every instance + * (including the first of course). + */ +static int +vnet_ipfw_init(const void *unused) +{ + int error, first; + struct ip_fw *rule = NULL; + struct ip_fw_chain *chain; + + chain = &V_layer3_chain; + + first = IS_DEFAULT_VNET(curvnet) ? 1 : 0; + + /* First set up some values that are compile time options */ + V_autoinc_step = 100; /* bounded to 1..1000 in add_rule() */ + V_fw_deny_unknown_exthdrs = 1; +#ifdef IPFIREWALL_VERBOSE + V_fw_verbose = 1; +#endif +#ifdef IPFIREWALL_VERBOSE_LIMIT + V_verbose_limit = IPFIREWALL_VERBOSE_LIMIT; +#endif +#ifdef IPFIREWALL_NAT + LIST_INIT(&chain->nat); +#endif + + ipfw_init_counters(); + /* insert the default rule and create the initial map */ + chain->n_rules = 1; + chain->map = malloc(sizeof(struct ip_fw *), M_IPFW, M_WAITOK | M_ZERO); + rule = ipfw_alloc_rule(chain, sizeof(struct ip_fw)); + + /* Set initial number of tables */ + V_fw_tables_max = default_fw_tables; + error = ipfw_init_tables(chain, first); + if (error) { + printf("ipfw2: setting up tables failed\n"); + free(chain->map, M_IPFW); + free(rule, M_IPFW); + return (ENOSPC); + } + + /* fill and insert the default rule */ + rule->act_ofs = 0; + rule->rulenum = IPFW_DEFAULT_RULE; + rule->cmd_len = 1; + rule->set = RESVD_SET; + rule->cmd[0].len = 1; + rule->cmd[0].opcode = default_to_accept ? O_ACCEPT : O_DENY; + chain->default_rule = chain->map[0] = rule; + chain->id = rule->id = 1; + /* Pre-calculate rules length for legacy dump format */ + chain->static_len = sizeof(struct ip_fw_rule0); + + IPFW_LOCK_INIT(chain); + ipfw_dyn_init(chain); +#ifdef LINEAR_SKIPTO + ipfw_init_skipto_cache(chain); +#endif + + /* First set up some values that are compile time options */ + V_ipfw_vnet_ready = 1; /* Open for business */ + + /* + * Hook the sockopt handler and pfil hooks for ipv4 and ipv6. + * Even if the latter two fail we still keep the module alive + * because the sockopt and layer2 paths are still useful. + * ipfw[6]_hook return 0 on success, ENOENT on failure, + * so we can ignore the exact return value and just set a flag. + * + * Note that V_fw[6]_enable are manipulated by a SYSCTL_PROC so + * changes in the underlying (per-vnet) variables trigger + * immediate hook()/unhook() calls. + * In layer2 we have the same behaviour, except that V_ether_ipfw + * is checked on each packet because there are no pfil hooks. + */ + V_ip_fw_ctl_ptr = ipfw_ctl3; + error = ipfw_attach_hooks(1); + return (error); +} + +/* + * Called for the removal of each instance. + */ +static int +vnet_ipfw_uninit(const void *unused) +{ + struct ip_fw *reap; + struct ip_fw_chain *chain = &V_layer3_chain; + int i, last; + + V_ipfw_vnet_ready = 0; /* tell new callers to go away */ + /* + * disconnect from ipv4, ipv6, layer2 and sockopt. + * Then grab, release and grab again the WLOCK so we make + * sure the update is propagated and nobody will be in. + */ + (void)ipfw_attach_hooks(0 /* detach */); + V_ip_fw_ctl_ptr = NULL; + + last = IS_DEFAULT_VNET(curvnet) ? 1 : 0; + + IPFW_UH_WLOCK(chain); + IPFW_UH_WUNLOCK(chain); + IPFW_UH_WLOCK(chain); + + IPFW_WLOCK(chain); + ipfw_dyn_uninit(0); /* run the callout_drain */ + IPFW_WUNLOCK(chain); + + reap = NULL; + IPFW_WLOCK(chain); + for (i = 0; i < chain->n_rules; i++) + ipfw_reap_add(chain, &reap, chain->map[i]); + free(chain->map, M_IPFW); +#ifdef LINEAR_SKIPTO + ipfw_destroy_skipto_cache(chain); +#endif + IPFW_WUNLOCK(chain); + IPFW_UH_WUNLOCK(chain); + ipfw_destroy_tables(chain, last); + if (reap != NULL) + ipfw_reap_rules(reap); + vnet_ipfw_iface_destroy(chain); + IPFW_LOCK_DESTROY(chain); + ipfw_dyn_uninit(1); /* free the remaining parts */ + ipfw_destroy_counters(); + return (0); +} + +/* + * Module event handler. + * In general we have the choice of handling most of these events by the + * event handler or by the (VNET_)SYS(UN)INIT handlers. I have chosen to + * use the SYSINIT handlers as they are more capable of expressing the + * flow of control during module and vnet operations, so this is just + * a skeleton. Note there is no SYSINIT equivalent of the module + * SHUTDOWN handler, but we don't have anything to do in that case anyhow. + */ +static int +ipfw_modevent(module_t mod, int type, void *unused) +{ + int err = 0; + + switch (type) { + case MOD_LOAD: + /* Called once at module load or + * system boot if compiled in. */ + break; + case MOD_QUIESCE: + /* Called before unload. May veto unloading. */ + break; + case MOD_UNLOAD: + /* Called during unload. */ + break; + case MOD_SHUTDOWN: + /* Called during system shutdown. */ + break; + default: + err = EOPNOTSUPP; + break; + } + return err; +} + +static moduledata_t ipfwmod = { + "ipfw", + ipfw_modevent, + 0 +}; + +/* Define startup order. */ +#define IPFW_SI_SUB_FIREWALL SI_SUB_PROTO_IFATTACHDOMAIN +#define IPFW_MODEVENT_ORDER (SI_ORDER_ANY - 255) /* On boot slot in here. */ +#define IPFW_MODULE_ORDER (IPFW_MODEVENT_ORDER + 1) /* A little later. */ +#define IPFW_VNET_ORDER (IPFW_MODEVENT_ORDER + 2) /* Later still. */ + +DECLARE_MODULE(ipfw, ipfwmod, IPFW_SI_SUB_FIREWALL, IPFW_MODEVENT_ORDER); +FEATURE(ipfw_ctl3, "ipfw new sockopt calls"); +MODULE_VERSION(ipfw, 3); +/* should declare some dependencies here */ + +/* + * Starting up. Done in order after ipfwmod() has been called. + * VNET_SYSINIT is also called for each existing vnet and each new vnet. + */ +SYSINIT(ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER, + ipfw_init, NULL); +VNET_SYSINIT(vnet_ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER, + vnet_ipfw_init, NULL); + +/* + * Closing up shop. These are done in REVERSE ORDER, but still + * after ipfwmod() has been called. Not called on reboot. + * VNET_SYSUNINIT is also called for each exiting vnet as it exits. + * or when the module is unloaded. + */ +SYSUNINIT(ipfw_destroy, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER, + ipfw_destroy, NULL); +VNET_SYSUNINIT(vnet_ipfw_uninit, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER, + vnet_ipfw_uninit, NULL); +/* end of file */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_dynamic.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_dynamic.c new file mode 100644 index 0000000..ba6f579 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_dynamic.c @@ -0,0 +1,1604 @@ +/*- + * Copyright (c) 2002 Luigi Rizzo, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_dynamic.c 272840 2014-10-09 19:32:35Z melifaro $"); + +#define DEB(x) +#define DDB(x) x + +/* + * Dynamic rule support for ipfw + */ + +#include "opt_ipfw.h" +#include "opt_inet.h" +#ifndef INET +#error IPFIREWALL requires INET. +#endif /* INET */ +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* for ETHERTYPE_IP */ +#include +#include +#include + +#include +#include +#include /* ip_defttl */ +#include +#include +#include + +#include /* IN6_ARE_ADDR_EQUAL */ +#ifdef INET6 +#include +#include +#endif + +#include + +#include /* XXX for in_cksum */ + +#ifdef MAC +#include +#endif + +/* + * Description of dynamic rules. + * + * Dynamic rules are stored in lists accessed through a hash table + * (ipfw_dyn_v) whose size is curr_dyn_buckets. This value can + * be modified through the sysctl variable dyn_buckets which is + * updated when the table becomes empty. + * + * XXX currently there is only one list, ipfw_dyn. + * + * When a packet is received, its address fields are first masked + * with the mask defined for the rule, then hashed, then matched + * against the entries in the corresponding list. + * Dynamic rules can be used for different purposes: + * + stateful rules; + * + enforcing limits on the number of sessions; + * + in-kernel NAT (not implemented yet) + * + * The lifetime of dynamic rules is regulated by dyn_*_lifetime, + * measured in seconds and depending on the flags. + * + * The total number of dynamic rules is equal to UMA zone items count. + * The max number of dynamic rules is dyn_max. When we reach + * the maximum number of rules we do not create anymore. This is + * done to avoid consuming too much memory, but also too much + * time when searching on each packet (ideally, we should try instead + * to put a limit on the length of the list on each bucket...). + * + * Each dynamic rule holds a pointer to the parent ipfw rule so + * we know what action to perform. Dynamic rules are removed when + * the parent rule is deleted. This can be changed by dyn_keep_states + * sysctl. + * + * There are some limitations with dynamic rules -- we do not + * obey the 'randomized match', and we do not do multiple + * passes through the firewall. XXX check the latter!!! + */ + +struct ipfw_dyn_bucket { + struct mtx mtx; /* Bucket protecting lock */ + ipfw_dyn_rule *head; /* Pointer to first rule */ +}; + +/* + * Static variables followed by global ones + */ +static VNET_DEFINE(struct ipfw_dyn_bucket *, ipfw_dyn_v); +static VNET_DEFINE(u_int32_t, dyn_buckets_max); +static VNET_DEFINE(u_int32_t, curr_dyn_buckets); +static VNET_DEFINE(struct callout, ipfw_timeout); +#define V_ipfw_dyn_v VNET(ipfw_dyn_v) +#define V_dyn_buckets_max VNET(dyn_buckets_max) +#define V_curr_dyn_buckets VNET(curr_dyn_buckets) +#define V_ipfw_timeout VNET(ipfw_timeout) + +static VNET_DEFINE(uma_zone_t, ipfw_dyn_rule_zone); +#define V_ipfw_dyn_rule_zone VNET(ipfw_dyn_rule_zone) + +#define IPFW_BUCK_LOCK_INIT(b) \ + mtx_init(&(b)->mtx, "IPFW dynamic bucket", NULL, MTX_DEF) +#define IPFW_BUCK_LOCK_DESTROY(b) \ + mtx_destroy(&(b)->mtx) +#define IPFW_BUCK_LOCK(i) mtx_lock(&V_ipfw_dyn_v[(i)].mtx) +#define IPFW_BUCK_UNLOCK(i) mtx_unlock(&V_ipfw_dyn_v[(i)].mtx) +#define IPFW_BUCK_ASSERT(i) mtx_assert(&V_ipfw_dyn_v[(i)].mtx, MA_OWNED) + + +static VNET_DEFINE(int, dyn_keep_states); +#define V_dyn_keep_states VNET(dyn_keep_states) + +/* + * Timeouts for various events in handing dynamic rules. + */ +static VNET_DEFINE(u_int32_t, dyn_ack_lifetime); +static VNET_DEFINE(u_int32_t, dyn_syn_lifetime); +static VNET_DEFINE(u_int32_t, dyn_fin_lifetime); +static VNET_DEFINE(u_int32_t, dyn_rst_lifetime); +static VNET_DEFINE(u_int32_t, dyn_udp_lifetime); +static VNET_DEFINE(u_int32_t, dyn_short_lifetime); + +#define V_dyn_ack_lifetime VNET(dyn_ack_lifetime) +#define V_dyn_syn_lifetime VNET(dyn_syn_lifetime) +#define V_dyn_fin_lifetime VNET(dyn_fin_lifetime) +#define V_dyn_rst_lifetime VNET(dyn_rst_lifetime) +#define V_dyn_udp_lifetime VNET(dyn_udp_lifetime) +#define V_dyn_short_lifetime VNET(dyn_short_lifetime) + +/* + * Keepalives are sent if dyn_keepalive is set. They are sent every + * dyn_keepalive_period seconds, in the last dyn_keepalive_interval + * seconds of lifetime of a rule. + * dyn_rst_lifetime and dyn_fin_lifetime should be strictly lower + * than dyn_keepalive_period. + */ + +static VNET_DEFINE(u_int32_t, dyn_keepalive_interval); +static VNET_DEFINE(u_int32_t, dyn_keepalive_period); +static VNET_DEFINE(u_int32_t, dyn_keepalive); +static VNET_DEFINE(time_t, dyn_keepalive_last); + +#define V_dyn_keepalive_interval VNET(dyn_keepalive_interval) +#define V_dyn_keepalive_period VNET(dyn_keepalive_period) +#define V_dyn_keepalive VNET(dyn_keepalive) +#define V_dyn_keepalive_last VNET(dyn_keepalive_last) + +static VNET_DEFINE(u_int32_t, dyn_max); /* max # of dynamic rules */ + +#define DYN_COUNT uma_zone_get_cur(V_ipfw_dyn_rule_zone) +#define V_dyn_max VNET(dyn_max) + +/* for userspace, we emulate the uma_zone_counter with ipfw_dyn_count */ +static int ipfw_dyn_count; /* number of objects */ + +#ifdef USERSPACE /* emulation of UMA object counters for userspace */ +#define uma_zone_get_cur(x) ipfw_dyn_count +#endif /* USERSPACE */ + +static int last_log; /* Log ratelimiting */ + +static void ipfw_dyn_tick(void *vnetx); +static void check_dyn_rules(struct ip_fw_chain *, ipfw_range_tlv *, int, int); +#ifdef SYSCTL_NODE + +static int sysctl_ipfw_dyn_count(SYSCTL_HANDLER_ARGS); +static int sysctl_ipfw_dyn_max(SYSCTL_HANDLER_ARGS); + +SYSBEGIN(f2) + +SYSCTL_DECL(_net_inet_ip_fw); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_buckets, + CTLFLAG_RW, &VNET_NAME(dyn_buckets_max), 0, + "Max number of dyn. buckets"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, curr_dyn_buckets, + CTLFLAG_RD, &VNET_NAME(curr_dyn_buckets), 0, + "Current Number of dyn. buckets"); +SYSCTL_VNET_PROC(_net_inet_ip_fw, OID_AUTO, dyn_count, + CTLTYPE_UINT|CTLFLAG_RD, 0, 0, sysctl_ipfw_dyn_count, "IU", + "Number of dyn. rules"); +SYSCTL_VNET_PROC(_net_inet_ip_fw, OID_AUTO, dyn_max, + CTLTYPE_UINT|CTLFLAG_RW, 0, 0, sysctl_ipfw_dyn_max, "IU", + "Max number of dyn. rules"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_ack_lifetime, + CTLFLAG_RW, &VNET_NAME(dyn_ack_lifetime), 0, + "Lifetime of dyn. rules for acks"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_syn_lifetime, + CTLFLAG_RW, &VNET_NAME(dyn_syn_lifetime), 0, + "Lifetime of dyn. rules for syn"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_fin_lifetime, + CTLFLAG_RW, &VNET_NAME(dyn_fin_lifetime), 0, + "Lifetime of dyn. rules for fin"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_rst_lifetime, + CTLFLAG_RW, &VNET_NAME(dyn_rst_lifetime), 0, + "Lifetime of dyn. rules for rst"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_udp_lifetime, + CTLFLAG_RW, &VNET_NAME(dyn_udp_lifetime), 0, + "Lifetime of dyn. rules for UDP"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_short_lifetime, + CTLFLAG_RW, &VNET_NAME(dyn_short_lifetime), 0, + "Lifetime of dyn. rules for other situations"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_keepalive, + CTLFLAG_RW, &VNET_NAME(dyn_keepalive), 0, + "Enable keepalives for dyn. rules"); +SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_keep_states, + CTLFLAG_RW, &VNET_NAME(dyn_keep_states), 0, + "Do not flush dynamic states on rule deletion"); + +SYSEND + +#endif /* SYSCTL_NODE */ + + +#ifdef INET6 +static __inline int +hash_packet6(struct ipfw_flow_id *id) +{ + u_int32_t i; + i = (id->dst_ip6.__u6_addr.__u6_addr32[2]) ^ + (id->dst_ip6.__u6_addr.__u6_addr32[3]) ^ + (id->src_ip6.__u6_addr.__u6_addr32[2]) ^ + (id->src_ip6.__u6_addr.__u6_addr32[3]) ^ + (id->dst_port) ^ (id->src_port); + return i; +} +#endif + +/* + * IMPORTANT: the hash function for dynamic rules must be commutative + * in source and destination (ip,port), because rules are bidirectional + * and we want to find both in the same bucket. + */ +static __inline int +hash_packet(struct ipfw_flow_id *id, int buckets) +{ + u_int32_t i; + +#ifdef INET6 + if (IS_IP6_FLOW_ID(id)) + i = hash_packet6(id); + else +#endif /* INET6 */ + i = (id->dst_ip) ^ (id->src_ip) ^ (id->dst_port) ^ (id->src_port); + i &= (buckets - 1); + return i; +} + +/** + * Print customizable flow id description via log(9) facility. + */ +static void +print_dyn_rule_flags(struct ipfw_flow_id *id, int dyn_type, int log_flags, + char *prefix, char *postfix) +{ + struct in_addr da; +#ifdef INET6 + char src[INET6_ADDRSTRLEN], dst[INET6_ADDRSTRLEN]; +#else + char src[INET_ADDRSTRLEN], dst[INET_ADDRSTRLEN]; +#endif + +#ifdef INET6 + if (IS_IP6_FLOW_ID(id)) { + ip6_sprintf(src, &id->src_ip6); + ip6_sprintf(dst, &id->dst_ip6); + } else +#endif + { + da.s_addr = htonl(id->src_ip); + inet_ntop(AF_INET, &da, src, sizeof(src)); + da.s_addr = htonl(id->dst_ip); + inet_ntop(AF_INET, &da, dst, sizeof(dst)); + } + log(log_flags, "ipfw: %s type %d %s %d -> %s %d, %d %s\n", + prefix, dyn_type, src, id->src_port, dst, + id->dst_port, DYN_COUNT, postfix); +} + +#define print_dyn_rule(id, dtype, prefix, postfix) \ + print_dyn_rule_flags(id, dtype, LOG_DEBUG, prefix, postfix) + +#define TIME_LEQ(a,b) ((int)((a)-(b)) <= 0) +#define TIME_LE(a,b) ((int)((a)-(b)) < 0) + +/* + * Lookup a dynamic rule, locked version. + */ +static ipfw_dyn_rule * +lookup_dyn_rule_locked(struct ipfw_flow_id *pkt, int i, int *match_direction, + struct tcphdr *tcp) +{ + /* + * Stateful ipfw extensions. + * Lookup into dynamic session queue. + */ +#define MATCH_REVERSE 0 +#define MATCH_FORWARD 1 +#define MATCH_NONE 2 +#define MATCH_UNKNOWN 3 + int dir = MATCH_NONE; + ipfw_dyn_rule *prev, *q = NULL; + + IPFW_BUCK_ASSERT(i); + + for (prev = NULL, q = V_ipfw_dyn_v[i].head; q; prev = q, q = q->next) { + if (q->dyn_type == O_LIMIT_PARENT && q->count) + continue; + + if (pkt->proto != q->id.proto || q->dyn_type == O_LIMIT_PARENT) + continue; + + if (IS_IP6_FLOW_ID(pkt)) { + if (IN6_ARE_ADDR_EQUAL(&pkt->src_ip6, &q->id.src_ip6) && + IN6_ARE_ADDR_EQUAL(&pkt->dst_ip6, &q->id.dst_ip6) && + pkt->src_port == q->id.src_port && + pkt->dst_port == q->id.dst_port) { + dir = MATCH_FORWARD; + break; + } + if (IN6_ARE_ADDR_EQUAL(&pkt->src_ip6, &q->id.dst_ip6) && + IN6_ARE_ADDR_EQUAL(&pkt->dst_ip6, &q->id.src_ip6) && + pkt->src_port == q->id.dst_port && + pkt->dst_port == q->id.src_port) { + dir = MATCH_REVERSE; + break; + } + } else { + if (pkt->src_ip == q->id.src_ip && + pkt->dst_ip == q->id.dst_ip && + pkt->src_port == q->id.src_port && + pkt->dst_port == q->id.dst_port) { + dir = MATCH_FORWARD; + break; + } + if (pkt->src_ip == q->id.dst_ip && + pkt->dst_ip == q->id.src_ip && + pkt->src_port == q->id.dst_port && + pkt->dst_port == q->id.src_port) { + dir = MATCH_REVERSE; + break; + } + } + } + if (q == NULL) + goto done; /* q = NULL, not found */ + + if (prev != NULL) { /* found and not in front */ + prev->next = q->next; + q->next = V_ipfw_dyn_v[i].head; + V_ipfw_dyn_v[i].head = q; + } + if (pkt->proto == IPPROTO_TCP) { /* update state according to flags */ + uint32_t ack; + u_char flags = pkt->_flags & (TH_FIN | TH_SYN | TH_RST); + +#define BOTH_SYN (TH_SYN | (TH_SYN << 8)) +#define BOTH_FIN (TH_FIN | (TH_FIN << 8)) +#define TCP_FLAGS (TH_FLAGS | (TH_FLAGS << 8)) +#define ACK_FWD 0x10000 /* fwd ack seen */ +#define ACK_REV 0x20000 /* rev ack seen */ + + q->state |= (dir == MATCH_FORWARD) ? flags : (flags << 8); + switch (q->state & TCP_FLAGS) { + case TH_SYN: /* opening */ + q->expire = time_uptime + V_dyn_syn_lifetime; + break; + + case BOTH_SYN: /* move to established */ + case BOTH_SYN | TH_FIN: /* one side tries to close */ + case BOTH_SYN | (TH_FIN << 8): +#define _SEQ_GE(a,b) ((int)(a) - (int)(b) >= 0) + if (tcp == NULL) + break; + + ack = ntohl(tcp->th_ack); + if (dir == MATCH_FORWARD) { + if (q->ack_fwd == 0 || + _SEQ_GE(ack, q->ack_fwd)) { + q->ack_fwd = ack; + q->state |= ACK_FWD; + } + } else { + if (q->ack_rev == 0 || + _SEQ_GE(ack, q->ack_rev)) { + q->ack_rev = ack; + q->state |= ACK_REV; + } + } + if ((q->state & (ACK_FWD | ACK_REV)) == + (ACK_FWD | ACK_REV)) { + q->expire = time_uptime + V_dyn_ack_lifetime; + q->state &= ~(ACK_FWD | ACK_REV); + } + break; + + case BOTH_SYN | BOTH_FIN: /* both sides closed */ + if (V_dyn_fin_lifetime >= V_dyn_keepalive_period) + V_dyn_fin_lifetime = V_dyn_keepalive_period - 1; + q->expire = time_uptime + V_dyn_fin_lifetime; + break; + + default: +#if 0 + /* + * reset or some invalid combination, but can also + * occur if we use keep-state the wrong way. + */ + if ( (q->state & ((TH_RST << 8)|TH_RST)) == 0) + printf("invalid state: 0x%x\n", q->state); +#endif + if (V_dyn_rst_lifetime >= V_dyn_keepalive_period) + V_dyn_rst_lifetime = V_dyn_keepalive_period - 1; + q->expire = time_uptime + V_dyn_rst_lifetime; + break; + } + } else if (pkt->proto == IPPROTO_UDP) { + q->expire = time_uptime + V_dyn_udp_lifetime; + } else { + /* other protocols */ + q->expire = time_uptime + V_dyn_short_lifetime; + } +done: + if (match_direction != NULL) + *match_direction = dir; + return (q); +} + +ipfw_dyn_rule * +ipfw_lookup_dyn_rule(struct ipfw_flow_id *pkt, int *match_direction, + struct tcphdr *tcp) +{ + ipfw_dyn_rule *q; + int i; + + i = hash_packet(pkt, V_curr_dyn_buckets); + + IPFW_BUCK_LOCK(i); + q = lookup_dyn_rule_locked(pkt, i, match_direction, tcp); + if (q == NULL) + IPFW_BUCK_UNLOCK(i); + /* NB: return table locked when q is not NULL */ + return q; +} + +/* + * Unlock bucket mtx + * @p - pointer to dynamic rule + */ +void +ipfw_dyn_unlock(ipfw_dyn_rule *q) +{ + + IPFW_BUCK_UNLOCK(q->bucket); +} + +static int +resize_dynamic_table(struct ip_fw_chain *chain, int nbuckets) +{ + int i, k, nbuckets_old; + ipfw_dyn_rule *q; + struct ipfw_dyn_bucket *dyn_v, *dyn_v_old; + + /* Check if given number is power of 2 and less than 64k */ + if ((nbuckets > 65536) || (!powerof2(nbuckets))) + return 1; + + CTR3(KTR_NET, "%s: resize dynamic hash: %d -> %d", __func__, + V_curr_dyn_buckets, nbuckets); + + /* Allocate and initialize new hash */ + dyn_v = malloc(nbuckets * sizeof(ipfw_dyn_rule), M_IPFW, + M_WAITOK | M_ZERO); + + for (i = 0 ; i < nbuckets; i++) + IPFW_BUCK_LOCK_INIT(&dyn_v[i]); + + /* + * Call upper half lock, as get_map() do to ease + * read-only access to dynamic rules hash from sysctl + */ + IPFW_UH_WLOCK(chain); + + /* + * Acquire chain write lock to permit hash access + * for main traffic path without additional locks + */ + IPFW_WLOCK(chain); + + /* Save old values */ + nbuckets_old = V_curr_dyn_buckets; + dyn_v_old = V_ipfw_dyn_v; + + /* Skip relinking if array is not set up */ + if (V_ipfw_dyn_v == NULL) + V_curr_dyn_buckets = 0; + + /* Re-link all dynamic states */ + for (i = 0 ; i < V_curr_dyn_buckets ; i++) { + while (V_ipfw_dyn_v[i].head != NULL) { + /* Remove from current chain */ + q = V_ipfw_dyn_v[i].head; + V_ipfw_dyn_v[i].head = q->next; + + /* Get new hash value */ + k = hash_packet(&q->id, nbuckets); + q->bucket = k; + /* Add to the new head */ + q->next = dyn_v[k].head; + dyn_v[k].head = q; + } + } + + /* Update current pointers/buckets values */ + V_curr_dyn_buckets = nbuckets; + V_ipfw_dyn_v = dyn_v; + + IPFW_WUNLOCK(chain); + + IPFW_UH_WUNLOCK(chain); + + /* Start periodic callout on initial creation */ + if (dyn_v_old == NULL) { + callout_reset_on(&V_ipfw_timeout, hz, ipfw_dyn_tick, curvnet, 0); + return (0); + } + + /* Destroy all mutexes */ + for (i = 0 ; i < nbuckets_old ; i++) + IPFW_BUCK_LOCK_DESTROY(&dyn_v_old[i]); + + /* Free old hash */ + free(dyn_v_old, M_IPFW); + + return 0; +} + +/** + * Install state of type 'type' for a dynamic session. + * The hash table contains two type of rules: + * - regular rules (O_KEEP_STATE) + * - rules for sessions with limited number of sess per user + * (O_LIMIT). When they are created, the parent is + * increased by 1, and decreased on delete. In this case, + * the third parameter is the parent rule and not the chain. + * - "parent" rules for the above (O_LIMIT_PARENT). + */ +static ipfw_dyn_rule * +add_dyn_rule(struct ipfw_flow_id *id, int i, u_int8_t dyn_type, struct ip_fw *rule) +{ + ipfw_dyn_rule *r; + + IPFW_BUCK_ASSERT(i); + + r = uma_zalloc(V_ipfw_dyn_rule_zone, M_NOWAIT | M_ZERO); + if (r == NULL) { + if (last_log != time_uptime) { + last_log = time_uptime; + log(LOG_DEBUG, "ipfw: %s: Cannot allocate rule\n", + __func__); + } + return NULL; + } + ipfw_dyn_count++; + + /* + * refcount on parent is already incremented, so + * it is safe to use parent unlocked. + */ + if (dyn_type == O_LIMIT) { + ipfw_dyn_rule *parent = (ipfw_dyn_rule *)rule; + if ( parent->dyn_type != O_LIMIT_PARENT) + panic("invalid parent"); + r->parent = parent; + rule = parent->rule; + } + + r->id = *id; + r->expire = time_uptime + V_dyn_syn_lifetime; + r->rule = rule; + r->dyn_type = dyn_type; + IPFW_ZERO_DYN_COUNTER(r); + r->count = 0; + + r->bucket = i; + r->next = V_ipfw_dyn_v[i].head; + V_ipfw_dyn_v[i].head = r; + DEB(print_dyn_rule(id, dyn_type, "add dyn entry", "total");) + return r; +} + +/** + * lookup dynamic parent rule using pkt and rule as search keys. + * If the lookup fails, then install one. + */ +static ipfw_dyn_rule * +lookup_dyn_parent(struct ipfw_flow_id *pkt, int *pindex, struct ip_fw *rule) +{ + ipfw_dyn_rule *q; + int i, is_v6; + + is_v6 = IS_IP6_FLOW_ID(pkt); + i = hash_packet( pkt, V_curr_dyn_buckets ); + *pindex = i; + IPFW_BUCK_LOCK(i); + for (q = V_ipfw_dyn_v[i].head ; q != NULL ; q=q->next) + if (q->dyn_type == O_LIMIT_PARENT && + rule== q->rule && + pkt->proto == q->id.proto && + pkt->src_port == q->id.src_port && + pkt->dst_port == q->id.dst_port && + ( + (is_v6 && + IN6_ARE_ADDR_EQUAL(&(pkt->src_ip6), + &(q->id.src_ip6)) && + IN6_ARE_ADDR_EQUAL(&(pkt->dst_ip6), + &(q->id.dst_ip6))) || + (!is_v6 && + pkt->src_ip == q->id.src_ip && + pkt->dst_ip == q->id.dst_ip) + ) + ) { + q->expire = time_uptime + V_dyn_short_lifetime; + DEB(print_dyn_rule(pkt, q->dyn_type, + "lookup_dyn_parent found", "");) + return q; + } + + /* Add virtual limiting rule */ + return add_dyn_rule(pkt, i, O_LIMIT_PARENT, rule); +} + +/** + * Install dynamic state for rule type cmd->o.opcode + * + * Returns 1 (failure) if state is not installed because of errors or because + * session limitations are enforced. + */ +int +ipfw_install_state(struct ip_fw_chain *chain, struct ip_fw *rule, + ipfw_insn_limit *cmd, struct ip_fw_args *args, uint32_t tablearg) +{ + ipfw_dyn_rule *q; + int i; + + DEB(print_dyn_rule(&args->f_id, cmd->o.opcode, "install_state", "");) + + i = hash_packet(&args->f_id, V_curr_dyn_buckets); + + IPFW_BUCK_LOCK(i); + + q = lookup_dyn_rule_locked(&args->f_id, i, NULL, NULL); + + if (q != NULL) { /* should never occur */ + DEB( + if (last_log != time_uptime) { + last_log = time_uptime; + printf("ipfw: %s: entry already present, done\n", + __func__); + }) + IPFW_BUCK_UNLOCK(i); + return (0); + } + + /* + * State limiting is done via uma(9) zone limiting. + * Save pointer to newly-installed rule and reject + * packet if add_dyn_rule() returned NULL. + * Note q is currently set to NULL. + */ + + switch (cmd->o.opcode) { + case O_KEEP_STATE: /* bidir rule */ + q = add_dyn_rule(&args->f_id, i, O_KEEP_STATE, rule); + break; + + case O_LIMIT: { /* limit number of sessions */ + struct ipfw_flow_id id; + ipfw_dyn_rule *parent; + uint32_t conn_limit; + uint16_t limit_mask = cmd->limit_mask; + int pindex; + + conn_limit = IP_FW_ARG_TABLEARG(chain, cmd->conn_limit, limit); + + DEB( + if (cmd->conn_limit == IP_FW_TARG) + printf("ipfw: %s: O_LIMIT rule, conn_limit: %u " + "(tablearg)\n", __func__, conn_limit); + else + printf("ipfw: %s: O_LIMIT rule, conn_limit: %u\n", + __func__, conn_limit); + ) + + id.dst_ip = id.src_ip = id.dst_port = id.src_port = 0; + id.proto = args->f_id.proto; + id.addr_type = args->f_id.addr_type; + id.fib = M_GETFIB(args->m); + + if (IS_IP6_FLOW_ID (&(args->f_id))) { + if (limit_mask & DYN_SRC_ADDR) + id.src_ip6 = args->f_id.src_ip6; + if (limit_mask & DYN_DST_ADDR) + id.dst_ip6 = args->f_id.dst_ip6; + } else { + if (limit_mask & DYN_SRC_ADDR) + id.src_ip = args->f_id.src_ip; + if (limit_mask & DYN_DST_ADDR) + id.dst_ip = args->f_id.dst_ip; + } + if (limit_mask & DYN_SRC_PORT) + id.src_port = args->f_id.src_port; + if (limit_mask & DYN_DST_PORT) + id.dst_port = args->f_id.dst_port; + + /* + * We have to release lock for previous bucket to + * avoid possible deadlock + */ + IPFW_BUCK_UNLOCK(i); + + if ((parent = lookup_dyn_parent(&id, &pindex, rule)) == NULL) { + printf("ipfw: %s: add parent failed\n", __func__); + IPFW_BUCK_UNLOCK(pindex); + return (1); + } + + if (parent->count >= conn_limit) { + if (V_fw_verbose && last_log != time_uptime) { + last_log = time_uptime; + char sbuf[24]; + last_log = time_uptime; + snprintf(sbuf, sizeof(sbuf), + "%d drop session", + parent->rule->rulenum); + print_dyn_rule_flags(&args->f_id, + cmd->o.opcode, + LOG_SECURITY | LOG_DEBUG, + sbuf, "too many entries"); + } + IPFW_BUCK_UNLOCK(pindex); + return (1); + } + /* Increment counter on parent */ + parent->count++; + IPFW_BUCK_UNLOCK(pindex); + + IPFW_BUCK_LOCK(i); + q = add_dyn_rule(&args->f_id, i, O_LIMIT, (struct ip_fw *)parent); + if (q == NULL) { + /* Decrement index and notify caller */ + IPFW_BUCK_UNLOCK(i); + IPFW_BUCK_LOCK(pindex); + parent->count--; + IPFW_BUCK_UNLOCK(pindex); + return (1); + } + break; + } + default: + printf("ipfw: %s: unknown dynamic rule type %u\n", + __func__, cmd->o.opcode); + } + + if (q == NULL) { + IPFW_BUCK_UNLOCK(i); + return (1); /* Notify caller about failure */ + } + + /* XXX just set lifetime */ + lookup_dyn_rule_locked(&args->f_id, i, NULL, NULL); + + IPFW_BUCK_UNLOCK(i); + return (0); +} + +/* + * Generate a TCP packet, containing either a RST or a keepalive. + * When flags & TH_RST, we are sending a RST packet, because of a + * "reset" action matched the packet. + * Otherwise we are sending a keepalive, and flags & TH_ + * The 'replyto' mbuf is the mbuf being replied to, if any, and is required + * so that MAC can label the reply appropriately. + */ +struct mbuf * +ipfw_send_pkt(struct mbuf *replyto, struct ipfw_flow_id *id, u_int32_t seq, + u_int32_t ack, int flags) +{ + struct mbuf *m = NULL; /* stupid compiler */ + int len, dir; + struct ip *h = NULL; /* stupid compiler */ +#ifdef INET6 + struct ip6_hdr *h6 = NULL; +#endif + struct tcphdr *th = NULL; + + MGETHDR(m, M_NOWAIT, MT_DATA); + if (m == NULL) + return (NULL); + + M_SETFIB(m, id->fib); +#ifdef MAC + if (replyto != NULL) + mac_netinet_firewall_reply(replyto, m); + else + mac_netinet_firewall_send(m); +#else + (void)replyto; /* don't warn about unused arg */ +#endif + + switch (id->addr_type) { + case 4: + len = sizeof(struct ip) + sizeof(struct tcphdr); + break; +#ifdef INET6 + case 6: + len = sizeof(struct ip6_hdr) + sizeof(struct tcphdr); + break; +#endif + default: + /* XXX: log me?!? */ + FREE_PKT(m); + return (NULL); + } + dir = ((flags & (TH_SYN | TH_RST)) == TH_SYN); + + m->m_data += max_linkhdr; + m->m_flags |= M_SKIP_FIREWALL; + m->m_pkthdr.len = m->m_len = len; + m->m_pkthdr.rcvif = NULL; + bzero(m->m_data, len); + + switch (id->addr_type) { + case 4: + h = mtod(m, struct ip *); + + /* prepare for checksum */ + h->ip_p = IPPROTO_TCP; + h->ip_len = htons(sizeof(struct tcphdr)); + if (dir) { + h->ip_src.s_addr = htonl(id->src_ip); + h->ip_dst.s_addr = htonl(id->dst_ip); + } else { + h->ip_src.s_addr = htonl(id->dst_ip); + h->ip_dst.s_addr = htonl(id->src_ip); + } + + th = (struct tcphdr *)(h + 1); + break; +#ifdef INET6 + case 6: + h6 = mtod(m, struct ip6_hdr *); + + /* prepare for checksum */ + h6->ip6_nxt = IPPROTO_TCP; + h6->ip6_plen = htons(sizeof(struct tcphdr)); + if (dir) { + h6->ip6_src = id->src_ip6; + h6->ip6_dst = id->dst_ip6; + } else { + h6->ip6_src = id->dst_ip6; + h6->ip6_dst = id->src_ip6; + } + + th = (struct tcphdr *)(h6 + 1); + break; +#endif + } + + if (dir) { + th->th_sport = htons(id->src_port); + th->th_dport = htons(id->dst_port); + } else { + th->th_sport = htons(id->dst_port); + th->th_dport = htons(id->src_port); + } + th->th_off = sizeof(struct tcphdr) >> 2; + + if (flags & TH_RST) { + if (flags & TH_ACK) { + th->th_seq = htonl(ack); + th->th_flags = TH_RST; + } else { + if (flags & TH_SYN) + seq++; + th->th_ack = htonl(seq); + th->th_flags = TH_RST | TH_ACK; + } + } else { + /* + * Keepalive - use caller provided sequence numbers + */ + th->th_seq = htonl(seq); + th->th_ack = htonl(ack); + th->th_flags = TH_ACK; + } + + switch (id->addr_type) { + case 4: + th->th_sum = in_cksum(m, len); + + /* finish the ip header */ + h->ip_v = 4; + h->ip_hl = sizeof(*h) >> 2; + h->ip_tos = IPTOS_LOWDELAY; + h->ip_off = htons(0); + h->ip_len = htons(len); + h->ip_ttl = V_ip_defttl; + h->ip_sum = 0; + break; +#ifdef INET6 + case 6: + th->th_sum = in6_cksum(m, IPPROTO_TCP, sizeof(*h6), + sizeof(struct tcphdr)); + + /* finish the ip6 header */ + h6->ip6_vfc |= IPV6_VERSION; + h6->ip6_hlim = IPV6_DEFHLIM; + break; +#endif + } + + return (m); +} + +/* + * Queue keepalive packets for given dynamic rule + */ +static struct mbuf ** +ipfw_dyn_send_ka(struct mbuf **mtailp, ipfw_dyn_rule *q) +{ + struct mbuf *m_rev, *m_fwd; + + m_rev = (q->state & ACK_REV) ? NULL : + ipfw_send_pkt(NULL, &(q->id), q->ack_rev - 1, q->ack_fwd, TH_SYN); + m_fwd = (q->state & ACK_FWD) ? NULL : + ipfw_send_pkt(NULL, &(q->id), q->ack_fwd - 1, q->ack_rev, 0); + + if (m_rev != NULL) { + *mtailp = m_rev; + mtailp = &(*mtailp)->m_nextpkt; + } + if (m_fwd != NULL) { + *mtailp = m_fwd; + mtailp = &(*mtailp)->m_nextpkt; + } + + return (mtailp); +} + +/* + * This procedure is used to perform various maintance + * on dynamic hash list. Currently it is called every second. + */ +static void +ipfw_dyn_tick(void * vnetx) +{ + struct ip_fw_chain *chain; + int check_ka = 0; +#ifdef VIMAGE + struct vnet *vp = vnetx; +#endif + + CURVNET_SET(vp); + + chain = &V_layer3_chain; + + /* Run keepalive checks every keepalive_period iff ka is enabled */ + if ((V_dyn_keepalive_last + V_dyn_keepalive_period <= time_uptime) && + (V_dyn_keepalive != 0)) { + V_dyn_keepalive_last = time_uptime; + check_ka = 1; + } + + check_dyn_rules(chain, NULL, check_ka, 1); + + callout_reset_on(&V_ipfw_timeout, hz, ipfw_dyn_tick, vnetx, 0); + + CURVNET_RESTORE(); +} + + +/* + * Walk thru all dynamic states doing generic maintance: + * 1) free expired states + * 2) free all states based on deleted rule / set + * 3) send keepalives for states if needed + * + * @chain - pointer to current ipfw rules chain + * @rule - delete all states originated by given rule if != NULL + * @set - delete all states originated by any rule in set @set if != RESVD_SET + * @check_ka - perform checking/sending keepalives + * @timer - indicate call from timer routine. + * + * Timer routine must call this function unlocked to permit + * sending keepalives/resizing table. + * + * Others has to call function with IPFW_UH_WLOCK held. + * Additionally, function assume that dynamic rule/set is + * ALREADY deleted so no new states can be generated by + * 'deleted' rules. + * + * Write lock is needed to ensure that unused parent rules + * are not freed by other instance (see stage 2, 3) + */ +static void +check_dyn_rules(struct ip_fw_chain *chain, ipfw_range_tlv *rt, + int check_ka, int timer) +{ + struct mbuf *m0, *m, *mnext, **mtailp; + struct ip *h; + int i, dyn_count, new_buckets = 0, max_buckets; + int expired = 0, expired_limits = 0, parents = 0, total = 0; + ipfw_dyn_rule *q, *q_prev, *q_next; + ipfw_dyn_rule *exp_head, **exptailp; + ipfw_dyn_rule *exp_lhead, **expltailp; + + KASSERT(V_ipfw_dyn_v != NULL, ("%s: dynamic table not allocated", + __func__)); + + /* Avoid possible LOR */ + KASSERT(!check_ka || timer, ("%s: keepalive check with lock held", + __func__)); + + /* + * Do not perform any checks if we currently have no dynamic states + */ + if (DYN_COUNT == 0) + return; + + /* Expired states */ + exp_head = NULL; + exptailp = &exp_head; + + /* Expired limit states */ + exp_lhead = NULL; + expltailp = &exp_lhead; + + /* + * We make a chain of packets to go out here -- not deferring + * until after we drop the IPFW dynamic rule lock would result + * in a lock order reversal with the normal packet input -> ipfw + * call stack. + */ + m0 = NULL; + mtailp = &m0; + + /* Protect from hash resizing */ + if (timer != 0) + IPFW_UH_WLOCK(chain); + else + IPFW_UH_WLOCK_ASSERT(chain); + +#define NEXT_RULE() { q_prev = q; q = q->next ; continue; } + + /* Stage 1: perform requested deletion */ + for (i = 0 ; i < V_curr_dyn_buckets ; i++) { + IPFW_BUCK_LOCK(i); + for (q = V_ipfw_dyn_v[i].head, q_prev = q; q ; ) { + /* account every rule */ + total++; + + /* Skip parent rules at all */ + if (q->dyn_type == O_LIMIT_PARENT) { + parents++; + NEXT_RULE(); + } + + /* + * Remove rules which are: + * 1) expired + * 2) matches deletion range + */ + if ((TIME_LEQ(q->expire, time_uptime)) || + (rt != NULL && ipfw_match_range(q->rule, rt))) { + if (TIME_LE(time_uptime, q->expire) && + q->dyn_type == O_KEEP_STATE && + V_dyn_keep_states != 0) { + /* + * Do not delete state if + * it is not expired and + * dyn_keep_states is ON. + * However we need to re-link it + * to any other stable rule + */ + q->rule = chain->default_rule; + NEXT_RULE(); + } + + /* Unlink q from current list */ + q_next = q->next; + if (q == V_ipfw_dyn_v[i].head) + V_ipfw_dyn_v[i].head = q_next; + else + q_prev->next = q_next; + + q->next = NULL; + + /* queue q to expire list */ + if (q->dyn_type != O_LIMIT) { + *exptailp = q; + exptailp = &(*exptailp)->next; + DEB(print_dyn_rule(&q->id, q->dyn_type, + "unlink entry", "left"); + ) + } else { + /* Separate list for limit rules */ + *expltailp = q; + expltailp = &(*expltailp)->next; + expired_limits++; + DEB(print_dyn_rule(&q->id, q->dyn_type, + "unlink limit entry", "left"); + ) + } + + q = q_next; + expired++; + continue; + } + + /* + * Check if we need to send keepalive: + * we need to ensure if is time to do KA, + * this is established TCP session, and + * expire time is within keepalive interval + */ + if ((check_ka != 0) && (q->id.proto == IPPROTO_TCP) && + ((q->state & BOTH_SYN) == BOTH_SYN) && + (TIME_LEQ(q->expire, time_uptime + + V_dyn_keepalive_interval))) + mtailp = ipfw_dyn_send_ka(mtailp, q); + + NEXT_RULE(); + } + IPFW_BUCK_UNLOCK(i); + } + + /* Stage 2: decrement counters from O_LIMIT parents */ + if (expired_limits != 0) { + /* + * XXX: Note that deleting set with more than one + * heavily-used LIMIT rules can result in overwhelming + * locking due to lack of per-hash value sorting + * + * We should probably think about: + * 1) pre-allocating hash of size, say, + * MAX(16, V_curr_dyn_buckets / 1024) + * 2) checking if expired_limits is large enough + * 3) If yes, init hash (or its part), re-link + * current list and start decrementing procedure in + * each bucket separately + */ + + /* + * Small optimization: do not unlock bucket until + * we see the next item resides in different bucket + */ + if (exp_lhead != NULL) { + i = exp_lhead->parent->bucket; + IPFW_BUCK_LOCK(i); + } + for (q = exp_lhead; q != NULL; q = q->next) { + if (i != q->parent->bucket) { + IPFW_BUCK_UNLOCK(i); + i = q->parent->bucket; + IPFW_BUCK_LOCK(i); + } + + /* Decrease parent refcount */ + q->parent->count--; + } + if (exp_lhead != NULL) + IPFW_BUCK_UNLOCK(i); + } + + /* + * We protectet ourselves from unused parent deletion + * (from the timer function) by holding UH write lock. + */ + + /* Stage 3: remove unused parent rules */ + if ((parents != 0) && (expired != 0)) { + for (i = 0 ; i < V_curr_dyn_buckets ; i++) { + IPFW_BUCK_LOCK(i); + for (q = V_ipfw_dyn_v[i].head, q_prev = q ; q ; ) { + if (q->dyn_type != O_LIMIT_PARENT) + NEXT_RULE(); + + if (q->count != 0) + NEXT_RULE(); + + /* Parent rule without consumers */ + + /* Unlink q from current list */ + q_next = q->next; + if (q == V_ipfw_dyn_v[i].head) + V_ipfw_dyn_v[i].head = q_next; + else + q_prev->next = q_next; + + q->next = NULL; + + /* Add to expired list */ + *exptailp = q; + exptailp = &(*exptailp)->next; + + DEB(print_dyn_rule(&q->id, q->dyn_type, + "unlink parent entry", "left"); + ) + + expired++; + + q = q_next; + } + IPFW_BUCK_UNLOCK(i); + } + } + +#undef NEXT_RULE + + if (timer != 0) { + /* + * Check if we need to resize hash: + * if current number of states exceeds number of buckes in hash, + * grow hash size to the minimum power of 2 which is bigger than + * current states count. Limit hash size by 64k. + */ + max_buckets = (V_dyn_buckets_max > 65536) ? + 65536 : V_dyn_buckets_max; + + dyn_count = DYN_COUNT; + + if ((dyn_count > V_curr_dyn_buckets * 2) && + (dyn_count < max_buckets)) { + new_buckets = V_curr_dyn_buckets; + while (new_buckets < dyn_count) { + new_buckets *= 2; + + if (new_buckets >= max_buckets) + break; + } + } + + IPFW_UH_WUNLOCK(chain); + } + + /* Finally delete old states ad limits if any */ + for (q = exp_head; q != NULL; q = q_next) { + q_next = q->next; + uma_zfree(V_ipfw_dyn_rule_zone, q); + ipfw_dyn_count--; + } + + for (q = exp_lhead; q != NULL; q = q_next) { + q_next = q->next; + uma_zfree(V_ipfw_dyn_rule_zone, q); + ipfw_dyn_count--; + } + + /* + * The rest code MUST be called from timer routine only + * without holding any locks + */ + if (timer == 0) + return; + + /* Send keepalive packets if any */ + for (m = m0; m != NULL; m = mnext) { + mnext = m->m_nextpkt; + m->m_nextpkt = NULL; + h = mtod(m, struct ip *); + if (h->ip_v == 4) + ip_output(m, NULL, NULL, 0, NULL, NULL); +#ifdef INET6 + else + ip6_output(m, NULL, NULL, 0, NULL, NULL, NULL); +#endif + } + + /* Run table resize without holding any locks */ + if (new_buckets != 0) + resize_dynamic_table(chain, new_buckets); +} + +/* + * Deletes all dynamic rules originated by given rule or all rules in + * given set. Specify RESVD_SET to indicate set should not be used. + * @chain - pointer to current ipfw rules chain + * @rr - delete all states originated by rules in matched range. + * + * Function has to be called with IPFW_UH_WLOCK held. + * Additionally, function assume that dynamic rule/set is + * ALREADY deleted so no new states can be generated by + * 'deleted' rules. + */ +void +ipfw_expire_dyn_rules(struct ip_fw_chain *chain, ipfw_range_tlv *rt) +{ + + check_dyn_rules(chain, rt, 0, 0); +} + +/* + * Check if rule contains at least one dynamic opcode. + * + * Returns 1 if such opcode is found, 0 otherwise. + */ +int +ipfw_is_dyn_rule(struct ip_fw *rule) +{ + int cmdlen, l; + ipfw_insn *cmd; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + switch (cmd->opcode) { + case O_LIMIT: + case O_KEEP_STATE: + case O_PROBE_STATE: + case O_CHECK_STATE: + return (1); + } + } + + return (0); +} + +void +ipfw_dyn_init(struct ip_fw_chain *chain) +{ + + V_ipfw_dyn_v = NULL; + V_dyn_buckets_max = 256; /* must be power of 2 */ + V_curr_dyn_buckets = 256; /* must be power of 2 */ + + V_dyn_ack_lifetime = 300; + V_dyn_syn_lifetime = 20; + V_dyn_fin_lifetime = 1; + V_dyn_rst_lifetime = 1; + V_dyn_udp_lifetime = 10; + V_dyn_short_lifetime = 5; + + V_dyn_keepalive_interval = 20; + V_dyn_keepalive_period = 5; + V_dyn_keepalive = 1; /* do send keepalives */ + V_dyn_keepalive_last = time_uptime; + + V_dyn_max = 4096; /* max # of dynamic rules */ + + V_ipfw_dyn_rule_zone = uma_zcreate("IPFW dynamic rule", + sizeof(ipfw_dyn_rule), NULL, NULL, NULL, NULL, + UMA_ALIGN_PTR, 0); + + /* Enforce limit on dynamic rules */ + uma_zone_set_max(V_ipfw_dyn_rule_zone, V_dyn_max); + + callout_init(&V_ipfw_timeout, CALLOUT_MPSAFE); + + /* + * This can potentially be done on first dynamic rule + * being added to chain. + */ + resize_dynamic_table(chain, V_curr_dyn_buckets); +} + +void +ipfw_dyn_uninit(int pass) +{ + int i; + + if (pass == 0) { + callout_drain(&V_ipfw_timeout); + return; + } + + if (V_ipfw_dyn_v != NULL) { + /* + * Skip deleting all dynamic states - + * uma_zdestroy() does this more efficiently; + */ + + /* Destroy all mutexes */ + for (i = 0 ; i < V_curr_dyn_buckets ; i++) + IPFW_BUCK_LOCK_DESTROY(&V_ipfw_dyn_v[i]); + free(V_ipfw_dyn_v, M_IPFW); + V_ipfw_dyn_v = NULL; + } + + uma_zdestroy(V_ipfw_dyn_rule_zone); +} + +#ifdef SYSCTL_NODE +/* + * Get/set maximum number of dynamic states in given VNET instance. + */ +static int +sysctl_ipfw_dyn_max(SYSCTL_HANDLER_ARGS) +{ + int error; + unsigned int nstates; + + nstates = V_dyn_max; + + error = sysctl_handle_int(oidp, &nstates, 0, req); + /* Read operation or some error */ + if ((error != 0) || (req->newptr == NULL)) + return (error); + + V_dyn_max = nstates; + uma_zone_set_max(V_ipfw_dyn_rule_zone, V_dyn_max); + + return (0); +} + +/* + * Get current number of dynamic states in given VNET instance. + */ +static int +sysctl_ipfw_dyn_count(SYSCTL_HANDLER_ARGS) +{ + int error; + unsigned int nstates; + + nstates = DYN_COUNT; + + error = sysctl_handle_int(oidp, &nstates, 0, req); + + return (error); +} +#endif + +/* + * Returns size of dynamic states in legacy format + */ +int +ipfw_dyn_len(void) +{ + + return (V_ipfw_dyn_v == NULL) ? 0 : + (DYN_COUNT * sizeof(ipfw_dyn_rule)); +} + +/* + * Returns number of dynamic states. + * Used by dump format v1 (current). + */ +int +ipfw_dyn_get_count(void) +{ + + return (V_ipfw_dyn_v == NULL) ? 0 : DYN_COUNT; +} + +static void +export_dyn_rule(ipfw_dyn_rule *src, ipfw_dyn_rule *dst) +{ + + memcpy(dst, src, sizeof(*src)); + memcpy(&(dst->rule), &(src->rule->rulenum), sizeof(src->rule->rulenum)); + /* + * store set number into high word of + * dst->rule pointer. + */ + memcpy((char *)&dst->rule + sizeof(src->rule->rulenum), + &(src->rule->set), sizeof(src->rule->set)); + /* + * store a non-null value in "next". + * The userland code will interpret a + * NULL here as a marker + * for the last dynamic rule. + */ + memcpy(&dst->next, &dst, sizeof(dst)); + dst->expire = + TIME_LEQ(dst->expire, time_uptime) ? 0 : dst->expire - time_uptime; +} + +/* + * Fills int buffer given by @sd with dynamic states. + * Used by dump format v1 (current). + * + * Returns 0 on success. + */ +int +ipfw_dump_states(struct ip_fw_chain *chain, struct sockopt_data *sd) +{ + ipfw_dyn_rule *p; + ipfw_obj_dyntlv *dst, *last; + ipfw_obj_ctlv *ctlv; + int i; + size_t sz; + + if (V_ipfw_dyn_v == NULL) + return (0); + + IPFW_UH_RLOCK_ASSERT(chain); + + ctlv = (ipfw_obj_ctlv *)ipfw_get_sopt_space(sd, sizeof(*ctlv)); + if (ctlv == NULL) + return (ENOMEM); + sz = sizeof(ipfw_obj_dyntlv); + ctlv->head.type = IPFW_TLV_DYNSTATE_LIST; + ctlv->objsize = sz; + last = NULL; + + for (i = 0 ; i < V_curr_dyn_buckets; i++) { + IPFW_BUCK_LOCK(i); + for (p = V_ipfw_dyn_v[i].head ; p != NULL; p = p->next) { + dst = (ipfw_obj_dyntlv *)ipfw_get_sopt_space(sd, sz); + if (dst == NULL) { + IPFW_BUCK_UNLOCK(i); + return (ENOMEM); + } + + export_dyn_rule(p, &dst->state); + dst->head.length = sz; + dst->head.type = IPFW_TLV_DYN_ENT; + last = dst; + } + IPFW_BUCK_UNLOCK(i); + } + + if (last != NULL) /* mark last dynamic rule */ + last->head.flags = IPFW_DF_LAST; + + return (0); +} + +/* + * Fill given buffer with dynamic states (legacy format). + * IPFW_UH_RLOCK has to be held while calling. + */ +void +ipfw_get_dynamic(struct ip_fw_chain *chain, char **pbp, const char *ep) +{ + ipfw_dyn_rule *p, *last = NULL; + char *bp; + int i; + + if (V_ipfw_dyn_v == NULL) + return; + bp = *pbp; + + IPFW_UH_RLOCK_ASSERT(chain); + + for (i = 0 ; i < V_curr_dyn_buckets; i++) { + IPFW_BUCK_LOCK(i); + for (p = V_ipfw_dyn_v[i].head ; p != NULL; p = p->next) { + if (bp + sizeof *p <= ep) { + ipfw_dyn_rule *dst = + (ipfw_dyn_rule *)bp; + + export_dyn_rule(p, dst); + last = dst; + bp += sizeof(ipfw_dyn_rule); + } + } + IPFW_BUCK_UNLOCK(i); + } + + if (last != NULL) /* mark last dynamic rule */ + bzero(&last->next, sizeof(last)); + *pbp = bp; +} +/* end of file */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_iface.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_iface.c new file mode 100644 index 0000000..7e9c992 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_iface.c @@ -0,0 +1,537 @@ +/*- + * Copyright (c) 2014 Yandex LLC. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: projects/ipfw/sys/netpfil/ipfw/ip_fw_iface.c 267384 2014-06-12 09:59:11Z melifaro $"); + +/* + * Kernel interface tracking API. + * + */ + +#include "opt_ipfw.h" +#include "opt_inet.h" +#ifndef INET +#error IPFIREWALL requires INET. +#endif /* INET */ +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include /* struct ipfw_rule_ref */ +#include + +#include + +#define CHAIN_TO_II(ch) ((struct namedobj_instance *)ch->ifcfg) + +#define DEFAULT_IFACES 128 + +static void handle_ifdetach(struct ip_fw_chain *ch, struct ipfw_iface *iif, + uint16_t ifindex); +static void handle_ifattach(struct ip_fw_chain *ch, struct ipfw_iface *iif, + uint16_t ifindex); +static int list_ifaces(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd); + +static struct ipfw_sopt_handler scodes[] = { + { IP_FW_XIFLIST, 0, HDIR_GET, list_ifaces }, +}; + +/* + * FreeBSD Kernel interface. + */ +static void ipfw_kifhandler(void *arg, struct ifnet *ifp); +static int ipfw_kiflookup(char *name); +static void iface_khandler_register(void); +static void iface_khandler_deregister(void); + +static eventhandler_tag ipfw_ifdetach_event, ipfw_ifattach_event; +static int num_vnets = 0; +static struct mtx vnet_mtx; + +/* + * Checks if kernel interface is contained in our tracked + * interface list and calls attach/detach handler. + */ +static void +ipfw_kifhandler(void *arg, struct ifnet *ifp) +{ + struct ip_fw_chain *ch; + struct ipfw_iface *iif; + struct namedobj_instance *ii; + uintptr_t htype; + + if (V_ipfw_vnet_ready == 0) + return; + + ch = &V_layer3_chain; + htype = (uintptr_t)arg; + + IPFW_UH_WLOCK(ch); + ii = CHAIN_TO_II(ch); + if (ii == NULL) { + IPFW_UH_WUNLOCK(ch); + return; + } + iif = (struct ipfw_iface*)ipfw_objhash_lookup_name(ii, 0, + if_name(ifp)); + if (iif != NULL) { + if (htype == 1) + handle_ifattach(ch, iif, ifp->if_index); + else + handle_ifdetach(ch, iif, ifp->if_index); + } + IPFW_UH_WUNLOCK(ch); +} + +/* + * Reference current VNET as iface tracking API user. + * Registers interface tracking handlers for first VNET. + */ +static void +iface_khandler_register() +{ + int create; + + create = 0; + + mtx_lock(&vnet_mtx); + if (num_vnets == 0) + create = 1; + num_vnets++; + mtx_unlock(&vnet_mtx); + + if (create == 0) + return; + + printf("IPFW: starting up interface tracker\n"); + + ipfw_ifdetach_event = EVENTHANDLER_REGISTER( + ifnet_departure_event, ipfw_kifhandler, NULL, + EVENTHANDLER_PRI_ANY); + ipfw_ifattach_event = EVENTHANDLER_REGISTER( + ifnet_arrival_event, ipfw_kifhandler, (void*)((uintptr_t)1), + EVENTHANDLER_PRI_ANY); +} + +/* + * + * Detach interface event handlers on last VNET instance + * detach. + */ +static void +iface_khandler_deregister() +{ + int destroy; + + destroy = 0; + mtx_lock(&vnet_mtx); + if (num_vnets == 1) + destroy = 1; + num_vnets--; + mtx_unlock(&vnet_mtx); + + if (destroy == 0) + return; + + EVENTHANDLER_DEREGISTER(ifnet_arrival_event, + ipfw_ifattach_event); + EVENTHANDLER_DEREGISTER(ifnet_departure_event, + ipfw_ifdetach_event); +} + +/* + * Retrieves ifindex for given @name. + * + * Returns ifindex or 0. + */ +static int +ipfw_kiflookup(char *name) +{ + struct ifnet *ifp; + int ifindex; + + ifindex = 0; + + if ((ifp = ifunit_ref(name)) != NULL) { + ifindex = ifp->if_index; + if_rele(ifp); + } + + return (ifindex); +} + +/* + * Global ipfw startup hook. + * Since we perform lazy initialization, do nothing except + * mutex init. + */ +int +ipfw_iface_init() +{ + + mtx_init(&vnet_mtx, "IPFW ifhandler mtx", NULL, MTX_DEF); + IPFW_ADD_SOPT_HANDLER(1, scodes); + return (0); +} + +/* + * Global ipfw destroy hook. + * Unregister khandlers iff init has been done. + */ +void +ipfw_iface_destroy() +{ + + IPFW_DEL_SOPT_HANDLER(1, scodes); + mtx_destroy(&vnet_mtx); +} + +/* + * Perform actual init on internal request. + * Inits both namehash and global khandler. + */ +static void +vnet_ipfw_iface_init(struct ip_fw_chain *ch) +{ + struct namedobj_instance *ii; + + ii = ipfw_objhash_create(DEFAULT_IFACES); + IPFW_UH_WLOCK(ch); + if (ch->ifcfg == NULL) { + ch->ifcfg = ii; + ii = NULL; + } + IPFW_UH_WUNLOCK(ch); + + if (ii != NULL) { + /* Already initialized. Free namehash. */ + ipfw_objhash_destroy(ii); + } else { + /* We're the first ones. Init kernel hooks. */ + iface_khandler_register(); + } +} + +static void +destroy_iface(struct namedobj_instance *ii, struct named_object *no, + void *arg) +{ + + /* Assume all consumers have been already detached */ + free(no, M_IPFW); +} + +/* + * Per-VNET ipfw detach hook. + * + */ +void +vnet_ipfw_iface_destroy(struct ip_fw_chain *ch) +{ + struct namedobj_instance *ii; + + IPFW_UH_WLOCK(ch); + ii = CHAIN_TO_II(ch); + ch->ifcfg = NULL; + IPFW_UH_WUNLOCK(ch); + + if (ii != NULL) { + ipfw_objhash_foreach(ii, destroy_iface, ch); + ipfw_objhash_destroy(ii); + iface_khandler_deregister(); + } +} + +/* + * Notify the subsystem that we are interested in tracking + * interface @name. This function has to be called without + * holding any locks to permit allocating the necessary states + * for proper interface tracking. + * + * Returns 0 on success. + */ +int +ipfw_iface_ref(struct ip_fw_chain *ch, char *name, + struct ipfw_ifc *ic) +{ + struct namedobj_instance *ii; + struct ipfw_iface *iif, *tmp; + + if (strlen(name) >= sizeof(iif->ifname)) + return (EINVAL); + + IPFW_UH_WLOCK(ch); + + ii = CHAIN_TO_II(ch); + if (ii == NULL) { + + /* + * First request to subsystem. + * Let's perform init. + */ + IPFW_UH_WUNLOCK(ch); + vnet_ipfw_iface_init(ch); + IPFW_UH_WLOCK(ch); + ii = CHAIN_TO_II(ch); + } + + iif = (struct ipfw_iface *)ipfw_objhash_lookup_name(ii, 0, name); + + if (iif != NULL) { + iif->no.refcnt++; + ic->iface = iif; + IPFW_UH_WUNLOCK(ch); + return (0); + } + + IPFW_UH_WUNLOCK(ch); + + /* Not found. Let's create one */ + iif = malloc(sizeof(struct ipfw_iface), M_IPFW, M_WAITOK | M_ZERO); + TAILQ_INIT(&iif->consumers); + iif->no.name = iif->ifname; + strlcpy(iif->ifname, name, sizeof(iif->ifname)); + + /* + * Ref & link to the list. + * + * We assume ifnet_arrival_event / ifnet_departure_event + * are not holding any locks. + */ + iif->no.refcnt = 1; + IPFW_UH_WLOCK(ch); + + tmp = (struct ipfw_iface *)ipfw_objhash_lookup_name(ii, 0, name); + if (tmp != NULL) { + /* Interface has been created since unlock. Ref and return */ + tmp->no.refcnt++; + ic->iface = tmp; + IPFW_UH_WUNLOCK(ch); + free(iif, M_IPFW); + return (0); + } + + iif->ifindex = ipfw_kiflookup(name); + if (iif->ifindex != 0) + iif->resolved = 1; + + ipfw_objhash_add(ii, &iif->no); + ic->iface = iif; + + IPFW_UH_WUNLOCK(ch); + + return (0); +} + +/* + * Adds @ic to the list of iif interface consumers. + * Must be called with holding both UH+WLOCK. + * Callback may be immediately called (if interface exists). + */ +void +ipfw_iface_add_notify(struct ip_fw_chain *ch, struct ipfw_ifc *ic) +{ + struct ipfw_iface *iif; + + IPFW_UH_WLOCK_ASSERT(ch); + IPFW_WLOCK_ASSERT(ch); + + iif = ic->iface; + + TAILQ_INSERT_TAIL(&iif->consumers, ic, next); + if (iif->resolved != 0) + ic->cb(ch, ic->cbdata, iif->ifindex); +} + +/* + * Unlinks interface tracker object @ic from interface. + * Must be called while holding UH lock. + */ +void +ipfw_iface_del_notify(struct ip_fw_chain *ch, struct ipfw_ifc *ic) +{ + struct ipfw_iface *iif; + + IPFW_UH_WLOCK_ASSERT(ch); + + iif = ic->iface; + TAILQ_REMOVE(&iif->consumers, ic, next); +} + +/* + * Unreference interface specified by @ic. + * Must be called without holding any locks. + */ +void +ipfw_iface_unref(struct ip_fw_chain *ch, struct ipfw_ifc *ic) +{ + struct ipfw_iface *iif; + + iif = ic->iface; + ic->iface = NULL; + + IPFW_UH_WLOCK(ch); + iif->no.refcnt--; + /* TODO: check for references & delete */ + IPFW_UH_WUNLOCK(ch); +} + +/* + * Interface arrival handler. + */ +static void +handle_ifattach(struct ip_fw_chain *ch, struct ipfw_iface *iif, + uint16_t ifindex) +{ + struct ipfw_ifc *ic; + + IPFW_UH_WLOCK_ASSERT(ch); + + iif->gencnt++; + iif->resolved = 1; + iif->ifindex = ifindex; + + IPFW_WLOCK(ch); + TAILQ_FOREACH(ic, &iif->consumers, next) + ic->cb(ch, ic->cbdata, iif->ifindex); + IPFW_WUNLOCK(ch); +} + +/* + * Interface departure handler. + */ +static void +handle_ifdetach(struct ip_fw_chain *ch, struct ipfw_iface *iif, + uint16_t ifindex) +{ + struct ipfw_ifc *ic; + + IPFW_UH_WLOCK_ASSERT(ch); + + IPFW_WLOCK(ch); + TAILQ_FOREACH(ic, &iif->consumers, next) + ic->cb(ch, ic->cbdata, 0); + IPFW_WUNLOCK(ch); + + iif->gencnt++; + iif->resolved = 0; + iif->ifindex = 0; +} + +struct dump_iface_args { + struct ip_fw_chain *ch; + struct sockopt_data *sd; +}; + +static void +export_iface_internal(struct namedobj_instance *ii, struct named_object *no, + void *arg) +{ + ipfw_iface_info *i; + struct dump_iface_args *da; + struct ipfw_iface *iif; + + da = (struct dump_iface_args *)arg; + + i = (ipfw_iface_info *)ipfw_get_sopt_space(da->sd, sizeof(*i)); + KASSERT(i != 0, ("previously checked buffer is not enough")); + + iif = (struct ipfw_iface *)no; + + strlcpy(i->ifname, iif->ifname, sizeof(i->ifname)); + if (iif->resolved) + i->flags |= IPFW_IFFLAG_RESOLVED; + i->ifindex = iif->ifindex; + i->refcnt = iif->no.refcnt; + i->gencnt = iif->gencnt; +} + +/* + * Lists all interface currently tracked by ipfw. + * Data layout (v0)(current): + * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size + * Reply: [ ipfw_obj_lheader ipfw_iface_info x N ] + * + * Returns 0 on success + */ +static int +list_ifaces(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct namedobj_instance *ii; + struct _ipfw_obj_lheader *olh; + struct dump_iface_args da; + uint32_t count, size; + + olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh)); + if (olh == NULL) + return (EINVAL); + if (sd->valsize < olh->size) + return (EINVAL); + + IPFW_UH_RLOCK(ch); + ii = CHAIN_TO_II(ch); + if (ii != NULL) + count = ipfw_objhash_count(ii); + else + count = 0; + size = count * sizeof(ipfw_iface_info) + sizeof(ipfw_obj_lheader); + + /* Fill in header regadless of buffer size */ + olh->count = count; + olh->objsize = sizeof(ipfw_iface_info); + + if (size > olh->size) { + olh->size = size; + IPFW_UH_RUNLOCK(ch); + return (ENOMEM); + } + olh->size = size; + + da.ch = ch; + da.sd = sd; + + if (ii != NULL) + ipfw_objhash_foreach(ii, export_iface_internal, &da); + IPFW_UH_RUNLOCK(ch); + + return (0); +} + diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_log.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_log.c new file mode 100644 index 0000000..cbbd875 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_log.c @@ -0,0 +1,567 @@ +/*- + * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_log.c 272840 2014-10-09 19:32:35Z melifaro $"); + +/* + * Logging support for ipfw + */ + +#include "opt_ipfw.h" +#include "opt_inet.h" +#ifndef INET +#error IPFIREWALL requires INET. +#endif /* INET */ +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* for ETHERTYPE_IP */ +#include +#include +#include +#include +#include /* for IFT_PFLOG */ +#include /* for BPF */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#ifdef INET6 +#include /* ip6_sprintf() */ +#endif + +#include + +#ifdef MAC +#include +#endif + +/* + * L3HDR maps an ipv4 pointer into a layer3 header pointer of type T + * Other macros just cast void * into the appropriate type + */ +#define L3HDR(T, ip) ((T *)((u_int32_t *)(ip) + (ip)->ip_hl)) +#define TCP(p) ((struct tcphdr *)(p)) +#define SCTP(p) ((struct sctphdr *)(p)) +#define UDP(p) ((struct udphdr *)(p)) +#define ICMP(p) ((struct icmphdr *)(p)) +#define ICMP6(p) ((struct icmp6_hdr *)(p)) + +#ifdef __APPLE__ +#undef snprintf +#define snprintf sprintf +#define SNPARGS(buf, len) buf + len +#define SNP(buf) buf +#else /* !__APPLE__ */ +#define SNPARGS(buf, len) buf + len, sizeof(buf) > len ? sizeof(buf) - len : 0 +#define SNP(buf) buf, sizeof(buf) +#endif /* !__APPLE__ */ + +#ifdef WITHOUT_BPF +void +ipfw_log_bpf(int onoff) +{ +} +#else /* !WITHOUT_BPF */ +static struct ifnet *log_if; /* hook to attach to bpf */ +static struct rwlock log_if_lock; +#define LOGIF_LOCK_INIT(x) rw_init(&log_if_lock, "ipfw log_if lock") +#define LOGIF_LOCK_DESTROY(x) rw_destroy(&log_if_lock) +#define LOGIF_RLOCK(x) rw_rlock(&log_if_lock) +#define LOGIF_RUNLOCK(x) rw_runlock(&log_if_lock) +#define LOGIF_WLOCK(x) rw_wlock(&log_if_lock) +#define LOGIF_WUNLOCK(x) rw_wunlock(&log_if_lock) + +static const char ipfwname[] = "ipfw"; + +/* we use this dummy function for all ifnet callbacks */ +static int +log_dummy(struct ifnet *ifp, u_long cmd, caddr_t addr) +{ + return EINVAL; +} + +static int +ipfw_log_output(struct ifnet *ifp, struct mbuf *m, + const struct sockaddr *dst, struct route *ro) +{ + if (m != NULL) + FREE_PKT(m); + return EINVAL; +} + +static void +ipfw_log_start(struct ifnet* ifp) +{ + panic("ipfw_log_start() must not be called"); +} + +static const u_char ipfwbroadcastaddr[6] = + { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }; + +static int +ipfw_log_clone_match(struct if_clone *ifc, const char *name) +{ + + return (strncmp(name, ipfwname, sizeof(ipfwname) - 1) == 0); +} + +static int +ipfw_log_clone_create(struct if_clone *ifc, char *name, size_t len, + caddr_t params) +{ + int error; + int unit; + struct ifnet *ifp; + + error = ifc_name2unit(name, &unit); + if (error) + return (error); + + error = ifc_alloc_unit(ifc, &unit); + if (error) + return (error); + + ifp = if_alloc(IFT_PFLOG); + if (ifp == NULL) { + ifc_free_unit(ifc, unit); + return (ENOSPC); + } + ifp->if_dname = ipfwname; + ifp->if_dunit = unit; + snprintf(ifp->if_xname, IFNAMSIZ, "%s%d", ipfwname, unit); + strlcpy(name, ifp->if_xname, len); + ifp->if_mtu = 65536; + ifp->if_flags = IFF_UP | IFF_SIMPLEX | IFF_MULTICAST; + ifp->if_init = (void *)log_dummy; + ifp->if_ioctl = log_dummy; + ifp->if_start = ipfw_log_start; + ifp->if_output = ipfw_log_output; + ifp->if_addrlen = 6; + ifp->if_hdrlen = 14; + ifp->if_broadcastaddr = ipfwbroadcastaddr; + ifp->if_baudrate = IF_Mbps(10); + + LOGIF_WLOCK(); + if (log_if == NULL) + log_if = ifp; + else { + LOGIF_WUNLOCK(); + if_free(ifp); + ifc_free_unit(ifc, unit); + return (EEXIST); + } + LOGIF_WUNLOCK(); + if_attach(ifp); + bpfattach(ifp, DLT_EN10MB, 14); + + return (0); +} + +static int +ipfw_log_clone_destroy(struct if_clone *ifc, struct ifnet *ifp) +{ + int unit; + + if (ifp == NULL) + return (0); + + LOGIF_WLOCK(); + if (log_if != NULL && ifp == log_if) + log_if = NULL; + else { + LOGIF_WUNLOCK(); + return (EINVAL); + } + LOGIF_WUNLOCK(); + + unit = ifp->if_dunit; + bpfdetach(ifp); + if_detach(ifp); + if_free(ifp); + ifc_free_unit(ifc, unit); + + return (0); +} + +static struct if_clone *ipfw_log_cloner; + +void +ipfw_log_bpf(int onoff) +{ + + if (onoff) { + LOGIF_LOCK_INIT(); + ipfw_log_cloner = if_clone_advanced(ipfwname, 0, + ipfw_log_clone_match, ipfw_log_clone_create, + ipfw_log_clone_destroy); + } else { + if_clone_detach(ipfw_log_cloner); + LOGIF_LOCK_DESTROY(); + } +} +#endif /* !WITHOUT_BPF */ + +#define TARG(k, f) IP_FW_ARG_TABLEARG(chain, k, f) +/* + * We enter here when we have a rule with O_LOG. + * XXX this function alone takes about 2Kbytes of code! + */ +void +ipfw_log(struct ip_fw_chain *chain, struct ip_fw *f, u_int hlen, + struct ip_fw_args *args, struct mbuf *m, struct ifnet *oif, + u_short offset, uint32_t tablearg, struct ip *ip) +{ + char *action; + int limit_reached = 0; + char action2[92], proto[128], fragment[32]; + + if (V_fw_verbose == 0) { +#ifndef WITHOUT_BPF + LOGIF_RLOCK(); + if (log_if == NULL || log_if->if_bpf == NULL) { + LOGIF_RUNLOCK(); + return; + } + + if (args->eh) /* layer2, use orig hdr */ + BPF_MTAP2(log_if, args->eh, ETHER_HDR_LEN, m); + else { + /* Add fake header. Later we will store + * more info in the header. + */ + if (ip->ip_v == 4) + BPF_MTAP2(log_if, "DDDDDDSSSSSS\x08\x00", ETHER_HDR_LEN, m); + else if (ip->ip_v == 6) + BPF_MTAP2(log_if, "DDDDDDSSSSSS\x86\xdd", ETHER_HDR_LEN, m); + else + /* Obviously bogus EtherType. */ + BPF_MTAP2(log_if, "DDDDDDSSSSSS\xff\xff", ETHER_HDR_LEN, m); + } + LOGIF_RUNLOCK(); +#endif /* !WITHOUT_BPF */ + return; + } + /* the old 'log' function */ + fragment[0] = '\0'; + proto[0] = '\0'; + + if (f == NULL) { /* bogus pkt */ + if (V_verbose_limit != 0 && V_norule_counter >= V_verbose_limit) + return; + V_norule_counter++; + if (V_norule_counter == V_verbose_limit) + limit_reached = V_verbose_limit; + action = "Refuse"; + } else { /* O_LOG is the first action, find the real one */ + ipfw_insn *cmd = ACTION_PTR(f); + ipfw_insn_log *l = (ipfw_insn_log *)cmd; + + if (l->max_log != 0 && l->log_left == 0) + return; + l->log_left--; + if (l->log_left == 0) + limit_reached = l->max_log; + cmd += F_LEN(cmd); /* point to first action */ + if (cmd->opcode == O_ALTQ) { + ipfw_insn_altq *altq = (ipfw_insn_altq *)cmd; + + snprintf(SNPARGS(action2, 0), "Altq %d", + altq->qid); + cmd += F_LEN(cmd); + } + if (cmd->opcode == O_PROB || cmd->opcode == O_TAG || + cmd->opcode == O_SETDSCP) + cmd += F_LEN(cmd); + + action = action2; + switch (cmd->opcode) { + case O_DENY: + action = "Deny"; + break; + + case O_REJECT: + if (cmd->arg1==ICMP_REJECT_RST) + action = "Reset"; + else if (cmd->arg1==ICMP_UNREACH_HOST) + action = "Reject"; + else + snprintf(SNPARGS(action2, 0), "Unreach %d", + cmd->arg1); + break; + + case O_UNREACH6: + if (cmd->arg1==ICMP6_UNREACH_RST) + action = "Reset"; + else + snprintf(SNPARGS(action2, 0), "Unreach %d", + cmd->arg1); + break; + + case O_ACCEPT: + action = "Accept"; + break; + case O_COUNT: + action = "Count"; + break; + case O_DIVERT: + snprintf(SNPARGS(action2, 0), "Divert %d", + TARG(cmd->arg1, divert)); + break; + case O_TEE: + snprintf(SNPARGS(action2, 0), "Tee %d", + TARG(cmd->arg1, divert)); + break; + case O_SETFIB: + snprintf(SNPARGS(action2, 0), "SetFib %d", + TARG(cmd->arg1, fib)); + break; + case O_SKIPTO: + snprintf(SNPARGS(action2, 0), "SkipTo %d", + TARG(cmd->arg1, skipto)); + break; + case O_PIPE: + snprintf(SNPARGS(action2, 0), "Pipe %d", + TARG(cmd->arg1, pipe)); + break; + case O_QUEUE: + snprintf(SNPARGS(action2, 0), "Queue %d", + TARG(cmd->arg1, pipe)); + break; + case O_FORWARD_IP: { + ipfw_insn_sa *sa = (ipfw_insn_sa *)cmd; + int len; + struct in_addr dummyaddr; + if (sa->sa.sin_addr.s_addr == INADDR_ANY) + dummyaddr.s_addr = htonl(tablearg); + else + dummyaddr.s_addr = sa->sa.sin_addr.s_addr; + + len = snprintf(SNPARGS(action2, 0), "Forward to %s", + inet_ntoa(dummyaddr)); + + if (sa->sa.sin_port) + snprintf(SNPARGS(action2, len), ":%d", + sa->sa.sin_port); + } + break; +#ifdef INET6 + case O_FORWARD_IP6: { + char buf[INET6_ADDRSTRLEN]; + ipfw_insn_sa6 *sa = (ipfw_insn_sa6 *)cmd; + int len; + + len = snprintf(SNPARGS(action2, 0), "Forward to [%s]", + ip6_sprintf(buf, &sa->sa.sin6_addr)); + + if (sa->sa.sin6_port) + snprintf(SNPARGS(action2, len), ":%u", + sa->sa.sin6_port); + } + break; +#endif + case O_NETGRAPH: + snprintf(SNPARGS(action2, 0), "Netgraph %d", + cmd->arg1); + break; + case O_NGTEE: + snprintf(SNPARGS(action2, 0), "Ngtee %d", + cmd->arg1); + break; + case O_NAT: + action = "Nat"; + break; + case O_REASS: + action = "Reass"; + break; + case O_CALLRETURN: + if (cmd->len & F_NOT) + action = "Return"; + else + snprintf(SNPARGS(action2, 0), "Call %d", + cmd->arg1); + break; + default: + action = "UNKNOWN"; + break; + } + } + + if (hlen == 0) { /* non-ip */ + snprintf(SNPARGS(proto, 0), "MAC"); + + } else { + int len; +#ifdef INET6 + char src[INET6_ADDRSTRLEN + 2], dst[INET6_ADDRSTRLEN + 2]; +#else + char src[INET_ADDRSTRLEN], dst[INET_ADDRSTRLEN]; +#endif + struct icmphdr *icmp; + struct tcphdr *tcp; + struct udphdr *udp; +#ifdef INET6 + struct ip6_hdr *ip6 = NULL; + struct icmp6_hdr *icmp6; + u_short ip6f_mf; +#endif + src[0] = '\0'; + dst[0] = '\0'; +#ifdef INET6 + ip6f_mf = offset & IP6F_MORE_FRAG; + offset &= IP6F_OFF_MASK; + + if (IS_IP6_FLOW_ID(&(args->f_id))) { + char ip6buf[INET6_ADDRSTRLEN]; + snprintf(src, sizeof(src), "[%s]", + ip6_sprintf(ip6buf, &args->f_id.src_ip6)); + snprintf(dst, sizeof(dst), "[%s]", + ip6_sprintf(ip6buf, &args->f_id.dst_ip6)); + + ip6 = (struct ip6_hdr *)ip; + tcp = (struct tcphdr *)(((char *)ip) + hlen); + udp = (struct udphdr *)(((char *)ip) + hlen); + } else +#endif + { + tcp = L3HDR(struct tcphdr, ip); + udp = L3HDR(struct udphdr, ip); + + inet_ntop(AF_INET, &ip->ip_src, src, sizeof(src)); + inet_ntop(AF_INET, &ip->ip_dst, dst, sizeof(dst)); + } + + switch (args->f_id.proto) { + case IPPROTO_TCP: + len = snprintf(SNPARGS(proto, 0), "TCP %s", src); + if (offset == 0) + snprintf(SNPARGS(proto, len), ":%d %s:%d", + ntohs(tcp->th_sport), + dst, + ntohs(tcp->th_dport)); + else + snprintf(SNPARGS(proto, len), " %s", dst); + break; + + case IPPROTO_UDP: + len = snprintf(SNPARGS(proto, 0), "UDP %s", src); + if (offset == 0) + snprintf(SNPARGS(proto, len), ":%d %s:%d", + ntohs(udp->uh_sport), + dst, + ntohs(udp->uh_dport)); + else + snprintf(SNPARGS(proto, len), " %s", dst); + break; + + case IPPROTO_ICMP: + icmp = L3HDR(struct icmphdr, ip); + if (offset == 0) + len = snprintf(SNPARGS(proto, 0), + "ICMP:%u.%u ", + icmp->icmp_type, icmp->icmp_code); + else + len = snprintf(SNPARGS(proto, 0), "ICMP "); + len += snprintf(SNPARGS(proto, len), "%s", src); + snprintf(SNPARGS(proto, len), " %s", dst); + break; +#ifdef INET6 + case IPPROTO_ICMPV6: + icmp6 = (struct icmp6_hdr *)(((char *)ip) + hlen); + if (offset == 0) + len = snprintf(SNPARGS(proto, 0), + "ICMPv6:%u.%u ", + icmp6->icmp6_type, icmp6->icmp6_code); + else + len = snprintf(SNPARGS(proto, 0), "ICMPv6 "); + len += snprintf(SNPARGS(proto, len), "%s", src); + snprintf(SNPARGS(proto, len), " %s", dst); + break; +#endif + default: + len = snprintf(SNPARGS(proto, 0), "P:%d %s", + args->f_id.proto, src); + snprintf(SNPARGS(proto, len), " %s", dst); + break; + } + +#ifdef INET6 + if (IS_IP6_FLOW_ID(&(args->f_id))) { + if (offset & (IP6F_OFF_MASK | IP6F_MORE_FRAG)) + snprintf(SNPARGS(fragment, 0), + " (frag %08x:%d@%d%s)", + args->f_id.extra, + ntohs(ip6->ip6_plen) - hlen, + ntohs(offset) << 3, ip6f_mf ? "+" : ""); + } else +#endif + { + int ipoff, iplen; + ipoff = ntohs(ip->ip_off); + iplen = ntohs(ip->ip_len); + if (ipoff & (IP_MF | IP_OFFMASK)) + snprintf(SNPARGS(fragment, 0), + " (frag %d:%d@%d%s)", + ntohs(ip->ip_id), iplen - (ip->ip_hl << 2), + offset << 3, + (ipoff & IP_MF) ? "+" : ""); + } + } +#ifdef __FreeBSD__ + if (oif || m->m_pkthdr.rcvif) + log(LOG_SECURITY | LOG_INFO, + "ipfw: %d %s %s %s via %s%s\n", + f ? f->rulenum : -1, + action, proto, oif ? "out" : "in", + oif ? oif->if_xname : m->m_pkthdr.rcvif->if_xname, + fragment); + else +#endif + log(LOG_SECURITY | LOG_INFO, + "ipfw: %d %s %s [no if info]%s\n", + f ? f->rulenum : -1, + action, proto, fragment); + if (limit_reached) + log(LOG_SECURITY | LOG_NOTICE, + "ipfw: limit %d reached on entry %d\n", + limit_reached, f ? f->rulenum : -1); +} +/* end of file */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_pfil.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_pfil.c new file mode 100644 index 0000000..f41b607 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_pfil.c @@ -0,0 +1,587 @@ +/*- + * Copyright (c) 2004 Andre Oppermann, Internet Business Solutions AG + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_pfil.c 264540 2014-04-16 14:37:11Z ae $"); + +#include "opt_ipfw.h" +#include "opt_inet.h" +#include "opt_inet6.h" +#ifndef INET +#error IPFIREWALL requires INET. +#endif /* INET */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#ifdef INET6 +#include +#include +#endif + +#include + +#include + +#include + +static VNET_DEFINE(int, fw_enable) = 1; +#define V_fw_enable VNET(fw_enable) + +#ifdef INET6 +static VNET_DEFINE(int, fw6_enable) = 1; +#define V_fw6_enable VNET(fw6_enable) +#endif + +static VNET_DEFINE(int, fwlink_enable) = 0; +#define V_fwlink_enable VNET(fwlink_enable) + +int ipfw_chg_hook(SYSCTL_HANDLER_ARGS); + +/* Forward declarations. */ +static int ipfw_divert(struct mbuf **, int, struct ipfw_rule_ref *, int); +int ipfw_check_packet(void *, struct mbuf **, struct ifnet *, int, + struct inpcb *); +int ipfw_check_frame(void *, struct mbuf **, struct ifnet *, int, + struct inpcb *); + +#ifdef SYSCTL_NODE + +SYSBEGIN(f1) + +SYSCTL_DECL(_net_inet_ip_fw); +SYSCTL_VNET_PROC(_net_inet_ip_fw, OID_AUTO, enable, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_enable), 0, + ipfw_chg_hook, "I", "Enable ipfw"); +#ifdef INET6 +SYSCTL_DECL(_net_inet6_ip6_fw); +SYSCTL_VNET_PROC(_net_inet6_ip6_fw, OID_AUTO, enable, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw6_enable), 0, + ipfw_chg_hook, "I", "Enable ipfw+6"); +#endif /* INET6 */ + +SYSCTL_DECL(_net_link_ether); +SYSCTL_VNET_PROC(_net_link_ether, OID_AUTO, ipfw, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fwlink_enable), 0, + ipfw_chg_hook, "I", "Pass ether pkts through firewall"); + +SYSEND + +#endif /* SYSCTL_NODE */ + +/* + * The pfilter hook to pass packets to ipfw_chk and then to + * dummynet, divert, netgraph or other modules. + * The packet may be consumed. + */ +int +ipfw_check_packet(void *arg, struct mbuf **m0, struct ifnet *ifp, int dir, + struct inpcb *inp) +{ + struct ip_fw_args args; + struct m_tag *tag; + int ipfw; + int ret; + + /* convert dir to IPFW values */ + dir = (dir == PFIL_IN) ? DIR_IN : DIR_OUT; + bzero(&args, sizeof(args)); + +again: + /* + * extract and remove the tag if present. If we are left + * with onepass, optimize the outgoing path. + */ + tag = m_tag_locate(*m0, MTAG_IPFW_RULE, 0, NULL); + if (tag != NULL) { + args.rule = *((struct ipfw_rule_ref *)(tag+1)); + m_tag_delete(*m0, tag); + if (args.rule.info & IPFW_ONEPASS) + return (0); + } + + args.m = *m0; + args.oif = dir == DIR_OUT ? ifp : NULL; + args.inp = inp; + + ipfw = ipfw_chk(&args); + *m0 = args.m; + + KASSERT(*m0 != NULL || ipfw == IP_FW_DENY, ("%s: m0 is NULL", + __func__)); + + /* breaking out of the switch means drop */ + ret = 0; /* default return value for pass */ + switch (ipfw) { + case IP_FW_PASS: + /* next_hop may be set by ipfw_chk */ + if (args.next_hop == NULL && args.next_hop6 == NULL) + break; /* pass */ +#if (!defined(INET6) && !defined(INET)) + ret = EACCES; +#else + { + struct m_tag *fwd_tag; + size_t len; + + KASSERT(args.next_hop == NULL || args.next_hop6 == NULL, + ("%s: both next_hop=%p and next_hop6=%p not NULL", __func__, + args.next_hop, args.next_hop6)); +#ifdef INET6 + if (args.next_hop6 != NULL) + len = sizeof(struct sockaddr_in6); +#endif +#ifdef INET + if (args.next_hop != NULL) + len = sizeof(struct sockaddr_in); +#endif + + /* Incoming packets should not be tagged so we do not + * m_tag_find. Outgoing packets may be tagged, so we + * reuse the tag if present. + */ + fwd_tag = (dir == DIR_IN) ? NULL : + m_tag_find(*m0, PACKET_TAG_IPFORWARD, NULL); + if (fwd_tag != NULL) { + m_tag_unlink(*m0, fwd_tag); + } else { + fwd_tag = m_tag_get(PACKET_TAG_IPFORWARD, len, + M_NOWAIT); + if (fwd_tag == NULL) { + ret = EACCES; + break; /* i.e. drop */ + } + } +#ifdef INET6 + if (args.next_hop6 != NULL) { + bcopy(args.next_hop6, (fwd_tag+1), len); + if (in6_localip(&args.next_hop6->sin6_addr)) + (*m0)->m_flags |= M_FASTFWD_OURS; + (*m0)->m_flags |= M_IP6_NEXTHOP; + } +#endif +#ifdef INET + if (args.next_hop != NULL) { + bcopy(args.next_hop, (fwd_tag+1), len); + if (in_localip(args.next_hop->sin_addr)) + (*m0)->m_flags |= M_FASTFWD_OURS; + (*m0)->m_flags |= M_IP_NEXTHOP; + } +#endif + m_tag_prepend(*m0, fwd_tag); + } +#endif /* INET || INET6 */ + break; + + case IP_FW_DENY: + ret = EACCES; + break; /* i.e. drop */ + + case IP_FW_DUMMYNET: + ret = EACCES; + if (ip_dn_io_ptr == NULL) + break; /* i.e. drop */ + if (mtod(*m0, struct ip *)->ip_v == 4) + ret = ip_dn_io_ptr(m0, dir, &args); + else if (mtod(*m0, struct ip *)->ip_v == 6) + ret = ip_dn_io_ptr(m0, dir | PROTO_IPV6, &args); + else + break; /* drop it */ + /* + * XXX should read the return value. + * dummynet normally eats the packet and sets *m0=NULL + * unless the packet can be sent immediately. In this + * case args is updated and we should re-run the + * check without clearing args. + */ + if (*m0 != NULL) + goto again; + break; + + case IP_FW_TEE: + case IP_FW_DIVERT: + if (ip_divert_ptr == NULL) { + ret = EACCES; + break; /* i.e. drop */ + } + ret = ipfw_divert(m0, dir, &args.rule, + (ipfw == IP_FW_TEE) ? 1 : 0); + /* continue processing for the original packet (tee). */ + if (*m0) + goto again; + break; + + case IP_FW_NGTEE: + case IP_FW_NETGRAPH: + if (ng_ipfw_input_p == NULL) { + ret = EACCES; + break; /* i.e. drop */ + } + ret = ng_ipfw_input_p(m0, dir, &args, + (ipfw == IP_FW_NGTEE) ? 1 : 0); + if (ipfw == IP_FW_NGTEE) /* ignore errors for NGTEE */ + goto again; /* continue with packet */ + break; + + case IP_FW_NAT: + /* honor one-pass in case of successful nat */ + if (V_fw_one_pass) + break; /* ret is already 0 */ + goto again; + + case IP_FW_REASS: + goto again; /* continue with packet */ + + default: + KASSERT(0, ("%s: unknown retval", __func__)); + } + + if (ret != 0) { + if (*m0) + FREE_PKT(*m0); + *m0 = NULL; + } + + return ret; +} + +/* + * ipfw processing for ethernet packets (in and out). + * Inteface is NULL from ether_demux, and ifp from + * ether_output_frame. + */ +int +ipfw_check_frame(void *arg, struct mbuf **m0, struct ifnet *dst, int dir, + struct inpcb *inp) +{ + struct ether_header *eh; + struct ether_header save_eh; + struct mbuf *m; + int i, ret; + struct ip_fw_args args; + struct m_tag *mtag; + + /* fetch start point from rule, if any */ + mtag = m_tag_locate(*m0, MTAG_IPFW_RULE, 0, NULL); + if (mtag == NULL) { + args.rule.slot = 0; + } else { + /* dummynet packet, already partially processed */ + struct ipfw_rule_ref *r; + + /* XXX can we free it after use ? */ + mtag->m_tag_id = PACKET_TAG_NONE; + r = (struct ipfw_rule_ref *)(mtag + 1); + if (r->info & IPFW_ONEPASS) + return (0); + args.rule = *r; + } + + /* I need some amt of data to be contiguous */ + m = *m0; + i = min(m->m_pkthdr.len, max_protohdr); + if (m->m_len < i) { + m = m_pullup(m, i); + if (m == NULL) { + *m0 = m; + return (0); + } + } + eh = mtod(m, struct ether_header *); +#if defined(USERSPACE) + args.eh = eh; +#else + save_eh = *eh; /* save copy for restore below */ + args.eh = &save_eh; /* MAC header for bridged/MAC packets */ +#endif + m_adj(m, ETHER_HDR_LEN); /* strip ethernet header */ + + args.m = m; /* the packet we are looking at */ + args.oif = dir == PFIL_OUT ? dst: NULL; /* destination, if any */ + args.next_hop = NULL; /* we do not support forward yet */ + args.next_hop6 = NULL; /* we do not support forward yet */ + args.inp = NULL; /* used by ipfw uid/gid/jail rules */ + i = ipfw_chk(&args); + m = args.m; + if (m != NULL) { + /* + * Restore Ethernet header, as needed, in case the + * mbuf chain was replaced by ipfw. + */ + M_PREPEND(m, ETHER_HDR_LEN, M_NOWAIT); + if (m == NULL) { + *m0 = NULL; + return (0); + } + if (eh != mtod(m, struct ether_header *)) + bcopy(&save_eh, mtod(m, struct ether_header *), + ETHER_HDR_LEN); + } + *m0 = m; + + ret = 0; + /* Check result of ipfw_chk() */ +#if defined(USERSPACE) + /* fwd 1.1.1.1 causes the packet to be bounced back. + * This is signalled by setting the low bit of the peer + */ + if (args.next_hop) { + uintptr_t *p = (void *)&(m->__m_peer); + *p |= 1; + } +#endif + switch (i) { + case IP_FW_PASS: + break; + + case IP_FW_DENY: + ret = EACCES; + break; /* i.e. drop */ + + case IP_FW_DUMMYNET: + ret = EACCES; + int dir; + + if (ip_dn_io_ptr == NULL) + break; /* i.e. drop */ + + *m0 = NULL; + dir = PROTO_LAYER2 | (dst ? DIR_OUT : DIR_IN); + ip_dn_io_ptr(&m, dir, &args); + return 0; + + default: + KASSERT(0, ("%s: unknown retval", __func__)); + } + + if (ret != 0) { + if (*m0) + FREE_PKT(*m0); + *m0 = NULL; + } + + return ret; +} + +/* do the divert, return 1 on error 0 on success */ +static int +ipfw_divert(struct mbuf **m0, int incoming, struct ipfw_rule_ref *rule, + int tee) +{ + /* + * ipfw_chk() has already tagged the packet with the divert tag. + * If tee is set, copy packet and return original. + * If not tee, consume packet and send it to divert socket. + */ + struct mbuf *clone; + struct ip *ip = mtod(*m0, struct ip *); + struct m_tag *tag; + + /* Cloning needed for tee? */ + if (tee == 0) { + clone = *m0; /* use the original mbuf */ + *m0 = NULL; + } else { + clone = m_dup(*m0, M_NOWAIT); + /* If we cannot duplicate the mbuf, we sacrifice the divert + * chain and continue with the tee-ed packet. + */ + if (clone == NULL) + return 1; + } + + /* + * Divert listeners can normally handle non-fragmented packets, + * but we can only reass in the non-tee case. + * This means that listeners on a tee rule may get fragments, + * and have to live with that. + * Note that we now have the 'reass' ipfw option so if we care + * we can do it before a 'tee'. + */ + if (!tee) switch (ip->ip_v) { + case IPVERSION: + if (ntohs(ip->ip_off) & (IP_MF | IP_OFFMASK)) { + int hlen; + struct mbuf *reass; + + reass = ip_reass(clone); /* Reassemble packet. */ + if (reass == NULL) + return 0; /* not an error */ + /* if reass = NULL then it was consumed by ip_reass */ + /* + * IP header checksum fixup after reassembly and leave header + * in network byte order. + */ + ip = mtod(reass, struct ip *); + hlen = ip->ip_hl << 2; + ip->ip_sum = 0; + if (hlen == sizeof(struct ip)) + ip->ip_sum = in_cksum_hdr(ip); + else + ip->ip_sum = in_cksum(reass, hlen); + clone = reass; + } + break; +#ifdef INET6 + case IPV6_VERSION >> 4: + { + struct ip6_hdr *const ip6 = mtod(clone, struct ip6_hdr *); + + if (ip6->ip6_nxt == IPPROTO_FRAGMENT) { + int nxt, off; + + off = sizeof(struct ip6_hdr); + nxt = frag6_input(&clone, &off, 0); + if (nxt == IPPROTO_DONE) + return (0); + } + break; + } +#endif + } + + /* attach a tag to the packet with the reinject info */ + tag = m_tag_alloc(MTAG_IPFW_RULE, 0, + sizeof(struct ipfw_rule_ref), M_NOWAIT); + if (tag == NULL) { + FREE_PKT(clone); + return 1; + } + *((struct ipfw_rule_ref *)(tag+1)) = *rule; + m_tag_prepend(clone, tag); + + /* Do the dirty job... */ + ip_divert_ptr(clone, incoming); + return 0; +} + +/* + * attach or detach hooks for a given protocol family + */ +static int +ipfw_hook(int onoff, int pf) +{ + struct pfil_head *pfh; + pfil_func_t hook_func; + + pfh = pfil_head_get(PFIL_TYPE_AF, pf); + if (pfh == NULL) + return ENOENT; + + hook_func = (pf == AF_LINK) ? ipfw_check_frame : ipfw_check_packet; + + (void) (onoff ? pfil_add_hook : pfil_remove_hook) + (hook_func, NULL, PFIL_IN | PFIL_OUT | PFIL_WAITOK, pfh); + + return 0; +} + +int +ipfw_attach_hooks(int arg) +{ + int error = 0; + + if (arg == 0) /* detach */ + ipfw_hook(0, AF_INET); + else if (V_fw_enable && ipfw_hook(1, AF_INET) != 0) { + error = ENOENT; /* see ip_fw_pfil.c::ipfw_hook() */ + printf("ipfw_hook() error\n"); + } +#ifdef INET6 + if (arg == 0) /* detach */ + ipfw_hook(0, AF_INET6); + else if (V_fw6_enable && ipfw_hook(1, AF_INET6) != 0) { + error = ENOENT; + printf("ipfw6_hook() error\n"); + } +#endif + if (arg == 0) /* detach */ + ipfw_hook(0, AF_LINK); + else if (V_fwlink_enable && ipfw_hook(1, AF_LINK) != 0) { + error = ENOENT; + printf("ipfw_link_hook() error\n"); + } + return error; +} + +int +ipfw_chg_hook(SYSCTL_HANDLER_ARGS) +{ + int newval; + int error; + int af; + + if (arg1 == &V_fw_enable) + af = AF_INET; +#ifdef INET6 + else if (arg1 == &V_fw6_enable) + af = AF_INET6; +#endif + else if (arg1 == &V_fwlink_enable) + af = AF_LINK; + else + return (EINVAL); + + newval = *(int *)arg1; + /* Handle sysctl change */ + error = sysctl_handle_int(oidp, &newval, 0, req); + + if (error) + return (error); + + /* Formalize new value */ + newval = (newval) ? 1 : 0; + + if (*(int *)arg1 == newval) + return (0); + + error = ipfw_hook(newval, af); + if (error) + return (error); + *(int *)arg1 = newval; + + return (0); +} +/* end of file */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_private.h b/example/ipfw/sys/netpfil/ipfw/ip_fw_private.h new file mode 100644 index 0000000..e7ad538 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_private.h @@ -0,0 +1,625 @@ +/*- + * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD: head/sys/netpfil/ipfw/ip_fw_private.h 272840 2014-10-09 19:32:35Z melifaro $ + */ + +#ifndef _IPFW2_PRIVATE_H +#define _IPFW2_PRIVATE_H + +/* + * Internal constants and data structures used by ipfw components + * and not meant to be exported outside the kernel. + */ + +#ifdef _KERNEL + +/* + * For platforms that do not have SYSCTL support, we wrap the + * SYSCTL_* into a function (one per file) to collect the values + * into an array at module initialization. The wrapping macros, + * SYSBEGIN() and SYSEND, are empty in the default case. + */ +#ifndef SYSBEGIN +#define SYSBEGIN(x) +#endif +#ifndef SYSEND +#define SYSEND +#endif + +/* Return values from ipfw_chk() */ +enum { + IP_FW_PASS = 0, + IP_FW_DENY, + IP_FW_DIVERT, + IP_FW_TEE, + IP_FW_DUMMYNET, + IP_FW_NETGRAPH, + IP_FW_NGTEE, + IP_FW_NAT, + IP_FW_REASS, +}; + +/* + * Structure for collecting parameters to dummynet for ip6_output forwarding + */ +struct _ip6dn_args { + struct ip6_pktopts *opt_or; + struct route_in6 ro_or; + int flags_or; + struct ip6_moptions *im6o_or; + struct ifnet *origifp_or; + struct ifnet *ifp_or; + struct sockaddr_in6 dst_or; + u_long mtu_or; + struct route_in6 ro_pmtu_or; +}; + + +/* + * Arguments for calling ipfw_chk() and dummynet_io(). We put them + * all into a structure because this way it is easier and more + * efficient to pass variables around and extend the interface. + */ +struct ip_fw_args { + struct mbuf *m; /* the mbuf chain */ + struct ifnet *oif; /* output interface */ + struct sockaddr_in *next_hop; /* forward address */ + struct sockaddr_in6 *next_hop6; /* ipv6 forward address */ + + /* + * On return, it points to the matching rule. + * On entry, rule.slot > 0 means the info is valid and + * contains the starting rule for an ipfw search. + * If chain_id == chain->id && slot >0 then jump to that slot. + * Otherwise, we locate the first rule >= rulenum:rule_id + */ + struct ipfw_rule_ref rule; /* match/restart info */ + + struct ether_header *eh; /* for bridged packets */ + + struct ipfw_flow_id f_id; /* grabbed from IP header */ + //uint32_t cookie; /* a cookie depending on rule action */ + struct inpcb *inp; + + struct _ip6dn_args dummypar; /* dummynet->ip6_output */ + struct sockaddr_in hopstore; /* store here if cannot use a pointer */ +}; + +MALLOC_DECLARE(M_IPFW); + +/* + * Hooks sometime need to know the direction of the packet + * (divert, dummynet, netgraph, ...) + * We use a generic definition here, with bit0-1 indicating the + * direction, bit 2 indicating layer2 or 3, bit 3-4 indicating the + * specific protocol + * indicating the protocol (if necessary) + */ +enum { + DIR_MASK = 0x3, + DIR_OUT = 0, + DIR_IN = 1, + DIR_FWD = 2, + DIR_DROP = 3, + PROTO_LAYER2 = 0x4, /* set for layer 2 */ + /* PROTO_DEFAULT = 0, */ + PROTO_IPV4 = 0x08, + PROTO_IPV6 = 0x10, + PROTO_IFB = 0x0c, /* layer2 + ifbridge */ + /* PROTO_OLDBDG = 0x14, unused, old bridge */ +}; + +/* wrapper for freeing a packet, in case we need to do more work */ +#ifndef FREE_PKT +#if defined(__linux__) || defined(_WIN32) +#define FREE_PKT(m) netisr_dispatch(-1, m) +#else +#define FREE_PKT(m) m_freem(m) +#endif +#endif /* !FREE_PKT */ + +/* + * Function definitions. + */ + +/* attach (arg = 1) or detach (arg = 0) hooks */ +int ipfw_attach_hooks(int); +#ifdef NOTYET +void ipfw_nat_destroy(void); +#endif + +/* In ip_fw_log.c */ +struct ip; +struct ip_fw_chain; +void ipfw_log_bpf(int); +void ipfw_log(struct ip_fw_chain *chain, struct ip_fw *f, u_int hlen, + struct ip_fw_args *args, struct mbuf *m, struct ifnet *oif, + u_short offset, uint32_t tablearg, struct ip *ip); +VNET_DECLARE(u_int64_t, norule_counter); +#define V_norule_counter VNET(norule_counter) +VNET_DECLARE(int, verbose_limit); +#define V_verbose_limit VNET(verbose_limit) + +/* In ip_fw_dynamic.c */ + +enum { /* result for matching dynamic rules */ + MATCH_REVERSE = 0, + MATCH_FORWARD, + MATCH_NONE, + MATCH_UNKNOWN, +}; + +/* + * The lock for dynamic rules is only used once outside the file, + * and only to release the result of lookup_dyn_rule(). + * Eventually we may implement it with a callback on the function. + */ +struct ip_fw_chain; +struct sockopt_data; +int ipfw_is_dyn_rule(struct ip_fw *rule); +void ipfw_expire_dyn_rules(struct ip_fw_chain *, ipfw_range_tlv *); +void ipfw_dyn_unlock(ipfw_dyn_rule *q); + +struct tcphdr; +struct mbuf *ipfw_send_pkt(struct mbuf *, struct ipfw_flow_id *, + u_int32_t, u_int32_t, int); +int ipfw_install_state(struct ip_fw_chain *chain, struct ip_fw *rule, + ipfw_insn_limit *cmd, struct ip_fw_args *args, uint32_t tablearg); +ipfw_dyn_rule *ipfw_lookup_dyn_rule(struct ipfw_flow_id *pkt, + int *match_direction, struct tcphdr *tcp); +void ipfw_remove_dyn_children(struct ip_fw *rule); +void ipfw_get_dynamic(struct ip_fw_chain *chain, char **bp, const char *ep); +int ipfw_dump_states(struct ip_fw_chain *chain, struct sockopt_data *sd); + +void ipfw_dyn_init(struct ip_fw_chain *); /* per-vnet initialization */ +void ipfw_dyn_uninit(int); /* per-vnet deinitialization */ +int ipfw_dyn_len(void); +int ipfw_dyn_get_count(void); + +/* common variables */ +VNET_DECLARE(int, fw_one_pass); +#define V_fw_one_pass VNET(fw_one_pass) + +VNET_DECLARE(int, fw_verbose); +#define V_fw_verbose VNET(fw_verbose) + +VNET_DECLARE(struct ip_fw_chain, layer3_chain); +#define V_layer3_chain VNET(layer3_chain) + +VNET_DECLARE(int, ipfw_vnet_ready); +#define V_ipfw_vnet_ready VNET(ipfw_vnet_ready) + +VNET_DECLARE(u_int32_t, set_disable); +#define V_set_disable VNET(set_disable) + +VNET_DECLARE(int, autoinc_step); +#define V_autoinc_step VNET(autoinc_step) + +VNET_DECLARE(unsigned int, fw_tables_max); +#define V_fw_tables_max VNET(fw_tables_max) + +VNET_DECLARE(unsigned int, fw_tables_sets); +#define V_fw_tables_sets VNET(fw_tables_sets) + +struct tables_config; + +#ifdef _KERNEL +/* + * Here we have the structure representing an ipfw rule. + * + * It starts with a general area + * followed by an array of one or more instructions, which the code + * accesses as an array of 32-bit values. + * + * Given a rule pointer r: + * + * r->cmd is the start of the first instruction. + * ACTION_PTR(r) is the start of the first action (things to do + * once a rule matched). + */ + +struct ip_fw { + uint16_t act_ofs; /* offset of action in 32-bit units */ + uint16_t cmd_len; /* # of 32-bit words in cmd */ + uint16_t rulenum; /* rule number */ + uint8_t set; /* rule set (0..31) */ + uint8_t flags; /* currently unused */ + counter_u64_t cntr; /* Pointer to rule counters */ + uint32_t timestamp; /* tv_sec of last match */ + uint32_t id; /* rule id */ + uint32_t cached_id; /* used by jump_fast */ + uint32_t cached_pos; /* used by jump_fast */ + + ipfw_insn cmd[1]; /* storage for commands */ +}; + +#define IPFW_RULE_CNTR_SIZE (2 * sizeof(counter_u64_t)) + +#endif + +struct ip_fw_chain { + struct ip_fw **map; /* array of rule ptrs to ease lookup */ + uint32_t id; /* ruleset id */ + int n_rules; /* number of static rules */ + LIST_HEAD(nat_list, cfg_nat) nat; /* list of nat entries */ + void *tablestate; /* runtime table info */ + void *valuestate; /* runtime table value info */ + int *idxmap; /* skipto array of rules */ +#if defined( __linux__ ) || defined( _WIN32 ) + spinlock_t rwmtx; +#else + struct rmlock rwmtx; +#endif + int static_len; /* total len of static rules (v0) */ + uint32_t gencnt; /* NAT generation count */ + struct ip_fw *default_rule; + struct tables_config *tblcfg; /* tables module data */ + void *ifcfg; /* interface module data */ + int *idxmap_back; /* standby skipto array of rules */ +#if defined( __linux__ ) || defined( _WIN32 ) + spinlock_t uh_lock; +#else + struct rwlock uh_lock; /* lock for upper half */ +#endif +}; + +/* 64-byte structure representing multi-field table value */ +struct table_value { + uint32_t tag; /* O_TAG/O_TAGGED */ + uint32_t pipe; /* O_PIPE/O_QUEUE */ + uint16_t divert; /* O_DIVERT/O_TEE */ + uint16_t skipto; /* skipto, CALLRET */ + uint32_t netgraph; /* O_NETGRAPH/O_NGTEE */ + uint32_t fib; /* O_SETFIB */ + uint32_t nat; /* O_NAT */ + uint32_t nh4; + uint8_t dscp; + uint8_t spare0[3]; + /* -- 32 bytes -- */ + struct in6_addr nh6; + uint32_t limit; /* O_LIMIT */ + uint32_t spare1; + uint64_t refcnt; /* Number of references */ +}; + +struct namedobj_instance; + +struct named_object { + TAILQ_ENTRY(named_object) nn_next; /* namehash */ + TAILQ_ENTRY(named_object) nv_next; /* valuehash */ + char *name; /* object name */ + uint8_t type; /* object type */ + uint8_t compat; /* Object name is number */ + uint16_t kidx; /* object kernel index */ + uint16_t uidx; /* userland idx for compat records */ + uint32_t set; /* set object belongs to */ + uint32_t refcnt; /* number of references */ +}; +TAILQ_HEAD(namedobjects_head, named_object); + +struct sockopt; /* used by tcp_var.h */ +struct sockopt_data { + caddr_t kbuf; /* allocated buffer */ + size_t ksize; /* given buffer size */ + size_t koff; /* data already used */ + size_t kavail; /* number of bytes available */ + size_t ktotal; /* total bytes pushed */ + struct sockopt *sopt; /* socket data */ + caddr_t sopt_val; /* sopt user buffer */ + size_t valsize; /* original data size */ +}; + +struct ipfw_ifc; + +typedef void (ipfw_ifc_cb)(struct ip_fw_chain *ch, void *cbdata, + uint16_t ifindex); + +struct ipfw_iface { + struct named_object no; + char ifname[64]; + int resolved; + uint16_t ifindex; + uint16_t spare; + uint64_t gencnt; + TAILQ_HEAD(, ipfw_ifc) consumers; +}; + +struct ipfw_ifc { + TAILQ_ENTRY(ipfw_ifc) next; + struct ipfw_iface *iface; + ipfw_ifc_cb *cb; + void *cbdata; +}; + +/* Macro for working with various counters */ +#define IPFW_INC_RULE_COUNTER(_cntr, _bytes) do { \ + counter_u64_add((_cntr)->cntr, 1); \ + counter_u64_add((_cntr)->cntr + 1, _bytes); \ + if ((_cntr)->timestamp != time_uptime) \ + (_cntr)->timestamp = time_uptime; \ + } while (0) + +#define IPFW_INC_DYN_COUNTER(_cntr, _bytes) do { \ + (_cntr)->pcnt++; \ + (_cntr)->bcnt += _bytes; \ + } while (0) + +#define IPFW_ZERO_RULE_COUNTER(_cntr) do { \ + counter_u64_zero((_cntr)->cntr); \ + counter_u64_zero((_cntr)->cntr + 1); \ + (_cntr)->timestamp = 0; \ + } while (0) + +#define IPFW_ZERO_DYN_COUNTER(_cntr) do { \ + (_cntr)->pcnt = 0; \ + (_cntr)->bcnt = 0; \ + } while (0) + +#define TARG_VAL(ch, k, f) ((struct table_value *)((ch)->valuestate))[k].f +#define IP_FW_ARG_TABLEARG(ch, a, f) \ + (((a) == IP_FW_TARG) ? TARG_VAL(ch, tablearg, f) : (a)) +/* + * The lock is heavily used by ip_fw2.c (the main file) and ip_fw_nat.c + * so the variable and the macros must be here. + */ + +#if defined( __linux__ ) || defined( _WIN32 ) +#define IPFW_LOCK_INIT(_chain) do { \ + rw_init(&(_chain)->rwmtx, "IPFW static rules"); \ + rw_init(&(_chain)->uh_lock, "IPFW UH lock"); \ + } while (0) + +#define IPFW_LOCK_DESTROY(_chain) do { \ + rw_destroy(&(_chain)->rwmtx); \ + rw_destroy(&(_chain)->uh_lock); \ + } while (0) + +#define IPFW_RLOCK_ASSERT(_chain) rw_assert(&(_chain)->rwmtx, RA_RLOCKED) +#define IPFW_WLOCK_ASSERT(_chain) rw_assert(&(_chain)->rwmtx, RA_WLOCKED) + +#define IPFW_RLOCK_TRACKER +#define IPFW_RLOCK(p) rw_rlock(&(p)->rwmtx) +#define IPFW_RUNLOCK(p) rw_runlock(&(p)->rwmtx) +#define IPFW_WLOCK(p) rw_wlock(&(p)->rwmtx) +#define IPFW_WUNLOCK(p) rw_wunlock(&(p)->rwmtx) +#define IPFW_PF_RLOCK(p) IPFW_RLOCK(p) +#define IPFW_PF_RUNLOCK(p) IPFW_RUNLOCK(p) +#else /* FreeBSD */ +#define IPFW_LOCK_INIT(_chain) do { \ + rm_init(&(_chain)->rwmtx, "IPFW static rules"); \ + rw_init(&(_chain)->uh_lock, "IPFW UH lock"); \ + } while (0) + +#define IPFW_LOCK_DESTROY(_chain) do { \ + rm_destroy(&(_chain)->rwmtx); \ + rw_destroy(&(_chain)->uh_lock); \ + } while (0) + +#define IPFW_RLOCK_ASSERT(_chain) rm_assert(&(_chain)->rwmtx, RA_RLOCKED) +#define IPFW_WLOCK_ASSERT(_chain) rm_assert(&(_chain)->rwmtx, RA_WLOCKED) + +#define IPFW_RLOCK_TRACKER struct rm_priotracker _tracker +#define IPFW_RLOCK(p) rm_rlock(&(p)->rwmtx, &_tracker) +#define IPFW_RUNLOCK(p) rm_runlock(&(p)->rwmtx, &_tracker) +#define IPFW_WLOCK(p) rm_wlock(&(p)->rwmtx) +#define IPFW_WUNLOCK(p) rm_wunlock(&(p)->rwmtx) +#define IPFW_PF_RLOCK(p) IPFW_RLOCK(p) +#define IPFW_PF_RUNLOCK(p) IPFW_RUNLOCK(p) +#endif + +#define IPFW_UH_RLOCK_ASSERT(_chain) rw_assert(&(_chain)->uh_lock, RA_RLOCKED) +#define IPFW_UH_WLOCK_ASSERT(_chain) rw_assert(&(_chain)->uh_lock, RA_WLOCKED) + +#define IPFW_UH_RLOCK(p) rw_rlock(&(p)->uh_lock) +#define IPFW_UH_RUNLOCK(p) rw_runlock(&(p)->uh_lock) +#define IPFW_UH_WLOCK(p) rw_wlock(&(p)->uh_lock) +#define IPFW_UH_WUNLOCK(p) rw_wunlock(&(p)->uh_lock) + +struct obj_idx { + uint16_t uidx; /* internal index supplied by userland */ + uint16_t kidx; /* kernel object index */ + uint16_t off; /* tlv offset from rule end in 4-byte words */ + uint8_t spare; + uint8_t type; /* object type within its category */ +}; + +struct rule_check_info { + uint16_t flags; /* rule-specific check flags */ + uint16_t table_opcodes; /* count of opcodes referencing table */ + uint16_t urule_numoff; /* offset of rulenum in bytes */ + uint8_t version; /* rule version */ + uint8_t spare; + ipfw_obj_ctlv *ctlv; /* name TLV containter */ + struct ip_fw *krule; /* resulting rule pointer */ + caddr_t urule; /* original rule pointer */ + struct obj_idx obuf[8]; /* table references storage */ +}; + +/* Legacy interface support */ +/* + * FreeBSD 8 export rule format + */ +struct ip_fw_rule0 { + struct ip_fw *x_next; /* linked list of rules */ + struct ip_fw *next_rule; /* ptr to next [skipto] rule */ + /* 'next_rule' is used to pass up 'set_disable' status */ + + uint16_t act_ofs; /* offset of action in 32-bit units */ + uint16_t cmd_len; /* # of 32-bit words in cmd */ + uint16_t rulenum; /* rule number */ + uint8_t set; /* rule set (0..31) */ + uint8_t _pad; /* padding */ + uint32_t id; /* rule id */ + + /* These fields are present in all rules. */ + uint64_t pcnt; /* Packet counter */ + uint64_t bcnt; /* Byte counter */ + uint32_t timestamp; /* tv_sec of last match */ + + ipfw_insn cmd[1]; /* storage for commands */ +}; + +struct ip_fw_bcounter0 { + uint64_t pcnt; /* Packet counter */ + uint64_t bcnt; /* Byte counter */ + uint32_t timestamp; /* tv_sec of last match */ +}; + +/* Kernel rule length */ +/* + * RULE _K_ SIZE _V_ -> + * get kernel size from userland rool version _V_. + * RULE _U_ SIZE _V_ -> + * get user size version _V_ from kernel rule + * RULESIZE _V_ -> + * get user size rule length + */ +/* FreeBSD8 <> current kernel format */ +#define RULEUSIZE0(r) (sizeof(struct ip_fw_rule0) + (r)->cmd_len * 4 - 4) +#define RULEKSIZE0(r) roundup2((sizeof(struct ip_fw) + (r)->cmd_len*4 - 4), 8) +/* FreeBSD11 <> current kernel format */ +#define RULEUSIZE1(r) (roundup2(sizeof(struct ip_fw_rule) + \ + (r)->cmd_len * 4 - 4, 8)) +#define RULEKSIZE1(r) roundup2((sizeof(struct ip_fw) + (r)->cmd_len*4 - 4), 8) + + +/* In ip_fw_iface.c */ +int ipfw_iface_init(void); +void ipfw_iface_destroy(void); +void vnet_ipfw_iface_destroy(struct ip_fw_chain *ch); +int ipfw_iface_ref(struct ip_fw_chain *ch, char *name, + struct ipfw_ifc *ic); +void ipfw_iface_unref(struct ip_fw_chain *ch, struct ipfw_ifc *ic); +void ipfw_iface_add_notify(struct ip_fw_chain *ch, struct ipfw_ifc *ic); +void ipfw_iface_del_notify(struct ip_fw_chain *ch, struct ipfw_ifc *ic); + +/* In ip_fw_sockopt.c */ +void ipfw_init_skipto_cache(struct ip_fw_chain *chain); +void ipfw_destroy_skipto_cache(struct ip_fw_chain *chain); +int ipfw_find_rule(struct ip_fw_chain *chain, uint32_t key, uint32_t id); +int ipfw_ctl3(struct sockopt *sopt); +int ipfw_chk(struct ip_fw_args *args); +void ipfw_reap_add(struct ip_fw_chain *chain, struct ip_fw **head, + struct ip_fw *rule); +void ipfw_reap_rules(struct ip_fw *head); +void ipfw_init_counters(void); +void ipfw_destroy_counters(void); +struct ip_fw *ipfw_alloc_rule(struct ip_fw_chain *chain, size_t rulesize); +int ipfw_match_range(struct ip_fw *rule, ipfw_range_tlv *rt); + +typedef int (sopt_handler_f)(struct ip_fw_chain *ch, + ip_fw3_opheader *op3, struct sockopt_data *sd); +struct ipfw_sopt_handler { + uint16_t opcode; + uint8_t version; + uint8_t dir; + sopt_handler_f *handler; + uint64_t refcnt; +}; +#define HDIR_SET 0x01 /* Handler is used to set some data */ +#define HDIR_GET 0x02 /* Handler is used to retrieve data */ +#define HDIR_BOTH HDIR_GET|HDIR_SET + +void ipfw_init_sopt_handler(void); +void ipfw_destroy_sopt_handler(void); +void ipfw_add_sopt_handler(struct ipfw_sopt_handler *sh, size_t count); +int ipfw_del_sopt_handler(struct ipfw_sopt_handler *sh, size_t count); +caddr_t ipfw_get_sopt_space(struct sockopt_data *sd, size_t needed); +caddr_t ipfw_get_sopt_header(struct sockopt_data *sd, size_t needed); +#define IPFW_ADD_SOPT_HANDLER(f, c) do { \ + if ((f) != 0) \ + ipfw_add_sopt_handler(c, \ + sizeof(c) / sizeof(c[0])); \ + } while(0) +#define IPFW_DEL_SOPT_HANDLER(l, c) do { \ + if ((l) != 0) \ + ipfw_del_sopt_handler(c, \ + sizeof(c) / sizeof(c[0])); \ + } while(0) + +typedef void (objhash_cb_t)(struct namedobj_instance *ni, struct named_object *, + void *arg); +typedef uint32_t (objhash_hash_f)(struct namedobj_instance *ni, void *key, + uint32_t kopt); +typedef int (objhash_cmp_f)(struct named_object *no, void *key, uint32_t kopt); +struct namedobj_instance *ipfw_objhash_create(uint32_t items); +void ipfw_objhash_destroy(struct namedobj_instance *); +void ipfw_objhash_bitmap_alloc(uint32_t items, void **idx, int *pblocks); +void ipfw_objhash_bitmap_merge(struct namedobj_instance *ni, + void **idx, int *blocks); +void ipfw_objhash_bitmap_swap(struct namedobj_instance *ni, + void **idx, int *blocks); +void ipfw_objhash_bitmap_free(void *idx, int blocks); +void ipfw_objhash_set_hashf(struct namedobj_instance *ni, objhash_hash_f *f); +struct named_object *ipfw_objhash_lookup_name(struct namedobj_instance *ni, + uint32_t set, char *name); +struct named_object *ipfw_objhash_lookup_kidx(struct namedobj_instance *ni, + uint16_t idx); +int ipfw_objhash_same_name(struct namedobj_instance *ni, struct named_object *a, + struct named_object *b); +void ipfw_objhash_add(struct namedobj_instance *ni, struct named_object *no); +void ipfw_objhash_del(struct namedobj_instance *ni, struct named_object *no); +uint32_t ipfw_objhash_count(struct namedobj_instance *ni); +void ipfw_objhash_foreach(struct namedobj_instance *ni, objhash_cb_t *f, + void *arg); +int ipfw_objhash_free_idx(struct namedobj_instance *ni, uint16_t idx); +int ipfw_objhash_alloc_idx(void *n, uint16_t *pidx); +void ipfw_objhash_set_funcs(struct namedobj_instance *ni, + objhash_hash_f *hash_f, objhash_cmp_f *cmp_f); + +/* In ip_fw_table.c */ +struct table_info; + +typedef int (table_lookup_t)(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); + +int ipfw_lookup_table(struct ip_fw_chain *ch, uint16_t tbl, in_addr_t addr, + uint32_t *val); +int ipfw_lookup_table_extended(struct ip_fw_chain *ch, uint16_t tbl, uint16_t plen, + void *paddr, uint32_t *val); +int ipfw_init_tables(struct ip_fw_chain *ch, int first); +int ipfw_resize_tables(struct ip_fw_chain *ch, unsigned int ntables); +int ipfw_switch_tables_namespace(struct ip_fw_chain *ch, unsigned int nsets); +void ipfw_destroy_tables(struct ip_fw_chain *ch, int last); + +/* In ip_fw_nat.c -- XXX to be moved to ip_var.h */ + +extern struct cfg_nat *(*lookup_nat_ptr)(struct nat_list *, int); + +typedef int ipfw_nat_t(struct ip_fw_args *, struct cfg_nat *, struct mbuf *); +typedef int ipfw_nat_cfg_t(struct sockopt *); + +VNET_DECLARE(int, ipfw_nat_ready); +#define V_ipfw_nat_ready VNET(ipfw_nat_ready) +#define IPFW_NAT_LOADED (V_ipfw_nat_ready) + +extern ipfw_nat_t *ipfw_nat_ptr; +extern ipfw_nat_cfg_t *ipfw_nat_cfg_ptr; +extern ipfw_nat_cfg_t *ipfw_nat_del_ptr; +extern ipfw_nat_cfg_t *ipfw_nat_get_cfg_ptr; +extern ipfw_nat_cfg_t *ipfw_nat_get_log_ptr; + +#endif /* _KERNEL */ +#endif /* _IPFW2_PRIVATE_H */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_sockopt.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_sockopt.c new file mode 100644 index 0000000..bdf2692 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_sockopt.c @@ -0,0 +1,3469 @@ +/*- + * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa + * Copyright (c) 2014 Yandex LLC + * Copyright (c) 2014 Alexander V. Chernikov + * + * Supported by: Valeria Paoli + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_sockopt.c 273035 2014-10-13 13:49:28Z melifaro $"); + +/* + * Control socket and rule management routines for ipfw. + * Control is currently implemented via IP_FW3 setsockopt() code. + */ + +#include "opt_ipfw.h" +#include "opt_inet.h" +#ifndef INET +#error IPFIREWALL requires INET. +#endif /* INET */ +#include "opt_inet6.h" + +#include +#include +#include +#include /* struct m_tag used by nested headers */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include /* hooks */ +#include + +#include +#include + +#ifdef MAC +#include +#endif + +static int ipfw_ctl(struct sockopt *sopt); +static int check_ipfw_rule_body(ipfw_insn *cmd, int cmd_len, + struct rule_check_info *ci); +static int check_ipfw_rule1(struct ip_fw_rule *rule, int size, + struct rule_check_info *ci); +static int check_ipfw_rule0(struct ip_fw_rule0 *rule, int size, + struct rule_check_info *ci); + +#define NAMEDOBJ_HASH_SIZE 32 + +struct namedobj_instance { + struct namedobjects_head *names; + struct namedobjects_head *values; + uint32_t nn_size; /* names hash size */ + uint32_t nv_size; /* number hash size */ + u_long *idx_mask; /* used items bitmask */ + uint32_t max_blocks; /* number of "long" blocks in bitmask */ + uint32_t count; /* number of items */ + uint16_t free_off[IPFW_MAX_SETS]; /* first possible free offset */ + objhash_hash_f *hash_f; + objhash_cmp_f *cmp_f; +}; +#define BLOCK_ITEMS (8 * sizeof(u_long)) /* Number of items for ffsl() */ + +static uint32_t objhash_hash_name(struct namedobj_instance *ni, void *key, + uint32_t kopt); +static uint32_t objhash_hash_idx(struct namedobj_instance *ni, uint32_t val); +static int objhash_cmp_name(struct named_object *no, void *name, uint32_t set); + +MALLOC_DEFINE(M_IPFW, "IpFw/IpAcct", "IpFw/IpAcct chain's"); + +static int dump_config(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); +static int add_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); +static int del_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); +static int clear_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); +static int move_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); +static int manage_sets(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); +static int dump_soptcodes(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd); + +/* ctl3 handler data */ +struct mtx ctl3_lock; +#define CTL3_LOCK_INIT() mtx_init(&ctl3_lock, "ctl3_lock", NULL, MTX_DEF) +#define CTL3_LOCK_DESTROY() mtx_destroy(&ctl3_lock) +#define CTL3_LOCK() mtx_lock(&ctl3_lock) +#define CTL3_UNLOCK() mtx_unlock(&ctl3_lock) + +static struct ipfw_sopt_handler *ctl3_handlers; +static size_t ctl3_hsize; +static uint64_t ctl3_refct, ctl3_gencnt; +#define CTL3_SMALLBUF 4096 /* small page-size write buffer */ +#define CTL3_LARGEBUF 16 * 1024 * 1024 /* handle large rulesets */ + +static int ipfw_flush_sopt_data(struct sockopt_data *sd); + +static struct ipfw_sopt_handler scodes[] = { + { IP_FW_XGET, 0, HDIR_GET, dump_config }, + { IP_FW_XADD, 0, HDIR_BOTH, add_rules }, + { IP_FW_XDEL, 0, HDIR_BOTH, del_rules }, + { IP_FW_XZERO, 0, HDIR_SET, clear_rules }, + { IP_FW_XRESETLOG, 0, HDIR_SET, clear_rules }, + { IP_FW_XMOVE, 0, HDIR_SET, move_rules }, + { IP_FW_SET_SWAP, 0, HDIR_SET, manage_sets }, + { IP_FW_SET_MOVE, 0, HDIR_SET, manage_sets }, + { IP_FW_SET_ENABLE, 0, HDIR_SET, manage_sets }, + { IP_FW_DUMP_SOPTCODES, 0, HDIR_GET, dump_soptcodes }, +}; + +/* + * static variables followed by global ones + */ + +static VNET_DEFINE(uma_zone_t, ipfw_cntr_zone); +#define V_ipfw_cntr_zone VNET(ipfw_cntr_zone) + +void +ipfw_init_counters() +{ + + V_ipfw_cntr_zone = uma_zcreate("IPFW counters", + IPFW_RULE_CNTR_SIZE, NULL, NULL, NULL, NULL, + UMA_ALIGN_PTR, UMA_ZONE_PCPU); +} + +void +ipfw_destroy_counters() +{ + + uma_zdestroy(V_ipfw_cntr_zone); +} + +struct ip_fw * +ipfw_alloc_rule(struct ip_fw_chain *chain, size_t rulesize) +{ + struct ip_fw *rule; + + rule = malloc(rulesize, M_IPFW, M_WAITOK | M_ZERO); + rule->cntr = uma_zalloc(V_ipfw_cntr_zone, M_WAITOK | M_ZERO); + + return (rule); +} + +static void +free_rule(struct ip_fw *rule) +{ + + uma_zfree(V_ipfw_cntr_zone, rule->cntr); + free(rule, M_IPFW); +} + + +/* + * Find the smallest rule >= key, id. + * We could use bsearch but it is so simple that we code it directly + */ +int +ipfw_find_rule(struct ip_fw_chain *chain, uint32_t key, uint32_t id) +{ + int i, lo, hi; + struct ip_fw *r; + + for (lo = 0, hi = chain->n_rules - 1; lo < hi;) { + i = (lo + hi) / 2; + r = chain->map[i]; + if (r->rulenum < key) + lo = i + 1; /* continue from the next one */ + else if (r->rulenum > key) + hi = i; /* this might be good */ + else if (r->id < id) + lo = i + 1; /* continue from the next one */ + else /* r->id >= id */ + hi = i; /* this might be good */ + }; + return hi; +} + +/* + * Builds skipto cache on rule set @map. + */ +static void +update_skipto_cache(struct ip_fw_chain *chain, struct ip_fw **map) +{ + int *smap, rulenum; + int i, mi; + + IPFW_UH_WLOCK_ASSERT(chain); + + mi = 0; + rulenum = map[mi]->rulenum; + smap = chain->idxmap_back; + + if (smap == NULL) + return; + + for (i = 0; i < 65536; i++) { + smap[i] = mi; + /* Use the same rule index until i < rulenum */ + if (i != rulenum || i == 65535) + continue; + /* Find next rule with num > i */ + rulenum = map[++mi]->rulenum; + while (rulenum == i) + rulenum = map[++mi]->rulenum; + } +} + +/* + * Swaps prepared (backup) index with current one. + */ +static void +swap_skipto_cache(struct ip_fw_chain *chain) +{ + int *map; + + IPFW_UH_WLOCK_ASSERT(chain); + IPFW_WLOCK_ASSERT(chain); + + map = chain->idxmap; + chain->idxmap = chain->idxmap_back; + chain->idxmap_back = map; +} + +/* + * Allocate and initialize skipto cache. + */ +void +ipfw_init_skipto_cache(struct ip_fw_chain *chain) +{ + int *idxmap, *idxmap_back; + + idxmap = malloc(65536 * sizeof(uint32_t *), M_IPFW, + M_WAITOK | M_ZERO); + idxmap_back = malloc(65536 * sizeof(uint32_t *), M_IPFW, + M_WAITOK | M_ZERO); + + /* + * Note we may be called at any time after initialization, + * for example, on first skipto rule, so we need to + * provide valid chain->idxmap on return + */ + + IPFW_UH_WLOCK(chain); + if (chain->idxmap != NULL) { + IPFW_UH_WUNLOCK(chain); + free(idxmap, M_IPFW); + free(idxmap_back, M_IPFW); + return; + } + + /* Set backup pointer first to permit building cache */ + chain->idxmap_back = idxmap_back; + update_skipto_cache(chain, chain->map); + IPFW_WLOCK(chain); + /* It is now safe to set chain->idxmap ptr */ + chain->idxmap = idxmap; + swap_skipto_cache(chain); + IPFW_WUNLOCK(chain); + IPFW_UH_WUNLOCK(chain); +} + +/* + * Destroys skipto cache. + */ +void +ipfw_destroy_skipto_cache(struct ip_fw_chain *chain) +{ + + if (chain->idxmap != NULL) + free(chain->idxmap, M_IPFW); + if (chain->idxmap != NULL) + free(chain->idxmap_back, M_IPFW); +} + + +/* + * allocate a new map, returns the chain locked. extra is the number + * of entries to add or delete. + */ +static struct ip_fw ** +get_map(struct ip_fw_chain *chain, int extra, int locked) +{ + + for (;;) { + struct ip_fw **map; + int i, mflags; + + mflags = M_ZERO | ((locked != 0) ? M_NOWAIT : M_WAITOK); + + i = chain->n_rules + extra; + map = malloc(i * sizeof(struct ip_fw *), M_IPFW, mflags); + if (map == NULL) { + printf("%s: cannot allocate map\n", __FUNCTION__); + return NULL; + } + if (!locked) + IPFW_UH_WLOCK(chain); + if (i >= chain->n_rules + extra) /* good */ + return map; + /* otherwise we lost the race, free and retry */ + if (!locked) + IPFW_UH_WUNLOCK(chain); + free(map, M_IPFW); + } +} + +/* + * swap the maps. It is supposed to be called with IPFW_UH_WLOCK + */ +static struct ip_fw ** +swap_map(struct ip_fw_chain *chain, struct ip_fw **new_map, int new_len) +{ + struct ip_fw **old_map; + + IPFW_WLOCK(chain); + chain->id++; + chain->n_rules = new_len; + old_map = chain->map; + chain->map = new_map; + swap_skipto_cache(chain); + IPFW_WUNLOCK(chain); + return old_map; +} + + +static void +export_cntr1_base(struct ip_fw *krule, struct ip_fw_bcounter *cntr) +{ + + cntr->size = sizeof(*cntr); + + if (krule->cntr != NULL) { + cntr->pcnt = counter_u64_fetch(krule->cntr); + cntr->bcnt = counter_u64_fetch(krule->cntr + 1); + cntr->timestamp = krule->timestamp; + } + if (cntr->timestamp > 0) + cntr->timestamp += boottime.tv_sec; +} + +static void +export_cntr0_base(struct ip_fw *krule, struct ip_fw_bcounter0 *cntr) +{ + + if (krule->cntr != NULL) { + cntr->pcnt = counter_u64_fetch(krule->cntr); + cntr->bcnt = counter_u64_fetch(krule->cntr + 1); + cntr->timestamp = krule->timestamp; + } + if (cntr->timestamp > 0) + cntr->timestamp += boottime.tv_sec; +} + +/* + * Copies rule @urule from v1 userland format (current). + * to kernel @krule. + * Assume @krule is zeroed. + */ +static void +import_rule1(struct rule_check_info *ci) +{ + struct ip_fw_rule *urule; + struct ip_fw *krule; + + urule = (struct ip_fw_rule *)ci->urule; + krule = (struct ip_fw *)ci->krule; + + /* copy header */ + krule->act_ofs = urule->act_ofs; + krule->cmd_len = urule->cmd_len; + krule->rulenum = urule->rulenum; + krule->set = urule->set; + krule->flags = urule->flags; + + /* Save rulenum offset */ + ci->urule_numoff = offsetof(struct ip_fw_rule, rulenum); + + /* Copy opcodes */ + memcpy(krule->cmd, urule->cmd, krule->cmd_len * sizeof(uint32_t)); +} + +/* + * Export rule into v1 format (Current). + * Layout: + * [ ipfw_obj_tlv(IPFW_TLV_RULE_ENT) + * [ ip_fw_rule ] OR + * [ ip_fw_bcounter ip_fw_rule] (depends on rcntrs). + * ] + * Assume @data is zeroed. + */ +static void +export_rule1(struct ip_fw *krule, caddr_t data, int len, int rcntrs) +{ + struct ip_fw_bcounter *cntr; + struct ip_fw_rule *urule; + ipfw_obj_tlv *tlv; + + /* Fill in TLV header */ + tlv = (ipfw_obj_tlv *)data; + tlv->type = IPFW_TLV_RULE_ENT; + tlv->length = len; + + if (rcntrs != 0) { + /* Copy counters */ + cntr = (struct ip_fw_bcounter *)(tlv + 1); + urule = (struct ip_fw_rule *)(cntr + 1); + export_cntr1_base(krule, cntr); + } else + urule = (struct ip_fw_rule *)(tlv + 1); + + /* copy header */ + urule->act_ofs = krule->act_ofs; + urule->cmd_len = krule->cmd_len; + urule->rulenum = krule->rulenum; + urule->set = krule->set; + urule->flags = krule->flags; + urule->id = krule->id; + + /* Copy opcodes */ + memcpy(urule->cmd, krule->cmd, krule->cmd_len * sizeof(uint32_t)); +} + + +/* + * Copies rule @urule from FreeBSD8 userland format (v0) + * to kernel @krule. + * Assume @krule is zeroed. + */ +static void +import_rule0(struct rule_check_info *ci) +{ + struct ip_fw_rule0 *urule; + struct ip_fw *krule; + int cmdlen, l; + ipfw_insn *cmd; + ipfw_insn_limit *lcmd; + ipfw_insn_if *cmdif; + + urule = (struct ip_fw_rule0 *)ci->urule; + krule = (struct ip_fw *)ci->krule; + + /* copy header */ + krule->act_ofs = urule->act_ofs; + krule->cmd_len = urule->cmd_len; + krule->rulenum = urule->rulenum; + krule->set = urule->set; + if ((urule->_pad & 1) != 0) + krule->flags |= IPFW_RULE_NOOPT; + + /* Save rulenum offset */ + ci->urule_numoff = offsetof(struct ip_fw_rule0, rulenum); + + /* Copy opcodes */ + memcpy(krule->cmd, urule->cmd, krule->cmd_len * sizeof(uint32_t)); + + /* + * Alter opcodes: + * 1) convert tablearg value from 65335 to 0 + * 2) Add high bit to O_SETFIB/O_SETDSCP values (to make room for targ). + * 3) convert table number in iface opcodes to u16 + */ + l = krule->cmd_len; + cmd = krule->cmd; + cmdlen = 0; + + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + switch (cmd->opcode) { + /* Opcodes supporting tablearg */ + case O_TAG: + case O_TAGGED: + case O_PIPE: + case O_QUEUE: + case O_DIVERT: + case O_TEE: + case O_SKIPTO: + case O_CALLRETURN: + case O_NETGRAPH: + case O_NGTEE: + case O_NAT: + if (cmd->arg1 == 65535) + cmd->arg1 = IP_FW_TARG; + break; + case O_SETFIB: + case O_SETDSCP: + if (cmd->arg1 == 65535) + cmd->arg1 = IP_FW_TARG; + else + cmd->arg1 |= 0x8000; + break; + case O_LIMIT: + lcmd = (ipfw_insn_limit *)cmd; + if (lcmd->conn_limit == 65535) + lcmd->conn_limit = IP_FW_TARG; + break; + /* Interface tables */ + case O_XMIT: + case O_RECV: + case O_VIA: + /* Interface table, possibly */ + cmdif = (ipfw_insn_if *)cmd; + if (cmdif->name[0] != '\1') + break; + + cmdif->p.kidx = (uint16_t)cmdif->p.glob; + break; + } + } +} + +/* + * Copies rule @krule from kernel to FreeBSD8 userland format (v0) + */ +static void +export_rule0(struct ip_fw *krule, struct ip_fw_rule0 *urule, int len) +{ + int cmdlen, l; + ipfw_insn *cmd; + ipfw_insn_limit *lcmd; + ipfw_insn_if *cmdif; + + /* copy header */ + memset(urule, 0, len); + urule->act_ofs = krule->act_ofs; + urule->cmd_len = krule->cmd_len; + urule->rulenum = krule->rulenum; + urule->set = krule->set; + if ((krule->flags & IPFW_RULE_NOOPT) != 0) + urule->_pad |= 1; + + /* Copy opcodes */ + memcpy(urule->cmd, krule->cmd, krule->cmd_len * sizeof(uint32_t)); + + /* Export counters */ + export_cntr0_base(krule, (struct ip_fw_bcounter0 *)&urule->pcnt); + + /* + * Alter opcodes: + * 1) convert tablearg value from 0 to 65335 + * 2) Remove highest bit from O_SETFIB/O_SETDSCP values. + * 3) convert table number in iface opcodes to int + */ + l = urule->cmd_len; + cmd = urule->cmd; + cmdlen = 0; + + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + switch (cmd->opcode) { + /* Opcodes supporting tablearg */ + case O_TAG: + case O_TAGGED: + case O_PIPE: + case O_QUEUE: + case O_DIVERT: + case O_TEE: + case O_SKIPTO: + case O_CALLRETURN: + case O_NETGRAPH: + case O_NGTEE: + case O_NAT: + if (cmd->arg1 == IP_FW_TARG) + cmd->arg1 = 65535; + break; + case O_SETFIB: + case O_SETDSCP: + if (cmd->arg1 == IP_FW_TARG) + cmd->arg1 = 65535; + else + cmd->arg1 &= ~0x8000; + break; + case O_LIMIT: + lcmd = (ipfw_insn_limit *)cmd; + if (lcmd->conn_limit == IP_FW_TARG) + lcmd->conn_limit = 65535; + break; + /* Interface tables */ + case O_XMIT: + case O_RECV: + case O_VIA: + /* Interface table, possibly */ + cmdif = (ipfw_insn_if *)cmd; + if (cmdif->name[0] != '\1') + break; + + cmdif->p.glob = cmdif->p.kidx; + break; + } + } +} + +/* + * Add new rule(s) to the list possibly creating rule number for each. + * Update the rule_number in the input struct so the caller knows it as well. + * Must be called without IPFW_UH held + */ +static int +commit_rules(struct ip_fw_chain *chain, struct rule_check_info *rci, int count) +{ + int error, i, insert_before, tcount; + uint16_t rulenum, *pnum; + struct rule_check_info *ci; + struct ip_fw *krule; + struct ip_fw **map; /* the new array of pointers */ + + /* Check if we need to do table remap */ + tcount = 0; + for (ci = rci, i = 0; i < count; ci++, i++) { + if (ci->table_opcodes == 0) + continue; + + /* + * Rule has some table opcodes. + * Reference & allocate needed tables/ + */ + error = ipfw_rewrite_table_uidx(chain, ci); + if (error != 0) { + + /* + * rewrite failed, state for current rule + * has been reverted. Check if we need to + * revert more. + */ + if (tcount > 0) { + + /* + * We have some more table rules + * we need to rollback. + */ + + IPFW_UH_WLOCK(chain); + while (ci != rci) { + ci--; + if (ci->table_opcodes == 0) + continue; + ipfw_unref_rule_tables(chain,ci->krule); + + } + IPFW_UH_WUNLOCK(chain); + + } + + return (error); + } + + tcount++; + } + + /* get_map returns with IPFW_UH_WLOCK if successful */ + map = get_map(chain, count, 0 /* not locked */); + if (map == NULL) { + if (tcount > 0) { + /* Unbind tables */ + IPFW_UH_WLOCK(chain); + for (ci = rci, i = 0; i < count; ci++, i++) { + if (ci->table_opcodes == 0) + continue; + + ipfw_unref_rule_tables(chain, ci->krule); + } + IPFW_UH_WUNLOCK(chain); + } + + return (ENOSPC); + } + + if (V_autoinc_step < 1) + V_autoinc_step = 1; + else if (V_autoinc_step > 1000) + V_autoinc_step = 1000; + + /* FIXME: Handle count > 1 */ + ci = rci; + krule = ci->krule; + rulenum = krule->rulenum; + + /* find the insertion point, we will insert before */ + insert_before = rulenum ? rulenum + 1 : IPFW_DEFAULT_RULE; + i = ipfw_find_rule(chain, insert_before, 0); + /* duplicate first part */ + if (i > 0) + bcopy(chain->map, map, i * sizeof(struct ip_fw *)); + map[i] = krule; + /* duplicate remaining part, we always have the default rule */ + bcopy(chain->map + i, map + i + 1, + sizeof(struct ip_fw *) *(chain->n_rules - i)); + if (rulenum == 0) { + /* Compute rule number and write it back */ + rulenum = i > 0 ? map[i-1]->rulenum : 0; + if (rulenum < IPFW_DEFAULT_RULE - V_autoinc_step) + rulenum += V_autoinc_step; + krule->rulenum = rulenum; + /* Save number to userland rule */ + pnum = (uint16_t *)((caddr_t)ci->urule + ci->urule_numoff); + *pnum = rulenum; + } + + krule->id = chain->id + 1; + update_skipto_cache(chain, map); + map = swap_map(chain, map, chain->n_rules + 1); + chain->static_len += RULEUSIZE0(krule); + IPFW_UH_WUNLOCK(chain); + if (map) + free(map, M_IPFW); + return (0); +} + +/* + * Adds @rule to the list of rules to reap + */ +void +ipfw_reap_add(struct ip_fw_chain *chain, struct ip_fw **head, + struct ip_fw *rule) +{ + + IPFW_UH_WLOCK_ASSERT(chain); + + /* Unlink rule from everywhere */ + ipfw_unref_rule_tables(chain, rule); + + *((struct ip_fw **)rule) = *head; + *head = rule; +} + +/* + * Reclaim storage associated with a list of rules. This is + * typically the list created using remove_rule. + * A NULL pointer on input is handled correctly. + */ +void +ipfw_reap_rules(struct ip_fw *head) +{ + struct ip_fw *rule; + + while ((rule = head) != NULL) { + head = *((struct ip_fw **)head); + free_rule(rule); + } +} + +/* + * Rules to keep are + * (default || reserved || !match_set || !match_number) + * where + * default ::= (rule->rulenum == IPFW_DEFAULT_RULE) + * // the default rule is always protected + * + * reserved ::= (cmd == 0 && n == 0 && rule->set == RESVD_SET) + * // RESVD_SET is protected only if cmd == 0 and n == 0 ("ipfw flush") + * + * match_set ::= (cmd == 0 || rule->set == set) + * // set number is ignored for cmd == 0 + * + * match_number ::= (cmd == 1 || n == 0 || n == rule->rulenum) + * // number is ignored for cmd == 1 or n == 0 + * + */ +int +ipfw_match_range(struct ip_fw *rule, ipfw_range_tlv *rt) +{ + + /* Don't match default rule for modification queries */ + if (rule->rulenum == IPFW_DEFAULT_RULE && + (rt->flags & IPFW_RCFLAG_DEFAULT) == 0) + return (0); + + /* Don't match rules in reserved set for flush requests */ + if ((rt->flags & IPFW_RCFLAG_ALL) != 0 && rule->set == RESVD_SET) + return (0); + + /* If we're filtering by set, don't match other sets */ + if ((rt->flags & IPFW_RCFLAG_SET) != 0 && rule->set != rt->set) + return (0); + + if ((rt->flags & IPFW_RCFLAG_RANGE) != 0 && + (rule->rulenum < rt->start_rule || rule->rulenum > rt->end_rule)) + return (0); + + return (1); +} + +/* + * Delete rules matching range @rt. + * Saves number of deleted rules in @ndel. + * + * Returns 0 on success. + */ +static int +delete_range(struct ip_fw_chain *chain, ipfw_range_tlv *rt, int *ndel) +{ + struct ip_fw *reap, *rule, **map; + int end, start; + int i, n, ndyn, ofs; + + reap = NULL; + IPFW_UH_WLOCK(chain); /* arbitrate writers */ + + /* + * Stage 1: Determine range to inspect. + * Range is half-inclusive, e.g [start, end). + */ + start = 0; + end = chain->n_rules - 1; + + if ((rt->flags & IPFW_RCFLAG_RANGE) != 0) { + start = ipfw_find_rule(chain, rt->start_rule, 0); + + end = ipfw_find_rule(chain, rt->end_rule, 0); + if (rt->end_rule != IPFW_DEFAULT_RULE) + while (chain->map[end]->rulenum == rt->end_rule) + end++; + } + + /* Allocate new map of the same size */ + map = get_map(chain, 0, 1 /* locked */); + if (map == NULL) { + IPFW_UH_WUNLOCK(chain); + return (ENOMEM); + } + + n = 0; + ndyn = 0; + ofs = start; + /* 1. bcopy the initial part of the map */ + if (start > 0) + bcopy(chain->map, map, start * sizeof(struct ip_fw *)); + /* 2. copy active rules between start and end */ + for (i = start; i < end; i++) { + rule = chain->map[i]; + if (ipfw_match_range(rule, rt) == 0) { + map[ofs++] = rule; + continue; + } + + n++; + if (ipfw_is_dyn_rule(rule) != 0) + ndyn++; + } + /* 3. copy the final part of the map */ + bcopy(chain->map + end, map + ofs, + (chain->n_rules - end) * sizeof(struct ip_fw *)); + /* 4. recalculate skipto cache */ + update_skipto_cache(chain, map); + /* 5. swap the maps (under UH_WLOCK + WHLOCK) */ + map = swap_map(chain, map, chain->n_rules - n); + /* 6. Remove all dynamic states originated by deleted rules */ + if (ndyn > 0) + ipfw_expire_dyn_rules(chain, rt); + /* 7. now remove the rules deleted from the old map */ + for (i = start; i < end; i++) { + rule = map[i]; + if (ipfw_match_range(rule, rt) == 0) + continue; + chain->static_len -= RULEUSIZE0(rule); + ipfw_reap_add(chain, &reap, rule); + } + IPFW_UH_WUNLOCK(chain); + + ipfw_reap_rules(reap); + if (map != NULL) + free(map, M_IPFW); + *ndel = n; + return (0); +} + +/* + * Changes set of given rule rannge @rt + * with each other. + * + * Returns 0 on success. + */ +static int +move_range(struct ip_fw_chain *chain, ipfw_range_tlv *rt) +{ + struct ip_fw *rule; + int i; + + IPFW_UH_WLOCK(chain); + + /* + * Move rules with matching paramenerts to a new set. + * This one is much more complex. We have to ensure + * that all referenced tables (if any) are referenced + * by given rule subset only. Otherwise, we can't move + * them to new set and have to return error. + */ + if (V_fw_tables_sets != 0) { + if (ipfw_move_tables_sets(chain, rt, rt->new_set) != 0) { + IPFW_UH_WUNLOCK(chain); + return (EBUSY); + } + } + + /* XXX: We have to do swap holding WLOCK */ + for (i = 0; i < chain->n_rules; i++) { + rule = chain->map[i]; + if (ipfw_match_range(rule, rt) == 0) + continue; + rule->set = rt->new_set; + } + + IPFW_UH_WUNLOCK(chain); + + return (0); +} + +/* + * Clear counters for a specific rule. + * Normally run under IPFW_UH_RLOCK, but these are idempotent ops + * so we only care that rules do not disappear. + */ +static void +clear_counters(struct ip_fw *rule, int log_only) +{ + ipfw_insn_log *l = (ipfw_insn_log *)ACTION_PTR(rule); + + if (log_only == 0) + IPFW_ZERO_RULE_COUNTER(rule); + if (l->o.opcode == O_LOG) + l->log_left = l->max_log; +} + +/* + * Flushes rules counters and/or log values on matching range. + * + * Returns number of items cleared. + */ +static int +clear_range(struct ip_fw_chain *chain, ipfw_range_tlv *rt, int log_only) +{ + struct ip_fw *rule; + int num; + int i; + + num = 0; + rt->flags |= IPFW_RCFLAG_DEFAULT; + + IPFW_UH_WLOCK(chain); /* arbitrate writers */ + for (i = 0; i < chain->n_rules; i++) { + rule = chain->map[i]; + if (ipfw_match_range(rule, rt) == 0) + continue; + clear_counters(rule, log_only); + num++; + } + IPFW_UH_WUNLOCK(chain); + + return (num); +} + +static int +check_range_tlv(ipfw_range_tlv *rt) +{ + + if (rt->head.length != sizeof(*rt)) + return (1); + if (rt->start_rule > rt->end_rule) + return (1); + if (rt->set >= IPFW_MAX_SETS || rt->new_set >= IPFW_MAX_SETS) + return (1); + + if ((rt->flags & IPFW_RCFLAG_USER) != rt->flags) + return (1); + + return (0); +} + +/* + * Delete rules matching specified parameters + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_range_tlv ] + * Reply: [ ipfw_obj_header ipfw_range_tlv ] + * + * Saves number of deleted rules in ipfw_range_tlv->new_set. + * + * Returns 0 on success. + */ +static int +del_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_range_header *rh; + int error, ndel; + + if (sd->valsize != sizeof(*rh)) + return (EINVAL); + + rh = (ipfw_range_header *)ipfw_get_sopt_space(sd, sd->valsize); + + if (check_range_tlv(&rh->range) != 0) + return (EINVAL); + + ndel = 0; + if ((error = delete_range(chain, &rh->range, &ndel)) != 0) + return (error); + + /* Save number of rules deleted */ + rh->range.new_set = ndel; + return (0); +} + +/* + * Move rules/sets matching specified parameters + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_range_tlv ] + * + * Returns 0 on success. + */ +static int +move_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_range_header *rh; + + if (sd->valsize != sizeof(*rh)) + return (EINVAL); + + rh = (ipfw_range_header *)ipfw_get_sopt_space(sd, sd->valsize); + + if (check_range_tlv(&rh->range) != 0) + return (EINVAL); + + return (move_range(chain, &rh->range)); +} + +/* + * Clear rule accounting data matching specified parameters + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_range_tlv ] + * Reply: [ ipfw_obj_header ipfw_range_tlv ] + * + * Saves number of cleared rules in ipfw_range_tlv->new_set. + * + * Returns 0 on success. + */ +static int +clear_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_range_header *rh; + int log_only, num; + char *msg; + + if (sd->valsize != sizeof(*rh)) + return (EINVAL); + + rh = (ipfw_range_header *)ipfw_get_sopt_space(sd, sd->valsize); + + if (check_range_tlv(&rh->range) != 0) + return (EINVAL); + + log_only = (op3->opcode == IP_FW_XRESETLOG); + + num = clear_range(chain, &rh->range, log_only); + + if (rh->range.flags & IPFW_RCFLAG_ALL) + msg = log_only ? "All logging counts reset" : + "Accounting cleared"; + else + msg = log_only ? "logging count reset" : "cleared"; + + if (V_fw_verbose) { + int lev = LOG_SECURITY | LOG_NOTICE; + log(lev, "ipfw: %s.\n", msg); + } + + /* Save number of rules cleared */ + rh->range.new_set = num; + return (0); +} + +static void +enable_sets(struct ip_fw_chain *chain, ipfw_range_tlv *rt) +{ + uint32_t v_set; + + IPFW_UH_WLOCK_ASSERT(chain); + + /* Change enabled/disabled sets mask */ + v_set = (V_set_disable | rt->set) & ~rt->new_set; + v_set &= ~(1 << RESVD_SET); /* set RESVD_SET always enabled */ + IPFW_WLOCK(chain); + V_set_disable = v_set; + IPFW_WUNLOCK(chain); +} + +static void +swap_sets(struct ip_fw_chain *chain, ipfw_range_tlv *rt, int mv) +{ + struct ip_fw *rule; + int i; + + IPFW_UH_WLOCK_ASSERT(chain); + + /* Swap or move two sets */ + for (i = 0; i < chain->n_rules - 1; i++) { + rule = chain->map[i]; + if (rule->set == rt->set) + rule->set = rt->new_set; + else if (rule->set == rt->new_set && mv == 0) + rule->set = rt->set; + } + if (V_fw_tables_sets != 0) + ipfw_swap_tables_sets(chain, rt->set, rt->new_set, mv); +} + +/* + * Swaps or moves set + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_range_tlv ] + * + * Returns 0 on success. + */ +static int +manage_sets(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_range_header *rh; + + if (sd->valsize != sizeof(*rh)) + return (EINVAL); + + rh = (ipfw_range_header *)ipfw_get_sopt_space(sd, sd->valsize); + + if (rh->range.head.length != sizeof(ipfw_range_tlv)) + return (1); + + IPFW_UH_WLOCK(chain); + switch (op3->opcode) { + case IP_FW_SET_SWAP: + case IP_FW_SET_MOVE: + swap_sets(chain, &rh->range, op3->opcode == IP_FW_SET_MOVE); + break; + case IP_FW_SET_ENABLE: + enable_sets(chain, &rh->range); + break; + } + IPFW_UH_WUNLOCK(chain); + + return (0); +} + +/** + * Remove all rules with given number, or do set manipulation. + * Assumes chain != NULL && *chain != NULL. + * + * The argument is an uint32_t. The low 16 bit are the rule or set number; + * the next 8 bits are the new set; the top 8 bits indicate the command: + * + * 0 delete rules numbered "rulenum" + * 1 delete rules in set "rulenum" + * 2 move rules "rulenum" to set "new_set" + * 3 move rules from set "rulenum" to set "new_set" + * 4 swap sets "rulenum" and "new_set" + * 5 delete rules "rulenum" and set "new_set" + */ +static int +del_entry(struct ip_fw_chain *chain, uint32_t arg) +{ + uint32_t num; /* rule number or old_set */ + uint8_t cmd, new_set; + int do_del, ndel; + int error = 0; + ipfw_range_tlv rt; + + num = arg & 0xffff; + cmd = (arg >> 24) & 0xff; + new_set = (arg >> 16) & 0xff; + + if (cmd > 5 || new_set > RESVD_SET) + return EINVAL; + if (cmd == 0 || cmd == 2 || cmd == 5) { + if (num >= IPFW_DEFAULT_RULE) + return EINVAL; + } else { + if (num > RESVD_SET) /* old_set */ + return EINVAL; + } + + /* Convert old requests into new representation */ + memset(&rt, 0, sizeof(rt)); + rt.start_rule = num; + rt.end_rule = num; + rt.set = num; + rt.new_set = new_set; + do_del = 0; + + switch (cmd) { + case 0: /* delete rules numbered "rulenum" */ + if (num == 0) + rt.flags |= IPFW_RCFLAG_ALL; + else + rt.flags |= IPFW_RCFLAG_RANGE; + do_del = 1; + break; + case 1: /* delete rules in set "rulenum" */ + rt.flags |= IPFW_RCFLAG_SET; + do_del = 1; + break; + case 5: /* delete rules "rulenum" and set "new_set" */ + rt.flags |= IPFW_RCFLAG_RANGE | IPFW_RCFLAG_SET; + rt.set = new_set; + rt.new_set = 0; + do_del = 1; + break; + case 2: /* move rules "rulenum" to set "new_set" */ + rt.flags |= IPFW_RCFLAG_RANGE; + break; + case 3: /* move rules from set "rulenum" to set "new_set" */ + IPFW_UH_WLOCK(chain); + swap_sets(chain, &rt, 1); + IPFW_UH_WUNLOCK(chain); + return (0); + case 4: /* swap sets "rulenum" and "new_set" */ + IPFW_UH_WLOCK(chain); + swap_sets(chain, &rt, 0); + IPFW_UH_WUNLOCK(chain); + return (0); + default: + return (ENOTSUP); + } + + if (do_del != 0) { + if ((error = delete_range(chain, &rt, &ndel)) != 0) + return (error); + + if (ndel == 0 && (cmd != 1 && num != 0)) + return (EINVAL); + + return (0); + } + + return (move_range(chain, &rt)); +} + +/** + * Reset some or all counters on firewall rules. + * The argument `arg' is an u_int32_t. The low 16 bit are the rule number, + * the next 8 bits are the set number, the top 8 bits are the command: + * 0 work with rules from all set's; + * 1 work with rules only from specified set. + * Specified rule number is zero if we want to clear all entries. + * log_only is 1 if we only want to reset logs, zero otherwise. + */ +static int +zero_entry(struct ip_fw_chain *chain, u_int32_t arg, int log_only) +{ + struct ip_fw *rule; + char *msg; + int i; + + uint16_t rulenum = arg & 0xffff; + uint8_t set = (arg >> 16) & 0xff; + uint8_t cmd = (arg >> 24) & 0xff; + + if (cmd > 1) + return (EINVAL); + if (cmd == 1 && set > RESVD_SET) + return (EINVAL); + + IPFW_UH_RLOCK(chain); + if (rulenum == 0) { + V_norule_counter = 0; + for (i = 0; i < chain->n_rules; i++) { + rule = chain->map[i]; + /* Skip rules not in our set. */ + if (cmd == 1 && rule->set != set) + continue; + clear_counters(rule, log_only); + } + msg = log_only ? "All logging counts reset" : + "Accounting cleared"; + } else { + int cleared = 0; + for (i = 0; i < chain->n_rules; i++) { + rule = chain->map[i]; + if (rule->rulenum == rulenum) { + if (cmd == 0 || rule->set == set) + clear_counters(rule, log_only); + cleared = 1; + } + if (rule->rulenum > rulenum) + break; + } + if (!cleared) { /* we did not find any matching rules */ + IPFW_UH_RUNLOCK(chain); + return (EINVAL); + } + msg = log_only ? "logging count reset" : "cleared"; + } + IPFW_UH_RUNLOCK(chain); + + if (V_fw_verbose) { + int lev = LOG_SECURITY | LOG_NOTICE; + + if (rulenum) + log(lev, "ipfw: Entry %d %s.\n", rulenum, msg); + else + log(lev, "ipfw: %s.\n", msg); + } + return (0); +} + + +/* + * Check rule head in FreeBSD11 format + * + */ +static int +check_ipfw_rule1(struct ip_fw_rule *rule, int size, + struct rule_check_info *ci) +{ + int l; + + if (size < sizeof(*rule)) { + printf("ipfw: rule too short\n"); + return (EINVAL); + } + + /* Check for valid cmd_len */ + l = roundup2(RULESIZE(rule), sizeof(uint64_t)); + if (l != size) { + printf("ipfw: size mismatch (have %d want %d)\n", size, l); + return (EINVAL); + } + if (rule->act_ofs >= rule->cmd_len) { + printf("ipfw: bogus action offset (%u > %u)\n", + rule->act_ofs, rule->cmd_len - 1); + return (EINVAL); + } + + if (rule->rulenum > IPFW_DEFAULT_RULE - 1) + return (EINVAL); + + return (check_ipfw_rule_body(rule->cmd, rule->cmd_len, ci)); +} + +/* + * Check rule head in FreeBSD8 format + * + */ +static int +check_ipfw_rule0(struct ip_fw_rule0 *rule, int size, + struct rule_check_info *ci) +{ + int l; + + if (size < sizeof(*rule)) { + printf("ipfw: rule too short\n"); + return (EINVAL); + } + + /* Check for valid cmd_len */ + l = sizeof(*rule) + rule->cmd_len * 4 - 4; + if (l != size) { + printf("ipfw: size mismatch (have %d want %d)\n", size, l); + return (EINVAL); + } + if (rule->act_ofs >= rule->cmd_len) { + printf("ipfw: bogus action offset (%u > %u)\n", + rule->act_ofs, rule->cmd_len - 1); + return (EINVAL); + } + + if (rule->rulenum > IPFW_DEFAULT_RULE - 1) + return (EINVAL); + + return (check_ipfw_rule_body(rule->cmd, rule->cmd_len, ci)); +} + +static int +check_ipfw_rule_body(ipfw_insn *cmd, int cmd_len, struct rule_check_info *ci) +{ + int cmdlen, l; + int have_action; + + have_action = 0; + + /* + * Now go for the individual checks. Very simple ones, basically only + * instruction sizes. + */ + for (l = cmd_len; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + if (cmdlen > l) { + printf("ipfw: opcode %d size truncated\n", + cmd->opcode); + return EINVAL; + } + switch (cmd->opcode) { + case O_PROBE_STATE: + case O_KEEP_STATE: + case O_PROTO: + case O_IP_SRC_ME: + case O_IP_DST_ME: + case O_LAYER2: + case O_IN: + case O_FRAG: + case O_DIVERTED: + case O_IPOPT: + case O_IPTOS: + case O_IPPRECEDENCE: + case O_IPVER: + case O_SOCKARG: + case O_TCPFLAGS: + case O_TCPOPTS: + case O_ESTAB: + case O_VERREVPATH: + case O_VERSRCREACH: + case O_ANTISPOOF: + case O_IPSEC: +#ifdef INET6 + case O_IP6_SRC_ME: + case O_IP6_DST_ME: + case O_EXT_HDR: + case O_IP6: +#endif + case O_IP4: + case O_TAG: + if (cmdlen != F_INSN_SIZE(ipfw_insn)) + goto bad_size; + break; + + case O_FIB: + if (cmdlen != F_INSN_SIZE(ipfw_insn)) + goto bad_size; + if (cmd->arg1 >= rt_numfibs) { + printf("ipfw: invalid fib number %d\n", + cmd->arg1); + return EINVAL; + } + break; + + case O_SETFIB: + if (cmdlen != F_INSN_SIZE(ipfw_insn)) + goto bad_size; + if ((cmd->arg1 != IP_FW_TARG) && + ((cmd->arg1 & 0x7FFFF) >= rt_numfibs)) { + printf("ipfw: invalid fib number %d\n", + cmd->arg1 & 0x7FFFF); + return EINVAL; + } + goto check_action; + + case O_UID: + case O_GID: + case O_JAIL: + case O_IP_SRC: + case O_IP_DST: + case O_TCPSEQ: + case O_TCPACK: + case O_PROB: + case O_ICMPTYPE: + if (cmdlen != F_INSN_SIZE(ipfw_insn_u32)) + goto bad_size; + break; + + case O_LIMIT: + if (cmdlen != F_INSN_SIZE(ipfw_insn_limit)) + goto bad_size; + break; + + case O_LOG: + if (cmdlen != F_INSN_SIZE(ipfw_insn_log)) + goto bad_size; + + ((ipfw_insn_log *)cmd)->log_left = + ((ipfw_insn_log *)cmd)->max_log; + + break; + + case O_IP_SRC_MASK: + case O_IP_DST_MASK: + /* only odd command lengths */ + if ( !(cmdlen & 1) || cmdlen > 31) + goto bad_size; + break; + + case O_IP_SRC_SET: + case O_IP_DST_SET: + if (cmd->arg1 == 0 || cmd->arg1 > 256) { + printf("ipfw: invalid set size %d\n", + cmd->arg1); + return EINVAL; + } + if (cmdlen != F_INSN_SIZE(ipfw_insn_u32) + + (cmd->arg1+31)/32 ) + goto bad_size; + break; + + case O_IP_SRC_LOOKUP: + case O_IP_DST_LOOKUP: + if (cmd->arg1 >= V_fw_tables_max) { + printf("ipfw: invalid table number %d\n", + cmd->arg1); + return (EINVAL); + } + if (cmdlen != F_INSN_SIZE(ipfw_insn) && + cmdlen != F_INSN_SIZE(ipfw_insn_u32) + 1 && + cmdlen != F_INSN_SIZE(ipfw_insn_u32)) + goto bad_size; + ci->table_opcodes++; + break; + case O_IP_FLOW_LOOKUP: + if (cmd->arg1 >= V_fw_tables_max) { + printf("ipfw: invalid table number %d\n", + cmd->arg1); + return (EINVAL); + } + if (cmdlen != F_INSN_SIZE(ipfw_insn) && + cmdlen != F_INSN_SIZE(ipfw_insn_u32)) + goto bad_size; + ci->table_opcodes++; + break; + case O_MACADDR2: + if (cmdlen != F_INSN_SIZE(ipfw_insn_mac)) + goto bad_size; + break; + + case O_NOP: + case O_IPID: + case O_IPTTL: + case O_IPLEN: + case O_TCPDATALEN: + case O_TCPWIN: + case O_TAGGED: + if (cmdlen < 1 || cmdlen > 31) + goto bad_size; + break; + + case O_DSCP: + if (cmdlen != F_INSN_SIZE(ipfw_insn_u32) + 1) + goto bad_size; + break; + + case O_MAC_TYPE: + case O_IP_SRCPORT: + case O_IP_DSTPORT: /* XXX artificial limit, 30 port pairs */ + if (cmdlen < 2 || cmdlen > 31) + goto bad_size; + break; + + case O_RECV: + case O_XMIT: + case O_VIA: + if (((ipfw_insn_if *)cmd)->name[0] == '\1') + ci->table_opcodes++; + if (cmdlen != F_INSN_SIZE(ipfw_insn_if)) + goto bad_size; + break; + + case O_ALTQ: + if (cmdlen != F_INSN_SIZE(ipfw_insn_altq)) + goto bad_size; + break; + + case O_PIPE: + case O_QUEUE: + if (cmdlen != F_INSN_SIZE(ipfw_insn)) + goto bad_size; + goto check_action; + + case O_FORWARD_IP: + if (cmdlen != F_INSN_SIZE(ipfw_insn_sa)) + goto bad_size; + goto check_action; +#ifdef INET6 + case O_FORWARD_IP6: + if (cmdlen != F_INSN_SIZE(ipfw_insn_sa6)) + goto bad_size; + goto check_action; +#endif /* INET6 */ + + case O_DIVERT: + case O_TEE: + if (ip_divert_ptr == NULL) + return EINVAL; + else + goto check_size; + case O_NETGRAPH: + case O_NGTEE: + if (ng_ipfw_input_p == NULL) + return EINVAL; + else + goto check_size; + case O_NAT: + if (!IPFW_NAT_LOADED) + return EINVAL; + if (cmdlen != F_INSN_SIZE(ipfw_insn_nat)) + goto bad_size; + goto check_action; + case O_FORWARD_MAC: /* XXX not implemented yet */ + case O_CHECK_STATE: + case O_COUNT: + case O_ACCEPT: + case O_DENY: + case O_REJECT: + case O_SETDSCP: +#ifdef INET6 + case O_UNREACH6: +#endif + case O_SKIPTO: + case O_REASS: + case O_CALLRETURN: +check_size: + if (cmdlen != F_INSN_SIZE(ipfw_insn)) + goto bad_size; +check_action: + if (have_action) { + printf("ipfw: opcode %d, multiple actions" + " not allowed\n", + cmd->opcode); + return (EINVAL); + } + have_action = 1; + if (l != cmdlen) { + printf("ipfw: opcode %d, action must be" + " last opcode\n", + cmd->opcode); + return (EINVAL); + } + break; +#ifdef INET6 + case O_IP6_SRC: + case O_IP6_DST: + if (cmdlen != F_INSN_SIZE(struct in6_addr) + + F_INSN_SIZE(ipfw_insn)) + goto bad_size; + break; + + case O_FLOW6ID: + if (cmdlen != F_INSN_SIZE(ipfw_insn_u32) + + ((ipfw_insn_u32 *)cmd)->o.arg1) + goto bad_size; + break; + + case O_IP6_SRC_MASK: + case O_IP6_DST_MASK: + if ( !(cmdlen & 1) || cmdlen > 127) + goto bad_size; + break; + case O_ICMP6TYPE: + if( cmdlen != F_INSN_SIZE( ipfw_insn_icmp6 ) ) + goto bad_size; + break; +#endif + + default: + switch (cmd->opcode) { +#ifndef INET6 + case O_IP6_SRC_ME: + case O_IP6_DST_ME: + case O_EXT_HDR: + case O_IP6: + case O_UNREACH6: + case O_IP6_SRC: + case O_IP6_DST: + case O_FLOW6ID: + case O_IP6_SRC_MASK: + case O_IP6_DST_MASK: + case O_ICMP6TYPE: + printf("ipfw: no IPv6 support in kernel\n"); + return (EPROTONOSUPPORT); +#endif + default: + printf("ipfw: opcode %d, unknown opcode\n", + cmd->opcode); + return (EINVAL); + } + } + } + if (have_action == 0) { + printf("ipfw: missing action\n"); + return (EINVAL); + } + return 0; + +bad_size: + printf("ipfw: opcode %d size %d wrong\n", + cmd->opcode, cmdlen); + return (EINVAL); +} + + +/* + * Translation of requests for compatibility with FreeBSD 7.2/8. + * a static variable tells us if we have an old client from userland, + * and if necessary we translate requests and responses between the + * two formats. + */ +static int is7 = 0; + +struct ip_fw7 { + struct ip_fw7 *next; /* linked list of rules */ + struct ip_fw7 *next_rule; /* ptr to next [skipto] rule */ + /* 'next_rule' is used to pass up 'set_disable' status */ + + uint16_t act_ofs; /* offset of action in 32-bit units */ + uint16_t cmd_len; /* # of 32-bit words in cmd */ + uint16_t rulenum; /* rule number */ + uint8_t set; /* rule set (0..31) */ + // #define RESVD_SET 31 /* set for default and persistent rules */ + uint8_t _pad; /* padding */ + // uint32_t id; /* rule id, only in v.8 */ + /* These fields are present in all rules. */ + uint64_t pcnt; /* Packet counter */ + uint64_t bcnt; /* Byte counter */ + uint32_t timestamp; /* tv_sec of last match */ + + ipfw_insn cmd[1]; /* storage for commands */ +}; + +static int convert_rule_to_7(struct ip_fw_rule0 *rule); +static int convert_rule_to_8(struct ip_fw_rule0 *rule); + +#ifndef RULESIZE7 +#define RULESIZE7(rule) (sizeof(struct ip_fw7) + \ + ((struct ip_fw7 *)(rule))->cmd_len * 4 - 4) +#endif + + +/* + * Copy the static and dynamic rules to the supplied buffer + * and return the amount of space actually used. + * Must be run under IPFW_UH_RLOCK + */ +static size_t +ipfw_getrules(struct ip_fw_chain *chain, void *buf, size_t space) +{ + char *bp = buf; + char *ep = bp + space; + struct ip_fw *rule; + struct ip_fw_rule0 *dst; + int error, i, l, warnflag; + time_t boot_seconds; + + warnflag = 0; + + boot_seconds = boottime.tv_sec; + for (i = 0; i < chain->n_rules; i++) { + rule = chain->map[i]; + + if (is7) { + /* Convert rule to FreeBSd 7.2 format */ + l = RULESIZE7(rule); + if (bp + l + sizeof(uint32_t) <= ep) { + bcopy(rule, bp, l + sizeof(uint32_t)); + error = ipfw_rewrite_table_kidx(chain, + (struct ip_fw_rule0 *)bp); + if (error != 0) + return (0); + error = convert_rule_to_7((struct ip_fw_rule0 *) bp); + if (error) + return 0; /*XXX correct? */ + /* + * XXX HACK. Store the disable mask in the "next" + * pointer in a wild attempt to keep the ABI the same. + * Why do we do this on EVERY rule? + */ + bcopy(&V_set_disable, + &(((struct ip_fw7 *)bp)->next_rule), + sizeof(V_set_disable)); + if (((struct ip_fw7 *)bp)->timestamp) + ((struct ip_fw7 *)bp)->timestamp += boot_seconds; + bp += l; + } + continue; /* go to next rule */ + } + + l = RULEUSIZE0(rule); + if (bp + l > ep) { /* should not happen */ + printf("overflow dumping static rules\n"); + break; + } + dst = (struct ip_fw_rule0 *)bp; + export_rule0(rule, dst, l); + error = ipfw_rewrite_table_kidx(chain, dst); + + /* + * XXX HACK. Store the disable mask in the "next" + * pointer in a wild attempt to keep the ABI the same. + * Why do we do this on EVERY rule? + * + * XXX: "ipfw set show" (ab)uses IP_FW_GET to read disabled mask + * so we need to fail _after_ saving at least one mask. + */ + bcopy(&V_set_disable, &dst->next_rule, sizeof(V_set_disable)); + if (dst->timestamp) + dst->timestamp += boot_seconds; + bp += l; + + if (error != 0) { + if (error == 2) { + /* Non-fatal table rewrite error. */ + warnflag = 1; + continue; + } + printf("Stop on rule %d. Fail to convert table\n", + rule->rulenum); + break; + } + } + if (warnflag != 0) + printf("ipfw: process %s is using legacy interfaces," + " consider rebuilding\n", ""); + ipfw_get_dynamic(chain, &bp, ep); /* protected by the dynamic lock */ + return (bp - (char *)buf); +} + + +struct dump_args { + uint32_t b; /* start rule */ + uint32_t e; /* end rule */ + uint32_t rcount; /* number of rules */ + uint32_t rsize; /* rules size */ + uint32_t tcount; /* number of tables */ + int rcounters; /* counters */ +}; + +/* + * Dumps static rules with table TLVs in buffer @sd. + * + * Returns 0 on success. + */ +static int +dump_static_rules(struct ip_fw_chain *chain, struct dump_args *da, + uint32_t *bmask, struct sockopt_data *sd) +{ + int error; + int i, l; + uint32_t tcount; + ipfw_obj_ctlv *ctlv; + struct ip_fw *krule; + caddr_t dst; + + /* Dump table names first (if any) */ + if (da->tcount > 0) { + /* Header first */ + ctlv = (ipfw_obj_ctlv *)ipfw_get_sopt_space(sd, sizeof(*ctlv)); + if (ctlv == NULL) + return (ENOMEM); + ctlv->head.type = IPFW_TLV_TBLNAME_LIST; + ctlv->head.length = da->tcount * sizeof(ipfw_obj_ntlv) + + sizeof(*ctlv); + ctlv->count = da->tcount; + ctlv->objsize = sizeof(ipfw_obj_ntlv); + } + + i = 0; + tcount = da->tcount; + while (tcount > 0) { + if ((bmask[i / 32] & (1 << (i % 32))) == 0) { + i++; + continue; + } + + if ((error = ipfw_export_table_ntlv(chain, i, sd)) != 0) + return (error); + + i++; + tcount--; + } + + /* Dump rules */ + ctlv = (ipfw_obj_ctlv *)ipfw_get_sopt_space(sd, sizeof(*ctlv)); + if (ctlv == NULL) + return (ENOMEM); + ctlv->head.type = IPFW_TLV_RULE_LIST; + ctlv->head.length = da->rsize + sizeof(*ctlv); + ctlv->count = da->rcount; + + for (i = da->b; i < da->e; i++) { + krule = chain->map[i]; + + l = RULEUSIZE1(krule) + sizeof(ipfw_obj_tlv); + if (da->rcounters != 0) + l += sizeof(struct ip_fw_bcounter); + dst = (caddr_t)ipfw_get_sopt_space(sd, l); + if (dst == NULL) + return (ENOMEM); + + export_rule1(krule, dst, l, da->rcounters); + } + + return (0); +} + +/* + * Dumps requested objects data + * Data layout (version 0)(current): + * Request: [ ipfw_cfg_lheader ] + IPFW_CFG_GET_* flags + * size = ipfw_cfg_lheader.size + * Reply: [ ipfw_cfg_lheader + * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional) + * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) + * ipfw_obj_tlv(IPFW_TLV_RULE_ENT) [ ip_fw_bcounter (optional) ip_fw_rule ] + * ] (optional) + * [ ipfw_obj_ctlv(IPFW_TLV_STATE_LIST) ipfw_obj_dyntlv x N ] (optional) + * ] + * * NOTE IPFW_TLV_STATE_LIST has the single valid field: objsize. + * The rest (size, count) are set to zero and needs to be ignored. + * + * Returns 0 on success. + */ +static int +dump_config(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_cfg_lheader *hdr; + struct ip_fw *rule; + size_t sz, rnum; + uint32_t hdr_flags; + int error, i; + struct dump_args da; + uint32_t *bmask; + + hdr = (ipfw_cfg_lheader *)ipfw_get_sopt_header(sd, sizeof(*hdr)); + if (hdr == NULL) + return (EINVAL); + + error = 0; + bmask = NULL; + /* Allocate needed state */ + if (hdr->flags & IPFW_CFG_GET_STATIC) + bmask = malloc(IPFW_TABLES_MAX / 8, M_TEMP, M_WAITOK | M_ZERO); + + IPFW_UH_RLOCK(chain); + + /* + * STAGE 1: Determine size/count for objects in range. + * Prepare used tables bitmask. + */ + sz = sizeof(ipfw_cfg_lheader); + memset(&da, 0, sizeof(da)); + + da.b = 0; + da.e = chain->n_rules; + + if (hdr->end_rule != 0) { + /* Handle custom range */ + if ((rnum = hdr->start_rule) > IPFW_DEFAULT_RULE) + rnum = IPFW_DEFAULT_RULE; + da.b = ipfw_find_rule(chain, rnum, 0); + rnum = hdr->end_rule; + rnum = (rnum < IPFW_DEFAULT_RULE) ? rnum+1 : IPFW_DEFAULT_RULE; + da.e = ipfw_find_rule(chain, rnum, 0) + 1; + } + + if (hdr->flags & IPFW_CFG_GET_STATIC) { + for (i = da.b; i < da.e; i++) { + rule = chain->map[i]; + da.rsize += RULEUSIZE1(rule) + sizeof(ipfw_obj_tlv); + da.rcount++; + da.tcount += ipfw_mark_table_kidx(chain, rule, bmask); + } + /* Add counters if requested */ + if (hdr->flags & IPFW_CFG_GET_COUNTERS) { + da.rsize += sizeof(struct ip_fw_bcounter) * da.rcount; + da.rcounters = 1; + } + + if (da.tcount > 0) + sz += da.tcount * sizeof(ipfw_obj_ntlv) + + sizeof(ipfw_obj_ctlv); + sz += da.rsize + sizeof(ipfw_obj_ctlv); + } + + if (hdr->flags & IPFW_CFG_GET_STATES) + sz += ipfw_dyn_get_count() * sizeof(ipfw_obj_dyntlv) + + sizeof(ipfw_obj_ctlv); + + + /* + * Fill header anyway. + * Note we have to save header fields to stable storage + * buffer inside @sd can be flushed after dumping rules + */ + hdr->size = sz; + hdr->set_mask = ~V_set_disable; + hdr_flags = hdr->flags; + hdr = NULL; + + if (sd->valsize < sz) { + error = ENOMEM; + goto cleanup; + } + + /* STAGE2: Store actual data */ + if (hdr_flags & IPFW_CFG_GET_STATIC) { + error = dump_static_rules(chain, &da, bmask, sd); + if (error != 0) + goto cleanup; + } + + if (hdr_flags & IPFW_CFG_GET_STATES) + error = ipfw_dump_states(chain, sd); + +cleanup: + IPFW_UH_RUNLOCK(chain); + + if (bmask != NULL) + free(bmask, M_TEMP); + + return (error); +} + +static int +check_object_name(ipfw_obj_ntlv *ntlv) +{ + int error; + + switch (ntlv->head.type) { + case IPFW_TLV_TBL_NAME: + error = ipfw_check_table_name(ntlv->name); + break; + default: + error = ENOTSUP; + } + + return (0); +} + +/* + * Adds one or more rules to ipfw @chain. + * Data layout (version 0)(current): + * Request: + * [ + * ip_fw3_opheader + * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional *1) + * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) ip_fw x N ] (*2) (*3) + * ] + * Reply: + * [ + * ip_fw3_opheader + * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional) + * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) ip_fw x N ] + * ] + * + * Rules in reply are modified to store their actual ruleset number. + * + * (*1) TLVs inside IPFW_TLV_TBL_LIST needs to be sorted ascending + * accoring to their idx field and there has to be no duplicates. + * (*2) Numbered rules inside IPFW_TLV_RULE_LIST needs to be sorted ascending. + * (*3) Each ip_fw structure needs to be aligned to u64 boundary. + * + * Returns 0 on success. + */ +static int +add_rules(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_obj_ctlv *ctlv, *rtlv, *tstate; + ipfw_obj_ntlv *ntlv; + int clen, error, idx; + uint32_t count, read; + struct ip_fw_rule *r; + struct rule_check_info rci, *ci, *cbuf; + int i, rsize; + + op3 = (ip_fw3_opheader *)ipfw_get_sopt_space(sd, sd->valsize); + ctlv = (ipfw_obj_ctlv *)(op3 + 1); + + read = sizeof(ip_fw3_opheader); + rtlv = NULL; + tstate = NULL; + cbuf = NULL; + memset(&rci, 0, sizeof(struct rule_check_info)); + + if (read + sizeof(*ctlv) > sd->valsize) + return (EINVAL); + + if (ctlv->head.type == IPFW_TLV_TBLNAME_LIST) { + clen = ctlv->head.length; + /* Check size and alignment */ + if (clen > sd->valsize || clen < sizeof(*ctlv)) + return (EINVAL); + if ((clen % sizeof(uint64_t)) != 0) + return (EINVAL); + + /* + * Some table names or other named objects. + * Check for validness. + */ + count = (ctlv->head.length - sizeof(*ctlv)) / sizeof(*ntlv); + if (ctlv->count != count || ctlv->objsize != sizeof(*ntlv)) + return (EINVAL); + + /* + * Check each TLV. + * Ensure TLVs are sorted ascending and + * there are no duplicates. + */ + idx = -1; + ntlv = (ipfw_obj_ntlv *)(ctlv + 1); + while (count > 0) { + if (ntlv->head.length != sizeof(ipfw_obj_ntlv)) + return (EINVAL); + + error = check_object_name(ntlv); + if (error != 0) + return (error); + + if (ntlv->idx <= idx) + return (EINVAL); + + idx = ntlv->idx; + count--; + ntlv++; + } + + tstate = ctlv; + read += ctlv->head.length; + ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + ctlv->head.length); + } + + if (read + sizeof(*ctlv) > sd->valsize) + return (EINVAL); + + if (ctlv->head.type == IPFW_TLV_RULE_LIST) { + clen = ctlv->head.length; + if (clen + read > sd->valsize || clen < sizeof(*ctlv)) + return (EINVAL); + if ((clen % sizeof(uint64_t)) != 0) + return (EINVAL); + + /* + * TODO: Permit adding multiple rules at once + */ + if (ctlv->count != 1) + return (ENOTSUP); + + clen -= sizeof(*ctlv); + + if (ctlv->count > clen / sizeof(struct ip_fw_rule)) + return (EINVAL); + + /* Allocate state for each rule or use stack */ + if (ctlv->count == 1) { + memset(&rci, 0, sizeof(struct rule_check_info)); + cbuf = &rci; + } else + cbuf = malloc(ctlv->count * sizeof(*ci), M_TEMP, + M_WAITOK | M_ZERO); + ci = cbuf; + + /* + * Check each rule for validness. + * Ensure numbered rules are sorted ascending + * and properly aligned + */ + idx = 0; + r = (struct ip_fw_rule *)(ctlv + 1); + count = 0; + error = 0; + while (clen > 0) { + rsize = roundup2(RULESIZE(r), sizeof(uint64_t)); + if (rsize > clen || ctlv->count <= count) { + error = EINVAL; + break; + } + + ci->ctlv = tstate; + error = check_ipfw_rule1(r, rsize, ci); + if (error != 0) + break; + + /* Check sorting */ + if (r->rulenum != 0 && r->rulenum < idx) { + printf("rulenum %d idx %d\n", r->rulenum, idx); + error = EINVAL; + break; + } + idx = r->rulenum; + + ci->urule = (caddr_t)r; + + rsize = roundup2(rsize, sizeof(uint64_t)); + clen -= rsize; + r = (struct ip_fw_rule *)((caddr_t)r + rsize); + count++; + ci++; + } + + if (ctlv->count != count || error != 0) { + if (cbuf != &rci) + free(cbuf, M_TEMP); + return (EINVAL); + } + + rtlv = ctlv; + read += ctlv->head.length; + ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + ctlv->head.length); + } + + if (read != sd->valsize || rtlv == NULL || rtlv->count == 0) { + if (cbuf != NULL && cbuf != &rci) + free(cbuf, M_TEMP); + return (EINVAL); + } + + /* + * Passed rules seems to be valid. + * Allocate storage and try to add them to chain. + */ + for (i = 0, ci = cbuf; i < rtlv->count; i++, ci++) { + clen = RULEKSIZE1((struct ip_fw_rule *)ci->urule); + ci->krule = ipfw_alloc_rule(chain, clen); + import_rule1(ci); + } + + if ((error = commit_rules(chain, cbuf, rtlv->count)) != 0) { + /* Free allocate krules */ + for (i = 0, ci = cbuf; i < rtlv->count; i++, ci++) + free(ci->krule, M_IPFW); + } + + if (cbuf != NULL && cbuf != &rci) + free(cbuf, M_TEMP); + + return (error); +} + +/* + * Lists all sopts currently registered. + * Data layout (v0)(current): + * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size + * Reply: [ ipfw_obj_lheader ipfw_sopt_info x N ] + * + * Returns 0 on success + */ +static int +dump_soptcodes(struct ip_fw_chain *chain, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_lheader *olh; + ipfw_sopt_info *i; + struct ipfw_sopt_handler *sh; + uint32_t count, n, size; + + olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh)); + if (olh == NULL) + return (EINVAL); + if (sd->valsize < olh->size) + return (EINVAL); + + CTL3_LOCK(); + count = ctl3_hsize; + size = count * sizeof(ipfw_sopt_info) + sizeof(ipfw_obj_lheader); + + /* Fill in header regadless of buffer size */ + olh->count = count; + olh->objsize = sizeof(ipfw_sopt_info); + + if (size > olh->size) { + olh->size = size; + CTL3_UNLOCK(); + return (ENOMEM); + } + olh->size = size; + + for (n = 1; n <= count; n++) { + i = (ipfw_sopt_info *)ipfw_get_sopt_space(sd, sizeof(*i)); + KASSERT(i != 0, ("previously checked buffer is not enough")); + sh = &ctl3_handlers[n]; + i->opcode = sh->opcode; + i->version = sh->version; + i->refcnt = sh->refcnt; + } + CTL3_UNLOCK(); + + return (0); +} + +/* + * Compares two sopt handlers (code, version and handler ptr). + * Used both as qsort() and bsearch(). + * Does not compare handler for latter case. + * + * Returns 0 if match is found. + */ +static int +compare_sh(const void *_a, const void *_b) +{ + const struct ipfw_sopt_handler *a, *b; + + a = (const struct ipfw_sopt_handler *)_a; + b = (const struct ipfw_sopt_handler *)_b; + + if (a->opcode < b->opcode) + return (-1); + else if (a->opcode > b->opcode) + return (1); + + if (a->version < b->version) + return (-1); + else if (a->version > b->version) + return (1); + + /* bsearch helper */ + if (a->handler == NULL) + return (0); + + if ((uintptr_t)a->handler < (uintptr_t)b->handler) + return (-1); + else if ((uintptr_t)b->handler > (uintptr_t)b->handler) + return (1); + + return (0); +} + +/* + * Finds sopt handler based on @code and @version. + * + * Returns pointer to handler or NULL. + */ +static struct ipfw_sopt_handler * +find_sh(uint16_t code, uint8_t version, sopt_handler_f *handler) +{ + struct ipfw_sopt_handler *sh, h; + + memset(&h, 0, sizeof(h)); + h.opcode = code; + h.version = version; + h.handler = handler; + + sh = (struct ipfw_sopt_handler *)bsearch(&h, ctl3_handlers, + ctl3_hsize, sizeof(h), compare_sh); + + return (sh); +} + +static int +find_ref_sh(uint16_t opcode, uint8_t version, struct ipfw_sopt_handler *psh) +{ + struct ipfw_sopt_handler *sh; + + CTL3_LOCK(); + if ((sh = find_sh(opcode, version, NULL)) == NULL) { + CTL3_UNLOCK(); + printf("ipfw: ipfw_ctl3 invalid option %d""v""%d\n", + opcode, version); + return (EINVAL); + } + sh->refcnt++; + ctl3_refct++; + /* Copy handler data to requested buffer */ + *psh = *sh; + CTL3_UNLOCK(); + + return (0); +} + +static void +find_unref_sh(struct ipfw_sopt_handler *psh) +{ + struct ipfw_sopt_handler *sh; + + CTL3_LOCK(); + sh = find_sh(psh->opcode, psh->version, NULL); + KASSERT(sh != NULL, ("ctl3 handler disappeared")); + sh->refcnt--; + ctl3_refct--; + CTL3_UNLOCK(); +} + +void +ipfw_init_sopt_handler() +{ + + CTL3_LOCK_INIT(); + IPFW_ADD_SOPT_HANDLER(1, scodes); +} + +void +ipfw_destroy_sopt_handler() +{ + + IPFW_DEL_SOPT_HANDLER(1, scodes); + CTL3_LOCK_DESTROY(); +} + +/* + * Adds one or more sockopt handlers to the global array. + * Function may sleep. + */ +void +ipfw_add_sopt_handler(struct ipfw_sopt_handler *sh, size_t count) +{ + size_t sz; + struct ipfw_sopt_handler *tmp; + + CTL3_LOCK(); + + for (;;) { + sz = ctl3_hsize + count; + CTL3_UNLOCK(); + tmp = malloc(sizeof(*sh) * sz, M_IPFW, M_WAITOK | M_ZERO); + CTL3_LOCK(); + if (ctl3_hsize + count <= sz) + break; + + /* Retry */ + free(tmp, M_IPFW); + } + + /* Merge old & new arrays */ + sz = ctl3_hsize + count; + memcpy(tmp, ctl3_handlers, ctl3_hsize * sizeof(*sh)); + memcpy(&tmp[ctl3_hsize], sh, count * sizeof(*sh)); + qsort(tmp, sz, sizeof(*sh), compare_sh); + /* Switch new and free old */ + if (ctl3_handlers != NULL) + free(ctl3_handlers, M_IPFW); + ctl3_handlers = tmp; + ctl3_hsize = sz; + ctl3_gencnt++; + + CTL3_UNLOCK(); +} + +/* + * Removes one or more sockopt handlers from the global array. + */ +int +ipfw_del_sopt_handler(struct ipfw_sopt_handler *sh, size_t count) +{ + size_t sz; + struct ipfw_sopt_handler *tmp, *h; + int i; + + CTL3_LOCK(); + + for (i = 0; i < count; i++) { + tmp = &sh[i]; + h = find_sh(tmp->opcode, tmp->version, tmp->handler); + if (h == NULL) + continue; + + sz = (ctl3_handlers + ctl3_hsize - (h + 1)) * sizeof(*h); + memmove(h, h + 1, sz); + ctl3_hsize--; + } + + if (ctl3_hsize == 0) { + if (ctl3_handlers != NULL) + free(ctl3_handlers, M_IPFW); + ctl3_handlers = NULL; + } + + ctl3_gencnt++; + + CTL3_UNLOCK(); + + return (0); +} + +/* + * Writes data accumulated in @sd to sockopt buffer. + * Zeroes internal @sd buffer. + */ +static int +ipfw_flush_sopt_data(struct sockopt_data *sd) +{ + struct sockopt *sopt; + int error; + size_t sz; + + sz = sd->koff; + if (sz == 0) + return (0); + + sopt = sd->sopt; + + if (sopt->sopt_dir == SOPT_GET) { + error = copyout(sd->kbuf, sopt->sopt_val, sz); + if (error != 0) + return (error); + } + + memset(sd->kbuf, 0, sd->ksize); + sd->ktotal += sz; + sd->koff = 0; + if (sd->ktotal + sd->ksize < sd->valsize) + sd->kavail = sd->ksize; + else + sd->kavail = sd->valsize - sd->ktotal; + + /* Update sopt buffer data */ + sopt->sopt_valsize = sd->ktotal; + sopt->sopt_val = sd->sopt_val + sd->ktotal; + + return (0); +} + +/* + * Ensures that @sd buffer has contigious @neeeded number of + * bytes. + * + * Returns pointer to requested space or NULL. + */ +caddr_t +ipfw_get_sopt_space(struct sockopt_data *sd, size_t needed) +{ + int error; + caddr_t addr; + + if (sd->kavail < needed) { + /* + * Flush data and try another time. + */ + error = ipfw_flush_sopt_data(sd); + + if (sd->kavail < needed || error != 0) + return (NULL); + } + + addr = sd->kbuf + sd->koff; + sd->koff += needed; + sd->kavail -= needed; + return (addr); +} + +/* + * Requests @needed contigious bytes from @sd buffer. + * Function is used to notify subsystem that we are + * interesed in first @needed bytes (request header) + * and the rest buffer can be safely zeroed. + * + * Returns pointer to requested space or NULL. + */ +caddr_t +ipfw_get_sopt_header(struct sockopt_data *sd, size_t needed) +{ + caddr_t addr; + + if ((addr = ipfw_get_sopt_space(sd, needed)) == NULL) + return (NULL); + + if (sd->kavail > 0) + memset(sd->kbuf + sd->koff, 0, sd->kavail); + + return (addr); +} + +/* + * New sockopt handler. + */ +int +ipfw_ctl3(struct sockopt *sopt) +{ + int error, locked; + size_t size, valsize; + struct ip_fw_chain *chain; + char xbuf[256]; + struct sockopt_data sdata; + struct ipfw_sopt_handler h; + ip_fw3_opheader *op3 = NULL; + + error = priv_check(sopt->sopt_td, PRIV_NETINET_IPFW); + if (error != 0) + return (error); + + if (sopt->sopt_name != IP_FW3) + return (ipfw_ctl(sopt)); + + chain = &V_layer3_chain; + error = 0; + + /* Save original valsize before it is altered via sooptcopyin() */ + valsize = sopt->sopt_valsize; + memset(&sdata, 0, sizeof(sdata)); + /* Read op3 header first to determine actual operation */ + op3 = (ip_fw3_opheader *)xbuf; + error = sooptcopyin(sopt, op3, sizeof(*op3), sizeof(*op3)); + if (error != 0) + return (error); + sopt->sopt_valsize = valsize; + + /* + * Find and reference command. + */ + error = find_ref_sh(op3->opcode, op3->version, &h); + if (error != 0) + return (error); + + /* + * Disallow modifications in really-really secure mode, but still allow + * the logging counters to be reset. + */ + if ((h.dir & HDIR_SET) != 0 && h.opcode != IP_FW_XRESETLOG) { + error = securelevel_ge(sopt->sopt_td->td_ucred, 3); + if (error != 0) { + find_unref_sh(&h); + return (error); + } + } + + /* + * Fill in sockopt_data structure that may be useful for + * IP_FW3 get requests. + */ + locked = 0; + if (valsize <= sizeof(xbuf)) { + /* use on-stack buffer */ + sdata.kbuf = xbuf; + sdata.ksize = sizeof(xbuf); + sdata.kavail = valsize; + } else { + + /* + * Determine opcode type/buffer size: + * allocate sliding-window buf for data export or + * contigious buffer for special ops. + */ + if ((h.dir & HDIR_SET) != 0) { + /* Set request. Allocate contigous buffer. */ + if (valsize > CTL3_LARGEBUF) { + find_unref_sh(&h); + return (EFBIG); + } + + size = valsize; + } else { + /* Get request. Allocate sliding window buffer */ + size = (valsizesopt_val, valsize); + if (error != 0) + return (error); + locked = 1; + } + } + + sdata.kbuf = malloc(size, M_TEMP, M_WAITOK | M_ZERO); + sdata.ksize = size; + sdata.kavail = size; + } + + sdata.sopt = sopt; + sdata.sopt_val = sopt->sopt_val; + sdata.valsize = valsize; + + /* + * Copy either all request (if valsize < bsize_max) + * or first bsize_max bytes to guarantee most consumers + * that all necessary data has been copied). + * Anyway, copy not less than sizeof(ip_fw3_opheader). + */ + if ((error = sooptcopyin(sopt, sdata.kbuf, sdata.ksize, + sizeof(ip_fw3_opheader))) != 0) + return (error); + op3 = (ip_fw3_opheader *)sdata.kbuf; + + /* Finally, run handler */ + error = h.handler(chain, op3, &sdata); + find_unref_sh(&h); + + /* Flush state and free buffers */ + if (error == 0) + error = ipfw_flush_sopt_data(&sdata); + else + ipfw_flush_sopt_data(&sdata); + + if (locked != 0) + vsunlock(sdata.sopt_val, valsize); + + /* Restore original pointer and set number of bytes written */ + sopt->sopt_val = sdata.sopt_val; + sopt->sopt_valsize = sdata.ktotal; + if (sdata.kbuf != xbuf) + free(sdata.kbuf, M_TEMP); + + return (error); +} + +/** + * {set|get}sockopt parser. + */ +int +ipfw_ctl(struct sockopt *sopt) +{ +#define RULE_MAXSIZE (512*sizeof(u_int32_t)) + int error; + size_t size, valsize; + struct ip_fw *buf; + struct ip_fw_rule0 *rule; + struct ip_fw_chain *chain; + u_int32_t rulenum[2]; + uint32_t opt; + struct rule_check_info ci; + IPFW_RLOCK_TRACKER; + + chain = &V_layer3_chain; + error = 0; + + /* Save original valsize before it is altered via sooptcopyin() */ + valsize = sopt->sopt_valsize; + opt = sopt->sopt_name; + + /* + * Disallow modifications in really-really secure mode, but still allow + * the logging counters to be reset. + */ + if (opt == IP_FW_ADD || + (sopt->sopt_dir == SOPT_SET && opt != IP_FW_RESETLOG)) { + error = securelevel_ge(sopt->sopt_td->td_ucred, 3); + if (error != 0) + return (error); + } + + switch (opt) { + case IP_FW_GET: + /* + * pass up a copy of the current rules. Static rules + * come first (the last of which has number IPFW_DEFAULT_RULE), + * followed by a possibly empty list of dynamic rule. + * The last dynamic rule has NULL in the "next" field. + * + * Note that the calculated size is used to bound the + * amount of data returned to the user. The rule set may + * change between calculating the size and returning the + * data in which case we'll just return what fits. + */ + for (;;) { + int len = 0, want; + + size = chain->static_len; + size += ipfw_dyn_len(); + if (size >= sopt->sopt_valsize) + break; + buf = malloc(size, M_TEMP, M_WAITOK | M_ZERO); + IPFW_UH_RLOCK(chain); + /* check again how much space we need */ + want = chain->static_len + ipfw_dyn_len(); + if (size >= want) + len = ipfw_getrules(chain, buf, size); + IPFW_UH_RUNLOCK(chain); + if (size >= want) + error = sooptcopyout(sopt, buf, len); + free(buf, M_TEMP); + if (size >= want) + break; + } + break; + + case IP_FW_FLUSH: + /* locking is done within del_entry() */ + error = del_entry(chain, 0); /* special case, rule=0, cmd=0 means all */ + break; + + case IP_FW_ADD: + rule = malloc(RULE_MAXSIZE, M_TEMP, M_WAITOK); + error = sooptcopyin(sopt, rule, RULE_MAXSIZE, + sizeof(struct ip_fw7) ); + + memset(&ci, 0, sizeof(struct rule_check_info)); + + /* + * If the size of commands equals RULESIZE7 then we assume + * a FreeBSD7.2 binary is talking to us (set is7=1). + * is7 is persistent so the next 'ipfw list' command + * will use this format. + * NOTE: If wrong version is guessed (this can happen if + * the first ipfw command is 'ipfw [pipe] list') + * the ipfw binary may crash or loop infinitly... + */ + size = sopt->sopt_valsize; + if (size == RULESIZE7(rule)) { + is7 = 1; + error = convert_rule_to_8(rule); + if (error) { + free(rule, M_TEMP); + return error; + } + size = RULESIZE(rule); + } else + is7 = 0; + if (error == 0) + error = check_ipfw_rule0(rule, size, &ci); + if (error == 0) { + /* locking is done within add_rule() */ + struct ip_fw *krule; + krule = ipfw_alloc_rule(chain, RULEKSIZE0(rule)); + ci.urule = (caddr_t)rule; + ci.krule = krule; + import_rule0(&ci); + error = commit_rules(chain, &ci, 1); + if (!error && sopt->sopt_dir == SOPT_GET) { + if (is7) { + error = convert_rule_to_7(rule); + size = RULESIZE7(rule); + if (error) { + free(rule, M_TEMP); + return error; + } + } + error = sooptcopyout(sopt, rule, size); + } + } + free(rule, M_TEMP); + break; + + case IP_FW_DEL: + /* + * IP_FW_DEL is used for deleting single rules or sets, + * and (ab)used to atomically manipulate sets. Argument size + * is used to distinguish between the two: + * sizeof(u_int32_t) + * delete single rule or set of rules, + * or reassign rules (or sets) to a different set. + * 2*sizeof(u_int32_t) + * atomic disable/enable sets. + * first u_int32_t contains sets to be disabled, + * second u_int32_t contains sets to be enabled. + */ + error = sooptcopyin(sopt, rulenum, + 2*sizeof(u_int32_t), sizeof(u_int32_t)); + if (error) + break; + size = sopt->sopt_valsize; + if (size == sizeof(u_int32_t) && rulenum[0] != 0) { + /* delete or reassign, locking done in del_entry() */ + error = del_entry(chain, rulenum[0]); + } else if (size == 2*sizeof(u_int32_t)) { /* set enable/disable */ + IPFW_UH_WLOCK(chain); + V_set_disable = + (V_set_disable | rulenum[0]) & ~rulenum[1] & + ~(1<sopt_val != 0) { + error = sooptcopyin(sopt, rulenum, + sizeof(u_int32_t), sizeof(u_int32_t)); + if (error) + break; + } + error = zero_entry(chain, rulenum[0], + sopt->sopt_name == IP_FW_RESETLOG); + break; + + /*--- TABLE opcodes ---*/ + case IP_FW_TABLE_ADD: + case IP_FW_TABLE_DEL: + { + ipfw_table_entry ent; + struct tentry_info tei; + struct tid_info ti; + struct table_value v; + + error = sooptcopyin(sopt, &ent, + sizeof(ent), sizeof(ent)); + if (error) + break; + + memset(&tei, 0, sizeof(tei)); + tei.paddr = &ent.addr; + tei.subtype = AF_INET; + tei.masklen = ent.masklen; + ipfw_import_table_value_legacy(ent.value, &v); + tei.pvalue = &v; + memset(&ti, 0, sizeof(ti)); + ti.uidx = ent.tbl; + ti.type = IPFW_TABLE_CIDR; + + error = (opt == IP_FW_TABLE_ADD) ? + add_table_entry(chain, &ti, &tei, 0, 1) : + del_table_entry(chain, &ti, &tei, 0, 1); + } + break; + + + case IP_FW_TABLE_FLUSH: + { + u_int16_t tbl; + struct tid_info ti; + + error = sooptcopyin(sopt, &tbl, + sizeof(tbl), sizeof(tbl)); + if (error) + break; + memset(&ti, 0, sizeof(ti)); + ti.uidx = tbl; + error = flush_table(chain, &ti); + } + break; + + case IP_FW_TABLE_GETSIZE: + { + u_int32_t tbl, cnt; + struct tid_info ti; + + if ((error = sooptcopyin(sopt, &tbl, sizeof(tbl), + sizeof(tbl)))) + break; + memset(&ti, 0, sizeof(ti)); + ti.uidx = tbl; + IPFW_RLOCK(chain); + error = ipfw_count_table(chain, &ti, &cnt); + IPFW_RUNLOCK(chain); + if (error) + break; + error = sooptcopyout(sopt, &cnt, sizeof(cnt)); + } + break; + + case IP_FW_TABLE_LIST: + { + ipfw_table *tbl; + struct tid_info ti; + + if (sopt->sopt_valsize < sizeof(*tbl)) { + error = EINVAL; + break; + } + size = sopt->sopt_valsize; + tbl = malloc(size, M_TEMP, M_WAITOK); + error = sooptcopyin(sopt, tbl, size, sizeof(*tbl)); + if (error) { + free(tbl, M_TEMP); + break; + } + tbl->size = (size - sizeof(*tbl)) / + sizeof(ipfw_table_entry); + memset(&ti, 0, sizeof(ti)); + ti.uidx = tbl->tbl; + IPFW_RLOCK(chain); + error = ipfw_dump_table_legacy(chain, &ti, tbl); + IPFW_RUNLOCK(chain); + if (error) { + free(tbl, M_TEMP); + break; + } + error = sooptcopyout(sopt, tbl, size); + free(tbl, M_TEMP); + } + break; + + /*--- NAT operations are protected by the IPFW_LOCK ---*/ + case IP_FW_NAT_CFG: + if (IPFW_NAT_LOADED) + error = ipfw_nat_cfg_ptr(sopt); + else { + printf("IP_FW_NAT_CFG: %s\n", + "ipfw_nat not present, please load it"); + error = EINVAL; + } + break; + + case IP_FW_NAT_DEL: + if (IPFW_NAT_LOADED) + error = ipfw_nat_del_ptr(sopt); + else { + printf("IP_FW_NAT_DEL: %s\n", + "ipfw_nat not present, please load it"); + error = EINVAL; + } + break; + + case IP_FW_NAT_GET_CONFIG: + if (IPFW_NAT_LOADED) + error = ipfw_nat_get_cfg_ptr(sopt); + else { + printf("IP_FW_NAT_GET_CFG: %s\n", + "ipfw_nat not present, please load it"); + error = EINVAL; + } + break; + + case IP_FW_NAT_GET_LOG: + if (IPFW_NAT_LOADED) + error = ipfw_nat_get_log_ptr(sopt); + else { + printf("IP_FW_NAT_GET_LOG: %s\n", + "ipfw_nat not present, please load it"); + error = EINVAL; + } + break; + + default: + printf("ipfw: ipfw_ctl invalid option %d\n", sopt->sopt_name); + error = EINVAL; + } + + return (error); +#undef RULE_MAXSIZE +} +#define RULE_MAXSIZE (256*sizeof(u_int32_t)) + +/* Functions to convert rules 7.2 <==> 8.0 */ +static int +convert_rule_to_7(struct ip_fw_rule0 *rule) +{ + /* Used to modify original rule */ + struct ip_fw7 *rule7 = (struct ip_fw7 *)rule; + /* copy of original rule, version 8 */ + struct ip_fw_rule0 *tmp; + + /* Used to copy commands */ + ipfw_insn *ccmd, *dst; + int ll = 0, ccmdlen = 0; + + tmp = malloc(RULE_MAXSIZE, M_TEMP, M_NOWAIT | M_ZERO); + if (tmp == NULL) { + return 1; //XXX error + } + bcopy(rule, tmp, RULE_MAXSIZE); + + /* Copy fields */ + //rule7->_pad = tmp->_pad; + rule7->set = tmp->set; + rule7->rulenum = tmp->rulenum; + rule7->cmd_len = tmp->cmd_len; + rule7->act_ofs = tmp->act_ofs; + rule7->next_rule = (struct ip_fw7 *)tmp->next_rule; + rule7->cmd_len = tmp->cmd_len; + rule7->pcnt = tmp->pcnt; + rule7->bcnt = tmp->bcnt; + rule7->timestamp = tmp->timestamp; + + /* Copy commands */ + for (ll = tmp->cmd_len, ccmd = tmp->cmd, dst = rule7->cmd ; + ll > 0 ; ll -= ccmdlen, ccmd += ccmdlen, dst += ccmdlen) { + ccmdlen = F_LEN(ccmd); + + bcopy(ccmd, dst, F_LEN(ccmd)*sizeof(uint32_t)); + + if (dst->opcode > O_NAT) + /* O_REASS doesn't exists in 7.2 version, so + * decrement opcode if it is after O_REASS + */ + dst->opcode--; + + if (ccmdlen > ll) { + printf("ipfw: opcode %d size truncated\n", + ccmd->opcode); + return EINVAL; + } + } + free(tmp, M_TEMP); + + return 0; +} + +static int +convert_rule_to_8(struct ip_fw_rule0 *rule) +{ + /* Used to modify original rule */ + struct ip_fw7 *rule7 = (struct ip_fw7 *) rule; + + /* Used to copy commands */ + ipfw_insn *ccmd, *dst; + int ll = 0, ccmdlen = 0; + + /* Copy of original rule */ + struct ip_fw7 *tmp = malloc(RULE_MAXSIZE, M_TEMP, M_NOWAIT | M_ZERO); + if (tmp == NULL) { + return 1; //XXX error + } + + bcopy(rule7, tmp, RULE_MAXSIZE); + + for (ll = tmp->cmd_len, ccmd = tmp->cmd, dst = rule->cmd ; + ll > 0 ; ll -= ccmdlen, ccmd += ccmdlen, dst += ccmdlen) { + ccmdlen = F_LEN(ccmd); + + bcopy(ccmd, dst, F_LEN(ccmd)*sizeof(uint32_t)); + + if (dst->opcode > O_NAT) + /* O_REASS doesn't exists in 7.2 version, so + * increment opcode if it is after O_REASS + */ + dst->opcode++; + + if (ccmdlen > ll) { + printf("ipfw: opcode %d size truncated\n", + ccmd->opcode); + return EINVAL; + } + } + + rule->_pad = tmp->_pad; + rule->set = tmp->set; + rule->rulenum = tmp->rulenum; + rule->cmd_len = tmp->cmd_len; + rule->act_ofs = tmp->act_ofs; + rule->next_rule = (struct ip_fw *)tmp->next_rule; + rule->cmd_len = tmp->cmd_len; + rule->id = 0; /* XXX see if is ok = 0 */ + rule->pcnt = tmp->pcnt; + rule->bcnt = tmp->bcnt; + rule->timestamp = tmp->timestamp; + + free (tmp, M_TEMP); + return 0; +} + +/* + * Named object api + * + */ + +/* + * Allocate new bitmask which can be used to enlarge/shrink + * named instance index. + */ +void +ipfw_objhash_bitmap_alloc(uint32_t items, void **idx, int *pblocks) +{ + size_t size; + int max_blocks; + u_long *idx_mask; + + KASSERT((items % BLOCK_ITEMS) == 0, + ("bitmask size needs to power of 2 and greater or equal to %zu", + BLOCK_ITEMS)); + + max_blocks = items / BLOCK_ITEMS; + size = items / 8; + idx_mask = malloc(size * IPFW_MAX_SETS, M_IPFW, M_WAITOK); + /* Mark all as free */ + memset(idx_mask, 0xFF, size * IPFW_MAX_SETS); + *idx_mask &= ~(u_long)1; /* Skip index 0 */ + + *idx = idx_mask; + *pblocks = max_blocks; +} + +/* + * Copy current bitmask index to new one. + */ +void +ipfw_objhash_bitmap_merge(struct namedobj_instance *ni, void **idx, int *blocks) +{ + int old_blocks, new_blocks; + u_long *old_idx, *new_idx; + int i; + + old_idx = ni->idx_mask; + old_blocks = ni->max_blocks; + new_idx = *idx; + new_blocks = *blocks; + + for (i = 0; i < IPFW_MAX_SETS; i++) { + memcpy(&new_idx[new_blocks * i], &old_idx[old_blocks * i], + old_blocks * sizeof(u_long)); + } +} + +/* + * Swaps current @ni index with new one. + */ +void +ipfw_objhash_bitmap_swap(struct namedobj_instance *ni, void **idx, int *blocks) +{ + int old_blocks; + u_long *old_idx; + + old_idx = ni->idx_mask; + old_blocks = ni->max_blocks; + + ni->idx_mask = *idx; + ni->max_blocks = *blocks; + + /* Save old values */ + *idx = old_idx; + *blocks = old_blocks; +} + +void +ipfw_objhash_bitmap_free(void *idx, int blocks) +{ + + free(idx, M_IPFW); +} + +/* + * Creates named hash instance. + * Must be called without holding any locks. + * Return pointer to new instance. + */ +struct namedobj_instance * +ipfw_objhash_create(uint32_t items) +{ + struct namedobj_instance *ni; + int i; + size_t size; + + size = sizeof(struct namedobj_instance) + + sizeof(struct namedobjects_head) * NAMEDOBJ_HASH_SIZE + + sizeof(struct namedobjects_head) * NAMEDOBJ_HASH_SIZE; + + ni = malloc(size, M_IPFW, M_WAITOK | M_ZERO); + ni->nn_size = NAMEDOBJ_HASH_SIZE; + ni->nv_size = NAMEDOBJ_HASH_SIZE; + + ni->names = (struct namedobjects_head *)(ni +1); + ni->values = &ni->names[ni->nn_size]; + + for (i = 0; i < ni->nn_size; i++) + TAILQ_INIT(&ni->names[i]); + + for (i = 0; i < ni->nv_size; i++) + TAILQ_INIT(&ni->values[i]); + + /* Set default hashing/comparison functions */ + ni->hash_f = objhash_hash_name; + ni->cmp_f = objhash_cmp_name; + + /* Allocate bitmask separately due to possible resize */ + ipfw_objhash_bitmap_alloc(items, (void*)&ni->idx_mask, (int *)&ni->max_blocks); + + return (ni); +} + +void +ipfw_objhash_destroy(struct namedobj_instance *ni) +{ + + free(ni->idx_mask, M_IPFW); + free(ni, M_IPFW); +} + +void +ipfw_objhash_set_funcs(struct namedobj_instance *ni, objhash_hash_f *hash_f, + objhash_cmp_f *cmp_f) +{ + + ni->hash_f = hash_f; + ni->cmp_f = cmp_f; +} + +static uint32_t +objhash_hash_name(struct namedobj_instance *ni, void *name, uint32_t set) +{ + + return (fnv_32_str((char *)name, FNV1_32_INIT)); +} + +static int +objhash_cmp_name(struct named_object *no, void *name, uint32_t set) +{ + + if ((strcmp(no->name, (char *)name) == 0) && (no->set == set)) + return (0); + + return (1); +} + +static uint32_t +objhash_hash_idx(struct namedobj_instance *ni, uint32_t val) +{ + uint32_t v; + + v = val % (ni->nv_size - 1); + + return (v); +} + +struct named_object * +ipfw_objhash_lookup_name(struct namedobj_instance *ni, uint32_t set, char *name) +{ + struct named_object *no; + uint32_t hash; + + hash = ni->hash_f(ni, name, set) % ni->nn_size; + + TAILQ_FOREACH(no, &ni->names[hash], nn_next) { + if (ni->cmp_f(no, name, set) == 0) + return (no); + } + + return (NULL); +} + +struct named_object * +ipfw_objhash_lookup_kidx(struct namedobj_instance *ni, uint16_t kidx) +{ + struct named_object *no; + uint32_t hash; + + hash = objhash_hash_idx(ni, kidx); + + TAILQ_FOREACH(no, &ni->values[hash], nv_next) { + if (no->kidx == kidx) + return (no); + } + + return (NULL); +} + +int +ipfw_objhash_same_name(struct namedobj_instance *ni, struct named_object *a, + struct named_object *b) +{ + + if ((strcmp(a->name, b->name) == 0) && a->set == b->set) + return (1); + + return (0); +} + +void +ipfw_objhash_add(struct namedobj_instance *ni, struct named_object *no) +{ + uint32_t hash; + + hash = ni->hash_f(ni, no->name, no->set) % ni->nn_size; + TAILQ_INSERT_HEAD(&ni->names[hash], no, nn_next); + + hash = objhash_hash_idx(ni, no->kidx); + TAILQ_INSERT_HEAD(&ni->values[hash], no, nv_next); + + ni->count++; +} + +void +ipfw_objhash_del(struct namedobj_instance *ni, struct named_object *no) +{ + uint32_t hash; + + hash = ni->hash_f(ni, no->name, no->set) % ni->nn_size; + TAILQ_REMOVE(&ni->names[hash], no, nn_next); + + hash = objhash_hash_idx(ni, no->kidx); + TAILQ_REMOVE(&ni->values[hash], no, nv_next); + + ni->count--; +} + +uint32_t +ipfw_objhash_count(struct namedobj_instance *ni) +{ + + return (ni->count); +} + +/* + * Runs @func for each found named object. + * It is safe to delete objects from callback + */ +void +ipfw_objhash_foreach(struct namedobj_instance *ni, objhash_cb_t *f, void *arg) +{ + struct named_object *no, *no_tmp; + int i; + + for (i = 0; i < ni->nn_size; i++) { + TAILQ_FOREACH_SAFE(no, &ni->names[i], nn_next, no_tmp) + f(ni, no, arg); + } +} + +/* + * Removes index from given set. + * Returns 0 on success. + */ +int +ipfw_objhash_free_idx(struct namedobj_instance *ni, uint16_t idx) +{ + u_long *mask; + int i, v; + + i = idx / BLOCK_ITEMS; + v = idx % BLOCK_ITEMS; + + if (i >= ni->max_blocks) + return (1); + + mask = &ni->idx_mask[i]; + + if ((*mask & ((u_long)1 << v)) != 0) + return (1); + + /* Mark as free */ + *mask |= (u_long)1 << v; + + /* Update free offset */ + if (ni->free_off[0] > i) + ni->free_off[0] = i; + + return (0); +} + +/* + * Allocate new index in given instance and stores in in @pidx. + * Returns 0 on success. + */ +int +ipfw_objhash_alloc_idx(void *n, uint16_t *pidx) +{ + struct namedobj_instance *ni; + u_long *mask; + int i, off, v; + + ni = (struct namedobj_instance *)n; + + off = ni->free_off[0]; + mask = &ni->idx_mask[off]; + + for (i = off; i < ni->max_blocks; i++, mask++) { + if ((v = ffsl(*mask)) == 0) + continue; + + /* Mark as busy */ + *mask &= ~ ((u_long)1 << (v - 1)); + + ni->free_off[0] = i; + + v = BLOCK_ITEMS * i + v - 1; + + *pidx = v; + return (0); + } + + return (1); +} + +/* end of file */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_table.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_table.c new file mode 100644 index 0000000..2994528 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_table.c @@ -0,0 +1,3674 @@ +/*- + * Copyright (c) 2004 Ruslan Ermilov and Vsevolod Lobko. + * Copyright (c) 2014 Yandex LLC + * Copyright (c) 2014 Alexander V. Chernikov + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_table.c 272840 2014-10-09 19:32:35Z melifaro $"); + +/* + * Lookup table support for ipfw. + * + * This file contains handlers for all generic tables' operations: + * add/del/flush entries, list/dump tables etc.. + * + * Table data modification is protected by both UH and runtime lock + * while reading configuration/data is protected by UH lock. + * + * Lookup algorithms for all table types are located in ip_fw_table_algo.c + */ + +#include "opt_ipfw.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* ip_fw.h requires IFNAMSIZ */ + +#include +#include /* struct ipfw_rule_ref */ +#include + +#include +#include + + /* + * Table has the following `type` concepts: + * + * `no.type` represents lookup key type (addr, ifp, uid, etc..) + * vmask represents bitmask of table values which are present at the moment. + * Special IPFW_VTYPE_LEGACY ( (uint32_t)-1 ) represents old + * single-value-for-all approach. + */ +struct table_config { + struct named_object no; + uint8_t tflags; /* type flags */ + uint8_t locked; /* 1 if locked from changes */ + uint8_t linked; /* 1 if already linked */ + uint8_t ochanged; /* used by set swapping */ + uint8_t vshared; /* 1 if using shared value array */ + uint8_t spare[3]; + uint32_t count; /* Number of records */ + uint32_t limit; /* Max number of records */ + uint32_t vmask; /* bitmask with supported values */ + uint32_t ocount; /* used by set swapping */ + uint64_t gencnt; /* generation count */ + char tablename[64]; /* table name */ + struct table_algo *ta; /* Callbacks for given algo */ + void *astate; /* algorithm state */ + struct table_info ti_copy; /* data to put to table_info */ + struct namedobj_instance *vi; +}; + +static struct table_config *find_table(struct namedobj_instance *ni, + struct tid_info *ti); +static struct table_config *alloc_table_config(struct ip_fw_chain *ch, + struct tid_info *ti, struct table_algo *ta, char *adata, uint8_t tflags); +static void free_table_config(struct namedobj_instance *ni, + struct table_config *tc); +static int create_table_internal(struct ip_fw_chain *ch, struct tid_info *ti, + char *aname, ipfw_xtable_info *i, uint16_t *pkidx, int ref); +static void link_table(struct ip_fw_chain *ch, struct table_config *tc); +static void unlink_table(struct ip_fw_chain *ch, struct table_config *tc); +static int find_ref_table(struct ip_fw_chain *ch, struct tid_info *ti, + struct tentry_info *tei, uint32_t count, int op, struct table_config **ptc); +#define OP_ADD 1 +#define OP_DEL 0 +static int export_tables(struct ip_fw_chain *ch, ipfw_obj_lheader *olh, + struct sockopt_data *sd); +static void export_table_info(struct ip_fw_chain *ch, struct table_config *tc, + ipfw_xtable_info *i); +static int dump_table_tentry(void *e, void *arg); +static int dump_table_xentry(void *e, void *arg); + +static int swap_tables(struct ip_fw_chain *ch, struct tid_info *a, + struct tid_info *b); + +static int check_table_space(struct ip_fw_chain *ch, struct tableop_state *ts, + struct table_config *tc, struct table_info *ti, uint32_t count); +static int destroy_table(struct ip_fw_chain *ch, struct tid_info *ti); + +static struct table_algo *find_table_algo(struct tables_config *tableconf, + struct tid_info *ti, char *name); + +static void objheader_to_ti(struct _ipfw_obj_header *oh, struct tid_info *ti); +static void ntlv_to_ti(struct _ipfw_obj_ntlv *ntlv, struct tid_info *ti); +static int classify_table_opcode(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype); + +#define CHAIN_TO_NI(chain) (CHAIN_TO_TCFG(chain)->namehash) +#define KIDX_TO_TI(ch, k) (&(((struct table_info *)(ch)->tablestate)[k])) + +#define TA_BUF_SZ 128 /* On-stack buffer for add/delete state */ + +void +rollback_toperation_state(struct ip_fw_chain *ch, void *object) +{ + struct tables_config *tcfg; + struct op_state *os; + + tcfg = CHAIN_TO_TCFG(ch); + TAILQ_FOREACH(os, &tcfg->state_list, next) + os->func(object, os); +} + +void +add_toperation_state(struct ip_fw_chain *ch, struct tableop_state *ts) +{ + struct tables_config *tcfg; + + tcfg = CHAIN_TO_TCFG(ch); + TAILQ_INSERT_HEAD(&tcfg->state_list, &ts->opstate, next); +} + +void +del_toperation_state(struct ip_fw_chain *ch, struct tableop_state *ts) +{ + struct tables_config *tcfg; + + tcfg = CHAIN_TO_TCFG(ch); + TAILQ_REMOVE(&tcfg->state_list, &ts->opstate, next); +} + +void +tc_ref(struct table_config *tc) +{ + + tc->no.refcnt++; +} + +void +tc_unref(struct table_config *tc) +{ + + tc->no.refcnt--; +} + +static struct table_value * +get_table_value(struct ip_fw_chain *ch, struct table_config *tc, uint32_t kidx) +{ + struct table_value *pval; + + pval = (struct table_value *)ch->valuestate; + + return (&pval[kidx]); +} + + +/* + * Checks if we're able to insert/update entry @tei into table + * w.r.t @tc limits. + * May alter @tei to indicate insertion error / insert + * options. + * + * Returns 0 if operation can be performed/ + */ +static int +check_table_limit(struct table_config *tc, struct tentry_info *tei) +{ + + if (tc->limit == 0 || tc->count < tc->limit) + return (0); + + if ((tei->flags & TEI_FLAGS_UPDATE) == 0) { + /* Notify userland on error cause */ + tei->flags |= TEI_FLAGS_LIMIT; + return (EFBIG); + } + + /* + * We have UPDATE flag set. + * Permit updating record (if found), + * but restrict adding new one since we've + * already hit the limit. + */ + tei->flags |= TEI_FLAGS_DONTADD; + + return (0); +} + +/* + * Convert algorithm callback return code into + * one of pre-defined states known by userland. + */ +static void +store_tei_result(struct tentry_info *tei, int op, int error, uint32_t num) +{ + int flag; + + flag = 0; + + switch (error) { + case 0: + if (op == OP_ADD && num != 0) + flag = TEI_FLAGS_ADDED; + if (op == OP_DEL) + flag = TEI_FLAGS_DELETED; + break; + case ENOENT: + flag = TEI_FLAGS_NOTFOUND; + break; + case EEXIST: + flag = TEI_FLAGS_EXISTS; + break; + default: + flag = TEI_FLAGS_ERROR; + } + + tei->flags |= flag; +} + +/* + * Creates and references table with default parameters. + * Saves table config, algo and allocated kidx info @ptc, @pta and + * @pkidx if non-zero. + * Used for table auto-creation to support old binaries. + * + * Returns 0 on success. + */ +static int +create_table_compat(struct ip_fw_chain *ch, struct tid_info *ti, + uint16_t *pkidx) +{ + ipfw_xtable_info xi; + int error; + + memset(&xi, 0, sizeof(xi)); + /* Set default value mask for legacy clients */ + xi.vmask = IPFW_VTYPE_LEGACY; + + error = create_table_internal(ch, ti, NULL, &xi, pkidx, 1); + if (error != 0) + return (error); + + return (0); +} + +/* + * Find and reference existing table optionally + * creating new one. + * + * Saves found table config into @ptc. + * Note function may drop/acquire UH_WLOCK. + * Returns 0 if table was found/created and referenced + * or non-zero return code. + */ +static int +find_ref_table(struct ip_fw_chain *ch, struct tid_info *ti, + struct tentry_info *tei, uint32_t count, int op, + struct table_config **ptc) +{ + struct namedobj_instance *ni; + struct table_config *tc; + uint16_t kidx; + int error; + + IPFW_UH_WLOCK_ASSERT(ch); + + ni = CHAIN_TO_NI(ch); + tc = NULL; + if ((tc = find_table(ni, ti)) != NULL) { + /* check table type */ + if (tc->no.type != ti->type) + return (EINVAL); + + if (tc->locked != 0) + return (EACCES); + + /* Try to exit early on limit hit */ + if (op == OP_ADD && count == 1 && + check_table_limit(tc, tei) != 0) + return (EFBIG); + + /* Reference and return */ + tc->no.refcnt++; + *ptc = tc; + return (0); + } + + if (op == OP_DEL) + return (ESRCH); + + /* Compability mode: create new table for old clients */ + if ((tei->flags & TEI_FLAGS_COMPAT) == 0) + return (ESRCH); + + IPFW_UH_WUNLOCK(ch); + error = create_table_compat(ch, ti, &kidx); + IPFW_UH_WLOCK(ch); + + if (error != 0) + return (error); + + tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, kidx); + KASSERT(tc != NULL, ("create_table_compat returned bad idx %d", kidx)); + + /* OK, now we've got referenced table. */ + *ptc = tc; + return (0); +} + +/* + * Rolls back already @added to @tc entries using state array @ta_buf_m. + * Assume the following layout: + * 1) ADD state (ta_buf_m[0] ... t_buf_m[added - 1]) for handling update cases + * 2) DEL state (ta_buf_m[count[ ... t_buf_m[count + added - 1]) + * for storing deleted state + */ +static void +rollback_added_entries(struct ip_fw_chain *ch, struct table_config *tc, + struct table_info *tinfo, struct tentry_info *tei, caddr_t ta_buf_m, + uint32_t count, uint32_t added) +{ + struct table_algo *ta; + struct tentry_info *ptei; + caddr_t v, vv; + size_t ta_buf_sz; + int error, i; + uint32_t num; + + IPFW_UH_WLOCK_ASSERT(ch); + + ta = tc->ta; + ta_buf_sz = ta->ta_buf_size; + v = ta_buf_m; + vv = v + count * ta_buf_sz; + for (i = 0; i < added; i++, v += ta_buf_sz, vv += ta_buf_sz) { + ptei = &tei[i]; + if ((ptei->flags & TEI_FLAGS_UPDATED) != 0) { + + /* + * We have old value stored by previous + * call in @ptei->value. Do add once again + * to restore it. + */ + error = ta->add(tc->astate, tinfo, ptei, v, &num); + KASSERT(error == 0, ("rollback UPDATE fail")); + KASSERT(num == 0, ("rollback UPDATE fail2")); + continue; + } + + error = ta->prepare_del(ch, ptei, vv); + KASSERT(error == 0, ("pre-rollback INSERT failed")); + error = ta->del(tc->astate, tinfo, ptei, vv, &num); + KASSERT(error == 0, ("rollback INSERT failed")); + tc->count -= num; + } +} + +/* + * Prepares add/del state for all @count entries in @tei. + * Uses either stack buffer (@ta_buf) or allocates a new one. + * Stores pointer to allocated buffer back to @ta_buf. + * + * Returns 0 on success. + */ +static int +prepare_batch_buffer(struct ip_fw_chain *ch, struct table_algo *ta, + struct tentry_info *tei, uint32_t count, int op, caddr_t *ta_buf) +{ + caddr_t ta_buf_m, v; + size_t ta_buf_sz, sz; + struct tentry_info *ptei; + int error, i; + + error = 0; + ta_buf_sz = ta->ta_buf_size; + if (count == 1) { + /* Sigle add/delete, use on-stack buffer */ + memset(*ta_buf, 0, TA_BUF_SZ); + ta_buf_m = *ta_buf; + } else { + + /* + * Multiple adds/deletes, allocate larger buffer + * + * Note we need 2xcount buffer for add case: + * we have hold both ADD state + * and DELETE state (this may be needed + * if we need to rollback all changes) + */ + sz = count * ta_buf_sz; + ta_buf_m = malloc((op == OP_ADD) ? sz * 2 : sz, M_TEMP, + M_WAITOK | M_ZERO); + } + + v = ta_buf_m; + for (i = 0; i < count; i++, v += ta_buf_sz) { + ptei = &tei[i]; + error = (op == OP_ADD) ? + ta->prepare_add(ch, ptei, v) : ta->prepare_del(ch, ptei, v); + + /* + * Some syntax error (incorrect mask, or address, or + * anything). Return error regardless of atomicity + * settings. + */ + if (error != 0) + break; + } + + *ta_buf = ta_buf_m; + return (error); +} + +/* + * Flushes allocated state for each @count entries in @tei. + * Frees @ta_buf_m if differs from stack buffer @ta_buf. + */ +static void +flush_batch_buffer(struct ip_fw_chain *ch, struct table_algo *ta, + struct tentry_info *tei, uint32_t count, int rollback, + caddr_t ta_buf_m, caddr_t ta_buf) +{ + caddr_t v; + struct tentry_info *ptei; + size_t ta_buf_sz; + int i; + + ta_buf_sz = ta->ta_buf_size; + + /* Run cleaning callback anyway */ + v = ta_buf_m; + for (i = 0; i < count; i++, v += ta_buf_sz) { + ptei = &tei[i]; + ta->flush_entry(ch, ptei, v); + if (ptei->ptv != NULL) { + free(ptei->ptv, M_IPFW); + ptei->ptv = NULL; + } + } + + /* Clean up "deleted" state in case of rollback */ + if (rollback != 0) { + v = ta_buf_m + count * ta_buf_sz; + for (i = 0; i < count; i++, v += ta_buf_sz) + ta->flush_entry(ch, &tei[i], v); + } + + if (ta_buf_m != ta_buf) + free(ta_buf_m, M_TEMP); +} + + +static void +rollback_add_entry(void *object, struct op_state *_state) +{ + struct ip_fw_chain *ch; + struct tableop_state *ts; + + ts = (struct tableop_state *)_state; + + if (ts->tc != object && ts->ch != object) + return; + + ch = ts->ch; + + IPFW_UH_WLOCK_ASSERT(ch); + + /* Call specifid unlockers */ + rollback_table_values(ts); + + /* Indicate we've called */ + ts->modified = 1; +} + +/* + * Adds/updates one or more entries in table @ti. + * + * Function may drop/reacquire UH wlock multiple times due to + * items alloc, algorithm callbacks (check_space), value linkage + * (new values, value storage realloc), etc.. + * Other processes like other adds (which may involve storage resize), + * table swaps (which changes table data and may change algo type), + * table modify (which may change value mask) may be executed + * simultaneously so we need to deal with it. + * + * The following approach was implemented: + * we have per-chain linked list, protected with UH lock. + * add_table_entry prepares special on-stack structure wthich is passed + * to its descendants. Users add this structure to this list before unlock. + * After performing needed operations and acquiring UH lock back, each user + * checks if structure has changed. If true, it rolls local state back and + * returns without error to the caller. + * add_table_entry() on its own checks if structure has changed and restarts + * its operation from the beginning (goto restart). + * + * Functions which are modifying fields of interest (currently + * resize_shared_value_storage() and swap_tables() ) + * traverses given list while holding UH lock immediately before + * performing their operations calling function provided be list entry + * ( currently rollback_add_entry ) which performs rollback for all necessary + * state and sets appropriate values in structure indicating rollback + * has happened. + * + * Algo interaction: + * Function references @ti first to ensure table won't + * disappear or change its type. + * After that, prepare_add callback is called for each @tei entry. + * Next, we try to add each entry under UH+WHLOCK + * using add() callback. + * Finally, we free all state by calling flush_entry callback + * for each @tei. + * + * Returns 0 on success. + */ +int +add_table_entry(struct ip_fw_chain *ch, struct tid_info *ti, + struct tentry_info *tei, uint8_t flags, uint32_t count) +{ + struct table_config *tc; + struct table_algo *ta; + uint16_t kidx; + int error, first_error, i, rollback; + uint32_t num, numadd; + struct tentry_info *ptei; + struct tableop_state ts; + char ta_buf[TA_BUF_SZ]; + caddr_t ta_buf_m = NULL, v; + + memset(&ts, 0, sizeof(ts)); + ta = NULL; + IPFW_UH_WLOCK(ch); + + /* + * Find and reference existing table. + */ +restart: + if (ts.modified != 0) { + IPFW_UH_WUNLOCK(ch); + flush_batch_buffer(ch, ta, tei, count, rollback, + ta_buf_m, ta_buf); + memset(&ts, 0, sizeof(ts)); + ta = NULL; + IPFW_UH_WLOCK(ch); + } + + error = find_ref_table(ch, ti, tei, count, OP_ADD, &tc); + if (error != 0) { + IPFW_UH_WUNLOCK(ch); + return (error); + } + ta = tc->ta; + + /* Fill in tablestate */ + ts.ch = ch; + ts.opstate.func = rollback_add_entry; + ts.tc = tc; + ts.vshared = tc->vshared; + ts.vmask = tc->vmask; + ts.ta = ta; + ts.tei = tei; + ts.count = count; + rollback = 0; + add_toperation_state(ch, &ts); + IPFW_UH_WUNLOCK(ch); + + /* Allocate memory and prepare record(s) */ + /* Pass stack buffer by default */ + ta_buf_m = ta_buf; + error = prepare_batch_buffer(ch, ta, tei, count, OP_ADD, &ta_buf_m); + if (error != 0) + goto cleanup; + + IPFW_UH_WLOCK(ch); + /* Drop reference we've used in first search */ + tc->no.refcnt--; + + /* + * Check if table swap has happened. + * (so table algo might be changed). + * Restart operation to achieve consistent behavior. + */ + del_toperation_state(ch, &ts); + if (ts.modified != 0) + goto restart; + + /* + * Link all values values to shared/per-table value array. + * + * May release/reacquire UH_WLOCK. + */ + error = ipfw_link_table_values(ch, &ts); + if (error != 0) + goto cleanup; + if (ts.modified != 0) + goto restart; + + /* + * Ensure we are able to add all entries without additional + * memory allocations. May release/reacquire UH_WLOCK. + */ + kidx = tc->no.kidx; + error = check_table_space(ch, &ts, tc, KIDX_TO_TI(ch, kidx), count); + if (error != 0) + goto cleanup; + if (ts.modified != 0) + goto restart; + + /* We've got valid table in @tc. Let's try to add data */ + kidx = tc->no.kidx; + ta = tc->ta; + numadd = 0; + first_error = 0; + + IPFW_WLOCK(ch); + + v = ta_buf_m; + for (i = 0; i < count; i++, v += ta->ta_buf_size) { + ptei = &tei[i]; + num = 0; + /* check limit before adding */ + if ((error = check_table_limit(tc, ptei)) == 0) { + error = ta->add(tc->astate, KIDX_TO_TI(ch, kidx), + ptei, v, &num); + /* Set status flag to inform userland */ + store_tei_result(ptei, OP_ADD, error, num); + } + if (error == 0) { + /* Update number of records to ease limit checking */ + tc->count += num; + numadd += num; + continue; + } + + if (first_error == 0) + first_error = error; + + /* + * Some error have happened. Check our atomicity + * settings: continue if atomicity is not required, + * rollback changes otherwise. + */ + if ((flags & IPFW_CTF_ATOMIC) == 0) + continue; + + rollback_added_entries(ch, tc, KIDX_TO_TI(ch, kidx), + tei, ta_buf_m, count, i); + + rollback = 1; + break; + } + + IPFW_WUNLOCK(ch); + + ipfw_garbage_table_values(ch, tc, tei, count, rollback); + + /* Permit post-add algorithm grow/rehash. */ + if (numadd != 0) + check_table_space(ch, NULL, tc, KIDX_TO_TI(ch, kidx), 0); + + /* Return first error to user, if any */ + error = first_error; + +cleanup: + IPFW_UH_WUNLOCK(ch); + + flush_batch_buffer(ch, ta, tei, count, rollback, ta_buf_m, ta_buf); + + return (error); +} + +/* + * Deletes one or more entries in table @ti. + * + * Returns 0 on success. + */ +int +del_table_entry(struct ip_fw_chain *ch, struct tid_info *ti, + struct tentry_info *tei, uint8_t flags, uint32_t count) +{ + struct table_config *tc; + struct table_algo *ta; + struct tentry_info *ptei; + uint16_t kidx; + int error, first_error, i; + uint32_t num, numdel; + char ta_buf[TA_BUF_SZ]; + caddr_t ta_buf_m, v; + + /* + * Find and reference existing table. + */ + IPFW_UH_WLOCK(ch); + error = find_ref_table(ch, ti, tei, count, OP_DEL, &tc); + if (error != 0) { + IPFW_UH_WUNLOCK(ch); + return (error); + } + ta = tc->ta; + IPFW_UH_WUNLOCK(ch); + + /* Allocate memory and prepare record(s) */ + /* Pass stack buffer by default */ + ta_buf_m = ta_buf; + error = prepare_batch_buffer(ch, ta, tei, count, OP_DEL, &ta_buf_m); + if (error != 0) + goto cleanup; + + IPFW_UH_WLOCK(ch); + + /* Drop reference we've used in first search */ + tc->no.refcnt--; + + /* + * Check if table algo is still the same. + * (changed ta may be the result of table swap). + */ + if (ta != tc->ta) { + IPFW_UH_WUNLOCK(ch); + error = EINVAL; + goto cleanup; + } + + kidx = tc->no.kidx; + numdel = 0; + first_error = 0; + + IPFW_WLOCK(ch); + v = ta_buf_m; + for (i = 0; i < count; i++, v += ta->ta_buf_size) { + ptei = &tei[i]; + num = 0; + error = ta->del(tc->astate, KIDX_TO_TI(ch, kidx), ptei, v, + &num); + /* Save state for userland */ + store_tei_result(ptei, OP_DEL, error, num); + if (error != 0 && first_error == 0) + first_error = error; + tc->count -= num; + numdel += num; + } + IPFW_WUNLOCK(ch); + + /* Unlink non-used values */ + ipfw_garbage_table_values(ch, tc, tei, count, 0); + + if (numdel != 0) { + /* Run post-del hook to permit shrinking */ + check_table_space(ch, NULL, tc, KIDX_TO_TI(ch, kidx), 0); + } + + IPFW_UH_WUNLOCK(ch); + + /* Return first error to user, if any */ + error = first_error; + +cleanup: + flush_batch_buffer(ch, ta, tei, count, 0, ta_buf_m, ta_buf); + + return (error); +} + +/* + * Ensure that table @tc has enough space to add @count entries without + * need for reallocation. + * + * Callbacks order: + * 0) need_modify() (UH_WLOCK) - checks if @count items can be added w/o resize. + * + * 1) alloc_modify (no locks, M_WAITOK) - alloc new state based on @pflags. + * 2) prepare_modifyt (UH_WLOCK) - copy old data into new storage + * 3) modify (UH_WLOCK + WLOCK) - switch pointers + * 4) flush_modify (UH_WLOCK) - free state, if needed + * + * Returns 0 on success. + */ +static int +check_table_space(struct ip_fw_chain *ch, struct tableop_state *ts, + struct table_config *tc, struct table_info *ti, uint32_t count) +{ + struct table_algo *ta; + uint64_t pflags; + char ta_buf[TA_BUF_SZ]; + int error; + + IPFW_UH_WLOCK_ASSERT(ch); + + error = 0; + ta = tc->ta; + if (ta->need_modify == NULL) + return (0); + + /* Acquire reference not to loose @tc between locks/unlocks */ + tc->no.refcnt++; + + /* + * TODO: think about avoiding race between large add/large delete + * operation on algorithm which implements shrinking along with + * growing. + */ + while (true) { + pflags = 0; + if (ta->need_modify(tc->astate, ti, count, &pflags) == 0) { + error = 0; + break; + } + + /* We have to shrink/grow table */ + if (ts != NULL) + add_toperation_state(ch, ts); + IPFW_UH_WUNLOCK(ch); + + memset(&ta_buf, 0, sizeof(ta_buf)); + error = ta->prepare_mod(ta_buf, &pflags); + + IPFW_UH_WLOCK(ch); + if (ts != NULL) + del_toperation_state(ch, ts); + + if (error != 0) + break; + + if (ts != NULL && ts->modified != 0) { + + /* + * Swap operation has happened + * so we're currently operating on other + * table data. Stop doing this. + */ + ta->flush_mod(ta_buf); + break; + } + + /* Check if we still need to alter table */ + ti = KIDX_TO_TI(ch, tc->no.kidx); + if (ta->need_modify(tc->astate, ti, count, &pflags) == 0) { + IPFW_UH_WUNLOCK(ch); + + /* + * Other thread has already performed resize. + * Flush our state and return. + */ + ta->flush_mod(ta_buf); + break; + } + + error = ta->fill_mod(tc->astate, ti, ta_buf, &pflags); + if (error == 0) { + /* Do actual modification */ + IPFW_WLOCK(ch); + ta->modify(tc->astate, ti, ta_buf, pflags); + IPFW_WUNLOCK(ch); + } + + /* Anyway, flush data and retry */ + ta->flush_mod(ta_buf); + } + + tc->no.refcnt--; + return (error); +} + +/* + * Adds or deletes record in table. + * Data layout (v0): + * Request: [ ip_fw3_opheader ipfw_table_xentry ] + * + * Returns 0 on success + */ +static int +manage_table_ent_v0(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_table_xentry *xent; + struct tentry_info tei; + struct tid_info ti; + struct table_value v; + int error, hdrlen, read; + + hdrlen = offsetof(ipfw_table_xentry, k); + + /* Check minimum header size */ + if (sd->valsize < (sizeof(*op3) + hdrlen)) + return (EINVAL); + + read = sizeof(ip_fw3_opheader); + + /* Check if xentry len field is valid */ + xent = (ipfw_table_xentry *)(op3 + 1); + if (xent->len < hdrlen || xent->len + read > sd->valsize) + return (EINVAL); + + memset(&tei, 0, sizeof(tei)); + tei.paddr = &xent->k; + tei.masklen = xent->masklen; + ipfw_import_table_value_legacy(xent->value, &v); + tei.pvalue = &v; + /* Old requests compability */ + tei.flags = TEI_FLAGS_COMPAT; + if (xent->type == IPFW_TABLE_ADDR) { + if (xent->len - hdrlen == sizeof(in_addr_t)) + tei.subtype = AF_INET; + else + tei.subtype = AF_INET6; + } + + memset(&ti, 0, sizeof(ti)); + ti.uidx = xent->tbl; + ti.type = xent->type; + + error = (op3->opcode == IP_FW_TABLE_XADD) ? + add_table_entry(ch, &ti, &tei, 0, 1) : + del_table_entry(ch, &ti, &tei, 0, 1); + + return (error); +} + +/* + * Adds or deletes record in table. + * Data layout (v1)(current): + * Request: [ ipfw_obj_header + * ipfw_obj_ctlv(IPFW_TLV_TBLENT_LIST) [ ipfw_obj_tentry x N ] + * ] + * + * Returns 0 on success + */ +static int +manage_table_ent_v1(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_obj_tentry *tent, *ptent; + ipfw_obj_ctlv *ctlv; + ipfw_obj_header *oh; + struct tentry_info *ptei, tei, *tei_buf; + struct tid_info ti; + int error, i, kidx, read; + + /* Check minimum header size */ + if (sd->valsize < (sizeof(*oh) + sizeof(*ctlv))) + return (EINVAL); + + /* Check if passed data is too long */ + if (sd->valsize != sd->kavail) + return (EINVAL); + + oh = (ipfw_obj_header *)sd->kbuf; + + /* Basic length checks for TLVs */ + if (oh->ntlv.head.length != sizeof(oh->ntlv)) + return (EINVAL); + + read = sizeof(*oh); + + ctlv = (ipfw_obj_ctlv *)(oh + 1); + if (ctlv->head.length + read != sd->valsize) + return (EINVAL); + + read += sizeof(*ctlv); + tent = (ipfw_obj_tentry *)(ctlv + 1); + if (ctlv->count * sizeof(*tent) + read != sd->valsize) + return (EINVAL); + + if (ctlv->count == 0) + return (0); + + /* + * Mark entire buffer as "read". + * This instructs sopt api write it back + * after function return. + */ + ipfw_get_sopt_header(sd, sd->valsize); + + /* Perform basic checks for each entry */ + ptent = tent; + kidx = tent->idx; + for (i = 0; i < ctlv->count; i++, ptent++) { + if (ptent->head.length != sizeof(*ptent)) + return (EINVAL); + if (ptent->idx != kidx) + return (ENOTSUP); + } + + /* Convert data into kernel request objects */ + objheader_to_ti(oh, &ti); + ti.type = oh->ntlv.type; + ti.uidx = kidx; + + /* Use on-stack buffer for single add/del */ + if (ctlv->count == 1) { + memset(&tei, 0, sizeof(tei)); + tei_buf = &tei; + } else + tei_buf = malloc(ctlv->count * sizeof(tei), M_TEMP, + M_WAITOK | M_ZERO); + + ptei = tei_buf; + ptent = tent; + for (i = 0; i < ctlv->count; i++, ptent++, ptei++) { + ptei->paddr = &ptent->k; + ptei->subtype = ptent->subtype; + ptei->masklen = ptent->masklen; + if (ptent->head.flags & IPFW_TF_UPDATE) + ptei->flags |= TEI_FLAGS_UPDATE; + + ipfw_import_table_value_v1(&ptent->v.value); + ptei->pvalue = (struct table_value *)&ptent->v.value; + } + + error = (oh->opheader.opcode == IP_FW_TABLE_XADD) ? + add_table_entry(ch, &ti, tei_buf, ctlv->flags, ctlv->count) : + del_table_entry(ch, &ti, tei_buf, ctlv->flags, ctlv->count); + + /* Translate result back to userland */ + ptei = tei_buf; + ptent = tent; + for (i = 0; i < ctlv->count; i++, ptent++, ptei++) { + if (ptei->flags & TEI_FLAGS_ADDED) + ptent->result = IPFW_TR_ADDED; + else if (ptei->flags & TEI_FLAGS_DELETED) + ptent->result = IPFW_TR_DELETED; + else if (ptei->flags & TEI_FLAGS_UPDATED) + ptent->result = IPFW_TR_UPDATED; + else if (ptei->flags & TEI_FLAGS_LIMIT) + ptent->result = IPFW_TR_LIMIT; + else if (ptei->flags & TEI_FLAGS_ERROR) + ptent->result = IPFW_TR_ERROR; + else if (ptei->flags & TEI_FLAGS_NOTFOUND) + ptent->result = IPFW_TR_NOTFOUND; + else if (ptei->flags & TEI_FLAGS_EXISTS) + ptent->result = IPFW_TR_EXISTS; + ipfw_export_table_value_v1(ptei->pvalue, &ptent->v.value); + } + + if (tei_buf != &tei) + free(tei_buf, M_TEMP); + + return (error); +} + +/* + * Looks up an entry in given table. + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_obj_tentry ] + * Reply: [ ipfw_obj_header ipfw_obj_tentry ] + * + * Returns 0 on success + */ +static int +find_table_entry(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_obj_tentry *tent; + ipfw_obj_header *oh; + struct tid_info ti; + struct table_config *tc; + struct table_algo *ta; + struct table_info *kti; + struct namedobj_instance *ni; + int error; + size_t sz; + + /* Check minimum header size */ + sz = sizeof(*oh) + sizeof(*tent); + if (sd->valsize != sz) + return (EINVAL); + + oh = (struct _ipfw_obj_header *)ipfw_get_sopt_header(sd, sz); + tent = (ipfw_obj_tentry *)(oh + 1); + + /* Basic length checks for TLVs */ + if (oh->ntlv.head.length != sizeof(oh->ntlv)) + return (EINVAL); + + objheader_to_ti(oh, &ti); + ti.type = oh->ntlv.type; + ti.uidx = tent->idx; + + IPFW_UH_RLOCK(ch); + ni = CHAIN_TO_NI(ch); + + /* + * Find existing table and check its type . + */ + ta = NULL; + if ((tc = find_table(ni, &ti)) == NULL) { + IPFW_UH_RUNLOCK(ch); + return (ESRCH); + } + + /* check table type */ + if (tc->no.type != ti.type) { + IPFW_UH_RUNLOCK(ch); + return (EINVAL); + } + + kti = KIDX_TO_TI(ch, tc->no.kidx); + ta = tc->ta; + + if (ta->find_tentry == NULL) + return (ENOTSUP); + + error = ta->find_tentry(tc->astate, kti, tent); + + IPFW_UH_RUNLOCK(ch); + + return (error); +} + +/* + * Flushes all entries or destroys given table. + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ] + * + * Returns 0 on success + */ +static int +flush_table_v0(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + int error; + struct _ipfw_obj_header *oh; + struct tid_info ti; + + if (sd->valsize != sizeof(*oh)) + return (EINVAL); + + oh = (struct _ipfw_obj_header *)op3; + objheader_to_ti(oh, &ti); + + if (op3->opcode == IP_FW_TABLE_XDESTROY) + error = destroy_table(ch, &ti); + else if (op3->opcode == IP_FW_TABLE_XFLUSH) + error = flush_table(ch, &ti); + else + return (ENOTSUP); + + return (error); +} + +static void +restart_flush(void *object, struct op_state *_state) +{ + struct tableop_state *ts; + + ts = (struct tableop_state *)_state; + + if (ts->tc != object) + return; + + /* Indicate we've called */ + ts->modified = 1; +} + +/* + * Flushes given table. + * + * Function create new table instance with the same + * parameters, swaps it with old one and + * flushes state without holding runtime WLOCK. + * + * Returns 0 on success. + */ +int +flush_table(struct ip_fw_chain *ch, struct tid_info *ti) +{ + struct namedobj_instance *ni; + struct table_config *tc; + struct table_algo *ta; + struct table_info ti_old, ti_new, *tablestate; + void *astate_old, *astate_new; + char algostate[64], *pstate; + struct tableop_state ts; + int error; + uint16_t kidx; + uint8_t tflags; + + /* + * Stage 1: save table algoritm. + * Reference found table to ensure it won't disappear. + */ + IPFW_UH_WLOCK(ch); + ni = CHAIN_TO_NI(ch); + if ((tc = find_table(ni, ti)) == NULL) { + IPFW_UH_WUNLOCK(ch); + return (ESRCH); + } +restart: + /* Set up swap handler */ + memset(&ts, 0, sizeof(ts)); + ts.opstate.func = restart_flush; + ts.tc = tc; + + ta = tc->ta; + /* Do not flush readonly tables */ + if ((ta->flags & TA_FLAG_READONLY) != 0) { + IPFW_UH_WUNLOCK(ch); + return (EACCES); + } + /* Save startup algo parameters */ + if (ta->print_config != NULL) { + ta->print_config(tc->astate, KIDX_TO_TI(ch, tc->no.kidx), + algostate, sizeof(algostate)); + pstate = algostate; + } else + pstate = NULL; + tflags = tc->tflags; + tc->no.refcnt++; + add_toperation_state(ch, &ts); + IPFW_UH_WUNLOCK(ch); + + /* + * Stage 2: allocate new table instance using same algo. + */ + memset(&ti_new, 0, sizeof(struct table_info)); + error = ta->init(ch, &astate_new, &ti_new, pstate, tflags); + + /* + * Stage 3: swap old state pointers with newly-allocated ones. + * Decrease refcount. + */ + IPFW_UH_WLOCK(ch); + tc->no.refcnt--; + del_toperation_state(ch, &ts); + + if (error != 0) { + IPFW_UH_WUNLOCK(ch); + return (error); + } + + /* + * Restart operation if table swap has happened: + * even if algo may be the same, algo init parameters + * may change. Restart operation instead of doing + * complex checks. + */ + if (ts.modified != 0) { + ta->destroy(astate_new, &ti_new); + goto restart; + } + + ni = CHAIN_TO_NI(ch); + kidx = tc->no.kidx; + tablestate = (struct table_info *)ch->tablestate; + + IPFW_WLOCK(ch); + ti_old = tablestate[kidx]; + tablestate[kidx] = ti_new; + IPFW_WUNLOCK(ch); + + astate_old = tc->astate; + tc->astate = astate_new; + tc->ti_copy = ti_new; + tc->count = 0; + + /* Notify algo on real @ti address */ + if (ta->change_ti != NULL) + ta->change_ti(tc->astate, &tablestate[kidx]); + + /* + * Stage 4: unref values. + */ + ipfw_unref_table_values(ch, tc, ta, astate_old, &ti_old); + IPFW_UH_WUNLOCK(ch); + + /* + * Stage 5: perform real flush/destroy. + */ + ta->destroy(astate_old, &ti_old); + + return (0); +} + +/* + * Swaps two tables. + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_obj_ntlv ] + * + * Returns 0 on success + */ +static int +swap_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + int error; + struct _ipfw_obj_header *oh; + struct tid_info ti_a, ti_b; + + if (sd->valsize != sizeof(*oh) + sizeof(ipfw_obj_ntlv)) + return (EINVAL); + + oh = (struct _ipfw_obj_header *)op3; + ntlv_to_ti(&oh->ntlv, &ti_a); + ntlv_to_ti((ipfw_obj_ntlv *)(oh + 1), &ti_b); + + error = swap_tables(ch, &ti_a, &ti_b); + + return (error); +} + +/* + * Swaps two tables of the same type/valtype. + * + * Checks if tables are compatible and limits + * permits swap, than actually perform swap. + * + * Each table consists of 2 different parts: + * config: + * @tc (with name, set, kidx) and rule bindings, which is "stable". + * number of items + * table algo + * runtime: + * runtime data @ti (ch->tablestate) + * runtime cache in @tc + * algo-specific data (@tc->astate) + * + * So we switch: + * all runtime data + * number of items + * table algo + * + * After that we call @ti change handler for each table. + * + * Note that referencing @tc won't protect tc->ta from change. + * XXX: Do we need to restrict swap between locked tables? + * XXX: Do we need to exchange ftype? + * + * Returns 0 on success. + */ +static int +swap_tables(struct ip_fw_chain *ch, struct tid_info *a, + struct tid_info *b) +{ + struct namedobj_instance *ni; + struct table_config *tc_a, *tc_b; + struct table_algo *ta; + struct table_info ti, *tablestate; + void *astate; + uint32_t count; + + /* + * Stage 1: find both tables and ensure they are of + * the same type. + */ + IPFW_UH_WLOCK(ch); + ni = CHAIN_TO_NI(ch); + if ((tc_a = find_table(ni, a)) == NULL) { + IPFW_UH_WUNLOCK(ch); + return (ESRCH); + } + if ((tc_b = find_table(ni, b)) == NULL) { + IPFW_UH_WUNLOCK(ch); + return (ESRCH); + } + + /* It is very easy to swap between the same table */ + if (tc_a == tc_b) { + IPFW_UH_WUNLOCK(ch); + return (0); + } + + /* Check type and value are the same */ + if (tc_a->no.type != tc_b->no.type || tc_a->tflags != tc_b->tflags) { + IPFW_UH_WUNLOCK(ch); + return (EINVAL); + } + + /* Check limits before swap */ + if ((tc_a->limit != 0 && tc_b->count > tc_a->limit) || + (tc_b->limit != 0 && tc_a->count > tc_b->limit)) { + IPFW_UH_WUNLOCK(ch); + return (EFBIG); + } + + /* Check if one of the tables is readonly */ + if (((tc_a->ta->flags | tc_b->ta->flags) & TA_FLAG_READONLY) != 0) { + IPFW_UH_WUNLOCK(ch); + return (EACCES); + } + + /* Notify we're going to swap */ + rollback_toperation_state(ch, tc_a); + rollback_toperation_state(ch, tc_b); + + /* Everything is fine, prepare to swap */ + tablestate = (struct table_info *)ch->tablestate; + ti = tablestate[tc_a->no.kidx]; + ta = tc_a->ta; + astate = tc_a->astate; + count = tc_a->count; + + IPFW_WLOCK(ch); + /* a <- b */ + tablestate[tc_a->no.kidx] = tablestate[tc_b->no.kidx]; + tc_a->ta = tc_b->ta; + tc_a->astate = tc_b->astate; + tc_a->count = tc_b->count; + /* b <- a */ + tablestate[tc_b->no.kidx] = ti; + tc_b->ta = ta; + tc_b->astate = astate; + tc_b->count = count; + IPFW_WUNLOCK(ch); + + /* Ensure tc.ti copies are in sync */ + tc_a->ti_copy = tablestate[tc_a->no.kidx]; + tc_b->ti_copy = tablestate[tc_b->no.kidx]; + + /* Notify both tables on @ti change */ + if (tc_a->ta->change_ti != NULL) + tc_a->ta->change_ti(tc_a->astate, &tablestate[tc_a->no.kidx]); + if (tc_b->ta->change_ti != NULL) + tc_b->ta->change_ti(tc_b->astate, &tablestate[tc_b->no.kidx]); + + IPFW_UH_WUNLOCK(ch); + + return (0); +} + +/* + * Destroys table specified by @ti. + * Data layout (v0)(current): + * Request: [ ip_fw3_opheader ] + * + * Returns 0 on success + */ +static int +destroy_table(struct ip_fw_chain *ch, struct tid_info *ti) +{ + struct namedobj_instance *ni; + struct table_config *tc; + + IPFW_UH_WLOCK(ch); + + ni = CHAIN_TO_NI(ch); + if ((tc = find_table(ni, ti)) == NULL) { + IPFW_UH_WUNLOCK(ch); + return (ESRCH); + } + + /* Do not permit destroying referenced tables */ + if (tc->no.refcnt > 0) { + IPFW_UH_WUNLOCK(ch); + return (EBUSY); + } + + IPFW_WLOCK(ch); + unlink_table(ch, tc); + IPFW_WUNLOCK(ch); + + /* Free obj index */ + if (ipfw_objhash_free_idx(ni, tc->no.kidx) != 0) + printf("Error unlinking kidx %d from table %s\n", + tc->no.kidx, tc->tablename); + + /* Unref values used in tables while holding UH lock */ + ipfw_unref_table_values(ch, tc, tc->ta, tc->astate, &tc->ti_copy); + IPFW_UH_WUNLOCK(ch); + + free_table_config(ni, tc); + + return (0); +} + +static uint32_t +roundup2p(uint32_t v) +{ + + v--; + v |= v >> 1; + v |= v >> 2; + v |= v >> 4; + v |= v >> 8; + v |= v >> 16; + v++; + + return (v); +} + +/* + * Grow tables index. + * + * Returns 0 on success. + */ +int +ipfw_resize_tables(struct ip_fw_chain *ch, unsigned int ntables) +{ + unsigned int ntables_old, tbl; + struct namedobj_instance *ni; + void *new_idx, *old_tablestate, *tablestate; + struct table_info *ti; + struct table_config *tc; + int i, new_blocks; + + /* Check new value for validity */ + if (ntables == 0) + return (EINVAL); + if (ntables > IPFW_TABLES_MAX) + ntables = IPFW_TABLES_MAX; + /* Alight to nearest power of 2 */ + ntables = (unsigned int)roundup2p(ntables); + + /* Allocate new pointers */ + tablestate = malloc(ntables * sizeof(struct table_info), + M_IPFW, M_WAITOK | M_ZERO); + + ipfw_objhash_bitmap_alloc(ntables, (void *)&new_idx, &new_blocks); + + IPFW_UH_WLOCK(ch); + + tbl = (ntables >= V_fw_tables_max) ? V_fw_tables_max : ntables; + ni = CHAIN_TO_NI(ch); + + /* Temporary restrict decreasing max_tables */ + if (ntables < V_fw_tables_max) { + + /* + * FIXME: Check if we really can shrink + */ + IPFW_UH_WUNLOCK(ch); + return (EINVAL); + } + + /* Copy table info/indices */ + memcpy(tablestate, ch->tablestate, sizeof(struct table_info) * tbl); + ipfw_objhash_bitmap_merge(ni, &new_idx, &new_blocks); + + IPFW_WLOCK(ch); + + /* Change pointers */ + old_tablestate = ch->tablestate; + ch->tablestate = tablestate; + ipfw_objhash_bitmap_swap(ni, &new_idx, &new_blocks); + + ntables_old = V_fw_tables_max; + V_fw_tables_max = ntables; + + IPFW_WUNLOCK(ch); + + /* Notify all consumers that their @ti pointer has changed */ + ti = (struct table_info *)ch->tablestate; + for (i = 0; i < tbl; i++, ti++) { + if (ti->lookup == NULL) + continue; + tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, i); + if (tc == NULL || tc->ta->change_ti == NULL) + continue; + + tc->ta->change_ti(tc->astate, ti); + } + + IPFW_UH_WUNLOCK(ch); + + /* Free old pointers */ + free(old_tablestate, M_IPFW); + ipfw_objhash_bitmap_free(new_idx, new_blocks); + + return (0); +} + +/* + * Switch between "set 0" and "rule's set" table binding, + * Check all ruleset bindings and permits changing + * IFF each binding has both rule AND table in default set (set 0). + * + * Returns 0 on success. + */ +int +ipfw_switch_tables_namespace(struct ip_fw_chain *ch, unsigned int sets) +{ + struct namedobj_instance *ni; + struct named_object *no; + struct ip_fw *rule; + ipfw_insn *cmd; + int cmdlen, i, l; + uint16_t kidx; + uint8_t type; + + IPFW_UH_WLOCK(ch); + + if (V_fw_tables_sets == sets) { + IPFW_UH_WUNLOCK(ch); + return (0); + } + + ni = CHAIN_TO_NI(ch); + + /* + * Scan all rules and examine tables opcodes. + */ + for (i = 0; i < ch->n_rules; i++) { + rule = ch->map[i]; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + + no = ipfw_objhash_lookup_kidx(ni, kidx); + + /* Check if both table object and rule has the set 0 */ + if (no->set != 0 || rule->set != 0) { + IPFW_UH_WUNLOCK(ch); + return (EBUSY); + } + + } + } + V_fw_tables_sets = sets; + + IPFW_UH_WUNLOCK(ch); + + return (0); +} + +/* + * Lookup an IP @addr in table @tbl. + * Stores found value in @val. + * + * Returns 1 if @addr was found. + */ +int +ipfw_lookup_table(struct ip_fw_chain *ch, uint16_t tbl, in_addr_t addr, + uint32_t *val) +{ + struct table_info *ti; + + ti = KIDX_TO_TI(ch, tbl); + + return (ti->lookup(ti, &addr, sizeof(in_addr_t), val)); +} + +/* + * Lookup an arbtrary key @paddr of legth @plen in table @tbl. + * Stores found value in @val. + * + * Returns 1 if key was found. + */ +int +ipfw_lookup_table_extended(struct ip_fw_chain *ch, uint16_t tbl, uint16_t plen, + void *paddr, uint32_t *val) +{ + struct table_info *ti; + + ti = KIDX_TO_TI(ch, tbl); + + return (ti->lookup(ti, paddr, plen, val)); +} + +/* + * Info/List/dump support for tables. + * + */ + +/* + * High-level 'get' cmds sysctl handlers + */ + +/* + * Lists all tables currently available in kernel. + * Data layout (v0)(current): + * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size + * Reply: [ ipfw_obj_lheader ipfw_xtable_info x N ] + * + * Returns 0 on success + */ +static int +list_tables(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_lheader *olh; + int error; + + olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh)); + if (olh == NULL) + return (EINVAL); + if (sd->valsize < olh->size) + return (EINVAL); + + IPFW_UH_RLOCK(ch); + error = export_tables(ch, olh, sd); + IPFW_UH_RUNLOCK(ch); + + return (error); +} + +/* + * Store table info to buffer provided by @sd. + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_xtable_info(empty)] + * Reply: [ ipfw_obj_header ipfw_xtable_info ] + * + * Returns 0 on success. + */ +static int +describe_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_header *oh; + struct table_config *tc; + struct tid_info ti; + size_t sz; + + sz = sizeof(*oh) + sizeof(ipfw_xtable_info); + oh = (struct _ipfw_obj_header *)ipfw_get_sopt_header(sd, sz); + if (oh == NULL) + return (EINVAL); + + objheader_to_ti(oh, &ti); + + IPFW_UH_RLOCK(ch); + if ((tc = find_table(CHAIN_TO_NI(ch), &ti)) == NULL) { + IPFW_UH_RUNLOCK(ch); + return (ESRCH); + } + + export_table_info(ch, tc, (ipfw_xtable_info *)(oh + 1)); + IPFW_UH_RUNLOCK(ch); + + return (0); +} + +/* + * Modifies existing table. + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_xtable_info ] + * + * Returns 0 on success + */ +static int +modify_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_header *oh; + ipfw_xtable_info *i; + char *tname; + struct tid_info ti; + struct namedobj_instance *ni; + struct table_config *tc; + + if (sd->valsize != sizeof(*oh) + sizeof(ipfw_xtable_info)) + return (EINVAL); + + oh = (struct _ipfw_obj_header *)sd->kbuf; + i = (ipfw_xtable_info *)(oh + 1); + + /* + * Verify user-supplied strings. + * Check for null-terminated/zero-length strings/ + */ + tname = oh->ntlv.name; + if (ipfw_check_table_name(tname) != 0) + return (EINVAL); + + objheader_to_ti(oh, &ti); + ti.type = i->type; + + IPFW_UH_WLOCK(ch); + ni = CHAIN_TO_NI(ch); + if ((tc = find_table(ni, &ti)) == NULL) { + IPFW_UH_WUNLOCK(ch); + return (ESRCH); + } + + /* Do not support any modifications for readonly tables */ + if ((tc->ta->flags & TA_FLAG_READONLY) != 0) { + IPFW_UH_WUNLOCK(ch); + return (EACCES); + } + + if ((i->mflags & IPFW_TMFLAGS_LIMIT) != 0) + tc->limit = i->limit; + if ((i->mflags & IPFW_TMFLAGS_LOCK) != 0) + tc->locked = ((i->flags & IPFW_TGFLAGS_LOCKED) != 0); + IPFW_UH_WUNLOCK(ch); + + return (0); +} + +/* + * Creates new table. + * Data layout (v0)(current): + * Request: [ ipfw_obj_header ipfw_xtable_info ] + * + * Returns 0 on success + */ +static int +create_table(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_header *oh; + ipfw_xtable_info *i; + char *tname, *aname; + struct tid_info ti; + struct namedobj_instance *ni; + struct table_config *tc; + + if (sd->valsize != sizeof(*oh) + sizeof(ipfw_xtable_info)) + return (EINVAL); + + oh = (struct _ipfw_obj_header *)sd->kbuf; + i = (ipfw_xtable_info *)(oh + 1); + + /* + * Verify user-supplied strings. + * Check for null-terminated/zero-length strings/ + */ + tname = oh->ntlv.name; + aname = i->algoname; + if (ipfw_check_table_name(tname) != 0 || + strnlen(aname, sizeof(i->algoname)) == sizeof(i->algoname)) + return (EINVAL); + + if (aname[0] == '\0') { + /* Use default algorithm */ + aname = NULL; + } + + objheader_to_ti(oh, &ti); + ti.type = i->type; + + ni = CHAIN_TO_NI(ch); + + IPFW_UH_RLOCK(ch); + if ((tc = find_table(ni, &ti)) != NULL) { + IPFW_UH_RUNLOCK(ch); + return (EEXIST); + } + IPFW_UH_RUNLOCK(ch); + + return (create_table_internal(ch, &ti, aname, i, NULL, 0)); +} + +/* + * Creates new table based on @ti and @aname. + * + * Relies on table name checking inside find_name_tlv() + * Assume @aname to be checked and valid. + * Stores allocated table kidx inside @pkidx (if non-NULL). + * Reference created table if @compat is non-zero. + * + * Returns 0 on success. + */ +static int +create_table_internal(struct ip_fw_chain *ch, struct tid_info *ti, + char *aname, ipfw_xtable_info *i, uint16_t *pkidx, int compat) +{ + struct namedobj_instance *ni; + struct table_config *tc, *tc_new, *tmp; + struct table_algo *ta; + uint16_t kidx; + + ni = CHAIN_TO_NI(ch); + + ta = find_table_algo(CHAIN_TO_TCFG(ch), ti, aname); + if (ta == NULL) + return (ENOTSUP); + + tc = alloc_table_config(ch, ti, ta, aname, i->tflags); + if (tc == NULL) + return (ENOMEM); + + tc->vmask = i->vmask; + tc->limit = i->limit; + if (ta->flags & TA_FLAG_READONLY) + tc->locked = 1; + else + tc->locked = (i->flags & IPFW_TGFLAGS_LOCKED) != 0; + + IPFW_UH_WLOCK(ch); + + /* Check if table has been already created */ + tc_new = find_table(ni, ti); + if (tc_new != NULL) { + + /* + * Compat: do not fail if we're + * requesting to create existing table + * which has the same type + */ + if (compat == 0 || tc_new->no.type != tc->no.type) { + IPFW_UH_WUNLOCK(ch); + free_table_config(ni, tc); + return (EEXIST); + } + + /* Exchange tc and tc_new for proper refcounting & freeing */ + tmp = tc; + tc = tc_new; + tc_new = tmp; + } else { + /* New table */ + if (ipfw_objhash_alloc_idx(ni, &kidx) != 0) { + IPFW_UH_WUNLOCK(ch); + printf("Unable to allocate table index." + " Consider increasing net.inet.ip.fw.tables_max"); + free_table_config(ni, tc); + return (EBUSY); + } + tc->no.kidx = kidx; + + IPFW_WLOCK(ch); + link_table(ch, tc); + IPFW_WUNLOCK(ch); + } + + if (compat != 0) + tc->no.refcnt++; + if (pkidx != NULL) + *pkidx = tc->no.kidx; + + IPFW_UH_WUNLOCK(ch); + + if (tc_new != NULL) + free_table_config(ni, tc_new); + + return (0); +} + +static void +ntlv_to_ti(ipfw_obj_ntlv *ntlv, struct tid_info *ti) +{ + + memset(ti, 0, sizeof(struct tid_info)); + ti->set = ntlv->set; + ti->uidx = ntlv->idx; + ti->tlvs = ntlv; + ti->tlen = ntlv->head.length; +} + +static void +objheader_to_ti(struct _ipfw_obj_header *oh, struct tid_info *ti) +{ + + ntlv_to_ti(&oh->ntlv, ti); +} + +/* + * Exports basic table info as name TLV. + * Used inside dump_static_rules() to provide info + * about all tables referenced by current ruleset. + * + * Returns 0 on success. + */ +int +ipfw_export_table_ntlv(struct ip_fw_chain *ch, uint16_t kidx, + struct sockopt_data *sd) +{ + struct namedobj_instance *ni; + struct named_object *no; + ipfw_obj_ntlv *ntlv; + + ni = CHAIN_TO_NI(ch); + + no = ipfw_objhash_lookup_kidx(ni, kidx); + KASSERT(no != NULL, ("invalid table kidx passed")); + + ntlv = (ipfw_obj_ntlv *)ipfw_get_sopt_space(sd, sizeof(*ntlv)); + if (ntlv == NULL) + return (ENOMEM); + + ntlv->head.type = IPFW_TLV_TBL_NAME; + ntlv->head.length = sizeof(*ntlv); + ntlv->idx = no->kidx; + strlcpy(ntlv->name, no->name, sizeof(ntlv->name)); + + return (0); +} + +/* + * Marks every table kidx used in @rule with bit in @bmask. + * Used to generate bitmask of referenced tables for given ruleset. + * + * Returns number of newly-referenced tables. + */ +int +ipfw_mark_table_kidx(struct ip_fw_chain *chain, struct ip_fw *rule, + uint32_t *bmask) +{ + int cmdlen, l, count; + ipfw_insn *cmd; + uint16_t kidx; + uint8_t type; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + count = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + + if ((bmask[kidx / 32] & (1 << (kidx % 32))) == 0) + count++; + + bmask[kidx / 32] |= 1 << (kidx % 32); + } + + return (count); +} + +struct dump_args { + struct ip_fw_chain *ch; + struct table_info *ti; + struct table_config *tc; + struct sockopt_data *sd; + uint32_t cnt; + uint16_t uidx; + int error; + uint32_t size; + ipfw_table_entry *ent; + ta_foreach_f *f; + void *farg; + ipfw_obj_tentry tent; +}; + +static int +count_ext_entries(void *e, void *arg) +{ + struct dump_args *da; + + da = (struct dump_args *)arg; + da->cnt++; + + return (0); +} + +/* + * Gets number of items from table either using + * internal counter or calling algo callback for + * externally-managed tables. + * + * Returns number of records. + */ +static uint32_t +table_get_count(struct ip_fw_chain *ch, struct table_config *tc) +{ + struct table_info *ti; + struct table_algo *ta; + struct dump_args da; + + ti = KIDX_TO_TI(ch, tc->no.kidx); + ta = tc->ta; + + /* Use internal counter for self-managed tables */ + if ((ta->flags & TA_FLAG_READONLY) == 0) + return (tc->count); + + /* Use callback to quickly get number of items */ + if ((ta->flags & TA_FLAG_EXTCOUNTER) != 0) + return (ta->get_count(tc->astate, ti)); + + /* Count number of iterms ourselves */ + memset(&da, 0, sizeof(da)); + ta->foreach(tc->astate, ti, count_ext_entries, &da); + + return (da.cnt); +} + +/* + * Exports table @tc info into standard ipfw_xtable_info format. + */ +static void +export_table_info(struct ip_fw_chain *ch, struct table_config *tc, + ipfw_xtable_info *i) +{ + struct table_info *ti; + struct table_algo *ta; + + i->type = tc->no.type; + i->tflags = tc->tflags; + i->vmask = tc->vmask; + i->set = tc->no.set; + i->kidx = tc->no.kidx; + i->refcnt = tc->no.refcnt; + i->count = table_get_count(ch, tc); + i->limit = tc->limit; + i->flags |= (tc->locked != 0) ? IPFW_TGFLAGS_LOCKED : 0; + i->size = tc->count * sizeof(ipfw_obj_tentry); + i->size += sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info); + strlcpy(i->tablename, tc->tablename, sizeof(i->tablename)); + ti = KIDX_TO_TI(ch, tc->no.kidx); + ta = tc->ta; + if (ta->print_config != NULL) { + /* Use algo function to print table config to string */ + ta->print_config(tc->astate, ti, i->algoname, + sizeof(i->algoname)); + } else + strlcpy(i->algoname, ta->name, sizeof(i->algoname)); + /* Dump algo-specific data, if possible */ + if (ta->dump_tinfo != NULL) { + ta->dump_tinfo(tc->astate, ti, &i->ta_info); + i->ta_info.flags |= IPFW_TATFLAGS_DATA; + } +} + +struct dump_table_args { + struct ip_fw_chain *ch; + struct sockopt_data *sd; +}; + +static void +export_table_internal(struct namedobj_instance *ni, struct named_object *no, + void *arg) +{ + ipfw_xtable_info *i; + struct dump_table_args *dta; + + dta = (struct dump_table_args *)arg; + + i = (ipfw_xtable_info *)ipfw_get_sopt_space(dta->sd, sizeof(*i)); + KASSERT(i != 0, ("previously checked buffer is not enough")); + + export_table_info(dta->ch, (struct table_config *)no, i); +} + +/* + * Export all tables as ipfw_xtable_info structures to + * storage provided by @sd. + * + * If supplied buffer is too small, fills in required size + * and returns ENOMEM. + * Returns 0 on success. + */ +static int +export_tables(struct ip_fw_chain *ch, ipfw_obj_lheader *olh, + struct sockopt_data *sd) +{ + uint32_t size; + uint32_t count; + struct dump_table_args dta; + + count = ipfw_objhash_count(CHAIN_TO_NI(ch)); + size = count * sizeof(ipfw_xtable_info) + sizeof(ipfw_obj_lheader); + + /* Fill in header regadless of buffer size */ + olh->count = count; + olh->objsize = sizeof(ipfw_xtable_info); + + if (size > olh->size) { + olh->size = size; + return (ENOMEM); + } + + olh->size = size; + + dta.ch = ch; + dta.sd = sd; + + ipfw_objhash_foreach(CHAIN_TO_NI(ch), export_table_internal, &dta); + + return (0); +} + +/* + * Dumps all table data + * Data layout (v1)(current): + * Request: [ ipfw_obj_header ], size = ipfw_xtable_info.size + * Reply: [ ipfw_obj_header ipfw_xtable_info ipfw_obj_tentry x N ] + * + * Returns 0 on success + */ +static int +dump_table_v1(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_header *oh; + ipfw_xtable_info *i; + struct tid_info ti; + struct table_config *tc; + struct table_algo *ta; + struct dump_args da; + uint32_t sz; + + sz = sizeof(ipfw_obj_header) + sizeof(ipfw_xtable_info); + oh = (struct _ipfw_obj_header *)ipfw_get_sopt_header(sd, sz); + if (oh == NULL) + return (EINVAL); + + i = (ipfw_xtable_info *)(oh + 1); + objheader_to_ti(oh, &ti); + + IPFW_UH_RLOCK(ch); + if ((tc = find_table(CHAIN_TO_NI(ch), &ti)) == NULL) { + IPFW_UH_RUNLOCK(ch); + return (ESRCH); + } + export_table_info(ch, tc, i); + + if (sd->valsize < i->size) { + + /* + * Submitted buffer size is not enough. + * WE've already filled in @i structure with + * relevant table info including size, so we + * can return. Buffer will be flushed automatically. + */ + IPFW_UH_RUNLOCK(ch); + return (ENOMEM); + } + + /* + * Do the actual dump in eXtended format + */ + memset(&da, 0, sizeof(da)); + da.ch = ch; + da.ti = KIDX_TO_TI(ch, tc->no.kidx); + da.tc = tc; + da.sd = sd; + + ta = tc->ta; + + ta->foreach(tc->astate, da.ti, dump_table_tentry, &da); + IPFW_UH_RUNLOCK(ch); + + return (da.error); +} + +/* + * Dumps all table data + * Data layout (version 0)(legacy): + * Request: [ ipfw_xtable ], size = IP_FW_TABLE_XGETSIZE() + * Reply: [ ipfw_xtable ipfw_table_xentry x N ] + * + * Returns 0 on success + */ +static int +dump_table_v0(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + ipfw_xtable *xtbl; + struct tid_info ti; + struct table_config *tc; + struct table_algo *ta; + struct dump_args da; + size_t sz, count; + + xtbl = (ipfw_xtable *)ipfw_get_sopt_header(sd, sizeof(ipfw_xtable)); + if (xtbl == NULL) + return (EINVAL); + + memset(&ti, 0, sizeof(ti)); + ti.uidx = xtbl->tbl; + + IPFW_UH_RLOCK(ch); + if ((tc = find_table(CHAIN_TO_NI(ch), &ti)) == NULL) { + IPFW_UH_RUNLOCK(ch); + return (0); + } + count = table_get_count(ch, tc); + sz = count * sizeof(ipfw_table_xentry) + sizeof(ipfw_xtable); + + xtbl->cnt = count; + xtbl->size = sz; + xtbl->type = tc->no.type; + xtbl->tbl = ti.uidx; + + if (sd->valsize < sz) { + + /* + * Submitted buffer size is not enough. + * WE've already filled in @i structure with + * relevant table info including size, so we + * can return. Buffer will be flushed automatically. + */ + IPFW_UH_RUNLOCK(ch); + return (ENOMEM); + } + + /* Do the actual dump in eXtended format */ + memset(&da, 0, sizeof(da)); + da.ch = ch; + da.ti = KIDX_TO_TI(ch, tc->no.kidx); + da.tc = tc; + da.sd = sd; + + ta = tc->ta; + + ta->foreach(tc->astate, da.ti, dump_table_xentry, &da); + IPFW_UH_RUNLOCK(ch); + + return (0); +} + +/* + * Legacy function to retrieve number of items in table. + */ +static int +get_table_size(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + uint32_t *tbl; + struct tid_info ti; + size_t sz; + int error; + + sz = sizeof(*op3) + sizeof(uint32_t); + op3 = (ip_fw3_opheader *)ipfw_get_sopt_header(sd, sz); + if (op3 == NULL) + return (EINVAL); + + tbl = (uint32_t *)(op3 + 1); + memset(&ti, 0, sizeof(ti)); + ti.uidx = *tbl; + IPFW_UH_RLOCK(ch); + error = ipfw_count_xtable(ch, &ti, tbl); + IPFW_UH_RUNLOCK(ch); + return (error); +} + +/* + * Legacy IP_FW_TABLE_GETSIZE handler + */ +int +ipfw_count_table(struct ip_fw_chain *ch, struct tid_info *ti, uint32_t *cnt) +{ + struct table_config *tc; + + if ((tc = find_table(CHAIN_TO_NI(ch), ti)) == NULL) + return (ESRCH); + *cnt = table_get_count(ch, tc); + return (0); +} + +/* + * Legacy IP_FW_TABLE_XGETSIZE handler + */ +int +ipfw_count_xtable(struct ip_fw_chain *ch, struct tid_info *ti, uint32_t *cnt) +{ + struct table_config *tc; + uint32_t count; + + if ((tc = find_table(CHAIN_TO_NI(ch), ti)) == NULL) { + *cnt = 0; + return (0); /* 'table all list' requires success */ + } + + count = table_get_count(ch, tc); + *cnt = count * sizeof(ipfw_table_xentry); + if (count > 0) + *cnt += sizeof(ipfw_xtable); + return (0); +} + +static int +dump_table_entry(void *e, void *arg) +{ + struct dump_args *da; + struct table_config *tc; + struct table_algo *ta; + ipfw_table_entry *ent; + struct table_value *pval; + int error; + + da = (struct dump_args *)arg; + + tc = da->tc; + ta = tc->ta; + + /* Out of memory, returning */ + if (da->cnt == da->size) + return (1); + ent = da->ent++; + ent->tbl = da->uidx; + da->cnt++; + + error = ta->dump_tentry(tc->astate, da->ti, e, &da->tent); + if (error != 0) + return (error); + + ent->addr = da->tent.k.addr.s_addr; + ent->masklen = da->tent.masklen; + pval = get_table_value(da->ch, da->tc, da->tent.v.kidx); + ent->value = ipfw_export_table_value_legacy(pval); + + return (0); +} + +/* + * Dumps table in pre-8.1 legacy format. + */ +int +ipfw_dump_table_legacy(struct ip_fw_chain *ch, struct tid_info *ti, + ipfw_table *tbl) +{ + struct table_config *tc; + struct table_algo *ta; + struct dump_args da; + + tbl->cnt = 0; + + if ((tc = find_table(CHAIN_TO_NI(ch), ti)) == NULL) + return (0); /* XXX: We should return ESRCH */ + + ta = tc->ta; + + /* This dump format supports IPv4 only */ + if (tc->no.type != IPFW_TABLE_ADDR) + return (0); + + memset(&da, 0, sizeof(da)); + da.ch = ch; + da.ti = KIDX_TO_TI(ch, tc->no.kidx); + da.tc = tc; + da.ent = &tbl->ent[0]; + da.size = tbl->size; + + tbl->cnt = 0; + ta->foreach(tc->astate, da.ti, dump_table_entry, &da); + tbl->cnt = da.cnt; + + return (0); +} + +/* + * Dumps table entry in eXtended format (v1)(current). + */ +static int +dump_table_tentry(void *e, void *arg) +{ + struct dump_args *da; + struct table_config *tc; + struct table_algo *ta; + struct table_value *pval; + ipfw_obj_tentry *tent; + int error; + + da = (struct dump_args *)arg; + + tc = da->tc; + ta = tc->ta; + + tent = (ipfw_obj_tentry *)ipfw_get_sopt_space(da->sd, sizeof(*tent)); + /* Out of memory, returning */ + if (tent == NULL) { + da->error = ENOMEM; + return (1); + } + tent->head.length = sizeof(ipfw_obj_tentry); + tent->idx = da->uidx; + + error = ta->dump_tentry(tc->astate, da->ti, e, tent); + if (error != 0) + return (error); + + pval = get_table_value(da->ch, da->tc, tent->v.kidx); + ipfw_export_table_value_v1(pval, &tent->v.value); + + return (0); +} + +/* + * Dumps table entry in eXtended format (v0). + */ +static int +dump_table_xentry(void *e, void *arg) +{ + struct dump_args *da; + struct table_config *tc; + struct table_algo *ta; + ipfw_table_xentry *xent; + ipfw_obj_tentry *tent; + struct table_value *pval; + int error; + + da = (struct dump_args *)arg; + + tc = da->tc; + ta = tc->ta; + + xent = (ipfw_table_xentry *)ipfw_get_sopt_space(da->sd, sizeof(*xent)); + /* Out of memory, returning */ + if (xent == NULL) + return (1); + xent->len = sizeof(ipfw_table_xentry); + xent->tbl = da->uidx; + + memset(&da->tent, 0, sizeof(da->tent)); + tent = &da->tent; + error = ta->dump_tentry(tc->astate, da->ti, e, tent); + if (error != 0) + return (error); + + /* Convert current format to previous one */ + xent->masklen = tent->masklen; + pval = get_table_value(da->ch, da->tc, da->tent.v.kidx); + xent->value = ipfw_export_table_value_legacy(pval); + /* Apply some hacks */ + if (tc->no.type == IPFW_TABLE_ADDR && tent->subtype == AF_INET) { + xent->k.addr6.s6_addr32[3] = tent->k.addr.s_addr; + xent->flags = IPFW_TCF_INET; + } else + memcpy(&xent->k, &tent->k, sizeof(xent->k)); + + return (0); +} + +/* + * Helper function to export table algo data + * to tentry format before calling user function. + * + * Returns 0 on success. + */ +static int +prepare_table_tentry(void *e, void *arg) +{ + struct dump_args *da; + struct table_config *tc; + struct table_algo *ta; + int error; + + da = (struct dump_args *)arg; + + tc = da->tc; + ta = tc->ta; + + error = ta->dump_tentry(tc->astate, da->ti, e, &da->tent); + if (error != 0) + return (error); + + da->f(&da->tent, da->farg); + + return (0); +} + +/* + * Allow external consumers to read table entries in standard format. + */ +int +ipfw_foreach_table_tentry(struct ip_fw_chain *ch, uint16_t kidx, + ta_foreach_f *f, void *arg) +{ + struct namedobj_instance *ni; + struct table_config *tc; + struct table_algo *ta; + struct dump_args da; + + ni = CHAIN_TO_NI(ch); + + tc = (struct table_config *)ipfw_objhash_lookup_kidx(ni, kidx); + if (tc == NULL) + return (ESRCH); + + ta = tc->ta; + + memset(&da, 0, sizeof(da)); + da.ch = ch; + da.ti = KIDX_TO_TI(ch, tc->no.kidx); + da.tc = tc; + da.f = f; + da.farg = arg; + + ta->foreach(tc->astate, da.ti, prepare_table_tentry, &da); + + return (0); +} + +/* + * Table algorithms + */ + +/* + * Finds algoritm by index, table type or supplied name. + * + * Returns pointer to algo or NULL. + */ +static struct table_algo * +find_table_algo(struct tables_config *tcfg, struct tid_info *ti, char *name) +{ + int i, l; + struct table_algo *ta; + + if (ti->type > IPFW_TABLE_MAXTYPE) + return (NULL); + + /* Search by index */ + if (ti->atype != 0) { + if (ti->atype > tcfg->algo_count) + return (NULL); + return (tcfg->algo[ti->atype]); + } + + if (name == NULL) { + /* Return default algorithm for given type if set */ + return (tcfg->def_algo[ti->type]); + } + + /* Search by name */ + /* TODO: better search */ + for (i = 1; i <= tcfg->algo_count; i++) { + ta = tcfg->algo[i]; + + /* + * One can supply additional algorithm + * parameters so we compare only the first word + * of supplied name: + * 'addr:chash hsize=32' + * '^^^^^^^^^' + * + */ + l = strlen(ta->name); + if (strncmp(name, ta->name, l) != 0) + continue; + if (name[l] != '\0' && name[l] != ' ') + continue; + /* Check if we're requesting proper table type */ + if (ti->type != 0 && ti->type != ta->type) + return (NULL); + return (ta); + } + + return (NULL); +} + +/* + * Register new table algo @ta. + * Stores algo id inside @idx. + * + * Returns 0 on success. + */ +int +ipfw_add_table_algo(struct ip_fw_chain *ch, struct table_algo *ta, size_t size, + uint32_t *idx) +{ + struct tables_config *tcfg; + struct table_algo *ta_new; + size_t sz; + + if (size > sizeof(struct table_algo)) + return (EINVAL); + + /* Check for the required on-stack size for add/del */ + sz = roundup2(ta->ta_buf_size, sizeof(void *)); + if (sz > TA_BUF_SZ) + return (EINVAL); + + KASSERT(ta->type <= IPFW_TABLE_MAXTYPE,("Increase IPFW_TABLE_MAXTYPE")); + + /* Copy algorithm data to stable storage. */ + ta_new = malloc(sizeof(struct table_algo), M_IPFW, M_WAITOK | M_ZERO); + memcpy(ta_new, ta, size); + + tcfg = CHAIN_TO_TCFG(ch); + + KASSERT(tcfg->algo_count < 255, ("Increase algo array size")); + + tcfg->algo[++tcfg->algo_count] = ta_new; + ta_new->idx = tcfg->algo_count; + + /* Set algorithm as default one for given type */ + if ((ta_new->flags & TA_FLAG_DEFAULT) != 0 && + tcfg->def_algo[ta_new->type] == NULL) + tcfg->def_algo[ta_new->type] = ta_new; + + *idx = ta_new->idx; + + return (0); +} + +/* + * Unregisters table algo using @idx as id. + * XXX: It is NOT safe to call this function in any place + * other than ipfw instance destroy handler. + */ +void +ipfw_del_table_algo(struct ip_fw_chain *ch, int idx) +{ + struct tables_config *tcfg; + struct table_algo *ta; + + tcfg = CHAIN_TO_TCFG(ch); + + KASSERT(idx <= tcfg->algo_count, ("algo idx %d out of range 1..%d", + idx, tcfg->algo_count)); + + ta = tcfg->algo[idx]; + KASSERT(ta != NULL, ("algo idx %d is NULL", idx)); + + if (tcfg->def_algo[ta->type] == ta) + tcfg->def_algo[ta->type] = NULL; + + free(ta, M_IPFW); +} + +/* + * Lists all table algorithms currently available. + * Data layout (v0)(current): + * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size + * Reply: [ ipfw_obj_lheader ipfw_ta_info x N ] + * + * Returns 0 on success + */ +static int +list_table_algo(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_lheader *olh; + struct tables_config *tcfg; + ipfw_ta_info *i; + struct table_algo *ta; + uint32_t count, n, size; + + olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh)); + if (olh == NULL) + return (EINVAL); + if (sd->valsize < olh->size) + return (EINVAL); + + IPFW_UH_RLOCK(ch); + tcfg = CHAIN_TO_TCFG(ch); + count = tcfg->algo_count; + size = count * sizeof(ipfw_ta_info) + sizeof(ipfw_obj_lheader); + + /* Fill in header regadless of buffer size */ + olh->count = count; + olh->objsize = sizeof(ipfw_ta_info); + + if (size > olh->size) { + olh->size = size; + IPFW_UH_RUNLOCK(ch); + return (ENOMEM); + } + olh->size = size; + + for (n = 1; n <= count; n++) { + i = (ipfw_ta_info *)ipfw_get_sopt_space(sd, sizeof(*i)); + KASSERT(i != 0, ("previously checked buffer is not enough")); + ta = tcfg->algo[n]; + strlcpy(i->algoname, ta->name, sizeof(i->algoname)); + i->type = ta->type; + i->refcnt = ta->refcnt; + } + + IPFW_UH_RUNLOCK(ch); + + return (0); +} + +/* + * Tables rewriting code + */ + +/* + * Determine table number and lookup type for @cmd. + * Fill @tbl and @type with appropriate values. + * Returns 0 for relevant opcodes, 1 otherwise. + */ +static int +classify_table_opcode(ipfw_insn *cmd, uint16_t *puidx, uint8_t *ptype) +{ + ipfw_insn_if *cmdif; + int skip; + uint16_t v; + + skip = 1; + + switch (cmd->opcode) { + case O_IP_SRC_LOOKUP: + case O_IP_DST_LOOKUP: + /* Basic IPv4/IPv6 or u32 lookups */ + *puidx = cmd->arg1; + /* Assume ADDR by default */ + *ptype = IPFW_TABLE_ADDR; + skip = 0; + + if (F_LEN(cmd) > F_INSN_SIZE(ipfw_insn_u32)) { + /* + * generic lookup. The key must be + * in 32bit big-endian format. + */ + v = ((ipfw_insn_u32 *)cmd)->d[1]; + switch (v) { + case 0: + case 1: + /* IPv4 src/dst */ + break; + case 2: + case 3: + /* src/dst port */ + *ptype = IPFW_TABLE_NUMBER; + break; + case 4: + /* uid/gid */ + *ptype = IPFW_TABLE_NUMBER; + break; + case 5: + /* jid */ + *ptype = IPFW_TABLE_NUMBER; + break; + case 6: + /* dscp */ + *ptype = IPFW_TABLE_NUMBER; + break; + } + } + break; + case O_XMIT: + case O_RECV: + case O_VIA: + /* Interface table, possibly */ + cmdif = (ipfw_insn_if *)cmd; + if (cmdif->name[0] != '\1') + break; + + *ptype = IPFW_TABLE_INTERFACE; + *puidx = cmdif->p.kidx; + skip = 0; + break; + case O_IP_FLOW_LOOKUP: + *puidx = cmd->arg1; + *ptype = IPFW_TABLE_FLOW; + skip = 0; + break; + } + + return (skip); +} + +/* + * Sets new table value for given opcode. + * Assume the same opcodes as classify_table_opcode() + */ +static void +update_table_opcode(ipfw_insn *cmd, uint16_t idx) +{ + ipfw_insn_if *cmdif; + + switch (cmd->opcode) { + case O_IP_SRC_LOOKUP: + case O_IP_DST_LOOKUP: + /* Basic IPv4/IPv6 or u32 lookups */ + cmd->arg1 = idx; + break; + case O_XMIT: + case O_RECV: + case O_VIA: + /* Interface table, possibly */ + cmdif = (ipfw_insn_if *)cmd; + cmdif->p.kidx = idx; + break; + case O_IP_FLOW_LOOKUP: + cmd->arg1 = idx; + break; + } +} + +/* + * Checks table name for validity. + * Enforce basic length checks, the rest + * should be done in userland. + * + * Returns 0 if name is considered valid. + */ +int +ipfw_check_table_name(char *name) +{ + int nsize; + ipfw_obj_ntlv *ntlv = NULL; + + nsize = sizeof(ntlv->name); + + if (strnlen(name, nsize) == nsize) + return (EINVAL); + + if (name[0] == '\0') + return (EINVAL); + + /* + * TODO: do some more complicated checks + */ + + return (0); +} + +/* + * Find tablename TLV by @uid. + * Check @tlvs for valid data inside. + * + * Returns pointer to found TLV or NULL. + */ +static ipfw_obj_ntlv * +find_name_tlv(void *tlvs, int len, uint16_t uidx) +{ + ipfw_obj_ntlv *ntlv; + uintptr_t pa, pe; + int l; + + pa = (uintptr_t)tlvs; + pe = pa + len; + l = 0; + for (; pa < pe; pa += l) { + ntlv = (ipfw_obj_ntlv *)pa; + l = ntlv->head.length; + + if (l != sizeof(*ntlv)) + return (NULL); + + if (ntlv->head.type != IPFW_TLV_TBL_NAME) + continue; + + if (ntlv->idx != uidx) + continue; + + if (ipfw_check_table_name(ntlv->name) != 0) + return (NULL); + + return (ntlv); + } + + return (NULL); +} + +/* + * Finds table config based on either legacy index + * or name in ntlv. + * Note @ti structure contains unchecked data from userland. + * + * Returns pointer to table_config or NULL. + */ +static struct table_config * +find_table(struct namedobj_instance *ni, struct tid_info *ti) +{ + char *name, bname[16]; + struct named_object *no; + ipfw_obj_ntlv *ntlv; + uint32_t set; + + if (ti->tlvs != NULL) { + ntlv = find_name_tlv(ti->tlvs, ti->tlen, ti->uidx); + if (ntlv == NULL) + return (NULL); + name = ntlv->name; + + /* + * Use set provided by @ti instead of @ntlv one. + * This is needed due to different sets behavior + * controlled by V_fw_tables_sets. + */ + set = ti->set; + } else { + snprintf(bname, sizeof(bname), "%d", ti->uidx); + name = bname; + set = 0; + } + + no = ipfw_objhash_lookup_name(ni, set, name); + + return ((struct table_config *)no); +} + +/* + * Allocate new table config structure using + * specified @algo and @aname. + * + * Returns pointer to config or NULL. + */ +static struct table_config * +alloc_table_config(struct ip_fw_chain *ch, struct tid_info *ti, + struct table_algo *ta, char *aname, uint8_t tflags) +{ + char *name, bname[16]; + struct table_config *tc; + int error; + ipfw_obj_ntlv *ntlv; + uint32_t set; + + if (ti->tlvs != NULL) { + ntlv = find_name_tlv(ti->tlvs, ti->tlen, ti->uidx); + if (ntlv == NULL) + return (NULL); + name = ntlv->name; + set = ntlv->set; + } else { + snprintf(bname, sizeof(bname), "%d", ti->uidx); + name = bname; + set = 0; + } + + tc = malloc(sizeof(struct table_config), M_IPFW, M_WAITOK | M_ZERO); + tc->no.name = tc->tablename; + tc->no.type = ta->type; + tc->no.set = set; + tc->tflags = tflags; + tc->ta = ta; + strlcpy(tc->tablename, name, sizeof(tc->tablename)); + /* Set "shared" value type by default */ + tc->vshared = 1; + + if (ti->tlvs == NULL) { + tc->no.compat = 1; + tc->no.uidx = ti->uidx; + } + + /* Preallocate data structures for new tables */ + error = ta->init(ch, &tc->astate, &tc->ti_copy, aname, tflags); + if (error != 0) { + free(tc, M_IPFW); + return (NULL); + } + + return (tc); +} + +/* + * Destroys table state and config. + */ +static void +free_table_config(struct namedobj_instance *ni, struct table_config *tc) +{ + + KASSERT(tc->linked == 0, ("free() on linked config")); + + /* + * We're using ta without any locking/referencing. + * TODO: fix this if we're going to use unloadable algos. + */ + tc->ta->destroy(tc->astate, &tc->ti_copy); + free(tc, M_IPFW); +} + +/* + * Links @tc to @chain table named instance. + * Sets appropriate type/states in @chain table info. + */ +static void +link_table(struct ip_fw_chain *ch, struct table_config *tc) +{ + struct namedobj_instance *ni; + struct table_info *ti; + uint16_t kidx; + + IPFW_UH_WLOCK_ASSERT(ch); + IPFW_WLOCK_ASSERT(ch); + + ni = CHAIN_TO_NI(ch); + kidx = tc->no.kidx; + + ipfw_objhash_add(ni, &tc->no); + + ti = KIDX_TO_TI(ch, kidx); + *ti = tc->ti_copy; + + /* Notify algo on real @ti address */ + if (tc->ta->change_ti != NULL) + tc->ta->change_ti(tc->astate, ti); + + tc->linked = 1; + tc->ta->refcnt++; +} + +/* + * Unlinks @tc from @chain table named instance. + * Zeroes states in @chain and stores them in @tc. + */ +static void +unlink_table(struct ip_fw_chain *ch, struct table_config *tc) +{ + struct namedobj_instance *ni; + struct table_info *ti; + uint16_t kidx; + + IPFW_UH_WLOCK_ASSERT(ch); + IPFW_WLOCK_ASSERT(ch); + + ni = CHAIN_TO_NI(ch); + kidx = tc->no.kidx; + + /* Clear state. @ti copy is already saved inside @tc */ + ipfw_objhash_del(ni, &tc->no); + ti = KIDX_TO_TI(ch, kidx); + memset(ti, 0, sizeof(struct table_info)); + tc->linked = 0; + tc->ta->refcnt--; + + /* Notify algo on real @ti address */ + if (tc->ta->change_ti != NULL) + tc->ta->change_ti(tc->astate, NULL); +} + +struct swap_table_args { + int set; + int new_set; + int mv; +}; + +/* + * Change set for each matching table. + * + * Ensure we dispatch each table once by setting/checking ochange + * fields. + */ +static void +swap_table_set(struct namedobj_instance *ni, struct named_object *no, + void *arg) +{ + struct table_config *tc; + struct swap_table_args *sta; + + tc = (struct table_config *)no; + sta = (struct swap_table_args *)arg; + + if (no->set != sta->set && (no->set != sta->new_set || sta->mv != 0)) + return; + + if (tc->ochanged != 0) + return; + + tc->ochanged = 1; + ipfw_objhash_del(ni, no); + if (no->set == sta->set) + no->set = sta->new_set; + else + no->set = sta->set; + ipfw_objhash_add(ni, no); +} + +/* + * Cleans up ochange field for all tables. + */ +static void +clean_table_set_data(struct namedobj_instance *ni, struct named_object *no, + void *arg) +{ + struct table_config *tc; + struct swap_table_args *sta; + + tc = (struct table_config *)no; + sta = (struct swap_table_args *)arg; + + tc->ochanged = 0; +} + +/* + * Swaps tables within two sets. + */ +void +ipfw_swap_tables_sets(struct ip_fw_chain *ch, uint32_t set, + uint32_t new_set, int mv) +{ + struct swap_table_args sta; + + IPFW_UH_WLOCK_ASSERT(ch); + + sta.set = set; + sta.new_set = new_set; + sta.mv = mv; + + ipfw_objhash_foreach(CHAIN_TO_NI(ch), swap_table_set, &sta); + ipfw_objhash_foreach(CHAIN_TO_NI(ch), clean_table_set_data, &sta); +} + +/* + * Move all tables which are reference by rules in @rr to set @new_set. + * Makes sure that all relevant tables are referenced ONLLY by given rules. + * + * Retuns 0 on success, + */ +int +ipfw_move_tables_sets(struct ip_fw_chain *ch, ipfw_range_tlv *rt, + uint32_t new_set) +{ + struct ip_fw *rule; + struct table_config *tc; + struct named_object *no; + struct namedobj_instance *ni; + int bad, i, l, cmdlen; + uint16_t kidx; + uint8_t type; + ipfw_insn *cmd; + + IPFW_UH_WLOCK_ASSERT(ch); + + ni = CHAIN_TO_NI(ch); + + /* Stage 1: count number of references by given rules */ + for (i = 0; i < ch->n_rules - 1; i++) { + rule = ch->map[i]; + if (ipfw_match_range(rule, rt) == 0) + continue; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + no = ipfw_objhash_lookup_kidx(ni, kidx); + KASSERT(no != NULL, + ("objhash lookup failed on index %d", kidx)); + tc = (struct table_config *)no; + tc->ocount++; + } + + } + + /* Stage 2: verify "ownership" */ + bad = 0; + for (i = 0; i < ch->n_rules - 1; i++) { + rule = ch->map[i]; + if (ipfw_match_range(rule, rt) == 0) + continue; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + no = ipfw_objhash_lookup_kidx(ni, kidx); + KASSERT(no != NULL, + ("objhash lookup failed on index %d", kidx)); + tc = (struct table_config *)no; + if (tc->no.refcnt != tc->ocount) { + + /* + * Number of references differ: + * Other rule(s) are holding reference to given + * table, so it is not possible to change its set. + * + * Note that refcnt may account + * references to some going-to-be-added rules. + * Since we don't know their numbers (and event + * if they will be added) it is perfectly OK + * to return error here. + */ + bad = 1; + break; + } + } + + if (bad != 0) + break; + } + + /* Stage 3: change set or cleanup */ + for (i = 0; i < ch->n_rules - 1; i++) { + rule = ch->map[i]; + if (ipfw_match_range(rule, rt) == 0) + continue; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + no = ipfw_objhash_lookup_kidx(ni, kidx); + KASSERT(no != NULL, + ("objhash lookup failed on index %d", kidx)); + tc = (struct table_config *)no; + + tc->ocount = 0; + if (bad != 0) + continue; + + /* Actually change set. */ + ipfw_objhash_del(ni, no); + no->set = new_set; + ipfw_objhash_add(ni, no); + } + } + + return (bad); +} + +/* + * Finds and bumps refcount for tables referenced by given @rule. + * Auto-creates non-existing tables. + * Fills in @oib array with userland/kernel indexes. + * First free oidx pointer is saved back in @oib. + * + * Returns 0 on success. + */ +static int +find_ref_rule_tables(struct ip_fw_chain *ch, struct ip_fw *rule, + struct rule_check_info *ci, struct obj_idx **oib, struct tid_info *ti) +{ + struct table_config *tc; + struct namedobj_instance *ni; + struct named_object *no; + int cmdlen, error, l, numnew; + uint16_t kidx; + ipfw_insn *cmd; + struct obj_idx *pidx, *pidx_first, *p; + + pidx_first = *oib; + pidx = pidx_first; + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + error = 0; + numnew = 0; + + IPFW_UH_WLOCK(ch); + ni = CHAIN_TO_NI(ch); + + /* Increase refcount on each existing referenced table. */ + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + if (classify_table_opcode(cmd, &ti->uidx, &ti->type) != 0) + continue; + + pidx->uidx = ti->uidx; + pidx->type = ti->type; + + if ((tc = find_table(ni, ti)) != NULL) { + if (tc->no.type != ti->type) { + /* Incompatible types */ + error = EINVAL; + break; + } + + /* Reference found table and save kidx */ + tc->no.refcnt++; + pidx->kidx = tc->no.kidx; + pidx++; + continue; + } + + /* + * Compability stuff for old clients: + * prepare to manually create non-existing tables. + */ + pidx++; + numnew++; + } + + if (error != 0) { + /* Unref everything we have already done */ + for (p = *oib; p < pidx; p++) { + if (p->kidx == 0) + continue; + + /* Find & unref by existing idx */ + no = ipfw_objhash_lookup_kidx(ni, p->kidx); + KASSERT(no != NULL, ("Ref'd table %d disappeared", + p->kidx)); + + no->refcnt--; + } + } + + IPFW_UH_WUNLOCK(ch); + + if (numnew == 0) { + *oib = pidx; + return (error); + } + + /* + * Compatibility stuff: do actual creation for non-existing, + * but referenced tables. + */ + for (p = pidx_first; p < pidx; p++) { + if (p->kidx != 0) + continue; + + ti->uidx = p->uidx; + ti->type = p->type; + ti->atype = 0; + + error = create_table_compat(ch, ti, &kidx); + if (error == 0) { + p->kidx = kidx; + continue; + } + + /* Error. We have to drop references */ + IPFW_UH_WLOCK(ch); + for (p = pidx_first; p < pidx; p++) { + if (p->kidx == 0) + continue; + + /* Find & unref by existing idx */ + no = ipfw_objhash_lookup_kidx(ni, p->kidx); + KASSERT(no != NULL, ("Ref'd table %d disappeared", + p->kidx)); + + no->refcnt--; + } + IPFW_UH_WUNLOCK(ch); + + return (error); + } + + *oib = pidx; + + return (error); +} + +/* + * Remove references from every table used in @rule. + */ +void +ipfw_unref_rule_tables(struct ip_fw_chain *chain, struct ip_fw *rule) +{ + int cmdlen, l; + ipfw_insn *cmd; + struct namedobj_instance *ni; + struct named_object *no; + uint16_t kidx; + uint8_t type; + + IPFW_UH_WLOCK_ASSERT(chain); + ni = CHAIN_TO_NI(chain); + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + + no = ipfw_objhash_lookup_kidx(ni, kidx); + + KASSERT(no != NULL, ("table id %d not found", kidx)); + KASSERT(no->type == type, ("wrong type %d (%d) for table id %d", + no->type, type, kidx)); + KASSERT(no->refcnt > 0, ("refcount for table %d is %d", + kidx, no->refcnt)); + + no->refcnt--; + } +} + +/* + * Compatibility function for old ipfw(8) binaries. + * Rewrites table kernel indices with userland ones. + * Convert tables matching '/^\d+$/' to their atoi() value. + * Use number 65535 for other tables. + * + * Returns 0 on success. + */ +int +ipfw_rewrite_table_kidx(struct ip_fw_chain *chain, struct ip_fw_rule0 *rule) +{ + int cmdlen, error, l; + ipfw_insn *cmd; + uint16_t kidx, uidx; + uint8_t type; + struct named_object *no; + struct namedobj_instance *ni; + + ni = CHAIN_TO_NI(chain); + error = 0; + + l = rule->cmd_len; + cmd = rule->cmd; + cmdlen = 0; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + + if (classify_table_opcode(cmd, &kidx, &type) != 0) + continue; + + if ((no = ipfw_objhash_lookup_kidx(ni, kidx)) == NULL) + return (1); + + uidx = no->uidx; + if (no->compat == 0) { + + /* + * We are called via legacy opcode. + * Save error and show table as fake number + * not to make ipfw(8) hang. + */ + uidx = 65535; + error = 2; + } + + update_table_opcode(cmd, uidx); + } + + return (error); +} + +/* + * Checks is opcode is referencing table of appropriate type. + * Adds reference count for found table if true. + * Rewrites user-supplied opcode values with kernel ones. + * + * Returns 0 on success and appropriate error code otherwise. + */ +int +ipfw_rewrite_table_uidx(struct ip_fw_chain *chain, + struct rule_check_info *ci) +{ + int cmdlen, error, l; + ipfw_insn *cmd; + uint16_t uidx; + uint8_t type; + struct namedobj_instance *ni; + struct obj_idx *p, *pidx_first, *pidx_last; + struct tid_info ti; + + ni = CHAIN_TO_NI(chain); + + /* + * Prepare an array for storing opcode indices. + * Use stack allocation by default. + */ + if (ci->table_opcodes <= (sizeof(ci->obuf)/sizeof(ci->obuf[0]))) { + /* Stack */ + pidx_first = ci->obuf; + } else + pidx_first = malloc(ci->table_opcodes * sizeof(struct obj_idx), + M_IPFW, M_WAITOK | M_ZERO); + + pidx_last = pidx_first; + error = 0; + type = 0; + memset(&ti, 0, sizeof(ti)); + + /* + * Use default set for looking up tables (old way) or + * use set rule is assigned to (new way). + */ + ti.set = (V_fw_tables_sets != 0) ? ci->krule->set : 0; + if (ci->ctlv != NULL) { + ti.tlvs = (void *)(ci->ctlv + 1); + ti.tlen = ci->ctlv->head.length - sizeof(ipfw_obj_ctlv); + } + + /* Reference all used tables */ + error = find_ref_rule_tables(chain, ci->krule, ci, &pidx_last, &ti); + if (error != 0) + goto free; + + IPFW_UH_WLOCK(chain); + + /* Perform rule rewrite */ + l = ci->krule->cmd_len; + cmd = ci->krule->cmd; + cmdlen = 0; + p = pidx_first; + for ( ; l > 0 ; l -= cmdlen, cmd += cmdlen) { + cmdlen = F_LEN(cmd); + if (classify_table_opcode(cmd, &uidx, &type) != 0) + continue; + update_table_opcode(cmd, p->kidx); + p++; + } + + IPFW_UH_WUNLOCK(chain); + +free: + if (pidx_first != ci->obuf) + free(pidx_first, M_IPFW); + + return (error); +} + +static struct ipfw_sopt_handler scodes[] = { + { IP_FW_TABLE_XCREATE, 0, HDIR_SET, create_table }, + { IP_FW_TABLE_XDESTROY, 0, HDIR_SET, flush_table_v0 }, + { IP_FW_TABLE_XFLUSH, 0, HDIR_SET, flush_table_v0 }, + { IP_FW_TABLE_XMODIFY, 0, HDIR_BOTH, modify_table }, + { IP_FW_TABLE_XINFO, 0, HDIR_GET, describe_table }, + { IP_FW_TABLES_XLIST, 0, HDIR_GET, list_tables }, + { IP_FW_TABLE_XLIST, 0, HDIR_GET, dump_table_v0 }, + { IP_FW_TABLE_XLIST, 1, HDIR_GET, dump_table_v1 }, + { IP_FW_TABLE_XADD, 0, HDIR_BOTH, manage_table_ent_v0 }, + { IP_FW_TABLE_XADD, 1, HDIR_BOTH, manage_table_ent_v1 }, + { IP_FW_TABLE_XDEL, 0, HDIR_BOTH, manage_table_ent_v0 }, + { IP_FW_TABLE_XDEL, 1, HDIR_BOTH, manage_table_ent_v1 }, + { IP_FW_TABLE_XFIND, 0, HDIR_GET, find_table_entry }, + { IP_FW_TABLE_XSWAP, 0, HDIR_SET, swap_table }, + { IP_FW_TABLES_ALIST, 0, HDIR_GET, list_table_algo }, + { IP_FW_TABLE_XGETSIZE, 0, HDIR_GET, get_table_size }, +}; + +static void +destroy_table_locked(struct namedobj_instance *ni, struct named_object *no, + void *arg) +{ + + unlink_table((struct ip_fw_chain *)arg, (struct table_config *)no); + if (ipfw_objhash_free_idx(ni, no->kidx) != 0) + printf("Error unlinking kidx %d from table %s\n", + no->kidx, no->name); + free_table_config(ni, (struct table_config *)no); +} + +/* + * Shuts tables module down. + */ +void +ipfw_destroy_tables(struct ip_fw_chain *ch, int last) +{ + + IPFW_DEL_SOPT_HANDLER(last, scodes); + + /* Remove all tables from working set */ + IPFW_UH_WLOCK(ch); + IPFW_WLOCK(ch); + ipfw_objhash_foreach(CHAIN_TO_NI(ch), destroy_table_locked, ch); + IPFW_WUNLOCK(ch); + IPFW_UH_WUNLOCK(ch); + + /* Free pointers itself */ + free(ch->tablestate, M_IPFW); + + ipfw_table_value_destroy(ch, last); + ipfw_table_algo_destroy(ch); + + ipfw_objhash_destroy(CHAIN_TO_NI(ch)); + free(CHAIN_TO_TCFG(ch), M_IPFW); +} + +/* + * Starts tables module. + */ +int +ipfw_init_tables(struct ip_fw_chain *ch, int first) +{ + struct tables_config *tcfg; + + /* Allocate pointers */ + ch->tablestate = malloc(V_fw_tables_max * sizeof(struct table_info), + M_IPFW, M_WAITOK | M_ZERO); + + tcfg = malloc(sizeof(struct tables_config), M_IPFW, M_WAITOK | M_ZERO); + tcfg->namehash = ipfw_objhash_create(V_fw_tables_max); + ch->tblcfg = tcfg; + + ipfw_table_value_init(ch, first); + ipfw_table_algo_init(ch); + + IPFW_ADD_SOPT_HANDLER(first, scodes); + return (0); +} + + + diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_table.h b/example/ipfw/sys/netpfil/ipfw/ip_fw_table.h new file mode 100644 index 0000000..216d713 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_table.h @@ -0,0 +1,246 @@ +/*- + * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD: head/sys/netpfil/ipfw/ip_fw_table.h 272840 2014-10-09 19:32:35Z melifaro $ + */ + +#ifndef _IPFW2_TABLE_H +#define _IPFW2_TABLE_H + +/* + * Internal constants and data structures used by ipfw tables + * not meant to be exported outside the kernel. + */ +#ifdef _KERNEL + +struct table_algo; +struct tables_config { + struct namedobj_instance *namehash; + struct namedobj_instance *valhash; + uint32_t val_size; + uint32_t algo_count; + struct table_algo *algo[256]; + struct table_algo *def_algo[IPFW_TABLE_MAXTYPE + 1]; + TAILQ_HEAD(op_state_l,op_state) state_list; +}; +#define CHAIN_TO_TCFG(chain) ((struct tables_config *)(chain)->tblcfg) + +struct table_info { + table_lookup_t *lookup; /* Lookup function */ + void *state; /* Lookup radix/other structure */ + void *xstate; /* eXtended state */ + u_long data; /* Hints for given func */ +}; + +/* Internal structures for handling sockopt data */ +struct tid_info { + uint32_t set; /* table set */ + uint16_t uidx; /* table index */ + uint8_t type; /* table type */ + uint8_t atype; + void *tlvs; /* Pointer to first TLV */ + int tlen; /* Total TLV size block */ +}; + +struct table_value; +struct tentry_info { + void *paddr; + struct table_value *pvalue; + void *ptv; /* Temporary field to hold obj */ + uint8_t masklen; /* mask length */ + uint8_t subtype; + uint16_t flags; /* record flags */ + uint32_t value; /* value index */ +}; +#define TEI_FLAGS_UPDATE 0x0001 /* Add or update rec if exists */ +#define TEI_FLAGS_UPDATED 0x0002 /* Entry has been updated */ +#define TEI_FLAGS_COMPAT 0x0004 /* Called from old ABI */ +#define TEI_FLAGS_DONTADD 0x0008 /* Do not create new rec */ +#define TEI_FLAGS_ADDED 0x0010 /* Entry was added */ +#define TEI_FLAGS_DELETED 0x0020 /* Entry was deleted */ +#define TEI_FLAGS_LIMIT 0x0040 /* Limit was hit */ +#define TEI_FLAGS_ERROR 0x0080 /* Unknown request error */ +#define TEI_FLAGS_NOTFOUND 0x0100 /* Entry was not found */ +#define TEI_FLAGS_EXISTS 0x0200 /* Entry already exists */ + +typedef int (ta_init)(struct ip_fw_chain *ch, void **ta_state, + struct table_info *ti, char *data, uint8_t tflags); +typedef void (ta_destroy)(void *ta_state, struct table_info *ti); +typedef int (ta_prepare_add)(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +typedef int (ta_prepare_del)(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +typedef int (ta_add)(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +typedef int (ta_del)(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +typedef void (ta_flush_entry)(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); + +typedef int (ta_need_modify)(void *ta_state, struct table_info *ti, + uint32_t count, uint64_t *pflags); +typedef int (ta_prepare_mod)(void *ta_buf, uint64_t *pflags); +typedef int (ta_fill_mod)(void *ta_state, struct table_info *ti, + void *ta_buf, uint64_t *pflags); +typedef void (ta_modify)(void *ta_state, struct table_info *ti, + void *ta_buf, uint64_t pflags); +typedef void (ta_flush_mod)(void *ta_buf); + +typedef void (ta_change_ti)(void *ta_state, struct table_info *ti); +typedef void (ta_print_config)(void *ta_state, struct table_info *ti, char *buf, + size_t bufsize); + +typedef int ta_foreach_f(void *node, void *arg); +typedef void ta_foreach(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg); +typedef int ta_dump_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent); +typedef int ta_find_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +typedef void ta_dump_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +typedef uint32_t ta_get_count(void *ta_state, struct table_info *ti); + +struct table_algo { + char name[16]; + uint32_t idx; + uint32_t type; + uint32_t refcnt; + uint32_t flags; + uint32_t vlimit; + size_t ta_buf_size; + ta_init *init; + ta_destroy *destroy; + ta_prepare_add *prepare_add; + ta_prepare_del *prepare_del; + ta_add *add; + ta_del *del; + ta_flush_entry *flush_entry; + ta_find_tentry *find_tentry; + ta_need_modify *need_modify; + ta_prepare_mod *prepare_mod; + ta_fill_mod *fill_mod; + ta_modify *modify; + ta_flush_mod *flush_mod; + ta_change_ti *change_ti; + ta_foreach *foreach; + ta_dump_tentry *dump_tentry; + ta_print_config *print_config; + ta_dump_tinfo *dump_tinfo; + ta_get_count *get_count; +}; +#define TA_FLAG_DEFAULT 0x01 /* Algo is default for given type */ +#define TA_FLAG_READONLY 0x02 /* Algo does not support modifications*/ +#define TA_FLAG_EXTCOUNTER 0x04 /* Algo has external counter available*/ + +int ipfw_add_table_algo(struct ip_fw_chain *ch, struct table_algo *ta, + size_t size, uint32_t *idx); +void ipfw_del_table_algo(struct ip_fw_chain *ch, int idx); + +void ipfw_table_algo_init(struct ip_fw_chain *chain); +void ipfw_table_algo_destroy(struct ip_fw_chain *chain); + +MALLOC_DECLARE(M_IPFW_TBL); +/* Exported to support legacy opcodes */ +int add_table_entry(struct ip_fw_chain *ch, struct tid_info *ti, + struct tentry_info *tei, uint8_t flags, uint32_t count); +int del_table_entry(struct ip_fw_chain *ch, struct tid_info *ti, + struct tentry_info *tei, uint8_t flags, uint32_t count); +int flush_table(struct ip_fw_chain *ch, struct tid_info *ti); +void ipfw_import_table_value_legacy(uint32_t value, struct table_value *v); +uint32_t ipfw_export_table_value_legacy(struct table_value *v); +int ipfw_get_table_size(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd); + +/* ipfw_table_value.c functions */ +struct table_config; +struct tableop_state; +void ipfw_table_value_init(struct ip_fw_chain *ch, int first); +void ipfw_table_value_destroy(struct ip_fw_chain *ch, int last); +int ipfw_link_table_values(struct ip_fw_chain *ch, struct tableop_state *ts); +void ipfw_garbage_table_values(struct ip_fw_chain *ch, struct table_config *tc, + struct tentry_info *tei, uint32_t count, int rollback); +void ipfw_import_table_value_v1(ipfw_table_value *iv); +void ipfw_export_table_value_v1(struct table_value *v, ipfw_table_value *iv); +void ipfw_unref_table_values(struct ip_fw_chain *ch, struct table_config *tc, + struct table_algo *ta, void *astate, struct table_info *ti); +void rollback_table_values(struct tableop_state *ts); + +int ipfw_rewrite_table_uidx(struct ip_fw_chain *chain, + struct rule_check_info *ci); +int ipfw_rewrite_table_kidx(struct ip_fw_chain *chain, + struct ip_fw_rule0 *rule); +int ipfw_mark_table_kidx(struct ip_fw_chain *chain, struct ip_fw *rule, + uint32_t *bmask); +int ipfw_export_table_ntlv(struct ip_fw_chain *ch, uint16_t kidx, + struct sockopt_data *sd); +void ipfw_unref_rule_tables(struct ip_fw_chain *chain, struct ip_fw *rule); + +/* utility functions */ +int ipfw_check_table_name(char *name); +int ipfw_move_tables_sets(struct ip_fw_chain *ch, ipfw_range_tlv *rt, + uint32_t new_set); +void ipfw_swap_tables_sets(struct ip_fw_chain *ch, uint32_t old_set, + uint32_t new_set, int mv); +int ipfw_foreach_table_tentry(struct ip_fw_chain *ch, uint16_t kidx, + ta_foreach_f f, void *arg); + +/* internal functions */ +void tc_ref(struct table_config *tc); +void tc_unref(struct table_config *tc); + +struct op_state; +typedef void (op_rollback_f)(void *object, struct op_state *state); +struct op_state { + TAILQ_ENTRY(op_state) next; /* chain link */ + op_rollback_f *func; +}; + +struct tableop_state { + struct op_state opstate; + struct ip_fw_chain *ch; + struct table_config *tc; + struct table_algo *ta; + struct tentry_info *tei; + uint32_t count; + uint32_t vmask; + int vshared; + int modified; +}; + +void add_toperation_state(struct ip_fw_chain *ch, struct tableop_state *ts); +void del_toperation_state(struct ip_fw_chain *ch, struct tableop_state *ts); +void rollback_toperation_state(struct ip_fw_chain *ch, void *object); + +/* Legacy interfaces */ +int ipfw_count_table(struct ip_fw_chain *ch, struct tid_info *ti, + uint32_t *cnt); +int ipfw_count_xtable(struct ip_fw_chain *ch, struct tid_info *ti, + uint32_t *cnt); +int ipfw_dump_table_legacy(struct ip_fw_chain *ch, struct tid_info *ti, + ipfw_table *tbl); + + +#endif /* _KERNEL */ +#endif /* _IPFW2_TABLE_H */ diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_table_algo.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_table_algo.c new file mode 100644 index 0000000..d9d0547 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_table_algo.c @@ -0,0 +1,4081 @@ +/*- + * Copyright (c) 2014 Yandex LLC + * Copyright (c) 2014 Alexander V. Chernikov + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_table_algo.c 272912 2014-10-10 20:37:06Z melifaro $"); + +/* + * Lookup table algorithms. + * + */ + +#include "opt_ipfw.h" +#include "opt_inet.h" +#ifndef INET +#error IPFIREWALL requires INET. +#endif /* INET */ +#include "opt_inet6.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* ip_fw.h requires IFNAMSIZ */ +#include +#include + +#include +#include /* struct ipfw_rule_ref */ +#include + +#include +#include + + +/* + * IPFW table lookup algorithms. + * + * What is needed to add another table algo? + * + * Algo init: + * * struct table_algo has to be filled with: + * name: "type:algoname" format, e.g. "addr:radix". Currently + * there are the following types: "addr", "iface", "number" and "flow". + * type: one of IPFW_TABLE_* types + * flags: one or more TA_FLAGS_* + * ta_buf_size: size of structure used to store add/del item state. + * Needs to be less than TA_BUF_SZ. + * callbacks: see below for description. + * * ipfw_add_table_algo / ipfw_del_table_algo has to be called + * + * Callbacks description: + * + * -init: request to initialize new table instance. + * typedef int (ta_init)(struct ip_fw_chain *ch, void **ta_state, + * struct table_info *ti, char *data, uint8_t tflags); + * MANDATORY, unlocked. (M_WAITOK). Returns 0 on success. + * + * Allocate all structures needed for normal operations. + * * Caller may want to parse @data for some algo-specific + * options provided by userland. + * * Caller may want to save configuration state pointer to @ta_state + * * Caller needs to save desired runtime structure pointer(s) + * inside @ti fields. Note that it is not correct to save + * @ti pointer at this moment. Use -change_ti hook for that. + * * Caller has to fill in ti->lookup to appropriate function + * pointer. + * + * + * + * -destroy: request to destroy table instance. + * typedef void (ta_destroy)(void *ta_state, struct table_info *ti); + * MANDATORY, may be locked (UH+WLOCK). (M_NOWAIT). + * + * Frees all table entries and all tables structures allocated by -init. + * + * + * + * -prepare_add: request to allocate state for adding new entry. + * typedef int (ta_prepare_add)(struct ip_fw_chain *ch, struct tentry_info *tei, + * void *ta_buf); + * MANDATORY, unlocked. (M_WAITOK). Returns 0 on success. + * + * Allocates state and fills it in with all necessary data (EXCEPT value) + * from @tei to minimize operations needed to be done under WLOCK. + * "value" field has to be copied to new entry in @add callback. + * Buffer ta_buf of size ta->ta_buf_sz may be used to store + * allocated state. + * + * + * + * -prepare_del: request to set state for deleting existing entry. + * typedef int (ta_prepare_del)(struct ip_fw_chain *ch, struct tentry_info *tei, + * void *ta_buf); + * MANDATORY, locked, UH. (M_NOWAIT). Returns 0 on success. + * + * Buffer ta_buf of size ta->ta_buf_sz may be used to store + * allocated state. Caller should use on-stack ta_buf allocation + * instead of doing malloc(). + * + * + * + * -add: request to insert new entry into runtime/config structures. + * typedef int (ta_add)(void *ta_state, struct table_info *ti, + * struct tentry_info *tei, void *ta_buf, uint32_t *pnum); + * MANDATORY, UH+WLOCK. (M_NOWAIT). Returns 0 on success. + * + * Insert new entry using previously-allocated state in @ta_buf. + * * @tei may have the following flags: + * TEI_FLAGS_UPDATE: request to add or update entry. + * TEI_FLAGS_DONTADD: request to update (but not add) entry. + * * Caller is required to do the following: + * copy real entry value from @tei + * entry added: return 0, set 1 to @pnum + * entry updated: return 0, store 0 to @pnum, store old value in @tei, + * add TEI_FLAGS_UPDATED flag to @tei. + * entry exists: return EEXIST + * entry not found: return ENOENT + * other error: return non-zero error code. + * + * + * + * -del: request to delete existing entry from runtime/config structures. + * typedef int (ta_del)(void *ta_state, struct table_info *ti, + * struct tentry_info *tei, void *ta_buf, uint32_t *pnum); + * MANDATORY, UH+WLOCK. (M_NOWAIT). Returns 0 on success. + * + * Delete entry using previously set up in @ta_buf. + * * Caller is required to do the following: + * entry deleted: return 0, set 1 to @pnum, store old value in @tei. + * entry not found: return ENOENT + * other error: return non-zero error code. + * + * + * + * -flush_entry: flush entry state created by -prepare_add / -del / others + * typedef void (ta_flush_entry)(struct ip_fw_chain *ch, + * struct tentry_info *tei, void *ta_buf); + * MANDATORY, may be locked. (M_NOWAIT). + * + * Delete state allocated by: + * -prepare_add (-add returned EEXIST|UPDATED) + * -prepare_del (if any) + * -del + * * Caller is required to handle empty @ta_buf correctly. + * + * + * -find_tentry: finds entry specified by key @tei + * typedef int ta_find_tentry(void *ta_state, struct table_info *ti, + * ipfw_obj_tentry *tent); + * OPTIONAL, locked (UH). (M_NOWAIT). Returns 0 on success. + * + * Finds entry specified by given key. + * * Caller is requred to do the following: + * entry found: returns 0, export entry to @tent + * entry not found: returns ENOENT + * + * + * -need_modify: checks if @ti has enough space to hold another @count items. + * typedef int (ta_need_modify)(void *ta_state, struct table_info *ti, + * uint32_t count, uint64_t *pflags); + * OPTIONAL, locked (UH). (M_NOWAIT). Returns 0 if has. + * + * Checks if given table has enough space to add @count items without + * resize. Caller may use @pflags to store desired modification data. + * + * + * + * -prepare_mod: allocate structures for table modification. + * typedef int (ta_prepare_mod)(void *ta_buf, uint64_t *pflags); + * OPTIONAL(need_modify), unlocked. (M_WAITOK). Returns 0 on success. + * + * Allocate all needed state for table modification. Caller + * should use `struct mod_item` to store new state in @ta_buf. + * Up to TA_BUF_SZ (128 bytes) can be stored in @ta_buf. + * + * + * + * -fill_mod: copy some data to new state/ + * typedef int (ta_fill_mod)(void *ta_state, struct table_info *ti, + * void *ta_buf, uint64_t *pflags); + * OPTIONAL(need_modify), locked (UH). (M_NOWAIT). Returns 0 on success. + * + * Copy as much data as we can to minimize changes under WLOCK. + * For example, array can be merged inside this callback. + * + * + * + * -modify: perform final modification. + * typedef void (ta_modify)(void *ta_state, struct table_info *ti, + * void *ta_buf, uint64_t pflags); + * OPTIONAL(need_modify), locked (UH+WLOCK). (M_NOWAIT). + * + * Performs all changes necessary to switch to new structures. + * * Caller should save old pointers to @ta_buf storage. + * + * + * + * -flush_mod: flush table modification state. + * typedef void (ta_flush_mod)(void *ta_buf); + * OPTIONAL(need_modify), unlocked. (M_WAITOK). + * + * Performs flush for the following: + * - prepare_mod (modification was not necessary) + * - modify (for the old state) + * + * + * + * -change_gi: monitor table info pointer changes + * typedef void (ta_change_ti)(void *ta_state, struct table_info *ti); + * OPTIONAL, locked (UH). (M_NOWAIT). + * + * Called on @ti pointer changed. Called immediately after -init + * to set initial state. + * + * + * + * -foreach: calls @f for each table entry + * typedef void ta_foreach(void *ta_state, struct table_info *ti, + * ta_foreach_f *f, void *arg); + * MANDATORY, locked(UH). (M_NOWAIT). + * + * Runs callback with specified argument for each table entry, + * Typically used for dumping table entries. + * + * + * + * -dump_tentry: dump table entry in current @tentry format. + * typedef int ta_dump_tentry(void *ta_state, struct table_info *ti, void *e, + * ipfw_obj_tentry *tent); + * MANDATORY, locked(UH). (M_NOWAIT). Returns 0 on success. + * + * Dumps entry @e to @tent. + * + * + * -print_config: prints custom algoritm options into buffer. + * typedef void (ta_print_config)(void *ta_state, struct table_info *ti, + * char *buf, size_t bufsize); + * OPTIONAL. locked(UH). (M_NOWAIT). + * + * Prints custom algorithm options in the format suitable to pass + * back to -init callback. + * + * + * + * -dump_tinfo: dumps algo-specific info. + * typedef void ta_dump_tinfo(void *ta_state, struct table_info *ti, + * ipfw_ta_tinfo *tinfo); + * OPTIONAL. locked(UH). (M_NOWAIT). + * + * Dumps options like items size/hash size, etc. + */ + +MALLOC_DEFINE(M_IPFW_TBL, "ipfw_tbl", "IpFw tables"); + +/* + * Utility structures/functions common to more than one algo + */ + +struct mod_item { + void *main_ptr; + size_t size; + void *main_ptr6; + size_t size6; +}; + +static int badd(const void *key, void *item, void *base, size_t nmemb, + size_t size, int (*compar) (const void *, const void *)); +static int bdel(const void *key, void *base, size_t nmemb, size_t size, + int (*compar) (const void *, const void *)); + + +/* + * ADDR implementation using radix + * + */ + +/* + * The radix code expects addr and mask to be array of bytes, + * with the first byte being the length of the array. rn_inithead + * is called with the offset in bits of the lookup key within the + * array. If we use a sockaddr_in as the underlying type, + * sin_len is conveniently located at offset 0, sin_addr is at + * offset 4 and normally aligned. + * But for portability, let's avoid assumption and make the code explicit + */ +#define KEY_LEN(v) *((uint8_t *)&(v)) +/* + * Do not require radix to compare more than actual IPv4/IPv6 address + */ +#define KEY_LEN_INET (offsetof(struct sockaddr_in, sin_addr) + sizeof(in_addr_t)) +#define KEY_LEN_INET6 (offsetof(struct sa_in6, sin6_addr) + sizeof(struct in6_addr)) + +#define OFF_LEN_INET (8 * offsetof(struct sockaddr_in, sin_addr)) +#define OFF_LEN_INET6 (8 * offsetof(struct sa_in6, sin6_addr)) + +struct radix_addr_entry { + struct radix_node rn[2]; + struct sockaddr_in addr; + uint32_t value; + uint8_t masklen; +}; + +struct sa_in6 { + uint8_t sin6_len; + uint8_t sin6_family; + uint8_t pad[2]; + struct in6_addr sin6_addr; +}; + +struct radix_addr_xentry { + struct radix_node rn[2]; + struct sa_in6 addr6; + uint32_t value; + uint8_t masklen; +}; + +struct radix_cfg { + struct radix_node_head *head4; + struct radix_node_head *head6; + size_t count4; + size_t count6; +}; + +struct ta_buf_radix +{ + void *ent_ptr; + struct sockaddr *addr_ptr; + struct sockaddr *mask_ptr; + union { + struct { + struct sockaddr_in sa; + struct sockaddr_in ma; + } a4; + struct { + struct sa_in6 sa; + struct sa_in6 ma; + } a6; + } addr; +}; + +static int ta_lookup_radix(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); +static int ta_init_radix(struct ip_fw_chain *ch, void **ta_state, + struct table_info *ti, char *data, uint8_t tflags); +static int flush_radix_entry(struct radix_node *rn, void *arg); +static void ta_destroy_radix(void *ta_state, struct table_info *ti); +static void ta_dump_radix_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +static int ta_dump_radix_tentry(void *ta_state, struct table_info *ti, + void *e, ipfw_obj_tentry *tent); +static int ta_find_radix_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +static void ta_foreach_radix(void *ta_state, struct table_info *ti, + ta_foreach_f *f, void *arg); +static void tei_to_sockaddr_ent(struct tentry_info *tei, struct sockaddr *sa, + struct sockaddr *ma, int *set_mask); +static int ta_prepare_add_radix(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_add_radix(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static int ta_prepare_del_radix(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_del_radix(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static void ta_flush_radix_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_need_modify_radix(void *ta_state, struct table_info *ti, + uint32_t count, uint64_t *pflags); + +static int +ta_lookup_radix(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct radix_node_head *rnh; + + if (keylen == sizeof(in_addr_t)) { + struct radix_addr_entry *ent; + struct sockaddr_in sa; + KEY_LEN(sa) = KEY_LEN_INET; + sa.sin_addr.s_addr = *((in_addr_t *)key); + rnh = (struct radix_node_head *)ti->state; + ent = (struct radix_addr_entry *)(rnh->rnh_matchaddr(&sa, rnh)); + if (ent != NULL) { + *val = ent->value; + return (1); + } + } else { + struct radix_addr_xentry *xent; + struct sa_in6 sa6; + KEY_LEN(sa6) = KEY_LEN_INET6; + memcpy(&sa6.sin6_addr, key, sizeof(struct in6_addr)); + rnh = (struct radix_node_head *)ti->xstate; + xent = (struct radix_addr_xentry *)(rnh->rnh_matchaddr(&sa6, rnh)); + if (xent != NULL) { + *val = xent->value; + return (1); + } + } + + return (0); +} + +/* + * New table + */ +static int +ta_init_radix(struct ip_fw_chain *ch, void **ta_state, struct table_info *ti, + char *data, uint8_t tflags) +{ + struct radix_cfg *cfg; + + if (!rn_inithead(&ti->state, OFF_LEN_INET)) + return (ENOMEM); + if (!rn_inithead(&ti->xstate, OFF_LEN_INET6)) { + rn_detachhead(&ti->state); + return (ENOMEM); + } + + cfg = malloc(sizeof(struct radix_cfg), M_IPFW, M_WAITOK | M_ZERO); + + *ta_state = cfg; + ti->lookup = ta_lookup_radix; + + return (0); +} + +static int +flush_radix_entry(struct radix_node *rn, void *arg) +{ + struct radix_node_head * const rnh = arg; + struct radix_addr_entry *ent; + + ent = (struct radix_addr_entry *) + rnh->rnh_deladdr(rn->rn_key, rn->rn_mask, rnh); + if (ent != NULL) + free(ent, M_IPFW_TBL); + return (0); +} + +static void +ta_destroy_radix(void *ta_state, struct table_info *ti) +{ + struct radix_cfg *cfg; + struct radix_node_head *rnh; + + cfg = (struct radix_cfg *)ta_state; + + rnh = (struct radix_node_head *)(ti->state); + rnh->rnh_walktree(rnh, flush_radix_entry, rnh); + rn_detachhead(&ti->state); + + rnh = (struct radix_node_head *)(ti->xstate); + rnh->rnh_walktree(rnh, flush_radix_entry, rnh); + rn_detachhead(&ti->xstate); + + free(cfg, M_IPFW); +} + +/* + * Provide algo-specific table info + */ +static void +ta_dump_radix_tinfo(void *ta_state, struct table_info *ti, ipfw_ta_tinfo *tinfo) +{ + struct radix_cfg *cfg; + + cfg = (struct radix_cfg *)ta_state; + + tinfo->flags = IPFW_TATFLAGS_AFDATA | IPFW_TATFLAGS_AFITEM; + tinfo->taclass4 = IPFW_TACLASS_RADIX; + tinfo->count4 = cfg->count4; + tinfo->itemsize4 = sizeof(struct radix_addr_entry); + tinfo->taclass6 = IPFW_TACLASS_RADIX; + tinfo->count6 = cfg->count6; + tinfo->itemsize6 = sizeof(struct radix_addr_xentry); +} + +static int +ta_dump_radix_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent) +{ + struct radix_addr_entry *n; +#ifdef INET6 + struct radix_addr_xentry *xn; +#endif + + n = (struct radix_addr_entry *)e; + + /* Guess IPv4/IPv6 radix by sockaddr family */ + if (n->addr.sin_family == AF_INET) { + tent->k.addr.s_addr = n->addr.sin_addr.s_addr; + tent->masklen = n->masklen; + tent->subtype = AF_INET; + tent->v.kidx = n->value; +#ifdef INET6 + } else { + xn = (struct radix_addr_xentry *)e; + memcpy(&tent->k, &xn->addr6.sin6_addr, sizeof(struct in6_addr)); + tent->masklen = xn->masklen; + tent->subtype = AF_INET6; + tent->v.kidx = xn->value; +#endif + } + + return (0); +} + +static int +ta_find_radix_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent) +{ + struct radix_node_head *rnh; + void *e; + + e = NULL; + if (tent->subtype == AF_INET) { + struct sockaddr_in sa; + KEY_LEN(sa) = KEY_LEN_INET; + sa.sin_addr.s_addr = tent->k.addr.s_addr; + rnh = (struct radix_node_head *)ti->state; + e = rnh->rnh_matchaddr(&sa, rnh); + } else { + struct sa_in6 sa6; + KEY_LEN(sa6) = KEY_LEN_INET6; + memcpy(&sa6.sin6_addr, &tent->k.addr6, sizeof(struct in6_addr)); + rnh = (struct radix_node_head *)ti->xstate; + e = rnh->rnh_matchaddr(&sa6, rnh); + } + + if (e != NULL) { + ta_dump_radix_tentry(ta_state, ti, e, tent); + return (0); + } + + return (ENOENT); +} + +static void +ta_foreach_radix(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg) +{ + struct radix_node_head *rnh; + + rnh = (struct radix_node_head *)(ti->state); + rnh->rnh_walktree(rnh, (walktree_f_t *)f, arg); + + rnh = (struct radix_node_head *)(ti->xstate); + rnh->rnh_walktree(rnh, (walktree_f_t *)f, arg); +} + + +#ifdef INET6 +static inline void ipv6_writemask(struct in6_addr *addr6, uint8_t mask); + +static inline void +ipv6_writemask(struct in6_addr *addr6, uint8_t mask) +{ + uint32_t *cp; + + for (cp = (uint32_t *)addr6; mask >= 32; mask -= 32) + *cp++ = 0xFFFFFFFF; + *cp = htonl(mask ? ~((1 << (32 - mask)) - 1) : 0); +} +#endif + +static void +tei_to_sockaddr_ent(struct tentry_info *tei, struct sockaddr *sa, + struct sockaddr *ma, int *set_mask) +{ + int mlen; +#ifdef INET + struct sockaddr_in *addr, *mask; +#endif +#ifdef INET6 + struct sa_in6 *addr6, *mask6; +#endif + in_addr_t a4; + + mlen = tei->masklen; + + if (tei->subtype == AF_INET) { +#ifdef INET + addr = (struct sockaddr_in *)sa; + mask = (struct sockaddr_in *)ma; + /* Set 'total' structure length */ + KEY_LEN(*addr) = KEY_LEN_INET; + KEY_LEN(*mask) = KEY_LEN_INET; + addr->sin_family = AF_INET; + mask->sin_addr.s_addr = + htonl(mlen ? ~((1 << (32 - mlen)) - 1) : 0); + a4 = *((in_addr_t *)tei->paddr); + addr->sin_addr.s_addr = a4 & mask->sin_addr.s_addr; + if (mlen != 32) + *set_mask = 1; + else + *set_mask = 0; +#endif +#ifdef INET6 + } else if (tei->subtype == AF_INET6) { + /* IPv6 case */ + addr6 = (struct sa_in6 *)sa; + mask6 = (struct sa_in6 *)ma; + /* Set 'total' structure length */ + KEY_LEN(*addr6) = KEY_LEN_INET6; + KEY_LEN(*mask6) = KEY_LEN_INET6; + addr6->sin6_family = AF_INET6; + ipv6_writemask(&mask6->sin6_addr, mlen); + memcpy(&addr6->sin6_addr, tei->paddr, sizeof(struct in6_addr)); + APPLY_MASK(&addr6->sin6_addr, &mask6->sin6_addr); + if (mlen != 128) + *set_mask = 1; + else + *set_mask = 0; +#endif + } +} + +static int +ta_prepare_add_radix(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_radix *tb; + struct radix_addr_entry *ent; +#ifdef INET6 + struct radix_addr_xentry *xent; +#endif + struct sockaddr *addr, *mask; + int mlen, set_mask; + + tb = (struct ta_buf_radix *)ta_buf; + + mlen = tei->masklen; + set_mask = 0; + + if (tei->subtype == AF_INET) { +#ifdef INET + if (mlen > 32) + return (EINVAL); + ent = malloc(sizeof(*ent), M_IPFW_TBL, M_WAITOK | M_ZERO); + ent->masklen = mlen; + + addr = (struct sockaddr *)&ent->addr; + mask = (struct sockaddr *)&tb->addr.a4.ma; + tb->ent_ptr = ent; +#endif +#ifdef INET6 + } else if (tei->subtype == AF_INET6) { + /* IPv6 case */ + if (mlen > 128) + return (EINVAL); + xent = malloc(sizeof(*xent), M_IPFW_TBL, M_WAITOK | M_ZERO); + xent->masklen = mlen; + + addr = (struct sockaddr *)&xent->addr6; + mask = (struct sockaddr *)&tb->addr.a6.ma; + tb->ent_ptr = xent; +#endif + } else { + /* Unknown CIDR type */ + return (EINVAL); + } + + tei_to_sockaddr_ent(tei, addr, mask, &set_mask); + /* Set pointers */ + tb->addr_ptr = addr; + if (set_mask != 0) + tb->mask_ptr = mask; + + return (0); +} + +static int +ta_add_radix(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct radix_cfg *cfg; + struct radix_node_head *rnh; + struct radix_node *rn; + struct ta_buf_radix *tb; + uint32_t *old_value, value; + + cfg = (struct radix_cfg *)ta_state; + tb = (struct ta_buf_radix *)ta_buf; + + /* Save current entry value from @tei */ + if (tei->subtype == AF_INET) { + rnh = ti->state; + ((struct radix_addr_entry *)tb->ent_ptr)->value = tei->value; + } else { + rnh = ti->xstate; + ((struct radix_addr_xentry *)tb->ent_ptr)->value = tei->value; + } + + /* Search for an entry first */ + rn = rnh->rnh_lookup(tb->addr_ptr, tb->mask_ptr, rnh); + if (rn != NULL) { + if ((tei->flags & TEI_FLAGS_UPDATE) == 0) + return (EEXIST); + /* Record already exists. Update value if we're asked to */ + if (tei->subtype == AF_INET) + old_value = &((struct radix_addr_entry *)rn)->value; + else + old_value = &((struct radix_addr_xentry *)rn)->value; + + value = *old_value; + *old_value = tei->value; + tei->value = value; + + /* Indicate that update has happened instead of addition */ + tei->flags |= TEI_FLAGS_UPDATED; + *pnum = 0; + + return (0); + } + + if ((tei->flags & TEI_FLAGS_DONTADD) != 0) + return (EFBIG); + + rn = rnh->rnh_addaddr(tb->addr_ptr, tb->mask_ptr, rnh, tb->ent_ptr); + if (rn == NULL) { + /* Unknown error */ + return (EINVAL); + } + + if (tei->subtype == AF_INET) + cfg->count4++; + else + cfg->count6++; + tb->ent_ptr = NULL; + *pnum = 1; + + return (0); +} + +static int +ta_prepare_del_radix(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_radix *tb; + struct sockaddr *addr, *mask; + int mlen, set_mask; + + tb = (struct ta_buf_radix *)ta_buf; + + mlen = tei->masklen; + set_mask = 0; + + if (tei->subtype == AF_INET) { + if (mlen > 32) + return (EINVAL); + + addr = (struct sockaddr *)&tb->addr.a4.sa; + mask = (struct sockaddr *)&tb->addr.a4.ma; +#ifdef INET6 + } else if (tei->subtype == AF_INET6) { + if (mlen > 128) + return (EINVAL); + + addr = (struct sockaddr *)&tb->addr.a6.sa; + mask = (struct sockaddr *)&tb->addr.a6.ma; +#endif + } else + return (EINVAL); + + tei_to_sockaddr_ent(tei, addr, mask, &set_mask); + tb->addr_ptr = addr; + if (set_mask != 0) + tb->mask_ptr = mask; + + return (0); +} + +static int +ta_del_radix(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct radix_cfg *cfg; + struct radix_node_head *rnh; + struct radix_node *rn; + struct ta_buf_radix *tb; + + cfg = (struct radix_cfg *)ta_state; + tb = (struct ta_buf_radix *)ta_buf; + + if (tei->subtype == AF_INET) + rnh = ti->state; + else + rnh = ti->xstate; + + rn = rnh->rnh_deladdr(tb->addr_ptr, tb->mask_ptr, rnh); + + if (rn == NULL) + return (ENOENT); + + /* Save entry value to @tei */ + if (tei->subtype == AF_INET) + tei->value = ((struct radix_addr_entry *)rn)->value; + else + tei->value = ((struct radix_addr_xentry *)rn)->value; + + tb->ent_ptr = rn; + + if (tei->subtype == AF_INET) + cfg->count4--; + else + cfg->count6--; + *pnum = 1; + + return (0); +} + +static void +ta_flush_radix_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_radix *tb; + + tb = (struct ta_buf_radix *)ta_buf; + + if (tb->ent_ptr != NULL) + free(tb->ent_ptr, M_IPFW_TBL); +} + +static int +ta_need_modify_radix(void *ta_state, struct table_info *ti, uint32_t count, + uint64_t *pflags) +{ + + /* + * radix does not require additional memory allocations + * other than nodes itself. Adding new masks to the tree do + * but we don't have any API to call (and we don't known which + * sizes do we need). + */ + return (0); +} + +struct table_algo addr_radix = { + .name = "addr:radix", + .type = IPFW_TABLE_ADDR, + .flags = TA_FLAG_DEFAULT, + .ta_buf_size = sizeof(struct ta_buf_radix), + .init = ta_init_radix, + .destroy = ta_destroy_radix, + .prepare_add = ta_prepare_add_radix, + .prepare_del = ta_prepare_del_radix, + .add = ta_add_radix, + .del = ta_del_radix, + .flush_entry = ta_flush_radix_entry, + .foreach = ta_foreach_radix, + .dump_tentry = ta_dump_radix_tentry, + .find_tentry = ta_find_radix_tentry, + .dump_tinfo = ta_dump_radix_tinfo, + .need_modify = ta_need_modify_radix, +}; + + +/* + * addr:hash cmds + * + * + * ti->data: + * [inv.mask4][inv.mask6][log2hsize4][log2hsize6] + * [ 8][ 8[ 8][ 8] + * + * inv.mask4: 32 - mask + * inv.mask6: + * 1) _slow lookup: mask + * 2) _aligned: (128 - mask) / 8 + * 3) _64: 8 + * + * + * pflags: + * [v4=1/v6=0][hsize] + * [ 32][ 32] + */ + +struct chashentry; + +SLIST_HEAD(chashbhead, chashentry); + +struct chash_cfg { + struct chashbhead *head4; + struct chashbhead *head6; + size_t size4; + size_t size6; + size_t items4; + size_t items6; + uint8_t mask4; + uint8_t mask6; +}; + +struct chashentry { + SLIST_ENTRY(chashentry) next; + uint32_t value; + uint32_t type; + union { + uint32_t a4; /* Host format */ + struct in6_addr a6; /* Network format */ + } a; +}; + +struct ta_buf_chash +{ + void *ent_ptr; + struct chashentry ent; +}; + +#ifdef INET +static __inline uint32_t hash_ip(uint32_t addr, int hsize); +#endif +#ifdef INET6 +static __inline uint32_t hash_ip6(struct in6_addr *addr6, int hsize); +static __inline uint16_t hash_ip64(struct in6_addr *addr6, int hsize); +static __inline uint32_t hash_ip6_slow(struct in6_addr *addr6, void *key, + int mask, int hsize); +static __inline uint32_t hash_ip6_al(struct in6_addr *addr6, void *key, int mask, + int hsize); +#endif +static int ta_lookup_chash_slow(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); +static int ta_lookup_chash_aligned(struct table_info *ti, void *key, + uint32_t keylen, uint32_t *val); +static int ta_lookup_chash_64(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); +static int chash_parse_opts(struct chash_cfg *cfg, char *data); +static void ta_print_chash_config(void *ta_state, struct table_info *ti, + char *buf, size_t bufsize); +static int ta_log2(uint32_t v); +static int ta_init_chash(struct ip_fw_chain *ch, void **ta_state, + struct table_info *ti, char *data, uint8_t tflags); +static void ta_destroy_chash(void *ta_state, struct table_info *ti); +static void ta_dump_chash_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +static int ta_dump_chash_tentry(void *ta_state, struct table_info *ti, + void *e, ipfw_obj_tentry *tent); +static uint32_t hash_ent(struct chashentry *ent, int af, int mlen, + uint32_t size); +static int tei_to_chash_ent(struct tentry_info *tei, struct chashentry *ent); +static int ta_find_chash_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +static void ta_foreach_chash(void *ta_state, struct table_info *ti, + ta_foreach_f *f, void *arg); +static int ta_prepare_add_chash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_add_chash(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static int ta_prepare_del_chash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_del_chash(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static void ta_flush_chash_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_need_modify_chash(void *ta_state, struct table_info *ti, + uint32_t count, uint64_t *pflags); +static int ta_prepare_mod_chash(void *ta_buf, uint64_t *pflags); +static int ta_fill_mod_chash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t *pflags); +static void ta_modify_chash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags); +static void ta_flush_mod_chash(void *ta_buf); + + +#ifdef INET +static __inline uint32_t +hash_ip(uint32_t addr, int hsize) +{ + + return (addr % (hsize - 1)); +} +#endif + +#ifdef INET6 +static __inline uint32_t +hash_ip6(struct in6_addr *addr6, int hsize) +{ + uint32_t i; + + i = addr6->s6_addr32[0] ^ addr6->s6_addr32[1] ^ + addr6->s6_addr32[2] ^ addr6->s6_addr32[3]; + + return (i % (hsize - 1)); +} + + +static __inline uint16_t +hash_ip64(struct in6_addr *addr6, int hsize) +{ + uint32_t i; + + i = addr6->s6_addr32[0] ^ addr6->s6_addr32[1]; + + return (i % (hsize - 1)); +} + + +static __inline uint32_t +hash_ip6_slow(struct in6_addr *addr6, void *key, int mask, int hsize) +{ + struct in6_addr mask6; + + ipv6_writemask(&mask6, mask); + memcpy(addr6, key, sizeof(struct in6_addr)); + APPLY_MASK(addr6, &mask6); + return (hash_ip6(addr6, hsize)); +} + +static __inline uint32_t +hash_ip6_al(struct in6_addr *addr6, void *key, int mask, int hsize) +{ + uint64_t *paddr; + + paddr = (uint64_t *)addr6; + *paddr = 0; + *(paddr + 1) = 0; + memcpy(addr6, key, mask); + return (hash_ip6(addr6, hsize)); +} +#endif + +static int +ta_lookup_chash_slow(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct chashbhead *head; + struct chashentry *ent; + uint16_t hash, hsize; + uint8_t imask; + + if (keylen == sizeof(in_addr_t)) { +#ifdef INET + head = (struct chashbhead *)ti->state; + imask = ti->data >> 24; + hsize = 1 << ((ti->data & 0xFFFF) >> 8); + uint32_t a; + a = ntohl(*((in_addr_t *)key)); + a = a >> imask; + hash = hash_ip(a, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + if (ent->a.a4 == a) { + *val = ent->value; + return (1); + } + } +#endif + } else { +#ifdef INET6 + /* IPv6: worst scenario: non-round mask */ + struct in6_addr addr6; + head = (struct chashbhead *)ti->xstate; + imask = (ti->data & 0xFF0000) >> 16; + hsize = 1 << (ti->data & 0xFF); + hash = hash_ip6_slow(&addr6, key, imask, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + if (memcmp(&ent->a.a6, &addr6, 16) == 0) { + *val = ent->value; + return (1); + } + } +#endif + } + + return (0); +} + +static int +ta_lookup_chash_aligned(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct chashbhead *head; + struct chashentry *ent; + uint16_t hash, hsize; + uint8_t imask; + + if (keylen == sizeof(in_addr_t)) { +#ifdef INET + head = (struct chashbhead *)ti->state; + imask = ti->data >> 24; + hsize = 1 << ((ti->data & 0xFFFF) >> 8); + uint32_t a; + a = ntohl(*((in_addr_t *)key)); + a = a >> imask; + hash = hash_ip(a, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + if (ent->a.a4 == a) { + *val = ent->value; + return (1); + } + } +#endif + } else { +#ifdef INET6 + /* IPv6: aligned to 8bit mask */ + struct in6_addr addr6; + uint64_t *paddr, *ptmp; + head = (struct chashbhead *)ti->xstate; + imask = (ti->data & 0xFF0000) >> 16; + hsize = 1 << (ti->data & 0xFF); + + hash = hash_ip6_al(&addr6, key, imask, hsize); + paddr = (uint64_t *)&addr6; + SLIST_FOREACH(ent, &head[hash], next) { + ptmp = (uint64_t *)&ent->a.a6; + if (paddr[0] == ptmp[0] && paddr[1] == ptmp[1]) { + *val = ent->value; + return (1); + } + } +#endif + } + + return (0); +} + +static int +ta_lookup_chash_64(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct chashbhead *head; + struct chashentry *ent; + uint16_t hash, hsize; + uint8_t imask; + + if (keylen == sizeof(in_addr_t)) { +#ifdef INET + head = (struct chashbhead *)ti->state; + imask = ti->data >> 24; + hsize = 1 << ((ti->data & 0xFFFF) >> 8); + uint32_t a; + a = ntohl(*((in_addr_t *)key)); + a = a >> imask; + hash = hash_ip(a, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + if (ent->a.a4 == a) { + *val = ent->value; + return (1); + } + } +#endif + } else { +#ifdef INET6 + /* IPv6: /64 */ + uint64_t a6, *paddr; + head = (struct chashbhead *)ti->xstate; + paddr = (uint64_t *)key; + hsize = 1 << (ti->data & 0xFF); + a6 = *paddr; + hash = hash_ip64((struct in6_addr *)key, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + paddr = (uint64_t *)&ent->a.a6; + if (a6 == *paddr) { + *val = ent->value; + return (1); + } + } +#endif + } + + return (0); +} + +static int +chash_parse_opts(struct chash_cfg *cfg, char *data) +{ + char *pdel, *pend, *s; + int mask4, mask6; + + mask4 = cfg->mask4; + mask6 = cfg->mask6; + + if (data == NULL) + return (0); + if ((pdel = strchr(data, ' ')) == NULL) + return (0); + while (*pdel == ' ') + pdel++; + if (strncmp(pdel, "masks=", 6) != 0) + return (EINVAL); + if ((s = strchr(pdel, ' ')) != NULL) + *s++ = '\0'; + + pdel += 6; + /* Need /XX[,/YY] */ + if (*pdel++ != '/') + return (EINVAL); + mask4 = strtol(pdel, &pend, 10); + if (*pend == ',') { + /* ,/YY */ + pdel = pend + 1; + if (*pdel++ != '/') + return (EINVAL); + mask6 = strtol(pdel, &pend, 10); + if (*pend != '\0') + return (EINVAL); + } else if (*pend != '\0') + return (EINVAL); + + if (mask4 < 0 || mask4 > 32 || mask6 < 0 || mask6 > 128) + return (EINVAL); + + cfg->mask4 = mask4; + cfg->mask6 = mask6; + + return (0); +} + +static void +ta_print_chash_config(void *ta_state, struct table_info *ti, char *buf, + size_t bufsize) +{ + struct chash_cfg *cfg; + + cfg = (struct chash_cfg *)ta_state; + + if (cfg->mask4 != 32 || cfg->mask6 != 128) + snprintf(buf, bufsize, "%s masks=/%d,/%d", "addr:hash", + cfg->mask4, cfg->mask6); + else + snprintf(buf, bufsize, "%s", "addr:hash"); +} + +static int +ta_log2(uint32_t v) +{ + uint32_t r; + + r = 0; + while (v >>= 1) + r++; + + return (r); +} + +/* + * New table. + * We assume 'data' to be either NULL or the following format: + * 'addr:hash [masks=/32[,/128]]' + */ +static int +ta_init_chash(struct ip_fw_chain *ch, void **ta_state, struct table_info *ti, + char *data, uint8_t tflags) +{ + int error, i; + uint32_t hsize; + struct chash_cfg *cfg; + + cfg = malloc(sizeof(struct chash_cfg), M_IPFW, M_WAITOK | M_ZERO); + + cfg->mask4 = 32; + cfg->mask6 = 128; + + if ((error = chash_parse_opts(cfg, data)) != 0) { + free(cfg, M_IPFW); + return (error); + } + + cfg->size4 = 128; + cfg->size6 = 128; + + cfg->head4 = malloc(sizeof(struct chashbhead) * cfg->size4, M_IPFW, + M_WAITOK | M_ZERO); + cfg->head6 = malloc(sizeof(struct chashbhead) * cfg->size6, M_IPFW, + M_WAITOK | M_ZERO); + for (i = 0; i < cfg->size4; i++) + SLIST_INIT(&cfg->head4[i]); + for (i = 0; i < cfg->size6; i++) + SLIST_INIT(&cfg->head6[i]); + + + *ta_state = cfg; + ti->state = cfg->head4; + ti->xstate = cfg->head6; + + /* Store data depending on v6 mask length */ + hsize = ta_log2(cfg->size4) << 8 | ta_log2(cfg->size6); + if (cfg->mask6 == 64) { + ti->data = (32 - cfg->mask4) << 24 | (128 - cfg->mask6) << 16| + hsize; + ti->lookup = ta_lookup_chash_64; + } else if ((cfg->mask6 % 8) == 0) { + ti->data = (32 - cfg->mask4) << 24 | + cfg->mask6 << 13 | hsize; + ti->lookup = ta_lookup_chash_aligned; + } else { + /* don't do that! */ + ti->data = (32 - cfg->mask4) << 24 | + cfg->mask6 << 16 | hsize; + ti->lookup = ta_lookup_chash_slow; + } + + return (0); +} + +static void +ta_destroy_chash(void *ta_state, struct table_info *ti) +{ + struct chash_cfg *cfg; + struct chashentry *ent, *ent_next; + int i; + + cfg = (struct chash_cfg *)ta_state; + + for (i = 0; i < cfg->size4; i++) + SLIST_FOREACH_SAFE(ent, &cfg->head4[i], next, ent_next) + free(ent, M_IPFW_TBL); + + for (i = 0; i < cfg->size6; i++) + SLIST_FOREACH_SAFE(ent, &cfg->head6[i], next, ent_next) + free(ent, M_IPFW_TBL); + + free(cfg->head4, M_IPFW); + free(cfg->head6, M_IPFW); + + free(cfg, M_IPFW); +} + +static void +ta_dump_chash_tinfo(void *ta_state, struct table_info *ti, ipfw_ta_tinfo *tinfo) +{ + struct chash_cfg *cfg; + + cfg = (struct chash_cfg *)ta_state; + + tinfo->flags = IPFW_TATFLAGS_AFDATA | IPFW_TATFLAGS_AFITEM; + tinfo->taclass4 = IPFW_TACLASS_HASH; + tinfo->size4 = cfg->size4; + tinfo->count4 = cfg->items4; + tinfo->itemsize4 = sizeof(struct chashentry); + tinfo->taclass6 = IPFW_TACLASS_HASH; + tinfo->size6 = cfg->size6; + tinfo->count6 = cfg->items6; + tinfo->itemsize6 = sizeof(struct chashentry); +} + +static int +ta_dump_chash_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent) +{ + struct chash_cfg *cfg; + struct chashentry *ent; + + cfg = (struct chash_cfg *)ta_state; + ent = (struct chashentry *)e; + + if (ent->type == AF_INET) { + tent->k.addr.s_addr = htonl(ent->a.a4 << (32 - cfg->mask4)); + tent->masklen = cfg->mask4; + tent->subtype = AF_INET; + tent->v.kidx = ent->value; +#ifdef INET6 + } else { + memcpy(&tent->k, &ent->a.a6, sizeof(struct in6_addr)); + tent->masklen = cfg->mask6; + tent->subtype = AF_INET6; + tent->v.kidx = ent->value; +#endif + } + + return (0); +} + +static uint32_t +hash_ent(struct chashentry *ent, int af, int mlen, uint32_t size) +{ + uint32_t hash; + + hash = 0; + + if (af == AF_INET) { +#ifdef INET + hash = hash_ip(ent->a.a4, size); +#endif + } else { +#ifdef INET6 + if (mlen == 64) + hash = hash_ip64(&ent->a.a6, size); + else + hash = hash_ip6(&ent->a.a6, size); +#endif + } + + return (hash); +} + +static int +tei_to_chash_ent(struct tentry_info *tei, struct chashentry *ent) +{ + int mlen; +#ifdef INET6 + struct in6_addr mask6; +#endif + + + mlen = tei->masklen; + + if (tei->subtype == AF_INET) { +#ifdef INET + if (mlen > 32) + return (EINVAL); + ent->type = AF_INET; + + /* Calculate masked address */ + ent->a.a4 = ntohl(*((in_addr_t *)tei->paddr)) >> (32 - mlen); +#endif +#ifdef INET6 + } else if (tei->subtype == AF_INET6) { + /* IPv6 case */ + if (mlen > 128) + return (EINVAL); + ent->type = AF_INET6; + + ipv6_writemask(&mask6, mlen); + memcpy(&ent->a.a6, tei->paddr, sizeof(struct in6_addr)); + APPLY_MASK(&ent->a.a6, &mask6); +#endif + } else { + /* Unknown CIDR type */ + return (EINVAL); + } + + return (0); +} + +static int +ta_find_chash_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent) +{ + struct chash_cfg *cfg; + struct chashbhead *head; + struct chashentry ent, *tmp; + struct tentry_info tei; + int error; + uint32_t hash; + + cfg = (struct chash_cfg *)ta_state; + + memset(&ent, 0, sizeof(ent)); + memset(&tei, 0, sizeof(tei)); + + if (tent->subtype == AF_INET) { + tei.paddr = &tent->k.addr; + tei.masklen = cfg->mask4; + tei.subtype = AF_INET; + + if ((error = tei_to_chash_ent(&tei, &ent)) != 0) + return (error); + + head = cfg->head4; + hash = hash_ent(&ent, AF_INET, cfg->mask4, cfg->size4); + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (tmp->a.a4 != ent.a.a4) + continue; + + ta_dump_chash_tentry(ta_state, ti, tmp, tent); + return (0); + } + } else { + tei.paddr = &tent->k.addr6; + tei.masklen = cfg->mask6; + tei.subtype = AF_INET6; + + if ((error = tei_to_chash_ent(&tei, &ent)) != 0) + return (error); + + head = cfg->head6; + hash = hash_ent(&ent, AF_INET6, cfg->mask6, cfg->size6); + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (memcmp(&tmp->a.a6, &ent.a.a6, 16) != 0) + continue; + ta_dump_chash_tentry(ta_state, ti, tmp, tent); + return (0); + } + } + + return (ENOENT); +} + +static void +ta_foreach_chash(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg) +{ + struct chash_cfg *cfg; + struct chashentry *ent, *ent_next; + int i; + + cfg = (struct chash_cfg *)ta_state; + + for (i = 0; i < cfg->size4; i++) + SLIST_FOREACH_SAFE(ent, &cfg->head4[i], next, ent_next) + f(ent, arg); + + for (i = 0; i < cfg->size6; i++) + SLIST_FOREACH_SAFE(ent, &cfg->head6[i], next, ent_next) + f(ent, arg); +} + +static int +ta_prepare_add_chash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_chash *tb; + struct chashentry *ent; + int error; + + tb = (struct ta_buf_chash *)ta_buf; + + ent = malloc(sizeof(*ent), M_IPFW_TBL, M_WAITOK | M_ZERO); + + error = tei_to_chash_ent(tei, ent); + if (error != 0) { + free(ent, M_IPFW_TBL); + return (error); + } + tb->ent_ptr = ent; + + return (0); +} + +static int +ta_add_chash(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct chash_cfg *cfg; + struct chashbhead *head; + struct chashentry *ent, *tmp; + struct ta_buf_chash *tb; + int exists; + uint32_t hash, value; + + cfg = (struct chash_cfg *)ta_state; + tb = (struct ta_buf_chash *)ta_buf; + ent = (struct chashentry *)tb->ent_ptr; + hash = 0; + exists = 0; + + /* Read current value from @tei */ + ent->value = tei->value; + + /* Read cuurrent value */ + if (tei->subtype == AF_INET) { + if (tei->masklen != cfg->mask4) + return (EINVAL); + head = cfg->head4; + hash = hash_ent(ent, AF_INET, cfg->mask4, cfg->size4); + + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (tmp->a.a4 == ent->a.a4) { + exists = 1; + break; + } + } + } else { + if (tei->masklen != cfg->mask6) + return (EINVAL); + head = cfg->head6; + hash = hash_ent(ent, AF_INET6, cfg->mask6, cfg->size6); + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (memcmp(&tmp->a.a6, &ent->a.a6, 16) == 0) { + exists = 1; + break; + } + } + } + + if (exists == 1) { + if ((tei->flags & TEI_FLAGS_UPDATE) == 0) + return (EEXIST); + /* Record already exists. Update value if we're asked to */ + value = tmp->value; + tmp->value = tei->value; + tei->value = value; + /* Indicate that update has happened instead of addition */ + tei->flags |= TEI_FLAGS_UPDATED; + *pnum = 0; + } else { + if ((tei->flags & TEI_FLAGS_DONTADD) != 0) + return (EFBIG); + SLIST_INSERT_HEAD(&head[hash], ent, next); + tb->ent_ptr = NULL; + *pnum = 1; + + /* Update counters */ + if (tei->subtype == AF_INET) + cfg->items4++; + else + cfg->items6++; + } + + return (0); +} + +static int +ta_prepare_del_chash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_chash *tb; + + tb = (struct ta_buf_chash *)ta_buf; + + return (tei_to_chash_ent(tei, &tb->ent)); +} + +static int +ta_del_chash(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct chash_cfg *cfg; + struct chashbhead *head; + struct chashentry *tmp, *tmp_next, *ent; + struct ta_buf_chash *tb; + uint32_t hash; + + cfg = (struct chash_cfg *)ta_state; + tb = (struct ta_buf_chash *)ta_buf; + ent = &tb->ent; + + if (tei->subtype == AF_INET) { + if (tei->masklen != cfg->mask4) + return (EINVAL); + head = cfg->head4; + hash = hash_ent(ent, AF_INET, cfg->mask4, cfg->size4); + + SLIST_FOREACH_SAFE(tmp, &head[hash], next, tmp_next) { + if (tmp->a.a4 != ent->a.a4) + continue; + + SLIST_REMOVE(&head[hash], tmp, chashentry, next); + cfg->items4--; + tb->ent_ptr = tmp; + tei->value = tmp->value; + *pnum = 1; + return (0); + } + } else { + if (tei->masklen != cfg->mask6) + return (EINVAL); + head = cfg->head6; + hash = hash_ent(ent, AF_INET6, cfg->mask6, cfg->size6); + SLIST_FOREACH_SAFE(tmp, &head[hash], next, tmp_next) { + if (memcmp(&tmp->a.a6, &ent->a.a6, 16) != 0) + continue; + + SLIST_REMOVE(&head[hash], tmp, chashentry, next); + cfg->items6--; + tb->ent_ptr = tmp; + tei->value = tmp->value; + *pnum = 1; + return (0); + } + } + + return (ENOENT); +} + +static void +ta_flush_chash_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_chash *tb; + + tb = (struct ta_buf_chash *)ta_buf; + + if (tb->ent_ptr != NULL) + free(tb->ent_ptr, M_IPFW_TBL); +} + +/* + * Hash growing callbacks. + */ + +static int +ta_need_modify_chash(void *ta_state, struct table_info *ti, uint32_t count, + uint64_t *pflags) +{ + struct chash_cfg *cfg; + uint64_t data; + + /* + * Since we don't know exact number of IPv4/IPv6 records in @count, + * ignore non-zero @count value at all. Check current hash sizes + * and return appropriate data. + */ + + cfg = (struct chash_cfg *)ta_state; + + data = 0; + if (cfg->items4 > cfg->size4 && cfg->size4 < 65536) + data |= (cfg->size4 * 2) << 16; + if (cfg->items6 > cfg->size6 && cfg->size6 < 65536) + data |= cfg->size6 * 2; + + if (data != 0) { + *pflags = data; + return (1); + } + + return (0); +} + +/* + * Allocate new, larger chash. + */ +static int +ta_prepare_mod_chash(void *ta_buf, uint64_t *pflags) +{ + struct mod_item *mi; + struct chashbhead *head; + int i; + + mi = (struct mod_item *)ta_buf; + + memset(mi, 0, sizeof(struct mod_item)); + mi->size = (*pflags >> 16) & 0xFFFF; + mi->size6 = *pflags & 0xFFFF; + if (mi->size > 0) { + head = malloc(sizeof(struct chashbhead) * mi->size, + M_IPFW, M_WAITOK | M_ZERO); + for (i = 0; i < mi->size; i++) + SLIST_INIT(&head[i]); + mi->main_ptr = head; + } + + if (mi->size6 > 0) { + head = malloc(sizeof(struct chashbhead) * mi->size6, + M_IPFW, M_WAITOK | M_ZERO); + for (i = 0; i < mi->size6; i++) + SLIST_INIT(&head[i]); + mi->main_ptr6 = head; + } + + return (0); +} + +/* + * Copy data from old runtime array to new one. + */ +static int +ta_fill_mod_chash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t *pflags) +{ + + /* In is not possible to do rehash if we're not holidng WLOCK. */ + return (0); +} + +/* + * Switch old & new arrays. + */ +static void +ta_modify_chash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags) +{ + struct mod_item *mi; + struct chash_cfg *cfg; + struct chashbhead *old_head, *new_head; + struct chashentry *ent, *ent_next; + int af, i, mlen; + uint32_t nhash; + size_t old_size, new_size; + + mi = (struct mod_item *)ta_buf; + cfg = (struct chash_cfg *)ta_state; + + /* Check which hash we need to grow and do we still need that */ + if (mi->size > 0 && cfg->size4 < mi->size) { + new_head = (struct chashbhead *)mi->main_ptr; + new_size = mi->size; + old_size = cfg->size4; + old_head = ti->state; + mlen = cfg->mask4; + af = AF_INET; + + for (i = 0; i < old_size; i++) { + SLIST_FOREACH_SAFE(ent, &old_head[i], next, ent_next) { + nhash = hash_ent(ent, af, mlen, new_size); + SLIST_INSERT_HEAD(&new_head[nhash], ent, next); + } + } + + ti->state = new_head; + cfg->head4 = new_head; + cfg->size4 = mi->size; + mi->main_ptr = old_head; + } + + if (mi->size6 > 0 && cfg->size6 < mi->size6) { + new_head = (struct chashbhead *)mi->main_ptr6; + new_size = mi->size6; + old_size = cfg->size6; + old_head = ti->xstate; + mlen = cfg->mask6; + af = AF_INET6; + + for (i = 0; i < old_size; i++) { + SLIST_FOREACH_SAFE(ent, &old_head[i], next, ent_next) { + nhash = hash_ent(ent, af, mlen, new_size); + SLIST_INSERT_HEAD(&new_head[nhash], ent, next); + } + } + + ti->xstate = new_head; + cfg->head6 = new_head; + cfg->size6 = mi->size6; + mi->main_ptr6 = old_head; + } + + /* Update lower 32 bits with new values */ + ti->data &= 0xFFFFFFFF00000000; + ti->data |= ta_log2(cfg->size4) << 8 | ta_log2(cfg->size6); +} + +/* + * Free unneded array. + */ +static void +ta_flush_mod_chash(void *ta_buf) +{ + struct mod_item *mi; + + mi = (struct mod_item *)ta_buf; + if (mi->main_ptr != NULL) + free(mi->main_ptr, M_IPFW); + if (mi->main_ptr6 != NULL) + free(mi->main_ptr6, M_IPFW); +} + +struct table_algo addr_hash = { + .name = "addr:hash", + .type = IPFW_TABLE_ADDR, + .ta_buf_size = sizeof(struct ta_buf_chash), + .init = ta_init_chash, + .destroy = ta_destroy_chash, + .prepare_add = ta_prepare_add_chash, + .prepare_del = ta_prepare_del_chash, + .add = ta_add_chash, + .del = ta_del_chash, + .flush_entry = ta_flush_chash_entry, + .foreach = ta_foreach_chash, + .dump_tentry = ta_dump_chash_tentry, + .find_tentry = ta_find_chash_tentry, + .print_config = ta_print_chash_config, + .dump_tinfo = ta_dump_chash_tinfo, + .need_modify = ta_need_modify_chash, + .prepare_mod = ta_prepare_mod_chash, + .fill_mod = ta_fill_mod_chash, + .modify = ta_modify_chash, + .flush_mod = ta_flush_mod_chash, +}; + + +/* + * Iface table cmds. + * + * Implementation: + * + * Runtime part: + * - sorted array of "struct ifidx" pointed by ti->state. + * Array is allocated with rounding up to IFIDX_CHUNK. Only existing + * interfaces are stored in array, however its allocated size is + * sufficient to hold all table records if needed. + * - current array size is stored in ti->data + * + * Table data: + * - "struct iftable_cfg" is allocated to store table state (ta_state). + * - All table records are stored inside namedobj instance. + * + */ + +struct ifidx { + uint16_t kidx; + uint16_t spare; + uint32_t value; +}; +#define DEFAULT_IFIDX_SIZE 64 + +struct iftable_cfg; + +struct ifentry { + struct named_object no; + struct ipfw_ifc ic; + struct iftable_cfg *icfg; + uint32_t value; + int linked; +}; + +struct iftable_cfg { + struct namedobj_instance *ii; + struct ip_fw_chain *ch; + struct table_info *ti; + void *main_ptr; + size_t size; /* Number of items allocated in array */ + size_t count; /* Number of all items */ + size_t used; /* Number of items _active_ now */ +}; + +struct ta_buf_ifidx +{ + struct ifentry *ife; + uint32_t value; +}; + +int compare_ifidx(const void *k, const void *v); +static struct ifidx * ifidx_find(struct table_info *ti, void *key); +static int ta_lookup_ifidx(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); +static int ta_init_ifidx(struct ip_fw_chain *ch, void **ta_state, + struct table_info *ti, char *data, uint8_t tflags); +static void ta_change_ti_ifidx(void *ta_state, struct table_info *ti); +static void destroy_ifidx_locked(struct namedobj_instance *ii, + struct named_object *no, void *arg); +static void ta_destroy_ifidx(void *ta_state, struct table_info *ti); +static void ta_dump_ifidx_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +static int ta_prepare_add_ifidx(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_add_ifidx(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static int ta_prepare_del_ifidx(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_del_ifidx(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static void ta_flush_ifidx_entry(struct ip_fw_chain *ch, + struct tentry_info *tei, void *ta_buf); +static void if_notifier(struct ip_fw_chain *ch, void *cbdata, uint16_t ifindex); +static int ta_need_modify_ifidx(void *ta_state, struct table_info *ti, + uint32_t count, uint64_t *pflags); +static int ta_prepare_mod_ifidx(void *ta_buf, uint64_t *pflags); +static int ta_fill_mod_ifidx(void *ta_state, struct table_info *ti, + void *ta_buf, uint64_t *pflags); +static void ta_modify_ifidx(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags); +static void ta_flush_mod_ifidx(void *ta_buf); +static int ta_dump_ifidx_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent); +static int ta_find_ifidx_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +static void foreach_ifidx(struct namedobj_instance *ii, struct named_object *no, + void *arg); +static void ta_foreach_ifidx(void *ta_state, struct table_info *ti, + ta_foreach_f *f, void *arg); + +int +compare_ifidx(const void *k, const void *v) +{ + const struct ifidx *ifidx; + uint16_t key; + + key = *((const uint16_t *)k); + ifidx = (const struct ifidx *)v; + + if (key < ifidx->kidx) + return (-1); + else if (key > ifidx->kidx) + return (1); + + return (0); +} + +/* + * Adds item @item with key @key into ascending-sorted array @base. + * Assumes @base has enough additional storage. + * + * Returns 1 on success, 0 on duplicate key. + */ +static int +badd(const void *key, void *item, void *base, size_t nmemb, + size_t size, int (*compar) (const void *, const void *)) +{ + int min, max, mid, shift, res; + caddr_t paddr; + + if (nmemb == 0) { + memcpy(base, item, size); + return (1); + } + + /* Binary search */ + min = 0; + max = nmemb - 1; + mid = 0; + while (min <= max) { + mid = (min + max) / 2; + res = compar(key, (const void *)((caddr_t)base + mid * size)); + if (res == 0) + return (0); + + if (res > 0) + min = mid + 1; + else + max = mid - 1; + } + + /* Item not found. */ + res = compar(key, (const void *)((caddr_t)base + mid * size)); + if (res > 0) + shift = mid + 1; + else + shift = mid; + + paddr = (caddr_t)base + shift * size; + if (nmemb > shift) + memmove(paddr + size, paddr, (nmemb - shift) * size); + + memcpy(paddr, item, size); + + return (1); +} + +/* + * Deletes item with key @key from ascending-sorted array @base. + * + * Returns 1 on success, 0 for non-existent key. + */ +static int +bdel(const void *key, void *base, size_t nmemb, size_t size, + int (*compar) (const void *, const void *)) +{ + caddr_t item; + size_t sz; + + item = (caddr_t)bsearch(key, base, nmemb, size, compar); + + if (item == NULL) + return (0); + + sz = (caddr_t)base + nmemb * size - item; + + if (sz > 0) + memmove(item, item + size, sz); + + return (1); +} + +static struct ifidx * +ifidx_find(struct table_info *ti, void *key) +{ + struct ifidx *ifi; + + ifi = bsearch(key, ti->state, ti->data, sizeof(struct ifidx), + compare_ifidx); + + return (ifi); +} + +static int +ta_lookup_ifidx(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct ifidx *ifi; + + ifi = ifidx_find(ti, key); + + if (ifi != NULL) { + *val = ifi->value; + return (1); + } + + return (0); +} + +static int +ta_init_ifidx(struct ip_fw_chain *ch, void **ta_state, struct table_info *ti, + char *data, uint8_t tflags) +{ + struct iftable_cfg *icfg; + + icfg = malloc(sizeof(struct iftable_cfg), M_IPFW, M_WAITOK | M_ZERO); + + icfg->ii = ipfw_objhash_create(DEFAULT_IFIDX_SIZE); + icfg->size = DEFAULT_IFIDX_SIZE; + icfg->main_ptr = malloc(sizeof(struct ifidx) * icfg->size, M_IPFW, + M_WAITOK | M_ZERO); + icfg->ch = ch; + + *ta_state = icfg; + ti->state = icfg->main_ptr; + ti->lookup = ta_lookup_ifidx; + + return (0); +} + +/* + * Handle tableinfo @ti pointer change (on table array resize). + */ +static void +ta_change_ti_ifidx(void *ta_state, struct table_info *ti) +{ + struct iftable_cfg *icfg; + + icfg = (struct iftable_cfg *)ta_state; + icfg->ti = ti; +} + +static void +destroy_ifidx_locked(struct namedobj_instance *ii, struct named_object *no, + void *arg) +{ + struct ifentry *ife; + struct ip_fw_chain *ch; + + ch = (struct ip_fw_chain *)arg; + ife = (struct ifentry *)no; + + ipfw_iface_del_notify(ch, &ife->ic); + free(ife, M_IPFW_TBL); +} + + +/* + * Destroys table @ti + */ +static void +ta_destroy_ifidx(void *ta_state, struct table_info *ti) +{ + struct iftable_cfg *icfg; + struct ip_fw_chain *ch; + + icfg = (struct iftable_cfg *)ta_state; + ch = icfg->ch; + + if (icfg->main_ptr != NULL) + free(icfg->main_ptr, M_IPFW); + + ipfw_objhash_foreach(icfg->ii, destroy_ifidx_locked, ch); + + ipfw_objhash_destroy(icfg->ii); + + free(icfg, M_IPFW); +} + +/* + * Provide algo-specific table info + */ +static void +ta_dump_ifidx_tinfo(void *ta_state, struct table_info *ti, ipfw_ta_tinfo *tinfo) +{ + struct iftable_cfg *cfg; + + cfg = (struct iftable_cfg *)ta_state; + + tinfo->taclass4 = IPFW_TACLASS_ARRAY; + tinfo->size4 = cfg->size; + tinfo->count4 = cfg->used; + tinfo->itemsize4 = sizeof(struct ifidx); +} + +/* + * Prepare state to add to the table: + * allocate ifentry and reference needed interface. + */ +static int +ta_prepare_add_ifidx(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_ifidx *tb; + char *ifname; + struct ifentry *ife; + + tb = (struct ta_buf_ifidx *)ta_buf; + + /* Check if string is terminated */ + ifname = (char *)tei->paddr; + if (strnlen(ifname, IF_NAMESIZE) == IF_NAMESIZE) + return (EINVAL); + + ife = malloc(sizeof(struct ifentry), M_IPFW_TBL, M_WAITOK | M_ZERO); + ife->ic.cb = if_notifier; + ife->ic.cbdata = ife; + + if (ipfw_iface_ref(ch, ifname, &ife->ic) != 0) { + free(ife, M_IPFW_TBL); + return (EINVAL); + } + + /* Use ipfw_iface 'ifname' field as stable storage */ + ife->no.name = ife->ic.iface->ifname; + + tb->ife = ife; + + return (0); +} + +static int +ta_add_ifidx(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct iftable_cfg *icfg; + struct ifentry *ife, *tmp; + struct ta_buf_ifidx *tb; + struct ipfw_iface *iif; + struct ifidx *ifi; + char *ifname; + uint32_t value; + + tb = (struct ta_buf_ifidx *)ta_buf; + ifname = (char *)tei->paddr; + icfg = (struct iftable_cfg *)ta_state; + ife = tb->ife; + + ife->icfg = icfg; + ife->value = tei->value; + + tmp = (struct ifentry *)ipfw_objhash_lookup_name(icfg->ii, 0, ifname); + + if (tmp != NULL) { + if ((tei->flags & TEI_FLAGS_UPDATE) == 0) + return (EEXIST); + + /* Exchange values in @tmp and @tei */ + value = tmp->value; + tmp->value = tei->value; + tei->value = value; + + iif = tmp->ic.iface; + if (iif->resolved != 0) { + /* We have to update runtime value, too */ + ifi = ifidx_find(ti, &iif->ifindex); + ifi->value = ife->value; + } + + /* Indicate that update has happened instead of addition */ + tei->flags |= TEI_FLAGS_UPDATED; + *pnum = 0; + return (0); + } + + if ((tei->flags & TEI_FLAGS_DONTADD) != 0) + return (EFBIG); + + /* Link to internal list */ + ipfw_objhash_add(icfg->ii, &ife->no); + + /* Link notifier (possible running its callback) */ + ipfw_iface_add_notify(icfg->ch, &ife->ic); + icfg->count++; + + tb->ife = NULL; + *pnum = 1; + + return (0); +} + +/* + * Prepare to delete key from table. + * Do basic interface name checks. + */ +static int +ta_prepare_del_ifidx(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_ifidx *tb; + char *ifname; + + tb = (struct ta_buf_ifidx *)ta_buf; + + /* Check if string is terminated */ + ifname = (char *)tei->paddr; + if (strnlen(ifname, IF_NAMESIZE) == IF_NAMESIZE) + return (EINVAL); + + return (0); +} + +/* + * Remove key from both configuration list and + * runtime array. Removed interface notification. + */ +static int +ta_del_ifidx(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct iftable_cfg *icfg; + struct ifentry *ife; + struct ta_buf_ifidx *tb; + char *ifname; + uint16_t ifindex; + int res; + + tb = (struct ta_buf_ifidx *)ta_buf; + ifname = (char *)tei->paddr; + icfg = (struct iftable_cfg *)ta_state; + ife = tb->ife; + + ife = (struct ifentry *)ipfw_objhash_lookup_name(icfg->ii, 0, ifname); + + if (ife == NULL) + return (ENOENT); + + if (ife->linked != 0) { + /* We have to remove item from runtime */ + ifindex = ife->ic.iface->ifindex; + + res = bdel(&ifindex, icfg->main_ptr, icfg->used, + sizeof(struct ifidx), compare_ifidx); + + KASSERT(res == 1, ("index %d does not exist", ifindex)); + icfg->used--; + ti->data = icfg->used; + ife->linked = 0; + } + + /* Unlink from local list */ + ipfw_objhash_del(icfg->ii, &ife->no); + /* Unlink notifier */ + ipfw_iface_del_notify(icfg->ch, &ife->ic); + + icfg->count--; + tei->value = ife->value; + + tb->ife = ife; + *pnum = 1; + + return (0); +} + +/* + * Flush deleted entry. + * Drops interface reference and frees entry. + */ +static void +ta_flush_ifidx_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_ifidx *tb; + + tb = (struct ta_buf_ifidx *)ta_buf; + + if (tb->ife != NULL) { + /* Unlink first */ + ipfw_iface_unref(ch, &tb->ife->ic); + free(tb->ife, M_IPFW_TBL); + } +} + + +/* + * Handle interface announce/withdrawal for particular table. + * Every real runtime array modification happens here. + */ +static void +if_notifier(struct ip_fw_chain *ch, void *cbdata, uint16_t ifindex) +{ + struct ifentry *ife; + struct ifidx ifi; + struct iftable_cfg *icfg; + struct table_info *ti; + int res; + + ife = (struct ifentry *)cbdata; + icfg = ife->icfg; + ti = icfg->ti; + + KASSERT(ti != NULL, ("ti=NULL, check change_ti handler")); + + if (ife->linked == 0 && ifindex != 0) { + /* Interface announce */ + ifi.kidx = ifindex; + ifi.spare = 0; + ifi.value = ife->value; + res = badd(&ifindex, &ifi, icfg->main_ptr, icfg->used, + sizeof(struct ifidx), compare_ifidx); + KASSERT(res == 1, ("index %d already exists", ifindex)); + icfg->used++; + ti->data = icfg->used; + ife->linked = 1; + } else if (ife->linked != 0 && ifindex == 0) { + /* Interface withdrawal */ + ifindex = ife->ic.iface->ifindex; + + res = bdel(&ifindex, icfg->main_ptr, icfg->used, + sizeof(struct ifidx), compare_ifidx); + + KASSERT(res == 1, ("index %d does not exist", ifindex)); + icfg->used--; + ti->data = icfg->used; + ife->linked = 0; + } +} + + +/* + * Table growing callbacks. + */ + +static int +ta_need_modify_ifidx(void *ta_state, struct table_info *ti, uint32_t count, + uint64_t *pflags) +{ + struct iftable_cfg *cfg; + uint32_t size; + + cfg = (struct iftable_cfg *)ta_state; + + size = cfg->size; + while (size < cfg->count + count) + size *= 2; + + if (size != cfg->size) { + *pflags = size; + return (1); + } + + return (0); +} + +/* + * Allocate ned, larger runtime ifidx array. + */ +static int +ta_prepare_mod_ifidx(void *ta_buf, uint64_t *pflags) +{ + struct mod_item *mi; + + mi = (struct mod_item *)ta_buf; + + memset(mi, 0, sizeof(struct mod_item)); + mi->size = *pflags; + mi->main_ptr = malloc(sizeof(struct ifidx) * mi->size, M_IPFW, + M_WAITOK | M_ZERO); + + return (0); +} + +/* + * Copy data from old runtime array to new one. + */ +static int +ta_fill_mod_ifidx(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t *pflags) +{ + struct mod_item *mi; + struct iftable_cfg *icfg; + + mi = (struct mod_item *)ta_buf; + icfg = (struct iftable_cfg *)ta_state; + + /* Check if we still need to grow array */ + if (icfg->size >= mi->size) { + *pflags = 0; + return (0); + } + + memcpy(mi->main_ptr, icfg->main_ptr, icfg->used * sizeof(struct ifidx)); + + return (0); +} + +/* + * Switch old & new arrays. + */ +static void +ta_modify_ifidx(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags) +{ + struct mod_item *mi; + struct iftable_cfg *icfg; + void *old_ptr; + + mi = (struct mod_item *)ta_buf; + icfg = (struct iftable_cfg *)ta_state; + + old_ptr = icfg->main_ptr; + icfg->main_ptr = mi->main_ptr; + icfg->size = mi->size; + ti->state = icfg->main_ptr; + + mi->main_ptr = old_ptr; +} + +/* + * Free unneded array. + */ +static void +ta_flush_mod_ifidx(void *ta_buf) +{ + struct mod_item *mi; + + mi = (struct mod_item *)ta_buf; + if (mi->main_ptr != NULL) + free(mi->main_ptr, M_IPFW); +} + +static int +ta_dump_ifidx_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent) +{ + struct ifentry *ife; + + ife = (struct ifentry *)e; + + tent->masklen = 8 * IF_NAMESIZE; + memcpy(&tent->k, ife->no.name, IF_NAMESIZE); + tent->v.kidx = ife->value; + + return (0); +} + +static int +ta_find_ifidx_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent) +{ + struct iftable_cfg *icfg; + struct ifentry *ife; + char *ifname; + + icfg = (struct iftable_cfg *)ta_state; + ifname = tent->k.iface; + + if (strnlen(ifname, IF_NAMESIZE) == IF_NAMESIZE) + return (EINVAL); + + ife = (struct ifentry *)ipfw_objhash_lookup_name(icfg->ii, 0, ifname); + + if (ife != NULL) { + ta_dump_ifidx_tentry(ta_state, ti, ife, tent); + return (0); + } + + return (ENOENT); +} + +struct wa_ifidx { + ta_foreach_f *f; + void *arg; +}; + +static void +foreach_ifidx(struct namedobj_instance *ii, struct named_object *no, + void *arg) +{ + struct ifentry *ife; + struct wa_ifidx *wa; + + ife = (struct ifentry *)no; + wa = (struct wa_ifidx *)arg; + + wa->f(ife, wa->arg); +} + +static void +ta_foreach_ifidx(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg) +{ + struct iftable_cfg *icfg; + struct wa_ifidx wa; + + icfg = (struct iftable_cfg *)ta_state; + + wa.f = f; + wa.arg = arg; + + ipfw_objhash_foreach(icfg->ii, foreach_ifidx, &wa); +} + +struct table_algo iface_idx = { + .name = "iface:array", + .type = IPFW_TABLE_INTERFACE, + .flags = TA_FLAG_DEFAULT, + .ta_buf_size = sizeof(struct ta_buf_ifidx), + .init = ta_init_ifidx, + .destroy = ta_destroy_ifidx, + .prepare_add = ta_prepare_add_ifidx, + .prepare_del = ta_prepare_del_ifidx, + .add = ta_add_ifidx, + .del = ta_del_ifidx, + .flush_entry = ta_flush_ifidx_entry, + .foreach = ta_foreach_ifidx, + .dump_tentry = ta_dump_ifidx_tentry, + .find_tentry = ta_find_ifidx_tentry, + .dump_tinfo = ta_dump_ifidx_tinfo, + .need_modify = ta_need_modify_ifidx, + .prepare_mod = ta_prepare_mod_ifidx, + .fill_mod = ta_fill_mod_ifidx, + .modify = ta_modify_ifidx, + .flush_mod = ta_flush_mod_ifidx, + .change_ti = ta_change_ti_ifidx, +}; + +/* + * Number array cmds. + * + * Implementation: + * + * Runtime part: + * - sorted array of "struct numarray" pointed by ti->state. + * Array is allocated with rounding up to NUMARRAY_CHUNK. + * - current array size is stored in ti->data + * + */ + +struct numarray { + uint32_t number; + uint32_t value; +}; + +struct numarray_cfg { + void *main_ptr; + size_t size; /* Number of items allocated in array */ + size_t used; /* Number of items _active_ now */ +}; + +struct ta_buf_numarray +{ + struct numarray na; +}; + +int compare_numarray(const void *k, const void *v); +static struct numarray *numarray_find(struct table_info *ti, void *key); +static int ta_lookup_numarray(struct table_info *ti, void *key, + uint32_t keylen, uint32_t *val); +static int ta_init_numarray(struct ip_fw_chain *ch, void **ta_state, + struct table_info *ti, char *data, uint8_t tflags); +static void ta_destroy_numarray(void *ta_state, struct table_info *ti); +static void ta_dump_numarray_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +static int ta_prepare_add_numarray(struct ip_fw_chain *ch, + struct tentry_info *tei, void *ta_buf); +static int ta_add_numarray(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static int ta_del_numarray(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static void ta_flush_numarray_entry(struct ip_fw_chain *ch, + struct tentry_info *tei, void *ta_buf); +static int ta_need_modify_numarray(void *ta_state, struct table_info *ti, + uint32_t count, uint64_t *pflags); +static int ta_prepare_mod_numarray(void *ta_buf, uint64_t *pflags); +static int ta_fill_mod_numarray(void *ta_state, struct table_info *ti, + void *ta_buf, uint64_t *pflags); +static void ta_modify_numarray(void *ta_state, struct table_info *ti, + void *ta_buf, uint64_t pflags); +static void ta_flush_mod_numarray(void *ta_buf); +static int ta_dump_numarray_tentry(void *ta_state, struct table_info *ti, + void *e, ipfw_obj_tentry *tent); +static int ta_find_numarray_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +static void ta_foreach_numarray(void *ta_state, struct table_info *ti, + ta_foreach_f *f, void *arg); + +int +compare_numarray(const void *k, const void *v) +{ + const struct numarray *na; + uint32_t key; + + key = *((const uint32_t *)k); + na = (const struct numarray *)v; + + if (key < na->number) + return (-1); + else if (key > na->number) + return (1); + + return (0); +} + +static struct numarray * +numarray_find(struct table_info *ti, void *key) +{ + struct numarray *ri; + + ri = bsearch(key, ti->state, ti->data, sizeof(struct numarray), + compare_ifidx); + + return (ri); +} + +static int +ta_lookup_numarray(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct numarray *ri; + + ri = numarray_find(ti, key); + + if (ri != NULL) { + *val = ri->value; + return (1); + } + + return (0); +} + +static int +ta_init_numarray(struct ip_fw_chain *ch, void **ta_state, struct table_info *ti, + char *data, uint8_t tflags) +{ + struct numarray_cfg *cfg; + + cfg = malloc(sizeof(*cfg), M_IPFW, M_WAITOK | M_ZERO); + + cfg->size = 16; + cfg->main_ptr = malloc(sizeof(struct numarray) * cfg->size, M_IPFW, + M_WAITOK | M_ZERO); + + *ta_state = cfg; + ti->state = cfg->main_ptr; + ti->lookup = ta_lookup_numarray; + + return (0); +} + +/* + * Destroys table @ti + */ +static void +ta_destroy_numarray(void *ta_state, struct table_info *ti) +{ + struct numarray_cfg *cfg; + + cfg = (struct numarray_cfg *)ta_state; + + if (cfg->main_ptr != NULL) + free(cfg->main_ptr, M_IPFW); + + free(cfg, M_IPFW); +} + +/* + * Provide algo-specific table info + */ +static void +ta_dump_numarray_tinfo(void *ta_state, struct table_info *ti, ipfw_ta_tinfo *tinfo) +{ + struct numarray_cfg *cfg; + + cfg = (struct numarray_cfg *)ta_state; + + tinfo->taclass4 = IPFW_TACLASS_ARRAY; + tinfo->size4 = cfg->size; + tinfo->count4 = cfg->used; + tinfo->itemsize4 = sizeof(struct numarray); +} + +/* + * Prepare for addition/deletion to an array. + */ +static int +ta_prepare_add_numarray(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_numarray *tb; + + tb = (struct ta_buf_numarray *)ta_buf; + + tb->na.number = *((uint32_t *)tei->paddr); + + return (0); +} + +static int +ta_add_numarray(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct numarray_cfg *cfg; + struct ta_buf_numarray *tb; + struct numarray *ri; + int res; + uint32_t value; + + tb = (struct ta_buf_numarray *)ta_buf; + cfg = (struct numarray_cfg *)ta_state; + + /* Read current value from @tei */ + tb->na.value = tei->value; + + ri = numarray_find(ti, &tb->na.number); + + if (ri != NULL) { + if ((tei->flags & TEI_FLAGS_UPDATE) == 0) + return (EEXIST); + + /* Exchange values between ri and @tei */ + value = ri->value; + ri->value = tei->value; + tei->value = value; + /* Indicate that update has happened instead of addition */ + tei->flags |= TEI_FLAGS_UPDATED; + *pnum = 0; + return (0); + } + + if ((tei->flags & TEI_FLAGS_DONTADD) != 0) + return (EFBIG); + + res = badd(&tb->na.number, &tb->na, cfg->main_ptr, cfg->used, + sizeof(struct numarray), compare_numarray); + + KASSERT(res == 1, ("number %d already exists", tb->na.number)); + cfg->used++; + ti->data = cfg->used; + *pnum = 1; + + return (0); +} + +/* + * Remove key from both configuration list and + * runtime array. Removed interface notification. + */ +static int +ta_del_numarray(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct numarray_cfg *cfg; + struct ta_buf_numarray *tb; + struct numarray *ri; + int res; + + tb = (struct ta_buf_numarray *)ta_buf; + cfg = (struct numarray_cfg *)ta_state; + + ri = numarray_find(ti, &tb->na.number); + if (ri == NULL) + return (ENOENT); + + tei->value = ri->value; + + res = bdel(&tb->na.number, cfg->main_ptr, cfg->used, + sizeof(struct numarray), compare_numarray); + + KASSERT(res == 1, ("number %u does not exist", tb->na.number)); + cfg->used--; + ti->data = cfg->used; + *pnum = 1; + + return (0); +} + +static void +ta_flush_numarray_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + + /* We don't have any state, do nothing */ +} + + +/* + * Table growing callbacks. + */ + +static int +ta_need_modify_numarray(void *ta_state, struct table_info *ti, uint32_t count, + uint64_t *pflags) +{ + struct numarray_cfg *cfg; + size_t size; + + cfg = (struct numarray_cfg *)ta_state; + + size = cfg->size; + while (size < cfg->used + count) + size *= 2; + + if (size != cfg->size) { + *pflags = size; + return (1); + } + + return (0); +} + +/* + * Allocate new, larger runtime array. + */ +static int +ta_prepare_mod_numarray(void *ta_buf, uint64_t *pflags) +{ + struct mod_item *mi; + + mi = (struct mod_item *)ta_buf; + + memset(mi, 0, sizeof(struct mod_item)); + mi->size = *pflags; + mi->main_ptr = malloc(sizeof(struct numarray) * mi->size, M_IPFW, + M_WAITOK | M_ZERO); + + return (0); +} + +/* + * Copy data from old runtime array to new one. + */ +static int +ta_fill_mod_numarray(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t *pflags) +{ + struct mod_item *mi; + struct numarray_cfg *cfg; + + mi = (struct mod_item *)ta_buf; + cfg = (struct numarray_cfg *)ta_state; + + /* Check if we still need to grow array */ + if (cfg->size >= mi->size) { + *pflags = 0; + return (0); + } + + memcpy(mi->main_ptr, cfg->main_ptr, cfg->used * sizeof(struct numarray)); + + return (0); +} + +/* + * Switch old & new arrays. + */ +static void +ta_modify_numarray(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags) +{ + struct mod_item *mi; + struct numarray_cfg *cfg; + void *old_ptr; + + mi = (struct mod_item *)ta_buf; + cfg = (struct numarray_cfg *)ta_state; + + old_ptr = cfg->main_ptr; + cfg->main_ptr = mi->main_ptr; + cfg->size = mi->size; + ti->state = cfg->main_ptr; + + mi->main_ptr = old_ptr; +} + +/* + * Free unneded array. + */ +static void +ta_flush_mod_numarray(void *ta_buf) +{ + struct mod_item *mi; + + mi = (struct mod_item *)ta_buf; + if (mi->main_ptr != NULL) + free(mi->main_ptr, M_IPFW); +} + +static int +ta_dump_numarray_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent) +{ + struct numarray *na; + + na = (struct numarray *)e; + + tent->k.key = na->number; + tent->v.kidx = na->value; + + return (0); +} + +static int +ta_find_numarray_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent) +{ + struct numarray_cfg *cfg; + struct numarray *ri; + + cfg = (struct numarray_cfg *)ta_state; + + ri = numarray_find(ti, &tent->k.key); + + if (ri != NULL) { + ta_dump_numarray_tentry(ta_state, ti, ri, tent); + return (0); + } + + return (ENOENT); +} + +static void +ta_foreach_numarray(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg) +{ + struct numarray_cfg *cfg; + struct numarray *array; + int i; + + cfg = (struct numarray_cfg *)ta_state; + array = cfg->main_ptr; + + for (i = 0; i < cfg->used; i++) + f(&array[i], arg); +} + +struct table_algo number_array = { + .name = "number:array", + .type = IPFW_TABLE_NUMBER, + .ta_buf_size = sizeof(struct ta_buf_numarray), + .init = ta_init_numarray, + .destroy = ta_destroy_numarray, + .prepare_add = ta_prepare_add_numarray, + .prepare_del = ta_prepare_add_numarray, + .add = ta_add_numarray, + .del = ta_del_numarray, + .flush_entry = ta_flush_numarray_entry, + .foreach = ta_foreach_numarray, + .dump_tentry = ta_dump_numarray_tentry, + .find_tentry = ta_find_numarray_tentry, + .dump_tinfo = ta_dump_numarray_tinfo, + .need_modify = ta_need_modify_numarray, + .prepare_mod = ta_prepare_mod_numarray, + .fill_mod = ta_fill_mod_numarray, + .modify = ta_modify_numarray, + .flush_mod = ta_flush_mod_numarray, +}; + +/* + * flow:hash cmds + * + * + * ti->data: + * [inv.mask4][inv.mask6][log2hsize4][log2hsize6] + * [ 8][ 8[ 8][ 8] + * + * inv.mask4: 32 - mask + * inv.mask6: + * 1) _slow lookup: mask + * 2) _aligned: (128 - mask) / 8 + * 3) _64: 8 + * + * + * pflags: + * [hsize4][hsize6] + * [ 16][ 16] + */ + +struct fhashentry; + +SLIST_HEAD(fhashbhead, fhashentry); + +struct fhashentry { + SLIST_ENTRY(fhashentry) next; + uint8_t af; + uint8_t proto; + uint16_t spare0; + uint16_t dport; + uint16_t sport; + uint32_t value; + uint32_t spare1; +}; + +struct fhashentry4 { + struct fhashentry e; + struct in_addr dip; + struct in_addr sip; +}; + +struct fhashentry6 { + struct fhashentry e; + struct in6_addr dip6; + struct in6_addr sip6; +}; + +struct fhash_cfg { + struct fhashbhead *head; + size_t size; + size_t items; + struct fhashentry4 fe4; + struct fhashentry6 fe6; +}; + +struct ta_buf_fhash { + void *ent_ptr; + struct fhashentry6 fe6; +}; + +static __inline int cmp_flow_ent(struct fhashentry *a, + struct fhashentry *b, size_t sz); +static __inline uint32_t hash_flow4(struct fhashentry4 *f, int hsize); +static __inline uint32_t hash_flow6(struct fhashentry6 *f, int hsize); +static uint32_t hash_flow_ent(struct fhashentry *ent, uint32_t size); +static int ta_lookup_fhash(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); +static int ta_init_fhash(struct ip_fw_chain *ch, void **ta_state, +struct table_info *ti, char *data, uint8_t tflags); +static void ta_destroy_fhash(void *ta_state, struct table_info *ti); +static void ta_dump_fhash_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +static int ta_dump_fhash_tentry(void *ta_state, struct table_info *ti, + void *e, ipfw_obj_tentry *tent); +static int tei_to_fhash_ent(struct tentry_info *tei, struct fhashentry *ent); +static int ta_find_fhash_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +static void ta_foreach_fhash(void *ta_state, struct table_info *ti, + ta_foreach_f *f, void *arg); +static int ta_prepare_add_fhash(struct ip_fw_chain *ch, + struct tentry_info *tei, void *ta_buf); +static int ta_add_fhash(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static int ta_prepare_del_fhash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_del_fhash(void *ta_state, struct table_info *ti, + struct tentry_info *tei, void *ta_buf, uint32_t *pnum); +static void ta_flush_fhash_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf); +static int ta_need_modify_fhash(void *ta_state, struct table_info *ti, + uint32_t count, uint64_t *pflags); +static int ta_prepare_mod_fhash(void *ta_buf, uint64_t *pflags); +static int ta_fill_mod_fhash(void *ta_state, struct table_info *ti, + void *ta_buf, uint64_t *pflags); +static void ta_modify_fhash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags); +static void ta_flush_mod_fhash(void *ta_buf); + +static __inline int +cmp_flow_ent(struct fhashentry *a, struct fhashentry *b, size_t sz) +{ + uint64_t *ka, *kb; + + ka = (uint64_t *)(&a->next + 1); + kb = (uint64_t *)(&b->next + 1); + + if (*ka == *kb && (memcmp(a + 1, b + 1, sz) == 0)) + return (1); + + return (0); +} + +static __inline uint32_t +hash_flow4(struct fhashentry4 *f, int hsize) +{ + uint32_t i; + + i = (f->dip.s_addr) ^ (f->sip.s_addr) ^ (f->e.dport) ^ (f->e.sport); + + return (i % (hsize - 1)); +} + +static __inline uint32_t +hash_flow6(struct fhashentry6 *f, int hsize) +{ + uint32_t i; + + i = (f->dip6.__u6_addr.__u6_addr32[2]) ^ + (f->dip6.__u6_addr.__u6_addr32[3]) ^ + (f->sip6.__u6_addr.__u6_addr32[2]) ^ + (f->sip6.__u6_addr.__u6_addr32[3]) ^ + (f->e.dport) ^ (f->e.sport); + + return (i % (hsize - 1)); +} + +static uint32_t +hash_flow_ent(struct fhashentry *ent, uint32_t size) +{ + uint32_t hash; + + if (ent->af == AF_INET) { + hash = hash_flow4((struct fhashentry4 *)ent, size); + } else { + hash = hash_flow6((struct fhashentry6 *)ent, size); + } + + return (hash); +} + +static int +ta_lookup_fhash(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct fhashbhead *head; + struct fhashentry *ent; + struct fhashentry4 *m4; + struct ipfw_flow_id *id; + uint16_t hash, hsize; + + id = (struct ipfw_flow_id *)key; + head = (struct fhashbhead *)ti->state; + hsize = ti->data; + m4 = (struct fhashentry4 *)ti->xstate; + + if (id->addr_type == 4) { + struct fhashentry4 f; + + /* Copy hash mask */ + f = *m4; + + f.dip.s_addr &= id->dst_ip; + f.sip.s_addr &= id->src_ip; + f.e.dport &= id->dst_port; + f.e.sport &= id->src_port; + f.e.proto &= id->proto; + hash = hash_flow4(&f, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + if (cmp_flow_ent(ent, &f.e, 2 * 4) != 0) { + *val = ent->value; + return (1); + } + } + } else if (id->addr_type == 6) { + struct fhashentry6 f; + uint64_t *fp, *idp; + + /* Copy hash mask */ + f = *((struct fhashentry6 *)(m4 + 1)); + + /* Handle lack of __u6_addr.__u6_addr64 */ + fp = (uint64_t *)&f.dip6; + idp = (uint64_t *)&id->dst_ip6; + /* src IPv6 is stored after dst IPv6 */ + *fp++ &= *idp++; + *fp++ &= *idp++; + *fp++ &= *idp++; + *fp &= *idp; + f.e.dport &= id->dst_port; + f.e.sport &= id->src_port; + f.e.proto &= id->proto; + hash = hash_flow6(&f, hsize); + SLIST_FOREACH(ent, &head[hash], next) { + if (cmp_flow_ent(ent, &f.e, 2 * 16) != 0) { + *val = ent->value; + return (1); + } + } + } + + return (0); +} + +/* + * New table. + */ +static int +ta_init_fhash(struct ip_fw_chain *ch, void **ta_state, struct table_info *ti, + char *data, uint8_t tflags) +{ + int i; + struct fhash_cfg *cfg; + struct fhashentry4 *fe4; + struct fhashentry6 *fe6; + + cfg = malloc(sizeof(struct fhash_cfg), M_IPFW, M_WAITOK | M_ZERO); + + cfg->size = 512; + + cfg->head = malloc(sizeof(struct fhashbhead) * cfg->size, M_IPFW, + M_WAITOK | M_ZERO); + for (i = 0; i < cfg->size; i++) + SLIST_INIT(&cfg->head[i]); + + /* Fill in fe masks based on @tflags */ + fe4 = &cfg->fe4; + fe6 = &cfg->fe6; + if (tflags & IPFW_TFFLAG_SRCIP) { + memset(&fe4->sip, 0xFF, sizeof(fe4->sip)); + memset(&fe6->sip6, 0xFF, sizeof(fe6->sip6)); + } + if (tflags & IPFW_TFFLAG_DSTIP) { + memset(&fe4->dip, 0xFF, sizeof(fe4->dip)); + memset(&fe6->dip6, 0xFF, sizeof(fe6->dip6)); + } + if (tflags & IPFW_TFFLAG_SRCPORT) { + memset(&fe4->e.sport, 0xFF, sizeof(fe4->e.sport)); + memset(&fe6->e.sport, 0xFF, sizeof(fe6->e.sport)); + } + if (tflags & IPFW_TFFLAG_DSTPORT) { + memset(&fe4->e.dport, 0xFF, sizeof(fe4->e.dport)); + memset(&fe6->e.dport, 0xFF, sizeof(fe6->e.dport)); + } + if (tflags & IPFW_TFFLAG_PROTO) { + memset(&fe4->e.proto, 0xFF, sizeof(fe4->e.proto)); + memset(&fe6->e.proto, 0xFF, sizeof(fe6->e.proto)); + } + + fe4->e.af = AF_INET; + fe6->e.af = AF_INET6; + + *ta_state = cfg; + ti->state = cfg->head; + ti->xstate = &cfg->fe4; + ti->data = cfg->size; + ti->lookup = ta_lookup_fhash; + + return (0); +} + +static void +ta_destroy_fhash(void *ta_state, struct table_info *ti) +{ + struct fhash_cfg *cfg; + struct fhashentry *ent, *ent_next; + int i; + + cfg = (struct fhash_cfg *)ta_state; + + for (i = 0; i < cfg->size; i++) + SLIST_FOREACH_SAFE(ent, &cfg->head[i], next, ent_next) + free(ent, M_IPFW_TBL); + + free(cfg->head, M_IPFW); + free(cfg, M_IPFW); +} + +/* + * Provide algo-specific table info + */ +static void +ta_dump_fhash_tinfo(void *ta_state, struct table_info *ti, ipfw_ta_tinfo *tinfo) +{ + struct fhash_cfg *cfg; + + cfg = (struct fhash_cfg *)ta_state; + + tinfo->flags = IPFW_TATFLAGS_AFITEM; + tinfo->taclass4 = IPFW_TACLASS_HASH; + tinfo->size4 = cfg->size; + tinfo->count4 = cfg->items; + tinfo->itemsize4 = sizeof(struct fhashentry4); + tinfo->itemsize6 = sizeof(struct fhashentry6); +} + +static int +ta_dump_fhash_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent) +{ + struct fhash_cfg *cfg; + struct fhashentry *ent; + struct fhashentry4 *fe4; +#ifdef INET6 + struct fhashentry6 *fe6; +#endif + struct tflow_entry *tfe; + + cfg = (struct fhash_cfg *)ta_state; + ent = (struct fhashentry *)e; + tfe = &tent->k.flow; + + tfe->af = ent->af; + tfe->proto = ent->proto; + tfe->dport = htons(ent->dport); + tfe->sport = htons(ent->sport); + tent->v.kidx = ent->value; + tent->subtype = ent->af; + + if (ent->af == AF_INET) { + fe4 = (struct fhashentry4 *)ent; + tfe->a.a4.sip.s_addr = htonl(fe4->sip.s_addr); + tfe->a.a4.dip.s_addr = htonl(fe4->dip.s_addr); + tent->masklen = 32; +#ifdef INET6 + } else { + fe6 = (struct fhashentry6 *)ent; + tfe->a.a6.sip6 = fe6->sip6; + tfe->a.a6.dip6 = fe6->dip6; + tent->masklen = 128; +#endif + } + + return (0); +} + +static int +tei_to_fhash_ent(struct tentry_info *tei, struct fhashentry *ent) +{ +#ifdef INET + struct fhashentry4 *fe4; +#endif +#ifdef INET6 + struct fhashentry6 *fe6; +#endif + struct tflow_entry *tfe; + + tfe = (struct tflow_entry *)tei->paddr; + + ent->af = tei->subtype; + ent->proto = tfe->proto; + ent->dport = ntohs(tfe->dport); + ent->sport = ntohs(tfe->sport); + + if (tei->subtype == AF_INET) { +#ifdef INET + fe4 = (struct fhashentry4 *)ent; + fe4->sip.s_addr = ntohl(tfe->a.a4.sip.s_addr); + fe4->dip.s_addr = ntohl(tfe->a.a4.dip.s_addr); +#endif +#ifdef INET6 + } else if (tei->subtype == AF_INET6) { + fe6 = (struct fhashentry6 *)ent; + fe6->sip6 = tfe->a.a6.sip6; + fe6->dip6 = tfe->a.a6.dip6; +#endif + } else { + /* Unknown CIDR type */ + return (EINVAL); + } + + return (0); +} + + +static int +ta_find_fhash_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent) +{ + struct fhash_cfg *cfg; + struct fhashbhead *head; + struct fhashentry *ent, *tmp; + struct fhashentry6 fe6; + struct tentry_info tei; + int error; + uint32_t hash; + size_t sz; + + cfg = (struct fhash_cfg *)ta_state; + + ent = &fe6.e; + + memset(&fe6, 0, sizeof(fe6)); + memset(&tei, 0, sizeof(tei)); + + tei.paddr = &tent->k.flow; + tei.subtype = tent->subtype; + + if ((error = tei_to_fhash_ent(&tei, ent)) != 0) + return (error); + + head = cfg->head; + hash = hash_flow_ent(ent, cfg->size); + + if (tei.subtype == AF_INET) + sz = 2 * sizeof(struct in_addr); + else + sz = 2 * sizeof(struct in6_addr); + + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (cmp_flow_ent(tmp, ent, sz) != 0) { + ta_dump_fhash_tentry(ta_state, ti, tmp, tent); + return (0); + } + } + + return (ENOENT); +} + +static void +ta_foreach_fhash(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg) +{ + struct fhash_cfg *cfg; + struct fhashentry *ent, *ent_next; + int i; + + cfg = (struct fhash_cfg *)ta_state; + + for (i = 0; i < cfg->size; i++) + SLIST_FOREACH_SAFE(ent, &cfg->head[i], next, ent_next) + f(ent, arg); +} + +static int +ta_prepare_add_fhash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_fhash *tb; + struct fhashentry *ent; + size_t sz; + int error; + + tb = (struct ta_buf_fhash *)ta_buf; + + if (tei->subtype == AF_INET) + sz = sizeof(struct fhashentry4); + else if (tei->subtype == AF_INET6) + sz = sizeof(struct fhashentry6); + else + return (EINVAL); + + ent = malloc(sz, M_IPFW_TBL, M_WAITOK | M_ZERO); + + error = tei_to_fhash_ent(tei, ent); + if (error != 0) { + free(ent, M_IPFW_TBL); + return (error); + } + tb->ent_ptr = ent; + + return (0); +} + +static int +ta_add_fhash(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct fhash_cfg *cfg; + struct fhashbhead *head; + struct fhashentry *ent, *tmp; + struct ta_buf_fhash *tb; + int exists; + uint32_t hash, value; + size_t sz; + + cfg = (struct fhash_cfg *)ta_state; + tb = (struct ta_buf_fhash *)ta_buf; + ent = (struct fhashentry *)tb->ent_ptr; + exists = 0; + + /* Read current value from @tei */ + ent->value = tei->value; + + head = cfg->head; + hash = hash_flow_ent(ent, cfg->size); + + if (tei->subtype == AF_INET) + sz = 2 * sizeof(struct in_addr); + else + sz = 2 * sizeof(struct in6_addr); + + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (cmp_flow_ent(tmp, ent, sz) != 0) { + exists = 1; + break; + } + } + + if (exists == 1) { + if ((tei->flags & TEI_FLAGS_UPDATE) == 0) + return (EEXIST); + /* Record already exists. Update value if we're asked to */ + /* Exchange values between tmp and @tei */ + value = tmp->value; + tmp->value = tei->value; + tei->value = value; + /* Indicate that update has happened instead of addition */ + tei->flags |= TEI_FLAGS_UPDATED; + *pnum = 0; + } else { + if ((tei->flags & TEI_FLAGS_DONTADD) != 0) + return (EFBIG); + + SLIST_INSERT_HEAD(&head[hash], ent, next); + tb->ent_ptr = NULL; + *pnum = 1; + + /* Update counters and check if we need to grow hash */ + cfg->items++; + } + + return (0); +} + +static int +ta_prepare_del_fhash(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_fhash *tb; + + tb = (struct ta_buf_fhash *)ta_buf; + + return (tei_to_fhash_ent(tei, &tb->fe6.e)); +} + +static int +ta_del_fhash(void *ta_state, struct table_info *ti, struct tentry_info *tei, + void *ta_buf, uint32_t *pnum) +{ + struct fhash_cfg *cfg; + struct fhashbhead *head; + struct fhashentry *ent, *tmp; + struct ta_buf_fhash *tb; + uint32_t hash; + size_t sz; + + cfg = (struct fhash_cfg *)ta_state; + tb = (struct ta_buf_fhash *)ta_buf; + ent = &tb->fe6.e; + + head = cfg->head; + hash = hash_flow_ent(ent, cfg->size); + + if (tei->subtype == AF_INET) + sz = 2 * sizeof(struct in_addr); + else + sz = 2 * sizeof(struct in6_addr); + + /* Check for existence */ + SLIST_FOREACH(tmp, &head[hash], next) { + if (cmp_flow_ent(tmp, ent, sz) == 0) + continue; + + SLIST_REMOVE(&head[hash], tmp, fhashentry, next); + tei->value = tmp->value; + *pnum = 1; + cfg->items--; + tb->ent_ptr = tmp; + return (0); + } + + return (ENOENT); +} + +static void +ta_flush_fhash_entry(struct ip_fw_chain *ch, struct tentry_info *tei, + void *ta_buf) +{ + struct ta_buf_fhash *tb; + + tb = (struct ta_buf_fhash *)ta_buf; + + if (tb->ent_ptr != NULL) + free(tb->ent_ptr, M_IPFW_TBL); +} + +/* + * Hash growing callbacks. + */ + +static int +ta_need_modify_fhash(void *ta_state, struct table_info *ti, uint32_t count, + uint64_t *pflags) +{ + struct fhash_cfg *cfg; + + cfg = (struct fhash_cfg *)ta_state; + + if (cfg->items > cfg->size && cfg->size < 65536) { + *pflags = cfg->size * 2; + return (1); + } + + return (0); +} + +/* + * Allocate new, larger fhash. + */ +static int +ta_prepare_mod_fhash(void *ta_buf, uint64_t *pflags) +{ + struct mod_item *mi; + struct fhashbhead *head; + int i; + + mi = (struct mod_item *)ta_buf; + + memset(mi, 0, sizeof(struct mod_item)); + mi->size = *pflags; + head = malloc(sizeof(struct fhashbhead) * mi->size, M_IPFW, + M_WAITOK | M_ZERO); + for (i = 0; i < mi->size; i++) + SLIST_INIT(&head[i]); + + mi->main_ptr = head; + + return (0); +} + +/* + * Copy data from old runtime array to new one. + */ +static int +ta_fill_mod_fhash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t *pflags) +{ + + /* In is not possible to do rehash if we're not holidng WLOCK. */ + return (0); +} + +/* + * Switch old & new arrays. + */ +static void +ta_modify_fhash(void *ta_state, struct table_info *ti, void *ta_buf, + uint64_t pflags) +{ + struct mod_item *mi; + struct fhash_cfg *cfg; + struct fhashbhead *old_head, *new_head; + struct fhashentry *ent, *ent_next; + int i; + uint32_t nhash; + size_t old_size; + + mi = (struct mod_item *)ta_buf; + cfg = (struct fhash_cfg *)ta_state; + + old_size = cfg->size; + old_head = ti->state; + + new_head = (struct fhashbhead *)mi->main_ptr; + for (i = 0; i < old_size; i++) { + SLIST_FOREACH_SAFE(ent, &old_head[i], next, ent_next) { + nhash = hash_flow_ent(ent, mi->size); + SLIST_INSERT_HEAD(&new_head[nhash], ent, next); + } + } + + ti->state = new_head; + ti->data = mi->size; + cfg->head = new_head; + cfg->size = mi->size; + + mi->main_ptr = old_head; +} + +/* + * Free unneded array. + */ +static void +ta_flush_mod_fhash(void *ta_buf) +{ + struct mod_item *mi; + + mi = (struct mod_item *)ta_buf; + if (mi->main_ptr != NULL) + free(mi->main_ptr, M_IPFW); +} + +struct table_algo flow_hash = { + .name = "flow:hash", + .type = IPFW_TABLE_FLOW, + .flags = TA_FLAG_DEFAULT, + .ta_buf_size = sizeof(struct ta_buf_fhash), + .init = ta_init_fhash, + .destroy = ta_destroy_fhash, + .prepare_add = ta_prepare_add_fhash, + .prepare_del = ta_prepare_del_fhash, + .add = ta_add_fhash, + .del = ta_del_fhash, + .flush_entry = ta_flush_fhash_entry, + .foreach = ta_foreach_fhash, + .dump_tentry = ta_dump_fhash_tentry, + .find_tentry = ta_find_fhash_tentry, + .dump_tinfo = ta_dump_fhash_tinfo, + .need_modify = ta_need_modify_fhash, + .prepare_mod = ta_prepare_mod_fhash, + .fill_mod = ta_fill_mod_fhash, + .modify = ta_modify_fhash, + .flush_mod = ta_flush_mod_fhash, +}; + +/* + * Kernel fibs bindings. + * + * Implementation: + * + * Runtime part: + * - fully relies on route API + * - fib number is stored in ti->data + * + */ + +static struct rtentry *lookup_kfib(void *key, int keylen, int fib); +static int ta_lookup_kfib(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val); +static int kfib_parse_opts(int *pfib, char *data); +static void ta_print_kfib_config(void *ta_state, struct table_info *ti, + char *buf, size_t bufsize); +static int ta_init_kfib(struct ip_fw_chain *ch, void **ta_state, + struct table_info *ti, char *data, uint8_t tflags); +static void ta_destroy_kfib(void *ta_state, struct table_info *ti); +static void ta_dump_kfib_tinfo(void *ta_state, struct table_info *ti, + ipfw_ta_tinfo *tinfo); +static int contigmask(uint8_t *p, int len); +static int ta_dump_kfib_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent); +static int ta_find_kfib_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent); +static void ta_foreach_kfib(void *ta_state, struct table_info *ti, + ta_foreach_f *f, void *arg); + +static struct rtentry * +lookup_kfib(void *key, int keylen, int fib) +{ + struct sockaddr *s; + + if (keylen == 4) { + struct sockaddr_in sin; + bzero(&sin, sizeof(sin)); + sin.sin_len = sizeof(struct sockaddr_in); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = *(in_addr_t *)key; + s = (struct sockaddr *)&sin; + } else { + struct sockaddr_in6 sin6; + bzero(&sin6, sizeof(sin6)); + sin6.sin6_len = sizeof(struct sockaddr_in6); + sin6.sin6_family = AF_INET6; + sin6.sin6_addr = *(struct in6_addr *)key; + s = (struct sockaddr *)&sin6; + } + + return (rtalloc1_fib(s, 0, 0, fib)); +} + +static int +ta_lookup_kfib(struct table_info *ti, void *key, uint32_t keylen, + uint32_t *val) +{ + struct rtentry *rte; + + if ((rte = lookup_kfib(key, keylen, ti->data)) == NULL) + return (0); + + *val = 0; + RTFREE_LOCKED(rte); + + return (1); +} + +/* Parse 'fib=%d' */ +static int +kfib_parse_opts(int *pfib, char *data) +{ + char *pdel, *pend, *s; + int fibnum; + + if (data == NULL) + return (0); + if ((pdel = strchr(data, ' ')) == NULL) + return (0); + while (*pdel == ' ') + pdel++; + if (strncmp(pdel, "fib=", 4) != 0) + return (EINVAL); + if ((s = strchr(pdel, ' ')) != NULL) + *s++ = '\0'; + + pdel += 4; + /* Need \d+ */ + fibnum = strtol(pdel, &pend, 10); + if (*pend != '\0') + return (EINVAL); + + *pfib = fibnum; + + return (0); +} + +static void +ta_print_kfib_config(void *ta_state, struct table_info *ti, char *buf, + size_t bufsize) +{ + + if (ti->data != 0) + snprintf(buf, bufsize, "%s fib=%lu", "addr:kfib", ti->data); + else + snprintf(buf, bufsize, "%s", "addr:kfib"); +} + +static int +ta_init_kfib(struct ip_fw_chain *ch, void **ta_state, struct table_info *ti, + char *data, uint8_t tflags) +{ + int error, fibnum; + + fibnum = 0; + if ((error = kfib_parse_opts(&fibnum, data)) != 0) + return (error); + + if (fibnum >= rt_numfibs) + return (E2BIG); + + ti->data = fibnum; + ti->lookup = ta_lookup_kfib; + + return (0); +} + +/* + * Destroys table @ti + */ +static void +ta_destroy_kfib(void *ta_state, struct table_info *ti) +{ + +} + +/* + * Provide algo-specific table info + */ +static void +ta_dump_kfib_tinfo(void *ta_state, struct table_info *ti, ipfw_ta_tinfo *tinfo) +{ + + tinfo->flags = IPFW_TATFLAGS_AFDATA; + tinfo->taclass4 = IPFW_TACLASS_RADIX; + tinfo->count4 = 0; + tinfo->itemsize4 = sizeof(struct rtentry); + tinfo->taclass6 = IPFW_TACLASS_RADIX; + tinfo->count6 = 0; + tinfo->itemsize6 = sizeof(struct rtentry); +} + +static int +contigmask(uint8_t *p, int len) +{ + int i, n; + + for (i = 0; i < len ; i++) + if ( (p[i/8] & (1 << (7 - (i%8)))) == 0) /* first bit unset */ + break; + for (n= i + 1; n < len; n++) + if ( (p[n/8] & (1 << (7 - (n % 8)))) != 0) + return (-1); /* mask not contiguous */ + return (i); +} + + +static int +ta_dump_kfib_tentry(void *ta_state, struct table_info *ti, void *e, + ipfw_obj_tentry *tent) +{ + struct rtentry *rte; +#ifdef INET + struct sockaddr_in *addr, *mask; +#endif +#ifdef INET6 + struct sockaddr_in6 *addr6, *mask6; +#endif + int len; + + rte = (struct rtentry *)e; + addr = (struct sockaddr_in *)rt_key(rte); + mask = (struct sockaddr_in *)rt_mask(rte); + len = 0; + + /* Guess IPv4/IPv6 radix by sockaddr family */ +#ifdef INET + if (addr->sin_family == AF_INET) { + tent->k.addr.s_addr = addr->sin_addr.s_addr; + len = 32; + if (mask != NULL) + len = contigmask((uint8_t *)&mask->sin_addr, 32); + if (len == -1) + len = 0; + tent->masklen = len; + tent->subtype = AF_INET; + tent->v.kidx = 0; /* Do we need to put GW here? */ + } +#endif +#ifdef INET6 + if (addr->sin_family == AF_INET6) { + addr6 = (struct sockaddr_in6 *)addr; + mask6 = (struct sockaddr_in6 *)mask; + memcpy(&tent->k, &addr6->sin6_addr, sizeof(struct in6_addr)); + len = 128; + if (mask6 != NULL) + len = contigmask((uint8_t *)&mask6->sin6_addr, 128); + if (len == -1) + len = 0; + tent->masklen = len; + tent->subtype = AF_INET6; + tent->v.kidx = 0; + } +#endif + + return (0); +} + +static int +ta_find_kfib_tentry(void *ta_state, struct table_info *ti, + ipfw_obj_tentry *tent) +{ + struct rtentry *rte; + void *key; + int keylen; + + if (tent->subtype == AF_INET) { + key = &tent->k.addr; + keylen = sizeof(struct in_addr); + } else { + key = &tent->k.addr6; + keylen = sizeof(struct in6_addr); + } + + if ((rte = lookup_kfib(key, keylen, ti->data)) == NULL) + return (0); + + if (rte != NULL) { + ta_dump_kfib_tentry(ta_state, ti, rte, tent); + RTFREE_LOCKED(rte); + return (0); + } + + return (ENOENT); +} + +static void +ta_foreach_kfib(void *ta_state, struct table_info *ti, ta_foreach_f *f, + void *arg) +{ + struct radix_node_head *rnh; + int error; + + rnh = rt_tables_get_rnh(ti->data, AF_INET); + if (rnh != NULL) { + RADIX_NODE_HEAD_RLOCK(rnh); + error = rnh->rnh_walktree(rnh, (walktree_f_t *)f, arg); + RADIX_NODE_HEAD_RUNLOCK(rnh); + } + + rnh = rt_tables_get_rnh(ti->data, AF_INET6); + if (rnh != NULL) { + RADIX_NODE_HEAD_RLOCK(rnh); + error = rnh->rnh_walktree(rnh, (walktree_f_t *)f, arg); + RADIX_NODE_HEAD_RUNLOCK(rnh); + } +} + +struct table_algo addr_kfib = { + .name = "addr:kfib", + .type = IPFW_TABLE_ADDR, + .flags = TA_FLAG_READONLY, + .ta_buf_size = 0, + .init = ta_init_kfib, + .destroy = ta_destroy_kfib, + .foreach = ta_foreach_kfib, + .dump_tentry = ta_dump_kfib_tentry, + .find_tentry = ta_find_kfib_tentry, + .dump_tinfo = ta_dump_kfib_tinfo, + .print_config = ta_print_kfib_config, +}; + +void +ipfw_table_algo_init(struct ip_fw_chain *ch) +{ + size_t sz; + + /* + * Register all algorithms presented here. + */ + sz = sizeof(struct table_algo); + ipfw_add_table_algo(ch, &addr_radix, sz, &addr_radix.idx); + ipfw_add_table_algo(ch, &addr_hash, sz, &addr_hash.idx); + ipfw_add_table_algo(ch, &iface_idx, sz, &iface_idx.idx); + ipfw_add_table_algo(ch, &number_array, sz, &number_array.idx); + ipfw_add_table_algo(ch, &flow_hash, sz, &flow_hash.idx); + ipfw_add_table_algo(ch, &addr_kfib, sz, &addr_kfib.idx); +} + +void +ipfw_table_algo_destroy(struct ip_fw_chain *ch) +{ + + ipfw_del_table_algo(ch, addr_radix.idx); + ipfw_del_table_algo(ch, addr_hash.idx); + ipfw_del_table_algo(ch, iface_idx.idx); + ipfw_del_table_algo(ch, number_array.idx); + ipfw_del_table_algo(ch, flow_hash.idx); + ipfw_del_table_algo(ch, addr_kfib.idx); +} + + diff --git a/example/ipfw/sys/netpfil/ipfw/ip_fw_table_value.c b/example/ipfw/sys/netpfil/ipfw/ip_fw_table_value.c new file mode 100644 index 0000000..4b7b193 --- /dev/null +++ b/example/ipfw/sys/netpfil/ipfw/ip_fw_table_value.c @@ -0,0 +1,812 @@ +/*- + * Copyright (c) 2014 Yandex LLC + * Copyright (c) 2014 Alexander V. Chernikov + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD: head/sys/netpfil/ipfw/ip_fw_table_value.c 272940 2014-10-11 15:04:50Z melifaro $"); + +/* + * Multi-field value support for ipfw tables. + * + * This file contains necessary functions to convert + * large multi-field values into u32 indices suitable to be fed + * to various table algorithms. Other machinery like proper refcounting, + * internal structures resizing are also kept here. + */ + +#include "opt_ipfw.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* ip_fw.h requires IFNAMSIZ */ + +#include +#include /* struct ipfw_rule_ref */ +#include + +#include +#include + +static uint32_t hash_table_value(struct namedobj_instance *ni, void *key, + uint32_t kopt); +static int cmp_table_value(struct named_object *no, void *key, uint32_t kopt); + +static int list_table_values(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd); + +static struct ipfw_sopt_handler scodes[] = { + { IP_FW_TABLE_VLIST, 0, HDIR_GET, list_table_values }, +}; + +#define CHAIN_TO_VI(chain) (CHAIN_TO_TCFG(chain)->valhash) + +struct table_val_link +{ + struct named_object no; + struct table_value *pval; /* Pointer to real table value */ +}; +#define VALDATA_START_SIZE 64 /* Allocate 64-items array by default */ + +struct vdump_args { + struct ip_fw_chain *ch; + struct sockopt_data *sd; + struct table_value *pval; + int error; +}; + + +static uint32_t +hash_table_value(struct namedobj_instance *ni, void *key, uint32_t kopt) +{ + + return (hash32_buf(key, 56, 0)); +} + +static int +cmp_table_value(struct named_object *no, void *key, uint32_t kopt) +{ + + return (memcmp(((struct table_val_link *)no)->pval, key, 56)); +} + +static void +mask_table_value(struct table_value *src, struct table_value *dst, + uint32_t mask) +{ +#define _MCPY(f, b) if ((mask & (b)) != 0) { dst->f = src->f; } + + memset(dst, 0, sizeof(*dst)); + _MCPY(tag, IPFW_VTYPE_TAG); + _MCPY(pipe, IPFW_VTYPE_PIPE); + _MCPY(divert, IPFW_VTYPE_DIVERT); + _MCPY(skipto, IPFW_VTYPE_SKIPTO); + _MCPY(netgraph, IPFW_VTYPE_NETGRAPH); + _MCPY(fib, IPFW_VTYPE_FIB); + _MCPY(nat, IPFW_VTYPE_NAT); + _MCPY(dscp, IPFW_VTYPE_DSCP); + _MCPY(nh4, IPFW_VTYPE_NH4); + _MCPY(nh6, IPFW_VTYPE_NH6); +#undef _MCPY +} + +static void +get_value_ptrs(struct ip_fw_chain *ch, struct table_config *tc, int vshared, + struct table_value **ptv, struct namedobj_instance **pvi) +{ + struct table_value *pval; + struct namedobj_instance *vi; + + if (vshared != 0) { + pval = (struct table_value *)ch->valuestate; + vi = CHAIN_TO_VI(ch); + } else { + pval = NULL; + vi = NULL; + //pval = (struct table_value *)&tc->ti.data; + } + + if (ptv != NULL) + *ptv = pval; + if (pvi != NULL) + *pvi = vi; +} + +/* + * Update pointers to real vaues after @pval change. + */ +static void +update_tvalue(struct namedobj_instance *ni, struct named_object *no, void *arg) +{ + struct vdump_args *da; + struct table_val_link *ptv; + struct table_value *pval; + + da = (struct vdump_args *)arg; + ptv = (struct table_val_link *)no; + + pval = da->pval; + ptv->pval = &pval[ptv->no.kidx]; + +} + +/* + * Grows value storage shared among all tables. + * Drops/reacquires UH locks. + * Notifies other running adds on @ch shared storage resize. + * Note function does not guarantee that free space + * will be available after invocation, so one caller needs + * to roll cycle himself. + * + * Returns 0 if case of no errors. + */ +static int +resize_shared_value_storage(struct ip_fw_chain *ch) +{ + struct tables_config *tcfg; + struct namedobj_instance *vi; + struct table_value *pval, *valuestate, *old_valuestate; + void *new_idx; + struct vdump_args da; + int new_blocks; + int val_size, val_size_old; + + IPFW_UH_WLOCK_ASSERT(ch); + + valuestate = NULL; + new_idx = NULL; + + pval = (struct table_value *)ch->valuestate; + vi = CHAIN_TO_VI(ch); + tcfg = CHAIN_TO_TCFG(ch); + + val_size = tcfg->val_size * 2; + + if (val_size == (1 << 30)) + return (ENOSPC); + + IPFW_UH_WUNLOCK(ch); + + valuestate = malloc(sizeof(struct table_value) * val_size, M_IPFW, + M_WAITOK | M_ZERO); + ipfw_objhash_bitmap_alloc(val_size, (void *)&new_idx, + &new_blocks); + + IPFW_UH_WLOCK(ch); + + /* + * Check if we still need to resize + */ + if (tcfg->val_size >= val_size) + goto done; + + /* Update pointers and notify everyone we're changing @ch */ + pval = (struct table_value *)ch->valuestate; + rollback_toperation_state(ch, ch); + + /* Good. Let's merge */ + memcpy(valuestate, pval, sizeof(struct table_value) * tcfg->val_size); + ipfw_objhash_bitmap_merge(CHAIN_TO_VI(ch), &new_idx, &new_blocks); + + IPFW_WLOCK(ch); + /* Change pointers */ + old_valuestate = ch->valuestate; + ch->valuestate = valuestate; + valuestate = old_valuestate; + ipfw_objhash_bitmap_swap(CHAIN_TO_VI(ch), &new_idx, &new_blocks); + + val_size_old = tcfg->val_size; + tcfg->val_size = val_size; + val_size = val_size_old; + IPFW_WUNLOCK(ch); + /* Update pointers to reflect resize */ + memset(&da, 0, sizeof(da)); + da.pval = (struct table_value *)ch->valuestate; + ipfw_objhash_foreach(vi, update_tvalue, &da); + +done: + free(valuestate, M_IPFW); + ipfw_objhash_bitmap_free(new_idx, new_blocks); + + return (0); +} + +/* + * Drops reference for table value with index @kidx, stored in @pval and + * @vi. Frees value if it has no references. + */ +static void +unref_table_value(struct namedobj_instance *vi, struct table_value *pval, + uint32_t kidx) +{ + struct table_val_link *ptvl; + + KASSERT(pval[kidx].refcnt > 0, ("Refcount is 0 on kidx %d", kidx)); + if (--pval[kidx].refcnt > 0) + return; + + /* Last reference, delete item */ + ptvl = (struct table_val_link *)ipfw_objhash_lookup_kidx(vi, kidx); + KASSERT(ptvl != NULL, ("lookup on value kidx %d failed", kidx)); + ipfw_objhash_del(vi, &ptvl->no); + ipfw_objhash_free_idx(vi, kidx); + free(ptvl, M_IPFW); +} + +struct flush_args { + struct ip_fw_chain *ch; + struct table_algo *ta; + struct table_info *ti; + void *astate; + ipfw_obj_tentry tent; +}; + +static int +unref_table_value_cb(void *e, void *arg) +{ + struct flush_args *fa; + struct ip_fw_chain *ch; + struct table_algo *ta; + ipfw_obj_tentry *tent; + int error; + + fa = (struct flush_args *)arg; + + ta = fa->ta; + memset(&fa->tent, 0, sizeof(fa->tent)); + tent = &fa->tent; + error = ta->dump_tentry(fa->astate, fa->ti, e, tent); + if (error != 0) + return (error); + + ch = fa->ch; + + unref_table_value(CHAIN_TO_VI(ch), + (struct table_value *)ch->valuestate, tent->v.kidx); + + return (0); +} + +/* + * Drop references for each value used in @tc. + */ +void +ipfw_unref_table_values(struct ip_fw_chain *ch, struct table_config *tc, + struct table_algo *ta, void *astate, struct table_info *ti) +{ + struct flush_args fa; + + IPFW_UH_WLOCK_ASSERT(ch); + + memset(&fa, 0, sizeof(fa)); + fa.ch = ch; + fa.ta = ta; + fa.astate = astate; + fa.ti = ti; + + ta->foreach(astate, ti, unref_table_value_cb, &fa); +} + +/* + * Table operation state handler. + * Called when we are going to change something in @tc which + * may lead to inconsistencies in on-going table data addition. + * + * Here we rollback all already committed state (table values, currently) + * and set "modified" field to non-zero value to indicate + * that we need to restart original operation. + */ +void +rollback_table_values(struct tableop_state *ts) +{ + struct ip_fw_chain *ch; + struct table_value *pval; + struct tentry_info *ptei; + struct namedobj_instance *vi; + int i; + + ch = ts->ch; + + IPFW_UH_WLOCK_ASSERT(ch); + + /* Get current table value pointer */ + get_value_ptrs(ch, ts->tc, ts->vshared, &pval, &vi); + + for (i = 0; i < ts->count; i++) { + ptei = &ts->tei[i]; + + if (ptei->value == 0) + continue; + + unref_table_value(vi, pval, ptei->value); + } +} + +/* + * Allocate new value index in either shared or per-table array. + * Function may drop/reacquire UH lock. + * + * Returns 0 on success. + */ +static int +alloc_table_vidx(struct ip_fw_chain *ch, struct tableop_state *ts, + struct namedobj_instance *vi, uint16_t *pvidx) +{ + int error, vlimit; + uint16_t vidx; + + IPFW_UH_WLOCK_ASSERT(ch); + + error = ipfw_objhash_alloc_idx(vi, &vidx); + if (error != 0) { + + /* + * We need to resize array. This involves + * lock/unlock, so we need to check "modified" + * state. + */ + ts->opstate.func(ts->tc, &ts->opstate); + error = resize_shared_value_storage(ch); + return (error); /* ts->modified should be set, we will restart */ + } + + vlimit = ts->ta->vlimit; + if (vlimit != 0 && vidx >= vlimit) { + + /* + * Algorithm is not able to store given index. + * We have to rollback state, start using + * per-table value array or return error + * if we're already using it. + * + * TODO: do not rollback state if + * atomicity is not required. + */ + if (ts->vshared != 0) { + /* shared -> per-table */ + return (ENOSPC); /* TODO: proper error */ + } + + /* per-table. Fail for now. */ + return (ENOSPC); /* TODO: proper error */ + } + + *pvidx = vidx; + return (0); +} + +/* + * Drops value reference for unused values (updates, deletes, partially + * successful adds or rollbacks). + */ +void +ipfw_garbage_table_values(struct ip_fw_chain *ch, struct table_config *tc, + struct tentry_info *tei, uint32_t count, int rollback) +{ + int i; + struct tentry_info *ptei; + struct table_value *pval; + struct namedobj_instance *vi; + + /* + * We have two slightly different ADD cases here: + * either (1) we are successful / partially successful, + * in that case we need + * * to ignore ADDED entries values + * * rollback every other values (either UPDATED since + * old value has been stored there, or some failure like + * EXISTS or LIMIT or simply "ignored" case. + * + * (2): atomic rollback of partially successful operation + * in that case we simply need to unref all entries. + * + * DELETE case is simpler: no atomic support there, so + * we simply unref all non-zero values. + */ + + /* + * Get current table value pointers. + * XXX: Properly read vshared + */ + get_value_ptrs(ch, tc, 1, &pval, &vi); + + for (i = 0; i < count; i++) { + ptei = &tei[i]; + + if (ptei->value == 0) { + + /* + * We may be deleting non-existing record. + * Skip. + */ + continue; + } + + if ((ptei->flags & TEI_FLAGS_ADDED) != 0 && rollback == 0) { + ptei->value = 0; + continue; + } + + unref_table_value(vi, pval, ptei->value); + ptei->value = 0; + } +} + +/* + * Main function used to link values of entries going to be added, + * to the index. Since we may perform many UH locks drops/acquires, + * handle changes by checking tablestate "modified" field. + * + * Success: return 0. + */ +int +ipfw_link_table_values(struct ip_fw_chain *ch, struct tableop_state *ts) +{ + int error, i, found; + struct namedobj_instance *vi; + struct table_config *tc; + struct tentry_info *tei, *ptei; + uint32_t count, vlimit; + uint16_t vidx; + struct table_val_link *ptv; + struct table_value tval, *pval; + + /* + * Stage 1: reference all existing values and + * save their indices. + */ + IPFW_UH_WLOCK_ASSERT(ch); + get_value_ptrs(ch, ts->tc, ts->vshared, &pval, &vi); + + error = 0; + found = 0; + vlimit = ts->ta->vlimit; + vidx = 0; + tc = ts->tc; + tei = ts->tei; + count = ts->count; + for (i = 0; i < count; i++) { + ptei = &tei[i]; + ptei->value = 0; /* Ensure value is always 0 in the beginnig */ + mask_table_value(ptei->pvalue, &tval, ts->vmask); + ptv = (struct table_val_link *)ipfw_objhash_lookup_name(vi, 0, + (char *)&tval); + if (ptv == NULL) + continue; + /* Deal with vlimit later */ + if (vlimit > 0 && vlimit <= ptv->no.kidx) + continue; + + /* Value found. Bump refcount */ + ptv->pval->refcnt++; + ptei->value = ptv->no.kidx; + found++; + } + + if (ts->count == found) { + /* We've found all values , no need ts create new ones */ + return (0); + } + + /* + * we have added some state here, let's attach operation + * state ts the list ts be able ts rollback if necessary. + */ + add_toperation_state(ch, ts); + /* Ensure table won't disappear */ + tc_ref(tc); + IPFW_UH_WUNLOCK(ch); + + /* + * Stage 2: allocate objects for non-existing values. + */ + for (i = 0; i < count; i++) { + ptei = &tei[i]; + if (ptei->value != 0) + continue; + if (ptei->ptv != NULL) + continue; + ptei->ptv = malloc(sizeof(struct table_val_link), M_IPFW, + M_WAITOK | M_ZERO); + } + + /* + * Stage 3: allocate index numbers for new values + * and link them to index. + */ + IPFW_UH_WLOCK(ch); + tc_unref(tc); + del_toperation_state(ch, ts); + if (ts->modified != 0) { + + /* + * In general, we should free all state/indexes here + * and return. However, we keep allocated state instead + * to ensure we achieve some progress on each restart. + */ + return (0); + } + + KASSERT(pval == ch->valuestate, ("resize_storage() notify failure")); + + /* Let's try to link values */ + for (i = 0; i < count; i++) { + ptei = &tei[i]; + if (ptei->value != 0) { + + /* + * We may be here after several process restarts, + * so we need to update all fields that might + * have changed. + */ + ptv = (struct table_val_link *)ptei->ptv; + ptv->pval = &pval[i]; + continue; + } + + /* Check if record has appeared */ + mask_table_value(ptei->pvalue, &tval, ts->vmask); + ptv = (struct table_val_link *)ipfw_objhash_lookup_name(vi, 0, + (char *)&tval); + if (ptv != NULL) { + ptv->pval->refcnt++; + ptei->value = ptv->no.kidx; + continue; + } + + /* May perform UH unlock/lock */ + error = alloc_table_vidx(ch, ts, vi, &vidx); + if (error != 0) { + ts->opstate.func(ts->tc, &ts->opstate); + return (error); + } + /* value storage resize has happened, return */ + if (ts->modified != 0) + return (0); + + /* Finally, we have allocated valid index, let's add entry */ + ptei->value = vidx; + ptv = (struct table_val_link *)ptei->ptv; + ptei->ptv = NULL; + + ptv->no.kidx = vidx; + ptv->no.name = (char *)&pval[vidx]; + ptv->pval = &pval[vidx]; + memcpy(ptv->pval, &tval, sizeof(struct table_value)); + pval[vidx].refcnt = 1; + ipfw_objhash_add(vi, &ptv->no); + } + + return (0); +} + +/* + * Compability function used to import data from old + * IP_FW_TABLE_ADD / IP_FW_TABLE_XADD opcodes. + */ +void +ipfw_import_table_value_legacy(uint32_t value, struct table_value *v) +{ + + memset(v, 0, sizeof(*v)); + v->tag = value; + v->pipe = value; + v->divert = value; + v->skipto = value; + v->netgraph = value; + v->fib = value; + v->nat = value; + v->nh4 = value; /* host format */ + v->dscp = value; + v->limit = value; +} + +/* + * Export data to legacy table dumps opcodes. + */ +uint32_t +ipfw_export_table_value_legacy(struct table_value *v) +{ + + /* + * TODO: provide more compatibility depending on + * vmask value. + */ + return (v->tag); +} + +/* + * Imports table value from current userland format. + * Saves value in kernel format to the same place. + */ +void +ipfw_import_table_value_v1(ipfw_table_value *iv) +{ + struct table_value v; + + memset(&v, 0, sizeof(v)); + v.tag = iv->tag; + v.pipe = iv->pipe; + v.divert = iv->divert; + v.skipto = iv->skipto; + v.netgraph = iv->netgraph; + v.fib = iv->fib; + v.nat = iv->nat; + v.dscp = iv->dscp; + v.nh4 = iv->nh4; + v.nh6 = iv->nh6; + v.limit = iv->limit; + + memcpy(iv, &v, sizeof(ipfw_table_value)); +} + +/* + * Export real table value @v to current userland format. + * Note that @v and @piv may point to the same memory. + */ +void +ipfw_export_table_value_v1(struct table_value *v, ipfw_table_value *piv) +{ + ipfw_table_value iv; + + memset(&iv, 0, sizeof(iv)); + iv.tag = v->tag; + iv.pipe = v->pipe; + iv.divert = v->divert; + iv.skipto = v->skipto; + iv.netgraph = v->netgraph; + iv.fib = v->fib; + iv.nat = v->nat; + iv.dscp = v->dscp; + iv.limit = v->limit; + iv.nh4 = v->nh4; + iv.nh6 = v->nh6; + + memcpy(piv, &iv, sizeof(iv)); +} + +/* + * Exports real value data into ipfw_table_value structure. + * Utilizes "spare1" field to store kernel index. + */ +static void +dump_tvalue(struct namedobj_instance *ni, struct named_object *no, void *arg) +{ + struct vdump_args *da; + struct table_val_link *ptv; + struct table_value *v; + + da = (struct vdump_args *)arg; + ptv = (struct table_val_link *)no; + + v = (struct table_value *)ipfw_get_sopt_space(da->sd, sizeof(*v)); + /* Out of memory, returning */ + if (v == NULL) { + da->error = ENOMEM; + return; + } + + memcpy(v, ptv->pval, sizeof(*v)); + v->spare1 = ptv->no.kidx; +} + +/* + * Dumps all shared/table value data + * Data layout (v1)(current): + * Request: [ ipfw_obj_lheader ], size = ipfw_obj_lheader.size + * Reply: [ ipfw_obj_lheader ipfw_table_value x N ] + * + * Returns 0 on success + */ +static int +list_table_values(struct ip_fw_chain *ch, ip_fw3_opheader *op3, + struct sockopt_data *sd) +{ + struct _ipfw_obj_lheader *olh; + struct namedobj_instance *vi; + struct vdump_args da; + uint32_t count, size; + + olh = (struct _ipfw_obj_lheader *)ipfw_get_sopt_header(sd,sizeof(*olh)); + if (olh == NULL) + return (EINVAL); + if (sd->valsize < olh->size) + return (EINVAL); + + IPFW_UH_RLOCK(ch); + vi = CHAIN_TO_VI(ch); + + count = ipfw_objhash_count(vi); + size = count * sizeof(ipfw_table_value) + sizeof(ipfw_obj_lheader); + + /* Fill in header regadless of buffer size */ + olh->count = count; + olh->objsize = sizeof(ipfw_table_value); + + if (size > olh->size) { + olh->size = size; + IPFW_UH_RUNLOCK(ch); + return (ENOMEM); + } + olh->size = size; + + /* + * Do the actual value dump + */ + memset(&da, 0, sizeof(da)); + da.ch = ch; + da.sd = sd; + ipfw_objhash_foreach(vi, dump_tvalue, &da); + + IPFW_UH_RUNLOCK(ch); + + return (0); +} + +void +ipfw_table_value_init(struct ip_fw_chain *ch, int first) +{ + struct tables_config *tcfg; + + ch->valuestate = malloc(VALDATA_START_SIZE * sizeof(struct table_value), + M_IPFW, M_WAITOK | M_ZERO); + + tcfg = ch->tblcfg; + + tcfg->val_size = VALDATA_START_SIZE; + tcfg->valhash = ipfw_objhash_create(tcfg->val_size); + ipfw_objhash_set_funcs(tcfg->valhash, hash_table_value, + cmp_table_value); + + IPFW_ADD_SOPT_HANDLER(first, scodes); +} + +static void +destroy_value(struct namedobj_instance *ni, struct named_object *no, + void *arg) +{ + + free(no, M_IPFW); +} + +void +ipfw_table_value_destroy(struct ip_fw_chain *ch, int last) +{ + + IPFW_DEL_SOPT_HANDLER(last, scodes); + + free(ch->valuestate, M_IPFW); + ipfw_objhash_foreach(CHAIN_TO_VI(ch), destroy_value, ch); + ipfw_objhash_destroy(CHAIN_TO_VI(ch)); +} + diff --git a/example/ipfw/sys/sys/fnv_hash.h b/example/ipfw/sys/sys/fnv_hash.h new file mode 100644 index 0000000..f070e6e --- /dev/null +++ b/example/ipfw/sys/sys/fnv_hash.h @@ -0,0 +1,71 @@ +/*- + * Fowler / Noll / Vo Hash (FNV Hash) + * http://www.isthe.com/chongo/tech/comp/fnv/ + * + * This is an implementation of the algorithms posted above. + * This file is placed in the public domain by Peter Wemm. + * + * $FreeBSD: head/sys/sys/fnv_hash.h 268351 2014-07-07 00:27:09Z marcel $ + */ +#ifndef _SYS_FNV_HASH_H_ +#define _SYS_FNV_HASH_H_ + +typedef u_int32_t Fnv32_t; +typedef u_int64_t Fnv64_t; + +#define FNV1_32_INIT ((Fnv32_t) 33554467UL) +#define FNV1_64_INIT ((Fnv64_t) 0xcbf29ce484222325ULL) + +#define FNV_32_PRIME ((Fnv32_t) 0x01000193UL) +#define FNV_64_PRIME ((Fnv64_t) 0x100000001b3ULL) + +static __inline Fnv32_t +fnv_32_buf(const void *buf, size_t len, Fnv32_t hval) +{ + const u_int8_t *s = (const u_int8_t *)buf; + + while (len-- != 0) { + hval *= FNV_32_PRIME; + hval ^= *s++; + } + return hval; +} + +static __inline Fnv32_t +fnv_32_str(const char *str, Fnv32_t hval) +{ + const u_int8_t *s = (const u_int8_t *)str; + Fnv32_t c; + + while ((c = *s++) != 0) { + hval *= FNV_32_PRIME; + hval ^= c; + } + return hval; +} + +static __inline Fnv64_t +fnv_64_buf(const void *buf, size_t len, Fnv64_t hval) +{ + const u_int8_t *s = (const u_int8_t *)buf; + + while (len-- != 0) { + hval *= FNV_64_PRIME; + hval ^= *s++; + } + return hval; +} + +static __inline Fnv64_t +fnv_64_str(const char *str, Fnv64_t hval) +{ + const u_int8_t *s = (const u_int8_t *)str; + u_register_t c; /* 32 bit on i386, 64 bit on alpha */ + + while ((c = *s++) != 0) { + hval *= FNV_64_PRIME; + hval ^= c; + } + return hval; +} +#endif /* _SYS_FNV_HASH_H_ */ diff --git a/example/ipfw/sys/sys/hash.h b/example/ipfw/sys/sys/hash.h new file mode 100644 index 0000000..bd8fa69 --- /dev/null +++ b/example/ipfw/sys/sys/hash.h @@ -0,0 +1,133 @@ +/*- + * Copyright (c) 2001 Tobias Weingartner + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * $OpenBSD: hash.h,v 1.4 2004/05/25 18:37:23 jmc Exp $ + * $FreeBSD: head/sys/sys/hash.h 272906 2014-10-10 19:26:26Z gnn $ + */ + +#ifndef _SYS_HASH_H_ +#define _SYS_HASH_H_ +#include + +/* Convenience */ +#ifndef HASHINIT +#define HASHINIT 5381 +#define HASHSTEP(x,c) (((x << 5) + x) + (c)) +#endif + +/* + * Return a 32-bit hash of the given buffer. The init + * value should be 0, or the previous hash value to extend + * the previous hash. + */ +static __inline uint32_t +hash32_buf(const void *buf, size_t len, uint32_t hash) +{ + const unsigned char *p = buf; + + while (len--) + hash = HASHSTEP(hash, *p++); + + return hash; +} + +/* + * Return a 32-bit hash of the given string. + */ +static __inline uint32_t +hash32_str(const void *buf, uint32_t hash) +{ + const unsigned char *p = buf; + + while (*p) + hash = HASHSTEP(hash, *p++); + + return hash; +} + +/* + * Return a 32-bit hash of the given string, limited by N. + */ +static __inline uint32_t +hash32_strn(const void *buf, size_t len, uint32_t hash) +{ + const unsigned char *p = buf; + + while (*p && len--) + hash = HASHSTEP(hash, *p++); + + return hash; +} + +/* + * Return a 32-bit hash of the given string terminated by C, + * (as well as 0). This is mainly here as a helper for the + * namei() hashing of path name parts. + */ +static __inline uint32_t +hash32_stre(const void *buf, int end, const char **ep, uint32_t hash) +{ + const unsigned char *p = buf; + + while (*p && (*p != end)) + hash = HASHSTEP(hash, *p++); + + if (ep) + *ep = (const char *)p; + + return hash; +} + +/* + * Return a 32-bit hash of the given string, limited by N, + * and terminated by C (as well as 0). This is mainly here + * as a helper for the namei() hashing of path name parts. + */ +static __inline uint32_t +hash32_strne(const void *buf, size_t len, int end, const char **ep, + uint32_t hash) +{ + const unsigned char *p = buf; + + while (*p && (*p != end) && len--) + hash = HASHSTEP(hash, *p++); + + if (ep) + *ep = (const char *)p; + + return hash; +} + +#ifdef _KERNEL +/* + * Hashing function from Bob Jenkins. Implementation in libkern/jenkins_hash.c. + */ +uint32_t jenkins_hash(const void *, size_t, uint32_t); +uint32_t jenkins_hash32(const uint32_t *, size_t, uint32_t); + +uint32_t murmur3_aligned_32(const void *data, size_t len, uint32_t seed); + +#endif /* _KERNEL */ + +#endif /* !_SYS_HASH_H_ */