buildEntryData() unconditionally sends IPSET_ATTR_PROTO (defaulting
to TCP) whenever Port is set. For bitmap:port ipsets, this protocol
attribute is not expected and newer kernels reject it with "invalid
protocol".
Decouple Protocol and Port attributes: only send IPSET_ATTR_PROTO
when entry.Protocol is explicitly set. This matches the behavior of
the ipset CLI tool, which does not send protocol for bitmap:port.
Note: callers using hash:ip,port or hash:net,port,net must now
explicitly set entry.Protocol, as the implicit TCP default is removed.
Also add a bitmap:port test case to TestIpsetCreateListAddDelDestroy
and fix the port/protocol assertion to handle nil Protocol.
Fixes#1054
The addition of ErrDumpInterrupted forced all callers to deal explicitly with
interrupted dumps, while typically the caller should simply retry.
This commit adds a retry system that makes the library retry up to 10 times
before finally giving up with ErrDumpInterrupted. This should lead to fewer
surprised users.
It can be enabled by calling Handle.RetryInterrupted.
Signed-off-by: Timo Beckers <timo@incline.eu>
When a rule has Src or Dst set to 0.0.0.0/0 (or ::/0), the kernel
omits the FRA_SRC/FRA_DST attribute since the prefix length is 0,
leaving rule.Src/rule.Dst as nil. The filter logic treated nil as
"not matching", so these rules were always incorrectly filtered out.
Fixes#1080
When a link is provided, AddrList was doing a full NLM_F_DUMP of all
addresses and filtering in the client side. On systems with many interfaces
this is wasteful and prone to NLM_F_DUMP_INTR errors.
Set ifa_index in the request message so the kernel can do the filtering.
Kernel-side filtering requires NETLINK_GET_STRICT_CHK to be enabled
on the handle (Handle.SetStrictCheck). Without it, the kernel ignores
the index field and the existing userspace filter serves as fallback.
Signed-off-by: Patryk Diak <pdiak@redhat.com>
SocketGet uses NLM_F_DUMP and already documents that it may return
ErrDumpInterrupted, but the implementation treats it as a hard failure,
discarding any partial results. When the dump is interrupted but a
matching socket was found in the partial results, the function should
still return it alongside the ErrDumpInterrupted error.
When no messages are returned and the error is ErrDumpInterrupted,
propagate the original error instead of the generic "no message nor
error from netlink" message, giving callers a proper signal to retry.
Follow-up to 084abd93d3 ("Add support for NLM_F_DUMP_INTR")
Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
RdmaResourceList uses NLM_F_DUMP but does not handle ErrDumpInterrupted,
unlike other dump functions in the library. When a netlink dump is
interrupted by concurrent operations, the kernel sets NLM_F_DUMP_INTR
and the Execute call returns ErrDumpInterrupted. Without this fix,
RdmaResourceList treats the interrupted dump as a hard failure and
discards any partial results.
Apply the same pattern used by other dump functions: filter out
ErrDumpInterrupted from the fatal error path and propagate it alongside
partial results so callers can decide whether to retry.
When the dump is interrupted and no messages are returned, propagate
ErrDumpInterrupted instead of the generic "No valid response from
kernel" error, giving callers a proper signal to retry.
Follow-up to 084abd93d3 ("Add support for NLM_F_DUMP_INTR")
Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
FouList and GTPPDPList check `err` instead of `executeErr` in the
ErrDumpInterrupted guard. In both functions, `err` refers to an earlier
call (FouFamilyId / GenlFamilyGet) that already succeeded, so it is
always nil at this point. This means `!errors.Is(err,
ErrDumpInterrupted)` is always true, and any executeErr — including
ErrDumpInterrupted — causes the function to return nil results.
This is the same class of bug as the RouteListFiltered issue: results
collected from the dump are discarded on ErrDumpInterrupted instead of
being returned to the caller.
Fix by checking the correct variable.
Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
Commit 084abd93d3 ("Add ErrDumpInterrupted") updated dump functions
to return partial results alongside ErrDumpInterrupted, giving callers
the option to use them. However, Handle.RouteListFiltered was missed:
it collects results via RouteListFilteredIter callback but discards
them when the iterator returns any error, including ErrDumpInterrupted.
This is inconsistent with other dump functions like AddrList, NeighList,
and LinkList which return partial results alongside the error. The doc
comment on RouteListFiltered already states "If the returned error is
[ErrDumpInterrupted], results may be inconsistent or incomplete",
implying results should be returned. Fix the function to match this
contract.
Since RouteList and the package-level RouteListFiltered both delegate
to Handle.RouteListFiltered, they were also affected.
Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
Use the NativeEndian native-endian var provided by the Go standard
library encoding/binary package since Go 1.21.
Comparing binary.NativeEndian to binary.{Bit,Little}Endian doesn't work
though, so instead use golang.org/x/sys/cpu.IsBigEndian as suggested by
https://github.com/golang/go/issues/57237.
Since Linux v6.13 or 83dfce38c49f ("xfrm: Restrict percpu SA attribute to
specific netlink message types") SA_PCPU is only allowed in MSG_NEWSA, UPDSA,
ALLOCSPI, ACQUIRE. For all other commands, it needs to be unset.
Also fix accessing *sa.Pcpunum in XfrmState.String() that would panic when the
field is nil.
Signed-off-by: Timo Beckers <timo@incline.eu>
Add a new ConntrackDelete() function that operates directly on flows,
same as the ConntrackCreate() and ConntrackUpdate() functions.
We already have ConntrackDeleteFilters() that is very useful to batch
operations and to express the intent based on filter matches, but having
the function that operate on flows allow to create much more complex
filtering without having to use an additional abstraction with filters.
Signed-off-by: Antonio Ojea <aojea@google.com>
The decode function was using swapped constants for SkipSw and SkipHw:
- skipSw was using TCA_CLS_FLAGS_SKIP_HW (wrong)
- skipHw was using TCA_CLS_FLAGS_SKIP_SW (wrong)
This caused SkipSw and SkipHw to have incorrect values when reading
flower filters from kernel, breaking hardware offload detection.
TestRdmaGetRdmaLink, TestRdmaSetRdmaLinkName, and TestRdmaLinkSetNsFd
hardcode the RDMA device name "foo". When ib_core is loaded but no
device named "foo" exists, these tests fail instead of skipping.
Replace the hardcoded name with a helper that lists available RDMA
devices and uses the first one, or skips the test if none are found.
Also skip TestRdmaLinkSetNsFd when switching to exclusive netns mode
fails with "device or resource busy", which happens when RDMA devices
are actively in use on the CI runner.
Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
Add BridgeVlanTunnelShowDev(link) to retrieve VLAN-to-tunnel-ID mappings
for a specific device, equivalent to `bridge vlan tunnelshow dev DEV`.
The existing BridgeVlanTunnelShow() returns a flat []TunnelInfo for all
bridge ports. Since TunnelInfo only contains {TunId, Vid} without an
ifindex, callers cannot determine which device each mapping belongs to.
This forces users to shell out to `bridge -j vlan tunnelshow dev <name>`
and parse JSON instead.
The internal bridgeVlanTunnelShowBy(ifindex) helper
performs a full dump and filters client-side by ifindex (matching how
iproute2 implements `bridge vlan tunnelshow dev DEV`). When ifindex is 0,
all devices are returned.
BridgeVlanTunnelShow() is refactored to use the same helper with no
behavior change (ifindex=0 dumps all devices).
Signed-off-by: Matteo Dallaglio <mdallagl@redhat.com>
Implement bridge VNI filtering operations (add/delete/list) for VXLAN
devices with VniFilter enabled. This allows managing VNI filters via
RTM_NEWTUNNEL/RTM_DELTUNNEL/RTM_GETTUNNEL netlink messages.
Equivalent to iproute2 bridge commands:
- bridge vni add dev DEV vni VNI[-VNIEND]
- bridge vni del dev DEV vni VNI[-VNIEND]
- bridge vni show
Reference implementation:
https://github.com/iproute2/iproute2/blob/main/bridge/vni.c
Signed-off-by: Patryk Diak <pdiak@redhat.com>
TCA_FLOWER_FLAGS was incorrectly encoded using htonl() (big-endian),
but netlink NLA_U32 attributes expect native endian. On little-endian
systems (x86), this caused skip_hw/skip_sw flags to be byte-swapped,
resulting in EINVAL or the flags being silently ignored.
Use nl.Uint32Attr() which correctly encodes in native endian, matching
the behavior of the tc CLI tool.
Signed-off-by: likiosliu <likiosliu@gamil.com>
Add support for the IFA_PROTO netlink attribute (kernel 5.18+) which
indicates the protocol or origin of an IP address. This is similar to
RTPROT_* for routes.
Signed-off-by: Carlos Goncalves <cgoncalves@redhat.com>
s390x is a big-endian architecture, so VLAN protocol values are already
in network byte order. The existing unconditional byte swap performed
when setting IFLA_VF_VLAN_INFO corrupts the VLAN protocol field on s390x,
causing drivers to reject the request with protocol not supported.
Signed-off-by: Ashok Pariya <ashok.pariya@ibm.com>
the structs and methods for managing devlink are
all under the `devlink_linux.go` file, and when
trying to write code in a non linux system (mac),
the editor doesn't recognize any of it.
this commit moves struct definitions from the
devlink_linux.go file and creates a
devlink_unspecified.go with the devlink funcs.
this is similar to what we have in netlink_unspecified.go
Signed-off-by: Bernardo Soares <20172413+bersoare@users.noreply.github.com>
Add a basic support of Linux's ip nexthop equivalent. In this PR, I
specifically focused on implementing a minimal feature to accomplish
IPv4 prefix with IPv6 (link-local) nexthop which is used by various
implementation like FRR to support technique called BGP Unnumbered.
The summary of the new features are:
- Introduce a low level primitive for nexthop in the nl package
- Introduce NexthopAdd/Del/List/Replace APIs (supports
NHA_ID/BLACKHOLE/GATEWAY, and protocol field)
- Introduce NHID field to the Route object which allows attaching
nexthop to routes.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
=== Squashed Commits ===
nl: Fix some wrong error and done message handling
The current logic of parsing ERROR and DONE message is, first reads
error field and when NLM_F_ACK_TLVS exists, tries to read the original
request header, payload of the request, and extended ACK.
We have three issues here:
1. The existence of the original request header is not indicated by
NLM_F_ACK_TLVS flag. At least the original request header always
exists.
2. We are missing the check for NLM_F_CAPPED flag. When the flag exists,
the payload of the request doesn't exist. In that case, we shouldn't
try to skip the payload. Otherwise, we may end up with the
out-of-range read.
3. NLMSG_DONE doesn't contain the original request, so we shouldn't
apply original request parsing logic to it.
In this commit, we fix these issues by:
1. We first check the existence of the NLM_F_CAPPED. When it exists,
only skip the original request header. Otherwise, skip the payload as
well. Don't apply this logic to the DONE message.
2. After that, check the existence of the NLM_F_ACK_TLVS. When it
exists, try to read extended ACK for both of DONE and ERROR messages.
Otherwise, don't.
Ref: https://docs.kernel.org/userspace-api/netlink/intro.html#netlink-message-types
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
nexthop: Add a low-level API for the nexthop
Preparation for the support of the nexthop object.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Add basic operation for nexthop
Add a basic support of the Linux's nexthop object (ip nexthop XXX). This
commit aims to introduce a basic operations (add, list, del) with
minimal attributes. Further features can be added later incrementally.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
nexthop: Support NHA_OIF
It can be used for expressing direct nexthop on specific link.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
nexthop: Support NHA_GATEWAY
It can express an IP nexthop. A unique use case we can accomplish by
this is attaching IPv6 nexthop to the routes with an IPv4 prefix which
we cannot do with the existing `ip route` equivalents.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
nexthop: Support protocol
Allow setting protocol for nexthop.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
route: Support RTA_NHID
Support attaching nexthop object to route object via NHID field.
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
nexthop: Add Replace operation support
Add `ip nexthop replace` equivalent
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
The logic for updating conntrack labels was missing.
The conntrack labels is an slice of bytes, if is not nil we send
its current value.
The all zeros slice has a special meaning , that wipes out the existing
labels.
There is also some unexpected behavior, the conntrack table does not
reserve space for the labels if there is no label set in any rule,
causing the netlink calls to fail with ENOSPC
Signed-off-by: Antonio Ojea <aojea@google.com>
Introduce generic IFLA_HEADROOM and IFLA_TAILROOM values that are appropriately
deserialised when querying link details via RTNL.
Netkit-specific variables that allow setting of driver-specific attributes have
been renamed to avoid ambiguity.
Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
Add the following devlink port attributes:
PortNumber: the physical port number
PfNumber: the PF number
VfNumber: the VF number (index)
SfNumber: the SF number (index)
ControllerNumber: the controller number
External: if set, indicates external controller
Signed-off-by: adrianc <adrianc@nvidia.com>
The SEG6 route test was failing on newer kernels:
* Using loopback (`lo`) for SRv6 inline routes no longer preserves the
`encap seg6` attribute. Switched to a dummy device (`dummy0`) to ensure
SRv6 routes are accepted.
* Removed the artificial `::` SID from the inline segment list, which
caused the kernel to drop the SRH.
* Added cleanup to delete the dummy device after the test.
* Updated route checks to use `RouteListFiltered` with `RT_FILTER_OIF|RT_FILTER_DST`
since modern kernels add extra auto-routes (e.g., fe80::/64), which broke
the naive `len(routes) == 1` assertion.
These changes make `TestSEG6RouteAddDel` stable across recent kernels while
still validating both INLINE (IPv6) and ENCAP (IPv4) SRv6 routes.
This change ensures that the necessary netfilter hooks are in place for
conntrack tests to run reliably. Previously, the tests would fail in
environments where the host's firewall was not configured to accept
conntrack traffic.
This change introduces a new function, `ensureCtHooksInThisNS`, that
uses `iptables` or `nftables` to install the necessary hooks. This
function is called from `nsCreateAndEnter`, so all tests that use this
function will have a properly configured netns.
This change also removes the `CI` environment variable check from the
tests, as they are now expected to pass in CI environments.
The TestLinkXdp test was observed to be flaky in CI environments.
The failure occurred because the test expects the second call to
LinkSetXdpFdWithFlags with the XDP_FLAGS_UPDATE_IF_NOEXIST flag
to fail with EBUSY, indicating that an XDP program is already
attached.
Due to a race condition, the kernel had not always completed the
attachment from the first LinkSetXdpFd call before the second call
was made. This caused the second call to succeed (returning a nil
error) when it should have failed. The test logic correctly
interpreted this unexpected success as a failure, but logged the
nil error, leading to confusing output.
This commit resolves the flakiness by introducing a retry loop
with a short delay. The test now attempts the second call multiple
times, giving the kernel sufficient time to update the link's XDP
state. This ensures that the test reliably checks for the EBUSY
error as intended.
- Added IP6tnlEncap processing
- Added proper IP6tnlEncap.ID assignment
- Enabled handling of IP6tnlEncap.TC and IP6tnlEncap.FLAGS for serialization support
IFLA_NETKIT_HEADROOM and IFLA_NETKIT_TAILROOM attributes were added to
upstream kernel 6.14, allowing Netkit devices to be configured with
appropriate headroom/tailroom space to accommodate additional headers,
packet length changes and so forth.
Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
Refactors test setup and teardown logic to use `t.Cleanup` instead
of `defer`. This ensures that cleanup functions are correctly scoped
to each subtest's lifecycle, improving test isolation and reliability.
The `setUpNetlinkTest` helper function is also improved to correctly
save and restore the original network namespace, ensuring that tests
do not leak state.
To support this, a `Close()` method that returns an error is added to
the `Handle` struct, allowing for proper cleanup of underlying netlink
sockets. The test helpers are updated to use this new method,
preventing resource leaks between tests.
Additionally, a bug in the `netns` tests is fixed where a large
namespace ID could overflow a 32-bit integer, causing spurious
failures on some systems.
Refactors the TestRuleListFiltered test to run each subtest in its
own network namespace. This prevents state leakage between subtests,
resolving intermittent test failures caused by rules created in one
test interfering with subsequent tests.
The test logic for verifying all rules is also simplified by moving
the rule listing into the setup phase for that specific test case,
removing conditional logic from the main test loop.
Fixes: #1108
Introduces a configuration flag to disable Virtual Function polling,
providing users with control over polling behavior for performance
optimization scenarios.
Fixes: #1097
Signed-off-by: pasteley <ceasebeing@gmail.com>
Skip TestSocketXDPGetInfo rather than failing it when the user lacks
permissions to create AF_XDP sockets.
This allows to run the full test suite as an unprivileged user again.