Ricardo Ribalda [Thu, 18 Mar 2021 20:22:22 +0000 (21:22 +0100)]
bpf: Fix typo 'accesible' into 'accessible'
Trivial fix.
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210318202223.164873-8-ribalda@chromium.org
Daniel Borkmann [Thu, 25 Mar 2021 23:41:34 +0000 (00:41 +0100)]
bpf: Undo ptr_to_map_key alu sanitation for now
Remove PTR_TO_MAP_KEY for the time being from being sanitized on pointer ALU
through sanitize_ptr_alu() mainly for 3 reasons:
1) It's currently unused and not available from unprivileged. However that by
itself is not yet a strong reason to drop the code.
2) Commit
69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") implemented
the sanitation not fully correct in that unlike stack or map_value pointer
it doesn't probe whether the access to the map key /after/ the simulated ALU
operation is still in bounds. This means that the generated mask can truncate
the offset in the non-speculative domain whereas it should only truncate in
the speculative domain. The verifier should instead reject such program as
we do for other types.
3) Given the recent fixes from
f232326f6966 ("bpf: Prohibit alu ops for pointer
types not defining ptr_limit"),
10d2bb2e6b1d ("bpf: Fix off-by-one for area
size in creating mask to left"),
b5871dca250c ("bpf: Simplify alu_limit masking
for pointer arithmetic") as well as
1b1597e64e1a ("bpf: Add sanity check for
upper ptr_limit") the code changed quite a bit and the merge in
efd13b71a3fa
broke the PTR_TO_MAP_KEY case due to an incorrect merge conflict.
Remove the relevant pieces for the time being and we can rework the PTR_TO_MAP_KEY
case once everything settles.
Fixes:
efd13b71a3fa ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Fixes:
69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David S. Miller [Thu, 25 Mar 2021 23:30:46 +0000 (16:30 -0700)]
Merge https://git./linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-03-24
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 15 day(s) which contain
a total of 65 files changed, 3200 insertions(+), 738 deletions(-).
The main changes are:
1) Static linking of multiple BPF ELF files, from Andrii.
2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo.
3) Spelling fixes from various folks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 25 Mar 2021 22:31:22 +0000 (15:31 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 25 Mar 2021 18:43:43 +0000 (11:43 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: mm (hugetlb, kasan, gup,
selftests, z3fold, kfence, memblock, and highmem), squashfs, ia64,
gcov, and mailmap"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mailmap: update Andrey Konovalov's email address
mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm: memblock: fix section mismatch warning again
kfence: make compatible with kmemleak
gcov: fix clang-11+ support
ia64: fix format strings for err_inject
ia64: mca: allocate early mca with GFP_ATOMIC
squashfs: fix xattr id and id lookup sanity checks
squashfs: fix inode lookup sanity checks
z3fold: prevent reclaim/free race for headless pages
selftests/vm: fix out-of-tree build
mm/mmu_notifiers: ensure range_end() is paired with range_start()
kasan: fix per-page tags for non-page_alloc pages
hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
Linus Torvalds [Thu, 25 Mar 2021 18:23:35 +0000 (11:23 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Not much going on, just some small bug fixes:
- Typo causing a regression in mlx5 devx
- Regression in the recent hns rework causing the HW to get out of
sync
- Long-standing cxgb4 adaptor crash when destroying cm ids"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
RDMA/hns: Fix bug during CMDQ initialization
RDMA/mlx5: Fix typo in destroy_mkey inbox
Linus Torvalds [Thu, 25 Mar 2021 18:11:45 +0000 (11:11 -0700)]
Merge tag 'mfd-fixes-5.12' of git://git./linux/kernel/git/lee/mfd
Pull mfs fix from Lee Jones:
"Unconstify editable placeholder structures"
* tag 'mfd-fixes-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
mfd: intel_quark_i2c_gpio: Revert "Constify static struct resources"
Linus Torvalds [Thu, 25 Mar 2021 18:07:40 +0000 (11:07 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Minor fixes all over, ranging from typos to tests to errata
workarounds:
- Fix possible memory hotplug failure with KASLR
- Fix FFR value in SVE kselftest
- Fix backtraces reported in /proc/$pid/stack
- Disable broken CnP implementation on NVIDIA Carmel
- Typo fixes and ACPI documentation clarification
- Fix some W=1 warnings"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kernel: disable CNP on Carmel
arm64/process.c: fix Wmissing-prototypes build warnings
kselftest/arm64: sve: Do not use non-canonical FFR register value
arm64: mm: correct the inside linear map range during hotplug check
arm64: kdump: update ppos when reading elfcorehdr
arm64: cpuinfo: Fix a typo
Documentation: arm64/acpi : clarify arm64 support of IBFT
arm64: stacktrace: don't trace arch_stack_walk()
arm64: csum: cast to the proper type
Chris Chiu [Thu, 25 Mar 2021 14:04:19 +0000 (22:04 +0800)]
mailmap: update the email address for Chris Chiu
Redirect my older email addresses in the git logs.
Signed-off-by: Chris Chiu <chris.chiu@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Thu, 25 Mar 2021 04:37:56 +0000 (21:37 -0700)]
mailmap: update Andrey Konovalov's email address
Use my personal email, the @google.com one will stop functioning soon.
Link: https://lkml.kernel.org/r/ead0e9c32a2f70e0bde6f63b3b9470e0ef13d2ee.1616107969.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ira Weiny [Thu, 25 Mar 2021 04:37:53 +0000 (21:37 -0700)]
mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
The kernel test robot found that __kmap_local_sched_out() was not
correctly skipping the guard pages when DEBUG_KMAP_LOCAL_FORCE_MAP was
set.[1] This was due to DEBUG_HIGHMEM check being used.
Change the configuration check to be correct.
[1] https://lore.kernel.org/lkml/
20210304083825.GB17830@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20210318230657.1497881-1-ira.weiny@intel.com
Fixes:
0e91a0c6984c ("mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Cc: David Sterba <dsterba@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Thu, 25 Mar 2021 04:37:50 +0000 (21:37 -0700)]
mm: memblock: fix section mismatch warning again
Commit
34dc2efb39a2 ("memblock: fix section mismatch warning") marked
memblock_bottom_up() and memblock_set_bottom_up() as __init, but they
could be referenced from non-init functions like
memblock_find_in_range_node() on architectures that enable
CONFIG_ARCH_KEEP_MEMBLOCK.
For such builds kernel test robot reports:
WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up()
The function memblock_find_in_range_node() references the function __init memblock_bottom_up().
This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong.
Replace __init annotations with __init_memblock annotations so that the
appropriate section will be selected depending on
CONFIG_ARCH_KEEP_MEMBLOCK.
Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com
Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org
Fixes:
34dc2efb39a2 ("memblock: fix section mismatch warning")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco Elver [Thu, 25 Mar 2021 04:37:47 +0000 (21:37 -0700)]
kfence: make compatible with kmemleak
Because memblock allocations are registered with kmemleak, the KFENCE
pool was seen by kmemleak as one large object. Later allocations
through kfence_alloc() that were registered with kmemleak via
slab_post_alloc_hook() would then overlap and trigger a warning.
Therefore, once the pool is initialized, we can remove (free) it from
kmemleak again, since it should be treated as allocator-internal and be
seen as "free memory".
The second problem is that kmemleak is passed the rounded size, and not
the originally requested size, which is also the size of KFENCE objects.
To avoid kmemleak scanning past the end of an object and trigger a
KFENCE out-of-bounds error, fix the size if it is a KFENCE object.
For simplicity, to avoid a call to kfence_ksize() in
slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
guard), just call kfence_ksize() in mm/kmemleak.c:create_object().
Link: https://lkml.kernel.org/r/20210317084740.3099921-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Luis Henriques <lhenriques@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Desaulniers [Thu, 25 Mar 2021 04:37:44 +0000 (21:37 -0700)]
gcov: fix clang-11+ support
LLVM changed the expected function signatures for llvm_gcda_start_file()
and llvm_gcda_emit_function() in the clang-11 release. Users of
clang-11 or newer may have noticed their kernels failing to boot due to
a panic when enabling CONFIG_GCOV_KERNEL=y +CONFIG_GCOV_PROFILE_ALL=y.
Fix up the function signatures so calling these functions doesn't panic
the kernel.
Link: https://reviews.llvm.org/rGcdd683b516d147925212724b09ec6fb792a40041
Link: https://reviews.llvm.org/rG13a633b438b6500ecad9e4f936ebadf3411d0f44
Link: https://lkml.kernel.org/r/20210312224132.3413602-2-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: Prasad Sodagudi <psodagud@quicinc.com>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergei Trofimovich [Thu, 25 Mar 2021 04:37:41 +0000 (21:37 -0700)]
ia64: fix format strings for err_inject
Fix warning with %lx / u64 mismatch:
arch/ia64/kernel/err_inject.c: In function 'show_resources':
arch/ia64/kernel/err_inject.c:62:22: warning:
format '%lx' expects argument of type 'long unsigned int',
but argument 3 has type 'u64' {aka 'long long unsigned int'}
62 | return sprintf(buf, "%lx", name[cpu]); \
| ^~~~~~~
Link: https://lkml.kernel.org/r/20210313104312.1548232-1-slyfox@gentoo.org
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergei Trofimovich [Thu, 25 Mar 2021 04:37:38 +0000 (21:37 -0700)]
ia64: mca: allocate early mca with GFP_ATOMIC
The sleep warning happens at early boot right at secondary CPU
activation bootup:
smp: Bringing up secondary CPUs ...
BUG: sleeping function called from invalid context at mm/page_alloc.c:4942
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted
5.12.0-rc2-00007-g79e228d0b611-dirty #99
..
Call Trace:
show_stack+0x90/0xc0
dump_stack+0x150/0x1c0
___might_sleep+0x1c0/0x2a0
__might_sleep+0xa0/0x160
__alloc_pages_nodemask+0x1a0/0x600
alloc_page_interleave+0x30/0x1c0
alloc_pages_current+0x2c0/0x340
__get_free_pages+0x30/0xa0
ia64_mca_cpu_init+0x2d0/0x3a0
cpu_init+0x8b0/0x1440
start_secondary+0x60/0x700
start_ap+0x750/0x780
Fixed BSP b0 value from CPU 1
As I understand interrupts are not enabled yet and system has a lot of
memory. There is little chance to sleep and switch to GFP_ATOMIC should
be a no-op.
Link: https://lkml.kernel.org/r/20210315085045.204414-1-slyfox@gentoo.org
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Phillip Lougher [Thu, 25 Mar 2021 04:37:35 +0000 (21:37 -0700)]
squashfs: fix xattr id and id lookup sanity checks
The checks for maximum metadata block size is missing
SQUASHFS_BLOCK_OFFSET (the two byte length count).
Link: https://lkml.kernel.org/r/2069685113.2081245.1614583677427@webmail.123-reg.co.uk
Fixes:
f37aa4c7366e23f ("squashfs: add more sanity checks in id lookup")
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Sean Nyekjaer <sean@geanix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sean Nyekjaer [Thu, 25 Mar 2021 04:37:32 +0000 (21:37 -0700)]
squashfs: fix inode lookup sanity checks
When mouting a squashfs image created without inode compression it fails
with: "unable to read inode lookup table"
It turns out that the BLOCK_OFFSET is missing when checking the
SQUASHFS_METADATA_SIZE agaist the actual size.
Link: https://lkml.kernel.org/r/20210226092903.1473545-1-sean@geanix.com
Fixes:
eabac19e40c0 ("squashfs: add more sanity checks in inode lookup")
Signed-off-by: Sean Nyekjaer <sean@geanix.com>
Acked-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Hebb [Thu, 25 Mar 2021 04:37:29 +0000 (21:37 -0700)]
z3fold: prevent reclaim/free race for headless pages
Commit
ca0246bb97c2 ("z3fold: fix possible reclaim races") introduced
the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page
release." By atomically testing and setting the bit in each of
z3fold_free() and z3fold_reclaim_page(), a double-free was avoided.
However, commit
dcf5aedb24f8 ("z3fold: stricter locking and more careful
reclaim") appears to have unintentionally broken this behavior by moving
the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
gets taken, which only happens for non-headless pages. For headless
pages, the check is now skipped entirely and races can occur again.
I have observed such a race on my system:
page:
00000000ffbd76b7 refcount:0 mapcount:0 mapping:
0000000000000000 index:0x0 pfn:0x165316
flags: 0x2ffff0000000000()
raw:
02ffff0000000000 ffffea0004535f48 ffff8881d553a170 0000000000000000
raw:
0000000000000000 0000000000000011 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:707!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: G B 5.10.7-arch1-1-kasan #1
Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS F9b 03/03/2016
Workqueue: zswap-shrink shrink_worker
RIP: 0010:__free_pages+0x10a/0x130
Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07
RSP: 0000:
ffff88819a2ffb98 EFLAGS:
00010296
RAX:
0000000000000000 RBX:
ffffea000594c5a8 RCX:
0000000000000000
RDX:
1ffffd4000b298b7 RSI:
0000000000000000 RDI:
ffffea000594c5b8
RBP:
ffffea000594c580 R08:
000000000000003e R09:
ffff8881d5520bbb
R10:
ffffed103aaa4177 R11:
0000000000000001 R12:
ffffea000594c5b4
R13:
0000000000000000 R14:
ffff888165316000 R15:
ffffea000594c588
FS:
0000000000000000(0000) GS:
ffff8881d5500000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007f7c8c3654d8 CR3:
0000000103f42004 CR4:
00000000001706e0
Call Trace:
z3fold_zpool_shrink+0x9b6/0x1240
shrink_worker+0x35/0x90
process_one_work+0x70c/0x1210
worker_thread+0x539/0x1200
kthread+0x330/0x400
ret_from_fork+0x22/0x30
Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me
hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart
---[ end trace
126d646fc3dc0ad8 ]---
To fix the issue, re-add the earlier test and set in the case where we
have a headless page.
Link: https://lkml.kernel.org/r/c8106dbe6d8390b290cd1d7f873a2942e805349e.1615452048.git.tommyhebb@gmail.com
Fixes:
dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim")
Signed-off-by: Thomas Hebb <tommyhebb@gmail.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Jongseok Kim <ks77sj@gmail.com>
Cc: Snild Dolkow <snild@sony.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rong Chen [Thu, 25 Mar 2021 04:37:26 +0000 (21:37 -0700)]
selftests/vm: fix out-of-tree build
When building out-of-tree, attempting to make target from $(OUTPUT) directory:
make[1]: *** No rule to make target '$(OUTPUT)/protection_keys.c', needed by '$(OUTPUT)/protection_keys_32'.
Link: https://lkml.kernel.org/r/20210315094700.522753-1-rong.a.chen@intel.com
Signed-off-by: Rong Chen <rong.a.chen@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sean Christopherson [Thu, 25 Mar 2021 04:37:23 +0000 (21:37 -0700)]
mm/mmu_notifiers: ensure range_end() is paired with range_start()
If one or more notifiers fails .invalidate_range_start(), invoke
.invalidate_range_end() for "all" notifiers. If there are multiple
notifiers, those that did not fail are expecting _start() and _end() to
be paired, e.g. KVM's mmu_notifier_count would become imbalanced.
Disallow notifiers that can fail _start() from implementing _end() so
that it's unnecessary to either track which notifiers rejected _start(),
or had already succeeded prior to a failed _start().
Note, the existing behavior of calling _start() on all notifiers even
after a previous notifier failed _start() was an unintented "feature".
Make it canon now that the behavior is depended on for correctness.
As of today, the bug is likely benign:
1. The only caller of the non-blocking notifier is OOM kill.
2. The only notifiers that can fail _start() are the i915 and Nouveau
drivers.
3. The only notifiers that utilize _end() are the SGI UV GRU driver
and KVM.
4. The GRU driver will never coincide with the i195/Nouveau drivers.
5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
_guest_, and the guest is already doomed due to being an OOM victim.
Fix the bug now to play nice with future usage, e.g. KVM has a
potential use case for blocking memslot updates in KVM while an
invalidation is in-progress, and failure to unblock would result in said
updates being blocked indefinitely and hanging.
Found by inspection. Verified by adding a second notifier in KVM that
periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
and observing that KVM exits with an elevated notifier count.
Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
Fixes:
93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Thu, 25 Mar 2021 04:37:20 +0000 (21:37 -0700)]
kasan: fix per-page tags for non-page_alloc pages
To allow performing tag checks on page_alloc addresses obtained via
page_address(), tag-based KASAN modes store tags for page_alloc
allocations in page->flags.
Currently, the default tag value stored in page->flags is 0x00.
Therefore, page_address() returns a 0x00ffff... address for pages that
were not allocated via page_alloc.
This might cause problems. A particular case we encountered is a
conflict with KFENCE. If a KFENCE-allocated slab object is being freed
via kfree(page_address(page) + offset), the address passed to kfree()
will get tagged with 0x00 (as slab pages keep the default per-page
tags). This leads to is_kfence_address() check failing, and a KFENCE
object ending up in normal slab freelist, which causes memory
corruptions.
This patch changes the way KASAN stores tag in page-flags: they are now
stored xor'ed with 0xff. This way, KASAN doesn't need to initialize
per-page flags for every created page, which might be slow.
With this change, page_address() returns natively-tagged (with 0xff)
pointers for pages that didn't have tags set explicitly.
This patch fixes the encountered conflict with KFENCE and prevents more
similar issues that can occur in the future.
Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
Fixes:
2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Thu, 25 Mar 2021 04:37:17 +0000 (21:37 -0700)]
hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
The current implementation of hugetlb_cgroup for shared mappings could
have different behavior. Consider the following two scenarios:
1.Assume initial css reference count of hugetlb_cgroup is 1:
1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
count is 2 associated with 1 file_region.
1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
count is 3 associated with 2 file_region.
1.3 coalesce_file_region will coalesce these two file_regions into
one. So css reference count is 3 associated with 1 file_region
now.
2.Assume initial css reference count of hugetlb_cgroup is 1 again:
2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
count is 2 associated with 1 file_region.
Therefore, we might have one file_region while holding one or more css
reference counts. This inconsistency could lead to imbalanced css_get()
and css_put() pair. If we do css_put one by one (i.g. hole punch case),
scenario 2 would put one more css reference. If we do css_put all
together (i.g. truncate case), scenario 1 will leak one css reference.
The imbalanced css_get() and css_put() pair would result in a non-zero
reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
directory is removed __but__ associated resource is not freed. This
might result in OOM or can not create a new hugetlb cgroup in a busy
workload ultimately.
In order to fix this, we have to make sure that one file_region must
hold exactly one css reference. So in coalesce_file_region case, we
should release one css reference before coalescence. Also only put css
reference when the entire file_region is removed.
The last thing to note is that the caller of region_add() will only hold
one reference to h_cg->css for the whole contiguous reservation region.
But this area might be scattered when there are already some
file_regions reside in it. As a result, many file_regions may share only
one h_cg->css reference. In order to ensure that one file_region must
hold exactly one css reference, we should do css_get() for each
file_region and release the reference held by caller when they are done.
[linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
Fixes:
075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Potnuri Bharat Teja [Wed, 24 Mar 2021 19:04:53 +0000 (00:34 +0530)]
RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
Not setting the ipv6 bit while destroying ipv6 listening servers may
result in potential fatal adapter errors due to lookup engine memory hash
errors. Therefore always set ipv6 field while destroying ipv6 listening
servers.
Fixes:
830662f6f032 ("RDMA/cxgb4: Add support for active and passive open connection with IPv6 address")
Link: https://lore.kernel.org/r/20210324190453.8171-1-bharat@chelsio.com
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Rich Wiley [Wed, 24 Mar 2021 00:28:09 +0000 (17:28 -0700)]
arm64: kernel: disable CNP on Carmel
On NVIDIA Carmel cores, CNP behaves differently than it does on standard
ARM cores. On Carmel, if two cores have CNP enabled and share an L2 TLB
entry created by core0 for a specific ASID, a non-shareable TLBI from
core1 may still see the shared entry. On standard ARM cores, that TLBI
will invalidate the shared entry as well.
This causes issues with patchsets that attempt to do local TLBIs based
on cpumasks instead of broadcast TLBIs. Avoid these issues by disabling
CNP support for NVIDIA Carmel cores.
Signed-off-by: Rich Wiley <rwiley@nvidia.com>
Link: https://lore.kernel.org/r/20210324002809.30271-1-rwiley@nvidia.com
[will: Fix pre-existing whitespace issue]
Signed-off-by: Will Deacon <will@kernel.org>
Maninder Singh [Wed, 24 Mar 2021 06:54:58 +0000 (12:24 +0530)]
arm64/process.c: fix Wmissing-prototypes build warnings
Fix GCC warnings reported when building with "-Wmissing-prototypes":
arch/arm64/kernel/process.c:261:6: warning: no previous prototype for '__show_regs' [-Wmissing-prototypes]
261 | void __show_regs(struct pt_regs *regs)
| ^~~~~~~~~~~
arch/arm64/kernel/process.c:307:6: warning: no previous prototype for '__show_regs_alloc_free' [-Wmissing-prototypes]
307 | void __show_regs_alloc_free(struct pt_regs *regs)
| ^~~~~~~~~~~~~~~~~~~~~~
arch/arm64/kernel/process.c:365:5: warning: no previous prototype for 'arch_dup_task_struct' [-Wmissing-prototypes]
365 | int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
| ^~~~~~~~~~~~~~~~~~~~
arch/arm64/kernel/process.c:546:41: warning: no previous prototype for '__switch_to' [-Wmissing-prototypes]
546 | __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
| ^~~~~~~~~~~
arch/arm64/kernel/process.c:710:25: warning: no previous prototype for 'arm64_preempt_schedule_irq' [-Wmissing-prototypes]
710 | asmlinkage void __sched arm64_preempt_schedule_irq(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
Link: https://lore.kernel.org/lkml/202103192250.AennsfXM-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Link: https://lore.kernel.org/r/1616568899-986-1-git-send-email-maninder1.s@samsung.com
Signed-off-by: Will Deacon <will@kernel.org>
Linus Torvalds [Thu, 25 Mar 2021 01:16:04 +0000 (18:16 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
"Various fixes, all over:
1) Fix overflow in ptp_qoriq_adjfine(), from Yangbo Lu.
2) Always store the rx queue mapping in veth, from Maciej
Fijalkowski.
3) Don't allow vmlinux btf in map_create, from Alexei Starovoitov.
4) Fix memory leak in octeontx2-af from Colin Ian King.
5) Use kvalloc in bpf x86 JIT for storing jit'd addresses, from
Yonghong Song.
6) Fix tx ptp stats in mlx5, from Aya Levin.
7) Check correct ip version in tun decap, fropm Roi Dayan.
8) Fix rate calculation in mlx5 E-Switch code, from arav Pandit.
9) Work item memork leak in mlx5, from Shay Drory.
10) Fix ip6ip6 tunnel crash with bpf, from Daniel Borkmann.
11) Lack of preemptrion awareness in macvlan, from Eric Dumazet.
12) Fix data race in pxa168_eth, from Pavel Andrianov.
13) Range validate stab in red_check_params(), from Eric Dumazet.
14) Inherit vlan filtering setting properly in b53 driver, from
Florian Fainelli.
15) Fix rtnl locking in igc driver, from Sasha Neftin.
16) Pause handling fixes in igc driver, from Muhammad Husaini
Zulkifli.
17) Missing rtnl locking in e1000_reset_task, from Vitaly Lifshits.
18) Use after free in qlcnic, from Lv Yunlong.
19) fix crash in fritzpci mISDN, from Tong Zhang.
20) Premature rx buffer reuse in igb, from Li RongQing.
21) Missing termination of ip[a driver message handler arrays, from
Alex Elder.
22) Fix race between "x25_close" and "x25_xmit"/"x25_rx" in hdlc_x25
driver, from Xie He.
23) Use after free in c_can_pci_remove(), from Tong Zhang.
24) Uninitialized variable use in nl80211, from Jarod Wilson.
25) Off by one size calc in bpf verifier, from Piotr Krysiuk.
26) Use delayed work instead of deferrable for flowtable GC, from
Yinjun Zhang.
27) Fix infinite loop in NPC unmap of octeontx2 driver, from
Hariprasad Kelam.
28) Fix being unable to change MTU of dwmac-sun8i devices due to lack
of fifo sizes, from Corentin Labbe.
29) DMA use after free in r8169 with WoL, fom Heiner Kallweit.
30) Mismatched prototypes in isdn-capi, from Arnd Bergmann.
31) Fix psample UAPI breakage, from Ido Schimmel"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (171 commits)
psample: Fix user API breakage
math: Export mul_u64_u64_div_u64
ch_ktls: fix enum-conversion warning
octeontx2-af: Fix memory leak of object buf
ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation
net: bridge: don't notify switchdev for local FDB addresses
net/sched: act_ct: clear post_ct if doing ct_clear
net: dsa: don't assign an error value to tag_ops
isdn: capi: fix mismatched prototypes
net/mlx5: SF, do not use ecpu bit for vhca state processing
net/mlx5e: Fix division by 0 in mlx5e_select_queue
net/mlx5e: Fix error path for ethtool set-priv-flag
net/mlx5e: Offload tuple rewrite for non-CT flows
net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
net/mlx5: Add back multicast stats for uplink representor
net: ipconfig: ic_dev can be NULL in ic_close_devs
MAINTAINERS: Combine "QLOGIC QLGE 10Gb ETHERNET DRIVER" sections into one
docs: networking: Fix a typo
r8169: fix DMA being used after buffer free if WoL is enabled
net: ipa: fix init header command validation
...
Arnd Bergmann [Wed, 24 Mar 2021 13:07:22 +0000 (14:07 +0100)]
hinic: avoid gcc -Wrestrict warning
With extra warnings enabled, gcc complains that snprintf should not
take the same buffer as source and destination:
drivers/net/ethernet/huawei/hinic/hinic_ethtool.c: In function 'hinic_set_settings_to_hw':
drivers/net/ethernet/huawei/hinic/hinic_ethtool.c:480:9: error: 'snprintf' argument 4 overlaps destination object 'set_link_str' [-Werror=restrict]
480 | err = snprintf(set_link_str, SET_LINK_STR_MAX_LEN,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
481 | "%sspeed %d ", set_link_str, speed);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/huawei/hinic/hinic_ethtool.c:464:7: note: destination object referenced by 'restrict'-qualified argument 1 was declared here
464 | char set_link_str[SET_LINK_STR_MAX_LEN] = {0};
Rewrite this to avoid the nested sprintf and instead use separate
buffers, which is simpler.
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ong Boon Leong [Wed, 24 Mar 2021 09:07:42 +0000 (17:07 +0800)]
net: stmmac: support FPE link partner hand-shaking procedure
In order to discover whether remote station supports frame preemption,
local station sends verify mPacket and expects response mPacket in
return from the remote station.
So, we add the functions to send and handle event when verify mPacket
and response mPacket are exchanged between the networked stations.
The mechanism to handle different FPE states between local and remote
station (link partner) is implemented using workqueue which starts a
task each time there is some sign of verify & response mPacket exchange
as check in FPE IRQ event. The task retries couple of times to try to
spot the states that both stations are ready to enter FPE ON. This allows
different end points to enable FPE at different time and verify-response
mPacket can happen asynchronously. Ultimately, the task will only turn
FPE ON when local station have both exchange response in both directions.
Thanks to Voon Weifeng for implementing the core functions for detecting
FPE events and send mPacket and phylink related change.
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Co-developed-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Co-developed-by: Tan Tee Min <tee.min.tan@intel.com>
Signed-off-by: Tan Tee Min <tee.min.tan@intel.com>
Co-developed-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com>
Signed-off-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Hai [Wed, 24 Mar 2021 06:22:24 +0000 (14:22 +0800)]
6lowpan: Fix some typos in nhc_udp.c
s/Orignal/Original/
s/infered/inferred/
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Hai [Wed, 24 Mar 2021 06:19:31 +0000 (14:19 +0800)]
net/packet: Fix a typo in af_packet.c
s/sequencially/sequentially/
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Hai [Wed, 24 Mar 2021 06:16:22 +0000 (14:16 +0800)]
net/tls: Fix a typo in tls_device.c
s/beggining/beginning/
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhichao Cai [Wed, 24 Mar 2021 02:30:47 +0000 (10:30 +0800)]
Simplify the code by using module_platform_driver macro
for ftmac100
Signed-off-by: Zhichao Cai <caizhichao@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Mar 2021 23:52:48 +0000 (16:52 -0700)]
Merge branch 'ipa-versions-and-registers'
Alex Elder says:
====================
net: ipa: versions and registers
Version 2 of this series adds kernel-doc descriptions for all
members of the ipa_version enumerated type in patch 2.
The original description of the series is below.
-Alex
This series is sort of a mix of things, generally related to
updating IPA versions and register definitions.
The first patch fixes some version-related tests throughout the code
so the conditions are valid for IPA versions other than the two that
are currently supported. Support for additional versions is
forthcoming, and this is a preparatory step.
The second patch adds to the set of defined IPA versions, to include
all versions between 3.0 and 4.11.
The next defines an endpoint initialization register that was
previously not being configured. We now initialize that register
(so that NAT is explicitly disabled) on all AP endpoints.
The fourth adds support for an extra bit in a field in a register,
which is present starting at IPA v4.5.
The last two are sort of standalone. One just moves a function
definition and makes it private. The other increases the number of
GSI channels and events supported by the driver, sufficient for IPA
v4.5.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 24 Mar 2021 13:15:28 +0000 (08:15 -0500)]
net: ipa: increase channels and events
Increase the maximum number of channels and event rings supported by
the driver, to allow the maximum available on the SDX55.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 24 Mar 2021 13:15:27 +0000 (08:15 -0500)]
net: ipa: move ipa_aggr_granularity_val()
We only use ipa_aggr_granularity_val() inside "ipa_main.c", so it
doesn't really need to be defined in a header file. It makes some
sense to be grouped with the register definitions, but it is unlike
the other inline functions now defined in "ipa_reg.h". So move it
into "ipa_main.c" where it's used. TIMER_FREQUENCY is used only
by that function, so move that definition as well.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 24 Mar 2021 13:15:26 +0000 (08:15 -0500)]
net: ipa: limit local processing context address
Not all of the bits of the LOCAL_PKT_PROC_CNTXT register are valid.
Until IPA v4.5, there are 17 valid bits (though the bottom three
must be zero). Starting with IPA v4.5, 18 bits are valid.
Introduce proc_cntxt_base_addr_encoded() to encode the base address
for use in the register using only the valid bits.
Shorten the name of the register (omit "_BASE") to avoid the need to
wrap the line in the one place it's used.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 24 Mar 2021 13:15:25 +0000 (08:15 -0500)]
net: ipa: define the ENDP_INIT_NAT register
Define the ENDP_INIT_NAT register for setting up the NAT
configuration for an endpoint. We aren't using NAT at this
time, so explicitly set the type to BYPASS for all endpoints.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 24 Mar 2021 13:15:24 +0000 (08:15 -0500)]
net: ipa: update version definitions
Add IPA version definitions for all IPA v3.x and v4.x. Fix the GSI
version associated with IPA version 4.1.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 24 Mar 2021 13:15:23 +0000 (08:15 -0500)]
net: ipa: reduce IPA version assumptions
Modify conditional tests throughout the IPA code so they do not
assume that IPA v3.5.1 is the oldest version supported. Also remove
assumptions that IPA v4.5 is the newest version of IPA supported.
Augment versions in comments with "+", to be clearer that the
comment applies to a version and subsequent versions. (E.g.,
"present for IPA v4.2+" instead of just "present for v4.2".)
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 24 Mar 2021 22:01:38 +0000 (15:01 -0700)]
tcp_metrics: tcpm_hash_bucket is strictly local
After commit
098a697b497e ("tcp_metrics: Use a single hash table
for all network namespaces."), tcpm_hash_bucket is local to
net/ipv4/tcp_metrics.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 24 Mar 2021 21:53:37 +0000 (14:53 -0700)]
inet: use bigger hash table for IP ID generation
In commit
73f156a6e8c1 ("inetpeer: get rid of ip_id_count")
I used a very small hash table that could be abused
by patient attackers to reveal sensitive information.
Switch to a dynamic sizing, depending on RAM size.
Typical big hosts will now use 128x more storage (2 MB)
to get a similar increase in security and reduction
of hash collisions.
As a bonus, use of alloc_large_system_hash() spreads
allocated memory among all NUMA nodes.
Fixes:
73f156a6e8c1 ("inetpeer: get rid of ip_id_count")
Reported-by: Amit Klein <aksecurity@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 19:43:32 +0000 (21:43 +0200)]
psample: Fix user API breakage
Cited commit added a new attribute before the existing group reference
count attribute, thereby changing its value and breaking existing
applications on new kernels.
Before:
# psample -l
libpsample ERROR psample_group_foreach: failed to recv message: Operation not supported
After:
# psample -l
Group Num Refcount Group Seq
1 1 0
Fix by restoring the value of the old attribute and remove the
misleading comments from the enumerator to avoid future bugs.
Cc: stable@vger.kernel.org
Fixes:
d8bed686ab96 ("net: psample: Add tunnel support")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Adiel Bidani <adielb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Mar 2021 23:42:54 +0000 (16:42 -0700)]
math: Export mul_u64_u64_div_u64
Fixes:
f51d7bf1dbe5 ("ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation")
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Mar 2021 23:34:58 +0000 (16:34 -0700)]
Merge branch 'mlxsw-resilient-nh-groups'
Ido Schimmel says:
====================
mlxsw: Add support for resilient nexthop groups
This patchset adds support for resilient nexthop groups in mlxsw. As far
as the hardware is concerned, resilient groups are the same as regular
groups. The differences lie in how mlxsw manages the individual
adjacency entries (nexthop buckets) that make up the group.
The first difference is that unlike regular groups the driver needs to
periodically update the kernel about activity of nexthop buckets so that
the kernel will not treat the buckets as idle, given traffic is
offloaded from the CPU to the ASIC. This is similar to what mlxsw is
already doing with respect to neighbour entries. The update interval is
set to 1 second to allow for short idle timers.
The second difference is that nexthop buckets that correspond to an
unresolved neighbour must be programmed to the device, as the size of
the group must remain fixed. This is achieved by programming such
entries with trap action, in order to trigger neighbour resolution by
the kernel.
The third difference is atomic replacement of individual nexthop
buckets. While the driver periodically updates the kernel about activity
of nexthop buckets, it is possible for a bucket to become active just
before the kernel decides to replace it with a different nexthop. To
avoid such situations and connections being reset, the driver instructs
the device to only replace an adjacency entry if it is inactive.
Failures are propagated back to the nexthop code.
Patchset overview:
Patches #1-#7 gradually add support for resilient nexthop groups
Patch #8 finally enables such groups to be programmed to the device
Patches #9-#10 add mlxsw-specific selftests
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:24 +0000 (22:14 +0200)]
selftests: mlxsw: Add resilient nexthop groups configuration tests
Test that unsupported resilient nexthop group configurations are
rejected and that offload / trap indication is correctly set on nexthop
buckets in a resilient group.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:23 +0000 (22:14 +0200)]
selftests: mlxsw: Test unresolved neigh trap with resilient nexthop groups
The number of nexthop buckets in a resilient nexthop group never
changes, so when the gateway address of a nexthop cannot be resolved,
the nexthop buckets are programmed to trap packets to the CPU in order
to trigger resolution. For example:
# ip nexthop add id 1 via 198.51.100.1 dev swp3
# ip nexthop add id 10 group 1 type resilient buckets 32
# ip nexthop bucket get id 10 index 0
id 10 index 0 idle_time 1.44 nhid 1 trap
Where 198.51.100.1 is a made-up IP.
Test that in this case packets are indeed trapped to the CPU via the
unresolved neigh trap.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:22 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Enable resilient nexthop groups to be programmed
Now that mlxsw supports resilient nexthop groups, allow them to be
programmed after validating that their configuration conforms to the
device's limitations (e.g., number of buckets is within predefined
range).
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:21 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Periodically update activity of nexthop buckets
The kernel periodically checks the idle time of nexthop buckets to
determine if they are idle and can be re-populated with a new nexthop.
When the resilient nexthop group is offloaded to hardware, the kernel
will not see activity on nexthop buckets unless it is reported from
hardware.
Therefore, periodically (every 1 second) query the hardware for activity
of adjacency entries used as part of a resilient nexthop group and
report it to the nexthop code.
The activity is only queried if resilient nexthop groups are in use. The
delayed work is canceled otherwise.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:20 +0000 (22:14 +0200)]
mlxsw: reg: Add Router Adjacency Table Activity Dump Register
The RATRAD register is used to dump and optionally clear activity bits
of router adjacency table entries. Will be used by the next patch to
query and clear the activity of nexthop buckets in a resilient nexthop
group.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:19 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Update hardware flags on nexthop buckets
So far, mlxsw only updated hardware flags ('offload' / 'trap') on
nexthop objects. For resilient nexthop groups, these flags need to be
updated on individual nexthop buckets as well.
Update these flags whenever updating the flags of the encapsulating
nexthop object and whenever a nexthop bucket is replaced.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:18 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Add nexthop bucket replacement support
Replace a single nexthop bucket upon receiving a
'NEXTHOP_EVENT_BUCKET_REPLACE' notification.
When the 'force' parameter is not set, instruct the device to only
overwrite an adjacency entry if its activity is cleared, so as not to
break existing flows using the adjacency entry. The device does not
provide feedback if the replacement was successful in this case, so the
contents of the adjacency entry after the replacement are compared with
the replacement request.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:17 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Pass payload pointer to nexthop update function
Have the caller pass a pointer to the payload of the RATR register to
the function updating a single nexthop / adjacency entry.
In a subsequent patch, this will allow the caller to make sure
replacement was successful by querying the state of the adjacency entry
after replacement and comparing with the initial request.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:16 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Add ability to overwrite adjacency entry only when inactive
Allow the driver to instruct the device to only overwrite an adjacency
entry if its activity is cleared. Currently, adjacency entry is always
overwritten, regardless of activity.
This will be used by subsequent patches to prevent replacement of active
nexthop buckets.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 24 Mar 2021 20:14:15 +0000 (22:14 +0200)]
mlxsw: spectrum_router: Add support for resilient nexthop groups
Parse the configuration of resilient nexthop groups to existing mlxsw
data structures. Unlike non-resilient groups, nexthops without a valid
MAC or router interface (RIF) are programmed with a trap action instead
of not being programmed at all.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cooper Lees [Wed, 24 Mar 2021 03:47:38 +0000 (20:47 -0700)]
Add Open Routing Protocol ID to `rtnetlink.h`
- The Open Routing (Open/R) network protocol netlink handler uses ID 99
- Will also add to `/etc/iproute2/rt_protos` once this is accepted
- For more information: https://github.com/facebook/openr
Signed-off-by: From: Cooper Lees <me@cooperlees.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 24 Mar 2021 15:44:55 +0000 (17:44 +0200)]
net: enetc: don't depend on system endianness in enetc_set_mac_ht_flt
When enetc runs out of exact match entries for unicast address
filtering, it switches to an approach based on hash tables, where
multiple MAC addresses might end up in the same bucket.
However, the enetc_set_mac_ht_flt function currently depends on the
system endianness, because it interprets the 64-bit hash value as an
array of two u32 elements. Modify this to use lower_32_bits and
upper_32_bits.
Tested by forcing enetc to go into hash table mode by creating two
macvlan upper interfaces:
ip link add link eno0 address 00:01:02:03:00:00 eno0.0 type macvlan && ip link set eno0.0 up
ip link add link eno0 address 00:01:02:03:00:01 eno0.1 type macvlan && ip link set eno0.1 up
and verified that the same bit values are written to the registers
before and after:
enetc_sync_mac_filters: addr 00:00:80:00:40:10 exact match 0
enetc_sync_mac_filters: addr 00:00:00:00:80:00 exact match 0
enetc_set_mac_ht_flt: hash 0x80008000000000 UMHFR0 0x0 UMHFR1 0x800080
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 24 Mar 2021 15:44:54 +0000 (17:44 +0200)]
net: enetc: don't depend on system endianness in enetc_set_vlan_ht_filter
ENETC has a 64-entry hash table for VLAN RX filtering per Station
Interface, which is accessed through two 32-bit registers: VHFR0 holding
the low portion, and VHFR1 holding the high portion.
The enetc_set_vlan_ht_filter function looks at the pf->vlan_ht_filter
bitmap, which is fundamentally an unsigned long variable, and casts it
to a u32 array of two elements. It puts the first u32 element into VHFR0
and the second u32 element into VHFR1.
It is easy to imagine that this will not work on big endian systems
(although, yes, we have bigger problems, because currently enetc assumes
that the CPU endianness is equal to the controller endianness, aka
little endian - but let's assume that we could add a cpu_to_le32 in
enetc_wd_reg and a le32_to_cpu in enetc_rd_reg).
Let's use lower_32_bits and upper_32_bits which are designed to work
regardless of endianness.
Tested that both the old and the new method produce the same results:
$ ethtool -K eth1 rx-vlan-filter on
$ ip link add link eth1 name eth1.100 type vlan id 100
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x20
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x20
$ ip link add link eth1 name eth1.101 type vlan id 101
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x30
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x30
$ ip link add link eth1 name eth1.34 type vlan id 34
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x34
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x34
$ ip link add link eth1 name eth1.1024 type vlan id 1024
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x1 VHFR1 0x34
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x1 VHFR1 0x34
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Wed, 24 Mar 2021 14:42:20 +0000 (14:42 +0000)]
netdevsim: switch to memdup_user_nul()
Use memdup_user_nul() helper instead of open-coding to
simplify the code.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sai Kalyaan Palla [Tue, 23 Mar 2021 17:44:19 +0000 (23:14 +0530)]
net: decnet: Fixed multiple Coding Style issues
Made changes to coding style as suggested by checkpatch.pl
changes are of the type:
space required before the open parenthesis '('
space required after that ','
Signed-off-by: Sai Kalyaan Palla <saikalyaan63@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Mar 2021 22:20:08 +0000 (15:20 -0700)]
Merge branch 'phy-c45-loopback'
Wong Vee Khee says:
====================
Add support for Clause-45 PHY Loopback
This patch series add support for Clause-45 PHY loopback.
It involves adding a generic API in the PHY framework, which can be
accessed by all C45 PHY drivers using the .set_loopback callback.
Also, enable PHY loopback for the Marvell 88x3310/88x2110 driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Wong Vee Khee [Tue, 23 Mar 2021 16:46:41 +0000 (00:46 +0800)]
net: phy: marvell10g: Add PHY loopback support
Add support for PHY loopback for Marvell 88x2110 and Marvell 88x3310.
This allow user to perform PHY loopback test using ethtool selftest.
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wong Vee Khee [Tue, 23 Mar 2021 16:46:40 +0000 (00:46 +0800)]
net: phy: add genphy_c45_loopback
Add generic code to enable C45 PHY loopback into the common phy-c45.c
file. This will allow C45 PHY drivers aceess this by setting
.set_loopback.
Suggested-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 23 Mar 2021 13:03:32 +0000 (14:03 +0100)]
rhashtable: avoid -Wrestrict warning on overlapping sprintf output
sprintf() is declared with a restrict keyword to not allow input and
output to point to the same buffer:
lib/test_rhashtable.c: In function 'print_ht':
lib/test_rhashtable.c:504:4: error: 'sprintf' argument 3 overlaps destination object 'buff' [-Werror=restrict]
504 | sprintf(buff, "%s\nbucket[%d] -> ", buff, i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/test_rhashtable.c:489:7: note: destination object referenced by 'restrict'-qualified argument 1 was declared here
489 | char buff[512] = "";
| ^~~~
Rework this function to remember the last offset instead to
avoid the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 23 Mar 2021 12:53:29 +0000 (13:53 +0100)]
octeontx2: fix -Wnonnull warning
When compile testing this driver on a platform on which probe() is
known to fail at compile time, gcc warns about the cgx_lmactype_string[]
array being uninitialized:
In function 'strncpy',
inlined from 'link_status_user_format' at /git/arm-soc/drivers/net/ethernet/marvell/octeontx2/af/cgx.c:838:2,
inlined from 'cgx_link_change_handler' at /git/arm-soc/drivers/net/ethernet/marvell/octeontx2/af/cgx.c:853:2:
include/linux/fortify-string.h:27:30: error: argument 2 null where non-null expected [-Werror=nonnull]
27 | #define __underlying_strncpy __builtin_strncpy
Address this by turning the runtime initialization into a fixed array,
which should also produce better code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tan Tee Min [Tue, 23 Mar 2021 11:07:34 +0000 (19:07 +0800)]
net: stmmac: Add hardware supported cross-timestamp
Cross timestamping is supported on Integrated Ethernet Controller in
Intel SoC such as EHL and TGL with Always Running Timer.
The hardware cross-timestamp result is made available to
applications through the PTP_SYS_OFFSET_PRECISE ioctl which calls
stmmac_getcrosststamp().
Device time is stored in the MAC Auxiliary register. The 64-bit System
time (ART timestamp) is stored in registers that are only addressable
by using MDIO space.
Signed-off-by: Tan Tee Min <tee.min.tan@intel.com>
Co-developed-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhaskar Chowdhury [Wed, 24 Mar 2021 07:52:04 +0000 (13:22 +0530)]
sfc-falcon: Fix a typo
s/maintaning/maintaining/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhaskar Chowdhury [Wed, 24 Mar 2021 07:46:37 +0000 (13:16 +0530)]
net: sched: Mundane typo fixes
s/procdure/procedure/
s/maintanance/maintenance/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhaskar Chowdhury [Tue, 23 Mar 2021 09:43:27 +0000 (15:13 +0530)]
octeontx2-af: Few mundane typos fixed
s/preceeds/precedes/ .....two different places
s/rsponse/response/
s/cetain/certain/
s/precison/precision/
Fix a sentence construction as per suggestion.
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Mar 2021 19:48:40 +0000 (12:48 -0700)]
Merge branch 'netfilter-flowtable'
Pablo Neira Ayuso says:
====================
netfilter: flowtable enhancements
[ This is v2 that includes documentation enhancements, including
existing limitations. This is a rebase on top on net-next. ]
The following patchset augments the Netfilter flowtable fastpath to
support for network topologies that combine IP forwarding, bridge,
classic VLAN devices, bridge VLAN filtering, DSA and PPPoE. This
includes support for the flowtable software and hardware datapaths.
The following pictures provides an example scenario:
fast path!
.------------------------.
/ \
| IP forwarding |
| / \ \/
| br0 wan ..... eth0
. / \ host C
-> veth1 veth2
. switch/router
.
.
eth0
host A
The bridge master device 'br0' has an IP address and a DHCP server is
also assumed to be running to provide connectivity to host A which
reaches the Internet through 'br0' as default gateway. Then, packet
enters the IP forwarding path and Netfilter is used to NAT the packets
before they leave through the wan device.
The general idea is to accelerate forwarding by building a fast path
that takes packets from the ingress path of the bridge port and place
them in the egress path of the wan device (and vice versa). Hence,
skipping the classic bridge and IP stack paths.
** Patch from #1 to #6 add the infrastructure which describes the list of
netdevice hops to reach a given destination MAC address in the local
network topology.
Patch #1 adds dev_fill_forward_path() and .ndo_fill_forward_path() to
netdev_ops.
Patch #2 adds .ndo_fill_forward_path for vlan devices, which provides
the next device hop via vlan->real_dev, the vlan ID and the
protocol.
Patch #3 adds .ndo_fill_forward_path for bridge devices, which allows to make
lookups to the FDB to locate the next device hop (bridge port) in the
forwarding path.
Patch #4 extends bridge .ndo_fill_forward_path to support for bridge VLAN
filtering.
Patch #5 adds .ndo_fill_forward_path for PPPoE devices.
Patch #6 adds .ndo_fill_forward_path for DSA.
Patches from #7 to #14 update the flowtable software datapath:
Patch #7 adds the transmit path type field to the flow tuple. Two transmit
paths are supported so far: the neighbour and the xfrm transmit
paths.
Patch #8 and #9 update the flowtable datapath to use dev_fill_forward_path()
to obtain the real ingress/egress device for the flowtable datapath.
This adds the new ethernet xmit direct path to the flowtable.
Patch #10 adds native flowtable VLAN support (up to 2 VLAN tags) through
dev_fill_forward_path(). The flowtable stores the VLAN id and
protocol in the flow tuple.
Patch #11 adds native flowtable bridge VLAN filter support through
dev_fill_forward_path().
Patch #12 adds native flowtable bridge PPPoE through dev_fill_forward_path().
Patch #13 adds DSA support through dev_fill_forward_path().
Patch #14 extends flowtable selftests to cover for flowtable software
datapath enhancements.
** Patches from #15 to #20 update the flowtable hardware offload datapath:
Patch #15 extends the flowtable hardware offload to support for the
direct ethernet xmit path. This also includes VLAN support.
Patch #16 stores the egress real device in the flow tuple. The software
flowtable datapath uses dev_hard_header() to transmit packets,
hence it might refer to VLAN/DSA/PPPoE software device, not
the real ethernet device.
Patch #17 deals with switchdev PVID hardware offload to skip it on
egress.
Patch #18 adds FLOW_ACTION_PPPOE_PUSH to the flow_offload action API.
Patch #19 extends the flowtable hardware offload to support for PPPoE
Patch #20 adds TC_SETUP_FT support for DSA.
** Patches from #20 to #23: Felix Fietkau adds a new driver which support
hardware offload for the mtk PPE engine through the existing flow
offload API which supports for the flowtable enhancements coming in
this batch.
Patch #24 extends the documentation and describe existing limitations.
Please, apply, thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:55 +0000 (02:30 +0100)]
docs: nf_flowtable: update documentation with enhancements
This patch updates the flowtable documentation to describe recent
enhancements:
- Offload action is available after the first packets go through the
classic forwarding path.
- IPv4 and IPv6 are supported. Only TCP and UDP layer 4 are supported at
this stage.
- Tuple has been augmented to track VLAN id and PPPoE session id.
- Bridge and IP forwarding integration, including bridge VLAN filtering
support.
- Hardware offload support.
- Describe the [OFFLOAD] and [HW_OFFLOAD] tags in the conntrack table
listing.
- Replace 'flow offload' by 'flow add' in example rulesets (preferred
syntax).
- Describe existing cache limitations.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:54 +0000 (02:30 +0100)]
net: ethernet: mtk_eth_soc: add flow offloading support
This adds support for offloading IPv4 routed flows, including SNAT/DNAT,
one VLAN, PPPoE and DSA.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:53 +0000 (02:30 +0100)]
net: ethernet: mtk_eth_soc: add support for initializing the PPE
The PPE (packet processing engine) is used to offload NAT/routed or even
bridged flows. This patch brings up the PPE and uses it to get a packet
hash. It also contains some functionality that will be used to bring up
flow offloading.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:52 +0000 (02:30 +0100)]
net: ethernet: mtk_eth_soc: fix parsing packets in GDM
When using DSA, set the special tag in GDM ingress control to allow the MAC
to parse packets properly earlier. This affects rx DMA source port reporting.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:51 +0000 (02:30 +0100)]
dsa: slave: add support for TC_SETUP_FT
The dsa infrastructure provides a well-defined hierarchy of devices,
pass up the call to set up the flow block to the master device. From the
software dataplane, the netfilter infrastructure uses the dsa slave
devices to refer to the input and output device for the given skbuff.
Similarly, the flowtable definition in the ruleset refers to the dsa
slave port devices.
This patch adds the glue code to call ndo_setup_tc with TC_SETUP_FT
with the master device via the dsa slave devices.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:50 +0000 (02:30 +0100)]
netfilter: flowtable: support for FLOW_ACTION_PPPOE_PUSH
Add a PPPoE push action if layer 2 protocol is ETH_P_PPP_SES to add
PPPoE flowtable hardware offload support.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:49 +0000 (02:30 +0100)]
net: flow_offload: add FLOW_ACTION_PPPOE_PUSH
Add an action to represent the PPPoE hardware offload support that
includes the session ID.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:48 +0000 (02:30 +0100)]
netfilter: flowtable: bridge vlan hardware offload and switchdev
The switch might have already added the VLAN tag through PVID hardware
offload. Keep this extra VLAN in the flowtable but skip it on egress.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:47 +0000 (02:30 +0100)]
netfilter: nft_flow_offload: use direct xmit if hardware offload is enabled
If there is a forward path to reach an ethernet device and hardware
offload is enabled, then use the direct xmit path.
Moreover, store the real device in the direct xmit path info since
software datapath uses dev_hard_header() to push the layer encapsulation
headers while hardware offload refers to the real device.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:46 +0000 (02:30 +0100)]
netfilter: flowtable: add offload support for xmit path types
When the flow tuple xmit_type is set to FLOW_OFFLOAD_XMIT_DIRECT, the
dst_cache pointer is not valid, and the h_source/h_dest/ifidx out fields
need to be used.
This patch also adds the FLOW_ACTION_VLAN_PUSH action to pass the VLAN
tag to the driver.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:45 +0000 (02:30 +0100)]
selftests: netfilter: flowtable bridge and vlan support
This patch adds two new tests to cover bridge and vlan support:
- Add a bridge device to the Router1 (nsr1) container and attach the
veth0 device to the bridge. Set the IP address to the bridge device
to exercise the bridge forwarding path.
- Add vlan encapsulation between to the bridge device in the Router1 and
one of the sender containers (ns1).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:44 +0000 (02:30 +0100)]
netfilter: flowtable: add dsa support
Replace the master ethernet device by the dsa slave port. Packets coming
in from the software ingress path use the dsa slave port as input
device.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:43 +0000 (02:30 +0100)]
netfilter: flowtable: add pppoe support
Add the PPPoE protocol and session id to the flow tuple using the encap
fields to uniquely identify flows from the receive path. For the
transmit path, dev_hard_header() on the vlan device push the headers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:42 +0000 (02:30 +0100)]
netfilter: flowtable: add bridge vlan filtering support
Add the vlan tag based when PVID is set on.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:41 +0000 (02:30 +0100)]
netfilter: flowtable: add vlan support
Add the vlan id and protocol to the flow tuple to uniquely identify
flows from the receive path. For the transmit path, dev_hard_header() on
the vlan device push the headers. This patch includes support for two
vlan headers (QinQ) from the ingress path.
Add a generic encap field to the flowtable entry which stores the
protocol and the tag id. This allows to reuse these fields in the PPPoE
support coming in a later patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:40 +0000 (02:30 +0100)]
netfilter: flowtable: use dev_fill_forward_path() to obtain egress device
The egress device in the tuple is obtained from route. Use
dev_fill_forward_path() instead to provide the real egress device for
this flow whenever this is available.
The new FLOW_OFFLOAD_XMIT_DIRECT type uses dev_queue_xmit() to transmit
ethernet frames. Cache the source and destination hardware address to
use dev_queue_xmit() to transfer packets.
The FLOW_OFFLOAD_XMIT_DIRECT replaces FLOW_OFFLOAD_XMIT_NEIGH if
dev_fill_forward_path() finds a direct transmit path.
In case of topology updates, if peer is moved to different bridge port,
the connection will time out, reconnect will result in a new entry with
the correct path. Snooping fdb updates would allow for cleaning up stale
flowtable entries.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:39 +0000 (02:30 +0100)]
netfilter: flowtable: use dev_fill_forward_path() to obtain ingress device
Obtain the ingress device in the tuple from the route in the reply
direction. Use dev_fill_forward_path() instead to get the real ingress
device for this flow.
Fall back to use the ingress device that the IP forwarding route
provides if:
- dev_fill_forward_path() finds no real ingress device.
- the ingress device that is obtained is not part of the flowtable
devices.
- this route has a xfrm policy.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:38 +0000 (02:30 +0100)]
netfilter: flowtable: add xmit path types
Add the xmit_type field that defines the two supported xmit paths in the
flowtable data plane, which are the neighbour and the xfrm xmit paths.
This patch prepares for new flowtable xmit path types to come.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:37 +0000 (02:30 +0100)]
net: dsa: resolve forwarding path for dsa slave ports
Add .ndo_fill_forward_path for dsa slave port devices
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:36 +0000 (02:30 +0100)]
net: ppp: resolve forwarding path for bridge pppoe devices
Pass on the PPPoE session ID, destination hardware address and the real
device.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 24 Mar 2021 01:30:35 +0000 (02:30 +0100)]
net: bridge: resolve forwarding path for VLAN tag actions in bridge devices
Depending on the VLAN settings of the bridge and the port, the bridge can
either add or remove a tag. When vlan filtering is enabled, the fdb lookup
also needs to know the VLAN tag/proto for the destination address
To provide this, keep track of the stack of VLAN tags for the path in the
lookup context
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:34 +0000 (02:30 +0100)]
net: bridge: resolve forwarding path for bridge devices
Add .ndo_fill_forward_path for bridge devices.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:33 +0000 (02:30 +0100)]
net: 8021q: resolve forwarding path for vlan devices
Add .ndo_fill_forward_path for vlan devices.
For instance, assuming the following topology:
IP forwarding
/ \
eth0.100 eth0
|
eth0
.
.
.
ethX
ab:cd:ef:ab:cd:ef
For packets going through IP forwarding to eth0.100 whose destination
MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the
following path:
eth0.100 -> eth0
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Wed, 24 Mar 2021 01:30:32 +0000 (02:30 +0100)]
net: resolve forwarding path from virtual netdevice and HW destination address
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.
For instance, assuming the following topology:
IP forwarding
/ \
br0 eth0
/ \
eth1 eth2
.
.
.
ethX
ab:cd:ef:ab:cd:ef
where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.
For packets going through IP forwarding to br0 whose destination MAC
address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the
following path:
br0 -> eth1
.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.
This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.
fast path
.------------------------.
/ \
| IP forwarding |
| / \ \/
| br0 eth0
. / \
-> eth1 eth2
.
.
.
ethX
ab:cd:ef:ab:cd:ef
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Wed, 24 Mar 2021 15:09:50 +0000 (15:09 +0000)]
net: bridge: Fix missing return assignment from br_vlan_replay_one call
The call to br_vlan_replay_one is returning an error return value but
this is not being assigned to err and the following check on err is
currently always false because err was initialized to zero. Fix this
by assigning err.
Addresses-Coverity: ("'Constant' variable guards dead code")
Fixes:
22f67cdfae6a ("net: bridge: add helper to replay VLANs installed on port")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 23 Mar 2021 21:52:50 +0000 (22:52 +0100)]
ch_ktls: fix enum-conversion warning
gcc points out an incorrect enum assignment:
drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c: In function 'chcr_ktls_cpl_set_tcb_rpl':
drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c:684:22: warning: implicit conversion from 'enum <anonymous>' to 'enum ch_ktls_open_state' [-Wenum-conversion]
This appears harmless, and should apparently use 'CH_KTLS_OPEN_SUCCESS'
instead of 'false', with the same value '0'.
Fixes:
efca3878a5fb ("ch_ktls: Issue if connection offload fails")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 23 Mar 2021 12:32:45 +0000 (12:32 +0000)]
octeontx2-af: Fix memory leak of object buf
Currently the error return path when lfs fails to allocate is not free'ing
the memory allocated to buf. Fix this by adding the missing kfree.
Addresses-Coverity: ("Resource leak")
Fixes:
f7884097141b ("octeontx2-af: Formatting debugfs entry rsrc_alloc.")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 24 Mar 2021 19:14:08 +0000 (12:14 -0700)]
Merge branch 'bridge-mrp-next'
Horatiu Vultur says:
====================
bridge: mrp: Disable roles before deleting
The first patch in this series make sures that the driver is notified
that the role is disabled before the MRP instance is deleted. The
second patch uses this so it can simplify the driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Tue, 23 Mar 2021 08:33:47 +0000 (09:33 +0100)]
net: ocelot: Simplify MRP deletion
Now that the driver will always be notified that the role is deleted
before the ring is deleted, then we don't need to duplicate the logic of
cleaning the resources also in the delete function.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Tue, 23 Mar 2021 08:33:46 +0000 (09:33 +0100)]
bridge: mrp: Disable roles before deleting the MRP instance
When an MRP instance was created, the driver was notified that the
instance is created and then in a different callback about role of the
instance. But when the instance was deleted the driver was notified only
that the MRP instance is deleted and not also that the role is disabled.
This patch make sure that the driver is notified that the role is
changed to disabled before the MRP instance is deleted to have similar
callbacks with the creating of the instance. In this way it would
simplify the logic in the drivers.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>