Merge branch 'mm-rst' into docs-next

author Jonathan Corbet <corbet@lwn.net>

Mon, 16 Apr 2018 20:25:08 +0000 (14:25 -0600)

committer Jonathan Corbet <corbet@lwn.net>

Mon, 16 Apr 2018 20:25:08 +0000 (14:25 -0600)
author Jonathan Corbet <corbet@lwn.net>
Mon, 16 Apr 2018 20:25:08 +0000 (14:25 -0600)
committer Jonathan Corbet <corbet@lwn.net>
Mon, 16 Apr 2018 20:25:08 +0000 (14:25 -0600)
diff --cc Documentation/admin-guide/kernel-parameters.txt
Simple merge
diff --cc Documentation/sysctl/vm.txt
Simple merge
diff --cc Documentation/vm/hmm.rst

index 0000000,3fafa33..cdf3911

mode 000000,100644..100644
--- /dev/null
--- 2/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@@ -1,0 -1,374 +1,386 @@@
- -Transparently allow any component of a program to use any memory region of said
- -program with a device without using device specific memory allocator. This is
- -becoming a requirement to simplify the use of advance heterogeneous computing
- -where GPU, DSP or FPGA are use to perform various computations.
- -
- -This document is divided as follow, in the first section i expose the problems
- -related to the use of a device specific allocator. The second section i expose
- -the hardware limitations that are inherent to many platforms. The third section
- -gives an overview of HMM designs. The fourth section explains how CPU page-
- -table mirroring works and what is HMM purpose in this context. Fifth section
- -deals with how device memory is represented inside the kernel. Finaly the last
- -section present the new migration helper that allow to leverage the device DMA
- -engine.
+ .. hmm:
+ 
+ =====================================
+ Heterogeneous Memory Management (HMM)
+ =====================================
+ 
- -Problems of using device specific memory allocator
- -==================================================
- -
- -Device with large amount of on board memory (several giga bytes) like GPU have
- -historically manage their memory through dedicated driver specific API. This
- -creates a disconnect between memory allocated and managed by device driver and
- -regular application memory (private anonymous, share memory or regular file
- -back memory). From here on i will refer to this aspect as split address space.
- -I use share address space to refer to the opposite situation ie one in which
- -any memory region can be use by device transparently.
- -
- -Split address space because device can only access memory allocated through the
- -device specific API. This imply that all memory object in a program are not
- -equal from device point of view which complicate large program that rely on a
- -wide set of libraries.
- -
- -Concretly this means that code that wants to leverage device like GPU need to
- -copy object between genericly allocated memory (malloc, mmap private/share/)
- -and memory allocated through the device driver API (this still end up with an
- -mmap but of the device file).
- -
- -For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
- -complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
- -data-set need to re-map all the pointer relations between each of its elements.
- -This is error prone and program gets harder to debug because of the duplicate
- -data-set.
- -
- -Split address space also means that library can not transparently use data they
- -are getting from core program or other library and thus each library might have
- -to duplicate its input data-set using specific memory allocator. Large project
- -suffer from this and waste resources because of the various memory copy.
- -
- -Duplicating each library API to accept as input or output memory allocted by
++Provide infrastructure and helpers to integrate non-conventional memory (device
++memory like GPU on board memory) into regular kernel path, with the cornerstone
++of this being specialized struct page for such memory (see sections 5 to 7 of
++this document).
++
++HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
++allowing a device to transparently access program address coherently with
++the CPU meaning that any valid pointer on the CPU is also a valid pointer
++for the device. This is becoming mandatory to simplify the use of advanced
++heterogeneous computing where GPU, DSP, or FPGA are used to perform various
++computations on behalf of a process.
++
++This document is divided as follows: in the first section I expose the problems
++related to using device specific memory allocators. In the second section, I
++expose the hardware limitations that are inherent to many platforms. The third
++section gives an overview of the HMM design. The fourth section explains how
++CPU page-table mirroring works and the purpose of HMM in this context. The
++fifth section deals with how device memory is represented inside the kernel.
++Finally, the last section presents a new migration helper that allows lever-
++aging the device DMA engine.
+ 
+ .. contents:: :local:
+ 
- -combinatorial explosions in the library entry points.
++Problems of using a device specific memory allocator
++====================================================
++
++Devices with a large amount of on board memory (several gigabytes) like GPUs
++have historically managed their memory through dedicated driver specific APIs.
++This creates a disconnect between memory allocated and managed by a device
++driver and regular application memory (private anonymous, shared memory, or
++regular file backed memory). From here on I will refer to this aspect as split
++address space. I use shared address space to refer to the opposite situation:
++i.e., one in which any application memory region can be used by a device
++transparently.
++
++Split address space happens because device can only access memory allocated
++through device specific API. This implies that all memory objects in a program
++are not equal from the device point of view which complicates large programs
++that rely on a wide set of libraries.
++
++Concretely this means that code that wants to leverage devices like GPUs needs
++to copy object between generically allocated memory (malloc, mmap private, mmap
++share) and memory allocated through the device driver API (this still ends up
++with an mmap but of the device file).
++
++For flat data sets (array, grid, image, ...) this isn't too hard to achieve but
++complex data sets (list, tree, ...) are hard to get right. Duplicating a
++complex data set needs to re-map all the pointer relations between each of its
++elements. This is error prone and program gets harder to debug because of the
++duplicate data set and addresses.
++
++Split address space also means that libraries cannot transparently use data
++they are getting from the core program or another library and thus each library
++might have to duplicate its input data set using the device specific memory
++allocator. Large projects suffer from this and waste resources because of the
++various memory copies.
++
++Duplicating each library API to accept as input or output memory allocated by
+ each device specific allocator is not a viable option. It would lead to a
- -Finaly with the advance of high level language constructs (in C++ but in other
- -language too) it is now possible for compiler to leverage GPU or other devices
- -without even the programmer knowledge. Some of compiler identified patterns are
- -only do-able with a share address. It is as well more reasonable to use a share
- -address space for all the other patterns.
++combinatorial explosion in the library entry points.
+ 
- -System bus, device memory characteristics
- -=========================================
++Finally, with the advance of high level language constructs (in C++ but in
++other languages too) it is now possible for the compiler to leverage GPUs and
++other devices without programmer knowledge. Some compiler identified patterns
++are only do-able with a shared address space. It is also more reasonable to use
++a shared address space for all other patterns.
+ 
+ 
- -System bus cripple share address due to few limitations. Most system bus only
- -allow basic memory access from device to main memory, even cache coherency is
- -often optional. Access to device memory from CPU is even more limited, most
- -often than not it is not cache coherent.
++I/O bus, device memory characteristics
++======================================
+ 
- -If we only consider the PCIE bus than device can access main memory (often
- -through an IOMMU) and be cache coherent with the CPUs. However it only allows
- -a limited set of atomic operation from device on main memory. This is worse
- -in the other direction the CPUs can only access a limited range of the device
- -memory and can not perform atomic operations on it. Thus device memory can not
- -be consider like regular memory from kernel point of view.
++I/O buses cripple shared address spaces due to a few limitations. Most I/O
++buses only allow basic memory access from device to main memory; even cache
++coherency is often optional. Access to device memory from CPU is even more
++limited. More often than not, it is not cache coherent.
+ 
- -and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
- -The final limitation is latency, access to main memory from the device has an
- -order of magnitude higher latency than when the device access its own memory.
++If we only consider the PCIE bus, then a device can access main memory (often
++through an IOMMU) and be cache coherent with the CPUs. However, it only allows
++a limited set of atomic operations from device on main memory. This is worse
++in the other direction: the CPU can only access a limited range of the device
++memory and cannot perform atomic operations on it. Thus device memory cannot
++be considered the same as regular memory from the kernel point of view.
+ 
+ Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
- -Some platform are developing new system bus or additions/modifications to PCIE
- -to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
++and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
++The final limitation is latency. Access to main memory from the device has an
++order of magnitude higher latency than when the device accesses its own memory.
+ 
- -architecture supports. Saddly not all platform are following this trends and
- -some major architecture are left without hardware solutions to those problems.
++Some platforms are developing new I/O buses or additions/modifications to PCIE
++to address some of these limitations (OpenCAPI, CCIX). They mainly allow two-
+ way cache coherency between CPU and device and allow all atomic operations the
- -So for share address space to make sense not only we must allow device to
- -access any memory memory but we must also permit any memory to be migrated to
- -device memory while device is using it (blocking CPU access while it happens).
++architecture supports. Sadly, not all platforms are following this trend and
++some major architectures are left without hardware solutions to these problems.
+ 
- -Share address space and migration
- -=================================
++So for shared address space to make sense, not only must we allow devices to
++access any memory but we must also permit any memory to be migrated to device
++memory while device is using it (blocking CPU access while it happens).
+ 
+ 
- -space by duplication the CPU page table into the device page table so same
- -address point to same memory and this for any valid main memory address in
++Shared address space and migration
++==================================
+ 
+ HMM intends to provide two main features. First one is to share the address
- -To achieve this, HMM offer a set of helpers to populate the device page table
++space by duplicating the CPU page table in the device page table so the same
++address points to the same physical memory for any valid main memory address in
+ the process address space.
+ 
- -not as easy as CPU page table updates. To update the device page table you must
- -allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
- -commands in it to perform the update (unmap, cache invalidations and flush,
- -...). This can not be done through common code for all device. Hence why HMM
- -provides helpers to factor out everything that can be while leaving the gory
- -details to the device driver.
- -
- -The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
- -allow to allocate a struct page for each page of the device memory. Those page
- -are special because the CPU can not map them. They however allow to migrate
- -main memory to device memory using exhisting migration mechanism and everything
- -looks like if page was swap out to disk from CPU point of view. Using a struct
- -page gives the easiest and cleanest integration with existing mm mechanisms.
- -Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
- -for the device memory and second to perform migration. Policy decision of what
- -and when to migrate things is left to the device driver.
- -
- -Note that any CPU access to a device page trigger a page fault and a migration
- -back to main memory ie when a page backing an given address A is migrated from
- -a main memory page to a device page then any CPU access to address A trigger a
- -page fault and initiate a migration back to main memory.
- -
- -
- -With this two features, HMM not only allow a device to mirror a process address
- -space and keeps both CPU and device page table synchronize, but also allow to
- -leverage device memory by migrating part of data-set that is actively use by a
- -device.
++To achieve this, HMM offers a set of helpers to populate the device page table
+ while keeping track of CPU page table updates. Device page table updates are
- -Address space mirroring main objective is to allow to duplicate range of CPU
- -page table into a device page table and HMM helps keeping both synchronize. A
- -device driver that want to mirror a process address space must start with the
++not as easy as CPU page table updates. To update the device page table, you must
++allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
++specific commands in it to perform the update (unmap, cache invalidations, and
++flush, ...). This cannot be done through common code for all devices. Hence
++why HMM provides helpers to factor out everything that can be while leaving the
++hardware specific details to the device driver.
++
++The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
++allows allocating a struct page for each page of the device memory. Those pages
++are special because the CPU cannot map them. However, they allow migrating
++main memory to device memory using existing migration mechanisms and everything
++looks like a page is swapped out to disk from the CPU point of view. Using a
++struct page gives the easiest and cleanest integration with existing mm mech-
++anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
++memory for the device memory and second to perform migration. Policy decisions
++of what and when to migrate things is left to the device driver.
++
++Note that any CPU access to a device page triggers a page fault and a migration
++back to main memory. For example, when a page backing a given CPU address A is
++migrated from a main memory page to a device page, then any CPU access to
++address A triggers a page fault and initiates a migration back to main memory.
++
++With these two features, HMM not only allows a device to mirror process address
++space and keeping both CPU and device page table synchronized, but also lever-
++ages device memory by migrating the part of the data set that is actively being
++used by the device.
+ 
+ 
+ Address space mirroring implementation and API
+ ==============================================
+ 
- -The locked variant is to be use when the driver is already holding the mmap_sem
- -of the mm in write mode. The mirror struct has a set of callback that are use
- -to propagate CPU page table::
++Address space mirroring's main objective is to allow duplication of a range of
++CPU page table into a device page table; HMM helps keep both synchronized. A
++device driver that wants to mirror a process address space must start with the
+ registration of an hmm_mirror struct::
+ 
+  int hmm_mirror_register(struct hmm_mirror *mirror,
+                          struct mm_struct *mm);
+  int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+                                 struct mm_struct *mm);
+ 
- -Device driver must perform update to the range following action (turn range
- -read only, or fully unmap, ...). Once driver callback returns the device must
- -be done with the update.
- -
++
++The locked variant is to be used when the driver is already holding mmap_sem
++of the mm in write mode. The mirror struct has a set of callbacks that are used
++to propagate CPU page tables::
+ 
+  struct hmm_mirror_ops {
+      /* sync_cpu_device_pagetables() - synchronize page tables
+       *
+       * @mirror: pointer to struct hmm_mirror
+       * @update_type: type of update that occurred to the CPU page table
+       * @start: virtual start address of the range to update
+       * @end: virtual end address of the range to update
+       *
+       * This callback ultimately originates from mmu_notifiers when the CPU
+       * page table is updated. The device driver must update its page table
+       * in response to this callback. The update argument tells what action
+       * to perform.
+       *
+       * The device driver must not return from this callback until the device
+       * page tables are completely updated (TLBs flushed, etc); this is a
+       * synchronous call.
+       */
+       void (*update)(struct hmm_mirror *mirror,
+                      enum hmm_update action,
+                      unsigned long start,
+                      unsigned long end);
+  };
+ 
- -When device driver wants to populate a range of virtual address it can use
- -either::
++The device driver must perform the update action to the range (mark range
++read only, or fully unmap, ...). The device must be done with the update before
++the driver callback returns.
+ 
- - int hmm_vma_get_pfns(struct vm_area_struct *vma,
++When the device driver wants to populate a range of virtual addresses, it can
++use either::
+ 
- -First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
- -will not trigger a page fault on missing or non present entry. The second one
- -do trigger page fault on missing or read only entry if write parameter is true.
- -Page fault use the generic mm page fault code path just like a CPU page fault.
++  int hmm_vma_get_pfns(struct vm_area_struct *vma,
+                       struct hmm_range *range,
+                       unsigned long start,
+                       unsigned long end,
+                       hmm_pfn_t *pfns);
+  int hmm_vma_fault(struct vm_area_struct *vma,
+                    struct hmm_range *range,
+                    unsigned long start,
+                    unsigned long end,
+                    hmm_pfn_t *pfns,
+                    bool write,
+                    bool block);
+ 
- -Both function copy CPU page table into their pfns array argument. Each entry in
- -that array correspond to an address in the virtual range. HMM provide a set of
- -flags to help driver identify special CPU page table entries.
++The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
++entries and will not trigger a page fault on missing or non-present entries.
++The second one does trigger a page fault on missing or read-only entry if the
++write parameter is true. Page faults use the generic mm page fault code path
++just like a CPU page fault.
+ 
- -respect in order to keep things properly synchronize. The usage pattern is::
++Both functions copy CPU page table entries into their pfns array argument. Each
++entry in that array corresponds to an address in the virtual range. HMM
++provides a set of flags to help the driver identify special CPU page table
++entries.
+ 
+ Locking with the update() callback is the most important aspect the driver must
- -The driver->update lock is the same lock that driver takes inside its update()
- -callback. That lock must be call before hmm_vma_range_done() to avoid any race
- -with a concurrent CPU page table update.
++respect in order to keep things properly synchronized. The usage pattern is::
+ 
+  int driver_populate_range(...)
+  {
+       struct hmm_range range;
+       ...
+  again:
+       ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
+       if (ret)
+           return ret;
+       take_lock(driver->update);
+       if (!hmm_vma_range_done(vma, &range)) {
+           release_lock(driver->update);
+           goto again;
+       }
+ 
+       // Use pfns array content to update device page table
+ 
+       release_lock(driver->update);
+       return 0;
+  }
+ 
- -HMM implements all this on top of the mmu_notifier API because we wanted to a
- -simpler API and also to be able to perform optimization latter own like doing
- -concurrent device update in multi-devices scenario.
++The driver->update lock is the same lock that the driver takes inside its
++update() callback. That lock must be held before hmm_vma_range_done() to avoid
++any race with a concurrent CPU page table update.
+ 
- -HMM also serve as an impedence missmatch between how CPU page table update are
- -done (by CPU write to the page table and TLB flushes) from how device update
- -their own page table. Device update is a multi-step process, first appropriate
- -commands are write to a buffer, then this buffer is schedule for execution on
- -the device. It is only once the device has executed commands in the buffer that
- -the update is done. Creating and scheduling update command buffer can happen
- -concurrently for multiple devices. Waiting for each device to report commands
- -as executed is serialize (there is no point in doing this concurrently).
++HMM implements all this on top of the mmu_notifier API because we wanted a
++simpler API and also to be able to perform optimizations latter on like doing
++concurrent device updates in multi-devices scenario.
+ 
- -Several differents design were try to support device memory. First one use
- -device specific data structure to keep information about migrated memory and
- -HMM hooked itself in various place of mm code to handle any access to address
- -that were back by device memory. It turns out that this ended up replicating
- -most of the fields of struct page and also needed many kernel code path to be
- -updated to understand this new kind of memory.
++HMM also serves as an impedance mismatch between how CPU page table updates
++are done (by CPU write to the page table and TLB flushes) and how devices
++update their own page table. Device updates are a multi-step process. First,
++appropriate commands are written to a buffer, then this buffer is scheduled for
++execution on the device. It is only once the device has executed commands in
++the buffer that the update is done. Creating and scheduling the update command
++buffer can happen concurrently for multiple devices. Waiting for each device to
++report commands as executed is serialized (there is no point in doing this
++concurrently).
+ 
+ 
+ Represent and manage device memory from core kernel point of view
+ =================================================================
+ 
- -Thing is most kernel code path never try to access the memory behind a page
- -but only care about struct page contents. Because of this HMM switchted to
- -directly using struct page for device memory which left most kernel code path
- -un-aware of the difference. We only need to make sure that no one ever try to
- -map those page from the CPU side.
++Several different designs were tried to support device memory. First one used
++a device specific data structure to keep information about migrated memory and
++HMM hooked itself in various places of mm code to handle any access to
++addresses that were backed by device memory. It turns out that this ended up
++replicating most of the fields of struct page and also needed many kernel code
++paths to be updated to understand this new kind of memory.
+ 
- -HMM provide a set of helpers to register and hotplug device memory as a new
- -region needing struct page. This is offer through a very simple API::
++Most kernel code paths never try to access the memory behind a page
++but only care about struct page contents. Because of this, HMM switched to
++directly using struct page for device memory which left most kernel code paths
++unaware of the difference. We only need to make sure that no one ever tries to
++map those pages from the CPU side.
+ 
- -drop. This means the device page is now free and no longer use by anyone. The
- -second callback happens whenever CPU try to access a device page which it can
- -not do. This second callback must trigger a migration back to system memory.
++HMM provides a set of helpers to register and hotplug device memory as a new
++region needing a struct page. This is offered through a very simple API::
+ 
+  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+                                    struct device *device,
+                                    unsigned long size);
+  void hmm_devmem_remove(struct hmm_devmem *devmem);
+ 
+ The hmm_devmem_ops is where most of the important things are::
+ 
+  struct hmm_devmem_ops {
+      void (*free)(struct hmm_devmem *devmem, struct page *page);
+      int (*fault)(struct hmm_devmem *devmem,
+                   struct vm_area_struct *vma,
+                   unsigned long addr,
+                   struct page *page,
+                   unsigned flags,
+                   pmd_t *pmdp);
+  };
+ 
+ The first callback (free()) happens when the last reference on a device page is
- -Migrate to and from device memory
- -=================================
++dropped. This means the device page is now free and no longer used by anyone.
++The second callback happens whenever the CPU tries to access a device page
++which it cannot do. This second callback must trigger a migration back to
++system memory.
+ 
+ 
- -Because CPU can not access device memory, migration must use device DMA engine
- -to perform copy from and to device memory. For this we need a new migration
- -helper::
++Migration to and from device memory
++===================================
+ 
- -Unlike other migration function it works on a range of virtual address, there
- -is two reasons for that. First device DMA copy has a high setup overhead cost
++Because the CPU cannot access device memory, migration must use the device DMA
++engine to perform copy from and to device memory. For this we need a new
++migration helper::
+ 
+  int migrate_vma(const struct migrate_vma_ops *ops,
+                  struct vm_area_struct *vma,
+                  unsigned long mentries,
+                  unsigned long start,
+                  unsigned long end,
+                  unsigned long *src,
+                  unsigned long *dst,
+                  void *private);
+ 
- -make the whole excersie pointless. The second reason is because driver trigger
- -such migration base on range of address the device is actively accessing.
++Unlike other migration functions it works on a range of virtual address, there
++are two reasons for that. First, device DMA copy has a high setup overhead cost
+ and thus batching multiple pages is needed as otherwise the migration overhead
- -The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
- -control destination memory allocation and copy operation. Second one is there
- -to allow device driver to perform cleanup operation after migration::
++makes the whole exercise pointless. The second reason is because the
++migration might be for a range of addresses the device is actively accessing.
+ 
- -It is important to stress that this migration helpers allow for hole in the
++The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
++controls destination memory allocation and copy operation. Second one is there
++to allow the device driver to perform cleanup operations after migration::
+ 
+  struct migrate_vma_ops {
+      void (*alloc_and_copy)(struct vm_area_struct *vma,
+                             const unsigned long *src,
+                             unsigned long *dst,
+                             unsigned long start,
+                             unsigned long end,
+                             void *private);
+      void (*finalize_and_map)(struct vm_area_struct *vma,
+                               const unsigned long *src,
+                               const unsigned long *dst,
+                               unsigned long start,
+                               unsigned long end,
+                               void *private);
+  };
+ 
- -the usual reasons (page is pin, page is lock, ...). This helper does not fail
- -but just skip over those pages.
++It is important to stress that these migration helpers allow for holes in the
+ virtual address range. Some pages in the range might not be migrated for all
- -The alloc_and_copy() might as well decide to not migrate all pages in the
- -range (for reasons under the callback control). For those the callback just
- -have to leave the corresponding dst entry empty.
++the usual reasons (page is pinned, page is locked, ...). This helper does not
++fail but just skips over those pages.
+ 
- -Finaly the migration of the struct page might fails (for file back page) for
++The alloc_and_copy() might decide to not migrate all pages in the
++range (for reasons under the callback control). For those, the callback just
++has to leave the corresponding dst entry empty.
+ 
- -that happens then the finalize_and_map() can catch any pages that was not
- -migrated. Note those page were still copied to new page and thus we wasted
++Finally, the migration of the struct page might fail (for file backed page) for
+ various reasons (failure to freeze reference, or update page cache, ...). If
- -anonymous if device page is use for anonymous, file if device page is use for
- -file back page or shmem if device page is use for share memory). This is a
- -deliberate choice to keep existing application that might start using device
- -memory without knowing about it to keep runing unimpacted.
- -
- -Drawbacks is that OOM killer might kill an application using a lot of device
- -memory and not a lot of regular system memory and thus not freeing much system
- -memory. We want to gather more real world experience on how application and
- -system react under memory pressure in the presence of device memory before
++that happens, then the finalize_and_map() can catch any pages that were not
++migrated. Note those pages were still copied to a new page and thus we wasted
+ bandwidth but this is considered as a rare event and a price that we are
+ willing to pay to keep all the code simpler.
+ 
+ 
+ Memory cgroup (memcg) and rss accounting
+ ========================================
+ 
+ For now device memory is accounted as any regular page in rss counters (either
- -Same decision was made for memory cgroup. Device memory page are accounted
++anonymous if device page is used for anonymous, file if device page is used for
++file backed page or shmem if device page is used for shared memory). This is a
++deliberate choice to keep existing applications, that might start using device
++memory without knowing about it, running unimpacted.
++
++A drawback is that the OOM killer might kill an application using a lot of
++device memory and not a lot of regular system memory and thus not freeing much
++system memory. We want to gather more real world experience on how applications
++and system react under memory pressure in the presence of device memory before
+ deciding to account device memory differently.
+ 
+ 
- -back from device memory to regular memory can not fail because it would
++Same decision was made for memory cgroup. Device memory pages are accounted
+ against same memory cgroup a regular page would be accounted to. This does
+ simplify migration to and from device memory. This also means that migration
- -get more experience in how device memory is use and its impact on memory
++back from device memory to regular memory cannot fail because it would
+ go above memory cgroup limit. We might revisit this choice latter on once we
- -Note that device memory can never be pin nor by device driver nor through GUP
++get more experience in how device memory is used and its impact on memory
+ resource control.
+ 
+ 
- -is drop in case of share memory or file back memory.
++Note that device memory can never be pinned by device driver nor through GUP
+ and thus such memory is always free upon process exit. Or when last reference
++is dropped in case of shared memory or file backed memory.
diff --cc Documentation/vm/page_migration.rst

index 0000000,07b67a8..f68d613

mode 000000,100644..100644
--- /dev/null
--- 2/Documentation/vm/page_migration.rst
+++ b/Documentation/vm/page_migration.rst
@@@ -1,0 -1,257 +1,257 @@@
- -2. Insure that writeback is complete.
+ .. _page_migration:
+ 
+ ==============
+ Page migration
+ ==============
+ 
+ Page migration allows the moving of the physical location of pages between
+ nodes in a numa system while the process is running. This means that the
+ virtual addresses that the process sees do not change. However, the
+ system rearranges the physical location of those pages.
+ 
+ The main intend of page migration is to reduce the latency of memory access
+ by moving pages near to the processor where the process accessing that memory
+ is running.
+ 
+ Page migration allows a process to manually relocate the node on which its
+ pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
+ a new memory policy via mbind(). The pages of process can also be relocated
+ from another process using the sys_migrate_pages() function call. The
+ migrate_pages function call takes two sets of nodes and moves pages of a
+ process that are located on the from nodes to the destination nodes.
+ Page migration functions are provided by the numactl package by Andi Kleen
+ (a version later than 0.9.3 is required. Get it from
+ ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
+ which provides an interface similar to other numa functionality for page
+ migration.  cat ``/proc/<pid>/numa_maps`` allows an easy review of where the
+ pages of a process are located. See also the numa_maps documentation in the
+ proc(5) man page.
+ 
+ Manual migration is useful if for example the scheduler has relocated
+ a process to a processor on a distant node. A batch scheduler or an
+ administrator may detect the situation and move the pages of the process
+ nearer to the new processor. The kernel itself does only provide
+ manual page migration support. Automatic page migration may be implemented
+ through user space processes that move pages. A special function call
+ "move_pages" allows the moving of individual pages within a process.
+ A NUMA profiler may f.e. obtain a log showing frequent off node
+ accesses and may use the result to move pages to more advantageous
+ locations.
+ 
+ Larger installations usually partition the system using cpusets into
+ sections of nodes. Paul Jackson has equipped cpusets with the ability to
+ move pages when a task is moved to another cpuset (See
+ Documentation/cgroup-v1/cpusets.txt).
+ Cpusets allows the automation of process locality. If a task is moved to
+ a new cpuset then also all its pages are moved with it so that the
+ performance of the process does not sink dramatically. Also the pages
+ of processes in a cpuset are moved if the allowed memory nodes of a
+ cpuset are changed.
+ 
+ Page migration allows the preservation of the relative location of pages
+ within a group of nodes for all migration techniques which will preserve a
+ particular memory allocation pattern generated even after migrating a
+ process. This is necessary in order to preserve the memory latencies.
+ Processes will run with similar performance after migration.
+ 
+ Page migration occurs in several steps. First a high level
+ description for those trying to use migrate_pages() from the kernel
+ (for userspace usage see the Andi Kleen's numactl package mentioned above)
+ and then a low level description of how the low level details work.
+ 
+ In kernel use of migrate_pages()
+ ================================
+ 
+ 1. Remove pages from the LRU.
+ 
+    Lists of pages to be migrated are generated by scanning over
+    pages and moving them into lists. This is done by
+    calling isolate_lru_page().
+    Calling isolate_lru_page increases the references to the page
+    so that it cannot vanish while the page migration occurs.
+    It also prevents the swapper or other scans to encounter
+    the page.
+ 
+ 2. We need to have a function of type new_page_t that can be
+    passed to migrate_pages(). This function should figure out
+    how to allocate the correct new page given the old page.
+ 
+ 3. The migrate_pages() function is called which attempts
+    to do the migration. It will call the function to allocate
+    the new page for each page that is considered for
+    moving.
+ 
+ How migrate_pages() works
+ =========================
+ 
+ migrate_pages() does several passes over its list of pages. A page is moved
+ if all references to a page are removable at the time. The page has
+ already been removed from the LRU via isolate_lru_page() and the refcount
+ is increased so that the page cannot be freed while page migration occurs.
+ 
+ Steps:
+ 
+ 1. Lock the page to be migrated
+ 
- -5. The radix tree lock is taken. This will cause all processes trying
- -   to access the page via the mapping to block on the radix tree spinlock.
++2. Ensure that writeback is complete.
+ 
+ 3. Lock the new page that we want to move to. It is locked so that accesses to
+    this (not yet uptodate) page immediately lock while the move is in progress.
+ 
+ 4. All the page table references to the page are converted to migration
+    entries. This decreases the mapcount of a page. If the resulting
+    mapcount is not zero then we do not migrate the page. All user space
+    processes that attempt to access the page will now wait on the page lock.
+ 
- -10. The reference count of the old page is dropped because the radix tree
++5. The i_pages lock is taken. This will cause all processes trying
++   to access the page via the mapping to block on the spinlock.
+ 
+ 6. The refcount of the page is examined and we back out if references remain
+    otherwise we know that we are the only one referencing this page.
+ 
+ 7. The radix tree is checked and if it does not contain the pointer to this
+    page then we back out because someone else modified the radix tree.
+ 
+ 8. The new page is prepped with some settings from the old page so that
+    accesses to the new page will discover a page with the correct settings.
+ 
+ 9. The radix tree is changed to point to the new page.
+ 
- -    the new page is referenced to by the radix tree.
++10. The reference count of the old page is dropped because the address space
+     reference is gone. A reference to the new page is established because
- -11. The radix tree lock is dropped. With that lookups in the mapping
- -    become possible again. Processes will move from spinning on the tree_lock
++    the new page is referenced by the address space.
+ 
++11. The i_pages lock is dropped. With that lookups in the mapping
++    become possible again. Processes will move from spinning on the lock
+     to sleeping on the locked new page.
+ 
+ 12. The page contents are copied to the new page.
+ 
+ 13. The remaining page flags are copied to the new page.
+ 
+ 14. The old page flags are cleared to indicate that the page does
+     not provide any information anymore.
+ 
+ 15. Queued up writeback on the new page is triggered.
+ 
+ 16. If migration entries were page then replace them with real ptes. Doing
+     so will enable access for user space processes not already waiting for
+     the page lock.
+ 
+ 19. The page locks are dropped from the old and new page.
+     Processes waiting on the page lock will redo their page faults
+     and will reach the new page.
+ 
+ 20. The new page is moved to the LRU and can be scanned by the swapper
+     etc again.
+ 
+ Non-LRU page migration
+ ======================
+ 
+ Although original migration aimed for reducing the latency of memory access
+ for NUMA, compaction who want to create high-order page is also main customer.
+ 
+ Current problem of the implementation is that it is designed to migrate only
+ *LRU* pages. However, there are potential non-lru pages which can be migrated
+ in drivers, for example, zsmalloc, virtio-balloon pages.
+ 
+ For virtio-balloon pages, some parts of migration code path have been hooked
+ up and added virtio-balloon specific functions to intercept migration logics.
+ It's too specific to a driver so other drivers who want to make their pages
+ movable would have to add own specific hooks in migration path.
+ 
+ To overclome the problem, VM supports non-LRU page migration which provides
+ generic functions for non-LRU movable pages without driver specific hooks
+ migration path.
+ 
+ If a driver want to make own pages movable, it should define three functions
+ which are function pointers of struct address_space_operations.
+ 
+ 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);``
+ 
+    What VM expects on isolate_page function of driver is to return *true*
+    if driver isolates page successfully. On returing true, VM marks the page
+    as PG_isolated so concurrent isolation in several CPUs skip the page
+    for isolation. If a driver cannot isolate the page, it should return *false*.
+ 
+    Once page is successfully isolated, VM uses page.lru fields so driver
+    shouldn't expect to preserve values in that fields.
+ 
+ 2. ``int (*migratepage) (struct address_space *mapping,``
+ |     ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
+ 
+    After isolation, VM calls migratepage of driver with isolated page.
+    The function of migratepage is to move content of the old page to new page
+    and set up fields of struct page newpage. Keep in mind that you should
+    indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
+    under page_lock if you migrated the oldpage successfully and returns
+    MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver
+    can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time
+    because VM interprets -EAGAIN as "temporal migration failure". On returning
+    any error except -EAGAIN, VM will give up the page migration without retrying
+    in this time.
+ 
+    Driver shouldn't touch page.lru field VM using in the functions.
+ 
+ 3. ``void (*putback_page)(struct page *);``
+ 
+    If migration fails on isolated page, VM should return the isolated page
+    to the driver so VM calls driver's putback_page with migration failed page.
+    In this function, driver should put the isolated page back to the own data
+    structure.
+ 
+ 4. non-lru movable page flags
+ 
+    There are two page flags for supporting non-lru movable page.
+ 
+    * PG_movable
+ 
+      Driver should use the below function to make page movable under page_lock::
+ 
+       void __SetPageMovable(struct page *page, struct address_space *mapping)
+ 
+      It needs argument of address_space for registering migration
+      family functions which will be called by VM. Exactly speaking,
+      PG_movable is not a real flag of struct page. Rather than, VM
+      reuses page->mapping's lower bits to represent it.
+ 
+ ::
+       #define PAGE_MAPPING_MOVABLE 0x2
+       page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
+ 
+      so driver shouldn't access page->mapping directly. Instead, driver should
+      use page_mapping which mask off the low two bits of page->mapping under
+      page lock so it can get right struct address_space.
+ 
+      For testing of non-lru movable page, VM supports __PageMovable function.
+      However, it doesn't guarantee to identify non-lru movable page because
+      page->mapping field is unified with other variables in struct page.
+      As well, if driver releases the page after isolation by VM, page->mapping
+      doesn't have stable value although it has PAGE_MAPPING_MOVABLE
+      (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
+      page is LRU or non-lru movable once the page has been isolated. Because
+      LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
+      good for just peeking to test non-lru movable pages before more expensive
+      checking with lock_page in pfn scanning to select victim.
+ 
+      For guaranteeing non-lru movable page, VM provides PageMovable function.
+      Unlike __PageMovable, PageMovable functions validates page->mapping and
+      mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
+      destroying of page->mapping.
+ 
+      Driver using __SetPageMovable should clear the flag via __ClearMovablePage
+      under page_lock before the releasing the page.
+ 
+    * PG_isolated
+ 
+      To prevent concurrent isolation among several CPUs, VM marks isolated page
+      as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
+      movable page, it can skip it. Driver doesn't need to manipulate the flag
+      because VM will set/clear it automatically. Keep in mind that if driver
+      sees PG_isolated page, it means the page have been isolated by VM so it
+      shouldn't touch page.lru field.
+      PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
+      for own purpose.
+ 
+ Christoph Lameter, May 8, 2006.
+ Minchan Kim, Mar 28, 2016.
diff --cc MAINTAINERS
Simple merge
diff --cc arch/alpha/Kconfig
Simple merge
diff --cc arch/mips/Kconfig
Simple merge
diff --cc arch/powerpc/Kconfig
Simple merge
diff --cc fs/dax.c
Simple merge
diff --cc fs/proc/task_mmu.c
Simple merge
diff --cc include/linux/hmm.h
Simple merge
diff --cc include/linux/sched/mm.h
Simple merge
diff --cc include/linux/swap.h
Simple merge
diff --cc mm/Kconfig
Simple merge
diff --cc mm/hmm.c
Simple merge
diff --cc mm/huge_memory.c
Simple merge
diff --cc mm/hugetlb.c
Simple merge
diff --cc mm/ksm.c
Simple merge
diff --cc mm/mmap.c
Simple merge
diff --cc mm/rmap.c
Simple merge
diff --cc mm/util.c
Simple merge
author	Jonathan Corbet <corbet@lwn.net>
	Mon, 16 Apr 2018 20:25:08 +0000 (14:25 -0600)
committer	Jonathan Corbet <corbet@lwn.net>
	Mon, 16 Apr 2018 20:25:08 +0000 (14:25 -0600)
		1	2
Documentation/admin-guide/kernel-parameters.txt	patch \|	diff1 \|	diff2 \|	blob \| history
Documentation/sysctl/vm.txt	patch \|	diff1 \|	diff2 \|	blob \| history
Documentation/vm/hmm.rst	patch \|	\|	diff2 \|	blob \| history
Documentation/vm/page_migration.rst	patch \|	\|	diff2 \|	blob \| history
MAINTAINERS	patch \|	diff1 \|	diff2 \|	blob \| history
arch/alpha/Kconfig	patch \|	diff1 \|	diff2 \|	blob \| history
arch/mips/Kconfig	patch \|	diff1 \|	diff2 \|	blob \| history
arch/powerpc/Kconfig	patch \|	diff1 \|	diff2 \|	blob \| history
fs/dax.c	patch \|	diff1 \|	diff2 \|	blob \| history
fs/proc/task_mmu.c	patch \|	diff1 \|	diff2 \|	blob \| history
include/linux/hmm.h	patch \|	diff1 \|	diff2 \|	blob \| history
include/linux/sched/mm.h	patch \|	diff1 \|	diff2 \|	blob \| history
include/linux/swap.h	patch \|	diff1 \|	diff2 \|	blob \| history
mm/Kconfig	patch \|	diff1 \|	diff2 \|	blob \| history
mm/hmm.c	patch \|	diff1 \|	diff2 \|	blob \| history
mm/huge_memory.c	patch \|	diff1 \|	diff2 \|	blob \| history
mm/hugetlb.c	patch \|	diff1 \|	diff2 \|	blob \| history
mm/ksm.c	patch \|	diff1 \|	diff2 \|	blob \| history
mm/mmap.c	patch \|	diff1 \|	diff2 \|	blob \| history
mm/rmap.c	patch \|	diff1 \|	diff2 \|	blob \| history
mm/util.c	patch \|	diff1 \|	diff2 \|	blob \| history