Jeff Li

Be another Jeff

Dump PTE with SystemTap

In my last blog post, we explored the pagemap interface in Linux which is an interface that exposes some kernel information about the memory management. As I mentioned in that post, the pagemap interface does not expose the raw pte content. Since PTEs are allocated in the kernel space, it is impossible to explore the PTEs from user space applications. To access the kernel space memory, a traditional way is writing a kernel module which, I believe, is of course kind of tedious.

In this article, we’ll see how to leverage SystemTap infrastructure to dump the PTE for the page where an given virtual address of a process lives in. Of course, SystemTap scripts will be translated into kernel module finally, however, compared with writing kernel modules, writing SystemTap scripts is more convenient. This article is intended to show how to leverage SystemTap to explore the world of Linux kernel. So it will not cover topics like memory management in x86. Actually, I plan to write an article to talk about the evolution of memory management from i386 to x86_64. But I am not sure when will that be started.

SystemTap Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
function find_pte:long (mm, addr) %{
  pgd_t *pgd;
  pte_t *pte;
  pud_t *pud;
  pmd_t *pmd;

  long address = STAP_ARG_addr;

  struct page* page = NULL;
  struct mm_struct *mm = (struct mm_struct*)STAP_ARG_mm;

  pgd = pgd_offset(mm, address);
  if (pgd_none(*pgd) || pgd_bad(*pgd)) {
    _stp_printf("bad pgd");
    STAP_RETURN(0);
  }

  pud = pud_offset(pgd, address);
  if (pud_none(*pud) || pud_bad(*pud)) {
    _stp_printf("bad pud\n");
    STAP_RETURN(0);
  }

  pmd = pmd_offset(pud, address);
  if (pmd_none(*pmd) || pmd_bad(*pmd)) {
    _stp_printf("bad pmd\n");
    STAP_RETURN(0);
  }

  pte = pte_offset_map(pmd, address);
  if (!pte) {
    _stp_printf("bad pte\n");
    STAP_RETURN(0);
  }

  STAP_RETURN(pte->pte);
%}

probe begin {
  process_pid = target()
  pt = pid2task(process_pid)
  mm = @cast(pt, "task_struct", "kernel<linux/sched.h>")->mm

  pte = find_pte(mm, $1)
  printf("pte: %p\n", pte)
  exit()
}

Save the code as dump_pte.stp and run it with the following command: sudo stap -g dump_pte.stp -x PID VIRTUAL_ADDRESS. The -g option is required since C code is embedded in the script.

Note, this script only work in x86-64 and can’t work in 32 bit x86. Though there is a fix, I don’t want to post it here because I have not figured out how it works in 32 bit platform.

Code Comments

This section will give a short explanation on how this System script work.

  • Line 41 get the task struct of the target process. pid2task is an built function to get task struct from pid
  • Line 42 get the mm field of task struct. You should be familiar with the data structures used by kernel to manage process.
  • Line 1-37 is the function definition to read the PTE. It should be note that although declared as a SystemTap function, the whole function body is actually C code. SystemTap allows embedding C code in the script.
  • Line 7 and Line 10 shows how to convert SystemTap function argument to C variable with the STAP_ARG_ macro.
  • Line 14 shows how to print variable to standard console with the method _stp_printf. Note this function is not recommended.
  • Line 15 shows how to return value in embedded C code. You can’t use return but the STAP_RETURN macro.
  • Line 12, Line 18, Line 24 and Line 30 describe the translation from linear address to physical address. As we know, Linux uses 4 level page table in x86_64 architecture, the code reflects the page table organisation.

Conclusion

SystemTap is not designed to inspect the kernel space memory. Instead, the primary usage is monitoring events happening in kernel world. In spite of that, we can leverage the tool to explore the Linux kernel world, learn the design of kernel. For example, we can observe the PTEs to verify the behaviour of COW mechanism. There will be another post shows how to observe the COW in Linux by using the script we talk in this article.

Allowing embedded C code in script makes writing SystemTap script very flexible. However, it also requires the users know some data structures of Linux kernel well. Sometimes, it is insufficient and more knowledge is required. For example, in the dump_pte script, users must know how to travel the page table.

Further Reading

  • x86-64 : Article of x86-64 in wikipedia, very good introduction on x86-64 memory.
  • x86 developer manual: The definitive guide about x86 architecture. Section 4.4.2 gives the detail of address translation including the PTE structure.

Comments