Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

[Draft] GPA Protection Implementation #35

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion core/ept2.c
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ void ept_handle_mapping_changed(hax_gpa_space_listener *listener,
}

int ept_handle_access_violation(hax_gpa_space *gpa_space, hax_ept_tree *tree,
exit_qualification_t qual, uint64 gpa)
exit_qualification_t qual, uint64 gpa,
uint64 *fault_gfn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the intrusive change you were talking about :) It didn't look very intrusive to me until I peeked at the second patch. I need to spend more time tomorrow to understand the rationale better and think about possible alternatives. But maybe I can ask you this for now: if we don't have to worry about chunks (see my other long comment below), do we still need this extra parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes.
I also think about something like:
__try {
hax_vcpu_run();
--------> call RaiseException when we failed to pass protection check
} __except (xxxx) {
//fill related htun and exit
}
But I have two concerns and I do not know the answer:
1> it seems too windows-specific, is it possible for mac (I did not work on mac so far). I believe it is OK not to implement this on mac from google point of view.
2> I am not sure what exception I should raise when checking for protection bitmap. My concern is if I choose existing exceptions defined (page fault, for example) and later a real page fault happened due to bug in code, it could be interpreted as our ram protection fault. And I do not know whether user-defined exception code is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was probably not clear enough, so you must have mistaken it for something else - but thanks to that, I get to see this very interesting idea in your answer :)

I totally understand that Google's priority is to bring on-demand snapshot loading to Windows, so it's fine to leave out Mac for the QEMU (Android Emulator) side patch. But I still hope the new API we add this time works on both Windows and Mac, with the same behavior, as all the existing HAXM APIs do. So I agree with you that any Windows-specific API design should be avoided.

Back to my original question: I know this new fault_gfn parameter is a result of the call to gpa_space_chunk_protected(). My point is, if we don't worry about chunks and just stick to processing (filling) one page per exit to QEMU (via HAX_EXIT_PAGEFAULT), the pagefault GPA we return to QEMU will be the same as gpa, so there's no need to call gpa_space_chunk_protected() just because it might return a different GPA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I've thought about handling HAXM's own accesses to guest RAM, and your "intrusive" approach makes more sense to me. But I also came to realize how complicated the problem is - for example, since it's possible for HAXM to access a protected GPA in the middle of its MMIO or PIO handler (instruction emulation) code, we have to make sure the code before the GPA access does nothing "irreversible", because as soon as we realize the GPA is protected, we'll abort the handler and return to QEMU with the new exit reason (instead of HAX_EXIT_FAST_MMIO or HAX_EXIT_IO), and must be able to emulate the same MMIO/PIO instruction again when QEMU re-executes it.

{
uint combined_perm;
uint64 gfn;
Expand Down Expand Up @@ -102,6 +103,9 @@ int ept_handle_access_violation(hax_gpa_space *gpa_space, hax_ept_tree *tree,
return 0;
}

if (gpa_space_chunk_protected(gpa_space, gfn, fault_gfn))
return -EPERM;

// The faulting GPA maps to RAM/ROM
is_rom = slot->flags & HAX_MEMSLOT_READONLY;
offset_within_slot = gpa - (slot->base_gfn << PG_ORDER_4K);
Expand Down
177 changes: 158 additions & 19 deletions core/gpa_space.c
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
#include "../include/hax.h"
#include "include/paging.h"
#include "../include/hax_host_mem.h"
#include "ept2.h"

int gpa_space_init(hax_gpa_space *gpa_space)
{
Expand All @@ -59,6 +60,13 @@ int gpa_space_init(hax_gpa_space *gpa_space)
return ret;
}

static uint64 gpa_space_prot_bitmap_size(uint64 npages)
{
uint64 bitmap_size = (npages + 7)/8;
bitmap_size += 8;
return bitmap_size;
}

void gpa_space_free(hax_gpa_space *gpa_space)
{
hax_gpa_space_listener *listener, *tmp;
Expand All @@ -75,6 +83,9 @@ void gpa_space_free(hax_gpa_space *gpa_space)
hax_gpa_space_listener, entry) {
hax_list_del(&listener->entry);
}
if (gpa_space->prot_bitmap.bitmap)
hax_vfree(gpa_space->prot_bitmap.bitmap,
gpa_space_prot_bitmap_size(gpa_space->prot_bitmap.max_gpfn));
}

void gpa_space_add_listener(hax_gpa_space *gpa_space,
Expand Down Expand Up @@ -109,9 +120,9 @@ void gpa_space_remove_listener(hax_gpa_space *gpa_space,
// hax_unmap_user_pages().
static int gpa_space_map_range(hax_gpa_space *gpa_space, uint64 start_gpa,
int len, uint8 **buf, hax_kmap_user *kmap,
bool *writable)
bool *writable, uint64 *fault_gfn)
{
uint64 gfn;
uint64 gfn, i;
uint delta, size, npages;
hax_memslot *slot;
hax_ramblock *block;
Expand All @@ -133,6 +144,14 @@ static int gpa_space_map_range(hax_gpa_space *gpa_space, uint64 start_gpa,
delta = (uint) (start_gpa - (gfn << PG_ORDER_4K));
size = (uint) len + delta;
npages = (size + PAGE_SIZE_4K - 1) >> PG_ORDER_4K;

// Check gpa protection bitmap
for (i = gfn; i < gfn + npages;)
if (gpa_space_chunk_protected(gpa_space, i, fault_gfn))
return -EPERM;
else
i = (i/HAX_CHUNK_NR_PAGES + 1)*HAX_CHUNK_NR_PAGES;

slot = memslot_find(gpa_space, gfn);
if (!slot) {
hax_error("%s: start_gpa=0x%llx is reserved for MMIO\n", __func__,
Expand Down Expand Up @@ -183,7 +202,7 @@ static int gpa_space_map_range(hax_gpa_space *gpa_space, uint64 start_gpa,
}

int gpa_space_read_data(hax_gpa_space *gpa_space, uint64 start_gpa, int len,
uint8 *data)
uint8 *data, uint64 *fault_gfn)
{
uint8 *buf;
hax_kmap_user kmap;
Expand All @@ -194,10 +213,12 @@ int gpa_space_read_data(hax_gpa_space *gpa_space, uint64 start_gpa, int len,
return -EINVAL;
}

ret = gpa_space_map_range(gpa_space, start_gpa, len, &buf, &kmap, NULL);
ret = gpa_space_map_range(gpa_space, start_gpa, len,
&buf, &kmap, NULL, fault_gfn);
if (ret < 0) {
hax_error("%s: gpa_space_map_range() failed: start_gpa=0x%llx,"
" len=%d\n", __func__, start_gpa, len);
if (ret != -EPERM)
hax_error("%s: gpa_space_map_range() failed: start_gpa=0x%llx,"
" len=%d\n", __func__, start_gpa, len);
return ret;
}

Expand All @@ -221,7 +242,7 @@ int gpa_space_read_data(hax_gpa_space *gpa_space, uint64 start_gpa, int len,
}

int gpa_space_write_data(hax_gpa_space *gpa_space, uint64 start_gpa, int len,
uint8 *data)
uint8 *data, uint64 *fault_gfn)
{
uint8 *buf;
hax_kmap_user kmap;
Expand All @@ -234,10 +255,11 @@ int gpa_space_write_data(hax_gpa_space *gpa_space, uint64 start_gpa, int len,
}

ret = gpa_space_map_range(gpa_space, start_gpa, len, &buf, &kmap,
&writable);
&writable, fault_gfn);
if (ret < 0) {
hax_error("%s: gpa_space_map_range() failed: start_gpa=0x%llx,"
" len=%d\n", __func__, start_gpa, len);
if (ret != -EPERM)
hax_error("%s: gpa_space_map_range() failed: start_gpa=0x%llx,"
" len=%d\n", __func__, start_gpa, len);
return ret;
}
if (!writable) {
Expand Down Expand Up @@ -265,24 +287,26 @@ int gpa_space_write_data(hax_gpa_space *gpa_space, uint64 start_gpa, int len,
return nbytes;
}

void * gpa_space_map_page(hax_gpa_space *gpa_space, uint64 gfn,
hax_kmap_user *kmap, bool *writable)
int gpa_space_map_page(hax_gpa_space *gpa_space, uint64 gfn,
hax_kmap_user *kmap, bool *writable,
void **kva, uint64 *fault_gfn)
{
uint8 *buf;
int ret;
void *kva;

assert(gpa_space != NULL);
assert(kmap != NULL);
ret = gpa_space_map_range(gpa_space, gfn << PG_ORDER_4K, PAGE_SIZE_4K, &buf,
kmap, writable);
kmap, writable, fault_gfn);
if (ret < PAGE_SIZE_4K) {
hax_error("%s: gpa_space_map_range() returned %d\n", __func__, ret);
return NULL;
if (ret != -EPERM)
hax_error("%s: gpa_space_map_range() returned %d\n", __func__, ret);
*kva = NULL;
return ret;
}
kva = (void *) buf;
assert(kva != NULL);
return kva;
*kva = (void *) buf;
assert(*kva != NULL);
return 0;
}

void gpa_space_unmap_page(hax_gpa_space *gpa_space, hax_kmap_user *kmap)
Expand Down Expand Up @@ -346,3 +370,118 @@ uint64 gpa_space_get_pfn(hax_gpa_space *gpa_space, uint64 gfn, uint8 *flags)

return pfn;
}

int gpa_space_adjust_prot_bitmap(hax_gpa_space *gpa_space, uint64 max_gpfn)
{
prot_bitmap *pb = &gpa_space->prot_bitmap;
uint8 *bmold = pb->bitmap, *bmnew = NULL;

/* Bitmap size only grows until it is destroyed */
if (max_gpfn <= pb->max_gpfn)
return 0;

bmnew = hax_vmalloc(gpa_space_prot_bitmap_size(max_gpfn), HAX_MEM_NONPAGE);
if (!bmnew) {
hax_error("%s: Not enought memory for new protection bitmap\n",
__func__);
return -ENOMEM;
}
pb->bitmap = bmnew;
if (bmold) {
memcpy(bmnew, bmold, gpa_space_prot_bitmap_size(pb->max_gpfn));
hax_vfree(bmold, gpa_space_prot_bitmap_size(pb->max_gpfn));
}
pb->max_gpfn = max_gpfn;
return 0;
}

static void gpa_space_set_prot_bitmap(uint64 start, uint64 nbits,
uint8 *bitmap, bool set)
{
uint64 i = 0;
uint64 start_index = start / 8;
uint64 start_bit = start % 8;
uint64 end_index = (start + nbits) / 8;
uint64 end_bit = (start + nbits) % 8;

if (start_index == end_index) {
for (i = start; i < start + nbits; i++)
if (set)
hax_test_and_set_bit(i, (uint64 *)bitmap);
else
hax_test_and_clear_bit(i, (uint64 *)bitmap);
return;
}

for (i = start; i < (start_index + 1) * 8; i++)
if (set)
hax_test_and_set_bit(i, (uint64 *)bitmap);
else
hax_test_and_clear_bit(i, (uint64 *)bitmap);

for (i = end_index * 8; i < start + nbits; i++)
if (set)
hax_test_and_set_bit(i, (uint64 *)bitmap);
else
hax_test_and_clear_bit(i, (uint64 *)bitmap);

for (i = start_index + 1; i < end_index; i++)
if (set)
bitmap[i] = 0xFF;
else
bitmap[i] = 0;
}

int gpa_space_test_prot_bitmap(struct hax_gpa_space *gpa_space, uint64 gfn)
{
struct prot_bitmap *pbm = &gpa_space->prot_bitmap;

if (!pbm)
return 0;

if (gfn >= pbm->max_gpfn)
return 0;

return hax_test_bit(gfn, (uint64 *)pbm->bitmap);
}

int gpa_space_chunk_protected(struct hax_gpa_space *gpa_space, uint64 gfn,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the QEMU patch make use of HAX_CHUNK_SIZE which is currently set to 2MB. When we introduced chunks with EPT2, we considered it an internal implementation detail and didn't expect to expose it to user space. If it's absolutely necessary, we should add an API to allow the user space to query the chunk size.
My question is, is making QEMU aware of HAXM chunks a necessity or an optimization? Admittedly, right now, at each non-MMIO EPT violation, we pin a whole chunk (instead of just the faulting page) in host RAM, and set up multiple EPT entries, so as to reduce the number of future EPT violations. But it's possible to make ept_tree_create_entries() skip the GFNs that are protected, so if the guest accesses one of them, we'll still get a VM exit. If we do that, is there any more reason for QEMU to know about chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, (I mean I did not test it), QEMU does not need to know the chunk size. It can resolve only one page one time, or 1M, etc. It simply means in that case, QEMU would receive more EPT_FAULT_EXIT at the same place with different gpa reported. I did not design it in a way that user space has to rely on the exact chunk size.
For QEMU side patch, I have to agree I write it that way because I have knowledge on HAXM.
I think as long as driver does not make assumptions on user-space implementation, it should be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically it's an optimization. I'd prefer that we process/fill one guest page per exit to QEMU, which is the same page that hardware (EPT violation) gave us, because

  1. Having user space depend on a driver implementation detail is not a good practice, since the interface is not stable, although it's not as bad as having the driver depend on user space.
  2. Chunks are created in HVA space (by dividing a RAM block), not GPA space. So technically, rounding a GPA down to the nearest 2MB boundary is not guaranteed to give you a GPA that maps to the same chunk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Some chunks are smaller than 2MB, e.g. in case of a tiny RAM block, or the last chunk in a RAM block, so __gpa < gpa + 0x200000 doesn't always work.
  2. HVF also fills one page at a time and takes this many VM exits. If we fill too many pages at a time, we may be doing kind of eager loading rather than on-demand loading (there's a trade-off).

If the performance is not good enough, we could add the chunk-based optimization in a later patch.

uint64 *fault_gfn)
{
uint64 __gfn = gfn / HAX_CHUNK_NR_PAGES * HAX_CHUNK_NR_PAGES;
for (gfn = __gfn; gfn < __gfn + HAX_CHUNK_NR_PAGES; gfn++)
if (gpa_space_test_prot_bitmap(gpa_space, gfn)) {
*fault_gfn = gfn;
return 1;
}

return 0;
}

int gpa_space_protect_range(struct hax_gpa_space *gpa_space,
struct hax_ept_tree *ept_tree,
uint64 start_gpa, uint64 len, int8 flags)
{
uint64 gfn;
uint npages;
hax_memslot *slot;

if (len == 0) {
hax_error("%s: len = 0\n", __func__);
return -EINVAL;
}

/* Did not support specific prot on r/w/e now */
if (flags != 0 && (flags & HAX_GPA_PROT_MASK) != HAX_GPA_PROT_ALL)
return -EINVAL;

gfn = start_gpa >> PG_ORDER_4K;
npages = (len + PAGE_SIZE_4K - 1) >> PG_ORDER_4K;

gpa_space_set_prot_bitmap(gfn, npages, gpa_space->prot_bitmap.bitmap, !flags);

if (!flags)
ept_tree_invalidate_entries(ept_tree, gfn, npages);

return 0;
}
1 change: 1 addition & 0 deletions core/hax.c
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,7 @@ int hax_get_capability(void *buf, int bufLeng, int *outLength)
cap->winfo |= HAX_CAP_64BIT_SETRAM;
#endif
cap->winfo |= HAX_CAP_TUNNEL_PAGE;
cap->winfo |= HAX_CAP_GPA_PROTECTION;
if (cpu_data->vmx_info._ept_cap) {
cap->winfo |= HAX_CAP_EPT;
}
Expand Down
3 changes: 2 additions & 1 deletion core/include/ept2.h
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,8 @@ void ept_handle_mapping_changed(hax_gpa_space_listener *listener,
// present, but the access violates the permissions it allows.
// -ENOMEM: Memory allocation/mapping error.
int ept_handle_access_violation(hax_gpa_space *gpa_space, hax_ept_tree *tree,
exit_qualification_t qual, uint64 gpa);
exit_qualification_t qual, uint64 gpa,
uint64 *fault_gfn);

// Handles an EPT misconfiguration caught by hardware while it tries to
// translate a GPA.
Expand Down
1 change: 1 addition & 0 deletions core/include/hax_core_interface.h
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ int hax_vm_set_ram2(struct vm_t *vm, struct hax_set_ram_info2 *info);
int hax_vm_free_all_ram(struct vm_t *vm);
int in_pmem_range(struct hax_vcpu_mem *pmem, uint64_t va);
int hax_vm_add_ramblock(struct vm_t *vm, uint64_t start_uva, uint64_t size);
int hax_vm_gpa_prot(struct vm_t *vm, struct hax_gpa_prot_info *info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hax_vm_protect_ram? Other similar function names begin with a verb.


void * get_vm_host(struct vm_t *vm);
int set_vm_host(struct vm_t *vm, void *vm_host);
Expand Down
Loading