This sounds like it's just the cost to traverse the page table, right? ~300 cycles per raw memory lookup, and 3 of them because you'll typically need to go three levels deep?
The TLB is tiny these days, and 4kb pages are tiny.
I'm super hopeful that Linus is going to force through some big improvements to HugePages, because the current Linux HugePages support is super painful at the moment. 2MB pages alone could be a massive gain.
512K pagesize? Wouldn't that add a ton of IO in a lot of scenarios? Like every 1 byte file would now require a 512K event? Large pages (2MB/1GB) is for specialised use where you know you're not going to be paging things in/out too often, right?
IIRC Linus was quite dismissive about having larger-than-4k page sizes as the default.
Another common one is actually in the kernel where filesystem block sizes are limited to page sizes, so from this point of view large page sizes are better:
The thing is, all the relevant page table data should already be in L1 caches when the page fault handler returns. The TLB miss on iret should not require raw memory lookups and should be much faster than 300 cycles.
Probably it's more that iret simply has always been slow (hence why the various syscall/sysenter extensions were created).
The TLB is tiny these days, and 4kb pages are tiny.
I'm super hopeful that Linus is going to force through some big improvements to HugePages, because the current Linux HugePages support is super painful at the moment. 2MB pages alone could be a massive gain.