(Sorry if this was more than you asked for, but some time ago I found that I need to research all
of this to understand the topic; ‘**’ is the power operator)
I want to make sure that I understood %topic% properly.
To really grok this whole paging thing, you need to understand why it was created. The very-first time-sharing computer, MIT's Compatible Time-Sharing System (CTSS), faced the problem of protecting the supervisor (kernel) from the rest of system user processes. They did it by statically
reserving the low 5-Kbytes of system memory to the kernel, and the rest 27-Kbytes to user applications. Afterwards, they wanted to avoid memory allocation clashes between users, but due to hardware limitations of the time, they solved that problem by allowing only one
program in core memory at any point in time! 
As a solution to the “protect users from each other” problem, the ATLAS computer pioneered abstracting the concept of an “address” from the low-level detail of a physical memory location. Thus, from a memory perspective, each process will have its own world view (and no need for historical tricks like memory overlays
: demand paging automatically does a similar effect by only loading parts of the executable that are currently needed; a process image can thus be bigger than the amount of contiguous free physical memory available). Now, If we are going to map every byte
of the process address space, we'll need a mapping table with the size of physical memory or larger! To solve this, we group physical memory addresses into “blocks”, and map these “blocks” instead of singular bytes in the mapping table. 
(NOTE-1: the whole difference between segmentation and paging is that the former divides the address space into blocks of arbitrary size, while the latter organizes such space into uniform block sizes.)
Now, imagine having a 32-bit address space while mapping “blocks” of 4-Kbytes in size. Each mapping table entry will equal 4-bytes to be able to point to the whole 32-bit space. Since each entry maps 4-Kbytes (2**12) of address space, we'll need (2**32/2**12 = 2**20) entries to map the entire 4-Gbyte/32-bit space. So, the mapping table size will equal ((number of needed entries * entry size) = 2**20 * 2**2 = 2**22 bytes = 4-Mbytes) to function. i.e. using that simple scheme, we'll need a 4-MByte mapping table for each process in the system! I'm sure you won't like to use any kernel like that
(NOTE-2: Hardware designers intentionally choose power-of-2 page sizes to quickly find a virtual address page number and offset using right and left shifts. Otherwise, they'll have to use the too-slow division operator; check the cited Denning paper for further details.)
If we are going to analyze why each process needed a 4-Mbyte mapping table above, we'll find that the key problem was mapping unused parts of the address space. And to further complicate matters, most processes use their memory space in a very sparse
manner; e.g. using the very top for the stack, and the very-bottom for the executable code and data. Thus, the paging scheme has to support virtual addresses moving up and down.
Hardware architecture books describe several solutions to the above problem. For the contemporary x86 CPUs, the solution chosen was to page the page-table itself. Intel and AMD64 documents cover the details.
 F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE Proceedings of the 1962 Spring Joint Computer Conference, MIT
 P.J. Denning, “Virtual Memory”, 1970, ACM Computing Surveys
If we want to pre-create a 2mb page / PML4 structure:
4k (for the PML4) + (4k (PDPT)* 4k (PD) ) = 0x1001000, right?
(I'll just call the PDP and PD as ‘PML3’ and ‘PML2’ respectively, it makes more sense that way)
Because the AMD designers wanted to limit requiring contiguous physical memory as far as possible, each page table was designed not to exceed 4-Kbytes of contiguous RAM. Since each entry here points to a full 64-bit address, such page table entry sizes will equal 8-bytes. Thus, max number of possible entries in a table = (page table size / entry size = 2**12 / 2**3 = 2**9) entries. That's in fact why the PML4, PML3, and PML2 offsets are 9-bits each and why all of these tables also have 512 entries each. This leads us to:
- A PML2 entry covers a 2-Mbyte region since it directly points to a 2MB page
- A PML3 entry covers a 1-Gbyte region since it points to 512 PML2 entries
- A PML4 entry covers a 512-Gbyte region since it points to 512 PML3 entries
Another explanation using page offsets:
- A PML2 entry covers a 2-Mbyte region since 2**21 = 2**20 * 2 = 2 Mbytes
- A PML3 entry covers a 1-Gbyte region since 2**30 = 2**20 * 2**10 = 1024-Mbytes = 1 Gbyte
- A PML4 entry covers a 512-Gbyte region since 2**39 = 2**30 * 2**9 = 512 Gbytes
(Edit: Add a small note on memory overlays and demand paging.)