4707 Chapter 8 – Flashcards
Unlock all answers in this set
Unlock answersquestion
File Organization [8]
answer
- A method of arranging the records in a file when the file is stored on disk - Each file organization makes certain operations efficient but other operations expensive
question
File [8]
answer
- Basic abstraction of data in a DBMS is a collection of records or a file - Each file consists of one or more pages
question
File and Access Methods [8]
answer
- Software layer organizes data carefully to support fast access to desired subsets of records - Often referred to as just the file layer
question
Index [8]
answer
- A data structure that organizes data records on disk to optimize certain kinds of retrieval operations - An index allows us to efficiently retrieve all records that satisfy search conditions on the search key fields of the index - We can also create additional indexes on a given collection of data records, each with a different search key, to speed up search operations that are not efficiently supported by the file organization used to store the data records - Indexing is a technique that can help when we have to access a collection of records in multiple ways, in addition to efficiently supporting various kinds of seletion
question
RID [8]
answer
Record Id - A unique identifier each file has - Has the property that we can identify the disk address of the page containing the record by using the rid
question
File of records [8]
answer
- An important abstraction in a DBMS and is implemented by the files and access methods layer of the code - A file can be created, destroyed, and have records inserted into and deleted from it - A relation is typically stored as a file of records - File layer stores the records in a file as a collection of disk pages
question
Scan [8]
answer
- Operation allows us to step through all records in the file one at a time
question
Heap File [8] [9]
answer
[8] - An unordered file - The simplest file structure - Records in a heap file are stored in random order across the pages of the file - A heap file organziation supports retrieval of all records, or retrieval of a particular record specified by its rid [9] - Data in the pages of a heap file is not ordered in any way and the only guarantee is that one can retrieve all records in the file by repeated requests for the next record Supported Operations: - Create files - Destroy files - Insert a record - Delete a record - Get a record - Scan all records in file
question
Data Entry [8]
answer
- Refers to the record stored in an index file - Data entry with search key value k, denoted k*, contains enough information to locate (one or more) data records with search key value k Three main alternatives for what to store as a data entry in an index: 1. A data entry k* is an actual data record (with search key value k) 2. A data entry is a pair where rid is the record id of a data record with a search key value k 3. A data entry is a pair where rid-list is a list of record ids of data records with search key value k
question
Indexed File Organization [8]
answer
- Can be used instead of a sorted file or an unordered file of records
question
Memory Hierarchy [9]
answer
Memory in a computer system is arranged in a hierarchy: 1. Primary storage 2. Secondary storage 3. Tertiary storage - Cost of a given amount of main memory is about 100 times the cost of the same amount of disk space, and tapes are even less expensive than disks
question
Primary storage [9]
answer
- Consists of cache and main memory and provides very fast access to data - On systems with 32-bit addressing only 2^32 bytes can be directly referenced in main memory: the number of data objects may exceed this number - Volatile, although it is possible to make it nonvolatile by adding a battery backup feature
question
Secondary storage [9]
answer
- Consists of slower devices such as magnetic disks - Play an important role in database system because the amount of data is typically very large - Nonvolatile
question
Tertiary storage [9]
answer
- Slowest class of storage devices - Optical disks and tapes - Play an important role in database system because the amount of data is typically very large - Nonvolatile
question
Nonvolatile [9]
answer
- Storage that retain information when the computer is restarted (after a shutdown or crash)
question
Tapes [8] [9]
answer
[8] - Sequential access devices that force us to read data one page after the other - Mostly used to archive data that is not needed on a regular basis [9] - Form of tertiary storage - Relatively inexpensive and can store very large amounts of data - Good choice for archival storage - when we need to maintain data for a long period but do not expect to access it often - Records data data on 128 tape tracks which can be thought of as a linear sequence of adjacent bytes - Main drawback is that they are sequential access devices - Are unsuitable for storing operation data, or data that is frequently accessed
question
Magnetic Disks [8] [9]
answer
[8] - Most important external storage devices - If we can read several pages in the order they are stored physically the cost can be must less than cost of reading same pages in random order [9] - Form of secondary storage - Support direct access to a desired location and are widely used by database applications Components: - Data is stored on disk in units called disk blocks - Blocks are arranged in concentric rings called tracks on one or more platters - The set of all tracks with the same diameter is called a cylinder - Each track is divided into arcs called sectors - An array of disk heads, one per recorded surface, is moved as a unit - A disk controller interfaces a disk drive to the computer - A checksum is computed for when data is written to a sector and stored with the sector, and again for when the data on the sector is read back Access time (of a disk block): 1. Seek time 2. Rotational delay 3. Transfer time
question
Disk Blocks [9]
answer
- Contiguous sequence of bytes that is the unit in which data is written to a disk and read from a disk
question
Tracks [9]
answer
- Tracks can be recorded on one or both surfaces of a platter
question
Platters [9]
answer
- We refer to platters as single-sided or double-sided according to whether the track is recorded on one or both surfaces of the platter - Typical diameters are 3.5 and 5.25 inches
question
Cylinder [9]
answer
- Contains one track per platter surface
question
Sectors [9]
answer
- Size is a characteristic of the disk and cannot be changed - Size of a disk block can be set when the disk is initialized as a multiple of the sector size
question
Disk Heads [9]
answer
- When one head is positioned over a block the other heads are in identical positions with respect to their platters - To read or write a block a disk head must be positioned on top of the block - Current systems typically allow at most one disk head to read or write at any one time since it is difficult to ensure that all heads are perfectly aligned
question
Disk Controller [9]
answer
- Implements commands to read or write a sector by moving the arm assembly and transferring data to and from the disk surfaces
question
Checksum [9]
answer
- If the sector is corrupted or the read is faulty for some reason it is very unlikely that the checksum computed when the sector is read matches the one computed when the sector was written - The controller computes checksums and if it detects an error it tries to read the sector again
question
Seek Time [9]
answer
- Time taken to move the disk heads to the track on which a desired block is located - As the size of platter decreases seek times also decrease since we move the disk head a shorter distance
question
Rotational Delay [9]
answer
- The waiting time for the desired block to rotate under the disk head - Time required for half a rotation on average and is usually less than seek time
question
Transfer Time [9]
answer
- Time to actually read or write the data in the block once the head is positioned - Time for the disk to rotate over the block
question
Performance Implications of Disk Structure [9]
answer
1. Data must be in memory for a DBMS to operate on it 2. Unit for data transfer between disk and main memory is a block; if a single item on a block is needed the entire block is transferred 3. The time to read or write a block varies depending on the location of the data - Time taken for database operations is affected significantly by how data is stored on disks - Time moving blocks to or from disk usually dominate time taken for database operations - Necessary to locate data records strategically on disk - Closest that two records can be on a disk is to be on the same block - In current design, all the data on a track can be read or written in one revolution - We have a natural notion of closeness for blocks which we can extend to a notion of next and previous blocks
question
I/O [9]
answer
Input/Output - Operation for reading or writing a disk block - Exploiting the notion of next by arranging records so they are read or written sequentially is very important in reducing the time spent in disk I/Os - I/O can typically be done concurrently with CPU computation
question
Disk Space Manager [8] [9]
answer
[8] - Manages the space on a disk - Keeps track of pages in use by the file layer, if a page is freed by the file layer the space manager tracks this and reuses the space if the file layer requests a new page later on - When the files and access methods layer needs additional space to hold new records in a file it asks the disk space manager to allocate an additional disk page for the file - File and access methods layer also informs the disk space manager if it no longer needs one of its disk pages [9] - Lowest level of software in the DBMS architecture - Abstractly the disk space manager supports the concept of a page and provides commands to allocate or deallocate a page and read or write a page - The capability of exploiting the advantages of sequentially accessing disk blocks must be provided by the disk space manager to higher-level layers of the DBMS - Hides details of the underlying hardware (and possibly OS) and allows higher levels of software to think of data as a collection of pages - Keeps tracks of which disk blocks are in use in addition to keeping track of which pages are on which disk blocks using a list of free blocks or bitmap
question
Page [8] [9]
answer
[8] - Unit of information read from or written to disk - Size of a page is a DBMS parameter and typical values are 4KB or 8KB - The cost of page I/O (input from disk to main memory and output from memory to disk) dominates the cost of typical database operations [9] - Unit of data - Size of a page is chosen to be the size of a disk block and pages are stored as disk blocks so that reading or writing can be done in one disk I/O - Pages are used to store records and organized into logical collections or files - Higher levels of DBMS code treat a page as effectively being a collection of records ignoring representation and storage details
question
Free Blocks [9]
answer
List of Free Blocks - One way of keeping track of block usage - As blocks are deallocated we can add them to the free list for future use - A pointer to the first block on the free block list is stored in a known location on disk
question
Bitmap [9]
answer
- One way of keeping track of block usage - One bit for each disk block which indicates whether a block is in use or not - Allows very fast identification and allocation of contiguous areas on disk which is difficult using linked list approach
question
Disk Space Manager & OS File Systems [9]
answer
OS: - Typically an operating system supports the abstraction of a file as a sequence of bytes - Manages space on disk and translates requests into corresponding low-level instructions - Database disk space manager could be built using OS file which it is then responsible for managing the space in these files - Many database systems do not rely on OS file system and do own disk management either from scratch or extending OS facilities - DBMS vendor wishes to support several OS platforms cannot assume features specific to any OS - On a 32-bit system the largest file size is 4GB whereas a DBMS may want to access a single file larger than that - Typical OS files cannot span disk devices
question
Buffer Manager [8] [9]
answer
[8] - Layer of software that reads data into memory for processing and writes to disk for persistent storage - When the files and access methods layer needs to process a page it asks the buffer manager to fetch the page, specifying the page's rid [9] - Software layer responsible for bringing pages from disk to main memory as needed - All data cannot be brought into main memory at once so DBMS must bring pages into main memory as they are needed and in the process decide what existing pages in main memory to replace to make space for a new page - The buffer manager simply assumes that the appropriate lock has been obtained before a page is requested Higher Level Code: - Asks buffer manager for the page and it is brought into a frame in the buffer pool if it is not already there - That requests a page must also release the page when it is no longer needed by informing the buffer manager - Inform buffer manager if it modifies the requested page which makes sure that the change is propagated to the copy of the page on disk Buffer manager maintains some book-keeping information and two variables for each frame in the pool: 1. Pin_count 2. Dirty When a a page is requested: 1. Check buffer pool to see if some frame contains requested page and if so increment the pin_count of that frame. If page is not in pool bring it in as follows: 1a. Choose a frame for replacement using replacement policy and increment its pin_count 1b. If dirty bit for replacement frame is on, write page it contains to disk 1c. Read requested page into replacement frame 2. Return the main memory address of the frame containing the requested page to the requestor - If no page in the buffer pool has pin_count 0 and a page that is not in the pool is requested the buffer manager must wait until some page is released before responding to the page request
question
Buffer Pool [9]
answer
- A collection of pages used by buffer manager to manage available main memory by partitioning it
question
Frames [9]
answer
- Main memory pages in the buffer pool - Convenient to think of them as slots that can hold a page
question
Pin [9]
answer
Pin_count - Number of times that the page currently in a given frame has been requested but not released - The number of current users of the page - Initially for every frame is set to 0 - Buffer manager will not read another page into a frame until its pin_count becomes 0 i.e. all requestors of page have unpinned it Pinning - Incrementing pin_count of requested page in the frame Unpinning - Decrementing pin_count of frame containing released page
question
Dirty [9]
answer
- Indicates whether the page has been modified since it was brought into the buffer pool from disk - Initially for every frame is turned off - If the requestor has modified the page it informs the buffer manager of this at the time that it unpins the page and the dirty bit for the frame is set - When a page is chosen for replacement if the dirty block is not set then the page has not been modified
question
Replacement Policy [9]
answer
- Policy used to decide which page to replace - Policy used to choose an unpinned page for replacement can affect the time taken for database operations considerably - If a requested page is not in the buffer pool and a free frame is not available in the buffer pool a frame with pin_count 0 is chosen for replacement using the buffer manager's replacement policy Policies: 1. LRU: Least Recently Used 2. Clock 3. FIFO: First In First Out 4. MRU: Most Recently Used 5. Random
question
LRU [9]
answer
Least Recently Used - Best known replacement policy - Can be implemented in the buffer manager using a queue of pointers to frames with pin_count 0 - Frame is added to the end of the queue when it becomes a candidate for replacement (when pin_count goes to 0) - Page chosen for replacement is the one in frame at head of queue - Not always the best replacement strategies for a database system if many user requests require sequential scans of data
question
Clock [9]
answer
- Variant of LRU and has similar behavior but less overhead - Choose a page for replacement using a current variable that takes on values 1 through N where N is the number of buffer frames in circular order - To approximate LRU behavior each frame has an associated referenced bit which is turned on when the pin_count goes to 0 - Current frame is considered for replacement, if the frame is not chosen current is incremented and next frame is considered - If current frame has the referenced bit turned on the clock algorithm turns the referenced bit off and increments current this way a recently referenced page is less likely to be replaced - If the current page has pin_count 0 and referenced bit is off then the page is chosen for replacement - Not always the best replacement strategies for a database system if many user requests require sequential scans of data
question
Sequential Flooding [9]
answer
- If the file has 1 more page than the number of available pages in the buffer pool with LRU every scan of the file will result in reading every possible page of the file - In this case LRU is the worst possible replacement strategy
question
Buffer Manager vs OS File Systems [9]
answer
- Similarities exist between virtual memory in OS and buffer management in database management system Why can't build DBMS using the virtual memory capability of an OS? - DBMS can predict order in which pages will be accessed, or page reference patterns, much more accurately than is typical in an OS environment and it is desirable to utilize this property to allow for a better choice of pages to replace - Since DBMS can predict reference patterns it can use an effective strategy called prefetching of pages - DBMS requires ability to explicitly force a page to disk and ensure that the copy of page on disk is updated with the copy in memory - DBMS must be able to ensure certain pages in buffer pool are written to disk before certain other pages to implement the WAL protocol for crash recovery - Virtual memory implementations in OS cannot be relied on to provide needed control over when pages are written to disk
question
Page Reference Patterns [9]
answer
- Order in which pages will be accessed - DBMS can often predict reference patterns because most page references are generated by higher-level operations (e.g. sequential scans) with a known pattern of page accesses
question
Prefetching of Pages [9]
answer
- Buffer manager can anticipate the next several page requests and fetch the corresponding pages into memory before the pages are requested - Once the prefetch request is issued to disk the disk is responsible for reading the requested pages into memory pages and the CPU can continue to do other work Two benefits: 1. Pages are available in buffer pool when they are requested 2. Reading in a contiguous block of pages is much faster than reading the same pages at different times in response to distinct requests
question
Implementing Heap Files [9]
answer
- Must keep track of pages in each heap file to support scans - Must keep track of pages that contain free space to implement insertion efficiently - Pages must hold two pointers (which are page ids) for file-level bookkepping in addition to the data Two Alternatives: 1. Linked List of Pages 2. Directory of Pages
question
Directory of Pages [9]
answer
- One possibility to maintain heap files - DBMS must remember where the first directory page of each heap file is located - Directory is itself a collection of pages and is shown as a linked list - Multiple organizations are possible for the directory itself - Each directory entry identifies a page (or a sequence of pages) in the heap file - Each directory entry is quite small in comparison to a typical page - The size of the directory is likely to be very small in comparison to the size of the heap file - Free space can be managed by maintaining a bit per entry indicating whether the corresponding page has any free space, or a count per entry, indicating the amount of free space on the page - Since several entries fit on a directory page we can efficiently search for a data page with enough space to hold a record to be inserted
question
Linked List of Pages [9]
answer
- One possibility to maintain heap files as a doubly linked list of pages - DBMS can remember where first page is located by maintaining a table containing pairs of in a known location on disk - We call the first page of the file the header page - Each pointer is really a page id Important task is to maintain information about empty slots created by deleting a record from the heap file, two distinct parts: 1. How to keep track of free space within a page 2. How to keep track of pages that have some free space: can be addressed by maintaining a doubly linked list of pages with free space and a doubly linked list of full pages; together these lists contain all pages in the heap file - Scheme can easily be generalized to allocate or deallocate a sequence of several pages and maintain a doubly linked list of these page sequence - Disadvantage of this scheme is that virtually all pages in a file will be on the free list if records are of variable length because it is likely that every page has at least a few free bytes
question
Page Formats [9]
answer
- Page abstraction is appropriate when dealing with I/O issues but higher levels of the DBMS see a collection of records - We can think of a page as a collection of slots each of which contains a record - A record is identified by using the pair ; this is the record id (rid) - An alternative way to identify records is to assign each record a unique integer as its rid and maintain a table that lists the page and slot of the corresponding record for each rid but due to the overhead of this it is not as common of an approach Some alternative approaches to managing slots: 1. Fixed-Length Records 2. Variable-Length Records - Main considerations are how these approaches support operations such as searching, inserting, or deleting records of pages
question
Record Formats [9]
answer
- While choosing a way to organize the fields of a record we must take into account whether the fields of the record are fixed or variable length and consider the cost of various operations on the record - In addition to storing individual records information common to all records of a given record type (such as the number of fields and field types) is stored in the system catalog
question
Fixed-Length Records [9]
answer
- If all records on page are guaranteed to be of the same length, record slots are uniform and can be arranged consecutively within a page - Main issue are how we keep track of empty slots and how we locate all records on a page Alternatives on how we handle deletion of a record: 1. Packed 2. Unpacked, Bitmap - Slotted page organization for variable-length records can also be used for fixed-length records Record Format: - Each field has a fixed length (that is the value in this field is of the same length in all records) and the number of fields is also fixed - The field of such as record can be stored consecutively
question
Packed Fixed-Length Records [1]
answer
- Store records in the first N slots (where N is the number of records on the page) whenever a record is deleted we move the last record on the page into the vacated slot - This approach does not work if there are external references to the record that is moved
question
Un-packed, Bitmap Fixed-Length Records [1]
answer
- Handle deletions by using an array of bits, one per slot to keep track of free slot information - In addition to the information about records on the page a page usually contains additional file-level information (e.g. id of the next page in the file)
question
Variable-Length Records [9]
answer
- If records are of variable length then we cannot divide the page into a fixed collection of slots - When a new record is to be inserted we have to find an empty slot of just the right length - Ability to move records on a page becomes very important - Most flexible organization is to maintain a directory of slots for each page Record Format: - In the relational model every record in a relation contains the same number of fields - If the number of fields is fixed a record is of variable length only because some of its fields are of variable length - One possible organization is to store fields consecutively separated by delimiters - A second possible (superior) organization is to reserve some space at the beginning of a record for use as an array of integer offsets - the ith integer in this array is the starting addressing of the ith field value relative to the start of the record - Second approach gives a clean way to deal with null values - Variable-length record formats can obviously be used to store fixed-length records as well; sometimes the overhead is justified by added flexibility Variable-Length Field Issues: - Modifying a field may cause it to grow - Modified record may no longer fit into the space remaining on its page - We may have to leave a 'forwarding address' on page identifying the new location of the record - Record may grow so large that it no longer fits on any one page (smaller records can be changed together)
question
Directory of Slots [9]
answer
- pair per slot - First component (record offset) is a pointer to the record and is the offset in bytes from start of the data area on the page to the start of the record - Deletion is readily accomplished by setting the record offset to -1 - One way to manage free space is to maintain a pointer (offset from start of the data area on the page) that indicates the start of the free space area - After reorganization all records appear in contiguous order followed by available free space - Slot for a deleted record cannot always be removed from the slot directory since slot numbers are used to identify records - by deleting a slot we change (decrements) the slot number of subsequent slots in the slot directory and thereby change the rid of records pointed to by subsequent slots - Only way to remove slots from the slot directory is to remove the last slot if the record that it points to is deleted - A new slot is added to the slot directory only if all existing slots point to records Fixed-Length Records: - Useful for fixed-length records if we need to move them around frequently - Becomes attractive if we need to move records around on a page for reasons other than keeping track of space feed by deletion - Typical example is that we want to keep records on a page sorted (according to the value in some field) B+ Tree: - We may not care about changing the rid of the record - The slot directory can be compacted after every record deletion - If we do not care about modifying rids we can also sort records on a page in an efficient manner by simply moving slot entries rather than actual records Variation: - A variation on slotted organization is to maintain only record offsets in the slots - This variation makes slot directory structure for pages with fixed-length records the same as for pages with variable-length records
question
System Catalog [9]
answer
- Can be thought of as a description of the contents of a database maintained by the DBMS - Avoids repeated storage of the same information with each record of a given type
question
Null [9]
answer
- Special value used to denote that the value for a field is unavailable or inapplicable - If a field contains a null value the pointer to the end of the field is set to be the same as the pointer to the beginning of the field