Unit 3: Data Storage
cheatsheet1999 opened this issue · 0 comments
Lesson Introduction: Major Data Storage Layouts
Much of the success in computer technology, has been the tremendous progress that data storage has undergone. When dealing with big data, the amount of data will directly have an influence on the performance of the system.
In this Unit, we will learn more about Memory Hierarchy and both internal and external data storage.
We will be able to relate the need for external data storage with large internet-based applications that require new and more scalable storage solutions. An example is cloud-based storage systems like AWS.
Topic: Introduction to Data Storage
How data is stored?
Normally, data is stored in a secondary storage device, which is the hard disk, to process the data in a computer, we need to load data into the main memory.
Processing Speed
CPU Register => CPU Cache => Main memory => Hard disk (Secondary Storage)
Internal Data Storage
Hard disk is a mechanical device, that's why it is very slow.
Back to the date, people use tape, because that is super cheap.
Modern, people use SSD, a lot faster than Hard disk.
Data on External Storage
- File
- A logical collection of data, physically stored as a set of pages
- File Organization
- Method of arranging a file of records on external storage, organized by Record ID(rid)
- Architecture
- Buffer manager stages pages from external storage to the main memory buffer pool.
- File and index layers make calls to the buffer manager.
Why do we have a buffer manager?
Memory is smaller than disk, so we cannot load every data into the main memory at once, we can only load it page by page.
Lesson Introduction: Alternative File Organizations
In addition to traditional data storage, there are alternative file organizations. Many alternatives exist, each ideal for some situations, and not so good for others. We will explore more about Heap (random order) files, Sorted files, and Indexes in this topic.
The Cost Model
The number of page accesses is a cost measure.
Reasoning
Page access cost is usually the dominant cost of database operations. An accurate model is too complex for analyzing algorithms.
Reading 3 pages is actually less time-consuming than reading just one page.
Heap File Advantage / Disadvantage
Advantage:
- Efficient
- for bulk loading data, (don't care about the order, just keep inserting)
- for relatively small relations as indexing overheads are avoided
- When queries need to fetch a large proportion of stored records
Disadvantages:
- Not Efficient
- for selective queries
- sorting is time-consuming
Indexes
- File Index
- Speeds up selections on the search key fields
- Any subset of the fields of a relation can be search key for an index on the relation
- An index contains a collection of data entries and supports efficient retrieval of all data entries k* with a given value k
B+ Tree Indexes
Most popular indexes structure in the database system
Non-leaf pages have index entries; only used to direct searches.
Knowledge Check: Data Storage
- Where is the database stored in a computer?
- Central Processing Unit
- [Correct] Hard disk (A database is stored in the hard disk of a computer.)
- Memory
- Cache
- What is the correct order of processing speed of major units in a computer from the fastest to slowest?
- CPU, cache, memory, hard disk
- Why is the processing speed of a traditional computer hard disk lower than a modern solid-state drive (SSD)?
- [Correct] Because a hard disk is a mechanical device. Contrary to the solid-state drive, a hard disk has to spin and spend more time to find a requested data byte)
- Because solid state drive is a mechanical device.
- Because the size of a solid state drive is bigger than that of a hard disk.
- Because a hard disk can only read pages in sequence.
- What is the name of the software component in a computer that loads pages from hard disk into memory?
- Memory Manager
- [Correct] Buffer Manager (Buffer manager loads pages from hard disk into memory)
- Load Manager
- Index Manager