Basics of System Design
1. What happens when you hit a URL in a browser
When you hit a URL in a browser, several processes occur to retrieve and display the webpage. Here’s a simplified block diagram representing these steps:
- URL Entry: The user enters a URL in the browser’s address bar.
- DNS Lookup: The browser contacts a DNS server to resolve the domain name to an IP address.
- TCP Connection: The browser establishes a TCP connection with the web server using the IP address.
- HTTP Request: The browser sends an HTTP request to the web server for the desired resource.
- Server Processing: The web server processes the request and prepares an HTTP response.
- HTTP Response: The server sends the HTTP response back to the browser.
- Rendering: The browser renders the received content and displays it to the user.
Here’s a simple block diagram:
+---------+ +-----------+ +---------+ +--------+ +---------+ +---------+ +---------+ +---------+
| User |--->---| Browser |--->---| DNS |--->---| IP |--->---| TCP |--->---| HTTP |--->---| Server |--->---| Browser |
| Action | | | | Lookup | | Address| | Conn. | | Request | | Process | | Rendering |
+---------+ +-----------+ +---------+ +--------+ +---------+ +---------+ +---------+ +---------+
This diagram highlights the main components and interactions involved when a URL is hit in a browser.
2. Distributed Systems
Characteristics of Distributed Systems
- Multiple independent components interact to achieve a common goal.
- Scalability, fault tolerance, and transparency.
- Examples include distributed databases, cloud computing, and microservices.
CAP Theorem
- Consistency: Every read receives the most recent write.
- Availability: Every request receives a response, without guarantee of the most recent write.
- Partition Tolerance: The system continues to operate despite network partitions.
- A distributed system can only provide two out of the three guarantees simultaneously.
Consensus Algorithms (Paxos, Raft)
- Ensure all nodes in a distributed system agree on a single value or state.
- Paxos: A protocol for achieving consensus despite failures.
- Raft: Easier to understand and implement compared to Paxos, used for leader election and log replication.
Fault Tolerance and Replication
- Fault tolerance: The ability to continue operating despite failures.
- Replication: Storing copies of data on multiple nodes to ensure availability and reliability.
- Techniques include active-passive and active-active replication.
3. Microservices
Microservice Architecture for E-commerce Application
+--------+ +--------------+ +------------------+
| Client |<----->| API Gateway |<----->| Aggregation Svc |
+--------+ +--------------+ +------------------+
/ | \
+---------------+ | +---------------+
| User Service | | | Product Service |
| User Database | | | Product Database|
+---------------+ | +---------------+
| Order Service | | | Payment Service |
| Order Database | | | Payment Database|
+---------------+ | +---------------+
| Inventory Svc | |
| Inventory DB | |
+---------------+ |
| Message Broker |
+-----------------+
Explanation
- Client: Sends requests to the API Gateway.
- API Gateway: Routes requests to the Aggregation Service for requests that require data from multiple microservices.
- Aggregation Service:
- User Service: Fetches user-related data and interacts with the User Database.
- Product Service: Retrieves product information from the Product Database.
- Order Service: Gathers order details from the Order Database.
- Payment Service: Obtains payment status from the Payment Database.
- Inventory Service: Checks inventory levels in the Inventory Database.
- Aggregation Service: Combines data from the above services to provide a unified response.
- Message Broker: Ensures asynchronous communication between services as needed.
Use Cases for Aggregation Service
- User Dashboard: Combines user profile, recent orders, and payment status.
- Product Details Page: Aggregates product information, inventory status, and user reviews.
- Order Summary: Gathers order details, payment status, and shipment tracking.
By using an Aggregation Service, the system simplifies complex data retrieval processes for the client, reduces the number of client-server interactions, and improves overall system performance.
4. Data Partitioning
Horizontal vs Vertical Partitioning
- Horizontal Partitioning: Dividing data across different nodes (sharding).
- Vertical Partitioning: Splitting a table into columns and storing them in separate databases.
Sharding
- Distributing data across multiple databases to improve performance and scalability.
- Each shard contains a subset of the data.
Partitioning Strategies (Range, Hash, List, Composite)
- Range Partitioning: Dividing data based on value ranges.
- Hash Partitioning: Using a hash function to determine the partition.
- List Partitioning: Dividing data based on predefined lists.
- Composite Partitioning: Combining multiple partitioning methods.
5. Caching
Caching Strategies (Write-through, Write-back, Write-around)
- Write-through: Data is written to both the cache and the underlying storage.
- Write-back: Data is written to the cache and later synchronized with the storage.
- Write-around: Data is written directly to the storage, bypassing the cache.
Cache Eviction Policies (LRU, LFU, FIFO)
- LRU (Least Recently Used): Evicts the least recently accessed items.
- LFU (Least Frequently Used): Evicts the least frequently accessed items.
- FIFO (First In, First Out): Evicts the oldest items first.
Distributed Caches (Redis, Memcached)
- Redis: In-memory data structure store, supports complex data types, persistence, and replication.
- Memcached: High-performance distributed memory caching system, simple key-value store.
6. SQL vs NoSQL
Relational vs Non-relational Databases
- SQL (Relational): Structured data, predefined schema, ACID properties.
- NoSQL (Non-relational): Flexible schema, horizontal scalability, BASE properties.
ACID vs BASE Properties
- ACID: Atomicity, Consistency, Isolation, Durability.
- BASE: Basically Available, Soft state, Eventual consistency.
Use Cases and Examples
- SQL: Financial systems, ERP, CRM.
- NoSQL: Social media, real-time analytics, IoT applications.
7. Consistent Hashing
How Consistent Hashing Works
- Distributes data across nodes in a way that minimizes reorganization when nodes are added or removed.
- Uses a hash function to map data to a circular space and assigns nodes to points on the circle.
Use Cases in Distributed Systems
- Load balancing, distributed caching, sharding.
Virtual Nodes
- Nodes are assigned multiple positions on the hash ring to improve load distribution and fault tolerance.
8. Sync vs Async
Synchronous and Asynchronous Processing
- Synchronous: Operations are performed sequentially, blocking until completion.
- Asynchronous: Operations can be performed concurrently, allowing for non-blocking execution.
Blocking vs Non-blocking I/O
- Blocking I/O: Waits for the operation to complete before continuing.
- Non-blocking I/O: Allows other operations to proceed while waiting.
Use Cases and Trade-offs
- Synchronous: Simplicity, predictable behavior.
- Asynchronous: Higher performance, better resource utilization.
9. Resilience and Fallback
Circuit Breaker Pattern
- Prevents repeated failures by stopping requests to a failing service.
- Opens the circuit after a threshold of failures, and retries after a timeout.
Retries and Exponential Backoff
- Retries: Automatically retry failed operations.
- Exponential Backoff: Gradually increase the wait time between retries.
Bulkhead Pattern
- Isolates different parts of a system to prevent failures from spreading.
- Ensures that failures in one component do not affect others.
System Design Questions
10. Design URL Shortening Service (Tiny URL)
Requirements
Functional Requirements:
- Users can generate short links from the given URL.
- Redirect to the original URL when clicking on the short link.
- Users can generate custom short links.
- Expiration of links after some duration.
Non-Functional Requirements:
- The system should be highly available because if the service is down, all links become dead.
- Generated links should not be easily guessable.
- Redirection should take less time (low latency).
Capacity Estimation and Constraints
Traffic Estimate:
- Assume a 100:1 read-write ratio for a read-heavy system.
- 500 million new URLs added per month.
- 500 million * 100 = 50 billion redirections per month.
Queries Per Second (QPS):
- Write: 500 million / (30 days * 24 hours * 3600 seconds) ≈ 200 writes per second.
- Read: 200 * 100 = 20,000 reads per second.
Storage (Assume for 5 years):
- 200 writes per second * 1KB per URL * 5 years * 365 days * 24 hours * 3600 seconds ≈ 30 TB.
Bandwidth:
- Downlink: 200 writes per second * 1KB = 200KB per second.
- Uplink: 200KB * 100 = 20MB per second.
Memory Estimate:
- Follow the 80-20 rule: 20% of URLs generate 80% of traffic.
- 500 million writes per month.
- 20,000 requests per second * 24 hours * 3600 seconds = 1.7 billion requests per day.
- 20% of 1.7 billion requests need ≈ 170GB of memory.
Algorithm
Short URL Generation:
- Generate a unique short URL using a hashing algorithm.
- Check for collisions:
- If a collision occurs, append additional characters or rehash.
- Store the original URL and short URL in the database.
Redirection:
- When a short URL is accessed, query the database for the original URL.
- Increment the access count.
- Redirect to the original URL.
Expiration Handling:
- Store the creation timestamp of each short URL.
- Periodically check for expired URLs and remove them from the database.
Handling Custom URLs:
- Allow users to specify a custom short URL.
- Check for availability of the custom URL before storing.
11. Design Instagram Timeline
- Requirements and constraints
- Data modeling
- Feed generation strategies
- Handling real-time updates
12. Design Dropbox
- Requirements and constraints
- File storage and synchronization
- Data deduplication
- Minimizing data transfer
13. Design Flash Sale System
- Requirements and constraints
- Load balancing
- Inventory management
- Handling sudden load increases
14. Design YouTube/Netflix
- Requirements and constraints
- Video storage and encoding
- Chunking and streaming
- Content Delivery Network (CDN) usage
15. Design API Rate Limiter
- Requirements and constraints
- Rate limiting algorithms (token bucket, leaky bucket)
- Distributed rate limiting
16. Design Typeahead Suggestion
- Requirements and constraints
- Data structures (Trie, Ternary Search Tree)
- Handling large datasets
- Ranking and relevance
17. Design Web Crawler
- Requirements and constraints
- URL frontier management
- Politeness policies
- Distributed crawling
18. Design Nearby Search
- Requirements and constraints
- Geospatial indexing (R-tree, Quad-tree)
- Efficient querying
- Handling large datasets
19. Design Uber-like System
- Requirements and constraints
- Real-time location tracking
- Matching algorithms
- Surge pricing
20. Design Payment Service
- Requirements and constraints
- Payment gateways
- Security and fraud detection
- Handling concurrency and idempotency
21. Design WhatsApp
- Requirements and constraints
- Real-time messaging
- Data storage and synchronization
- End-to-end encryption
22. Design Scalable Notification System
- Requirements and constraints
- Notification delivery mechanisms (push, SMS, email)
- Handling millions of notifications within a short time
- Ensuring reliability and scalability
23. Design Event Pipeline
- Requirements and constraints
- Event sourcing
- Stream processing frameworks (Kafka, Flink)
- Ensuring data integrity and consistency
24. Design Scalable Logging Framework
- Requirements and constraints
- Log collection and aggregation
- Storage and retrieval
- Real-time log analysis
25. Design Swiggy-like Hyperlocal Search
- Requirements and constraints
- Efficient search algorithms
- Handling dynamic data
- Real-time updates