Design Pastebin
Feature Extraction - Functional requirements (5 min)
- Use Cases - What are we trying to give the users that they don't already have?
- User enters a block of text and gets a randomly generated link.
- Expiration: default setting does not expire. Can optionally set a timed expiration.
- User enters a paste's url and views the contents
- User is anonymous.
- Service tracks analytics of pages: monthly visit stats
- Service deletes expired pastes
- Service has high availability
- User enters a block of text and gets a randomly generated link.
- Scenarios that will not be covered
- User registers for an account. User verifies email. User logs into a registered account.
- User edits the document.
- User can set visibility.
- User can set the shortlink.
Estimation - Nonfunctional Requirements (5 minutes)
- Who will use
- 10 million users
- How many will use
- 10 million paste writes per month
- 100 million paste reads per month
- Read/Write ratio
- 10:1 read to write ratio
- Traffic estimates (QPS for read and write queries)
- 3.8 ~= 4 write requests per second on average
- 40 read requests per second on average
- Storage estimates
- Size per paste: 1.27KB total
- 1KB content per paste
- short link: 7 bytes
- expiration_length_in_minutes: 4 bytes
- created_at: 8 bytes
- pate_path: 255 bytes
- 1.27K * 4 ~= 5k bytes paste per second ~= 150GB paste per year -> assume most are new pastes instead of updating the existing pastes
- 120 million new short links per year
- Memory estimates
- If we are using a cache, what is the kind of data we want to store in cache
- How much RAM and how many machines do we need for us to achieve this ?
- Amount of data you want to store in disk/ssd
Design Goals (5 minutes)
- Latency and Throughput requirements
- P90 1 second for read: 1 second
- P90 2 seconds for write: 2 seconds
- Consistency vs Availability [Weak/strong/eventual => consistency | Failover/replication => availability]
- Don't give a presentation. Give examples of consistency vs. availability and tell your assumption and ask if the assumption is correct.
In a distributed computer system, you can only support two of the following guarantees:
- Consistency - Every read receives the most recent write or an error
- Availability - Every request receives a response, without guarantee that it contains the most recent version of the information
- Partition Tolerance - The system continues to operate despite arbitrary partitioning due to network failures
Networks aren't reliable, so you'll need to support partition tolerance. You'll need to make a software tradeoff between consistency and availability.
CP - consistency and partition tolerance Waiting for a response from the partitioned node might result in a timeout error. CP is a good choice if your business needs require atomic reads and writes.
AP - availability and partition tolerance Responses return the most readily available version of the data available on any node, which might not be the latest. Writes might take some time to propagate when the partition is resolved. AP is a good choice if the business needs to allow for eventual consistency or when the system needs to continue working despite external errors.
High Level Design (5 - 10 minutes)
- what to store/where to store
- Database schema
shortlink char(7) NOT NULL,
expiration_length_in_minutes int NOT NULL,
created_at datetime NOT NULL,
paste_path varchar(255) NOT NULL,
PRIMARY KEY(shortlink)
- Mention the NoSql type of the db first before mentioning dynamodb as an example
- Relational DB as a large hash table mapping the generated url to a file server
- Instead of managing a file server, we are using a managed object store such as Amazon S3 (strongly consistent noSQL) or NoSQL document store.
- APIs for Read/Write scenarios for crucial components
writePaste(content, expiration_length_in_minutes, created_at)
The write API server does the following a. generates a unique url
- checks if the url is uniquee by looking at the SQL database for a duplicate
- If the url is not unique, it generates another url
b. Saves to the SQL database pastes table c. Saves the paste data to the object store d. Returns the url
getPaste(short_link, paste_path=optional)
- The client sends a get pate request to the web server.
- The webserver forwards the request to the READ API server.
- The read API server does the following
a. checks if the short link exists in the db.
- if the url exists in the db, fetch the paste contents from the object store. else return an error message to the user.
- High level design for Read/Write scenario
DEEP DIVE [15-20 min]
- Reading 1MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4X and from disk takes 80x longer.
Vertical vs. Horizontal Scaling
--- | Pros | Cons |
---|---|---|
Vertical | Nothing changes about your db infrastructure. The difference is you have more CPUS, storage, memory. | 1. Instances with more processing power can be a lot more expensive. 2. The available configuration options from cloud providers might be limited. You may need only 10% more resources, but your option may be 50% more. 3. If scaling vertically requires migration between hardwares, it could result in downtime or service disruption. |
Horizontal | 1. More predictable increase in pricing. Adding additional machine is a predictable cost. 2. Infinitely scalable as you can always add another machine. | More below |
Sharding (Horizontal scaling)
e.g. Dividing the data into subsets based on the alphabetical order of shortlinks.
Pros | Cons |
---|---|
1. Improves system's fault tolerance and availability. No single point of failure. | 1. Needs to make sure data is evenly distributed across the shards. 2. There might be duplicate or loss of data. |
Data replication (Horizontal scaling)
- Master-slave replication: The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves like a tree structure. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.
Pros | Cons |
---|---|
1. system availability and fault tolerance is greatly improved. all machines have the same database with the same data stored, the system can continue to operate without interruption. 2. data requests can be distributed across multiple machines, so it leads to improved performance. | Additional logic is needed to promote a slave to a master. |
- Master-master replication: Both masters serve reads and writes and coordinate with each other on writes. If either master goes down, the system can continue to operate with both reads and writes.
Cons |
---|
1. you need a load balancer to determine where to write. |
2. Most master-master systems are either loosely consistent (violating ACID) or have increased write latency due to synchronization. |
Disadvantages of Data replication
- There is a potential for loss of data if the master fails before any newly written data can be replicated to other nodes.
- Writes are copied to the read replicas. If there are a lot of writes, the read replicas can get bogged down with copying writes and can't do as many reads.
- The more read slaves, the more you have to replicate, which leads to greater replication lag.