1. Why do we need URL shortening? (1 min)
Save space and typing.
Less likely to have typo.
0. Threat Modeling (18 min)
1. What are we doing?
We are building a service that help the users to mapping between a long URL and a short URL (domain name + 7 alphanumeric value) [62 ^ 7 ≈ 3.5T options]
2. What could go wrong?
User might abuse the service, and repeatedly use up our space for DoS attack.
Users might give repeated URLs.
Attacks can spoofing DNS server and cause the URL be mapped to a malicious web page.
XSS is possible for both type 1 and type 2, i.e. type1 persistent XSS can save the malicious URL and scripts in our DB; type 2 can use the shortened link to trick other users from clicking it.
CSRF (cross site request forgery) is also possible with the shortened links
3. What can we do with it?
For the Injection, XSS, and CSRF, we can have the service to check and prevent any type of JS or CSS scripts in the URL.
Also we need to use the built-in input validation function and preparation function from language or DB to escape all potential malicious URL before we process.
White listing the carefully selected characters, and don't allow any characters not in the whitelist.
Injection attack
For DNS spoofing, we need to have strong Checksum and cryptographic protection to the DNS service, and make sure that the update of the records can only performed by validated admin.
For DoS attack, we need to first make sure that we implemented IPv6 and IPSec, so that when attackers what to spoof the IP we can immediately tell from the manipulation. Since IPv6 support payload encryption and packet authentication.
Then we can start to monitor the system, and detect any abnormal behaviors from users.
4. How do we know if we did a good job?
Listing all the threats we have.
Test all of them by our developers or team.
If we found some bug, then give feedback to the development team, and given them detailed logs and solution approaches. And collaborate with the closely to accomplish fast iteration of testing.
2. Requirements and Goals of the System (6 min)
🚩 Ask first for the scope in interview's mind
Requirements
Functional
Non-functional
✅ We should moderate the potential attacks caused by XSS or CSRF that shorten their URL to disguise users.
User can pass in at most 8 bytes long URL, and we will return a new shortened URL with our domain name plus 7 alphanumeric characters.
Given the shortened URL, users should be able to access the original page efficiently without delay.
The service should keep all the links safe and periodically prune outdated URL given the expiration date.
The URL should be able to be renewed extend the expiration date.
3. Capacity Estimation and Constrains (17 min)
Users
Total users
❌ 50 Million users
❌ Active user
10 Million
Requests
Read
Each user read 5 links per day, then 10 Million users will be 50 Million read requests per day.
Write
❌ Suppose the Read and Write ratio is 1000 : 1, then it will be 50K writes per day.
Storage for DB
❌ Size of one shorten URL: base url (8 bytes) + uniqueID (7 bytes) = 16 bytes
❌ Say there are about 50K URLs Created per day, then the total size need to be stored will be 50K/day * 16 bytes = 800K bytes/day >= 800 KB/day
❌ Since our uniqueID is 7 characters long, we can use 7 bytes to save it.
❌ Unique ID for URL
If we assuming 50K new URLs are used/created/shortened each day, then we can find the data required to use up all the options:
2^(7*8) / 50K / 365 ≈ 4 Billion Years
to finish all the options
❌ After 10 years to DB size would be 800 KB * 356 * 10 = 2920000 KB = 2.9 GB
, which can be easily fitted in to modern server.
4. System API (4 min)
We decide to chose REST API, since it is more flexible, easy to learn, and is easy to use with JSON data types.
shortenURL(apiDevKey, oldURL, validDate, customAlias, userName)
apiDevKey: used for assigning development functionalities to the developers.
oldURL: the old long URL string that get from user.
validDate: how long will the URL live.
5. Data Base Design (6 min)
❌ Solution 1
Storage
Since the structure of our data format is relatively fixed, and most of the relationships are simple key value pairs. We can use relationship database for storage
Performance Optimization
We can have one or several Cache servers in from of the DB and after the application server to handle fast read throughput.
We can using Load Balancers in between (1) client and application servers, and (2) between application servers and the caches.
We can use replication handle most of the fault tolerance, and have several layers in the application server, cache, and Load Balancer layers. (master nodes handle write, and then pass to the slave nodes. slave nodes only need to handle read requests, and take the position of master node when it was down).
6. Basic System Design and Algorithm (12 m) ⭐
Encoding the actually URL
Database Schema
We save the key as the shortened URL id (7 bytes), and the value as the original URL (8 bytes?) in a table.
We just need to use some kind of Hash function to help us to map the URL to our desired range. So we can use SHA256 to get the hash message output, and then pick to top 7 (or 6) bytes to map to our table.
Optimizations
❤ [With Cache] We can use consistent hashing to allocate servers and "virtual servers" to serve the URL with ID in different ranges. This can prevent service lost caused by server down, and can handle the load more evenly.
We can also generate all the keys first, and then save the un used key in the DB, with some of them saved in the global Cache. This way we can have faster write server, cause the key generation step is done offline (how? where we save? still need to run hash again though)
✅
Size estimation
With 6 characters for each unique key, and 64-bits based, we have 64^6 (unique keys) * 6 (bytes or char per key) = 412 billion bytes = 412 GB
Using replica to be fault tolerant
Since we don't save too much key on Cache, so even when the cache is done and lost some keys, we don't worry too much (cause the options are huge).
7. Data Partitioning and Replication
Since all the keys are already hashed values, so we can just use key based partition, and save the partitions in different servers. When we need to read them, we just have aggregator server and load balancer to handle the selection. ✅
❌ Replication is good too, since the key value size is not too large, so we can just have multiple servers to save exactly the same data structure, and have the master node to handle write, then pass down to the lave nodes with little latency (a few milliseconds).
9. Load Balancer (LB) (done)
10. Purging or DB cleanup
We don't really need to do this too often, and we can just let the expired links to stay for a little bit longer (can also as user and see if they want to extend). Then every day when the read request is low, we can iterate through the DB and remove the expired links, then update the master and slave nodes.
11. Telemetry (Monitoring) (9 min)
Recording data like: the country of access, platform of access, date and time of access, web pages the refers the link, and how often a link is visited (for hotspot load balancing)
12. Security and Permissions
Can users create private URLs? Or allow a particular set of users to access a URL?
Permission level can be stored together with the URL key and in a new column that represent binary results (public/private). Return 401 if not public
If we use wide column NoSQL DB, then we can save the permitted users to columns. Or we can use other data structures to map the relationships.
8. Cache (done)
🚩 Users should optionally be able to pick a custom short link for their URL.
🚩 The system should be highly available, and it should redirect in real-time with minimal latency.
🚩 Shortened links should not be guessable (not predictable)
🚩 Extended requirements
Analytics: e.g. how many times a redirection happened?
Our service should also be accessible though REST APIs by other services.
✅ 100:1 ratio between read and write
✅ 500 Million users
50 Billion read, and 500 Million write per month
🚩 Query Per Second (QPS)
✅ New URL shortening per second
500 million / (30 days * 24 hours * 3600 seconds) ≈ 200 URL/sec
✅ URL query per second
50 billion / (30 days * 86400 sec/day) ≈ 20K URLs/sec = writeRatio * 100
✅ After 5 years with 500M new shortening URL per month
5 year * 12 months/year * 500 millions/month = 30 billions
Files/objets for storage
Assuming 500 bytes each stored object.
30 billions * 500 bytes = 15 trillion bytes = 15 TB
for saving all the objects
✅ Bandwidth estimation
Incoming
We know that we will have 200 new URLs/sec, so the total incoming data will be:
Outgoing
Since we know that every second the read request amount will be about 20K, so the outgoing response bandwidth will be:
200 urlsObj/sec * 500 bytes/urlObj = 100 KB/sec
20K urlObj/sec * 500 bytes/urlObj = 10 MB/sec
🚩 ✅ Memory estimation (Cache)
Rule: 20/80, i.e. 20% of the URL will be hot spots and take about 80% of the traffic.
Thus based on the 20/80 rule we have
Hotspot incoming bandwidth = 100 KB/sec * 80% = 80 KB/sec
, and hotspot outgoing bandwidth = 10 MB/sec * 80% = 8 MB/sec
.
The hotspot saving will only for one day and read only, and thus
Requests per day = 20 K/sec * 3600 sec * 24
= 1.7 billions`
The requirement will be 20% * 1.7 billion * 500 bytes/urlObj = 170 GB
. And since there are many duplications in the URL, we can actually use less than 170 GB for the cache.
✅ Summary: Assuming 500 million users per month, and 100:1 read:write ratio
QPS
Incoming
20,000/second
Outgoing
200/second
Storage after 5 years
15 TB (30 billion object, with 500 bytes each)
Cache usage (per day)
<= 170 GB
Bandwidth
Incoming
100 KB/second
Outgoing
10 MB/second
🚩 returns: (String), return error code is failed
🚩 deleteURL(apiDevKey, urlKey)
Return "URL removed" on success, else ignore
🚩 User abuse prevention: using the API dev Key to prevent abuse: only a specific amount can be generated per some time period (differ by user group)
🚩 Observations
We need to store billions of records.
Each object we store is small (less than 1K)
There are no relationships between records -- other than storing which user created a URL.
Our service is read-heavy.
🚩 Database Schema
Table 1: for storing info about the URL mappings
Table 2: for the user's data who created the short link
Key: Hash: varchar(16)
Values:
OriginalURL: varchar(512)
CreationDate: datetime
ExpirationDate: datetime
UserID: int
Key: UserID: int
Values:
Name: varchar(20)
Email: varchar(32)
CreationDate: datetime
LastLogin: datetime
🚩 Tools
NoSQL is the best, since we don't have much relationship between data: DynamoDB, Cassandra, or Riak.
Issues: if there are duplicates, we will waste time in waiting, so we can generate all keys offline, and than share to cache for faster access.
Concurrency issues: multiple severs reading keys concurrently
Using two tables to store keys: one for visited, and one for unvisited. We can update the table as soon as the key is visited
✅
As soon as the key is loaded to memory of key DB, then they are sent to the used table, so servers are guaranteed to use different keys.
Key look up
look through the DB or Cache, if found key, sending "HTTP 302" for redirecting, and is not, then send "HTTP 404" or send to the homepage.
Also we will need to use consistent hashing for even distribution of server for the partitions work load.
🚩: How much cache memory should we have?
Just 20% of daily traffic
170 GB: 20 QPS for read, which means 1.7 billion queries per day. Thus 20% of this is about 170 GB, assuming that each request object takes 500 bytes.
Modern server has 256 GB memory, so this can fit in one.
🚩 Cache eviction policy: LRU
How to update the cache?
Since here we care more about read throughput, thus we can sacrifice the write time by using write through method to update the cache and DB data change. (write through will update both cache and DB when update coming, and wait until everything is finished).
We didn't use Write-back here, since we want to make sure to minimize the data loss (but this is still a good option)
We didn't use Write-around policy, because this will slow down the read significantly: we remove the key from the cache, and just update the DB, and let the update happen in read time when client cannot find key in the Cache.
✅ Health checking constantly + Round Robin approach for LB.
Between server and DB, between client and server, between server and cache. (3 places)
🚩 More techniques to passively delete the expired links
Delete in runtime when user request an expired link, and we can just delete it and return an error to user.
Periodically delete when traffic is low.
Default expiration date for 2 years.
After removing the expired link, we can put the key back to DB for reuse.
We don't need to worry about inactive links, since storage is cheap.