Mobile System Design
Library
External API
Configurability
Any library must be configurable. What can be configured:
- maxMemoryCacheSize
- maxMemoryDiskSize
- maxParallelDownloads
- maxImageSize
Quality requirements
The library is an external dependency that must meet very high-Quality requirements (crash-free, leaks-free, etc.).
Remember about Facebook SDK fail.
Documentation + Support
External users must have a well-described API to work with a Library straightforward and without hacks
Importance
The necessity to design API very well. Released API is so hard for further changes without users annoying.
Examples of API:
Base functions:
- init(config: Config)
- enable()/disable()
- functional methods
Additional see:
🚩 API for simple functionality
🚩 Downloading/Uploading API
Privacy
Libraries that gather various data and send them to servers (for example, FirebaseAnalytics) must have a possibility to disable to guarantee privacy of users
methods - enable/disable
Downloading/Uploading API
API
DownloadTask:
- id: Int - unique id of a task
- pause()
- resume()
- stop()
- addCallback(TaskCallback)
TaskCallback:
- onProgress(idTask: Int, progress: Int, )
- onCompleted(idTask: Int)
- onCancelled(idTask: Int)
- onFailed(idTask: Int, exception: Throwable)
Example:
fun execute(): DownloadTask
API for simple functionality
What is Simple functionality?
Usually, it's a piece of your product's code that executes a single task. For example, Payments Module is responsible for only Payments.
Best Practices
We recommend using the Async approach everywhere where a function executes something heavy.
If you mix both approaches then it's better to reach an agreement on how to name them. For example, all blocking methods may have the "Blocking" suffix in their names.
Blocking
fun executeSomething()
Under the hood, this function does a lot of work and blocks the current thread
Async
fun execute(callback: Callback) - callback approach
fun execute(): Single/Observable/Flow - reactive approach
suspend fun execute() - kotlin coroutines approach
Calling of these methods doesn't block the current thread
The problem
The work with Big files may require a possibility to manage the process. That's why API may return a special object - "DownloadTask".
Big files:
See 🚩 Big Files
ThreadPool Executor
The problem
Where and how to determine threads on which a task will be executed
Best practice
It's better to delegate such issues to lower components (for example, NetworkClient, DbClient, etc.).
The first reason. Lower component better know on which thread pool to execute a task (IO, Default, Separate special pool).
The second reason. Higher components must be free of unnecessary knowledge about details of lower components. Higher components just call a method and it's all.
IO vs Default
IO - for long-running operations (network, db, file system). The main task - avoid possible waiting due to the pool limit reaching. For example, Dispatchers.IO (Kotlin Coroutines) contains 64 threads under the hood.
Default - for short and intensive operations. The number of threads is calculated using the number of CPUs to diminish the count of thread switchings.
Additional
🚩 ThreadPool Executor
"Infinitive" list
Backend pagination
Frontend pagination
Offset
Provides limit and offset params for requests.
Example: GET /feed?offset=100&limit=20
+: ideal for non-changing content
+: simple
+: stateless on the server.
-: bad performance on large offset values (the database needs to skip offset rows before returning the paginated result)
Keyset
Uses the values from the last page to fetch the next set of items.
Example: GET /feed?after=2021-05-25T00:00:00&limit=20
+: translates easily into a SQL query
+: good performance with large datasets
+: stateless on the server.
-: "leaky abstraction" - the pagination mechanism becomes aware of the underlying database storage
-: only works on fields with a natural ordering (timestamps, etc)
Cursor/Seek
Operates with stable ids which are decoupled from the database SQL queries (usually, a selected field is encoded using base64 and encrypted on the backend side).
Example: GET /feed?after_id=t1234xzy&limit=20
+: decouples pagination from SQL database.
+: consistent ordering when new items are inserted.
-: more complex backend implementation
-: does not work well if items get deleted (ids might become invalid)
Online support
The most simple option.
- Determine that you reached some threshold. For example, RecyclerView (LayoutManager) has different callbacks where you observe items' visibility.
- Download a new portion using the count of items or last item id/timestamp/anything
Offline support
Dynamic item inserts in the middle support
Problem
An App receives info that new items appeared in the middle of a current list.
First option
Show info that the list is changed and suggest reloading the list
Second option
The backend must support an additional field like "sort_id". The App should use this field to figure out where to insert the element.
Offline Read mode
What does it mean?
We give possibility to read content (news, open details, etc) even when a user is offline
Options:
network only, network first, cache only, cache first
Recommended way
Two-level Caching
Patterns
Repository - pattern to hide the source of information
Single source of true. In Recommended way it's a Memory and then DB
LCE pattern (Loading, Content, Error):
data class Data<out T>(
val content: T? = null,
val error: Throwable? = null,
val loading: Boolean = false
)
How to avoid outdated info
- Memory and DB have an expired time
- The request from VM/UseCase may contain a requirement to update data (field ⇒ refresh: Boolean)
Offline Write mode
What does it mean?
We give possibility to write something (create/edit some entities, dependent entities, etc.) when a user is offline
Single source of True
DB
Syncing changes
Client-side
Download info from a backend and try to merge. This leads to an inconsistent state where each client may have their own Truth.
Backend-side
Snapshot
A client pushes a snapshot of new data to the backend.
It's appropriate for simple scenarios where an app sends only new data without later editing/removing. For example, AnalyticsLibrary can use a described solution.
Backend merging may become very tricky for more complicated scenarios where an item can be edited or contain links to other such items.
Delta Table
A client creates a new Delta Table (besides Data Table where we store data as is) where stores "deltas" of changes.
Example: user id=1 edited a node with id=100 where updated field "text" from "la-la" to "ha-ha".
When a user is offline the App can apply local deltas on the Data Table. But, after syncing with the Backend, some rows in the Data Table can be overwritten by a Server.
Preferably upload deltas while syncing before any requests to Backend to keep consistency.
Sync Failure Handling
Problem. A client can send the same deltas to the backend in case of errors and related retries.
Solution for Snapshot. Add fields "status" and "last_modified_time" for related entities in DB.
Solution for Delta Table. Add UUID to each delta. Then the backend can check whether it applied a concrete delta or not. Plus fields "status" and "last_modified_time" to check that the data was sent.
Two ID problem
Problem. While an app is in offline all new entities are creating with local id. After uploading data to the Backend, the Backend assigns its own id to mentioned entities.
Solution 1. Explicitly add to a data model fields: "id" (local id obtained in DB) and "serverId". It's desirable to leave domain data models without such knowledge.
Solution 2. Use a mapper between local and server ID when created entity is used in different places and simple updating id may cause consistent issues.
Network Limit
What else?
🚩 Storage issues (because we store info somewhere on a device)
🚩 Privacy issues (because stored info can be available for malware applications)
What else?
🚩 Storage issues (because we store info somewhere on a device)
🚩 Privacy issues (because stored info can be available for malware applications)
Background work
Why?
Various applications want to execute some work while the app is in the background.
Examples: MediaPlayer, Downloader, Service app (Antivirus).
Options
- WorkManager
Allows planning some work in the future. OS guarantees execution but not accuracy in time.
- Foreground service
Helps to keep an App active even in the background. Especially actual for various players.
- Service
But, Android implemented a lot of restrictions for Service in background mode.
- Broadcast receiver.
Allows to wake up an App to react to different external events (changes in FileSystem, etc.). But, Android decreased number of such receivers.
Storage
Why?
Where to store data?
Options
- Key-Value Storage
SharedPreferences, DataStore, Secure SP from Jetpack Security Library
+: simple
+: ideal for very small data
-: no schema
-: no migration
-: no big data
- Database/ORM
SqLite, Room, Realm, etc.
+: relational db
+: performance
+: concurrency out-of-box
+: migration
-: more complicated, demands some knowledge in DB
-: insecure by default
- On-Device Secure Storage
Keystore/KeyChain
Ideal for small sensitive data
Storage Location
- Internal
Sandbox of an App, unavailable to other apps except for root permissions.
- External
Primary storage and Secondary Storage (SD). Now, it's available through Scope Storage policy.
- Media/Scoped
Server pushes
What is it and When?
When an App should receive real-time events from a Backend.
Examples: Chat app, Delivery app, Taxi app, etc.
Options
HTTP-polling
A client periodically requests a Server.
Types:
- Short
+: Very simple
+: No persistent connection
-: No real-time
-: Connection establishment overhead
- Long
+: Realtime
-: Persistent connection
-: Backend modifications
Server-Sent Events
Allows a client to establish a stream over HTTP
The small improvement: in the beginning, we can send id (order) of the last received message to start with the last received message but not the first.
+: realtime
+: optimized compared to HTTP Long Polling
-: possible losses and duplications due to unidirectional stream
Bidirectional protocols
WebSocket, gRPC.
More details in 🚩 Network protocols
Firebase Cloud Messaging
Allows to send data messages to an App.
+: able to wake up an App from any state of the App and OS
-: 3rd party
-: not exact realtime
Push notifications
Why?
Helps to notify users about various news.
For example, someone wrote in a chat, new album is published, breaking news, etc
Implementation
- Realtime
See 🚩 Server pushes and create notifications self-manually using OS SDK API
- Non-Realtime
There is a option to delegate to 3rd party solutions like Firebase Cloud Messaging that is offering a convenient API and Console for push notifications
Backend API Design
Protocols
GraphQL
A query language for working with API - allows clients to request data from several resources using a single endpoint (instead of making multiple requests in traditional RESTful apps).
+: schema-based typed queries - clients can verify data integrity and format.
+: highly customizable - clients can request specific data and reduce the amount of HTTP-traffic.
+: bi-directional communication with GraphQL Subscriptions (WebSocket based).
-: more complex backend implementation.
-: "leaky-abstraction" - clients become tightly coupled to the backend.
-: the performance of a query is bound to the performance of the slowest service on the backend (in case the response data is federated between multiple services).
WebSocket
Full-duplex communication over a single TCP connection.
+: real time bi-directional communication.
+: provides both text-based and binary traffic.
-: requires maintaining an active connection - might have poor performance on unstable cellular networks.
-: schemaless - it's hard to check data validity on the client.
-: the number of active connections on a single server is limited to 65k.
gRPC
Remote Procedure Call framework which runs on top of HTTP/2. Supports bi-directional streaming using a single physical connection.
+: lightweight binary messages (much smaller compared to text-based protocols).
+: schema-based - built-in code generation with Protobuf.
+: provides support of event-driven architecture: server-side streaming, client-side streaming, and bi-directional streaming
+: multiple parallel requests.
-: limited browser support.
-: non human-readable format.
-: steeper learning curve.
HTTP
A text-based stateless protocol - most popular choice for CRUD (Create, Read, Update, and Delete) operations.
+: easy to learn, understand, and implement
+: easy to cache using built-in HTTP caching mechanism
+: loose coupling between client and server
-: less efficient on mobile platforms since every request requires a separate physical connection
-: schemaless - it's hard to check data validity on the client.
-: stateless - needs extra functionality to maintain a session.
-: additional overhead - every request contains contextual metadata and headers
Protocol choice example
- Simple Request-Response communication - choose HTTP
- Complex requests and Network limitations - try GraphQL
- Real-time events from a Server (new messages, order status, driver's car location, etc.) - see 🚩 Server pushes (WebSocket, gRPC)
- Strong Network limitation with a big number of requests - try something based on UDP
RESTful Service Example
API:
Paths:
- /v1/drivers
- GET
- request: <some based on location data>
- response
- type: array
- defin: Driver
- /v1/drivers/{id}
- GET
- request
- response
- type: object
- defin: Driver
- /v1/riders
- GET
- request: <some based on location data>
- response
- type: array
- defin: Rider
- /v1/riders/{id}
- GET
- request
- response
- type: object
- defin: Rider
- v1/trips
- POST
- request
- type: object
- defin: StartTrip
- response:
- type: object
- definition: Trip // or trip_id
- v1/trips/{id}
- GET
- request
- response
- type: object
- defin: Trip
- PUT (for cancelling/editing)
- request
- type: object
- defin: EditTrip
definitions:
Driver:
- id: Int
- name: String
- rating: String
- info..
Rider:
- id: Int
- name: String
- rating: String
- info..
CreateTrip
- driver_id: Int
- rider_id: Int
- from: Geo
- to: Geo
- info..
EditTrip
- trip_id
- operation:..
Trip:
- id: Int
- driver_id: Int
- rider_id: Int
- from: Geo
- to: Geo
- info..
Uploading/Downloading Big files
Big files:
See 🚩 Big Files
Problem
TCP-connection may break while the client sends a big file to a server or vice versa and there is no idea how much data was transferred.
Solutions
- Uploads.
The file is divided into separate segments and transferred. Google, YouTube, Photos provides their own custom extensions on HTTP to upload. - Downloads.
Downloading of big files is processed by HTTP Range or see 🚩Streaming.
Streaming
What kind of data?
- video
- audio
What is Streaming?
Streaming is a real-time and more efficient way to transfer audio/video files for playback than Downloading.
Streaming can work on top of both UDP and TCP. Based on TCP, one of the popular protocols is HTTP live streaming (HLS).
Buffering is a technique used by players. It consists in the fact that the data is loaded with some buffer, so that in case of video/audio failures, playback remains smooth.
HLS
HTTP live streaming (HLS) is one of the most popular protocols for streaming.
The idea. HLS splits video/audio into small downloadable HTTP files (numbered segments several seconds long), which are delivered using the usual HTTP protocol. The client accepts these files and plays them.
HLS also has such a feature as adaptive bitrate streaming. This allows you to adjust the video/audio quality on the fly depending on the network and so on. HLD on the server immediately creates several copies of segments of different quality.
Waiting list
Problem
The waiting list is a list of something that can not be executed immediately due to lack of resources (network, storage, etc.), limits, or priorities.
Example
File downloading. An app can't download all required files because there are network limits especially in Mobile internet and we can't fill the entire network channel blocking other apps' requests. All of this leads to the creation of a waiting list for file downloading.
File downloading Scheme example
Description
- Job - data model describing what to execute.
Theoretically, number of Jobs can be unlimited. - Worker - executor of a Job. May implement pattern Strategy to be able to execute different types of Job.
Also, several Job may point to the same Worker that allows to carry on Worker process until at least one associated Job is active.
Number of Workers is limited by a pool size.
- Download Request - encapsulates a single file downloading request
- Download Task - see 🚩 Downloading/Uploading API
Cache Types
Question
Where cache data?
Options
In-App cache (Heap)
Very fast but lost after an app closing
Storage (SharedPrefs, DB, FileSystem)
See 🚩 Storage
HTTP-Cache
Implementation details of a concrete HTTP client and Backend
Problem
Different countries and regions provide absolutely different possibilities for Network connections especially when a device is connected to Mobile Data.
What can we do to handle poor connections and save a user device's battery?
Options
Batch requests for Mobile Data
Radio module is a very energy-consuming part of a device. See the article. It makes sense to batch requests when an App is in background.
Reduce connections
Optimise requests
If data gathering requires a lot of "proxy" requests like to load list of contacts, list of phones, list of emails to obtain a User's info than it makes sense to try GraphQL or modify backend.
Prefetch data
When a user is not active an App can predownload some peace of information like details of first 20 news and etc.
Network priorities
There is option to prioritize requests and execute them basing on these priorities. See solutions about Queue like in 🚩 Waiting list
What else?
- 🚩 Waiting list
- 🚩 Downloading/Uploading API
What else?
- 🚩 Waiting list
- 🚩 Downloading/Uploading API
"State Intense" App
Problem
An App contains a lot of states that may affect each other.
Typical example is a Chat application where you should monitor changes of users, messages, chats and etc. Also, Taxi app, Delivery app and etc. may be treated as a "State Intense" app.
Recommended patterns
Presentation Layer
MVI. Because MVI introduces special objects helping to sort out the state mess
Domain/Data Layer
There is a big chance that the App will use 🚩 Server pushes. It means the App will listen updates from a special channel (Stream, Flow, Observable).
Events are like a "delta-models" that show the diff.
CQRS
Command and Query Responsibility Segregation.
All actions like "SendMessage", "ChangeStatus" and etc. are commands. UpdateChannel is like a Query.
Why? Because requests may affect different areas. For example, "SendMessage" may change list of messages, status of a concrete message, list of chats where the last messages field changes and etc. Dependencies can be very tricky. That's why it's simpler to listen all updates from one or several queries.
Image thumbnails
Video and Audio Streaming
Network errors
Problem
How to handle network errors?
An error occurs. What next?
A request to the network (or DB) really causes an error (4xx, 5xx, IO exception, etc.). What to do further?
Options:
- throw an exception and catch the one in above layers
- wrap a result into a special model
Recommendations:
There are a lot of opinions to rely on the second option because it's safer and more predictable, and an app will not be crashed at random moments.
Examples:
data class Result<T>(
val data: T?,
val error: Throwable?
)
Try to repeat (options)
Repeater
We can introduce a default Repeater in an App’s network layer. But remember about possible DDOS. That's why it makes sense to add exponential back-off + Jitter (including randomness in time period determining).
Network availability
If an error occurred due to network problems then we can wait for network availability
Mobile Failover handling
Store a list of domains (primary and back) and use it for switching on other domain (Uber's approach).
Scheme:
Principles:
- Maximize the usage of primary domains
- Effectively differentiate between network errors and host-level outages
- Reduce degraded experience during primary domain outages
Security
Android Backend Security. Main points
- Use HTTPS (TLS). HTTPS provides encryption (mix of symmetric and asymmetric encryption) of your communication with a server.
- Use Certificates and SSL Pinning to prevent Man-in-The-Middle attack.
- In applications like Chat, use E2E encryption to deprive a server of knowledge about messages content.
- In applications where you need Auth, use OAuth/OpenID protocol. It means that you need to add an additional header field in your API: Authorization: Bearer <token>
Auth scheme (OAuth) in Android
What clarify in requirements
- Size of files
- Thumbnails to optimize work with images
Big Files
What is it?
- different kind of attachments
- file
- video
- audio
- image
- etc.
Problem
The size of big files affects how to work with them:
Storing, Downloading, Uploading, API Designing.
See details in
- 🚩 Uploading/Downloading Big files
- 🚩 Downloading/Uploading API
- 🚩 Offline Write mode
Storing => Files vs BLOB
There are different opinions here.
The general recommendations look like below:
- large blobs are not so effective compared to simple files
- file can be removed from memory by a User, blob - no
Attachments. File Permissions.
User/OS may revoke permissions on the file before when an App becomes online. There is a sense to copy the file to the App's storage.
What else based on TCP
- XMPP (WhatsApp, Zoom, Google Talk)
- MQTT (IoT, Smart Home, etc).
What else based on UDP
- WebRTC (Discord, Google Hangouts, Facebook Messenger).
TCP vs UDP
- Unlike UDP, TCP uses
- Connection setup
- Acknowledgement of received segments
- A mechanism to retransmit lost packages
- TCP is designed for wired connections more rather than wireless connections that introduce additional challenges with regular temp losses of connection
Privacy
Brief recommendations
- Keep as little of the user's data as possible
- Minimize the use of permissions and be ready to handle permission changes
- Assume that on-device and backend storage are not secure
- Assume that the target platform's (iOS/Android) Security & Privacy rules will change
- Applying "cloud vs on-device" storage/calculations depends on various things, and the final decision is made case-by-case.
More details are here.
Reverse-Engineering
Protect your products against reverse-engineering (more important for Android)
Performance/Stability
Main Points
- Metered data usage
- Bandwidth usage
- CPU usage
- Memory Usage
- Startup Time
- Crashers/ANRs
Geo-Location Usage
Main Points
- Don't compromise user privacy while using location services.
- Prefer the lowest possible location accuracy. Progressively ask for increased location accuracy if needed.
- Provide location access rationale before requesting permissions.
3rd-Party SDKs Usage
Main Points
- 3rd-Party SDKs might cause performance regressions and/or serious outages (example - Facebook SDK).
- Each SDK should be guarded by a feature flag.
- A new SDK integration should be introduced as an A/B test or a staged rollout.
- You need to have a "support and upgrade" plan for the 3rd-party SDKs in a long term.
❓ Suggestion
group by "Http + websockets and rest + graphql + grpc"
Target market/client
Questions
- number of users
- target market (region)
- typical usage (home/street/outdoor)
- time peak usage
- internet quality
- devices quality
What affects
- Highload backend makes DDOS a real potential problem.
It makes sense to reduce a number of connections and introduce backoff in case of network errors.
See 🚩Network errors and 🚩Network Limit for all details
- Also, Higload may force to leave some calculations on a device. For example, Badoo introduces Face recognition on a Device to reduce a load on a server.
- 🚩Network Limit generally
- Following different local laws. Example - Analytics Library that could handle personal data should include methods like enable/disable to give a possibility to manage the work of the library explicitly
Company resources limitations
Questions
- developers
- deadlines
- how should the work on the system be divided among multiple teams?
How big files can be presented in data models in network responses?
- "some_big_file": "id"
A developer can download the file using this id and a special backend endpoint like "GET /sound/file/{id}" - "some_big_file": "url"
Where "url" points to a certain resource in CDN. It's very simple way but insecure