Mobile System Design

Downloading/Uploading API

Library

External API

Configurability

Any library must be configurable. What can be configured:

maxMemoryCacheSize
maxMemoryDiskSize
maxParallelDownloads
maxImageSize

Quality requirements

The library is an external dependency that must meet very high-Quality requirements (crash-free, leaks-free, etc.).
Remember about Facebook SDK fail.

Documentation + Support

External users must have a well-described API to work with a Library straightforward and without hacks

Importance
The necessity to design API very well. Released API is so hard for further changes without users annoying.

Examples of API:

Base functions:

init(config: Config)
enable()/disable()
functional methods

Additional see:
🚩 API for simple functionality
🚩 Downloading/Uploading API

Privacy

Libraries that gather various data and send them to servers (for example, FirebaseAnalytics) must have a possibility to disable to guarantee privacy of users

methods - enable/disable

Downloading/Uploading API

API

DownloadTask:

id: Int - unique id of a task
pause()
resume()
stop()
addCallback(TaskCallback)

TaskCallback:

onProgress(idTask: Int, progress: Int, )
onCompleted(idTask: Int)
onCancelled(idTask: Int)
onFailed(idTask: Int, exception: Throwable)

Example:
fun execute(): DownloadTask

API for simple functionality

What is Simple functionality?

Usually, it's a piece of your product's code that executes a single task. For example, Payments Module is responsible for only Payments.

Best Practices

We recommend using the Async approach everywhere where a function executes something heavy.

If you mix both approaches then it's better to reach an agreement on how to name them. For example, all blocking methods may have the "Blocking" suffix in their names.

Blocking

fun executeSomething()

Under the hood, this function does a lot of work and blocks the current thread

Async

fun execute(callback: Callback) - callback approach
fun execute(): Single/Observable/Flow - reactive approach
suspend fun execute() - kotlin coroutines approach

Calling of these methods doesn't block the current thread

The problem

The work with Big files may require a possibility to manage the process. That's why API may return a special object - "DownloadTask".

Big files:

See 🚩 Big Files

ThreadPool Executor

The problem

Where and how to determine threads on which a task will be executed

Best practice

It's better to delegate such issues to lower components (for example, NetworkClient, DbClient, etc.).
The first reason. Lower component better know on which thread pool to execute a task (IO, Default, Separate special pool).
The second reason. Higher components must be free of unnecessary knowledge about details of lower components. Higher components just call a method and it's all.

IO vs Default

IO - for long-running operations (network, db, file system). The main task - avoid possible waiting due to the pool limit reaching. For example, Dispatchers.IO (Kotlin Coroutines) contains 64 threads under the hood.

Default - for short and intensive operations. The number of threads is calculated using the number of CPUs to diminish the count of thread switchings.

Additional

🚩 ThreadPool Executor

"Infinitive" list

Backend pagination

Frontend pagination

Offset
Provides limit and offset params for requests.
Example: GET /feed?offset=100&limit=20
+: ideal for non-changing content
+: simple
+: stateless on the server.
-: bad performance on large offset values (the database needs to skip offset rows before returning the paginated result)

Keyset
Uses the values from the last page to fetch the next set of items.
Example: GET /feed?after=2021-05-25T00:00:00&limit=20
+: translates easily into a SQL query
+: good performance with large datasets
+: stateless on the server.
-: "leaky abstraction" - the pagination mechanism becomes aware of the underlying database storage
-: only works on fields with a natural ordering (timestamps, etc)

Cursor/Seek
Operates with stable ids which are decoupled from the database SQL queries (usually, a selected field is encoded using base64 and encrypted on the backend side).
Example: GET /feed?after_id=t1234xzy&limit=20
+: decouples pagination from SQL database.
+: consistent ordering when new items are inserted.
-: more complex backend implementation
-: does not work well if items get deleted (ids might become invalid)

Online support

The most simple option.

Determine that you reached some threshold. For example, RecyclerView (LayoutManager) has different callbacks where you observe items' visibility.
Download a new portion using the count of items or last item id/timestamp/anything

Offline support

Dynamic item inserts in the middle support

Problem
An App receives info that new items appeared in the middle of a current list.

First option
Show info that the list is changed and suggest reloading the list

Second option
The backend must support an additional field like "sort_id". The App should use this field to figure out where to insert the element.

Offline Read mode

What does it mean?

We give possibility to read content (news, open details, etc) even when a user is offline

Options:
network only, network first, cache only, cache first

Recommended way

Two-level Caching

Patterns

Repository - pattern to hide the source of information
Single source of true. In Recommended way it's a Memory and then DB
LCE pattern (Loading, Content, Error):

data class Data<out T>(
    val content: T? = null,
    val error: Throwable? = null,
    val loading: Boolean = false
)

How to avoid outdated info

Memory and DB have an expired time
The request from VM/UseCase may contain a requirement to update data (field ⇒ refresh: Boolean)

Offline Write mode

What does it mean?

We give possibility to write something (create/edit some entities, dependent entities, etc.) when a user is offline

Single source of True

Syncing changes

Client-side
Download info from a backend and try to merge. This leads to an inconsistent state where each client may have their own Truth.

Backend-side

Snapshot

A client pushes a snapshot of new data to the backend.
It's appropriate for simple scenarios where an app sends only new data without later editing/removing. For example, AnalyticsLibrary can use a described solution.
Backend merging may become very tricky for more complicated scenarios where an item can be edited or contain links to other such items.

Delta Table

A client creates a new Delta Table (besides Data Table where we store data as is) where stores "deltas" of changes.
Example: user id=1 edited a node with id=100 where updated field "text" from "la-la" to "ha-ha".

When a user is offline the App can apply local deltas on the Data Table. But, after syncing with the Backend, some rows in the Data Table can be overwritten by a Server.

Preferably upload deltas while syncing before any requests to Backend to keep consistency.

Sync Failure Handling

Problem. A client can send the same deltas to the backend in case of errors and related retries.

Solution for Snapshot. Add fields "status" and "last_modified_time" for related entities in DB.
Solution for Delta Table. Add UUID to each delta. Then the backend can check whether it applied a concrete delta or not. Plus fields "status" and "last_modified_time" to check that the data was sent.

Two ID problem

Problem. While an app is in offline all new entities are creating with local id. After uploading data to the Backend, the Backend assigns its own id to mentioned entities.

Solution 1. Explicitly add to a data model fields: "id" (local id obtained in DB) and "serverId". It's desirable to leave domain data models without such knowledge.

Solution 2. Use a mapper between local and server ID when created entity is used in different places and simple updating id may cause consistent issues.

Network Limit

What else?

🚩 Storage issues (because we store info somewhere on a device)
🚩 Privacy issues (because stored info can be available for malware applications)

What else?

🚩 Storage issues (because we store info somewhere on a device)
🚩 Privacy issues (because stored info can be available for malware applications)

Background work

Why?

Various applications want to execute some work while the app is in the background.
Examples: MediaPlayer, Downloader, Service app (Antivirus).

Options

WorkManager
Allows planning some work in the future. OS guarantees execution but not accuracy in time.

Foreground service
Helps to keep an App active even in the background. Especially actual for various players.

Service
But, Android implemented a lot of restrictions for Service in background mode.

Broadcast receiver.
Allows to wake up an App to react to different external events (changes in FileSystem, etc.). But, Android decreased number of such receivers.

Storage

Why?

Where to store data?

Options

Key-Value Storage
SharedPreferences, DataStore, Secure SP from Jetpack Security Library
+: simple
+: ideal for very small data
-: no schema
-: no migration
-: no big data

Database/ORM
SqLite, Room, Realm, etc.
+: relational db
+: performance
+: concurrency out-of-box
+: migration
-: more complicated, demands some knowledge in DB
-: insecure by default

On-Device Secure Storage
Keystore/KeyChain
Ideal for small sensitive data

Storage Location

Internal
Sandbox of an App, unavailable to other apps except for root permissions.

External
Primary storage and Secondary Storage (SD). Now, it's available through Scope Storage policy.

Media/Scoped

Server pushes

What is it and When?

When an App should receive real-time events from a Backend.
Examples: Chat app, Delivery app, Taxi app, etc.

Options

HTTP-polling

A client periodically requests a Server.

Types:

Short
+: Very simple
+: No persistent connection
-: No real-time
-: Connection establishment overhead

Long
+: Realtime
-: Persistent connection
-: Backend modifications

Server-Sent Events

Allows a client to establish a stream over HTTP

The small improvement: in the beginning, we can send id (order) of the last received message to start with the last received message but not the first.
+: realtime
+: optimized compared to HTTP Long Polling
-: possible losses and duplications due to unidirectional stream

Bidirectional protocols

WebSocket, gRPC.
More details in 🚩 Network protocols

Firebase Cloud Messaging

Allows to send data messages to an App.
+: able to wake up an App from any state of the App and OS
-: 3rd party
-: not exact realtime

Push notifications

Why?

Helps to notify users about various news.
For example, someone wrote in a chat, new album is published, breaking news, etc

Implementation

Realtime
See 🚩 Server pushes and create notifications self-manually using OS SDK API

Non-Realtime
There is a option to delegate to 3rd party solutions like Firebase Cloud Messaging that is offering a convenient API and Console for push notifications

Backend API Design

Protocols

GraphQL

A query language for working with API - allows clients to request data from several resources using a single endpoint (instead of making multiple requests in traditional RESTful apps).

+: schema-based typed queries - clients can verify data integrity and format.
+: highly customizable - clients can request specific data and reduce the amount of HTTP-traffic.
+: bi-directional communication with GraphQL Subscriptions (WebSocket based).

-: more complex backend implementation.
-: "leaky-abstraction" - clients become tightly coupled to the backend.
-: the performance of a query is bound to the performance of the slowest service on the backend (in case the response data is federated between multiple services).

WebSocket

Full-duplex communication over a single TCP connection.

+: real time bi-directional communication.
+: provides both text-based and binary traffic.

-: requires maintaining an active connection - might have poor performance on unstable cellular networks.
-: schemaless - it's hard to check data validity on the client.
-: the number of active connections on a single server is limited to 65k.

gRPC

Remote Procedure Call framework which runs on top of HTTP/2. Supports bi-directional streaming using a single physical connection.

+: lightweight binary messages (much smaller compared to text-based protocols).
+: schema-based - built-in code generation with Protobuf.
+: provides support of event-driven architecture: server-side streaming, client-side streaming, and bi-directional streaming
+: multiple parallel requests.

-: limited browser support.
-: non human-readable format.
-: steeper learning curve.

HTTP

A text-based stateless protocol - most popular choice for CRUD (Create, Read, Update, and Delete) operations.

+: easy to learn, understand, and implement
+: easy to cache using built-in HTTP caching mechanism
+: loose coupling between client and server

-: less efficient on mobile platforms since every request requires a separate physical connection
-: schemaless - it's hard to check data validity on the client.
-: stateless - needs extra functionality to maintain a session.
-: additional overhead - every request contains contextual metadata and headers

Protocol choice example

Simple Request-Response communication - choose HTTP
Complex requests and Network limitations - try GraphQL
Real-time events from a Server (new messages, order status, driver's car location, etc.) - see 🚩 Server pushes (WebSocket, gRPC)
Strong Network limitation with a big number of requests - try something based on UDP

RESTful Service Example

API:

Paths:
    - /v1/drivers
        - GET
            - request: <some based on location data>
            - response    
                - type: array
                - defin: Driver
    - /v1/drivers/{id}
        - GET
            - request
            - response
                - type: object
                - defin: Driver            
    - /v1/riders
        - GET
            - request: <some based on location data>
            - response    
                - type: array
                - defin: Rider
    - /v1/riders/{id}
        - GET
            - request
            - response
                - type: object
                - defin: Rider            
    - v1/trips
        - POST
             - request
                - type: object
                - defin: StartTrip
             - response:
                - type: object
                - definition: Trip // or trip_id
    - v1/trips/{id}
        - GET
            - request
            - response
                - type: object
                - defin: Trip
        - PUT (for cancelling/editing)
            - request
                - type: object
                - defin: EditTrip

definitions:
    Driver:
        - id: Int
        - name: String
        - rating: String
        - info..
    Rider:
        - id: Int
        - name: String
        - rating: String
        - info..
    CreateTrip
        - driver_id: Int
        - rider_id: Int
        - from: Geo
        - to: Geo
        - info..
    EditTrip
        - trip_id
        - operation:..
    Trip:
        - id: Int
        - driver_id: Int
        - rider_id: Int
        - from: Geo
        - to: Geo
        - info..

Uploading/Downloading Big files

Big files:

See 🚩 Big Files

Problem

TCP-connection may break while the client sends a big file to a server or vice versa and there is no idea how much data was transferred.

Solutions

Uploads.
The file is divided into separate segments and transferred. Google, YouTube, Photos provides their own custom extensions on HTTP to upload.
Downloads.
Downloading of big files is processed by HTTP Range or see 🚩Streaming.

Streaming

What kind of data?

video
audio

What is Streaming?

Streaming is a real-time and more efficient way to transfer audio/video files for playback than Downloading.

Streaming can work on top of both UDP and TCP. Based on TCP, one of the popular protocols is HTTP live streaming (HLS).

Buffering is a technique used by players. It consists in the fact that the data is loaded with some buffer, so that in case of video/audio failures, playback remains smooth.

HLS

HTTP live streaming (HLS) is one of the most popular protocols for streaming.

The idea. HLS splits video/audio into small downloadable HTTP files (numbered segments several seconds long), which are delivered using the usual HTTP protocol. The client accepts these files and plays them.

HLS also has such a feature as adaptive bitrate streaming. This allows you to adjust the video/audio quality on the fly depending on the network and so on. HLD on the server immediately creates several copies of segments of different quality.

Waiting list

Problem

The waiting list is a list of something that can not be executed immediately due to lack of resources (network, storage, etc.), limits, or priorities.

Example
File downloading. An app can't download all required files because there are network limits especially in Mobile internet and we can't fill the entire network channel blocking other apps' requests. All of this leads to the creation of a waiting list for file downloading.

File downloading Scheme example

Description

Job - data model describing what to execute.
Theoretically, number of Jobs can be unlimited.
Worker - executor of a Job. May implement pattern Strategy to be able to execute different types of Job.
Also, several Job may point to the same Worker that allows to carry on Worker process until at least one associated Job is active.
Number of Workers is limited by a pool size.

Download Request - encapsulates a single file downloading request
Download Task - see 🚩 Downloading/Uploading API

Cache Types

Question

Where cache data?

Options

In-App cache (Heap)
Very fast but lost after an app closing

Storage (SharedPrefs, DB, FileSystem)
See 🚩 Storage

HTTP-Cache
Implementation details of a concrete HTTP client and Backend

Problem

Different countries and regions provide absolutely different possibilities for Network connections especially when a device is connected to Mobile Data.
What can we do to handle poor connections and save a user device's battery?

Options

Batch requests for Mobile Data
Radio module is a very energy-consuming part of a device. See the article. It makes sense to batch requests when an App is in background.

Reduce connections

Optimise requests
If data gathering requires a lot of "proxy" requests like to load list of contacts, list of phones, list of emails to obtain a User's info than it makes sense to try GraphQL or modify backend.

Prefetch data
When a user is not active an App can predownload some peace of information like details of first 20 news and etc.

Network priorities
There is option to prioritize requests and execute them basing on these priorities. See solutions about Queue like in 🚩 Waiting list

What else?

🚩 Waiting list
🚩 Downloading/Uploading API

What else?

🚩 Waiting list
🚩 Downloading/Uploading API

"State Intense" App

Problem

An App contains a lot of states that may affect each other.
Typical example is a Chat application where you should monitor changes of users, messages, chats and etc. Also, Taxi app, Delivery app and etc. may be treated as a "State Intense" app.

Recommended patterns

Presentation Layer

MVI. Because MVI introduces special objects helping to sort out the state mess

Domain/Data Layer

There is a big chance that the App will use 🚩 Server pushes. It means the App will listen updates from a special channel (Stream, Flow, Observable).

Events are like a "delta-models" that show the diff.

CQRS

Command and Query Responsibility Segregation.
All actions like "SendMessage", "ChangeStatus" and etc. are commands. UpdateChannel is like a Query.

Why? Because requests may affect different areas. For example, "SendMessage" may change list of messages, status of a concrete message, list of chats where the last messages field changes and etc. Dependencies can be very tricky. That's why it's simpler to listen all updates from one or several queries.

Image thumbnails

Video and Audio Streaming

Network errors

Problem

How to handle network errors?

An error occurs. What next?

A request to the network (or DB) really causes an error (4xx, 5xx, IO exception, etc.). What to do further?

Options:

throw an exception and catch the one in above layers
wrap a result into a special model

Recommendations:
There are a lot of opinions to rely on the second option because it's safer and more predictable, and an app will not be crashed at random moments.

Examples:

data class Result<T>(
    val data: T?,
    val error: Throwable?
)

Try to repeat (options)

Repeater
We can introduce a default Repeater in an App’s network layer. But remember about possible DDOS. That's why it makes sense to add exponential back-off + Jitter (including randomness in time period determining).

Network availability
If an error occurred due to network problems then we can wait for network availability

Mobile Failover handling
Store a list of domains (primary and back) and use it for switching on other domain (Uber's approach).
Scheme:

Principles:

Maximize the usage of primary domains
Effectively differentiate between network errors and host-level outages
Reduce degraded experience during primary domain outages

Security

Android Backend Security. Main points

Use HTTPS (TLS). HTTPS provides encryption (mix of symmetric and asymmetric encryption) of your communication with a server.
Use Certificates and SSL Pinning to prevent Man-in-The-Middle attack.
In applications like Chat, use E2E encryption to deprive a server of knowledge about messages content.
In applications where you need Auth, use OAuth/OpenID protocol. It means that you need to add an additional header field in your API: Authorization: Bearer <token>

Auth scheme (OAuth) in Android

What clarify in requirements

Size of files
Thumbnails to optimize work with images

Big Files

What is it?

different kind of attachments
file
video
audio
image
etc.

Problem

The size of big files affects how to work with them:
Storing, Downloading, Uploading, API Designing.

See details in

🚩 Uploading/Downloading Big files
🚩 Downloading/Uploading API
🚩 Offline Write mode

Storing => Files vs BLOB

There are different opinions here.
The general recommendations look like below:

large blobs are not so effective compared to simple files
file can be removed from memory by a User, blob - no

Attachments. File Permissions.

User/OS may revoke permissions on the file before when an App becomes online. There is a sense to copy the file to the App's storage.

What else based on TCP

XMPP (WhatsApp, Zoom, Google Talk)
MQTT (IoT, Smart Home, etc).

What else based on UDP

WebRTC (Discord, Google Hangouts, Facebook Messenger).

TCP vs UDP

Unlike UDP, TCP uses
- Connection setup
- Acknowledgement of received segments
- A mechanism to retransmit lost packages
TCP is designed for wired connections more rather than wireless connections that introduce additional challenges with regular temp losses of connection

Privacy

Brief recommendations

Keep as little of the user's data as possible
Minimize the use of permissions and be ready to handle permission changes
Assume that on-device and backend storage are not secure
Assume that the target platform's (iOS/Android) Security & Privacy rules will change
Applying "cloud vs on-device" storage/calculations depends on various things, and the final decision is made case-by-case.

More details are here.

Reverse-Engineering

Protect your products against reverse-engineering (more important for Android)

Performance/Stability

Main Points

Metered data usage
Bandwidth usage
CPU usage
Memory Usage
Startup Time
Crashers/ANRs

Geo-Location Usage

Main Points

Don't compromise user privacy while using location services.
Prefer the lowest possible location accuracy. Progressively ask for increased location accuracy if needed.
Provide location access rationale before requesting permissions.

3rd-Party SDKs Usage

Main Points

3rd-Party SDKs might cause performance regressions and/or serious outages (example - Facebook SDK).
Each SDK should be guarded by a feature flag.
A new SDK integration should be introduced as an A/B test or a staged rollout.
You need to have a "support and upgrade" plan for the 3rd-party SDKs in a long term.

❓ Suggestion
group by "Http + websockets and rest + graphql + grpc"

Target market/client

Questions

number of users
target market (region)
typical usage (home/street/outdoor)
time peak usage
internet quality
devices quality

What affects

Highload backend makes DDOS a real potential problem.
It makes sense to reduce a number of connections and introduce backoff in case of network errors.
See 🚩Network errors and 🚩Network Limit for all details

Also, Higload may force to leave some calculations on a device. For example, Badoo introduces Face recognition on a Device to reduce a load on a server.

🚩Network Limit generally

Following different local laws. Example - Analytics Library that could handle personal data should include methods like enable/disable to give a possibility to manage the work of the library explicitly

Company resources limitations

Questions

developers
deadlines
how should the work on the system be divided among multiple teams?

How big files can be presented in data models in network responses?

"some_big_file": "id"
A developer can download the file using this id and a special backend endpoint like "GET /sound/file/{id}"
"some_big_file": "url"
Where "url" points to a certain resource in CDN. It's very simple way but insecure