Mobile System Design

Library

External API

Configurability


Any library must be configurable. What can be configured:

  • maxMemoryCacheSize
  • maxMemoryDiskSize
  • maxParallelDownloads
  • maxImageSize

Quality requirements


The library is an external dependency that must meet very high-Quality requirements (crash-free, leaks-free, etc.).
Remember about Facebook SDK fail.

Documentation + Support


External users must have a well-described API to work with a Library straightforward and without hacks

Importance
The necessity to design API very well. Released API is so hard for further changes without users annoying.

Examples of API:


Base functions:

  • init(config: Config)
  • enable()/disable()
  • functional methods

Additional see:
🚩 API for simple functionality
🚩 Downloading/Uploading API

Privacy


Libraries that gather various data and send them to servers (for example, FirebaseAnalytics) must have a possibility to disable to guarantee privacy of users


methods - enable/disable

Downloading/Uploading API

API


DownloadTask:

  • id: Int - unique id of a task
  • pause()
  • resume()
  • stop()
  • addCallback(TaskCallback)

TaskCallback:

  • onProgress(idTask: Int, progress: Int, )
  • onCompleted(idTask: Int)
  • onCancelled(idTask: Int)
  • onFailed(idTask: Int, exception: Throwable)

Example:
fun execute(): DownloadTask

API for simple functionality

What is Simple functionality?


Usually, it's a piece of your product's code that executes a single task. For example, Payments Module is responsible for only Payments.

Best Practices


We recommend using the Async approach everywhere where a function executes something heavy.


If you mix both approaches then it's better to reach an agreement on how to name them. For example, all blocking methods may have the "Blocking" suffix in their names.


Blocking


fun executeSomething()


Under the hood, this function does a lot of work and blocks the current thread

Async


fun execute(callback: Callback) - callback approach
fun execute(): Single/Observable/Flow - reactive approach
suspend fun execute() - kotlin coroutines approach


Calling of these methods doesn't block the current thread

The problem


The work with Big files may require a possibility to manage the process. That's why API may return a special object - "DownloadTask".

Big files:


See 🚩 Big Files

ThreadPool Executor

The problem


Where and how to determine threads on which a task will be executed

Best practice


It's better to delegate such issues to lower components (for example, NetworkClient, DbClient, etc.).
The first reason. Lower component better know on which thread pool to execute a task (IO, Default, Separate special pool).
The second reason. Higher components must be free of unnecessary knowledge about details of lower components. Higher components just call a method and it's all.

IO vs Default


IO - for long-running operations (network, db, file system). The main task - avoid possible waiting due to the pool limit reaching. For example, Dispatchers.IO (Kotlin Coroutines) contains 64 threads under the hood.


Default - for short and intensive operations. The number of threads is calculated using the number of CPUs to diminish the count of thread switchings.

Additional


🚩 ThreadPool Executor

"Infinitive" list

Backend pagination

Frontend pagination

Offset
Provides limit and offset params for requests.
Example: GET /feed?offset=100&limit=20
+: ideal for non-changing content
+: simple
+: stateless on the server.
-: bad performance on large offset values (the database needs to skip offset rows before returning the paginated result)

Keyset
Uses the values from the last page to fetch the next set of items.
Example: GET /feed?after=2021-05-25T00:00:00&limit=20
+: translates easily into a SQL query
+: good performance with large datasets
+: stateless on the server.
-: "leaky abstraction" - the pagination mechanism becomes aware of the underlying database storage
-: only works on fields with a natural ordering (timestamps, etc)

Cursor/Seek
Operates with stable ids which are decoupled from the database SQL queries (usually, a selected field is encoded using base64 and encrypted on the backend side).
Example: GET /feed?after_id=t1234xzy&limit=20
+: decouples pagination from SQL database.
+: consistent ordering when new items are inserted.
-: more complex backend implementation
-: does not work well if items get deleted (ids might become invalid)

Online support


The most simple option.

  • Determine that you reached some threshold. For example, RecyclerView (LayoutManager) has different callbacks where you observe items' visibility.
  • Download a new portion using the count of items or last item id/timestamp/anything

Offline support


image

Dynamic item inserts in the middle support


Problem
An App receives info that new items appeared in the middle of a current list.


First option
Show info that the list is changed and suggest reloading the list


Second option
The backend must support an additional field like "sort_id". The App should use this field to figure out where to insert the element.

Offline Read mode

What does it mean?


We give possibility to read content (news, open details, etc) even when a user is offline

Options:
network only, network first, cache only, cache first

Recommended way


Two-level Caching
image

Patterns


Repository - pattern to hide the source of information
Single source of true. In Recommended way it's a Memory and then DB
LCE pattern (Loading, Content, Error):

data class Data<out T>(
    val content: T? = null,
    val error: Throwable? = null,
    val loading: Boolean = false
)

How to avoid outdated info


  • Memory and DB have an expired time
  • The request from VM/UseCase may contain a requirement to update data (field ⇒ refresh: Boolean)

Offline Write mode

What does it mean?


We give possibility to write something (create/edit some entities, dependent entities, etc.) when a user is offline

Single source of True


DB

Syncing changes

Client-side
Download info from a backend and try to merge. This leads to an inconsistent state where each client may have their own Truth.

Backend-side

Snapshot


A client pushes a snapshot of new data to the backend.
It's appropriate for simple scenarios where an app sends only new data without later editing/removing. For example, AnalyticsLibrary can use a described solution.
Backend merging may become very tricky for more complicated scenarios where an item can be edited or contain links to other such items.

Delta Table


A client creates a new Delta Table (besides Data Table where we store data as is) where stores "deltas" of changes.
Example: user id=1 edited a node with id=100 where updated field "text" from "la-la" to "ha-ha".


When a user is offline the App can apply local deltas on the Data Table. But, after syncing with the Backend, some rows in the Data Table can be overwritten by a Server.


Preferably upload deltas while syncing before any requests to Backend to keep consistency.

Sync Failure Handling


Problem. A client can send the same deltas to the backend in case of errors and related retries.


Solution for Snapshot. Add fields "status" and "last_modified_time" for related entities in DB.
Solution for Delta Table. Add UUID to each delta. Then the backend can check whether it applied a concrete delta or not. Plus fields "status" and "last_modified_time" to check that the data was sent.

Two ID problem


Problem. While an app is in offline all new entities are creating with local id. After uploading data to the Backend, the Backend assigns its own id to mentioned entities.


Solution 1. Explicitly add to a data model fields: "id" (local id obtained in DB) and "serverId". It's desirable to leave domain data models without such knowledge.


Solution 2. Use a mapper between local and server ID when created entity is used in different places and simple updating id may cause consistent issues.

Network Limit

What else?


🚩 Storage issues (because we store info somewhere on a device)
🚩 Privacy issues (because stored info can be available for malware applications)

What else?


🚩 Storage issues (because we store info somewhere on a device)
🚩 Privacy issues (because stored info can be available for malware applications)

Background work

Why?


Various applications want to execute some work while the app is in the background.
Examples: MediaPlayer, Downloader, Service app (Antivirus).

Options


  1. WorkManager
    Allows planning some work in the future. OS guarantees execution but not accuracy in time.

  1. Foreground service
    Helps to keep an App active even in the background. Especially actual for various players.

  1. Service
    But, Android implemented a lot of restrictions for Service in background mode.

  1. Broadcast receiver.
    Allows to wake up an App to react to different external events (changes in FileSystem, etc.). But, Android decreased number of such receivers.

Storage

Why?


Where to store data?

Options


  1. Key-Value Storage
    SharedPreferences, DataStore, Secure SP from Jetpack Security Library
    +: simple
    +: ideal for very small data
    -: no schema
    -: no migration
    -: no big data

  1. Database/ORM
    SqLite, Room, Realm, etc.
    +: relational db
    +: performance
    +: concurrency out-of-box
    +: migration
    -: more complicated, demands some knowledge in DB
    -: insecure by default

  1. On-Device Secure Storage
    Keystore/KeyChain
    Ideal for small sensitive data

Storage Location


  1. Internal
    Sandbox of an App, unavailable to other apps except for root permissions.

  1. External
    Primary storage and Secondary Storage (SD). Now, it's available through Scope Storage policy.

  1. Media/Scoped

Server pushes

What is it and When?


When an App should receive real-time events from a Backend.
Examples: Chat app, Delivery app, Taxi app, etc.

Options

HTTP-polling


A client periodically requests a Server.


Types:

  1. Short
    +: Very simple
    +: No persistent connection
    -: No real-time
    -: Connection establishment overhead

  1. Long
    +: Realtime
    -: Persistent connection
    -: Backend modifications

Server-Sent Events


Allows a client to establish a stream over HTTP
image
The small improvement: in the beginning, we can send id (order) of the last received message to start with the last received message but not the first.
+: realtime
+: optimized compared to HTTP Long Polling
-: possible losses and duplications due to unidirectional stream

Bidirectional protocols


WebSocket, gRPC.
More details in 🚩 Network protocols

Firebase Cloud Messaging


Allows to send data messages to an App.
+: able to wake up an App from any state of the App and OS
-: 3rd party
-: not exact realtime

Push notifications

Why?


Helps to notify users about various news.
For example, someone wrote in a chat, new album is published, breaking news, etc

Implementation


  1. Realtime
    See 🚩 Server pushes and create notifications self-manually using OS SDK API

  1. Non-Realtime
    There is a option to delegate to 3rd party solutions like Firebase Cloud Messaging that is offering a convenient API and Console for push notifications

Backend API Design

Protocols

GraphQL


A query language for working with API - allows clients to request data from several resources using a single endpoint (instead of making multiple requests in traditional RESTful apps).


+: schema-based typed queries - clients can verify data integrity and format.
+: highly customizable - clients can request specific data and reduce the amount of HTTP-traffic.
+: bi-directional communication with GraphQL Subscriptions (WebSocket based).


-: more complex backend implementation.
-: "leaky-abstraction" - clients become tightly coupled to the backend.
-: the performance of a query is bound to the performance of the slowest service on the backend (in case the response data is federated between multiple services).

WebSocket


Full-duplex communication over a single TCP connection.


+: real time bi-directional communication.
+: provides both text-based and binary traffic.


-: requires maintaining an active connection - might have poor performance on unstable cellular networks.
-: schemaless - it's hard to check data validity on the client.
-: the number of active connections on a single server is limited to 65k.

gRPC


Remote Procedure Call framework which runs on top of HTTP/2. Supports bi-directional streaming using a single physical connection.


+: lightweight binary messages (much smaller compared to text-based protocols).
+: schema-based - built-in code generation with Protobuf.
+: provides support of event-driven architecture: server-side streaming, client-side streaming, and bi-directional streaming
+: multiple parallel requests.


-: limited browser support.
-: non human-readable format.
-: steeper learning curve.

HTTP


A text-based stateless protocol - most popular choice for CRUD (Create, Read, Update, and Delete) operations.


+: easy to learn, understand, and implement
+: easy to cache using built-in HTTP caching mechanism
+: loose coupling between client and server


-: less efficient on mobile platforms since every request requires a separate physical connection
-: schemaless - it's hard to check data validity on the client.
-: stateless - needs extra functionality to maintain a session.
-: additional overhead - every request contains contextual metadata and headers

Protocol choice example


  1. Simple Request-Response communication - choose HTTP
  2. Complex requests and Network limitations - try GraphQL
  3. Real-time events from a Server (new messages, order status, driver's car location, etc.) - see 🚩 Server pushes (WebSocket, gRPC)
  4. Strong Network limitation with a big number of requests - try something based on UDP

RESTful Service Example

API:

Paths:
    - /v1/drivers
        - GET
            - request: <some based on location data>
            - response    
                - type: array
                - defin: Driver
    - /v1/drivers/{id}
        - GET
            - request
            - response
                - type: object
                - defin: Driver            
    - /v1/riders
        - GET
            - request: <some based on location data>
            - response    
                - type: array
                - defin: Rider
    - /v1/riders/{id}
        - GET
            - request
            - response
                - type: object
                - defin: Rider            
    - v1/trips
        - POST
             - request
                - type: object
                - defin: StartTrip
             - response:
                - type: object
                - definition: Trip // or trip_id
    - v1/trips/{id}
        - GET
            - request
            - response
                - type: object
                - defin: Trip
        - PUT (for cancelling/editing)
            - request
                - type: object
                - defin: EditTrip

definitions:
    Driver:
        - id: Int
        - name: String
        - rating: String
        - info..
    Rider:
        - id: Int
        - name: String
        - rating: String
        - info..
    CreateTrip
        - driver_id: Int
        - rider_id: Int
        - from: Geo
        - to: Geo
        - info..
    EditTrip
        - trip_id
        - operation:..
    Trip:
        - id: Int
        - driver_id: Int
        - rider_id: Int
        - from: Geo
        - to: Geo
        - info..

Uploading/Downloading Big files

Big files:


See 🚩 Big Files

Problem


TCP-connection may break while the client sends a big file to a server or vice versa and there is no idea how much data was transferred.

Solutions


  1. Uploads.
    The file is divided into separate segments and transferred. Google, YouTube, Photos provides their own custom extensions on HTTP to upload.
  2. Downloads.
    Downloading of big files is processed by HTTP Range or see 🚩Streaming.

Streaming

What kind of data?


  • video
  • audio

What is Streaming?


Streaming is a real-time and more efficient way to transfer audio/video files for playback than Downloading.


Streaming can work on top of both UDP and TCP. Based on TCP, one of the popular protocols is HTTP live streaming (HLS).


Buffering is a technique used by players. It consists in the fact that the data is loaded with some buffer, so that in case of video/audio failures, playback remains smooth.

HLS


HTTP live streaming (HLS) is one of the most popular protocols for streaming.


The idea. HLS splits video/audio into small downloadable HTTP files (numbered segments several seconds long), which are delivered using the usual HTTP protocol. The client accepts these files and plays them.


HLS also has such a feature as adaptive bitrate streaming. This allows you to adjust the video/audio quality on the fly depending on the network and so on. HLD on the server immediately creates several copies of segments of different quality.

Waiting list

Problem


The waiting list is a list of something that can not be executed immediately due to lack of resources (network, storage, etc.), limits, or priorities.


Example
File downloading. An app can't download all required files because there are network limits especially in Mobile internet and we can't fill the entire network channel blocking other apps' requests. All of this leads to the creation of a waiting list for file downloading.


File downloading Scheme example


image

Description


  1. Job - data model describing what to execute.
    Theoretically, number of Jobs can be unlimited.
  2. Worker - executor of a Job. May implement pattern Strategy to be able to execute different types of Job.
    Also, several Job may point to the same Worker that allows to carry on Worker process until at least one associated Job is active.
    Number of Workers is limited by a pool size.

  1. Download Request - encapsulates a single file downloading request
  2. Download Task - see 🚩 Downloading/Uploading API

Cache Types

Question


Where cache data?

Options


In-App cache (Heap)
Very fast but lost after an app closing


Storage (SharedPrefs, DB, FileSystem)
See 🚩 Storage


HTTP-Cache
Implementation details of a concrete HTTP client and Backend

Problem


Different countries and regions provide absolutely different possibilities for Network connections especially when a device is connected to Mobile Data.
What can we do to handle poor connections and save a user device's battery?

Options

Batch requests for Mobile Data
Radio module is a very energy-consuming part of a device. See the article. It makes sense to batch requests when an App is in background.

Reduce connections

Optimise requests
If data gathering requires a lot of "proxy" requests like to load list of contacts, list of phones, list of emails to obtain a User's info than it makes sense to try GraphQL or modify backend.

Prefetch data
When a user is not active an App can predownload some peace of information like details of first 20 news and etc.

Network priorities
There is option to prioritize requests and execute them basing on these priorities. See solutions about Queue like in 🚩 Waiting list

What else?

  1. 🚩 Waiting list
  2. 🚩 Downloading/Uploading API

What else?

  1. 🚩 Waiting list
  2. 🚩 Downloading/Uploading API

"State Intense" App

Problem


An App contains a lot of states that may affect each other.
Typical example is a Chat application where you should monitor changes of users, messages, chats and etc. Also, Taxi app, Delivery app and etc. may be treated as a "State Intense" app.

Recommended patterns

Presentation Layer


MVI. Because MVI introduces special objects helping to sort out the state mess

Domain/Data Layer


There is a big chance that the App will use 🚩 Server pushes. It means the App will listen updates from a special channel (Stream, Flow, Observable).
image
Events are like a "delta-models" that show the diff.

CQRS


Command and Query Responsibility Segregation.
All actions like "SendMessage", "ChangeStatus" and etc. are commands. UpdateChannel is like a Query.


Why? Because requests may affect different areas. For example, "SendMessage" may change list of messages, status of a concrete message, list of chats where the last messages field changes and etc. Dependencies can be very tricky. That's why it's simpler to listen all updates from one or several queries.

Image thumbnails

Video and Audio Streaming

Network errors

Problem


How to handle network errors?

An error occurs. What next?


A request to the network (or DB) really causes an error (4xx, 5xx, IO exception, etc.). What to do further?


Options:

  • throw an exception and catch the one in above layers
  • wrap a result into a special model

Recommendations:
There are a lot of opinions to rely on the second option because it's safer and more predictable, and an app will not be crashed at random moments.


Examples:

data class Result<T>(
    val data: T?,
    val error: Throwable?
)

Try to repeat (options)


Repeater
We can introduce a default Repeater in an App’s network layer. But remember about possible DDOS. That's why it makes sense to add exponential back-off + Jitter (including randomness in time period determining).


Network availability
If an error occurred due to network problems then we can wait for network availability


Mobile Failover handling
Store a list of domains (primary and back) and use it for switching on other domain (Uber's approach).
Scheme:
image
Principles:

  • Maximize the usage of primary domains
  • Effectively differentiate between network errors and host-level outages
  • Reduce degraded experience during primary domain outages

Security

Android Backend Security. Main points


  • Use HTTPS (TLS). HTTPS provides encryption (mix of symmetric and asymmetric encryption) of your communication with a server.
  • Use Certificates and SSL Pinning to prevent Man-in-The-Middle attack.
  • In applications like Chat, use E2E encryption to deprive a server of knowledge about messages content.
  • In applications where you need Auth, use OAuth/OpenID protocol. It means that you need to add an additional header field in your API: Authorization: Bearer <token>

Auth scheme (OAuth) in Android


image

What clarify in requirements


  • Size of files
  • Thumbnails to optimize work with images

Big Files

What is it?


  • different kind of attachments
  • file
  • video
  • audio
  • image
  • etc.

Problem


The size of big files affects how to work with them:
Storing, Downloading, Uploading, API Designing.


See details in

  • 🚩 Uploading/Downloading Big files
  • 🚩 Downloading/Uploading API
  • 🚩 Offline Write mode

Storing => Files vs BLOB


There are different opinions here.
The general recommendations look like below:

  • large blobs are not so effective compared to simple files
  • file can be removed from memory by a User, blob - no

Attachments. File Permissions.


User/OS may revoke permissions on the file before when an App becomes online. There is a sense to copy the file to the App's storage.

What else based on TCP


  • XMPP (WhatsApp, Zoom, Google Talk)
  • MQTT (IoT, Smart Home, etc).

What else based on UDP


  • WebRTC (Discord, Google Hangouts, Facebook Messenger).

TCP vs UDP


  1. Unlike UDP, TCP uses
    • Connection setup
    • Acknowledgement of received segments
    • A mechanism to retransmit lost packages
  2. TCP is designed for wired connections more rather than wireless connections that introduce additional challenges with regular temp losses of connection

Privacy

Brief recommendations


  • Keep as little of the user's data as possible
  • Minimize the use of permissions and be ready to handle permission changes
  • Assume that on-device and backend storage are not secure
  • Assume that the target platform's (iOS/Android) Security & Privacy rules will change
  • Applying "cloud vs on-device" storage/calculations depends on various things, and the final decision is made case-by-case.

More details are here.

Reverse-Engineering


Protect your products against reverse-engineering (more important for Android)

Performance/Stability

Main Points


  • Metered data usage
  • Bandwidth usage
  • CPU usage
  • Memory Usage
  • Startup Time
  • Crashers/ANRs

Geo-Location Usage

Main Points


  • Don't compromise user privacy while using location services.
  • Prefer the lowest possible location accuracy. Progressively ask for increased location accuracy if needed.
  • Provide location access rationale before requesting permissions.

3rd-Party SDKs Usage

Main Points


  • 3rd-Party SDKs might cause performance regressions and/or serious outages (example - Facebook SDK).
  • Each SDK should be guarded by a feature flag.
  • A new SDK integration should be introduced as an A/B test or a staged rollout.
  • You need to have a "support and upgrade" plan for the 3rd-party SDKs in a long term.

❓ Suggestion
group by "Http + websockets and rest + graphql + grpc"

Target market/client

Questions


  • number of users
  • target market (region)
  • typical usage (home/street/outdoor)
  • time peak usage
  • internet quality
  • devices quality

What affects


  • Highload backend makes DDOS a real potential problem.
    It makes sense to reduce a number of connections and introduce backoff in case of network errors.
    See 🚩Network errors and 🚩Network Limit for all details


  • Also, Higload may force to leave some calculations on a device. For example, Badoo introduces Face recognition on a Device to reduce a load on a server.


  • 🚩Network Limit generally


  • Following different local laws. Example - Analytics Library that could handle personal data should include methods like enable/disable to give a possibility to manage the work of the library explicitly

Company resources limitations

Questions


  • developers
  • deadlines
  • how should the work on the system be divided among multiple teams?

How big files can be presented in data models in network responses?


  1. "some_big_file": "id"
    A developer can download the file using this id and a special backend endpoint like "GET /sound/file/{id}"
  2. "some_big_file": "url"
    Where "url" points to a certain resource in CDN. It's very simple way but insecure