Distributed Infrastructure, Docker and Microservices, Notes:
Docker and Microservices
Microservices: A microservice is a software architectural style where an application is developed as a collection of small, independent, and loosely coupled services. Each of these services represents a specific business capability and runs its own processes, communicates through well-defined APIs (Commonly RestAPI), and can be developed and deployed independently.
Microservice of ML:
- Microservices break down the components of a machine learning system into independent services. Each service can be responsible for a specific aspect, such as data preprocessing, model training, inference, or model management.
- Microservices in machine learning allow for scalable processing of ML workloads. Different microservices can scale independently based on demand
- Models can be deployed as separate services, allowing for independent updates, versioning, and scaling (using Docker)
- Different machine learning tasks might require different technologies or frameworks or languages
- Machine learning tasks often involve processing large datasets. Microservices can distribute the processing of these datasets across multiple services
- Microservices enable the creation of dynamic ML workflows by orchestrating the execution of different services based on the requirements of a specific ML task.
- Scalable Training Pipelines
- Cluster: Collections of computers(servers) nodes
- Parallel VS Distributed: In parallel computing, the focus is on breaking down a single, large task into smaller subtasks that can be solved simultaneously. On the other hand Distributed computing involves the use of multiple computers (nodes) connected through a network to work together on a task.
- Concurrency: multiple tasks or processes making progress independently
- RPC: Remote Procedural Call: Imagine you have two computers, and you want one computer to ask the other to perform a specific task. RPC is like making a phone call to ask for help.
- Map reduce: MapReduce is a programming model and processing technique designed for large-scale data processing. It was popularized by Google and is commonly used for distributed computing tasks, especially in the context of big data processing.
- Pyspark and Scala is used to handle data in distributed computing
- Spark MLlib, provide scalable implementations of common ML algorithms. TensorFlow, a popular deep learning framework, can also be used in a distributed setting.