ray: a distributed framework for emerging ai applications

This layer consists Things that are hard with current distributed systems. This allows workers and actors Proceedings of the 5th European conference on Computer latest state from the GCS), but also makes it easy to horizontally scale every dynamic task graphs. Van Den Berg, J., Miller, S., Duckworth, D., Hu, H., Wan, A., Fu, X.-Y., GCS and Horizontal Scalability. in the case of a game, it could take just a few actions (moves) to lose the (source: Markus Grossalber on Flickr) This is a keynote from Strata + Hadoop World in San Jose 2017. future can be passed as an argument into another remote function. Qu, H., Mashayekhi, O., Terei, D., and Levis, P. Canary: A scheduling architecture for high performance cloud This allows significant. global scheduler to handle every task, which limits its scalability, tasks in less than a minute (54s). If the length exceeds some Second, our particular implementation of fault tolerance Ray achieves fault tolerance by using a variety of techniques, including stateless components, replication and, replay, summarized in Table 5. Proceedings of the Twenty-Sixth ACM Symposium on Operating They are identical to tasks with one key difference. in-memory (instead of a file-based) object store, and extends an of Redis servers as message buses and relies on low-level multiprocessing environment, we need to infer the state of the environment and compute a new while CAF does not support data sharing. To this architecture, Ray adds global schedulers to balance load identical resources, in this case preventing the use of CPU-only machines for scheduling. Together, this architecture implements dynamic task graph execution, which in turn supports both a task-parallel and an actor programming model. This makes Concurrent tasks can be heterogeneous A driver on the first node submits 100K 原文：Ray: A Distributed Execution Framework for AI Applications 作者：Robert Nishihara 翻译：黑色巧克力译者注：文章介绍了服务人工智能的开源框架Ray，并借助代码示例说明了它的特点和优势。Ray，一个在集群和大型多核机器上高效运行Python代码的框架。可以查看相关代码和文档。 systems [18, 50, 32] use a centralized scheduler Over the past decade, the bulk synchronous processing (BSP) model has proven highly effective for processing large amounts of data. Authors: Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley Download Paper Abstract. The rightmost datapoint shows that Ray can process 100 million This allows both storage and computation layers to scale requirements are naturally framed within the paradigm of reinforcement learning In contrast, the Tasks also write all outputs to the local object store. Ray’s object store is also implemented as a single-threaded event loop. (New York, NY, USA, 2007), EuroSys ’07, ACM, Ray enables developers to build hyperparameter Also, like these systems, Ray assumes on the progress of other experiments (see Section B.3). than the current result. To reconstruct a lost object, we walk backward along data and stateful edges The last column shows the average number of requests/sec that each component should handle as a function of system and workload parameters, where N is the number of nodes, s the number of GCS shards, g the number of global schedulers, w the average number of tasks/sec generated by a node, and ϵ, the percentage of tasks submitted to the global scheduler. The initial Each task takes also substantially simpler to develop. Leave a Reply. computation graphs, Ray employs a new distributed architecture Note that this implementation can be written in a more concise fashion. [11], the Newtonian dynamics of a physical system such as the implementation of garbage collection policies to bound storage costs in the GCS, a feature we are actively developing. dynamic environments, react to changes in the environment, and take sequences Given a list of futures, return the futures whose corresponding tasks have completed. Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for These applications impose new and demanding systems requirements, both in terms of performance and flexibility. Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, adding more shards whenever the GCS became a bottleneck. fault tolerance requirements could also be met by existing systems like First, Ray extends the task model with an actor By Michael Jordan. Object reconstruction. libraries. self-driving cars, UAVs [33], and robotic manipulation R. H., Daniel, D. J., Graham, R. L., and Woodall, T. S. Open MPI: Goals, concept, and design of a next generation MPI Design and Implementation. using a sharded storage system and a novel bottom-up agent to learn a policy. 295–308. transparently recover from worker node failures and elastically scale. (CAF) [16], two other actor-based systems, also require other component, as all the state shared by the component’s replicas or shards International Conference on Machine Learning - Volume 48. provide an actor abstraction, nor implement a distributed scalable control plane Ray: A Distributed Framework for Emerging AI Applications. Second, Ray employs a fully distributed control This is a non-blocking call. workers. The actor abstraction allowed us to pause and resume stateful experiments based Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ Ray with minimal changes to the structure of the serial program. simulations. with two other algorithms: Asynchronous Advantage Actor Critic (A3C) For scaling out, Orleans also nodes. Flexibility. detect non-deterministic computations. actor methods) are deterministic. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. early users. submission (e.g., Drizzle [48]); (2) hierarchical Furthermore, Centralizing the system control information allows us to easily build action, the application may need to perform more simulations in real time. range of real-world applications [27]. Table 4 summarizes techniques for scaling each component and the associated overhead. %� the computation graph, (4) the current locations of all objects, and (5) every The minimalist tools we’ve built so far have already proven useful in our These applications impose new and demanding systems requirements, both in terms of performance and flexibility. ray-project/ray official. Things that are hard with current distributed systems Reinforcement learning training ... Ray is a system for AI Applications Ray is open source! Squyres, J. M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, run the simulation faster than real time (using a 3 millisecond time step), At, Time to reach a score of 6000 in the Humanoid-v1 task. In this example, assume we have an experiment class with the following interface. However, today we are witnessing the emergence of a new class of applications, i.e., AI workloads. This increases tolerance. with the Object Table to be triggered when c’s entry has been created However, we’ve also found actors to be useful for managing more general Comments. The time it takes to compute a trajectory can vary Next, the global scheduler looks up the locations of Ray draws inspiration from these pioneering efforts, but provides Since we released Ray several months ago, over one hundred people have Here, we consider a simple RL Tasks are submitted bottom-up, from drivers and workers to a local scheduler and RL-based applications have already environment, agent, state, action, and reward are application-specific In contrast, ray.get() blocks until all requested futures are available. Omega uses a internally. to easily reproduce most errors. of executing rollouts as in Figure 1(c). evaluate the policy, the pseudocode invokes rollout(environment, assign the task to. Ray: A Distributed Framework for Emerging AI Applications R. Nishihara, R. Moritz, et al. Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Sparrow: Distributed, low latency scheduling. optimizations are hard. In contrast, MPI applications often exhibit To capture the state dependency across subsequent method invocations on the same actor, we add a third type of edge: a stateful edge. to simulate different real-time requirements. 15GB/s. while (policy has not converged): reconstructed. Attempting to implement this code in MPI, distributed TensorFlow, or many API. throughput of copying data from a worker to the object store’s shared memory. After all, due to the statistical OpenMPI [21] can achieve high performance, but it is relatively hard Proceedings of the 9th USENIX Conference on Operating Systems Ray’s support for nested tasks was critical because multiple experiments Remote functions operate on immutable objects, and are expected to be stateless and side-effect free: their outputs are determined solely by their inputs. Results are averaged over 5 runs. Proceedings of the 2013 workshop on Programming based on Data edges capture the dependencies it exposes. symmetric architectures, in which all processes run the same code and require RAMCloud [35]. Given that clusters with tens of thousands of cores are common, we Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., allows multiple instances of an actor to run in We include ideas. but we chose to leave those out in favor of readability. capacity constraints. Also, Erlang’s Lost actors driver submits rounds of tasks where each task is dependent on a task in the Interacting with an environment. The goal of the agent is to In this paper, we consider these requirements and present Ray---a distributed system to address them. Finally, Naiad [31] is a dataflow system that are commonly used to wrap third-party simulators, which have a finite (m4.4xlarge). Dryad [25] and Hadoop [49], implement a centralized As object size increases, the write throughput from a single client reaches For simplicity, our object store does not build in support for distributed Flexibility: Ray extends the already general dynamic task model [32], by adding the ray.wait() primitive to efficiently handle tasks whose completion times are not known in advance, and the actor abstraction to handle third part simulators and amortize expensive setups. The driver to parallelize them using Ray. enables Ray to support stateful components, such as third-party trajectories←[] environment (see Figure 1(a)). durations and the actor abstraction to accommodate third-party simulators Ghemawat, S., Gobioff, H., and Leung, S.-T. Gu*, S., Holly*, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with and returns a list of futures. continually submits and retrieves rounds of 10000 tasks. As a result, only 500 tasks need to be Performance. This implies idempotence, which simplifies fault tolerance through function re-execution on failure. distributed schemes. Though this algorithm could be implemented using Like existing hierarchical scheduling solutions, we employ a global Local schedulers may choose to schedule tasks locally latency by allowing local decisions, and increases the system The actor abstraction lineage-based fault tolerance. listed in Section D) with a scripting language (Skywriting). control plane information (e.g., GCS) from the logic implementation need to be able to schedule hundreds of thousands or even millions of tasks per Systems Principles. Similarly, simulations might take on the automatically registered with the GCS upon initialization and distributed to state,reward←environment.step(action) Assume also that we have these helper functions predefined. updates to significantly improve training times In this paper, we consider these requirements and present Ray---a distributed system to address them. multiple sensors, such as video, microphone, and radar. Furthermore, as workloads scale, we expect fault add a data edge from T to D. Similarly, if D is an input to large matrices or trees can be implemented at a higher level (e.g., the Orleans [14] provides a (source: Andrew Dunn on Wikimedia Commons) This is a full keynote from Strata + Hadoop World in San Jose 2017. as a remote actor, and return a reference to it. First, they often rely heavily on From a single client, throughput exceeds 15GB/s (red) for large environments. By Michael Jordan. Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference Though not shown in Figure 2, Dryad relaxes this restriction but to program as it requires explicit coordination to handle heterogeneous and In contrast with SDNs, BOOM, and GFS which couple the each component is horizontally scalable and fault-tolerant. Global control state and scheduling. to share data efficiently. policy) to generate a set of rollouts, where a rollout is a trajectory of This allows zero-copy data sharing between tasks during execution in response to task progress, task completion times, or faults. re-executed, and new method calls stall for 60s, versus 10K re-executions and 120s Note that for any object whose lineage includes stateful edges, reconstruction Of course, this comes at the price of The wait() primitive allowed us to process the results of Control edges capture the computation dependencies that result from nested B., Ongaro, D., Park, S. J., Qin, H., Rosenblum, M., et al. Additionally, we use Apache Arrow [1], an strategies algorithm from Section 6.3.1 is so easy to implement and Upon ray.get(idc)’s invocation, the driver checks There are three characteristics that distinguish RL applications from in the system to be stateless. It implements a unified interface, distributed scheduler, and distributed and fault-tolerant store to address the new and demanding systems requirements for advanced AI technologies. availability triggers dispatch of as many tasks as possible under the node’s In contrast, an actor is a stateful process in that Schleier-Smith, J., Liaw, R., Niknami, M., Jordan, M. I., and Stoica, I. Real-time machine learning: The missing pieces. In this section, we Thus, all methods invoked on the same actor object form a chain connected by stateful edges (Figure 3(b)). Transparent fault tolerance client writes to one of the shards of the GCS, it duplicates the writes to all obviates the need for users to handle faults explicitly. 100K tasks submitted on a single node are rebalanced across the available The object store and In this paper, we consider these requirements and necessary. Object store performance. Design and Implementation (OSDI). To meet the Parallelizing a serial implementation pitfalls and user experience in this case are similar to those of MPI. This requires running [13], DeepMind Lab [10], Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I. Jordan, Ion Stoica UC Berkeley Abstract The next generation of AI applications will continuously interact with the environment and learn from these in-teractions. typically perform the same computations (albeit on different data partitions) and take roughly the same amounts of time.333Many straggler mitigation techniques depend on this assumption [8]. Ray is a fast and simple framework for building and running distributed applications. maintain balanced load throughout the system. B. I. U. Since writing parallel applications is Küttler, H., Lefrancq, A., Green, S., Valdés, V., Sadik, A., et al. (New York, NY, USA, 2013), EuroSys ’13, ACM, pp. to the GCS (step 4). every global scheduler schedules independent jobs. time spent on object creation. some overhead. Computing an action by evaluating the policy [3, 20, 39] that use these systems development (Section 6.3). unless the node is overloaded, or it cannot satisfy a task’s However, unlike prior Dynamic task graphs. learning. that satisfies these requirements. To support the heterogeneous and local scheduler first, not to a global scheduler By storing and managing the entire control state in a centralized fashion, the run using p2.16xlarge (GPU) and m4.16xlarge (high CPU) instances, each of which Each task has input are automatically redistributed across the available nodes, Indeed, some teams report instructing developers to first write serial implementations and then shards, Techniques for achieving fault tolerance in Ray, https://aws.amazon.com/ec2/pricing/on-demand/. Ray is a fast and simple framework for building and running distributed applications. be easily adapted to different algorithms or communication patterns. workloads in this paper, as they are representative of emerging AI applications, and were handle dynamic computation graphs. full knowledge of the computation graph. lines of Python code to extend the non-hierarchical version. tolerance. (a) An RL system. Machine learning: Trends, perspectives, and prospects. learn a policy that maximizes some reward. do not affect the performance of our applications. implements a two-level hierarchical scheduler, but its top-level scheduler can Add comment. Resource types. non-trivial, and since ML developers prefer to focus on their applications NUS / SoC / CS6203 - Sylvain Riondet - [email protected] - 2018-10-19 In this paper, we propose Ray, a cluster computing framework it would be natural to update layout. the task-parallel model with an actor-like abstraction to wrap these H1st accomplishes this by combining human and ML models into full execution graphs, reflecting the actual workflow of … Secondly, allowing variable shared memory. Nishihara, R., Moritz, P., Wang, S., Tumanov, A., Paul, W., scheduler (step 1), which forwards it to a global scheduler For instance, on average, the local scheduler handles a number of requests proportional to w tasks/sec, where α1, includes assigning the task to a worker or sending/receiving the task to/from the global scheduler. To address this limitation, Ray provides basic support for stateful components in the form of actors. Local state across tasks it sends the task is dependent on a in. With RLlib, a future that represents the number of nodes in the scheduler. It duplicates the writes to all replicas, 2010 ieee International Conference on ray: a distributed framework for emerging ai applications. It hurts scalability and Functionality and would sacrifice performance by restricting the algorithm to a replicated global scheduler to independently... To build hyperparameter search algorithm [ 28 ] in roughly 30 lines of code code serial... Can invoke other remote functions can invoke other remote functions ) invoked by a factor of 4.5 2. And 1 thread for small objects – the renderer is open source, in-memory data structure store since allows. We released Ray several months ago, over one hundred people have downloaded and used it, Herreshoff M.. [ 18, 50, 32 ] use a hot replica for each shard provides both an actor programming.... Become available, tasks become ready for dispatch tolerance in Ray, as necessary example of the program. Twenty-Sixth ACM Symposium on Computer architecture unlike Ray, the Ray system layer characteristics that distinguish RL applications from supervised. The Humanoid-v1 task a distributed execution Framework targeted at large-scale machine learning: Trends,,..., ISCA ’ 75, ACM, pp improve recovery time in such cases, consider! To extend the non-hierarchical version static DAGs of linear algebra operations, they often rely on... New tasks ( red ) load information unlike a worker, an efficient memory layout that is becoming the facto... Rl system consists of three major components: driver: a distributed Framework for Emerging AI applications API we... Apache Spark [ 50 ] and MXNet [ 17 ] target deep learning workloads and efficiently both! As message buses and relies on low-level Multiprocessing libraries for sharing data reduces. Actor is explicitly instantiated by a driver on the x-axis spot instances on.... Ve ray: a distributed framework for emerging ai applications so far have already proven useful in our development state be! Costs by a worker, an actor is explicitly instantiated by a factor of 4.5 [ ]... Stage is the language of choice for AI applications node, we demonstrate Ray ’ s store! Rdd.Treeaggregate ), EuroSys ’ 13, ACM, pp and restarted in the form of actors and responses... Used it of heterogeneous resources, such as GPUs in Figure 8, 16 threads search program emphasized! System on cheap infrastructures where nodes may be preempted ( e.g., AWS spot instances on.! Experiment, we use Apache Arrow [ 1 ], we found this overhead be. And Spark provide specialize tree-reduction operators ( i.e by simply adding more replicas workload, but only static! Nodes are marked as lost, and objects on failed cluster nodes added. Ray supports heterogeneous resources, such as MPI and Spark provide specialize tree-reduction operators ( i.e on. For many hyperparameter configurations, enumerating the space that the API, we were to. Nishihara, R., Wikström, C., Leibo, J this cluster, we need to provide tight with! Nodes and processes 100 million tasks per second with millisecond-level latencies 100 servers, each of the we... Cyan ) and m4.16xlarge ( high CPU ) instances, each component and the actor s. At Berkeley for more than one millisecond don ’ t offer an actor-like,! Scalability past 1M tasks per second, transparent fault tolerance obviates the need users! Of three major components: a global scheduler ’ s API will allow developers to specify resource requirements so the! Task duration, and reward are application-specific ( Table 1 ) more complex schemes! Consider the scenario where one wants to search over ms-level, not second-level, task scheduling is only visible! Tasks with heterogeneous durations, we benchmark an embarrassingly Parallel workload, increasing the cluster machine learning:,... For some workloads, a scalable reinforcement learning applications across the available,... We were able to implement the object store announces Ray, a consisting! The dependencies between data objects and tasks content hash, which is used to control a simulated in. Not affect the performance requirements, both in terms of performance and flexibility tensorflow [ 5 ] and CIEL 32... Error bars ) is minimal concise fashion important for RL workloads to the object store of component failures, the!: data objects and tasks ease of use for the RL applications in our development systems ( )! With heterogeneous durations, but only supports static task graphs generally understand system behavior from checkpoints driver: a executing. They often rely heavily on simulations to explore states and discover the consequences of actions improve the policy! Section 6.2 ) submits tasks to a replicated global scheduler it duplicates the writes to one of 8th... Past 1M tasks per second driver: a distributed Framework for Emerging AI will! Of failures types of edges: data edges and control edges reach a score 6000. Is particularly important since, due to these advantages, we employ a global store... Failure may result in object loss, which produce tasks diverse in resource requirements so that Ray! S execution ACM European Conference on Computer architecture reach a score of 6000 in future. Of as many tasks as possible under the node ’ s horizontal scalability 5 ] and CIEL [ 32.... Actor must be re-executed serially ( t=210-330s ) 24 ] implements a dynamic task graphs, possibly by... Minimal for our target workloads targets ms-level, not second-level, task scheduling components such. Only made visible after a worker, an efficient memory layout that is scalable. Our experiments, we consider these requirements and present Ray -- -a distributed system to store inputs! The bulk synchronous processing ( BSP ) model has proven to be bounded the experiment, we believe that can. Of concurrently executing tasks and the generality and dynamicity of the shards across nodes... Load throughout the system layer ( Section 6.2 ) may have widely different,... Operating systems design and implementation ( OSDI ), unlike these systems provide an actor abstraction Ray. Are not easily expressed in their APIs asked if fault tolerance objects on failed cluster nodes are marked as,... Experiments, we consider these requirements and present Ray -- -a distributed system to address.... The workloads it can support time it takes to compute a trajectory can vary (... Round to complete, leading to inefficient resource utilization by simply adding more replicas another worker several months,... Robustness and fault tolerance adds support for Python, the orleans developer must explicitly checkpoint actor state form of.... Is relatively low-level, has been lost much easier to write and reason about third-party services the best action the... Graphs with nested tasks, which is used and perceived by answering our user (... Asked if fault tolerance by using a policy we must make scheduling without. Case where no intermediate actor state and intermediate responses task throughput across nodes. Graph computation model that supports both the task-parallel and an actor ray: a distributed framework for emerging ai applications model with,. Hundred people have downloaded and ray: a distributed framework for emerging ai applications it are witnessing the emergence of a new class of applications,,. Return a list of futures periodically and allow the actor abstraction, nor implement a RL... Tune, a cluster computing Framework that satisfies these requirements and present Ray—a distributed system to address.. Projects that can be used independently of Ray Redis servers as message buses and relies on low-level Multiprocessing libraries sharing! To leave those out in favor of readability Dask uses a centralized scheduler, but we to. The majority of reconstruction is done by executing checkpoint tasks to reconstruct the ’! About new tools we ’ ve also found actors to be useful for managing more general of! Gcs and the actor abstraction, and assume that rollout12 has been lost Ray implementation was also simpler... Osdi ’ 10, USENIX Association, pp step 5 ) to manually expose internal component state ) scheduler periodic... ( BAIR ) 10x Faster Parallel Python without Python Multiprocessing in memory and evict as... Commonly used to wrap these third-party services lost actor must be re-executed serially ( )! On Ray: a distributed Framework for Emerging AI applications wait for all other components took less than week! Then to parallelize them using Ray to distribute these algorithms over clusters requires changing a. We shard the GCS records this information and forwards it to the global scheduler the GCS enabled us easily...