[Discourse.ros.org] [ROS Projects] Slice and dice large ROS bag files on Hadoop and Spark
Large amount of sensor and robotic data is produced by the industry at an ever increasing peace. Be it from areas like mobility, perception, smart factory or from development tools through planing, modelling or simulation.
New effervescent robotic topics of research like self driving cars put pressure to develop new tools and techniques to deal with larger and more complex data sets. Some projects and industry players publicly announced the adoption of ROS as part of their process.
On the other hand, **Hadoop** and **Spark** Ecosystems are seeing a tremendous adoption for processing and analysing large data in parallel. (The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets)
ROS command **rosbag record** subscribes to topics and writes a bag file with the contents of all messages published on those topics. For performance reasons the messages are written interlaced as they come over the wires, with different frequencies.
**Associative** operations can be applied in parallel. Or more precisely the parallelism requires associativity. (Although concurrency technically is not parallelism it also requires associativity.) Spark provides an unified functional API for processing locally, concurrently or on multiple machines.
** Now you do not need to convert ROS bag files to work with them in Spark**
The assumption was that the ROS bag files have to be converted into a more suitable format before they can be processed in parallel with tools like Hadoop or Spark. It turns out that the format is good enough for processing with a distributed file system like HDFS but it happened that nobody has written an Hadoop InputFormat for it.
So we did it. We took the time and wrote a Hadoop RosbagInputFormat :grinning: published under Apache 2.0 License.
Both are possible, data center and on-premise. There are no hardware requirements but we recommend a 3 nodes setup to see the benefits of parallelism. Start with 64-128 GB memory, 4 disks and quad-core per node.
Spark performance really scale out with multiple machines. If there are 3 splits is 3 times faster with 3 workers. etc.