[Discourse.ros.org] [ROS Projects] Slice and dice large ROS bag files on Hadoop and Spark

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Discourse.ros.org] [ROS Projects] Slice and dice large ROS bag files on Hadoop and Spark

Dirk Thomas via ros-users


Large amount of sensor and robotic data is produced by the industry at an ever increasing peace. Be it from areas like mobility, perception, smart factory or from development tools through planing, modelling or simulation.

New effervescent robotic topics of research like self driving cars put pressure to develop new tools and techniques to deal with larger and more complex data sets. Some projects and industry players publicly announced the adoption of ROS as part of their process.

On the other hand, **Hadoop** and **Spark** Ecosystems are seeing a tremendous adoption for processing and analysing large data in parallel. (The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets)

**Why process large ROS bag files in parallel?**

![08|690x380](/uploads/ros/original/1X/d04ed3baa759c4cef69d41fac1c97122959f42f2.png)

ROS command **rosbag record** subscribes to topics and writes a bag file with the contents of all messages published on those topics. For performance reasons the messages are written interlaced as they come over the wires, with different frequencies.

**Associative** operations can be applied in parallel.  Or more precisely the parallelism requires associativity. (Although concurrency technically is not parallelism it also requires associativity.) Spark provides an unified functional API for processing locally, concurrently or on multiple machines.

** Now you do not need to convert ROS bag files to work with them in Spark**

The assumption was that the ROS bag files have to be converted into a more suitable format before they can be processed in parallel with tools like Hadoop or Spark. It turns out that the format is good enough for processing with a distributed file system like HDFS but it happened that nobody has written an Hadoop InputFormat for it.

So we did it. We took the time and wrote a Hadoop RosbagInputFormat :grinning: published under Apache 2.0 License.

[http://github.com/valtech/ros_hadoop](http://github.com/valtech/ros_hadoop)

RosbagInputFormat is an open source splittable Hadoop InputFormat for the rosbag file format.

![16|653x482](/uploads/ros/original/1X/3e98bb3b9a14f0653b62244e6c03ba9906f9ea93.png)

We also prepared a Dockerfile and step-by-step tutorial that you could use to try the concepts presented here:  

[http://github.com/valtech/ros_hadoop](http://github.com/valtech/ros_hadoop)

We hope that the RosbagInputFormat would be useful for you. It would be great if you give us some feedback.

Thanks!
Adrian, Jan





---
[Visit Topic](https://discourse.ros.org/t/slice-and-dice-large-ros-bag-files-on-hadoop-and-spark/2314/1) or reply to this email to respond.


If you do not want to receive messages from ros-users please use the unsubscribe link below. If you use the one above, you will stop all of ros-users from receiving updates.
______________________________________________________________________________
ros-users mailing list
[hidden email]
http://lists.ros.org/mailman/listinfo/ros-users
Unsubscribe: <http://lists.ros.org/mailman//options/ros-users>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Discourse.ros.org] [ROS Projects] Slice and dice large ROS bag files on Hadoop and Spark

Dirk Thomas via ros-users


Jan, this seems great. Is it meant to run in the data center or as an on-premise solution? In other words are there any special requirements on the computer HW?

Did you do any profiling on how much faster can you process data using this spliter + spark vs if you'd just do it sequentially   by playing a bag from an ext4 formatted disk?

D.





---
[Visit Topic](https://discourse.ros.org/t/slice-and-dice-large-ros-bag-files-on-hadoop-and-spark/2314/2) or reply to this email to respond.


If you do not want to receive messages from ros-users please use the unsubscribe link below. If you use the one above, you will stop all of ros-users from receiving updates.
______________________________________________________________________________
ros-users mailing list
[hidden email]
http://lists.ros.org/mailman/listinfo/ros-users
Unsubscribe: <http://lists.ros.org/mailman//options/ros-users>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Discourse.ros.org] [ROS Projects] Slice and dice large ROS bag files on Hadoop and Spark

Dirk Thomas via ros-users
In reply to this post by Dirk Thomas via ros-users


Both are possible, data center and on-premise. There are no hardware requirements but we recommend a 3 nodes setup to see the benefits of parallelism. Start with 64-128 GB memory, 4 disks and quad-core per node.

Spark performance really scale out with multiple machines. If there are 3 splits is 3 times faster with 3 workers. etc.





---
[Visit Topic](https://discourse.ros.org/t/slice-and-dice-large-ros-bag-files-on-hadoop-and-spark/2314/3) or reply to this email to respond.


If you do not want to receive messages from ros-users please use the unsubscribe link below. If you use the one above, you will stop all of ros-users from receiving updates.
______________________________________________________________________________
ros-users mailing list
[hidden email]
http://lists.ros.org/mailman/listinfo/ros-users
Unsubscribe: <http://lists.ros.org/mailman//options/ros-users>
Loading...