In the O&G industry, acquisition of more complicated and larger datasets can happen on the rig sites due to the development of more advanced sensors. It is now more critical for efficient and effective processing of the acquired data in both real-time and post-processing. The traditional methods and workflows, such as hosting datasets in a centralized data server, manual processing through a standalone machine, and manual transferring and transforming dataset formats, are inefficient, difficult to manage, and not horizontally scalable. With distributed clusters and cloud technology, data storage and processing performance can be scale horizontally and optimize via multiple task nodes. However, it is not trivial to understand how distributed cluster functions and how to integrate applications with different cloud provider solutions. In this paper, we will introduce and illustrate a generic and holistic high performance distributed computing and storage system for large datasets. Different domain applications in the O&G industry can utilize this system to accomplish large datasets processing task with instantaneous feedbacks, such as well log interpretation, petrophysics processing, seismic processing, machine learning and deep learning modeling, etc. The system is composed of a generic distributed processing engine utilizing Apache Spark and Advanced Message Queuing Protocol (AMQP), a generic File I/O service, a distributed data store, and a distributed computing cluster. The system architecture is very flexible and cloud provider agnostic. It can be implemented and deployed into either an on-premise, internal cluster or a public cloud network based on the processing requirements needed.
WATCH VIDEO