Published on

Analyzing Oil and Gas Production Growth

Authors
  • avatar
    Name
    Samad Ahmed
    Twitter

The oil and gas industry is a complex, dynamic sector where efficient data analysis can lead to valuable insights. I've worked in a variety of domains surrounding this industry which includes mobile applications, customer-facing web and mobile apps, and also have built internal tools for accounting and financial teams to analyze their data. All of which, involved heavy SQL and data engineering.

In this guide, we'll dive into using SQL to unlock insights from well production data and transform these insights into strategies for revenue growth.

A Wellsite

Situation At Hand

Currently, let's assume that "analytics" at the company is driven primarily by a shared CSV that is passed around on a weekly email to all intended recipients. However, it is intended that this data must now be accessible within an internal reporting application and role-based access must be granted. This data is also currently being created manually from a joint-operation effort from accounting and business teams. Our job is automate this process and rapidly shift forward business analytics processes so that the company can inch forward in their goal to be a data-driven organization.

Understanding Our Data

Assume we have data resembling as such (the shared CSV being passed around) that provides detailed information about multiple oil wells, including their unique identifiers, the basins they are located in, production data for specific years and months, operating costs, water cut percentages, and gas-oil ratios.

Table: Multibasin Oil Well Production Data

Well IDWell NameWell DescriptionBasinYearMonthProductionOperating CostWater CutGas-Oil Ratio
1001VitalisPrimaryPermian2022Jan7501500101500
1001VitalisPrimaryPermian2022Feb8001600111600
1002PhoenixSecondaryBakken2022Jan6501400121400
1002PhoenixSecondaryBakken2022Feb7001450101450
1003DiscoveryExplorationEagle2022Jan6001550131550
1003DiscoveryExplorationEagle2022Feb900165091650
1004TriumphPrimaryPermian2022Jan8501700101700
1004TriumphPrimaryPermian2022Feb9501750111750

We're expecting to fully migrate multiple sources so that authorized users can self-serve their own analytics. In this case, we'll need a data warehousing solution that is both scalable and flexible. Snowflake is a reasonable candidate for this for a multitude of reasons, especially from a cost perspective because compute and resources pricing are decoupled. This means we'll pay less for queries on smaller data tables and cost increases with use, rather than fixed cost.

Aside from cost, Snowflake offers quite an advantage from a security, governance, and compliance perspective. From a control standpoint, Snowflake combines both discretionary access control (DAC) and role-based access control (RBAC).

  • DAC: each object has an owner who can then grant access to that object

  • RBAC: access privileges are assigned to roles of which are then assigned to users

This makes sense because not only can we grant access on data, but also on the reports themselves which will be built within Snowflake as well to lessen the learning curve of having our end-users learn multiple tools.

Ingesting Your Data

Let's start by creating a table called oil_well_production with a primary key on well_id to ensure its uniqueness and also a unique constraint on a combination of well_id, year, and month to prevent duplicates entries for the same well in the same month.

CREATE OR REPLACE TABLE oil_well_production (
    well_id INT,
    well_name STRING,
    well_description STRING,
    basin STRING,
    year INT,
    month STRING,
    production INT,
    operating_cost INT,
    water_cut INT,
    gas_oil_ratio INT,
    PRIMARY KEY (well_id), -- Ensure well_id is unique
    UNIQUE (well_id, year, month) -- Combination of well_id, year, and month is unique
);

Our intention is to use this table as our "clean" table. Meaning, our data should be parsed and transformed prior to load. The team currently uses CSVs today that are created from external processes that are independent of the engineering team. However, let's assume the team responsible for the CSV's creation has a load process setup to dump these files into a raw S3 bucket for auditing purposes.

We can load these files as they are dumped directly into Snowflake by referencing an external stage in Snowflake that references your S3 bucket and ingesting data into it with Snowpipe.

CREATE OR REPLACE STAGE s3_well_production_bucket
URL = 's3://your-s3-bucket/path/to/csvs/'
CREDENTIALS = (
    AWS_KEY_ID = 'your-access-key-id'
    AWS_SECRET_KEY = 'your-secret-access-key'
);

Now you can setup Snowpipe that monitors your external stage for new files and automatically loads them into the table as long as the CSV is organized homogenously with your table.

CREATE OR REPLACE PIPE csv_ingestion
AUTO_INGEST = TRUE -- Enables automatic ingestion
AS
COPY INTO oil_well_production
FROM @s3_well_production_bucket
FILE_FORMAT = (TYPE = CSV)
ON_ERROR = 'CONTINUE';

Querying Your Data

You've started the process to continously ingest your CSV data into Snowflake for analytics. Now we can actually perform queries against this and start creating value for your teams. When building, always go for the low-hanging fruit and deliver value as fast as you can. It not only serves as a stepping point, but gives your stakeholders reassurance into your processes.

Here is one example of an easily achievable query that brings value.

-- Calculate the total production and operating costs per basin for the year 2022
WITH BasinSummary AS (
    SELECT
        basin,
        SUM(production) AS total_production,
        SUM(operating_cost) AS total_operating_cost
    FROM
        oil_well_production
    WHERE
        year = 2022
    GROUP BY
        basin
),

-- Find the top-performing well in each basin
TopWells AS (
    SELECT
        basin,
        well_name,
        MAX(production) AS max_production
    FROM
        oil_well_production
    WHERE
        year = 2022
    GROUP BY
        basin, well_name
)

-- Combine the results
SELECT
    bp.basin,
    bp.total_production,
    bp.total_operating_cost,
    tw.well_name,
    tw.max_production
FROM
    BasinSummary AS bp
JOIN
    TopWells AS tw
ON
    bp.basin = tw.basin
    AND bp.total_production = tw.max_production
ORDER BY
    bp.basin;

In this query, we first calculate the total production and operating costs per basin for the year 2022 using a common table expression (CTE) named BasinSummary. Then, we find the top-performing well in each basin for the same year and store this information in a CTE named TopWells. Finally, we combine the results to obtain the top-performing well in each basin along with the total production and operating costs for that year.

This would output the following:

Table: Basin Production Summary for 2022

BASINTOTAL_PRODUCTIONTOTAL_OPERATING_COSTWELL_NAMEMAX_PRODUCTION
Bakken13502850Phoenix700
Eagle15003200Discovery900
Permian17003300Triumph950

Here are some examples of other interesting things we can do.

  • Production Trends Over Time: Analyze how production varies over different years and months to identify seasonal or long-term trends. This can help in forecasting production and planning maintenance.

    SELECT
        year,
        month,
        SUM(production) AS total_production
    FROM
        oil_well_production
    GROUP BY
        year, month
    ORDER BY
        year, month;
    
  • Operating Cost Efficiency: Calculate the cost efficiency of each well by comparing the operating cost to production. Identify wells that have a high production-to-cost ratio.

    SELECT
        well_name,
        SUM(production) AS total_production,
        SUM(operating_cost) AS total_operating_cost,
        SUM(production) / SUM(operating_cost) AS production_cost_ratio
    FROM
        oil_well_production
    GROUP BY
        well_name
    ORDER BY
        production_cost_ratio DESC;
    
  • Water Cut Impact: Study the relationship between the water cut percentage and production. Determine how the water cut affects the overall production and the economic viability of the well.

    SELECT
        well_name,
        AVG(water_cut) AS avg_water_cut,
        SUM(production) AS total_production
    FROM
        oil_well_production
    GROUP BY
        well_name
    ORDER BY
        avg_water_cut;
    
  • Gas-Oil Ratio and Quality: Examine how the gas-oil ratio (GOR) relates to production levels and the quality of the extracted oil. High GOR may affect oil quality and processing.

    SELECT
        well_name,
        AVG(gas_oil_ratio) AS avg_gas_oil_ratio,
        AVG(production) AS avg_production
    FROM
        oil_well_production
    GROUP BY
        well_name
    ORDER BY
        avg_gas_oil_ratio DESC;
    
  • Regional Performance: Analyze the production and cost variations between different basins or regions. Identify the most productive and cost-effective regions for further investment.

    SELECT
        basin,
        AVG(production) AS avg_production,
        AVG(operating_cost) AS avg_operating_cost
    FROM
        oil_well_production
    GROUP BY
        basin
    ORDER BY
        avg_production DESC;
    

We're able to drive so much value with just bare simple SELECT statements. Now let's get funky with some complex querying.

Looking Deeper

Let's assume we also have another table in our data warehouse regarding truck drivers and we'd like to see how well drivers perform per location. Assume the schema is as such:

CREATE OR REPLACE TABLE driver_data (
    location_id INT,
    truck_driver_id INT,
    dumping_date DATE,
    sand_tonnage DECIMAL(10, 2),
    dumping_time TIME,
    dumping_location VARCHAR(100),
    weather_conditions VARCHAR(50),
    sand_source VARCHAR(100)
);

Then we can cross analyze this table with our oil production data.

  • Analyzing Sand Dumping Locations and Oil Wells by Truck Driver: This will provide an overview of the sand-dumping locations, the total sand tonnage dumped, and the oil wells serviced by each truck driver.

    WITH TruckDriverSummary AS (
        SELECT
            sdd.truck_driver_id,
            sdd.dumping_location,
            SUM(sdd.sand_tonnage) AS total_sand_tonnage,
            ARRAY_AGG(DISTINCT owp.well_name) AS serviced_wells
        FROM
            driver_data AS sdd
        LEFT JOIN
            oil_well_production AS owp
        ON
            sdd.location_id = owp.well_id
            AND EXTRACT(YEAR FROM sdd.dumping_date) = owp.year
        WHERE
            EXTRACT(YEAR FROM sdd.dumping_date) = 2022
        GROUP BY
            sdd.truck_driver_id, sdd.dumping_location
    )
    SELECT
        tds.truck_driver_id,
        tds.dumping_location,
        tds.total_sand_tonnage,
        tds.serviced_wells
    FROM
        TruckDriverSummary AS tds
    ORDER BY
        tds.truck_driver_id, tds.dumping_location;
    
  • Correlating Tonnage and Weather: We can also check the correlation between the tonnage of sand dumped and weather conditions during the dumping process.

    SELECT
        weather_conditions,
        AVG(sand_tonnage) AS avg_sand_tonnage
    FROM
        driver_data
    WHERE
        EXTRACT(YEAR FROM dumping_date) = 2022
    GROUP BY
        weather_conditions
    ORDER BY
        avg_sand_tonnage DESC;
    

Great! We delved into production trends and operating efficiency to answer a ton (pun-intended) of critical questions to optimize performance and allow our company to allocate resources effectively.

Safeguarding Data

Regulatory compliance and auditing are integral to financial operations. By granting authorized teams direct data access, organizations can maintain comprehensive records and demonstrate transparency and compliance with greater ease. Let's assume for our fictional company that the accounting team is responsible for this reporting. Ultimately, the amount of control you want to hand over depends on your process ability to backfill, pipeline retries, and organizational processes.

For example, we can allow them to retrieve data without modifying it.

GRANT SELECT ON oil_well_production TO ROLE accounting;
GRANT SELECT ON sand_dumping_data TO ROLE accounting;

Or we can grant USAGE privileges on those schemas to allow the accounting role to see the objects within them.

GRANT USAGE ON SCHEMA your_schema TO ROLE accounting;

Depending on your specific use case, you may need to grant additional privileges such as INSERT, UPDATE, DELETE, or EXECUTE on specific stored procedures.

GRANT INSERT ON sand_dumping_data TO ROLE accounting;

And finally, you can grant access on a schema level for future tables.

GRANT USAGE, SELECT ON SCHEMA your_schema TO ROLE accounting;

Closing Thoughts

We just harnessed the power of simple SQL and expanded datasets to uncover valuable insights and empowered decision-makers with actionable knowledge. Our exploration has showcased the vital role that data plays in steering the industry toward optimized operations, cost-efficiency, and strategic growth.

Whether it's enabling accounting teams to make informed financial decisions or helping field operators fine-tune well productivity, the data we've dissected has become a compass for smarter, more responsive decision-making. In this data-rich landscape, where every bit and byte holds the potential to fuel progress.

I'll be making a seperate post going over how to document processes like these so that future changes are easily applied to new pipelines and datasets.