Analyzing Oil and Gas Production Growth

The oil and gas industry is a complex, dynamic sector where efficient data analysis can lead to valuable insights. I've worked in a variety of domains surrounding this industry which includes mobile applications, customer-facing web and mobile apps, and also have built internal tools for accounting and financial teams to analyze their data. All of which, involved heavy SQL and data engineering.

In this guide, we'll dive into using SQL to unlock insights from well production data and transform these insights into strategies for revenue growth.

Situation At Hand

Currently, let's assume that "analytics" at the company is driven primarily by a shared CSV that is passed around on a weekly email to all intended recipients. However, it is intended that this data must now be accessible within an internal reporting application and role-based access must be granted. This data is also currently being created manually from a joint-operation effort from accounting and business teams. Our job is automate this process and rapidly shift forward business analytics processes so that the company can inch forward in their goal to be a data-driven organization.

Understanding Our Data

Assume we have data resembling as such (the shared CSV being passed around) that provides detailed information about multiple oil wells, including their unique identifiers, the basins they are located in, production data for specific years and months, operating costs, water cut percentages, and gas-oil ratios.

Table: Multibasin Oil Well Production Data

Well ID	Well Name	Well Description	Basin	Year	Month	Production	Operating Cost	Water Cut	Gas-Oil Ratio
1001	Vitalis	Primary	Permian	2022	Jan	750	1500	10	1500
1001	Vitalis	Primary	Permian	2022	Feb	800	1600	11	1600
1002	Phoenix	Secondary	Bakken	2022	Jan	650	1400	12	1400
1002	Phoenix	Secondary	Bakken	2022	Feb	700	1450	10	1450
1003	Discovery	Exploration	Eagle	2022	Jan	600	1550	13	1550
1003	Discovery	Exploration	Eagle	2022	Feb	900	1650	9	1650
1004	Triumph	Primary	Permian	2022	Jan	850	1700	10	1700
1004	Triumph	Primary	Permian	2022	Feb	950	1750	11	1750

We're expecting to fully migrate multiple sources so that authorized users can self-serve their own analytics. In this case, we'll need a data warehousing solution that is both scalable and flexible. Snowflake is a reasonable candidate for this for a multitude of reasons, especially from a cost perspective because compute and resources pricing are decoupled. This means we'll pay less for queries on smaller data tables and cost increases with use, rather than fixed cost.

Aside from cost, Snowflake offers quite an advantage from a security, governance, and compliance perspective. From a control standpoint, Snowflake combines both discretionary access control (DAC) and role-based access control (RBAC).

DAC: each object has an owner who can then grant access to that object
RBAC: access privileges are assigned to roles of which are then assigned to users

This makes sense because not only can we grant access on data, but also on the reports themselves which will be built within Snowflake as well to lessen the learning curve of having our end-users learn multiple tools.

Ingesting Your Data

Let's start by creating a table called oil_well_production with a primary key on well_id to ensure its uniqueness and also a unique constraint on a combination of well_id, year, and month to prevent duplicates entries for the same well in the same month.

CREATE OR REPLACE TABLE oil_well_production (
    well_id INT,
    well_name STRING,
    well_description STRING,
    basin STRING,
    year INT,
    month STRING,
    production INT,
    operating_cost INT,
    water_cut INT,
    gas_oil_ratio INT,
    PRIMARY KEY (well_id), -- Ensure well_id is unique
    UNIQUE (well_id, year, month) -- Combination of well_id, year, and month is unique
);

Our intention is to use this table as our "clean" table. Meaning, our data should be parsed and transformed prior to load. The team currently uses CSVs today that are created from external processes that are independent of the engineering team. However, let's assume the team responsible for the CSV's creation has a load process setup to dump these files into a raw S3 bucket for auditing purposes.

We can load these files as they are dumped directly into Snowflake by referencing an external stage in Snowflake that references your S3 bucket and ingesting data into it with Snowpipe.

CREATE OR REPLACE STAGE s3_well_production_bucket
URL = 's3://your-s3-bucket/path/to/csvs/'
CREDENTIALS = (
    AWS_KEY_ID = 'your-access-key-id'
    AWS_SECRET_KEY = 'your-secret-access-key'
);

Now you can setup Snowpipe that monitors your external stage for new files and automatically loads them into the table as long as the CSV is organized homogenously with your table.

CREATE OR REPLACE PIPE csv_ingestion
AUTO_INGEST = TRUE -- Enables automatic ingestion
AS
COPY INTO oil_well_production
FROM @s3_well_production_bucket
FILE_FORMAT = (TYPE = CSV)
ON_ERROR = 'CONTINUE';

Querying Your Data

You've started the process to continously ingest your CSV data into Snowflake for analytics. Now we can actually perform queries against this and start creating value for your teams. When building, always go for the low-hanging fruit and deliver value as fast as you can. It not only serves as a stepping point, but gives your stakeholders reassurance into your processes.

Here is one example of an easily achievable query that brings value.

-- Calculate the total production and operating costs per basin for the year 2022
WITH BasinSummary AS (
    SELECT
        basin,
        SUM(production) AS total_production,
        SUM(operating_cost) AS total_operating_cost
    FROM
        oil_well_production
    WHERE
        year = 2022
    GROUP BY
        basin
),

-- Find the top-performing well in each basin
TopWells AS (
    SELECT
        basin,
        well_name,
        MAX(production) AS max_production
    FROM
        oil_well_production
    WHERE
        year = 2022
    GROUP BY
        basin, well_name
)

-- Combine the results
SELECT
    bp.basin,
    bp.total_production,
    bp.total_operating_cost,
    tw.well_name,
    tw.max_production
FROM
    BasinSummary AS bp
JOIN
    TopWells AS tw
ON
    bp.basin = tw.basin
    AND bp.total_production = tw.max_production
ORDER BY
    bp.basin;

In this query, we first calculate the total production and operating costs per basin for the year 2022 using a common table expression (CTE) named BasinSummary. Then, we find the top-performing well in each basin for the same year and store this information in a CTE named TopWells. Finally, we combine the results to obtain the top-performing well in each basin along with the total production and operating costs for that year.

This would output the following:

Table: Basin Production Summary for 2022

BASIN	TOTAL_PRODUCTION	TOTAL_OPERATING_COST	WELL_NAME	MAX_PRODUCTION
Bakken	1350	2850	Phoenix	700
Eagle	1500	3200	Discovery	900
Permian	1700	3300	Triumph	950

Here are some examples of other interesting things we can do.

Production Trends Over Time: Analyze how production varies over different years and months to identify seasonal or long-term trends. This can help in forecasting production and planning maintenance.
```
SELECT
    year,
    month,
    SUM(production) AS total_production
FROM
    oil_well_production
GROUP BY
    year, month
ORDER BY
    year, month;
```

Operating Cost Efficiency: Calculate the cost efficiency of each well by comparing the operating cost to production. Identify wells that have a high production-to-cost ratio.

SELECT
    well_name,
    SUM(production) AS total_production,
    SUM(operating_cost) AS total_operating_cost,
    SUM(production) / SUM(operating_cost) AS production_cost_ratio
FROM
    oil_well_production
GROUP BY
    well_name
ORDER BY
    production_cost_ratio DESC;

Water Cut Impact: Study the relationship between the water cut percentage and production. Determine how the water cut affects the overall production and the economic viability of the well.

SELECT
    well_name,
    AVG(water_cut) AS avg_water_cut,
    SUM(production) AS total_production
FROM
    oil_well_production
GROUP BY
    well_name
ORDER BY
    avg_water_cut;

Gas-Oil Ratio and Quality: Examine how the gas-oil ratio (GOR) relates to production levels and the quality of the extracted oil. High GOR may affect oil quality and processing.

SELECT
    well_name,
    AVG(gas_oil_ratio) AS avg_gas_oil_ratio,
    AVG(production) AS avg_production
FROM
    oil_well_production
GROUP BY
    well_name
ORDER BY
    avg_gas_oil_ratio DESC;

Regional Performance: Analyze the production and cost variations between different basins or regions. Identify the most productive and cost-effective regions for further investment.

SELECT
    basin,
    AVG(production) AS avg_production,
    AVG(operating_cost) AS avg_operating_cost
FROM
    oil_well_production
GROUP BY
    basin
ORDER BY
    avg_production DESC;

We're able to drive so much value with just bare simple SELECT statements. Now let's get funky with some complex querying.

Looking Deeper

Let's assume we also have another table in our data warehouse regarding truck drivers and we'd like to see how well drivers perform per location. Assume the schema is as such:

CREATE OR REPLACE TABLE driver_data (
    location_id INT,
    truck_driver_id INT,
    dumping_date DATE,
    sand_tonnage DECIMAL(10, 2),
    dumping_time TIME,
    dumping_location VARCHAR(100),
    weather_conditions VARCHAR(50),
    sand_source VARCHAR(100)
);

Then we can cross analyze this table with our oil production data.

Analyzing Sand Dumping Locations and Oil Wells by Truck Driver: This will provide an overview of the sand-dumping locations, the total sand tonnage dumped, and the oil wells serviced by each truck driver.

WITH TruckDriverSummary AS (
    SELECT
        sdd.truck_driver_id,
        sdd.dumping_location,
        SUM(sdd.sand_tonnage) AS total_sand_tonnage,
        ARRAY_AGG(DISTINCT owp.well_name) AS serviced_wells
    FROM
        driver_data AS sdd
    LEFT JOIN
        oil_well_production AS owp
    ON
        sdd.location_id = owp.well_id
        AND EXTRACT(YEAR FROM sdd.dumping_date) = owp.year
    WHERE
        EXTRACT(YEAR FROM sdd.dumping_date) = 2022
    GROUP BY
        sdd.truck_driver_id, sdd.dumping_location
)
SELECT
    tds.truck_driver_id,
    tds.dumping_location,
    tds.total_sand_tonnage,
    tds.serviced_wells
FROM
    TruckDriverSummary AS tds
ORDER BY
    tds.truck_driver_id, tds.dumping_location;

Correlating Tonnage and Weather: We can also check the correlation between the tonnage of sand dumped and weather conditions during the dumping process.

SELECT
    weather_conditions,
    AVG(sand_tonnage) AS avg_sand_tonnage
FROM
    driver_data
WHERE
    EXTRACT(YEAR FROM dumping_date) = 2022
GROUP BY
    weather_conditions
ORDER BY
    avg_sand_tonnage DESC;

Great! We delved into production trends and operating efficiency to answer a ton (pun-intended) of critical questions to optimize performance and allow our company to allocate resources effectively.

Safeguarding Data

Regulatory compliance and auditing are integral to financial operations. By granting authorized teams direct data access, organizations can maintain comprehensive records and demonstrate transparency and compliance with greater ease. Let's assume for our fictional company that the accounting team is responsible for this reporting. Ultimately, the amount of control you want to hand over depends on your process ability to backfill, pipeline retries, and organizational processes.

For example, we can allow them to retrieve data without modifying it.

GRANT SELECT ON oil_well_production TO ROLE accounting;
GRANT SELECT ON sand_dumping_data TO ROLE accounting;

Or we can grant USAGE privileges on those schemas to allow the accounting role to see the objects within them.

GRANT USAGE ON SCHEMA your_schema TO ROLE accounting;

Depending on your specific use case, you may need to grant additional privileges such as INSERT, UPDATE, DELETE, or EXECUTE on specific stored procedures.

GRANT INSERT ON sand_dumping_data TO ROLE accounting;

And finally, you can grant access on a schema level for future tables.

GRANT USAGE, SELECT ON SCHEMA your_schema TO ROLE accounting;

Closing Thoughts

We just harnessed the power of simple SQL and expanded datasets to uncover valuable insights and empowered decision-makers with actionable knowledge. Our exploration has showcased the vital role that data plays in steering the industry toward optimized operations, cost-efficiency, and strategic growth.

Whether it's enabling accounting teams to make informed financial decisions or helping field operators fine-tune well productivity, the data we've dissected has become a compass for smarter, more responsive decision-making. In this data-rich landscape, where every bit and byte holds the potential to fuel progress.

I'll be making a seperate post going over how to document processes like these so that future changes are easily applied to new pipelines and datasets.