- Published on
Analyzing Oil and Gas Production Growth
- Authors
- Name
- Samad Ahmed
The oil and gas industry is a complex, dynamic sector where efficient data analysis can lead to valuable insights. I've worked in a variety of domains surrounding this industry which includes mobile applications, customer-facing web and mobile apps, and also have built internal tools for accounting and financial teams to analyze their data. All of which, involved heavy SQL and data engineering.
In this guide, we'll dive into using SQL to unlock insights from well production data and transform these insights into strategies for revenue growth.
Situation At Hand
Currently, let's assume that "analytics" at the company is driven primarily by a shared CSV that is passed around on a weekly email to all intended recipients. However, it is intended that this data must now be accessible within an internal reporting application and role-based access must be granted. This data is also currently being created manually from a joint-operation effort from accounting and business teams. Our job is automate this process and rapidly shift forward business analytics processes so that the company can inch forward in their goal to be a data-driven organization.
Understanding Our Data
Assume we have data resembling as such (the shared CSV being passed around) that provides detailed information about multiple oil wells, including their unique identifiers, the basins they are located in, production data for specific years and months, operating costs, water cut percentages, and gas-oil ratios.
Table: Multibasin Oil Well Production Data
Well ID | Well Name | Well Description | Basin | Year | Month | Production | Operating Cost | Water Cut | Gas-Oil Ratio |
---|---|---|---|---|---|---|---|---|---|
1001 | Vitalis | Primary | Permian | 2022 | Jan | 750 | 1500 | 10 | 1500 |
1001 | Vitalis | Primary | Permian | 2022 | Feb | 800 | 1600 | 11 | 1600 |
1002 | Phoenix | Secondary | Bakken | 2022 | Jan | 650 | 1400 | 12 | 1400 |
1002 | Phoenix | Secondary | Bakken | 2022 | Feb | 700 | 1450 | 10 | 1450 |
1003 | Discovery | Exploration | Eagle | 2022 | Jan | 600 | 1550 | 13 | 1550 |
1003 | Discovery | Exploration | Eagle | 2022 | Feb | 900 | 1650 | 9 | 1650 |
1004 | Triumph | Primary | Permian | 2022 | Jan | 850 | 1700 | 10 | 1700 |
1004 | Triumph | Primary | Permian | 2022 | Feb | 950 | 1750 | 11 | 1750 |
We're expecting to fully migrate multiple sources so that authorized users can self-serve their own analytics. In this case, we'll need a data warehousing solution that is both scalable and flexible. Snowflake is a reasonable candidate for this for a multitude of reasons, especially from a cost perspective because compute and resources pricing are decoupled. This means we'll pay less for queries on smaller data tables and cost increases with use, rather than fixed cost.
Aside from cost, Snowflake offers quite an advantage from a security, governance, and compliance perspective. From a control standpoint, Snowflake combines both discretionary access control (DAC) and role-based access control (RBAC).
DAC: each object has an owner who can then grant access to that object
RBAC: access privileges are assigned to roles of which are then assigned to users
This makes sense because not only can we grant access on data, but also on the reports themselves which will be built within Snowflake as well to lessen the learning curve of having our end-users learn multiple tools.
Ingesting Your Data
Let's start by creating a table called oil_well_production
with a primary key on well_id
to ensure its uniqueness and also a unique constraint on a combination of well_id
, year
, and month
to prevent duplicates entries for the same well in the same month.
CREATE OR REPLACE TABLE oil_well_production (
well_id INT,
well_name STRING,
well_description STRING,
basin STRING,
year INT,
month STRING,
production INT,
operating_cost INT,
water_cut INT,
gas_oil_ratio INT,
PRIMARY KEY (well_id), -- Ensure well_id is unique
UNIQUE (well_id, year, month) -- Combination of well_id, year, and month is unique
);
Our intention is to use this table as our "clean" table. Meaning, our data should be parsed and transformed prior to load. The team currently uses CSVs today that are created from external processes that are independent of the engineering team. However, let's assume the team responsible for the CSV's creation has a load process setup to dump these files into a raw S3 bucket for auditing purposes.
We can load these files as they are dumped directly into Snowflake by referencing an external stage in Snowflake that references your S3 bucket and ingesting data into it with Snowpipe.
CREATE OR REPLACE STAGE s3_well_production_bucket
URL = 's3://your-s3-bucket/path/to/csvs/'
CREDENTIALS = (
AWS_KEY_ID = 'your-access-key-id'
AWS_SECRET_KEY = 'your-secret-access-key'
);
Now you can setup Snowpipe that monitors your external stage for new files and automatically loads them into the table as long as the CSV is organized homogenously with your table.
CREATE OR REPLACE PIPE csv_ingestion
AUTO_INGEST = TRUE -- Enables automatic ingestion
AS
COPY INTO oil_well_production
FROM @s3_well_production_bucket
FILE_FORMAT = (TYPE = CSV)
ON_ERROR = 'CONTINUE';
Querying Your Data
You've started the process to continously ingest your CSV data into Snowflake for analytics. Now we can actually perform queries against this and start creating value for your teams. When building, always go for the low-hanging fruit and deliver value as fast as you can. It not only serves as a stepping point, but gives your stakeholders reassurance into your processes.
Here is one example of an easily achievable query that brings value.
-- Calculate the total production and operating costs per basin for the year 2022
WITH BasinSummary AS (
SELECT
basin,
SUM(production) AS total_production,
SUM(operating_cost) AS total_operating_cost
FROM
oil_well_production
WHERE
year = 2022
GROUP BY
basin
),
-- Find the top-performing well in each basin
TopWells AS (
SELECT
basin,
well_name,
MAX(production) AS max_production
FROM
oil_well_production
WHERE
year = 2022
GROUP BY
basin, well_name
)
-- Combine the results
SELECT
bp.basin,
bp.total_production,
bp.total_operating_cost,
tw.well_name,
tw.max_production
FROM
BasinSummary AS bp
JOIN
TopWells AS tw
ON
bp.basin = tw.basin
AND bp.total_production = tw.max_production
ORDER BY
bp.basin;
In this query, we first calculate the total production and operating costs per basin for the year 2022 using a common table expression (CTE) named BasinSummary
. Then, we find the top-performing well in each basin for the same year and store this information in a CTE named TopWells
. Finally, we combine the results to obtain the top-performing well in each basin along with the total production and operating costs for that year.
This would output the following:
Table: Basin Production Summary for 2022
BASIN | TOTAL_PRODUCTION | TOTAL_OPERATING_COST | WELL_NAME | MAX_PRODUCTION |
---|---|---|---|---|
Bakken | 1350 | 2850 | Phoenix | 700 |
Eagle | 1500 | 3200 | Discovery | 900 |
Permian | 1700 | 3300 | Triumph | 950 |
Here are some examples of other interesting things we can do.
Production Trends Over Time: Analyze how production varies over different years and months to identify seasonal or long-term trends. This can help in forecasting production and planning maintenance.
SELECT year, month, SUM(production) AS total_production FROM oil_well_production GROUP BY year, month ORDER BY year, month;
Operating Cost Efficiency: Calculate the cost efficiency of each well by comparing the operating cost to production. Identify wells that have a high production-to-cost ratio.
SELECT well_name, SUM(production) AS total_production, SUM(operating_cost) AS total_operating_cost, SUM(production) / SUM(operating_cost) AS production_cost_ratio FROM oil_well_production GROUP BY well_name ORDER BY production_cost_ratio DESC;
Water Cut Impact: Study the relationship between the water cut percentage and production. Determine how the water cut affects the overall production and the economic viability of the well.
SELECT well_name, AVG(water_cut) AS avg_water_cut, SUM(production) AS total_production FROM oil_well_production GROUP BY well_name ORDER BY avg_water_cut;
Gas-Oil Ratio and Quality: Examine how the gas-oil ratio (GOR) relates to production levels and the quality of the extracted oil. High GOR may affect oil quality and processing.
SELECT well_name, AVG(gas_oil_ratio) AS avg_gas_oil_ratio, AVG(production) AS avg_production FROM oil_well_production GROUP BY well_name ORDER BY avg_gas_oil_ratio DESC;
Regional Performance: Analyze the production and cost variations between different basins or regions. Identify the most productive and cost-effective regions for further investment.
SELECT basin, AVG(production) AS avg_production, AVG(operating_cost) AS avg_operating_cost FROM oil_well_production GROUP BY basin ORDER BY avg_production DESC;
We're able to drive so much value with just bare simple SELECT
statements. Now let's get funky with some complex querying.
Looking Deeper
Let's assume we also have another table in our data warehouse regarding truck drivers and we'd like to see how well drivers perform per location. Assume the schema is as such:
CREATE OR REPLACE TABLE driver_data (
location_id INT,
truck_driver_id INT,
dumping_date DATE,
sand_tonnage DECIMAL(10, 2),
dumping_time TIME,
dumping_location VARCHAR(100),
weather_conditions VARCHAR(50),
sand_source VARCHAR(100)
);
Then we can cross analyze this table with our oil production data.
Analyzing Sand Dumping Locations and Oil Wells by Truck Driver: This will provide an overview of the sand-dumping locations, the total sand tonnage dumped, and the oil wells serviced by each truck driver.
WITH TruckDriverSummary AS ( SELECT sdd.truck_driver_id, sdd.dumping_location, SUM(sdd.sand_tonnage) AS total_sand_tonnage, ARRAY_AGG(DISTINCT owp.well_name) AS serviced_wells FROM driver_data AS sdd LEFT JOIN oil_well_production AS owp ON sdd.location_id = owp.well_id AND EXTRACT(YEAR FROM sdd.dumping_date) = owp.year WHERE EXTRACT(YEAR FROM sdd.dumping_date) = 2022 GROUP BY sdd.truck_driver_id, sdd.dumping_location ) SELECT tds.truck_driver_id, tds.dumping_location, tds.total_sand_tonnage, tds.serviced_wells FROM TruckDriverSummary AS tds ORDER BY tds.truck_driver_id, tds.dumping_location;
Correlating Tonnage and Weather: We can also check the correlation between the tonnage of sand dumped and weather conditions during the dumping process.
SELECT weather_conditions, AVG(sand_tonnage) AS avg_sand_tonnage FROM driver_data WHERE EXTRACT(YEAR FROM dumping_date) = 2022 GROUP BY weather_conditions ORDER BY avg_sand_tonnage DESC;
Great! We delved into production trends and operating efficiency to answer a ton (pun-intended) of critical questions to optimize performance and allow our company to allocate resources effectively.
Safeguarding Data
Regulatory compliance and auditing are integral to financial operations. By granting authorized teams direct data access, organizations can maintain comprehensive records and demonstrate transparency and compliance with greater ease. Let's assume for our fictional company that the accounting team is responsible for this reporting. Ultimately, the amount of control you want to hand over depends on your process ability to backfill, pipeline retries, and organizational processes.
For example, we can allow them to retrieve data without modifying it.
GRANT SELECT ON oil_well_production TO ROLE accounting;
GRANT SELECT ON sand_dumping_data TO ROLE accounting;
Or we can grant USAGE
privileges on those schemas to allow the accounting role to see the objects within them.
GRANT USAGE ON SCHEMA your_schema TO ROLE accounting;
Depending on your specific use case, you may need to grant additional privileges such as INSERT
, UPDATE
, DELETE
, or EXECUTE
on specific stored procedures.
GRANT INSERT ON sand_dumping_data TO ROLE accounting;
And finally, you can grant access on a schema level for future tables.
GRANT USAGE, SELECT ON SCHEMA your_schema TO ROLE accounting;
Closing Thoughts
We just harnessed the power of simple SQL and expanded datasets to uncover valuable insights and empowered decision-makers with actionable knowledge. Our exploration has showcased the vital role that data plays in steering the industry toward optimized operations, cost-efficiency, and strategic growth.
Whether it's enabling accounting teams to make informed financial decisions or helping field operators fine-tune well productivity, the data we've dissected has become a compass for smarter, more responsive decision-making. In this data-rich landscape, where every bit and byte holds the potential to fuel progress.
I'll be making a seperate post going over how to document processes like these so that future changes are easily applied to new pipelines and datasets.