Automating Tilray's weekly ETL using KNIME and SQL Server
Summary
Problem: The analytics team spent a lot of time cleaning the sales and inventory data that they received in Excel files every week.
Solution: I used KNIME to develop an ETL program to extract the data from the Excel files, transform the data, then load the transformed data into SQL Server.
Results: The analytics team can simply query data from SQL Server instead of wrangling data from thousands of Excel files.
KNIME
KNIME is a low-code data processing application where you can chain “nodes” to create programs. A node is like a function in a programming language.
Screenshot of KNIME.
Definitions
f
- An Excel file containing sales and inventory data for one week.
f.data
- The data table extracted from
f
. All columns inf.data
are the string data type. T
-
A sequence of data transformations. These include:
- Transforming column contents using regular expressions.
- Converting column data types.
T(f.data)
- The resulting table after applying
T
tof.data
. all_data
- A database table that will store
T(f.data)
for every Excel filef
. - match
- A table
A
matches a tableB
ifA
andB
have the same column names and data types.T(f.data)
can only be loaded intoall_data
ifT(f.data)
matchesall_data
. all_file_names
- A set of all the Excel file names.
loaded_file_names
- A set of the Excel file names
f.name
such thatT(f.data)
has been loaded intoall_data
.
Pseudocode of the ETL program (similar to Python)
Algorithm ETL_program(file_names): Input: file_names: A set of Excel file names. Output: A dictionary of (file name, error message) items. not_loaded_file_names = file_names − loaded_file_names. file_name_to_error_message = dict(). for each Excel file name f.name in not_loaded_file_names: f.data = Extract the data table from f. try: T(f.data) = Apply each transformation in T to f.data. Load T(f.data) into all_data. except Exception as error: file_name_to_error_message[f.name] = str(error). else: Add f.name to loaded_file_names. return file_name_to_error_message.
After executing file_name_to_error_message = ETL_program(all_file_names)
, I execute the following steps manually:
for f.name, error_message in file_name_to_error_message.items(): Use error_message to update the data transformations in T. Execute ETL_program(set(f.name)).
Example execution of the ETL program
T
:-
- Convert
date
to the date data type. - Convert
column_a
to the integer data type.
- Convert
1. Execute ETL_program(set(f_1.name))
:
T(f_1.data)
can't be loaded into all_data
because "missing"
in column_a
can't be converted to an integer.
2. Update T
:
T
:-
- Convert
date
to the date data type. - In
column_a
, replace all cells matching the regular expression“^missing$”
withnull
. - Convert
column_a
to the integer data type.
- Convert
3. Execute ETL_program(set(f_1.name))
:
After updating T
, T(f_1.data)
is loaded into all_data
.