Modules & functions¶
Cleaning Functions¶
Basic transformations for cleaning individual units of data
-
etl_toolbox.cleaning_functions.
FALSEY_INDICATORS
= ['false', '0']¶ A list of strings that are considered to be falsey. Used by
clean_null()
.
-
etl_toolbox.cleaning_functions.
NULL_INDICATORS
= ['', 'blank', 'blocked', 'empty', 'invalid', 'na', 'nan', 'nbsp', 'none', 'notavailable', 'notset', 'np', 'null', 'removed', 'unavailable', 'unidentified', 'unknown']¶ A list of strings that are considered equivalent to
None
. Used byclean_null()
.
-
etl_toolbox.cleaning_functions.
clean_null
(x, falsey_is_null=False, special_characters='')[source]¶ Returns
None
ifx
is null-indicating, else returnsx
x
is considered null-indicating if any of the following are true:x is None
len(x) == 0
fingerprint(x) in NULL_INDICATORS
(see:NULL_INDICATORS
)falsey_is_null and !x
falsey_is_null and fingerprint(x) in FALSEY_INDICATORS
(see:FALSEY_INDICATORS
)x
is an iterable consisting of all null-indicating values- Ex:
x == ['empty', None, {None}]
- Ex:
x
evaluates as a Python literal that is null-indicating- Ex:
x == '{None, None}'
- Ex:
- Usage:
>>> from etl_toolbox.cleaning_functions import clean_null >>> clean_null('Unknown') is None True >>> clean_null(['empty', None, {None}]) is None True >>> clean_null('false') is None False >>> clean_null('false', falsey_is_null=True) is None True
Parameters: - x –
The object to be evaluated.
x
can be any type, though this function is intended for use with strings, lists, and sets.Warning
The behavior of
clean_null()
may not be intuitive for types other than string, list, and set. For example, a dict will be considered null-indicating if it is either empty, or all of its keys are null-indicating. Dictionary values will be ignored. - falsey_is_null (boolean, optional) –
Controls whether falsey objects are considered null-indicating.
- If
True
, all falsey values will be considered null-indicating - If
False
, falsey values will not be considered null-indicating (unless they are otherwise null-indicating regardless of falsiness)
Default is
False
.Note
A common use case for setting
falsey_is_null
toTrue
would be when evaluating non-numeric data such as locations or phone numbers. In these data sets,0
is not a meaningful value and is likely just functioning as a placeholder for null. - If
- special_characters (string, optional) – A string of special characters to preserve while creating the
fingerprint of
x
. Seefingerprint()
for more details.
Returns: Returns
None
orx
.
-
etl_toolbox.cleaning_functions.
clean_whitespace
(x)[source]¶ Returns
x
with whitespace characters trimmed and condensed- Cleaning rules:
- all whitespace characters replaced with standard ASCII space (32)
- consecutive whitespace condensed
- leading/trailing whitespace removed
- Usage:
>>> from etl_toolbox.cleaning_functions import clean_whitespace >>> clean_whitespace(''' 123 abc 456 ... def 789 ''') '123 abc 456 def 789'
Parameters: x – The string to be cleaned. .. note:
If ``x`` is not a string, it will be returned unchanged.
Returns: Returns the same type as x
.Note
clean_whitespace()
does not remove unicode formatting characters without the White_Space character property:- U+180E
- U+200B
- U+200C
- U+200D
- U+2060
- U+FEFF
-
etl_toolbox.cleaning_functions.
fingerprint
(x, special_characters='')[source]¶ Returns a lowercase, alphanumeric representation of
x
- Usage:
>>> from etl_toolbox.cleaning_functions import fingerprint >>> fingerprint('(Aa_Bb_Cc)') 'aabbcc'
Parameters: - x – The object to be fingerprinted. Will be cast to a string using
str(x)
. - special_characters (string, optional) –
A string of special characters to preserve while creating the fingerprint of
x
. Any special characters which are individually meaningful for the data should be included here.>>> fingerprint('(Aa_Bb_Cc)', special_characters='_') 'aa_bb_cc'
Note
An example of where this might be useful is when fingerprinting column labels on a data set. Some datasets use
$
as a label for price/cost or#
as a label for phone number or id number. In that case,special_characters
should be set to'#$'
so that those values are preserved.
Returns: Returns a string.
DataFrame Functions¶
Functions for working withpandas.DataFrame
s
-
etl_toolbox.dataframe_functions.
dataframe_clean_null
(df, empty_row_thresh=1, empty_column_thresh=1, falsey_is_null=False, special_characters='')[source]¶ Cleans null values of a
pandas.DataFrame
and removes empty rows/columns.Warning
This function is computationally intensive and might be slow on large
DataFrame
s.- Usage:
>>> import pandas as pd >>> from etl_toolbox.dataframe_functions import dataframe_clean_null >>> df = pd.DataFrame( ... [ ... ["AAA", None, "111-111-1111", "empty"], ... ["BAA", "baa@baa.com", "-", "-"], ... ["CAA", "caa@caa.com", "notavailable", "..."], ... ["DAA", "blocked", "444-444-4444", "null"] ... ], ... columns=["id", "email", "phone", "col4"] ... ) >>> print(df) id email phone col4 0 AAA None 111-111-1111 empty 1 BAA baa@baa.com - - 2 CAA caa@caa.com notavailable ... 3 DAA blocked 444-444-4444 null >>> dataframe_clean_null(df) >>> print(df) id email phone 0 AAA NaN 111-111-1111 1 BAA baa@baa.com NaN 2 CAA caa@caa.com NaN 3 DAA NaN 444-444-4444
Parameters: - df – A
pandas.DataFrame
. - empty_row_thresh (int, optional) –
The number of non-null values required for a row to be considered populated/non-empty. Default is
1
.If set to
0
, no rows will be removed. - empty_column_thresh (int, optional) –
The number of non-null values required for a column to be considered populated/non-empty. Default is
1
.Note
Rows are dropped before columns, so this threshold will be applied to the values that remain after rows are removed. If
empty_column_thresh
is greater than1
, the resultingdf
may contain rows with fewer populated cells thanempty_row_thresh
. However, it will never contain completely empty rows (unlessempty_row_thresh
is0
).If set to
0
, no columns will be removed. - falsey_is_null (boolean, optional) – Controls whether falsey objects are considered null-indicating.
See
cleaning_functions.clean_null()
for details. Default isFalse
. - special_characters (string, optional) – A string of special characters to preserve while creating the
fingerprints. See
cleaning_functions.fingerprint()
for details.
Returns: Returns
None
. Thedf
argument is mutated.
-
etl_toolbox.dataframe_functions.
find_column_labels
(df, label_fingerprints, label_match_thresh=3, special_characters='')[source]¶ Finds a row of column labels within a
pandas.DataFrame
based on a collection of expectedlabel_fingerprints
.df.columns
is updated to contain the found column labels and all rows before and including the column label row are removed. If the initial column labels ofdf
fit the match criteria,df
will be unchanged.Note
All labels must be in the same row. This function will not attempt to correct translational errors.
- Usage:
>>> import pandas as pd >>> from etl_toolbox.dataframe_functions import find_column_labels >>> df = pd.DataFrame( ... [ ... ["created by:", "etl-toolbox", "", "-"], ... [ "2020-06-07", "---", "3 rows", "4 columns"], ... [ "", "", "", "-"], ... [ "email", "date", "phone", "id"], ... ["aaa@aaa.com", "04mar14", "999-333-4444", "AAA"], ... ["baa@baa.com", "05aug13", "111-222-3333", "BAA"], ... ["caa@caa.com", "01jun15", "777-777-7777", "CAA"] ... ] ... ) >>> print(df) 0 1 2 3 0 created by: etl-toolbox - 1 2020-06-07 --- 3 rows 4 columns 2 - 3 email date phone id 4 aaa@aaa.com 04mar14 999-333-4444 AAA 5 baa@baa.com 05aug13 111-222-3333 BAA 6 caa@caa.com 01jun15 777-777-7777 CAA >>> label_fingerprints = {'email', 'date', 'phone'} >>> find_column_labels(df, label_fingerprints) >>> print(df) email date phone id 0 aaa@aaa.com 04mar14 999-333-4444 AAA 1 baa@baa.com 05aug13 111-222-3333 BAA 2 caa@caa.com 01jun15 777-777-7777 CAA
Parameters: - df – A
pandas.DataFrame
containing column labels as a row, possibly preceded by misc. non-data rows. - label_fingerprints (set, list, or dict) –
Fingerprinted label names that are expected in the column labels row. It can contain many variations on the expected label names, which makes this function useful for cleaning multiple files with varied column labels.
- Example:
>>> label_fingerprints = {'email', 'emailaddr', 'phone', 'phonenum'}
Note
If a dict is provided, the keys will be used as the fingerprints and the values will be ignored.
A
fingerprint_map
dictionary created for use in themapping_functions
module can be passed aslabel_fingerprints
. - label_match_thresh (int, optional) –
The number of fingerprints that must be found in
label_fingerprints
for a row to be identified as the label row. Default is3
.This exists to prevent false matches caused by non-data header rows containing values that are also expected column labels.
For example, in the following
DataFrame
, row 1 will be misidentified as the label row if thelabel_match_thresh
is set to1
:>>> import pandas as pd >>> from etl_toolbox.dataframe_functions import find_column_labels >>> df = pd.DataFrame( ... [ ... [ "name:", "data report", "", "-"], ... [ "date:", "2001-05-03", "3 rows", "4 columns"], ... [ "", "", "", "-"], ... [ "email", "date", "phone", "id"], ... ["aaa@aaa.com", "04mar14", "999-333-4444", "AAA"], ... ["baa@baa.com", "05aug13", "111-222-3333", "BAA"], ... ["caa@caa.com", "01jun15", "777-777-7777", "CAA"] ... ] ... ) >>> print(df) 0 1 2 3 0 name: data report - 1 date: 2001-05-03 3 rows 4 columns 2 - 3 email date phone id 4 aaa@aaa.com 04mar14 999-333-4444 AAA 5 baa@baa.com 05aug13 111-222-3333 BAA 6 caa@caa.com 01jun15 777-777-7777 CAA >>> label_fingerprints = {'email', 'date', 'phone'} >>> find_column_labels(df, label_fingerprints, label_match_thresh=1) >>> print(df) date: 2001-05-03 3 rows 4 columns 0 - 1 email date phone id 2 aaa@aaa.com 04mar14 999-333-4444 AAA 3 baa@baa.com 05aug13 111-222-3333 BAA 4 caa@caa.com 01jun15 777-777-7777 CAA
3
is a good value for most datasets with a known set oflabel_fingerprints
. It can be set lower if the incoming data has few columns and/or highly varied label names.If this argument is set to
0
, the function will raise aValueError
. - special_characters (string, optional) –
(optional) A string of special characters to preserve while creating the fingerprints for lookup in
label_fingerprints
. Seecleaning_functions.fingerprint()
for details.Note
Any special characters that appear in the elements of
label_fingerprints
should be included here.
Raises: - IndexError – Raised if a label row can not be identified in the given
DataFrame
. - ValueError – Raised if the
label_match_thresh
is set to 0.
Returns: Returns
None
. Thedf
argument is mutated.
-
etl_toolbox.dataframe_functions.
index_is_default
(df)[source]¶ Returns
True
if the providedDataFrame
has the defaultpandas.RangeIndex
.Else returns
False
.
-
etl_toolbox.dataframe_functions.
merge_columns_by_label
(df, deduplicate_values=False)[source]¶ Merges columns of a
pandas.DataFrame
that have identical labelsFor duplicate column labels in
df
, the first instance of each label will be turned into a column of lists containing the values from all of the instances. The other instances will then be dropped.None
andnp.nan
values will not be included in the merged column.Note
This function does not fingerprint the labels for comparison. Column labels should be cleaned and mapped before using this tool.
- Usage:
>>> import pandas as pd >>> from etl_toolbox.dataframe_functions import merge_columns_by_label >>> df = pd.DataFrame( ... [ ... ["AAA", "aaa@aaa.com", "111-111-1111", "111@aaa.com"], ... ["BAA", "baa@baa.com", "222-222-2222", "222@baa.com"], ... ["CAA", "caa@caa.com", "333-333-3333", None], ... ["DAA", "daa@daa.com", "444-444-4444", "444@daa.com"] ... ], ... columns=["id", "email", "phone", "email"] ... ) >>> print(df) id email phone email 0 AAA aaa@aaa.com 111-111-1111 111@aaa.com 1 BAA baa@baa.com 222-222-2222 222@baa.com 2 CAA caa@caa.com 333-333-3333 None 3 DAA daa@daa.com 444-444-4444 444@daa.com >>> merge_columns_by_label(df) >>> print(df) id email phone 0 AAA [aaa@aaa.com, 111@aaa.com] 111-111-1111 1 BAA [baa@baa.com, 222@baa.com] 222-222-2222 2 CAA [caa@caa.com] 333-333-3333 3 DAA [daa@daa.com, 444@daa.com] 444-444-4444
Parameters: - df – A
pandas.DataFrame
. - deduplicate_values (boolean, optional) – If
True
, the values of the combined columns will be deduplicated and stored in the modifiedDataFrame
as a set instead of a list.
Returns: Returns
None
. Thedf
argument is mutated.
File Functions¶
Functions for working with files and directories
-
etl_toolbox.file_functions.
get_file_list_from_dir
(dir_path, recursive=False, include_regex=None)[source]¶ Returns a list of the files in a directory
- Usage:
>>> get_file_list_from_dir('test_data/test_dir') # doctest:+SKIP ['test_data/test_dir/1.csv', 'test_data/test_dir/2.csv', 'test_data/test_dir/3.json']
Parameters: - recursive (boolean, optional) –
If set to
True
, the returned list will include files fromdir
and all of its subdirectories. Default isFalse
.- Example:
>>> get_file_list_from_dir('test_data/test_dir', ... recursive=True) # doctest:+SKIP ['test_data/test_dir/1.csv', 'test_data/test_dir/2.csv', 'test_data/test_dir/3.json', 'test_data/test_dir/a/1.csv', 'test_data/test_dir/b/3.csv', 'test_data/test_dir/b/c/2.txt']
- include_regex (string, optional) –
Only include files whose path matches this regex. Default is
None
(list is unfiltered).- Example:
>>> get_file_list_from_dir('test_data/test_dir', ... include_regex=r'.*\.csv$') # doctest:+SKIP ['test_data/test_dir/1.csv', 'test_data/test_dir/2.csv']
Returns: Returns list of file paths.
Mapping Functions¶
Functions for mapping collections of values
-
etl_toolbox.mapping_functions.
append_count
(x)[source]¶ A generator function that yields
x
with a numbered suffix.
-
etl_toolbox.mapping_functions.
map_labels
(labels, fingerprint_map, special_characters='', return_unmapped=False)[source]¶ Maps a list of
labels
to new values based on providedfingerprint_map
and returns that mapped list.This is useful for mapping column labels from a dataframe/file to a standard schema, particularly when the labels are inconsistent.
The order of
labels
will be preserved in the return, and if a label isn’t found infingerprint_map
, that label will be'-'
in returned list.- Usage:
>>> from etl_toolbox.mapping_functions import map_labels >>> labels = [1, '2_A', '2b'] >>> fingerprint_map = {'1': 'one', '2a': 'two_a', 'extrakey': 'value'} >>> map_labels(labels, fingerprint_map) ['one', 'two_a', '-']
Parameters: - labels – The list of labels to map. These will be fingerprinted for their
lookup in
fingerprint_map
. - fingerprint_map – A dictionary of all expected label fingerprints mapped to formatted outputs.
- special_characters (string, optional) –
A string of special characters to preserve while fingerprinting the labels. See
cleaning_functions.fingerprint()
for details.Note
This should include any special characters that appear in the keys of
fingerprint_map
. - return_unmapped (boolean, optional) –
If this is set to
True
, this function will return a tuple of the mapped labels and a set of unmapped labels (any value fromlabels
whose fingerprint was not found infingerprint_map
). Default isFalse
.Note
This is useful for tracking unrecognized labels of incoming files in an automated ETL system.
Returns: Returns a list or, if the
return_unmapped
option isTrue
, returns a tuple, with the first element being a list and the second being a set.
-
etl_toolbox.mapping_functions.
rename_duplicate_labels
(labels, rename_generator=<function append_count>)[source]¶ Maps a list of
labels
such that duplicates are renamed according to therename_generator
.The order of
labels
is preserved in the return, and if a label isn’t a duplicate, its value will be unchanged. Values will NOT be fingerprinted for comparison, so this function is best used after labels have been standardized.- Usage:
>>> from etl_toolbox.mapping_functions import rename_duplicate_labels >>> labels = ['email', 'email', 'phone', 'name', 'email', 'phone'] >>> rename_duplicate_labels(labels) ['email_1', 'email_2', 'phone_1', 'name', 'email_3', 'phone_2']
Parameters: - labels – The list of labels to map.
- rename_generator (generator, optional) –
A generator function that specifies how to rename duplicate columns. It should take a label name as a positional argument and yield the renamed label. The default
rename_generator
appends a count, separated by underscore.- Example:
>>> def rename_generator(x): ... i = 0 ... while True: ... i += 1 ... yield x + '_' + str(i) >>> r = rename_generator('label') >>> next(r) 'label_1' >>> next(r) 'label_2'
Returns: Returns a list.