Modules & functions¶

Cleaning Functions¶

Basic transformations for cleaning individual units of data

etl_toolbox.cleaning_functions.FALSEY_INDICATORS = ['false', '0']¶: A list of strings that are considered to be falsey. Used by clean_null().

etl_toolbox.cleaning_functions.NULL_INDICATORS = ['', 'blank', 'blocked', 'empty', 'invalid', 'na', 'nan', 'nbsp', 'none', 'notavailable', 'notset', 'np', 'null', 'removed', 'unavailable', 'unidentified', 'unknown']¶: A list of strings that are considered equivalent to None. Used by clean_null().

etl_toolbox.cleaning_functions.clean_null(x, falsey_is_null=False, special_characters='')[source]¶

Returns None if x is null-indicating, else returns x

x is considered null-indicating if any of the following are true:

x is None
len(x) == 0
fingerprint(x) in NULL_INDICATORS (see: NULL_INDICATORS)
falsey_is_null and !x
falsey_is_null and fingerprint(x) in FALSEY_INDICATORS (see: FALSEY_INDICATORS)
x is an iterable consisting of all null-indicating values
- Ex: x == ['empty', None, {None}]
x evaluates as a Python literal that is null-indicating
- Ex: x == '{None, None}'

Usage:

>>> from etl_toolbox.cleaning_functions import clean_null
>>> clean_null('Unknown') is None
True
>>> clean_null(['empty', None, {None}]) is None
True
>>> clean_null('false') is None
False
>>> clean_null('false', falsey_is_null=True) is None
True

Parameters:

Parameters:	x – The object to be evaluated. `x` can be any type, though this function is intended for use with strings, lists, and sets. Warning The behavior of `clean_null()` may not be intuitive for types other than string, list, and set. For example, a dict will be considered null-indicating if it is either empty, or all of its keys are null-indicating. Dictionary values will be ignored. falsey_is_null (boolean, optional) – Controls whether falsey objects are considered null-indicating. If `True`, all falsey values will be considered null-indicating If `False`, falsey values will not be considered null-indicating (unless they are otherwise null-indicating regardless of falsiness) Default is `False`. Note A common use case for setting `falsey_is_null` to `True` would be when evaluating non-numeric data such as locations or phone numbers. In these data sets, `0` is not a meaningful value and is likely just functioning as a placeholder for null. special_characters (string, optional) – A string of special characters to preserve while creating the fingerprint of `x`. See `fingerprint()` for more details.
Returns:	Returns `None` or `x`.

x –
The object to be evaluated. x can be any type, though this function is intended for use with strings, lists, and sets.

Warning

The behavior of clean_null() may not be intuitive for types other than string, list, and set. For example, a dict will be considered null-indicating if it is either empty, or all of its keys are null-indicating. Dictionary values will be ignored.
falsey_is_null (boolean, optional) –
Controls whether falsey objects are considered null-indicating.
- If True, all falsey values will be considered null-indicating
- If False, falsey values will not be considered null-indicating (unless they are otherwise null-indicating regardless of falsiness)
Default is False.

Note

A common use case for setting falsey_is_null to True would be when evaluating non-numeric data such as locations or phone numbers. In these data sets, 0 is not a meaningful value and is likely just functioning as a placeholder for null.
special_characters (string, optional) – A string of special characters to preserve while creating the fingerprint of x. See fingerprint() for more details.

Returns:

Returns None or x.

etl_toolbox.cleaning_functions.clean_whitespace(x)[source]¶

Returns x with whitespace characters trimmed and condensed

Cleaning rules:

all whitespace characters replaced with standard ASCII space (32)
consecutive whitespace condensed
leading/trailing whitespace removed

Usage:

>>> from etl_toolbox.cleaning_functions import clean_whitespace
>>> clean_whitespace(''' 123   abc 456
...                               def              789    ''')
'123 abc 456 def 789'

Parameters:	x – The string to be cleaned. .. note: If ``x`` is not a string, it will be returned unchanged.
Returns:	Returns the same type as `x`.

Note

clean_whitespace() does not remove unicode formatting characters without the White_Space character property:

U+180E
U+200B

U+200C
U+200D

U+2060
U+FEFF

etl_toolbox.cleaning_functions.fingerprint(x, special_characters='')[source]¶

Returns a lowercase, alphanumeric representation of x

Usage:

>>> from etl_toolbox.cleaning_functions import fingerprint
>>> fingerprint('(Aa_Bb_Cc)')
'aabbcc'

Parameters:

Parameters:	x – The object to be fingerprinted. Will be cast to a string using `str(x)`. special_characters (string, optional) – A string of special characters to preserve while creating the fingerprint of `x`. Any special characters which are individually meaningful for the data should be included here. >>> fingerprint('(Aa_Bb_Cc)', special_characters='_') 'aa_bb_cc' Note An example of where this might be useful is when fingerprinting column labels on a data set. Some datasets use `$` as a label for price/cost or `#` as a label for phone number or id number. In that case, `special_characters` should be set to `'#$'` so that those values are preserved.
Returns:	Returns a string.

x – The object to be fingerprinted. Will be cast to a string using str(x).
special_characters (string, optional) –
A string of special characters to preserve while creating the fingerprint of x. Any special characters which are individually meaningful for the data should be included here.
```
>>> fingerprint('(Aa_Bb_Cc)', special_characters='_')
'aa_bb_cc'
```
Note

An example of where this might be useful is when fingerprinting column labels on a data set. Some datasets use $ as a label for price/cost or # as a label for phone number or id number. In that case, special_characters should be set to '#$' so that those values are preserved.

Returns:

Returns a string.

DataFrame Functions¶

Functions for working with pandas.DataFrames

etl_toolbox.dataframe_functions.dataframe_clean_null(df, empty_row_thresh=1, empty_column_thresh=1, falsey_is_null=False, special_characters='')[source]¶

Cleans null values of a pandas.DataFrame and removes empty rows/columns.

Warning

This function is computationally intensive and might be slow on large DataFrames.

Usage:

>>> import pandas as pd
>>> from etl_toolbox.dataframe_functions import dataframe_clean_null
>>> df = pd.DataFrame(
...     [
...         ["AAA",          None, "111-111-1111", "empty"],
...         ["BAA", "baa@baa.com",            "-",     "-"],
...         ["CAA", "caa@caa.com", "notavailable",   "..."],
...         ["DAA",     "blocked", "444-444-4444",  "null"]
...     ],
...     columns=["id", "email", "phone", "col4"]
... )
>>> print(df)
    id        email         phone   col4
0  AAA         None  111-111-1111  empty
1  BAA  baa@baa.com             -      -
2  CAA  caa@caa.com  notavailable    ...
3  DAA      blocked  444-444-4444   null
>>> dataframe_clean_null(df)
>>> print(df)
    id        email         phone
0  AAA          NaN  111-111-1111
1  BAA  baa@baa.com           NaN
2  CAA  caa@caa.com           NaN
3  DAA          NaN  444-444-4444

Parameters:

Parameters:	df – A `pandas.DataFrame`. empty_row_thresh (int, optional) – The number of non-null values required for a row to be considered populated/non-empty. Default is `1`. If set to `0`, no rows will be removed. empty_column_thresh (int, optional) – The number of non-null values required for a column to be considered populated/non-empty. Default is `1`. Note Rows are dropped before columns, so this threshold will be applied to the values that remain after rows are removed. If `empty_column_thresh` is greater than `1`, the resulting `df` may contain rows with fewer populated cells than `empty_row_thresh`. However, it will never contain completely empty rows (unless `empty_row_thresh` is `0`). If set to `0`, no columns will be removed. falsey_is_null (boolean, optional) – Controls whether falsey objects are considered null-indicating. See `cleaning_functions.clean_null()` for details. Default is `False`. special_characters (string, optional) – A string of special characters to preserve while creating the fingerprints. See `cleaning_functions.fingerprint()` for details.
Returns:	Returns `None`. The `df` argument is mutated.

df – A pandas.DataFrame.
empty_row_thresh (int, optional) –
The number of non-null values required for a row to be considered populated/non-empty. Default is 1.

If set to 0, no rows will be removed.
empty_column_thresh (int, optional) –
The number of non-null values required for a column to be considered populated/non-empty. Default is 1.

Note

Rows are dropped before columns, so this threshold will be applied to the values that remain after rows are removed. If empty_column_thresh is greater than 1, the resulting df may contain rows with fewer populated cells than empty_row_thresh. However, it will never contain completely empty rows (unless empty_row_thresh is 0).

If set to 0, no columns will be removed.
falsey_is_null (boolean, optional) – Controls whether falsey objects are considered null-indicating. See cleaning_functions.clean_null() for details. Default is False.
special_characters (string, optional) – A string of special characters to preserve while creating the fingerprints. See cleaning_functions.fingerprint() for details.

Returns:

Returns None. The df argument is mutated.

etl_toolbox.dataframe_functions.find_column_labels(df, label_fingerprints, label_match_thresh=3, special_characters='')[source]¶

Finds a row of column labels within a pandas.DataFrame based on a collection of expected label_fingerprints.

df.columns is updated to contain the found column labels and all rows before and including the column label row are removed. If the initial column labels of df fit the match criteria, df will be unchanged.

Note

All labels must be in the same row. This function will not attempt to correct translational errors.

Usage:

>>> import pandas as pd
>>> from etl_toolbox.dataframe_functions import find_column_labels
>>> df = pd.DataFrame(
...     [
...         ["created by:", "etl-toolbox",             "",         "-"],
...         [ "2020-06-07",         "---",       "3 rows", "4 columns"],
...         [           "",            "",             "",         "-"],
...         [      "email",        "date",        "phone",        "id"],
...         ["aaa@aaa.com",     "04mar14", "999-333-4444",       "AAA"],
...         ["baa@baa.com",     "05aug13", "111-222-3333",       "BAA"],
...         ["caa@caa.com",     "01jun15", "777-777-7777",       "CAA"]
...     ]
... )
>>> print(df)
             0            1             2          3
0  created by:  etl-toolbox                        -
1   2020-06-07          ---        3 rows  4 columns
2                                                  -
3        email         date         phone         id
4  aaa@aaa.com      04mar14  999-333-4444        AAA
5  baa@baa.com      05aug13  111-222-3333        BAA
6  caa@caa.com      01jun15  777-777-7777        CAA
>>> label_fingerprints = {'email', 'date', 'phone'}
>>> find_column_labels(df, label_fingerprints)
>>> print(df)
         email     date         phone   id
0  aaa@aaa.com  04mar14  999-333-4444  AAA
1  baa@baa.com  05aug13  111-222-3333  BAA
2  caa@caa.com  01jun15  777-777-7777  CAA

Parameters:	df – A `pandas.DataFrame` containing column labels as a row, possibly preceded by misc. non-data rows. label_fingerprints (set, list, or dict) – Fingerprinted label names that are expected in the column labels row. It can contain many variations on the expected label names, which makes this function useful for cleaning multiple files with varied column labels. Example: >>> label_fingerprints = {'email', 'emailaddr', 'phone', 'phonenum'} Note If a dict is provided, the keys will be used as the fingerprints and the values will be ignored. A `fingerprint_map` dictionary created for use in the `mapping_functions` module can be passed as `label_fingerprints`. label_match_thresh (int, optional) – The number of fingerprints that must be found in `label_fingerprints` for a row to be identified as the label row. Default is `3`. This exists to prevent false matches caused by non-data header rows containing values that are also expected column labels. For example, in the following `DataFrame`, row 1 will be misidentified as the label row if the `label_match_thresh` is set to `1`: >>> import pandas as pd >>> from etl_toolbox.dataframe_functions import find_column_labels >>> df = pd.DataFrame( ... [ ... [ "name:", "data report", "", "-"], ... [ "date:", "2001-05-03", "3 rows", "4 columns"], ... [ "", "", "", "-"], ... [ "email", "date", "phone", "id"], ... ["aaa@aaa.com", "04mar14", "999-333-4444", "AAA"], ... ["baa@baa.com", "05aug13", "111-222-3333", "BAA"], ... ["caa@caa.com", "01jun15", "777-777-7777", "CAA"] ... ] ... ) >>> print(df) 0 1 2 3 0 name: data report - 1 date: 2001-05-03 3 rows 4 columns 2 - 3 email date phone id 4 aaa@aaa.com 04mar14 999-333-4444 AAA 5 baa@baa.com 05aug13 111-222-3333 BAA 6 caa@caa.com 01jun15 777-777-7777 CAA >>> label_fingerprints = {'email', 'date', 'phone'} >>> find_column_labels(df, label_fingerprints, label_match_thresh=1) >>> print(df) date: 2001-05-03 3 rows 4 columns 0 - 1 email date phone id 2 aaa@aaa.com 04mar14 999-333-4444 AAA 3 baa@baa.com 05aug13 111-222-3333 BAA 4 caa@caa.com 01jun15 777-777-7777 CAA `3` is a good value for most datasets with a known set of `label_fingerprints`. It can be set lower if the incoming data has few columns and/or highly varied label names. If this argument is set to `0`, the function will raise a `ValueError`. special_characters (string, optional) – (optional) A string of special characters to preserve while creating the fingerprints for lookup in `label_fingerprints`. See `cleaning_functions.fingerprint()` for details. Note Any special characters that appear in the elements of `label_fingerprints` should be included here.
Raises:	IndexError – Raised if a label row can not be identified in the given `DataFrame`. ValueError – Raised if the `label_match_thresh` is set to 0.
Returns:	Returns `None`. The `df` argument is mutated.

etl_toolbox.dataframe_functions.index_is_default(df)[source]¶

Returns True if the provided DataFrame has the default pandas.RangeIndex.

Else returns False.

etl_toolbox.dataframe_functions.merge_columns_by_label(df, deduplicate_values=False)[source]¶

Merges columns of a pandas.DataFrame that have identical labels

For duplicate column labels in df, the first instance of each label will be turned into a column of lists containing the values from all of the instances. The other instances will then be dropped.

None and np.nan values will not be included in the merged column.

Note

This function does not fingerprint the labels for comparison. Column labels should be cleaned and mapped before using this tool.

Usage:

>>> import pandas as pd
>>> from etl_toolbox.dataframe_functions import merge_columns_by_label
>>> df = pd.DataFrame(
...     [
...         ["AAA", "aaa@aaa.com", "111-111-1111", "111@aaa.com"],
...         ["BAA", "baa@baa.com", "222-222-2222", "222@baa.com"],
...         ["CAA", "caa@caa.com", "333-333-3333", None],
...         ["DAA", "daa@daa.com", "444-444-4444", "444@daa.com"]
...     ],
...     columns=["id", "email", "phone", "email"]
... )
>>> print(df)
    id        email         phone        email
0  AAA  aaa@aaa.com  111-111-1111  111@aaa.com
1  BAA  baa@baa.com  222-222-2222  222@baa.com
2  CAA  caa@caa.com  333-333-3333         None
3  DAA  daa@daa.com  444-444-4444  444@daa.com
>>> merge_columns_by_label(df)
>>> print(df)
    id                       email         phone
0  AAA  [aaa@aaa.com, 111@aaa.com]  111-111-1111
1  BAA  [baa@baa.com, 222@baa.com]  222-222-2222
2  CAA               [caa@caa.com]  333-333-3333
3  DAA  [daa@daa.com, 444@daa.com]  444-444-4444

Parameters:	df – A `pandas.DataFrame`. deduplicate_values (boolean, optional) – If `True`, the values of the combined columns will be deduplicated and stored in the modified `DataFrame` as a set instead of a list.
Returns:	Returns `None`. The `df` argument is mutated.

File Functions¶

Functions for working with files and directories

etl_toolbox.file_functions.get_file_list_from_dir(dir_path, recursive=False, include_regex=None)[source]¶

Returns a list of the files in a directory

Usage:

>>> get_file_list_from_dir('test_data/test_dir') # doctest:+SKIP
['test_data/test_dir/1.csv',
 'test_data/test_dir/2.csv',
 'test_data/test_dir/3.json']

Parameters:

Parameters:	recursive (boolean, optional) – If set to `True`, the returned list will include files from `dir` and all of its subdirectories. Default is `False`. Example: >>> get_file_list_from_dir('test_data/test_dir', ... recursive=True) # doctest:+SKIP ['test_data/test_dir/1.csv', 'test_data/test_dir/2.csv', 'test_data/test_dir/3.json', 'test_data/test_dir/a/1.csv', 'test_data/test_dir/b/3.csv', 'test_data/test_dir/b/c/2.txt'] include_regex (string, optional) – Only include files whose path matches this regex. Default is `None` (list is unfiltered). Example: >>> get_file_list_from_dir('test_data/test_dir', ... include_regex=r'.*\.csv$') # doctest:+SKIP ['test_data/test_dir/1.csv', 'test_data/test_dir/2.csv']
Returns:	Returns list of file paths.

recursive (boolean, optional) –

If set to True, the returned list will include files from dir and all of its subdirectories. Default is False.

Example:

>>> get_file_list_from_dir('test_data/test_dir',
...                        recursive=True)  # doctest:+SKIP
['test_data/test_dir/1.csv',
 'test_data/test_dir/2.csv',
 'test_data/test_dir/3.json',
 'test_data/test_dir/a/1.csv',
 'test_data/test_dir/b/3.csv',
 'test_data/test_dir/b/c/2.txt']

include_regex (string, optional) –

Only include files whose path matches this regex. Default is None (list is unfiltered).

Example:

>>> get_file_list_from_dir('test_data/test_dir',
...                        include_regex=r'.*\.csv$') # doctest:+SKIP
['test_data/test_dir/1.csv',
 'test_data/test_dir/2.csv']

Returns:

Returns list of file paths.

Mapping Functions¶

Functions for mapping collections of values

etl_toolbox.mapping_functions.append_count(x)[source]¶: A generator function that yields x with a numbered suffix.

etl_toolbox.mapping_functions.map_labels(labels, fingerprint_map, special_characters='', return_unmapped=False)[source]¶

Maps a list of labels to new values based on provided fingerprint_map and returns that mapped list.

This is useful for mapping column labels from a dataframe/file to a standard schema, particularly when the labels are inconsistent.

The order of labels will be preserved in the return, and if a label isn’t found in fingerprint_map, that label will be '-' in returned list.

Usage:

>>> from etl_toolbox.mapping_functions import map_labels
>>> labels = [1, '2_A', '2b']
>>> fingerprint_map = {'1': 'one', '2a': 'two_a', 'extrakey': 'value'}
>>> map_labels(labels, fingerprint_map)
['one', 'two_a', '-']

Parameters:

Parameters:	labels – The list of labels to map. These will be fingerprinted for their lookup in `fingerprint_map`. fingerprint_map – A dictionary of all expected label fingerprints mapped to formatted outputs. special_characters (string, optional) – A string of special characters to preserve while fingerprinting the labels. See `cleaning_functions.fingerprint()` for details. Note This should include any special characters that appear in the keys of `fingerprint_map`. return_unmapped (boolean, optional) – If this is set to `True`, this function will return a tuple of the mapped labels and a set of unmapped labels (any value from `labels` whose fingerprint was not found in `fingerprint_map`). Default is `False`. Note This is useful for tracking unrecognized labels of incoming files in an automated ETL system.
Returns:	Returns a list or, if the `return_unmapped` option is `True`, returns a tuple, with the first element being a list and the second being a set.

labels – The list of labels to map. These will be fingerprinted for their lookup in fingerprint_map.
fingerprint_map – A dictionary of all expected label fingerprints mapped to formatted outputs.
special_characters (string, optional) –
A string of special characters to preserve while fingerprinting the labels. See cleaning_functions.fingerprint() for details.

Note

This should include any special characters that appear in the keys of fingerprint_map.
return_unmapped (boolean, optional) –
If this is set to True, this function will return a tuple of the mapped labels and a set of unmapped labels (any value from labels whose fingerprint was not found in fingerprint_map). Default is False.

Note

This is useful for tracking unrecognized labels of incoming files in an automated ETL system.

Returns:

Returns a list or, if the return_unmapped option is True, returns a tuple, with the first element being a list and the second being a set.

etl_toolbox.mapping_functions.rename_duplicate_labels(labels, rename_generator=<function append_count>)[source]¶

Maps a list of labels such that duplicates are renamed according to the rename_generator.

The order of labels is preserved in the return, and if a label isn’t a duplicate, its value will be unchanged. Values will NOT be fingerprinted for comparison, so this function is best used after labels have been standardized.

Usage:

>>> from etl_toolbox.mapping_functions import rename_duplicate_labels
>>> labels = ['email', 'email', 'phone', 'name', 'email', 'phone']
>>> rename_duplicate_labels(labels)
['email_1', 'email_2', 'phone_1', 'name', 'email_3', 'phone_2']

Parameters:

Parameters:	labels – The list of labels to map. rename_generator (generator, optional) – A generator function that specifies how to rename duplicate columns. It should take a label name as a positional argument and yield the renamed label. The default `rename_generator` appends a count, separated by underscore. Example: >>> def rename_generator(x): ... i = 0 ... while True: ... i += 1 ... yield x + '_' + str(i) >>> r = rename_generator('label') >>> next(r) 'label_1' >>> next(r) 'label_2'
Returns:	Returns a list.

labels – The list of labels to map.
rename_generator (generator, optional) –
A generator function that specifies how to rename duplicate columns. It should take a label name as a positional argument and yield the renamed label. The default rename_generator appends a count, separated by underscore.
Example:
>>> def rename_generator(x): ... i = 0 ... while True: ... i += 1 ... yield x + '_' + str(i) >>> r = rename_generator('label') >>> next(r) 'label_1' >>> next(r) 'label_2'

Returns:

Returns a list.