Modules & functions¶


Cleaning Functions¶

Basic transformations for cleaning individual units of data
etl_toolbox.cleaning_functions.FALSEY_INDICATORS = ['false', '0']¶

A list of strings that are considered to be falsey. Used by clean_null().

etl_toolbox.cleaning_functions.NULL_INDICATORS = ['', 'blank', 'blocked', 'empty', 'invalid', 'na', 'nan', 'nbsp', 'none', 'notavailable', 'notset', 'np', 'null', 'removed', 'unavailable', 'unidentified', 'unknown']¶

A list of strings that are considered equivalent to None. Used by clean_null().

etl_toolbox.cleaning_functions.clean_null(x, falsey_is_null=False, special_characters='')[source]¶

Returns None if x is null-indicating, else returns x

x is considered null-indicating if any of the following are true:

  • x is None
  • len(x) == 0
  • fingerprint(x) in NULL_INDICATORS (see: NULL_INDICATORS)
  • falsey_is_null and !x
  • falsey_is_null and fingerprint(x) in FALSEY_INDICATORS (see: FALSEY_INDICATORS)
  • x is an iterable consisting of all null-indicating values
    • Ex: x == ['empty', None, {None}]
  • x evaluates as a Python literal that is null-indicating
    • Ex: x == '{None, None}'
Usage:
>>> from etl_toolbox.cleaning_functions import clean_null
>>> clean_null('Unknown') is None
True
>>> clean_null(['empty', None, {None}]) is None
True
>>> clean_null('false') is None
False
>>> clean_null('false', falsey_is_null=True) is None
True
Parameters:
  • x –

    The object to be evaluated. x can be any type, though this function is intended for use with strings, lists, and sets.

    Warning

    The behavior of clean_null() may not be intuitive for types other than string, list, and set. For example, a dict will be considered null-indicating if it is either empty, or all of its keys are null-indicating. Dictionary values will be ignored.

  • falsey_is_null (boolean, optional) –

    Controls whether falsey objects are considered null-indicating.

    • If True, all falsey values will be considered null-indicating
    • If False, falsey values will not be considered null-indicating (unless they are otherwise null-indicating regardless of falsiness)

    Default is False.

    Note

    A common use case for setting falsey_is_null to True would be when evaluating non-numeric data such as locations or phone numbers. In these data sets, 0 is not a meaningful value and is likely just functioning as a placeholder for null.

  • special_characters (string, optional) – A string of special characters to preserve while creating the fingerprint of x. See fingerprint() for more details.
Returns:

Returns None or x.

etl_toolbox.cleaning_functions.clean_whitespace(x)[source]¶

Returns x with whitespace characters trimmed and condensed

Cleaning rules:
  • all whitespace characters replaced with standard ASCII space (32)
  • consecutive whitespace condensed
  • leading/trailing whitespace removed
Usage:
>>> from etl_toolbox.cleaning_functions import clean_whitespace
>>> clean_whitespace(''' 123   abc 456
...                               def              789    ''')
'123 abc 456 def 789'
Parameters:x –

The string to be cleaned. .. note:

If ``x`` is not a string, it will be returned unchanged.
Returns:Returns the same type as x.

Note

clean_whitespace() does not remove unicode formatting characters without the White_Space character property:

  • U+180E
  • U+200B
  • U+200C
  • U+200D
  • U+2060
  • U+FEFF
etl_toolbox.cleaning_functions.fingerprint(x, special_characters='')[source]¶

Returns a lowercase, alphanumeric representation of x

Usage:
>>> from etl_toolbox.cleaning_functions import fingerprint
>>> fingerprint('(Aa_Bb_Cc)')
'aabbcc'
Parameters:
  • x – The object to be fingerprinted. Will be cast to a string using str(x).
  • special_characters (string, optional) –

    A string of special characters to preserve while creating the fingerprint of x. Any special characters which are individually meaningful for the data should be included here.

    >>> fingerprint('(Aa_Bb_Cc)', special_characters='_')
    'aa_bb_cc'
    

    Note

    An example of where this might be useful is when fingerprinting column labels on a data set. Some datasets use $ as a label for price/cost or # as a label for phone number or id number. In that case, special_characters should be set to '#$' so that those values are preserved.

Returns:

Returns a string.


DataFrame Functions¶

Functions for working with pandas.DataFrames
etl_toolbox.dataframe_functions.dataframe_clean_null(df, empty_row_thresh=1, empty_column_thresh=1, falsey_is_null=False, special_characters='')[source]¶

Cleans null values of a pandas.DataFrame and removes empty rows/columns.

Warning

This function is computationally intensive and might be slow on large DataFrames.

Usage:
>>> import pandas as pd
>>> from etl_toolbox.dataframe_functions import dataframe_clean_null
>>> df = pd.DataFrame(
...     [
...         ["AAA",          None, "111-111-1111", "empty"],
...         ["BAA", "baa@baa.com",            "-",     "-"],
...         ["CAA", "caa@caa.com", "notavailable",   "..."],
...         ["DAA",     "blocked", "444-444-4444",  "null"]
...     ],
...     columns=["id", "email", "phone", "col4"]
... )
>>> print(df)
    id        email         phone   col4
0  AAA         None  111-111-1111  empty
1  BAA  baa@baa.com             -      -
2  CAA  caa@caa.com  notavailable    ...
3  DAA      blocked  444-444-4444   null
>>> dataframe_clean_null(df)
>>> print(df)
    id        email         phone
0  AAA          NaN  111-111-1111
1  BAA  baa@baa.com           NaN
2  CAA  caa@caa.com           NaN
3  DAA          NaN  444-444-4444
Parameters:
  • df – A pandas.DataFrame.
  • empty_row_thresh (int, optional) –

    The number of non-null values required for a row to be considered populated/non-empty. Default is 1.

    If set to 0, no rows will be removed.

  • empty_column_thresh (int, optional) –

    The number of non-null values required for a column to be considered populated/non-empty. Default is 1.

    Note

    Rows are dropped before columns, so this threshold will be applied to the values that remain after rows are removed. If empty_column_thresh is greater than 1, the resulting df may contain rows with fewer populated cells than empty_row_thresh. However, it will never contain completely empty rows (unless empty_row_thresh is 0).

    If set to 0, no columns will be removed.

  • falsey_is_null (boolean, optional) – Controls whether falsey objects are considered null-indicating. See cleaning_functions.clean_null() for details. Default is False.
  • special_characters (string, optional) – A string of special characters to preserve while creating the fingerprints. See cleaning_functions.fingerprint() for details.
Returns:

Returns None. The df argument is mutated.

etl_toolbox.dataframe_functions.find_column_labels(df, label_fingerprints, label_match_thresh=3, special_characters='')[source]¶

Finds a row of column labels within a pandas.DataFrame based on a collection of expected label_fingerprints.

df.columns is updated to contain the found column labels and all rows before and including the column label row are removed. If the initial column labels of df fit the match criteria, df will be unchanged.

Note

All labels must be in the same row. This function will not attempt to correct translational errors.

Usage:
>>> import pandas as pd
>>> from etl_toolbox.dataframe_functions import find_column_labels
>>> df = pd.DataFrame(
...     [
...         ["created by:", "etl-toolbox",             "",         "-"],
...         [ "2020-06-07",         "---",       "3 rows", "4 columns"],
...         [           "",            "",             "",         "-"],
...         [      "email",        "date",        "phone",        "id"],
...         ["aaa@aaa.com",     "04mar14", "999-333-4444",       "AAA"],
...         ["baa@baa.com",     "05aug13", "111-222-3333",       "BAA"],
...         ["caa@caa.com",     "01jun15", "777-777-7777",       "CAA"]
...     ]
... )
>>> print(df)
             0            1             2          3
0  created by:  etl-toolbox                        -
1   2020-06-07          ---        3 rows  4 columns
2                                                  -
3        email         date         phone         id
4  aaa@aaa.com      04mar14  999-333-4444        AAA
5  baa@baa.com      05aug13  111-222-3333        BAA
6  caa@caa.com      01jun15  777-777-7777        CAA
>>> label_fingerprints = {'email', 'date', 'phone'}
>>> find_column_labels(df, label_fingerprints)
>>> print(df)
         email     date         phone   id
0  aaa@aaa.com  04mar14  999-333-4444  AAA
1  baa@baa.com  05aug13  111-222-3333  BAA
2  caa@caa.com  01jun15  777-777-7777  CAA
Parameters:
  • df – A pandas.DataFrame containing column labels as a row, possibly preceded by misc. non-data rows.
  • label_fingerprints (set, list, or dict) –

    Fingerprinted label names that are expected in the column labels row. It can contain many variations on the expected label names, which makes this function useful for cleaning multiple files with varied column labels.

    Example:
    >>> label_fingerprints = {'email', 'emailaddr', 'phone', 'phonenum'}
    

    Note

    If a dict is provided, the keys will be used as the fingerprints and the values will be ignored.

    A fingerprint_map dictionary created for use in the mapping_functions module can be passed as label_fingerprints.

  • label_match_thresh (int, optional) –

    The number of fingerprints that must be found in label_fingerprints for a row to be identified as the label row. Default is 3.

    This exists to prevent false matches caused by non-data header rows containing values that are also expected column labels.

    For example, in the following DataFrame, row 1 will be misidentified as the label row if the label_match_thresh is set to 1:

    >>> import pandas as pd
    >>> from etl_toolbox.dataframe_functions import find_column_labels
    >>> df = pd.DataFrame(
    ...     [
    ...         [      "name:", "data report",             "",         "-"],
    ...         [      "date:",  "2001-05-03",       "3 rows", "4 columns"],
    ...         [           "",            "",             "",         "-"],
    ...         [      "email",        "date",        "phone",        "id"],
    ...         ["aaa@aaa.com",     "04mar14", "999-333-4444",       "AAA"],
    ...         ["baa@baa.com",     "05aug13", "111-222-3333",       "BAA"],
    ...         ["caa@caa.com",     "01jun15", "777-777-7777",       "CAA"]
    ...     ]
    ... )
    >>> print(df)
                 0            1             2          3
    0        name:  data report                        -
    1        date:   2001-05-03        3 rows  4 columns
    2                                                  -
    3        email         date         phone         id
    4  aaa@aaa.com      04mar14  999-333-4444        AAA
    5  baa@baa.com      05aug13  111-222-3333        BAA
    6  caa@caa.com      01jun15  777-777-7777        CAA
    >>> label_fingerprints = {'email', 'date', 'phone'}
    >>> find_column_labels(df, label_fingerprints, label_match_thresh=1)
    >>> print(df)
             date: 2001-05-03        3 rows 4 columns
    0                                               -
    1        email       date         phone        id
    2  aaa@aaa.com    04mar14  999-333-4444       AAA
    3  baa@baa.com    05aug13  111-222-3333       BAA
    4  caa@caa.com    01jun15  777-777-7777       CAA
    

    3 is a good value for most datasets with a known set of label_fingerprints. It can be set lower if the incoming data has few columns and/or highly varied label names.

    If this argument is set to 0, the function will raise a ValueError.

  • special_characters (string, optional) –

    (optional) A string of special characters to preserve while creating the fingerprints for lookup in label_fingerprints. See cleaning_functions.fingerprint() for details.

    Note

    Any special characters that appear in the elements of label_fingerprints should be included here.

Raises:
  • IndexError – Raised if a label row can not be identified in the given DataFrame.
  • ValueError – Raised if the label_match_thresh is set to 0.
Returns:

Returns None. The df argument is mutated.

etl_toolbox.dataframe_functions.index_is_default(df)[source]¶

Returns True if the provided DataFrame has the default pandas.RangeIndex.

Else returns False.

etl_toolbox.dataframe_functions.merge_columns_by_label(df, deduplicate_values=False)[source]¶

Merges columns of a pandas.DataFrame that have identical labels

For duplicate column labels in df, the first instance of each label will be turned into a column of lists containing the values from all of the instances. The other instances will then be dropped.

None and np.nan values will not be included in the merged column.

Note

This function does not fingerprint the labels for comparison. Column labels should be cleaned and mapped before using this tool.

Usage:
>>> import pandas as pd
>>> from etl_toolbox.dataframe_functions import merge_columns_by_label
>>> df = pd.DataFrame(
...     [
...         ["AAA", "aaa@aaa.com", "111-111-1111", "111@aaa.com"],
...         ["BAA", "baa@baa.com", "222-222-2222", "222@baa.com"],
...         ["CAA", "caa@caa.com", "333-333-3333", None],
...         ["DAA", "daa@daa.com", "444-444-4444", "444@daa.com"]
...     ],
...     columns=["id", "email", "phone", "email"]
... )
>>> print(df)
    id        email         phone        email
0  AAA  aaa@aaa.com  111-111-1111  111@aaa.com
1  BAA  baa@baa.com  222-222-2222  222@baa.com
2  CAA  caa@caa.com  333-333-3333         None
3  DAA  daa@daa.com  444-444-4444  444@daa.com
>>> merge_columns_by_label(df)
>>> print(df)
    id                       email         phone
0  AAA  [aaa@aaa.com, 111@aaa.com]  111-111-1111
1  BAA  [baa@baa.com, 222@baa.com]  222-222-2222
2  CAA               [caa@caa.com]  333-333-3333
3  DAA  [daa@daa.com, 444@daa.com]  444-444-4444
Parameters:
  • df – A pandas.DataFrame.
  • deduplicate_values (boolean, optional) – If True, the values of the combined columns will be deduplicated and stored in the modified DataFrame as a set instead of a list.
Returns:

Returns None. The df argument is mutated.


File Functions¶

Functions for working with files and directories
etl_toolbox.file_functions.get_file_list_from_dir(dir_path, recursive=False, include_regex=None)[source]¶

Returns a list of the files in a directory

Usage:
>>> get_file_list_from_dir('test_data/test_dir') # doctest:+SKIP
['test_data/test_dir/1.csv',
 'test_data/test_dir/2.csv',
 'test_data/test_dir/3.json']
Parameters:
  • recursive (boolean, optional) –

    If set to True, the returned list will include files from dir and all of its subdirectories. Default is False.

    Example:
    >>> get_file_list_from_dir('test_data/test_dir',
    ...                        recursive=True)  # doctest:+SKIP
    ['test_data/test_dir/1.csv',
     'test_data/test_dir/2.csv',
     'test_data/test_dir/3.json',
     'test_data/test_dir/a/1.csv',
     'test_data/test_dir/b/3.csv',
     'test_data/test_dir/b/c/2.txt']
    
  • include_regex (string, optional) –

    Only include files whose path matches this regex. Default is None (list is unfiltered).

    Example:
    >>> get_file_list_from_dir('test_data/test_dir',
    ...                        include_regex=r'.*\.csv$') # doctest:+SKIP
    ['test_data/test_dir/1.csv',
     'test_data/test_dir/2.csv']
    
Returns:

Returns list of file paths.


Mapping Functions¶

Functions for mapping collections of values
etl_toolbox.mapping_functions.append_count(x)[source]¶

A generator function that yields x with a numbered suffix.

etl_toolbox.mapping_functions.map_labels(labels, fingerprint_map, special_characters='', return_unmapped=False)[source]¶

Maps a list of labels to new values based on provided fingerprint_map and returns that mapped list.

This is useful for mapping column labels from a dataframe/file to a standard schema, particularly when the labels are inconsistent.

The order of labels will be preserved in the return, and if a label isn’t found in fingerprint_map, that label will be '-' in returned list.

Usage:
>>> from etl_toolbox.mapping_functions import map_labels
>>> labels = [1, '2_A', '2b']
>>> fingerprint_map = {'1': 'one', '2a': 'two_a', 'extrakey': 'value'}
>>> map_labels(labels, fingerprint_map)
['one', 'two_a', '-']
Parameters:
  • labels – The list of labels to map. These will be fingerprinted for their lookup in fingerprint_map.
  • fingerprint_map – A dictionary of all expected label fingerprints mapped to formatted outputs.
  • special_characters (string, optional) –

    A string of special characters to preserve while fingerprinting the labels. See cleaning_functions.fingerprint() for details.

    Note

    This should include any special characters that appear in the keys of fingerprint_map.

  • return_unmapped (boolean, optional) –

    If this is set to True, this function will return a tuple of the mapped labels and a set of unmapped labels (any value from labels whose fingerprint was not found in fingerprint_map). Default is False.

    Note

    This is useful for tracking unrecognized labels of incoming files in an automated ETL system.

Returns:

Returns a list or, if the return_unmapped option is True, returns a tuple, with the first element being a list and the second being a set.

etl_toolbox.mapping_functions.rename_duplicate_labels(labels, rename_generator=<function append_count>)[source]¶

Maps a list of labels such that duplicates are renamed according to the rename_generator.

The order of labels is preserved in the return, and if a label isn’t a duplicate, its value will be unchanged. Values will NOT be fingerprinted for comparison, so this function is best used after labels have been standardized.

Usage:
>>> from etl_toolbox.mapping_functions import rename_duplicate_labels
>>> labels = ['email', 'email', 'phone', 'name', 'email', 'phone']
>>> rename_duplicate_labels(labels)
['email_1', 'email_2', 'phone_1', 'name', 'email_3', 'phone_2']
Parameters:
  • labels – The list of labels to map.
  • rename_generator (generator, optional) –

    A generator function that specifies how to rename duplicate columns. It should take a label name as a positional argument and yield the renamed label. The default rename_generator appends a count, separated by underscore.

    Example:
    >>> def rename_generator(x):
    ...     i = 0
    ...     while True:
    ...         i += 1
    ...         yield x + '_' + str(i)
    >>> r = rename_generator('label')
    >>> next(r)
    'label_1'
    >>> next(r)
    'label_2'
    
Returns:

Returns a list.