etl-toolbox¶

Useful ETL functions for Python

etl-toolbox is a Python library of simple but powerful functions for ETL and data cleaning. It contains tools that are useful for nearly any ETL pipeline, with a specific focus on the data variety challenges that arise when compiling data from many sources.

Features¶

Standardize various null-indicating values ('blank', 'none', 'null', etc) to Python’s None
Trim, condense, and standardize whitespace with a single function
Locate and rename column labels in messy files

Quick Start¶

Installation¶

Install from PyPI using pip:

$ pip install etl_toolbox

Usage¶

>>> import pandas as pd
>>>
>>> df = pd.read_csv('./test_data/bad-data.csv')
>>> df  # doctest:+SKIP
         Unnamed: 0           Unnamed: 1            Unnamed: 2    Unnamed: 3 Unnamed: 4
0      created by:   Brookcub Industries  for testing purposes           NaN        NaN
1             date:           2020-06-07                3 rows  some columns        NaN
2               NaN                  NaN                   NaN           NaN        NaN
3             Cust.             EML-addr                On          phn-nmbr       col5
4     Golden jackal    c.aureus@mail.com              03/04/14      333-4444      blank
5  Pie, rufous tree                 none                   NaN      222-3333      empty
6   Vulture, bengal              blocked              06/01/15      777-7777       none
7       Arctic tern  s_paradise@mail.com              01/28/16           NaN        NaN
8   Eurasian badger  meles@othermail.net         notavailable            NaN        NaN
9   Grant's gazelle   grant@randmail.com                     -           NaN        NaN

Find and standardize column labels using a dictionary of the expected values:

>>> from etl_toolbox.dataframe_functions import find_column_labels
>>> from etl_toolbox.mapping_functions import map_labels
>>>
>>> fingerprint_map = {
...     'cust': 'Name',
...     'emladdr': 'Email',
...     'on': 'Date',
...     'phnnmbr': 'Phone'
... }
>>>
>>> find_column_labels(df, fingerprint_map)
>>> df
              Cust.             EML-addr         On      phn-nmbr   col5
0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
1  Pie, rufous tree                 none            NaN  222-3333  empty
2   Vulture, bengal              blocked       06/01/15  777-7777   none
3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
5   Grant's gazelle   grant@randmail.com              -       NaN    NaN
>>>
>>> df.columns = map_labels(df.columns, fingerprint_map)
>>> df
               Name                Email           Date     Phone      -
0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
1  Pie, rufous tree                 none            NaN  222-3333  empty
2   Vulture, bengal              blocked       06/01/15  777-7777   none
3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
5   Grant's gazelle   grant@randmail.com              -       NaN    NaN

Standardize null values and remove empty rows/columns:

>>> from etl_toolbox.dataframe_functions import dataframe_clean_null
>>>
>>> dataframe_clean_null(df)
>>> df
               Name                Email      Date     Phone
0     Golden jackal    c.aureus@mail.com  03/04/14  333-4444
1  Pie, rufous tree                  NaN       NaN  222-3333
2   Vulture, bengal                  NaN  06/01/15  777-7777
3       Arctic tern  s_paradise@mail.com  01/28/16       NaN
4   Eurasian badger  meles@othermail.net       NaN       NaN
5   Grant's gazelle   grant@randmail.com       NaN       NaN

Or clean individual data values:

>>> from etl_toolbox.cleaning_functions import clean_whitespace
>>>
>>> clean_whitespace(''' 123   abc 456
...                               def\t\t 789\t''')
'123 abc 456 def 789'

Contributing¶

Contributions are appreciated! There are multiple ways to contribute:

Bug Reports¶

Bug reports help make this library more robust. A good bug report should include:

A clear description of the problem (the expected behavior vs the actual behavior)
A minimal, reproducible example (see the Stack Overflow guide)
The platform and versions involved (operating system, Python version, etl-toolbox version, pandas/numpy version if applicable, etc)

Submit bug reports with the issue tracker on GitHub.

Feature Requests¶

Open an issue to discuss features you’d like to see added to etl-toolbox.

Pull Requests¶

Follow these steps for submitting pull requests:

Find an issue or feature on the issue tracker.
Fork this repository on GitHub and make changes in a branch.
Add tests to confirm that the bugfix/feature works as expected.
Run the entire test suite and coverage report with pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox --ignore=docs/conf.py. Make sure text coverage is 100% and all tests are passing.
Submit a pull request.

The code style for etl-toolbox mostly follows PEP8. A linter like Flake8 is recommended for double checking new contributions.

Release History¶

0.0.3
- Fix multiple bugs in merge_columns_by_label() that occurred with certain inputs
- Change merge_columns_by_label() to remove None and np.nan values from merged columns
- Change find_column_labels() to check whether the existing column labels fit the match criteria before searching rows
- Change map_labels() to return '-' instead of None for unmapped labels
- Change clean_whitespace() to return non-string inputs unaltered instead of raising an exception
0.0.2
- Add GitHub continuous integration
- Add project links and badges to readme and PyPI metadata
- Fix bug in merge_columns_by_label() that raises a ValueError if df has multiple columns labeled None
0.0.1
- Initial release