etl-toolbox¶
Useful ETL functions for Python
etl-toolbox is a Python library of simple but powerful functions for ETL and data cleaning. It contains tools that are useful for nearly any ETL pipeline, with a specific focus on the data variety challenges that arise when compiling data from many sources.
Features¶
- Standardize various null-indicating values (
'blank'
,'none'
,'null'
, etc) to Python’sNone
- Trim, condense, and standardize whitespace with a single function
- Locate and rename column labels in messy files
Quick Start¶
Usage¶
>>> import pandas as pd
>>>
>>> df = pd.read_csv('./test_data/bad-data.csv')
>>> df # doctest:+SKIP
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 created by: Brookcub Industries for testing purposes NaN NaN
1 date: 2020-06-07 3 rows some columns NaN
2 NaN NaN NaN NaN NaN
3 Cust. EML-addr On phn-nmbr col5
4 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank
5 Pie, rufous tree none NaN 222-3333 empty
6 Vulture, bengal blocked 06/01/15 777-7777 none
7 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN
8 Eurasian badger meles@othermail.net notavailable NaN NaN
9 Grant's gazelle grant@randmail.com - NaN NaN
Find and standardize column labels using a dictionary of the expected values:
>>> from etl_toolbox.dataframe_functions import find_column_labels
>>> from etl_toolbox.mapping_functions import map_labels
>>>
>>> fingerprint_map = {
... 'cust': 'Name',
... 'emladdr': 'Email',
... 'on': 'Date',
... 'phnnmbr': 'Phone'
... }
>>>
>>> find_column_labels(df, fingerprint_map)
>>> df
Cust. EML-addr On phn-nmbr col5
0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank
1 Pie, rufous tree none NaN 222-3333 empty
2 Vulture, bengal blocked 06/01/15 777-7777 none
3 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN
4 Eurasian badger meles@othermail.net notavailable NaN NaN
5 Grant's gazelle grant@randmail.com - NaN NaN
>>>
>>> df.columns = map_labels(df.columns, fingerprint_map)
>>> df
Name Email Date Phone -
0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank
1 Pie, rufous tree none NaN 222-3333 empty
2 Vulture, bengal blocked 06/01/15 777-7777 none
3 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN
4 Eurasian badger meles@othermail.net notavailable NaN NaN
5 Grant's gazelle grant@randmail.com - NaN NaN
Standardize null values and remove empty rows/columns:
>>> from etl_toolbox.dataframe_functions import dataframe_clean_null
>>>
>>> dataframe_clean_null(df)
>>> df
Name Email Date Phone
0 Golden jackal c.aureus@mail.com 03/04/14 333-4444
1 Pie, rufous tree NaN NaN 222-3333
2 Vulture, bengal NaN 06/01/15 777-7777
3 Arctic tern s_paradise@mail.com 01/28/16 NaN
4 Eurasian badger meles@othermail.net NaN NaN
5 Grant's gazelle grant@randmail.com NaN NaN
Or clean individual data values:
>>> from etl_toolbox.cleaning_functions import clean_whitespace
>>>
>>> clean_whitespace(''' 123 abc 456
... def\t\t 789\t''')
'123 abc 456 def 789'
Contributing¶
Contributions are appreciated! There are multiple ways to contribute:
Bug Reports¶
Bug reports help make this library more robust. A good bug report should include:
- A clear description of the problem (the expected behavior vs the actual behavior)
- A minimal, reproducible example (see the Stack Overflow guide)
- The platform and versions involved (operating system, Python version,
etl-toolbox
version,pandas
/numpy
version if applicable, etc)
Submit bug reports with the issue tracker on GitHub.
Feature Requests¶
Open an issue to discuss features you’d like to see added to etl-toolbox.
Pull Requests¶
Follow these steps for submitting pull requests:
- Find an issue or feature on the issue tracker.
- Fork this repository on GitHub and make changes in a branch.
- Add tests to confirm that the bugfix/feature works as expected.
- Run the entire test suite and coverage report with
pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox --ignore=docs/conf.py
. Make sure text coverage is 100% and all tests are passing. - Submit a pull request.
The code style for etl-toolbox mostly follows PEP8. A linter like Flake8 is recommended for double checking new contributions.
Release History¶
- 0.0.3
- Fix multiple bugs in
merge_columns_by_label()
that occurred with certain inputs - Change
merge_columns_by_label()
to removeNone
andnp.nan
values from merged columns - Change
find_column_labels()
to check whether the existing column labels fit the match criteria before searching rows - Change
map_labels()
to return'-'
instead ofNone
for unmapped labels - Change
clean_whitespace()
to return non-string inputs unaltered instead of raising an exception
- Fix multiple bugs in
- 0.0.2
- Add GitHub continuous integration
- Add project links and badges to readme and PyPI metadata
- Fix bug in
merge_columns_by_label()
that raises aValueError
ifdf
has multiple columns labeledNone
- 0.0.1
- Initial release