diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/404.html b/404.html new file mode 100644 index 000000000..bc24ae979 --- /dev/null +++ b/404.html @@ -0,0 +1,661 @@ + + + +
+ + + + + + + + + + + + + + + + +Once you have added your contribution to pyjanitor
,
+please add your name using this markdown template:
[@githubname](https://github.com/githubname) | [contributions](https://github.com/pyjanitor-devs/pyjanitor/issues?q=is%3Aclosed+mentions%3Agithubname)
+
You can copy/paste the template and replace githubname
with your username.
Contributions that did not leave a commit trace +are indicated in bullet points below each user's username.
+git
issues at SciPy 2019.git
issues at SciPy 2019.join_apply
as no alternative currently exists - Issue #1399 @lbeltramecartesian_product
function, as well as an expand
method for pandas. - Issue #1293 @samukwekupivot_longer
when sort_by_appearance
is True. Added pivot_longer_spec
for more control on how the dataframe should be unpivoted. -@samukweku #1361convert_excel_date
and convert_matlab_date
methods for polars - Issue #1352complete
method for polars. - Issue #1352 @samukwekupivot_longer
method, and a pivot_longer_spec
function for polars - Issue #1352 @samukwekurow_to_names
method for polars. Issue #1352 @samukwekuread_commandline
function now supports polars - Issue #1352 @samukwekuxlsx_cells
function now supports polars - Issue #1352 @samukwekuxlsx_table
function now supports polars - Issue #1352 @samukwekuclean_names
method for polars - it can be used to clean the column names, or clean column values . Issue #1343 @samukwekucomplete
method. - PR #1369 @samukwekufirst/last
in `conditional_join, when the join columns in the right dataframe are sorted. - PR #1382 @samukwekucomplete
PR #1289 @samukwekuselect
function now supports variable arguments - PR #1288 @samukwekuconditional_join
now supports timedelta dtype. - PR #1297 @samukwekuget_join_indices
function added - returns only join indices between two dataframes. Issue #1310 @samukwekuexplode_index
function added. - Issue #1283conditional_join
now supports timedelta dtype. - PR #1297change_index_dtype
added. - @samukweku Issue #1314glue
and axis
parameters to collapse_levels
. - Issue #211 @samukwekurow_to_names
now supports multiple rows conversion to columns. - @samukweku Issue #1333truncate_datetime
now uses a vectorized option. -@samukweku #1337clean_names
can now be applied to column values. Issue #995 @samukwekupytest.ini
file with pyproject.toml
file. PR #1204 @Zeroto521TypeError
when importing v0.24.0 (issue #1201 @xujiboy and @joranbeasley)mkdocs
compatibility code. PR #1231 @thatlittleboymkdocstrings
. PR #1235 @thatlittleboy>>>
) and outputs in Example code blocks. PR #1237 @thatlittleboyprocess_text
, rename_column
, rename_columns
, filter_on
, remove_columns
, fill_direction
. Issue #1045 @samukwekupivot_longer
now supports named groups where names_pattern
is a regular expression. A dictionary can now be passed to names_pattern
, and is internally evaluated as a list/tuple of regular expressions. Issue #1209 @samukwekuconditional_join
. Issue #1223 @samukwekucol
class for selecting columns within an expression. Currently limited to use within conditional_join
. PR #1260 @samukweku.conditional_join
, when use_numba = False
. Performance improvement for equi-join and a range join, when use_numba = True
, for many to many join with wide ranges. PR #1256, #1267 @samukwekupivot_wider
. Issue #1045 @samukwekudeprecated_kwargs
for breaking API. #1103 @Zeroto521min_max_scale
drop old_min
and old_max
to fit sklearn's method API. Issue #1068 @Zeroto521jointly
option for min_max_scale
support to transform each column values or entire values. Default transform each column, similar behavior to sklearn.preprocessing.MinMaxScaler
. (Issue #1067, PR #1112, PR #1123) @Zeroto521expand_grid
. Issue #1121 @samukwekunames_expand
and index_expand
parameters to pivot_wider
for exposing missing categoricals. Issue #1108 @samukwekupivot_wider
. Issue #1134 @samukwekudropna
parameter added to pivot_longer
. Issue #1132 @samukwekumkdocstrings
version and to fit its new coming features. PR #1138 @Zeroto521math.softmax
returning Series
. PR #1139 @Zeroto521encode_categorical
handle 2 (or more ) dimensions array. PR #1153 @Zeroto521concurrency
. PR #1161 @Zeroto521sort_by_appearance
is False. Issue #1102 @samukwekuchange_type
mutating original DataFrame
. PR #1162 @Zeroto521column_name
of change_type
totally supports inputing multi-column now. #1163 @Zeroto521sort_by_appearance=True
is combined with dropna=True
. Issue #1168 @samukwekucase_when
function. Issue #1159 @samukweku_MergeOperation
doesn't have copy
keyword anymore. Issue #1174 @Zeroto521select_rows
function added for flexible row selection. Generic select
function added as well. Add support for MultiIndex selection via dictionary. Issue #1124 @samukwekuFailedHealthCheck
Issue #1181 @Zeroto521docs-preview.yml
and docs.yml
) to one. And add documentation
pytest mark. PR #1183 @Zeroto521codecov.yml
(only works for the dev branch pushing event) into tests.yml
(only works for PR event). PR #1185 @Zeroto521DataDescription
to fix: AttributeError: 'DataFrame' object has no attribute 'data_description'
. PR #1191 @Zeroto521fill.py
and update_where.py
documentation with working examples.num_bins
from bin_numeric
in favour of bins
, and allow generic **kwargs
to be passed into pd.cut
. Issue #969. @thatlittleboyconcatenate_columns
not working on category inputs @zbarryfilter_on
. Issue #988. @thatlittleboyxlsx_table
, for reading tables from an Excel sheet. @samukwekusort_column_value_order
no longer mutates original dataframe.fill_empty
's column_names
type range. Issue #998. @Zeroto521row_to_names
(#1004) and round_to_fraction
(#1005). @thatlittleboypatterns
deprecated in favour of importing re.compile
. #1007 @samukwekuencode_categorical
, where the values can either be a string or a 1D array. #1021 @samukwekufill_value
and explicit
parameters to the complete
function. #1019 @samukwekuexpand_grid
. @samukwekufactorize_columns
(PR #1028) and truncate_datetime_dataframe
(PR #1040) functions non-mutating. @thatlittleboytruncate_datetime_dataframe
, along with further performance improvements (PR #1040). @thatlittleboyconditional_join
. @samukweku.value
is now supported in pivot_longer
. Multiple values_to is also supported, when names_pattern is a list or tuple. names_transform
parameter added, for efficient dtype transformation of unpivoted columns. #1034, #1048, #1051 @samukwekuxlsx_cells
for reading a spreadsheet as a table of individual cells. #929 @samukweku.filter_string
suit parameters of Series.str.contains
Issue #1003 and #1047. @Zeroto521names_glue
in pivot_wider
now takes a string form, using str.format_map under the hood. levels_order
is also deprecated. @samukwekutransform_columns
which ignored the column_names
specification when new_column_names
dictionary was provided as an argument, issue #1063. @thatlittleboycount_cumulative_unique
no longer modifies the column being counted in the output when case_sensitive
argument is set to False, issue #1065. @thatlittleboyRemote Container
in VS Code. @ashenafiybexpand_column
and find_replace
code examples to doctests, issue #972. @gahjelleexpand_column
code examples to doctests, issue #972. @gahjelleget_dupes
code examples to doctests, issue #972. @ethompsyengineering
code examples to doctests, issue #972 @ashenafiybgroupby_topk
code examples to doctests, issue #972. @ethompsymath
, issue #972. @gahjellemath
and ml
, issue #972. @gahjellemath
, ml
, and xarray
, issue #972. @gahjellecase_when
to handle multiple conditionals and replacement values. Issue #736. @robertmitchellvnew_column_names
and merge_frame
from process_text
. Only existing columns are supported. @samukwekucomplete
uses pd.merge
internally, providing a simpler logic, with some speed improvements in certain cases over pd.reindex
. @samukwekuexpand_grid
returns a MultiIndex DataFrame, allowing the user to decide how to manipulate the columns. @samukwekupivot_longer
for wrong output when names_pattern
is a sequence with a single value. Issue #885 @samukwekuaggfunc
from pivot_wider
; aggregation can be chained with pandas' groupby
.As_Categorical
deprecated from encode_categorical
; a tuple of (categories, order)
suffices for **kwargs. @samukwekunames_sort
from pivot_wider
.@samukwekusoftmax
to math
module. Issue #902. @loganthomascoalesce
, from bfill/fill;coalesce
now uses variable arguments. Issue #882 @samukwekubase.in
. Issue #895 @ericmjllabel_encode
to use pandas factorize instead of scikit-learn LabelEncoder. @nvamsikrishna05factorize_columns
method which will deprecate the label_encode
method in future release. @nvamsikrishna05isort
automatic checks. Issue #845. @loganthomascomplete
function now uses variable args (*args) - @samukwekuexpand_column
's sep
default is "|"
, same to pandas.Series.str.get_dummies
. Issue #876. @Zeroto521limit
from fill_direction. fill_direction now uses kwargs. @samukwekuconditional_join
function that supports joins on non-equi operators. @samukweku-n
(pytest-xdist) option. Issue #881. @Zeroto521select_columns
's example same style. @Zeroto521rename_columns
to take optional function argument for mapping. @nvamsikrishna05fill_value
parameter from complete
. Users can use fillna
instead. @samukwekupivot_longer
with single level columns. @samukwekucoalesce
to return columns; also use bfill
, ffill
,
+ which is faster than combine_first
@samukwekueval
for string conditions in update_where
. @samukwekupivot_longer
. h/t to @tdhock
+ for the observation. Issue #836 @samukwekuselect_columns
now uses variable arguments (*args),
+ to provide a simpler selection without the need for lists. - @samukwekuencode_categoricals
refactored to use generic functions
+ via functools.dispatch
. - @samukwekudropna
parameter to groupby_agg. @samukwekucomplete
adds a by
parameter to expose explicit missing values per group, via groupby. @samukwekulabel_encode
. @zbarryexpand_grid
. @samukwekumultipledispatch
to pip requirements. @ericmjldarglint
package for docstring linting. Issue #745. @loganthomasclean_names
function. Issue #753. @richardqiutimeseries.flag_jumps()
function. Issue #711. @loganthomaspivot_longer
can handle multiple values in paired columns, and can reshape
+ using a list/tuple of regular expressions in names_pattern
. @samukwekudtypes
parameter,
+ allowing the user to control the data types. - @samukwekupivot_wider
function, which is the inverse of the pivot_longer
+ function. @samukwekuopenpyxl
to environment-dev.yml
. @samukwekupivot_longer
function, with improved speed and cleaner code.
+ dtypes
parameter dropped; user can change dtypes with pandas' astype
method, or
+ pyjanitor's change_type
method. @samukwekuencode_categorical
function, to create ordered categorical columns,
+ or categorical columns with explicit categories. @samukwekucomplete
method. Use pd.merge
to handle duplicates and
+ null values. @samukwekunew_column_names
parameter to process_text
, allowing a user to
+ create a new column name after processing a text column. Also added a merge_frame
+ parameter, allowing dataframe merging, if the result of the text processing is a
+ dataframe.@samukwekuaggfunc
parameter to pivot_wider. @samukwekucheck
function in utils to verify if a value is a callable. @samukweku_select_column
function, using functools.singledispatch
,
+ to allow for flexible columns selection. @samukwekusort_timestamps_monotonically
to timeseries functions @UGuntupallialso
method for running functions in chain with no return values.timeseries
module section to website docs. Issue #742. @loganthomaspivot_longer
function, a wrapper around pd.melt
and similar to
+ tidyr's pivot_longer
function. Also added an example notebook. @samukwekufill_value
is not a dictionary. @samukwekuby
argument. @samukwekugroupby_topk
to janitor functions @mphirkeupdate_where
function to use either the pandas query style,
+ or boolean indexing via the loc
method. Also updated find_replace
function to use the loc
+ method directly, instead of routing it through the update_where
function. @samukwekupandas
minimum version to 1.0.0. @hectormzpandas
testing functions follow uniform pattern. @hectormzprocess_text
wrapper function for all Pandas string methods. @samukwekufill_direction
function for forward/backward fills on missing values
+ for selected columns in a dataframe. @samukwekuexpand_grid
function by @samukwekudebug-statements
, requirements-txt-fixer
, and interrogate
to pre-commit
. @hectormzpycodestyle
with flake8
in order to add pandas-vet
linter @hectormzselect_columns()
now raises NameError
if column label in
+ search_columns_labels
is missing from DataFrame
columns. @smu095sort_naturally()
function. @ericmjlapply()
in favor of pandas
functions in several functions. @hectormzecdf()
Series function by @ericmjl.find_replace
implementation to use keyword arguments to specify columns to perform find and replace on. @ericmjljitter()
dataframe function by @rahosbachfunctions.py
. @bdiceAUTHORS.rst
and contributions by @hectormzpre-commit
hooks to repository by @ericmjlCONTRIBUTING.rst
by @hectormzupdate_where
, provide a bit more detail, and expand the bad_values example notebook to demonstrate its use by @anzelpwj.win32
@Ram-Npandas
API example. @ericmjlget_features_targets()
to new ml.py
module by @hectormzimport_message
suggests python dist. appropriate installs by @hectormzupdate_where()
method to janitor.spark.functions
submodule by @zjpohgit
traces by @ericmjlclean_names
by @ericmjl,
+ h/t @jtaylor for sharing originalnull_flag
function which can mark null values in rows.
+ Implemented by @anzelpwjFor changes that happened prior to v0.18.1, +please consult the closed PRs, +which can be found here.
+We thank all contributors
+who have helped make pyjanitor
+the package that it is today.
Biology and bioinformatics-oriented data cleaning functions.
+ + + + + + + + +join_fasta(df, filename, id_col, column_name)
+
+Convenience method to join in a FASTA file as a column.
+This allows us to add the string sequence of a FASTA file as a new column +of data in the dataframe.
+This method only attaches the string representation of the SeqRecord.Seq +object from Biopython. Does not attach the full SeqRecord. Alphabet is +also not stored, under the assumption that the data scientist has domain +knowledge of what kind of sequence is being read in (nucleotide vs. amino +acid.)
+This method mutates the original DataFrame.
+For more advanced functions, please use phylopandas.
+ + +Examples:
+>>> import tempfile
+>>> import pandas as pd
+>>> import janitor.biology
+>>> tf = tempfile.NamedTemporaryFile()
+>>> tf.write('''>SEQUENCE_1
+... MTEITAAMVKELRESTGAGMMDCK
+... >SEQUENCE_2
+... SATVSEINSETDFVAKN'''.encode('utf8'))
+66
+>>> tf.seek(0)
+0
+>>> df = pd.DataFrame({"sequence_accession":
+... ["SEQUENCE_1", "SEQUENCE_2", ]})
+>>> df = df.join_fasta(
+... filename=tf.name,
+... id_col='sequence_accession',
+... column_name='sequence',
+... )
+>>> df.sequence
+0 MTEITAAMVKELRESTGAGMMDCK
+1 SATVSEINSETDFVAKN
+Name: sequence, dtype: object
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ filename
+ |
+
+ str
+ |
+
+
+
+ Path to the FASTA file. + |
+ + required + | +
+ id_col
+ |
+
+ str
+ |
+
+
+
+ The column in the DataFrame that houses sequence IDs. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ The name of the new column. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with new FASTA string sequence column. + |
+
janitor/biology.py
20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 |
|
Chemistry and cheminformatics-oriented data cleaning functions.
+ + + + + + + + +maccs_keys_fingerprint(df, mols_column_name)
+
+Convert a column of RDKIT mol objects into MACCS Keys Fingerprints.
+Returns a new dataframe without any of the original data. +This is intentional to leave the user with the data requested.
+This method does not mutate the original DataFrame.
+ + +Examples:
+Functional usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+>>> maccs = janitor.chemistry.maccs_keys_fingerprint(
+... df=df.smiles2mol('smiles', 'mols'),
+... mols_column_name='mols'
+... )
+>>> len(maccs.columns)
+167
+
Method chaining usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+>>> maccs = (
+... df.smiles2mol('smiles', 'mols')
+... .maccs_keys_fingerprint(mols_column_name='mols')
+... )
+>>> len(maccs.columns)
+167
+
If you wish to join the maccs keys fingerprints back into the
+original dataframe, this can be accomplished by doing a join
,
+because the indices are preserved:
>>> joined = df.join(maccs)
+>>> len(joined.columns)
+169
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ mols_column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The name of the column that has the RDKIT mol +objects. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A new pandas DataFrame of MACCS keys fingerprints. + |
+
janitor/chemistry.py
422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 +475 +476 +477 +478 +479 +480 +481 +482 +483 +484 +485 +486 +487 |
|
molecular_descriptors(df, mols_column_name)
+
+Convert a column of RDKIT mol objects into a Pandas DataFrame +of molecular descriptors.
+Returns a new dataframe without any of the original data. This is +intentional to leave the user only with the data requested.
+This method does not mutate the original DataFrame.
+The molecular descriptors are from the rdkit.Chem.rdMolDescriptors
:
Chi0n, Chi0v, Chi1n, Chi1v, Chi2n, Chi2v, Chi3n, Chi3v,
+Chi4n, Chi4v, ExactMolWt, FractionCSP3, HallKierAlpha, Kappa1,
+Kappa2, Kappa3, LabuteASA, NumAliphaticCarbocycles,
+NumAliphaticHeterocycles, NumAliphaticRings, NumAmideBonds,
+NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings,
+NumAtomStereoCenters, NumBridgeheadAtoms, NumHBA, NumHBD,
+NumHeteroatoms, NumHeterocycles, NumLipinskiHBA, NumLipinskiHBD,
+NumRings, NumSaturatedCarbocycles, NumSaturatedHeterocycles,
+NumSaturatedRings, NumSpiroAtoms, NumUnspecifiedAtomStereoCenters,
+TPSA.
+
Examples:
+Functional usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+>>> mol_desc = (
+... janitor.chemistry.molecular_descriptors(
+... df=df.smiles2mol('smiles', 'mols'),
+... mols_column_name='mols'
+... )
+... )
+>>> mol_desc.TPSA
+0 34.14
+1 37.30
+Name: TPSA, dtype: float64
+
Method chaining usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+>>> mol_desc = (
+... df.smiles2mol('smiles', 'mols')
+... .molecular_descriptors(mols_column_name='mols')
+... )
+>>> mol_desc.TPSA
+0 34.14
+1 37.30
+Name: TPSA, dtype: float64
+
If you wish to join the molecular descriptors back into the original
+dataframe, this can be accomplished by doing a join
,
+because the indices are preserved:
>>> joined = df.join(mol_desc)
+>>> len(joined.columns)
+41
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ mols_column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The name of the column that has the RDKIT mol +objects. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A new pandas DataFrame of molecular descriptors. + |
+
janitor/chemistry.py
298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 |
|
morgan_fingerprint(df, mols_column_name, radius=3, nbits=2048, kind='counts')
+
+Convert a column of RDKIT Mol objects into Morgan Fingerprints.
+Returns a new dataframe without any of the original data. This is +intentional, as Morgan fingerprints are usually high-dimensional +features.
+This method does not mutate the original DataFrame.
+ + +Examples:
+Functional usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+
>>> morgans = janitor.chemistry.morgan_fingerprint(
+... df=df.smiles2mol('smiles', 'mols'),
+... mols_column_name='mols',
+... radius=3, # Defaults to 3
+... nbits=2048, # Defaults to 2048
+... kind='counts' # Defaults to "counts"
+... )
+>>> set(morgans.iloc[0])
+{0.0, 1.0, 2.0}
+
>>> morgans = janitor.chemistry.morgan_fingerprint(
+... df=df.smiles2mol('smiles', 'mols'),
+... mols_column_name='mols',
+... radius=3, # Defaults to 3
+... nbits=2048, # Defaults to 2048
+... kind='bits' # Defaults to "counts"
+... )
+>>> set(morgans.iloc[0])
+{0.0, 1.0}
+
Method chaining usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+
>>> morgans = (
+... df.smiles2mol('smiles', 'mols')
+... .morgan_fingerprint(
+... mols_column_name='mols',
+... radius=3, # Defaults to 3
+... nbits=2048, # Defaults to 2048
+... kind='counts' # Defaults to "counts"
+... )
+... )
+>>> set(morgans.iloc[0])
+{0.0, 1.0, 2.0}
+
>>> morgans = (
+... df
+... .smiles2mol('smiles', 'mols')
+... .morgan_fingerprint(
+... mols_column_name='mols',
+... radius=3, # Defaults to 3
+... nbits=2048, # Defaults to 2048
+... kind='bits' # Defaults to "counts"
+... )
+... )
+>>> set(morgans.iloc[0])
+{0.0, 1.0}
+
If you wish to join the morgan fingerprints back into the original
+dataframe, this can be accomplished by doing a join
,
+because the indices are preserved:
>>> joined = df.join(morgans)
+>>> len(joined.columns)
+2050
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ mols_column_name
+ |
+
+ str
+ |
+
+
+
+ The name of the column that has the RDKIT +mol objects + |
+ + required + | +
+ radius
+ |
+
+ int
+ |
+
+
+
+ Radius of Morgan fingerprints. + |
+
+ 3
+ |
+
+ nbits
+ |
+
+ int
+ |
+
+
+
+ The length of the fingerprints. + |
+
+ 2048
+ |
+
+ kind
+ |
+
+ Literal['counts', 'bits']
+ |
+
+
+
+ Whether to return counts or bits. + |
+
+ 'counts'
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A new pandas DataFrame of Morgan fingerprints. + |
+
janitor/chemistry.py
167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 |
|
smiles2mol(df, smiles_column_name, mols_column_name, drop_nulls=True, progressbar=None)
+
+Convert a column of SMILES strings into RDKit Mol objects.
+Automatically drops invalid SMILES, as determined by RDKIT.
+This method mutates the original DataFrame.
+ + +Examples:
+Functional usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = pd.DataFrame({"smiles": ["O=C=O", "CCC(=O)O"]})
+>>> df = janitor.chemistry.smiles2mol(
+... df=df,
+... smiles_column_name='smiles',
+... mols_column_name='mols'
+... )
+>>> df.mols[0].GetNumAtoms(), df.mols[0].GetNumBonds()
+(3, 2)
+>>> df.mols[1].GetNumAtoms(), df.mols[1].GetNumBonds()
+(5, 4)
+
Method chaining usage
+>>> import pandas as pd
+>>> import janitor.chemistry
+>>> df = df.smiles2mol(
+... smiles_column_name='smiles',
+... mols_column_name='rdkmol'
+... )
+>>> df.rdkmol[0].GetNumAtoms(), df.rdkmol[0].GetNumBonds()
+(3, 2)
+
A progressbar can be optionally used.
+tqdm
notebook progressbar.
+ (ipywidgets
must be enabled with your Jupyter installation.)tqdm
progressbar. Better suited for use
+ with scripts.None
is the default value - progress bar will not be shown.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ pandas DataFrame. + |
+ + required + | +
+ smiles_column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Name of column that holds the SMILES strings. + |
+ + required + | +
+ mols_column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Name to be given to the new mols column. + |
+ + required + | +
+ drop_nulls
+ |
+
+ bool
+ |
+
+
+
+ Whether to drop rows whose mols failed to be +constructed. + |
+
+ True
+ |
+
+ progressbar
+ |
+
+ Optional[str]
+ |
+
+
+
+ Whether to show a progressbar or not. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with new RDKIT Mol objects column. + |
+
janitor/chemistry.py
79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 |
|
Engineering-specific data cleaning functions.
+ + + + + + + + +convert_units(df, column_name=None, existing_units=None, to_units=None, dest_column_name=None)
+
+Converts a column of numeric values from one unit to another.
+Unit conversion can only take place if the existing_units
and
+to_units
are of the same type (e.g., temperature or pressure).
+The provided unit types can be any unit name or alternate name provided
+in the unyt
package's Listing of Units table.
Volume units are not provided natively in unyt
. However, exponents are
+supported, and therefore some volume units can be converted. For example,
+a volume in cubic centimeters can be converted to cubic meters using
+existing_units='cm**3'
and to_units='m**3'
.
This method mutates the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor.engineering
+>>> df = pd.DataFrame({"temp_F": [-40, 112]})
+>>> df = df.convert_units(
+... column_name='temp_F',
+... existing_units='degF',
+... to_units='degC',
+... dest_column_name='temp_C'
+... )
+>>> df
+ temp_F temp_C
+0 -40 -40.000000
+1 112 44.444444
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ Name of the column containing numeric +values that are to be converted from one set of units to another. + |
+
+ None
+ |
+
+ existing_units
+ |
+
+ str
+ |
+
+
+
+ The unit type to convert from. + |
+
+ None
+ |
+
+ to_units
+ |
+
+ str
+ |
+
+
+
+ The unit type to convert to. + |
+
+ None
+ |
+
+ dest_column_name
+ |
+
+ str
+ |
+
+
+
+ The name of the new column containing the +converted values that will be created. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ TypeError
+ |
+
+
+
+ If column is not numeric. + |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a new column of unit-converted values. + |
+
janitor/engineering.py
21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 |
|
Finance-specific data cleaning functions.
+ + + + + + + + +convert_currency(df, api_key, column_name=None, from_currency=None, to_currency=None, historical_date=None, make_new_column=False)
+
+Deprecated function.
+ + +janitor/finance.py
405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 |
|
convert_stock(stock_symbol)
+
+This function takes in a stock symbol as a parameter, +queries an API for the companies full name and returns +it
+Examples:
+```python
+import janitor.finance
+janitor.finance.convert_stock("aapl")
+```
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ stock_symbol
+ |
+
+ str
+ |
+
+
+
+ Stock ticker Symbol + |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ ConnectionError
+ |
+
+
+
+ Internet connection is not available + |
+
Returns:
+Type | +Description | +
---|---|
+ str
+ |
+
+
+
+ Full company name + |
+
janitor/finance.py
697 +698 +699 +700 +701 +702 +703 +704 +705 +706 +707 +708 +709 +710 +711 +712 +713 +714 +715 +716 +717 +718 +719 +720 +721 +722 +723 +724 +725 |
|
get_symbol(symbol)
+
+This is a helper function to get a companies full +name based on the stock symbol.
+Examples:
+```python
+import janitor.finance
+janitor.finance.get_symbol("aapl")
+```
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ symbol
+ |
+
+ str
+ |
+
+
+
+ This is our stock symbol that we use +to query the api for the companies full name. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Optional[str]
+ |
+
+
+
+ Company full name + |
+
janitor/finance.py
728 +729 +730 +731 +732 +733 +734 +735 +736 +737 +738 +739 +740 +741 +742 +743 +744 +745 +746 +747 +748 +749 +750 +751 +752 +753 +754 +755 +756 |
|
inflate_currency(df, column_name=None, country=None, currency_year=None, to_year=None, make_new_column=False)
+
+Inflates a column of monetary values from one year to another, based on +the currency's country.
+The provided country can be any economy name or code from the World Bank +list of economies.
+Note: This method mutates the original DataFrame.
+Examples:
+>>> import pandas as pd
+>>> import janitor.finance
+>>> df = pd.DataFrame({"profit":[100.10, 200.20, 300.30, 400.40, 500.50]})
+>>> df
+ profit
+0 100.1
+1 200.2
+2 300.3
+3 400.4
+4 500.5
+>>> df.inflate_currency(
+... column_name='profit',
+... country='USA',
+... currency_year=2015,
+... to_year=2018,
+... make_new_column=True
+... )
+ profit profit_2018
+0 100.1 106.050596
+1 200.2 212.101191
+2 300.3 318.151787
+3 400.4 424.202382
+4 500.5 530.252978
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ Name of the column containing monetary +values to inflate. + |
+
+ None
+ |
+
+ country
+ |
+
+ str
+ |
+
+
+
+ The country associated with the currency being inflated. +May be any economy or code from the World Bank +List of economies. + |
+
+ None
+ |
+
+ currency_year
+ |
+
+ int
+ |
+
+
+
+ The currency year to inflate from. +The year should be 1960 or later. + |
+
+ None
+ |
+
+ to_year
+ |
+
+ int
+ |
+
+
+
+ The currency year to inflate to. +The year should be 1960 or later. + |
+
+ None
+ |
+
+ make_new_column
+ |
+
+ bool
+ |
+
+
+
+ Generates new column for inflated currency if +True, otherwise, inflates currency in place. + |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ The DataFrame with inflated currency column. + |
+
janitor/finance.py
623 +624 +625 +626 +627 +628 +629 +630 +631 +632 +633 +634 +635 +636 +637 +638 +639 +640 +641 +642 +643 +644 +645 +646 +647 +648 +649 +650 +651 +652 +653 +654 +655 +656 +657 +658 +659 +660 +661 +662 +663 +664 +665 +666 +667 +668 +669 +670 +671 +672 +673 +674 +675 +676 +677 +678 +679 +680 +681 +682 +683 +684 +685 +686 +687 +688 +689 +690 +691 +692 +693 +694 |
|
pyjanitor's general-purpose data cleaning functions.
+ + + + + + + + +add_columns
+
+
+add_column(df, column_name, value, fill_remaining=False)
+
+Add a column to the dataframe.
+Intended to be the method-chaining alternative to:
+df[column_name] = value
+
Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.assign
instead.
Examples:
+Add a column of constant values to the dataframe.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
+>>> df.add_column(column_name="c", value=1)
+ a b c
+0 0 a 1
+1 1 b 1
+2 2 c 1
+
Add a column of different values to the dataframe.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
+>>> df.add_column(column_name="c", value=list("efg"))
+ a b c
+0 0 a e
+1 1 b f
+2 2 c g
+
Add a column using an iterator.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
+>>> df.add_column(column_name="c", value=range(4, 7))
+ a b c
+0 0 a 4
+1 1 b 5
+2 2 c 6
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ Name of the new column. Should be a string, in order +for the column name to be compatible with the Feather binary +format (this is a useful thing to have). + |
+ + required + | +
+ value
+ |
+
+ Union[List[Any], Tuple[Any], Any]
+ |
+
+
+
+ Either a single value, or a list/tuple of values. + |
+ + required + | +
+ fill_remaining
+ |
+
+ bool
+ |
+
+
+
+ If value is a tuple or list that is smaller than +the number of rows in the DataFrame, repeat the list or tuple +(R-style) to the end of the DataFrame. + |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If attempting to add a column that already exists. + |
+
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If attempting to add an iterable of values with +a length not equal to the number of DataFrame rows. + |
+
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with an added column. + |
+
janitor/functions/add_columns.py
10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 |
|
add_columns(df, fill_remaining=False, **kwargs)
+
+Add multiple columns to the dataframe.
+This method does not mutate the original DataFrame.
+Method to augment
+add_column
+with ability to add multiple columns in
+one go. This replaces the need for multiple
+add_column
calls.
Usage is through supplying kwargs where the key is the col name and the +values correspond to the values of the new DataFrame column.
+Values passed can be scalar or iterable (list, ndarray, etc.)
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.assign
instead.
Examples:
+Inserting two more columns into a dataframe.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
+>>> df.add_columns(x=4, y=list("def"))
+ a b x y
+0 0 a 4 d
+1 1 b 4 e
+2 2 c 4 f
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ fill_remaining
+ |
+
+ bool
+ |
+
+
+
+ If value is a tuple or list that is smaller than
+the number of rows in the DataFrame, repeat the list or tuple
+(R-style) to the end of the DataFrame. (Passed to
+ |
+
+ False
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Column, value pairs which are looped through in
+ |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with added columns. + |
+
janitor/functions/add_columns.py
139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 |
|
also
+
+
+Implementation source for chainable function also
.
also(df, func, *args, **kwargs)
+
+Run a function with side effects.
+This function allows you to run an arbitrary function
+in the pyjanitor
method chain.
+Doing so will let you do things like save the dataframe to disk midway
+while continuing to modify the dataframe afterwards.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = (
+... pd.DataFrame({"a": [1, 2, 3], "b": list("abc")})
+... .query("a > 1")
+... .also(lambda df: print(f"DataFrame shape is: {df.shape}"))
+... .rename_column(old_column_name="a", new_column_name="a_new")
+... .also(lambda df: df.to_csv("midpoint.csv"))
+... .also(
+... lambda df: print(f"Columns: {df.columns}")
+... )
+... )
+DataFrame shape is: (2, 2)
+Columns: Index(['a_new', 'b'], dtype='object')
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ func
+ |
+
+ Callable
+ |
+
+
+
+ A function you would like to run in the method chain. +It should take one DataFrame object as a parameter and have no return. +If there is a return, it will be ignored. + |
+ + required + | +
+ *args
+ |
+
+ Any
+ |
+
+
+
+ Optional arguments for |
+
+ ()
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Optional keyword arguments for |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ The input pandas DataFrame, unmodified. + |
+
janitor/functions/also.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 |
|
bin_numeric
+
+
+Implementation source for bin_numeric
.
bin_numeric(df, from_column_name, to_column_name, bins=5, **kwargs)
+
+Generate a new column that labels bins for a specified numeric column.
+This method does not mutate the original DataFrame.
+A wrapper around the pandas cut()
function to bin data of
+one column, generating a new column with the results.
Examples:
+Binning a numeric column with specific bin edges.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [3, 6, 9, 12, 15]})
+>>> df.bin_numeric(
+... from_column_name="a", to_column_name="a_binned",
+... bins=[0, 5, 11, 15],
+... )
+ a a_binned
+0 3 (0, 5]
+1 6 (5, 11]
+2 9 (5, 11]
+3 12 (11, 15]
+4 15 (11, 15]
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ from_column_name
+ |
+
+ str
+ |
+
+
+
+ The column whose data you want binned. + |
+ + required + | +
+ to_column_name
+ |
+
+ str
+ |
+
+
+
+ The new column to be created with the binned data. + |
+ + required + | +
+ bins
+ |
+
+ Optional[Union[int, ScalarSequence, IntervalIndex]]
+ |
+
+
+
+ The binning strategy to be utilized. Read the |
+
+ 5
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Additional kwargs to pass to |
+
+ {}
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/bin_numeric.py
13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 |
|
case_when
+
+
+Implementation source for case_when
.
case_when(df, *args, default=None, column_name)
+
+Create a column based on a condition or multiple conditions.
+Similar to SQL and dplyr's case_when
+with inspiration from pydatatable
if_else function.
If your scenario requires direct replacement of values,
+pandas' replace
method or map
method should be better
+suited and more efficient; if the conditions check
+if a value is within a range of values, pandas' cut
or qcut
+should be more efficient; np.where/np.select
are also
+performant options.
This function relies on pd.Series.mask
method.
When multiple conditions are satisfied, the first one is used.
+The variable *args
parameters takes arguments of the form :
+condition0
, value0
, condition1
, value1
, ..., default
.
+If condition0
evaluates to True
, then assign value0
to
+column_name
, if condition1
evaluates to True
, then
+assign value1
to column_name
, and so on. If none of the
+conditions evaluate to True
, assign default
to
+column_name
.
This function can be likened to SQL's case_when
:
CASE WHEN condition0 THEN value0
+ WHEN condition1 THEN value1
+ --- more conditions
+ ELSE default
+ END AS column_name
+
compared to python's if-elif-else
:
if condition0:
+ value0
+elif condition1:
+ value1
+# more elifs
+else:
+ default
+
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... "a": [0, 0, 1, 2, "hi"],
+... "b": [0, 3, 4, 5, "bye"],
+... "c": [6, 7, 8, 9, "wait"],
+... }
+... )
+>>> df
+ a b c
+0 0 0 6
+1 0 3 7
+2 1 4 8
+3 2 5 9
+4 hi bye wait
+>>> df.case_when(
+... ((df.a == 0) & (df.b != 0)) | (df.c == "wait"), df.a,
+... (df.b == 0) & (df.a == 0), "x",
+... default = df.c,
+... column_name = "value",
+... )
+ a b c value
+0 0 0 6 x
+1 0 3 7 0
+2 1 4 8 8
+3 2 5 9 9
+4 hi bye wait hi
+
Version Changed
+default
parameter.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ *args
+ |
+
+ Any
+ |
+
+
+
+ Variable argument of conditions and expected values.
+Takes the form
+ |
+
+ ()
+ |
+
+ default
+ |
+
+ Any
+ |
+
+
+
+ This is the element inserted in the output +when all conditions evaluate to False. +Can be scalar, 1-D array or callable. +If callable, it should evaluate to a 1-D array. +The 1-D array should be the same length as the DataFrame. + |
+
+ None
+ |
+
+ column_name
+ |
+
+ str
+ |
+
+
+
+ Name of column to assign results to. A new column +is created if it does not already exist in the DataFrame. + |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If condition/value fails to evaluate. + |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/case_when.py
16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 |
|
change_index_dtype
+
+
+Implementation of the change_index_dtype
function.
change_index_dtype(df, dtype, axis='index')
+
+Cast an index to a specified dtype dtype
.
This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import numpy as np
+>>> import janitor
+>>> rng = np.random.default_rng(seed=0)
+>>> np.random.seed(0)
+>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
+... 'foo', 'foo', 'qux', 'qux'],
+... [1.0, 2.0, 1.0, 2.0,
+... 1.0, 2.0, 1.0, 2.0]]))
+>>> idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
+>>> df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])
+>>> df
+ A B
+first second
+bar 1.0 1.764052 0.400157
+ 2.0 0.978738 2.240893
+baz 1.0 1.867558 -0.977278
+ 2.0 0.950088 -0.151357
+foo 1.0 -0.103219 0.410599
+ 2.0 0.144044 1.454274
+qux 1.0 0.761038 0.121675
+ 2.0 0.443863 0.333674
+>>> outcome=df.change_index_dtype(dtype=str)
+>>> outcome
+ A B
+first second
+bar 1.0 1.764052 0.400157
+ 2.0 0.978738 2.240893
+baz 1.0 1.867558 -0.977278
+ 2.0 0.950088 -0.151357
+foo 1.0 -0.103219 0.410599
+ 2.0 0.144044 1.454274
+qux 1.0 0.761038 0.121675
+ 2.0 0.443863 0.333674
+>>> outcome.index.dtypes
+first object
+second object
+dtype: object
+>>> outcome=df.change_index_dtype(dtype={'second':int})
+>>> outcome
+ A B
+first second
+bar 1 1.764052 0.400157
+ 2 0.978738 2.240893
+baz 1 1.867558 -0.977278
+ 2 0.950088 -0.151357
+foo 1 -0.103219 0.410599
+ 2 0.144044 1.454274
+qux 1 0.761038 0.121675
+ 2 0.443863 0.333674
+>>> outcome.index.dtypes
+first object
+second int64
+dtype: object
+>>> outcome=df.change_index_dtype(dtype={0:'category',1:int})
+>>> outcome
+ A B
+first second
+bar 1 1.764052 0.400157
+ 2 0.978738 2.240893
+baz 1 1.867558 -0.977278
+ 2 0.950088 -0.151357
+foo 1 -0.103219 0.410599
+ 2 0.144044 1.454274
+qux 1 0.761038 0.121675
+ 2 0.443863 0.333674
+>>> outcome.index.dtypes
+first category
+second int64
+dtype: object
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ dtype
+ |
+ + | +
+
+
+ Use a str or dtype to cast the entire Index +to the same type. +Alternatively, use a dictionary to change the MultiIndex +to new dtypes. + |
+ + required + | +
+ axis
+ |
+
+ str
+ |
+
+
+
+ Determines which axis to change the dtype(s). +Should be either 'index' or 'columns'. + |
+
+ 'index'
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with new Index. + |
+
janitor/functions/change_index_dtype.py
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 |
|
change_type
+
+
+change_type(df, column_name, dtype, ignore_exception=False)
+
+Change the type of a column.
+This method does not mutate the original DataFrame.
+Exceptions that are raised can be ignored. For example, if one has a mixed
+dtype column that has non-integer strings and integers, and you want to
+coerce everything to integers, you can optionally ignore the non-integer
+strings and replace them with NaN
or keep the original value.
Intended to be the method-chaining alternative to:
+df[col] = df[col].astype(dtype)
+
Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.astype
instead.
Examples:
+Change the type of a column.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
+>>> df
+ col1 col2
+0 0 m
+1 1 5
+2 2 True
+>>> df.change_type(
+... "col1", dtype=str,
+... ).change_type(
+... "col2", dtype=float, ignore_exception="fillna",
+... )
+ col1 col2
+0 0 NaN
+1 1 5.0
+2 2 1.0
+
Change the type of multiple columns. To change the type of all columns,
+please use DataFrame.astype
instead.
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
+>>> df.change_type(['col1', 'col2'], str)
+ col1 col2
+0 0 m
+1 1 5
+2 2 True
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable | list[Hashable] | Index
+ |
+
+
+
+ The column(s) in the dataframe. + |
+ + required + | +
+ dtype
+ |
+
+ type
+ |
+
+
+
+ The datatype to convert to. Should be one of the standard +Python types, or a numpy datatype. + |
+ + required + | +
+ ignore_exception
+ |
+
+ bool
+ |
+
+
+
+ One of |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If unknown option provided for |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with changed column types. + |
+
janitor/functions/change_type.py
11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 |
|
clean_names
+
+
+Functions for cleaning columns/index names and/or column values.
+ + + + + + + + +clean_names(df, axis='columns', column_names=None, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=True, preserve_original_labels=True, enforce_string=True, truncate_limit=None)
+
+Clean column/index names. It can also be applied to column values.
+Takes all column names, converts them to lowercase, +then replaces all spaces with underscores.
+By default, column names are converted to string types.
+This can be switched off by passing in enforce_string=False
.
This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... "Aloha": range(3),
+... "Bell Chart": range(3),
+... "Animals@#$%^": range(3)
+... }
+... )
+>>> df
+ Aloha Bell Chart Animals@#$%^
+0 0 0 0
+1 1 1 1
+2 2 2 2
+>>> df.clean_names()
+ aloha bell_chart animals@#$%^
+0 0 0 0
+1 1 1 1
+2 2 2 2
+>>> df.clean_names(remove_special=True)
+ aloha bell_chart animals
+0 0 0 0
+1 1 1 1
+2 2 2 2
+
Version Changed
+axis
and column_names
parameters.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ axis
+ |
+
+ str
+ |
+
+
+
+ Whether to clean the labels on the index or columns.
+If |
+
+ 'columns'
+ |
+
+ column_names
+ |
+
+ str | list
+ |
+
+
+
+ Clean the values in a column.
+ |
+
+ None
+ |
+
+ strip_underscores
+ |
+
+ str | bool
+ |
+
+
+
+ Removes the outer underscores from all +column names/values. Default None keeps outer underscores. +Values can be either 'left', 'right' or 'both' +or the respective shorthand 'l', +'r' and True. + |
+
+ None
+ |
+
+ case_type
+ |
+
+ str
+ |
+
+
+
+ Whether to make columns lower or uppercase. +Current case may be preserved with 'preserve', +while snake case conversion (from CamelCase or camelCase only) +can be turned on using "snake". +Default 'lower' makes all characters lowercase. + |
+
+ 'lower'
+ |
+
+ remove_special
+ |
+
+ bool
+ |
+
+
+
+ Remove special characters from columns. +Only letters, numbers and underscores are preserved. + |
+
+ False
+ |
+
+ strip_accents
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to remove accents from +columns names/values. + |
+
+ True
+ |
+
+ preserve_original_labels
+ |
+
+ bool
+ |
+
+
+
+ Preserve original names.
+This is later retrievable using |
+
+ True
+ |
+
+ enforce_string
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to convert all +column names/values to string type. +Defaults to True, but can be turned off. +Columns with >1 levels will not be converted by default. + |
+
+ True
+ |
+
+ truncate_limit
+ |
+
+ int
+ |
+
+
+
+ Truncates formatted column names/values +to the specified length. +Default None does not truncate. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/clean_names.py
17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 |
|
coalesce
+
+
+Function for performing coalesce.
+ + + + + + + + +coalesce(df, *column_names, target_column_name=None, default_value=None)
+
+Coalesce two or more columns of data in order of column names provided.
+Given the variable arguments of column names,
+coalesce
finds and returns the first non-missing value
+from these columns, for every row in the input dataframe.
+If all the column values are null for a particular row,
+then the default_value
will be filled in.
If target_column_name
is not provided,
+then the first column is coalesced.
This method does not mutate the original DataFrame.
+The select
syntax
+can be used in column_names
.
Examples:
+Use coalesce
with 3 columns, "a", "b" and "c".
>>> import pandas as pd
+>>> import numpy as np
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": [np.nan, 1, np.nan],
+... "b": [2, 3, np.nan],
+... "c": [4, np.nan, np.nan],
+... })
+>>> df.coalesce("a", "b", "c")
+ a b c
+0 2.0 2.0 4.0
+1 1.0 3.0 NaN
+2 NaN NaN NaN
+
Provide a target_column_name.
+>>> df.coalesce("a", "b", "c", target_column_name="new_col")
+ a b c new_col
+0 NaN 2.0 4.0 2.0
+1 1.0 3.0 NaN 1.0
+2 NaN NaN NaN NaN
+
Provide a default value.
+>>> import pandas as pd
+>>> import numpy as np
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": [1, np.nan, np.nan],
+... "b": [2, 3, np.nan],
+... })
+>>> df.coalesce(
+... "a", "b",
+... target_column_name="new_col",
+... default_value=-1,
+... )
+ a b new_col
+0 1.0 2.0 1.0
+1 NaN 3.0 3.0
+2 NaN NaN -1.0
+
This is more syntactic diabetes! For R users, this should look familiar to
+dplyr
's coalesce
function; for Python users, the interface
+should be more intuitive than the pandas.Series.combine_first
+method.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Any
+ |
+
+
+
+ A list of column names. + |
+
+ ()
+ |
+
+ target_column_name
+ |
+
+ Optional[str]
+ |
+
+
+
+ The new column name after combining.
+If |
+
+ None
+ |
+
+ default_value
+ |
+
+ Optional[Union[int, float, str]]
+ |
+
+
+
+ A scalar to replace any remaining nulls +after coalescing. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If length of |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with coalesced columns. + |
+
janitor/functions/coalesce.py
12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 |
|
collapse_levels
+
+
+Implementation of the collapse_levels
function.
collapse_levels(df, sep=None, glue=None, axis='columns')
+
+Flatten multi-level index/column dataframe to a single level.
+This method does not mutate the original DataFrame.
+Given a DataFrame containing multi-level index/columns, flatten to single-level +by string-joining the labels in each level.
+After a groupby
/ aggregate
operation where .agg()
is passed a
+list of multiple aggregation functions, a multi-level DataFrame is
+returned with the name of the function applied in the second level.
It is sometimes convenient for later indexing to flatten out this +multi-level configuration back into a single level. This function does +this through a simple string-joining of all the names across different +levels in a single column.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "class": ["bird", "bird", "bird", "mammal", "mammal"],
+... "max_speed": [389, 389, 24, 80, 21],
+... "type": ["falcon", "falcon", "parrot", "Lion", "Monkey"],
+... })
+>>> df
+ class max_speed type
+0 bird 389 falcon
+1 bird 389 falcon
+2 bird 24 parrot
+3 mammal 80 Lion
+4 mammal 21 Monkey
+>>> grouped_df = df.groupby("class")[['max_speed']].agg(["mean", "median"])
+>>> grouped_df
+ max_speed
+ mean median
+class
+bird 267.333333 389.0
+mammal 50.500000 50.5
+>>> grouped_df.collapse_levels(sep="_")
+ max_speed_mean max_speed_median
+class
+bird 267.333333 389.0
+mammal 50.500000 50.5
+
Before applying .collapse_levels
, the .agg
operation returns a
+multi-level column DataFrame whose columns are (level 1, level 2)
:
[("max_speed", "mean"), ("max_speed", "median")]
+
.collapse_levels
then flattens the column MultiIndex into a single
+level index with names:
["max_speed_mean", "max_speed_median"]
+
For more control, a glue
specification can be passed,
+where the names of the levels are used to control the output of the
+flattened index:
>>> (grouped_df
+... .rename_axis(columns=['column_name', 'agg_name'])
+... .collapse_levels(glue="{agg_name}_{column_name}")
+... )
+ mean_max_speed median_max_speed
+class
+bird 267.333333 389.0
+mammal 50.500000 50.5
+
Note that for glue
to work, the keyword arguments
+in the glue specification
+should be the names of the levels in the MultiIndex.
Version Changed
+glue
and axis
parameters.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ sep
+ |
+
+ str
+ |
+
+
+
+ String separator used to join the column level names. + |
+
+ None
+ |
+
+ glue
+ |
+
+ str
+ |
+
+
+
+ A specification on how the column levels should be combined.
+It allows for a more granular composition,
+and serves as an alternative to |
+
+ None
+ |
+
+ axis
+ |
+
+ str
+ |
+
+
+
+ Determines whether to collapse the +levels on the index or columns. + |
+
+ 'columns'
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with single-level column index. + |
+
janitor/functions/collapse_levels.py
10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 |
|
complete
+
+
+complete(df, *columns, sort=False, by=None, fill_value=None, explicit=True)
+
+Complete a data frame with missing combinations of data.
+It is modeled after tidyr's complete
function.
+In a way, it is the inverse of pd.dropna
, as it exposes
+implicitly missing rows.
The variable columns
parameter can be a column name,
+a list of column names,
+or a pandas Index, Series, or DataFrame.
+If a pandas Index, Series, or DataFrame is passed, it should
+have a name or names that exist in df
.
A callable can also be passed - the callable should evaluate
+to a pandas Index, Series, or DataFrame,
+and the names of the pandas object should exist in df
.
A dictionary can also be passed -
+the values of the dictionary should be
+either be a 1D array
+or a callable that evaluates to a
+1D array,
+while the keys of the dictionary
+should exist in df
.
User should ensure that the pandas object is unique and/or sorted +- no checks are done to ensure uniqueness and/or sortedness.
+If by
is present, the DataFrame is completed per group.
+by
should be a column name, or a list of column names.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> import numpy as np
+>>> df = pd.DataFrame(
+... {
+... "Year": [1999, 2000, 2004, 1999, 2004],
+... "Taxon": [
+... "Saccharina",
+... "Saccharina",
+... "Saccharina",
+... "Agarum",
+... "Agarum",
+... ],
+... "Abundance": [4, 5, 2, 1, 8],
+... }
+... )
+>>> df
+ Year Taxon Abundance
+0 1999 Saccharina 4
+1 2000 Saccharina 5
+2 2004 Saccharina 2
+3 1999 Agarum 1
+4 2004 Agarum 8
+
Expose missing pairings of Year
and Taxon
:
>>> df.complete("Year", "Taxon", sort=True)
+ Year Taxon Abundance
+0 1999 Agarum 1.0
+1 1999 Saccharina 4.0
+2 2000 Agarum NaN
+3 2000 Saccharina 5.0
+4 2004 Agarum 8.0
+5 2004 Saccharina 2.0
+
Expose missing years from 1999 to 2004:
+>>> index = pd.Index(range(1999,2005),name='Year')
+>>> df.complete(index, "Taxon", sort=True)
+ Year Taxon Abundance
+0 1999 Agarum 1.0
+1 1999 Saccharina 4.0
+2 2000 Agarum NaN
+3 2000 Saccharina 5.0
+4 2001 Agarum NaN
+5 2001 Saccharina NaN
+6 2002 Agarum NaN
+7 2002 Saccharina NaN
+8 2003 Agarum NaN
+9 2003 Saccharina NaN
+10 2004 Agarum 8.0
+11 2004 Saccharina 2.0
+
A dictionary can be used as well:
+>>> dictionary = {'Year':range(1999,2005)}
+>>> df.complete(dictionary, "Taxon", sort=True)
+ Year Taxon Abundance
+0 1999 Agarum 1.0
+1 1999 Saccharina 4.0
+2 2000 Agarum NaN
+3 2000 Saccharina 5.0
+4 2001 Agarum NaN
+5 2001 Saccharina NaN
+6 2002 Agarum NaN
+7 2002 Saccharina NaN
+8 2003 Agarum NaN
+9 2003 Saccharina NaN
+10 2004 Agarum 8.0
+11 2004 Saccharina 2.0
+
Fill missing values:
+>>> df = pd.DataFrame(
+... dict(
+... group=(1, 2, 1, 2),
+... item_id=(1, 2, 2, 3),
+... item_name=("a", "a", "b", "b"),
+... value1=(1, np.nan, 3, 4),
+... value2=range(4, 8),
+... )
+... )
+>>> df
+ group item_id item_name value1 value2
+0 1 1 a 1.0 4
+1 2 2 a NaN 5
+2 1 2 b 3.0 6
+3 2 3 b 4.0 7
+
>>> df.complete(
+... "group",
+... ["item_id", "item_name"],
+... fill_value={"value1": 0, "value2": 99},
+... sort=True
+... )
+ group item_id item_name value1 value2
+0 1 1 a 1.0 4.0
+1 1 2 a 0.0 99.0
+2 1 2 b 3.0 6.0
+3 1 3 b 0.0 99.0
+4 2 1 a 0.0 99.0
+5 2 2 a 0.0 5.0
+6 2 2 b 0.0 99.0
+7 2 3 b 4.0 7.0
+
Limit the fill to only implicit missing values
+by setting explicit to False
:
>>> df.complete(
+... "group",
+... ["item_id", "item_name"],
+... fill_value={"value1": 0, "value2": 99},
+... explicit=False,
+... sort=True
+... )
+ group item_id item_name value1 value2
+0 1 1 a 1.0 4.0
+1 1 2 a 0.0 99.0
+2 1 2 b 3.0 6.0
+3 1 3 b 0.0 99.0
+4 2 1 a 0.0 99.0
+5 2 2 a NaN 5.0
+6 2 2 b 0.0 99.0
+7 2 3 b 4.0 7.0
+
Expose missing rows per group, using a callable:
+>>> df = pd.DataFrame(
+... {
+... "state": ["CA", "CA", "HI", "HI", "HI", "NY", "NY"],
+... "year": [2010, 2013, 2010, 2012, 2016, 2009, 2013],
+... "value": [1, 3, 1, 2, 3, 2, 5],
+... }
+... )
+>>> df
+ state year value
+0 CA 2010 1
+1 CA 2013 3
+2 HI 2010 1
+3 HI 2012 2
+4 HI 2016 3
+5 NY 2009 2
+6 NY 2013 5
+
>>> def new_year_values(df):
+... return pd.RangeIndex(start=df.year.min(), stop=df.year.max() + 1, name='year')
+>>> df.complete(new_year_values, by='state',sort=True)
+ state year value
+0 CA 2010 1.0
+1 CA 2011 NaN
+2 CA 2012 NaN
+3 CA 2013 3.0
+4 HI 2010 1.0
+5 HI 2011 NaN
+6 HI 2012 2.0
+7 HI 2013 NaN
+8 HI 2014 NaN
+9 HI 2015 NaN
+10 HI 2016 3.0
+11 NY 2009 2.0
+12 NY 2010 NaN
+13 NY 2011 NaN
+14 NY 2012 NaN
+15 NY 2013 5.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ *columns
+ |
+
+ Any
+ |
+
+
+
+ This refers to the columns to be completed. +It could be a column name, +a list of column names, +or a pandas Index, Series, or DataFrame. +It can also be a callable that gets evaluated +to a pandas Index, Series, or DataFrame. +It can also be a dictionary,
+where the values are either a 1D array
+or a callable that evaluates to a
+1D array,
+while the keys of the dictionary
+should exist in |
+
+ ()
+ |
+
+ sort
+ |
+
+ bool
+ |
+
+
+
+ Sort DataFrame based on *columns. + |
+
+ False
+ |
+
+ by
+ |
+
+ str | list
+ |
+
+
+
+ Label or list of labels to group by. +The explicit missing rows are returned per group. + |
+
+ None
+ |
+
+ fill_value
+ |
+
+ dict | Any
+ |
+
+
+
+ Scalar value to use instead of NaN +for missing combinations. A dictionary, mapping columns names +to a scalar value is also accepted. + |
+
+ None
+ |
+
+ explicit
+ |
+
+ bool
+ |
+
+
+
+ Determines if only implicitly missing values
+should be filled ( |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with explicit missing rows, if any. + |
+
janitor/functions/complete.py
12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 |
|
concatenate_columns
+
+
+concatenate_columns(df, column_names, new_column_name, sep='-', ignore_empty=True)
+
+Concatenates the set of columns into a single column.
+Used to quickly generate an index based on a group of columns.
+This method mutates the original DataFrame.
+ + +Examples:
+Concatenate two columns row-wise.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [1, 3, 5], "b": list("xyz")})
+>>> df
+ a b
+0 1 x
+1 3 y
+2 5 z
+>>> df.concatenate_columns(
+... column_names=["a", "b"], new_column_name="m",
+... )
+ a b m
+0 1 x 1-x
+1 3 y 3-y
+2 5 z 5-z
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ List[Hashable]
+ |
+
+
+
+ A list of columns to concatenate together. + |
+ + required + | +
+ new_column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The name of the new column. + |
+ + required + | +
+ sep
+ |
+
+ str
+ |
+
+
+
+ The separator between each column's data. + |
+
+ '-'
+ |
+
+ ignore_empty
+ |
+
+ bool
+ |
+
+
+
+ Ignore null values if exists. + |
+
+ True
+ |
+
Raises:
+Type | +Description | +
---|---|
+ JanitorError
+ |
+
+
+
+ If at least two columns are not provided
+within |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with concatenated columns. + |
+
janitor/functions/concatenate_columns.py
10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 |
|
conditional_join
+
+
+conditional_join(df, right, *conditions, how='inner', df_columns=slice(None), right_columns=slice(None), keep='all', use_numba=False, indicator=False, force=False)
+
+The conditional_join function operates similarly to pd.merge
,
+but supports joins on inequality operators,
+or a combination of equi and non-equi joins.
Joins solely on equality are not supported.
+If the join is solely on equality, pd.merge
function
+covers that; if you are interested in nearest joins, asof joins,
+or rolling joins, then pd.merge_asof
covers that.
+There is also pandas' IntervalIndex, which is efficient for range joins,
+especially if the intervals do not overlap.
Column selection in df_columns
and right_columns
is possible using the
+select
syntax.
Performance might be improved by setting use_numba
to True
-
+this can be handy for equi joins that have lots of duplicated keys.
+This can also be handy for non-equi joins, where there are more than
+two join conditions,
+or there is significant overlap in the range join columns.
+This assumes that numba
is installed.
Noticeable performance can be observed for range joins, +if both join columns from the right dataframe +are monotonically increasing.
+This function returns rows, if any, where values from df
meet the
+condition(s) for values from right
. The conditions are passed in
+as a variable argument of tuples, where the tuple is of
+the form (left_on, right_on, op)
; left_on
is the column
+label from df
, right_on
is the column label from right
,
+while op
is the operator.
For multiple conditions, the and(&
)
+operator is used to combine the results of the individual conditions.
In some scenarios there might be performance gains if the less than join,
+or the greater than join condition, or the range condition
+is executed before the equi join - pass force=True
to force this.
The operator can be any of ==
, !=
, <=
, <
, >=
, >
.
There is no optimisation for the !=
operator.
The join is done only on the columns.
+For non-equi joins, only numeric, timedelta and date columns are supported.
+inner
, left
, right
and outer
joins are supported.
If the columns from df
and right
have nothing in common,
+a single index column is returned; else, a MultiIndex column
+is returned.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df1 = pd.DataFrame({"value_1": [2, 5, 7, 1, 3, 4]})
+>>> df2 = pd.DataFrame({"value_2A": [0, 3, 7, 12, 0, 2, 3, 1],
+... "value_2B": [1, 5, 9, 15, 1, 4, 6, 3],
+... })
+>>> df1
+ value_1
+0 2
+1 5
+2 7
+3 1
+4 3
+5 4
+>>> df2
+ value_2A value_2B
+0 0 1
+1 3 5
+2 7 9
+3 12 15
+4 0 1
+5 2 4
+6 3 6
+7 1 3
+
>>> df1.conditional_join(
+... df2,
+... ("value_1", "value_2A", ">"),
+... ("value_1", "value_2B", "<")
+... )
+ value_1 value_2A value_2B
+0 2 1 3
+1 5 3 6
+2 3 2 4
+3 4 3 5
+4 4 3 6
+
Select specific columns, after the join:
+>>> df1.conditional_join(
+... df2,
+... ("value_1", "value_2A", ">"),
+... ("value_1", "value_2B", "<"),
+... right_columns='value_2B',
+... how='left'
+... )
+ value_1 value_2B
+0 2 3.0
+1 5 6.0
+2 3 4.0
+3 4 5.0
+4 4 6.0
+5 7 NaN
+6 1 NaN
+
Rename columns, before the join:
+>>> (df1
+... .rename(columns={'value_1':'left_column'})
+... .conditional_join(
+... df2,
+... ("left_column", "value_2A", ">"),
+... ("left_column", "value_2B", "<"),
+... right_columns='value_2B',
+... how='outer')
+... )
+ left_column value_2B
+0 2.0 3.0
+1 5.0 6.0
+2 3.0 4.0
+3 4.0 5.0
+4 4.0 6.0
+5 7.0 NaN
+6 1.0 NaN
+7 NaN 1.0
+8 NaN 9.0
+9 NaN 15.0
+10 NaN 1.0
+
Get the first match:
+>>> df1.conditional_join(
+... df2,
+... ("value_1", "value_2A", ">"),
+... ("value_1", "value_2B", "<"),
+... keep='first'
+... )
+ value_1 value_2A value_2B
+0 2 1 3
+1 5 3 6
+2 3 2 4
+3 4 3 5
+
Get the last match:
+>>> df1.conditional_join(
+... df2,
+... ("value_1", "value_2A", ">"),
+... ("value_1", "value_2B", "<"),
+... keep='last'
+... )
+ value_1 value_2A value_2B
+0 2 1 3
+1 5 3 6
+2 3 2 4
+3 4 3 6
+
Add an indicator column:
+>>> df1.conditional_join(
+... df2,
+... ("value_1", "value_2A", ">"),
+... ("value_1", "value_2B", "<"),
+... how='outer',
+... indicator=True
+... )
+ value_1 value_2A value_2B _merge
+0 2.0 1.0 3.0 both
+1 5.0 3.0 6.0 both
+2 3.0 2.0 4.0 both
+3 4.0 3.0 5.0 both
+4 4.0 3.0 6.0 both
+5 7.0 NaN NaN left_only
+6 1.0 NaN NaN left_only
+7 NaN 0.0 1.0 right_only
+8 NaN 7.0 9.0 right_only
+9 NaN 12.0 15.0 right_only
+10 NaN 0.0 1.0 right_only
+
Version Changed
+df_columns
, right_columns
, keep
and use_numba
parameters.indicator
parameter.col
class supported.sort_by_appearance
deprecated.col
class deprecated.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ right
+ |
+
+ Union[DataFrame, Series]
+ |
+
+
+
+ Named Series or DataFrame to join to. + |
+ + required + | +
+ conditions
+ |
+
+ Any
+ |
+
+
+
+ Variable argument of tuple(s) of the form
+ |
+
+ ()
+ |
+
+ how
+ |
+
+ Literal['inner', 'left', 'right', 'outer']
+ |
+
+
+
+ Indicates the type of join to be performed.
+It can be one of |
+
+ 'inner'
+ |
+
+ df_columns
+ |
+
+ Optional[Any]
+ |
+
+
+
+ Columns to select from |
+
+ slice(None)
+ |
+
+ right_columns
+ |
+
+ Optional[Any]
+ |
+
+
+
+ Columns to select from |
+
+ slice(None)
+ |
+
+ use_numba
+ |
+
+ bool
+ |
+
+
+
+ Use numba, if installed, to accelerate the computation. + |
+
+ False
+ |
+
+ keep
+ |
+
+ Literal['first', 'last', 'all']
+ |
+
+
+
+ Choose whether to return the first match, last match or all matches. + |
+
+ 'all'
+ |
+
+ indicator
+ |
+
+ Optional[Union[bool, str]]
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ force
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame of the two merged Pandas objects. + |
+
janitor/functions/conditional_join.py
31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 |
|
get_join_indices(df, right, conditions, keep='all', use_numba=False, force=False, return_ragged_arrays=False)
+
+Convenience function to return the matching indices from an inner join.
+New in version 0.27.0
+Version Changed
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ right
+ |
+
+ Union[DataFrame, Series]
+ |
+
+
+
+ Named Series or DataFrame to join to. + |
+ + required + | +
+ conditions
+ |
+
+ list[tuple[str]]
+ |
+
+
+
+ List of arguments of tuple(s) of the form
+ |
+ + required + | +
+ use_numba
+ |
+
+ bool
+ |
+
+
+
+ Use numba, if installed, to accelerate the computation. + |
+
+ False
+ |
+
+ keep
+ |
+
+ Literal['first', 'last', 'all']
+ |
+
+
+
+ Choose whether to return the first match, last match or all matches. + |
+
+ 'all'
+ |
+
+ force
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ return_ragged_arrays
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ tuple[ndarray, ndarray]
+ |
+
+
+
+ A tuple of indices for the rows in the dataframes that match. + |
+
janitor/functions/conditional_join.py
1466 +1467 +1468 +1469 +1470 +1471 +1472 +1473 +1474 +1475 +1476 +1477 +1478 +1479 +1480 +1481 +1482 +1483 +1484 +1485 +1486 +1487 +1488 +1489 +1490 +1491 +1492 +1493 +1494 +1495 +1496 +1497 +1498 +1499 +1500 +1501 +1502 +1503 +1504 +1505 +1506 +1507 +1508 +1509 +1510 +1511 +1512 +1513 +1514 +1515 +1516 +1517 +1518 +1519 +1520 +1521 |
|
convert_date
+
+
+convert_excel_date(df, column_names)
+
+Convert Excel's serial date format into Python datetime format.
+This method does not mutate the original DataFrame.
+Implementation is based on +Stack Overflow.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"date": [39690, 39690, 37118]})
+>>> df
+ date
+0 39690
+1 39690
+2 37118
+>>> df.convert_excel_date('date')
+ date
+0 2008-08-30
+1 2008-08-30
+2 2001-08-15
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Union[Hashable, list]
+ |
+
+
+
+ A column name, or a list of column names. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with corrected dates. + |
+
janitor/functions/convert_date.py
10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 |
|
convert_matlab_date(df, column_names)
+
+Convert Matlab's serial date number into Python datetime format.
+Implementation is based on +Stack Overflow.
+This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
+>>> df
+ date
+0 737125.000000
+1 737124.815863
+2 737124.498500
+3 737124.000000
+>>> df.convert_matlab_date('date')
+ date
+0 2018-03-06 00:00:00.000000000
+1 2018-03-05 19:34:50.563199671
+2 2018-03-05 11:57:50.399998876
+3 2018-03-05 00:00:00.000000000
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Union[Hashable, list]
+ |
+
+
+
+ A column name, or a list of column names. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with corrected dates. + |
+
janitor/functions/convert_date.py
58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 |
|
convert_unix_date(df, column_name)
+
+Convert unix epoch time into Python datetime format.
+Note that this ignores local tz and convert all timestamps to naive +datetime based on UTC!
+This method mutates the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.to_datetime
instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})
+>>> df
+ date
+0 1651510462
+1 53394822
+2 1126233195
+>>> df.convert_unix_date('date')
+ date
+0 2022-05-02 16:54:22
+1 1971-09-10 23:53:42
+2 2005-09-09 02:33:15
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ A column name. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with corrected dates. + |
+
janitor/functions/convert_date.py
105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 |
|
count_cumulative_unique
+
+
+Implementation of count_cumulative_unique.
+ + + + + + + + +count_cumulative_unique(df, column_name, dest_column_name, case_sensitive=True)
+
+Generates a running total of cumulative unique values in a given column.
+A new column will be created containing a running
+count of unique values in the specified column.
+If case_sensitive
is True
, then the case of
+any letters will matter (i.e., a != A
);
+otherwise, the case of any letters will not matter.
This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "letters": list("aabABb"),
+... "numbers": range(4, 10),
+... })
+>>> df
+ letters numbers
+0 a 4
+1 a 5
+2 b 6
+3 A 7
+4 B 8
+5 b 9
+>>> df.count_cumulative_unique(
+... column_name="letters",
+... dest_column_name="letters_unique_count",
+... )
+ letters numbers letters_unique_count
+0 a 4 1
+1 a 5 1
+2 b 6 2
+3 A 7 3
+4 B 8 4
+5 b 9 4
+
Cumulative counts, ignoring casing.
+>>> df.count_cumulative_unique(
+... column_name="letters",
+... dest_column_name="letters_unique_count",
+... case_sensitive=False,
+... )
+ letters numbers letters_unique_count
+0 a 4 1
+1 a 5 1
+2 b 6 2
+3 A 7 2
+4 B 8 2
+5 b 9 2
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Name of the column containing values from which a +running count of unique values will be created. + |
+ + required + | +
+ dest_column_name
+ |
+
+ str
+ |
+
+
+
+ The name of the new column containing the +cumulative count of unique values that will be created. + |
+ + required + | +
+ case_sensitive
+ |
+
+ bool
+ |
+
+
+
+ Whether or not uppercase and lowercase letters +will be considered equal. Only valid with string-like columns. + |
+
+ True
+ |
+
Raises:
+Type | +Description | +
---|---|
+ TypeError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a new column containing a cumulative +count of unique values from another column. + |
+
janitor/functions/count_cumulative_unique.py
12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 |
|
currency_column_to_numeric
+
+
+currency_column_to_numeric(df, column_name, cleaning_style=None, cast_non_numeric=None, fill_all_non_numeric=None, remove_non_numeric=False)
+
+Convert currency column to numeric.
+This method does not mutate the original DataFrame.
+This method allows one to take a column containing currency values,
+inadvertently imported as a string, and cast it as a float. This is
+usually the case when reading CSV files that were modified in Excel.
+Empty strings (i.e. ''
) are retained as NaN
values.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a_col": [" 24.56", "-", "(12.12)", "1,000,000"],
+... "d_col": ["", "foo", "1.23 dollars", "-1,000 yen"],
+... })
+>>> df
+ a_col d_col
+0 24.56
+1 - foo
+2 (12.12) 1.23 dollars
+3 1,000,000 -1,000 yen
+
The default cleaning style.
+>>> df.currency_column_to_numeric("d_col")
+ a_col d_col
+0 24.56 NaN
+1 - NaN
+2 (12.12) 1.23
+3 1,000,000 -1000.00
+
The accounting cleaning style.
+>>> df.currency_column_to_numeric("a_col", cleaning_style="accounting")
+ a_col d_col
+0 24.56
+1 0.00 foo
+2 -12.12 1.23 dollars
+3 1000000.00 -1,000 yen
+
Valid cleaning styles are:
+None
: Default cleaning is applied. Empty strings are always retained as
+ NaN
. Numbers, -
, .
are extracted and the resulting string
+ is cast to a float.'accounting'
: Replaces numbers in parentheses with negatives, removes commas.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ The column containing currency values to modify. + |
+ + required + | +
+ cleaning_style
+ |
+
+ Optional[str]
+ |
+
+
+
+ What style of cleaning to perform. + |
+
+ None
+ |
+
+ cast_non_numeric
+ |
+
+ Optional[dict]
+ |
+
+
+
+ A dict of how to coerce certain strings to numeric
+type. For example, if there are values of 'REORDER' in the DataFrame,
+ |
+
+ None
+ |
+
+ fill_all_non_numeric
+ |
+
+ Optional[Union[float, int]]
+ |
+
+
+
+ Similar to |
+
+ None
+ |
+
+ remove_non_numeric
+ |
+
+ bool
+ |
+
+
+
+ If set to True, rows of |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/currency_column_to_numeric.py
10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 |
|
deconcatenate_column
+
+
+Implementation of deconcatenating columns.
+ + + + + + + + +deconcatenate_column(df, column_name, sep=None, new_column_names=None, autoname=None, preserve_position=False)
+
+De-concatenates a single column into multiple columns.
+The column to de-concatenate can be either a collection (list, tuple, ...)
+which can be separated out with pd.Series.tolist()
,
+or a string to slice based on sep
.
To determine this behaviour automatically, +the first element in the column specified is inspected.
+If it is a string, then sep
must be specified.
+Else, the function assumes that it is an iterable type
+(e.g. list
or tuple
),
+and will attempt to deconcatenate by splitting the list.
Given a column with string values, this is the inverse of the
+concatenate_columns
+function.
Used to quickly split columns out of a single column.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"m": ["1-x", "2-y", "3-z"]})
+>>> df
+ m
+0 1-x
+1 2-y
+2 3-z
+>>> df.deconcatenate_column("m", sep="-", autoname="col")
+ m col1 col2
+0 1-x 1 x
+1 2-y 2 y
+2 3-z 3 z
+
The keyword argument preserve_position
+takes True
or False
boolean
+that controls whether the new_column_names
+will take the original position
+of the to-be-deconcatenated column_name
:
preserve_position=False
(default), df.columns
change from
+ [..., column_name, ...]
to [..., column_name, ..., new_column_names]
.
+ In other words, the deconcatenated new columns are appended to the right
+ of the original dataframe and the original column_name
is NOT dropped.preserve_position=True
, df.column
change from
+ [..., column_name, ...]
to [..., new_column_names, ...]
.
+ In other words, the deconcatenated new column will REPLACE the original
+ column_name
at its original position, and column_name
itself
+ is dropped.The keyword argument autoname
accepts a base string
+and then automatically creates numbered column names
+based off the base string.
+For example, if col
is passed in as the argument to autoname
,
+and 4 columns are created, then the resulting columns will be named
+col1, col2, col3, col4
.
+Numbering is always 1-indexed, not 0-indexed,
+in order to make the column names human-friendly.
This method does not mutate the original DataFrame.
+ + +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The column to split. + |
+ + required + | +
+ sep
+ |
+
+ Optional[str]
+ |
+
+
+
+ The separator delimiting the column's data. + |
+
+ None
+ |
+
+ new_column_names
+ |
+
+ Optional[Union[List[str], Tuple[str]]]
+ |
+
+
+
+ A list of new column names post-splitting. + |
+
+ None
+ |
+
+ autoname
+ |
+
+ str
+ |
+
+
+
+ A base name for automatically naming the new columns.
+Takes precedence over |
+
+ None
+ |
+
+ preserve_position
+ |
+
+ bool
+ |
+
+
+
+ Boolean for whether or not to preserve original +position of the column upon de-concatenation. + |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If either |
+
+ JanitorError
+ |
+
+
+
+ If incorrect number of names is provided
+within |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a deconcatenated column. + |
+
janitor/functions/deconcatenate_column.py
12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 |
|
drop_constant_columns
+
+
+Implementation of drop_constant_columns.
+ + + + + + + + +drop_constant_columns(df)
+
+Finds and drops the constant columns from a Pandas DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> data_dict = {
+... "a": [1, 1, 1],
+... "b": [1, 2, 3],
+... "c": [1, 1, 1],
+... "d": ["rabbit", "leopard", "lion"],
+... "e": ["Cambridge", "Shanghai", "Basel"]
+... }
+>>> df = pd.DataFrame(data_dict)
+>>> df
+ a b c d e
+0 1 1 1 rabbit Cambridge
+1 1 2 1 leopard Shanghai
+2 1 3 1 lion Basel
+>>> df.drop_constant_columns()
+ b d e
+0 1 rabbit Cambridge
+1 2 leopard Shanghai
+2 3 lion Basel
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ Input Pandas DataFrame + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ The Pandas DataFrame with the constant columns dropped. + |
+
janitor/functions/drop_constant_columns.py
7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 |
|
drop_duplicate_columns
+
+
+Implementation for drop_duplicate_columns
.
drop_duplicate_columns(df, column_name, nth_index=0)
+
+Remove a duplicated column specified by column_name
.
Specifying nth_index=0
will remove the first column,
+nth_index=1
will remove the second column,
+and so on and so forth.
The corresponding tidyverse R's library is:
+select(-<column_name>_<nth_index + 1>)
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": range(2, 5),
+... "b": range(3, 6),
+... "A": range(4, 7),
+... "a*": range(6, 9),
+... }).clean_names(remove_special=True)
+>>> df
+ a b a a
+0 2 3 4 6
+1 3 4 5 7
+2 4 5 6 8
+>>> df.drop_duplicate_columns(column_name="a", nth_index=1)
+ a b a
+0 2 3 6
+1 3 4 7
+2 4 5 8
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Name of duplicated columns. + |
+ + required + | +
+ nth_index
+ |
+
+ int
+ |
+
+
+
+ Among the duplicated columns, +select the nth column to drop. + |
+
+ 0
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame + |
+
janitor/functions/drop_duplicate_columns.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 |
|
dropnotnull
+
+
+Implementation source for dropnotnull
.
dropnotnull(df, column_name)
+
+Drop rows that do not have null values in the given column.
+This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import numpy as np
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [1., np.NaN, 3.], "b": [None, "y", "z"]})
+>>> df
+ a b
+0 1.0 None
+1 NaN y
+2 3.0 z
+>>> df.dropnotnull("a")
+ a b
+1 NaN y
+>>> df.dropnotnull("b")
+ a b
+0 1.0 None
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The column name to drop rows from. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with dropped rows. + |
+
janitor/functions/dropnotnull.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 |
|
encode_categorical
+
+
+encode_categorical(df, column_names=None, **kwargs)
+
+Encode the specified columns with Pandas' category dtype.
+It is syntactic sugar around pd.Categorical
.
This method does not mutate the original DataFrame.
+Simply pass a string, or a sequence of column names to column_names
;
+alternatively, you can pass kwargs, where the keys are the column names
+and the values can either be None, sort
, appearance
+or a 1-D array-like object.
sort
: column is cast to an ordered categorical,
+ with the order defined by the sort-order of the categories.appearance
: column is cast to an ordered categorical,
+ with the order defined by the order of appearance
+ in the original column.column_names
and kwargs
parameters cannot be used at the same time.
Examples:
+Using column_names
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "foo": ["b", "b", "a", "c", "b"],
+... "bar": range(4, 9),
+... })
+>>> df
+ foo bar
+0 b 4
+1 b 5
+2 a 6
+3 c 7
+4 b 8
+>>> df.dtypes
+foo object
+bar int64
+dtype: object
+>>> enc_df = df.encode_categorical(column_names="foo")
+>>> enc_df.dtypes
+foo category
+bar int64
+dtype: object
+>>> enc_df["foo"].cat.categories
+Index(['a', 'b', 'c'], dtype='object')
+>>> enc_df["foo"].cat.ordered
+False
+
Using kwargs
to specify an ordered categorical.
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "foo": ["b", "b", "a", "c", "b"],
+... "bar": range(4, 9),
+... })
+>>> df.dtypes
+foo object
+bar int64
+dtype: object
+>>> enc_df = df.encode_categorical(foo="appearance")
+>>> enc_df.dtypes
+foo category
+bar int64
+dtype: object
+>>> enc_df["foo"].cat.categories
+Index(['b', 'a', 'c'], dtype='object')
+>>> enc_df["foo"].cat.ordered
+True
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame object. + |
+ + required + | +
+ column_names
+ |
+
+ Union[str, Iterable[str], Hashable]
+ |
+
+
+
+ A column name or an iterable (list or tuple) +of column names. + |
+
+ None
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ A mapping from column name to either |
+
+ {}
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If both |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/encode_categorical.py
14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 |
|
expand_column
+
+
+Implementation for expand_column.
+ + + + + + + + +expand_column(df, column_name, sep='|', concat=True)
+
+Expand a categorical column with multiple labels into dummy-coded columns.
+Super sugary syntax that wraps pandas.Series.str.get_dummies
.
This method does not mutate the original DataFrame.
+ + +Examples:
+Functional usage syntax:
+>>> import pandas as pd
+>>> df = pd.DataFrame(
+... {
+... "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
+... "col2": [1, 2, 3, 4],
+... }
+... )
+>>> df = expand_column(
+... df,
+... column_name="col1",
+... sep=", " # note space in sep
+... )
+>>> df
+ col1 col2 A B C D E F
+0 A, B 1 1 1 0 0 0 0
+1 B, C, D 2 0 1 1 1 0 0
+2 E, F 3 0 0 0 0 1 1
+3 A, E, F 4 1 0 0 0 1 1
+
Method chaining syntax:
+>>> import pandas as pd
+>>> import janitor
+>>> df = (
+... pd.DataFrame(
+... {
+... "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
+... "col2": [1, 2, 3, 4],
+... }
+... )
+... .expand_column(
+... column_name='col1',
+... sep=', '
+... )
+... )
+>>> df
+ col1 col2 A B C D E F
+0 A, B 1 1 1 0 0 0 0
+1 B, C, D 2 0 1 1 1 0 0
+2 E, F 3 0 0 0 0 1 1
+3 A, E, F 4 1 0 0 0 1 1
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Which column to expand. + |
+ + required + | +
+ sep
+ |
+
+ str
+ |
+
+
+
+ The delimiter, same to
+ |
+
+ '|'
+ |
+
+ concat
+ |
+
+ bool
+ |
+
+
+
+ Whether to return the expanded column concatenated to
+the original dataframe ( |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with an expanded column. + |
+
janitor/functions/expand_column.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 |
|
expand_grid
+
+
+Implementation source for expand_grid
.
cartesian_product(*inputs, sort=False)
+
+Creates a DataFrame from a cartesian combination of all inputs.
+Inspiration is from tidyr's expand_grid() function.
+The input argument should be a pandas Index/Series/DataFrame, +or a dictionary - the values of the dictionary should be +a 1D array.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor as jn
+>>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})
+>>> data = pd.Series([1, 2, 3], name='z')
+>>> jn.cartesian_product(df, data)
+ x y z
+0 1 2 1
+1 1 2 2
+2 1 2 3
+3 2 1 1
+4 2 1 2
+5 2 1 3
+
cartesian_product
also works with non-pandas objects:
>>> data = {"x": [1, 2, 3], "y": [1, 2]}
+>>> cartesian_product(data)
+ x y
+0 1 1
+1 1 2
+2 2 1
+3 2 2
+4 3 1
+5 3 2
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ *inputs
+ |
+
+ tuple
+ |
+
+
+
+ Variable arguments. The arguments should be +a pandas Index/Series/DataFrame, or a dictionary, +where the values in the dictionary is a 1D array. + |
+
+ ()
+ |
+
+ sort
+ |
+
+ bool
+ |
+
+
+
+ If True, sort the output DataFrame. + |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/expand_grid.py
406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 |
|
expand(df, *columns, sort=False, by=None)
+
+Creates a DataFrame from a cartesian combination of all inputs.
+Inspiration is from tidyr's expand() function.
+expand() is often useful with
+pd.merge
+to convert implicit
+missing values to explicit missing values - similar to
+complete
.
It can also be used to figure out which combinations are missing +(e.g identify gaps in your DataFrame).
+The variable columns
parameter can be a column name,
+a list of column names, a pandas Index/Series/DataFrame,
+or a callable, which when applied to the DataFrame,
+evaluates to a pandas Index/Series/DataFrame.
A dictionary can also be passed
+to the variable columns
parameter -
+the values of the dictionary should be
+either be a 1D array
+or a callable that evaluates to a
+1D array. The array should be unique;
+no check is done to verify this.
If by
is present, the DataFrame is expanded per group.
+by
should be a column name, or a list of column names.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> data = [{'type': 'apple', 'year': 2010, 'size': 'XS'},
+... {'type': 'orange', 'year': 2010, 'size': 'S'},
+... {'type': 'apple', 'year': 2012, 'size': 'M'},
+... {'type': 'orange', 'year': 2010, 'size': 'S'},
+... {'type': 'orange', 'year': 2011, 'size': 'S'},
+... {'type': 'orange', 'year': 2012, 'size': 'M'}]
+>>> df = pd.DataFrame(data)
+>>> df
+ type year size
+0 apple 2010 XS
+1 orange 2010 S
+2 apple 2012 M
+3 orange 2010 S
+4 orange 2011 S
+5 orange 2012 M
+
Get unique observations:
+>>> df.expand('type')
+ type
+0 apple
+1 orange
+>>> df.expand('size')
+ size
+0 XS
+1 S
+2 M
+>>> df.expand('type', 'size')
+ type size
+0 apple XS
+1 apple S
+2 apple M
+3 orange XS
+4 orange S
+5 orange M
+>>> df.expand('type','size','year')
+ type size year
+0 apple XS 2010
+1 apple XS 2012
+2 apple XS 2011
+3 apple S 2010
+4 apple S 2012
+5 apple S 2011
+6 apple M 2010
+7 apple M 2012
+8 apple M 2011
+9 orange XS 2010
+10 orange XS 2012
+11 orange XS 2011
+12 orange S 2010
+13 orange S 2012
+14 orange S 2011
+15 orange M 2010
+16 orange M 2012
+17 orange M 2011
+
Get observations that only occur in the data:
+>>> df.expand(['type','size'])
+ type size
+0 apple XS
+1 orange S
+2 apple M
+3 orange M
+>>> df.expand(['type','size','year'])
+ type size year
+0 apple XS 2010
+1 orange S 2010
+2 apple M 2012
+3 orange S 2011
+4 orange M 2012
+
Expand the DataFrame to include new observations:
+>>> df.expand('type','size',{'new_year':range(2010,2014)})
+ type size new_year
+0 apple XS 2010
+1 apple XS 2011
+2 apple XS 2012
+3 apple XS 2013
+4 apple S 2010
+5 apple S 2011
+6 apple S 2012
+7 apple S 2013
+8 apple M 2010
+9 apple M 2011
+10 apple M 2012
+11 apple M 2013
+12 orange XS 2010
+13 orange XS 2011
+14 orange XS 2012
+15 orange XS 2013
+16 orange S 2010
+17 orange S 2011
+18 orange S 2012
+19 orange S 2013
+20 orange M 2010
+21 orange M 2011
+22 orange M 2012
+23 orange M 2013
+
Filter for missing observations:
+>>> combo = df.expand('type','size','year')
+>>> anti_join = df.merge(combo, how='right', indicator=True)
+>>> anti_join.query("_merge=='right_only'").drop(columns="_merge")
+ type year size
+1 apple 2012 XS
+2 apple 2011 XS
+3 apple 2010 S
+4 apple 2012 S
+5 apple 2011 S
+6 apple 2010 M
+8 apple 2011 M
+9 orange 2010 XS
+10 orange 2012 XS
+11 orange 2011 XS
+14 orange 2012 S
+16 orange 2010 M
+18 orange 2011 M
+
Expand within each group, using by
:
>>> df.expand('year','size',by='type')
+ year size
+type
+apple 2010 XS
+apple 2010 M
+apple 2012 XS
+apple 2012 M
+orange 2010 S
+orange 2010 M
+orange 2011 S
+orange 2011 M
+orange 2012 S
+orange 2012 M
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ columns
+ |
+
+ tuple
+ |
+
+
+
+ Specification of columns to expand. +It could be column labels, + a list/tuple of column labels, + or a pandas Index/Series/DataFrame. +It can also be a callable; +the callable will be applied to the +entire DataFrame. The callable should +return a pandas Series/Index/DataFrame. +It can also be a dictionary, +where the values are either a 1D array +or a callable that evaluates to a +1D array. +The array should be unique; +no check is done to verify this. + |
+
+ ()
+ |
+
+ sort
+ |
+
+ bool
+ |
+
+
+
+ If True, sort the DataFrame. + |
+
+ False
+ |
+
+ by
+ |
+
+ str | list
+ |
+
+
+
+ Label or list of labels to group by. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/expand_grid.py
140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 |
|
expand_grid(df=None, df_key=None, *, others=None)
+
+Creates a DataFrame from a cartesian combination of all inputs.
+Note
+This function will be deprecated in a 1.x release;
+use cartesian_product
+instead.
It is not restricted to a pandas DataFrame; +it can work with any list-like structure +that is 1 or 2 dimensional.
+If method-chaining to a DataFrame, a string argument
+to df_key
parameter must be provided.
Data types are preserved in this function, +including pandas' extension array dtypes.
+The output will always be a DataFrame, usually with a MultiIndex column,
+with the keys of the others
dictionary serving as the top level columns.
If a pandas Series/DataFrame is passed, and has a labeled index, or +a MultiIndex index, the index is discarded; the final DataFrame +will have a RangeIndex.
+The MultiIndexed DataFrame can be flattened using pyjanitor's
+collapse_levels
+method; the user can also decide to drop any of the levels, via pandas'
+droplevel
method.
Examples:
+>>> import pandas as pd
+>>> import janitor as jn
+>>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})
+>>> data = {"z": [1, 2, 3]}
+>>> df.expand_grid(df_key="df", others=data)
+ df z
+ x y 0
+0 1 2 1
+1 1 2 2
+2 1 2 3
+3 2 1 1
+4 2 1 2
+5 2 1 3
+
expand_grid
works with non-pandas objects:
>>> data = {"x": [1, 2, 3], "y": [1, 2]}
+>>> jn.expand_grid(others=data)
+ x y
+ 0 0
+0 1 1
+1 1 2
+2 2 1
+3 2 2
+4 3 1
+5 3 2
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ Optional[DataFrame]
+ |
+
+
+
+ A pandas DataFrame. + |
+
+ None
+ |
+
+ df_key
+ |
+
+ Optional[str]
+ |
+
+
+
+ Name of key for the dataframe. +It becomes part of the column names of the dataframe. + |
+
+ None
+ |
+
+ others
+ |
+
+ Optional[dict]
+ |
+
+
+
+ A dictionary that contains the data
+to be combined with the dataframe.
+If no dataframe exists, all inputs
+in |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ KeyError
+ |
+
+
+
+ If there is a DataFrame and |
+
Returns:
+Type | +Description | +
---|---|
+ Union[DataFrame, None]
+ |
+
+
+
+ A pandas DataFrame of the cartesian product. + |
+
+ Union[DataFrame, None]
+ |
+
+
+
+ If |
+
+ Union[DataFrame, None]
+ |
+
+
+
+ None is returned. + |
+
janitor/functions/expand_grid.py
20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 |
|
explode_index
+
+
+Implementation of the explode_index
function.
explode_index(df, names_sep=None, names_pattern=None, axis='columns', level_names=None)
+
+Explode a single index DataFrame into a MultiIndex DataFrame.
+This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {'max_speed_mean': [267.3333333333333, 50.5],
+... 'max_speed_median': [389.0, 50.5]})
+>>> df
+ max_speed_mean max_speed_median
+0 267.333333 389.0
+1 50.500000 50.5
+>>> df.explode_index(names_sep='_',axis='columns')
+ max
+ speed
+ mean median
+0 267.333333 389.0
+1 50.500000 50.5
+>>> df.explode_index(names_pattern=r"(.+speed)_(.+)",axis='columns')
+ max_speed
+ mean median
+0 267.333333 389.0
+1 50.500000 50.5
+>>> df.explode_index(
+... names_pattern=r"(?P<measurement>.+speed)_(?P<aggregation>.+)",
+... axis='columns'
+... )
+measurement max_speed
+aggregation mean median
+0 267.333333 389.0
+1 50.500000 50.5
+>>> df.explode_index(
+... names_sep='_',
+... axis='columns',
+... level_names = ['min or max', 'measurement','aggregation']
+... )
+min or max max
+measurement speed
+aggregation mean median
+0 267.333333 389.0
+1 50.500000 50.5
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ names_sep
+ |
+
+ Union[str, None]
+ |
+
+
+
+ string or compiled regex used to split the column/index into levels. + |
+
+ None
+ |
+
+ names_pattern
+ |
+
+ Union[str, None]
+ |
+
+
+
+ regex to extract new levels from the column/index. + |
+
+ None
+ |
+
+ axis
+ |
+
+ str
+ |
+
+
+
+ 'index/columns'. Determines which axis to explode. + |
+
+ 'columns'
+ |
+
+ level_names
+ |
+
+ list
+ |
+
+
+
+ names of the levels in the MultiIndex. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a MultiIndex. + |
+
janitor/functions/explode_index.py
14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 |
|
factorize_columns
+
+
+Implementation of the factorize_columns
function
factorize_columns(df, column_names, suffix='_enc', **kwargs)
+
+Converts labels into numerical data.
+This method will create a new column with the string _enc
appended
+after the original column's name.
+This can be overridden with the suffix parameter.
Internally, this method uses pandas factorize
method.
+It takes in an optional suffix and keyword arguments also.
+An empty string as suffix will override the existing column.
This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "foo": ["b", "b", "a", "c", "b"],
+... "bar": range(4, 9),
+... })
+>>> df
+ foo bar
+0 b 4
+1 b 5
+2 a 6
+3 c 7
+4 b 8
+>>> df.factorize_columns(column_names="foo")
+ foo bar foo_enc
+0 b 4 0
+1 b 5 0
+2 a 6 1
+3 c 7 2
+4 b 8 0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ column_names
+ |
+
+ Union[str, Iterable[str], Hashable]
+ |
+
+
+
+ A column name or an iterable (list or tuple) of +column names. + |
+ + required + | +
+ suffix
+ |
+
+ str
+ |
+
+
+
+ Suffix to be used for the new column. +An empty string suffix means, it will override the existing column. + |
+
+ '_enc'
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Keyword arguments. It takes any of the keyword arguments,
+which the pandas factorize method takes like |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/factorize_columns.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 |
|
fill
+
+
+fill_direction(df, **kwargs)
+
+Provide a method-chainable function for filling missing values +in selected columns.
+It is a wrapper for pd.Series.ffill
and pd.Series.bfill
,
+and pairs the column name with one of up
, down
, updown
,
+and downup
.
Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.assign
instead.
Examples:
+>>> import pandas as pd
+>>> import janitor as jn
+>>> df = pd.DataFrame(
+... {
+... 'col1': [1, 2, 3, 4],
+... 'col2': [None, 5, 6, 7],
+... 'col3': [8, 9, 10, None],
+... 'col4': [None, None, 11, None],
+... 'col5': [None, 12, 13, None]
+... }
+... )
+>>> df
+ col1 col2 col3 col4 col5
+0 1 NaN 8.0 NaN NaN
+1 2 5.0 9.0 NaN 12.0
+2 3 6.0 10.0 11.0 13.0
+3 4 7.0 NaN NaN NaN
+>>> df.fill_direction(
+... col2 = 'up',
+... col3 = 'down',
+... col4 = 'downup',
+... col5 = 'updown'
+... )
+ col1 col2 col3 col4 col5
+0 1 5.0 8.0 11.0 12.0
+1 2 5.0 9.0 11.0 12.0
+2 3 6.0 10.0 11.0 13.0
+3 4 7.0 10.0 11.0 13.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Key - value pairs of columns and directions.
+Directions can be either |
+
+ {}
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If direction supplied is not one of |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with modified column(s). + |
+
janitor/functions/fill.py
18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 |
|
fill_empty(df, column_names, value)
+
+Fill NaN
values in specified columns with a given value.
Super sugary syntax that wraps pandas.DataFrame.fillna
.
This method mutates the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use jn.impute
instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... 'col1': [1, 2, 3],
+... 'col2': [None, 4, None ],
+... 'col3': [None, 5, 6]
+... }
+... )
+>>> df
+ col1 col2 col3
+0 1 NaN NaN
+1 2 4.0 5.0
+2 3 NaN 6.0
+>>> df.fill_empty(column_names = 'col2', value = 0)
+ col1 col2 col3
+0 1 0.0 NaN
+1 2 4.0 5.0
+2 3 0.0 6.0
+>>> df.fill_empty(column_names = ['col2', 'col3'], value = 0)
+ col1 col2 col3
+0 1 0.0 0.0
+1 2 4.0 5.0
+2 3 0.0 6.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Union[str, Iterable[str], Hashable]
+ |
+
+
+
+ A column name or an iterable (list +or tuple) of column names. If a single column name is passed in, +then only that column will be filled; if a list or tuple is passed +in, then those columns will all be filled with the same value. + |
+ + required + | +
+ value
+ |
+
+ Any
+ |
+
+
+
+ The value that replaces the |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with |
+
janitor/functions/fill.py
131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 |
|
filter
+
+
+filter_column_isin(df, column_name, iterable, complement=False)
+
+Filter a dataframe for values in a column that exist in the given iterable.
+This method does not mutate the original DataFrame.
+Assumes exact matching; fuzzy matching not implemented.
+ + +Examples:
+Filter the dataframe to retain rows for which names
+are exactly James
or John
.
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})
+>>> df
+ names foo
+0 Jane x
+1 Jeremy y
+2 John z
+>>> df.filter_column_isin(column_name="names", iterable=["James", "John"])
+ names foo
+2 John z
+
This is the method-chaining alternative to:
+df = df[df["names"].isin(["James", "John"])]
+
If complement=True
, then we will only get rows for which the names
+are neither James
nor John
.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The column on which to filter. + |
+ + required + | +
+ iterable
+ |
+
+ Iterable
+ |
+
+
+
+ An iterable. Could be a list, tuple, another pandas +Series. + |
+ + required + | +
+ complement
+ |
+
+ bool
+ |
+
+
+
+ Whether to return the complement of the selection or +not. + |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A filtered pandas DataFrame. + |
+
janitor/functions/filter.py
296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 |
|
filter_date(df, column_name, start_date=None, end_date=None, years=None, months=None, days=None, column_date_options=None, format=None)
+
+Filter a date-based column based on certain criteria.
+This method does not mutate the original DataFrame.
+Dates may be finicky and this function builds on top of the magic from
+the pandas to_datetime
function that is able to parse dates well.
Additional options to parse the date type of your column may be found at +the official pandas documentation.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": range(5, 9),
+... "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],
+... })
+>>> df
+ a dt
+0 5 2021-11-12
+1 6 2021-12-15
+2 7 2022-01-03
+3 8 2022-01-09
+>>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")
+ a dt
+1 6 2021-12-15
+2 7 2022-01-03
+>>> df.filter_date("dt", years=[2021], months=[12])
+ a dt
+1 6 2021-12-15
+
Note
+This method will cast your column to a Timestamp!
+Note
+This only affects the format of the start_date
and end_date
+parameters. If there's an issue with the format of the DataFrame being
+parsed, you would pass {'format': your_format}
to column_date_options
.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The dataframe to filter on. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The column which to apply the fraction transformation. + |
+ + required + | +
+ start_date
+ |
+
+ Optional[date]
+ |
+
+
+
+ The beginning date to use to filter the DataFrame. + |
+
+ None
+ |
+
+ end_date
+ |
+
+ Optional[date]
+ |
+
+
+
+ The end date to use to filter the DataFrame. + |
+
+ None
+ |
+
+ years
+ |
+
+ Optional[List]
+ |
+
+
+
+ The years to use to filter the DataFrame. + |
+
+ None
+ |
+
+ months
+ |
+
+ Optional[List]
+ |
+
+
+
+ The months to use to filter the DataFrame. + |
+
+ None
+ |
+
+ days
+ |
+
+ Optional[List]
+ |
+
+
+
+ The days to use to filter the DataFrame. + |
+
+ None
+ |
+
+ column_date_options
+ |
+
+ Optional[Dict]
+ |
+
+
+
+ Special options to use when parsing the date +column in the original DataFrame. The options may be found at the +official Pandas documentation. + |
+
+ None
+ |
+
+ format
+ |
+
+ Optional[str]
+ |
+
+
+
+ If you're using a format for |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A filtered pandas DataFrame. + |
+
janitor/functions/filter.py
184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 |
|
filter_on(df, criteria, complement=False)
+
+Return a dataframe filtered on a particular criteria.
+This method does not mutate the original DataFrame.
+This is super-sugary syntax that wraps the pandas .query()
API, enabling
+users to use strings to quickly specify filters for filtering their
+dataframe. The intent is that filter_on
as a verb better matches the
+intent of a pandas user than the verb query
.
This is intended to be the method-chaining equivalent of the following:
+df = df[df["score"] < 3]
+
Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.query
instead.
Examples:
+Filter students who failed an exam (scored less than 50).
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "student_id": ["S1", "S2", "S3"],
+... "score": [40, 60, 85],
+... })
+>>> df
+ student_id score
+0 S1 40
+1 S2 60
+2 S3 85
+>>> df.filter_on("score < 50", complement=False)
+ student_id score
+0 S1 40
+
Credit to Brant Peterson for the name.
+ + +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ criteria
+ |
+
+ str
+ |
+
+
+
+ A filtering criteria that returns an array or Series of +booleans, on which pandas can filter on. + |
+ + required + | +
+ complement
+ |
+
+ bool
+ |
+
+
+
+ Whether to return the complement of the filter or not. +If set to True, then the rows for which the criteria is False are +retained instead. + |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A filtered pandas DataFrame. + |
+
janitor/functions/filter.py
107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 |
|
filter_string(df, column_name, search_string, complement=False, case=True, flags=0, na=None, regex=True)
+
+Filter a string-based column according to whether it contains a substring.
+This is super sugary syntax that builds on top of pandas.Series.str.contains
.
+It is meant to be the method-chaining equivalent of the following:
df = df[df[column_name].str.contains(search_string)]]
+
This method does not mutate the original DataFrame.
+ + +Examples:
+Retain rows whose column values contain a particular substring.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})
+>>> df
+ a b
+0 3 bear
+1 4 peeL
+2 5 sail
+>>> df.filter_string(column_name="b", search_string="ee")
+ a b
+1 4 peeL
+>>> df.filter_string(column_name="b", search_string="L", case=False)
+ a b
+1 4 peeL
+2 5 sail
+
Filter names does not contain '.'
(disable regex mode).
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()
+>>> df
+ Name
+0 JoseChen
+1 Brian.Salvi
+>>> df.filter_string(column_name="Name", search_string=".", regex=False, complement=True)
+ Name
+0 JoseChen
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The column to filter. The column should contain strings. + |
+ + required + | +
+ search_string
+ |
+
+ str
+ |
+
+
+
+ A regex pattern or a (sub-)string to search. + |
+ + required + | +
+ complement
+ |
+
+ bool
+ |
+
+
+
+ Whether to return the complement of the filter or not. If +set to True, then the rows for which the string search fails are retained +instead. + |
+
+ False
+ |
+
+ case
+ |
+
+ bool
+ |
+
+
+
+ If True, case sensitive. + |
+
+ True
+ |
+
+ flags
+ |
+
+ int
+ |
+
+
+
+ Flags to pass through to the re module, e.g. re.IGNORECASE. + |
+
+ 0
+ |
+
+ na
+ |
+
+ Any
+ |
+
+
+
+ Fill value for missing values. The default depends on dtype of
+the array. For object-dtype, |
+
+ None
+ |
+
+ regex
+ |
+
+ bool
+ |
+
+
+
+ If True, assumes |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A filtered pandas DataFrame. + |
+
janitor/functions/filter.py
19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 |
|
find_replace
+
+
+Implementation for find_replace.
+ + + + + + + + +find_replace(df, match='exact', **mappings)
+
+Perform a find-and-replace action on provided columns.
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.replace
instead.
Depending on use case, users can choose either exact, full-value matching, +or regular-expression-based fuzzy matching +(hence allowing substring matching in the latter case). +For strings, the matching is always case sensitive.
+ + +Examples:
+For instance, given a DataFrame containing orders at a coffee shop:
+>>> df = pd.DataFrame({
+... "customer": ["Mary", "Tom", "Lila"],
+... "order": ["ice coffee", "lemonade", "regular coffee"]
+... })
+>>> df
+ customer order
+0 Mary ice coffee
+1 Tom lemonade
+2 Lila regular coffee
+
Our task is to replace values ice coffee
and regular coffee
+of the order
column into latte
.
Example 1 - exact matching (functional usage):
+>>> df = find_replace(
+... df,
+... match="exact",
+... order={"ice coffee": "latte", "regular coffee": "latte"},
+... )
+>>> df
+ customer order
+0 Mary latte
+1 Tom lemonade
+2 Lila latte
+
Example 1 - exact matching (method chaining):
+>>> df = df.find_replace(
+... match="exact",
+... order={"ice coffee": "latte", "regular coffee": "latte"},
+... )
+>>> df
+ customer order
+0 Mary latte
+1 Tom lemonade
+2 Lila latte
+
Example 2 - Regular-expression-based matching (functional usage):
+>>> df = find_replace(
+... df,
+... match='regex',
+... order={'coffee$': 'latte'},
+... )
+>>> df
+ customer order
+0 Mary latte
+1 Tom lemonade
+2 Lila latte
+
Example 2 - Regular-expression-based matching (method chaining usage):
+>>> df = df.find_replace(
+... match='regex',
+... order={'coffee$': 'latte'},
+... )
+>>> df
+ customer order
+0 Mary latte
+1 Tom lemonade
+2 Lila latte
+
To perform a find and replace on the entire DataFrame,
+pandas' df.replace()
function provides the appropriate functionality.
+You can find more detail on the replace docs.
This function only works with column names that have no spaces
+or punctuation in them.
+For example, a column name item_name
would work with find_replace
,
+because it is a contiguous string that can be parsed correctly,
+but item name
would not be parsed correctly by the Python interpreter.
If you have column names that might not be compatible,
+we recommend calling on clean_names()
+as the first method call. If, for whatever reason, that is not possible,
+then _find_replace
is available as a function
+that you can do a pandas pipe call on.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ match
+ |
+
+ str
+ |
+
+
+
+ Whether or not to perform an exact match or not. +Valid values are "exact" or "regex". + |
+
+ 'exact'
+ |
+
+ **mappings
+ |
+
+ Any
+ |
+
+
+
+ keyword arguments corresponding to column names +that have dictionaries passed in indicating what to find (keys) +and what to replace with (values). + |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with replaced values. + |
+
janitor/functions/find_replace.py
11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 |
|
flag_nulls
+
+
+Implementation source for flag_nulls
.
flag_nulls(df, column_name='null_flag', columns=None)
+
+Creates a new column to indicate whether you have null values in a given +row.
+If the columns parameter is not set, looks across the entire +DataFrame, otherwise will look only in the columns you set.
+This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": ["w", "x", None, "z"], "b": [5, None, 7, 8],
+... })
+>>> df.flag_nulls()
+ a b null_flag
+0 w 5.0 0
+1 x NaN 1
+2 None 7.0 1
+3 z 8.0 0
+>>> df.flag_nulls(columns="b")
+ a b null_flag
+0 w 5.0 0
+1 x NaN 1
+2 None 7.0 0
+3 z 8.0 0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ Input pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Optional[Hashable]
+ |
+
+
+
+ Name for the output column. + |
+
+ 'null_flag'
+ |
+
+ columns
+ |
+
+ Optional[Union[str, Iterable[str], Hashable]]
+ |
+
+
+
+ List of columns to look at for finding null values. If you +only want to look at one column, you can simply give its name. +If set to None (default), all DataFrame columns are used. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If any column within |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ Input dataframe with the null flag column. + |
+
janitor/functions/flag_nulls.py
12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 |
|
get_dupes
+
+
+Implementation of the get_dupes
function
get_dupes(df, column_names=None)
+
+Return all duplicate rows.
+This method does not mutate the original DataFrame.
+ + +Examples:
+Method chaining syntax:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "item": ["shoe", "shoe", "bag", "shoe", "bag"],
+... "quantity": [100, 100, 75, 200, 75],
+... })
+>>> df
+ item quantity
+0 shoe 100
+1 shoe 100
+2 bag 75
+3 shoe 200
+4 bag 75
+>>> df.get_dupes()
+ item quantity
+0 shoe 100
+1 shoe 100
+2 bag 75
+4 bag 75
+
Optional column_names
usage:
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "item": ["shoe", "shoe", "bag", "shoe", "bag"],
+... "quantity": [100, 100, 75, 200, 75],
+... })
+>>> df
+ item quantity
+0 shoe 100
+1 shoe 100
+2 bag 75
+3 shoe 200
+4 bag 75
+>>> df.get_dupes(column_names=["item"])
+ item quantity
+0 shoe 100
+1 shoe 100
+2 bag 75
+3 shoe 200
+4 bag 75
+>>> df.get_dupes(column_names=["quantity"])
+ item quantity
+0 shoe 100
+1 shoe 100
+2 bag 75
+4 bag 75
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ column_names
+ |
+
+ Optional[Union[str, Iterable[str], Hashable]]
+ |
+
+
+
+ A column name or an iterable +(list or tuple) of column names. Following pandas API, this only +considers certain columns for identifying duplicates. Defaults +to using all columns. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ The duplicate rows, as a pandas DataFrame. + |
+
janitor/functions/get_dupes.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 |
|
groupby_agg
+
+
+Implementation source for groupby_agg
.
groupby_agg(df, by, new_column_name, agg_column_name, agg, dropna=True)
+
+Shortcut for assigning a groupby-transform to a new column.
+This method does not mutate the original DataFrame.
+Intended to be the method-chaining equivalent of:
+df = df.assign(...=df.groupby(...)[...].transform(...))
+
Note
+This function will be deprecated in a 1.x release.
+Please use
+jn.transform_column
+instead.
Examples:
+Basic usage.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "item": ["shoe", "shoe", "bag", "shoe", "bag"],
+... "quantity": [100, 120, 75, 200, 25],
+... })
+>>> df.groupby_agg(
+... by="item",
+... agg="mean",
+... agg_column_name="quantity",
+... new_column_name="avg_quantity",
+... )
+ item quantity avg_quantity
+0 shoe 100 140.0
+1 shoe 120 140.0
+2 bag 75 50.0
+3 shoe 200 140.0
+4 bag 25 50.0
+
Set dropna=False
to compute the aggregation, treating the null
+values in the by
column as an isolated "group".
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "x": ["a", "a", None, "b"], "y": [9, 9, 9, 9],
+... })
+>>> df.groupby_agg(
+... by="x",
+... agg="count",
+... agg_column_name="y",
+... new_column_name="y_count",
+... dropna=False,
+... )
+ x y y_count
+0 a 9 2
+1 a 9 2
+2 None 9 1
+3 b 9 1
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ by
+ |
+
+ Union[List, Callable, str]
+ |
+
+
+
+ Column(s) to groupby on, will be passed into |
+ + required + | +
+ new_column_name
+ |
+
+ str
+ |
+
+
+
+ Name of the aggregation output column. + |
+ + required + | +
+ agg_column_name
+ |
+
+ str
+ |
+
+
+
+ Name of the column to aggregate over. + |
+ + required + | +
+ agg
+ |
+
+ Union[Callable, str]
+ |
+
+
+
+ How to aggregate. + |
+ + required + | +
+ dropna
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to include null values, if present in the
+ |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/groupby_agg.py
11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 |
|
groupby_topk
+
+
+Implementation of the groupby_topk
function
groupby_topk(df, by, column, k, dropna=True, ascending=True, ignore_index=True)
+
+Return top k
rows from a groupby of a set of columns.
Returns a DataFrame that has the top k
values per column
,
+grouped by by
. Under the hood it uses nlargest/nsmallest
,
+for numeric columns, which avoids sorting the entire dataframe,
+and is usually more performant. For non-numeric columns, pd.sort_values
+is used.
+No sorting is done to the by
column(s); the order is maintained
+in the final output.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... "age": [20, 23, 22, 43, 21],
+... "id": [1, 4, 6, 2, 5],
+... "result": ["pass", "pass", "fail", "pass", "fail"],
+... }
+... )
+>>> df
+ age id result
+0 20 1 pass
+1 23 4 pass
+2 22 6 fail
+3 43 2 pass
+4 21 5 fail
+
Ascending top 3:
+>>> df.groupby_topk(by="result", column="age", k=3)
+ age id result
+0 20 1 pass
+1 23 4 pass
+2 43 2 pass
+3 21 5 fail
+4 22 6 fail
+
Descending top 2:
+>>> df.groupby_topk(
+... by="result", column="age", k=2, ascending=False, ignore_index=False
+... )
+ age id result
+3 43 2 pass
+1 23 4 pass
+2 22 6 fail
+4 21 5 fail
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ by
+ |
+
+ Union[list, Hashable]
+ |
+
+
+
+ Column name(s) to group input DataFrame |
+ + required + | +
+ column
+ |
+
+ Hashable
+ |
+
+
+
+ Name of the column that determines |
+ + required + | +
+ k
+ |
+
+ int
+ |
+
+
+
+ Number of top rows to return for each group. + |
+ + required + | +
+ dropna
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
+ ascending
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
+ ignore_index
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with top |
+
janitor/functions/groupby_topk.py
11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 |
|
impute
+
+
+Implementation of impute
function
impute(df, column_names, value=None, statistic_column_name=None)
+
+Method-chainable imputation of values in a column.
+This method does not mutate the original DataFrame.
+Underneath the hood, this function calls the .fillna()
method available
+to every pandas.Series
object.
Either one of value
or statistic_column_name
should be provided.
If value
is provided, then all null values in the selected column will
+take on the value provided.
If statistic_column_name
is provided, then all null values in the
+selected column(s) will take on the summary statistic value
+of other non-null values.
Column selection in column_names
is possible using the
+select
syntax.
Currently supported statistics include:
+mean
(also aliased by average
)median
mode
minimum
(also aliased by min
)maximum
(also aliased by max
)Examples:
+>>> import numpy as np
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": [1, 2, 3],
+... "sales": np.nan,
+... "score": [np.nan, 3, 2],
+... })
+>>> df
+ a sales score
+0 1 NaN NaN
+1 2 NaN 3.0
+2 3 NaN 2.0
+
Imputing null values with 0 (using the value
parameter):
>>> df.impute(column_names="sales", value=0.0)
+ a sales score
+0 1 0.0 NaN
+1 2 0.0 3.0
+2 3 0.0 2.0
+
Imputing null values with median (using the statistic_column_name
+parameter):
>>> df.impute(column_names="score", statistic_column_name="median")
+ a sales score
+0 1 NaN 2.5
+1 2 NaN 3.0
+2 3 NaN 2.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Any
+ |
+
+
+
+ The name of the column(s) on which to impute values. + |
+ + required + | +
+ value
+ |
+
+ Optional[Any]
+ |
+
+
+
+ The value used for imputation, passed into |
+
+ None
+ |
+
+ statistic_column_name
+ |
+
+ Optional[str]
+ |
+
+
+
+ The column statistic to impute. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If both |
+
+ KeyError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ An imputed pandas DataFrame. + |
+
janitor/functions/impute.py
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 |
|
jitter
+
+
+Implementation of the jitter
function.
jitter(df, column_name, dest_column_name, scale, clip=None, random_state=None)
+
+Adds Gaussian noise (jitter) to the values of a column.
+A new column will be created containing the values of the original column
+with Gaussian noise added.
+For each value in the column, a Gaussian distribution is created
+having a location (mean) equal to the value
+and a scale (standard deviation) equal to scale
.
+A random value is then sampled from this distribution,
+which is the jittered value.
+If a tuple is supplied for clip
,
+then any values of the new column less than clip[0]
+will be set to clip[0]
,
+and any values greater than clip[1]
will be set to clip[1]
.
+Additionally, if a numeric value is supplied for random_state
,
+this value will be used to set the random seed used for sampling.
+NaN values are ignored in this method.
This method mutates the original DataFrame.
+ + +Examples:
+>>> import numpy as np
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [3, 4, 5, np.nan]})
+>>> df
+ a
+0 3.0
+1 4.0
+2 5.0
+3 NaN
+>>> df.jitter("a", dest_column_name="a_jit", scale=1, random_state=42)
+ a a_jit
+0 3.0 3.496714
+1 4.0 3.861736
+2 5.0 5.647689
+3 NaN NaN
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Name of the column containing +values to add Gaussian jitter to. + |
+ + required + | +
+ dest_column_name
+ |
+
+ str
+ |
+
+
+
+ The name of the new column containing the +jittered values that will be created. + |
+ + required + | +
+ scale
+ |
+
+ number
+ |
+
+
+
+ A positive value multiplied by the original +column value to determine the scale (standard deviation) of the +Gaussian distribution to sample from. (A value of zero results in +no jittering.) + |
+ + required + | +
+ clip
+ |
+
+ Optional[Iterable[number]]
+ |
+
+
+
+ An iterable of two values (minimum and maximum) to clip +the jittered values to, default to None. + |
+
+ None
+ |
+
+ random_state
+ |
+
+ Optional[number]
+ |
+
+
+
+ An integer or 1-d array value used to set the random +seed, default to None. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ TypeError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a new column containing +Gaussian-jittered values from another column. + |
+
janitor/functions/jitter.py
12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 |
|
join_apply
+
+
+Implementation of the join_apply
function
join_apply(df, func, new_column_name)
+
+Join the result of applying a function across dataframe rows.
+This method does not mutate the original DataFrame.
+This is a convenience function that allows us to apply arbitrary functions +that take any combination of information from any of the columns. The only +requirement is that the function signature takes in a row from the +DataFrame.
+ + +Examples:
+Sum the result of two columns into a new column.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a":[1, 2, 3], "b": [2, 3, 4]})
+>>> df
+ a b
+0 1 2
+1 2 3
+2 3 4
+>>> df.join_apply(
+... func=lambda x: 2 * x["a"] + x["b"],
+... new_column_name="2a+b",
+... )
+ a b 2a+b
+0 1 2 4
+1 2 3 7
+2 3 4 10
+
Incorporating conditionals in func
.
>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [20, 30, 40]})
+>>> df
+ a b
+0 1 20
+1 2 30
+2 3 40
+>>> def take_a_if_even(x):
+... if x["a"] % 2 == 0:
+... return x["a"]
+... else:
+... return x["b"]
+>>> df.join_apply(take_a_if_even, "a_if_even")
+ a b a_if_even
+0 1 20 20
+1 2 30 2
+2 3 40 40
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ func
+ |
+
+ Callable
+ |
+
+
+
+ A function that is applied elementwise across all rows of the +DataFrame. + |
+ + required + | +
+ new_column_name
+ |
+
+ str
+ |
+
+
+
+ Name of the resulting column. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with new column appended. + |
+
janitor/functions/join_apply.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 |
|
label_encode
+
+
+Implementation of label_encode
function
label_encode(df, column_names)
+
+Convert labels into numerical data.
+This method will create a new column with the string _enc
appended
+after the original column's name.
+Consider this to be syntactic sugar.
+This function uses the factorize
pandas function under the hood.
This method behaves differently from
+encode_categorical
.
+This method creates a new column of numeric data.
+encode_categorical
+replaces the dtype of the original column with a categorical dtype.
This method mutates the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use factorize_columns
+instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "foo": ["b", "b", "a", "c", "b"],
+... "bar": range(4, 9),
+... })
+>>> df
+ foo bar
+0 b 4
+1 b 5
+2 a 6
+3 c 7
+4 b 8
+>>> df.label_encode(column_names="foo")
+ foo bar foo_enc
+0 b 4 0
+1 b 5 0
+2 a 6 1
+3 c 7 2
+4 b 8 0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ column_names
+ |
+
+ Union[str, Iterable[str], Hashable]
+ |
+
+
+
+ A column name or an iterable (list +or tuple) of column names. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/label_encode.py
13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 |
|
limit_column_characters
+
+
+Implementation of limit_column_characters.
+ + + + + + + + +limit_column_characters(df, column_length, col_separator='_')
+
+Truncate column sizes to a specific length.
+This method mutates the original DataFrame.
+Method chaining will truncate all columns to a given length and append +a given separator character with the index of duplicate columns, except +for the first distinct column name.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> data_dict = {
+... "really_long_name": [9, 8, 7],
+... "another_really_long_name": [2, 4, 6],
+... "another_really_longer_name": list("xyz"),
+... "this_is_getting_out_of_hand": list("pqr"),
+... }
+>>> df = pd.DataFrame(data_dict)
+>>> df
+ really_long_name another_really_long_name another_really_longer_name this_is_getting_out_of_hand
+0 9 2 x p
+1 8 4 y q
+2 7 6 z r
+>>> df.limit_column_characters(7)
+ really_ another another_1 this_is
+0 9 2 x p
+1 8 4 y q
+2 7 6 z r
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_length
+ |
+
+ int
+ |
+
+
+
+ Character length for which to truncate all columns. +The column separator value and number for duplicate column name does +not contribute. Therefore, if all columns are truncated to 10 +characters, the first distinct column will be 10 characters and the +remaining will be 12 characters (assuming a column separator of one +character). + |
+ + required + | +
+ col_separator
+ |
+
+ str
+ |
+
+
+
+ The separator to use for counting distinct column
+values, for example, |
+
+ '_'
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with truncated column lengths. + |
+
janitor/functions/limit_column_characters.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 |
|
min_max_scale
+
+
+min_max_scale(df, feature_range=(0, 1), column_name=None, jointly=False)
+
+Scales DataFrame to between a minimum and maximum value.
+One can optionally set a new target minimum and maximum value
+using the feature_range
keyword argument.
If column_name
is specified, then only that column(s) of data is scaled.
+Otherwise, the entire dataframe is scaled.
+If jointly
is True
, the column_names
provided entire dataframe will
+be regnozied as the one to jointly scale. Otherwise, each column of data
+will be scaled separately.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
+>>> df.min_max_scale()
+ a b
+0 0.0 0.0
+1 1.0 1.0
+>>> df.min_max_scale(jointly=True)
+ a b
+0 0.5 0.0
+1 1.0 0.5
+
Setting custom minimum and maximum.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
+>>> df.min_max_scale(feature_range=(0, 100))
+ a b
+0 0.0 0.0
+1 100.0 100.0
+>>> df.min_max_scale(feature_range=(0, 100), jointly=True)
+ a b
+0 50.0 0.0
+1 100.0 50.0
+
Apply min-max to the selected columns.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1], 'c': [1, 0]})
+>>> df.min_max_scale(
+... feature_range=(0, 100),
+... column_name=["a", "c"],
+... )
+ a b c
+0 0.0 0 100.0
+1 100.0 1 0.0
+>>> df.min_max_scale(
+... feature_range=(0, 100),
+... column_name=["a", "c"],
+... jointly=True,
+... )
+ a b c
+0 50.0 0 50.0
+1 100.0 1 0.0
+>>> df.min_max_scale(feature_range=(0, 100), column_name='a')
+ a b c
+0 0.0 0 1
+1 100.0 1 0
+
The aforementioned example might be applied to something like scaling the +isoelectric points of amino acids. While technically they range from +approx 3-10, we can also think of them on the pH scale which ranges from +1 to 14. Hence, 3 gets scaled not to 0 but approx. 0.15 instead, while 10 +gets scaled to approx. 0.69 instead.
+Version Changed
+old_min
, old_max
, new_min
, and new_max
options.feature_range
, and jointly
options.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ feature_range
+ |
+
+ tuple[int | float, int | float]
+ |
+
+
+
+ Desired range of transformed data. + |
+
+ (0, 1)
+ |
+
+ column_name
+ |
+
+ str | int | list[str | int] | Index
+ |
+
+
+
+ The column on which to perform scaling. + |
+
+ None
+ |
+
+ jointly
+ |
+
+ bool
+ |
+
+
+
+ Scale the entire data if True. + |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If the length of |
+
+ ValueError
+ |
+
+
+
+ If the element of |
+
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with scaled data. + |
+
janitor/functions/min_max_scale.py
9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 |
|
move
+
+
+Implementation of move.
+ + + + + + + + +move(df, source, target=None, position='before', axis=0)
+
+Changes rows or columns positions in the dataframe.
+It uses the
+select
syntax,
+making it easy to move blocks of rows or columns at once.
This operation does not reset the index of the dataframe. User must +explicitly do so.
+The dataframe must have unique column names or indices.
+ + +Examples:
+Move a row:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [2, 4, 6, 8], "b": list("wxyz")})
+>>> df
+ a b
+0 2 w
+1 4 x
+2 6 y
+3 8 z
+>>> df.move(source=0, target=3, position="before", axis=0)
+ a b
+1 4 x
+2 6 y
+0 2 w
+3 8 z
+
Move a column:
+>>> import pandas as pd
+>>> import janitor
+>>> data = [{"a": 1, "b": 1, "c": 1,
+... "d": "a", "e": "a","f": "a"}]
+>>> df = pd.DataFrame(data)
+>>> df
+ a b c d e f
+0 1 1 1 a a a
+>>> df.move(source="a", target="c", position="after", axis=1)
+ b c a d e f
+0 1 1 1 a a a
+>>> df.move(source="f", target="b", position="before", axis=1)
+ a f b c d e
+0 1 a 1 1 a a
+>>> df.move(source="a", target=None, position="after", axis=1)
+ b c d e f a
+0 1 1 a a a 1
+
Move columns:
+>>> from pandas.api.types import is_numeric_dtype, is_string_dtype
+>>> df.move(source=is_string_dtype, target=None, position="before", axis=1)
+ d e f a b c
+0 a a a 1 1 1
+>>> df.move(source=is_numeric_dtype, target=None, position="after", axis=1)
+ d e f a b c
+0 a a a 1 1 1
+>>> df.move(source = ["d", "f"], target=is_numeric_dtype, position="before", axis=1)
+ d f a b c e
+0 a a 1 1 1 a
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ source
+ |
+
+ Any
+ |
+
+
+
+ Columns or rows to move. + |
+ + required + | +
+ target
+ |
+
+ Any
+ |
+
+
+
+ Columns or rows to move adjacent to.
+If |
+
+ None
+ |
+
+ position
+ |
+
+ str
+ |
+
+
+
+ Specifies the destination of the columns/rows.
+Values can be either |
+
+ 'before'
+ |
+
+ axis
+ |
+
+ int
+ |
+
+
+
+ Axis along which the function is applied. 0 to move along +the index, 1 to move along the columns. + |
+
+ 0
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ The dataframe with the Series moved. + |
+
janitor/functions/move.py
12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 |
|
pivot
+
+
+pivot_longer(df, index=None, column_names=None, names_to=None, values_to='value', column_level=None, names_sep=None, names_pattern=None, names_transform=None, dropna=False, sort_by_appearance=False, ignore_index=True)
+
+Unpivots a DataFrame from wide to long format.
+This method does not mutate the original DataFrame.
+It is modeled after the pivot_longer
function in R's tidyr package,
+and also takes inspiration from R's data.table package.
This function is useful to massage a DataFrame into a format where +one or more columns are considered measured variables, and all other +columns are considered as identifier variables.
+All measured variables are unpivoted (and typically duplicated) along the +row axis.
+Column selection in index
and column_names
is possible using the
+select
syntax.
For more granular control on the unpivoting, have a look at
+pivot_longer_spec
.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... "Sepal.Length": [5.1, 5.9],
+... "Sepal.Width": [3.5, 3.0],
+... "Petal.Length": [1.4, 5.1],
+... "Petal.Width": [0.2, 1.8],
+... "Species": ["setosa", "virginica"],
+... }
+... )
+>>> df
+ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
+0 5.1 3.5 1.4 0.2 setosa
+1 5.9 3.0 5.1 1.8 virginica
+
Replicate pandas' melt:
+>>> df.pivot_longer(index = 'Species')
+ Species variable value
+0 setosa Sepal.Length 5.1
+1 virginica Sepal.Length 5.9
+2 setosa Sepal.Width 3.5
+3 virginica Sepal.Width 3.0
+4 setosa Petal.Length 1.4
+5 virginica Petal.Length 5.1
+6 setosa Petal.Width 0.2
+7 virginica Petal.Width 1.8
+
Convenient, flexible column selection in the index
via the
+select
syntax:
>>> from pandas.api.types import is_string_dtype
+>>> df.pivot_longer(index = is_string_dtype)
+ Species variable value
+0 setosa Sepal.Length 5.1
+1 virginica Sepal.Length 5.9
+2 setosa Sepal.Width 3.5
+3 virginica Sepal.Width 3.0
+4 setosa Petal.Length 1.4
+5 virginica Petal.Length 5.1
+6 setosa Petal.Width 0.2
+7 virginica Petal.Width 1.8
+
Split the column labels into individual columns:
+>>> df.pivot_longer(
+... index = 'Species',
+... names_to = ('part', 'dimension'),
+... names_sep = '.',
+... sort_by_appearance = True,
+... )
+ Species part dimension value
+0 setosa Sepal Length 5.1
+1 setosa Sepal Width 3.5
+2 setosa Petal Length 1.4
+3 setosa Petal Width 0.2
+4 virginica Sepal Length 5.9
+5 virginica Sepal Width 3.0
+6 virginica Petal Length 5.1
+7 virginica Petal Width 1.8
+
Retain parts of the column names as headers:
+>>> df.pivot_longer(
+... index = 'Species',
+... names_to = ('part', '.value'),
+... names_sep = '.',
+... sort_by_appearance = True,
+... )
+ Species part Length Width
+0 setosa Sepal 5.1 3.5
+1 setosa Petal 1.4 0.2
+2 virginica Sepal 5.9 3.0
+3 virginica Petal 5.1 1.8
+
Split the column labels based on regex:
+>>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
+>>> df
+ id new_sp_m5564 newrel_f65
+0 1 2 3
+>>> df.pivot_longer(
+... index = 'id',
+... names_to = ('diagnosis', 'gender', 'age'),
+... names_pattern = r"new_?(.+)_(.)(\d+)",
+... )
+ id diagnosis gender age value
+0 1 sp m 5564 2
+1 1 rel f 65 3
+
Split the column labels for the above dataframe using named groups in names_pattern
:
>>> df.pivot_longer(
+... index = 'id',
+... names_pattern = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)",
+... )
+ id diagnosis gender age value
+0 1 sp m 5564 2
+1 1 rel f 65 3
+
Convert the dtypes of specific columns with names_transform
:
>>> result = (df
+... .pivot_longer(
+... index = 'id',
+... names_to = ('diagnosis', 'gender', 'age'),
+... names_pattern = r"new_?(.+)_(.)(\d+)",
+... names_transform = {'gender': 'category', 'age':'int'})
+... )
+>>> result.dtypes
+id int64
+diagnosis object
+gender category
+age int64
+value int64
+dtype: object
+
Use multiple .value
to reshape the dataframe:
>>> df = pd.DataFrame(
+... [
+... {
+... "x_1_mean": 10,
+... "x_2_mean": 20,
+... "y_1_mean": 30,
+... "y_2_mean": 40,
+... "unit": 50,
+... }
+... ]
+... )
+>>> df
+ x_1_mean x_2_mean y_1_mean y_2_mean unit
+0 10 20 30 40 50
+>>> df.pivot_longer(
+... index="unit",
+... names_to=(".value", "time", ".value"),
+... names_pattern=r"(x|y)_([0-9])(_mean)",
+... )
+ unit time x_mean y_mean
+0 50 1 10 30
+1 50 2 20 40
+
Replicate the above with named groups in names_pattern
- use _
instead of .value
:
>>> df.pivot_longer(
+... index="unit",
+... names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",
+... )
+ unit time x_mean y_mean
+0 50 1 10 30
+1 50 2 20 40
+
Convenient, flexible column selection in the column_names
via
+the select
syntax:
>>> df.pivot_longer(
+... column_names="*mean",
+... names_to=(".value", "time", ".value"),
+... names_pattern=r"(x|y)_([0-9])(_mean)",
+... )
+ unit time x_mean y_mean
+0 50 1 10 30
+1 50 2 20 40
+
>>> df.pivot_longer(
+... column_names=slice("x_1_mean", "y_2_mean"),
+... names_to=(".value", "time", ".value"),
+... names_pattern=r"(x|y)_([0-9])(_mean)",
+... )
+ unit time x_mean y_mean
+0 50 1 10 30
+1 50 2 20 40
+
Reshape the dataframe by passing a sequence to names_pattern
:
>>> df = pd.DataFrame({'hr1': [514, 573],
+... 'hr2': [545, 526],
+... 'team': ['Red Sox', 'Yankees'],
+... 'year1': [2007, 2007],
+... 'year2': [2008, 2008]})
+>>> df
+ hr1 hr2 team year1 year2
+0 514 545 Red Sox 2007 2008
+1 573 526 Yankees 2007 2008
+>>> df.pivot_longer(
+... index = 'team',
+... names_to = ['year', 'hr'],
+... names_pattern = ['year', 'hr']
+... )
+ team hr year
+0 Red Sox 514 2007
+1 Yankees 573 2007
+2 Red Sox 545 2008
+3 Yankees 526 2008
+
Reshape the above dataframe by passing a dictionary to names_pattern
:
>>> df.pivot_longer(
+... index = 'team',
+... names_pattern = {"year":"year", "hr":"hr"}
+... )
+ team hr year
+0 Red Sox 514 2007
+1 Yankees 573 2007
+2 Red Sox 545 2008
+3 Yankees 526 2008
+
Multiple values_to:
+>>> df = pd.DataFrame(
+... {
+... "City": ["Houston", "Austin", "Hoover"],
+... "State": ["Texas", "Texas", "Alabama"],
+... "Name": ["Aria", "Penelope", "Niko"],
+... "Mango": [4, 10, 90],
+... "Orange": [10, 8, 14],
+... "Watermelon": [40, 99, 43],
+... "Gin": [16, 200, 34],
+... "Vodka": [20, 33, 18],
+... },
+... )
+>>> df
+ City State Name Mango Orange Watermelon Gin Vodka
+0 Houston Texas Aria 4 10 40 16 20
+1 Austin Texas Penelope 10 8 99 200 33
+2 Hoover Alabama Niko 90 14 43 34 18
+>>> df.pivot_longer(
+... index=["City", "State"],
+... column_names=slice("Mango", "Vodka"),
+... names_to=("Fruit", "Drink"),
+... values_to=("Pounds", "Ounces"),
+... names_pattern=["M|O|W", "G|V"],
+... )
+ City State Fruit Drink Pounds Ounces
+0 Houston Texas Mango Gin 4 16.0
+1 Austin Texas Mango Gin 10 200.0
+2 Hoover Alabama Mango Gin 90 34.0
+3 Houston Texas Orange Vodka 10 20.0
+4 Austin Texas Orange Vodka 8 33.0
+5 Hoover Alabama Orange Vodka 14 18.0
+6 Houston Texas Watermelon None 40 NaN
+7 Austin Texas Watermelon None 99 NaN
+8 Hoover Alabama Watermelon None 43 NaN
+
Replicate the above transformation with a nested dictionary passed to names_pattern
+- the outer keys in the names_pattern
dictionary are passed to names_to
,
+while the inner keys are passed to values_to
:
>>> df.pivot_longer(
+... index=["City", "State"],
+... column_names=slice("Mango", "Vodka"),
+... names_pattern={
+... "Fruit": {"Pounds": "M|O|W"},
+... "Drink": {"Ounces": "G|V"},
+... },
+... )
+ City State Fruit Drink Pounds Ounces
+0 Houston Texas Mango Gin 4 16.0
+1 Austin Texas Mango Gin 10 200.0
+2 Hoover Alabama Mango Gin 90 34.0
+3 Houston Texas Orange Vodka 10 20.0
+4 Austin Texas Orange Vodka 8 33.0
+5 Hoover Alabama Orange Vodka 14 18.0
+6 Houston Texas Watermelon None 40 NaN
+7 Austin Texas Watermelon None 99 NaN
+8 Hoover Alabama Watermelon None 43 NaN
+
Version Changed
+dropna
parameter.names_pattern
can accept a dictionary.names_pattern
.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ index
+ |
+
+ list | tuple | str | Pattern
+ |
+
+
+
+ Name(s) of columns to use as identifier variables.
+Should be either a single column name, or a list/tuple of
+column names.
+ |
+
+ None
+ |
+
+ column_names
+ |
+
+ list | tuple | str | Pattern
+ |
+
+
+
+ Name(s) of columns to unpivot. Should be either
+a single column name or a list/tuple of column names.
+ |
+
+ None
+ |
+
+ names_to
+ |
+
+ list | tuple | str
+ |
+
+
+
+ Name of new column as a string that will contain
+what were previously the column names in |
+
+ None
+ |
+
+ values_to
+ |
+
+ str
+ |
+
+
+
+ Name of new column as a string that will contain what
+were previously the values of the columns in |
+
+ 'value'
+ |
+
+ column_level
+ |
+
+ int | str
+ |
+
+
+
+ If columns are a MultiIndex, then use this level to
+unpivot the DataFrame. Provided for compatibility with pandas' melt,
+and applies only if neither |
+
+ None
+ |
+
+ names_sep
+ |
+
+ str | Pattern
+ |
+
+
+
+ Determines how the column name is broken up, if
+ |
+
+ None
+ |
+
+ names_pattern
+ |
+
+ list | tuple | str | Pattern
+ |
+
+
+
+ Determines how the column name is broken up.
+It can be a regular expression containing matching groups.
+Under the hood it is processed with pandas' |
+
+ None
+ |
+
+ names_transform
+ |
+
+ str | Callable | dict
+ |
+
+
+
+ Use this option to change the types of columns that
+have been transformed to rows. This does not applies to the values' columns.
+Accepts any argument that is acceptable by |
+
+ None
+ |
+
+ dropna
+ |
+
+ bool
+ |
+
+
+
+ Determines whether or not to drop nulls
+from the values columns. Default is |
+
+ False
+ |
+
+ sort_by_appearance
+ |
+
+ bool
+ |
+
+
+
+ Boolean value that determines
+the final look of the DataFrame. If |
+
+ False
+ |
+
+ ignore_index
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame that has been unpivoted from wide to long +format. + |
+
janitor/functions/pivot.py
25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 |
|
pivot_longer_spec(df, spec, sort_by_appearance=False, ignore_index=True, dropna=False, df_columns_is_unique=True)
+
+A declarative interface to pivot a DataFrame from wide to long form,
+where you describe how the data will be unpivoted,
+using a DataFrame. This gives you, the user,
+more control over unpivoting, where you create a “spec”
+data frame that describes exactly how data stored
+in the column names becomes variables.
+It can come in handy for situations where
+pivot_longer
+seems inadequate for the transformation.
New in version 0.28.0
+Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... "Sepal.Length": [5.1, 5.9],
+... "Sepal.Width": [3.5, 3.0],
+... "Petal.Length": [1.4, 5.1],
+... "Petal.Width": [0.2, 1.8],
+... "Species": ["setosa", "virginica"],
+... }
+... )
+>>> df
+ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
+0 5.1 3.5 1.4 0.2 setosa
+1 5.9 3.0 5.1 1.8 virginica
+>>> spec = {'.name':['Sepal.Length','Petal.Length',
+... 'Sepal.Width','Petal.Width'],
+... '.value':['Length','Length','Width','Width'],
+... 'part':['Sepal','Petal','Sepal','Petal']}
+>>> spec = pd.DataFrame(spec)
+>>> spec
+ .name .value part
+0 Sepal.Length Length Sepal
+1 Petal.Length Length Petal
+2 Sepal.Width Width Sepal
+3 Petal.Width Width Petal
+>>> pivot_longer_spec(df=df,spec=spec)
+ Species part Length Width
+0 setosa Sepal 5.1 3.5
+1 virginica Sepal 5.9 3.0
+2 setosa Petal 1.4 0.2
+3 virginica Petal 5.1 1.8
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The source DataFrame to unpivot. + |
+ + required + | +
+ spec
+ |
+
+ DataFrame
+ |
+
+
+
+ A specification DataFrame. +At a minimum, the spec DataFrame +must have a '.name' and a '.value' columns. +The '.name' column should contain the +columns in the source DataFrame that will be +transformed to long form. +The '.value' column gives the name of the column(s) +that the values in the source DataFrame will go into. +Additional columns in spec should be named to match columns +in the long format of the dataset and contain values +corresponding to columns pivoted from the wide format. +Note that these additional columns should not already exist +in the source DataFrame. + |
+ + required + | +
+ sort_by_appearance
+ |
+
+ bool
+ |
+
+
+
+ Boolean value that determines
+the final look of the DataFrame. If |
+
+ False
+ |
+
+ ignore_index
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
+ dropna
+ |
+
+ bool
+ |
+
+
+
+ Determines whether or not to drop nulls
+from the values columns. Default is |
+
+ False
+ |
+
+ df_columns_is_unique
+ |
+
+ bool
+ |
+
+
+
+ Boolean value to indicate if the source
+DataFrame's columns is unique. Default is |
+
+ True
+ |
+
Raises:
+Type | +Description | +
---|---|
+ KeyError
+ |
+
+
+
+ If '.name' or '.value' is missing from the spec's columns. + |
+
+ ValueError
+ |
+
+
+
+ If the spec's columns is not unique, +or the labels in spec['.name'] is not unique. + |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/pivot.py
413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 +475 +476 +477 +478 +479 +480 +481 +482 +483 +484 +485 +486 +487 +488 +489 +490 +491 +492 +493 +494 +495 +496 +497 +498 +499 +500 +501 +502 +503 +504 +505 +506 +507 +508 +509 +510 +511 +512 +513 +514 +515 +516 +517 +518 +519 +520 +521 +522 +523 +524 +525 +526 +527 +528 +529 +530 +531 +532 +533 +534 +535 +536 +537 +538 +539 +540 +541 +542 +543 +544 +545 +546 +547 +548 +549 +550 +551 +552 |
|
pivot_wider(df, index=None, names_from=None, values_from=None, flatten_levels=True, names_sep='_', names_glue=None, reset_index=True, names_expand=False, index_expand=False)
+
+Reshapes data from long to wide form.
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.pivot
instead.
The number of columns are increased, while decreasing
+the number of rows. It is the inverse of the
+pivot_longer
+method, and is a wrapper around pd.DataFrame.pivot
method.
This method does not mutate the original DataFrame.
+Column selection in index
, names_from
and values_from
+is possible using the
+select
syntax.
A ValueError is raised if the combination
+of the index
and names_from
is not unique.
By default, values from values_from
are always
+at the top level if the columns are not flattened.
+If flattened, the values from values_from
are usually
+at the start of each label in the columns.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = [{'dep': 5.5, 'step': 1, 'a': 20, 'b': 30},
+... {'dep': 5.5, 'step': 2, 'a': 25, 'b': 37},
+... {'dep': 6.1, 'step': 1, 'a': 22, 'b': 19},
+... {'dep': 6.1, 'step': 2, 'a': 18, 'b': 29}]
+>>> df = pd.DataFrame(df)
+>>> df
+ dep step a b
+0 5.5 1 20 30
+1 5.5 2 25 37
+2 6.1 1 22 19
+3 6.1 2 18 29
+
Pivot and flatten columns:
+>>> df.pivot_wider(
+... index = "dep",
+... names_from = "step",
+... )
+ dep a_1 a_2 b_1 b_2
+0 5.5 20 25 30 37
+1 6.1 22 18 19 29
+
Modify columns with names_sep
:
>>> df.pivot_wider(
+... index = "dep",
+... names_from = "step",
+... names_sep = "",
+... )
+ dep a1 a2 b1 b2
+0 5.5 20 25 30 37
+1 6.1 22 18 19 29
+
Modify columns with names_glue
:
>>> df.pivot_wider(
+... index = "dep",
+... names_from = "step",
+... names_glue = "{_value}_step{step}",
+... )
+ dep a_step1 a_step2 b_step1 b_step2
+0 5.5 20 25 30 37
+1 6.1 22 18 19 29
+
Expand columns to expose implicit missing values +- this applies only to categorical columns:
+>>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
+>>> daily = pd.DataFrame(
+... {
+... "day": pd.Categorical(
+... values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays
+... ),
+... "value": (2, 3, 1, 5),
+... },
+... index=[0, 0, 0, 0],
+... )
+>>> daily
+ day value
+0 Tue 2
+0 Thu 3
+0 Fri 1
+0 Mon 5
+>>> daily.pivot_wider(names_from='day', values_from='value')
+ Tue Thu Fri Mon
+0 2 3 1 5
+>>> (daily
+... .pivot_wider(
+... names_from='day',
+... values_from='value',
+... names_expand=True)
+... )
+ Mon Tue Wed Thu Fri Sat Sun
+0 5 2 NaN 3 1 NaN NaN
+
Expand the index to expose implicit missing values +- this applies only to categorical columns:
+>>> daily = daily.assign(letter = list('ABBA'))
+>>> daily
+ day value letter
+0 Tue 2 A
+0 Thu 3 B
+0 Fri 1 B
+0 Mon 5 A
+>>> daily.pivot_wider(index='day',names_from='letter',values_from='value')
+ day A B
+0 Tue 2.0 NaN
+1 Thu NaN 3.0
+2 Fri NaN 1.0
+3 Mon 5.0 NaN
+>>> (daily
+... .pivot_wider(
+... index='day',
+... names_from='letter',
+... values_from='value',
+... index_expand=True)
+... )
+ day A B
+0 Mon 5.0 NaN
+1 Tue 2.0 NaN
+2 Wed NaN NaN
+3 Thu NaN 3.0
+4 Fri NaN 1.0
+5 Sat NaN NaN
+6 Sun NaN NaN
+
Version Changed
+reset_index
, names_expand
and index_expand
parameters.Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ index
+ |
+
+ list | str
+ |
+
+
+
+ Name(s) of columns to use as identifier variables.
+It should be either a single column name, or a list of column names.
+If |
+
+ None
+ |
+
+ names_from
+ |
+
+ list | str
+ |
+
+
+
+ Name(s) of column(s) to use to make the new +DataFrame's columns. Should be either a single column name, +or a list of column names. + |
+
+ None
+ |
+
+ values_from
+ |
+
+ list | str
+ |
+
+
+
+ Name(s) of column(s) that will be used for populating
+the new DataFrame's values.
+If |
+
+ None
+ |
+
+ flatten_levels
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
+ names_sep
+ |
+
+ str
+ |
+
+
+
+ If |
+
+ '_'
+ |
+
+ names_glue
+ |
+
+ str
+ |
+
+
+
+ A string to control the output of the flattened columns.
+It offers more flexibility in creating custom column names,
+and uses python's |
+
+ None
+ |
+
+ reset_index
+ |
+
+ bool
+ |
+
+
+
+ Determines whether to restore |
+
+ True
+ |
+
+ names_expand
+ |
+
+ bool
+ |
+
+
+
+ Expand columns to show all the categories.
+Applies only if |
+
+ False
+ |
+
+ index_expand
+ |
+
+ bool
+ |
+
+
+
+ Expand the index to show all the categories.
+Applies only if |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame that has been unpivoted from long to wide form. + |
+
janitor/functions/pivot.py
1858 +1859 +1860 +1861 +1862 +1863 +1864 +1865 +1866 +1867 +1868 +1869 +1870 +1871 +1872 +1873 +1874 +1875 +1876 +1877 +1878 +1879 +1880 +1881 +1882 +1883 +1884 +1885 +1886 +1887 +1888 +1889 +1890 +1891 +1892 +1893 +1894 +1895 +1896 +1897 +1898 +1899 +1900 +1901 +1902 +1903 +1904 +1905 +1906 +1907 +1908 +1909 +1910 +1911 +1912 +1913 +1914 +1915 +1916 +1917 +1918 +1919 +1920 +1921 +1922 +1923 +1924 +1925 +1926 +1927 +1928 +1929 +1930 +1931 +1932 +1933 +1934 +1935 +1936 +1937 +1938 +1939 +1940 +1941 +1942 +1943 +1944 +1945 +1946 +1947 +1948 +1949 +1950 +1951 +1952 +1953 +1954 +1955 +1956 +1957 +1958 +1959 +1960 +1961 +1962 +1963 +1964 +1965 +1966 +1967 +1968 +1969 +1970 +1971 +1972 +1973 +1974 +1975 +1976 +1977 +1978 +1979 +1980 +1981 +1982 +1983 +1984 +1985 +1986 +1987 +1988 +1989 +1990 +1991 +1992 +1993 +1994 +1995 +1996 +1997 +1998 +1999 +2000 +2001 +2002 +2003 +2004 +2005 +2006 +2007 +2008 +2009 +2010 +2011 +2012 +2013 +2014 +2015 +2016 +2017 +2018 +2019 +2020 +2021 +2022 +2023 +2024 +2025 +2026 +2027 +2028 +2029 +2030 +2031 +2032 +2033 +2034 +2035 +2036 +2037 +2038 +2039 +2040 +2041 +2042 +2043 +2044 +2045 +2046 +2047 +2048 +2049 +2050 +2051 +2052 +2053 +2054 +2055 +2056 +2057 +2058 +2059 +2060 +2061 +2062 +2063 |
|
process_text
+
+
+Implementation source for process_text
.
process_text(df, column_name, string_function, **kwargs)
+
+Apply a Pandas string method to an existing column.
+This function aims to make string cleaning easy, while chaining, +by simply passing the string method name, +along with keyword arguments, if any, to the function.
+This modifies an existing column; it does not create a new column;
+new columns can be created via pyjanitor's
+transform_columns
.
A list of all the string methods in Pandas can be accessed here.
+Note
+This function will be deprecated in a 1.x release.
+Please use jn.transform_column
+instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> import re
+>>> df = pd.DataFrame({"text": ["Ragnar", "sammywemmy", "ginger"],
+... "code": [1, 2, 3]})
+>>> df
+ text code
+0 Ragnar 1
+1 sammywemmy 2
+2 ginger 3
+>>> df.process_text(column_name="text", string_function="lower")
+ text code
+0 ragnar 1
+1 sammywemmy 2
+2 ginger 3
+
For string methods with parameters, simply pass the keyword arguments:
+>>> df.process_text(
+... column_name="text",
+... string_function="extract",
+... pat=r"(ag)",
+... expand=False,
+... flags=re.IGNORECASE,
+... )
+ text code
+0 ag 1
+1 NaN 2
+2 NaN 3
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ String column to be operated on. + |
+ + required + | +
+ string_function
+ |
+
+ str
+ |
+
+
+
+ pandas string method to be applied. + |
+ + required + | +
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Keyword arguments for parameters of the |
+
+ {}
+ |
+
Raises:
+Type | +Description | +
---|---|
+ KeyError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If the text function returns a DataFrame, instead of a Series. + |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with modified column. + |
+
janitor/functions/process_text.py
17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 |
|
remove_columns
+
+
+Implementation of remove_columns.
+ + + + + + + + +remove_columns(df, column_names)
+
+Remove the set of columns specified in column_names
.
This method does not mutate the original DataFrame.
+Intended to be the method-chaining alternative to del df[col]
.
Note
+This function will be deprecated in a 1.x release.
+Kindly use pd.DataFrame.drop
instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": [2, 4, 6], "b": [1, 3, 5], "c": [7, 8, 9]})
+>>> df
+ a b c
+0 2 1 7
+1 4 3 8
+2 6 5 9
+>>> df.remove_columns(column_names=['a', 'c'])
+ b
+0 1
+1 3
+2 5
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Union[str, Iterable[str], Hashable]
+ |
+
+
+
+ The columns to remove. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/remove_columns.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 |
|
remove_empty
+
+
+Implementation of remove_empty.
+ + + + + + + + +remove_empty(df, reset_index=True)
+
+Drop all rows and columns that are completely null.
+This method does not mutate the original DataFrame.
+Implementation is inspired from StackOverflow.
+ + +Examples:
+>>> import numpy as np
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": [1, np.nan, 2],
+... "b": [3, np.nan, 4],
+... "c": [np.nan, np.nan, np.nan],
+... })
+>>> df
+ a b c
+0 1.0 3.0 NaN
+1 NaN NaN NaN
+2 2.0 4.0 NaN
+>>> df.remove_empty()
+ a b
+0 1.0 3.0
+1 2.0 4.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ reset_index
+ |
+
+ bool
+ |
+
+
+
+ Determines if the index is reset. + |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/remove_empty.py
7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 |
|
rename_columns
+
+
+rename_column(df, old_column_name, new_column_name)
+
+Rename a column in place.
+This method does not mutate the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.rename
instead.
This is just syntactic sugar/a convenience function for renaming one column at a time.
+If you are convinced that there are multiple columns in need of changing,
+then use the pandas.DataFrame.rename
method.
Examples:
+Change the name of column 'a' to 'a_new'.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
+>>> df.rename_column(old_column_name='a', new_column_name='a_new')
+ a_new b
+0 0 a
+1 1 b
+2 2 c
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ old_column_name
+ |
+
+ str
+ |
+
+
+
+ The old column name. + |
+ + required + | +
+ new_column_name
+ |
+
+ str
+ |
+
+
+
+ The new column name. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with renamed columns. + |
+
janitor/functions/rename_columns.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 |
|
rename_columns(df, new_column_names=None, function=None)
+
+Rename columns.
+This method does not mutate the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.rename
instead.
One of the new_column_names
or function
are a required parameter.
+If both are provided, then new_column_names
takes priority and function
+is never executed.
Examples:
+Rename columns using a dictionary which maps old names to new names.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
+>>> df
+ a b
+0 0 x
+1 1 y
+2 2 z
+>>> df.rename_columns(new_column_names={"a": "a_new", "b": "b_new"})
+ a_new b_new
+0 0 x
+1 1 y
+2 2 z
+
Rename columns using a generic callable.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
+>>> df.rename_columns(function=str.upper)
+ A B
+0 0 x
+1 1 y
+2 2 z
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ new_column_names
+ |
+
+ Union[Dict, None]
+ |
+
+
+
+ A dictionary of old and new column names. + |
+
+ None
+ |
+
+ function
+ |
+
+ Callable
+ |
+
+
+
+ A function which should be applied to all the columns. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If both |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with renamed columns. + |
+
janitor/functions/rename_columns.py
61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 |
|
reorder_columns
+
+
+Implementation source for reorder_columns
.
reorder_columns(df, column_order)
+
+Reorder DataFrame columns by specifying desired order as list of col names.
+Columns not specified retain their order and follow after the columns specified
+in column_order
.
All columns specified within the column_order
list must be present within df
.
This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"col1": [1, 1, 1], "col2": [2, 2, 2], "col3": [3, 3, 3]})
+>>> df
+ col1 col2 col3
+0 1 2 3
+1 1 2 3
+2 1 2 3
+>>> df.reorder_columns(['col3', 'col1'])
+ col3 col1 col2
+0 3 1 2
+1 3 1 2
+2 3 1 2
+
Notice that the column order of df
is now col3
, col1
, col2
.
Internally, this function uses DataFrame.reindex
with copy=False
+to avoid unnecessary data duplication.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+
|
+ + required + | +
+ column_order
+ |
+
+ Union[Iterable[str], Index, Hashable]
+ |
+
+
+
+ A list of column names or Pandas |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ IndexError
+ |
+
+
+
+ If a column within |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with reordered columns. + |
+
janitor/functions/reorder_columns.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 |
|
round_to_fraction
+
+
+Implementation of round_to_fraction
round_to_fraction(df, column_name, denominator, digits=np.inf)
+
+Round all values in a column to a fraction.
+This method mutates the original DataFrame.
+Taken from the R package.
+Also, optionally round to a specified number of digits.
+ + +Examples:
+Round numeric column to the nearest 1/4 value.
+>>> import numpy as np
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a1": [1.263, 2.499, np.nan],
+... "a2": ["x", "y", "z"],
+... })
+>>> df
+ a1 a2
+0 1.263 x
+1 2.499 y
+2 NaN z
+>>> df.round_to_fraction("a1", denominator=4)
+ a1 a2
+0 1.25 x
+1 2.50 y
+2 NaN z
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Name of column to round to fraction. + |
+ + required + | +
+ denominator
+ |
+
+ float
+ |
+
+
+
+ The denominator of the fraction for rounding. Must be +a positive number. + |
+ + required + | +
+ digits
+ |
+
+ float
+ |
+
+
+
+ The number of digits for rounding after rounding to the +fraction. Default is np.inf (i.e. no subsequent rounding). + |
+
+ inf
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a column's values rounded. + |
+
janitor/functions/round_to_fraction.py
12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 |
|
row_to_names
+
+
+Implementation of the row_to_names
function.
row_to_names(df, row_numbers=0, remove_rows=False, remove_rows_above=False, reset_index=False)
+
+Elevates a row, or rows, to be the column names of a DataFrame.
+This method does not mutate the original DataFrame.
+Contains options to remove the elevated row from the DataFrame along with +removing the rows above the selected row.
+ + +Examples:
+Replace column names with the first row and reset the index.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": ["nums", 6, 9],
+... "b": ["chars", "x", "y"],
+... })
+>>> df
+ a b
+0 nums chars
+1 6 x
+2 9 y
+>>> df.row_to_names(0, remove_rows=True, reset_index=True)
+ nums chars
+0 6 x
+1 9 y
+>>> df.row_to_names([0,1], remove_rows=True, reset_index=True)
+ nums chars
+ 6 x
+0 9 y
+
Remove rows above the elevated row and the elevated row itself.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": ["bla1", "nums", 6, 9],
+... "b": ["bla2", "chars", "x", "y"],
+... })
+>>> df
+ a b
+0 bla1 bla2
+1 nums chars
+2 6 x
+3 9 y
+>>> df.row_to_names(1, remove_rows=True, remove_rows_above=True, reset_index=True)
+ nums chars
+0 6 x
+1 9 y
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ row_numbers
+ |
+
+ int | list | slice
+ |
+
+
+
+ Position of the row(s) containing the variable names. +It can be an integer, a list or a slice. +Defaults to 0 (first row). + |
+
+ 0
+ |
+
+ remove_rows
+ |
+
+ bool
+ |
+
+
+
+ Whether the row(s) should be removed from the DataFrame. + |
+
+ False
+ |
+
+ remove_rows_above
+ |
+
+ bool
+ |
+
+
+
+ Whether the row(s) above the selected row should +be removed from the DataFrame. + |
+
+ False
+ |
+
+ reset_index
+ |
+
+ bool
+ |
+
+
+
+ Whether the index should be reset on the returning DataFrame. + |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with set column names. + |
+
janitor/functions/row_to_names.py
14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 |
|
select
+
+
+DropLabel
+
+
+
+ dataclass
+
+
+Helper class for removing labels within the select
syntax.
label
can be any of the types supported in the select
,
+select_rows
and select_columns
functions.
+An array of integers not matching the labels is returned.
New in version 0.24.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ label
+ |
+
+ Any
+ |
+
+
+
+ Label(s) to be dropped from the index. + |
+ + required + | +
janitor/functions/select.py
538 +539 +540 +541 +542 +543 +544 +545 +546 +547 +548 +549 +550 +551 +552 |
|
get_columns(group, label)
+
+Helper function for selecting columns on a grouped object,
+using the
+select
syntax.
New in version 0.25.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ group
+ |
+
+ DataFrameGroupBy | SeriesGroupBy
+ |
+
+
+
+ A Pandas GroupBy object. + |
+ + required + | +
+ label
+ |
+
+ Any
+ |
+
+
+
+ column(s) to select. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrameGroupBy | SeriesGroupBy
+ |
+
+
+
+ A pandas groupby object. + |
+
janitor/functions/select.py
477 +478 +479 +480 +481 +482 +483 +484 +485 +486 +487 +488 +489 +490 +491 +492 +493 +494 +495 +496 +497 |
|
get_index_labels(arg, df, axis)
+
+Convenience function to get actual labels from column/index
+New in version 0.25.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ arg
+ |
+
+ Any
+ |
+
+
+
+ Valid inputs include: an exact column name to look for,
+a shell-style glob string (e.g. |
+ + required + | +
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ axis
+ |
+
+ Literal['index', 'columns']
+ |
+
+
+
+ Should be either |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Index
+ |
+
+
+
+ A pandas Index. + |
+
janitor/functions/select.py
450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 |
|
select(df, *args, index=None, columns=None, axis='columns', invert=False)
+
+Method-chainable selection of rows and columns.
+It accepts a string, shell-like glob strings (*string*)
,
+regex, slice, array-like object, or a list of the previous options.
Selection on a MultiIndex on a level, or multiple levels, +is possible with a dictionary.
+This method does not mutate the original DataFrame.
+Selection can be inverted with the DropLabel
class.
Optional ability to invert selection of index/columns available as well.
+New in version 0.24.0
+Note
+The preferred option when selecting columns or rows in a Pandas DataFrame
+is with .loc
or .iloc
methods, as they are generally performant.
+select
is primarily for convenience.
Version Changed
+args
, invert
and axis
parameters.rows
keyword deprecated in favour of index
.Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
+... index=['cobra', 'viper', 'sidewinder'],
+... columns=['max_speed', 'shield'])
+>>> df
+ max_speed shield
+cobra 1 2
+viper 4 5
+sidewinder 7 8
+>>> df.select(index='cobra', columns='shield')
+ shield
+cobra 2
+
Labels can be dropped with the DropLabel
class:
>>> df.select(index=DropLabel('cobra'))
+ max_speed shield
+viper 4 5
+sidewinder 7 8
+
More examples can be found in the
+select_columns
section.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ *args
+ |
+
+ tuple
+ |
+
+
+
+ Valid inputs include: an exact index name to look for,
+a shell-style glob string (e.g. |
+
+ ()
+ |
+
+ index
+ |
+
+ Any
+ |
+
+
+
+ Valid inputs include: an exact label to look for,
+a shell-style glob string (e.g. |
+
+ None
+ |
+
+ columns
+ |
+
+ Any
+ |
+
+
+
+ Valid inputs include: an exact label to look for,
+a shell-style glob string (e.g. |
+
+ None
+ |
+
+ invert
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to invert the selection. +This will result in the selection +of the complement of the rows/columns provided. + |
+
+ False
+ |
+
+ axis
+ |
+
+ str
+ |
+
+
+
+ Whether the selection should be on the index('index'), +or columns('columns'). +Applicable only for the variable args parameter. + |
+
+ 'columns'
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If args and index/columns are provided. + |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with the specified rows and/or columns selected. + |
+
janitor/functions/select.py
330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 |
|
select_columns(df, *args, invert=False)
+
+Method-chainable selection of columns.
+It accepts a string, shell-like glob strings (*string*)
,
+regex, slice, array-like object, or a list of the previous options.
Selection on a MultiIndex on a level, or multiple levels, +is possible with a dictionary.
+This method does not mutate the original DataFrame.
+Optional ability to invert selection of columns available as well.
+Note
+The preferred option when selecting columns or rows in a Pandas DataFrame
+is with .loc
or .iloc
methods.
+select_columns
is primarily for convenience.
Note
+This function will be deprecated in a 1.x release.
+Please use jn.select
instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> from numpy import nan
+>>> pd.set_option("display.max_columns", None)
+>>> pd.set_option("display.expand_frame_repr", False)
+>>> pd.set_option("max_colwidth", None)
+>>> data = {'name': ['Cheetah','Owl monkey','Mountain beaver',
+... 'Greater short-tailed shrew','Cow'],
+... 'genus': ['Acinonyx', 'Aotus', 'Aplodontia', 'Blarina', 'Bos'],
+... 'vore': ['carni', 'omni', 'herbi', 'omni', 'herbi'],
+... 'order': ['Carnivora','Primates','Rodentia','Soricomorpha','Artiodactyla'],
+... 'conservation': ['lc', nan, 'nt', 'lc', 'domesticated'],
+... 'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
+... 'sleep_rem': [nan, 1.8, 2.4, 2.3, 0.7],
+... 'sleep_cycle': [nan, nan, nan, 0.133333333, 0.666666667],
+... 'awake': [11.9, 7.0, 9.6, 9.1, 20.0],
+... 'brainwt': [nan, 0.0155, nan, 0.00029, 0.423],
+... 'bodywt': [50.0, 0.48, 1.35, 0.019, 600.0]}
+>>> df = pd.DataFrame(data)
+>>> df
+ name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
+0 Cheetah Acinonyx carni Carnivora lc 12.1 NaN NaN 11.9 NaN 50.000
+1 Owl monkey Aotus omni Primates NaN 17.0 1.8 NaN 7.0 0.01550 0.480
+2 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NaN 9.6 NaN 1.350
+3 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133333 9.1 0.00029 0.019
+4 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.666667 20.0 0.42300 600.000
+
Explicit label selection:
+>>> df.select_columns('name', 'order')
+ name order
+0 Cheetah Carnivora
+1 Owl monkey Primates
+2 Mountain beaver Rodentia
+3 Greater short-tailed shrew Soricomorpha
+4 Cow Artiodactyla
+
Selection via globbing:
+>>> df.select_columns("sleep*", "*wt")
+ sleep_total sleep_rem sleep_cycle brainwt bodywt
+0 12.1 NaN NaN NaN 50.000
+1 17.0 1.8 NaN 0.01550 0.480
+2 14.4 2.4 NaN NaN 1.350
+3 14.9 2.3 0.133333 0.00029 0.019
+4 4.0 0.7 0.666667 0.42300 600.000
+
Selection via regex:
+>>> import re
+>>> df.select_columns(re.compile(r"o.+er"))
+ order conservation
+0 Carnivora lc
+1 Primates NaN
+2 Rodentia nt
+3 Soricomorpha lc
+4 Artiodactyla domesticated
+
Selection via slicing:
+>>> df.select_columns(slice('name','order'), slice('sleep_total','sleep_cycle'))
+ name genus vore order sleep_total sleep_rem sleep_cycle
+0 Cheetah Acinonyx carni Carnivora 12.1 NaN NaN
+1 Owl monkey Aotus omni Primates 17.0 1.8 NaN
+2 Mountain beaver Aplodontia herbi Rodentia 14.4 2.4 NaN
+3 Greater short-tailed shrew Blarina omni Soricomorpha 14.9 2.3 0.133333
+4 Cow Bos herbi Artiodactyla 4.0 0.7 0.666667
+
Selection via callable:
+>>> from pandas.api.types import is_numeric_dtype
+>>> df.select_columns(is_numeric_dtype)
+ sleep_total sleep_rem sleep_cycle awake brainwt bodywt
+0 12.1 NaN NaN 11.9 NaN 50.000
+1 17.0 1.8 NaN 7.0 0.01550 0.480
+2 14.4 2.4 NaN 9.6 NaN 1.350
+3 14.9 2.3 0.133333 9.1 0.00029 0.019
+4 4.0 0.7 0.666667 20.0 0.42300 600.000
+>>> df.select_columns(lambda f: f.isna().any())
+ conservation sleep_rem sleep_cycle brainwt
+0 lc NaN NaN NaN
+1 NaN 1.8 NaN 0.01550
+2 nt 2.4 NaN NaN
+3 lc 2.3 0.133333 0.00029
+4 domesticated 0.7 0.666667 0.42300
+
Exclude columns with the invert
parameter:
>>> df.select_columns(is_numeric_dtype, invert=True)
+ name genus vore order conservation
+0 Cheetah Acinonyx carni Carnivora lc
+1 Owl monkey Aotus omni Primates NaN
+2 Mountain beaver Aplodontia herbi Rodentia nt
+3 Greater short-tailed shrew Blarina omni Soricomorpha lc
+4 Cow Bos herbi Artiodactyla domesticated
+
Exclude columns with the DropLabel
class:
>>> from janitor import DropLabel
+>>> df.select_columns(DropLabel(slice("name", "awake")), "conservation")
+ brainwt bodywt conservation
+0 NaN 50.000 lc
+1 0.01550 0.480 NaN
+2 NaN 1.350 nt
+3 0.00029 0.019 lc
+4 0.42300 600.000 domesticated
+
Selection on MultiIndex columns:
+>>> d = {'num_legs': [4, 4, 2, 2],
+... 'num_wings': [0, 0, 2, 2],
+... 'class': ['mammal', 'mammal', 'mammal', 'bird'],
+... 'animal': ['cat', 'dog', 'bat', 'penguin'],
+... 'locomotion': ['walks', 'walks', 'flies', 'walks']}
+>>> df = pd.DataFrame(data=d)
+>>> df = df.set_index(['class', 'animal', 'locomotion']).T
+>>> df
+class mammal bird
+animal cat dog bat penguin
+locomotion walks walks flies walks
+num_legs 4 4 2 2
+num_wings 0 0 2 2
+
Selection with a scalar:
+>>> df.select_columns('mammal')
+class mammal
+animal cat dog bat
+locomotion walks walks flies
+num_legs 4 4 2
+num_wings 0 0 2
+
Selection with a tuple:
+>>> df.select_columns(('mammal','bat'))
+class mammal
+animal bat
+locomotion flies
+num_legs 2
+num_wings 2
+
Selection within a level is possible with a dictionary, +where the key is either a level name or number:
+>>> df.select_columns({'animal':'cat'})
+class mammal
+animal cat
+locomotion walks
+num_legs 4
+num_wings 0
+>>> df.select_columns({1:["bat", "cat"]})
+class mammal
+animal bat cat
+locomotion flies walks
+num_legs 2 4
+num_wings 2 0
+
Selection on multiple levels:
+>>> df.select_columns({"class":"mammal", "locomotion":"flies"})
+class mammal
+animal bat
+locomotion flies
+num_legs 2
+num_wings 2
+
Selection with a regex on a level:
+>>> df.select_columns({"animal":re.compile(".+t$")})
+class mammal
+animal cat bat
+locomotion walks flies
+num_legs 4 2
+num_wings 0 2
+
Selection with a callable on a level:
+>>> df.select_columns({"animal":lambda f: f.str.endswith('t')})
+class mammal
+animal cat bat
+locomotion walks flies
+num_legs 4 2
+num_wings 0 2
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ *args
+ |
+
+ Any
+ |
+
+
+
+ Valid inputs include: an exact column name to look for,
+a shell-style glob string (e.g. |
+
+ ()
+ |
+
+ invert
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to invert the selection. +This will result in the selection +of the complement of the columns provided. + |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with the specified columns selected. + |
+
janitor/functions/select.py
27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 |
|
select_rows(df, *args, invert=False)
+
+Method-chainable selection of rows.
+It accepts a string, shell-like glob strings (*string*)
,
+regex, slice, array-like object, or a list of the previous options.
Selection on a MultiIndex on a level, or multiple levels, +is possible with a dictionary.
+This method does not mutate the original DataFrame.
+Optional ability to invert selection of rows available as well.
+New in version 0.24.0
+Note
+The preferred option when selecting columns or rows in a Pandas DataFrame
+is with .loc
or .iloc
methods, as they are generally performant.
+select_rows
is primarily for convenience.
Note
+This function will be deprecated in a 1.x release.
+Please use jn.select
instead.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}
+>>> df = pd.DataFrame.from_dict(df, orient='index')
+>>> df
+ 0 1
+col1 1 2
+foo 3 4
+col2 5 6
+>>> df.select_rows("col*")
+ 0 1
+col1 1 2
+col2 5 6
+
More examples can be found in the
+select_columns
section.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ *args
+ |
+
+ Any
+ |
+
+
+
+ Valid inputs include: an exact index name to look for,
+a shell-style glob string (e.g. |
+
+ ()
+ |
+
+ invert
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to invert the selection. +This will result in the selection +of the complement of the rows provided. + |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with the specified rows selected. + |
+
janitor/functions/select.py
254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 |
|
shuffle
+
+
+Implementation of shuffle
functions.
shuffle(df, random_state=None, reset_index=True)
+
+Shuffle the rows of the DataFrame.
+This method does not mutate the original DataFrame.
+Super-sugary syntax! Underneath the hood, we use df.sample(frac=1)
,
+with the option to set the random state.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "col1": range(5),
+... "col2": list("abcde"),
+... })
+>>> df
+ col1 col2
+0 0 a
+1 1 b
+2 2 c
+3 3 d
+4 4 e
+>>> df.shuffle(random_state=42)
+ col1 col2
+0 1 b
+1 4 e
+2 2 c
+3 0 a
+4 3 d
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ random_state
+ |
+
+ Any
+ |
+
+
+
+ If provided, set a seed for the random number
+generator. Passed to |
+
+ None
+ |
+
+ reset_index
+ |
+
+ bool
+ |
+
+
+
+ If True, reset the dataframe index to the default +RangeIndex. + |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A shuffled pandas DataFrame. + |
+
janitor/functions/shuffle.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 |
|
sort_column_value_order
+
+
+Implementation of the sort_column_value_order
function.
sort_column_value_order(df, column, column_value_order, columns=None)
+
+This function adds precedence to certain values in a specified column, +then sorts based on that column and any other specified columns.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> import numpy as np
+>>> company_sales = {
+... "SalesMonth": ["Jan", "Feb", "Feb", "Mar", "April"],
+... "Company1": [150.0, 200.0, 200.0, 300.0, 400.0],
+... "Company2": [180.0, 250.0, 250.0, np.nan, 500.0],
+... "Company3": [400.0, 500.0, 500.0, 600.0, 675.0],
+... }
+>>> df = pd.DataFrame.from_dict(company_sales)
+>>> df
+ SalesMonth Company1 Company2 Company3
+0 Jan 150.0 180.0 400.0
+1 Feb 200.0 250.0 500.0
+2 Feb 200.0 250.0 500.0
+3 Mar 300.0 NaN 600.0
+4 April 400.0 500.0 675.0
+>>> df.sort_column_value_order(
+... "SalesMonth",
+... {"April": 1, "Mar": 2, "Feb": 3, "Jan": 4}
+... )
+ SalesMonth Company1 Company2 Company3
+4 April 400.0 500.0 675.0
+3 Mar 300.0 NaN 600.0
+1 Feb 200.0 250.0 500.0
+2 Feb 200.0 250.0 500.0
+0 Jan 150.0 180.0 400.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ pandas DataFrame that we are manipulating + |
+ + required + | +
+ column
+ |
+
+ str
+ |
+
+
+
+ This is a column name as a string we are using to specify +which column to sort by + |
+ + required + | +
+ column_value_order
+ |
+
+ dict
+ |
+
+
+
+ Dictionary of values that will +represent precedence of the values in the specified column + |
+ + required + | +
+ columns
+ |
+
+ str
+ |
+
+
+
+ A list of additional columns that we can sort by + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If chosen Column Name is not in
+Dataframe, or if |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A sorted pandas DataFrame. + |
+
janitor/functions/sort_column_value_order.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 |
|
sort_naturally
+
+
+Implementation of the sort_naturally
function.
sort_naturally(df, column_name, **natsorted_kwargs)
+
+Sort a DataFrame by a column using natural sorting.
+Natural sorting is distinct from
+the default lexiographical sorting provided by pandas
.
+For example, given the following list of items:
["A1", "A11", "A3", "A2", "A10"]
+
Lexicographical sorting would give us:
+["A1", "A10", "A11", "A2", "A3"]
+
By contrast, "natural" sorting would give us:
+["A1", "A2", "A3", "A10", "A11"]
+
This function thus provides natural sorting +on a single column of a dataframe.
+To accomplish this, we do a natural sort +on the unique values that are present in the dataframe. +Then, we reconstitute the entire dataframe +in the naturally sorted order.
+Natural sorting is provided by the Python package +natsort.
+All keyword arguments to natsort
should be provided
+after the column name to sort by is provided.
+They are passed through to the natsorted
function.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame(
+... {
+... "Well": ["A21", "A3", "A21", "B2", "B51", "B12"],
+... "Value": [1, 2, 13, 3, 4, 7],
+... }
+... )
+>>> df
+ Well Value
+0 A21 1
+1 A3 2
+2 A21 13
+3 B2 3
+4 B51 4
+5 B12 7
+>>> df.sort_naturally("Well")
+ Well Value
+1 A3 2
+0 A21 1
+2 A21 13
+3 B2 3
+5 B12 7
+4 B51 4
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ str
+ |
+
+
+
+ The column on which natural sorting should take place. + |
+ + required + | +
+ **natsorted_kwargs
+ |
+
+ Any
+ |
+
+
+
+ Keyword arguments to be passed
+to natsort's |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A sorted pandas DataFrame. + |
+
janitor/functions/sort_naturally.py
10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 |
|
take_first
+
+
+Implementation of take_first function.
+ + + + + + + + +take_first(df, subset, by, ascending=True)
+
+Take the first row within each group specified by subset
.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [0, 1, 2, 3]})
+>>> df
+ a b
+0 x 0
+1 x 1
+2 y 2
+3 y 3
+>>> df.take_first(subset="a", by="b")
+ a b
+0 x 0
+2 y 2
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ subset
+ |
+
+ Union[Hashable, Iterable[Hashable]]
+ |
+
+
+
+ Column(s) defining the group. + |
+ + required + | +
+ by
+ |
+
+ Hashable
+ |
+
+
+
+ Column to sort by. + |
+ + required + | +
+ ascending
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to sort in ascending order, |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/take_first.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 |
|
then
+
+
+Implementation source for then
.
then(df, func)
+
+Add an arbitrary function to run in the pyjanitor
method chain.
This method does not mutate the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use pd.DataFrame.pipe
instead.
Examples:
+A trivial example using a lambda func
.
>>> import pandas as pd
+>>> import janitor
+>>> (pd.DataFrame({"a": [1, 2, 3], "b": [7, 8, 9]})
+... .then(lambda df: df * 2))
+ a b
+0 2 14
+1 4 16
+2 6 18
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ func
+ |
+
+ Callable
+ |
+
+
+
+ A function you would like to run in the method chain. +It should take one parameter and return one parameter, each being +the DataFrame object. After that, do whatever you want in the +middle. Go crazy. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/then.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 |
|
to_datetime
+
+
+Implementation source for to_datetime
.
to_datetime(df, column_name, **kwargs)
+
+Convert column to a datetime type, in-place.
+Intended to be the method-chaining equivalent of:
+df[column_name] = pd.to_datetime(df[column_name], **kwargs)
+
This method mutates the original DataFrame.
+Note
+This function will be deprecated in a 1.x release.
+Please use jn.transform_column
+instead.
Examples:
+Converting a string column to datetime type with custom format.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({'date': ['20200101', '20200202', '20200303']})
+>>> df
+ date
+0 20200101
+1 20200202
+2 20200303
+>>> df.to_datetime('date', format='%Y%m%d')
+ date
+0 2020-01-01
+1 2020-02-02
+2 2020-03-03
+
Read the pandas documentation for to_datetime
for more information.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Column name. + |
+ + required + | +
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Provide any kwargs that |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with updated datetime data. + |
+
janitor/functions/to_datetime.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 |
|
toset
+
+
+Implementation of the toset
function.
toset(series)
+
+Return a set of the values.
+Note
+This function will be deprecated in a 1.x release.
+Please use set(df[column])
instead.
These are each a scalar type, which is a Python scalar +(for str, int, float) or a pandas scalar +(for Timestamp/Timedelta/Interval/Period)
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([1, 2, 3, 5, 5], index=["a", "b", "c", "d", "e"])
+>>> s
+a 1
+b 2
+c 3
+d 5
+e 5
+dtype: int64
+>>> s.toset()
+{1, 2, 3, 5}
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ series
+ |
+
+ Series
+ |
+
+
+
+ A pandas series. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Set
+ |
+
+
+
+ A set of values. + |
+
janitor/functions/toset.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 |
|
transform_columns
+
+
+transform_column(df, column_name, function, dest_column_name=None, elementwise=True)
+
+Transform the given column using the provided function.
+Meant to be the method-chaining equivalent of: +
df[dest_column_name] = df[column_name].apply(function)
+
Functions can be applied in one of two ways:
+elementwise=True
). Then, the individual
+column elements will be passed in as the first argument of function
.elementwise=False
). Then, function
is expected to
+take in a pandas Series and return a sequence that is of identical length
+to the original.If dest_column_name
is provided, then the transformation result is stored
+in that column. Otherwise, the transformed result is stored under the name
+of the original column.
This method does not mutate the original DataFrame.
+ + +Examples:
+Transform a column in-place with an element-wise function.
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "a": [2, 3, 4],
+... "b": ["area", "pyjanitor", "grapefruit"],
+... })
+>>> df
+ a b
+0 2 area
+1 3 pyjanitor
+2 4 grapefruit
+>>> df.transform_column(
+... column_name="a",
+... function=lambda x: x**2 - 1,
+... )
+ a b
+0 3 area
+1 8 pyjanitor
+2 15 grapefruit
+
Examples:
+Transform a column in-place with an column-wise function.
+>>> df.transform_column(
+... column_name="b",
+... function=lambda srs: srs.str[:5],
+... elementwise=False,
+... )
+ a b
+0 2 area
+1 3 pyjan
+2 4 grape
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_name
+ |
+
+ Hashable
+ |
+
+
+
+ The column to transform. + |
+ + required + | +
+ function
+ |
+
+ Callable
+ |
+
+
+
+ A function to apply on the column. + |
+ + required + | +
+ dest_column_name
+ |
+
+ Optional[str]
+ |
+
+
+
+ The column name to store the transformation result +in. Defaults to None, which will result in the original column +name being overwritten. If a name is provided here, then a new +column with the transformed values will be created. + |
+
+ None
+ |
+
+ elementwise
+ |
+
+ bool
+ |
+
+
+
+ Whether to apply the function elementwise or not.
+If |
+
+ True
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with a transformed column. + |
+
janitor/functions/transform_columns.py
20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 |
|
transform_columns(df, column_names, function, suffix=None, elementwise=True, new_column_names=None)
+
+Transform multiple columns through the same transformation.
+This method does not mutate the original DataFrame.
+Super syntactic sugar!
+Essentially wraps transform_column
+and calls it repeatedly over all column names provided.
User can optionally supply either a suffix to create a new set of columns
+with the specified suffix, or provide a dictionary mapping each original
+column name in column_names
to its corresponding new column name.
+Note that all column names must be strings.
Examples:
+log10 transform a list of columns, replacing original columns.
+>>> import numpy as np
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "col1": [5, 10, 15],
+... "col2": [3, 6, 9],
+... "col3": [10, 100, 1_000],
+... })
+>>> df
+ col1 col2 col3
+0 5 3 10
+1 10 6 100
+2 15 9 1000
+>>> df.transform_columns(["col1", "col2", "col3"], np.log10)
+ col1 col2 col3
+0 0.698970 0.477121 1.0
+1 1.000000 0.778151 2.0
+2 1.176091 0.954243 3.0
+
Using the suffix
parameter to create new columns.
>>> df.transform_columns(["col1", "col3"], np.log10, suffix="_log")
+ col1 col2 col3 col1_log col3_log
+0 5 3 10 0.698970 1.0
+1 10 6 100 1.000000 2.0
+2 15 9 1000 1.176091 3.0
+
Using the new_column_names
parameter to create new columns.
>>> df.transform_columns(
+... ["col1", "col3"],
+... np.log10,
+... new_column_names={"col1": "transform1"},
+... )
+ col1 col2 col3 transform1
+0 5 3 1.0 0.698970
+1 10 6 2.0 1.000000
+2 15 9 3.0 1.176091
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+ + required + | +
+ column_names
+ |
+
+ Union[List[str], Tuple[str]]
+ |
+
+
+
+ An iterable of columns to transform. + |
+ + required + | +
+ function
+ |
+
+ Callable
+ |
+
+
+
+ A function to apply on each column. + |
+ + required + | +
+ suffix
+ |
+
+ Optional[str]
+ |
+
+
+
+ Suffix to use when creating new columns to hold +the transformed values. + |
+
+ None
+ |
+
+ elementwise
+ |
+
+ bool
+ |
+
+
+
+ Passed on to |
+
+ True
+ |
+
+ new_column_names
+ |
+
+ Optional[Dict[str, str]]
+ |
+
+
+
+ An explicit mapping of old column names in
+ |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If both |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with transformed columns. + |
+
janitor/functions/transform_columns.py
125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 |
|
truncate_datetime
+
+
+Implementation of the truncate_datetime
family of functions.
truncate_datetime_dataframe(df, datepart)
+
+Truncate times down to a user-specified precision of +year, month, day, hour, minute, or second.
+This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> df = pd.DataFrame({
+... "foo": ["xxxx", "yyyy", "zzzz"],
+... "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),
+... })
+>>> df
+ foo dt
+0 xxxx 2020-03-11 00:00:00
+1 yyyy 2020-03-11 15:00:00
+2 zzzz 2020-03-12 06:00:00
+>>> df.truncate_datetime_dataframe("day")
+ foo dt
+0 xxxx 2020-03-11
+1 yyyy 2020-03-11
+2 zzzz 2020-03-12
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame on which to truncate datetime. + |
+ + required + | +
+ datepart
+ |
+
+ str
+ |
+
+
+
+ Truncation precision, YEAR, MONTH, DAY, +HOUR, MINUTE, SECOND. (String is automagically +capitalized) + |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If an invalid |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame with all valid datetimes truncated down +to the specified precision. + |
+
janitor/functions/truncate_datetime.py
9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 |
|
update_where
+
+
+Function for updating values based on other column values.
+ + + + + + + + +update_where(df, conditions, target_column_name, target_val)
+
+Add multiple conditions to update a column in the dataframe.
+This method does not mutate the original DataFrame.
+ + +Examples:
+>>> import janitor
+>>> data = {
+... "a": [1, 2, 3, 4],
+... "b": [5, 6, 7, 8],
+... "c": [0, 0, 0, 0],
+... }
+>>> df = pd.DataFrame(data)
+>>> df
+ a b c
+0 1 5 0
+1 2 6 0
+2 3 7 0
+3 4 8 0
+>>> df.update_where(
+... conditions = (df.a > 2) & (df.b < 8),
+... target_column_name = 'c',
+... target_val = 10
+... )
+ a b c
+0 1 5 0
+1 2 6 0
+2 3 7 10
+3 4 8 0
+>>> df.update_where( # supports pandas *query* style string expressions
+... conditions = "a > 2 and b < 8",
+... target_column_name = 'c',
+... target_val = 10
+... )
+ a b c
+0 1 5 0
+1 2 6 0
+2 3 7 10
+3 4 8 0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ conditions
+ |
+
+ Any
+ |
+
+
+
+ Conditions used to update a target column +and target value. + |
+ + required + | +
+ target_column_name
+ |
+
+ Hashable
+ |
+
+
+
+ Column to be updated. If column does not exist +in DataFrame, a new column will be created; note that entries +that do not get set in the new column will be null. + |
+ + required + | +
+ target_val
+ |
+
+ Any
+ |
+
+
+
+ Value to be updated. + |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A pandas DataFrame. + |
+
janitor/functions/update_where.py
12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 |
|
utils
+
+
+Utility functions for all of the functions submodule.
+ + + + + + + + +patterns(regex_pattern)
+
+This function converts a string into a compiled regular expression.
+It can be used to select columns in the index or columns_names
+arguments of pivot_longer
function.
Warning
+This function is deprecated. Kindly use re.compile
instead.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ regex_pattern
+ |
+
+ Union[str, Pattern]
+ |
+
+
+
+ String to be converted to compiled regular +expression. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Pattern
+ |
+
+
+
+ A compile regular expression from provided |
+
janitor/functions/utils.py
140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 |
|
unionize_dataframe_categories(*dataframes, column_names=None)
+
+Given a group of dataframes which contain some categorical columns, for +each categorical column present, find all the possible categories across +all the dataframes which have that column. +Update each dataframes' corresponding column with a new categorical object +that contains the original data +but has labels for all the possible categories from all dataframes. +This is useful when concatenating a list of dataframes which all have the +same categorical columns into one dataframe.
+If, for a given categorical column, all input dataframes do not have at
+least one instance of all the possible categories,
+Pandas will change the output dtype of that column from category
to
+object
, losing out on dramatic speed gains you get from the former
+format.
Examples:
+Usage example for concatenation of categorical column-containing +dataframes:
+Instead of:
+concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)
+
which in your case has resulted in category
-> object
conversion,
+use:
unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
+concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ *dataframes
+ |
+
+ Any
+ |
+
+
+
+ The dataframes you wish to unionize the categorical +objects for. + |
+
+ ()
+ |
+
+ column_names
+ |
+
+ Optional[Iterable[CategoricalDtype]]
+ |
+
+
+
+ If supplied, only unionize this subset of columns. + |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ TypeError
+ |
+
+
+
+ If any of the inputs are not pandas DataFrames. + |
+
Returns:
+Type | +Description | +
---|---|
+ List[DataFrame]
+ |
+
+
+
+ A list of the category-unioned dataframes in the same order they +were provided. + |
+
janitor/functions/utils.py
40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 |
|
read_commandline(cmd, engine='pandas', **kwargs)
+
+Read a CSV file based on a command-line command.
+For example, you may wish to run the following command on sep-quarter.csv
+before reading it into a pandas DataFrame:
cat sep-quarter.csv | grep .SEA1AA
+
In this case, you can use the following Python code to load the dataframe:
+import janitor as jn
+df = jn.read_commandline("cat data.csv | grep .SEA1AA")
+
This function assumes that your command line command will return
+an output that is parsable using the relevant engine and StringIO.
+This function defaults to using pd.read_csv
underneath the hood.
+Keyword arguments are passed through as-is.
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ cmd
+ |
+
+ str
+ |
+
+
+
+ Shell command to preprocess a file on disk. + |
+ + required + | +
+ engine
+ |
+
+ str
+ |
+
+
+
+ DataFrame engine to process the output of the shell command. +Currently supports both pandas and polars. + |
+
+ 'pandas'
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Keyword arguments that are passed through to +the engine's csv reader. + |
+
+ {}
+ |
+
Returns:
+Type | +Description | +
---|---|
+ Mapping
+ |
+
+
+
+ A DataFrame parsed from the stdout of the underlying +shell. + |
+
janitor/io.py
96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 |
|
read_csvs(files_path, separate_df=False, **kwargs)
+
+Read multiple CSV files and return a dictionary of DataFrames, or +one concatenated DataFrame.
+ + +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ files_path
+ |
+
+ Union[str, Iterable[str]]
+ |
+
+
+
+ The filepath pattern matching the CSV files.
+Accepts regular expressions, with or without |
+ + required + | +
+ separate_df
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Keyword arguments to pass into the
+original pandas |
+
+ {}
+ |
+
Raises:
+Type | +Description | +
---|---|
+ JanitorError
+ |
+
+
+
+ If |
+
+ JanitorError
+ |
+
+
+
+ If length of |
+
+ ValueError
+ |
+
+
+
+ If no CSV files exist in |
+
+ ValueError
+ |
+
+
+
+ If columns in input CSV files do not match. + |
+
Returns:
+Type | +Description | +
---|---|
+ Union[DataFrame, dict]
+ |
+
+
+
+ DataFrame of concatenated DataFrames or dictionary of DataFrames. + |
+
janitor/io.py
27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 +93 |
|
xlsx_cells(path, sheetnames=None, start_point=None, end_point=None, read_only=True, include_blank_cells=True, fill=False, font=False, alignment=False, border=False, protection=False, comment=False, engine='pandas', **kwargs)
+
+Imports data from spreadsheet without coercing it into a rectangle.
+Each cell is represented by a row in a dataframe, and includes the
+cell's coordinates, the value, row and column position.
+The cell formatting (fill, font, border, etc) can also be accessed;
+usually this is returned as a dictionary in the cell, and the specific
+cell format attribute can be accessed using pd.Series.str.get
+or pl.struct.field
if it is a polars DataFrame.
Inspiration for this comes from R's tidyxl package.
+ + +Examples:
+>>> import pandas as pd
+>>> import polars as pl
+>>> from janitor import xlsx_cells
+>>> pd.set_option("display.max_columns", None)
+>>> pd.set_option("display.expand_frame_repr", False)
+>>> pd.set_option("max_colwidth", None)
+>>> filename = "../pyjanitor/tests/test_data/worked-examples.xlsx"
+
Each cell is returned as a row:
+>>> xlsx_cells(filename, sheetnames="highlights")
+ value internal_value coordinate row column data_type is_date number_format
+0 Age Age A1 1 1 s False General
+1 Height Height B1 1 2 s False General
+2 1 1 A2 2 1 n False General
+3 2 2 B2 2 2 n False General
+4 3 3 A3 3 1 n False General
+5 4 4 B3 3 2 n False General
+6 5 5 A4 4 1 n False General
+7 6 6 B4 4 2 n False General
+
Access cell formatting such as fill:
+>>> out=xlsx_cells(filename, sheetnames="highlights", fill=True).select("value", "fill", axis='columns')
+>>> out
+ value fill
+0 Age {'patternType': None, 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
+1 Height {'patternType': None, 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
+2 1 {'patternType': None, 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
+3 2 {'patternType': None, 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
+4 3 {'patternType': 'solid', 'fgColor': {'rgb': 'FFFFFF00', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': 'FFFFFF00', 'type': 'rgb', 'tint': 0.0}}
+5 4 {'patternType': 'solid', 'fgColor': {'rgb': 'FFFFFF00', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': 'FFFFFF00', 'type': 'rgb', 'tint': 0.0}}
+6 5 {'patternType': None, 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
+7 6 {'patternType': None, 'fgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}, 'bgColor': {'rgb': '00000000', 'type': 'rgb', 'tint': 0.0}}
+
Specific cell attributes can be accessed by using Pandas' series.str.get
:
>>> out.fill.str.get("fgColor").str.get("rgb")
+0 00000000
+1 00000000
+2 00000000
+3 00000000
+4 FFFFFF00
+5 FFFFFF00
+6 00000000
+7 00000000
+Name: fill, dtype: object
+
Access cell formatting in a polars DataFrame:
+>>> out = xlsx_cells(filename, sheetnames="highlights", engine='polars', fill=True).get_column('fill')
+>>> out
+shape: (8,)
+Series: 'fill' [struct[3]]
+[
+ {null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
+ {null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
+ {null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
+ {null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
+ {"solid",{"FFFFFF00","rgb",0.0},{"FFFFFF00","rgb",0.0}}
+ {"solid",{"FFFFFF00","rgb",0.0},{"FFFFFF00","rgb",0.0}}
+ {null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
+ {null,{"00000000","rgb",0.0},{"00000000","rgb",0.0}}
+]
+
Specific cell attributes can be acessed via Polars' struct:
+>>> out.struct.field('fgColor').struct.field('rgb')
+shape: (8,)
+Series: 'rgb' [str]
+[
+ "00000000"
+ "00000000"
+ "00000000"
+ "00000000"
+ "FFFFFF00"
+ "FFFFFF00"
+ "00000000"
+ "00000000"
+]
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ path
+ |
+
+ Union[str, Workbook]
+ |
+
+
+
+ Path to the Excel File. It can also be an openpyxl Workbook. + |
+ + required + | +
+ sheetnames
+ |
+
+ Union[str, list, tuple]
+ |
+
+
+
+ Names of the sheets from which the cells are to be extracted.
+If |
+
+ None
+ |
+
+ start_point
+ |
+
+ Union[str, int]
+ |
+
+
+
+ Start coordinates of the Excel sheet. This is useful
+if the user is only interested in a subsection of the sheet.
+If |
+
+ None
+ |
+
+ end_point
+ |
+
+ Union[str, int]
+ |
+
+
+
+ End coordinates of the Excel sheet. This is useful
+if the user is only interested in a subsection of the sheet.
+If |
+
+ None
+ |
+
+ read_only
+ |
+
+ bool
+ |
+
+
+
+ Determines if the entire file is loaded in memory,
+or streamed. For memory efficiency, read_only should be set to |
+
+ True
+ |
+
+ include_blank_cells
+ |
+
+ bool
+ |
+
+
+
+ Determines if cells without a value should be included. + |
+
+ True
+ |
+
+ fill
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ font
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ alignment
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ border
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ protection
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ comment
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ engine
+ |
+
+ str
+ |
+
+
+
+ DataFrame engine. Should be either pandas or polars. + |
+
+ 'pandas'
+ |
+
+ **kwargs
+ |
+
+ Any
+ |
+
+
+
+ Any other attributes of the cell, that can be accessed from openpyxl. + |
+
+ {}
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If kwargs is provided, and one of the keys is a default column. + |
+
+ AttributeError
+ |
+
+
+
+ If kwargs is provided and any of the keys +is not a openpyxl cell attribute. + |
+
Returns:
+Type | +Description | +
---|---|
+ Mapping
+ |
+
+
+
+ A DataFrame, or a dictionary of DataFrames. + |
+
janitor/io.py
345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 +475 +476 +477 +478 +479 +480 +481 +482 +483 +484 +485 +486 +487 +488 +489 +490 +491 +492 +493 +494 +495 +496 +497 +498 +499 +500 +501 +502 +503 +504 +505 +506 +507 +508 +509 +510 +511 +512 +513 +514 +515 +516 +517 +518 +519 +520 +521 +522 +523 +524 +525 +526 +527 +528 +529 +530 +531 +532 +533 +534 +535 +536 +537 +538 +539 +540 +541 +542 +543 +544 +545 +546 +547 +548 +549 +550 +551 +552 +553 +554 +555 +556 +557 +558 +559 +560 +561 +562 +563 +564 +565 +566 +567 +568 +569 +570 +571 +572 +573 +574 +575 +576 +577 +578 +579 +580 +581 +582 +583 +584 +585 +586 +587 +588 +589 +590 +591 +592 +593 +594 +595 +596 +597 +598 +599 +600 +601 +602 +603 +604 +605 +606 +607 +608 +609 +610 +611 +612 +613 +614 +615 +616 +617 |
|
xlsx_table(path, sheetname=None, table=None, engine='pandas')
+
+Returns a DataFrame of values in a table in the Excel file.
+This applies to an Excel file, where the data range is explicitly +specified as a Microsoft Excel table.
+If there is a single table in the sheet, or a string is provided
+as an argument to the table
parameter, a DataFrame is returned;
+if there is more than one table in the sheet,
+and the table
argument is None
, or a list/tuple of names,
+a dictionary of DataFrames is returned, where the keys of the dictionary
+are the table names.
Examples:
+>>> import pandas as pd
+>>> import polars as pl
+>>> from janitor import xlsx_table
+>>> filename="../pyjanitor/tests/test_data/016-MSPTDA-Excel.xlsx"
+
Single table:
+>>> xlsx_table(filename, table='dCategory')
+ CategoryID Category
+0 1 Beginner
+1 2 Advanced
+2 3 Freestyle
+3 4 Competition
+4 5 Long Distance
+
>>> xlsx_table(filename, table='dCategory', engine='polars')
+shape: (5, 2)
+┌────────────┬───────────────┐
+│ CategoryID ┆ Category │
+│ --- ┆ --- │
+│ i64 ┆ str │
+╞════════════╪═══════════════╡
+│ 1 ┆ Beginner │
+│ 2 ┆ Advanced │
+│ 3 ┆ Freestyle │
+│ 4 ┆ Competition │
+│ 5 ┆ Long Distance │
+└────────────┴───────────────┘
+
Multiple tables:
+>>> out=xlsx_table(filename, table=["dCategory", "dSalesReps"])
+>>> out["dCategory"]
+ CategoryID Category
+0 1 Beginner
+1 2 Advanced
+2 3 Freestyle
+3 4 Competition
+4 5 Long Distance
+>>> out["dSalesReps"].head(3)
+ SalesRepID SalesRep Region
+0 1 Sioux Radcoolinator NW
+1 2 Tyrone Smithe NE
+2 3 Chantel Zoya SW
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ path
+ |
+
+ Union[str, IO, Workbook]
+ |
+
+
+
+ Path to the Excel File. It can also be an openpyxl Workbook. + |
+ + required + | +
+ table
+ |
+
+ Union[str, list, tuple]
+ |
+
+
+
+ Name of a table, or list of tables in the sheet. + |
+
+ None
+ |
+
+ engine
+ |
+
+ str
+ |
+
+
+
+ DataFrame engine. Should be either pandas or polars. +Defaults to pandas + |
+
+ 'pandas'
+ |
+
Raises:
+Type | +Description | +
---|---|
+ AttributeError
+ |
+
+
+
+ If a workbook is provided, and is a ReadOnlyWorksheet. + |
+
+ ValueError
+ |
+
+
+
+ If there are no tables in the sheet. + |
+
+ KeyError
+ |
+
+
+
+ If the provided table does not exist in the sheet. + |
+
Returns:
+Type | +Description | +
---|---|
+ Mapping
+ |
+
+
+
+ A DataFrame, or a dictionary of DataFrames,
+if there are multiple arguments for the |
+
janitor/io.py
159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 |
|
Miscellaneous mathematical operators.
+ + + + + + + + +ecdf(s)
+
+Return cumulative distribution of values in a series.
+Null values must be dropped from the series,
+otherwise a ValueError
is raised.
Also, if the dtype
of the series is not numeric,
+a TypeError
is raised.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0, 4, 0, 1, 2, 1, 1, 3])
+>>> x, y = s.ecdf()
+>>> x
+array([0, 0, 1, 1, 1, 2, 3, 4])
+>>> y
+array([0.125, 0.25 , 0.375, 0.5 , 0.625, 0.75 , 0.875, 1. ])
+
You can then plot the ECDF values, for example:
+>>> from matplotlib import pyplot as plt
+>>> plt.scatter(x, y)
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ A pandas series. |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ TypeError
+ |
+
+
+
+ If series is not numeric. + |
+
+ ValueError
+ |
+
+
+
+ If series contains nulls. + |
+
Returns:
+Name | Type | +Description | +
---|---|---|
x |
+ ndarray
+ |
+
+
+
+ Sorted array of values. + |
+
y |
+ ndarray
+ |
+
+
+
+ Cumulative fraction of data points with value |
+
janitor/math.py
329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 |
|
exp(s)
+
+Take the exponential transform of the series.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0, 1, 3], name="numbers")
+>>> s.exp()
+0 1.000000
+1 2.718282
+2 20.085537
+Name: numbers, dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 |
|
log(s, error='warn')
+
+Take natural logarithm of the Series.
+Each value in the series should be positive. Use error
to control the
+behavior if there are nonpositive entries in the series.
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0, 1, 3], name="numbers")
+>>> s.log(error="ignore")
+0 NaN
+1 0.000000
+2 1.098612
+Name: numbers, dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
+ error
+ |
+
+ str
+ |
+
+
+
+ Determines behavior when taking the log of nonpositive
+entries. If |
+
+ 'warn'
+ |
+
Raises:
+Type | +Description | +
---|---|
+ RuntimeError
+ |
+
+
+
+ Raised when there are nonpositive values in the
+Series and |
+
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 |
|
logit(s, error='warn')
+
+Take logit transform of the Series.
+The logit transform is defined:
+logit(p) = log(p/(1-p))
+
Each value in the series should be between 0 and 1. Use error
to
+control the behavior if any series entries are outside of (0, 1).
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0.1, 0.5, 0.9], name="numbers")
+>>> s.logit()
+0 -2.197225
+1 0.000000
+2 2.197225
+Name: numbers, dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
+ error
+ |
+
+ str
+ |
+
+
+
+ Determines behavior when |
+
+ 'warn'
+ |
+
Raises:
+Type | +Description | +
---|---|
+ RuntimeError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 |
|
normal_cdf(s)
+
+Transforms the Series via the CDF of the Normal distribution.
+ + +Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([-1, 0, 3], name="numbers")
+>>> s.normal_cdf()
+0 0.158655
+1 0.500000
+2 0.998650
+dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 |
|
probit(s, error='warn')
+
+Transforms the Series via the inverse CDF of the Normal distribution.
+Each value in the series should be between 0 and 1. Use error
to
+control the behavior if any series entries are outside of (0, 1).
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0.1, 0.5, 0.8], name="numbers")
+>>> s.probit()
+0 -1.281552
+1 0.000000
+2 0.841621
+dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
+ error
+ |
+
+ str
+ |
+
+
+
+ Determines behavior when |
+
+ 'warn'
+ |
+
Raises:
+Type | +Description | +
---|---|
+ RuntimeError
+ |
+
+
+
+ When there are problematic values
+in the Series and |
+
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series + |
+
janitor/math.py
232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 |
|
sigmoid(s)
+
+Take the sigmoid transform of the series.
+The sigmoid function is defined:
+sigmoid(x) = 1 / (1 + exp(-x))
+
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([-1, 0, 4], name="numbers")
+>>> s.sigmoid()
+0 0.268941
+1 0.500000
+2 0.982014
+Name: numbers, dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 |
|
softmax(s)
+
+Take the softmax transform of the series.
+The softmax function transforms each element of a collection by +computing the exponential of each element divided by the sum of the +exponentials of all the elements.
+That is, if x is a one-dimensional numpy array or pandas Series:
+softmax(x) = exp(x)/sum(exp(x))
+
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0, 1, 3], name="numbers")
+>>> s.softmax()
+0 0.042010
+1 0.114195
+2 0.843795
+Name: numbers, dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 |
|
z_score(s, moments_dict=None, keys=('mean', 'std'))
+
+Transforms the Series into z-scores.
+The z-score is defined:
+z = (s - s.mean()) / s.std()
+
Examples:
+>>> import pandas as pd
+>>> import janitor
+>>> s = pd.Series([0, 1, 3], name="numbers")
+>>> s.z_score()
+0 -0.872872
+1 -0.218218
+2 1.091089
+Name: numbers, dtype: float64
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ s
+ |
+
+ Series
+ |
+
+
+
+ Input Series. + |
+ + required + | +
+ moments_dict
+ |
+
+ dict
+ |
+
+
+
+ If not |
+
+ None
+ |
+
+ keys
+ |
+
+ Tuple[str, str]
+ |
+
+
+
+ Determines the keys saved in |
+
+ ('mean', 'std')
+ |
+
Returns:
+Type | +Description | +
---|---|
+ Series
+ |
+
+
+
+ Transformed Series. + |
+
janitor/math.py
283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 |
|
Machine learning specific functions.
+ + + + + + + + +get_features_targets(df, target_column_names, feature_column_names=None)
+
+Get the features and targets as separate DataFrames/Series.
+This method does not mutate the original DataFrame.
+The behaviour is as such:
+target_column_names
is mandatory.feature_column_names
is present, then we will respect the column
+ names inside there.feature_column_names
is not passed in, then we will assume that
+the rest of the columns are feature columns, and return them.Examples:
+>>> import pandas as pd
+>>> import janitor.ml
+>>> df = pd.DataFrame(
+... {"a": [1, 2, 3], "b": [-2, 0, 4], "c": [1.23, 7.89, 4.56]}
+... )
+>>> X, Y = df.get_features_targets(target_column_names=["a", "c"])
+>>> X
+ b
+0 -2
+1 0
+2 4
+>>> Y
+ a c
+0 1 1.23
+1 2 7.89
+2 3 4.56
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ The pandas DataFrame object. + |
+ + required + | +
+ target_column_names
+ |
+
+ Union[str, Union[List, Tuple], Hashable]
+ |
+
+
+
+ Either a column name or an +iterable (list or tuple) of column names that are the target(s) to +be predicted. + |
+ + required + | +
+ feature_column_names
+ |
+
+ Optional[Union[str, Iterable[str], Hashable]]
+ |
+
+
+
+ The column name or +iterable of column names that are the features (a.k.a. predictors) +used to predict the targets. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ Tuple[DataFrame, DataFrame]
+ |
+
+
+
+
|
+
janitor/ml.py
11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 |
|
clean_names
+
+
+clean_names implementation for polars.
+ + + + + + + + +clean_names(df, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=False, truncate_limit=None)
+
+Clean the column names in a polars DataFrame.
+clean_names
can also be applied to a LazyFrame.
Examples:
+>>> import polars as pl
+>>> import janitor.polars
+>>> df = pl.DataFrame(
+... {
+... "Aloha": range(3),
+... "Bell Chart": range(3),
+... "Animals@#$%^": range(3)
+... }
+... )
+>>> df
+shape: (3, 3)
+┌───────┬────────────┬──────────────┐
+│ Aloha ┆ Bell Chart ┆ Animals@#$%^ │
+│ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ i64 │
+╞═══════╪════════════╪══════════════╡
+│ 0 ┆ 0 ┆ 0 │
+│ 1 ┆ 1 ┆ 1 │
+│ 2 ┆ 2 ┆ 2 │
+└───────┴────────────┴──────────────┘
+>>> df.clean_names(remove_special=True)
+shape: (3, 3)
+┌───────┬────────────┬─────────┐
+│ aloha ┆ bell_chart ┆ animals │
+│ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ i64 │
+╞═══════╪════════════╪═════════╡
+│ 0 ┆ 0 ┆ 0 │
+│ 1 ┆ 1 ┆ 1 │
+│ 2 ┆ 2 ┆ 2 │
+└───────┴────────────┴─────────┘
+
New in version 0.28.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ strip_underscores
+ |
+
+ str | bool
+ |
+
+
+
+ Removes the outer underscores from all +column names. Default None keeps outer underscores. Values can be +either 'left', 'right' or 'both' or the respective shorthand 'l', +'r' and True. + |
+
+ None
+ |
+
+ case_type
+ |
+
+ str
+ |
+
+
+
+ Whether to make the column names lower or uppercase. +Current case may be preserved with 'preserve', +while snake case conversion (from CamelCase or camelCase only) +can be turned on using "snake". +Default 'lower' makes all characters lowercase. + |
+
+ 'lower'
+ |
+
+ remove_special
+ |
+
+ bool
+ |
+
+
+
+ Remove special characters from the column names. +Only letters, numbers and underscores are preserved. + |
+
+ False
+ |
+
+ strip_accents
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to remove accents from +the labels. + |
+
+ False
+ |
+
+ truncate_limit
+ |
+
+ int
+ |
+
+
+
+ Truncates formatted column names to +the specified length. Default None does not truncate. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame | LazyFrame
+ |
+
+
+
+ A polars DataFrame/LazyFrame. + |
+
janitor/polars/clean_names.py
35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 |
|
make_clean_names(expression, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=False, enforce_string=False, truncate_limit=None)
+
+Clean the labels in a polars Expression.
+ + +Examples:
+>>> import polars as pl
+>>> import janitor.polars
+>>> df = pl.DataFrame({"raw": ["Abçdê fgí j"]})
+>>> df
+shape: (1, 1)
+┌─────────────┐
+│ raw │
+│ --- │
+│ str │
+╞═════════════╡
+│ Abçdê fgí j │
+└─────────────┘
+
Clean the column values:
+>>> df.with_columns(pl.col("raw").make_clean_names(strip_accents=True))
+shape: (1, 1)
+┌─────────────┐
+│ raw │
+│ --- │
+│ str │
+╞═════════════╡
+│ abcde_fgi_j │
+└─────────────┘
+
New in version 0.28.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ strip_underscores
+ |
+
+ str | bool
+ |
+
+
+
+ Removes the outer underscores +from all labels in the expression. +Default None keeps outer underscores. +Values can be either 'left', 'right' +or 'both' or the respective shorthand 'l', +'r' and True. + |
+
+ None
+ |
+
+ case_type
+ |
+
+ str
+ |
+
+
+
+ Whether to make the labels in the expression lower or uppercase. +Current case may be preserved with 'preserve', +while snake case conversion (from CamelCase or camelCase only) +can be turned on using "snake". +Default 'lower' makes all characters lowercase. + |
+
+ 'lower'
+ |
+
+ remove_special
+ |
+
+ bool
+ |
+
+
+
+ Remove special characters from the values in the expression. +Only letters, numbers and underscores are preserved. + |
+
+ False
+ |
+
+ strip_accents
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to remove accents from +the expression. + |
+
+ False
+ |
+
+ enforce_string
+ |
+
+ bool
+ |
+
+
+
+ Whether or not to cast the expression to a string type. + |
+
+ False
+ |
+
+ truncate_limit
+ |
+
+ int
+ |
+
+
+
+ Truncates formatted labels in the expression to +the specified length. Default None does not truncate. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ Expr
+ |
+
+
+
+ A polars Expression. + |
+
janitor/polars/clean_names.py
117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 |
|
complete
+
+
+complete implementation for polars.
+ + + + + + + + +complete(df, *columns, fill_value=None, explicit=True, sort=False, by=None)
+
+Turns implicit missing values into explicit missing values
+It is modeled after tidyr's complete
function.
+In a way, it is the inverse of pl.drop_nulls
,
+as it exposes implicitly missing rows.
If new values need to be introduced, a polars Expression +or a polars Series with the new values can be passed, +as long as the polars Expression/Series +has a name that already exists in the DataFrame.
+complete
can also be applied to a LazyFrame.
Examples:
+>>> import polars as pl
+>>> import janitor.polars
+>>> df = pl.DataFrame(
+... dict(
+... group=(1, 2, 1, 2),
+... item_id=(1, 2, 2, 3),
+... item_name=("a", "a", "b", "b"),
+... value1=(1, None, 3, 4),
+... value2=range(4, 8),
+... )
+... )
+>>> df
+shape: (4, 5)
+┌───────┬─────────┬───────────┬────────┬────────┐
+│ group ┆ item_id ┆ item_name ┆ value1 ┆ value2 │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
+╞═══════╪═════════╪═══════════╪════════╪════════╡
+│ 1 ┆ 1 ┆ a ┆ 1 ┆ 4 │
+│ 2 ┆ 2 ┆ a ┆ null ┆ 5 │
+│ 1 ┆ 2 ┆ b ┆ 3 ┆ 6 │
+│ 2 ┆ 3 ┆ b ┆ 4 ┆ 7 │
+└───────┴─────────┴───────────┴────────┴────────┘
+
Generate all possible combinations of
+group
, item_id
, and item_name
+(whether or not they appear in the data)
>>> with pl.Config(tbl_rows=-1):
+... df.complete("group", "item_id", "item_name", sort=True)
+shape: (12, 5)
+┌───────┬─────────┬───────────┬────────┬────────┐
+│ group ┆ item_id ┆ item_name ┆ value1 ┆ value2 │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
+╞═══════╪═════════╪═══════════╪════════╪════════╡
+│ 1 ┆ 1 ┆ a ┆ 1 ┆ 4 │
+│ 1 ┆ 1 ┆ b ┆ null ┆ null │
+│ 1 ┆ 2 ┆ a ┆ null ┆ null │
+│ 1 ┆ 2 ┆ b ┆ 3 ┆ 6 │
+│ 1 ┆ 3 ┆ a ┆ null ┆ null │
+│ 1 ┆ 3 ┆ b ┆ null ┆ null │
+│ 2 ┆ 1 ┆ a ┆ null ┆ null │
+│ 2 ┆ 1 ┆ b ┆ null ┆ null │
+│ 2 ┆ 2 ┆ a ┆ null ┆ 5 │
+│ 2 ┆ 2 ┆ b ┆ null ┆ null │
+│ 2 ┆ 3 ┆ a ┆ null ┆ null │
+│ 2 ┆ 3 ┆ b ┆ 4 ┆ 7 │
+└───────┴─────────┴───────────┴────────┴────────┘
+
Cross all possible group
values with the unique pairs of
+(item_id, item_name)
that already exist in the data.
>>> with pl.Config(tbl_rows=-1):
+... df.select(
+... "group", pl.struct("item_id", "item_name"), "value1", "value2"
+... ).complete("group", "item_id", sort=True).unnest("item_id")
+shape: (8, 5)
+┌───────┬─────────┬───────────┬────────┬────────┐
+│ group ┆ item_id ┆ item_name ┆ value1 ┆ value2 │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
+╞═══════╪═════════╪═══════════╪════════╪════════╡
+│ 1 ┆ 1 ┆ a ┆ 1 ┆ 4 │
+│ 1 ┆ 2 ┆ a ┆ null ┆ null │
+│ 1 ┆ 2 ┆ b ┆ 3 ┆ 6 │
+│ 1 ┆ 3 ┆ b ┆ null ┆ null │
+│ 2 ┆ 1 ┆ a ┆ null ┆ null │
+│ 2 ┆ 2 ┆ a ┆ null ┆ 5 │
+│ 2 ┆ 2 ┆ b ┆ null ┆ null │
+│ 2 ┆ 3 ┆ b ┆ 4 ┆ 7 │
+└───────┴─────────┴───────────┴────────┴────────┘
+
Fill in nulls:
+>>> with pl.Config(tbl_rows=-1):
+... df.select(
+... "group", pl.struct("item_id", "item_name"), "value1", "value2"
+... ).complete(
+... "group",
+... "item_id",
+... fill_value={"value1": 0, "value2": 99},
+... explicit=True,
+... sort=True,
+... ).unnest("item_id")
+shape: (8, 5)
+┌───────┬─────────┬───────────┬────────┬────────┐
+│ group ┆ item_id ┆ item_name ┆ value1 ┆ value2 │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
+╞═══════╪═════════╪═══════════╪════════╪════════╡
+│ 1 ┆ 1 ┆ a ┆ 1 ┆ 4 │
+│ 1 ┆ 2 ┆ a ┆ 0 ┆ 99 │
+│ 1 ┆ 2 ┆ b ┆ 3 ┆ 6 │
+│ 1 ┆ 3 ┆ b ┆ 0 ┆ 99 │
+│ 2 ┆ 1 ┆ a ┆ 0 ┆ 99 │
+│ 2 ┆ 2 ┆ a ┆ 0 ┆ 5 │
+│ 2 ┆ 2 ┆ b ┆ 0 ┆ 99 │
+│ 2 ┆ 3 ┆ b ┆ 4 ┆ 7 │
+└───────┴─────────┴───────────┴────────┴────────┘
+
Limit the fill to only the newly created
+missing values with explicit = FALSE
:
>>> with pl.Config(tbl_rows=-1):
+... df.select(
+... "group", pl.struct("item_id", "item_name"), "value1", "value2"
+... ).complete(
+... "group",
+... "item_id",
+... fill_value={"value1": 0, "value2": 99},
+... explicit=False,
+... sort=True,
+... ).unnest("item_id").sort(pl.all())
+shape: (8, 5)
+┌───────┬─────────┬───────────┬────────┬────────┐
+│ group ┆ item_id ┆ item_name ┆ value1 ┆ value2 │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
+╞═══════╪═════════╪═══════════╪════════╪════════╡
+│ 1 ┆ 1 ┆ a ┆ 1 ┆ 4 │
+│ 1 ┆ 2 ┆ a ┆ 0 ┆ 99 │
+│ 1 ┆ 2 ┆ b ┆ 3 ┆ 6 │
+│ 1 ┆ 3 ┆ b ┆ 0 ┆ 99 │
+│ 2 ┆ 1 ┆ a ┆ 0 ┆ 99 │
+│ 2 ┆ 2 ┆ a ┆ null ┆ 5 │
+│ 2 ┆ 2 ┆ b ┆ 0 ┆ 99 │
+│ 2 ┆ 3 ┆ b ┆ 4 ┆ 7 │
+└───────┴─────────┴───────────┴────────┴────────┘
+
>>> df = pl.DataFrame(
+... {
+... "Year": [1999, 2000, 2004, 1999, 2004],
+... "Taxon": [
+... "Saccharina",
+... "Saccharina",
+... "Saccharina",
+... "Agarum",
+... "Agarum",
+... ],
+... "Abundance": [4, 5, 2, 1, 8],
+... }
+... )
+>>> df
+shape: (5, 3)
+┌──────┬────────────┬───────────┐
+│ Year ┆ Taxon ┆ Abundance │
+│ --- ┆ --- ┆ --- │
+│ i64 ┆ str ┆ i64 │
+╞══════╪════════════╪═══════════╡
+│ 1999 ┆ Saccharina ┆ 4 │
+│ 2000 ┆ Saccharina ┆ 5 │
+│ 2004 ┆ Saccharina ┆ 2 │
+│ 1999 ┆ Agarum ┆ 1 │
+│ 2004 ┆ Agarum ┆ 8 │
+└──────┴────────────┴───────────┘
+
Expose missing years from 1999 to 2004 - +pass a polars expression with the new dates, +and ensure the expression's name already exists +in the DataFrame:
+>>> expression = pl.int_range(1999,2005).alias('Year')
+>>> with pl.Config(tbl_rows=-1):
+... df.complete(expression,'Taxon',sort=True)
+shape: (12, 3)
+┌──────┬────────────┬───────────┐
+│ Year ┆ Taxon ┆ Abundance │
+│ --- ┆ --- ┆ --- │
+│ i64 ┆ str ┆ i64 │
+╞══════╪════════════╪═══════════╡
+│ 1999 ┆ Agarum ┆ 1 │
+│ 1999 ┆ Saccharina ┆ 4 │
+│ 2000 ┆ Agarum ┆ null │
+│ 2000 ┆ Saccharina ┆ 5 │
+│ 2001 ┆ Agarum ┆ null │
+│ 2001 ┆ Saccharina ┆ null │
+│ 2002 ┆ Agarum ┆ null │
+│ 2002 ┆ Saccharina ┆ null │
+│ 2003 ┆ Agarum ┆ null │
+│ 2003 ┆ Saccharina ┆ null │
+│ 2004 ┆ Agarum ┆ 8 │
+│ 2004 ┆ Saccharina ┆ 2 │
+└──────┴────────────┴───────────┘
+
Expose missing rows per group:
+>>> df = pl.DataFrame(
+... {
+... "state": ["CA", "CA", "HI", "HI", "HI", "NY", "NY"],
+... "year": [2010, 2013, 2010, 2012, 2016, 2009, 2013],
+... "value": [1, 3, 1, 2, 3, 2, 5],
+... }
+... )
+>>> df
+shape: (7, 3)
+┌───────┬──────┬───────┐
+│ state ┆ year ┆ value │
+│ --- ┆ --- ┆ --- │
+│ str ┆ i64 ┆ i64 │
+╞═══════╪══════╪═══════╡
+│ CA ┆ 2010 ┆ 1 │
+│ CA ┆ 2013 ┆ 3 │
+│ HI ┆ 2010 ┆ 1 │
+│ HI ┆ 2012 ┆ 2 │
+│ HI ┆ 2016 ┆ 3 │
+│ NY ┆ 2009 ┆ 2 │
+│ NY ┆ 2013 ┆ 5 │
+└───────┴──────┴───────┘
+>>> low = pl.col('year').min()
+>>> high = pl.col('year').max().add(1)
+>>> new_year_values=pl.int_range(low,high).alias('year')
+>>> with pl.Config(tbl_rows=-1):
+... df.complete(new_year_values,by='state',sort=True)
+shape: (16, 3)
+┌───────┬──────┬───────┐
+│ state ┆ year ┆ value │
+│ --- ┆ --- ┆ --- │
+│ str ┆ i64 ┆ i64 │
+╞═══════╪══════╪═══════╡
+│ CA ┆ 2010 ┆ 1 │
+│ CA ┆ 2011 ┆ null │
+│ CA ┆ 2012 ┆ null │
+│ CA ┆ 2013 ┆ 3 │
+│ HI ┆ 2010 ┆ 1 │
+│ HI ┆ 2011 ┆ null │
+│ HI ┆ 2012 ┆ 2 │
+│ HI ┆ 2013 ┆ null │
+│ HI ┆ 2014 ┆ null │
+│ HI ┆ 2015 ┆ null │
+│ HI ┆ 2016 ┆ 3 │
+│ NY ┆ 2009 ┆ 2 │
+│ NY ┆ 2010 ┆ null │
+│ NY ┆ 2011 ┆ null │
+│ NY ┆ 2012 ┆ null │
+│ NY ┆ 2013 ┆ 5 │
+└───────┴──────┴───────┘
+
New in version 0.28.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ *columns
+ |
+
+ ColumnNameOrSelector
+ |
+
+
+
+ This refers to the columns to be completed. +It can be a string or a column selector or a polars expression. +A polars expression can be used to introduced new values, +as long as the polars expression has a name that already exists +in the DataFrame. + |
+
+ ()
+ |
+
+ fill_value
+ |
+
+ dict | Any | Expr
+ |
+
+
+
+ Scalar value or polars expression to use instead of nulls +for missing combinations. A dictionary, mapping columns names +to a scalar value is also accepted. + |
+
+ None
+ |
+
+ explicit
+ |
+
+ bool
+ |
+
+
+
+ Determines if only implicitly missing values
+should be filled ( |
+
+ True
+ |
+
+ sort
+ |
+
+ bool
+ |
+
+
+
+ Sort the DataFrame based on *columns. + |
+
+ False
+ |
+
+ by
+ |
+
+ ColumnNameOrSelector
+ |
+
+
+
+ Column(s) to group by. +The explicit missing rows are returned per group. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame | LazyFrame
+ |
+
+
+
+ A polars DataFrame/LazyFrame. + |
+
janitor/polars/complete.py
268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 +475 +476 +477 +478 +479 +480 +481 +482 +483 +484 +485 +486 +487 +488 +489 +490 +491 +492 +493 +494 +495 +496 +497 +498 +499 +500 +501 +502 +503 +504 +505 +506 +507 +508 +509 +510 +511 +512 +513 +514 +515 +516 +517 +518 +519 +520 +521 +522 +523 +524 +525 +526 +527 +528 +529 +530 +531 +532 +533 +534 +535 +536 +537 +538 +539 +540 +541 +542 +543 +544 +545 +546 +547 +548 +549 +550 +551 +552 +553 +554 +555 +556 +557 |
|
expand(df, *columns, sort=False, by=None)
+
+Creates a DataFrame from a cartesian combination of all inputs.
+Inspiration is from tidyr's expand() function.
+expand() is often useful with
+pl.DataFrame.join
+to convert implicit
+missing values to explicit missing values - similar to
+complete
.
It can also be used to figure out which combinations are missing +(e.g identify gaps in your DataFrame).
+The variable columns
parameter can be a string,
+a ColumnSelector, a polars expression, or a polars Series.
expand
can also be applied to a LazyFrame.
Examples:
+>>> import polars as pl
+>>> import janitor.polars
+>>> data = [{'type': 'apple', 'year': 2010, 'size': 'XS'},
+... {'type': 'orange', 'year': 2010, 'size': 'S'},
+... {'type': 'apple', 'year': 2012, 'size': 'M'},
+... {'type': 'orange', 'year': 2010, 'size': 'S'},
+... {'type': 'orange', 'year': 2011, 'size': 'S'},
+... {'type': 'orange', 'year': 2012, 'size': 'M'}]
+>>> df = pl.DataFrame(data)
+>>> df
+shape: (6, 3)
+┌────────┬──────┬──────┐
+│ type ┆ year ┆ size │
+│ --- ┆ --- ┆ --- │
+│ str ┆ i64 ┆ str │
+╞════════╪══════╪══════╡
+│ apple ┆ 2010 ┆ XS │
+│ orange ┆ 2010 ┆ S │
+│ apple ┆ 2012 ┆ M │
+│ orange ┆ 2010 ┆ S │
+│ orange ┆ 2011 ┆ S │
+│ orange ┆ 2012 ┆ M │
+└────────┴──────┴──────┘
+
Get unique observations:
+>>> df.expand('type',sort=True)
+shape: (2, 1)
+┌────────┐
+│ type │
+│ --- │
+│ str │
+╞════════╡
+│ apple │
+│ orange │
+└────────┘
+>>> df.expand('size',sort=True)
+shape: (3, 1)
+┌──────┐
+│ size │
+│ --- │
+│ str │
+╞══════╡
+│ M │
+│ S │
+│ XS │
+└──────┘
+>>> df.expand('type', 'size',sort=True)
+shape: (6, 2)
+┌────────┬──────┐
+│ type ┆ size │
+│ --- ┆ --- │
+│ str ┆ str │
+╞════════╪══════╡
+│ apple ┆ M │
+│ apple ┆ S │
+│ apple ┆ XS │
+│ orange ┆ M │
+│ orange ┆ S │
+│ orange ┆ XS │
+└────────┴──────┘
+>>> with pl.Config(tbl_rows=-1):
+... df.expand('type','size','year',sort=True)
+shape: (18, 3)
+┌────────┬──────┬──────┐
+│ type ┆ size ┆ year │
+│ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ i64 │
+╞════════╪══════╪══════╡
+│ apple ┆ M ┆ 2010 │
+│ apple ┆ M ┆ 2011 │
+│ apple ┆ M ┆ 2012 │
+│ apple ┆ S ┆ 2010 │
+│ apple ┆ S ┆ 2011 │
+│ apple ┆ S ┆ 2012 │
+│ apple ┆ XS ┆ 2010 │
+│ apple ┆ XS ┆ 2011 │
+│ apple ┆ XS ┆ 2012 │
+│ orange ┆ M ┆ 2010 │
+│ orange ┆ M ┆ 2011 │
+│ orange ┆ M ┆ 2012 │
+│ orange ┆ S ┆ 2010 │
+│ orange ┆ S ┆ 2011 │
+│ orange ┆ S ┆ 2012 │
+│ orange ┆ XS ┆ 2010 │
+│ orange ┆ XS ┆ 2011 │
+│ orange ┆ XS ┆ 2012 │
+└────────┴──────┴──────┘
+
Get observations that only occur in the data:
+>>> df.expand(pl.struct('type','size'),sort=True).unnest('type')
+shape: (4, 2)
+┌────────┬──────┐
+│ type ┆ size │
+│ --- ┆ --- │
+│ str ┆ str │
+╞════════╪══════╡
+│ apple ┆ M │
+│ apple ┆ XS │
+│ orange ┆ M │
+│ orange ┆ S │
+└────────┴──────┘
+>>> df.expand(pl.struct('type','size','year'),sort=True).unnest('type')
+shape: (5, 3)
+┌────────┬──────┬──────┐
+│ type ┆ size ┆ year │
+│ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ i64 │
+╞════════╪══════╪══════╡
+│ apple ┆ M ┆ 2012 │
+│ apple ┆ XS ┆ 2010 │
+│ orange ┆ M ┆ 2012 │
+│ orange ┆ S ┆ 2010 │
+│ orange ┆ S ┆ 2011 │
+└────────┴──────┴──────┘
+
Expand the DataFrame to include new observations:
+>>> with pl.Config(tbl_rows=-1):
+... df.expand('type','size',pl.int_range(2010,2014).alias('new_year'),sort=True)
+shape: (24, 3)
+┌────────┬──────┬──────────┐
+│ type ┆ size ┆ new_year │
+│ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ i64 │
+╞════════╪══════╪══════════╡
+│ apple ┆ M ┆ 2010 │
+│ apple ┆ M ┆ 2011 │
+│ apple ┆ M ┆ 2012 │
+│ apple ┆ M ┆ 2013 │
+│ apple ┆ S ┆ 2010 │
+│ apple ┆ S ┆ 2011 │
+│ apple ┆ S ┆ 2012 │
+│ apple ┆ S ┆ 2013 │
+│ apple ┆ XS ┆ 2010 │
+│ apple ┆ XS ┆ 2011 │
+│ apple ┆ XS ┆ 2012 │
+│ apple ┆ XS ┆ 2013 │
+│ orange ┆ M ┆ 2010 │
+│ orange ┆ M ┆ 2011 │
+│ orange ┆ M ┆ 2012 │
+│ orange ┆ M ┆ 2013 │
+│ orange ┆ S ┆ 2010 │
+│ orange ┆ S ┆ 2011 │
+│ orange ┆ S ┆ 2012 │
+│ orange ┆ S ┆ 2013 │
+│ orange ┆ XS ┆ 2010 │
+│ orange ┆ XS ┆ 2011 │
+│ orange ┆ XS ┆ 2012 │
+│ orange ┆ XS ┆ 2013 │
+└────────┴──────┴──────────┘
+
Filter for missing observations:
+>>> columns = ('type','size','year')
+>>> with pl.Config(tbl_rows=-1):
+... df.expand(*columns).join(df, how='anti', on=columns).sort(by=pl.all())
+shape: (13, 3)
+┌────────┬──────┬──────┐
+│ type ┆ size ┆ year │
+│ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ i64 │
+╞════════╪══════╪══════╡
+│ apple ┆ M ┆ 2010 │
+│ apple ┆ M ┆ 2011 │
+│ apple ┆ S ┆ 2010 │
+│ apple ┆ S ┆ 2011 │
+│ apple ┆ S ┆ 2012 │
+│ apple ┆ XS ┆ 2011 │
+│ apple ┆ XS ┆ 2012 │
+│ orange ┆ M ┆ 2010 │
+│ orange ┆ M ┆ 2011 │
+│ orange ┆ S ┆ 2012 │
+│ orange ┆ XS ┆ 2010 │
+│ orange ┆ XS ┆ 2011 │
+│ orange ┆ XS ┆ 2012 │
+└────────┴──────┴──────┘
+
Expand within each group, using by
:
>>> with pl.Config(tbl_rows=-1):
+... df.expand('year','size',by='type',sort=True)
+shape: (10, 3)
+┌────────┬──────┬──────┐
+│ type ┆ year ┆ size │
+│ --- ┆ --- ┆ --- │
+│ str ┆ i64 ┆ str │
+╞════════╪══════╪══════╡
+│ apple ┆ 2010 ┆ M │
+│ apple ┆ 2010 ┆ XS │
+│ apple ┆ 2012 ┆ M │
+│ apple ┆ 2012 ┆ XS │
+│ orange ┆ 2010 ┆ M │
+│ orange ┆ 2010 ┆ S │
+│ orange ┆ 2011 ┆ M │
+│ orange ┆ 2011 ┆ S │
+│ orange ┆ 2012 ┆ M │
+│ orange ┆ 2012 ┆ S │
+└────────┴──────┴──────┘
+
New in version 0.28.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ *columns
+ |
+
+ ColumnNameOrSelector
+ |
+
+
+
+ This refers to the columns to be completed. +It can be a string or a column selector or a polars expression. +A polars expression can be used to introduced new values, +as long as the polars expression has a name that already exists +in the DataFrame. + |
+
+ ()
+ |
+
+ sort
+ |
+
+ bool
+ |
+
+
+
+ Sort the DataFrame based on *columns. + |
+
+ False
+ |
+
+ by
+ |
+
+ ColumnNameOrSelector
+ |
+
+
+
+ Column(s) to group by. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame | LazyFrame
+ |
+
+
+
+ A polars DataFrame/LazyFrame. + |
+
janitor/polars/complete.py
24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 |
|
pivot_longer
+
+
+pivot_longer implementation for polars.
+ + + + + + + + +pivot_longer(df, index=None, column_names=None, names_to='variable', values_to='value', names_sep=None, names_pattern=None, names_transform=None)
+
+Unpivots a DataFrame from wide to long format.
+It is modeled after the pivot_longer
function in R's tidyr package,
+and also takes inspiration from the melt
function in R's data.table package.
This function is useful to massage a DataFrame into a format where +one or more columns are considered measured variables, and all other +columns are considered as identifier variables.
+All measured variables are unpivoted (and typically duplicated) along the +row axis.
+If names_pattern
, use a valid regular expression pattern containing at least
+one capture group, compatible with the regex crate.
For more granular control on the unpivoting, have a look at
+pivot_longer_spec
.
pivot_longer
can also be applied to a LazyFrame.
Examples:
+>>> import polars as pl
+>>> import polars.selectors as cs
+>>> import janitor.polars
+>>> df = pl.DataFrame(
+... {
+... "Sepal.Length": [5.1, 5.9],
+... "Sepal.Width": [3.5, 3.0],
+... "Petal.Length": [1.4, 5.1],
+... "Petal.Width": [0.2, 1.8],
+... "Species": ["setosa", "virginica"],
+... }
+... )
+>>> df
+shape: (2, 5)
+┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
+│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
+╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
+│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
+│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
+└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
+
Replicate polars' melt:
+>>> df.pivot_longer(index = 'Species').sort(by=pl.all())
+shape: (8, 3)
+┌───────────┬──────────────┬───────┐
+│ Species ┆ variable ┆ value │
+│ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ f64 │
+╞═══════════╪══════════════╪═══════╡
+│ setosa ┆ Petal.Length ┆ 1.4 │
+│ setosa ┆ Petal.Width ┆ 0.2 │
+│ setosa ┆ Sepal.Length ┆ 5.1 │
+│ setosa ┆ Sepal.Width ┆ 3.5 │
+│ virginica ┆ Petal.Length ┆ 5.1 │
+│ virginica ┆ Petal.Width ┆ 1.8 │
+│ virginica ┆ Sepal.Length ┆ 5.9 │
+│ virginica ┆ Sepal.Width ┆ 3.0 │
+└───────────┴──────────────┴───────┘
+
Split the column labels into individual columns:
+>>> df.pivot_longer(
+... index = 'Species',
+... names_to = ('part', 'dimension'),
+... names_sep = '.',
+... ).select('Species','part','dimension','value').sort(by=pl.all())
+shape: (8, 4)
+┌───────────┬───────┬───────────┬───────┐
+│ Species ┆ part ┆ dimension ┆ value │
+│ --- ┆ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ str ┆ f64 │
+╞═══════════╪═══════╪═══════════╪═══════╡
+│ setosa ┆ Petal ┆ Length ┆ 1.4 │
+│ setosa ┆ Petal ┆ Width ┆ 0.2 │
+│ setosa ┆ Sepal ┆ Length ┆ 5.1 │
+│ setosa ┆ Sepal ┆ Width ┆ 3.5 │
+│ virginica ┆ Petal ┆ Length ┆ 5.1 │
+│ virginica ┆ Petal ┆ Width ┆ 1.8 │
+│ virginica ┆ Sepal ┆ Length ┆ 5.9 │
+│ virginica ┆ Sepal ┆ Width ┆ 3.0 │
+└───────────┴───────┴───────────┴───────┘
+
Retain parts of the column names as headers:
+>>> df.pivot_longer(
+... index = 'Species',
+... names_to = ('part', '.value'),
+... names_sep = '.',
+... ).select('Species','part','Length','Width').sort(by=pl.all())
+shape: (4, 4)
+┌───────────┬───────┬────────┬───────┐
+│ Species ┆ part ┆ Length ┆ Width │
+│ --- ┆ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ f64 ┆ f64 │
+╞═══════════╪═══════╪════════╪═══════╡
+│ setosa ┆ Petal ┆ 1.4 ┆ 0.2 │
+│ setosa ┆ Sepal ┆ 5.1 ┆ 3.5 │
+│ virginica ┆ Petal ┆ 5.1 ┆ 1.8 │
+│ virginica ┆ Sepal ┆ 5.9 ┆ 3.0 │
+└───────────┴───────┴────────┴───────┘
+
Split the column labels based on regex:
+>>> df = pl.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
+>>> df
+shape: (1, 3)
+┌─────┬──────────────┬────────────┐
+│ id ┆ new_sp_m5564 ┆ newrel_f65 │
+│ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ i64 │
+╞═════╪══════════════╪════════════╡
+│ 1 ┆ 2 ┆ 3 │
+└─────┴──────────────┴────────────┘
+>>> df.pivot_longer(
+... index = 'id',
+... names_to = ('diagnosis', 'gender', 'age'),
+... names_pattern = r"new_?(.+)_(.)([0-9]+)",
+... ).select('id','diagnosis','gender','age','value').sort(by=pl.all())
+shape: (2, 5)
+┌─────┬───────────┬────────┬──────┬───────┐
+│ id ┆ diagnosis ┆ gender ┆ age ┆ value │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ str ┆ str ┆ str ┆ i64 │
+╞═════╪═══════════╪════════╪══════╪═══════╡
+│ 1 ┆ rel ┆ f ┆ 65 ┆ 3 │
+│ 1 ┆ sp ┆ m ┆ 5564 ┆ 2 │
+└─────┴───────────┴────────┴──────┴───────┘
+
Convert the dtypes of specific columns with names_transform
:
>>> df.pivot_longer(
+... index = "id",
+... names_pattern=r"new_?(.+)_(.)([0-9]+)",
+... names_to=("diagnosis", "gender", "age"),
+... names_transform=pl.col('age').cast(pl.Int32),
+... ).select("id", "diagnosis", "gender", "age", "value").sort(by=pl.all())
+shape: (2, 5)
+┌─────┬───────────┬────────┬──────┬───────┐
+│ id ┆ diagnosis ┆ gender ┆ age ┆ value │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ str ┆ str ┆ i32 ┆ i64 │
+╞═════╪═══════════╪════════╪══════╪═══════╡
+│ 1 ┆ rel ┆ f ┆ 65 ┆ 3 │
+│ 1 ┆ sp ┆ m ┆ 5564 ┆ 2 │
+└─────┴───────────┴────────┴──────┴───────┘
+
Use multiple .value
to reshape the dataframe:
>>> df = pl.DataFrame(
+... [
+... {
+... "x_1_mean": 10,
+... "x_2_mean": 20,
+... "y_1_mean": 30,
+... "y_2_mean": 40,
+... "unit": 50,
+... }
+... ]
+... )
+>>> df
+shape: (1, 5)
+┌──────────┬──────────┬──────────┬──────────┬──────┐
+│ x_1_mean ┆ x_2_mean ┆ y_1_mean ┆ y_2_mean ┆ unit │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
+╞══════════╪══════════╪══════════╪══════════╪══════╡
+│ 10 ┆ 20 ┆ 30 ┆ 40 ┆ 50 │
+└──────────┴──────────┴──────────┴──────────┴──────┘
+>>> df.pivot_longer(
+... index="unit",
+... names_to=(".value", "time", ".value"),
+... names_pattern=r"(x|y)_([0-9])(_mean)",
+... ).select('unit','time','x_mean','y_mean').sort(by=pl.all())
+shape: (2, 4)
+┌──────┬──────┬────────┬────────┐
+│ unit ┆ time ┆ x_mean ┆ y_mean │
+│ --- ┆ --- ┆ --- ┆ --- │
+│ i64 ┆ str ┆ i64 ┆ i64 │
+╞══════╪══════╪════════╪════════╡
+│ 50 ┆ 1 ┆ 10 ┆ 30 │
+│ 50 ┆ 2 ┆ 20 ┆ 40 │
+└──────┴──────┴────────┴────────┘
+
New in version 0.28.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ index
+ |
+
+ ColumnNameOrSelector
+ |
+
+
+
+ Column(s) or selector(s) to use as identifier variables. + |
+
+ None
+ |
+
+ column_names
+ |
+
+ ColumnNameOrSelector
+ |
+
+
+
+ Column(s) or selector(s) to unpivot. + |
+
+ None
+ |
+
+ names_to
+ |
+
+ list | tuple | str
+ |
+
+
+
+ Name of new column as a string that will contain
+what were previously the column names in |
+
+ 'variable'
+ |
+
+ values_to
+ |
+
+ str
+ |
+
+
+
+ Name of new column as a string that will contain what
+were previously the values of the columns in |
+
+ 'value'
+ |
+
+ names_sep
+ |
+
+ str
+ |
+
+
+
+ Determines how the column name is broken up, if
+ |
+
+ None
+ |
+
+ names_pattern
+ |
+
+ str
+ |
+
+
+
+ Determines how the column name is broken up.
+It can be a regular expression containing matching groups.
+It takes the same specification as
+polars' |
+
+ None
+ |
+
+ names_transform
+ |
+
+ Expr
+ |
+
+
+
+ Use this option to change the types of columns that +have been transformed to rows. +This does not applies to the values' columns. +Accepts a polars expression or a list of polars expressions. +Applicable only if one of names_sep +or names_pattern is provided. + |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame | LazyFrame
+ |
+
+
+
+ A polars DataFrame/LazyFrame that has been unpivoted + |
+
+ DataFrame | LazyFrame
+ |
+
+
+
+ from wide to long format. + |
+
janitor/polars/pivot_longer.py
187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 |
|
pivot_longer_spec(df, spec)
+
+A declarative interface to pivot a Polars Frame +from wide to long form, +where you describe how the data will be unpivoted, +using a DataFrame.
+It is modeled after tidyr's pivot_longer_spec
.
This gives you, the user, +more control over the transformation to long form, +using a spec DataFrame that describes exactly +how data stored in the column names +becomes variables.
+It can come in handy for situations where
+pivot_longer
+seems inadequate for the transformation.
New in version 0.28.0
+Examples:
+>>> import pandas as pd
+>>> from janitor.polars import pivot_longer_spec
+>>> df = pl.DataFrame(
+... {
+... "Sepal.Length": [5.1, 5.9],
+... "Sepal.Width": [3.5, 3.0],
+... "Petal.Length": [1.4, 5.1],
+... "Petal.Width": [0.2, 1.8],
+... "Species": ["setosa", "virginica"],
+... }
+... )
+>>> df
+shape: (2, 5)
+┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
+│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
+│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
+│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
+╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
+│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
+│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
+└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
+>>> spec = {'.name':['Sepal.Length','Petal.Length',
+... 'Sepal.Width','Petal.Width'],
+... '.value':['Length','Length','Width','Width'],
+... 'part':['Sepal','Petal','Sepal','Petal']}
+>>> spec = pl.DataFrame(spec)
+>>> spec
+shape: (4, 3)
+┌──────────────┬────────┬───────┐
+│ .name ┆ .value ┆ part │
+│ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ str │
+╞══════════════╪════════╪═══════╡
+│ Sepal.Length ┆ Length ┆ Sepal │
+│ Petal.Length ┆ Length ┆ Petal │
+│ Sepal.Width ┆ Width ┆ Sepal │
+│ Petal.Width ┆ Width ┆ Petal │
+└──────────────┴────────┴───────┘
+>>> df.pipe(pivot_longer_spec,spec=spec).sort(by=pl.all())
+shape: (4, 4)
+┌───────────┬───────┬────────┬───────┐
+│ Species ┆ part ┆ Length ┆ Width │
+│ --- ┆ --- ┆ --- ┆ --- │
+│ str ┆ str ┆ f64 ┆ f64 │
+╞═══════════╪═══════╪════════╪═══════╡
+│ setosa ┆ Petal ┆ 1.4 ┆ 0.2 │
+│ setosa ┆ Sepal ┆ 5.1 ┆ 3.5 │
+│ virginica ┆ Petal ┆ 5.1 ┆ 1.8 │
+│ virginica ┆ Sepal ┆ 5.9 ┆ 3.0 │
+└───────────┴───────┴────────┴───────┘
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame | LazyFrame
+ |
+
+
+
+ The source DataFrame to unpivot. +It can also be a LazyFrame. + |
+ + required + | +
+ spec
+ |
+
+ DataFrame
+ |
+
+
+
+ A specification DataFrame.
+At a minimum, the spec DataFrame
+must have a |
+ + required + | +
Raises:
+Type | +Description | +
---|---|
+ KeyError
+ |
+
+
+
+ If |
+
+ ValueError
+ |
+
+
+
+ If the labels in spec's |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame | LazyFrame
+ |
+
+
+
+ A polars DataFrame/LazyFrame. + |
+
janitor/polars/pivot_longer.py
21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 |
|
row_to_names
+
+
+row_to_names implementation for polars.
+ + + + + + + + +row_to_names(df, row_numbers=0, remove_rows=False, remove_rows_above=False, separator='_')
+
+Elevates a row, or rows, to be the column names of a DataFrame.
+ + +Examples:
+Replace column names with the first row.
+>>> import polars as pl
+>>> import janitor.polars
+>>> df = pl.DataFrame({
+... "a": ["nums", '6', '9'],
+... "b": ["chars", "x", "y"],
+... })
+>>> df
+shape: (3, 2)
+┌──────┬───────┐
+│ a ┆ b │
+│ --- ┆ --- │
+│ str ┆ str │
+╞══════╪═══════╡
+│ nums ┆ chars │
+│ 6 ┆ x │
+│ 9 ┆ y │
+└──────┴───────┘
+>>> df.row_to_names(0, remove_rows=True)
+shape: (2, 2)
+┌──────┬───────┐
+│ nums ┆ chars │
+│ --- ┆ --- │
+│ str ┆ str │
+╞══════╪═══════╡
+│ 6 ┆ x │
+│ 9 ┆ y │
+└──────┴───────┘
+>>> df.row_to_names(row_numbers=[0,1], remove_rows=True)
+shape: (1, 2)
+┌────────┬─────────┐
+│ nums_6 ┆ chars_x │
+│ --- ┆ --- │
+│ str ┆ str │
+╞════════╪═════════╡
+│ 9 ┆ y │
+└────────┴─────────┘
+
Remove rows above the elevated row and the elevated row itself.
+>>> df = pl.DataFrame({
+... "a": ["bla1", "nums", '6', '9'],
+... "b": ["bla2", "chars", "x", "y"],
+... })
+>>> df
+shape: (4, 2)
+┌──────┬───────┐
+│ a ┆ b │
+│ --- ┆ --- │
+│ str ┆ str │
+╞══════╪═══════╡
+│ bla1 ┆ bla2 │
+│ nums ┆ chars │
+│ 6 ┆ x │
+│ 9 ┆ y │
+└──────┴───────┘
+>>> df.row_to_names(1, remove_rows=True, remove_rows_above=True)
+shape: (2, 2)
+┌──────┬───────┐
+│ nums ┆ chars │
+│ --- ┆ --- │
+│ str ┆ str │
+╞══════╪═══════╡
+│ 6 ┆ x │
+│ 9 ┆ y │
+└──────┴───────┘
+
New in version 0.28.0
+Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ row_numbers
+ |
+
+ int | list | slice
+ |
+
+
+
+ Position of the row(s) containing the variable names. +It can be an integer, list or a slice. + |
+
+ 0
+ |
+
+ remove_rows
+ |
+
+ bool
+ |
+
+
+
+ Whether the row(s) should be removed from the DataFrame. + |
+
+ False
+ |
+
+ remove_rows_above
+ |
+
+ bool
+ |
+
+
+
+ Whether the row(s) above the selected row should +be removed from the DataFrame. + |
+
+ False
+ |
+
+ separator
+ |
+
+ str
+ |
+
+
+
+ Combines the labels into a single string, +if row_numbers is a list of integers. Default is '_'. + |
+
+ '_'
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ A polars DataFrame. + |
+
janitor/polars/row_to_names.py
22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 |
|
Time series-specific data cleaning functions.
+ + + + + + + + +fill_missing_timestamps(df, frequency, first_time_stamp=None, last_time_stamp=None)
+
+Fills a DataFrame with missing timestamps based on a defined frequency.
+If timestamps are missing, this function will re-index the DataFrame. +If timestamps are not missing, then the function will return the DataFrame +unmodified.
+ + +Examples:
+Functional usage
+>>> import pandas as pd
+>>> import janitor.timeseries
+>>> df = janitor.timeseries.fill_missing_timestamps(
+... df=pd.DataFrame(...),
+... frequency="1H",
+... )
+
Method chaining example:
+>>> import pandas as pd
+>>> import janitor.timeseries
+>>> df = (
+... pd.DataFrame(...)
+... .fill_missing_timestamps(frequency="1H")
+... )
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ DataFrame which needs to be tested for missing timestamps + |
+ + required + | +
+ frequency
+ |
+
+ str
+ |
+
+
+
+ Sampling frequency of the data. +Acceptable frequency strings are available +here. +Check offset aliases under time series in user guide + |
+ + required + | +
+ first_time_stamp
+ |
+
+ Timestamp
+ |
+
+
+
+ Timestamp expected to start from;
+defaults to |
+
+ None
+ |
+
+ last_time_stamp
+ |
+
+ Timestamp
+ |
+
+
+
+ Timestamp expected to end with; defaults to |
+
+ None
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ DataFrame that has a complete set of contiguous datetimes. + |
+
janitor/timeseries.py
13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 |
|
flag_jumps(df, scale='percentage', direction='any', threshold=0.0, strict=False)
+
+Create boolean column(s) that flag whether or not the change +between consecutive rows exceeds a provided threshold.
+Examples:
+Applies specified criteria across all columns of the DataFrame
+and appends a flag column for each column in the DataFrame
+
+>>> df = (
+... pd.DataFrame(...)
+... .flag_jumps(
+... scale="absolute",
+... direction="any",
+... threshold=2
+... )
+... ) # doctest: +SKIP
+
+Applies specific criteria to certain DataFrame columns,
+applies default criteria to columns *not* specifically listed and
+appends a flag column for each column in the DataFrame
+
+>>> df = (
+... pd.DataFrame(...)
+... .flag_jumps(
+... scale=dict(col1="absolute", col2="percentage"),
+... direction=dict(col1="increasing", col2="any"),
+... threshold=dict(col1=1, col2=0.5),
+... )
+... ) # doctest: +SKIP
+
+Applies specific criteria to certain DataFrame columns,
+applies default criteria to columns *not* specifically listed and
+appends a flag column for only those columns found in specified
+criteria
+
+>>> df = (
+... pd.DataFrame(...)
+... .flag_jumps(
+... scale=dict(col1="absolute"),
+... threshold=dict(col2=1),
+... strict=True,
+... )
+... ) # doctest: +SKIP
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ DataFrame which needs to be flagged for changes between +consecutive rows above a certain threshold. + |
+ + required + | +
+ scale
+ |
+
+ Union[str, Dict[str, str]]
+ |
+
+
+
+ Type of scaling approach to use. +Acceptable arguments are +
|
+
+ 'percentage'
+ |
+
+ direction
+ |
+
+ Union[str, Dict[str, str]]
+ |
+
+
+
+ Type of method used to handle the sign change when +comparing consecutive rows. +Acceptable arguments are +
|
+
+ 'any'
+ |
+
+ threshold
+ |
+
+ Union[int, float, Dict[str, Union[int, float]]]
+ |
+
+
+
+ The value to check if consecutive row comparisons
+exceed. Always uses a greater than comparison. Must be |
+
+ 0.0
+ |
+
+ strict
+ |
+
+ bool
+ |
+
+
+
+ Flag to enable/disable appending of a flag column for
+each column in the provided DataFrame. If set to |
+
+ False
+ |
+
Raises:
+Type | +Description | +
---|---|
+ JanitorError
+ |
+
+
+
+ If |
+
+ JanitorError
+ |
+
+
+
+ If |
+
+ JanitorError
+ |
+
+
+
+ If |
+
+ JanitorError
+ |
+
+
+
+ If |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ DataFrame that has |
+
janitor/timeseries.py
256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 +330 +331 +332 +333 +334 +335 +336 +337 +338 +339 +340 +341 +342 +343 +344 +345 +346 +347 +348 +349 +350 +351 +352 +353 +354 +355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 |
|
sort_timestamps_monotonically(df, direction='increasing', strict=False)
+
+Sort DataFrame such that index is monotonic.
+If timestamps are monotonic, this function will return +the DataFrame unmodified. If timestamps are not monotonic, +then the function will sort the DataFrame.
+ + +Examples:
+Functional usage
+>>> import pandas as pd
+>>> import janitor.timeseries
+>>> df = janitor.timeseries.sort_timestamps_monotonically(
+... df=pd.DataFrame(...),
+... direction="increasing",
+... )
+
Method chaining example:
+>>> import pandas as pd
+>>> import janitor.timeseries
+>>> df = (
+... pd.DataFrame(...)
+... .sort_timestamps_monotonically(direction="increasing")
+... )
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ df
+ |
+
+ DataFrame
+ |
+
+
+
+ DataFrame which needs to be tested for monotonicity. + |
+ + required + | +
+ direction
+ |
+
+ str
+ |
+
+
+
+ Type of monotonicity desired.
+Acceptable arguments are |
+
+ 'increasing'
+ |
+
+ strict
+ |
+
+ bool
+ |
+
+
+
+ Flag to enable/disable strict monotonicity.
+If set to |
+
+ False
+ |
+
Returns:
+Type | +Description | +
---|---|
+ DataFrame
+ |
+
+
+
+ DataFrame that has monotonically increasing (or decreasing) +timestamps. + |
+
janitor/timeseries.py
106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 |
|
Functions to augment XArray DataArrays and Datasets with additional +functionality.
+ + + + + + + + +clone_using(da, np_arr, use_coords=True, use_attrs=False, new_name=None)
+
+Given a NumPy array, return an XArray DataArray
which contains the same
+dimension names and (optionally) coordinates and other properties as the
+supplied DataArray
.
This is similar to xr.DataArray.copy()
with more specificity for
+the type of cloning you would like to perform - the different properties
+that you desire to mirror in the new DataArray
.
If the coordinates from the source DataArray
are not desired, the shape
+of the source and new NumPy arrays don't need to match.
+The number of dimensions do, however.
Examples:
+Making a new DataArray
from a previous one, keeping the
+dimension names but dropping the coordinates (the input NumPy array
+is of a different size):
>>> import xarray as xr
+>>> import janitor.xarray
+>>> da = xr.DataArray(
+... np.zeros((512, 1024)), dims=["ax_1", "ax_2"],
+... coords=dict(ax_1=np.linspace(0, 1, 512),
+... ax_2=np.logspace(-2, 2, 1024)),
+... name="original",
+... )
+>>> new_da = da.clone_using(
+... np.ones((4, 6)), new_name='new_and_improved', use_coords=False,
+... )
+>>> new_da
+<xarray.DataArray 'new_and_improved' (ax_1: 4, ax_2: 6)> Size: 192B
+array([[1., 1., 1., 1., 1., 1.],
+ [1., 1., 1., 1., 1., 1.],
+ [1., 1., 1., 1., 1., 1.],
+ [1., 1., 1., 1., 1., 1.]])
+Dimensions without coordinates: ax_1, ax_2
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ da
+ |
+
+ DataArray
+ |
+
+
+
+ The |
+ + required + | +
+ np_arr
+ |
+
+ array
+ |
+
+
+
+ The NumPy array which will be wrapped in a new |
+ + required + | +
+ use_coords
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ True
+ |
+
+ use_attrs
+ |
+
+ bool
+ |
+
+
+
+ If |
+
+ False
+ |
+
+ new_name
+ |
+
+ str
+ |
+
+
+
+ If set, use as the new name of the returned |
+
+ None
+ |
+
Raises:
+Type | +Description | +
---|---|
+ ValueError
+ |
+
+
+
+ If number of dimensions in |
+
+ ValueError
+ |
+
+
+
+ If shape of |
+
Returns:
+Type | +Description | +
---|---|
+ DataArray
+ |
+
+
+
+ A |
+
janitor/xarray/functions.py
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 |
|
convert_datetime_to_number(da_or_ds, time_units, dim='time')
+
+Convert the coordinates of a datetime axis to a human-readable float +representation.
+ + +Examples:
+Convert a DataArray
's time dimension coordinates from
+minutes to seconds:
>>> import numpy as np
+>>> import xarray as xr
+>>> import janitor.xarray
+>>> timepoints = 5
+>>> da = xr.DataArray(
+... np.array([2, 8, 0, 1, 7, 7]),
+... dims="time",
+... coords=dict(time=np.arange(6) * np.timedelta64(1, "m"))
+... )
+>>> da_minutes = da.convert_datetime_to_number("s", dim="time")
+>>> da_minutes
+<xarray.DataArray (time: 6)> Size: 48B
+array([2, 8, 0, 1, 7, 7])
+Coordinates:
+ * time (time) float64 48B 0.0 60.0 120.0 180.0 240.0 300.0
+
Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
+ da_or_ds
+ |
+
+ Union[DataArray, Dataset]
+ |
+
+
+
+ XArray object. + |
+ + required + | +
+ time_units
+ |
+
+ str
+ |
+
+
+
+ Numpy timedelta string specification for the unit you +would like to convert the coordinates to. + |
+ + required + | +
+ dim
+ |
+
+ str
+ |
+
+
+
+ The time dimension whose coordinates are datetime objects. + |
+
+ 'time'
+ |
+
Returns:
+Type | +Description | +
---|---|
+ Union[DataArray, Dataset]
+ |
+
+
+
+ The original XArray object with the time dimension reassigned. + |
+
janitor/xarray/functions.py
108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 |
|