# pandasklar
Toolbox / Ecosystem for data science. Easier handling of pandas, especially when thinking in SQL.
Focused on working with complex, ambiguous, erroneous, two-dimensional DataFrames containing one- or two-dimensional objects.
Focused on convenience when working with jupyter notebooks, not speed (exception: `fast_startswith` and `fast_endswith`).
Convenience means:
* use more high-level functions
* functions try to cope even with sloppy data and try to avoid error messages when running cells again
* basic functions easier to remember
Comes in the form of helper functions, i.e. without changes to pandas, just on top of it.
Full dokumentation see `jupyter` directory.
## Try out
The directory `jupyter` contains many notebooks with examples. They are very easy to try out interactively online, with Google Colab. Just click the link at the top of the page and then select Runtime/Run all from the menu in Colab.
## Install
`pip install pandasklar`
## Create Random Data for Testing
* `random_series`: Returns a series of random data of several types, including names, random walks with perlin-noise and errorprone series to test your functions.
* `decorate`: Decorates a series with specials (e.g. NaNs)
* `people` and `random_numbers`: Random data for testing.
## Develop
* `check_mask`: Count rows filtered by a binary mask. Raises an error, if the number is unexpected.
* `specials`: Returns rows representing all special values per column.
* `sample`: Returns some sample rows: beginning + end + specials + random rows.
* `search_str`: Searches all str columns of a dataframe. Useful for development and debugging.
* `plot`: Easier plot.
* `memory_consumption`: Returns the memory consumption of Python objects.
## Analyse Datatypes
* `analyse_datatypes`: Returns info about the datatypes and the mem_usage of the columns of a DataFrame
* `analyse_values`: Returns statistical data for a DataFrame, a Series or an Index
* `analyse_cols`: Describes the datatypes and the content of a DataFrame. Merged info from analyse_datatypes and analyse_values
* `change_datatype`: Converts the datatypes of a DataFrame or a Series. Automatically, if you want.
* `copy_datatype`: Copies the dtypes from one dataframe to another, matching the column names.
## Analyse Frequencies
* `analyse_freqs`: Frequency analysis that includes a subordinate frequency analysis. Provides e.g. the most important examples per case. Splits strings and lists.
## Analyse uniqueness, discrepancies und redundancy
* `analyse_groups`: Analyses a DataFrame for uniqueness and redundancy.
* `same_but_different`: Returns the rows of a DataFrame that are the same on the one hand and different on the other: They are the same in the fields named in same. And they differ in the field named in different. This is useful for analysing whether fields correlate 100% with each other or are independent.
## Compare Series and DataFrames
* `compare_series`: Compares the content of two Series. Returns several indicators of equality.
* `compare_dataframes`: Compares the content of two DataFrames column by column.<br>
Returns several indicators of equality.
* `check_equal`: Compares the content of two DataFrames column by column.
* `compare_col_dtype`: Returns the column names of two DataFrames whose dtype differs
* `get_different_rows`: Returns the rows of two DataFrames that differ
## Manage columns
* `drop_cols`: Drops a column or a list of columns. Does not throw an error if the column does not exist.
* `move_cols`: Reorders the columns of a DataFrame. The specified columns are moved to a numerical position or behind a named column.
* `rename_col`: Renames a column of a DataFrame. If you try to rename a column again, no error is thrown (better for the workflow in jupyter notebooks).
* `col_names`: Selects column names based on analyse_cols. Useful to apply a method to specific columns of a DataFrame.
* `write_empty_col`: Writes empty iterables into a column.
## Manage rows
* `drop_multiindex`: Converts any MultiIndex to normal columns and resets the index. Works with MultiIndex in Series or DataFrames, in rows and in columns.
* `reset_index`: Creates a new, unnamed index. If keep_as is given, the old index is preserved as a row with this name. Otherwise the old index is dropped.
* `rename_index`: Renames the index.
* `drop_rows`: Drops rows identified by a binary mask, verbose if wanted.
* `move_rows`: Moves rows identified by a binary mask from one dataframe to another (e.g. into a trash).<br>
The target dataframe gets an additional message column by standard (to identify why the rows were moved). Verbose if wanted.
* `add_rows`: Like concat, with additional features only_new and verbose.
## Let DataFrames Interact
* `isin`: isin over several columns. Returns a mask for df1: The rows of df1 that match the ones in df2 in the specified columns.
* `update_col`: Transfers one column of data from one dataframe to another dataframe.<br>
Unlike a simple merge, the index and the dtypes are retained. Handles dups and conditions. Verbose if wanted.
## Create DataFrames Easily
* `dataframe`: Converts multidimensional objects into dataframes. Dictionaries and Tuples are interpreted column-wise, Lists and Counters by rows.
## Save and load data
* `dump_pickle`: Convenient function to save a DataFrame to a pickle file. Optional optimisation of datatypes. Verbose if wanted.
* `load_pickle`: Convenient function to load a DataFrame from pickle file. Optional optimisation of datatypes. Verbose if wanted.
* `dump_excel`: Writes a dataframe into an xlsx file for Excel or Calc.<br>
The tabcol-feature groups by a specific column and creates a tab for every group.
* `load_excel`: Loads a dataframe from an xlsx file (Excel or Calc).<br>
The tabcol-feature writes all tabs in a column of the result.
## Work with NaN
* `nnan`: Count NaNs in Series or DataFrames.
* `any_nan`: Are there NaNs? Returns True or False.
* `nan_rows`: Returns the rows of a DataFrame that are NaN in the specified column.
## Scale Numbers
* `scale`: Scales all values of a numeric series to a defined value range.<br>
Available methods: max_abs, min_max, min_max_robust, rel, mean, median,
compare_median, rank and random.
## Cleanup Strings
* `remove_str`: Removes a list of unwanted substrings from a Series of strings.
* `remove_words`: Removes a list of unwanted words from a Series of strings.
* `replace_str`: Replaces substrings from a Series of strings according to a dict.
## Slice Strings Variably
* `slice_string`: Slices a column of strings based on indexes in another columns.
## Search Strings Fast
* `fast_startswith`: Searches string columns for matching beginnings.<br>
Like pandas str.startswith(), but much faster for large amounts of data, and it returns the matching fragment.
* `fast_endswith`: Searches string columns for matching endings.
## Work with Lists
* `find_in_list`: Searches a column with a list of strings. Returns a binary mask for the rows containing the searchstring in the list.
* `apply_on_elements`: Applies a function to all elements of a Series of lists.
* `list_to_string`: Converts a Series of lists of strings into a Series of strings.
## Rank Rows
* `rank`: Select the max row per group. Or the min.<br>
Or mark the rows instead of selecting them.
## Aggregate Rows
* `group_and_agg`: Groups and aggregates. Provides a user interface similar to that of MS Access.
* `most_freq_elt`: Aggregates a Series to the most frequent scalar element.<br>
Like Series.mode, but always returns a scalar.
* `top_values`: Aggregates a Series to a list of the most frequent elements.<br>
Can also return the counts of the most frequent elements.
* `first_valid_value`: Returns the first not-nan values of a Series.
* `last_valid_value`: Returns the last not-nan values of a Series.
* `agg_words`: Aggregates a Series of strings to a long string.<br>
A space is alway
没有合适的资源?快使用搜索试试~ 我知道了~
pandasklar-0.2.2.tar.gz
需积分: 1 0 下载量 167 浏览量
2024-03-10
20:12:43
上传
评论
收藏 50KB GZ 举报
温馨提示
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论































收起资源包目录
































共 27 条
- 1
资源评论


程序员Chino的日记
- 粉丝: 4227
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
