Data Preprocessing
- seeq.addons.correlation._preprocessor.attach_summary(df, summary)[source]
This functions adds the summary of a pre-processing operation as property of the DataFrame that contains the data.
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
summary (pandas.DataFrame) – A DataFrame of exactly one column with the summary of the pre-processing step and signal names as index.
- Returns:
df – The input DataFrame with a preprocessing container attached as metadata.
- Return type:
pandas.DataFrame
Notes
This function modifies the input DataFrame by adding or updating the preprocessing.summary property.
- seeq.addons.correlation._preprocessor.default_preprocessing_wrapper(df, consecutivenans=0.04, percent_nan=0.0, remove_disparate_sampled=False, sampling_ratio_threshold=4, bypass_processing=False)[source]
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
consecutivenans (float, default 0.01) – percentage of the total number of samples that the interpolator will fill consecutive invalid values.
percent_nan (float, default 0.0) – Maximum percentage of invalid values (from the total number of samples) allowed in order to keep the signal in the DataFrame.
remove_disparate_sampled (bool, default False) – Removes signals whose sampling period is too different (set by sampling_ratio_threshold) from the median of sampling rates
sampling_ratio_threshold (float, default 2.5) – Signals with a sampling rate ratio (raw/gridded) above sampling_ratio_threshold or below 1/sampling_ratio_threshold will be removed from the dataframe
bypass_processing (bool, default False) – If true, pre-processing routines in this wrapper are by-passed. However, the _validate_df is still run to check that the dataframe is valid even if bypass_processing is set to True.
- Returns:
dff – A DataFrame of the cleansed signals
- Return type:
pandas.DataFrame
Notes
A summary of how this function modified or tagged signals in the dataframe can be accessed in the property dff.preprocessing.summary
- seeq.addons.correlation._preprocessor.duplicated_column_names(df)[source]
Renames columns that have the same column name
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
- Returns:
dff – A DataFrame of the signals with unique column names
- Return type:
pandas.DataFrame
- seeq.addons.correlation._preprocessor.interpolate_nans(df, consecutivenans=0.05)[source]
Interpolates invalid values (linearly) per signal only if the percentage of consecutive invalid values with respect to the total number of samples is less than the specified threshold.
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
consecutivenans (float, default 0.05) – percentage of the total number of samples that the interpolator will fill consecutive invalid values.
- Returns:
interpolated_df – A DataFrame with the interpolated values (if applicable)
- Return type:
pandas.DataFrame
Notes
A summary of how this function modified or tagged signals in the dataframe can be accessed in the property interpolated_df.preprocessing.summary.
- seeq.addons.correlation._preprocessor.remove_flat_lines(df)[source]
Find signals with zero variance (flat lines) and remove them from the input DataFrame
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
- Returns:
df_out – A DataFrame without the signals that are flat lines
- Return type:
pandas.DataFrame
Notes
A summary of how this function modified or tagged signals in the dataframe can be accessed in the property df_out.preprocessing.summary
- seeq.addons.correlation._preprocessor.remove_non_numeric(df, auto_remove=True)[source]
Removes non-numeric signals from the input DataFrame.
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
auto_remove (bool, default True) – Removes the non-numeric signals from the DataFrame if set to True. Otherwise, it just tags the signals.
- Returns:
dff – A DataFrame of the signals that do not have non-numeric data
- Return type:
pandas.DataFrame
Notes
A summary of how this function modified or tagged signals in the dataframe can be accessed in the property dff.preprocessing.summary.
- seeq.addons.correlation._preprocessor.remove_signals_with_missing_data(df, percent_nan=0.4)[source]
Removes columns from the dataframe that have a high percentage of missing data.
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
percent_nan (float, default 0.4) – Maximum percentage of invalid values (from the total number of samples) allowed in order to keep the signal in the DataFrame.
- Returns:
dff – A DataFrame of the signals that have less than percent_nan of missing data.
- Return type:
pandas.DataFrame
Notes
A summary of how this function modified or tagged signals in the dataframe can be accessed in the property dff.preprocessing.summary.
- seeq.addons.correlation._preprocessor.sampling_info(df, sampling_ratio_threshold, remove)[source]
Attaches sampling period information if available and removes or tags signals that have a high or low ratio of sampling period to gridded period.
- Parameters:
df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.
sampling_ratio_threshold (float) – Signals with a sampling rate ratio (raw/gridded) above sampling_ratio_threshold or below 1/sampling_ratio_threshold will be removed from the dataframe.
remove (bool) – Removes signals whose sampling period is too different (set by sampling_ratio_threshold) from the median of sampling rates.
- Returns:
dff – A DataFrame of the signals with unique column names
- Return type:
pandas.DataFrame
- seeq.addons.correlation._preprocessor.standardization(df)[source]
Scales the data in the DataFrame to zero mean and unit variance
- Parameters:
df (pandas.DataFrame) – A pickled DataFrame that contains a set of signals as columns and date-time as the index.
- Returns:
scaled_df – A DataFrame that contains a the scaled signals as columns and date-time as the index.
- Return type:
pandas.DataFrame
Notes
A summary of how this function modified or tagged signals in the dataframe can be accessed in the property scaled_df.preprocessing.summary.