Data Preprocessing

seeq.addons.correlation._preprocessor.attach_summary(df, summary)[source]

This functions adds the summary of a pre-processing operation as property of the DataFrame that contains the data.

Parameters:
  • df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

  • summary (pandas.DataFrame) – A DataFrame of exactly one column with the summary of the pre-processing step and signal names as index.

Returns:

df – The input DataFrame with a preprocessing container attached as metadata.

Return type:

pandas.DataFrame

Notes

This function modifies the input DataFrame by adding or updating the preprocessing.summary property.

seeq.addons.correlation._preprocessor.default_preprocessing_wrapper(df, consecutivenans=0.04, percent_nan=0.0, remove_disparate_sampled=False, sampling_ratio_threshold=4, bypass_processing=False)[source]
Parameters:
  • df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

  • consecutivenans (float, default 0.01) – percentage of the total number of samples that the interpolator will fill consecutive invalid values.

  • percent_nan (float, default 0.0) – Maximum percentage of invalid values (from the total number of samples) allowed in order to keep the signal in the DataFrame.

  • remove_disparate_sampled (bool, default False) – Removes signals whose sampling period is too different (set by sampling_ratio_threshold) from the median of sampling rates

  • sampling_ratio_threshold (float, default 2.5) – Signals with a sampling rate ratio (raw/gridded) above sampling_ratio_threshold or below 1/sampling_ratio_threshold will be removed from the dataframe

  • bypass_processing (bool, default False) – If true, pre-processing routines in this wrapper are by-passed. However, the _validate_df is still run to check that the dataframe is valid even if bypass_processing is set to True.

Returns:

dff – A DataFrame of the cleansed signals

Return type:

pandas.DataFrame

Notes

A summary of how this function modified or tagged signals in the dataframe can be accessed in the property dff.preprocessing.summary

seeq.addons.correlation._preprocessor.duplicated_column_names(df)[source]

Renames columns that have the same column name

Parameters:

df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

Returns:

dff – A DataFrame of the signals with unique column names

Return type:

pandas.DataFrame

seeq.addons.correlation._preprocessor.interpolate_nans(df, consecutivenans=0.05)[source]

Interpolates invalid values (linearly) per signal only if the percentage of consecutive invalid values with respect to the total number of samples is less than the specified threshold.

Parameters:
  • df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

  • consecutivenans (float, default 0.05) – percentage of the total number of samples that the interpolator will fill consecutive invalid values.

Returns:

interpolated_df – A DataFrame with the interpolated values (if applicable)

Return type:

pandas.DataFrame

Notes

A summary of how this function modified or tagged signals in the dataframe can be accessed in the property interpolated_df.preprocessing.summary.

seeq.addons.correlation._preprocessor.remove_flat_lines(df)[source]

Find signals with zero variance (flat lines) and remove them from the input DataFrame

Parameters:

df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

Returns:

df_out – A DataFrame without the signals that are flat lines

Return type:

pandas.DataFrame

Notes

A summary of how this function modified or tagged signals in the dataframe can be accessed in the property df_out.preprocessing.summary

seeq.addons.correlation._preprocessor.remove_non_numeric(df, auto_remove=True)[source]

Removes non-numeric signals from the input DataFrame.

Parameters:
  • df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

  • auto_remove (bool, default True) – Removes the non-numeric signals from the DataFrame if set to True. Otherwise, it just tags the signals.

Returns:

dff – A DataFrame of the signals that do not have non-numeric data

Return type:

pandas.DataFrame

Notes

A summary of how this function modified or tagged signals in the dataframe can be accessed in the property dff.preprocessing.summary.

seeq.addons.correlation._preprocessor.remove_signals_with_missing_data(df, percent_nan=0.4)[source]

Removes columns from the dataframe that have a high percentage of missing data.

Parameters:
  • df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

  • percent_nan (float, default 0.4) – Maximum percentage of invalid values (from the total number of samples) allowed in order to keep the signal in the DataFrame.

Returns:

dff – A DataFrame of the signals that have less than percent_nan of missing data.

Return type:

pandas.DataFrame

Notes

A summary of how this function modified or tagged signals in the dataframe can be accessed in the property dff.preprocessing.summary.

seeq.addons.correlation._preprocessor.sampling_info(df, sampling_ratio_threshold, remove)[source]

Attaches sampling period information if available and removes or tags signals that have a high or low ratio of sampling period to gridded period.

Parameters:
  • df (pandas.DataFrame) – A DataFrame that contains a set of signals as columns and date-time as the index.

  • sampling_ratio_threshold (float) – Signals with a sampling rate ratio (raw/gridded) above sampling_ratio_threshold or below 1/sampling_ratio_threshold will be removed from the dataframe.

  • remove (bool) – Removes signals whose sampling period is too different (set by sampling_ratio_threshold) from the median of sampling rates.

Returns:

dff – A DataFrame of the signals with unique column names

Return type:

pandas.DataFrame

seeq.addons.correlation._preprocessor.standardization(df)[source]

Scales the data in the DataFrame to zero mean and unit variance

Parameters:

df (pandas.DataFrame) – A pickled DataFrame that contains a set of signals as columns and date-time as the index.

Returns:

scaled_df – A DataFrame that contains a the scaled signals as columns and date-time as the index.

Return type:

pandas.DataFrame

Notes

A summary of how this function modified or tagged signals in the dataframe can be accessed in the property scaled_df.preprocessing.summary.