echofilter.raw package#

Echoview output file loading and generation, post-processing and shard generation.

Submodules#

echofilter.raw.loader module#

Input/Output handling for raw Echoview files.

echofilter.raw.loader.count_lines(filename)[source]#

Count the number of lines in a file.

Parameters

filename (str) – Path to file.

Returns

Number of lines in file.

Return type

int

echofilter.raw.loader.evdtstr2timestamp(datestr, timestr=None)[source]#

Convert an Echoview-compatible datetime string into a Unix epoch timestamp.

Parameters
  • datestr (str) – Datetime string in the Echoview-compatible format "CCYYMMDD HHmmSSssss", or (if timestr is also provided) just the date part, "CCYYMMDD".

  • timestr (str, optional) – Time string in the Echoview-compatible format “HHmmSSssss”.

Returns

timestamp – Number of seconds since Unix epoch.

Return type

float

echofilter.raw.loader.evl_loader(fname, special_to_nan=True, return_status=False)[source]#

EVL file loader.

Parameters
Returns

  • numpy.ndarray of floats – Timestamps, in seconds.

  • numpy.ndarary of floats – Depth, in metres.

  • numpy.ndarary of ints, optional – Status codes.

echofilter.raw.loader.evl_reader(fname)[source]#

EVL file reader.

Parameters

fname (str) – Path to .evl file.

Returns

A generator which yields the timestamp (in seconds), depth (in metres), and status (int) for each entry. Note that the timestamp is not corrected for timezone (so make sure your timezones are internally consistent).

Return type

generator

echofilter.raw.loader.evl_writer(fname, timestamps, depths, status=1, line_ending='\r\n', pad=False)[source]#

EVL file writer.

Parameters

Notes

For more details on the format specification, see https://support.echoview.com/WebHelp/Using_Echoview/Exporting/Exporting_data/Exporting_line_data.htm#Line_definition_file_format

echofilter.raw.loader.evr_reader(fname, parse_echofilter_regions=True)[source]#

Echoview region file (EVR) reader.

Parameters
  • fname (str) – Path to .evr file.

  • parse_echofilter_regions (bool, default=True) – Whether to separate out echofilter generated regions (passive, removed vbands, and removed patches) from other regions.

Returns

  • regions_passive (list of tuples, optional) – Start and end timestamps for passive regions.

  • regions_removed (list of tuples, optional) – Start and end timestamps for removed vertical bands.

  • regions_patch (list of lists, optional) – Start and end timestamps for bad data patches.

  • regions_other (list of dicts) – Dictionary mapping creation type to points defining each region.

echofilter.raw.loader.evr_writer(fname, rectangles=None, contours=None, common_notes='', default_region_type=0, line_ending='\r\n')[source]#

EVR file writer.

Writes regions to an Echoview region file.

Parameters
  • fname (str) – Destination of output file.

  • rectangles (list of dictionaries, optional) – Rectangle region definitions. Default is an empty list. Each rectangle region must implement fields "depths" and "timestamps", which indicate the extent of the rectangle. Optionally, "creation_type", "region_name", "region_type", and "notes" may be set. If these are not given, the default creation_type is 4 and region_type is set by default_region_type.

  • contours (list of dictionaries) – Contour region definitions. Default is an empty list. Each contour region must implement a "points" field containing a numpy.ndarray shaped (n, 2) defining the co-ordinates of nodes along the (open) contour in units of timestamp and depth. Optionally, "creation_type", "region_name", "region_type", and "notes" may be set. If these are not given, the default creation_type is 2 and region_type is set by default_region_type.

  • common_notes (str, optional) – Notes to include for every region. Default is "", an empty string.

  • default_region_type (int, optional) –

    The region type to use for rectangles and contours which do not define a "region_type" field. Possible region types are

    • 0 : bad (no data)

    • 1 : analysis

    • 2 : marker

    • 3 : fishtracks

    • 4 : bad (empty water)

    Default is 0.

  • line_ending (str, optional) – Line ending. Default is "\r\n" the standard line ending on Windows/DOS, as per the specification for the file format. https://support.echoview.com/WebHelp/Using_Echoview/Exporting/Exporting_data/Exporting_line_data.htm Set to "\n" to get Unix-style line endings instead.

Notes

For more details on the format specification, see: https://support.echoview.com/WebHelp/Reference/File_formats/Export_file_formats/2D_Region_definition_file_format.htm

echofilter.raw.loader.get_partition_data(partition, dataset='mobile', partitioning_version='firstpass', root_data_dir='/data/dsforce/surveyExports')[source]#

Load partition metadata.

Parameters
  • transect_pth (str) – Relative path to transect, excluding "_Sv_raw.csv".

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • partitioning_version (str, optional) – Name of partitioning method.

  • root_data_dir (str) – Path to root directory where data is located.

Returns

Metadata for all transects in the partition. Each row is a single sample.

Return type

pandas.DataFrame

echofilter.raw.loader.get_partition_list(partition, dataset='mobile', full_path=False, partitioning_version='firstpass', root_data_dir='/data/dsforce/surveyExports', sharded=False)[source]#

Get a list of transects in a single partition.

Parameters
  • transect_pth (str) – Relative path to transect, excluding "_Sv_raw.csv".

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • full_path (bool, optional) – Whether to return the full path to the sample. If False, only the relative path (from the dataset directory) is returned. Default is False.

  • partitioning_version (str, optional) – Name of partitioning method.

  • root_data_dir (str, optional) – Path to root directory where data is located.

  • sharded (bool, optional) – Whether to return path to sharded version of data. Default is False.

Returns

Path for each sample in the partition.

Return type

list

echofilter.raw.loader.list_from_file(fname)[source]#

Get a list from a file.

Parameters

fname (str) – Path to file.

Returns

Contents of the file, one line per entry in the list. Trailing whitespace is removed from each end of each line.

Return type

list

echofilter.raw.loader.load_transect_data(transect_pth, dataset='mobile', root_data_dir='/data/dsforce/surveyExports')[source]#

Load all data for one transect.

Parameters
  • transect_pth (str) – Relative path to transect, excluding "_Sv_raw.csv".

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • root_data_dir (str) – Path to root directory where data is located.

Returns

  • timestamps (numpy.ndarray) – Timestamps (in seconds since Unix epoch), with each entry corresponding to each row in the signals data.

  • depths (numpy.ndarray) – Depths from the surface (in metres), with each entry corresponding to each column in the signals data.

  • signals (numpy.ndarray) – Echogram Sv data, shaped (num_timestamps, num_depths).

  • turbulence (numpy.ndarray) – Depth of turbulence line, shaped (num_timestamps, ).

  • bottom (numpy.ndarray) – Depth of bottom line, shaped (num_timestamps, ).

echofilter.raw.loader.regions2mask(timestamps, depths, regions_passive=None, regions_removed=None, regions_patch=None, regions_other=None)[source]#

Convert regions to mask.

Takes the output from :func:evr_reader` and returns a set of masks.

Parameters
  • timestamps (array_like) – Timestamps for each node in the line.

  • depths (array_like) – Depths (in meters) for each node in the line.

  • regions_passive (list of tuples, optional) – Start and end timestamps for passive regions.

  • regions_removed (list of tuples, optional) – Start and end timestamps for removed vertical bands.

  • regions_patch (list of lists, optional) – Start and end timestamps for bad data patches.

  • regions_other (list of dicts) – Dictionary mapping creation type to points defining each region.

Returns

transect

A dictionary with keys:

  • ”is_passive”numpy.ndarray

    Logical array showing whether a timepoint is of passive data. Shaped (num_timestamps, ). All passive recording data should be excluded by the mask.

  • ”is_removed”numpy.ndarray

    Logical array showing whether a timepoint is entirely removed by the mask. Shaped (num_timestamps, ).

  • ”mask_patches”numpy.ndarray

    Logical array indicating which datapoints are inside a patch from regions_patch (True) and should be excluded by the mask. Shaped (num_timestamps, num_depths).

  • ”mask”numpy.ndarray

    Logical array indicating which datapoints should be kept (True) and which are marked as removed (False) by one of the other three outputs. Shaped (num_timestamps, num_depths).

Return type

dict

echofilter.raw.loader.remove_trailing_slash(s)[source]#

Remove trailing forward slashes from a string.

Parameters

s (str) – String representing a path, possibly with trailing slashes.

Returns

Same as s, but without trailing forward slashes.

Return type

str

echofilter.raw.loader.timestamp2evdtstr(timestamp)[source]#

Convert a timestamp into an Echoview-compatible datetime string.

The output is in the format “CCYYMMDD HHmmSSssss”, where:

CC: century
YY: year
MM: month
DD: day
HH: hour
mm: minute
SS: second
ssss: 0.1 milliseconds
Parameters

timestamp (float) – Number of seconds since Unix epoch.

Returns

datetimestring – Datetime string in the Echoview-compatible format “CCYYMMDD HHmmSSssss”.

Return type

str

echofilter.raw.loader.transect_loader(fname, skip_lines=0, warn_row_overflow=None, row_len_selector='mode')[source]#

Load an entire survey transect CSV.

Parameters
  • fname (str) – Path to survey CSV file.

  • skip_lines (int, optional) – Number of initial entries to skip. Default is 0.

  • warn_row_overflow (bool or int, optional) – Whether to print a warning message if the number of elements in a row exceeds the expected number. If this is an int, this is the number of times to display the warnings before they are supressed. If this is True, the number of outputs is unlimited. If None, the maximum number of underflow and overflow warnings differ: if row_len_selector is "init" or "min", underflow always produces a message and the overflow messages stop at 2; otherwise the values are reversed. Default is None.

  • row_len_selector ({"init", "min", "max", "median", "mode"}, optional) – The method used to determine which row length (number of depth samples) to use. Default is "mode", the most common row length across all the measurement timepoints.

Returns

  • numpy.ndarray – Timestamps for each row, in seconds. Note: not corrected for timezone (so make sure your timezones are internally consistent).

  • numpy.ndarray – Depth of each column, in metres.

  • numpy.ndarray – Survey signal (Sv, for instance). Units match that of the file.

echofilter.raw.loader.transect_reader(fname)[source]#

Create a generator which iterates through a survey csv file.

Parameters

fname (str) – Path to survey CSV file.

Returns

Yields a tupule of (metadata, data), where metadata is a dict, and data is a numpy.ndarray. Each yield corresponds to a single row in the data. Every row (except for the header) is yielded.

Return type

generator

echofilter.raw.loader.write_transect_regions(fname, transect, depth_range=None, passive_key='is_passive', removed_key='is_removed', patches_key='mask_patches', collate_passive_length=0, collate_removed_length=0, minimum_passive_length=0, minimum_removed_length=0, minimum_patch_area=0, name_suffix='', common_notes='', line_ending='\r\n', verbose=0, verbose_indent=0)[source]#

Convert a transect dictionary to a set of regions and write as an EVR file.

Parameters
  • fname (str) – Destination of output file.

  • transect (dict) – Transect dictionary.

  • depth_range (array_like or None, optional) – The minimum and maximum depth extents (in any order) of the passive and removed block regions. If this is None (default), the minimum and maximum of transect["depths"] is used.

  • passive_key (str, optional) – Field name to use for passive data identification. Default is "is_passive".

  • removed_key (str, optional) – Field name to use for removed blocks. Default is "is_removed".

  • patches_key (str, optional) – Field name to use for the mask of patch regions. Default is "mask_patches".

  • collate_passive_length (int, optional) – Maximum distance (in indices) over which passive regions should be merged together, closing small gaps between them. Default is 0.

  • collate_removed_length (int, optional) – Maximum distance (in indices) over which removed blocks should be merged together, closing small gaps between them. Default is 0.

  • minimum_passive_length (int, optional) – Minimum length (in indices) a passive region must have to be included in the output. Set to -1 to omit all passive regions from the output. Default is 0.

  • minimum_removed_length (int, optional) – Minimum length (in indices) a removed block must have to be included in the output. Set to -1 to omit all removed regions from the output. Default is 0.

  • minimum_patch_area (float, optional) – Minimum amount of area (in input pixel space) that a patch must occupy in order to be included in the output. Set to 0 to include all patches, no matter their area. Set to -1 to omit all patches. Default is 0.

  • name_suffix (str, optional) – Suffix to append to variable names. Default is "", an empty string.

  • common_notes (str, optional) – Notes to include for every region. Default is "", an empty string.

  • line_ending (str, optional) – Line ending. Default is "\r\n" the standard line ending on Windows/DOS, as per the specification for the file format, https://support.echoview.com/WebHelp/Using_Echoview/Exporting/Exporting_data/Exporting_line_data.htm Set to "\n" to get Unix-style line endings instead.

  • verbose (int, optional) – Verbosity level. Default is 0.

  • verbose_indent (int, optional) – Level of indentation (number of preceding spaces) before verbosity messages. Default is 0.

echofilter.raw.manipulate module#

Manipulating lines and masks contained in Echoview files.

echofilter.raw.manipulate.find_nonzero_region_boundaries(v)[source]#

Find the start and end indices for nonzero regions of a vector.

Parameters

v (array_like) – A vector.

Returns

  • starts (numpy.ndarray) – Indices for start of regions of nonzero elements in vector v

  • ends (numpy.ndarray) – Indices for end of regions of nonzero elements in vector v (exclusive).

Notes

For i in range(len(starts)), the set of values v[starts[i]:ends[i]] are nonzero. Values in the range v[ends[i]:starts[i+1]] are zero.

echofilter.raw.manipulate.find_passive_data(signals, n_depth_use=38, threshold=25.0, deviation=None)[source]#

Find segments of Sv recording which correspond to passive recording.

Parameters
  • signals (array_like) – Two-dimensional array of Sv values, shaped [timestamps, depths].

  • n_depth_use (int, optional) – How many Sv depths to use, starting with the first depths (closest to the sounder device). If None all depths are used. Default is 38.

  • threshold (float, optional) – Threshold for start/end of passive regions. Default is 25.

  • deviation (float, optional) – Threshold for start/end of passive regions is deviation times the interquartile-range of the difference between samples at neigbouring timestamps. Default is None. Only one of threshold and deviation should be set.

Returns

  • passive_start (numpy.ndarray) – Indices of rows of signals at which passive segments start.

  • passive_end (numpy.ndarray) – Indices of rows of signals at which passive segments end.

Notes

Works by looking at the difference between consecutive recordings and finding large deviations.

echofilter.raw.manipulate.find_passive_data_v2(signals, n_depth_use=38, threshold_inner=None, threshold_init=None, deviation=None, sigma_depth=0, sigma_time=1)[source]#

Find segments of Sv recording which correspond to passive recording.

Parameters
  • signals (array_like) – Two-dimensional array of Sv values, shaped [timestamps, depths].

  • n_depth_use (int, optional) – How many Sv depths to use, starting with the first depths (closest to the sounder device). If None all depths are used. Default is 38. The median is taken across the depths, after taking the temporal derivative.

  • threshold_inner (float, optional) – Theshold to apply to the temporal derivative of the signal when detected fine-tuned start/end of passive regions. Default behaviour is to use a threshold automatically determined using deviation if it is set, and otherwise use a threshold of 35.0.

  • threshold_init (float, optional) – Theshold to apply during the initial scan of the start/end of passive regions, which seeds the fine-tuning search. Default behaviour is to use a threshold automatically determined using deviation if it is set, and otherwise use a threshold of 12.0.

  • deviation (float, optional) – Set threshold_inner to be deviation times the standard deviation of the temporal derivative of the signal. The standard deviation is robustly estimated based on the interquartile range. If this is set, threshold_inner must not be None. Default is None

  • sigma_depth (float, optional) – Width of kernel for filtering signals across second dimension (depth). Default is 0 (no filter).

  • sigma_time (float, optional) – Width of kernel for filtering signals across second dimension (time). Default is 1. Set to 0 to not filter.

Returns

  • passive_start (numpy.ndarray) – Indices of rows of signals at which passive segments start.

  • passive_end (numpy.ndarray) – Indices of rows of signals at which passive segments end.

Notes

Works by looking at the difference between consecutive recordings and finding large deviations.

echofilter.raw.manipulate.fix_surface_line(timestamps, d_surface, is_passive)[source]#

Fix anomalies in the surface line.

Parameters
  • timestamps (array_like sized (N, )) – Timestamps for each ping.

  • d_surface (array_like sized (N, )) – Surface line depths.

  • is_passive (array_like sized (N, )) – Indicator for passive data. Values for the surface line during passive data collection will not be used.

Returns

  • fixed_surface (numpy.ndarray) – Surface line depths, with anomalies replaced with median filtered values and passive data replaced with linear interpolation. Has the same size and dtype as d_surface.

  • is_replaced (boolean numpy.ndarray sized (N, )) – Indicates which datapoints were replaced. Note that passive data is always replaced and is marked as such.

echofilter.raw.manipulate.fixup_lines(timestamps, depths, mask, t_turbulence=None, d_turbulence=None, t_bottom=None, d_bottom=None)[source]#

Extend existing turbulence/bottom lines based on masked target Sv output.

Parameters
  • timestamps (array_like) – Shaped (num_timestamps, ).

  • depths (array_like) – Shaped (num_depths, ).

  • mask (array_like) – Boolean array, where True denotes kept entries. Shaped (num_timestamps, num_depths).

  • t_turbulence (array_like, optional) – Sampling times for existing turbulence line.

  • d_turbulence (array_like, optional) – Depth of existing turbulence line.

  • t_bottom (array_like, optional) – Sampling times for existing bottom line.

  • d_bottom (array_like, optional) – Depth of existing bottom line.

Returns

  • d_turbulence_new (numpy.ndarray) – Depth of new turbulence line.

  • d_bottom_new (numpy.ndarray) – Depth of new bottom line.

echofilter.raw.manipulate.join_transect(transects)[source]#

Join segmented transects together into a single dictionary.

Parameters

transects (iterable of dict) – Transect segments, each with the same fields and compatible shapes.

Yields

dict – Transect data.

echofilter.raw.manipulate.load_decomposed_transect_mask(sample_path)[source]#

Load a raw and masked transect and decompose the mask.

The mask is decomposed into turbulence and bottom lines, and passive and removed regions.

Parameters

sample_path (str) – Path to sample, without extension. The raw data should be located at sample_path + "_Sv_raw.csv".

Returns

A dictionary with keys:

  • ”timestamps”numpy.ndarray

    Timestamps (in seconds since Unix epoch), for each recording timepoint.

  • ”depths”numpy.ndarray

    Depths from the surface (in metres), with each entry corresponding to each column in the signals data.

  • ”Sv”numpy.ndarray

    Echogram Sv data, shaped (num_timestamps, num_depths).

  • ”mask”numpy.ndarray

    Logical array indicating which datapoints were kept (True) and which removed (False) for the masked Sv output. Shaped (num_timestamps, num_depths).

  • ”turbulence”numpy.ndarray

    For each timepoint, the depth of the shallowest datapoint which should be included for the mask. Shaped (num_timestamps, ).

  • ”bottom”numpy.ndarray

    For each timepoint, the depth of the deepest datapoint which should be included for the mask. Shaped (num_timestamps, ).

  • ”is_passive”numpy.ndarray

    Logical array showing whether a timepoint is of passive data. Shaped (num_timestamps, ). All passive recording data should be excluded by the mask.

  • ”is_removed”numpy.ndarray

    Logical array showing whether a timepoint is entirely removed by the mask. Shaped (num_timestamps, ). Does not include periods of passive recording.

  • ”is_upward_facing”bool

    Indicates whether the recording source is located at the deepest depth (i.e. the seabed), facing upwards. Otherwise, the recording source is at the shallowest depth (i.e. the surface), facing downwards.

Return type

dict

echofilter.raw.manipulate.make_lines_from_mask(mask, depths=None, max_gap_squash=1.0)[source]#

Determine turbulence and bottom lines for a mask array.

Parameters
  • mask (array_like) – A two-dimensional logical array, where for each row dimension 1 takes the value False for some unknown continuous stretch at the start and end of the column, with True values between these two masked-out regions.

  • depths (array_like, optional) – Depth of each sample point along dim 1 of mask. Must be either monotonically increasing or monotonically decreasing. Default is the index of mask, arange(mask.shape[1]).

  • max_gap_squash (float, optional) – Maximum gap to merge together, in metres. Default is 1..

Returns

  • d_turbulence (numpy.ndarray) – Depth of turbulence line. This is the line of smaller depth which separates the False region of mask from the central region of True values. (If depths is monotonically increasing, this is for the start of the columns of mask, otherwise it is at the end.)

  • d_bottom (numpy.ndarray) – Depth of bottom line. As for d_turbulence, but for the other end of the array.

echofilter.raw.manipulate.make_lines_from_masked_csv(fname)[source]#

Load a masked csv file and convert its mask to lines.

Parameters

fname (str) – Path to file containing masked Echoview output data in csv format.

Returns

  • timestamps (numpy.ndarray) – Sample timestamps.

  • d_turbulence (numpy.ndarray) – Depth of turbulence line.

  • d_bottom (numpy.ndarray) – Depth of bottom line.

echofilter.raw.manipulate.pad_transect(transect, pad=32, pad_mode='reflect', previous_padding='diff')[source]#

Pad a transect in the timestamps dimension (axis 0).

Parameters
  • transect (dict) – A dictionary of transect data.

  • pad (int, default=32) – Amount of padding to add.

  • pad_mode (str, default="reflect") – Padding method for out-of-bounds inputs. Must be supported by numpy.pad(), such as "contast", "reflect", or "edge". If the mode is "contast", the array will be padded with zeros.

  • previous_padding ({"diff", "add", "noop"}, default="diff") –

    How to handle this padding if the transect has already been padded.

    "diff"

    Extend the padding up to the target pad value.

    "add"

    Add this padding irrespective of pre-existing padding.

    "noop"

    Don’t add any new padding if previously padded.

Returns

transect – Like input transect, but with all time-like dimensions extended with padding and fields "_pad_start" and "_pad_end" changed to indicate the total padding (including any pre-existing padding).

Return type

dict

echofilter.raw.manipulate.remove_anomalies_1d(signal, thr=5, thr2=4, kernel=201, kernel2=31, return_filtered=False)[source]#

Remove anomalies from a temporal signal.

Apply a median filter to the data, and replaces datapoints which deviate from the median filtered signal by more than some threshold with the median filtered data. This process is repeated until no datapoints deviate from the filtered line by more than the threshold.

Parameters
  • signal (array_like) – The signal to filter.

  • thr (float, optional) – The initial threshold will be thr times the standard deviation of the residuals. The standard deviation is robustly estimated from the interquartile range. Default is 5.

  • thr2 (float, optional) – The threshold for repeated iterations will be thr2 times the standard deviation of the remaining residuals. The standard deviation is robustly estimated from interdecile range. Default is 4.

  • kernel (int, optional) – The kernel size for the initial median filter. Default is 201.

  • kernel2 (int, optional) – The kernel size for subsequent median filters. Default is 31.

  • return_filtered (bool, optional) – If True, the median filtered signal is also returned. Default is False.

Returns

  • signal (numpy.ndarray like signal) – The input signal with anomalies replaced with median values.

  • is_replaced (bool numpy.ndarray shaped like signal) – Indicator for which datapoints were replaced.

  • filtered (numpy.ndarray like signal, optional) – The final median filtered signal. Returned if return_filtered=True.

echofilter.raw.manipulate.split_transect(timestamps=None, threshold=20, percentile=97.5, max_length=- 1, pad_length=32, pad_on='max', **transect)[source]#

Split a transect into segments each containing contiguous recordings.

Parameters
  • timestamps (array_like) – A 1-d array containing the timestamp at which each recording was measured. The sampling is assumed to high-frequency with occassional gaps.

  • threshold (int, optional) – Threshold for splitting timestamps into segments. Any timepoints further apart than threshold times the percentile percentile of the difference between timepoints will be split apart into new segments. Default is 20.

  • percentile (float, optional) – The percentile at which to sample the timestamp intervals to establish a baseline typical interval. Default is 97.5.

  • max_length (int, default=-1) – Maximum length of each segment. Set to 0 or -1 to disable (default).

  • pad_length (int, default=32) – Amount of overlap between the segments. Set to 0 to disable.

  • pad_on ({"max", "thr", "all", "none"}, default="max") – Apply overlap padding when the transect is split due to either the total length exceeding the maximum ("max"), the time delta exceeding the threshold ("thr"), or both ("all").

  • **kwargs – Arbitrary additional transect variables, which will be split into segments as appropriate in accordance with timestamps.

Yields

dict – Containing segmented data, key/value pairs as per given in **kwargs in addition to timestamps.

echofilter.raw.manipulate.write_lines_for_masked_csv(fname_mask, fname_turbulence=None, fname_bottom=None)[source]#

Write turbulence and bottom lines based on masked csv file.

Parameters
  • fname_mask (str) – Path to input file containing masked Echoview output data in csv format.

  • fname_turbulence (str, optional) – Destination of generated turbulence line, written in evl format. If None (default), the output name is <fname_base>_mask-turbulence.evl, where <fname_base> is fname_mask without extension and without any occurence of the substrings _Sv_raw or _Sv in the base file name.

  • fname_bottom (str) – Destination of generated bottom line, written in evl format. If None (default), the output name is <fname_base>_mask-bottom.evl.

echofilter.raw.metadata module#

Dataset metadata, relevant for loading correct data.

echofilter.raw.metadata.recall_passive_edges(sample_path, timestamps)[source]#

Define passive data edges for samples within known datasets.

Parameters
  • sample_path (str) – Path to sample.

  • timestamps (array_like vector) – Vector of timestamps in sample.

Returns

  • passive_starts (numpy.ndarray or None) – Indices indicating the onset of passive data collection periods, or None if passive metadata is unavailable for this sample.

  • passive_ends (numpy.ndarray or None) – Indices indicating the offset of passive data collection periods, or None if passive metadata is unavailable for this sample.

  • finder_version (absent or str) – If passive_starts and passive_ends, this string may be present to indicate which passive finder algorithm works best for this dataset.

echofilter.raw.shardloader module#

Converting raw data into shards, and loading data from shards.

echofilter.raw.shardloader.load_transect_from_shards(transect_rel_pth, i1=0, i2=None, dataset='mobile', segment=0, root_data_dir='/data/dsforce/surveyExports', **kwargs)#

Load transect data from shard files.

Parameters
  • transect_rel_pth (str) – Relative path to transect.

  • i1 (int, optional) – Index of first sample to retrieve. Default is 0, the first sample.

  • i2 (int, optional) – Index of last sample to retrieve. As-per python convention, the range i1 to i2 is inclusive on the left and exclusive on the right, so datapoint i2 - 1 is the right-most datapoint loaded. Default is None, which loads everything up to and including to the last sample.

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • segment (int, optional) – Which segment to load. Default is 0.

  • root_data_dir (str) – Path to root directory where data is located.

  • **kwargs – As per load_transect_from_shards_abs().

Returns

See load_transect_from_shards_abs().

Return type

dict

echofilter.raw.shardloader.load_transect_from_shards_abs(transect_abs_pth, i1=0, i2=None, pad_mode='edge')[source]#

Load transect data from shard files.

Parameters
  • transect_abs_pth (str) – Absolute path to transect shard directory.

  • i1 (int, optional) – Index of first sample to retrieve. Default is 0, the first sample.

  • i2 (int, optional) – Index of last sample to retrieve. As-per python convention, the range i1 to i2 is inclusive on the left and exclusive on the right, so datapoint i2 - 1 is the right-most datapoint loaded. Default is None, which loads everything up to and including to the last sample.

  • pad_mode (str, optional) – Padding method for out-of-bounds inputs. Must be supported by numpy.pad(), such as "contast", "reflect", or "edge". If the mode is "contast", the array will be padded with zeros. Default is “edge”.

Returns

A dictionary with keys:

  • ”timestamps”numpy.ndarray

    Timestamps (in seconds since Unix epoch), for each recording timepoint. The number of entries, num_timestamps, is equal to i2 - i1.

  • ”depths”numpy.ndarray

    Depths from the surface (in metres), with each entry corresponding to each column in the signals data.

  • ”Sv”numpy.ndarray

    Echogram Sv data, shaped (num_timestamps, num_depths).

  • ”mask”numpy.ndarray

    Logical array indicating which datapoints were kept (True) and which removed (False) for the masked Sv output. Shaped (num_timestamps, num_depths).

  • ”turbulence”numpy.ndarray

    For each timepoint, the depth of the shallowest datapoint which should be included for the mask. Shaped (num_timestamps, ).

  • ”bottom”numpy.ndarray

    For each timepoint, the depth of the deepest datapoint which should be included for the mask. Shaped (num_timestamps, ).

  • ”is_passive”numpy.ndarray

    Logical array showing whether a timepoint is of passive data. Shaped (num_timestamps, ). All passive recording data should be excluded by the mask.

  • ”is_removed”numpy.ndarray

    Logical array showing whether a timepoint is entirely removed by the mask. Shaped (num_timestamps, ). Does not include periods of passive recording.

  • ”is_upward_facing”bool

    Indicates whether the recording source is located at the deepest depth (i.e. the seabed), facing upwards. Otherwise, the recording source is at the shallowest depth (i.e. the surface), facing downwards.

Return type

dict

echofilter.raw.shardloader.load_transect_from_shards_rel(transect_rel_pth, i1=0, i2=None, dataset='mobile', segment=0, root_data_dir='/data/dsforce/surveyExports', **kwargs)[source]#

Load transect data from shard files.

Parameters
  • transect_rel_pth (str) – Relative path to transect.

  • i1 (int, optional) – Index of first sample to retrieve. Default is 0, the first sample.

  • i2 (int, optional) – Index of last sample to retrieve. As-per python convention, the range i1 to i2 is inclusive on the left and exclusive on the right, so datapoint i2 - 1 is the right-most datapoint loaded. Default is None, which loads everything up to and including to the last sample.

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • segment (int, optional) – Which segment to load. Default is 0.

  • root_data_dir (str) – Path to root directory where data is located.

  • **kwargs – As per load_transect_from_shards_abs().

Returns

See load_transect_from_shards_abs().

Return type

dict

echofilter.raw.shardloader.load_transect_segments_from_shards_abs(transect_abs_pth, segments=None)[source]#

Load transect data from shard files.

Parameters
  • transect_abs_pth (str) – Absolute path to transect shard segments directory.

  • segments (iterable or None) – Which segments to load. If None (default), all segments are loaded.

Returns

See load_transect_from_shards_abs().

Return type

dict

echofilter.raw.shardloader.load_transect_segments_from_shards_rel(transect_rel_pth, dataset='mobile', segments=None, root_data_dir='/data/dsforce/surveyExports')[source]#

Load transect data from shard files.

Parameters
  • transect_rel_pth (str) – Relative path to transect.

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • segments (iterable or None) – Which segments to load. If None (default), all segments are loaded.

  • root_data_dir (str) – Path to root directory where data is located.

  • **kwargs – As per load_transect_from_shards_abs().

Returns

See load_transect_from_shards_abs().

Return type

dict

echofilter.raw.shardloader.segment_and_shard_transect(transect_pth, dataset='mobile', max_depth=None, shard_len=128, root_data_dir='/data/dsforce/surveyExports')[source]#

Create a sharded copy of a transect.

The transect is cut into segments based on recording starts/stops. Each segment is split across multiple files (shards) for efficient loading.

Parameters
  • transect_pth (str) – Relative path to transect, excluding "_Sv_raw.csv".

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • max_depth (float or None, optional) – The maximum depth to include in the saved shard. Data corresponding to deeper locations is omitted to save on load time and memory when the shard is loaded. If None, no cropping is applied. Default is None.

  • shard_len (int, optional) – Number of timestamp samples to include in each shard. Default is 128.

  • root_data_dir (str) – Path to root directory where data is located.

Notes

The segments will be written to the directories <root_data_dir>_sharded/<dataset>/transect_path/<segment>/ For the contents of each directory, see write_transect_shards.

echofilter.raw.shardloader.shard_transect(transect_pth, dataset='mobile', max_depth=None, shard_len=128, root_data_dir='/data/dsforce/surveyExports')#

Create a sharded copy of a transect.

The transect is cut into segments based on recording starts/stops. Each segment is split across multiple files (shards) for efficient loading.

Parameters
  • transect_pth (str) – Relative path to transect, excluding "_Sv_raw.csv".

  • dataset (str, optional) – Name of dataset. Default is "mobile".

  • max_depth (float or None, optional) – The maximum depth to include in the saved shard. Data corresponding to deeper locations is omitted to save on load time and memory when the shard is loaded. If None, no cropping is applied. Default is None.

  • shard_len (int, optional) – Number of timestamp samples to include in each shard. Default is 128.

  • root_data_dir (str) – Path to root directory where data is located.

Notes

The segments will be written to the directories <root_data_dir>_sharded/<dataset>/transect_path/<segment>/ For the contents of each directory, see write_transect_shards.

echofilter.raw.shardloader.write_transect_shards(dirname, transect, max_depth=None, shard_len=128)[source]#

Create a sharded copy of a transect.

The transect is cut by timestamp and split across multiple files.

Parameters
  • dirname (str) – Path to output directory.

  • transect (dict) – Observed values for the transect. Should already be segmented.

  • max_depth (float or None, optional) – The maximum depth to include in the saved shard. Data corresponding to deeper locations is omitted to save on load time and memory when the shard is loaded. If None, no cropping is applied. Default is None.

  • shard_len (int, optional) – Number of timestamp samples to include in each shard. Default is 128.

Notes

The output will be written to the directory dirname, and will contain:

  • a file named "shard_size.txt", which contains the sharding metadata: total number of samples, and shard size;

  • a directory for each shard, named 0, 1, … Each shard directory will contain files:

    • depths.npy

    • timestamps.npy

    • Sv.npy

    • mask.npy

    • turbulence.npy

    • bottom.npy

    • is_passive.npy

    • is_removed.npy

    • is_upward_facing.npy

    which contain pickled numpy dumps of the matrices for each shard.

echofilter.raw.utils module#

Loader utility functions.

echofilter.raw.utils.fillholes2d(arr, nan_thr=2, interp_method='linear', inplace=False)[source]#

Interpolate to replace NaN values in 2d gridded array data.

Parameters
  • arr (2d numpy.ndarray) – Array in 2d which, may contain NaNs.

  • nan_thr (int, default=2) – Minimum number of NaN values needed in a row/column for it to be included in the (rectangular) area where NaNs are fixed.

  • interp_method (str, default="linear") – Interpolation method.

  • inplace (bool, default=False) – Whether to update arr instead of a copy.

Returns

arr – Like input arr, but with NaN values replaced with interpolated values.

Return type

2d numpy.ndarray

echofilter.raw.utils.integrate_area_of_contour(x, y, closed=None, preserve_sign=False)[source]#

Compute the area within a contour, using Green’s algorithm.

Parameters
  • x (array_like vector) – x co-ordinates of nodes along the contour.

  • y (array_like vector) – y co-ordinates of nodes along the contour.

  • closed (bool or None, optional) – Whether the contour is already closed. If False, it will be closed before deterimining the area. If None (default), it is automatically determined as to whether the contour is already closed, and is closed if necessary.

  • preserve_sign (bool, optional) – Whether to preserve the sign of the area. If True, the area is positive if the contour is anti-clockwise and negative if it is clockwise oriented. Default is False, which always returns a positive area.

Returns

area – The integral of the area witihn the contour.

Return type

float

Notes

https://en.wikipedia.org/wiki/Green%27s_theorem#Area_calculation

echofilter.raw.utils.interp1d_preserve_nan(x, y, x_samples, nan_threshold=0.0, bounds_error=False, **kwargs)[source]#

Interpolate a 1-D function, preserving NaNs.

Inputs x and y are arrays of values used to approximate some function f: y = f(x). We exclude NaNs for the interpolation and then mask out entries which are adjacent (or close to) a NaN in the input.

Parameters
  • x ((N,) array_like) – A 1-D array of real values. Must not contain NaNs.

  • y ((...,N,...) array_like) – A N-D array of real values. The length of y along the interpolation axis must be equal to the length of x. May contain NaNs.

  • x_samples (array_like) – A 1-D array of real values at which the interpolation function will be sampled.

  • nan_threshold (float, optional) – Minimum amount of influence a NaN must have on an output sample for it to become a NaN. Default is 0. i.e. any influence.

  • bounds_error (bool, optional) – If True, a ValueError is raised any time interpolation is attempted on a value outside of the range of x (where extrapolation is necessary). If False (default), out of bounds values are assigned value fill_value (whose default is NaN).

  • **kwargs – Additional keyword arguments are as per scipy.interpolate.interp1d().

Returns

y_samples – The result of interpolating, with sample points close to NaNs in the input returned as NaN.

Return type

(…,N,…) np.ndarray

echofilter.raw.utils.medfilt1d(signal, kernel_size, axis=- 1, pad_mode='reflect')[source]#

Median filter in 1d, with support for selecting padding mode.

Parameters
  • signal (array_like) – The signal to filter.

  • kernel_size – Size of the median kernel to use.

  • axis (int, optional) – Which axis to operate along. Default is -1.

  • pad_mode (str, optional) – Method with which to pad the vector at the edges. Must be supported by numpy.pad(). Default is "reflect".

Returns

filtered – The filtered signal.

Return type

array_like

echofilter.raw.utils.pad1d(array, pad_width, axis=0, **kwargs)[source]#

Pad an array along a single axis only.

Parameters
  • array (numpy.ndarary) – Array to be padded.

  • pad_width (int or tuple) – The amount to pad, either a length two tuple of values for each edge, or an int if the padding should be the same for each side.

  • axis (int, optional) – The axis to pad. Default is 0.

  • **kwargs – As per numpy.pad().

Returns

Padded array.

Return type

numpy.ndarary

See also

numpy.pad

echofilter.raw.utils.squash_gaps(mask, max_gap_squash, axis=- 1, inplace=False)[source]#

Merge small gaps between zero values in a boolean array.

Parameters
  • mask (boolean array) – The input mask, with small gaps between zero values which will be squashed with zeros.

  • max_gap_squash (int) – Maximum length of gap to squash.

  • axis (int, optional) – Axis on which to operate. Default is -1.

  • inplace (bool, optional) – Whether to operate on the original array. If False, a copy is created and returned.

Returns

merged_mask – Mask as per the input, but with small gaps squashed.

Return type

boolean array