mtdata package

Subpackages

Submodules

mtdata.backward module

mtdata.backward.read_backward(file: BinaryIO) Iterable[str]

Read the lines of a text file (opened in binary mode), but backward, from the last line to the first line.

>>> from io import BytesIO
>>> list(read_backward(BytesIO(b'a\nb')))
['b', 'a\n']
>>> list(read_backward(BytesIO(b'a\nb\n')))
['b\n', 'a\n']
>>> list(read_backward(BytesIO(b'a\n\n')))
['\n', 'a\n']
>>> list(read_backward(BytesIO(b'\n\n')))
['\n', '\n']
>>> list(read_backward(BytesIO(b'\n')))
['\n']
>>> list(read_backward(BytesIO(b'')))
[]

mtdata.dataset module

class mtdata.dataset.Dataset

Bases: ABC

A single collection of data, represented as a table.

abstract property dedup_facets: Iterable[str]

Fields used to determine which rows should be compared for de-duplication.

For example, if the data are sensor readings for various locations, the field that indicates the location would be listed here. That way, we don’t drop a new reading from a different location just because it occurred at the same time as a reading from a different location.

Uses the transformed version of the field names.

abstract property dedup_fields: Iterable[str]

Fields used to compare rows for de-duplication. This is likely to be some kind of timestamp, but that depends on the kind of data.

For example, if the data are sensor readings and each row is timestamped based on when the reading occurred, then that field will be listed here because two or more fetches might retrieve the same reading instance.

Uses the transformed version of the field names.

abstract fetch() FetchResult

Fetch new data from the source (generally the web).

abstract static name() str

The dataset name, which is used in the UI and for things like file and table names.

abstract property transformer: Transformer

The transformer to be applied to each row that is fetched from the data source before it is stored.

class mtdata.dataset.FetchResult(success: bool, message: str, data: Iterable[Dict[str, Any]])

Bases: tuple

The result of a completed fetch operation. If the operation was successful (data were acquired, or there were no new data available) then success should be True, False otherwise.

Upon success, it is reasonable for the message to be the empty string. However, if the operation failed, then the message field should contain some kind of explanation suitable for presentation to the user and inclusion in log files.

If the fetch was successful, then data should contain the rows that were acquired from the data source. It may be empty if there were no rows available (this will depend on the source). It should also be empty on failure.

property data

Alias for field number 2

property message

Alias for field number 1

property success

Alias for field number 0

mtdata.fields module

mtdata.fields.prune_fields(row: Dict[str, Any], keys: Iterable[str]) Dict[str, Any]

Remove fields from the row that are not included in the iterable of keys provided. The row is mutated, but also returned to the caller.

>>> prune_fields({'a': 1, 'b': 2}, ['b'])
{'b': 2}
mtdata.fields.rename_fields(row: Dict[str, Any], mapping: Dict[str, str]) Dict[str, Any]

Rename the fields of the given row using the name mapping provided. The row is mutated, but also returned to the caller.

>>> rename_fields({'A': 1, 'b': 2}, {'A': 'a', 'b': 'b'})
{'a': 1, 'b': 2}

mtdata.manifest module

mtdata.manifest.get_dataset(name: str) Optional[Type[Dataset]]

Get the dataset class with the given name, or None if there is no dataset implementation in the manifest with that name.

mtdata.manifest.get_store(name: str) Optional[Type[Storage]]

Get the store class with the given name, or None if there is no store implementation in the manifest with that name.

mtdata.parameters module

class mtdata.parameters.Parameters(datasets: Tuple[str], list_datasets: bool, list_stores: bool, namespace: str, stores: Tuple[str])

Bases: tuple

Parameters supported by the CLI.

property datasets

Alias for field number 0

property list_datasets

Alias for field number 1

property list_stores

Alias for field number 2

property namespace

Alias for field number 3

property stores

Alias for field number 4

mtdata.parameters.comma_tuple(arg: str) Tuple[str, ...]

A “type” that can be used with ArgumentParser to split a comma-delimited list of values into an actual list.

TODO: Add a “type” to this that converts the values

mtdata.parameters.parse_parameters(args: List[str]) Parameters

Turn a list of command line arguments into a Parameters object.

mtdata.registry module

mtdata.row module

mtdata.storage module

class mtdata.storage.CSVBasic(namespace: str)

Bases: Storage

A minimal CSV implementation that uses a DictWriter to write rows to the indicated file.

append(name: str, data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str], dedup_fields: Iterable[str]) StoreResult

Append some number of rows to the data currently stored. The existing data should remain untouched and the new data should, where it makes sense, be stored “after” the existing data.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

If dedup_facets and dedup_fields are non-empty, then de-duplication must occur before the new data are stored. See the documentation for Dataset for an explanation of these fields.

load(name: str) Iterable[Dict[str, Any]]

Read in all data and return it as an iterable of rows. The implementation may choose to read all rows into memory or stream them through an iterator.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

load_backward(name: str) Iterable[Dict[str, Any]]

Load the data in reverse order. Used for de-duplication.

static name() str

A human-readable name for the storage implementation. Intended for use in the UI.

By convention, this should be the class name, converted to kabob case. So a storage class called FancyDatabase would be named “fancy-database”.

name_to_path(name: str) str

Name to path conversion that assumes the file extension.

replace(name: str, data: Iterable[Dict[str, Any]]) StoreResult

Delete all data currently stored and replace it with the given rows.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

class mtdata.storage.JsonLines(namespace: str)

Bases: Storage

A storage implementation that writes each row as a single, JSON-formatted line in a file. This allows the data to be “streamed” back without reading in the entire file. It also allows efficient append operations since the old data needn’t be loaded in order to add more.

Data are stored and retrieved in the order they are appended or replaced. Therefore, as long as data are always added in chronological order, they will remain in that order.

append(name: str, data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str], dedup_fields: Iterable[str]) StoreResult

Append some number of rows to the data currently stored. The existing data should remain untouched and the new data should, where it makes sense, be stored “after” the existing data.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

If dedup_facets and dedup_fields are non-empty, then de-duplication must occur before the new data are stored. See the documentation for Dataset for an explanation of these fields.

load(name: str) Iterable[Dict[str, Any]]

Read in all data and return it as an iterable of rows. The implementation may choose to read all rows into memory or stream them through an iterator.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

load_backward(name: str) Iterable[Dict[str, Any]]

Load data from the store in reverse order. In other words, the first row returned is the row that was most recently added to the store, and so on.

TODO: Consider making this abstract on the base class

static name() str

A human-readable name for the storage implementation. Intended for use in the UI.

By convention, this should be the class name, converted to kabob case. So a storage class called FancyDatabase would be named “fancy-database”.

name_to_path(name: str) str

Convert a name to a file path with the correct extension.

replace(name: str, data: Iterable[Dict[str, Any]]) StoreResult

Delete all data currently stored and replace it with the given rows.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

class mtdata.storage.Storage(namespace: str)

Bases: ABC

A generic storage manager that can handle writing data to a file or other persistence mechanism.

abstract append(name: str, data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str], dedup_fields: Iterable[str]) StoreResult

Append some number of rows to the data currently stored. The existing data should remain untouched and the new data should, where it makes sense, be stored “after” the existing data.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

If dedup_facets and dedup_fields are non-empty, then de-duplication must occur before the new data are stored. See the documentation for Dataset for an explanation of these fields.

static dedup(existing_data: Iterable[Dict[str, Any]], new_data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str] = (), dedup_fields: Iterable[str] = ()) Iterable[Dict[str, Any]]

A helper function to de-duplicate data based on the given facets and fields. This algorithm won’t work for every possible case, but it ought to cover the most common situations nicely.

Important: the existing_data iterable MUST be in reverse- chronological order. In other words, the first element of this iterable must be the most recent row added to the store.

If dedup_facets are provided, then for each new row, search backward through the existing data to find the most recent row that matches on those facets, then compare based on the dedup_fields. If they match, then the row will not be included in the returned iterable.

If there are no dedup_facets, but there are dedup_fields, then grab the most recent row from the stored data and compare it against each row of new data. If any of the new rows match, then drop that row and all rows that occurred before it, and add the remaining rows to the returned iterable.

If both dedup parameters are empty, then the new data are passed through unfiltered.

>>> list(Storage.dedup(
...   reversed([{'a': 1}, {'a': 2}]),
...   [{'a': 3}], [], ['a']))
[{'a': 3}]
>>> list(Storage.dedup(
...   reversed([{'a': 1}, {'a': 2}]),
...   [{'a': 2}, {'a': 3}], [], ['a']))
[{'a': 3}]
get_path(name: str, extension: str) str

A helper for implementations that use the filesystem. Returns a path to a file with the given name and extension, located in a directory determined by the namespace property.

abstract load(name: str) Iterable[Dict[str, Any]]

Read in all data and return it as an iterable of rows. The implementation may choose to read all rows into memory or stream them through an iterator.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

abstract static name() str

A human-readable name for the storage implementation. Intended for use in the UI.

By convention, this should be the class name, converted to kabob case. So a storage class called FancyDatabase would be named “fancy-database”.

property namespace: str

The namespace is tied to the instance of the running software and should be used to construct storage paths.

abstract replace(name: str, data: Iterable[Dict[str, Any]]) StoreResult

Delete all data currently stored and replace it with the given rows.

The name is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.

class mtdata.storage.StoreResult(success: bool, message: str)

Bases: tuple

The result of a write operation on a store.

TODO: We should wrap read operation results as well

property message

Alias for field number 1

property success

Alias for field number 0

mtdata.transformer module

class mtdata.transformer.Transformer(fields: Iterable[Union[Tuple[str, str, Callable[[Any], Any]], Tuple[str, str]]] = ())

Bases: object

A transformation that can be applied to a single dataset row represented as a dictionary. The transformation can update field names and make arbitrary changes to field data.

When the transformation is applied, the following will happen:

  1. Any fields not included in the transformation will be pruned

  2. Fields that have old names specified will be renamed

  3. Values will be updated for fields that have update functions

The row will be transformed in-place but also returned to the caller.

>>> t = Transformer()
>>> t.add_field('a', 'A')
>>> t.add_field('b', 'B', lambda x: x.lower())
>>> t({'A': 'XYZ', 'B': 'XYZ'})
{'a': 'XYZ', 'b': 'xyz'}
add_field(name: str, old_name: ~typing.Optional[str] = None, updater: ~typing.Callable[[~typing.Any], ~typing.Any] = <function Transformer.<lambda>>) None

Add a field to the transformation.

>>> t = Transformer()
>>> t.add_field('a', 'A', lambda x: x.lower())
>>> t._name_mapping
{'A': 'a'}
>>> t._update_functions['a']('ABC')
'abc'

Module contents

mtdata - a tool for extracting and curating public data