Mt. Data¶
Mt. Data (pronounced “mount data”, like “Mount Everest”) is a tool to help people collect and store (mountains of) public data from a variety of sources.
Developer Documentation¶
It is relatively straightforward to extend Mt. Data in various ways. The documentation below walks through several common tasks such as adding a new dataset.
mtdata¶
mtdata package¶
Subpackages¶
mtdata.datasets package¶
- class mtdata.datasets.air_quality.AirQuality¶
Bases:
mtdata.dataset.Dataset
Air quality data for Montana.
- property dedup_facets: Iterable[str]¶
Fields used to determine which rows should be compared for de-duplication.
For example, if the data are sensor readings for various locations, the field that indicates the location would be listed here. That way, we don’t drop a new reading from a different location just because it occurred at the same time as a reading from a different location.
Uses the transformed version of the field names.
- property dedup_fields: Iterable[str]¶
Fields used to compare rows for de-duplication. This is likely to be some kind of timestamp, but that depends on the kind of data.
For example, if the data are sensor readings and each row is timestamped based on when the reading occurred, then that field will be listed here because two or more fetches might retrieve the same reading instance.
Uses the transformed version of the field names.
- fetch() mtdata.dataset.FetchResult ¶
Fetch new data from the source (generally the web).
- static name() str ¶
The dataset name, which is used in the UI and for things like file and table names.
- property transformer: mtdata.transformer.Transformer¶
The transformer to be applied to each row that is fetched from the data source before it is stored.
- class mtdata.datasets.missoula_911.Missoula911¶
Bases:
mtdata.dataset.Dataset
Record of 911 events for Missoula city and county in Montana.
- property dedup_facets: Iterable[str]¶
Fields used to determine which rows should be compared for de-duplication.
For example, if the data are sensor readings for various locations, the field that indicates the location would be listed here. That way, we don’t drop a new reading from a different location just because it occurred at the same time as a reading from a different location.
Uses the transformed version of the field names.
- property dedup_fields: Iterable[str]¶
Fields used to compare rows for de-duplication. This is likely to be some kind of timestamp, but that depends on the kind of data.
For example, if the data are sensor readings and each row is timestamped based on when the reading occurred, then that field will be listed here because two or more fetches might retrieve the same reading instance.
Uses the transformed version of the field names.
- fetch() mtdata.dataset.FetchResult ¶
Fetch new data from the source (generally the web).
- static name() str ¶
The dataset name, which is used in the UI and for things like file and table names.
- property transformer: mtdata.transformer.Transformer¶
The transformer to be applied to each row that is fetched from the data source before it is stored.
- class mtdata.datasets.mt_covid_counts.CovidCounts¶
Bases:
mtdata.dataset.Dataset
Covid-19 case data for Montana.
Source front-end display: https://www.arcgis.com/apps/MapSeries/index.html?appid=7c34f3412536439491adcc2103421d4b
- property dedup_facets: Iterable[str]¶
Fields used to determine which rows should be compared for de-duplication.
For example, if the data are sensor readings for various locations, the field that indicates the location would be listed here. That way, we don’t drop a new reading from a different location just because it occurred at the same time as a reading from a different location.
Uses the transformed version of the field names.
- property dedup_fields: Iterable[str]¶
Fields used to compare rows for de-duplication. This is likely to be some kind of timestamp, but that depends on the kind of data.
For example, if the data are sensor readings and each row is timestamped based on when the reading occurred, then that field will be listed here because two or more fetches might retrieve the same reading instance.
Uses the transformed version of the field names.
- fetch() mtdata.dataset.FetchResult ¶
Fetch new data from the source (generally the web).
- static name() str ¶
The dataset name, which is used in the UI and for things like file and table names.
- property transformer: mtdata.transformer.Transformer¶
The transformer to be applied to each row that is fetched from the data source before it is stored.
datasets - a collection of curated, built-in datasets
Submodules¶
mtdata.backward module¶
- mtdata.backward.read_backward(file: BinaryIO) Iterable[str] ¶
Read the lines of a text file (opened in binary mode), but backward, from the last line to the first line.
>>> from io import BytesIO >>> list(read_backward(BytesIO(b'a\nb'))) ['b', 'a\n'] >>> list(read_backward(BytesIO(b'a\nb\n'))) ['b\n', 'a\n'] >>> list(read_backward(BytesIO(b'a\n\n'))) ['\n', 'a\n'] >>> list(read_backward(BytesIO(b'\n\n'))) ['\n', '\n'] >>> list(read_backward(BytesIO(b'\n'))) ['\n'] >>> list(read_backward(BytesIO(b''))) []
mtdata.dataset module¶
- class mtdata.dataset.Dataset¶
Bases:
abc.ABC
A single collection of data, represented as a table.
- abstract property dedup_facets: Iterable[str]¶
Fields used to determine which rows should be compared for de-duplication.
For example, if the data are sensor readings for various locations, the field that indicates the location would be listed here. That way, we don’t drop a new reading from a different location just because it occurred at the same time as a reading from a different location.
Uses the transformed version of the field names.
- abstract property dedup_fields: Iterable[str]¶
Fields used to compare rows for de-duplication. This is likely to be some kind of timestamp, but that depends on the kind of data.
For example, if the data are sensor readings and each row is timestamped based on when the reading occurred, then that field will be listed here because two or more fetches might retrieve the same reading instance.
Uses the transformed version of the field names.
- abstract fetch() mtdata.dataset.FetchResult ¶
Fetch new data from the source (generally the web).
- abstract static name() str ¶
The dataset name, which is used in the UI and for things like file and table names.
- abstract property transformer: mtdata.transformer.Transformer¶
The transformer to be applied to each row that is fetched from the data source before it is stored.
- class mtdata.dataset.FetchResult(success: bool, message: str, data: Iterable[Dict[str, Any]])¶
Bases:
tuple
The result of a completed fetch operation. If the operation was successful (data were acquired, or there were no new data available) then
success
should beTrue
,False
otherwise.Upon success, it is reasonable for the
message
to be the empty string. However, if the operation failed, then themessage
field should contain some kind of explanation suitable for presentation to the user and inclusion in log files.If the fetch was successful, then
data
should contain the rows that were acquired from the data source. It may be empty if there were no rows available (this will depend on the source). It should also be empty on failure.- property data¶
Alias for field number 2
- property message¶
Alias for field number 1
- property success¶
Alias for field number 0
mtdata.fields module¶
- mtdata.fields.prune_fields(row: Dict[str, Any], keys: Iterable[str]) Dict[str, Any] ¶
Remove fields from the row that are not included in the iterable of keys provided. The row is mutated, but also returned to the caller.
>>> prune_fields({'a': 1, 'b': 2}, ['b']) {'b': 2}
- mtdata.fields.rename_fields(row: Dict[str, Any], mapping: Dict[str, str]) Dict[str, Any] ¶
Rename the fields of the given row using the name mapping provided. The row is mutated, but also returned to the caller.
>>> rename_fields({'A': 1, 'b': 2}, {'A': 'a', 'b': 'b'}) {'a': 1, 'b': 2}
mtdata.manifest module¶
- mtdata.manifest.get_dataset(name: str) Optional[Type[mtdata.dataset.Dataset]] ¶
Get the dataset class with the given name, or
None
if there is no dataset implementation in the manifest with that name.
- mtdata.manifest.get_store(name: str) Optional[Type[mtdata.storage.Storage]] ¶
Get the store class with the given name, or
None
if there is no store implementation in the manifest with that name.
mtdata.parameters module¶
- class mtdata.parameters.Parameters(datasets: Tuple[str], list_datasets: bool, list_stores: bool, namespace: str, stores: Tuple[str])¶
Bases:
tuple
Parameters supported by the CLI.
- property datasets¶
Alias for field number 0
- property list_datasets¶
Alias for field number 1
- property list_stores¶
Alias for field number 2
- property namespace¶
Alias for field number 3
- property stores¶
Alias for field number 4
- mtdata.parameters.comma_tuple(arg: str) Tuple[str, ...] ¶
A “type” that can be used with
ArgumentParser
to split a comma-delimited list of values into an actual list.TODO: Add a “type” to this that converts the values
- mtdata.parameters.parse_parameters(args: List[str]) mtdata.parameters.Parameters ¶
Turn a list of command line arguments into a
Parameters
object.
mtdata.registry module¶
mtdata.row module¶
mtdata.storage module¶
- class mtdata.storage.CSVBasic(namespace: str)¶
Bases:
mtdata.storage.Storage
A minimal CSV implementation that uses a DictWriter to write rows to the indicated file.
- append(name: str, data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str], dedup_fields: Iterable[str]) mtdata.storage.StoreResult ¶
Append some number of rows to the data currently stored. The existing data should remain untouched and the new data should, where it makes sense, be stored “after” the existing data.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.If
dedup_facets
anddedup_fields
are non-empty, then de-duplication must occur before the new data are stored. See the documentation forDataset
for an explanation of these fields.
- load(name: str) Iterable[Dict[str, Any]] ¶
Read in all data and return it as an iterable of rows. The implementation may choose to read all rows into memory or stream them through an iterator.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.
- load_backward(name: str) Iterable[Dict[str, Any]] ¶
Load the data in reverse order. Used for de-duplication.
- static name() str ¶
A human-readable name for the storage implementation. Intended for use in the UI.
By convention, this should be the class name, converted to kabob case. So a storage class called
FancyDatabase
would be named “fancy-database”.
- name_to_path(name: str) str ¶
Name to path conversion that assumes the file extension.
- replace(name: str, data: Iterable[Dict[str, Any]]) mtdata.storage.StoreResult ¶
Delete all data currently stored and replace it with the given rows.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.
- class mtdata.storage.JsonLines(namespace: str)¶
Bases:
mtdata.storage.Storage
A storage implementation that writes each row as a single, JSON-formatted line in a file. This allows the data to be “streamed” back without reading in the entire file. It also allows efficient append operations since the old data needn’t be loaded in order to add more.
Data are stored and retrieved in the order they are appended or replaced. Therefore, as long as data are always added in chronological order, they will remain in that order.
- append(name: str, data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str], dedup_fields: Iterable[str]) mtdata.storage.StoreResult ¶
Append some number of rows to the data currently stored. The existing data should remain untouched and the new data should, where it makes sense, be stored “after” the existing data.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.If
dedup_facets
anddedup_fields
are non-empty, then de-duplication must occur before the new data are stored. See the documentation forDataset
for an explanation of these fields.
- load(name: str) Iterable[Dict[str, Any]] ¶
Read in all data and return it as an iterable of rows. The implementation may choose to read all rows into memory or stream them through an iterator.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.
- load_backward(name: str) Iterable[Dict[str, Any]] ¶
Load data from the store in reverse order. In other words, the first row returned is the row that was most recently added to the store, and so on.
TODO: Consider making this abstract on the base class
- static name() str ¶
A human-readable name for the storage implementation. Intended for use in the UI.
By convention, this should be the class name, converted to kabob case. So a storage class called
FancyDatabase
would be named “fancy-database”.
- name_to_path(name: str) str ¶
Convert a name to a file path with the correct extension.
- replace(name: str, data: Iterable[Dict[str, Any]]) mtdata.storage.StoreResult ¶
Delete all data currently stored and replace it with the given rows.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.
- class mtdata.storage.Storage(namespace: str)¶
Bases:
abc.ABC
A generic storage manager that can handle writing data to a file or other persistence mechanism.
- abstract append(name: str, data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str], dedup_fields: Iterable[str]) mtdata.storage.StoreResult ¶
Append some number of rows to the data currently stored. The existing data should remain untouched and the new data should, where it makes sense, be stored “after” the existing data.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.If
dedup_facets
anddedup_fields
are non-empty, then de-duplication must occur before the new data are stored. See the documentation forDataset
for an explanation of these fields.
- static dedup(existing_data: Iterable[Dict[str, Any]], new_data: Iterable[Dict[str, Any]], dedup_facets: Iterable[str] = (), dedup_fields: Iterable[str] = ()) Iterable[Dict[str, Any]] ¶
A helper function to de-duplicate data based on the given facets and fields. This algorithm won’t work for every possible case, but it ought to cover the most common situations nicely.
Important: the
existing_data
iterable MUST be in reverse- chronological order. In other words, the first element of this iterable must be the most recent row added to the store.If
dedup_facets
are provided, then for each new row, search backward through the existing data to find the most recent row that matches on those facets, then compare based on thededup_fields
. If they match, then the row will not be included in the returned iterable.If there are no
dedup_facets
, but there arededup_fields
, then grab the most recent row from the stored data and compare it against each row of new data. If any of the new rows match, then drop that row and all rows that occurred before it, and add the remaining rows to the returned iterable.If both dedup parameters are empty, then the new data are passed through unfiltered.
>>> list(Storage.dedup( ... reversed([{'a': 1}, {'a': 2}]), ... [{'a': 3}], [], ['a'])) [{'a': 3}] >>> list(Storage.dedup( ... reversed([{'a': 1}, {'a': 2}]), ... [{'a': 2}, {'a': 3}], [], ['a'])) [{'a': 3}]
- get_path(name: str, extension: str) str ¶
A helper for implementations that use the filesystem. Returns a path to a file with the given name and extension, located in a directory determined by the
namespace
property.
- abstract load(name: str) Iterable[Dict[str, Any]] ¶
Read in all data and return it as an iterable of rows. The implementation may choose to read all rows into memory or stream them through an iterator.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.
- abstract static name() str ¶
A human-readable name for the storage implementation. Intended for use in the UI.
By convention, this should be the class name, converted to kabob case. So a storage class called
FancyDatabase
would be named “fancy-database”.
- property namespace: str¶
The namespace is tied to the instance of the running software and should be used to construct storage paths.
- abstract replace(name: str, data: Iterable[Dict[str, Any]]) mtdata.storage.StoreResult ¶
Delete all data currently stored and replace it with the given rows.
The
name
is the identifier associated with the dataset being stored and should be used to construct any files or tables required by the storage implementation.
mtdata.transformer module¶
- class mtdata.transformer.Transformer(fields: Iterable[Union[Tuple[str, str, Callable[[Any], Any]], Tuple[str, str]]] = ())¶
Bases:
object
A transformation that can be applied to a single dataset row represented as a dictionary. The transformation can update field names and make arbitrary changes to field data.
When the transformation is applied, the following will happen:
Any fields not included in the transformation will be pruned
Fields that have old names specified will be renamed
Values will be updated for fields that have update functions
The row will be transformed in-place but also returned to the caller.
>>> t = Transformer() >>> t.add_field('a', 'A') >>> t.add_field('b', 'B', lambda x: x.lower()) >>> t({'A': 'XYZ', 'B': 'XYZ'}) {'a': 'XYZ', 'b': 'xyz'}
- add_field(name: str, old_name: Optional[str] = None, updater: Callable[[Any], Any] = <function Transformer.<lambda>>) None ¶
Add a field to the transformation.
>>> t = Transformer() >>> t.add_field('a', 'A', lambda x: x.lower()) >>> t._name_mapping {'A': 'a'} >>> t._update_functions['a']('ABC') 'abc'
Module contents¶
mtdata - a tool for extracting and curating public data
Adding a Dataset¶
A dataset is implemented as a sub-class of the Dataset abstract
class in the mtdata.datasets
module. Once the dataset
is implemented, add the class to the list in
mtdata.manifest
. Once this is done, the new dataset
will run by default when the mtdata module is run.
It may be easiest to adapt an existing dataset, e.g.
mtdata.datasets.air_quality
. See
mtdata.dataset
for specific documentation on
mtdata.dataset.Dataset
methods.
Adding a Store¶
Installation¶
Concepts¶
A dataset is a collection of related data, usually a single table. Each dataset knows how to download updates to the data, how to transform fields names and values, and which fields are useful for de-duplication.
A store is a means of persisting acquired data. For example, a particular store might know how to talk to a database to save data collected by datasets as database tables.