MDF Connect Client¶

class mdf_connect_client.MDFConnectClient(test=False, service_instance=None, authorizer=None)[source]¶

The MDF Connect Client is the Python client to easily submit datasets to MDF Connect.

__init__(test=False, service_instance=None, authorizer=None)[source]¶

Create an MDF Connect Client.

Parameters:	test (bool) – When `False`, the dataset will be processed normally. When `True`, the dataset will be processed, but submitted to test/sandbox/temporary resources instead of live resources. This includes the `mdf-test` Search index and test DOIs minted with MDF Publish. Default: `False` service_instance (str) – The instance of the MDF Connect API to use. This value should not normally be changed from the default. Default: `None`, to use the default API instance. authorizer (globus_sdk.GlobusAuthorizer) – The authorizer to use for authentication. This value should not normally be changed from the default. Default: `None`, to run the standard authentication flow.
Returns:	An initialized, authenticated MDF Connect Client.
Return type:	MDFConnectClient

accept_curation_submission(source_id, reason=None, prompt=True, raw=False)[source]¶

Complete a curation task by accepting the submission. You must have curation permissions on the selected submission.

Parameters:	source_id (str) – The `source_id` (`source_name` + version information) of the curation task. You can acquire this through `get_available_curation_tasks()`. reason (str) – The reason for accepting this submission. Default: `None`, to use a generic acceptance reason. prompt (bool) – When `True`, will prompt the user to confirm action selection, with a summary of the selected task. When `False`, will not require confirmation. Default: `True`. raw (bool) – When `False`, will print the result. When `True`, will return a dictionary of the full result. For direct human consumption, `False` is recommended. Default: `False`
Returns:	The full task results.
Return type:	if raw is `True`, dict

add_data_destination(data_destination)[source]¶

Add a data destination to your submission. Note that this method is cumulative, so calls do not overwrite previous ones.

Parameters:

Parameters:	data_destination (str or list of str) – The destination for the data. Destinations must be Globus Endpoints, and formatted with protocol. Example `"globus://endpoint123/path/data.out"`

data_destination (str or list of str) –

The destination for the data. Destinations must be Globus Endpoints, and formatted with protocol.

Example

"globus://endpoint123/path/data.out"

add_data_source(data_source)[source]¶

Add a data source to your submission. Note that this method is cumulative, so calls do not overwrite previous ones.

Parameters:

Parameters:	data_source (str or list of str) – The location(s) of the data. These should be formatted with protocol. Examples `"https://example.com/path/data.zip"` `"https://www.globus.org/app/transfer?..."` `"globus://endpoint123/path/data.out"`

data_source (str or list of str) –

The location(s) of the data. These should be formatted with protocol.

Examples

"https://example.com/path/data.zip" "https://www.globus.org/app/transfer?..." "globus://endpoint123/path/data.out"

add_index(data_type, mapping, delimiter=None, na_values=None)[source]¶

Add indexing instructions for your dataset. This method can be called multiple times for multiple data types, but multiple calls with the same data type will overwrite each other.

Parameters:

Parameters:	data_type (str) – The type of data to apply to. Supported types are: `json`, `csv`, `yaml`, `xml`, `excel`, and `filename`. mapping (dict) – The mapping of MDF fields to your data type’s fields. It is strongly recommended that you use “dot notation”, where nested JSON objects are represented with a period. Examples: { "material.composition": "my_json.data.stuff.comp", "dft.converged": "my_json.data.dft.convgd" } { "material.composition": "csv_header_1", "crystal_structure.space_group_number": "csv_header_2" } delimiter (str) – The character that delimits cells in a table. Only applicable to tabular data. Default: comma. na_values (str or list of str) – Values to treat as N/A (not applicable/available). Applies to all values. Default: For tabular data, blank and space. For other data, `None` (no N/A values).

data_type (str) – The type of data to apply to. Supported types are: json, csv, yaml, xml, excel, and filename.

mapping (dict) –

The mapping of MDF fields to your data type’s fields. It is strongly recommended that you use “dot notation”, where nested JSON objects are represented with a period.

Examples:

{
    "material.composition": "my_json.data.stuff.comp",
    "dft.converged": "my_json.data.dft.convgd"
}
{
    "material.composition": "csv_header_1",
    "crystal_structure.space_group_number": "csv_header_2"
}

delimiter (str) – The character that delimits cells in a table. Only applicable to tabular data. Default: comma.
na_values (str or list of str) – Values to treat as N/A (not applicable/available). Applies to all values. Default: For tabular data, blank and space. For other data, None (no N/A values).

add_links(links)[source]¶

Add links to a dataset.

Parameters:	link (str or list of str) – The link(s) to add. Should be of the form {“type”:str, “doi”:str, “url”:str, “description”:str, “bibtex”:str}

add_organization(organization)[source]¶

Add your dataset to an organization.

Parameters:	organization (str or list of str) – The organization(s) to add. If the organization is not registered with MDF, it will be discarded. Parent organizations will be added automatically.

add_service(service, parameters=None)[source]¶

Add a service for data submission.

Parameters:

Parameters:	service (str) – The integrated service to submit your dataset to. Connected services include: `mdf_publish` (publication with DOI minting) `citrine` (industry-partnered machine-learning specialists) `mrr` (NIST Materials Resource Registry) parameters (dict) – Optional, service-specific parameters. For `mdf_publish`: publication_location (str) - The Globus Endpoint and path on which to save the published files. It is recommended to not specify this parameter, which causes the dataset to be published on MDF resources. For `citrine`: public (bool) - When `True`, will make data public. Otherwise, it is inaccessible.

service (str) –
The integrated service to submit your dataset to. Connected services include:
- mdf_publish (publication with DOI minting)
- citrine (industry-partnered machine-learning specialists)
- mrr (NIST Materials Resource Registry)
parameters (dict) –
Optional, service-specific parameters.
- For mdf_publish:
  - publication_location (str) - The Globus Endpoint
    
    and path on which to save the published files. It is recommended to not specify this parameter, which causes the dataset to be published on MDF resources.
- For citrine:
  - public (bool) - When True, will make data public. Otherwise, it is inaccessible.

add_tag(tag)[source]¶

Add a tag or keyword to your dataset. Note that this method is cumulative, so calls do not overwrite previous ones.

Note

Setting tags here is equivalent to setting tags in create_dc_block(subjects=...). This method exists only for convenience.

Parameters:	tag (str or list of str) – The tag(s) to add.

check_all_submissions(verbose=False, active_only=False, include_tests=True, newer_than_date=None, older_than_date=None, raw=False, filters=None, _admin_code=None)[source]¶

Check the status of all of your submissions.

Parameters:

Parameters:	verbose (bool) – When `False`, will print a basic summary of your submissions. When `True`, will print the full status summary of each submission, as if you called `check_status()` on each. Has no effect if raw is `True`. Default: `False` active_only (bool) – When `True`, will only print active submissions. Default: `False` include_tests (bool) – When `False`, will only print non-test submissions. Default: `True` newer_than_date (datetime or tuple of ints) – Exclude submissions made before this date. Accepts a `datetime` object or `(year, month, day)` as integers. Comparisons are made in UTC. Default:: `None`, to set no maximum age. older_than_date – (datetime or tuple of ints): Exclude submissions made after this date. Accepts a `datetime` object or `(year, month, day)` as integers. Comparisons are made in UTC. Default:: `None`, to set no minimum age. raw (bool) – When `False`, will print your submissions’ summaries. When `True`, will return the full status results. For direct human consumption, `False` is recommended. Default: `False` filters (list of tuples) – Advanced users only Filters to apply to the status database scan. For a submission to be returned, all filters must match. Default: `None`. Format: (field, operator, value) field: The status field to filter on. operator: The relation of field to value. Valid operators: ^: Begins with : Contains ==: Equal to (or field does not exist, if value is None) !=: Not equal to (or field exists, if value is None) >: Greater than >=: Greater than or equal to <: Less than <=: Less than or equal to []: Between, inclusive (requires a list of two values) in: Is one of the values (requires a list of values) This operator effectively allows OR-ing ‘==’ value: The value of the field. _admin_code* (str) – For MDF Connect administrators only, a special function code. Valid codes: `all`: All submission statuses `active`: All active submission statuses Only MDF Connect administrators are allowed to use these codes. Default: `None`, the only valid value for non-admins.

verbose (bool) – When False, will print a basic summary of your submissions. When True, will print the full status summary of each submission, as if you called check_status() on each. Has no effect if raw is True. Default: False
active_only (bool) – When True, will only print active submissions. Default: False
include_tests (bool) – When False, will only print non-test submissions. Default: True
newer_than_date (datetime or tuple of ints) – Exclude submissions made before this date. Accepts a datetime object or (year, month, day) as integers. Comparisons are made in UTC. Default:: None, to set no maximum age.
older_than_date – (datetime or tuple of ints): Exclude submissions made after this date. Accepts a datetime object or (year, month, day) as integers. Comparisons are made in UTC. Default:: None, to set no minimum age.
raw (bool) – When False, will print your submissions’ summaries. When True, will return the full status results. For direct human consumption, False is recommended. Default: False
filters (list of tuples) –

Advanced users only

Filters to apply to the status database scan. For a submission to be returned, all filters must match. Default: None. Format: (field, operator, value)

field: The status field to filter on. operator: The relation of field to value. Valid operators:

^: Begins with *: Contains ==: Equal to (or field does not exist, if value is None) !=: Not equal to (or field exists, if value is None) >: Greater than >=: Greater than or equal to <: Less than <=: Less than or equal to []: Between, inclusive (requires a list of two values) in: Is one of the values (requires a list of values)

This operator effectively allows OR-ing ‘==’

value: The value of the field.
_admin_code (str) –
For MDF Connect administrators only, a special function code. Valid codes:
- all: All submission statuses
- active: All active submission statuses
Only MDF Connect administrators are allowed to use these codes. Default: None, the only valid value for non-admins.

Note about date filtering:: Days are compared in UTC, at exactly 0:00 (12:00am). This means that the two dates cannot be the same, as they would filter out all submissions not made at exactly 0:00:00 on the chosen date. To see submissions made on a specific date, set the older_than filter one day away from the date in question. For example, to see submissions from Feb 11, 2020, use newer_than_date=(2020, 2, 11), older_than_date=(2020, 2, 12).

Returns:	The full status results.
Return type:	if raw is `True`, dict

check_status(source_id=None, short=False, raw=False)[source]¶

Check the status of your submission. You may only check the status of your own submissions.

Parameters:	source_id (str) – The `source_id` (`source_name` + version information) of the submission to check. Default: `self.source_id` short (bool) – When `False`, will print a status summary containing all of the status steps for the dataset. When `True`, will print a short finished/processing message, useful for checking many datasets’ status at once. Default: `False` raw (bool) – When `False`, will print a nicely-formatted status summary. When `True`, will return the full status result. For direct human consumption, `False` is recommended. Default: `False`
Returns:	The full status result.
Return type:	If `raw` is `True`, dict

clear_base_acl()[source]¶: Reset the base ACL of your dataset to the default value ["public"].

clear_data_destinations()[source]¶: Clear all data destinations added so far to your dataset.

clear_data_sources()[source]¶: Clear all data sources added so far to your dataset.

clear_dataset_acl()[source]¶: Remove all Globus UUIDs from the dataset ACL for your dataset.

clear_external_uri()[source]¶: Remove any set external URI from your submission.

clear_index()[source]¶: Clear all indexing instructions set so far.

clear_links()[source]¶: Clear all tags added so far to your dataset.

clear_organizations()[source]¶: Clear all added organizations from the submission.

clear_services()[source]¶: Clear all services added so far.

clear_source_name()[source]¶: Remove a previously set source_name.

clear_tags()[source]¶: Clear all tags added so far to your dataset.

create_dc_block(title, authors, affiliations=None, publisher=None, publication_year=None, resource_type=None, description=None, dataset_doi=None, related_dois=None, subjects=None, **kwargs)[source]¶

Create your submission’s dc block. This block is the DataCite block. Additional information on DataCite fields is available from the official DataCite website: https://schema.datacite.org/meta/kernel-4.1/

Parameters:

Parameters:	title (str or list of str) – The title(s) of the dataset. authors (str or list of str) – The author(s) of the dataset. The name will be automatically parsed into given name and family name. publisher (str) – The publisher of the dataset (not an associated paper). Default: The Materials Data Facility. publication_year (int or str) – The year of dataset publication. Default: The current year. resource_type (str) – The type of resource. Except in unusual cases, this should be `"Dataset"`. Default: `"Dataset"` affiliations (str or list of str or list of list of str) – The affiliations of the authors, in the same order. If a different number of affiliations are given, all affiliations will be applied to all authors. Multiple affiliations can be given as a list. Default: `None` for no affiliations for any author. Examples: authors = ["Fromnist, Alice", "Fromnist; Bob", "Cathy Multiples"] # All authors are from NIST affiliations = "NIST" # All authors are from both NIST and UChicago affiliations = ["NIST", "UChicago"] # Alice and Bob are from NIST, Cathy is from NIST and UChicago affliliations = ["NIST", "NIST", ["NIST", "UChicago"]] # This is incorrect! If applying affiliations to all authors, # lists must not be nested. affiliations = ["NIST", ["NIST", "UChicago"], "Argonne", "Oak Ridge"] description (str) – A description of the dataset. Default: `None` for no description. dataset_doi (str) – The DOI for this dataset (not an associated paper). Default: `None` related_dois (str or list of str) – DOIs related to this dataset, not including the dataset’s own DOI (for example, an associated paper’s DOI). Default: `None` subjects (str or list of str) – Subjects (in Datacite terminology) or tags related to the dataset. Sefault: `None`

title (str or list of str) – The title(s) of the dataset.
authors (str or list of str) – The author(s) of the dataset. The name will be automatically parsed into given name and family name.
publisher (str) – The publisher of the dataset (not an associated paper). Default: The Materials Data Facility.
publication_year (int or str) – The year of dataset publication. Default: The current year.
resource_type (str) – The type of resource. Except in unusual cases, this should be "Dataset". Default: "Dataset"

affiliations (str or list of str or list of list of str) –

The affiliations of the authors, in the same order. If a different number of affiliations are given, all affiliations will be applied to all authors. Multiple affiliations can be given as a list. Default: None for no affiliations for any author.

Examples:

authors = ["Fromnist, Alice", "Fromnist; Bob", "Cathy Multiples"]
# All authors are from NIST
affiliations = "NIST"
# All authors are from both NIST and UChicago
affiliations = ["NIST", "UChicago"]
# Alice and Bob are from NIST, Cathy is from NIST and UChicago
affliliations = ["NIST", "NIST", ["NIST", "UChicago"]]

# This is incorrect! If applying affiliations to all authors,
# lists must not be nested.
affiliations = ["NIST", ["NIST", "UChicago"], "Argonne", "Oak Ridge"]

description (str) – A description of the dataset. Default: None for no description.
dataset_doi (str) – The DOI for this dataset (not an associated paper). Default: None
related_dois (str or list of str) – DOIs related to this dataset, not including the dataset’s own DOI (for example, an associated paper’s DOI). Default: None
subjects (str or list of str) – Subjects (in Datacite terminology) or tags related to the dataset. Sefault: None

Any further keyword arguments will be added to the DataCite metadata (the dc block). These arguments should be valid DataCite, as listed in the MDF Connect documentation. This is completely optional.

create_mrr_block(mrr_data)[source]¶

Create the mrr block for your dataset. This helper should be more helpful in the future.

Parameters:	mrr_data (dict) – The MRR schema-compliant metadata.

get_available_curation_tasks(summary=True, raw=False, _admin_code=None)[source]¶

Get all curation tasks available to you.

Parameters:	summary (bool) – When `False`, will print the entire curation task, including dataset entry and sample records. When `True`, will only print a summary of the task. Using the summary is recommended to find specific tasks to get full task information on using `get_curation_task()`. Default: `True` raw (bool) – When `False`, will print out summaries of your available curation tasks. When `True`, will return a dictionary containing the results. For direct human consumption, `False` is recommended. Default: `False` _admin_code (str) – For MDF Connect administrators only, a special function code. Valid codes: `all`: All waiting curation tasks. Only MDF Connect administrators are allowed to use these codes. Default: `None`, the only valid value for non-admins.
Returns:	The full task results.
Return type:	if raw is `True`, dict

get_curation_task(source_id, summary=False, raw=False)[source]¶

Get the content of a curation task. You must have curation permissions on the selected submission.

Parameters:	source_id (str) – The `source_id` (`source_name` + version information) of the curation task. You can acquire this through `get_available_curation_tasks()`. summary (bool) – When `False`, will print the entire curation task, including the verbose dataset entry and sample records. When `True`, will only print a summary of the task. Default: `False` raw (bool) – When `False`, will print the curation task. When `True`, will return a dictionary of the full result. Overrides the value of `summary`. For direct human consumption, `False` is recommended. Default: `False`
Returns:	The full task results.
Return type:	if raw is `True`, dict

get_submission()[source]¶

Fetch the current state of your submission.

Returns:	Your submission.
Return type:	dict

logout()[source]¶: Log out by removing cached tokens and discarding the client’s authorizer. Also clear the current submission, as it cannot be interacted with.

reject_curation_submission(source_id, reason=None, prompt=True, raw=False)[source]¶

Complete a curation task by rejecting the submission. You must have curation permissions on the selected submission.

Parameters:	source_id (str) – The `source_id` (`source_name` + version information) of the curation task. You can acquire this through `get_available_curation_tasks()`. reason (str) – The reason for rejecting this submission. Default: `None`, to use a generic rejection reason. prompt (bool) – When `True`, will prompt the user to confirm action selection, with a summary of the selected task. When `False`, will not require confirmation. Default: `True`. raw (bool) – When `False`, will print the result. When `True`, will return a dictionary of the full result. For direct human consumption, `False` is recommended. Default: `False`
Returns:	The full task results.
Return type:	if raw is `True`, dict

reset_submission()[source]¶

Reset and clear metadata from your submission.

Warning

This action cannot be undone.

The last submission’s source_id will also be cleared. If you want to use check_status, you will be required to input the source_id manually.

Returns:	The variables that are NOT cleared, which includes: test: (bool) - If the submission is a test submission or not. service_location (str) - The URL of the MDF Connect server in use.
Return type:	dict

set_base_acl(acl)[source]¶

Set the Access Control List for your entire dataset.

Parameters:	acl (str or list of str) – The Globus UUIDs of users or groups that should be granted full read access to the dataset, including records and files. Default: The special keyword `"public"`, which makes the dataset visible to everyone.

Warning

The identities listed in the base_acl of your submission can always see your submission, including dataset entry, even if they are not listed in the dataset_acl. This means that if you do not specify a ``base_acl``, because it defaults to “public”, your entire dataset will be public. MDF encourages you to make your data public, but if you do not want it public you must specify this value.

set_curation(curation)[source]¶

Set the curation flag for this submission.

Note

Normally, this flag is set automatically by an organization, and is not set manually by the dataset submitter.

Parameters:	curation (bool) – When `False`, the dataset will be processed normally. When `True`, the dataset must be approved in curation before it will be ingested to MDF Search or any other service. Default: `False`

set_custom_block(custom_fields)[source]¶

Set the custom block for your dataset.

Parameters:	custom_fields (dict) – Custom field-value pairs for your dataset. You may add descriptions of your fields by creating a new field called `[field]_desc` with the string description inside, or by calling `set_custom_descriptions()`.

set_custom_descriptions(custom_descriptions)[source]¶

Add descriptions to your custom block.

Parameters:	custom_descriptions (dict) – Custom field-description pairs for your dataset. Field names in this argument must match field names added by calling `set_custom_block()`.

set_dataset_acl(acl)[source]¶

Set the Access Control List for just the dataset entry of your dataset.

Parameters:	acl (str or list of str) – The Globus UUIDs of users or groups that should be granted read access only to the dataset entry for your dataset in MDF Search (this includes the author list, title, etc. but does not include extracted metadata in records or files). Anyone listed in the base ACL already has this permission.

set_external_uri(uri)[source]¶

Set an external URI for your dataset. This is used to point at a landing page outside of MDF that also hosts the dataset.

Parameters:	uri (str) – The external URI.

set_extraction_config(config)[source]¶

Set advanced configuration parameters for dataset extraction. These parameters are intended for advanced users and/or special-case datasets.

Parameters:	config (dict) – The extraction configuration parameters.

set_incremental_update(source_id)[source]¶

Make this submission an incremental update of a previous submission. Incremental updates use the same submission metadata, except for whatever you specify in the new submission. For example, if you submit an incremental update and only include a data_source, the submission will run as if you copied the DC block and other metadata into the submission, but with the new data_source.

Note

You must still set update=True when submitting an incremental update.

Parameters:	source_id (str) – The `source_id` of the previous submission to update.

set_passthrough(passthrough)[source]¶

Set the dataset pass-through flag for your submission.

Caution

This flag will cause metadata from your dataset’s files to not be extracted by MDF Connect, so only high-level dataset metadata will be available in MDF Search. This flag is only intended for datasets that cannot be extracted.

Parameters:	passthrough (bool) – When `False`, the dataset will be processed normally. When `True`, the metadata in the files will not be extracted. Default: `False`

set_project_block(project, data)[source]¶

Set the project block for your dataset. Intended only for use by members of an approved project. To delete a project block, call this method with data=None.

Parameters:	project (str) – The name of the project block. data (dict) – The data for the project block.

set_source_name(source_name)[source]¶

Set the source name for your dataset.

Parameters:	source_name (str) – The desired source name. Must be unique for new datasets. Please note that your source name will be cleaned when submitted to Connect, so the actual `source_name` may differ from this value. Additionally, the `source_id` (which is the `source_name` plus version information) is required to fetch the status of a submission. `check_status()` can handle this for you.

set_test(test)[source]¶

Set the test flag for this dataset.

Parameters:	test (bool) – When `False`, the dataset will be processed normally. When `True`, the dataset will be processed, but submitted to test/sandbox/temporary resources instead of live resources. This includes the `mdf-test` Search index and test DOIs minted with MDF Publish. Default: `False`

submit_dataset(update=False, submission=None, reset=False)[source]¶

Submit your dataset to MDF Connect for processing.

Parameters:

Parameters:	update (bool) – If you wish to submit this dataset again, set this to `True`. If this is the first submission, leave this `False`. Default: `False` submission (dict) – If you have assembled the Connect metadata yourself, you can submit it here. This argument supersedes any data set through other methods. Default: `None`, to use method-assembled data. reset (bool) – If True, will clear the old submission. The test flag will be preserved. IMPORTANT: The `source_id` of the submission will not be saved if this argument is `True`. `check_status` will require you to pass the `source_id` as an argument. If `False`, the submission will be preserved. Default: `False`
Returns:	The submission information. success (bool) - Whether the submission was successful. source_id (string) - The `source_id` of your dataset, which is also saved in `self.source_id`. The `source_id` is the `source_name` plus version information. In other words, the `source_name` is unique to your dataset, and the `source_id` is unique to your submission of the dataset. error (string) - Error message, if applicable.
Return type:	dict

update (bool) – If you wish to submit this dataset again, set this to True. If this is the first submission, leave this False. Default: False
submission (dict) – If you have assembled the Connect metadata yourself, you can submit it here. This argument supersedes any data set through other methods. Default: None, to use method-assembled data.
reset (bool) – If True, will clear the old submission. The test flag will be preserved. IMPORTANT: The source_id of the submission will not be saved if this argument is True. check_status will require you to pass the source_id as an argument. If False, the submission will be preserved. Default: False

Returns:

The submission information.

success (bool) - Whether the submission was successful.
source_id (string) - The source_id of your dataset, which is also saved

in self.source_id. The source_id is the source_name plus version information. In other words, the source_name is unique to your dataset, and the source_id is unique to your submission of the dataset.
error (string) - Error message, if applicable.

Return type:

dict

submit_dataset_metadata_update(source_id, metadata_update=None, reset=False)[source]¶

Submit an update to a dataset entry (and NOT the data or record entries).

Parameters:

Parameters:	source_id (str) – The `source_id` of the dataset you wish to update. You must be the owner of the dataset. metadata_update (dict) – If you have assembled the dataset metadata yourself, you can submit it here. This argument supersedes any data set through other methods. Default: `None`, to use method-assembled data. reset (bool) – If True, will clear the old metadata from the client. The test flag will be preserved. If `False`, the metadata will be preserved. Default: `False`

source_id (str) – The source_id of the dataset you wish to update. You must be the owner of the dataset.
metadata_update (dict) – If you have assembled the dataset metadata yourself, you can submit it here. This argument supersedes any data set through other methods. Default: None, to use method-assembled data.
reset (bool) – If True, will clear the old metadata from the client. The test flag will be preserved. If False, the metadata will be preserved. Default: False