MDF Connect Client¶
-
class
mdf_connect_client.
MDFConnectClient
(test=False, service_instance=None, authorizer=None)[source]¶ The MDF Connect Client is the Python client to easily submit datasets to MDF Connect.
-
__init__
(test=False, service_instance=None, authorizer=None)[source]¶ Create an MDF Connect Client.
Parameters: - test (bool) – When
False
, the dataset will be processed normally. WhenTrue
, the dataset will be processed, but submitted to test/sandbox/temporary resources instead of live resources. This includes themdf-test
Search index and test DOIs minted with MDF Publish. Default:False
- service_instance (str) – The instance of the MDF Connect API to use.
This value should not normally be changed from the default.
Default:
None
, to use the default API instance. - authorizer (globus_sdk.GlobusAuthorizer) – The authorizer to use for authentication.
This value should not normally be changed from the default.
Default:
None
, to run the standard authentication flow.
Returns: An initialized, authenticated MDF Connect Client.
Return type: MDFConnectClient
- test (bool) – When
-
accept_curation_submission
(source_id, reason=None, prompt=True, raw=False)[source]¶ Complete a curation task by accepting the submission. You must have curation permissions on the selected submission.
Parameters: - source_id (str) – The
source_id
(source_name
+ version information) of the curation task. You can acquire this throughget_available_curation_tasks()
. - reason (str) – The reason for accepting this submission.
Default:
None
, to use a generic acceptance reason. - prompt (bool) – When
True
, will prompt the user to confirm action selection, with a summary of the selected task. WhenFalse
, will not require confirmation. Default:True
. - raw (bool) – When
False
, will print the result. WhenTrue
, will return a dictionary of the full result. For direct human consumption,False
is recommended. Default:False
Returns: The full task results.
Return type: if raw is
True
, dict- source_id (str) – The
-
add_data_destination
(data_destination)[source]¶ Add a data destination to your submission. Note that this method is cumulative, so calls do not overwrite previous ones.
Parameters: data_destination (str or list of str) – The destination for the data. Destinations must be Globus Endpoints, and formatted with protocol.
Example
"globus://endpoint123/path/data.out"
-
add_data_source
(data_source)[source]¶ Add a data source to your submission. Note that this method is cumulative, so calls do not overwrite previous ones.
Parameters: data_source (str or list of str) – The location(s) of the data. These should be formatted with protocol.
Examples
"https://example.com/path/data.zip"
"https://www.globus.org/app/transfer?..."
"globus://endpoint123/path/data.out"
-
add_index
(data_type, mapping, delimiter=None, na_values=None)[source]¶ Add indexing instructions for your dataset. This method can be called multiple times for multiple data types, but multiple calls with the same data type will overwrite each other.
Parameters: - data_type (str) – The type of data to apply to. Supported types are:
json
,csv
,yaml
,xml
,excel
, andfilename
. - mapping (dict) –
The mapping of MDF fields to your data type’s fields. It is strongly recommended that you use “dot notation”, where nested JSON objects are represented with a period.
Examples:
{ "material.composition": "my_json.data.stuff.comp", "dft.converged": "my_json.data.dft.convgd" } { "material.composition": "csv_header_1", "crystal_structure.space_group_number": "csv_header_2" }
- delimiter (str) – The character that delimits cells in a table. Only applicable to tabular data. Default: comma.
- na_values (str or list of str) – Values to treat as N/A (not applicable/available).
Applies to all values.
Default: For tabular data, blank and space.
For other data,
None
(no N/A values).
- data_type (str) – The type of data to apply to. Supported types are:
-
add_links
(links)[source]¶ Add links to a dataset.
Parameters: link (str or list of str) – The link(s) to add. Should be of the form {“type”:str, “doi”:str, “url”:str, “description”:str, “bibtex”:str}
-
add_organization
(organization)[source]¶ Add your dataset to an organization.
Parameters: organization (str or list of str) – The organization(s) to add. If the organization is not registered with MDF, it will be discarded. Parent organizations will be added automatically.
-
add_service
(service, parameters=None)[source]¶ Add a service for data submission.
Parameters: - service (str) –
The integrated service to submit your dataset to. Connected services include:
mdf_publish
(publication with DOI minting)citrine
(industry-partnered machine-learning specialists)mrr
(NIST Materials Resource Registry)
- parameters (dict) –
Optional, service-specific parameters.
- For
mdf_publish
: - publication_location (str) - The Globus Endpoint
- and path on which to save the published files. It is recommended to not specify this parameter, which causes the dataset to be published on MDF resources.
- For
- For
citrine
: - public (bool) - When
True
, will make data public. Otherwise, it is inaccessible.
- public (bool) - When
- For
- service (str) –
-
add_tag
(tag)[source]¶ Add a tag or keyword to your dataset. Note that this method is cumulative, so calls do not overwrite previous ones.
Note
Setting tags here is equivalent to setting tags in
create_dc_block(subjects=...)
. This method exists only for convenience.Parameters: tag (str or list of str) – The tag(s) to add.
-
check_all_submissions
(verbose=False, active_only=False, include_tests=True, newer_than_date=None, older_than_date=None, raw=False, filters=None, _admin_code=None)[source]¶ Check the status of all of your submissions.
Parameters: - verbose (bool) – When
False
, will print a basic summary of your submissions. WhenTrue
, will print the full status summary of each submission, as if you calledcheck_status()
on each. Has no effect if raw isTrue
. Default:False
- active_only (bool) – When
True
, will only print active submissions. Default:False
- include_tests (bool) – When
False
, will only print non-test submissions. Default:True
- newer_than_date (datetime or tuple of ints) – Exclude submissions made before
this date. Accepts a
datetime
object or(year, month, day)
as integers. Comparisons are made in UTC. Default::None
, to set no maximum age. - older_than_date – (datetime or tuple of ints): Exclude submissions made after
this date. Accepts a
datetime
object or(year, month, day)
as integers. Comparisons are made in UTC. Default::None
, to set no minimum age. - raw (bool) – When
False
, will print your submissions’ summaries. WhenTrue
, will return the full status results. For direct human consumption,False
is recommended. Default:False
- filters (list of tuples) –
- Advanced users only
- Filters to apply to the status database scan.
For a submission to be returned, all filters must match.
Default:
None
. Format: (field, operator, value)field: The status field to filter on. operator: The relation of field to value. Valid operators:^: Begins with *: Contains ==: Equal to (or field does not exist, if value is None) !=: Not equal to (or field exists, if value is None) >: Greater than >=: Greater than or equal to <: Less than <=: Less than or equal to []: Between, inclusive (requires a list of two values) in: Is one of the values (requires a list of values)This operator effectively allows OR-ing ‘==’
value: The value of the field.
- _admin_code (str) –
For MDF Connect administrators only, a special function code. Valid codes:
all
: All submission statusesactive
: All active submission statuses
Only MDF Connect administrators are allowed to use these codes. Default:
None
, the only valid value for non-admins.
- Note about date filtering:
- Days are compared in UTC, at exactly 0:00 (12:00am). This means that the two dates
cannot be the same, as they would filter out all submissions not made at exactly
0:00:00 on the chosen date. To see submissions made on a specific date, set the
older_than filter one day away from the date in question.
For example, to see submissions from Feb 11, 2020, use
newer_than_date=(2020, 2, 11), older_than_date=(2020, 2, 12)
.
Returns: The full status results. Return type: if raw is True
, dict- verbose (bool) – When
-
check_status
(source_id=None, short=False, raw=False)[source]¶ Check the status of your submission. You may only check the status of your own submissions.
Parameters: - source_id (str) – The
source_id
(source_name
+ version information) of the submission to check. Default:self.source_id
- short (bool) – When
False
, will print a status summary containing all of the status steps for the dataset. WhenTrue
, will print a short finished/processing message, useful for checking many datasets’ status at once. Default:False
- raw (bool) – When
False
, will print a nicely-formatted status summary. WhenTrue
, will return the full status result. For direct human consumption,False
is recommended. Default:False
Returns: The full status result.
Return type: If
raw
isTrue
, dict- source_id (str) – The
Clear all tags added so far to your dataset.
-
create_dc_block
(title, authors, affiliations=None, publisher=None, publication_year=None, resource_type=None, description=None, dataset_doi=None, related_dois=None, subjects=None, **kwargs)[source]¶ Create your submission’s dc block. This block is the DataCite block. Additional information on DataCite fields is available from the official DataCite website: https://schema.datacite.org/meta/kernel-4.1/
Parameters: - title (str or list of str) – The title(s) of the dataset.
- authors (str or list of str) – The author(s) of the dataset. The name will be automatically parsed into given name and family name.
- publisher (str) – The publisher of the dataset (not an associated paper). Default: The Materials Data Facility.
- publication_year (int or str) – The year of dataset publication. Default: The current year.
- resource_type (str) – The type of resource. Except in unusual cases, this should be
"Dataset"
. Default:"Dataset"
- affiliations (str or list of str or list of list of str) –
The affiliations of the authors, in the same order. If a different number of affiliations are given, all affiliations will be applied to all authors. Multiple affiliations can be given as a list. Default:
None
for no affiliations for any author.Examples:
authors = ["Fromnist, Alice", "Fromnist; Bob", "Cathy Multiples"] # All authors are from NIST affiliations = "NIST" # All authors are from both NIST and UChicago affiliations = ["NIST", "UChicago"] # Alice and Bob are from NIST, Cathy is from NIST and UChicago affliliations = ["NIST", "NIST", ["NIST", "UChicago"]] # This is incorrect! If applying affiliations to all authors, # lists must not be nested. affiliations = ["NIST", ["NIST", "UChicago"], "Argonne", "Oak Ridge"]
- description (str) – A description of the dataset.
Default:
None
for no description. - dataset_doi (str) – The DOI for this dataset (not an associated paper).
Default:
None
- related_dois (str or list of str) – DOIs related to this dataset,
not including the dataset’s own DOI (for example, an associated paper’s DOI).
Default:
None
- subjects (str or list of str) – Subjects (in Datacite terminology) or tags related
to the dataset. Sefault:
None
Any further keyword arguments will be added to the DataCite metadata (the dc block). These arguments should be valid DataCite, as listed in the MDF Connect documentation. This is completely optional.
-
create_mrr_block
(mrr_data)[source]¶ Create the mrr block for your dataset. This helper should be more helpful in the future.
Parameters: mrr_data (dict) – The MRR schema-compliant metadata.
-
get_available_curation_tasks
(summary=True, raw=False, _admin_code=None)[source]¶ Get all curation tasks available to you.
Parameters: - summary (bool) – When
False
, will print the entire curation task, including dataset entry and sample records. WhenTrue
, will only print a summary of the task. Using the summary is recommended to find specific tasks to get full task information on usingget_curation_task()
. Default:True
- raw (bool) – When
False
, will print out summaries of your available curation tasks. WhenTrue
, will return a dictionary containing the results. For direct human consumption,False
is recommended. Default:False
- _admin_code (str) –
For MDF Connect administrators only, a special function code. Valid codes:
all
: All waiting curation tasks.
Only MDF Connect administrators are allowed to use these codes. Default:
None
, the only valid value for non-admins.
Returns: The full task results.
Return type: if raw is
True
, dict- summary (bool) – When
-
get_curation_task
(source_id, summary=False, raw=False)[source]¶ Get the content of a curation task. You must have curation permissions on the selected submission.
Parameters: - source_id (str) – The
source_id
(source_name
+ version information) of the curation task. You can acquire this throughget_available_curation_tasks()
. - summary (bool) – When
False
, will print the entire curation task, including the verbose dataset entry and sample records. WhenTrue
, will only print a summary of the task. Default:False
- raw (bool) – When
False
, will print the curation task. WhenTrue
, will return a dictionary of the full result. Overrides the value ofsummary
. For direct human consumption,False
is recommended. Default:False
Returns: The full task results.
Return type: if raw is
True
, dict- source_id (str) – The
-
get_submission
()[source]¶ Fetch the current state of your submission.
Returns: Your submission. Return type: dict
-
logout
()[source]¶ Log out by removing cached tokens and discarding the client’s authorizer. Also clear the current submission, as it cannot be interacted with.
-
reject_curation_submission
(source_id, reason=None, prompt=True, raw=False)[source]¶ Complete a curation task by rejecting the submission. You must have curation permissions on the selected submission.
Parameters: - source_id (str) – The
source_id
(source_name
+ version information) of the curation task. You can acquire this throughget_available_curation_tasks()
. - reason (str) – The reason for rejecting this submission.
Default:
None
, to use a generic rejection reason. - prompt (bool) – When
True
, will prompt the user to confirm action selection, with a summary of the selected task. WhenFalse
, will not require confirmation. Default:True
. - raw (bool) – When
False
, will print the result. WhenTrue
, will return a dictionary of the full result. For direct human consumption,False
is recommended. Default:False
Returns: The full task results.
Return type: if raw is
True
, dict- source_id (str) – The
-
reset_submission
()[source]¶ Reset and clear metadata from your submission.
Warning
This action cannot be undone.
The last submission’s source_id will also be cleared. If you want to use
check_status
, you will be required to input thesource_id
manually.Returns: - The variables that are NOT cleared, which includes:
- test: (bool) - If the submission is a test submission or not.
- service_location (str) - The URL of the MDF Connect server in use.
Return type: dict
-
set_base_acl
(acl)[source]¶ Set the Access Control List for your entire dataset.
Parameters: acl (str or list of str) – The Globus UUIDs of users or groups that should be granted full read access to the dataset, including records and files. Default: The special keyword "public"
, which makes the dataset visible to everyone.Warning
The identities listed in the base_acl of your submission can always see your submission, including dataset entry, even if they are not listed in the
dataset_acl
. This means that if you do not specify a ``base_acl``, because it defaults to “public”, your entire dataset will be public. MDF encourages you to make your data public, but if you do not want it public you must specify this value.
-
set_curation
(curation)[source]¶ Set the curation flag for this submission.
Note
Normally, this flag is set automatically by an organization, and is not set manually by the dataset submitter.
Parameters: curation (bool) – When False
, the dataset will be processed normally. WhenTrue
, the dataset must be approved in curation before it will be ingested to MDF Search or any other service. Default:False
-
set_custom_block
(custom_fields)[source]¶ Set the custom block for your dataset.
Parameters: custom_fields (dict) – Custom field-value pairs for your dataset. You may add descriptions of your fields by creating a new field called [field]_desc
with the string description inside, or by callingset_custom_descriptions()
.
-
set_custom_descriptions
(custom_descriptions)[source]¶ Add descriptions to your custom block.
Parameters: custom_descriptions (dict) – Custom field-description pairs for your dataset. Field names in this argument must match field names added by calling set_custom_block()
.
-
set_dataset_acl
(acl)[source]¶ Set the Access Control List for just the dataset entry of your dataset.
Parameters: acl (str or list of str) – The Globus UUIDs of users or groups that should be granted read access only to the dataset entry for your dataset in MDF Search (this includes the author list, title, etc. but does not include extracted metadata in records or files). Anyone listed in the base ACL already has this permission.
-
set_external_uri
(uri)[source]¶ Set an external URI for your dataset. This is used to point at a landing page outside of MDF that also hosts the dataset.
Parameters: uri (str) – The external URI.
-
set_extraction_config
(config)[source]¶ Set advanced configuration parameters for dataset extraction. These parameters are intended for advanced users and/or special-case datasets.
Parameters: config (dict) – The extraction configuration parameters.
-
set_incremental_update
(source_id)[source]¶ Make this submission an incremental update of a previous submission. Incremental updates use the same submission metadata, except for whatever you specify in the new submission. For example, if you submit an incremental update and only include a
data_source
, the submission will run as if you copied the DC block and other metadata into the submission, but with the newdata_source
.Note
You must still set
update=True
when submitting an incremental update.Parameters: source_id (str) – The source_id
of the previous submission to update.
-
set_passthrough
(passthrough)[source]¶ Set the dataset pass-through flag for your submission.
Caution
This flag will cause metadata from your dataset’s files to not be extracted by MDF Connect, so only high-level dataset metadata will be available in MDF Search. This flag is only intended for datasets that cannot be extracted.
Parameters: passthrough (bool) – When False
, the dataset will be processed normally. WhenTrue
, the metadata in the files will not be extracted. Default:False
-
set_project_block
(project, data)[source]¶ Set the project block for your dataset. Intended only for use by members of an approved project. To delete a project block, call this method with
data=None
.Parameters: - project (str) – The name of the project block.
- data (dict) – The data for the project block.
-
set_source_name
(source_name)[source]¶ Set the source name for your dataset.
Parameters: source_name (str) – The desired source name. Must be unique for new datasets. Please note that your source name will be cleaned when submitted to Connect, so the actual source_name
may differ from this value. Additionally, thesource_id
(which is thesource_name
plus version information) is required to fetch the status of a submission.check_status()
can handle this for you.
-
set_test
(test)[source]¶ Set the test flag for this dataset.
Parameters: test (bool) – When False
, the dataset will be processed normally. WhenTrue
, the dataset will be processed, but submitted to test/sandbox/temporary resources instead of live resources. This includes themdf-test
Search index and test DOIs minted with MDF Publish. Default:False
-
submit_dataset
(update=False, submission=None, reset=False)[source]¶ Submit your dataset to MDF Connect for processing.
Parameters: - update (bool) – If you wish to submit this dataset again, set this to
True
. If this is the first submission, leave thisFalse
. Default:False
- submission (dict) – If you have assembled the Connect metadata yourself,
you can submit it here. This argument supersedes any data
set through other methods.
Default:
None
, to use method-assembled data. - reset (bool) – If True, will clear the old submission. The test flag will be preserved.
IMPORTANT: The
source_id
of the submission will not be saved if this argument isTrue
.check_status
will require you to pass thesource_id
as an argument. IfFalse
, the submission will be preserved. Default:False
Returns: - The submission information.
- success (bool) - Whether the submission was successful.
- source_id (string) - The
source_id
of your dataset, which is also saved - in
self.source_id
. Thesource_id
is thesource_name
plus version information. In other words, thesource_name
is unique to your dataset, and thesource_id
is unique to your submission of the dataset.
- source_id (string) - The
- error (string) - Error message, if applicable.
Return type: dict
- update (bool) – If you wish to submit this dataset again, set this to
-
submit_dataset_metadata_update
(source_id, metadata_update=None, reset=False)[source]¶ Submit an update to a dataset entry (and NOT the data or record entries).
Parameters: - source_id (str) – The
source_id
of the dataset you wish to update. You must be the owner of the dataset. - metadata_update (dict) – If you have assembled the dataset metadata yourself,
you can submit it here. This argument supersedes any data
set through other methods.
Default:
None
, to use method-assembled data. - reset (bool) – If True, will clear the old metadata from the client.
The test flag will be preserved.
If
False
, the metadata will be preserved. Default:False
- source_id (str) – The
-