{ "cells": [ { "cell_type": "markdown", "id": "dddb93fd-84e6-428c-ba98-175d2860d60c", "metadata": {}, "source": [ "# SoilPulse project management\n", "Working with SoilPulse is structured into so called \"projects\". One user can establish and maintain any number of projects while keeping the project names unique. Project is a collection of containers that represent data elements and structures extracted from files included in the project.\n", "\n", "## Establishing new project\n", "First import the soilpulsecore package and get storage connection through NullConnector - the project structure and all project-related data will be managed in dedicated filesystem directory in user's home directory depending on operating system.\n", "\n", "### Storage connection\n", "Setting a storage is needed for propper functioning of the SoilPulse package. Primary purpose of the storage is manipulating the files that are being included in the project. (Temporarily) storing provided files is necessary for the full content analysis including unpacking file archives to have access to packed files. Additionaly project, containers, datasets and translation dictionaries are saved to this storage if filesystem storage is used (i.e. the NullConnector).\n", "\n", "Dediceated directory for SoilPulse projects is create automaticly on first execution in users home directory obtained from running system on user's computer.\n", "\n", "Another option is MySQLConnector that requieres additional prerequisities and therefore is covered by separate tutorial.\n", "\n", "### New empty project\n", "Let'screate new project and name it \"SoilPulse testing\".\n", "\n", "_note: project can be established without a name, in such case a unique name will be assigned automaticly by soilpulsecore and can be changed anytime_\n", "\n", "In case of filesystem stored SoilPulse project (using NullConnector) the user_id parameter is not relevant and all projects are accessible to local user." ] }, { "cell_type": "code", "execution_count": 7, "id": "e9764c91-5a93-4f15-8cf4-840503e03c55", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "failed to load concept vocabulary 'AGROVOC' from 'vocabularies\\agrovoc.json'\n", "failed to load concept vocabulary 'TestConceptVocabulary' from 'vocabularies\\_concepts_vocabulary_1.json'\n", "failed to load method vocabulary 'TestMethodsVocabulary' from 'vocabularies\\_methods_vocabulary_1.json'\n", "loaded methods vocabularies: \n", "failed to load units vocabulary 'TestUnitsVocabulary' from 'vocabularies\\_units_vocabulary_1.json'\n", "loaded units vocabularies: \n", "doi: 'None'\n", "Empty DOI provided. DOI metadata were not retrieved.\n" ] } ], "source": [ "from soilpulsecore.db_access import NullConnector\n", "from soilpulsecore.project_management import *\n", "from soilpulsecore.resource_managers import filesystem, data_structures, json\n", "\n", "# get the filesystem storage connection\n", "dbcon = NullConnector()\n", "# create the ProjectManager instance and establish directories and files structure of the project\n", "project = ProjectManager(dbcon, user_id=1, name=\"SoilPulse testing\")" ] }, { "cell_type": "markdown", "id": "969f0b6c-a92a-4cd2-bdce-78991751e41d", "metadata": {}, "source": [ "The prints indicate status of soilpulsecore modules and other components being loaded." ] }, { "cell_type": "markdown", "id": "75852880-8dca-4cec-b698-0e54b089f35c", "metadata": {}, "source": [ "To make sure the project was successfuly created and exists now in memory we can print out its ID and name. \n", "We can change the name of project by changing the value of \"name\" attribute directly.\n", "\n", "To check the project status we can also print the project itself.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "d0e6a22c-9406-489e-ba4e-5d16fce238c2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5 - 'SoilPulse testing'\n", "\n", "=== Project #5 ======================================================================\n", "name: SoilPulse show case\n", "local directory: C:\\Users\\jande\\SoilPulse\\project_files\\temp_5\n", "keep stored files: yes\n", "space occupied: 127.0 B\n", "no DOI assigned\n", "==========================================================================================\n", "\n" ] } ], "source": [ "print(f\"{project.id} - '{project.name}'\")\n", "project.name = \"SoilPulse show case\"\n", "print(project)" ] }, { "cell_type": "markdown", "id": "09bef619-f8ad-4d12-8a48-28ba764de0e5", "metadata": {}, "source": [ "## Loading data to project\n", "To process some data we must load them into the project first.\n", "Files can be added to project by three basic ways:\n", "\n", "* upload of a file from user's computer by local path\n", "* download from internet resources by URL or list of URLs\n", "* download of a data-package available through DOI record (currently implemented for Datacite.org) and data-publisher record (currently implemented for Zenodo)\n", "\n", "To include some data in the project, let's upload a file from disc - you may download the example file from this link. \n", "To use the file on your soilpulsecore instance adjust the source path accordingly." ] }, { "cell_type": "code", "execution_count": 9, "id": "bd4163ae-3432-4357-90f9-4595e4f9334a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Container 1 was already analyzed.\n" ] } ], "source": [ "project.uploadFilesFromSession(\"d:\\\\downloads\\\\runoffdb_excerpt.csv\")" ] }, { "cell_type": "markdown", "id": "e78d80fc-1b8c-4aed-b682-69970486232b", "metadata": {}, "source": [ "## Loaded data structure analysis\n", "The uploaded file is analyzed right after uploading to obtain its inner structure. Each data element loaded into the project structure is represented as \"container\". Containers can have various types depending on recognized file or data type. The containers are organized hierarchicaly where the top level element is the project and each container may contain more containers.\n", "\n", "To see current structure of the project's containers we can us ProjectManager's method showContainerTree - all containers in the projects structure are printed out. Each container is represented by one line and its unique ID (project scope) is shown in front of the container details." ] }, { "cell_type": "code", "execution_count": 10, "id": "b2be90b1-4955-449a-af4b-b5ac32f7d10b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "SoilPulse show case\n", "container tree:\n", "--------------------------------------------------------------------------------\n", "1 - runoffdb_excerpt.csv (file, 56.2 kB, 22.11.2024/22.11.2024) [1] >root\n", ". 2 - runoffdb_excerpt (table - table) [20] ^1\n", ". . 3 - locality (column - column) [0] ^2\n", ". . 4 - latitude (column - column) [0] ^2\n", ". . 5 - longitude (column - column) [0] ^2\n", ". . 6 - run ID (column - column) [0] ^2\n", ". . 7 - date (column - column) [0] ^2\n", ". . 8 - plot ID (column - column) [0] ^2\n", ". . 9 - simulator (column - column) [0] ^2\n", ". . 10 - crop (column - column) [0] ^2\n", ". . 11 - crop type (column - column) [0] ^2\n", ". . 12 - initial cond. (column - column) [0] ^2\n", ". . 13 - init. moisture (column - column) [0] ^2\n", ". . 14 - canopy cover (column - column) [0] ^2\n", ". . 15 - BBCH (column - column) [0] ^2\n", ". . 16 - rain intensity [mm.h^-1] (column - column) [0] ^2\n", ". . 17 - time to runoff (column - column) [0] ^2\n", ". . 18 - bulk density [g.cm^-3] (column - column) [0] ^2\n", ". . 19 - total time [s] (column - column) [0] ^2\n", ". . 20 - total rainfall [mm.h^-1] (column - column) [0] ^2\n", ". . 21 - total discharge [l] (column - column) [0] ^2\n", ". . 22 - total soil loss [g] (column - column) [0] ^2\n", "================================================================================\n", "\n", "\n" ] } ], "source": [ "project.showContainerTree()" ] }, { "cell_type": "markdown", "id": "d113b39c-1ba4-4524-b414-2b26f484be3c", "metadata": {}, "source": [ "## Using datasets\n", "Though it is possible to work with all data on the project level, it it usefull to strucure related resources in a dataset.\n", "New dataset is created within project by specifying its name:" ] }, { "cell_type": "code", "execution_count": 11, "id": "31220dad-d2e4-4d74-ad3b-f89b7b724587", "metadata": {}, "outputs": [], "source": [ "ds = project.createDataset(\"CTU RunoffDB excerpt\")" ] }, { "cell_type": "markdown", "id": "c66c5d79-79f2-4dae-a11a-25da4630ab6f", "metadata": {}, "source": [ "
Newly created dataset is empty and containers with resources must be added into it.\n", "Container be can for example obtained from project by its ID.
\n", "\n", "note: adding the file container or the table container within doesn't really matter as all procedures should work for both arrangements
" ] }, { "cell_type": "code", "execution_count": 12, "id": "330383c2-6518-4744-b782-c11d773ba95f", "metadata": {}, "outputs": [], "source": [ "table = project.getContainerByID(2)\n", "ds.addContainer(table)" ] }, { "cell_type": "markdown", "id": "97a18567-c81f-4bda-a254-da4aa3f7a906", "metadata": {}, "source": [ "The container added to dataset still contains all of its subcontainers.\n", "We may check that by printing contents of the dataset" ] }, { "cell_type": "code", "execution_count": 13, "id": "13c3ddac-ed31-4817-9767-b47a5589f0c8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "==== CTU RunoffDB excerpt ============================================================ #1\n", "---- container tree: ----\n", "2 - runoffdb_excerpt (table - table) [20] ^1\n", ". 3 - locality (column - column) [0] ^2\n", ". 4 - latitude (column - column) [0] ^2\n", ". 5 - longitude (column - column) [0] ^2\n", ". 6 - run ID (column - column) [0] ^2\n", ". 7 - date (column - column) [0] ^2\n", ". 8 - plot ID (column - column) [0] ^2\n", ". 9 - simulator (column - column) [0] ^2\n", ". 10 - crop (column - column) [0] ^2\n", ". 11 - crop type (column - column) [0] ^2\n", ". 12 - initial cond. (column - column) [0] ^2\n", ". 13 - init. moisture (column - column) [0] ^2\n", ". 14 - canopy cover (column - column) [0] ^2\n", ". 15 - BBCH (column - column) [0] ^2\n", ". 16 - rain intensity [mm.h^-1] (column - column) [0] ^2\n", ". 17 - time to runoff (column - column) [0] ^2\n", ". 18 - bulk density [g.cm^-3] (column - column) [0] ^2\n", ". 19 - total time [s] (column - column) [0] ^2\n", ". 20 - total rainfall [mm.h^-1] (column - column) [0] ^2\n", ". 21 - total discharge [l] (column - column) [0] ^2\n", ". 22 - total soil loss [g] (column - column) [0] ^2\n", "================================================================================\n", "\n" ] } ], "source": [ "ds.showContents()" ] }, { "cell_type": "markdown", "id": "1bc2956d-01b1-49f1-9cc1-f0c896140169", "metadata": {}, "source": [ "## Content analysis and metadata capturing\n", "\n", "The only metadata-from-content extraction procedure implemented so far is matching column header strings against vocabulary terms. Three descriptional entities are defined that can be assigned to table and column containers: concept, method and unit.\n", "Inside a container a string-translations dictionary is kept for each of these entities. \n", "The translations are obtained by matching the containers content (currently implemented for column headers only) against available __vocabularies__ or __dictionaries__.\n", "\n", "### Vocabularies\n", "Vocabularies are meant as somehow exhaustive lists of __terms__ that can be unambiguously referecnced by an URI. As the only available example right now the AGROVOC controled vocabulary is used for identifying concepts. Only an excerpt from the AGROVOC is distributed with soilpulsecore\n", "\n", "### Translation dictionaries\n", "Dictionaries are used an ad-hoc collection of __translations__ i.e. one or more terms assigned to a character string. In fact the container's concepts, methods and units are held in a translation dictionary. Project or dataset translation dictionaries are generated from their belonging containers. Dictionaries can be imported to project from disk, project dictionaries can be exported to file so translation colleciton from one project can be used within another project.\n", "\n", "To perform content analysis (\"crawling\") based on own translation dictionaries we must first load them to the project.\n", "To ensure propper loading of the translation we may print content of the dictionaries in structured manner by ProjectManager's method." ] }, { "cell_type": "code", "execution_count": 14, "id": "0be956fc-6be7-450d-bb66-1871c349075b", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "string indices must be integers, not 'str'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[14], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mproject\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mupdateConceptsTranslationsFromFile\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43md:\u001b[39;49m\u001b[38;5;130;43;01m\\\\\u001b[39;49;00m\u001b[38;5;124;43mdownloads\u001b[39;49m\u001b[38;5;130;43;01m\\\\\u001b[39;49;00m\u001b[38;5;124;43mrunoffdb_concepts.json\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 2\u001b[0m project\u001b[38;5;241m.\u001b[39mupdateMethodsTranslationsFromFile(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124md:\u001b[39m\u001b[38;5;130;01m\\\\\u001b[39;00m\u001b[38;5;124mdownloads\u001b[39m\u001b[38;5;130;01m\\\\\u001b[39;00m\u001b[38;5;124mrunoffdb_methods.json\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 3\u001b[0m \u001b[38;5;66;03m# project.showDictionaries()\u001b[39;00m\n", "File \u001b[1;32mC:\\Python311\\Lib\\site-packages\\soilpulsecore\\project_management.py:783\u001b[0m, in \u001b[0;36mProjectManager.updateConceptsTranslationsFromFile\u001b[1;34m(self, input_file)\u001b[0m\n\u001b[0;32m 778\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 779\u001b[0m \u001b[38;5;124;03mAdds string-concepts translations to project's dictionary (if not already there) from specified file\u001b[39;00m\n\u001b[0;32m 780\u001b[0m \u001b[38;5;124;03m:param input_file: path of a file to load\u001b[39;00m\n\u001b[0;32m 781\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 782\u001b[0m \u001b[38;5;66;03m# load the input JSON file to dictionary\u001b[39;00m\n\u001b[1;32m--> 783\u001b[0m str_conc_dict \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mloadTranslationsFromFile\u001b[49m\u001b[43m(\u001b[49m\u001b[43minput_file\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 784\u001b[0m updateTranslationsDictionary(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconceptsTranslations, str_conc_dict)\n\u001b[0;32m 785\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n", "File \u001b[1;32mC:\\Python311\\Lib\\site-packages\\soilpulsecore\\project_management.py:820\u001b[0m, in \u001b[0;36mProjectManager.loadTranslationsFromFile\u001b[1;34m(self, input_file)\u001b[0m\n\u001b[0;32m 818\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 819\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m \u001b[38;5;28mstr\u001b[39m \u001b[38;5;129;01min\u001b[39;00m json\u001b[38;5;241m.\u001b[39mload(f):\n\u001b[1;32m--> 820\u001b[0m str_dict\u001b[38;5;241m.\u001b[39mupdate({\u001b[38;5;28;43mstr\u001b[39;49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mstring\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m: \u001b[38;5;28mstr\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtranslation\u001b[39m\u001b[38;5;124m\"\u001b[39m]})\n\u001b[0;32m 821\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[0;32m 822\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTranslations dictionary \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00minput_file\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m failed to load.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", "\u001b[1;31mTypeError\u001b[0m: string indices must be integers, not 'str'" ] } ], "source": [ "project.updateConceptsTranslationsFromFile(\"d:\\\\downloads\\\\runoffdb_concepts.json\")\n", "project.updateMethodsTranslationsFromFile(\"d:\\\\downloads\\\\runoffdb_methods.json\")\n", "# project.showDictionaries()" ] }, { "cell_type": "markdown", "id": "8323b342-9510-48f8-8fd1-12d90e502e73", "metadata": {}, "source": [ "To include additional vocabulary that will be used for finding matching" ] }, { "cell_type": "code", "execution_count": null, "id": "7cc67091-18b5-43fa-8e44-65b4711d230a", "metadata": {}, "outputs": [], "source": [ "dbcon.units_vocabularies.update({\"units_in_brakets\": dbcon.loadVocabularyFromFile(\"d:\\\\downloads\\\\bracket_units_vocab.json\")})\n" ] }, { "cell_type": "markdown", "id": "0f287dc1-2dce-41ba-84a3-58136a2e2395", "metadata": {}, "source": [ "Now concepts/parameters, methods and units can be searched for in the dataset.\n", "The 'force' parameter makes the crawler do the crawling even if the container was already crawled and previous values will be overwritten." ] }, { "cell_type": "code", "execution_count": null, "id": "82eef19c-2ba0-429c-990f-21fc44edb4fe", "metadata": {}, "outputs": [], "source": [ "ds.getCrawled(force=True)" ] }, { "cell_type": "markdown", "id": "9aa3d6a9-0ba0-4645-bee9-4afee4e4ed50", "metadata": {}, "source": [ "To check the results of the vocabularies and dictionaries search we can again print out the dataset's contents" ] }, { "cell_type": "code", "execution_count": null, "id": "e1354230-e01b-440c-a2bb-6c9271de9359", "metadata": {}, "outputs": [], "source": [ "ds.showContents()" ] }, { "cell_type": "markdown", "id": "b47eb909-ab5a-4f29-9ea1-d8a67d9e4f03", "metadata": {}, "source": [ "## Saving project to storage\n", "\n", "The project lives only in memory until it's saved. To save the project's current state we call the ProjectManager's method to update its storage record. Without updating the project record all edits performed on the project structure, datasets and metadata capturing since the last save will be lost at the end of the session." ] }, { "cell_type": "code", "execution_count": null, "id": "369c9226-076c-4b1f-bb95-75684b7ec56e", "metadata": {}, "outputs": [], "source": [ "project.updateDBrecord()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }