SoilPulse project management

Working with SoilPulse is structured into so called “projects”. One user can establish and maintain any number of projects while keeping the project names unique. Project is a collection of containers that represent data elements and structures extracted from files included in the project.

Establishing new project

First import the soilpulsecore package and get storage connection through NullConnector - the project structure and all project-related data will be managed in dedicated filesystem directory in user’s home directory depending on operating system.

Storage connection

Setting a storage is needed for propper functioning of the SoilPulse package. Primary purpose of the storage is manipulating the files that are being included in the project. (Temporarily) storing provided files is necessary for the full content analysis including unpacking file archives to have access to packed files. Additionaly project, containers, datasets and translation dictionaries are saved to this storage if filesystem storage is used (i.e. the NullConnector).

Dediceated directory for SoilPulse projects is create automaticly on first execution in users home directory obtained from running system on user’s computer.

Another option is MySQLConnector that requieres additional prerequisities and therefore is covered by separate tutorial.

New empty project

Let’screate new project and name it “SoilPulse testing”.

note: project can be established without a name, in such case a unique name will be assigned automaticly by soilpulsecore and can be changed anytime

In case of filesystem stored SoilPulse project (using NullConnector) the user_id parameter is not relevant and all projects are accessible to local user.

[7]:
from soilpulsecore.db_access import NullConnector
from soilpulsecore.project_management import *
from soilpulsecore.resource_managers import filesystem, data_structures, json

# get the filesystem storage connection
dbcon = NullConnector()
# create the ProjectManager instance and establish directories and files structure of the project
project = ProjectManager(dbcon, user_id=1, name="SoilPulse testing")
failed to load concept vocabulary 'AGROVOC' from 'vocabularies\agrovoc.json'
failed to load concept vocabulary 'TestConceptVocabulary' from 'vocabularies\_concepts_vocabulary_1.json'
failed to load method vocabulary 'TestMethodsVocabulary' from 'vocabularies\_methods_vocabulary_1.json'
loaded methods vocabularies:
failed to load units vocabulary 'TestUnitsVocabulary' from 'vocabularies\_units_vocabulary_1.json'
loaded units vocabularies:
doi: 'None'
Empty DOI provided. DOI metadata were not retrieved.

The prints indicate status of soilpulsecore modules and other components being loaded.

To make sure the project was successfuly created and exists now in memory we can print out its ID and name.
We can change the name of project by changing the value of “name” attribute directly.

To check the project status we can also print the project itself.

[8]:
print(f"{project.id} - '{project.name}'")
project.name = "SoilPulse show case"
print(project)
5 - 'SoilPulse testing'

=== Project #5 ======================================================================
name: SoilPulse show case
local directory: C:\Users\jande\SoilPulse\project_files\temp_5
keep stored files: yes
space occupied: 127.0 B
no DOI assigned
==========================================================================================

Loading data to project

To process some data we must load them into the project first. Files can be added to project by three basic ways:

  • upload of a file from user’s computer by local path

  • download from internet resources by URL or list of URLs

  • download of a data-package available through DOI record (currently implemented for Datacite.org) and data-publisher record (currently implemented for Zenodo)

To include some data in the project, let’s upload a file from disc - you may download the example file from this link.
To use the file on your soilpulsecore instance adjust the source path accordingly.
[9]:
project.uploadFilesFromSession("d:\\downloads\\runoffdb_excerpt.csv")
Container 1 was already analyzed.

Loaded data structure analysis

The uploaded file is analyzed right after uploading to obtain its inner structure. Each data element loaded into the project structure is represented as “container”. Containers can have various types depending on recognized file or data type. The containers are organized hierarchicaly where the top level element is the project and each container may contain more containers.

To see current structure of the project’s containers we can us ProjectManager’s method showContainerTree - all containers in the projects structure are printed out. Each container is represented by one line and its unique ID (project scope) is shown in front of the container details.

[10]:
project.showContainerTree()

================================================================================
SoilPulse show case
container tree:
--------------------------------------------------------------------------------
1 - runoffdb_excerpt.csv (file, 56.2 kB, 22.11.2024/22.11.2024) [1]  >root
. 2 - runoffdb_excerpt (table - table) [20] ^1
. . 3 - locality (column - column) [0] ^2
. . 4 - latitude (column - column) [0] ^2
. . 5 - longitude (column - column) [0] ^2
. . 6 - run ID (column - column) [0] ^2
. . 7 - date (column - column) [0] ^2
. . 8 - plot ID (column - column) [0] ^2
. . 9 - simulator (column - column) [0] ^2
. . 10 - crop (column - column) [0] ^2
. . 11 - crop type (column - column) [0] ^2
. . 12 - initial cond. (column - column) [0] ^2
. . 13 - init. moisture (column - column) [0] ^2
. . 14 - canopy cover (column - column) [0] ^2
. . 15 - BBCH (column - column) [0] ^2
. . 16 - rain intensity [mm.h^-1] (column - column) [0] ^2
. . 17 - time to runoff (column - column) [0] ^2
. . 18 - bulk density [g.cm^-3] (column - column) [0] ^2
. . 19 - total time [s] (column - column) [0] ^2
. . 20 - total rainfall [mm.h^-1] (column - column) [0] ^2
. . 21 - total discharge [l] (column - column) [0] ^2
. . 22 - total soil loss [g] (column - column) [0] ^2
================================================================================


Using datasets

Though it is possible to work with all data on the project level, it it usefull to strucure related resources in a dataset. New dataset is created within project by specifying its name:

[11]:
ds = project.createDataset("CTU RunoffDB excerpt")

Newly created dataset is empty and containers with resources must be added into it. Container be can for example obtained from project by its ID.

note: adding the file container or the table container within doesn’t really matter as all procedures should work for both arrangements

[12]:
table = project.getContainerByID(2)
ds.addContainer(table)

The container added to dataset still contains all of its subcontainers. We may check that by printing contents of the dataset

[13]:
ds.showContents()

==== CTU RunoffDB excerpt ============================================================ #1
---- container tree: ----
2 - runoffdb_excerpt (table - table) [20] ^1
. 3 - locality (column - column) [0] ^2
. 4 - latitude (column - column) [0] ^2
. 5 - longitude (column - column) [0] ^2
. 6 - run ID (column - column) [0] ^2
. 7 - date (column - column) [0] ^2
. 8 - plot ID (column - column) [0] ^2
. 9 - simulator (column - column) [0] ^2
. 10 - crop (column - column) [0] ^2
. 11 - crop type (column - column) [0] ^2
. 12 - initial cond. (column - column) [0] ^2
. 13 - init. moisture (column - column) [0] ^2
. 14 - canopy cover (column - column) [0] ^2
. 15 - BBCH (column - column) [0] ^2
. 16 - rain intensity [mm.h^-1] (column - column) [0] ^2
. 17 - time to runoff (column - column) [0] ^2
. 18 - bulk density [g.cm^-3] (column - column) [0] ^2
. 19 - total time [s] (column - column) [0] ^2
. 20 - total rainfall [mm.h^-1] (column - column) [0] ^2
. 21 - total discharge [l] (column - column) [0] ^2
. 22 - total soil loss [g] (column - column) [0] ^2
================================================================================

Content analysis and metadata capturing

The only metadata-from-content extraction procedure implemented so far is matching column header strings against vocabulary terms. Three descriptional entities are defined that can be assigned to table and column containers: concept, method and unit. Inside a container a string-translations dictionary is kept for each of these entities.
The translations are obtained by matching the containers content (currently implemented for column headers only) against available vocabularies or dictionaries.

Vocabularies

Vocabularies are meant as somehow exhaustive lists of terms that can be unambiguously referecnced by an URI. As the only available example right now the AGROVOC controled vocabulary is used for identifying concepts. Only an excerpt from the AGROVOC is distributed with soilpulsecore

Translation dictionaries

Dictionaries are used an ad-hoc collection of translations i.e. one or more terms assigned to a character string. In fact the container’s concepts, methods and units are held in a translation dictionary. Project or dataset translation dictionaries are generated from their belonging containers. Dictionaries can be imported to project from disk, project dictionaries can be exported to file so translation colleciton from one project can be used within another project.

To perform content analysis (“crawling”) based on own translation dictionaries we must first load them to the project. To ensure propper loading of the translation we may print content of the dictionaries in structured manner by ProjectManager’s method.

[14]:
project.updateConceptsTranslationsFromFile("d:\\downloads\\runoffdb_concepts.json")
project.updateMethodsTranslationsFromFile("d:\\downloads\\runoffdb_methods.json")
# project.showDictionaries()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 project.updateConceptsTranslationsFromFile("d:\\downloads\\runoffdb_concepts.json")
      2 project.updateMethodsTranslationsFromFile("d:\\downloads\\runoffdb_methods.json")
      3 # project.showDictionaries()

File C:\Python311\Lib\site-packages\soilpulsecore\project_management.py:783, in ProjectManager.updateConceptsTranslationsFromFile(self, input_file)
    778 """
    779 Adds string-concepts translations to project's dictionary (if not already there) from specified file
    780 :param input_file: path of a file to load
    781 """
    782 # load the input JSON file to dictionary
--> 783 str_conc_dict = self.loadTranslationsFromFile(input_file)
    784 updateTranslationsDictionary(self.conceptsTranslations, str_conc_dict)
    785 return

File C:\Python311\Lib\site-packages\soilpulsecore\project_management.py:820, in ProjectManager.loadTranslationsFromFile(self, input_file)
    818 try:
    819     for str in json.load(f):
--> 820         str_dict.update({str['string']: str["translation"]})
    821 except KeyError as e:
    822     print(f"Translations dictionary '{input_file}' failed to load.")

TypeError: string indices must be integers, not 'str'

To include additional vocabulary that will be used for finding matching

[ ]:
dbcon.units_vocabularies.update({"units_in_brakets": dbcon.loadVocabularyFromFile("d:\\downloads\\bracket_units_vocab.json")})

Now concepts/parameters, methods and units can be searched for in the dataset. The ‘force’ parameter makes the crawler do the crawling even if the container was already crawled and previous values will be overwritten.

[ ]:
ds.getCrawled(force=True)

To check the results of the vocabularies and dictionaries search we can again print out the dataset’s contents

[ ]:
ds.showContents()

Saving project to storage

The project lives only in memory until it’s saved. To save the project’s current state we call the ProjectManager’s method to update its storage record. Without updating the project record all edits performed on the project structure, datasets and metadata capturing since the last save will be lost at the end of the session.

[ ]:
project.updateDBrecord()