SoilPulse project management
Working with SoilPulse is structured into so called “projects”. One user can establish and maintain any number of projects while keeping the project names unique. Project is a collection of containers that represent data elements and structures extracted from files included in the project.
Establishing new project
First import the soilpulsecore package and get storage connection through NullConnector - the project structure and all project-related data will be managed in dedicated filesystem directory in user’s home directory depending on operating system.
Storage connection
Setting a storage is needed for propper functioning of the SoilPulse package. Primary purpose of the storage is manipulating the files that are being included in the project. (Temporarily) storing provided files is necessary for the full content analysis including unpacking file archives to have access to packed files. Additionaly project, containers, datasets and translation dictionaries are saved to this storage if filesystem storage is used (i.e. the NullConnector).
Dediceated directory for SoilPulse projects is create automaticly on first execution in users home directory obtained from running system on user’s computer.
Another option is MySQLConnector that requieres additional prerequisities and therefore is covered by separate tutorial.
New empty project
Let’screate new project and name it “SoilPulse testing”.
note: project can be established without a name, in such case a unique name will be assigned automaticly by soilpulsecore and can be changed anytime
In case of filesystem stored SoilPulse project (using NullConnector) the user_id parameter is not relevant and all projects are accessible to local user.
[7]:
from soilpulsecore.db_access import NullConnector
from soilpulsecore.project_management import *
from soilpulsecore.resource_managers import filesystem, data_structures, json
# get the filesystem storage connection
dbcon = NullConnector()
# create the ProjectManager instance and establish directories and files structure of the project
project = ProjectManager(dbcon, user_id=1, name="SoilPulse testing")
failed to load concept vocabulary 'AGROVOC' from 'vocabularies\agrovoc.json'
failed to load concept vocabulary 'TestConceptVocabulary' from 'vocabularies\_concepts_vocabulary_1.json'
failed to load method vocabulary 'TestMethodsVocabulary' from 'vocabularies\_methods_vocabulary_1.json'
loaded methods vocabularies:
failed to load units vocabulary 'TestUnitsVocabulary' from 'vocabularies\_units_vocabulary_1.json'
loaded units vocabularies:
doi: 'None'
Empty DOI provided. DOI metadata were not retrieved.
The prints indicate status of soilpulsecore modules and other components being loaded.
To check the project status we can also print the project itself.
[8]:
print(f"{project.id} - '{project.name}'")
project.name = "SoilPulse show case"
print(project)
5 - 'SoilPulse testing'
=== Project #5 ======================================================================
name: SoilPulse show case
local directory: C:\Users\jande\SoilPulse\project_files\temp_5
keep stored files: yes
space occupied: 127.0 B
no DOI assigned
==========================================================================================
Loading data to project
To process some data we must load them into the project first. Files can be added to project by three basic ways:
upload of a file from user’s computer by local path
download from internet resources by URL or list of URLs
download of a data-package available through DOI record (currently implemented for Datacite.org) and data-publisher record (currently implemented for Zenodo)
[9]:
project.uploadFilesFromSession("d:\\downloads\\runoffdb_excerpt.csv")
Container 1 was already analyzed.
Loaded data structure analysis
The uploaded file is analyzed right after uploading to obtain its inner structure. Each data element loaded into the project structure is represented as “container”. Containers can have various types depending on recognized file or data type. The containers are organized hierarchicaly where the top level element is the project and each container may contain more containers.
To see current structure of the project’s containers we can us ProjectManager’s method showContainerTree - all containers in the projects structure are printed out. Each container is represented by one line and its unique ID (project scope) is shown in front of the container details.
[10]:
project.showContainerTree()
================================================================================
SoilPulse show case
container tree:
--------------------------------------------------------------------------------
1 - runoffdb_excerpt.csv (file, 56.2 kB, 22.11.2024/22.11.2024) [1] >root
. 2 - runoffdb_excerpt (table - table) [20] ^1
. . 3 - locality (column - column) [0] ^2
. . 4 - latitude (column - column) [0] ^2
. . 5 - longitude (column - column) [0] ^2
. . 6 - run ID (column - column) [0] ^2
. . 7 - date (column - column) [0] ^2
. . 8 - plot ID (column - column) [0] ^2
. . 9 - simulator (column - column) [0] ^2
. . 10 - crop (column - column) [0] ^2
. . 11 - crop type (column - column) [0] ^2
. . 12 - initial cond. (column - column) [0] ^2
. . 13 - init. moisture (column - column) [0] ^2
. . 14 - canopy cover (column - column) [0] ^2
. . 15 - BBCH (column - column) [0] ^2
. . 16 - rain intensity [mm.h^-1] (column - column) [0] ^2
. . 17 - time to runoff (column - column) [0] ^2
. . 18 - bulk density [g.cm^-3] (column - column) [0] ^2
. . 19 - total time [s] (column - column) [0] ^2
. . 20 - total rainfall [mm.h^-1] (column - column) [0] ^2
. . 21 - total discharge [l] (column - column) [0] ^2
. . 22 - total soil loss [g] (column - column) [0] ^2
================================================================================
Using datasets
Though it is possible to work with all data on the project level, it it usefull to strucure related resources in a dataset. New dataset is created within project by specifying its name:
[11]:
ds = project.createDataset("CTU RunoffDB excerpt")
Newly created dataset is empty and containers with resources must be added into it. Container be can for example obtained from project by its ID.
note: adding the file container or the table container within doesn’t really matter as all procedures should work for both arrangements
[12]:
table = project.getContainerByID(2)
ds.addContainer(table)
The container added to dataset still contains all of its subcontainers. We may check that by printing contents of the dataset
[13]:
ds.showContents()
==== CTU RunoffDB excerpt ============================================================ #1
---- container tree: ----
2 - runoffdb_excerpt (table - table) [20] ^1
. 3 - locality (column - column) [0] ^2
. 4 - latitude (column - column) [0] ^2
. 5 - longitude (column - column) [0] ^2
. 6 - run ID (column - column) [0] ^2
. 7 - date (column - column) [0] ^2
. 8 - plot ID (column - column) [0] ^2
. 9 - simulator (column - column) [0] ^2
. 10 - crop (column - column) [0] ^2
. 11 - crop type (column - column) [0] ^2
. 12 - initial cond. (column - column) [0] ^2
. 13 - init. moisture (column - column) [0] ^2
. 14 - canopy cover (column - column) [0] ^2
. 15 - BBCH (column - column) [0] ^2
. 16 - rain intensity [mm.h^-1] (column - column) [0] ^2
. 17 - time to runoff (column - column) [0] ^2
. 18 - bulk density [g.cm^-3] (column - column) [0] ^2
. 19 - total time [s] (column - column) [0] ^2
. 20 - total rainfall [mm.h^-1] (column - column) [0] ^2
. 21 - total discharge [l] (column - column) [0] ^2
. 22 - total soil loss [g] (column - column) [0] ^2
================================================================================
Content analysis and metadata capturing
Vocabularies
Vocabularies are meant as somehow exhaustive lists of terms that can be unambiguously referecnced by an URI. As the only available example right now the AGROVOC controled vocabulary is used for identifying concepts. Only an excerpt from the AGROVOC is distributed with soilpulsecore
Translation dictionaries
Dictionaries are used an ad-hoc collection of translations i.e. one or more terms assigned to a character string. In fact the container’s concepts, methods and units are held in a translation dictionary. Project or dataset translation dictionaries are generated from their belonging containers. Dictionaries can be imported to project from disk, project dictionaries can be exported to file so translation colleciton from one project can be used within another project.
To perform content analysis (“crawling”) based on own translation dictionaries we must first load them to the project. To ensure propper loading of the translation we may print content of the dictionaries in structured manner by ProjectManager’s method.
[14]:
project.updateConceptsTranslationsFromFile("d:\\downloads\\runoffdb_concepts.json")
project.updateMethodsTranslationsFromFile("d:\\downloads\\runoffdb_methods.json")
# project.showDictionaries()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 project.updateConceptsTranslationsFromFile("d:\\downloads\\runoffdb_concepts.json")
2 project.updateMethodsTranslationsFromFile("d:\\downloads\\runoffdb_methods.json")
3 # project.showDictionaries()
File C:\Python311\Lib\site-packages\soilpulsecore\project_management.py:783, in ProjectManager.updateConceptsTranslationsFromFile(self, input_file)
778 """
779 Adds string-concepts translations to project's dictionary (if not already there) from specified file
780 :param input_file: path of a file to load
781 """
782 # load the input JSON file to dictionary
--> 783 str_conc_dict = self.loadTranslationsFromFile(input_file)
784 updateTranslationsDictionary(self.conceptsTranslations, str_conc_dict)
785 return
File C:\Python311\Lib\site-packages\soilpulsecore\project_management.py:820, in ProjectManager.loadTranslationsFromFile(self, input_file)
818 try:
819 for str in json.load(f):
--> 820 str_dict.update({str['string']: str["translation"]})
821 except KeyError as e:
822 print(f"Translations dictionary '{input_file}' failed to load.")
TypeError: string indices must be integers, not 'str'
To include additional vocabulary that will be used for finding matching
[ ]:
dbcon.units_vocabularies.update({"units_in_brakets": dbcon.loadVocabularyFromFile("d:\\downloads\\bracket_units_vocab.json")})
Now concepts/parameters, methods and units can be searched for in the dataset. The ‘force’ parameter makes the crawler do the crawling even if the container was already crawled and previous values will be overwritten.
[ ]:
ds.getCrawled(force=True)
To check the results of the vocabularies and dictionaries search we can again print out the dataset’s contents
[ ]:
ds.showContents()
Saving project to storage
The project lives only in memory until it’s saved. To save the project’s current state we call the ProjectManager’s method to update its storage record. Without updating the project record all edits performed on the project structure, datasets and metadata capturing since the last save will be lost at the end of the session.
[ ]:
project.updateDBrecord()