ADMiniSter: a python package to administer large numbers of text files with numerical data

ADMiniSter stems from “Ascii Data Minimalist Management Suite.” It is an open-source Python package that provides a collection of simple yet powerful tools to manage plain-text files storing numerical data. The origins of this package are some of the tools, code snippets, approaches, etc., that I have needed or found helpful over the years when dealing with numerical data stored in plain-text files (but not only). I realized that bundling them together might result in a handy and self-contained package allowing me to keep those tools clean, sharp, and organized. Moreover, they could be useful for other people too. Thus, I decided to publish this open-source package. Given its nature, it is always under development, since now and then, I expect to add some new utilities.

Despite the drawbacks of plain-text files regarding reading speed or information density, plain-text files remain a widely available form of numerical data storage in many fields such as academic research. The reason behind that is not only historical. Whenever performance or storage capabilities are not a concern, using plain-text files for numerical data storage presents many advantages. Namely, it is human-friendly, readable by most applications, and since the data is encoded into characters instead of in binary form, it is system-independent. For this latter reason, their compatibility with future applications or machines is also guaranteed (which is of the utmost
importance for e.g., scientific research, where some findings might become specially relevant even decades after). All these reasons make plain-text data storage a desirable solution whenever performance allows it.

Currently, the ADMiniSter package is formed of two modules:

csv_with_metadata

Long-term storage of numerical data requires context to make sense of that data. Adding metadata to the files can partially solve this problem by making the files self-descriptive. While common plain-text data formats such as JSON and XML can naturally handle metadata, the CSV format, which is especially convenient for numerical data storage, does not. Thus, different applications or users resort to their own ways to include metadata in CSV files as a header, making this metadata format non-universal and potentially laborious to be parsed and loaded into an application.

This module defines a format to store data and metadata in plain-text files CSV and provides the tools to easily create and read the data and the metadata. The data is stored as CSV, and the metadata is stored as a well-structured header. The header can be composed of an arbitrary number of sections, and each section stores text or an arbitrary number of keys and values. The tools provided here allow easy writing and loading of data and metadata stored in this format. Specifically, the metadata in the header can be conveniently handled using dictionary-like interfaces.

file_index

This module aims to provide tools to manage, locate, and process large amounts of data files simply and efficiently. At the same time, it seeks to work out of the box on most systems. This module achieves those goals by implementing a set of functions that leverage the capabilities of a so-called file index. The file index is a table that relates the paths to many data files with attributes characteristic of each file. The tools in this module help create file indices based on user-defined attributes loader functions, which define how attributes are to be extracted from the data files. Moreover, tools are provided for locating data files with queries based on attributes and easily launching parallel analyses using user-defined functions.

Since the file index contains paths to the data files but not the data itself, it is typically lightweight and fast. It can, in many situations, efficiently replace more complex, heavier data management systems.

Link to the repository: https://github.com/dfcastellanos/ADMiniSter/

August 21, 2021 David