Parser Configuration
This guide explains how to configure and customize path parsers in StarryNight to work with your own data organization.
Understanding Path Parsers
StarryNight uses a grammar-based path parsing system to extract structured metadata from file paths.
flowchart LR
Files["Raw File Paths"] -->|Grammar Rules| Parser["Path Parser"]
Parser -->|Transformer| Metadata["Structured Metadata"]
Metadata --> Index["Index Generation"]
classDef default stroke:#333,stroke-width:1px;
How Path Parsing Works
The StarryNight parser consists of three components:
- Lexer - Breaks the file path into tokens using regular expressions
- Grammar Rules - Defines token combinations and organization (in
.lark
file) - Transformer - Converts the parsed structure into a Python dictionary
This architecture enables flexible, robust parsing without relying on brittle string splitting.
The Default "Vincent" Parser
StarryNight's default parser handles file paths with this structure, which is commonly produced by Phenix imaging systems:
[dataset]/[source_id]/[batch_id]/images/[plate_id]/[experiment_id]/Well[well_id]_Point[site_id]_[index]_Channel[channels]_Seq[sequence].ome.tiff
Example:
MyDataset/Source1/Batch1/images/Plate1/20X_CP_Plate1/WellA01_PointA01_0_ChannelDAPI,AF488,AF647_Seq0.ome.tiff
The parser handles variations like:
- Sequencing-by-synthesis (SBS) folders vs Cell Painting (CP) folders
- Aligned images vs raw images
- Metadata files vs image files
- Illumination files
Understanding the Grammar File
The default grammar file (path_parser_vincent.lark
) defines rules for parsing:
// Top-level rule - starting point for parsing
start: sep? dataset_id sep source_id sep _root_dir
// Directory structure rules
_root_dir: batch_id sep (_images_root_dir | _illum_root_dir | _images_aligned_root_dir | _workspace_root_dir)
_images_root_dir: "images"i sep plate_id sep _plate_root_dir
...
Rules prefixed with underscore (e.g., _root_dir
) are internal structural rules that don't map to output metadata fields. Rules without underscores become fields in the output.
Customizing the Parser
When to Create a Custom Parser
You'll need a custom parser when:
- Your file organization differs from the default pattern
- You need to extract different metadata fields
- You have a unique naming convention
Specifying a Custom Parser
Specify a custom parser with the CLI:
starrynight index gen \
-i ./workspace/inventory/inventory.parquet \
-o ./workspace/index/ \
--parser /path/to/custom/parser.lark
Creating a Custom Grammar File
To create a custom parser:
- Document your file patterns and identify metadata components to extract
- Write a
.lark
file that defines the path structure - Test your grammar against sample paths
- Use it in your workflow with the
--parser
parameter
Example: Custom Grammar File
Example grammar for a different file organization:
// Custom grammar for example_lab file organization
start: sep? project_name sep experiment_name sep plate_id sep _image_file
_image_file: well_id "_" site_id "_" channel "_" cycle_id "." extension
project_name: stringwithdashcommaspace
experiment_name: stringwithdashcommaspace
plate_id: string
well_id: (LETTER | DIGIT)~2
site_id: DIGIT~1..4
channel: stringwithdash
cycle_id: DIGIT~1..2
extension: stringwithdots
sep: "/"
string: (LETTER | DIGIT)+
stringwithdash: (string | "-")+
stringwithdashcommaspace: ( string | "-" | "_" | "," | " " )+
stringwithdots: ( string | "." )+
DIGIT: "0".."9"
%import common.LETTER
Parses paths like:
Testing Custom Parsers
-
Use Lark Parser IDE: Test at Lark Parser IDE to visualize parse trees.
-
Test with sample paths:
from lark import Lark
# Load grammar and test paths
parser = Lark.open('/path/to/grammar.lark', parser='lalr')
paths = ['MyProject/Experiment-2023-05/Plate1/A1_01_DAPI_01.tiff']
for path in paths:
try:
tree = parser.parse(path)
print(f"✓ Parsed: {path}")
except Exception as e:
print(f"✗ Failed: {path} - {e}")
Parser Architecture
The parser works through three layers:
- Lexer: Tokenizes paths using regexes (uppercase rules like
DIGIT
) - Parser: Builds a tree using grammar rules (lowercase rules like
well_id
) - Transformer: Maps parse tree to metadata dictionary (handles special cases)
Best Practices
When creating parsers:
- Start simple - Begin with basic grammar and add complexity as needed
- Test thoroughly - Validate with diverse file paths
- Consider performance - Complex parsers can slow index generation
- Document your schema - Document your file organization pattern
- Separate concerns:
- Lexer for basic pattern matching
- Grammar for structural relationships
- Transformer for conversion logic
Troubleshooting
Common issues:
- Parsing errors: Check grammar rules, test in Lark IDE, add permissive rules
- Missing metadata: Ensure grammar extracts all needed fields with matching names
- Performance issues: Simplify complex rules, reduce nesting, move pattern matching to lexer
Using Your Custom Parser
Use your parser in the index generation step:
starrynight index gen \
-i ./workspace/inventory/inventory.parquet \
-o ./workspace/index/ \
--parser /path/to/custom/parser.lark
Validate results by examining the index.parquet
file.
Custom Transformers
Creating custom transformers requires modifying source code. For most users, a custom grammar file provides sufficient flexibility.
For Document Contributors
Guidelines for maintaining this document:
Audience: Users adapting StarryNight to non-standard file organization with sufficient technical knowledge of grammar files, and developers extending functionality.
Organization Principles:
- Progressive disclosure (basics → advanced)
- Practical, functional examples
- Implementation details for extensibility
Style Guidelines:
- Consistent command formatting
- Define technical terms at first use
- Prioritize practical guidance over theory
- Use real-world examples
Related Docs: Builds on Getting Started, complements Complete Workflow Example, references Architecture Overview for advanced details.