Skip to content

se-sic/ModelCompletionSlicing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Software Modeling using Slicing and LLMs

This project explores the influence of model slicing on software model completion using Large Language Models (LLMs). We implement and evaluate several slicing techniques on two open-source datasets: ModelSet and Revision (RepairVision).

General Structure

  • supplementary_material.pdf contains the paper and additional appendix material for further details (appendix sections are located after the main paper — please scroll down)
  • data/ contains the ModelSet and Revision datasets, the predictions for all slicing approaches for LLaMA and GPT and the final evalution results
  • what_is_right_context_completion/ contains the implementation and scripts for context selection and completion

Implemenation details

Usage and Workflow

Step 1: Environment Configuration

Edit the .env file in the project root to configure the following key variables:

  • DATASET: Choose dataset type (MODELSET or REVISION).
  • RAW_MODELS_DIR: Specify the directory where the raw, unflattened JSON models you downloaded are located. This directory is only required for preprocessing in case your downloaded software models are not flattened into a single folder yet.
  • MODELS_DIR: Specify the directory where the flattened software models will be stored. This directory also serves as parent directory for the incomplete models and ground truths, predictions, and evaluations. More detailed information about the structure of this directory can be found in the .env file.
  • LLM_NAME: Specify the LLM to use (e.g., llama3.1:8b)
  • LLM_PROVIDER: Specify the LLM provider (openai, open_web_ui, or huggingface)
  • API_KEY: Either your OpenAI API key, Open WebUI API key, or your HuggingFace authentication token (depending on the specified LLM provider)
  • SLICING_APPROACH: Specify the slicing strategy which should be evaluated (clustering, radius_based_slicing, louvain_based_slicing, random_slicing, complete_model, all)
  • RADIUS: Specify the radius in case radius-based slicing is used. For other slicing approaches, this variable does not have any effect.
  • CLUSTERING_LLM_NAME: Specify the LLM to use for generating clusters in case clustering is used. For other slicing approaches, this variable does not have any effect.
  • PARTITION_LEVEL: Specify the partitioning level in case louvain-based slicing is used (finest or coarsest). For other slicing approaches, this variable does not have any effect.

More detailed information about the environment variables can be found in the .env file.

Step 2: Preprocessing

Step 2.1: Flatten Downloaded Models (Optional)

If the downloaded software models in JSON format are nested in subfolders, flatten them into a single, non-nested directory using:

flatten-models-dir

Before running this command, make sure your .env file is correctly configured:

  • Set RAW_MODELS_DIR to the path of the nested input folder containing the downloaded models.
  • Set MODELS_DIR to the path where the flattened models should be saved.
  • Specify the dataset you are working with (MODELSET or REVISION) under DATASET.

Step 2.2: Create Incomplete Models and Ground Truths

Once the models are flattened, generate the incomplete models and the corresponding ground truths (i.e., the true missing parts) using:

preprocess

The resulting incomplete models and ground truth files will be stored in subfolders of the MODELS_DIR directory you specified in the .env file.

Step 3: Slicing and Prediction

To generate predictions for the missing parts of the software models, run:

generate-predictions

Before executing the command, configure the .env file correctly:

  • Set SLICING_APPROACH to the slicing approach for which you want predictions to be generated. You can also set it to all to generate predictions for all implemented slicing approaches
  • Predictions for each slicing approach will be saved in separate subfolders under the MODELS_DIR you specified in .env.
  • If you use radius-based slicing, specify the RADIUS value in .env.
  • If you use clustering, specify the CLUSTERING_LLM_NAME in .env to choose which LLM is used for clustering class names.

This will generate the predicted missing elements and store them alongside the incomplete models for later evaluation.

Resource usage statistics, such as token counts and input size reduction, are automatically collected during prediction generation and saved as JSON reports in the same directories as the corresponding predictions.

Step 4: Evaluation

To evaluate the predictions generated for the incomplete software models, run:

evaluate-predictions

Before executing this command, ensure that the SLICING_APPROACH in your .env file is set to the slicing approach you want to evaluate. You can also use all to evaluate predictions for all implemented slicing approaches.

For each prediction, several evaluation metrics are computed, including BLEU score, cosine similarity, structural correctness, length difference, and others. The results are saved as JSON reports in separate subfolders under the MODELS_DIR specified in .env, organized by slicing approach. Each JSON report also contains aggregated statistics such as the mean, standard deviation, and number of samples for each performance metric, which can be used for subsequent plotting.

Step 5: Plotting and Visualization

The results for RQ1 and RQ2 can be generated using whatisrightcontent/src/eval. This module provides scripts for visualizing prediction performance and analyzing trade-offs between context size and completion corretness. The scripts are organized according to their corresponding research questions. Additionally, scripts for the dataset statistics are provided in this directory.


Setup: Virtual Environment

  1. Install Python 3.12+ if you don't have it already. You can download it from python.org.

  2. Clone the repository and navigate to the project root.

  3. Create a virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate
  4. Install the package locally:

    pip install .
  5. (Alternative) Install in development mode:

    pip install -e .

    This allows you to edit the source code and have changes take effect immediately.


Overview of Slicing Approaches

This section summarizes the slicing approaches implemented for extracting relevant parts of incomplete software models.

Clustering

In the clustering approach, all class names of the software model are provided to the LLM, which groups them into clusters based on domain or functionality. Using the metadata of the software model, we identify the class of the focused node (the predecessor of the missing node or edge). The resulting slice includes all classes that belong to the same cluster as the focused node, as determined by the LLM.

Radius-Based Slicing

In the radius-based approach, the slice is constructed by including all nodes that are within a specified radius from the focused node, along with all edges connecting these nodes.

Louvain Slicing

The Louvain method is a hierarchical community detection algorithm. Its goal is to find groups (communities) of nodes that maximize a quality function called modularity, which measures how dense the connections are within communities compared to connections between them. Among the subgraphs (partitions) created by the Louvain method, the slicing algorithm includes the subgraph containing the focused node into the slice.

Random Slicing (Baseline)

The random slicing approach serves as a baseline: it selects a random number of nodes from the software model and includes all edges that connect the chosen nodes in the slice.

Complete Model (Baseline)

As a baseline approach, complete model slicing includes the entire software model in the slice, without any reduction or filtering.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages